[nzlug] Data processing.....

Andras Farago andras_farago at yahoo.com.au
Mon May 28 14:34:41 NZST 2007


  The data are in five CVS files currently but it can be converted to anything I want the fastest solution.
  The queries are really simple and I have buckets between 1500-3000.
  I was done with “awk” earlier and it was too slow. MySQL was fast with 1 million records but when I tried to import all the data into a table it exploded. After 5 million rows the database size on the disk was 30 GB and I found it too much. The raw data takes 7GB at all.
   
  Andrew
   
  Ps: Thank you for the script.
  

Michael Field <michael.field at concepts.co.nz> wrote: Hi Andrew,

As a follow-up, here is how I would do a 'cheap-and-dirty' for the query
in you email.

Assume you have this file (in 'test.csv'):
 key1,junk,1
 key1,junk,100
 key2,junk,1000
 key2,junk,1000
 key3,junk,1000
(Of cause you will have 25 fields and 2,000,000 rows!)

You could then use a script like:
 cut -f 1,3 -d ',' < test.csv | sort | awk '
 BEGIN {
    total = 0;
   key = "zzz"
   FS=","
     }
 
     {
       if(key == "zzz") key = $1;
       if(key != $1)
       {
         print total " " key;
         key = $1;
    total = 0;
       }
       total += $2
    }

 END {
         print total " " key;
     }' | sort -rn

To do the following:
 - cut out the fields of interest (1 and three in this case)
   (This is important to limit the data being sorted on)
 - sort that subset into order of the key for the totaling
 - produce running totals for the different key values 
   (ordered total then key)
 - sort the totals in the reverse order

Key performance hog will be the initial 'sort'.

It seems to work:
 # ./test.sh
 2000 key2
 1000 key3
 101 key1

Regards

Mike



 	      
---------------------------------
How would you spend $50,000 to create a more sustainable environment in Australia?  Go to Yahoo!7 Answers and share your idea.


More information about the NZLUG mailing list