[nzlug] Data processing.....
Andras Farago
andras_farago at yahoo.com.au
Mon May 28 14:34:41 NZST 2007
The data are in five CVS files currently but it can be converted to anything I want the fastest solution.
The queries are really simple and I have buckets between 1500-3000.
I was done with awk earlier and it was too slow. MySQL was fast with 1 million records but when I tried to import all the data into a table it exploded. After 5 million rows the database size on the disk was 30 GB and I found it too much. The raw data takes 7GB at all.
Andrew
Ps: Thank you for the script.
Michael Field <michael.field at concepts.co.nz> wrote: Hi Andrew,
As a follow-up, here is how I would do a 'cheap-and-dirty' for the query
in you email.
Assume you have this file (in 'test.csv'):
key1,junk,1
key1,junk,100
key2,junk,1000
key2,junk,1000
key3,junk,1000
(Of cause you will have 25 fields and 2,000,000 rows!)
You could then use a script like:
cut -f 1,3 -d ',' < test.csv | sort | awk '
BEGIN {
total = 0;
key = "zzz"
FS=","
}
{
if(key == "zzz") key = $1;
if(key != $1)
{
print total " " key;
key = $1;
total = 0;
}
total += $2
}
END {
print total " " key;
}' | sort -rn
To do the following:
- cut out the fields of interest (1 and three in this case)
(This is important to limit the data being sorted on)
- sort that subset into order of the key for the totaling
- produce running totals for the different key values
(ordered total then key)
- sort the totals in the reverse order
Key performance hog will be the initial 'sort'.
It seems to work:
# ./test.sh
2000 key2
1000 key3
101 key1
Regards
Mike
---------------------------------
How would you spend $50,000 to create a more sustainable environment in Australia? Go to Yahoo!7 Answers and share your idea.
More information about the NZLUG
mailing list