[nzlug] Data processing.....
Michael Field
michael.field at concepts.co.nz
Mon May 28 14:57:05 NZST 2007
Hi Andrew,
You would be best off to do a "summarize each file then merge" process
for each file.
The limit is set by the time taken to read the files from disks -
8MB/sec sounds reasonable for a modern PC - 7GB is approx 15*60 * 8.
If you want a 'really fast' solution, email me direct - I would be able
to do a bit of ad-hoc 'C' or Perl programming for you if you are able to
give requirements (at our usual rates).
I should be able to answer all queries in one pass of the data files
(RAM willing), so I guess under 15 minutes for the size you are talking
about. If you want anything less, then you will need to look at what
hardware you are running it on...
But, it will cost... the mantra of S/W development is something "good,
cheap, fast - pick any two"
Mike
-----Original Message-----
From: nzlug-bounces at linux.net.nz [mailto:nzlug-bounces at linux.net.nz] On
Behalf Of Andras Farago
Sent: Monday, 28 May 2007 2:35 p.m.
To: NZLUG Mailing List
Subject: RE: [nzlug] Data processing.....
The data are in five CVS files currently but it can be converted to
anything I want the fastest solution.
The queries are really simple and I have buckets between 1500-3000.
I was done with "awk" earlier and it was too slow. MySQL was fast with
1 million records but when I tried to import all the data into a table
it exploded. After 5 million rows the database size on the disk was 30
GB and I found it too much. The raw data takes 7GB at all.
Andrew
Ps: Thank you for the script.
Michael Field <michael.field at concepts.co.nz> wrote: Hi Andrew,
As a follow-up, here is how I would do a 'cheap-and-dirty' for the query
in you email.
Assume you have this file (in 'test.csv'):
key1,junk,1
key1,junk,100
key2,junk,1000
key2,junk,1000
key3,junk,1000
(Of cause you will have 25 fields and 2,000,000 rows!)
You could then use a script like:
cut -f 1,3 -d ',' < test.csv | sort | awk '
BEGIN {
total = 0;
key = "zzz"
FS=","
}
{
if(key == "zzz") key = $1;
if(key != $1)
{
print total " " key;
key = $1;
total = 0;
}
total += $2
}
END {
print total " " key;
}' | sort -rn
To do the following:
- cut out the fields of interest (1 and three in this case)
(This is important to limit the data being sorted on)
- sort that subset into order of the key for the totaling
- produce running totals for the different key values
(ordered total then key)
- sort the totals in the reverse order
Key performance hog will be the initial 'sort'.
It seems to work:
# ./test.sh
2000 key2
1000 key3
101 key1
Regards
Mike
---------------------------------
How would you spend $50,000 to create a more sustainable environment in
Australia? Go to Yahoo!7 Answers and share your idea.
_______________________________________________
NZLUG mailing list NZLUG at linux.net.nz
http://www.linux.net.nz/cgi-bin/mailman/listinfo/nzlug
Computer Concepts Limited
25 Leslie Hills Drive
PO Box 8744 Riccarton
Christchurch, New Zealand
Phone: +64-3-348-2500
Fax: +64-3-343-7569
Notice of confidential information:
The information contained in this e-mail message is
confidential information and may also be legally privileged,
intended only for the individual or entity named above.
If you are not the intended recipient you are hereby
notified that any use, review, dissemination, distribution
or copying of this document is strictly prohibited.
If you have received this document in error, please
immediately notify the sender by telephone and destroy the
message. Thank you.
All prices quoted in this email are exclusive of GST & Freight and
valid only while stocks last.
More information about the NZLUG
mailing list