[nzlug] Data processing.....
Andrew Farago
andras_farago at yahoo.com.au
Wed May 30 00:19:56 NZST 2007
Hello everyone,
I am here again!
I could not sleep so I tried to do the offline data processing with
MySQL and awk as well.
There was no success with MySQL. Believe me I tried my best but it
took too much time and disk space. I think MySQL is not the right
tool to do this job.
I think the indexing can not help at all because I need to examine
every data row. In that case does not matter what do you use because
it depend only on your disk reading speed and any indexing or other
'administrative' work just waste of time and disk space. Using awk I
have got a good result within a 'reasonable' time.
There was 18,326,533 rows. My awk program could do the job 53', 54'
and 56' minutes. I did the processing three times and the only
difference was how many questions I had.
And here is my little dirty script.
==========>8===================
BEGIN {
print "Let's rock and roll!!!"
FS=","
line=0;
}
{
#Here is what I want to know.
aaa[$2]+=$11; # A sum
bbb[$7]+=1; # A count
ccc[$5]+=$19; # An other sum
ddd[$8]+=1; # And an other count
print line++; # Running numbers on the screen looks impressive :-)
}
END {
# Print out the results in a mostly human readable format.
print "aaa" > "result.txt";
print "==========================" > "result.txt";
for(idx in aaa) {
print aaa[idx]","idx > "result.txt";
}
print "==========================" > "result.txt";
print "bbb" > "result.txt";
print "==========================" > "result.txt";
for(idx in bbb) {
print bbb[idx]","idx > "result.txt";
}
print "==========================" > "result.txt";
print "ccc" > "result.txt";
print "==========================" > "result.txt";
for(idx in ccc) {
print ccc[idx]","idx > "result.txt";
}
print "==========================" > "result.txt";
print "ddd" > "result.txt";
print "==========================" > "result.txt";
for(idx in ddd) {
print ddd[idx]","idx > "result.txt";
}
print "==========================" > "result.txt";
print "This is the end!";
}
==========8<===================
Have a good night.
Andrew
On 29/05/2007, at 11:05 PM, Nevyn wrote:
> Hi Michael - all very interesting points. The only problem being - out
> of the box.
>
> If you're a coder, this approach is fine and dandy and does brilliant
> things (and is able to jump over tall buildings and is faster than a
> speeding bullet). However, for the most part, people aren't coders and
> end up paying someone else to code it for them.
>
> Enter an out of the box solution - install mysql, run a query, go
> have a coffee
> vs.
> Pay a coder, wait for coder to finish (an hour?), argue with coder
> when coder tries to charge you for 2 hours, find the code doesn't
> quite do what you want, go back to coder, get code back, run script
> and find it finally works.
>
> I know this all sounds cynical but does the time of interacting with
> someone to get something done factor into your rant?
>
> On 5/29/07, Michael Field <michael.field at concepts.co.nz> wrote:
>> Hi Nevyn,
>>
>> This is written in the style of a rant, because ex-programmers do
>> that
>> sort of thing. No malice is intended!
>>
>> <asbestos underwear>I think that your is thinking wrong </asbestos
>> underwear>
>>
>> I can justify that as follows -
>>
>> Assuming that
>> - the size of database table is at least that of the CSV file
>> - ths CSV file is too big to fit in RAM.
>> - each database index would only index one or two fields
>> - three or four different result 'group by' orderings are
>> required
>> - The data goes stale quickly (so incremental loads are not
>> possible)
>> - The result sets are only a tiny fraction of the raw data
>> size.
>> - The result sets usually includes data from all records
>> - The result sets are small enough to fit in RAM.
>>
>> All of which sound reasonable to me based on the in the original
>> post.
>>
>> Following you suggestion:
>> 1. The database (tables + indexes) could be maybe 2 or three
>> times the size of the raw data set, depending on how much
>> padding will be needed in the character fields to
>> accommodate the longest values.
>>
>> 2. The database's query parser will realize that it needs to
>> visit every record in the table, and will do a table scan
>> not an index scan. The I/O volume will be at least the
>> same size as the raw data.
>>
>> 3. If it does use the index, then odds are that the data
>> you are sum()ing will not be an index's fields, forcing
>> the row to be fetched from disk (adding to the I/O and
>> causing many disk seeks vs a sequential table scan).
>> Performance will suck unless the indexes are on different
>> disks to the data. Even then it will be sucky.
>> The same data block may be loaded multiple times depending
>> on how they are scattered around the table's natural
>> order.
>>
>> 4. The work done to generate the index will exceed that work
>> Required parse the table once to process the 'Group by' /
>> 'order by' clauses in memory.
>>
>>
>> All of which points to the most efficient process being to
>>
>> 1. Read in the raw data from the raw CSV file.
>>
>> 2. Throw away as much data as you can, keeping only the stuff
>> you are interested in to answer the query.
>>
>> 3. Process the 'sum()' and 'group by' clause in memory (using
>> a hash table). Processing multiple queries at once costs
>> you only a RAM and a smidgen of CPU
>>
>> 4. Finally sort and report the results to the user, sorting
>> in memory.
>>
>> This could all be done in a Perl script and run in less than the time
>> taken to generate one index. The pseudo code being:
>>
>> while ( read a line)
>> do
>> split the line into keyfields and totalfields on the
>> commas
>> hash = keyfield1 + ',' + keyfield2....
>> totals[hash][1] += totalfield1;
>> totals[hash][2] += totalfield2;
>> totals[hash][3] += 1;
>> ...
>> done
>> Sort the hash keys based on required critter.
>> foreach key in the (now sorted) hash
>> print the hash key, the totals[hash][1],
>> totals[hash][2]...
>>
>> Even a junior PHP program could knock that up before their mother
>> brings
>> them milk and cookies and reads him penguin stories!!!!
>>
>> Mike
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: nzlug-bounces at linux.net.nz [mailto:nzlug-
>> bounces at linux.net.nz] On
>> Behalf Of Nevyn
>> Sent: Tuesday, 29 May 2007 2:32 p.m.
>> To: NZLUG Mailing List
>> Subject: Re: [nzlug] Data processing.....
>>
>> On 5/28/07, Cliff Pratt <enkidu at cliffp.com> wrote:
>> > Andras Farago wrote:
>> > > Hi guys, What software can you recommend to process a database/
>> txt
>> > > file/whatever with approx 20 million rows, 25 fields in each
>> row? I
>> > > need to perform only very simple queries like "SELECT a, Sum
>> (b) FROM
>> > > data GROUP BY a ORDER BY Sum(b) DESC;". Another important
>> option is
>> > > the time, I can't wait all day to get the result.
>> > >
>> > If you loaded it into a database you could build an index on it and
>> get
>> > quick results that way. Though loading it and building the index
>> could
>> > take some time.
>> >
>> > Cheers,
>> >
>> > Cliff
>>
>> I'd have to say that the best approach here would be chose a platform
>> whether it be postgre or mysql and learn the little tips and tricks
>> for making things fast.
>>
>> Of course a database is going to take up more room - there's more
>> data
>> to store. Loading things generally take longer too initially just
>> because there's more data to generate. However, once all that data
>> has
>> been generated and stored, you've now got speedy queries due to that
>> data that took oh so long to generate.
>>
>> You can turn off indexing while loading in the data but the indexing
>> has to be done at some point.
>>
>> _______________________________________________
>> NZLUG mailing list NZLUG at linux.net.nz
>> http://www.linux.net.nz/cgi-bin/mailman/listinfo/nzlug
>> Computer Concepts Limited
>> 25 Leslie Hills Drive
>> PO Box 8744 Riccarton
>> Christchurch, New Zealand
>
> _______________________________________________
> NZLUG mailing list NZLUG at linux.net.nz
> http://www.linux.net.nz/cgi-bin/mailman/listinfo/nzlug
More information about the NZLUG
mailing list