[nzlug] Data processing.....
Michael Field
michael.field at concepts.co.nz
Tue May 29 16:00:22 NZST 2007
Hi Nevyn,
This is written in the style of a rant, because ex-programmers do that
sort of thing. No malice is intended!
<asbestos underwear>I think that your is thinking wrong </asbestos
underwear>
I can justify that as follows -
Assuming that
- the size of database table is at least that of the CSV file
- ths CSV file is too big to fit in RAM.
- each database index would only index one or two fields
- three or four different result 'group by' orderings are
required
- The data goes stale quickly (so incremental loads are not
possible)
- The result sets are only a tiny fraction of the raw data size.
- The result sets usually includes data from all records
- The result sets are small enough to fit in RAM.
All of which sound reasonable to me based on the in the original post.
Following you suggestion:
1. The database (tables + indexes) could be maybe 2 or three
times the size of the raw data set, depending on how much
padding will be needed in the character fields to
accommodate the longest values.
2. The database's query parser will realize that it needs to
visit every record in the table, and will do a table scan
not an index scan. The I/O volume will be at least the
same size as the raw data.
3. If it does use the index, then odds are that the data
you are sum()ing will not be an index's fields, forcing
the row to be fetched from disk (adding to the I/O and
causing many disk seeks vs a sequential table scan).
Performance will suck unless the indexes are on different
disks to the data. Even then it will be sucky.
The same data block may be loaded multiple times depending
on how they are scattered around the table's natural order.
4. The work done to generate the index will exceed that work
Required parse the table once to process the 'Group by' /
'order by' clauses in memory.
All of which points to the most efficient process being to
1. Read in the raw data from the raw CSV file.
2. Throw away as much data as you can, keeping only the stuff
you are interested in to answer the query.
3. Process the 'sum()' and 'group by' clause in memory (using
a hash table). Processing multiple queries at once costs
you only a RAM and a smidgen of CPU
4. Finally sort and report the results to the user, sorting
in memory.
This could all be done in a Perl script and run in less than the time
taken to generate one index. The pseudo code being:
while ( read a line)
do
split the line into keyfields and totalfields on the
commas
hash = keyfield1 + ',' + keyfield2....
totals[hash][1] += totalfield1;
totals[hash][2] += totalfield2;
totals[hash][3] += 1;
...
done
Sort the hash keys based on required critter.
foreach key in the (now sorted) hash
print the hash key, the totals[hash][1],
totals[hash][2]...
Even a junior PHP program could knock that up before their mother brings
them milk and cookies and reads him penguin stories!!!!
Mike
-----Original Message-----
From: nzlug-bounces at linux.net.nz [mailto:nzlug-bounces at linux.net.nz] On
Behalf Of Nevyn
Sent: Tuesday, 29 May 2007 2:32 p.m.
To: NZLUG Mailing List
Subject: Re: [nzlug] Data processing.....
On 5/28/07, Cliff Pratt <enkidu at cliffp.com> wrote:
> Andras Farago wrote:
> > Hi guys, What software can you recommend to process a database/txt
> > file/whatever with approx 20 million rows, 25 fields in each row? I
> > need to perform only very simple queries like "SELECT a, Sum(b) FROM
> > data GROUP BY a ORDER BY Sum(b) DESC;". Another important option is
> > the time, I can't wait all day to get the result.
> >
> If you loaded it into a database you could build an index on it and
get
> quick results that way. Though loading it and building the index could
> take some time.
>
> Cheers,
>
> Cliff
I'd have to say that the best approach here would be chose a platform
whether it be postgre or mysql and learn the little tips and tricks
for making things fast.
Of course a database is going to take up more room - there's more data
to store. Loading things generally take longer too initially just
because there's more data to generate. However, once all that data has
been generated and stored, you've now got speedy queries due to that
data that took oh so long to generate.
You can turn off indexing while loading in the data but the indexing
has to be done at some point.
_______________________________________________
NZLUG mailing list NZLUG at linux.net.nz
http://www.linux.net.nz/cgi-bin/mailman/listinfo/nzlug
Computer Concepts Limited
25 Leslie Hills Drive
PO Box 8744 Riccarton
Christchurch, New Zealand
Phone: +64-3-348-2500
Fax: +64-3-343-7569
Notice of confidential information:
The information contained in this e-mail message is
confidential information and may also be legally privileged,
intended only for the individual or entity named above.
If you are not the intended recipient you are hereby
notified that any use, review, dissemination, distribution
or copying of this document is strictly prohibited.
If you have received this document in error, please
immediately notify the sender by telephone and destroy the
message. Thank you.
All prices quoted in this email are exclusive of GST & Freight and
valid only while stocks last.
More information about the NZLUG
mailing list