[nzlug] Data processing.....
Nevyn
nevynh at gmail.com
Tue May 29 23:05:51 NZST 2007
Hi Michael - all very interesting points. The only problem being - out
of the box.
If you're a coder, this approach is fine and dandy and does brilliant
things (and is able to jump over tall buildings and is faster than a
speeding bullet). However, for the most part, people aren't coders and
end up paying someone else to code it for them.
Enter an out of the box solution - install mysql, run a query, go have a coffee
vs.
Pay a coder, wait for coder to finish (an hour?), argue with coder
when coder tries to charge you for 2 hours, find the code doesn't
quite do what you want, go back to coder, get code back, run script
and find it finally works.
I know this all sounds cynical but does the time of interacting with
someone to get something done factor into your rant?
On 5/29/07, Michael Field <michael.field at concepts.co.nz> wrote:
> Hi Nevyn,
>
> This is written in the style of a rant, because ex-programmers do that
> sort of thing. No malice is intended!
>
> <asbestos underwear>I think that your is thinking wrong </asbestos
> underwear>
>
> I can justify that as follows -
>
> Assuming that
> - the size of database table is at least that of the CSV file
> - ths CSV file is too big to fit in RAM.
> - each database index would only index one or two fields
> - three or four different result 'group by' orderings are
> required
> - The data goes stale quickly (so incremental loads are not
> possible)
> - The result sets are only a tiny fraction of the raw data size.
> - The result sets usually includes data from all records
> - The result sets are small enough to fit in RAM.
>
> All of which sound reasonable to me based on the in the original post.
>
> Following you suggestion:
> 1. The database (tables + indexes) could be maybe 2 or three
> times the size of the raw data set, depending on how much
> padding will be needed in the character fields to
> accommodate the longest values.
>
> 2. The database's query parser will realize that it needs to
> visit every record in the table, and will do a table scan
> not an index scan. The I/O volume will be at least the
> same size as the raw data.
>
> 3. If it does use the index, then odds are that the data
> you are sum()ing will not be an index's fields, forcing
> the row to be fetched from disk (adding to the I/O and
> causing many disk seeks vs a sequential table scan).
> Performance will suck unless the indexes are on different
> disks to the data. Even then it will be sucky.
> The same data block may be loaded multiple times depending
> on how they are scattered around the table's natural order.
>
> 4. The work done to generate the index will exceed that work
> Required parse the table once to process the 'Group by' /
> 'order by' clauses in memory.
>
>
> All of which points to the most efficient process being to
>
> 1. Read in the raw data from the raw CSV file.
>
> 2. Throw away as much data as you can, keeping only the stuff
> you are interested in to answer the query.
>
> 3. Process the 'sum()' and 'group by' clause in memory (using
> a hash table). Processing multiple queries at once costs
> you only a RAM and a smidgen of CPU
>
> 4. Finally sort and report the results to the user, sorting
> in memory.
>
> This could all be done in a Perl script and run in less than the time
> taken to generate one index. The pseudo code being:
>
> while ( read a line)
> do
> split the line into keyfields and totalfields on the
> commas
> hash = keyfield1 + ',' + keyfield2....
> totals[hash][1] += totalfield1;
> totals[hash][2] += totalfield2;
> totals[hash][3] += 1;
> ...
> done
> Sort the hash keys based on required critter.
> foreach key in the (now sorted) hash
> print the hash key, the totals[hash][1],
> totals[hash][2]...
>
> Even a junior PHP program could knock that up before their mother brings
> them milk and cookies and reads him penguin stories!!!!
>
> Mike
>
>
>
>
>
> -----Original Message-----
> From: nzlug-bounces at linux.net.nz [mailto:nzlug-bounces at linux.net.nz] On
> Behalf Of Nevyn
> Sent: Tuesday, 29 May 2007 2:32 p.m.
> To: NZLUG Mailing List
> Subject: Re: [nzlug] Data processing.....
>
> On 5/28/07, Cliff Pratt <enkidu at cliffp.com> wrote:
> > Andras Farago wrote:
> > > Hi guys, What software can you recommend to process a database/txt
> > > file/whatever with approx 20 million rows, 25 fields in each row? I
> > > need to perform only very simple queries like "SELECT a, Sum(b) FROM
> > > data GROUP BY a ORDER BY Sum(b) DESC;". Another important option is
> > > the time, I can't wait all day to get the result.
> > >
> > If you loaded it into a database you could build an index on it and
> get
> > quick results that way. Though loading it and building the index could
> > take some time.
> >
> > Cheers,
> >
> > Cliff
>
> I'd have to say that the best approach here would be chose a platform
> whether it be postgre or mysql and learn the little tips and tricks
> for making things fast.
>
> Of course a database is going to take up more room - there's more data
> to store. Loading things generally take longer too initially just
> because there's more data to generate. However, once all that data has
> been generated and stored, you've now got speedy queries due to that
> data that took oh so long to generate.
>
> You can turn off indexing while loading in the data but the indexing
> has to be done at some point.
>
> _______________________________________________
> NZLUG mailing list NZLUG at linux.net.nz
> http://www.linux.net.nz/cgi-bin/mailman/listinfo/nzlug
> Computer Concepts Limited
> 25 Leslie Hills Drive
> PO Box 8744 Riccarton
> Christchurch, New Zealand
More information about the NZLUG
mailing list