Populating a database with data - performance question

Soldato
Joined
1 Mar 2003
Posts
5,508
Location
Cotham, Bristol
Ok I have about 400 csv type files (about 50MB in total, also they're remote i.e. reading each one from a URL), I read all of these files into memory, populate a couple of MySQL temporary tables from this data, then populate 8 permanent table from these temporary tables.

Across all 8 tables there are roughly 192k rows.

All of this takes my Python code about 9.5 minutes to do when the database is empty, and 7 minutes when it's just doing updates. I realise this is quite a high level of detail, but does that sound like reasonable performance?
 
Last edited:
Soldato
Joined
3 Jun 2005
Posts
3,119
Location
The South
As Spunkey mentions, it doesn't sound too bad if that 9.5 minutes is for the whole shebang - although it's a bit of a 'how long is a piece of string' question as we don't know what you're doing in terms of importing/queries etc.

Easiest option is just to time every section of your code, as it'll obviously give a clear indication where majority of the time is spent, and then take it from there.

Edit - Can you post any of the code?
 
Last edited:
Soldato
Joined
11 May 2011
Posts
2,901
Location
Farnborough
As some of the other posts. Depends on how you need your application to preform. I would say that's a reasonable time but do you need it to be faster? If so then you possibly need to do some re factoring.
 
Soldato
OP
Joined
1 Mar 2003
Posts
5,508
Location
Cotham, Bristol
Thanks all, need no, want maybe :p

I have been working on it a while (I've cut 10 minutes off) and I'm just trying to trim down the time as much as I can. I already have timings in there, most of the time is inevitably spent on insert/select statements on the large tables. The following for example takes 100 seconds or so to insert 120k rows into an empty table

Code:
          sqlString = """INSERT INTO CHANNELPROGRAMME (
                             CHANNELID, PROGRAMMEID, EPISODEID,
                             RPEAT, NEWSERIES, PREMIERE, CHOICE, SUBTITLES,
                             DEAFSIGNED, STARRATING, PROGDATE, STARTTIME, ENDTIME,
                             DURATION)
                         SELECT
                             T.CHANNELID, P.PROGRAMMEID, E.EPISODEID, T.RPEAT, T.NEWSERIES,
                             T.PREMIERE, T.CHOICE, T.SUBTITLES, T.DEAFSIGNED, T.STARRATING,
                             T.PROGDATE, T.STARTTIME, T.ENDTIME, T.DURATION
                         FROM
                             TVTEMPTABLE T
                             INNER JOIN PROGRAMME P ON P.PROGTITLE=T.PROGTITLE
                             INNER JOIN CHANNELS CH ON CH.CHANNELID=T.CHANNELID
                             INNER JOIN EPISODE E ON E.SUBTITLE=T.SUBTITLE AND E.EPISODE=T.EPISODE"""

And I have already indexed the columns, so I'm not sure there's any more 'fat' to trim.
 

Pho

Pho

Soldato
Joined
18 Oct 2002
Posts
9,325
Location
Derbyshire
Can you post an execution plan of what that query looks like?

Have you put indexes on your temp tables?

You may find it runs quicker if you disable indexes on CHANNELPROGRAMME before inserting then re-enable them afterwards; indexes make inserts slower.
 
Soldato
Joined
18 Oct 2002
Posts
3,926
Location
SW London
And I have already indexed the columns, so I'm not sure there's any more 'fat' to trim.

Just want to check which columns you're referring to here.
If you're talking about indexes on the CHANNELPROGRAMME table - the one you're inserting into - then you are aware that indexes actually slow down inserts?

The db has to update the index for each row you insert, which slows things down.
The only columns I can see that would benefit from indexes are the ones in your joins from the temp table, so that you can do index scans there rather than full table scans.
Even then, as the tables on the other side of the join probably won't be that big, there's limited benefit to gain from using indexes for the load.

Obviously when querying your main table the indexes will be useful though.
 
Soldato
Joined
3 Jun 2005
Posts
3,119
Location
The South
As others have mentioned, disable indexes during mass inserts as that'll help dramatically.

I'd also have a quick read of THIS, especially regarding 'bulk_inset_buffer_size' (if using MyISAM) and 'LOAD DATA INFILE' - both can massively help bulk loading data.
 
Associate
Joined
2 May 2012
Posts
565
Location
/dev/null
Don't forget to factor in how long it takes to rebuild your index after you have disabled them for the insert...
usually quicker to rebuild the index at the end than leave it active and have it get modified after each row is inserted
 
Soldato
OP
Joined
1 Mar 2003
Posts
5,508
Location
Cotham, Bristol
Thanks all, I haven't done anything about the indexes in the end as I moved the script to my always on work PC which is significantly newer than my laptop. It now runs in about 3 minutes, so I'm happy to leave it as is :)
 

Pho

Pho

Soldato
Joined
18 Oct 2002
Posts
9,325
Location
Derbyshire
By the way, another option you may want to think of in the future is using ETL tools rather than doing it manually.. basically they give you a GUI/scripting engine to let you grab data from one source (extract), do something with it (translate), and plonk it somewhere else (load) = ETL.

I've used Talend in the past to do this (the open-source community edition is more than enough) but there's lots of others out there like Kettle.

These sorts of tools are especially useful if you need to connect to lots of different data-sources without the hassle of writing your own parsers, data mappers etc. Talend for example can connect to a lot of stuff!
 
Back
Top Bottom