Populating a database with data - performance question

PaulStat · 16 Apr 2014 at 22:00

Ok I have about 400 csv type files (about 50MB in total, also they're remote i.e. reading each one from a URL), I read all of these files into memory, populate a couple of MySQL temporary tables from this data, then populate 8 permanent table from these temporary tables.

Across all 8 tables there are roughly 192k rows.

All of this takes my Python code about 9.5 minutes to do when the database is empty, and 7 minutes when it's just doing updates. I realise this is quite a high level of detail, but does that sound like reasonable performance?

GravyMonster · 16 Apr 2014 at 22:50

Doesn't sound too bad. I assume this is a one-off import?

PaulStat · 16 Apr 2014 at 23:09

It would be something that happens once a day with the update of the csv's

visibleman · 16 Apr 2014 at 23:24

As Spunkey mentions, it doesn't sound too bad if that 9.5 minutes is for the whole shebang - although it's a bit of a 'how long is a piece of string' question as we don't know what you're doing in terms of importing/queries etc.

Easiest option is just to time every section of your code, as it'll obviously give a clear indication where majority of the time is spent, and then take it from there.

Edit - Can you post any of the code?

tidusjar · 17 Apr 2014 at 08:50

As some of the other posts. Depends on how you need your application to preform. I would say that's a reasonable time but do you need it to be faster? If so then you possibly need to do some re factoring.

PaulStat · 17 Apr 2014 at 08:59

Thanks all, need no, want maybe

I have been working on it a while (I've cut 10 minutes off) and I'm just trying to trim down the time as much as I can. I already have timings in there, most of the time is inevitably spent on insert/select statements on the large tables. The following for example takes 100 seconds or so to insert 120k rows into an empty table

Code:

          sqlString = """INSERT INTO CHANNELPROGRAMME (
                             CHANNELID, PROGRAMMEID, EPISODEID,
                             RPEAT, NEWSERIES, PREMIERE, CHOICE, SUBTITLES,
                             DEAFSIGNED, STARRATING, PROGDATE, STARTTIME, ENDTIME,
                             DURATION)
                         SELECT
                             T.CHANNELID, P.PROGRAMMEID, E.EPISODEID, T.RPEAT, T.NEWSERIES,
                             T.PREMIERE, T.CHOICE, T.SUBTITLES, T.DEAFSIGNED, T.STARRATING,
                             T.PROGDATE, T.STARTTIME, T.ENDTIME, T.DURATION
                         FROM
                             TVTEMPTABLE T
                             INNER JOIN PROGRAMME P ON P.PROGTITLE=T.PROGTITLE
                             INNER JOIN CHANNELS CH ON CH.CHANNELID=T.CHANNELID
                             INNER JOIN EPISODE E ON E.SUBTITLE=T.SUBTITLE AND E.EPISODE=T.EPISODE"""

And I have already indexed the columns, so I'm not sure there's any more 'fat' to trim.

Pho · 17 Apr 2014 at 09:53

Can you post an execution plan of what that query looks like?

Have you put indexes on your temp tables?

You may find it runs quicker if you disable indexes on CHANNELPROGRAMME before inserting then re-enable them afterwards; indexes make inserts slower.

Haircut · 17 Apr 2014 at 10:02

PaulStat said:
And I have already indexed the columns, so I'm not sure there's any more 'fat' to trim.

Just want to check which columns you're referring to here.
If you're talking about indexes on the CHANNELPROGRAMME table - the one you're inserting into - then you are aware that indexes actually slow down inserts?

The db has to update the index for each row you insert, which slows things down.
The only columns I can see that would benefit from indexes are the ones in your joins from the temp table, so that you can do index scans there rather than full table scans.
Even then, as the tables on the other side of the join probably won't be that big, there's limited benefit to gain from using indexes for the load.

Obviously when querying your main table the indexes will be useful though.

PaulStat · 17 Apr 2014 at 10:29

Err nope I was unaware of that, in the SQL you see I have indexed all the columns in the joins, not limited to the temporary table.

visibleman · 17 Apr 2014 at 11:31

As others have mentioned, disable indexes during mass inserts as that'll help dramatically.

I'd also have a quick read of THIS, especially regarding 'bulk_inset_buffer_size' (if using MyISAM) and 'LOAD DATA INFILE' - both can massively help bulk loading data.

dasyad · 17 Apr 2014 at 17:30

Don't forget to factor in how long it takes to rebuild your index after you have disabled them for the insert...

melvis · 17 Apr 2014 at 19:31

Don't forget to factor in how long it takes to rebuild your index after you have disabled them for the insert...

usually quicker to rebuild the index at the end than leave it active and have it get modified after each row is inserted

PaulStat · 18 Apr 2014 at 17:22

Thanks all, I haven't done anything about the indexes in the end as I moved the script to my always on work PC which is significantly newer than my laptop. It now runs in about 3 minutes, so I'm happy to leave it as is

Pho · 18 Apr 2014 at 17:44

By the way, another option you may want to think of in the future is using ETL tools rather than doing it manually.. basically they give you a GUI/scripting engine to let you grab data from one source (extract), do something with it (translate), and plonk it somewhere else (load) = ETL.

I've used Talend in the past to do this (the open-source community edition is more than enough) but there's lots of others out there like Kettle.

These sorts of tools are especially useful if you need to connect to lots of different data-sources without the hassle of writing your own parsers, data mappers etc. Talend for example can connect to a lot of stuff!