Efficient database querying/updating

PaulStat · 26 Oct 2010 at 21:01

Hi Guys,

I have a set of 400 csv files (average of 350 lines per file) that I use to insert data into a MySQL database (via python). I'm trying to improve the speed with which I can update this database.

The process I do this by reading through the contents of a file at a time storing each row in a list. I did have it so that it inserted into the tables a line at a time, but this was very slow, I then updated the code so it built a list of columns to insert all at once (every row for the current file)

The problem I now have is before I insert a row into a table I have to check if it already exists, so I have to query every row of every file, which is obviously slowing things down somewhat. (At 5 select queries per line that works out to be a minimum of 700,000 select queries, not to mention the time taken to insert 350 rows into a table at a time)

Anyway here's the code so to give an idea of what's going on.

PHP:

def InsertChannelProgrammes(self,channelID,listDict):
   chanProgsqlString = """INSERT INTO CHANNELPROGRAMME(CHANNELID,PROGRAMMEID,EPISODEID,`REPEAT`,NEWSERIES,PREMIERE,CHOICE,
                       SUBTITLES,DEAFSIGNED,DATE,STARTTIME,ENDTIME,DURATION) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
   progsqlString = """INSERT INTO PROGRAMME(GENREID,PROGTITLE,YEAR,DIRECTOR,PERFORMERS,FILM,WIDESCREEN,
                   BNW,STARRATING,CERTIFICATE,DESCRIPTION) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
   episodesqlString = """INSERT INTO EPISODE(PROGRAMMEID,SUBTITLE,EPISODE,DESCRIPTION) VALUES(%s,%s,%s,%s)"""
   genresqlString = """INSERT INTO GENRE (GENRENAME) VALUES(%s)"""

   chanProgcolumns = []
   progcolumns = []
   episodecolumns = []
   genrecolumns = []
   for row in listDict:
      #Insert Genres into GENRE table
      if row["GENRE"] != None:
         #Check to see if genre exists
         self._cursor.execute("SELECT * FROM `GENRE` WHERE GENRENAME=\""+row["GENRE"]+"\"")
         genResult = self._cursor.fetchone()
         if genResult == None:
            #New Genre add row for inserting
            genValuesTuple = (row["GENRE"],)
            genrecolumns.append(genValuesTuple)
      #Insert Programmes into PROGRAMMES table
      #Check to see if programmes exists
      self._cursor.execute("""SELECT PROGRAMMEID, PROGTITLE FROM PROGRAMME WHERE PROGTITLE=%s"""
                              ,(row["PROGTITLE"]))
      progResult = self._cursor.fetchone()
      if progResult == None:
         #New Programme add row for inserting
         description = ""
         if row["SUBTITLE"] == "" and row["EPISODE"] == "":
            description = row["DESCRIPTION"]
         progValuesTuple =  (self.GetGenreId(row["GENRE"]),row["PROGTITLE"],row["YEAR"],row["DIRECTOR"],row["PERFORMERS"],
                         row["FILM"],row["WIDESCREEN"],row["BNW"],row["STARRATING"],
                         row["CERTIFICATE"],description)
         progcolumns.append(progValuesTuple)
      #Insert Episodes into EPISODES table
      #Check to see if episode exists
      self._cursor.execute("""SELECT * FROM EPISODE WHERE SUBTITLE=%s and EPISODE=%s""",
                              (row["SUBTITLE"],row["EPISODE"]))
      epResult = self._cursor.fetchone()
      if epResult == None:
         #New Episode add row for inserting
         episodeValuesTuple = (self.GetProgId(row["PROGTITLE"]),row["SUBTITLE"],row["EPISODE"],row["DESCRIPTION"])
         episdoecolumns.append(episodeValuesTuple)
      #Insert Programmes into CHANNELPROGRAMME table
      chanProgvaluesTuple = (channelID,self.GetProgId(row["PROGTITLE"]),self.GetEpId(row["SUBTITLE"],row["EPISODE"]),
                    row["REPEAT"],row["NEWSERIES"],row["PREMIERE"],row["CHOICE"],
                    row["SUBTITLES"],row["DEAFSIGNED"],self.FormatDate(row["DATE"]),
                    row["STARTTIME"],row["ENDTIME"],row["DURATION"])
      chanProgcolumns.append(chanProgvaluesTuple)
   if len(genrecolumns) > 0:
      self._cursor.executemany(genresqlString,genrecolumns)
   if len(progcolumns) > 0:
      self._cursor.executemany(progsqlString,progcolumns)
   if len(episodecolumns) > 0:
      self._cursor.executemany(episodesqlString,episodecolumns)
   self._cursor.executemany(chanProgsqlString,chanProgcolumns)

rpstewart · 26 Oct 2010 at 21:16

Do you have to load the data programatically in the python? Can you not use the built in import / mass load functionality in MySQL?

If you do have to insert from the python then you shouldn't be having to check that the data exists first, there should be a primary key on the table which enforces the uniqueness of the records. All you then have to do is insert the row and make sure that you get a successful return code from MySQL.

SMB30 · 26 Oct 2010 at 21:16

What database are you using, and are you using indexes?

Is genrename always a unique field, and how many fields does the table have?

PaulStat · 26 Oct 2010 at 21:23

rpstewart said:
Do you have to load the data programatically in the python? Can you not use the built in import / mass load functionality in MySQL?

If you do have to insert from the python then you shouldn't be having to check that the data exists first, there should be a primary key on the table which enforces the uniqueness of the records. All you then have to do is insert the row and make sure that you get a successful return code from MySQL.

The reading of the contents of each file is actually really quick, it's purely the querying of the database which is taking a long time. Plus I've looked at the MySQL documentation for the mass load thing, and I really don't get it. In fact I find most of the online MySQL docs confusing

Each table does have a primary key which is a unique integer id, but I also want the columns I mentioned above to be unique. I just don't know how to make text based columns unique. Also what actually happens if something has to be unique? Does it just ignore the insert if that row already exists?

SMB30 said:
What database are you using, and are you using indexes?

Is genrename always a unique field, and how many fields does the table have?

I'm using MySQL

I've updated the OP with the table structure I'm using

rpstewart · 26 Oct 2010 at 22:46

PaulStat said:
Each table does have a primary key which is a unique integer id, but I also want the columns I mentioned above to be unique. I just don't know how to make text based columns unique. Also what actually happens if something has to be unique? Does it just ignore the insert if that row already exists?

You should be able to create a unique index on the text columns.

If you try and insert something which violates the unique index then the insert will fail, you just need to trap and deal with the error in the python code.

PaulStat · 26 Oct 2010 at 23:01

Thanks, I'll give it a go.

Also as part of the inserts I get some ID values using another select for example

PHP:

      chanProgvaluesTuple = (channelID,self.GetProgId(row["PROGTITLE"]),self.GetEpId(row["SUBTITLE"],row["EPISODE"]),
                    row["REPEAT"],row["NEWSERIES"],row["PREMIERE"],row["CHOICE"],
                    row["SUBTITLES"],row["DEAFSIGNED"],self.FormatDate(row["DATE"]),
                    row["STARTTIME"],row["ENDTIME"],row["DURATION"])

The GetProgId and GetEpId are two select queries, is this a good way of getting the id's to link between tables?

rpstewart · 26 Oct 2010 at 23:27

I don't follow why you're having to do any selects as part of the inserts. It seems to me that you've got a data model that ticks all the theoretical boxes but doesn't actually work with the source data you have. Normalisation to the Nth degree is fine but the database design has to work with the application and data.

There's nothing wrong with using a bit more grunt in the database layer if it simplifies the application flow.

PaulStat · 27 Oct 2010 at 06:44

perhaps there's something I'm not understanding, but when I'm inserting into the channelprogramme table for example I have 3 foreign keys CHANNELID, PROGRAMMEID + EPISODEID

Now I know the CHANNELID already so that's fine, but by the time I come around to inserting to the CHANNELPROGRAMME table the relevant programme and episode already exist in PROGRAMME and EPISODE tables, so to make sure I'm referencing the correct ones I use a select to get their ID's.

As far as checking for duplication go's. Say the database is empty, fine I don't have to check for any duplicates, it populates the programme, episode and channelprogramme tables without any problems. The next day I want to update the database with the next lot of programmes. The channelprogramme table is emptied as this represents every programme that is on every channel over a two week period so it can contain repeats etc. , the programme and episode tables however should have unique entries. So for example the programme table could have a progId of one which represents CSI: NY. That can only occur once. The episode table could have an epId of 1, a progId of 1 and an episode title of "1/24 series 1", that can only occur once.

So when I'm scanning through each entry I need to know whether or not an instance for that programme and that episode already exist, if not insert. That's why I'm doing inserts in that scenario. Is this not the way it should be done? If not what should I be doing?

NickK · 27 Oct 2010 at 07:39

First thing that hits me - you're doing this serially by pulling/pushing data row-by-row. Instead use the set based power of the database and minimise row-based processing (preferably eliminate it). This is the reason your process is slow.

Use an import, validate, commit form by:

1. Opening a transaction
2. Importing the entire data into a temporary table, say called TEMP_IMPORT instead of using a python array to hold the variables. Don't validate the data against tables yet.
3. Process/Validate the TEMP_IMPORT table using set based processing or embedded procedures. This should cope with duplicates (see below).
4. When all the updates are made then commit the transaction.
5. If a problem occurred and you want to roll back the transaction over multiple tables.

Now if you want to prompt the user as to what todo with duplicates, then simply offload duplicates into a DUPLICATES table and then post processing ask what the user wants todo with them by presenting them as a full list that they can tick or cross.
Then simply update the DUPLICATES table with the user input

The great thing with a duplicates table is that it simplifies the SQL and you can set off another import whilst waiting continuing to build duplicate table. The user can take their time to walk through as the data is loaded in the background.
A second process to process the user actions for the duplicates in one go can then be kicked off.

Just a thought.

PaulStat · 27 Oct 2010 at 08:42

^ Ok I think I understand what you're saying. The TEMP_IMPORT table contains all the data in a table that isn't normalised.

And then I

"3. Process/Validate the TEMP_IMPORT table using set based processing or embedded procedures."

I'm not sure I understand how to get data from one table and populate the normalised tables correctly though?

NickK · 27 Oct 2010 at 21:06

PaulStat said:
^ Ok I think I understand what you're saying. The TEMP_IMPORT table contains all the data in a table that isn't normalised.

And then I

"3. Process/Validate the TEMP_IMPORT table using set based processing or embedded procedures."

I'm not sure I understand how to get data from one table and populate the normalised tables correctly though?

Normalisation is just removing duplicate information. So you may end up with a query for each table to insert missing rows but the data set (ie multiple rows) is processed by just using one SQL command.

I'm very rusty at SQL (hope it doesn't suck too much).. I've done ORACLE and MySQL in the past so..a simple example where temp table rows 'name' column is checked against table a.
There are more efficient forms but this shows what I mean:

Code:

INSERT INTO table_a (name, film)
  SELECT name, film
  FROM temp_table
  WHERE
     NOT EXISTS (
        SELECT * FROM table_a moo WHERE temp_table.name = moo.name
     )

So I tell the database to give me all the rows from temp_table where there's no match from moo (table_a alias) and then insert them into table_a.

It's faster to run like this than process row-by-row programatically. The difference is that the database will use it's engine todo the work, the data is held in the most efficient form and it doesn't travel repeatedly across the application-database boundary.