Efficient database querying/updating

PaulStat · 26 Oct 2010 at 21:01

Hi Guys,

I have a set of 400 csv files (average of 350 lines per file) that I use to insert data into a MySQL database (via python). I'm trying to improve the speed with which I can update this database.

The process I do this by reading through the contents of a file at a time storing each row in a list. I did have it so that it inserted into the tables a line at a time, but this was very slow, I then updated the code so it built a list of columns to insert all at once (every row for the current file)

The problem I now have is before I insert a row into a table I have to check if it already exists, so I have to query every row of every file, which is obviously slowing things down somewhat. (At 5 select queries per line that works out to be a minimum of 700,000 select queries, not to mention the time taken to insert 350 rows into a table at a time)

Anyway here's the code so to give an idea of what's going on.

PHP:

def InsertChannelProgrammes(self,channelID,listDict):
   chanProgsqlString = """INSERT INTO CHANNELPROGRAMME(CHANNELID,PROGRAMMEID,EPISODEID,`REPEAT`,NEWSERIES,PREMIERE,CHOICE,
                       SUBTITLES,DEAFSIGNED,DATE,STARTTIME,ENDTIME,DURATION) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
   progsqlString = """INSERT INTO PROGRAMME(GENREID,PROGTITLE,YEAR,DIRECTOR,PERFORMERS,FILM,WIDESCREEN,
                   BNW,STARRATING,CERTIFICATE,DESCRIPTION) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
   episodesqlString = """INSERT INTO EPISODE(PROGRAMMEID,SUBTITLE,EPISODE,DESCRIPTION) VALUES(%s,%s,%s,%s)"""
   genresqlString = """INSERT INTO GENRE (GENRENAME) VALUES(%s)"""

   chanProgcolumns = []
   progcolumns = []
   episodecolumns = []
   genrecolumns = []
   for row in listDict:
      #Insert Genres into GENRE table
      if row["GENRE"] != None:
         #Check to see if genre exists
         self._cursor.execute("SELECT * FROM `GENRE` WHERE GENRENAME=\""+row["GENRE"]+"\"")
         genResult = self._cursor.fetchone()
         if genResult == None:
            #New Genre add row for inserting
            genValuesTuple = (row["GENRE"],)
            genrecolumns.append(genValuesTuple)
      #Insert Programmes into PROGRAMMES table
      #Check to see if programmes exists
      self._cursor.execute("""SELECT PROGRAMMEID, PROGTITLE FROM PROGRAMME WHERE PROGTITLE=%s"""
                              ,(row["PROGTITLE"]))
      progResult = self._cursor.fetchone()
      if progResult == None:
         #New Programme add row for inserting
         description = ""
         if row["SUBTITLE"] == "" and row["EPISODE"] == "":
            description = row["DESCRIPTION"]
         progValuesTuple =  (self.GetGenreId(row["GENRE"]),row["PROGTITLE"],row["YEAR"],row["DIRECTOR"],row["PERFORMERS"],
                         row["FILM"],row["WIDESCREEN"],row["BNW"],row["STARRATING"],
                         row["CERTIFICATE"],description)
         progcolumns.append(progValuesTuple)
      #Insert Episodes into EPISODES table
      #Check to see if episode exists
      self._cursor.execute("""SELECT * FROM EPISODE WHERE SUBTITLE=%s and EPISODE=%s""",
                              (row["SUBTITLE"],row["EPISODE"]))
      epResult = self._cursor.fetchone()
      if epResult == None:
         #New Episode add row for inserting
         episodeValuesTuple = (self.GetProgId(row["PROGTITLE"]),row["SUBTITLE"],row["EPISODE"],row["DESCRIPTION"])
         episdoecolumns.append(episodeValuesTuple)
      #Insert Programmes into CHANNELPROGRAMME table
      chanProgvaluesTuple = (channelID,self.GetProgId(row["PROGTITLE"]),self.GetEpId(row["SUBTITLE"],row["EPISODE"]),
                    row["REPEAT"],row["NEWSERIES"],row["PREMIERE"],row["CHOICE"],
                    row["SUBTITLES"],row["DEAFSIGNED"],self.FormatDate(row["DATE"]),
                    row["STARTTIME"],row["ENDTIME"],row["DURATION"])
      chanProgcolumns.append(chanProgvaluesTuple)
   if len(genrecolumns) > 0:
      self._cursor.executemany(genresqlString,genrecolumns)
   if len(progcolumns) > 0:
      self._cursor.executemany(progsqlString,progcolumns)
   if len(episodecolumns) > 0:
      self._cursor.executemany(episodesqlString,episodecolumns)
   self._cursor.executemany(chanProgsqlString,chanProgcolumns)

PaulStat · 26 Oct 2010 at 21:23

rpstewart said:
Do you have to load the data programatically in the python? Can you not use the built in import / mass load functionality in MySQL?

If you do have to insert from the python then you shouldn't be having to check that the data exists first, there should be a primary key on the table which enforces the uniqueness of the records. All you then have to do is insert the row and make sure that you get a successful return code from MySQL.

The reading of the contents of each file is actually really quick, it's purely the querying of the database which is taking a long time. Plus I've looked at the MySQL documentation for the mass load thing, and I really don't get it. In fact I find most of the online MySQL docs confusing

Each table does have a primary key which is a unique integer id, but I also want the columns I mentioned above to be unique. I just don't know how to make text based columns unique. Also what actually happens if something has to be unique? Does it just ignore the insert if that row already exists?

SMB30 said:
What database are you using, and are you using indexes?

Is genrename always a unique field, and how many fields does the table have?

I'm using MySQL

I've updated the OP with the table structure I'm using

PaulStat · 26 Oct 2010 at 23:01

Thanks, I'll give it a go.

Also as part of the inserts I get some ID values using another select for example

PHP:

      chanProgvaluesTuple = (channelID,self.GetProgId(row["PROGTITLE"]),self.GetEpId(row["SUBTITLE"],row["EPISODE"]),
                    row["REPEAT"],row["NEWSERIES"],row["PREMIERE"],row["CHOICE"],
                    row["SUBTITLES"],row["DEAFSIGNED"],self.FormatDate(row["DATE"]),
                    row["STARTTIME"],row["ENDTIME"],row["DURATION"])

The GetProgId and GetEpId are two select queries, is this a good way of getting the id's to link between tables?

PaulStat · 27 Oct 2010 at 06:44

perhaps there's something I'm not understanding, but when I'm inserting into the channelprogramme table for example I have 3 foreign keys CHANNELID, PROGRAMMEID + EPISODEID

Now I know the CHANNELID already so that's fine, but by the time I come around to inserting to the CHANNELPROGRAMME table the relevant programme and episode already exist in PROGRAMME and EPISODE tables, so to make sure I'm referencing the correct ones I use a select to get their ID's.

As far as checking for duplication go's. Say the database is empty, fine I don't have to check for any duplicates, it populates the programme, episode and channelprogramme tables without any problems. The next day I want to update the database with the next lot of programmes. The channelprogramme table is emptied as this represents every programme that is on every channel over a two week period so it can contain repeats etc. , the programme and episode tables however should have unique entries. So for example the programme table could have a progId of one which represents CSI: NY. That can only occur once. The episode table could have an epId of 1, a progId of 1 and an episode title of "1/24 series 1", that can only occur once.

So when I'm scanning through each entry I need to know whether or not an instance for that programme and that episode already exist, if not insert. That's why I'm doing inserts in that scenario. Is this not the way it should be done? If not what should I be doing?

PaulStat · 27 Oct 2010 at 08:42

^ Ok I think I understand what you're saying. The TEMP_IMPORT table contains all the data in a table that isn't normalised.

And then I

"3. Process/Validate the TEMP_IMPORT table using set based processing or embedded procedures."

I'm not sure I understand how to get data from one table and populate the normalised tables correctly though?