Hi Guys,
I have a set of 400 csv files (average of 350 lines per file) that I use to insert data into a MySQL database (via python). I'm trying to improve the speed with which I can update this database.
The process I do this by reading through the contents of a file at a time storing each row in a list. I did have it so that it inserted into the tables a line at a time, but this was very slow, I then updated the code so it built a list of columns to insert all at once (every row for the current file)
The problem I now have is before I insert a row into a table I have to check if it already exists, so I have to query every row of every file, which is obviously slowing things down somewhat. (At 5 select queries per line that works out to be a minimum of 700,000 select queries, not to mention the time taken to insert 350 rows into a table at a time)
Anyway here's the code so to give an idea of what's going on.
Is there something database wise I can do to the tables so I don't have to do quite so many queries?
"Uniqueness" is defined in the tables like so
GENRE: GENRENAME
PROGRAMME: PROGTITLE
EPISODE: SUBTITLE + EPISODE
Thanks,
Paul
Edit: The tables I have defined are
CHANNELS
CHANNELID | CHANNELNAME
GENRE
GENREID | GENRENAME
EPISODE
EPISODEID | SUBTITLE | EPISODE
PROGRAMME
PROGRAMMEID | GENREID | PROGTITLE | YEAR | DIRECTOR | PERFORMERS | PREMIERE | FILM | REPEAT | SUBTITLES | WIDESCREEN | NEWSERIES | DEAFSIGNED | BNW | STARRATING | CERTIFICATE | DESCRIPTION | CHOICE
CHANNELPROGRAMMES
CHANNELPROGRAMMEID | CHANNELID | PROGRAMMEID | EPISODEID | DATE | STARTTIME | ENDTIME | DURATION
I have a set of 400 csv files (average of 350 lines per file) that I use to insert data into a MySQL database (via python). I'm trying to improve the speed with which I can update this database.
The process I do this by reading through the contents of a file at a time storing each row in a list. I did have it so that it inserted into the tables a line at a time, but this was very slow, I then updated the code so it built a list of columns to insert all at once (every row for the current file)
The problem I now have is before I insert a row into a table I have to check if it already exists, so I have to query every row of every file, which is obviously slowing things down somewhat. (At 5 select queries per line that works out to be a minimum of 700,000 select queries, not to mention the time taken to insert 350 rows into a table at a time)
Anyway here's the code so to give an idea of what's going on.
PHP:
def InsertChannelProgrammes(self,channelID,listDict):
chanProgsqlString = """INSERT INTO CHANNELPROGRAMME(CHANNELID,PROGRAMMEID,EPISODEID,`REPEAT`,NEWSERIES,PREMIERE,CHOICE,
SUBTITLES,DEAFSIGNED,DATE,STARTTIME,ENDTIME,DURATION) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
progsqlString = """INSERT INTO PROGRAMME(GENREID,PROGTITLE,YEAR,DIRECTOR,PERFORMERS,FILM,WIDESCREEN,
BNW,STARRATING,CERTIFICATE,DESCRIPTION) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
episodesqlString = """INSERT INTO EPISODE(PROGRAMMEID,SUBTITLE,EPISODE,DESCRIPTION) VALUES(%s,%s,%s,%s)"""
genresqlString = """INSERT INTO GENRE (GENRENAME) VALUES(%s)"""
chanProgcolumns = []
progcolumns = []
episodecolumns = []
genrecolumns = []
for row in listDict:
#Insert Genres into GENRE table
if row["GENRE"] != None:
#Check to see if genre exists
self._cursor.execute("SELECT * FROM `GENRE` WHERE GENRENAME=\""+row["GENRE"]+"\"")
genResult = self._cursor.fetchone()
if genResult == None:
#New Genre add row for inserting
genValuesTuple = (row["GENRE"],)
genrecolumns.append(genValuesTuple)
#Insert Programmes into PROGRAMMES table
#Check to see if programmes exists
self._cursor.execute("""SELECT PROGRAMMEID, PROGTITLE FROM PROGRAMME WHERE PROGTITLE=%s"""
,(row["PROGTITLE"]))
progResult = self._cursor.fetchone()
if progResult == None:
#New Programme add row for inserting
description = ""
if row["SUBTITLE"] == "" and row["EPISODE"] == "":
description = row["DESCRIPTION"]
progValuesTuple = (self.GetGenreId(row["GENRE"]),row["PROGTITLE"],row["YEAR"],row["DIRECTOR"],row["PERFORMERS"],
row["FILM"],row["WIDESCREEN"],row["BNW"],row["STARRATING"],
row["CERTIFICATE"],description)
progcolumns.append(progValuesTuple)
#Insert Episodes into EPISODES table
#Check to see if episode exists
self._cursor.execute("""SELECT * FROM EPISODE WHERE SUBTITLE=%s and EPISODE=%s""",
(row["SUBTITLE"],row["EPISODE"]))
epResult = self._cursor.fetchone()
if epResult == None:
#New Episode add row for inserting
episodeValuesTuple = (self.GetProgId(row["PROGTITLE"]),row["SUBTITLE"],row["EPISODE"],row["DESCRIPTION"])
episdoecolumns.append(episodeValuesTuple)
#Insert Programmes into CHANNELPROGRAMME table
chanProgvaluesTuple = (channelID,self.GetProgId(row["PROGTITLE"]),self.GetEpId(row["SUBTITLE"],row["EPISODE"]),
row["REPEAT"],row["NEWSERIES"],row["PREMIERE"],row["CHOICE"],
row["SUBTITLES"],row["DEAFSIGNED"],self.FormatDate(row["DATE"]),
row["STARTTIME"],row["ENDTIME"],row["DURATION"])
chanProgcolumns.append(chanProgvaluesTuple)
if len(genrecolumns) > 0:
self._cursor.executemany(genresqlString,genrecolumns)
if len(progcolumns) > 0:
self._cursor.executemany(progsqlString,progcolumns)
if len(episodecolumns) > 0:
self._cursor.executemany(episodesqlString,episodecolumns)
self._cursor.executemany(chanProgsqlString,chanProgcolumns)
Is there something database wise I can do to the tables so I don't have to do quite so many queries?
"Uniqueness" is defined in the tables like so
GENRE: GENRENAME
PROGRAMME: PROGTITLE
EPISODE: SUBTITLE + EPISODE
Thanks,
Paul
Edit: The tables I have defined are
CHANNELS
CHANNELID | CHANNELNAME
GENRE
GENREID | GENRENAME
EPISODE
EPISODEID | SUBTITLE | EPISODE
PROGRAMME
PROGRAMMEID | GENREID | PROGTITLE | YEAR | DIRECTOR | PERFORMERS | PREMIERE | FILM | REPEAT | SUBTITLES | WIDESCREEN | NEWSERIES | DEAFSIGNED | BNW | STARRATING | CERTIFICATE | DESCRIPTION | CHOICE
CHANNELPROGRAMMES
CHANNELPROGRAMMEID | CHANNELID | PROGRAMMEID | EPISODEID | DATE | STARTTIME | ENDTIME | DURATION
Last edited: