Slow SQL Query / help

whitecrook · 18 Jun 2009 at 15:46

Hi all

Wonder if anyone knows how I can get around this problem:

2 DB tables:

main_table_2:
id, app_no, date,

csv_table:
id, app_no, book_date, pub_date

(book_date should match date, and app_no should match app_no)

What I'm doing is importing a CSV into the CSV table and then comparing against the main_table_2 and trying to produce a table that shows me records in the csv_table that AREN'T in the main_table_2.

This is my query:

Code:

SELECT `csv_table`.`app_no` AS `CSV_app`,
       `pub_date`, `book_date`
       FROM `csv_table`
       LEFT JOIN `main_table_2` 
       ON `main_table_2`.`app_no` = `csv_table`.`app_no` 
       WHERE `book_date` = '2009-06-05' 
       AND `main_table_2`.`app_no` is NULL

But it takes 26 seconds to compute.

Explain SQL

Code:

id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra 
1	SIMPLE	csv_table	ALL	NULL	NULL	NULL	NULL	242	Using where
1	SIMPLE	main_table_2	index	NULL	app_no	47	NULL	227210	Using where; Using index; Not exists

Can anyone tell me what's wrong or improve on the query?

Cheers

GravyMonster · 18 Jun 2009 at 16:36

The SQL looks ok, i take it there's just a lot of records in these tables?

You could try adding some indexes to the columns which the join occurs on.

whitecrook · 18 Jun 2009 at 17:04

INDEX ON csv_table

Code:

Keyname	Type	Cardinality	Action	Field
PRIMARY 	PRIMARY	242  	 	 id 
app_no_2 	UNIQUE		 242  	 	 app_no 
	 	 	 	 	 	 book_date 
app_no 	 	INDEX		 242  	 	 app_no 
	 	 	 	 	 	 book_date 
book_date 	INDEX		 2  	 	 book_date 
app_no_3 	INDEX		 242  	 	 app_no

and main_table_2

Code:

Keyname	Type	Cardinality	Action	Field
PRIMARY 	PRIMARY	225678  	id 
app_no_2 	UNIQUE		225678 	 	app_no 
	 	 	 	 	 	date 
app_no 		INDEX		225678  	app_no 
date 		INDEX		2375  	 	date

I confess I don't really understand indexes even after reading hours and hours worth of info about them....

Goksly · 18 Jun 2009 at 20:02

Right, as you are not actually taking any of the data from main_table_2 I would use EXISTS instead of a join...

Code:

SELECT `csv_table`.`app_no` AS `CSV_app`,
       `pub_date`, `book_date`
       FROM `csv_table`
       WHERE NOT EXISTS (SELECT 'x' FROM `main_table_2` AS table_2 
                                   WHERE table_2.app_no = 'csv_table'.app_no 
                                   AND table_2.date = 'csv_table'.book_date)
       AND book_date` = '2009-06-05'

Can't try it (obviously) but I think that should be pretty fast. If you are just using the join to find missing values, I *think* exists should be faster.

GravyMonster · 19 Jun 2009 at 08:39

I'd be interested to see a speed test between the two. I did consider using exists, but dismissed it as it would effectively mean running two queries, rather than one with an indexed join.

whitecrook · 19 Jun 2009 at 11:49

Hmmmmmmmm

Yes. Couple of things here

The query executes much quicker around 0.08 seconds..

But it gives me different results ... However I think the results it's giving me are more correct.

The app_no in main_table_2 is not unique. - it's only unique per date. So My first query gives me 2 hits back for that date .

apps in the csv for that date but not in the db at all.

You query gives me:

apps in the db anytime but not in the csv for that date and apps not in the db but in the csv for that date..

I'll look into how to prcoeed from here.

Cheers...

whitecrook · 19 Jun 2009 at 14:16

If I run the query without the date = date and date = 2009-6-05 I get the same result as my previous query (i.e. only apps which do not appear in the db on any date) and the execution time is similar, about 30 secs or so....

I think I may be better to do the your query on the date range and then for each app returned, perform a select on the app to find out if it is or is not in the db? That may be quicker?

Goksly · 19 Jun 2009 at 15:47

Depends on various points like how often this query is going to be run and how many apps are you likely to have to go back to the database again to get further information. Don't think I fully understand why the query i provided doesnt do all that you want, considering the constraints you put in your first post?

whitecrook · 22 Jun 2009 at 10:34

Code:

Apps in csv but not in DB for expected book in date
Position	app_no	pub_date	book_date	First Seen	Last Seen	Times Seen
1	app1	05/06/2009	05/06/2009	20/10/2008	14/11/2008	4
2	APP2	05/06/2009	05/06/2009	08/05/2009	08/05/2009	1
3	app3	05/06/2009	05/06/2009	26/05/2009	15/06/2009	5
4	APp56	05/06/2009	05/06/2009	27/03/2009	22/05/2009	3
5	app453	05/06/2009	05/06/2009			
6	app12	05/06/2009	05/06/2009	26/03/2009	26/03/2009	1
7	app150	05/06/2009	05/06/2009	16/04/2009	16/04/2009	1
8	app160	05/06/2009	05/06/2009			
9	app159	05/06/2009	05/06/2009	07/05/2009	13/05/2009	2
10	app200	05/06/2009	05/06/2009	14/11/2007	14/11/2007	1

Maybe this will help.

This is the result of your query.

My query produces only the rows that have 'never been seen before' I.e something in the csv that is never been in the db. (row 5 and 8)

Your query is more useful, as it shows when stuff is in the db, but not on the expected 'book-in' date, as well as the stuff that just isn't in the db.

What I did was use your query and then for each record pull the data. This is much faster.

here's my (rough) php code

PHP:

$title4 = "<b>Apps in csv but not in DB for expected book in date</b>";
$sql4 = "SELECT `csv_table`.`app_no`, `pub_date`, " .
      "`book_date` " .
      "FROM `csv_table` " .
      "WHERE NOT EXISTS (SELECT 'x' FROM `main_table_2` AS table_2 " .
                       "WHERE table_2.app_no = `csv_table`.`app_no` " .
                       "AND `table_2`.`date` = `csv_table`.`book_date`) " .
      "AND `book_date` = '$dateToSearchOn'";


printResults2($sql4, $title4);
echo "<br />";



function printResults2($sql, $title) {
  $result = "";
  $fields = "";
  $result = mysql_query($sql) or trigger_error(mysql_error());
  $number_of_rows = mysql_numrows($result);
  echo $title;
  echo "<table border=1 >";
  echo "<tr>";
  echo "<td><b>Position</b></td>";

  $i = 0;
  while ($i < mysql_num_fields($result)) {
    $meta = mysql_fetch_field($result, $i);
    $temp = rawurlencode($meta->name);

    echo "<td><b>$meta->name</b></a></td>";

    $fields[] = $meta->name;
    $i++;
  }
  echo "<td>First Seen</td><td>Last Seen</td><td>Times Seen</td>";
  echo "</tr>";




  // print $querystring;
  // $result = mysql_query($querystring) or trigger_error(mysql_error());
  // $position = $offset+1;

  $position = 1;

  while($row = mysql_fetch_array($result)){
    foreach($row AS $key => $value) { $row[$key] = stripslashes($value); }

    echo "<tr>";
    echo "<td valign='top'>" . $position . "</td>";

    foreach($fields as $field_name) {
      echo "<td valign='top'>" . nl2br( $row[$field_name]) . "</td>";

    }
    $app_to_search = $row['app_no'];
    //    echo $app_to_search;
    $sql_lookup_app = "SELECT MIN(date) AS `first_seen`, MAX(date) AS `last_seen`, COUNT(app_no) AS `times_seen` FROM `main_table_2` WHERE `app_no` = '$app_to_search' GROUP BY `app_no`";

    $result2 = mysql_query($sql_lookup_app) or trigger_error(mysql_error());
    $number_of_rows2 = mysql_numrows($result2);
    while($row2 = mysql_fetch_array($result2)){
      foreach($row2 AS $key2 => $value2) { $row2[$key2] = stripslashes($value2); }
      echo "<td valign='top'>" . nl2br( $row2['first_seen']) . "</td>";
      echo "<td valign='top'>" . nl2br( $row2['last_seen']) . "</td>";
      echo "<td valign='top'>" . nl2br( $row2['times_seen']) . "</td>";
    }
    echo "</tr>";

    $position ++;
  }
  echo "</table>";
}