SQL performance

D.P. · 10 Jul 2014 at 15:51

I don't know much about SQL and what to expect performance wise. We have a PSQL instance running on AWS with plenty of resources (CPU and memory never top 5%)

Some of the queries are incredibly slow, like taking several minutes, but the amount of data we get is not that large. We are thinking of dumping the DB altogether and just using flat files (something like goggle protobufs)

There is a table:
id | mid | type | timestamp | int1 | int2 | float1 | float2 | str | add_time

id, mid, type, timestamp are all ints. Str is a string but is onyl 3 characters.
add_time is an SQL time type.

The query is like this:
SELECT * FROM data WHERE mid >400 AND timestamp > 1404913361 ORDER BY timestamp ASC;

There are about 500 entries for a particular value of "mid" and maybe 5-10 different "mid" values.

Not sure on the size of SQL data types but I am estimating between 20-100KB of data, which shouldn't be that much. Does the querying and sorting really take that long? Is waiting 3 minutes to get this data realistic? If so we will drop SQL at once because we can do similar in a few millseconds using flat files.

tom_e · 10 Jul 2014 at 15:55

That's way too long for 500 records, I can grab a few thousand records and have them pulled into a nicely formatted report in seconds from our data so minutes to pull 500 records of raw data is definitely not right.

Haircut · 10 Jul 2014 at 16:09

How many rows are in the table in total?

jasonmorris · 10 Jul 2014 at 16:17

You probably don't have any indexes so it is having to do a full table scan to satisfy the where clause which is very slow. I'm a Microsoft SQL man myself but I presume PSQL behaves the same way.

D.P. · 10 Jul 2014 at 16:19

Haircut said:
How many rows are in the table in total?

Currently 6,990,058 although we tend to purge old data as it isn't needed.

The data is pulled from the database and displayed on a webserver as some nice graphs/It things like number of users connected, or a simple heart beat every 5 minutes that server X is alive and some performance metrics. When the web-server is first loaded it needs to grab 24 hours worth of data and this is where the 3-5 minute wait is happening, then the webserver updates every 5 minutes and we request the data since the last update so this is just a few dozen rows typically but even that takes several seconds which just seems absurd.

We are well versed in handling big data, we crunch through hundreds of GB without issue. The DB was meant to make life easier for creating a webserver but it is proving to be a pain. What is strange is sometimes it is much faster (e.g. 10x, which we still consider very slow) but the resources of the AWS server that the DB is running is not pushed at all.

D.P. · 10 Jul 2014 at 16:20

jasonmorris said:
You probably don't have any indexes so it is having to do a full table scan to satisfy the where clause which is very slow. I'm a Microsoft SQL man myself but I presume PSQL behaves the same way.

Yeah, I was thinking this is the kind of issue but I know next to nothing about databases so wanted to get some feedback as I have no idea what to expect.

Haircut · 10 Jul 2014 at 16:34

It sounds like you're doing a full table scan on all 7million rows.
3 minutes still sounds like a pretty long time to do that query though.

What indexes are on the table at the moment?
Can you get a query plan to see exactly what it's doing?

I'm not sure what profiling tools you have on Postgres, but you should be able to get some info about where the time is being spent.

D.P. · 10 Jul 2014 at 21:30

These are the current indexes:
table_name | index_name | column_name
------------+--------------------+-------------
data | data_pkey | id
data | idx_data_timestamp | timestamp

Thinking about the typically queries the id column is not important, the timestamp is though but the mid and type columns are also important.
Queries are like this typically:

Code:

 SELECT * FROM data WHERE mid >400 AND timestamp > 1404913361 ORDER BY timestamp ASC;

or

Code:

 SELECT * FROM data WHERE mid >400 AND timestamp > 1404913361 AND type = 2 ORDER BY timestamp ASC;

Haircut · 10 Jul 2014 at 21:54

It does kind of seem that what you're doing may not be suited to using an RDBMS.
From what you've described you're just querying a single table over and over again, presumably with increasing timestamps to get the latest data?

What exactly is the data for and how does it end up in the db?
When you mention about getting the previous 24 hours of data on startup and then getting regular updates could you do something with a message queue?

Whatever inserts to the db now could push a message onto the queue with a particular time to live. Then on startup you could fetch the current state of the world, which would give you all live messages.
Then, once subscribed, you would receive all further updates which you could buffer/window as appropriate if you want to process things at set intervals by the consumer.

durbs · 10 Jul 2014 at 22:06

Haircut said:
It does kind of seem that what you're doing may not be suited to using an RDBMS.

This is what I thought. If you already use AWS, would a NoSQL offering such as DynamoDB not be a better option if you want just very fast reads.

Then, as said above, a queue such as Amazon SQS could be used if required.

dowie · 11 Jul 2014 at 00:50

http://kx.com/ might be worth taking a look at, then again might be overkill - they offer a free version now so....

DanF · 11 Jul 2014 at 12:08

D.P. said:
These are the current indexes:
table_name | index_name | column_name
------------+--------------------+-------------
data | data_pkey | id
data | idx_data_timestamp | timestamp

Thinking about the typically queries the id column is not important, the timestamp is though but the mid and type columns are also important.
Queries are like this typically:

Code:

SELECT * FROM data WHERE mid >400 AND timestamp > 1404913361 ORDER BY timestamp ASC;

or

Code:

SELECT * FROM data WHERE mid >400 AND timestamp > 1404913361 AND type = 2 ORDER BY timestamp ASC;

looks like you can add a combined index over mid and timestamp to help speed it up. It should add a fair amount of performance if the values in mid are fairly limited and repeating.

The initial days queries will be sending a fair amount of data over the pipe to your app, you'd need to find out the query speed and the send speed to see if there's a bottleneck somewhere. How many rows are being sent for the day?

skyripper · 11 Jul 2014 at 12:29

Create a single index on fields (timestamp, mid)

Make sure stats are being collected just after you load a large number of rows into the table.

D.P. · 11 Jul 2014 at 14:38

Haircut said:
It does kind of seem that what you're doing may not be suited to using an RDBMS.
From what you've described you're just querying a single table over and over again, presumably with increasing timestamps to get the latest data?

What exactly is the data for and how does it end up in the db?
When you mention about getting the previous 24 hours of data on startup and then getting regular updates could you do something with a message queue?

Whatever inserts to the db now could push a message onto the queue with a particular time to live. Then on startup you could fetch the current state of the world, which would give you all live messages.
Then, once subscribed, you would receive all further updates which you could buffer/window as appropriate if you want to process things at set intervals by the consumer.

That is kind of what we are thinking.

For are main number crunching software we don't bother with a database because we need to load the previous X-Days worth of historic data which is saved in files for each day. We can load GBs in very little time. The DB was just used to monitor somethings like number of connected users, heartbeats from server, CPU and memory utilization. The idea of a DB was just to make integration with a webserver easy but it hasn't really achieved that for us.

I will see if I can get our sys admin to looking at making an index used mid and timestamp as a temp solution.

D.P. · 11 Jul 2014 at 14:40

skyripper said:
Create a single index on fields (timestamp, mid)

Make sure stats are being collected just after you load a large number of rows into the table.

Will see if I can get the sys admin to do this. he was very against the idea of making changes to the indexing but I'm not sure he likes getting criticism from someone who knows next to nothing about DBs!

skyripper · 11 Jul 2014 at 14:53

Just ask him to look at the IO counters for the database drives when you run your query

Alternatively put it in a way where he/she can be a hero.
"We're looking for help with making this query quicker. Can you check if there are any indexes we could add as I have a feeling we're missing a trick here".

Imagine using a dictionary to look for a bunch of words beginning with M. If you don't have an index (so you at least know to start from page 110 thru to 115), then you have to start at the beginning and check each page. That's what your server is doing for 7million rows.

jasonmorris · 11 Jul 2014 at 15:25

Remember adding additional indexes will slow your data loads down but might be worth a go. Hopefully you have a test system where you can replicate the issue.

cats_five · 10 Aug 2014 at 10:53

I don't use the database you are using, but I suspect indexes and how they are interacting with the WHERE and ORDER BY clauses are at the bottom of the problem.

Have you looked at the query execution plan?

http://www.postgresql.org/docs/8.1/static/sql-explain.html

WHERE mid >400 AND timestamp > 1404913361 ORDER BY timestamp ASC

I would have thought it should run through in timestamp order from the given value discarding rows where mid <= 400. That implies an index on timestamp, mid.

Note: I see someone else has suggested this index!

sist_si · 11 Aug 2014 at 00:01

as has already been said: index index index
....especially for the query/queries you're using regularly