SQL performance

D.P. · 10 Jul 2014 at 15:51

I don't know much about SQL and what to expect performance wise. We have a PSQL instance running on AWS with plenty of resources (CPU and memory never top 5%)

Some of the queries are incredibly slow, like taking several minutes, but the amount of data we get is not that large. We are thinking of dumping the DB altogether and just using flat files (something like goggle protobufs)

There is a table:
id | mid | type | timestamp | int1 | int2 | float1 | float2 | str | add_time

id, mid, type, timestamp are all ints. Str is a string but is onyl 3 characters.
add_time is an SQL time type.

The query is like this:
SELECT * FROM data WHERE mid >400 AND timestamp > 1404913361 ORDER BY timestamp ASC;

There are about 500 entries for a particular value of "mid" and maybe 5-10 different "mid" values.

Not sure on the size of SQL data types but I am estimating between 20-100KB of data, which shouldn't be that much. Does the querying and sorting really take that long? Is waiting 3 minutes to get this data realistic? If so we will drop SQL at once because we can do similar in a few millseconds using flat files.

D.P. · 10 Jul 2014 at 16:19

Haircut said:
How many rows are in the table in total?

Currently 6,990,058 although we tend to purge old data as it isn't needed.

The data is pulled from the database and displayed on a webserver as some nice graphs/It things like number of users connected, or a simple heart beat every 5 minutes that server X is alive and some performance metrics. When the web-server is first loaded it needs to grab 24 hours worth of data and this is where the 3-5 minute wait is happening, then the webserver updates every 5 minutes and we request the data since the last update so this is just a few dozen rows typically but even that takes several seconds which just seems absurd.

We are well versed in handling big data, we crunch through hundreds of GB without issue. The DB was meant to make life easier for creating a webserver but it is proving to be a pain. What is strange is sometimes it is much faster (e.g. 10x, which we still consider very slow) but the resources of the AWS server that the DB is running is not pushed at all.

D.P. · 10 Jul 2014 at 16:20

jasonmorris said:
You probably don't have any indexes so it is having to do a full table scan to satisfy the where clause which is very slow. I'm a Microsoft SQL man myself but I presume PSQL behaves the same way.

Yeah, I was thinking this is the kind of issue but I know next to nothing about databases so wanted to get some feedback as I have no idea what to expect.

D.P. · 10 Jul 2014 at 21:30

These are the current indexes:
table_name | index_name | column_name
------------+--------------------+-------------
data | data_pkey | id
data | idx_data_timestamp | timestamp

Thinking about the typically queries the id column is not important, the timestamp is though but the mid and type columns are also important.
Queries are like this typically:

Code:

 SELECT * FROM data WHERE mid >400 AND timestamp > 1404913361 ORDER BY timestamp ASC;

or

Code:

 SELECT * FROM data WHERE mid >400 AND timestamp > 1404913361 AND type = 2 ORDER BY timestamp ASC;

D.P. · 11 Jul 2014 at 14:38

Haircut said:
It does kind of seem that what you're doing may not be suited to using an RDBMS.
From what you've described you're just querying a single table over and over again, presumably with increasing timestamps to get the latest data?

What exactly is the data for and how does it end up in the db?
When you mention about getting the previous 24 hours of data on startup and then getting regular updates could you do something with a message queue?

Whatever inserts to the db now could push a message onto the queue with a particular time to live. Then on startup you could fetch the current state of the world, which would give you all live messages.
Then, once subscribed, you would receive all further updates which you could buffer/window as appropriate if you want to process things at set intervals by the consumer.

That is kind of what we are thinking.

For are main number crunching software we don't bother with a database because we need to load the previous X-Days worth of historic data which is saved in files for each day. We can load GBs in very little time. The DB was just used to monitor somethings like number of connected users, heartbeats from server, CPU and memory utilization. The idea of a DB was just to make integration with a webserver easy but it hasn't really achieved that for us.

I will see if I can get our sys admin to looking at making an index used mid and timestamp as a temp solution.

D.P. · 11 Jul 2014 at 14:40

skyripper said:
Create a single index on fields (timestamp, mid)

Make sure stats are being collected just after you load a large number of rows into the table.

Will see if I can get the sys admin to do this. he was very against the idea of making changes to the indexing but I'm not sure he likes getting criticism from someone who knows next to nothing about DBs!