Bit of help with SQL

Mrwong · 5 Feb 2022 at 00:33

NOT EXISTS with a Correlated Subquery, what is it?

OspreyO · 5 Feb 2022 at 10:57

I'm not sure what you are asking. I'll attempt ask answer though I might have misunderstood you and my own SQL isn't brilliant.

It's basically looking for values that are not in the where sub clause. It's slower than using a join to the same thing but perhaps easier to write.

https://stackoverflow.com/questions/12500646/which-is-faster-not-in-or-not-exists

Miganto · 5 Feb 2022 at 23:13

double post

please delete

Miganto · 5 Feb 2022 at 23:29

Personally I stay clear of exists/not exists and write the joins, i've seen many large performance issues using bad 'exists'.

It's checking for rows that your subclause brings back, it doesn't matter what the values are in there, it can bring back everything or you can make it a correlated sub query where you join the outer query table to the sub query table - if the row exists, it'll bring back results.

I don't like exists, i find them harder to visualise what it's actually matching on rather than 'in' (which can only be used for single columns) or joins and is another reason i generally stay away from them.

This explains it all quite well - https://www.mssqltips.com/sqlservertip/6013/sql-server-in-vs-exists/

OspreyO · 5 Feb 2022 at 23:41

I prefer "in" myself. Exists can be unpredictable with nulls.

I also think a sub clause as "bad". I only use it when checking something. If it was going to run frequently as in production I'd rewrite it as join.

Dolph · 6 Feb 2022 at 22:19

OspreyO said:
I prefer "in" myself. Exists can be unpredictable with nulls.

I also think a sub clause as "bad". I only use it when checking something. If it was going to run frequently as in production I'd rewrite it as join.

It's not unpredictable, it's explicitly different by design. Whether In/Not In or Exists/Not Exists will work better depends quite a lot on the specifics of the table design and constraints around nulls on the columns you want to use in the queries.

I'd also add (in response to Miganto's post) that using joins instead of exists because some people write bad exists clauses isn't necessarily good advice. Exists is powerful and, when used correctly, will generally outperform joins for filtering data when the queries and datasets are appropriately optimised.

Basically exists (and not exists) creates a simple true or false test against each row of data, that is then used to filter the query.

For example:

Code:

Select
          id,
          PersonName
from
      Person a
where exists
    (select 1 from Sales s where a.id = s.PersonID)

would return person details from the person table where there was at least 1 sale in the sales table (by using 1 rather than *, the query doesn't even bother looking at the rest of the table structure in sales, or gathering the information).

So basically this creates a list of personIds with a 1 against them from sales, then uses that to filter the Person table output.

For Not Exists:

Code:

Select
          id,
          PersonName
from
      Person a
where not exists
    (select 1 from Sales s where a.id = s.PersonID)

This does exactly the same, but filters the person table output based on the rows that don't match an entry in the sub query.

Exists is best used when you need to manipulate or section out data within a table, based on relationships to another table, but when there is no need to actually include any data from the other table.

OspreyO · 6 Feb 2022 at 22:53

I find them unpredictable because I'm dealing with a lot of legacy data and less common (to me at least) old data types.

I'm generally looking for specific data to test a query before I run it against a much larger set of data. Or I'm filtering and/or transforming all the weird old data.

..and I generally don't know what I'm doing or have a mentor to ask.

OspreyO · 6 Feb 2022 at 23:02

What does appropriately optimised mean in this case...

gpuerrilla · 6 Feb 2022 at 23:19

I am guessing either reducing response time or if its a frequent request helping not slow down the system. Utilising indexes and only bringing back info that is needed would be common techniques.

Dolph · 6 Feb 2022 at 23:54

OspreyO said:
I find them unpredictable because I'm dealing with a lot of legacy data and less common (to me at least) old data types.

I'm generally looking for specific data to test a query before I run it against a much larger set of data. Or I'm filtering and/or transforming all the weird old data.

..and I generally don't know what I'm doing or have a mentor to ask.

OspreyO said:
What does appropriately optimised mean in this case...

ah, that'll do it. Let me guess, most of the data design and rationale behind it is undocumented as well

Apologies if my previous came across a bit harsh, it wasn't intended, but re-reading it could probably have been better phrased.

Optimisation is about ensuring both the query and the data structures are optimised (and working together) to ensure fast processing, not taking too many system resources etc. Having the correct indexing, and ensuring you write the queries in such a way that they use them (which can be especially tricky in some setups, especially where things like partitions are involved) and so on. I've seen queries producing the same output go from 4 hours to 5 seconds when re-written. This was in a production data warehouse where tables could frequently contain millions or hundreds of millions of rows.

If you ever want help or advice, feel free to ask, either here or in PM.

OspreyO · 7 Feb 2022 at 00:01

We were discussing partition's this week in work we felt unless you had a very specific use case they were more hassle than they were worth. One of min devs wanted to try them for performance.

What's your opinion on partitions.

I would suggest on this system which is excessively normalised, and where only the last 2yrs is relevant for most part. That should be their focus rather than partitions. Also I know from experience they change their tables and views often.

Miganto · 7 Feb 2022 at 00:40

"and, when used correctly, will generally outperform joins for filtering data when the queries and datasets are appropriately optimised."

Indeed but to me someone who is asking what 'exists' does, isn't someone who is building their own database or has the control over the performance, indexing, etc.

For that situation, my point has come out to be correct most of the time because without that control, exists have had terrible performance compared to going through the joining route.

If you are able to DBA the situation, then use exists ofc.

Mrwong · 7 Feb 2022 at 01:31

How would you explain it to a none techy?

Dolph · 7 Feb 2022 at 08:36

OspreyO said:
We were discussing partition's this week in work we felt unless you had a very specific use case they were more hassle than they were worth. One of min devs wanted to try them for performance.

What's your opinion on partitions.

I would suggest on this system which is excessively normalised, and where only the last 2yrs is relevant for most part. That should be their focus rather than partitions. Also I know from experience they change their tables and views often.

Partitions are far from an instant performance boost, they can significantly speed up common queries on large datasets where the common searches are always for a similar data duration (say a day or a week), but at the expense of increased overhead on data modification, and placing additional requirements in query creation onto the users so that they write queries where the optimiser uses the partitions properly.

Are you primarily working with a OLTP (online transaction processing) or OLAP (online analytics processing) focused DB? I would agree that generally, optimising the wider structures, relationships, indexes and things like retention periods will almost certainly give better results than partitioning, partitioning is best used when those other aspects have been fully exhausted and your datasets are still creating performance challenges purely through sheer size, but be prepared to have to train those using the dB in how to write queries that effectively use those partitions and how to read an explain plan to check it.

Dolph · 7 Feb 2022 at 08:58

Miganto said:
"and, when used correctly, will generally outperform joins for filtering data when the queries and datasets are appropriately optimised."

Indeed but to me someone who is asking what 'exists' does, isn't someone who is building their own database or has the control over the performance, indexing, etc.

For that situation, my point has come out to be correct most of the time because without that control, exists have had terrible performance compared to going through the joining route.

If you are able to DBA the situation, then use exists ofc.

It's not really a DBA role either, it's the data architect (formally appointed or not) who defines the structure of the data, key relationships, indexes etc. It's also the architects responsibility to ensure that is appropriately documented and available to those who need it (whether they are dbas, analysts or whatever).

Generally most SQL methods are highly design dependent to ensure they are giving you the information you need, so understanding the design and the implications it has on the query processing is vital, hence why I'm not a fan of 'use X, it's easier' type answers. Joins can still throw plenty of unexpected outputs if the underlying data isn't understood. One of the big benefits of exists is that you can never accidentally create a Cartesian multiplication effect through not understanding the data. (The example I used earlier around people and sales wouldn't work with joins without a lot more complexity, as there's a one to many relationship between people and sales that would have to be factored in).

Dolph · 7 Feb 2022 at 09:00

Mrwong said:
How would you explain it to a none techy?

https://stackoverflow.com/questions/5846882/how-do-sql-exists-statements-work

The first and second post in this explains it quite well.

OspreyO · 7 Feb 2022 at 09:37

Dolph said:
Partitions are far from an instant performance boost, they can significantly speed up common queries on large datasets where the common searches are always for a similar data duration (say a day or a week), but at the expense of increased overhead on data modification, and placing additional requirements in query creation onto the users so that they write queries where the optimiser uses the partitions properly.

Are you primarily working with a OLTP (online transaction processing) or OLAP (online analytics processing) focused DB? I would agree that generally, optimising the wider structures, relationships, indexes and things like retention periods will almost certainly give better results than partitioning, partitioning is best used when those other aspects have been fully exhausted and your datasets are still creating performance challenges purely through sheer size, but be prepared to have to train those using the dB in how to write queries that effectively use those partitions and how to read an explain plan to check it.

It does both OLTP and OLAP. Thanks for explanation.

Miganto · 7 Feb 2022 at 10:33

Dolph said:
It's not really a DBA role either, it's the data architect (formally appointed or not) who defines the structure of the data, key relationships, indexes etc. It's also the architects responsibility to ensure that is appropriately documented and available to those who need it (whether they are dbas, analysts or whatever).

Generally most SQL methods are highly design dependent to ensure they are giving you the information you need, so understanding the design and the implications it has on the query processing is vital, hence why I'm not a fan of 'use X, it's easier' type answers. Joins can still throw plenty of unexpected outputs if the underlying data isn't understood. One of the big benefits of exists is that you can never accidentally create a Cartesian multiplication effect through not understanding the data. (The example I used earlier around people and sales wouldn't work with joins without a lot more complexity, as there's a one to many relationship between people and sales that would have to be factored in).

And that's why i gave the guy the answer as well as alternatives due to potential real world situations. Most companies i've been in will have data extractors (for whatever reason, analysts, report writers, etc) that have no control over the database, so having the alternative is a better way of teaching than saying 'this is how it should be and that's that', in case they do happen to have a poorly optimised database and end up having poorly written code because of it. Give people the tools to make decisions, don't just tell them X is the way and that's that.

Secondly, you're again speaking as though you teach in a perfect world - in my world, it's the DBA that designs and keeps the database in check, we had no data architect for decades (only in the last 2 years has that role started to be a thing in this company) and we are using systems that have 10's of thousands of users, hundreds of tables and millions of rows in multiple tables. I've also been in companies that bring their database in from an external supplier (care system, ERP system, health system etc) that start off pre built that you can then alter slightly - again the literal design side already happens and that isn't available. So again these options you supply wouldn't be possible and someone trying to learn from what you are saying would be stuck.

My point is, you seem to be putting your ideas across from an idealistic world, rather than undertanding different people come from different companies with different set ups and situations.

It's a tad annoying that we've both given near identical replies, with external links to help out yet you keep calling me out for things i'm saying that are 100% accurate in some instances of 'the real world' after i've already agreed with you that if you had full control of the database set up, then what you are saying is correct. We're trying to help someone learn, work with me here dude, it's not a contest.

OspreyO · 7 Feb 2022 at 12:12

In our place its a power struggle between a BOFH DBA and arrogant developers who want free reign to do what ever latest fab floats their boat. Getting access to anything is like getting blood out of a stone.

Mrwong · 10 Feb 2022 at 11:52

Hi, I just wanted a simple explanation of A SQL correlated subquery NOT EXISTS. In none techy terms.