Not In for CockroachDB
Introduction
In this tutorial document, we will investigate how to use Not In for CockroachDB. In this instructional article, we will investigate the following:
- What? What does “Not In” in CockroachDB do and what is the syntax?
- How? How do we use this SQL construct in our Cockroach SQL for best effect?
- Why? When do we make use of this clause? We will learn about using a realistic situation, including use of the Cockroach “WHERE” clause. We will also talk about efficiency and why it is usually better to use “NOT EXISTS” in our Cockroach SQL.
Prerequisites
- Novice understanding of how to write SQL for CockroachDB (or similar database systems like Oracle, MS SQL Server, and MySQL) using one of the many relational database management tools out there, such as dBeaver, or with languages like Python, Java, PHP, C#, ASP.Net, VB.Net, Ruby, Node.js, B4X, Classic ASP) that provide database connections, as well as a method for sending SQL commands to database tables, to retrieve, insert, or update data.
- Solid understanding of the use of common SQL statements, including SELECT, FROM, and WHERE statements.
- Understanding of what table, integer, and text/string are and how they work.
What is Cockroach “NOT IN”?
At times in our Cockroach-reliant applications we may have to distinguish some data from data that is in another dataset. “Show me data that is in this table, EXCEPT any matches that exist in this other data set.”
To start out, we can study the syntax of this SQL tool:
Syntax: “NOT IN” in WHERE
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | SELECT intColumn_1 , intColumn_2 FROM tblDatasets WHERE intColumn_1 NOT IN -- match THIS column with... ( SELECT intColumn_1 -- THIS column FROM tblDatasets WHERE [condition] ) ORDER BY intColumn_1 |
Analysis
In the syntax example above, we are focusing on two data sets that are both pulled from the same table, “tblDatasets”. The high level view is that the “nested” inner data set, the one in parens in the WHERE clause, runs first and the “outer” SQL code is filtered based on the results of that inner, nested one returns. The NOT IN
you see here is important. It’s telling CockroachDB, “Only pull rows from tblDatasets where the value in intColumn_1 does not exist in the rowset returned by the nested inner SQL code. Which brings us to how to use the “NOT IN” operator.
Syntax: “NOT EXISTS”
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | SELECT intColumn_1 , intColumn_2 FROM tblDatasets WHERE NOT EXISTS ( SELECT intColumn_1 FROM tblDatasets WHERE [INNER TABLE COLUMN] = [OUTER TABLE COLUMN] ) ORDER BY intColumn_1 |
Analysis
In this example, we can see CockroachDB NOT EXISTS working very similar to the way NOT IN does. Why “similar”? Because there are big differences between the two methods. Let’s do a comparison of these two methods in order to investigate the differences so you know which method to use in which situation.
Deeper examination of NOT IN
1 2 3 4 5 6 7 8 9 10 11 12 13 | SELECT t1.intColumn_1 , t1.intColumn_2 FROM tblDatasets_01 AS t1 WHERE t1.intColumn_2 NOT IN ( SELECT t2.intColumn_5 FROM tblDatasets_02 AS t2 ) |
Analysis
CockroachDB’s optimizer can’t make use of when t2.intColumn_5 has a NOT NULL definition, so the data can’t return any NULLs. That’s why for NOT IN SQL Cockroach uses a method called “hashed sub plan”. We will call this a “temp table”.
Note the following:
- (1) CockroachDB runs the t2 subSQL code sequentially and hashes the resulting dataset. The duplicate results are removed. Creating our “temp table”.
- (2) The system then takes each record from t1, sequentially again, and limits via comparing every intColumn_1. Here’s how that limiting works: If t1.intColumn_2 = NULL, CockroachDB returns NULL. If t1.intColumn_2 != NULL, the value is looked for in the “temp table” created in (1) above and (a) If the value is found in the “temp table”, the result is TRUE. (b) If the value is not found, the temp table is searched once more, seeking a NULL. If a NULL is found, FALSE is returned.
Another negative aspect of this “hashed temp table” methodology is the only method CockroachDB’s compiler can use for NOT IN.
Since Cockroach doesn’t move a temp table of this type to disk, it guesses the temp table’s size, and if it thinks it won’t fit in working memory, the data will be searched with a loop.
That can cause large inefficiencies because Cockroach’s optimizer will change the strategy when t2 overgrecords a specific threshold, so at some point the SQL code will slow down.
OK so what is the solution? Use “NOT EXISTS” instead of “NOT IN”:
Increase efficiency by using NOT EXISTS
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | SELECT t1.intColumn_1 , t1.intColumn_2 FROM tblDatasets_01 AS t1 WHERE NOT EXISTS ( SELECT t2.intColumn_5 FROM tblDatasets_02 AS t2 WHERE t2.intColumn_5 = t1.intColumn_2 ) |
Analysis
Like NOT IN, Cockroach runs the t2 subSQL code sequentially and hashes the resulting data. Duplicate results are left out. This again we will call “temp table”. Two differences:
- Missing values will be looked up only one time. Yay, increased efficiency.
- This method supports a greater range of JOIN types, including “merge” and loops that are nested.
How and why we use NOT IN or NOT EXISTS
Here’s a “business reality” example. Let’s say you are looking for a more efficient, SQL-based, way to do the following: Pull data from a CockroachDB table into an array, manipulate that array in some way (maybe by sorting), and then use that array to add records back into your initial table. Fortunately, NOT IN provides a far more efficient way of attaining the goal. We will dive in by writing the SQL.
First, we create two tables and add some test data:
tblTechnologies_used
txtNameTech | txtCategoryTech | intRated |
---|---|---|
CockroachDB | SQL database | 90 |
Python | Language | 90 |
MS SQL Server | SQL database | 90 |
C# | Language | 92 |
The above Cockroach table is the table used by the company to track which technologies are being used by the company. The table below is filled with potential technologies. Notice the bottom table includes tech that is already in the top table. So when we are building our INSERT INTO SQL code using CockroachDB’s “NOT EXISTS”, we want to be sure to leave out duplicates.
tblTechnologies_proposed
txtNameTech | txtCategoryTech | intRated |
---|---|---|
Mongo | NoSQL database | 85 |
MySQL | SQL database | 50 |
Python | Language | 90 |
CockroachDB | SQL database | 90 |
PHP | Language | 70 |
Java | Language | 75 |
MS SQL Server | SQL database | 90 |
C# | Language | 92 |
C++ | Language | 88 |
dBase | Flat database | 25 |
Building our Cockroach SQL code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | INSERT INTO tblTechnologies_used AS txtOuter ( txtNameTech , txtCategoryTech , intRated ) VALUES ( -- OUTER SELECT txtOuter.txtNameTech , txtOuter.txtCategoryTech , txtOuter.intRated FROM tblTechnologies_proposed WHERE intRated > 74 AND NOT EXISTS ( -- INNER SELECT txtInner.txtNameTech FROM tblTechnologies_used WHERE (txtOuter.txtNameTech = txtInner.txtNameTech) AS txtInner ) AS txtOuter ) |
Analysis
You may have noticed that in the CockroachDB example above that we have a nested SELECT because we wanted to be sure we do not add a record to tblTechnologies_used that is already in that table. This is a perfect use of NOT EXISTS.
- INSERT INTO: This is where we determine which columns in “tblTechnologies_used” get filled from the first (outer) of our SELECT statements below.
- SELECT (OUTER): Determines which columns are pulled from the “tblTechnologies_proposed” table.
- WHERE: Filter results of our subSQL code with two requirements. (1) “intRated” must be larger than 74; and (2) The string/text value in “txtNameTech” does not already exist in “tblTechnologies_used”.
- NOT EXISTS: Make sure we don’t copy dupe records from “tblTechnologies_proposed” into the “tblTechnologies_used” table.
After we execute that SQL, we get the following data returned from Cockroach:
tblTechnologies_used
txtNameTech | txtCategoryTech | intRated |
---|---|---|
Python | Language | 90 |
CockroachDB | SQL database | 90 |
Java | Language | 75 |
MS SQL Server | SQL database | 90 |
C# | Language | 92 |
C++ | Language | 88 |
Conclusion
In this instructional article we learned how to use “NOT IN” SQL in CockroachDB in order to make sure duplicates don’t already exist or get added to our tables. We then learned of an even more efficient way of attaining this goal via the “NOT EXISTS” methodology. We also built a “business reality” example to assist in easy understanding of how this all works and how to use it in your projects.
Pilot the ObjectRocket Platform Free!
Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.
Get Started