Not In for CockroachDB

Have a Database Problem? Speak with an Expert for Free
Get Started >>

Introduction

In this tutorial document, we will investigate how to use Not In for CockroachDB. In this instructional article, we will investigate the following:

  • What? What does “Not In” in CockroachDB do and what is the syntax?
  • How? How do we use this SQL construct in our Cockroach SQL for best effect?
  • Why? When do we make use of this clause? We will learn about using a realistic situation, including use of the Cockroach “WHERE” clause. We will also talk about efficiency and why it is usually better to use “NOT EXISTS” in our Cockroach SQL.

Prerequisites

  • Novice understanding of how to write SQL for CockroachDB (or similar database systems like Oracle, MS SQL Server, and MySQL) using one of the many relational database management tools out there, such as dBeaver, or with languages like Python, Java, PHP, C#, ASP.Net, VB.Net, Ruby, Node.js, B4X, Classic ASP) that provide database connections, as well as a method for sending SQL commands to database tables, to retrieve, insert, or update data.
  • Solid understanding of the use of common SQL statements, including SELECT, FROM, and WHERE statements.
  • Understanding of what table, integer, and text/string are and how they work.

What is Cockroach “NOT IN”?

At times in our Cockroach-reliant applications we may have to distinguish some data from data that is in another dataset. “Show me data that is in this table, EXCEPT any matches that exist in this other data set.”

To start out, we can study the syntax of this SQL tool:

Syntax: “NOT IN” in WHERE

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
SELECT
    intColumn_1
    , intColumn_2
FROM
    tblDatasets
WHERE
    intColumn_1 NOT IN -- match THIS column with...
    (
    SELECT
        intColumn_1 -- THIS column
    FROM
        tblDatasets
    WHERE
        [condition]
    )
ORDER BY intColumn_1

Analysis

In the syntax example above, we are focusing on two data sets that are both pulled from the same table, “tblDatasets”. The high level view is that the “nested” inner data set, the one in parens in the WHERE clause, runs first and the “outer” SQL code is filtered based on the results of that inner, nested one returns. The NOT IN you see here is important. It’s telling CockroachDB, “Only pull rows from tblDatasets where the value in intColumn_1 does not exist in the rowset returned by the nested inner SQL code. Which brings us to how to use the “NOT IN” operator.

Syntax: “NOT EXISTS”

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
SELECT
    intColumn_1
    , intColumn_2
FROM
    tblDatasets
WHERE
    NOT EXISTS
    (
    SELECT
        intColumn_1
    FROM
        tblDatasets
    WHERE
        [INNER TABLE COLUMN] = [OUTER TABLE COLUMN]
    )
ORDER BY intColumn_1

Analysis

In this example, we can see CockroachDB NOT EXISTS working very similar to the way NOT IN does. Why “similar”? Because there are big differences between the two methods. Let’s do a comparison of these two methods in order to investigate the differences so you know which method to use in which situation.

Deeper examination of NOT IN

1
2
3
4
5
6
7
8
9
10
11
12
13
SELECT
    t1.intColumn_1
    , t1.intColumn_2
FROM
    tblDatasets_01 AS t1
WHERE
    t1.intColumn_2 NOT IN
    (
    SELECT
        t2.intColumn_5
    FROM
        tblDatasets_02 AS t2
    )

Analysis

CockroachDB’s optimizer can’t make use of when t2.intColumn_5 has a NOT NULL definition, so the data can’t return any NULLs. That’s why for NOT IN SQL Cockroach uses a method called “hashed sub plan”. We will call this a “temp table”.

Note the following:

  • (1) CockroachDB runs the t2 subSQL code sequentially and hashes the resulting dataset. The duplicate results are removed. Creating our “temp table”.
  • (2) The system then takes each record from t1, sequentially again, and limits via comparing every intColumn_1. Here’s how that limiting works: If t1.intColumn_2 = NULL, CockroachDB returns NULL. If t1.intColumn_2 != NULL, the value is looked for in the “temp table” created in (1) above and (a) If the value is found in the “temp table”, the result is TRUE. (b) If the value is not found, the temp table is searched once more, seeking a NULL. If a NULL is found, FALSE is returned.

Another negative aspect of this “hashed temp table” methodology is the only method CockroachDB’s compiler can use for NOT IN.

Since Cockroach doesn’t move a temp table of this type to disk, it guesses the temp table’s size, and if it thinks it won’t fit in working memory, the data will be searched with a loop.

That can cause large inefficiencies because Cockroach’s optimizer will change the strategy when t2 overgrecords a specific threshold, so at some point the SQL code will slow down.

OK so what is the solution? Use “NOT EXISTS” instead of “NOT IN”:

Increase efficiency by using NOT EXISTS

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
SELECT
    t1.intColumn_1
    , t1.intColumn_2
FROM
    tblDatasets_01 AS t1
WHERE
    NOT EXISTS
    (
    SELECT
        t2.intColumn_5
    FROM
        tblDatasets_02 AS t2
    WHERE
        t2.intColumn_5 = t1.intColumn_2
    )

Analysis

Like NOT IN, Cockroach runs the t2 subSQL code sequentially and hashes the resulting data. Duplicate results are left out. This again we will call “temp table”. Two differences:

  • Missing values will be looked up only one time. Yay, increased efficiency.
  • This method supports a greater range of JOIN types, including “merge” and loops that are nested.

How and why we use NOT IN or NOT EXISTS

Here’s a “business reality” example. Let’s say you are looking for a more efficient, SQL-based, way to do the following: Pull data from a CockroachDB table into an array, manipulate that array in some way (maybe by sorting), and then use that array to add records back into your initial table. Fortunately, NOT IN provides a far more efficient way of attaining the goal. We will dive in by writing the SQL.

First, we create two tables and add some test data:

tblTechnologies_used

txtNameTechtxtCategoryTechintRated
CockroachDBSQL database90
PythonLanguage90
MS SQL ServerSQL database90
C#Language92

The above Cockroach table is the table used by the company to track which technologies are being used by the company. The table below is filled with potential technologies. Notice the bottom table includes tech that is already in the top table. So when we are building our INSERT INTO SQL code using CockroachDB’s “NOT EXISTS”, we want to be sure to leave out duplicates.

tblTechnologies_proposed

txtNameTechtxtCategoryTechintRated
MongoNoSQL database85
MySQLSQL database50
PythonLanguage90
CockroachDBSQL database90
PHPLanguage70
JavaLanguage75
MS SQL ServerSQL database90
C#Language92
C++Language88
dBaseFlat database25

Building our Cockroach SQL code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
INSERT INTO
    tblTechnologies_used AS txtOuter
    (
    txtNameTech
    , txtCategoryTech
    , intRated
    )
    VALUES
    (
    -- OUTER
    SELECT
       txtOuter.txtNameTech
        , txtOuter.txtCategoryTech
        , txtOuter.intRated
    FROM
        tblTechnologies_proposed
    WHERE
        intRated > 74
        AND
        NOT EXISTS
        (
            -- INNER
            SELECT
                txtInner.txtNameTech
            FROM
                tblTechnologies_used
            WHERE
                (txtOuter.txtNameTech = txtInner.txtNameTech)
            AS txtInner
        )
    AS txtOuter
    )

Analysis

You may have noticed that in the CockroachDB example above that we have a nested SELECT because we wanted to be sure we do not add a record to tblTechnologies_used that is already in that table. This is a perfect use of NOT EXISTS.

  • INSERT INTO: This is where we determine which columns in “tblTechnologies_used” get filled from the first (outer) of our SELECT statements below.
  • SELECT (OUTER): Determines which columns are pulled from the “tblTechnologies_proposed” table.
  • WHERE: Filter results of our subSQL code with two requirements. (1) “intRated” must be larger than 74; and (2) The string/text value in “txtNameTech” does not already exist in “tblTechnologies_used”.
  • NOT EXISTS: Make sure we don’t copy dupe records from “tblTechnologies_proposed” into the “tblTechnologies_used” table.

After we execute that SQL, we get the following data returned from Cockroach:

tblTechnologies_used

txtNameTechtxtCategoryTechintRated
PythonLanguage90
CockroachDBSQL database90
JavaLanguage75
MS SQL ServerSQL database90
C#Language92
C++Language88

Conclusion

In this instructional article we learned how to use “NOT IN” SQL in CockroachDB in order to make sure duplicates don’t already exist or get added to our tables. We then learned of an even more efficient way of attaining this goal via the “NOT EXISTS” methodology. We also built a “business reality” example to assist in easy understanding of how this all works and how to use it in your projects.

Pilot the ObjectRocket Platform Free!

Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.

Get Started

Keep in the know!

Subscribe to our emails and we’ll let you know what’s going on at ObjectRocket. We hate spam and make it easy to unsubscribe.