Use of "Not In" SQL in Postgres

Introduction

In this article, we’ll explore how to use “Not In” SQL in Postgres. In this tutorial, we’ll explore the following:

  • What? What does “Not In” in PostgreSQL do and what’s the syntax?
  • How? How do we use this SQL construct in our Postgres queries for best effect?
  • Why? When do we make use of this clause? We’ll study using a realistic situation, including use of the Postgres “WHERE” clause. We’ll also talk about efficiency and why it is usually better to use “NOT EXISTS” in our Postgres queries.

Prerequisites

  • Basic knowledge of how to write SQL for PostgreSQL (or similar database systems like Oracle, MS SQL Server, and MySQL) using one of the many relational database management tools out there, such as PGAdmin, dBeaver, or with languages like Javascript, Python, Java, PHP, C#, ASP.Net, VB.Net, Ruby, Node.js, B4X, Classic ASP) that provide database connections, as well as a method for sending SQL queries to database tables, to retrieve, insert, or update data.
  • Understanding of the use of common SQL statements, including SELECT, FROM, and WHERE statements.
  • Knowledge of what table, integer, and text/string are and how they work.

What is “NOT IN”?

At times in our Postgres-reliant applications we may need to distinguish some data from data that is in another dataset. “Show me data that is in this table, EXCEPT any matches that exist in this other data set.”

To begin, we can examine the syntax of this SQL tool:

Syntax 1: "NOT IN" in WHERE

SELECT
    i_column_1
    , i_column_2
FROM
    tbl_dataset
WHERE
    i_column_1 NOT IN -- match THIS column with...
    (
    SELECT
        i_column_1 -- THIS column
    FROM
        tbl_dataset
    WHERE
        [condition]
    )
ORDER BY i_column_1

Analysis

In the syntax example above, we are looking at two data sets that are both pulled from the same table, “tbldataset”. The high level view is that the “nested” inner data set, the one in parens in the WHERE clause, runs first and the “outer” query is filtered based on the results of that inner, nested one returns. The NOT IN you see here is important. It’s telling PostgreSQL, “Only pull records from tbl_dataset where the value in i_column_1 does _not exist in the recordset returned by the nested inner query. Which brings us to how to use the “NOT IN” operator.

Syntax 2: "NOT IN" alternative: NOT EXISTS

SELECT
    i_column_1
    , i_column_2
FROM
    tbl_dataset
WHERE
    NOT EXISTS
    (
    SELECT
        i_column_1
    FROM
        tbl_dataset
    WHERE
        [INNER TABLE COLUMN] = [OUTER TABLE COLUMN]
    )
ORDER BY i_column_1

Analysis

In this example, we can see NOT EXISTS working almost exactly like NOT IN. Why “almost”? Because there are fundamental differences between the two methods. Let’s do a comparison of the two methods in order to explore the differences so you know which to use in which situation.

NOT IN deeper examination

SELECT
    t1.i_column_1
    , t1.i_column_2
FROM
    tbl_dataset_01 AS t1
WHERE
    t1.i_column_2 NOT IN
    (
    SELECT
        t2.i_column_5
    FROM
        tbl_dataset_02 AS t2
    )

Analysis

PostgreSQL’s optimizer can’t make use of when t2.i_column_5 has a NOT NULL definition, so the data can’t return any NULLs.

That’s why for NOT IN queries Postgres uses a method called “hashed sub plan”. We will call this “temp table”.

  • (1) PG runs the t2 subquery sequentially and hashes the resulting dataset. The duplicate results are removed. Creating our “temp table”.
  • (2) The system then takes each row from t1, sequentially again, and limits via comparing every i_column_1. Here’s how that limiting works: If t1.i_column_2 = NULL, PG returns NULL. If t1.i_column_2 != NULL, the value is looked for in the “temp table” created in (1) above and (a) If the value is found in the “temp table”, the result is TRUE. (b) If the value is not found, the temp table is searched once more, seeking a NULL. If a NULL is found, FALSE is returned.

Another negative aspect of this “hashed temp table” methodology is the only method PostgreSQL’s compiler can use for NOT IN.

Since Postgres doesn’t move a temp table of this type to disk, it guesses the temp table’s size, and if it thinks it won’t fit in working memory, the data will be searched with a loop.

That can cause large inefficiencies because Postgres’ optimizer will change the strategy when t2 overgrows a specific threshold, so at some point the query will slow down.

OK so what is the solution? Use “NOT EXISTS” instead of “NOT IN”:

Increase efficiency by using NOT EXISTS

SELECT
    t1.i_column_1
    , t1.i_column_2
FROM
    tbl_dataset_01 AS t1
WHERE
    NOT EXISTS
    (
    SELECT
        t2.i_column_5
    FROM
        tbl_dataset_02 AS t2
    WHERE
        t2.i_column_5 = t1.i_column_2
    )

Analysis

Like NOT IN, Postgres runs the t2 subquery sequentially and hashes the resulting data. Duplicate results are left out. This again we will call “temp table”. Two differences:

  • Missing values will be looked up only one time. Yay, increased efficiency.
  • This method supports a greater range of JOIN types, including “merge” and loops that are nested.

How and why we use NOT IN or NOT EXISTS

Here’s a real world example. Let’s say you are looking for a more efficient, SQL-based, way to do the following: Pull data from a table into an array, manipulate that array in some way (perhaps sorting), and then use that array to add rows back into your initial table. Fortunately, NOT IN provides a far more efficient way of accomplishing the task. We’ll dive in by writing the SQL.

First, we create two tables and add some test data:

tbl_technologies_used | t_name_tech | t_category_tech | i_rating | |—————|—————–|———:| | Javascript | Language | 95 | | PostgreSQL | SQL database | 90 | | Python | Language | 90 | | MS SQL Server | SQL database | 90 | | C# | Language | 92 |

The above table is the table used by the company to track which technologies are being used by the company. The table below is filled with potential technologies. Notice the bottom table includes tech that is already in the top table. So when we are building our INSERT INTO query using PostgreSQL’s “NOT EXISTS”, we want to be sure to leave out duplicates.

tbl_technologies_proposed | t_name_tech | t_category_tech | i_rating | |—————|—————–|———:| | Javascript | Language | 95 | | Mongo | NoSQL database | 85 | | MySQL | SQL database | 50 | | PostgreSQL | SQL database | 90 | | Python | Language | 90 | | PHP | Language | 70 | | Java | Language | 75 | | MS SQL Server | SQL database | 90 | | C# | Language | 92 | | C++ | Language | 88 | | dBase | Flat database | 25 |

Building our Postgres query

INSERT INTO
    tbl_technologies_used AS t_outer
    (
    t_name_tech
    , t_category_tech
    , i_rating
    )
    VALUES
    (
    -- OUTER
    SELECT
       t_outer.t_name_tech
        , t_outer.t_category_tech
        , t_outer.i_rating
    FROM
        tbl_technologies_proposed
    WHERE
        i_rating > 74
        AND
        NOT EXISTS
        (
            -- INNER
            SELECT
                t_inner.t_name_tech
            FROM
                tbl_technologies_used
            WHERE
                (t_outer.t_name_tech = t_inner.t_name_tech)
            AS t_inner
        )
    AS t_outer
    )

Analysis

You may have noticed in the above example that we have a nested SELECT because we want to be sure we do not add a row to tbl_technologies_used that is already in that table. This is a classicly perfect use of NOT EXISTS.

  • INSERT INTO: This is where we determine which columns in “tbl_technologies_used” get filled from the first (outer) of our SELECT statements below.
  • SELECT (OUTER): Determines which fields (columns) are pulled from “tbl_technologies_proposed”.
  • WHERE: Limit results of our subquery by two requirements. (1) “i_rating” must be larger than 74; and (2) The string/text value in “t_name_tech” does not already exist in “tbl_technologies_used”.
  • NOT EXISTS: Make sure we don’t copy dupe rows from “tbl_technologies_proposed” into “tbl_technologies_used”.

After we execute that SQL, we get the following data:

tbl_technologies_used | t_name_tech | t_category_tech | i_rating | |—————|—————–|———:| | Javascript | Language | 95 | | PostgreSQL | SQL database | 90 | | Python | Language | 90 | | Java | Language | 75 | | MS SQL Server | SQL database | 90 | | C# | Language | 92 | | C++ | Language | 88 |

Miscellaneous

  • If you noticed we prefixed some of our fields, variables, and columns with “tbl“, “i“, and “t“: We used “tbl” to mean table, “i” to designate integer, and “t” to mean string/text. Here is an article on that.

Conclusion

In this tutorial we learned here how to use “NOT IN” SQL in Postgres in order to make sure duplicates don’t already exist or get added to our data set. We then learned of an even more efficient way of accomplishing this task via the “NOT EXISTS” methodology. We also built a real world example to assist in easy understanding of how this all works and how to use it in your development projects.

Pilot the ObjectRocket Platform Free!

Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.

Get Started

Keep in the know!

Subscribe to our emails and we’ll let you know what’s going on at ObjectRocket. We hate spam and make it easy to unsubscribe.