Generate Postgres Test Data with Python (Part 1)

Introduction

Sometimes creating test data for an SQL database, like PostgreSQL, can be time-consuming and a pain. This article will show you how to write a simple Python script that will automatically generate an SQL string, of randomized Postgres data, that can be copied and pasted into psql, saved as a .json file, or executed within the Python script itself using the psycopg2 adapter library for PostgreSQL.

This article is part one of a two-part series that will demonstrate how to create, and subsequently insert, the randomized data into a Postgres table.

Prerequisites

  • Python 3 needs to be installed and working. Use the python3 -V command in a terminal or command prompt to have it return its version number, or use which python3 to check the installation location.

  • Use the PIP package manager to install the psycopg2 Python library for Postgres if you intend to insert the created records inside of the Python script itself. Use pip3 -V to verify that PIP is installed, and then use pip3 install psycopg2 to install the Postgres adapter.

  • Make sure Postgres and psql are installed and working. In systemd distros of Linux (like Ubuntu) you can use the systemctl status postgresql.service command to check if the Postgres service is running. In macOS you can install Postgres using Homebrew and then use the brew services start postgresql command to start the service.

Connect to Postgres

You’ll have to connect to Postgres in order to create a database and table for the records generated by the Python code in this article. Either create them in your Python script, using the psycopg2 adapter, or use the psql command line interface to execute the SQL statements.

Create a table in Postgres

In either case, the following CREATE TABLE SQL statement needs to be executed in order to insert the test data generated by the Python script:

1
2
3
4
5
6
7
CREATE TABLE python_test (
    id UUID PRIMARY KEY,
    str_col VARCHAR(500),
    int_col SMALLINT,
    bool_col BOOLEAN,
    float_col DECIMAL
);

NOTE: If you change the python_test table name you’ll have to make sure to specify the new name in your code. The code in this article uses a database called python_data, so you’ll have to create that as well (using a CREATE DATABASE SQL statement) if you plan on using the code as-is.

Use psql to connect to a database

You can just type psql, but it may return an error saying “role does not exist”. If you already have a user configured just use the -U option for the Postgres username, and the -d option for database name, like in the following example, to connect to your preferred database:

1
psql -U orkb -d python_data

You can also just connect to psql using the postgres superuser by executing the following command in your terminal:

1
sudo -u postgres psql

Once you’re inside of psql be sure to execute the above CREATE TABLE SQL statement. You can use the \q command to quit and return to the terminal prompt.

Use Python to connect to PostgreSQL

You can also connect to Postgres, and execute SQL statements to modify your database with Python code by importing the psycopg2 library and passing a string to a cursor object’s execute() method.

The following line of Python will import the necessary modules from the library:

1
from psycopg2 import sql, Error, connect

Once you’ve imported the library you can use the connect() method library to connect to your Postgres server like in the following example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
try:
    # declare a new PostgreSQL connection object
    conn = connect(
        dbname = "python_data",
        user = "orkb",
        host = "localhost",
        password = "1234",
        # attempt to connect for 3 seconds then raise exception
        connect_timeout = 3
    )

    cur = conn.cursor()
    print ("\ncreated cursor object:", cur)

except Error as err:
    print ("\npsycopg2 connect error:", err)
    conn = None
    cur = None

NOTE: Make sure to change the values above to reflect your username, database, password, and domain name (or IP address).

The above code will set the cur cursor object to None if there were any problems connecting.

Once the script has connected you can execute an SQL string using the following code:

1
2
3
4
5
6
7
8
9
10
11
# only attempt to execute SQL if cursor is valid
if cur != None:

    try:
        sql_resp = cur.execute( sql_str )
        conn.commit()
        print ('finished INSERT INTO execution')

    except (Exception, Error) as error:
        print("\nexecute_sql() error:", error)
        conn.rollback()

Create a Python script

Make sure to use an IDE like VS Code or Atom that supports Python syntax and indentation and create a new Python script with the .py file extension. You can also use the touch command in a terminal window to do this like in the following example:

1
touch my_script.py

Python code to generate PostgreSQL test data

You’ll need to import the following built-in Python libraries at the top of your script before you can create the function to randomly generate data:

1
import random, uuid, time, json, sys

Now define a function that will accept a list of custom strings and an integer as parameters that will ultimately return a joined string made up of random words:

1
2
3
4
5
6
7
8
9
def get_random_words(words_list, total=1):
    ran_words = []

    # enumerate over specified number of words
    while len(ran_words) < total:
        ran_words += [words_list[random.randint(0, len(words_list)-1)]]

    # return a string by joining the list of words
    return ' '.join(ran_words)

Now we’re going to nest the above get_random_words() function into another function that’s designed to randomly generate data, and put it into a dictionary objects that will represent multiple PostgreSQL records.

Python function that creates Postgres records

Define another function that will return a list called records that contains Python dictionary objects of randomized data:

1
2
3
4
5
6
7
8
def create_postgres_json(size):

    # list to store JSON records
    records = []

    # random words to inject into records
    words = ["ObjectRocket's tutorials", 'orkb', "orkb's tutorials",
    "Postgres' psycopg2 adapter", "data's positronic brain", 'type', 'examples', 'foo', 'bar']

The above function declaration accepts an integer as an argument that will be used to specify the number of Postgres records being created.

In the next part of the function you’ll need to iterate over a range of the integer passed to the function call in order to create a dict object for each record:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
    # iterate over the number of records being created
    for rec_id in range(size):

        # create a new record dict
        new_record = {}

        # input a value for each table column
        new_record[ 'id' ] = uuid.uuid4().hex
        new_record[ 'str_col' ] = get_random_words( words, random.randint(1, len(words)) )
        new_record[ 'int_col' ] = random.randint(1, 6000)
        new_record[ 'bool_col' ] = [True, False][random.randint(0, 1)]
        new_record[ 'float_col' ] = random.uniform(1.5, 99.9)

        # append the new record dict to the list
        records += [ new_record ]

    # return the list of JSON records
    return records

NOTE: The id namespace is reserved in Python for a built-in function, so, if you’re planning to declare a variable called id, use an underscore or otherwise distinguish it somehow from the built-in Python function.

The above code will generate a random UUID for each record’s ID, using the uuid.uuid4() method call, and then it casts it as a string by calling the method’s hex attribute.

You can now call the function by passing an integer to it to specify the desired number of records you’d like to generate. The following code is an example of how you can call the function and have it return 3 Postgres records in a Python list that represents a JSON array:

1
json_records = create_postgres_json(3)

Python dict to JSON

You can also pass the list of dictionary objects to the json.dumps() method to convert the Python object into a JSON string:

1
2
3
4
5
# use the JSON library to convert JSON array into a Python string
json_records_str = json.dumps(json_records, indent=4)

print ("\nPostgres records JSON string:")
print (json_records_str)

Screenshot of a function to generate Postgres test data in Python

Use Python to write JSON to a file

One great use case for the json.dumps() method is being able to create a JSON string of your data in order to write it to a .json file. The following code is an example of how you can write the test data as a language agnostic JSON string to a file:

1
2
3
# save the generated Postgres records in a JSON file
with open('postgres-records.json', 'w') as output_file:
    output_file.write(json_records_str)

NOTE: The above write() method will raise a TypeError if you attempt to pass an object to it that’s not a string.

Conclusion

At this point in the series we’ve created some Python code that will randomly generate a JSON data array, but it still needs to be properly formatted and modified, to conform to SQL standards, so that it can be pasted into psql, or executed via the psycopg2 adapter. In the last installment of this series we’ll write another function to iterate and parse the data in order to accomplish this. Here is the link to part 2 of this tutorial:

Part 2

Just the Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
#!/usr/bin/python3
# -*- coding: utf-8 -*-

# import libraries to generate data
import random, uuid, time, json, sys
from psycopg2 import sql, Error, connect

# requires 'random' library
def get_random_words(words_list, total=1):
    ran_words = []

    # enumerate over specified number of words
    while len(ran_words) < total:
        ran_words += [words_list[random.randint(0, len(words_list)-1)]]

    # return a string by joining the list of words
    return ' '.join(ran_words)

def create_postgres_json(size):

    # list to store JSON records
    records = []

    # random words to inject into records
    words = ["ObjectRocket's tutorials", 'orkb', "orkb's tutorials",
    "Postgres' psycopg2 adapter", "data's positronic brain", 'type', 'examples', 'foo', 'bar']

    # iterate over the number of records being created
    for rec_id in range(size):

        # create a new record dict
        new_record = {}

        # input a value for each table column
        new_record[ 'id' ] = uuid.uuid4().hex
        new_record[ 'str_col' ] = get_random_words( words, random.randint(1, len(words)) )
        new_record[ 'int_col' ] = random.randint(1, 6000)
        new_record[ 'bool_col' ] = [True, False][random.randint(0, 1)]
        new_record[ 'float_col' ] = random.uniform(1.5, 99.9)

        # append the new record dict to the list
        records += [ new_record ]

    # return the list of JSON records
    return records


print ("\n")

# convert the string back to a dict (JSON) object
json_records = json.loads(json_records_str)


# allow the table name to be passed to the script
if len(sys.argv) > 1:
    table_name = '_'.join(sys.argv[1:])
else:
    # otherwise use a default table name
    table_name = 'test_table'

print ('\nsql_str:')
print (sql_str)

# generate records for Postgres
json_records = create_postgres_json(3)

# use the JSON library to convert JSON array into a Python string
json_records_str = json.dumps(json_records, indent=4)

print ("\nPostgres records JSON string:")
print (json_records_str)

# save the generated Postgres records in a JSON file
with open('postgres-records.json', 'w') as output_file:
    output_file.write(json_records)

Pilot the ObjectRocket Platform Free!

Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.

Get Started

Keep in the know!

Subscribe to our emails and we’ll let you know what’s going on at ObjectRocket. We hate spam and make it easy to unsubscribe.