Use Psql to Insert a Record with Unicode Characters

Introduction to inserting PostgreSQL records with Unicode characters

If you try to insert a record into a PostgreSQL table using the psql command-line interface, and the record data has a Unicode character (outside of the 128 ASCII character range) then psql will return the following error:

ERROR:  invalid byte sequence for encoding "UTF8":

This article will show you how to properly concatenate Unicode strings using SQL, while connected to a PostgreSQL database, in order to insert the Unicode data into a table’s string or text column.

Supported character sets in PostgreSQL

The following is a list of all of all of the supported character sets for PostgreSQL:

BIG5, EUC_CN, EUC_JP, EUC_JIS_2004, EUC_KR, EUC_TW, GB18030, GBK, ISO_8859_5, ISO_8859_6, ISO_8859_7, ISO_8859_8, JOHAB, KOI8R, KOI8U, LATIN1, LATIN2, LATIN3, LATIN4, LATIN5, LATIN6, LATIN7, LATIN8, LATIN9, LATIN10, MULE_INTERNAL, SJIS, SHIFT_JIS_2004, SQL_ASCII, UHC, UTF8, WIN866, WIN874, WIN1250, WIN1251, WIN1252, WIN1253, WIN1254, WIN1255, WIN1256, WIN1257, WIN1258.

NOTE: Use the SHOW client_encoding; SQL command, while connected with psql, to have Postgres return the default client encoding for the databases.

The default character set for a database is the UTF-8 (UTF) Unicode set that encompasses the entire Unicode. If you’d like to create a database using another, more specify, character set, then make sure to use the WITH ENCODING statement.

The following example SQL statement will create a database with Russian Cyrillic Unicode character support:

CREATE DATABASE "some_database" WITH ENCODING 'KOI8R';

Set the default character encoding set for the PostgreSQL cluster

The following example shows you how can set the Korean Hangul character encoding as the default encoding set for the entire PostgreSQL cluster:

initdb --pgdata=/usr/local/var/postgres -E 'UTF-8' --lc-collate='ko_KR.UTF-8' --lc-ctype='ko_KR.UTF-8'

NOTE: The -D option is to specify the directory for the database system. If you don’t know where your Postgres data is stored you can use the SHOW data_directory; command while connected with a Postgres SUPERUSER role to have it return the directory location for you.

Prerequisites to inserting record data into a PostgreSQL table

Make sure that the Postgres service is installed and running. The example SQL Postgres statements found in this article rely on the psql command-line interface.

Get the version numbers for Postgres and psql

You can use psql -V and postgres -V to get the version numbers for psql and Postgres respectively. While connected to PostgreSQL with psql you can use the \conninfo to have Postgres return information about the server’s port (default is 5432) and database.

psql and Postgres version numbers in a terminal window

Create a user role for the Postgres database with Unicode records

Use the psql postgres command to connect to PostgreSQL (using the postgres user), and then create a user role for the database that will store the Unicode characters. The following example creates a new user with a SUPERUSER role and an encrypted password:

CREATE ROLE objectrocket WITH SUPERUSER CREATEDB CREATEROLE LOGIN ENCRYPTED PASSWORD '1234';

Grant privileges to the user for all the tables in the public schema

This next command will grant all privileges to the user for all tables in the public schema:

GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO objectrocket;

Also grant the user privileges to modify the records in the target database:

GRANT ALL PRIVILEGES ON DATABASE some_database TO objectrocket;

Create a new PostgreSQL table for the Unicode record data

Let’s create a table for the Unicode string data. The concatenated string data will be inserted into the table’s str VARCHAR() column. The following SQL statement creates a table with 4 columns:

CREATE TABLE some_table (
    id UUID PRIMARY KEY,
    str VARCHAR(64),
    INT INTEGER,
    bool BOOLEAN
);

Get the four-digit UTF-8 bytes string for the Unicode character

You’ll need to convert the special Unicode character into a Unicode escaped string with the 4-digit control characters (e.g. the code '\u4f60' for the character 'R').

The following Python code converts some UTF-8 string data into their respective Unicode escape sequences:

'你好,世界'.encode("unicode_escape")
'Jag kan också köpa ost'.encode("unicode_escape")
'Grimms Märchen sind großartig'.encode("unicode_escape")
'The EU uses the €, and Britain uses the £'.encode("unicode_escape")

NOTE: The U+0000 (which represents the null character) form of the Unicode control characters should be inserted into Postgres as \u0000.

Screenshot of IDLE for Python 3 returning the Unicode escape sequence of string data

The above Python code should return the following respective Unicode escape sequences as bytes strings:

b'\\u4f60\\u597d\\uff0c\\u4e16\\u754c'
b'Jag kan ocks\\xe5 k\\xf6pa ost'
b'Grimms M\\xe4rchen sind gro\\xdfartig'
b'The EU uses the \\u20ac, and Britain uses the \\xa3'

Convert a raw byte into a UTF-8 Unicode code point

The UTF-8 encoding standard in psql will only accept the escaped, 4-digit Unicode control characters (\uNNNN'), so if you only have the two-digit raw byte (\xNN) you’ll have to convert it to the UTF-8 byte Unicode code point by replacing the \x with into a UTF-8 escaped string with two leading-zeros (e.g. \u00).

The Swedish letter "å", for example, can be represented with the two-digit raw bytes string '\xe5', so its 8-byte UTF-8 form would be '\u00E5'.

Using psql to insert a record with unicode characters

Now that you have the proper 8-byte Unicode code point for your Unicode characters you can insert them into our table created earlier.

Concatenate a string using the CONCAT SQL function

We’re going to call the CONCAT() function in SQL to concatenate a string, using the UTF-8 Unicode code points, while we insert the record data into the table.

The following SQL statement uses the E'' escape sequence to create the Unicode string for '你好世界', which means “Hello, World” in Mandarin Chinese:

CONCAT(E'\u4f60', E'\u597d', E'\uff0c', E'\u4e16', E'\u754c')

Insert the concatenated Unicode string into one of the PostgreSQL table columns

Call the CONCAT() function inside of the VALUES keyword, while executing the INSERT INTO statement, in order to insert the Unicode string data into the PostgreSQL table:

INSERT INTO
  some_table (id, str, INT, bool)
  VALUES
    ('73467defd4fe46bbbaaedd5856d1b131',
    CONCAT(E'\u4f60', E'\u597d', E'\uff0c', E'\u4e16', E'\u754c'),
    10,
    TRUE),
    ('f63bfdd9a7874017a4823457835cea4d',
    CONCAT('Jag kan ocks', E'\u00E5', ' k', E'\u00f6', 'pa ost'),
    42,
    FALSE),
    ('17d941de79314eb7a83bda27b1f2321c',
    CONCAT('Grimms M', E'\u00e4', 'rchen sind gro', E'\u00df', 'artig'),
    12,
    TRUE),
    ('4f079806297e45c5bb66ff1e611e94b6',
    CONCAT('The EU uses the ', E'\u20ac', ', and Britain uses the ', E'\u00a3'),
    123,
    FALSE);

The above SQL statement should return the following result:

INSERT 0 4

Screenshot of psql

Conclusion to inserting records with Unicode characters into PostgreSQL

The psql interface can sometimes be a little tricky when trying to use Unicode. Many of the PostgreSQL client libraries have built-in Unicode support to help you avoid a lot of the hassle of concatenating UTF-8 code points into strings, but at least now you have a better idea of how Unicode, and the different encoding (character sets), work in the psql interface for PostgreSQL.

Pilot the ObjectRocket Platform Free!

Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.

Get Started

Keep in the know!

Subscribe to our emails and we’ll let you know what’s going on at ObjectRocket. We hate spam and make it easy to unsubscribe.