Use Psql to Insert a Record with Unicode Characters
Introduction to inserting PostgreSQL records with Unicode characters
If you try to insert a record into a PostgreSQL table using the psql command-line interface, and the record data has a Unicode character (outside of the 128 ASCII character range) then psql will return the following error:
1 | ERROR: invalid byte sequence for encoding "UTF8": |
This article will show you how to properly concatenate Unicode strings using SQL, while connected to a PostgreSQL database, in order to insert the Unicode data into a table’s string or text column.
Supported character sets in PostgreSQL
The following is a list of all of all of the supported character sets for PostgreSQL:
BIG5
, EUC_CN
, EUC_JP
, EUC_JIS_2004
, EUC_KR
, EUC_TW
, GB18030
, GBK
, ISO_8859_5
, ISO_8859_6
, ISO_8859_7
, ISO_8859_8
, JOHAB
, KOI8R
, KOI8U
, LATIN1
, LATIN2
, LATIN3
, LATIN4
, LATIN5
, LATIN6
, LATIN7
, LATIN8
, LATIN9
, LATIN10
, MULE_INTERNAL
, SJIS
, SHIFT_JIS_2004
, SQL_ASCII
, UHC
, UTF8
, WIN866
, WIN874
, WIN1250
, WIN1251
, WIN1252
, WIN1253
, WIN1254
, WIN1255
, WIN1256
, WIN1257
, WIN1258
.
NOTE: Use the SHOW client_encoding;
SQL command, while connected with psql, to have Postgres return the default client encoding for the databases.
The default character set for a database is the UTF-8 (UTF
) Unicode set that encompasses the entire Unicode. If you’d like to create a database using another, more specify, character set, then make sure to use the WITH ENCODING
statement.
The following example SQL statement will create a database with Russian Cyrillic Unicode character support:
1 | CREATE DATABASE "some_database" WITH ENCODING 'KOI8R'; |
Set the default character encoding set for the PostgreSQL cluster
The following example shows you how can set the Korean Hangul character encoding as the default encoding set for the entire PostgreSQL cluster:
1 | initdb --pgdata=/usr/local/var/postgres -E 'UTF-8' --lc-collate='ko_KR.UTF-8' --lc-ctype='ko_KR.UTF-8' |
NOTE: The -D
option is to specify the directory for the database system. If you don’t know where your Postgres data is stored you can use the SHOW data_directory;
command while connected with a Postgres SUPERUSER
role to have it return the directory location for you.
Prerequisites to inserting record data into a PostgreSQL table
Make sure that the Postgres service is installed and running. The example SQL Postgres statements found in this article rely on the psql
command-line interface.
Get the version numbers for Postgres and psql
You can use psql -V
and postgres -V
to get the version numbers for psql and Postgres respectively. While connected to PostgreSQL with psql you can use the \conninfo
to have Postgres return information about the server’s port (default is 5432
) and database.
Create a user role for the Postgres database with Unicode records
Use the psql postgres
command to connect to PostgreSQL (using the postgres
user), and then create a user role for the database that will store the Unicode characters. The following example creates a new user with a SUPERUSER
role and an encrypted password:
1 | CREATE ROLE objectrocket WITH SUPERUSER CREATEDB CREATEROLE LOGIN ENCRYPTED PASSWORD '1234'; |
Grant privileges to the user for all the tables in the public schema
This next command will grant all privileges to the user for all tables in the public schema:
1 | GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO objectrocket; |
Also grant the user privileges to modify the records in the target database:
1 | GRANT ALL PRIVILEGES ON DATABASE some_database TO objectrocket; |
Create a new PostgreSQL table for the Unicode record data
Let’s create a table for the Unicode string data. The concatenated string data will be inserted into the table’s str VARCHAR()
column. The following SQL statement creates a table with 4 columns:
1 2 3 4 5 6 | CREATE TABLE some_table ( id UUID PRIMARY KEY, str VARCHAR(64), INT INTEGER, bool BOOLEAN ); |
Get the four-digit UTF-8 bytes string for the Unicode character
You’ll need to convert the special Unicode character into a Unicode escaped string with the 4-digit control characters (e.g. the code '\u4f60'
for the character 'R'
).
The following Python code converts some UTF-8 string data into their respective Unicode escape sequences:
1 2 3 4 | '你好,世界'.encode("unicode_escape") 'Jag kan också köpa ost'.encode("unicode_escape") 'Grimms Märchen sind großartig'.encode("unicode_escape") 'The EU uses the €, and Britain uses the £'.encode("unicode_escape") |
NOTE: The U+0000
(which represents the null
character) form of the Unicode control characters should be inserted into Postgres as \u0000
.
The above Python code should return the following respective Unicode escape sequences as bytes strings:
1 2 3 4 | b'\\u4f60\\u597d\\uff0c\\u4e16\\u754c' b'Jag kan ocks\\xe5 k\\xf6pa ost' b'Grimms M\\xe4rchen sind gro\\xdfartig' b'The EU uses the \\u20ac, and Britain uses the \\xa3' |
Convert a raw byte into a UTF-8 Unicode code point
The UTF-8 encoding standard in psql will only accept the escaped, 4-digit Unicode control characters (\uNNNN'
), so if you only have the two-digit raw byte (\xNN
) you’ll have to convert it to the UTF-8 byte Unicode code point by replacing the \x
with into a UTF-8 escaped string with two leading-zeros (e.g. \u00
).
The Swedish letter "å"
, for example, can be represented with the two-digit raw bytes string '\xe5'
, so its 8-byte UTF-8 form would be '\u00E5'
.
Using psql to insert a record with unicode characters
Now that you have the proper 8-byte Unicode code point for your Unicode characters you can insert them into our table created earlier.
Concatenate a string using the CONCAT SQL function
We’re going to call the CONCAT()
function in SQL to concatenate a string, using the UTF-8 Unicode code points, while we insert the record data into the table.
The following SQL statement uses the E''
escape sequence to create the Unicode string for '你好世界'
, which means “Hello, World” in Mandarin Chinese:
1 | CONCAT(E'\u4f60', E'\u597d', E'\uff0c', E'\u4e16', E'\u754c') |
Insert the concatenated Unicode string into one of the PostgreSQL table columns
Call the CONCAT()
function inside of the VALUES
keyword, while executing the INSERT INTO
statement, in order to insert the Unicode string data into the PostgreSQL table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | INSERT INTO some_table (id, str, INT, bool) VALUES ('73467defd4fe46bbbaaedd5856d1b131', CONCAT(E'\u4f60', E'\u597d', E'\uff0c', E'\u4e16', E'\u754c'), 10, TRUE), ('f63bfdd9a7874017a4823457835cea4d', CONCAT('Jag kan ocks', E'\u00E5', ' k', E'\u00f6', 'pa ost'), 42, FALSE), ('17d941de79314eb7a83bda27b1f2321c', CONCAT('Grimms M', E'\u00e4', 'rchen sind gro', E'\u00df', 'artig'), 12, TRUE), ('4f079806297e45c5bb66ff1e611e94b6', CONCAT('The EU uses the ', E'\u20ac', ', and Britain uses the ', E'\u00a3'), 123, FALSE); |
The above SQL statement should return the following result:
1 | INSERT 0 4 |
Conclusion to inserting records with Unicode characters into PostgreSQL
The psql interface can sometimes be a little tricky when trying to use Unicode. Many of the PostgreSQL client libraries have built-in Unicode support to help you avoid a lot of the hassle of concatenating UTF-8 code points into strings, but at least now you have a better idea of how Unicode, and the different encoding (character sets), work in the psql interface for PostgreSQL.
Pilot the ObjectRocket Platform Free!
Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.
Get Started