How to Build an Optical Character Recognition OCR App for an Elasticsearch Index using Python and Tesseract

Written by Data Pilot

August 28, 2019

Elasticsearch
PyTesseract
Python
Tesseract

Have a Database Problem? Speak with an Expert for Free
Get Started >>

Introduction to using Google’s Tesseract with Elasticsearch

Python-Tesseract is an optical character recognition, or OCR, tool for Python designed to read text embedded in any image supported by the Leptonica and Pillow imaging libraries. This tutorial will explain how build an optical character recognition OCR Elasticsearch app with Python Tesseract software in Elasticsearch using the PyTesseract library. This will allow for creating string data from an image file that can then be indexed as a document in Elasticsearch.

A screenshot of the ObjectRocket example image used in this tutorial

Screenshot of the image used for the PyTesseract Python Elasticsearch article

The image intentionally includes some Chinese and Japanese characters to demonstrate that Python 3, Elasticsearch and Tesseract all have multi-language Unicode support.

Prerequisites to Build an Optical Character Recognition, or OCR, Elasticsearch App using the Python Tesseract Library with Elasticsearch

Have an Elasticsearch cluster running on the same machine or server with the image and Tesseract library installed. Execute the following command to install the Elasticsearch low-level client for Python 3 using the PIP3 package manager:

1	pip3 install elasticsearch

PyTesseract relies on the Python Pillow Imaging Library package. Install both Pillow and the PyTesseract module for Python 3 using the following pip3 command:

1 2	pip3 install Pillow pip3 install pytesseract

NOTE: The examples in this tutorial are executed using Python 3. The code examples used here have not been tested with Python 2. The next section supplies instructions for installing the Tesseract software.

How to specify the OCR installation location for PyTesseract on Windows

The PyTesseract OCR installation location must be specified when running this script on a Windows machine. Specify the installation location with the following command:

1	pytesseract.pytesseract.tesseract_cmd = 'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'

How to Install the Tesseract OCR Library for the Elasticsearch Cluster’s Server

If using Windows to run the example Python code in this article, then download the executable installer for Windows.

How to install the Tesseract OCR library on a Debian-based distro of Linux

If using a Debian-based distro of Linux, such as Debian, Linux Mint, or Ubuntu, execute the following command to install the Tesseract dependencies for using the APT-GET repository:

1	sudo apt-get install libicu-dev libpango1.0-dev libcairo2-dev

Then use the APT repository to install the Tesseract OCR library:

1	sudo apt install tesseract-ocr

How to install the Tesseract library on macOS using Homebrew

To run this Python script in macOS, use the following Homebrew brew command to install the Tesseract library and language support:

1	brew install tesseract --all-languages

How to install language support for the Tesseract OCR library in Linux

Execute the following command to install traditional Chinese language support for Tesseract:

1	sudo apt-get install tesseract-ocr-chi-tra

Execute the following command to install Japanese language support for Tesseract:

1	sudo apt-get install tesseract-ocr-jpn

Following is a screenshot showing the before and after configurations:

Screenshot of PyTesseract Python OCR data with and without Japanese and Chinese language support

NOTE: Calling the image_to_string() method, with Japanese language support specified in the method call, will likely produce accurate OCR results. This is true even if some of the source text is in both Chinese and Japanese. This is because the Japanese “Kanji” writing system borrowed many of its characters from Chinese. However, notice that the OCR reading of the image did misinterpret “will” as “wl” when using the Japanese language support in the method call.

How to Import the Elasticsearch and PyTesseract Libraries into a Python Script

Once inside the Python script, make sure to import the PyTesseract library and the PIL (Pillow) library for loading and reading image data. Use Python’s JSON library to parse the dictionary response returned by Elasticsearch so both are more readable when printed. Execute the follow script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

# import the Pytesseract library
import pytesseract

# import the Image method library from PIL
from PIL import Image

# import the JSON library for Python for pretty print
import json

# import the Base64 library for Python for encoding OCR data
import base64

# import the datetime() method for timestamps
from datetime import datetime

# import the Elasticsearch low-level client library
from elasticsearch import Elasticsearch

Additionally, import the Base64 library for Python so the string data parsed from the image can be encoded and indexed as an Elasticsearch document. Note that the datetime.now() method call is used in this tutorial for the Elasticsearch document’s timestamp field.

How to have Python return a string of the Tesseract version installed

Call the PyTesseract library’s get_tesseract_version() method to verify the Tesseract package is installed on the machine running the Elasticsearch cluster:

1
2
3

# get the version of PyTesseract installed
tesseract_ver = pytesseract.get_tesseract_version()
print ("Tesseract version:", tesseract_ver)

How to Declare a Function that will Index Tesseract OCR Data as an Elasticsearch Document

Use the def keyword in Python used to define a new function that will take the image data and file name and then index both into an Elasticsearch document:

1 2	# Elasticsearch will dynamically create _id by default def index_ocr_data(file, data, doc_id=""):

The default parameter for the function’s Elasticsearch document _id is an empty string (""). This will cause Elasticsearch to automatically create a dynamically generated alpha-numeric document ID for the user, provided none were passed to the function call.

How to convert the OCR data to a bytes string and encode it with Base64

to pass the data to the bytes() function and use UTF-8 character encoding:

1
2
3
4

try:
# convert PyTesser data to bytes object
bytes_string = bytes(str(data), "utf-8")
print ("\nPyTesser data bytes_string:", bytes_string)

Execute the folloing script to acquire the bytes string and encode it using the Base64 library’s b64encode() method:

1 2	# convert bytes to base64 encoded string encoded_data = base64.b64encode(bytes_string)

How the Python client returns a mapper_parsing_exception when making an index() API call

To avoid the mapper_parsing_exception error when indexing to Elasticsearch, convert the string of data returned by PyTesseract to a Base64 encoded string before passing it to the Elasticsearch client’s index() method nested inside the document (Python dict) body.

How to Build an Elasticsearch Document from the Encoded OCR Data

Build an Elasticsearch document from the encoded OCR data by passing the encoded data to an Elasticsearch field of the body that will represent the document’s _source data.

How to declare a Python dict object for the Elasticsearch document body

Use the "file name” to record the image’s accompanying file name for the OCR data:

1
2
3
4
5
6
7
8
9
10
11
12

# create a document body for the new Elasticsearch document
doc_body = {
"file name": file,
"data": encoded_data,
"timestamp": datetime.now()
}

# print the doc body and use json.dumps() for indentation
try:
print ("\nindex_ocr_data() doc body:", json.dumps(doc_body, indent=4))
except:
print ("\nindex_ocr_data() doc body:", doc_body)

How to Pass the Elasticsearch Dictionary Object to the Python Client’s Index() Method Call

It is critical to pass the entire dict body to the index() method’s body parameter. The following example script uses the request_timeout parameter to instruct it to attempt the API call for two seconds before returning a timeout error:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

# return a result object from Elasticsearch
result = client.index(
index = INDEX,
doc_type = "_doc",
id = doc_id,
body = doc_body,
request_timeout = 2
)

except Exception as error:
# returns a set() object of the Elasticsearch error
result = {"index_ocr_data ERROR", str(error)}

# return the JSON result
return result

The function will return the JSON response returned by Elasticsearch as a Python dictionary object. If an exception is raised, then the function is designed to instead return information about the error.

How to Setup the Global Variables for the Image File and Elasticsearch Index Name

Global variables for the image’s location and file name can be declared outside the function call as can the Elasticsearch index name and client instance of the Elasticsearch Python library. The location of the file need not be specified if it’s located in the same directory as the Python script. to set up the global variables:

1
2
3
4
5
6
7
8

# file name of the image being indexed into Elasticsearch
file_name = "objectrocket_ocr.jpg"

# declare global client instance for the index_ocr_data() function
client = Elasticsearch(hosts=["localhost:9200"])

# declare global for index name
INDEX = "pytesseract_image_data"

How to Declare a PIL Object Instance of the Image and Retrieve its OCR Data

Use the following script to have the PIL library to return a PIL.JpegImagePlugin object of the image’s data that can then be passed to PyTesseract’s image_to_string() method call to return a string that will, hopefully, contain accurately transcribed text data from the image:

1
2
3
4
5
6

# return a PIL.JpegImagePlugin Image object of the local image
img = Image.open(file_name)
print ("img TYPE:", type(img))

# return a string of the image's data by passing the PIL object to the image_to_string() method
data_from_image = pytesseract.image_to_string(img, lang="jpn")

The above example passes the string "jpn" to the method’s lang parameter so the OCR software knows to look for Japanese writing in the image.

How to Call the Index_OCR_Data( ) Function and Print the API Response from Elasticsearch

Call the index_ocr_data() function declared earlier and pass the image’s file name and OCR data to its function calls:

1 2	# call the index_ocr_data() function and return the response resp = index_ocr_data(file_name, data_from_image)

NOTE: The option exists to pass a document _id parameter to the function call, such as doc_id=42, to explicitly choose the document’s ID.

How to print the OCR function’s JSON response from Elasticsearch

Print the response returned by the function call:

1
2
3
4
5
6

# print the results of the API call made to Elasticsearch
print ("\nRESULT:")
if type(resp) == dict:
print ("index_ocr_data() resp:", json.dumps(resp, indent=4))
else:
print ("index_ocr_data() resp:", resp)

At this point in the code the Base64 encoded OCR data, retrieved from the image, will have been indexed in Elasticsearch as a document’s _source data when the script is executed.

How to run the Python script to get the OCR data from the image and index it in Elasticsearch

Navigate to the directory containing the Python script and execute the following code using the python3 command:

1	python3 pytesser_elastic.py

How to use Kibana to verify the encoded Tesseract OCR data was actually indexed in Elasticsearch

If the Kibana Console UI is installed and open in a browser tab, a GET HTTP request can be made to retrieve the Base64 encoded OCR data stored on the pytesseract_image_data Elasticsearch index. Execute the following GET command:

1	GET pytesseract_image_data/_search

The result should resemble the following:

Screenshot of the Kibana Console UI getting Tesseract OCR data indexed inn Elasticsearch As shown in the above screenshot, the OCR data is an unreadable, encoded Base64 string. Another call must be made to Elasticsearch and the string decoded in order to read the indexed text from the image file. Alternatively, copy and paste the encoded string into another application to have it decoded.

How to use the Elasticsearch Python client’s get( ) method to retrieve the PyTesseract document data

After the OCR data has been indexed as an Elasticsearch document, the Python client’s get() method can be used to retrieve the Base64 encoded data. It can then be decode back into a string to verify the OCR image data was indexed and stored successfully. Execute the following script:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

# get the data indexed and convert back into a string
try:
# get the indexed doc _id from response dict
doc_id = resp["_id"]

# pass the doc _id to the client's get() method
doc_info = client.get(
index = INDEX,
id = doc_id
)

# print the entire JSON document response from get()
print ("doc_info reponse:", json.dumps(doc_info, indent=4))

# get the _source field's data stored from image
source_data = doc_info["_source"]["data"]

# decode the base64 data returned from Elasticsearch
decoded_data = base64.b64decode(source_data).decode("utf-8")
print ("\ndecoded_data:", decoded_data)

except Exception as error:
print ("Elasticsearch client get() ERROR:", error)

The results should look like the following screenshot:

Screenshot of Python printing Base64 UTF8 decoded document data returned from an Elasticsearch index

Conclusion

This tutorial covered how build an optical character recognition, or OCR, Elasticsearch app with Python Tesseract software in Elasticsearch using the PyTesseract library. The article specifically explained how to specify the OCR installation location for PyTesseract on Windows, Linux and macOS. The tutorial also covered how to import the Elasticsearch and PyTesseract libraries into a Python Script, convert the OCR data to a bytes string and encode it with Base64, pass the Elasticsearch dictionary object to the Python client’s index( ) method call and setup the global variables for the image file and Elasticsearch index name. Finally, the tutorial covered how to print the OCR function’s JSON response from Elasticsearch, how to use Kibana to verify the encoded Tesseract OCR data was indexed in Elasticsearch and how to use the Elasticsearch Python client’s get( ) method to retrieve the PyTesseract document data. Remember that all the examples in this tutorial were executed with Python 3 and Python 2 has not been tested with these scripts.

Just the Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114

#!/usr/bin/env python3
#-*- coding: utf-8 -*-

# import the Pytesseract library
import pytesseract

# import the Image method library from PIL
from PIL import Image

# import the JSON library for Python for pretty print
import json

# import the Base64 library for Python for encoding OCR data
import base64

# import the datetime() method for timestamps
from datetime import datetime

# import the Elasticsearch low-level client library
from elasticsearch import Elasticsearch

# get the version of Tesseract installed
tesseract_ver = pytesseract.get_tesseract_version()
print ("Tesseract version:", tesseract_ver)

# Elasticsearch will dynamically create _id by default
def index_ocr_data(file, data, doc_id=""):

try:
# convert PyTesser data to bytes object
bytes_string = bytes(str(data), "utf-8")
print ("\nPyTesser data bytes_string:", bytes_string)

# convert bytes to base64 encoded string
encoded_data = base64.b64encode(bytes_string)

# create a document body for the new Elasticsearch document
doc_body = {
"file name": file,
"data": encoded_data,
"timestamp": datetime.now()
}

# print the doc body and use json.dumps() for indentation
try:
print ("\nindex_ocr_data() doc body:", json.dumps(doc_body, indent=4))
except:
print ("\nindex_ocr_data() doc body:", doc_body)

# return a result object from Elasticsearch
result = client.index(
index = INDEX,
doc_type = "_doc",
id = doc_id,
body = doc_body,
request_timeout = 2
)

except Exception as error:
# returns a set() object of the Elasticsearch error
result = {"index_ocr_data ERROR", str(error)}

# return the JSON result
return result

# file name of the image being indexed into Elasticsearch
file_name = "objectrocket_ocr.jpg"

# declare global client instance for the index_ocr_data() function
client = Elasticsearch(hosts=["localhost:9200"])

# declare global for index name
INDEX = "pytesseract_image_data"

# return a PIL.JpegImagePlugin Image object of the local image
img = Image.open(file_name)
print ("img TYPE:", type(img))

# return a string of the image's data by passing the PIL object to the image_to_string() method
data_from_image = pytesseract.image_to_string(img, lang="jpn")

# call the index_ocr_data() function and return the response
resp = index_ocr_data(file_name, data_from_image)

# print the results of the API call made to Elasticsearch
print ("\nRESULT:")
if type(resp) == dict:
print ("index_ocr_data() resp:", json.dumps(resp, indent=4))
else:
print ("index_ocr_data() resp:", resp)

# get() the OCR data indexed in Elasticsearch and decode it
try:
# get the indexed doc _id from response dict
doc_id = resp["_id"]

# pass the doc _id to the client's get() method
doc_info = client.get(
index = INDEX,
id = doc_id
)

# print the entire JSON document response from get()
print ("doc_info reponse:", json.dumps(doc_info, indent=4))

# get the _source field's data stored from image
source_data = doc_info["_source"]["data"]

# decode the base64 data returned from Elasticsearch
decoded_data = base64.b64decode(source_data).decode("utf-8")
print ("\ndecoded_data:", decoded_data)

except Exception as error:
print ("Elasticsearch client get() ERROR:", error)

Pilot the ObjectRocket Platform Free!

Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.

Get Started