Use Tesseract OCR to Insert MongoDB Documents

Introduction to using Tesseract OCR to insert MongoDB documents

Google’s Tesseract OCR (Optical Character Recognition) software allows you to analyze the text in an image in order to process it and render it as a string of characters. This article will demonstrate how you can use Python’s pytesseract and pymongo modules to read an image and then insert its text data as a MongoDB document.

Check out the first part of this article series for instructions on setting up your project. This article assumes that you already have an image that you can use for extracting text data, and that you’ve already installed the necessary Python modules using PIP3. Please make sure the MongoDB service is installed and running before attempting to run the Python code.

Screenshot of Python PyMongo MongoDB PyTesseract image in GIMP

Fixing the ‘TesseractNotFoundError’ in Python

Installing the pytesseract library with the PIP package manager isn’t enough, and Python may return a TesseractNotFoundError while attempting to use the PyTesseract library methods. This typically means that Tesseract’s binary layer dependencies, or its language libraries, are not installed.

Install TesserAct OCR for Debian-based Linux

Use the following bash command to install the libraries on a Debian-based distro of Linux like Ubuntu or Linux Mint:

sudo apt install tesseract-ocr && sudo apt install libtesseract-dev

Install TesserAct OCR for Red Hat (RHEL) Linux

If you’re using a Red Hat distro of Linux (like Fedora or CentOS) then use the YUM package manager to install the Tesseract libraries:

yum install tesseract && yum install tesseract-langpack-eng

Install TesserAct OCR for macOS with Homebrew

Use the brew install command to install the necessary packages on macOS:

brew install tesseract && brew install tesseract-lang

Install TesserACT OCR on Windows

If you’re using Windows then download the correct Tesseract binary executable for your version of Windows, and set the environment path for pytesseract.pytesseract.tesseract_cmd.

Screenshot of Linux terminal

Import the Python modules for your Tesseract-MongoDB app

The following Python code will import the PyTesseract and MongoClient libraries, as well as a few other built-in system libraries like os, platform, and the Pillow library, or PIL (Python Imaging Library), so that Python can read the image’s data into memory:

# import the Pytesseract library
import pytesseract, platform, os

# import the Image method library from PIL
from PIL import Image

# import PyMongo into the Python script
from pymongo import MongoClient

# import the datetime() method for timestamps
from datetime import *
from time import time

We’ll use the datetime and time libraries to format timestamps and create a creation date field for the MongoDB document.

Use Pillow (PIL) to load the image file into memory

Now let’s declare a string object for the image’s filename, and then use Pillow’s Image.open() method to load the image’s pixel data into the script’s memory.

Use the following code to execute the code within a try-except indentation block to catch any errors that may arise:

# return a PIL.JpegImagePlugin Image object of the local image
try:
    # target image filename for OCR
    filename = "objectrocket-mongo.jpg"

    img = Image.open(filename)
    print ("PIL img TYPE:", type(img))
except Exception as err:

    # set the PIL image object to 'None' if exception
    print ("Image.open() error:", err)
    img = None

NOTE: Make sure to pass the absolute path for the image file, along with the filename, if it’s not located in the same directory as the Python script. If Python cannot find the file, or otherwise raises an exception, then the above code will set our img object to None.

Connect to MongoDB using the PyMongo client library

If no errors were returned then the example JPG image should be rendered by the Pillow library as a PIL.JpegImagePlugin.JpegImageFile object. Use the following code to evaluate the img object and connect to MongoDB with PyMongo’s MongoClient() method library if the PIL object is not set to None:

# insert image data if Image.open() was successful
if img != None:

    # create a client instance of the MongoClient library
    client = MongoClient('localhost', 27017)

    # declare MongoDB database and collection instances
    db = client.image_db
    col = db["Images Collection"]

NOTE: MongoDB collection’s are allowed to have spaces, but database names are not.

Use PyTesseract to extract the image’s text

Now that we’ve connect to MongoDB we can pass img PIL object to the PyTesseract library’s image_to_string() method to make an API call to the Tesseract engine in order to have it return a string of text data from the image object.

Use the following code to get the text data within another try-except indentation block:

    try:
        # get a string of the PIL image object's text
        data_from_image = pytesseract.image_to_string(img, lang="eng")

Use the following line of code if you’d like to replace the newline characters in the image’s text string with spaces:

        # replace the newline chars with a space
        data_from_image = data_from_image.replace("\n", " ").strip()
        print ("\ndata_from_image:", data_from_image)

Get the image’s file stats before inserting into MongoDB

The following code will use Python’s os.stat() method library to get more detailed information from the image file so that it can be inserted into the MongoDB document along with the text data:

        # get the image file's stats
        file_stats = os.stat( filename )
        print ("\nfile_stats:", file_stats, "\n")

Create a timestamp for the image’s MongoDB document

This next code will attempt to get the creation date for the file so that it can inserted as a field value for the MongoDB document:

        # get the creation date for the image file
        if hasattr(file_stats, 'st_birthtime') == True:
            date_stamp = file_stats.st_birthtime
        elif hasattr(file_stats, 'st_mtime') == True:
            date_stamp = file_stats.st_mtime
        else:
            date_stamp = time()

        # create a datetime object from UNIX time stamp
        creation_date = datetime.utcfromtimestamp( date_stamp )

NOTE: If the meta data for the image file’s creation date cannot be found then the above code will use the current time instead.

Create a Python dict for the MongoDB document

The code in this section will create a Python dict object for the document data that will be passed to PyMongo’s insert_one() method call.

Use this next bit of code to format the filename so that the absolute path is not included (if applicable) when the data gets included with the other document data:

        # fix the file name if needed
        img_filename = filename

        # Platform is Windows
        if platform.system() == "Windows":
            if "\" in img_filename:
                start = img_filename.find("
\")
                img_filename = img_filename[start:]

        # UNIX-based OS like macOS or Linux
        else:

            if "
/" in img_filename:
                start = img_filename.find("
/")
                img_filename = img_filename[start:]

Declare a Python dictionary for the image’s text data

The following code declares a dictionary ({}) that will store the image data as key-value pairs:

        # Declare a Python dict for the MongoDB doc
        img_doc = {
            "filename":img_filename,
            "creation_date": creation_date,
            "text": data_from_image,
            "timestamp": datetime.utcnow()
        }

        # print the dict object for the doc
        print (img_doc)

    except Exception as err:
            # set the image dict object to 'None' if exception
            print ("pytesseract.image_to_string() ERROR:", err)
            img_doc = None

NOTE: PyMongo has support for Python’s datetime() format, so it will have no problem interpreting datetime.datetime(2019, 11, 11, 6, 16, 23) as 2019-11-11T06:16:23.000+00:00 when it inserts the data.

Insert the image’s text data into MongoDB as a document

The last part of our code will evaluate the img_doc to assert that its value is not set to None, and then it will pass the dict object to the MongoDB collection object’s insert_one() method:

    try:
        # attempt to insert the image text as a MongoDB doc
        if img_doc != None:
            insert_id = col.insert_one(img_doc).inserted_id
            print ("MongoDB doc ID for image insertion:", insert_id)

    except Exception as err:
        print ("MongoDB client insert_one() ERROR:", err)

NOTE: The inserted_id method attribute will instruct PyMongo to return only the document ID, otherwise PyMongo will return the complete pymongo.results.InsertOneResult object.

The document data being inserted should look something like the following JSON object:

{
    "_id":"5dcaa4c7e979d56e233b5c49",
    "filename":"objectrocket-mongo.jpg",
    "creation_date":"2019-11-11T06:16:23.000Z",
    "text":"ObjectRocket Python & MongoDB",
    "timestamp":"2019-11-12T12:25:43.705Z"
}

Conclusion to using Tesseract OCR to insert MongoDB documents

Make sure to save the above code in a Python script (using the .py file extension), and then use the python3 bash command to execute the file:

python3 tesseract-app.py

The Python code should return something like the following (at the end of the script) if the MongoDB document insertion was successful:

MongoDB doc ID for image insertion: 5dcaa4c7e979d56e233b5c49

Screenshot of Python3 inserting text extracted from an image using Tesseract into a MongoDB collection

Use MongoDB Compass to verify that the image data was inserted

If you have it installed already, use the MongoDB Compass GUI application to verify that the image data was successfully inserted using Python.

Screenshot of MongoDB Compass returning a document containing the image's text

NOTE: The above screenshot shows how your document might look with and without the newline characters (\n) being replaced with spaces.

The document for the image’s text data should look something like the following in the MongoDB Compass application:

_id: 5dcaa4c7e979d56e233b5c49 filename: “objectrocket-mongo.jpg” creation_date: 2019-11-11T06:16:23.000+00:00 text: “ObjectRocket Python & MongoDB” timestamp: 2019-11-12T12:25:43.705+00:00

This concludes the two-part article series demonstrating how you can use Python, PyTesseract, and PyMongo to extract the text data from an image and insert it into a MongoDB collection as a document.

Just the Code

#!/usr/bin/env python3
#-*- coding: utf-8 -*-

# import the Pytesseract library
import pytesseract, platform

# import the Image method library from PIL
from PIL import Image

# import PyMongo into the Python script
from pymongo import MongoClient

# import the JSON library for Python for pretty print
import json, os

# import the datetime() method for timestamps
from datetime import *
from time import time

# Check if pytesseract binary and language dep are installed
try:
    # get the version of Tesseract installed
    tesseract_ver = pytesseract.get_tesseract_version()
    print ("Tesseract version:", tesseract_ver)
except Exception as err:
    print ("TesseractNotFoundError: You'll need to install the binary for PyTesseract:")

    # check if platform is 'Linux', 'Darwin' (macOS), or Windows
    if platform.system() == "Linux":
        print ("Install with:")
        print ("'sudo apt install tesseract-ocr && sudo apt install libtesseract-dev'")
        # For Red Hat/Fedora: 'yum install tesseract && yum install tesseract-langpack-eng'

    elif platform.system() == "Darwin":
        print ("Install with:")
        print ("'brew install tesseract && brew install tesseract-lang'")

    elif platform.system() == "Windows":
        print ("If you're using Windows then download the binary library and")
        print ("")


# return a PIL.JpegImagePlugin Image object of the local image
try:
    # target image filename for OCR
    filename = "objectrocket-mongo.jpg"

    img = Image.open(filename)
    print ("PIL img TYPE:", type(img))
except Exception as err:

    # set the PIL image object to 'None' if exception
    print ("Image.open() error:", err)
    img = None

# insert image data if Image.open() was successful
if img != None:

    # create a client instance of the MongoClient library
    client = MongoClient('localhost', 27017)

    # declare MongoDB database and collection instances
    db = client.image_db
    col = db["Images Collection"]

    try:

        # get a string of the PIL image object's text
        data_from_image = pytesseract.image_to_string(img, lang="eng")

        # replace the newline chars with a space
        data_from_image = data_from_image.replace("\n", " ").strip()
        print ("\ndata_from_image:", data_from_image)

        # get the image file's stats
        file_stats = os.stat( filename )
        print ("\nfile_stats:", file_stats, "\n")

        # get the creation date for the image file
        if hasattr(file_stats, 'st_birthtime') == True:
            date_stamp = file_stats.st_birthtime
        elif hasattr(file_stats, 'st_mtime') == True:
            date_stamp = file_stats.st_mtime
        else:
            date_stamp = time()

        # create a datetime object from UNIX time stamp
        creation_date = datetime.utcfromtimestamp( date_stamp )

        # fix the file name if needed
        img_filename = filename

        # Platform is Windows
        if platform.system() == "Windows":
            if "\" in img_filename:
                start = img_filename.find("
\")
                img_filename = img_filename[start:]

        # UNIX-based OS like macOS or Linux
        else:
            if "
/" in img_filename:
                start = img_filename.find("
/")
                img_filename = img_filename[start:]

        # Declare a Python dict for the MongoDB doc
        img_doc = {
            "
filename":img_filename,
            "
creation_date": creation_date,
            "
text": data_from_image,
            "
timestamp": datetime.utcnow()
        }

        # print the dict object for the doc
        print (img_doc)

    except Exception as err:
            # set the image dict object to 'None' if exception
            print ("
pytesseract.image_to_string() ERROR:", err)
            img_doc = None

    try:
        # attempt to insert the image text as a MongoDB doc
        if img_doc != None:
            insert_id = col.insert_one(img_doc).inserted_id
            print ("
MongoDB doc ID for image insertion:", insert_id)
    except Exception as err:
        print ("
MongoDB client insert_one() ERROR:", err)

Pilot the ObjectRocket Platform Free!

Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.

Get Started

Keep in the know!

Subscribe to our emails and we’ll let you know what’s going on at ObjectRocket. We hate spam and make it easy to unsubscribe.