How to Upload Images to an Elasticsearch Index
Introduction
Although Elasticsearch is known for its fast and powerful text search, it’s possible to index more than just text. For example, it’s easy to upload images to Elasticsearch. You can use Python’s PIL library to extract an image’s EXIF data, and then index the image’s data to an Elasticsearch index. In this article, we’ll provide step-by-step instructions for storing images with Python in Elasticsearch.
Prerequisites for indexing photos to Elasticsearch
Before we look at how to upload photos to Elasticsearch, let’s go over the key system requirements for the task:
- Elasticsearch must be installed and running, along with Java and its JVM. You can check if Elasticsearch is running by making a
GET
request to the server on the default port of9200
. It should return a JSON response with cluster information:
- When you index photos with Python in Elasticsearch, you may want to use the Kibana Console UI to verify that the photos or images uploaded correctly. If so, you need to make sure that the Kibana service is running as well. Simply load the interface in a browser by navigating to:
- You’ll need to create an Elasticsearch index for the image data. If you haven’t created one yet, we’ll go over how to do it in this article. We’ll be indexing EXIF data in Elasticsearch for this tutorial, so it’s best not to have an index with a rigid
_mapping
schema– not all images have the same EXIF data fields, and some images don’t have any at all.
Install all the Python libraries and dependencies using pip3
The Python code used in this article is tested and written with Python 3 in mind, since Python 2.7 is now deprecated. Therefore, we’ll be using the pip3
command for the PIP package manager for Python 3 to install the necessary libraries and modules.
Install Pillow (PIL) for Python 3
Throughout this tutorial, we’ll be using the Python Imaging Library, also known as PIL
or Pillow
. We’ll be using Pillow with Python and Elasticsearch to index images. Let’s begin by installing PIL using the following command:
If you’d like to upgrade PIL to the latest version, use this command:
>NOTE: Older versions of PIP allowed for the entire Pillow library to be imported (e.g. import PIL
); however, since version 2.0 of PIL, only its classes can be imported (e.g. from PIL import Image
)
Install pybase64
for the Base64 Python encoding library
Next, we’ll install and upgrade the base64 encoding and decoding wrapper for the libbase64
Python library:
This library is necessary for converting documents to base64 in order to index them to an Elasticsearch index.
Select a personal photo or public domain image to index to Elasticsearch
The image we’ll be using in this article is cute-kittens-in-basket.jpg
from publicdomainpictures.net.
The actual image we’re using is a scaled down (1000×669) version of the original “cute-kittens” image:
If you’d like to follow along with the instructions in this tutorial, use a public domain image like this one, or take a photo with your phone and put it into the same directory as your Python script.
Create a new project directory and Python script
Once you’ve confirmed all the system requirements and selected an image, you can get started on the code. The first step will be to create a directory and a Python script for the code. Then, move the image you’d like to index into the project directory. You can create a directory using the mkdir
command in a terminal window:
After you create your directory, use the touch
command to create a new Python script, or use nano
to edit a new file and save it as a Python script:
>NOTE: Keep in mind that the names of Python scripts use underscores (_
) instead of hyphens (-
) (e.g. my_script.py
).
Open a new Finder window (or whatever GUI-based folder application available on the OS where the Elasticsearch cluster is running), and move the image to the new project folder.
If you’re in a terminal window, this can be done using the mv
command by specifying the target directory in the second parameter:
Import the necessary Python libraries and modules to index an image to Elasticsearch
Next, we’ll need to import libraries and modules such as Elasticsearch and PIL into Python. Let’s edit the project’s Python script, making sure to import all of the following modules, classes, and libraries at the top:
from elasticsearch import Elasticsearch
# import the Image and TAGS classes from Pillow (PIL)
from PIL import Image
from PIL.ExifTags import TAGS
import uuid # for image meta data ID
import base64 # convert image to b64 for indexing
import datetime # for image EXIF data timestamp
After making these edits, save your script and run it in a terminal or command prompt window (using the python3
command) to make sure that all modules were properly installed using pip3
.
If you get an ImportError
, then return to the Prerequisites
section of this article and make sure all of the libraries are installed and updated. Pillow
(commonly known as PIL
), elasticsearch
, and pybase64
(or base64
for Python) are the only libraries we’ll be using in this article that don’t come packaged with a Python installation by default.
Create a new client instance of the Elasticsearch low-level client
Before we proceed, let’s make sure that the Elasticsearch cluster is up and running. Once you’ve confirmed this, we’ll add the following code to create a new client instance of Elasticsearch in the Python script:
elastic_client = Elasticsearch([{'host': 'localhost', 'port': 9200}])
Start the Kibana service
You may want to use the Kibana Console UI to make HTTP requests and verify that an image has been indexed. If so, then you’ll need to make sure that Kibana is running.
Use Python’s PIL library to grab the image’s EXIF meta data
The PIL library (Python Imaging Library) is instrumental in making it possible to index photos in Elasticsearch with Python. The PIL library has a TAGS
class that allows one to read an image’s EXIF meta data. You can use this data to create custom fields for the Elasticsearch document’s _source
data. This allows the document to be indexed, organized and searched using the image’s EXIF tags.
Set up the image document’s index data for Elasticsearch
If you have a specific _id
or _index
in mind for your selected image document, it’s best to declare that information before doing anything else. Let’s also declare a string containing the file name (and its path if the image is not in the same directory as the script):
_file = "cute-kittens-in-basket.jpg"
_index = "images"
_id = 1
Create a PIL Image
instance of the image in the directory
Next, we’ll pass that _file
string to PIL’s Image.open
method as 'rb'
. We’ll have it return a PIL Image
object of the target image:
Use’s PIL’s PIL.ExifTags
Python library to read an image’s EXIF meta data
Make sure to import the TAGS
class at the beginning of your script. This makes it possible to index an image EXIF with PIL:
Create a Python function to get the PIL Image’s EXIF
data
We’ll need to create a function that first verifies the PIL Image object and then uses the _getexif()
method to return all of the image’s EXIF data:
# use PIL to verify image is not corrupted
img.verify()
try:
# call the img's getexif() method and return EXIF data
exif = img._getexif()
exif_data = {}
# iterate over the exif items
for (meta, value) in exif.items():
try:
# put the exif data into the dict obj
exif_data[TAGS.get(meta)] = value
except AttributeError as error:
print ('get_image_meta AttributeError for:', file_name, '--', error)
except AttributeError:
# if img file doesn't have _getexif, then give empty dict
exif_data = {}
return exif_data
Now we can use the exif_data
dictionary object for the Elasticsearch document’s _source
data. Here’s the code we need to do it:
_source = create_exif_data(img)
# store the file name in the Elasticsearch index
_source['name'] = _file
The function will place all of the EXIF data into a Python dict
object and return that object to be indexed later on. For our sample image, the object’s key-value pairs would look like the following:
'mime_type': 'image/jpeg',
'name': 'cute-kittens-in-basket.jpg',
'datetime': '2019:06:14 21:18:04',
'make': 'Camera Unknown',
'model': 'Camera Unknown',
'uuid': '1a994cbe-8d03-4c07-9f26-6d13930dcbcd'
}
Create another function to generate missing EXIF data if needed
At this point, we’ll need to declare another function that will parse the EXIF dict
data returned by the get_image_exif()
function. This function will ensure that all of the documents in our Elasticsearch index have the same fields by generating EXIF data for them, even if some images don’t have any.
This function will parse through the EXIF keys using if
and elif
conditional statements to check if the EXIF data is present:
# create a new dict obj for the Elasticsearch doc
es_doc = {}
es_doc["size"] = img.size
# put PIL Image conversion in a try-except indent block
try:
# create PIL Image from path and file name
img = Image.open(_file)
except Exception as error:
print ('create_exif_data PIL ERROR:', error, '-- for file:', _file)
# call the method to have PIL return exif data
exif_data = get_image_exif(img)
# get the PIL img's format and MIME
es_doc["image_format"] = img.format
es_doc["image_mime"] = Image.MIME[img.format]
# get datetime meta data from one of these keys if possible
if 'DateTimeOriginal' in exif_data:
es_doc['datetime'] = exif_data['DateTimeOriginal']
elif 'DateTime' in exif_data:
es_doc['datetime'] = exif_data['DateTime']
elif 'DateTimeDigitized' in exif_data:
es_doc['datetime'] = exif_data['DateTimeDigitized']
# if none of these exist, then use current timestamp
else:
es_doc['datetime'] = str( datetime.datetime.now() )
# create a UUID for the image if none exists
if 'ImageUniqueID' in exif_data:
es_doc['uuid'] = exif_data['ImageUniqueID']
else:
# create a UUID converted to string
es_doc['uuid'] = str( uuid.uuid4() )
# make and model of the camera that took the image
if 'Make' in exif_data:
es_doc['make'] = exif_data['Make']
else:
es_doc['make'] = "Camera Unknown"
# camera unknown if none exists
if 'Model' in exif_data:
es_doc['model'] = exif_data['Model']
else:
es_doc['model'] = "Camera Unknown"
if 'Software' in exif_data:
es_doc['software'] = exif_data['Software']
else:
es_doc['software'] = 'Unknown Software'
# get the X and Y res of image
if 'XResolution' in exif_data:
es_doc['x_res'] = exif_data['XResolution']
else:
es_doc['x_res'] = None
if 'YResolution' in exif_data:
es_doc['y_res'] = exif_data['YResolution']
else:
es_doc['y_res'] = None
# return the dict
return es_doc
Create a NumPy array
In our next step, we’ll be using the file string (_file
) declared earlier. We’ll open the image using PIL’s Image
class, and pass that Image object instance to NumPy’s asarray()
method. Be sure to cast the NumPy array as a normal Python list
object using the tolist()
method; this ensures that the image’s pixel data can be put in lists and then stored in Elasticsearch:
img_array = np.asarray( Image.open( _file ) ).tolist()
Convert the Python list
object to a str
Keep in mind that Elasticsearch only allows strings or encoded strings to be passed– not objects such as lists. We’ll need to convert the list to a string using the str()
function:
img_str = str( img_array )
Use Python’s Base64 library to encode images
You can also encode an image using Python’s Base64 library:
Put the raw image data into the _source object
At this point, we’re ready to put the raw image data into a field key of the _source
object:
_source["raw_data"] = img_base64
Create an Elasticsearch index and put the image data into it
The final step is to create an index in Elasticsearch if you have not done so already. Then, we can index the _source
dictionary object with all of the EXIF and raw image data.
Create an Elasticsearch index and ignore any 400 HTTP errors
If the _index
specified earlier already exists, be sure to use the ignore=400
option. This will instruct Elasticsearch to attempt to create an index, but also to ignore any HTTP 400 error codes so that the script will continue on without interruption:
resp = elastic_client.indices.create(
index = _index,
body = "{}",
ignore = 400 # ignore 400 already exists code
)
print ("\nElasticsearch create() index response:", resp)
Index the image dict object to the Elasticsearch index
In the previous step, we passed all the document data declared earlier to the Elasticsearch client’s index()
method. Here, we have it return a response to confirm that the API call was successful and that there were no errors:
resp = elastic_client.index(
index = _index,
doc_type = '_doc',
id = _id,
body = _source
)
print ("\nElasticsearch index() response:", resp)
Getting the raise ConnectionTimeout(“TIMEOUT”, str(e), e) error while calling the Elasticsearch index() method
If you encounter a ConnectionTimeout
error while indexing the image to Elasticsearch, simply use the request_timeout
option while calling the index method:
resp = elastic_client.index(
index = _index,
doc_type = 'img',
id = _id,
body = _source,
request_timeout=60
)
The example shown above attempts to index the image data but will wait 60 seconds before it raises a timeout error. This can be useful for larger images that may take longer than expected to index.
Conclusion
While Elasticsearch is known for its powerful text search capabilities, it’s also possible to upload photos to Elasticsearch with Python. Although the example we reviewed in this tutorial shows the indexing of a single photo, you can use the same basic process to bulk index in Elasticsearch in Python. If you needed to bulk index documents in Elasticsearch, you would still read and generate EXIF data for each image in Python and then index that data to an Elasticsearch index. Using the step-by-step instructions provided in this tutorial, you’ll have no trouble uploading photos to Elasticsearch.
Open up the Kibana Console UI to verify that the image was indexed
If you’d like to verify that your image was indexed properly, you can use the Kibana Console UI. Simply navigate to port 5601
on your server’s domain (or to localhost:5601
) in a web browser, and click on Dev Tools (represented by a small “wrench” icon in Kibana v7.x). Then, make the following GET
request in the left console pane to verify that the image indexed properly:
You can also pass the document’s _id
to an HTTP request directly using _doc
as the document type (as specified in the index()
method):
Once you click on the green arrow icon, the right pane of the Kibana Console will display information about the indexed image:
Just the Code
We’ve looked at our example code one section at a time throughout this tutorial. The following code represents the complete script needed to index images in Elasticsearch:
#-*- coding: utf-8 -*-
# import the Elasticsearch low-level client
from elasticsearch import Elasticsearch
# import the Image and TAGS classes from Pillow (PIL)
from PIL import Image
from PIL.ExifTags import TAGS
import uuid # for image meta data ID
import base64 # convert image to b64 for indexing
import datetime # for image meta data timestamp
# create a client instance of Elasticsearch
elastic_client = Elasticsearch([{'host': 'localhost', 'port': 9200}])
"""
Function that uses PIL's TAGS class to get an image's EXIF
meta data and returns it all in a dict
"""
def get_image_exif(img):
# use PIL to verify image is not corrupted
img.verify()
try:
# call the img's getexif() method and return EXIF data
exif = img._getexif()
exif_data = {}
# iterate over the exif items
for (meta, value) in exif.items():
try:
# put the exif data into the dict obj
exif_data[TAGS.get(meta)] = value
except AttributeError as error:
print ('get_image_meta AttributeError for:', file_name, '--', error)
except AttributeError:
# if img file doesn't have _getexif, then give empty dict
exif_data = {}
return exif_data
"""
Function to create new meta data for the Elasticsearch
document. If certain meta data is missing from the orginal,
then this script will "fill in the gaps" for the new documents
to be indexed.
"""
def create_exif_data(img):
# create a new dict obj for the Elasticsearch doc
es_doc = {}
es_doc["size"] = img.size
# put PIL Image conversion in a try-except indent block
try:
# create PIL Image from path and file name
img = Image.open(_file)
except Exception as error:
print ('create_exif_data PIL ERROR:', error, '-- for file:', _file)
# call the method to have PIL return exif data
exif_data = get_image_exif(img)
# get the PIL img's format and MIME
es_doc["image_format"] = img.format
es_doc["image_mime"] = Image.MIME[img.format]
# get datetime meta data from one of these keys if possible
if 'DateTimeOriginal' in exif_data:
es_doc['datetime'] = exif_data['DateTimeOriginal']
elif 'DateTime' in exif_data:
es_doc['datetime'] = exif_data['DateTime']
elif 'DateTimeDigitized' in exif_data:
es_doc['datetime'] = exif_data['DateTimeDigitized']
# if none of these exist, then use current timestamp
else:
es_doc['datetime'] = str( datetime.datetime.now() )
# create a UUID for the image if none exists
if 'ImageUniqueID' in exif_data:
es_doc['uuid'] = exif_data['ImageUniqueID']
else:
# create a UUID converted to string
es_doc['uuid'] = str( uuid.uuid4() )
# make and model of the camera that took the image
if 'Make' in exif_data:
es_doc['make'] = exif_data['Make']
else:
es_doc['make'] = "Camera Unknown"
# camera unknown if none exists
if 'Model' in exif_data:
es_doc['model'] = exif_data['Model']
else:
es_doc['model'] = "Camera Unknown"
if 'Software' in exif_data:
es_doc['software'] = exif_data['Software']
else:
es_doc['software'] = exif_data['Unknown Software']
# get the X and Y res of image
if 'XResolution' in exif_data:
es_doc['x_res'] = exif_data['XResolution']
else:
es_doc['x_res'] = None
if 'YResolution' in exif_data:
es_doc['y_res'] = exif_data['YResolution']
else:
es_doc['y_res'] = None
# return the dict
return es_doc
# create an Image instance of photo
_file = "cute-kittens-in-basket.jpg"
_index = "images"
_id = 1
img = Image.open(open(_file, 'rb'))
# get the _source dict for Elasticsearch doc
_source = create_exif_data(img)
# store the file name in the Elasticsearch index
_source['name'] = _file
# covert NumPy of PIL image to simple Python list obj
img_array = np.asarray( Image.open( _file ) ).tolist()
# convert the nested Python array to a str
img_str = str( img_array )
# put the encoded string into the _source dict
_source["raw_data"] = img_str
# create the "images" index for Elasticsearch if necessary
resp = elastic_client.indices.create(
index = _index,
body = "{}",
ignore = 400 # ignore 400 already exists code
)
print ("\nElasticsearch create() index response -->", resp)
# call the Elasticsearch client's index() method
resp = elastic_client.index(
index = _index,
doc_type = '_doc',
id = _id,
body = _source,
request_timeout=60
)
print ("\nElasticsearch index() response -->", resp)
Pilot the ObjectRocket Platform Free!
Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.
Get Started