Parse Lines In A Text File And Insert Them As MongoDB Documents Using Python

Introduction

MongoDB and Python are known to work particularly well together, making it possible to accomplish arduous tasks with quick and simple scripts. In this article, we’ll show you a perfect example of this harmony between Python and MongoDB. We’ll demonstrate how you can iterate over the contents of a dictionary text file to parse out terms and definitions, inserting those entries as MongoDB documents. If this sounds like a task requiring reams of complex code, fear not– everything we described can be done with fewer than 200 lines of Python code.

Before you try parsing a line-by-line text file and inserting MongoDB documents with Python, make sure you own the rights to the content being inserted, or use an open source or public domain text document. For the purposes of this article, we’ll be using an open source eBook titled “Webster’s Unabridged Dictionary by Various”, which can be found on the Gutenberg website.

Please continue reading as we parse lines of a text file and insert them as MongoDB Documents with Python.

Prerequisites

Let’s go over a couple of basic prerequisites that need to be taken care of before we can proceed with our tutorial:

  • Make sure that the text file you’re planning on inserting is on the same machine that’s running your MongoDB server. Use the mongod or mongo commands to verify that the server is running.

  • Python 3 should be installed on the same machine, and the PyMongo Python driver needs to be installed using the PIP3 package manager:

pip3 install pymongo
  • Make sure you have some free space on your machine or server to download the copy of Webster’s dictionary before running the code shown in this article.

Download and unzip the text file for Webster’s Dictionary

Now that we’ve reviewed the prerequisites, let’s open another browser tab and navigate to the download page for the Webster’s dictionary files. Download either the zip archive (29765-8.zip) or the text file for the dictionary. Unzip the archive and move the text file (which will be about 29MB in size) to the same directory as the Python script.

Create a Python script that will parse the text file and insert the data as MongoDB documents

If you’re running macOS or Linux, you can use the touch command to create a new Python script:

touch insert_dictionary.py

Rename and move the dictionary file to the location of the Python script

If you’re in the same directory as the ~/Downloads folder, use the bash command shown below to rename the filename 29765-8.txt and move it into the same directory as the Python script:

mv 29765-8.txt /var/www/html/python/websters_dictionary.txt

Be sure to replace the file path with a value that matches the path of your Python script. If you’re not sure what the path is, use the pwd command to find out.

Screenshot of the Websters Dictionary zip file being extracted and moved into another directory

Import the Python necessary packages for inserting MongoDB documents

Next, we’ll import the MongoClient and errors attribute libraries from the PyMongo driver library:

# import the MongoClient class
from pymongo import MongoClient, errors

Import Python’s Pickle library to serialize the dictionary entries

# import Python's pickle library to serialize the dictionary
import pickle

Import Python’s time and JSON libraries

We’ll use the json.dumps() method to indent the JSON response that will be returned by MongoDB when the API response is printed:

# import Python's time and JSON libraries
import time, json

Instantiate float variable for the Python’s scripts start time

We’ll also need to import the time library to track how many seconds it takes to iterate, parse, and insert the text data. The time.time() method will be used to return a float of the epoch time at the start of the script:

# record the start time for the script
start_time = time.time()

Declare a function that will iterate and parse the dictionary text file

Declare a function that will open the Webster’s dictionary text file, iterate and parse its data, and then put each dictionary entry into a dictionary:

Declare the get_webster_entries() function for parsing text

Our get_webster_entries() function will require a string passed to it representing the text file’s name and directory path:

# declare a function that parsed the Webster's text file
def get_webster_entries(filename):

Using Python’s open() function and managing system resources

Next, we’ll use Python’s open() function to create a _io.BufferedReader object of the text file’s data. We’ll then use the object’s read() method to have it convert the data into a bytes object:

# use the open() function to open the text file
with open(filename, 'rb') as raw:
data = raw.read()

NOTE: It’s a good idea to use the with keyword and have it open the file in an indentation so that it will free your system’s memory once the operation is complete. Otherwise, you’ll have to call the buffered reader’s close() method to have Python’s garbage collector immediately free up those system resources.

Screenshot of Python IDLE getting the data type for the open function and BufferedReader read method

Decode the bytes string for the dictionary data as UTF-8

The bytes data from the text file needs to be converted to a string. We can use the decode() method to convert the bytes string to a UTF-8 unicode string:

# decode the file as UTF-8 and ignore errors
data = data.decode("utf-8", errors='ignore')

Note that we opted to ignore errors as we convert to UTF-8. This may leave some minor “holes” in the data, but it’s necessary to avoid Python returning a UnicodeDecodeError.

Split the string into a list of strings using the “rn” newline characters

# split the dictionary file into list by: "\r\n"
data = data.split("\r\n")

Declare an empty Python dictionary for the dictionary entries and iterate over the list of strings

The dictionary entries can be stored as dictionary keys, and the list of strings can be iterated over using the enumerate() function:

# create an empty Python dict for the entries
dict_data = {}

# iterate over the list of dictionary terms
for num, line in enumerate(data):

Check for new dictionary entries in the string data

You may notice that each entry in this version of Webster’s dictionary starts with a uppercase word(s)(e.g. MODALITY). This makes it easy to have Python determine where to start for each dictionary entry and its definitions:

try:
# entry titles in Webster's dict are uppercase
if len(line) < 40 and line.isupper() == True:

# new entry for the dictionary
current = line.title()
current = current.replace("'S", "'s")

The code shown above code converts the term to title case and fixes the 's possessive. Be sure to replace the periods (.) in the entry title so that MongoDB won’t return an error:

# MongoDB docs must not have "."
current = current.replace(".", "")

Add the dictionary entry to the Python dict object

Before adding an entry to the dict object, check to see if the dictionary entry has already been added, and count the definitions for each entry:

# append an empty dict object to the list for the entry
if current not in dict_data:
# reset the definition count
def_count = 1

# new dictionary entry
dict_data[current] = {"definitions": 1}
else:
# append to dict entry if needed
def_count = dict_data[current]["definitions"] + 1
dict_data[current]["definitions"] = def_count

Look for the dictionary entry’s definitions

The version of Webster’s dictionary that we’re using denotes definitions in two different ways. For the first definition of a word, the line starts with "Defn:"; for all subsequent definitions, it starts with a number and period like this: "2.".

In our code, we’ll look for the "Defn:" substring inside each line as we iterate through the data. The presence of the substring signals that a new definition starts on the line being evaluated:

# add a new definition by looking for "Defn"
if "Defn:" in line:

# concatenate strings for the definitions
def_title = "Defn " + str(def_count)
def_content = line.replace("Defn: ", "")

# add the definition to the defn title key
dict_data[current][def_title] = def_content

# add to the definition count
def_count += 1

# add definition by number and period
elif "." in line[:2] and line[0].isdigit():

# concatenate strings for the definitions
def_title = "Defn " + str(def_count)
def_content = line = line[line.find(".")+2:]

# make sure content for definition has some length
if len(def_content) >= 10:

# add the definition to the defn title key
dict_data[current][def_title] = def_content

# add to the definition count
def_count += 1

Make sure to increment the entry’s def_count variable each time a definition is added to keep accurate track of all the definitions.

Update the number of definitions and return the dict data

We need to end the try-except indentation and update the current entry’s "definitions" key before returning the dict_data object at the end of the function:

except Exception as error:
# errors while iterating with enumerate()
print ("\nenumerate() text file ERROR:", error)
print ("line number:", num)

try:
# update the number of definitions
dict_data[current]["definitions"] = def_count-1
except UnboundLocalError:
pass

return dict_data

Have the get_webster_entries() function return the dictionary entries

At this point, we’re ready to call the function by passing the text file’s name as a string:

# call the function and return Webster's dict as a Python dict
dictionary = get_webster_entries("websters_dictionary.txt")

NOTE: Make sure to specify the exact location of the text file in the filename string if it is not located in the same directory as this Python script.

Put the dictionary entries into a list for the MongoDB insert_many() method call

The PyMongo client driver’s insert_many() method requires that a Python list ([]) object containing dict objects be passed to its method call, so we’ll need to put our entries into a list.

Instantiate an empty Python list for the MongoDB documents

Let’s declare the empty Python list that will contain the dictionary entries as Python dict objects. This list will ultimately be inserted into a MongoDB collection:

# declare an empty list for the final MongoDB docs to be inserted
final_list = []

Keep track of the entries that get removed because they have no content, and iterate over the dictionary’s object’s key-value pairs in order to construct a new MongoDB document with each iteration:

# tally the # of entries that won't be inserted
rem = 0

# iterate over the dictionary entries and definitions
for entry, val in dictionary.items():

# only add to the list if it has a definition
if val != {'definitions': 1}:
# put the final dictionary entry into the list
obj = {'entry': entry}

# update the entry with definitions for the document
obj.update(val)

Append the new MongoDB document object containing the dictionary entry to the list:

# add the dictionary object to the MongoDB list
final_list += [ obj ]
else:
# tally the removed entries
rem += 1

Print information about the dictionary entries queued for MongoDB insertion

Print the list of dictionary entries and other information about the data about to be inserted into a MongoDB collection:

# uncomment the following to print the complete list of entries
#print (final_list)

# print the num of entries with empty definitions
print ("# of entries removed:", rem)

Declare a client instance of MongoDB and insert the dictionary entries

Now, let’s declare a new client instance with the MongoClient() method library. Be sure to pass the correct domain and port parameters for the host server:

# declare a client instance of the MongoDB PyMongo driver
client = MongoClient('localhost', 27017)

Use the MongoDB client’s server_info() method call to check if the host settings are correct

We’ll need to call the client’s server_info() method inside of a try-except indentation to make sure that the MongoDB server is running:

try:
# server_info() should raise exception if host settings are invalid
print ("\nserver_info():", json.dumps(client.server_info(), indent=4))

Instantiate MongoDB database and collection objects and pass the dictionary data to insert_many()

Declare the database and collection names for the data parsed from the Webster’s dictionary text file, and pass the list of MongoDB documents to the collection object’s insert_many() method:

# declare a database and collection instance from the client
db = client["WebstersDictionary"]
col = db["DictionaryEntries"]

# make an API request to MongoDB to insert_many() fruits
result = col.insert_many(final_list)

Parse and print the result object returned by the MongoDB server

# print the API response from the MongoDB server
print ("\ninsert_many() result:", result)

# get the total numbers of docs inserted
total_docs = len(result.inserted_ids)

# print the number of dictionary entries inserted
print ("total entries inserted:", total_docs)

except errors.ServerSelectionTimeoutError as err:
# catch pymongo.errors.ServerSelectionTimeoutError
print ("PyMongo ERROR:", err)

Print the number of seconds that have elapsed since completing the operations

# print the time that elapsed
print ("Elapsed time (in seconds):", time.time() - start_time)

Serialize the dictionary entries as a Python pickle object

If you’d like to serialize and store the data locally, it’s not difficult to make that happen. You can do this by passing the dictionary object, containing all of the dictionary entries, to the Pickle library’s dump() method:

# serialize the Python dictionary as a local pickle file
with open("websters_mongodb.pickle","wb") as pickle_dict:
pickle.dump(dictionary, pickle_dict)

Conclusion

Once you’ve finished creating your script based on the example we discussed in this article, it’s time to test it out. Run the Python script to open the dictionary text file and parse its contents to be inserted as MongoDB documents:

python3 insert_dictionary.py

Even if you’re running this code on an older machine, the complete run-time for the script shouldn’t be more than a few seconds :

total entries inserted: 90597
# of entries removed: 8372
Elapsed time (in seconds): 2.8063137531280518

Screenshot of the Python script printing the number of MongoDB documents inserted

NOTE: Python’s print() statement is very CPU-intensive. Make sure not to avoid using print() while iterating over the dictionary entries to keep your code running efficiently.

Use the MongoDB Compass GUI to filter the Webster dictionary entries

If you have the MongoDB Compass application installed you can use it to verify that the Webster’s dictionary entries have been inserted successfully. Just navigate to the WebstersDictionary database and then to the DictionaryEntries collection on the left-hand side, and there should be over 90k documents in it:

Screenshot of MongoDB Compass UI

Create a MongoDB filter to find a dictionary entry

You can also use the Compass GUI to filter a dictionary entry. Here’s a filter example that gets the definition for “Aaron’s Rod”:

{"entry":"Aaron's Rod"}

Screenshot of MongoDB Compass GUI filtering documents parsed from entries in Webster's dictionary

Inserting or indexing Dictionary entries into a NoSQL database as documents is ideal for creating a mobile, web-based, or desktop dictionary application. With the instructions provided in this tutorial, you’ll be able to parse a line-by-line text file and insert MongoDB documents with Python to create your own dictionary application.

Just the Code

Throughout our tutorial, we looked at the example code one section at a time. Shown below is the complete Python script to parse a line-by-line text file and insert MongoDB documents based on the contents of the file:

#!/usr/bin/env python3
#-*- coding: utf-8 -*-

# import the MongoClient class
from pymongo import MongoClient, errors

# import Python's pickle library to serialize the dictionary
import pickle

# import Python's time and JSON libraries
import time, json

# record the start time for the script
start_time = time.time()

# declare a function that parsed the Webster's text file
def get_webster_entries(filename):

# use the open() function to open the text file
with open(filename, 'rb') as raw:
data = raw.read()

# decode the file as UTF-8 and ignore errors
data = data.decode("utf-8", errors='ignore')

# split the dictionary file into list by: "\r\n"
data = data.split("\r\n")

# create an empty Python dict for the entries
dict_data = {}

# iterate over the list of dictionary terms
for num, line in enumerate(data):

try:
# entry titles in Webster's dict are uppercase
if len(line) < 40 and line.isupper() == True:

# new entry for the dictionary
current = line.title()
current = current.replace("'S", "'s")

# MongoDB docs must not have "."
current = current.replace(".", "")

# append an empty dict object to the list for the entry
if current not in dict_data:
# reset the definition count
def_count = 1

# new dictionary entry
dict_data[current] = {"definitions": 1}
else:
# append to dict entry if needed
def_count = dict_data[current]["definitions"] + 1
dict_data[current]["definitions"] = def_count

# add a new definition by looking for "Defn"
if "Defn:" in line:

# concatenate strings for the definitions
def_title = "Defn " + str(def_count)
def_content = line.replace("Defn: ", "")

# add the definition to the defn title key
dict_data[current][def_title] = def_content

# add to the definition count
def_count += 1

# add definition by number and period
elif "." in line[:2] and line[0].isdigit():

# concatenate strings for the definitions
def_title = "Defn " + str(def_count)
def_content = line = line[line.find(".")+2:]

# make sure content for definition has some length
if len(def_content) >= 10:

# add the definition to the defn title key
dict_data[current][def_title] = def_content

# add to the definition count
def_count += 1

except Exception as error:
# errors while iterating with enumerate()
print ("\nenumerate() text file ERROR:", error)
print ("line number:", num)

try:
# update the number of definitions
dict_data[current]["definitions"] = def_count-1
except UnboundLocalError:
pass

return dict_data

# call the function and return Webster's dict as a Python dict
dictionary = get_webster_entries("websters_dictionary.txt")

# declare an empty list for the final MongoDB docs to be inserted
final_list = []

# tally the # of entries that won't be inserted
rem = 0

# iterate over the dictionary entries and definitions
for entry, val in dictionary.items():

# only add to the list if it has a definition
if val != {'definitions': 1}:
# put the final dictionary entry into the list
obj = {'entry': entry}

# update the entry with definitions for the document
obj.update(val)

# add the dictionary object to the MongoDB list
final_list += [ obj ]
else:
# tally the removed entries
rem += 1

# uncomment the following to print the complete list of entries
#print (final_list)

# print the num of entries with empty definitions
print ("# of entries removed:", rem)

# declare a client instance of the MongoDB PyMongo driver
client = MongoClient('localhost', 27017)

try:
# server_info() should raise exception if host settings are invalid
print ("\nserver_info():", json.dumps(client.server_info(), indent=4))

# declare a database and collection instance from the client
db = client["WebstersDictionary"]
col = db["DictionaryEntries"]

# make an API request to MongoDB to insert_many() fruits
result = col.insert_many(final_list)

# print the API response from the MongoDB server
print ("\ninsert_many() result:", result)

# get the total numbers of docs inserted
total_docs = len(result.inserted_ids)

# print the number of dictionary entries inserted
print ("total entries inserted:", total_docs)

except errors.ServerSelectionTimeoutError as err:
# catch pymongo.errors.ServerSelectionTimeoutError
print ("PyMongo ERROR:", err)

# print the time that elapsed
print ("Elapsed time (in seconds):", time.time() - start_time)

# serialize the Python dictionary as a local pickle file
with open("websters_mongodb.pickle","wb") as pickle_dict:
pickle.dump(dictionary, pickle_dict)

Pilot the ObjectRocket Platform Free!

Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.

Get Started

Keep in the know!

Subscribe to our emails and we’ll let you know what’s going on at ObjectRocket. We hate spam and make it easy to unsubscribe.