How to Fetch Data from the Internet and Index it into Elasticsearch Documents using Python

Introduction

Elasticsearch can be used to search and link to external web sources, such as Wikipedia, and index the data as Elasticsearch documents. This tutorial will demonstrate how to use Python’s built-in urllib3 package library for Python 3 to get data from the internet to index Elasticsearch documents to fetch Wikipedia titles and URLs and how to index that data as Elasticsearch documents.

Prerequisites for using Python’s Built-in urllib3 Package Library for Python 3 to get Data from the Internet to Index Elasticsearch Documents using Python

  • While the urllib3 library comes included with Python 3, the Elasticsearch low-level client does not. The PIP3 package manager can be used to install the client for Python with the following command:
pip3 install elasticsearch
  • The Elasticsearch cluster must be running. Execute the following cURL request to verify the cluster is running and to obtain its version number:
curl -XGET localhost:9200

The result should resemble the following:

Screenshot of the PIP3 installer for Python 3 and Elasticsearch in a UNIX terminal

How to Import the Urllib3 and Elasticsearch Packages

Make sure to import the Elasticsearch and Urllib3 package library’s at the beginning of the Python script:

# import the urllib package for Python 3
import urllib3

# import the Elasticsearch low-level client library
from elasticsearch import Elasticsearch

How to instantiate a client instance of the Python low-level Elasticsearch library

Instantiate a client instance with the following code, and make sure to pass the correct hosts string, that match your Elasticsearch cluster, to the method call:

# domain name, or server's IP address, goes in the 'hosts' list
client = Elasticsearch(hosts=["localhost:9200"])

How to Define a Python Function to Fetch HTML Data using urllib3

Declare a function that will fetch HTML data and return it as a decoded UTF-8 string using the urllib3 library :

# declare a function that will return a webpage's HTML as a string
def fetch_data(url):

    # catch any errors and print/return them
    try:
        # get the HTTP urllib3.connectionpool.HTTPSConnectionPool object
        http_pool = urllib3.connection_from_url(url)

        # returns a urllib3.response.HTTPResponse object
        resp = http_pool.urlopen('GET', url)

        # decode the HTML data using UTF-8
        decoded = resp.data.decode('utf-8')

    except Exception as err:
        print ("fetch_data() ERROR", err)
        decoded = "fetch_data() ERROR" + str(err)

    # return the urllib response or any errors
    return decoded

How to Parse and Slice the HTML String Data Returned by urllib3

The below code will define a function in the Python script that will splice the HTML string using certain “markers.” This is accomplished by iterating over a list of HTML lines, creating sub-strings in the process, and by removing the HTML tags in order to parse-out the article title and URL string data.

Declare another function that will parse and slice the HTML string data:

# iterate, slice, and parse HTML string data
def parse_html(data, marker, start, end):
    # iterate the list of HTML data
    for line in data:

How to verify that the HTML marker is in the line iteration

Verify that the target marker sub-string in in the line and then look for the start and endpoints for the data that needs to get parsed out of it:

        # make sure the HTML string contains marker
        if marker in line:
            line = line[line.find(marker):]

            # check if the start and end data is in line
            if (start in line) and (end in line):

How to use Python’s find( ) function to get the start and end location of the target string slice

Use the following code to get a target slice from the data line using Python’s built-in find() function and then return the string slice:

                # find where the target data starts and ends
                s = line.find(start) + len(start)
                e = line.find(end)

                # return the sliced string
                return line[s : e]

    # return empty string by default
    return ""

NOTE: The function will return an empty string if the target string slice cannot be found in the web page.

How to Declare a URL String for Wikipedia Articles and Fetch the Data

Execute the following code to create a variable for the Wikipedia URL string outside of the declared functions and pass it to the fetch_data() function declared earlier:

# declare a URL string for the urllib to parse
url = "https://en.wikipedia.org/wiki/Wikipedia:Vital_articles/Level/1"

# pass the URL string to the fetch_data() function
source = fetch_data(url)

How to use Python’s split( ) function to return the HTML data as a list

Use the following code to break up the entire HTML from the web page using split() delimiting each line of string in the list by its newline character (\n):

# use Python's split() method to put lines of HTML into a list []
source = source.split("\n") # "\n" == linebreak

# print the number of HTML lines returned by urllib
print ("Lines to iterate:", len(source))

The parse_html() function declared above requires that a Python list object ([]) be passed to it. Here use the string’s split() method to return a list of all the lines of HTML in the web page.

How to Call the parse_html( ) Function and get the Macro Topic for the Wikipedia Articles

This section will cover how to get the main topic that all of the Wikipedia articles belong to by passing a list of the HTML data from the Wikipedia article page to the parse_html() function.

How to view the web page’s source data

Right-click anywhere on the Wikipedia website’s Level 1 vital articles page shown in the following screenshot:

Screenshot of Wikipedia in a browser with View Page Source option

This will bring up the “view page source” option for the HTML, as shown above underlined in red. Click on that option.

The following is a screenshot of where the Wikipedia article titles begin in the HTML page source

Screenshot of the Wikipedia page's HTML page source data

The macro topic for the Wikipedia articles starts at “<h2><span id="Level_1“, so pass this string to the parse_html() function’s marker parameter.

Pass the markers for the HTML to the parse_html() method

The actual topic name starts at “title="Edit section:” and ends at “">edit</a>“:

# topic of articles
topic = parse_html(
    data = source,
    marker = '<h2><span id="Level_1',
    start = 'title="Edit section: ',
    end = '">edit</a>'
)

In case the function couldn’t find the title, here’s some code that will find an alternative in the URL string:

# make sure topic/index name has length
if len(topic) < 1:
    # get index name from URL if needed
    topic = url[url.find("Wikipedia:")+10:]
    topic = topic[:topic.find("/")]

# make lowercase and replace spaces with underscores (_)
topic = topic.lower();
topic = topic.replace(" ", "_")

Fix the title to conform to Elasticsearch index naming conventions

Elasticsearch has certain requirements for its index naming convention. Remove any special characters from the Wikipedia title and make sure the length of the index name will be fewer than 255 characters:

# remove all special characters
topic = ''.join(char for char in topic if char.isalnum() or char == "_")

# Elasticsearch index names must be fewer than 255 char
if len(topic) > 255:
    topic = topic[:255]

# print the topic/Elasticsearch index name
print ("Topic:", topic)

Iterate over the list of source data to create Python dict objects

Parse and create Python dict objects from the HTML string data returned by the urllib3 Python library and then index the parsed data as Elasticsearch documents.

Execute the following code to declare an empty Python list for the Elasticsearch documents and iterate the HTML source data:

# declare an empty list for the article links
articles = []

# iterate the list of HTML lines of data
for line in source:
    # make sure that the HTML line contains correct data
    if 'title="' in line and '<a href="/wiki/' in line:

Pass the line of HTML to the parse_html() function

Pass the line string as a list object to the data parameter of the parse_html() function call in every iteration in order to get all of the Wikipedia article names and links:

        # title for the article
        title = parse_html(
            data = [line],
            marker = 'title="',
            start = '">',
            end = '</a></li>'
        )

        # link for the article
        link = parse_html(
            data = [line],
            marker = '/> <a href="',
            start = '<a href="',
            end = '" title="'
        )

If the function found the data it needed, then have it append the data to the articles list as a Python dictionary object:

        # make sure the link and title have length
        if len(title) >= 3 and len(link) > 3:

            # ignore links like "Featured Articles"
            if 'Wikipedia:' not in link:
                articles += [
                    {
                        "title" : title,
                        "url" : link
                    }
                ]

# print the articles dict object
print ("Articles:", articles, "\n")

Create Elasticsearch documents from the Python dictionary objects

Iterate over the article list and index the Python dict objects as Elasticsearch documents:

# iterate the list of articles with items()
for id, article in enumerate(articles):

Grab the title and URL attributes of the Python dict object

    # parse the key-value pairs for the article
    article_title = article["title"]
    article_url = "https://en.wikipedia.org/" + str(article["url"])
    id_num = id + 1

Create a Python dict object for the Elasticsearch document being indexed

Put the title and URL string values into a new dict object that will be used for the Elasticsearch document body:

    # create a document body for the new Elasticsearch document
    doc_body = {
        "title": article_title,
        "url": article_url
    }

Call the index() method to index the Wikipedia data as Elasticsearch documents

Pass the new dict object with the Wikipedia data to the client object instance’s index() method by passing it to the body parameter:

    # use a try-catch block for any Elasticsearch index() exceptions
    try:
        # call the Elasticsearch client's index() method
        response = client.index(
            index = topic,
            doc_type = '_doc',
            id = id_num,
            body = doc_body,
            request_timeout = 5
        )

        # print the API response returned by the Elasticsearch client
        print ("response:", response)

Print the Elasticsearch API response object returned by the cluster

        # evaluate the response dict object to check that the API was successful
        if response["_shards"]["successful"] == 1:
            print ("INDEX CALL WAS A SUCCESS:", response["_shards"]["successful"])
        else:
            print ("INDEX CALL FAILED")

    except Exception as err:
        print ("Elasticsearch index() ERROR:", err)

Run the Python script to index the Elasticsearch documents

When you run the Python script the Urllib library will make an HTTP request to the Wikipedia server to get the HTML data, and then it will create an index (if needed) and then it will index the Wikipedia data as new Elasticsearch documents:

python3 index_wikipedia.py

![Screenshot of Python printing Ur

    # make sure that the HTML line contains correct data
    if 'title="' in line and '<a href="/wiki/' in line:

Pass the line of HTML to the parse_html() function

Pass the line string as a list object to the data parameter of the parse_html() function call in every iteration in order to get all of the Wikipedia article names and links:

        # title for the article
        title = parse_html(
            data = [line],
            marker = 'title="',
            start = '">',
            end = '</a></li>'
        )

        # link for the article
        link = parse_html(
            data = [line],
            marker = '/> <a href="',
            start = '<a href="',
            end = '" title="'
        )

If the function found the data it needed, then have it append the data to the articles list as a Python dictionary object:

        # make sure the link and title have length
        if len(title) >= 3 and len(link) > 3:

            # ignore links like "Featured Articles"
            if 'Wikipedia:' not in link:
                articles += [
                    {
                        "title" : title,
                        "url" : link
                    }
                ]

# print the articles dict object
print ("Articles:", articles, "\n")

Create Elasticsearch documents from the Python dictionary objects

Iterate over the article list and index the Python dict objects as Elasticsearch documents:

# iterate the list of articles with items()
for id, article in enumerate(articles):

Grab the title and URL attributes of the Python dict object

Declare a new object for the article’s title and then concatenate a string for the article’s URL:

    # parse the key-value pairs for the article
    article_title = article["title"]
    article_url = "https://en.wikipedia.org/" + str(article["url"])
    id_num = id + 1

Create a Python dict object for the Elasticsearch document being indexed

Put the title and URL string values into a new dict object that will be used for the Elasticsearch document body:

    # create a document body for the new Elasticsearch document
    doc_body = {
        "title": article_title,
        "url": article_url
    }

Call the index() method to index the Wikipedia data as Elasticsearch documents

Pass the new dict object with the Wikipedia data to the client object instance’s index() method by passing it to the body parameter:

    # use a try-catch block for any Elasticsearch index() exceptions
    try:
        # call the Elasticsearch client's index() method
        response = client.index(
            index = topic,
            doc_type = '_doc',
            id = id_num,
            body = doc_body,
            request_timeout = 5
        )

        # print the API response returned by the Elasticsearch client
        print ("response:", response)

Wikipedia titles and URLs indexed to Elasticsearch as documents in a terminal window

Conclusion

This tutorial explained how use Python’s built-in urllib3 package library for Python 3 to get data from the internet to index Elasticsearch documents to fetch Wikipedia titles and URLs and how to index that data as Elasticsearch documents. The tutorial specifically covered how to define a Python function to fetch HTML data and parse and slice the HTML string data, declare a URL string for Wikipedia articles, check if the string of HTML contains the Wikipedia article title and link and how to index the Elasticsearch documents.

NOTE: Care must be exercised when using Python’s urllib3 library to get HTML data from a web page. Contents may be copyrighted and many sites will block an IP address for abusing the power of the Python package library. The Wikipedia HTML links fetched in this tutorial are not copyrighted material, however, the content in Wikipedia articles used in this article do have some copyright restrictions.

Here’s how to make a GET HTTP request to the Elasticsearch cluster to verify that the data was indexed successfully:

GET level_1_vital_articles_10_articles/_search

Use the Search API in Kibana to have it return all of the Wikipedia data as Elasticsearch documents

Screenshot of the Wikipedia URL and article titles data as Elasticsearch documents in Kibana

Just the Code

#!/usr/bin/env python3
#-*- coding: utf-8 -*-

# import the urllib package for Python 3
import urllib3

# import the Elasticsearch low-level client library
from elasticsearch import Elasticsearch

# domain name, or server's IP address, goes in the 'hosts' list
client = Elasticsearch(hosts=["localhost:9200"])

# declare a function that will return a webpage's HTML as a string
def fetch_data(url):

    # catch any errors and print/return them
    try:
        # get the HTTP urllib3.connectionpool.HTTPSConnectionPool object
        http_pool = urllib3.connection_from_url(url)

        # returns a urllib3.response.HTTPResponse object
        resp = http_pool.urlopen('GET', url)

        # decode the HTML data using UTF-8
        decoded = resp.data.decode('utf-8')

    except Exception as err:
        print ("fetch_data() ERROR", err)
        decoded = "fetch_data() ERROR" + str(err)

    # return the urllib response or any errors
    return decoded

# iterate, slice, and parse HTML string data
def parse_html(data, marker, start, end):
    # iterate the list of HTML data
    for line in data:

        # make sure the HTML string contains marker
        if marker in line:
            line = line[line.find(marker):]

            # check if the start and end data is in line
            if (start in line) and (end in line):

                # find where the target data starts and ends
                s = line.find(start) + len(start)
                e = line.find(end)

                # return the sliced string
                return line[s : e]

    # return empty string by default
    return ""

# declare a URL string for the urllib to parse
url = "https://en.wikipedia.org/wiki/Wikipedia:Vital_articles/Level/1"

# pass the URL string to the fetch_data() function
source = fetch_data(url)

# use Python's split() method to put lines of HTML into a list []
source = source.split("\n") # "\n" == linebreak

# print the number of HTML lines returned by urllib
print ("Lines to iterate:", len(source))

# topic of articles
topic = parse_html(
    data = source,
    marker = '<h2><span id="Level_1',
    start = 'title="Edit section: ',
    end = '">edit</a>'
)

# make sure topic/index name has length
if len(topic) < 1:
    # get index name from URL if needed
    topic = url[url.find("Wikipedia:")+10:]
    topic = topic[:topic.find("/")]

# make lowercase and replace spaces with underscores (_)
topic = topic.lower();
topic = topic.replace(" ", "_")

# remove all special characters
topic = ''.join(char for char in topic if char.isalnum() or char == "_")

# Elasticsearch index names must be fewer than 255 char
if len(topic) > 255:
    topic = topic[:255]

# print the topic/Elasticsearch index name
print ("Topic:", topic)

# declare an empty list for the article links
articles = []

# iterate the list of HTML lines of data
for line in source:

    # make sure that the HTML line contains correct data
    if 'title="' in line and '<a href="/wiki/' in line:

        # title for the article
        title = parse_html(
            data = [line],
            marker = 'title="',
            start = '">',
            end = '</a></li>'
        )

        # link for the article
        link = parse_html(
            data = [line],
            marker = '/> <a href="',
            start = '<a href="',
            end = '" title="'
        )

        # make sure the link and title have length
        if len(title) >= 3 and len(link) > 3:

            # ignore links like "Featured Articles"
            if 'Wikipedia:' not in link:
                articles += [
                    {
                        "title" : title,
                        "url" : link
                    }
                ]

# print the articles dict object
print ("Articles:", articles, "\n")

# iterate the list of articles with items()
for id, article in enumerate(articles):

    # parse the key-value pairs for the article
    article_title = article["title"]
    article_url = "https://en.wikipedia.org/" + str(article["url"])
    id_num = id + 1

    # create a document body for the new Elasticsearch document
    doc_body = {
        "title": article_title,
        "url": article_url
    }

    # use a try-catch block for any Elasticsearch index() exceptions
    try:
        # call the Elasticsearch client's index() method
        response = client.index(
            index = topic,
            doc_type = '_doc',
            id = id_num,
            body = doc_body,
            request_timeout = 5
        )

        # print the API response returned by the Elasticsearch client
        print ("response:", response)

        # evaluate the response dict object to check that the API was successful
        if response["_shards"]["successful"] == 1:
            print ("INDEX CALL WAS A SUCCESS:", response["_shards"]["successful"])
        else:
            print ("INDEX CALL FAILED")

    except Exception as err:
        print ("Elasticsearch index() ERROR:", err)

Pilot the ObjectRocket Platform Free!

Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.

Get Started

Keep in the know!

Subscribe to our emails and we’ll let you know what’s going on at ObjectRocket. We hate spam and make it easy to unsubscribe.