Scrape a Website to CockroachDB

Have a Database Problem? Speak with an Expert for Free
Get Started >>

Introduction

In this tutorial document we learn how to scrape a website into CockroachDB with Python scripts. Here’s why and how:

  • Why? Why scrape data from a web site using Python and Cockroach to store the data? Most reasons include automation. For example, you may wish to automate: (1) Getting a stock or cryptocurrency price from a site that lists prices; (2) Scour dictionary.com and/or thesaurus.com to feed your a.i.; (e) R&D – get data for statistics; (4) Average or otherwise analyze weather data, etc; (5) looking for a job; and other reasons.
  • What and how? We will use functions and methods for creating a “stream” to pull data from a web site, isolating the parts we want, using the “BeautifulSoup” library, and saving that data to Cockroach with Python. In addition, we will use Python’s render_template function so we can alert the user to status, including trapped errors. Finally, we will build a Python application for scraping web site temperature data into our CockroachDB table.

Prerequisites

  • Cockroach: Experience with the basics of Cockroach use. You may also want to use the free “dBeaver” tool.
  • Python: Beginner understanding of Python. We are using the free “Visual Studio Code” program to write and debug Python applications.

The plan for website scrape

  • 1: Install the Python frameworks we need.
  • 2: Use a web browser to view source of the site you want to scrape with Python.
  • 3: Open a “stream” to the web site URL and bring all the data into a variable we can work with locally.
  • 4: Isolate the data within that text by looking for specific tags that the data we want is nested in.
  • Step 5: Insert the data into a Cockroach table we set up to receive and store this data.

Install Python frameworks

Once you have used PIP to install flask, selenium, beautifulsoup, and psycopg2, you can reference them via the “from…import” construct:

1
2
3
4
5
from flask import Flask
from flask import render_template
from selenium import webdriver
from BeautifulSoup import BeautifulSoup
import psycopg2

Selenium is a tool used for automating tests via web applications but – important to note – not for desktop or mobile applications.

View source

  • First, navigate to the site you want to scrape; https://Weather.com and choose our location.
  • Then click “today” at the top, so we can get a page with temperature.
  • Copy the new URL.
  • Notice the “feels like 72” temperature and “view source”.
  • Now we search thru the source HTML for “feels like” to find the spot in the HTML referring to what we want to scrape.
  • Copy that script for examination:
1
<span class="deg-feels" className="deg-feels">72<sup>°</sup></span>

From the above scripts, the important parts to remember for later are:

  • span
  • deg-feels

Scrape data

1
2
3
4
5
myWebDriver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
txtURL = "https://weather.com/weather/today/l/7ebb344012f0c5ff88820d763da89ed94306a86c770fda50c983bf01a0f55c0d"
myWebDriver.get("<a href='" + txtURL + "'>" + txtURL + "</a>")
txtContent = myWebDriver.page_source
txtSoupInBowl = BeautifulSoup(txtContent)

Analysis

  • Set “webdriver” object to use Google’s Chrome web browser.
  • Designate a URL for the website scraping.
  • Initiate scrape of the website using Python.
  • Return all of the page contents into “txtContent” variable.
  • Use BeautifulSoup to copy content into a format that is easy to search.

Search for exact data wanted

1
2
objTemp = txtSoupInBowl.find('span', attrs={'class':'deg-feels'})
txtTemp = objTemp.text

Analysis

  • Find the exact span and class for the web site data we are looking for.
  • Get the value in the text parameter from “objTemp” and place it in a variable called “txtTemp”. txtTemp now contains our temperature!

Insert scraped data into Cockroach

1
2
3
4
5
6
7
8
s = ""
s += "INSERT INTO tblTemps"
s += "("
s += "txtTemp"
s += ") VALUES ("
s += "(%txtTemp)"
s += ")"
cursorCRDB.execute(s, [txtTemp])

Analysis

This is a parameterized SQL code (SQL) telling Cockroach with Python to INSERT a new row into the table named “tblTemps” and in that record, to set the value we have in the local “txtTemp” variable to be placed in the column in our table we have also named “txtTemp”. We named it “txtTemp” because it is a real number data type, as opposed to an integer.

Source Code in Python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
from flask import Flask # needed for flask-dependent libraries below
from flask import render_template # to render the error page
from selenium import webdriver # to grab source from URL
from BeautifulSoup import BeautifulSoup # for searching through HTML
import psycopg2 # for database access

app = Flask(__name__)
@app.route("/")

# Set up Cockroach database connection and cursor.
txtDBname = "db name"
txtNameUser = "db user name"
txtSSLmode = "auto"
txtSSLrootCert = 'certifs/ca.crt'
t_sslkey = 'certifs/client.maxroach.key'
t_sslcert = 'certifs/client.maxroach.crt'
txtHostURL = "localhost"
txtPortNum = "32853"
connCRDB = psycopg2.connect(database=txtDBname, user=txtNameUser, sslmode=txtSSLmode, sslrootcert=txtSSLrootCert, sslkey=t_sslkey, sslcert=t_sslcert, host=txtHostURL, port=txtPortNum)
cursorCRDB = connCRDB.cursor()

@app.route("/import_temp")

def import_temp():
    myWebDriver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
    # set the URL of the site we want to scrape.
    #   NOTE: the long string of characters at the end of this URL below is a clue that
    #   maybe this page is so dynamic, maybe refers to a specific web session and/or day/time,
    #   that we can't necessarily count on it to be the same more than this time.
    #   Which means... we may want to find another source for our weather data; one that is
    #   more dependable. That said, whatever URL you use, the methodology in this
    #   lesson stands.
    txtURL = "https://weather.com/weather/today/l/7ebb344012f0c5ff88820d763da89ed94306a86c770fda50c983bf01a0f55c0d"
    # Begin scrape of the website's page data.
    myWebDriver.get("<a href='" + txtURL + "'>" + txtURL + "</a>")
    # Return page contents into "txtContent" variable.
    txtContent = myWebDriver.page_source
    # Use BeautifulSoup to make page content easily searchable.
    txtSoupInBowl = BeautifulSoup(txtContent)
    # Search for a UNIQUE span and class for the data we are looking for:
    objTemp = txtSoupInBowl.find('span', attrs={'class':'deg-feels'})
    # From the resulting object, "objTemp", get the text param and assign it to "txtTemp".
    txtTemp = objTemp.text

    # Build SQL for the purpose of saving:
    # the temperature data to a new record.
    s = ""
    s += "INSERT INTO tblTemps"
    s += "("
    s += "txtTemp"
    s += ") VALUES ("
    s += "(%txtTemp)"
    s += ")"

    # Trap errors for opening the file
    try:
        cursorCRDB.execute(s, [txtTemp])
        connCRDB.commit()
    except psycopg2.Error as e:
        txtMsg = "Database error: " + e + "/n open() SQL: " + s
        return render_template("error_web-page.html", txtMsg = txtMsg)

    # Send a message to user.
    txtMsg = "Successful scrape!"
    return render_template("progression.html", txtMsg = txtMsg)

    cursorCRDB.close()
    connCRDB.close()

Conclusion

We learned here how to scrape a web site into Cockroach with use of Python. First, we installed the frameworks needed. Next, we navigated to weather.com with our web browser, then we searched the “view source” of that URL page, so as to find a unique identifier such as class name. Then we scraped all the contents of that page into a local object. Using that local object, we used the “find” method from “BeautifulSoup” to find the exact data we were looking for. Next, we copied that data into a CockroachDB table. Finally, we the source code for the entire application available and heavily commented at the end of this instructional article.

Pilot the ObjectRocket Platform Free!

Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.

Get Started

Keep in the know!

Subscribe to our emails and we’ll let you know what’s going on at ObjectRocket. We hate spam and make it easy to unsubscribe.