Scrape a Website to Postgres with Python

Introduction

In this article we learn how to scrape from a website into Postgres with Python code. Here’s why and how:

  • Why? Why scrape data from a web site using Python and Postgres to store the data? Most reasons include automation. For example, you may wish to automate: (a) Getting a stock or cryptocurrency price from a site that lists prices; (b) Scour dictionary.com and/or thesaurus.com to feed your a.i.; (c) R&D – get data for statistics; (d) Average or otherwise analyze weather data, etc; (e) looking for a job; and maybe other reasons.
  • What and how? We’ll use functions and methods for creating a “stream” to pull data from a web site, isolating the parts we want, using the “BeautifulSoup” library, and saving that data to Postgres with Python. In addition, we will use Python’s “render_template” function so we can alert the user to status, including trapped errors. Finally, we will build a Python application for scraping website temperature data into our PostgreSQL table.

Prerequisites

  • Postgres: Experience with the basics of Postgres use. You may also want to use the free “PGadmin” tool or another database admin tool like dBeaver.
  • Python: Beginner understanding of Python. We are using Visual Studio Code to write Python.
  • Optional: but helpful: Tutorial on naming conventions explaining why we prefix our variables, column names, table names, etc. as you see us doing in this article. For example, naming “t_variablename” with the “t” you see at the beginning in order to delineate it as a “text” (string) object and “tbl_” before table names in order to clearly mark those objects as tables.

The plan

  • Step 1: Install the libraries we need.
  • Step 2: Use a web browser to view source of the site you want to scrape with Python.
  • Step 3: Open a “stream” to a website URL and get all the data into a variable we can work with locally.
  • Step 4: Isolate the data within that text by looking for specific tags that the data we want is nested in.
  • Step 5: Insert the data into a Postgres table we set up to receive and store this data.

Step 1: Install necessary Python libraries

from flask import Flask
from flask import render_template
from selenium import webdriver
from BeautifulSoup import BeautifulSoup
import psycopg2

Selenium is a tool used for automating tests via web applications but – important to note – not for desktop or mobile applications.

Step 2: View source to find data you want

  • First, we navigate to https://Weather.com and choose our location.
  • Then we click “today” at the top, so we can get a page with temperature.
  • We copy the URL: “”
  • Then notice the “feels like 39” temperature and “view source”.
  • Now we search thru the source HTML for “feels like” to find the spot in the HTML referring to what we want to scrape.
  • Copy that code for examination:
<span class="deg-feels" className="deg-feels">39<sup>°</sup></span>

From the above code, the important parts to remember for later are:

  • span
  • deg-feels

Step 3: Scrape data from URL

myWebDriver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
t_url = "https://weather.com/weather/today/l/7ebb344012f0c5ff88820d763da89ed94306a86c770fda50c983bf01a0f55c0d"
myWebDriver.get("<a href='" + t_url + "'>" + t_url + "</a>")
t_content = myWebDriver.page_source
soup_in_bowl = BeautifulSoup(t_content)

Analysis

  • Set “webdriver” object to use Google’s Chrome browser.
  • Designate a URL for the website scrape.
  • Initiate scrape of website with Python.
  • Return all of the page contents into “t_content” variable.
  • Use soup to put content into a format that is easy to search.

Step 4: Search for exact data wanted

o_temp = soup_in_bowl.find('span', attrs={'class':'deg-feels'})
n_temp = o_temp.text

Analysis

  • Find the exact span and class for the website data we are seeking.
  • Get the value in the text parameter from “o_temp” and place it in a variable called “n_temp”.

Step 5: Insert data into Postgres

s = ""
s += "INSERT INTO tbl_temperatures"
s += "("
s += "n_temp"
s += ") VALUES ("
s += "(%n_temp)"
s += ")"
db_cursor.execute(s, [n_temp, n_temp])

Analysis

This is a parameterized query (SQL) telling Postgres with Python to INSERT a new record into the table named “tbl_temperatures” and in that row, to set the value we have in the local “n_temp” variable to be placed in the column in our table we have also named “n_temp”. We named it “n_temp” because it is a real number data type, as opposed to an integer.

Full Source Code in Python

from flask import Flask # needed for flask-dependent libraries below
from flask import render_template # to render the error page
from selenium import webdriver # to grab source from URL
from BeautifulSoup import BeautifulSoup # for searching through HTML
import psycopg2 # for database access

app = Flask(__name__)
@app.route("/")

# Set up Postgres database connection and cursor.
t_host = "PostgreSQL database host address" # either "localhost", a domain name, or an IP address.
t_port = "5432" # default postgres port
t_dbname = "database name"
t_user = "database user name"
t_pw = "password"
db_conn = psycopg2.connect(host=t_host, port=t_port, dbname=t_dbname, user=t_user, password=t_pw)
db_cursor = db_conn.cursor()

@app.route("/import_temp")

def import_temp():
    # set up your webdriver to use Chrome web browser
    myWebDriver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver")
    # designate the URL we want to scrape
    #   NOTE: the long string of characters at the end of this URL below is a clue that
    #   maybe this page is so dynamic, like maybe refers to a specific web session and/or day/time,
    #   that we can't necessarily count on it to be the same more than one time.
    #   Which means... we may want to find another source for our data; one that is more
    #   dependable. That said, whatever URL you use, the methodology in this lesson stands.
    t_url = "https://weather.com/weather/today/l/7ebb344012f0c5ff88820d763da89ed94306a86c770fda50c983bf01a0f55c0d"
    # initiate scrape of website page data
    myWebDriver.get("<a href='" + t_url + "'>" + t_url + "</a>")
    # return entire page into "t_content"
    t_content = myWebDriver.page_source
    # use soup to make page content easily searchable
    soup_in_bowl = BeautifulSoup(t_content)
    # search for the UNIQUE span and class for the data we are looking for:
    o_temp = soup_in_bowl.find('span', attrs={'class':'deg-feels'})
    # from the resulting object, "o_temp", get the text parameter and assign it to "n_temp"
    n_temp = o_temp.text

    # Build SQL for purpose of:
    #    saving the temperature data to a new row
    s = ""
    s += "INSERT INTO tbl_temperatures"
    s += "("
    s += "n_temp"
    s += ") VALUES ("
    s += "(%n_temp)"
    s += ")"

    # Trap errors for opening the file
    try:
        db_cursor.execute(s, [n_temp, n_temp])
        db_conn.commit()
    except psycopg2.Error as e:
        t_msg = "Database error: " + e + "/n open() SQL: " + s
        return render_template("error_page.html", t_msg = t_msg)

    # Success!
    # Show a message to user.
    t_msg = "Successful scrape!"
    return render_template("progress.html", t_msg = t_msg)

    # Clean up the cursor and connection objects
    db_cursor.close()
    db_conn.close()

Conclusion

We learned here how to scrape a website into Postgres with use of Python. First, we installed the libraries needed, next, we navigated to weather.com via a web browser, then we searched the “view source” of that page, so as to find a unique identifier such as class name. Then we scraped all the content of that page into a local object. Using that local object, we used the “find” method from “BeautifulSoup” to find the exact data we want. Next, we copied that data into a PostgreSQL table. Finally, we the source script for the entire application available and heavily commented at the end of this tutorial.

Pilot the ObjectRocket Platform Free!

Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.

Get Started

Keep in the know!

Subscribe to our emails and we’ll let you know what’s going on at ObjectRocket. We hate spam and make it easy to unsubscribe.