Scrape a Website to Postgres with Python
Introduction
In this article we learn how to scrape from a website into Postgres with Python code. Here’s why and how:
- Why? Why scrape data from a web site using Python and Postgres to store the data? Most reasons include automation. For example, you may wish to automate: (a) Getting a stock or cryptocurrency price from a site that lists prices; (b) Scour dictionary.com and/or thesaurus.com to feed your a.i.; (c) R&D – get data for statistics; (d) Average or otherwise analyze weather data, etc; (e) looking for a job; and maybe other reasons.
- What and how? We’ll use functions and methods for creating a “stream” to pull data from a web site, isolating the parts we want, using the “BeautifulSoup” library, and saving that data to Postgres with Python. In addition, we will use Python’s “render_template” function so we can alert the user to status, including trapped errors. Finally, we will build a Python application for scraping website temperature data into our PostgreSQL table.
Prerequisites
- Postgres: Experience with the basics of Postgres use. You may also want to use the free “PGadmin” tool or another database admin tool like dBeaver.
- Python: Beginner understanding of Python. We are using Visual Studio Code to write Python.
- Optional: but helpful: Tutorial on naming conventions explaining why we prefix our variables, column names, table names, etc. as you see us doing in this article. For example, naming “t_variablename” with the “t” you see at the beginning in order to delineate it as a “text” (string) object and “tbl_” before table names in order to clearly mark those objects as tables.
The plan
- Step 1: Install the libraries we need.
- Step 2: Use a web browser to view source of the site you want to scrape with Python.
- Step 3: Open a “stream” to a website URL and get all the data into a variable we can work with locally.
- Step 4: Isolate the data within that text by looking for specific tags that the data we want is nested in.
- Step 5: Insert the data into a Postgres table we set up to receive and store this data.
Step 1: Install necessary Python libraries
1 2 3 4 5 | from flask import Flask from flask import render_template from selenium import webdriver from BeautifulSoup import BeautifulSoup import psycopg2 |
Selenium is a tool used for automating tests via web applications but – important to note – not for desktop or mobile applications.
Step 2: View source to find data you want
- First, we navigate to https://Weather.com and choose our location.
- Then we click “today” at the top, so we can get a page with temperature.
- We copy the URL: “”
- Then notice the “feels like 39” temperature and “view source”.
- Now we search thru the source HTML for “feels like” to find the spot in the HTML referring to what we want to scrape.
- Copy that code for examination:
From the above code, the important parts to remember for later are:
- span
- deg-feels
Step 3: Scrape data from URL
1 2 3 4 5 | myWebDriver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver") t_url = "https://weather.com/weather/today/l/7ebb344012f0c5ff88820d763da89ed94306a86c770fda50c983bf01a0f55c0d" myWebDriver.get("<a href='" + t_url + "'>" + t_url + "</a>") t_content = myWebDriver.page_source soup_in_bowl = BeautifulSoup(t_content) |
Analysis
- Set “webdriver” object to use Google’s Chrome browser.
- Designate a URL for the website scrape.
- Initiate scrape of website with Python.
- Return all of the page contents into “t_content” variable.
- Use soup to put content into a format that is easy to search.
Step 4: Search for exact data wanted
1 2 | o_temp = soup_in_bowl.find('span', attrs={'class':'deg-feels'}) n_temp = o_temp.text |
Analysis
- Find the exact span and class for the website data we are seeking.
- Get the value in the text parameter from “o_temp” and place it in a variable called “n_temp”.
Step 5: Insert data into Postgres
1 2 3 4 5 6 7 8 | s = "" s += "INSERT INTO tbl_temperatures" s += "(" s += "n_temp" s += ") VALUES (" s += "(%n_temp)" s += ")" db_cursor.execute(s, [n_temp, n_temp]) |
Analysis
This is a parameterized query (SQL) telling Postgres with Python to INSERT a new record into the table named “tbl_temperatures” and in that row, to set the value we have in the local “n_temp” variable to be placed in the column in our table we have also named “n_temp”. We named it “n_temp” because it is a real number data type, as opposed to an integer.
Full Source Code in Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | from flask import Flask # needed for flask-dependent libraries below from flask import render_template # to render the error page from selenium import webdriver # to grab source from URL from BeautifulSoup import BeautifulSoup # for searching through HTML import psycopg2 # for database access app = Flask(__name__) @app.route("/") # Set up Postgres database connection and cursor. t_host = "PostgreSQL database host address" # either "localhost", a domain name, or an IP address. t_port = "5432" # default postgres port t_dbname = "database name" t_user = "database user name" t_pw = "password" db_conn = psycopg2.connect(host=t_host, port=t_port, dbname=t_dbname, user=t_user, password=t_pw) db_cursor = db_conn.cursor() @app.route("/import_temp") def import_temp(): # set up your webdriver to use Chrome web browser myWebDriver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver") # designate the URL we want to scrape # NOTE: the long string of characters at the end of this URL below is a clue that # maybe this page is so dynamic, like maybe refers to a specific web session and/or day/time, # that we can't necessarily count on it to be the same more than one time. # Which means... we may want to find another source for our data; one that is more # dependable. That said, whatever URL you use, the methodology in this lesson stands. t_url = "https://weather.com/weather/today/l/7ebb344012f0c5ff88820d763da89ed94306a86c770fda50c983bf01a0f55c0d" # initiate scrape of website page data myWebDriver.get("<a href='" + t_url + "'>" + t_url + "</a>") # return entire page into "t_content" t_content = myWebDriver.page_source # use soup to make page content easily searchable soup_in_bowl = BeautifulSoup(t_content) # search for the UNIQUE span and class for the data we are looking for: o_temp = soup_in_bowl.find('span', attrs={'class':'deg-feels'}) # from the resulting object, "o_temp", get the text parameter and assign it to "n_temp" n_temp = o_temp.text # Build SQL for purpose of: # saving the temperature data to a new row s = "" s += "INSERT INTO tbl_temperatures" s += "(" s += "n_temp" s += ") VALUES (" s += "(%n_temp)" s += ")" # Trap errors for opening the file try: db_cursor.execute(s, [n_temp, n_temp]) db_conn.commit() except psycopg2.Error as e: t_msg = "Database error: " + e + "/n open() SQL: " + s return render_template("error_page.html", t_msg = t_msg) # Success! # Show a message to user. t_msg = "Successful scrape!" return render_template("progress.html", t_msg = t_msg) # Clean up the cursor and connection objects db_cursor.close() db_conn.close() |
Conclusion
We learned here how to scrape a website into Postgres with use of Python. First, we installed the libraries needed, next, we navigated to weather.com via a web browser, then we searched the “view source” of that page, so as to find a unique identifier such as class name. Then we scraped all the content of that page into a local object. Using that local object, we used the “find” method from “BeautifulSoup” to find the exact data we want. Next, we copied that data into a PostgreSQL table. Finally, we the source script for the entire application available and heavily commented at the end of this tutorial.
Pilot the ObjectRocket Platform Free!
Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.
Get Started