How to Create a Web Scraper with Mongoose, NodeJS, Axios, and Cheerio - Part 2

Introduction

In this second part of the tutorial on how to build a webscraper from scratch we will concentrate on doing the actual webscraping. We will be utilizing the npm packages axios and cheerio to make our job easier on us.

If you’re just joining us this is multi-part series we will create a web scraper from scratch. It will store the data it has scraped using MongoDB. The backend will be written in Javascript running on NodeJS. There are several libraries we will use on the backend to make our lives easier including Express, Axios, and Cheerio. On the front-end we will use Bootstrap for styling and JQuery to send requests to the backend. Finally we will deploy the site using Heroku.

In the first part of the series we got a basic app that served our homepage running using nodejs, express, and bootstrap and now we’re jumping in to do the webscraping. Let’s get started.

Start Scraping with Axios and Cheerio

To start scraping first we’ll need the html of the page which is exactly what axios does. The page we’ll scrape is a page that provides idioms of the English language. So if you look for the word “light” it will show you all the idioms that contain that word like:

URL: https://idioms.thefreedictionary.com/light

  • a light touch
  • a light-bulb moment
  • all sweetness and light
  • at first light
  • as light as a feather

First we’ll start off by requiring the axios and cheerio library we need in our app.js:

var axios = require("axios");
var cheerio = require("cheerio");

Now here’s a test function we wrote in app.js that retrieves the idioms for ‘light’ and parses the html using cheerio to pull out each idiom ( in plain text ) and the associated link.

/* Test Axios with Cheerio */
axios.get("https://idioms.thefreedictionary.com/light").then(function(response) {
   
    var $ = cheerio.load(response.data);

    var idioms = [];
    var links = [];
    var listItems = $("ul.idiKw li a").each(function(i, elem) {
        idioms.push($(elem).text());
        links.push("https://thefreedictionary.com/" + $(elem).attr("href"));
    });

    console.log(idioms);
    console.log(links);
});

You can see from the console that it is working.

App listening on port 3000
[ '(all) sweetness and light',
  '(as) light as a feather',
  '(as) light as air',
  'a heavy purse makes a light heart',
  'a leading light',
  ... ( more )

You can see the axios.get() call is pretty straightforward, you provide a url and a callback function to execute if successful. Then we hand the html data off to cheerio so we can parse it:

    var $ = cheerio.load(response.data);

Then we use cheerio similar to jquery to find the idioms by element type and push them onto arrays and output them to the console. If you aren’t familiar with jquery, the cheerio code might be a little complex to understand and is beyond the scope of this article but there are tons of tutorials on jquery that you can use to get the gist of what’s happening. When we inspect the page for ‘light’ we can find that the idioms are all in an unordered list with class idiKW and each idiom is a link within a <li>. With this knowledge you use cheerio to pull out all these elements. This is also how jQuery works.

So now we’ve proven we can scrape a webpage but we’d like to generalize this functionality so it works for any term and not just ‘light’. We restructured this code into a function called scrape here:

var scrape = function(searchTerm) {
  var idioms = [];
  return axios.get("https://idioms.thefreedictionary.com/" + searchTerm).then(function(response) {
 
      var $ = cheerio.load(response.data);
     
        var listItems = $("ul.idiKw li a").each(function(i, elem) {
          let obj = {
              idiom: $(elem).text(),
              link: "https://thefreedictionary.com/" + $(elem).attr("href")
          };
          idioms.push(obj);
      });

      return idioms;
  });
   console.log(idioms);
}

This is a lot more versatile and will allow us to handle any search term the user hands us.

Conclusion of Part 2

In this second part of the tutorial we built a function using axios and cheerio that will handle our webscraping in a reusable way. The axios and cheerio did most of the heavy lifting, all we needed was an examination of the html structure of our page to know how to pull out the html elements that we desired. In the next part of this tutorial ( Part 3 ) we’ll hook up a Mongo Database so we can store the results that we’ve scraped.

Check out Part 1

Check out Part 2

Check out Part 3

Check out Part 4

Check out Part 5

Pilot the ObjectRocket platform free for 30 Days

It's easy to get started. Imagine the time you'll save by not worrying about database management. Let's do this!

PILOT FREE FOR 30 DAYS

Keep in the know!

Subscribe to our emails and we’ll let you know what’s going on at ObjectRocket. We hate spam and make it easy to unsubscribe.