How to Create a Web Scraper with Mongoose, NodeJS, Axios, and Cheerio - Part 1

Written by Data Pilot

May 06, 2019

MongoDB
Mongoose

Introduction

In this multi-part series we will create a web scraper from scratch. It will store the data it has scraped using MongoDB. The backend will be written in Javascript running on NodeJS. There are several libraries we will use on the backend to make our lives easier including Express, Axios, and Cheerio. On the front-end we will use Bootstrap for styling and JQuery to send requests to the backend. Finally we will deploy the site using Heroku. In this first part of the series we will get a basic version of our backend running.

Prerequisites

You should have MongoDB installed.
Some command line, Javascript, and npm experience is recommended.

Step 1 Create a starter NodeJS app

We are starting from scratch so we will begin with creating a simple NodeJS app and confirm that it’s running. We will add functionality and a front-end as we progress but it’s better to approach a bigger project like this step by step.

First create a directory using the terminal or through your Operating System’s user interface. We’ll show how to do it with the command line:

1	mkdir webscraper

Now we will initialize this project folder using npm so we can keep track of it’s depenedencies because we will be using a fair amount of libraries. To do this run the npm init command:

1

npm init

You can accept all the defaults except we’ve chosen to use app.js as our entry point:

1	entry point: (index.js) app.js

Next we will create the javascript file that will receive and respond to requests. We’ll call it app.js. Create this file using your OS or through the terminal like this:

1	touch app.js

Now we’ll create a basic app and make sure that we can get it running. We start with this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

var express = require("express");

var PORT = process.env.PORT || 3000;

// Initialize Express
var app = express();

// Configure middleware
// Parse request body as JSON
app.use(express.urlencoded({ extended: true }));
app.use(express.json());

// Make public a static folder
app.use(express.static("public"));

// Start the server
app.listen(PORT, function() {
console.log("App listening on port " + PORT);
});

This is a very basic express app and since we require the express library we’ll need to install it with npm:

1	npm install express

Let’s verify that we can run our NodeJS app and that we see our console message that it is listening on port 3000:

1 2	$node app.js App listening on port 3000

Step 2 Add Axios and Cheerio for the web-scraping functionality

Next we’ll be installing two libraries from npm that will help us with the web-scraping functionality. Axios allows you to make requests to a webpage and return the html and Cheerio is a subset of jquery that allow us to parse through the scraped html page for the components we are looking for.

1	npm install axios cheerio

Step 3 Add our first route

Now we will use express for what it does best, creating routes. Our first route will be to the homepage so that when a browser makes a request to our homepage it will return an index.html page that we will create shortly. Let’s create the route first:

This is how we add a new route to a index.html file in our public folder:

1
2
3
4

// Simple index route
app.get("/", function(req, res) {
res.sendFile(path.join(__dirname + "./public/index.html"));
});

So our app.js file is now:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

Now let’s create the index.html file for our homepage. First we’ll need the public directory:

1	mkdir public

Then we’ll need to change directories (cd) into it:

1

cd public

Then we’ll need to create the index.html file:

1	touch index.html

Finally use your favorite IDE (integrated development environment) to put this html into that file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77

<!doctype html>
<html lang="en">
<head>

<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">


<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">

<title>Idiom Scraper</title>
</head>
<body>

<div class="container">

<h1>Idiom Scraper</h1>

<div class="row mt-5">
<div class="col">
<form>
<div class="form-group">
<label for="scrapeTerm">Scrape Idioms by Term</label>
<input type="text" class="form-control" id="scrapeTerm" aria-describedby="scrapeHelp" placeholder="Enter single word search term">
<small id="scrapeHelp" class="form-text text-muted">Enter a single word search term and we'll scrape an idiom dictionary for idioms that contain your search term.</small>
</div>
<button type="submit" class="btn btn-primary" id="scrapeButton">Scrape</button>
</form>
</div>
</div>

<div class="row mt-5">
<div class="col">
<div class="form-group">
<label for="searchTerm">Search scraped idioms stored with MongoDB.</label>
<input type="text" class="form-control" id="searchTerm" aria-describedby="searchHelp" placeholder="Enter single word search term">
<small id="searchHelp" class="form-text text-muted">Enter a single word search term and we'll search our MongoDB for idioms we've already scraped.</small>
</div>
<button type="submit" class="btn btn-primary" id="searchButton">Search Saved Idioms</button>
</div>
</div>

<div class="row mt-5">
<div class="col">
<label for="searchTerm">Search scraped idioms stored with MongoDB.</label>
</br>
<button type="submit" class="btn btn-primary" id="getAllButton">Get All Saved Idioms</button>
</div>
</div>

<h4 class="mt-5">Results:</h4>
<div class="mt-5" id="tableDiv">
Insert content here.
</div>

<div class="mt-5" id="log">
Logs go here.
</div>

</div>


<script
src="https://code.jquery.com/jquery-3.4.0.min.js"
integrity="sha256-BJeo0qm959uMBGb65z40ejJYGSgR7REI4+CW1fNKwOg="
crossorigin="anonymous"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.7/umd/popper.min.js" integrity="sha384-UO2eT0CpHqdSJQ6hJty5KVphtPhzWj9WO1clHTMGa3JDZwrnQq4sF86dIHNDz0W1" crossorigin="anonymous"></script>
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/js/bootstrap.min.js" integrity="sha384-JjSmVgyd0p3pXB1rRibZUAYoIIy6OrQ6VrjIEaFf/nJGzIxFDsf4x0xIM+B07jRM" crossorigin="anonymous"></script>
<script src="index.js"></script>

</body>
</html>

As you can see there is a lot of functionality we are not using in our html but at least you can see where we are headed.

Now that we have a get route for home that serves index.html we should be able to hit http://localhost:3000/ and see our page that was built with boostrap.

Conclusion of Part 1

In this first part of the tutorial we built the basis for our webscraper app. We have an application that serves a homepage and we have all the libraries we need to finish this project. So please proceed to Part 2 of this tutorial where we build the backend functionality to scrape a web page. In Part 3 we’ll tie all the front-end and back-end together with express routes.