How to Create a Web Scraper with Mongoose, NodeJS, Axios, and Cheerio - Part 3

Have a Database Problem? Speak with an Expert for Free
Get Started >>

Introduction

In this third part of the tutorial on how to build a webscraper from scratch we will concentrate on how to save the data we’ve scraped into a Mongo Database. We will be utilizing the npm packages Mongoose which is a library that takes a model based approach to creating data.

If you’re just joining us this is multi-part series we’re in the midst of creating a web scraper from scratch. It will store the data it has scraped using MongoDB. The backend will be written in Javascript running on NodeJS. There are several libraries we will use on the backend to make our lives easier including Express, Axios, and Cheerio. On the front-end we will use Bootstrap for styling and JQuery to send requests to the backend. Finally we will deploy the site using Heroku.

In the first part of the series we got a basic app that served our homepage running using nodejs, express, and bootstrap. In the second part, we built a function that will do the webscraping using axios and cheerio. Now now we’ll show you how to store the data you’ve scraped in a Mongo Database.

Install and require Mongoose

You’ll need a driver to interact with Mongo from NodeJS and there are a few out there including mongodb and mongojs but we’ll be using one called Mongoose that takes a schema/model based approach that we feel is pretty intuitive. It takes a little more work to get Mongoose working but we feel it pays off in the end.

First let’s install the library with npm:

1
npm install mongoose

Next we’ll require it in our app.js:

1
var mongoose = require("mongoose"); // Require Mongoose to store idioms in database

Connect to Mongo

Then we’ll use this code to connect to a Mongo with a new database called idioms_db:

1
2
var MONGODB_URI = process.env.MONGODB_URI || "mongodb://localhost/idioms_db";
mongoose.connect(MONGODB_URI);

Note: Don’t get thrown by the process.env code, this will only come into play when we deploy our app.

Run your app again and make sure you don’t get any errors.

Create a Schema

Here comes the hard part about Mongoose. When we want to save data through Mongoose we’ll have to do so through a Model which is created from a Schema. A schema just describes what kind of data we are storing. Let’s create a models folder to put our schema in:

1
mkdir models

Now let’s create a schema for the idioms called idioms.js:

1
touch idioms.js

In idioms.js we’ll first need to require Mongoose. Then we need to think about what data we want to store as an idiom. We’ve decided that we just want a string of the idiom and a link to it’s page. The string should be required but the link should not be. Here’s how we create that schema in Mongoose:

File: webscraperDemo/models/idioms.js

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Require mongoose library
var mongoose = require("mongoose");

// Get the schema constructor
var Schema = mongoose.Schema;

// Use the Schema constructor to create a new IdiomSchema object
var IdiomSchema = new Schema({
  idiom: {
    type: String,
    required: true,
    unique: true
  },
  link: {
    type: String,
    required: false
  }
});

// Create model from schema using model method
var Idiom = mongoose.model("Idiom", IdiomSchema);

// Export the Idiom model
module.exports = Idiom;

Notice at the end of the code how we create a Model from our schema and then export that Model so we can use it in app.js.

Create a route to save idioms

Now to prove that we are actually interacting with the database we will create a route that will scrape for idioms by a term and save them to our database. This is a long chunk of code so we’ll post it and then dissect it afterwards:

File: /webscraperDemo/app.js `js // Route to go out and scrape for idioms that have an entered string app.post(“/idioms/scrape/:searchTerm”, function(req, res) {


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
scrape(req.params.searchTerm)
.then(function(foundIdioms) {
    console.log("scraped:");
    console.log(foundIdioms);
    // Save scraped Idioms
    foundIdioms.forEach(function(eachIdiom) {
        Idiom.create(eachIdiom)
        .then(function(savedIdiom) {
            // If saved successfully, print the new Idiom document to the console
            console.log(savedIdiom);
        })
        .catch(function(err) {
            // If an error occurs, log the error message
            console.log(err.message);
        });
    });
   

    res.json(foundIdioms);
})
.catch(function(err) {
    res.json(err);
});

}); `

This code creates a route so that your front-end can send a request to http://localhost:3000/idioms/scrape/light and it will use our scrape() function from earlier to scrape the idioms and then save them in the database.

In order for this to work a couple more things need to happen. You’ll notice in the index.html from earlier that we linked to an index.js that we never mentioned. We’ll we’re gonna use it here to make requests to our backend. We won’t cover the details now, just now that that is it’s basic purpose. Use this code and put it in your public folder.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
$(document).ready(function () {

    /* Create variables to Dom needed DOM elements */
    var $scrapeTerm = $("#scrapeTerm");
    var $scrapeButton = $("#scrapeButton");
    var $searchTerm = $("#searchTerm");
    var $searchButton = $("#searchButton");
    var $getAllButton = $("#getAllButton");
    var $tableDiv = $("#tableDiv");

    /* Create API object to make AJAX calls */
    var searchAPI = {

        getAll: function () {
            return $.ajax({
                url: "/idioms",
                type: "GET"
            });
        },

        searchTerm: function (term) {
            return $.ajax({
                url: "/idioms/search/" + term,
                type: "GET"
            });
        },

        scrapeTerm: function (term) {
            return $.ajax({
                url: "/idioms/scrape/" + term,
                type: "POST"
            });
        }

    };


    /* Functions called by Event Listeners */
    var handleScrapeSubmit = function (event) {
        event.preventDefault();

        var searchTerm = $scrapeTerm.val().trim();

        searchAPI.scrapeTerm(searchTerm).then(function (resp) {

            var data = prepareResponseForTable(resp);
            makeTable($tableDiv, data);
        });

        // Clear out scrape field
        $scrapeTerm.val("");
    };


    var handleSearchSubmit = function (event) {

        var searchTerm = $searchTerm.val().trim();

        searchAPI.searchTerm(searchTerm).then(function (resp) {

            var data = prepareResponseForTable(resp);
            makeTable($tableDiv, data);
        });

        // Clear out search field
        $searchTerm.val("");
    };

    var handleGetAll = function (event) {

        searchAPI.getAll()
        .then(function(resp) {
            var data = prepareResponseForTable(resp);
            makeTable($tableDiv, data);

        })
        .catch(function(err) {
            console.log(err);
        });

    };

   
    /* Utilities */
    //  Utility to make a table from aset of data ( an array of arrays )
    //  https://www.htmlgoodies.com/beyond/css/working_w_tables_using_jquery.html
    function makeTable(container, data) {
        var table = $("<table/>").addClass('table table-striped');
        $.each(data, function (rowIndex, r) {

            var row = $("<tr/>");
            $.each(r, function (colIndex, c) {
                row.append($("<t" + (rowIndex == 0 ? "h" : "d") + "/>").text(c));
            });
            table.append(row);
        });
        return container.html(table);
    }

    //  Utility to take a response filled with idioms and make it into an array of arrays that is in a format ready for our "makeTable" utility.
    function prepareResponseForTable(response) {
        var data = [];
        data[0] = ["idiom"]; // Row header ( Add more columns if needed )

        response.forEach(function (eachIdiom) {
            //   data.push([eachIdiom._id, eachIdiom.idiom, eachIdiom.link, eachIdiom._v]);
            data.push([eachIdiom.idiom]);
        });

        return data; // Returns an array of arrays for "makeTable"
    }



    /* Event Listeners */
    $scrapeButton.on("click", handleScrapeSubmit);
    $searchButton.on("click", handleSearchSubmit);
    $getAllButton.on("click", handleGetAll);
});

In our app.js file we’ll also need to require Mongoose and import the Model:

1
2
3
4
5
var mongoose = require("mongoose"); // Require Mongoose to store idioms in database


// Requiring the `Idioms` model for accessing the `idioms` collection
var Idiom = require("./models/idioms.js");

If you’re lost on all the changes we’ll share the finished repository at the end of this tutorial so just try to follow along.

Now let’s try to hit that route from the front end by entering “light” into the “Scrape Idioms by Term” field and pressing “Scrape”. You should see a table of idioms populate at the bottom. That means the scraping is working! Creating that table is handled by the index.js. In the next section when we flesh out the rest of the routes we’ll verify data is being stored by creating a route that fetches all the idioms in the database.

Image from Gyazo

Image from Gyazo

Conclusion of Part 3

In this third part of the tutorial we connected with Mongo and created a schema and a model that allowed us to create data in our database. In Part 4 we’ll create more routes that add functionality like retrieving all the idioms in our database and deleting items.

Check out Part 1

Check out Part 2

Check out Part 3

Check out Part 4

Check out Part 5

Pilot the ObjectRocket Platform Free!

Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.

Get Started

Keep in the know!

Subscribe to our emails and we’ll let you know what’s going on at ObjectRocket. We hate spam and make it easy to unsubscribe.