Data Processing using MongoDB MapReduce

Introduction

There has been so much talk lately about MapReduce and how it can help developers process large data sets into useful aggregated results. In this tutorial, we are going to show you some examples using the MapReduce() function on a MongoDB collection. MapReduce is broken down into two phases:

  • The map phase filters out the data and transforms it so that it can be processed.
  • The reduce phase performs the aggregations, or analysis over that data.

All mapReduce() functions in MongoDB take the documents within a single collection and performs a query of the data. During implementation the map phase is always executed before the reduce phase. In the map phase, data is collected and transformed into key/value pairs. Finally, during the reduce phase, that data will be collected from the generated map, then condensed/combined (or reduced) to an aggregated set of data that will later be stored in a collection. Let’s see how it works!

Prerequisites

For the following examples, we will be using MongoDB shell to execute commands.

  • You should have MongoDB installed
  • It’s not required but it’s recommended that you have some previous shell experience with MongoDB.
  • You should have access to the MongoDB shell ( Execute mongo )
  • Start the Mongo Daemon (in another terminal window execute mongod)

MapReduce Syntax

   >db.collection.mapReduce(
       // map function
       function() {emit(key,value);},

       // reduce function
       function(key,values) {return reduceFunction}, {
            out: <collection>,
            query: <document>, //optional
            sort: <document>,
            limit: number
        }

This syntax is common across MongoDB examples, with the exception of the optional arguments. Let’s look at what’s available and what they mean:

  1. The out refers to the location of the results. You can output to a collection or inline.
  2. The query refers to the selection criteria that will be used for query operators to determine the document input.
  3. The sort is the sort criteria to sort the input documents.
  4. The limit sets the maximum number of documents to be used as input into the map function.

While there are more options, these are the most commonly used.

Executing the MapReduce Function

Now that we understand how mapReduce works and have some basic knowledge of the syntax, let’s see it in action! Make sure that you have the Mongo Daemon running in another terminal and set up your database. Here we are using a small student db:

> use school
switched to db school

> db.students.insert(
    [
    {student_id:"s1",name:"John", class: "science", grade: 98},
    {student_id:"s2",name:"John", class: "math", grade: 87} ,
    {student_id:"s3",name:"Chris",class: "art", grade: 92} ,
    {student_id:"s4",name:"Chris",class: "math", grade: 87} ,
    {student_id:"s5",name:"Chris",class: "science", grade: 91} ,
    {student_id:"s6",name:"Brian",class: "science", grade: 95}
    ] )
BulkWriteResult({
    "writeErrors" : [ ],
    "writeConcernErrors" : [ ],
    "nInserted" : 6,
    "nUpserted" : 0,
    "nMatched" : 0,
    "nModified" : 0,
    "nRemoved" : 0,
    "upserted" : [ ]
})

> db.students.find()
{ "_id" : ObjectId("5d780f31e8afa59914279be3"), "student_id" : "s1", "name" : "John", "class" : "science", "grade" : 98 }
{ "_id" : ObjectId("5d780f31e8afa59914279be4"), "student_id" : "s2", "name" : "John", "class" : "math", "grade" : 87 }
{ "_id" : ObjectId("5d780f31e8afa59914279be5"), "student_id" : "s3", "name" : "Chris", "class" : "art", "grade" : 92 }
{ "_id" : ObjectId("5d780f31e8afa59914279be6"), "student_id" : "s4", "name" : "Chris", "class" : "math", "grade" : 87 }
{ "_id" : ObjectId("5d780f31e8afa59914279be7"), "student_id" : "s5", "name" : "Chris", "class" : "science", "grade" : 91 }
{ "_id" : ObjectId("5d780f31e8afa59914279be8"), "student_id" : "s6", "name" : "Brian", "class" : "science", "grade" : 95 }

Define the map() function

In our example, the map function should produce key/value pairs, where the key will be the student’s name and the value is the grade. Once this is complete, the function will emit a key/value pair. Enter this in the Mongo shell:

> var map = function() {emit(this.name,this.grade);};

Define the reduce() function

The reduce function will group all the key/value pairs and return the sum of all the values. In this specific example, it will find the names of each student and sum the grades from all their classes. Enter this in the Mongo shell:

> var reduce = function(name,grades) {return Array.sum(grades);};

Note: You won’t see any validation responses after entering these two functions in the shell.

Execute the mapReduce() function

Once we have defined both our map and reduce functions, we can run the query and declare the collection where it will store the aggregated data.

> db.students.mapReduce(map, reduce, { out: "total_grades" });
{
    "result" : "total_grades",
    "timeMillis" : 228,
    "counts" : {
        "input" : 6,
        "emit" : 6,
        "reduce" : 2,
        "output" : 3
    },
    "ok" : 1
}
> db.total_grades.find()
{ "_id" : "Brian", "value" : 95 } // Brian only had one class with a grade of 95
{ "_id" : "Chris", "value" : 270 } // Chris had three classes with these grades: 92, 87, 91
{ "_id" : "John", "value" : 185 } // John had two classes with these grades: 98, 87

The mapReduce query was successful and our output shows that a collection was created. When we view the contents of the collection, we see that the reduce() function calculated the sum of all the grades for each student.

Conclusion

The mapReduce function is commonly used for aggregating large sets of data. This example was a brief introduction into setting up both the map and reduce functions, in order to perform a simple mapReduce() query. The map function generated key/value pairs from the original data, then the reduce function performed the calculation using the key/value data. Once the mapReduce function was executed, it stored that calculated data into a collection. Now that you are familiar with how it works, it’s time to try it on a larger cluster of data!

Pilot the ObjectRocket Platform Free!

Try Fully-Managed CockroachDB, Elasticsearch, MongoDB, PostgreSQL (Beta) or Redis.

Get Started

Keep in the know!

Subscribe to our emails and we’ll let you know what’s going on at ObjectRocket. We hate spam and make it easy to unsubscribe.