MongoDB Workbook
Exercises to accompany the course
By Nicholas Johnson
version 1.0.0

Image credit: Robert McCall, NASA
By Nicholas Johnson
version 1.0.0
Image credit: Robert McCall, NASA
This little book accompanies the Mongo course taught over 2 days. Each exercise builds on the last, so it's best to work through these exercises in order.
We'll start off with the helper functions, find, count and distinct, then move into the aggregate pipeline and on to map reduce, finally integrating with Node to produce a service oriented architecture.
Mongo DB is a highly scalable, NoSQL, schema free data store.
Mongo represents data as a tree. If your data is tree shaped, or can be made tree shaped, Mongo is great. If your data is a web or a network which can't be flattened out, you likely have relational data, and Mongo is perhaps not for you this time.
If you have unstructured data to store which can be represented as a series of nested lists Mongo will make your life more enjoyable.
Say you have a webpage full of widgets, and each of those widgets can contain arbitrary information. This is a semi-structured tree and Mongo might be a very good choice.
If you have big customer data to store, and each customer record contains lists of communications, subscriptions, etc, the data is tree shaped, and Mongo would again be a good chice.
If you have big data and you want to query it in interesting and complex ways, pulling useful aggregated data out the other side in suprisingly short timeframes, Mongo is perfect.
On the other hand, if your data looks like a web: comments, purchases, kittens, customers, sharks, exploding hats, etc, all linking to each other in a web, then you have relational data, and you may wish to stick with a relational database like Postgres, MySQL or MS SQL Server.
Mongo manages to be so fast because it does less. There's no magical difference in the architecture that makes it fast, it just has a simplified streamlined query language that is easier to optimise.
We connect to the Mongo terminal using the mongo command
mongo
By default Mongo will connect to localhost.
We can connect to a remote server by passing arguments, like so:
mongo connection.mongolab.com:45352 -u username -p passw0rd
Once we connect to a Mongo instance we can type JavaScript directly into the console. We can create variables, do maths, write JSON.
Connect to the console at localhost. Try typing some JavaScript expressions.
We can switch to a database in Mongo with the use command.
use petshop
This will switch to writing to the petshop database. It doesn't matter if the database doesn't exist yet. It will be brought into existence when you first write a document to it.
You can find which database you are using simply by typing db. You can drop the current database and everything in it using db.dropDatabase.
db
> petshop
db.dropDatabase()
That was easy wasn't it. Don't worry, it gets a bit harder.
Collections are sets of (usually) related documents. Your database can have as many collections as you like.
Because Mongo has no joins, a Mongo query can pull data from only one collection at a time. You will need to take this into account when deciding how to arrange your data.
You can create a collection using the createCollection command.
use petshop
db.createCollection('mammals')
Collections will also be created automatically. If you write a document to a collection that doesn't exist that collection will be brought into being for you.
View your databases and collections using the show command, like this:
show dbs
show collections
Documents are JSON objects that live inside a collection. They can be any valid JSON format, with the caveat that they can't contain functions.
The size limit for a document is 16Mb which is more than ample for most use cases.
You can create a document by inserting it into a collection
db.mammals.insert({name: "Polar Bear"})
db.mammals.insert({name: "Star Nosed Mole"})
You can find a document or documents matching a particular pattern using the find function.
If you want to find all the mammals in the mammals collection, you can do this easily.
db.mammals.find()
Mongo comes with a set of convenience functions for performing common operations. Find is one such function. It allows us to find documents by providing a partial match, or an expression.
You can use find to:
You can't use find to chain complex operators. You can do a few simple things like counting, but if you want to real power you need the aggregate pipeline, which is actually not at all scary and is quite easy to use.
The Aggregate pipeline allows us to chain operations together and pipe a set of documents from one operation to the next.
You can use find with no arguments to list documents in a collection.
db.entrycodes.find()
This will list all of the codes, 20 at a time.
You can get the same result by passing an empty object, like so:
db.entrycodes.find({})
Assuming you know the object ID of a document. You can pull that document by id like so:
db.entrycodes.find(ObjectId("557afc91c0b20703009f7edf"))
The _id field of any collection is automatically indexed.
IDs are 12 byte BSON objects, not Strings which is why we need the ObjectId function. If you want to read more on ObjectId, you can do so here.
Say you have a list of users and you want to find by name, you might do:
db.people.find({name: "dave"})
You can match on more than one field:
db.people.find({
name: "dave",
email: "davey@aol.com"
})
You can match on numbers:
db.people.find({
name: "dave",
age: 69,
email: "davey@aol.com"
})
You also match using a regex (although be aware this is slow on large data sets):
db.people.find({
name: /dave/
})
We need to start out by inserting some data which we can work with.
use petshop
db.pets.insert({name: "Mikey", species: "Gerbil"})
db.pets.insert({name: "Davey Bungooligan", species: "Piranha"})
db.pets.insert({name: "Suzy B", species: "Cat"})
db.pets.insert({name: "Mikey", species: "Hotdog"})
db.pets.insert({name: "Terrence", species: "Sausagedog"})
db.pets.insert({name: "Philomena Jones", species: "Cat"})
We have seen how we can find elements by passing Mongo a partial match, like so:
db.people.find({name: 'Yolanda Sasquatch'})
We can also find using expressions. We define these using JSON, like so:
db.people.find({
age: {
$gt: 65
}
})
We can use operators like this:
See the full list here:
Copy the following code into a Mongo terminal. It will create a collection of people, some of whom will have cats.
Optionally modify the code so that some people have piranhas, and some have dachshunds.
use people
(function() {
var names = [
'Yolanda',
'Iska',
'Malone',
'Frank',
'Foxton',
'Pirate',
'Poppelhoffen',
'Elbow',
'Fluffy',
'Paphat'
]
var randName = function() {
var n = names.length;
return [
names[Math.floor(Math.random() * n)],
names[Math.floor(Math.random() * n)]
].join(' ');
}
var randAge = function(n) {
return Math.floor(Math.random() * n);
}
for (var i = 0; i < 1000; ++i) {
var person = {
name: randName(),
age: randAge(100)
}
if (Math.random() > 0.8) {
person.cat = {
name: randName(),
age: randAge(18)
}
}
db.people.insert(person);
};
})();
We can use exists to filter on the existence of non-existence of a field. We might find all the breakfasts with eggs:
db.breakfast.find({
eggs: {
$exists: true
}
})
We can use $gt and $lt to find documents that have fields which are greater than or less than a value:
db.breakfast.find({
starRating: {
$gt: 5
}
})
We are going to use some real data now. The stocks json file is a list of all stocks traded in the US in 2015. It's real data.
Download the stocks.json file from here:
http://nicholasjohnson.com/mongo/datasets/stocks.json
We can import a JSON file from the command line using the mongoimport shell command.
Enter the following into a terminal. Don't enter this into the Mongo console or it won't work.
mongoimport --db stocks --collection stocks --file stocks.json
We can even filter using an arbitrary JavaScript expression using $where. This will allow us to compare two fields in a single document.
db.sandwiches.find({
$where: "this.jam && this.peanutButter && this.jam > this.peanutButter"
})
Here we find all the sandwiches with jam and peanut butter where the jam quotient outweighs the peanut butter.
Warning: It's easy to overuse $where since it appears to do everything with plain old JavaScript. $where is eval-ing a JavaScript expression and as such is slow. Mongo can make no optimisations here, and must execute the JavaScript on every single document in the collection. Prefer the native operators where possible.
Find takes a second parameter which allows you to whitelist fields to pass into the output document. We call this projection.
You can choose fields to pass though, like so:
{
ham: 4,
eggs: 2
}
{
cheese: 6,
lime: 0.5
}
db.breakfast.find({}, {
eggs: true,
lime: true
})
This will yield
{
eggs: 2
},
{
lime: 0.5
}
{ "cat" : { "name" : "Fluffy Frank", "age" : 13 } }
When you output the cats, you will need to find only people who have cats, where cats $exists, or you will have gaps in your data.
You will notice that the ID field is always passed through project by default. This is often desirable, but you may wish to hide it, perhaps to conceal your implementation, or to keep your communication over the wire tight.
You can do this easily by passing _id: false:
db.breakfast.find({}, {
eggs: true,
lime: true,
_id: false
})
We can chain some additional functions onto our find query to modify the output.
Count will convert our result set into a number. We can use it in two ways. We can either chain it:
db.people.find({sharks: 3}).count()
or we can use it in place of find:
db.people.count({sharks: 3})
To count the people who have exactly three sharks.
Don't confuse it with length(). Length will convert to an array, then count the length of that array. This is inefficient.
Count can be a slow operation on large data sets. For more on optimising count, you might like to read the following: https://www.compose.io/articles/counting-and-not-counting-with-mongodb/
Limit will allow us to limit the results in the output set. Skip will allow us to to offset the start. Between them they give us pagination.
For example
db.biscuits.find().limit(5)
will give us the first 5 biscuits. If we want the next 5 we can skip the first 5.
db.biscuits.find().limit(5).skip(5)
We can sort the results using the sort operator, like so:
db.spiders.find().sort({hairiness: 1})
This will sort the spiders in ascending order of hairiness. You can reverse the sort by passing -1.
db.spiders.find().sort({hairiness: -1})
This will get the most hairy spiders first.
We can sort by more than one field:
db.spiders.find().sort({
hairiness: -1,
scariness: -1
})
We might also sort by nested fields:
db.spiders.find().sort({
'web.size': -1
})
will give the spiders with the largest webs.
Use the stocks data from here:
http://nicholasjohnson.com/mongo/datasets/stocks.json
mongoimport --db stocks --collection stocks --file stocks.json
When we compose a query, Mongo gives us back a cursor object from which we can get the values.
When we called limit and sort in the last section, we were actually calling methods on the cursor that was returned by find.
If we save the cursor in a variable we can call more methods on it.
var people = db.people.find( );
We can iterate over the the cursor using a simple while loop. We can check if the cursor has a next value, and can call cursor.next to get the value out.
var people = db.people.find();
while (people.hasNext()) {
print(tojson(people.next()));
}
We can simplify the code above using functional style loops.
db.people.find().forEach(function(person) {
print(person.name);
});
We also have access to map, which will return an array of values.
var array = db.people.find().map(function(person) {
return person.name;
});
You can read all the cursor methods here:
http://docs.mongodb.org/manual/reference/method/js-cursor/
CRUD is a basic function of any database. Crud stands for:
The four basic things that any data store needs to give us.
We create using the insert command, like this:
db.people.insert({name: "Tony Stark", occupation: "Billionaire, playboy, philantropist..."})
The JSON object will be created and saved.
Refresh your muscle memory. Create a new person now. Ensure that person has a shark.
We have many options for finding. We have already seen db.collection.find(). We can also use db.collection.findOne() which will return at most one result.
As we shall see soon, we also have the aggregate framework, and if we need maximum flexibility at the expense of a good deal of speed, we can also use map-reduce.
We save using the db.collection.save function. We pass the function a JSON object that contains the modified object to save, including the _id. The item will be found and updated.
var p = db.people.findOne()
p.age = 999
db.people.save(p)
We can also find and update in a single step using the update function:
db.people.update({name: 'dave'}, {name:'brunhilde'})
We can remove people en-mass.
people = db.people.remove({name:'Dave'})
In 2001, the American energy firm Enron was taken down by accounting fraud. During the investigation, the Federal Energy Regulatory Commission made their entire email database public This is now the largest public domain email corpus available.
We'll use this dataset more as we go on, but for now we want to get it into shape.
Download the email corpus here:
http://nicholasjohnson.com/mongo/datasets/enron.json
Import it into Mongo using mongoimport, something like this:
mongoimport --db enron --collection emails --file enron.json
Have a look at the data. You'll notice the email format we are provided with here is a string. We could really do with changing this field into a Date object.
We can create a Date from a string like this:
new Date("2000-08-23 02:16:00-07:00")
Iterate over the dataset, converting all the strings into dates and saving them back into the database. You will need to use the save command for this.
Just for fun, find every email that contains the word 'fraud'.
The Mongo Aggregation framework gives you a document query pipeline. You can pipe a collection into the top and transform it though a series of operations, eventually popping a result out the bottom (snigger).
For example, you might take a result set, filter it, group by a particular field, then sum values in a particular group. You could find the total population of Iowa given an array of postcodes. You could find all the coupons that were used on Monday, and then count them.
We can compose a pipeline as a set of JSON objects, then run the pipeline on a collection.
If you provide an empty pipeline, Mongo will return all the results in the collection:
db.entrycodes.aggregate()
Try out the aggregate pipeline now. Call aggregate on your people collection. You'll see the result is the same as if you called find.
We can use the aggregation pipeline to filter a result set. This is more or less analogous to find, and is probably the most common thing we want to do.
Say we want to list only people who have cats (where cat is a sub-document), we would probably do something like this this:
db.people.find({
cat:{
$exists: true
}
})
We can get the same result in the aggregation framework using $match, like so:
db.people.aggregate({
'$match' : {
cat:{
$exists: true
}
}
})
So why use aggregation over find? In this example they are the same, but the power comes when we start to chain additional functions as we shall soon see.
Matching is quick but not smart. It's designed to limit the result set, so that the rest of the pipeline can run more quickly. When used with project we can match against fields that don't exist in our result set. This is a powerful and useful feature.
Download the zips file here. This contains a list of all the zip codes in the US:
http://nicholasjohnson.com/mongo/datasets/zips.json
Import it into Mongo using mongoimport, something like this:
mongoimport --db zips --collection zips --file zips.json
The find function allowed us to do simple whitelist projection. The aggregate pipeline gives us many more options.
We can use $project to modify all the documents in the pipeline. We can remove elements, allow elements through, rename elements, and even create brand new elements based on expressions.
Say we have a set of voucher codes, like this:
{
"firstName" : "Dave",
"lastName" : "Jones",
"code" : "74wy zphz",
"usedAt" : ISODate("2015-06-13T17:47:20.423Z"),
"email" : "123@gmail.com"
},
{
"firstName" : "Stuart",
"lastName" : "Hat",
"code" : "7uwz e3cw",
"usedAt" : ISODate("2015-06-13T17:47:50.489Z"),
"email" : "456@gmail.com"
}
...
We can use project to restrict the fields that are passed through. We pick the fields we like and set true to pass them through unchanged.
db.entrycodes.aggregate(
{
'$project' : {
email: true,
code: true
}
}
)
Remove the id field bay passing _id: false.
This will yield a set something like the following:
[
{
"code" : "7uwy zphz",
"email" : "123@gmail.com"
},
{
"code" : "7uwz eccw",
"email" : "123@gmail.com"
}
]
We can use project to rename fields if we want to. We use an expression: $lastName to pull out the value from the lastName field, and project it forward into the surname field.
db.entrycodes.aggregate(
{
'$project' : {
surname: "$lastName"
}
}
)
This will yield something like the following:
[
{
surname : "Jones"
},
{
surname : "Hat"
}
...
]
We can chain $match and $project together. Say we have a list of codes, and some have not yet been used. We want to pull out the names and emails, but only from the codes which have been used.
[,
{
"code" : "7uwz eccw"
}
{
"firstName" : "Dave",
"lastName" : "Jones",
"code" : "7uwy zphz",
"usedAt" : ISODate("2015-06-13T17:47:20.423Z"),
"email" : "123@gmail.com",
"winner": true
},
{
"firstName" : "Stuart",
"lastName" : "Hat",
"code" : "7uwz eccw",
"usedAt" : ISODate("2015-06-13T17:47:50.489Z"),
"email" : "123@gmail.com"
},
{
"code" : "7uwz eccw"
}
...
]
We might first $match the codes which have a usedAt field, and then use $project to pull out the names and emails from the remainder.
db.entrycodes.aggregate(
{
$match: {
usedAt: {
$exists: true
}
}
},
{
$project: {
firstName: true,
lastName: true,
email: true,
_id: false
}
}
)
We can use project to add new fields to our documents based on expressions.
Say we had a list of people, like so:
{firstName: "Chinstrap", surname: "McGee"}
{firstName: "Bootle", surname: "Cheeserafter"}
{firstName: "Mangstrang", surname: "Fringlehoffen"}
We can use project to compose a name field, like so:
db.entrycodes.aggregate(
{
$project: {
name: {$concat: ['$firstName', ' ', '$surname']}
}
}
)
This will give us results like this:
{ "name" : "Dave Smith" },
{ "name" : "Mike Moo" }
We saw here how to use $concat to make a new attribute containing a concatenated string. Have a look at the other String aggregation operators here:
http://docs.mongodb.org/manual/reference/operator/aggregation-string/
Attempt to capitalize all the names. This is useful because Mongo grouping and matching is case sensitive.
We can set the value of a field using a boolean expression using $cond. There are a couple of ways to use $cond. You may wish to review the documentation here: http://docs.mongodb.org/manual/reference/operator/aggregation/cond/
Say we have a set of customers, and some of them have complaints. We might set a flag on all of the unhappy customers like so:
db.customers.aggregate(
{
$project:{
unhappy: {
$cond: { if: '$complaints', then: true, else: false }
},
complaints: '$complaints'
}
}
)
Unhappy will either be true or false.
$group allows us to group a collection according to criteria. We can group by fields that exist in the data, or we can group by expressions that create new fields.
Group will operate on a set of documents in the pipeline, and output a new set of documents out the bottom.
We group using the _id field. This will create a new _id for each group that will be an object containing the grouping criteria.
The simplest group would look like this:
db.people.aggregate(
{
$group: {
_id: 1
}
}
)
The id field is empty, so the group contains the whole collection, but we haven't output anything, so each output document is empty.
If we just want to group by a single field we can do this easily. The id of each output document will be the value of the expression, in this case '$name'.
db.people.aggregate(
{
$group: {
_id: '$name'
}
}
)
You just wrote a function for getting distinct emails.
You can group on more than one field by passing an object to _id:
db.people.aggregate(
{
$group: {
_id: {
name: '$name',
age: '$age'
}
}
}
)
Try out the above. Notice that the _id field is now an object. Use $project to reformat the data. You now have distinct names and ages.
When grouping we use the _id field to hold the common values that we are grouping our documents on. This means that the output of a group aggregation only holds the data that is common to all documents in that group.
What happens though if we want to preserve a value that we are not grouping on. For this we use $push.
$push will create an array and store part or all of the grouped source documents in it.
db.entrycodes.aggregate([
{
$group: {
_id: "$email",
count: {$sum: 1},
entry: {
$push: {
firstName: "$firstName",
lastName: "$lastName"
}
}
}
}
])
The $$ROOT variable contains the source documents for the group. If you'd like to just pass them through unmodified, you can do this by $pushing $$ROOT into the output from the group.
db.entrycodes.aggregate([
{
$group: {
_id: "$email",
name: "$firstName"
count: {$sum: 1},
entries: { $push: "$$ROOT" }
}
}
])
We can now see a list of all the cats owned by people with a particular name.
Group has the ability to count. You can count the entries in a group. By chaining $group commands together you could count the number of groups.
For example, say you have a set of customer records which may contain duplicate emails. You could group by email and find out who is using your service most often. You might count the groups, to get the number of distinct emails, you might group by count, to find how many people used your site once, twice, five times, etc.
We use $group to count, because generally we want to count groups.
We could count the entire collection by grouping everything, then adding a count field. This is the same as db.collection.find().count()
db.hamsters.aggregate(
{
$group: {
_id: 1,
count: {
$sum: 1
}
}
}
)
Let's group on name, then count how many people have each name:
db.people.aggregate(
{
$group: {
_id: {
name: '$name'
},
count: {
$sum: 1
}
}
}
)
We can sort records in the aggregation pipeline just as we can with find. Choose the fields to sort on and pass 1 or -1 to sort or reverse the sort.
db.people.aggregate(
{
$sort: {
age: 1,
name: 1
}
}
)
Another challenge is to count the number of groups. For example, say you have a dataset containing duplicate emails, you might want to generate a list of distinct emails and then count that list.
You have two ways to do this:
Now say we want to pick out all the unique emails, we might use distinct, like so:
db.entrycodes.distinct('email')
This will pop a list out into memory, like this:
[
"123@gmail.com",
"456@gmail.com",
"567@gmail.com",
"890@aol.com"
]
We could get the length of the collection just by querying the array, like so:
db.entrycodes.distinct('email').length
However, this is bad. Imagine now that we have 15,000,000 records. We now have to create a massive array just for the purpose of getting a single number.
Instead we can do this entirely in the aggregation framework using two group commands. First we group by emails and throw away the rest of the data. We now have a list of all the unique emails.
We now want to find out how big this set is, so we create a big group that holds everything (using _id: 1) and count that.
db.people.aggregate(
{
$group: {
_id: '$email'
}
},
{
$group: {
_id: 1,
count: {
$sum: 1
}
}
}
)
http://nicholasjohnson.com/mongo/datasets/enron.json
mongoimport --db enron --collection emails --file enron.json
Stats are cool. People love stats. We can use Mongo to generate a timeline showing day by day activity.
To do this we will need to group by date. First we add fields to our documents representing year, month and dayOfMonth. Then we group by these fields. We can count to get an aggregate, or filter based on some parameter.
For clarity, I have separated this into group and project stages so you can see how the pipeline changes. You could roll these two steps into a single $group stage:
db.competitionentries.aggregate(
{
$project: {
year: {$year: date},
month: {$month: date},
dayOfMonth: {$dayOfMonth: date}
}
},
{
$group: {
_id: {
year: '$year',
month: '$month',
dayOfMonth: '$dayOfMonth'
},
count: {
$sum: 1
}
}
}
)
http://nicholasjohnson.com/mongo/datasets/holidays.json
mongoimport --db holidays --collection holidays --file holidays.json
You now have a calendar of events for the year.
Why might this be useful? Imagine instead of holidays, we have a list of customer complaints. We can now find how customer complaints are distributed, which might be useful.
We use unwind when we have data that contains arrays, for example
{name: 'dave', pets: ['cat', 'dog']}
{name: 'mike', pets: ['naked mole rat', 'dog', 'hat']}
We can unwind this data to get a unique entry for each element in the array:
db.people.aggregate(
{ $unwind : "$pets" }
)
This will yield something like this:
{name: 'dave', pets: 'cat'}
{name: 'dave', pets: 'dog'}
{name: 'mike', pets: 'naked mole rat'}
{name: 'mike', pets: 'dog'}
{name: 'mike', pets: 'hat'}
Say you needed to get hold of the number of dogs, you can't easily group and count on an entry in an array. If you unwind first you can easily group on the pets field.
-# TODO Preprep this data. This exercise is too hard
Download the company startup dataset here:
http://nicholasjohnson.com/mongo/datasets/startups.json.zip
Now load it into a database collection.
First up we have to prep the data. We are going to be filtering by tags. Notice that the tag_list is a comma separated string. Ideally this would be an array.
Use a cursor to iterate over the array and convert the comma separated string into an array, then save this array back into your collection. use split() to convert a string into an array, like so:
"a,b".split(",")
Tell me the total number of tags.
Say we have previously $group-ed our data, and $push-ed to generate an array. we can use $unwind to restore the values with the count attached.
Say we have a list of customer interactions. For each interaction we want to add an 'engagement' field which lists out the number of times the customer appears in the list.
We might:
We now have the original time series data, but with the addition of an engagement field.
http://nicholasjohnson.com/mongo/datasets/enron.json
First we have to prep the data. Unless you've already converted it, The date will be stored as a string. We would ideally like to convert it to an actual date.
We can split the data using a script like this one. See if you can see what it is doing:
db.startups.find({}).forEach(function(startup) {
if (startup.tag_list && startup.tag_list.split) {
startup.tag_list = startup.tag_list.split(',').map(function(a){return a.trim()});
} else {
startup.tag_list = [];
}
print(startup.tag_list);
db.startups.save(startup)
})
Prior to Mongo 2.6, Map reduce was the only way to run queries on Mongo. The aggregate framework has largely replaced Map reduce, but we can still use it for complex queries.
Because Map reduce runs JavaScript functions, it can be slower than the equivalent aggregate pipeline. Favour aggregate operations where possible.
The Map stage accepts a single document and converts it to a usable form. Say you are summing all the fleas on an elephant, the map stage might output just the fleacount for a single elephant. It emits the value. It can emit many values.
var map = function() {
emit('flea_count', this.flea_count)
emit('tick_count', this.tick_count)
}
The reduce stage accepts multiple mapped inputs and aggregates them in some way, perhaps adding them to an array, or summing them.
var reduce = function(id, counts) {
return Array.sum(counts)
}
db.people.mapReduce(
map,
reduce,
{
out: { inline: 1 }
}
)
read more about map reduce here: http://docs.mongodb.org/manual/reference/method/db.collection.mapReduce/
If you have data distributed across a cluster, each machine in the cluster can map a few documents feeding the result forward. Furthermore, each machine can handle a few reduce operations, potentially distributing the workload very widely.
This is the algorithm that Google uses in its search, so as you can imagine it's quite quick.