POSTS
MongoDB: A first look
The entire subject of two talks and mentioned in several other, MongoDB was definitely a buzz at TekX this year. It's long been in favor in the tech community in Lawrence and has been used for some data crunching for a few projects at the local paper. Even with all of this exposure, I've yet to sit down and actually explore it.
That changed Friday afternoon while I sat at O'Hare waiting on my flight back to Lawrence (which subsequently got canceled). I installed Mongo earlier in the week and opened up a bunch of tabs on the various intros and tutorials available on the Mongo wiki. The rest of this article a mix of stream-of-conscious as I played around with Mongo for the first time and some of my reflections this past week.
Note on typefaces
I use both Mongo and mongo
throughout this article. The first, the
title-case Mongo refers to the software as a whole. Whenever you see mongo
with a lowercase and in monospace
, it's referring to the Mongo client program
you run from the command line.
Installation
On a Mac, it's a breeze. I use Homebrew to manage software on my Mac, so a
quick brew install mongodb
was all I needed and a minute later I was ready to
go.
Starting Up the Server
Mongo is run by the mongod
process. I don't know if it's pronounced
mongo-d or mon-god though. It's a fun play on words if the latter is the
case.
Brew includes a basic configuration to get up and running, so I use that inside
a screen
instance so I can leave it running in the background while I use the
mongo
tool to interact with it.
Interacting with Mongo
I started out with the basic tutorial to get going. It looks like that needs some love though. It shows the version in the startup as 0.9.8. Homebrew ships with 1.4.2 and I did find a few things that were out of date. No, I' haven't been a good open source community member and submitted fixes yet.
The first thing that's different than a traditional RMDBS with Mongo is that
you don't have to explicitly create a database. Pretty straight forward: from
within mongo
, type use <database>
. This creates a brand new database for
you and you're off. For the examples below, I'm using use mydb
to select
mydb
as my database.
It's kind of nice to just be able to connect and go, but it feels odd. Not
good or bad, just odd. Sort of like the first time you run git checkout
inside a repository to switch branches when you're used to Subversion.
The shell feels like a Javascript console. I don't have access to the source code in my off-line mode, so I don't know but that it is. The syntax seems remarkably similar, so it's at least Javascript inspired.
Adding Records
Mongo stores documents, not rows of columns. This distinction allows Mongo to ignore schema—continuing the theme of leaving it up to the developer. Those documents can be made up any number key-values that look remarkably like JSON. Need to store a new data point, just add it as a field to a document and you're set.
Here's an example inspired by Mongo's tutorial for adding a few records:
> person = {name: "Travis Swicegood"}
> city = {city: "Lawrence", state: "KS"}
> db.things.save(person)
> db.things.save(city)
Here I created two new objects with various data attached to them, then saved
them all inside the things
collection. Collections in Mongo are like a table
inside the SQL world. You don't have to create a collection, you just declare
it on the db
object, and you're set.
Comparing this to the same code in a database, I've got to say I love this. No boilerplate code to get going. I didn't have to create a database, no tables were created. I just started using them. This appeals to my laziness—err, I mean desire for efficiency, but also looks very promising to teach someone new. Every abstract idea you can remove is one less potential stumbling block for someone starting out.
Back to the data I entered. Notice that neither have the same fields.
Collections inside Mongo are made up of a series of keys and values—they
can be whatever you want them to be. This is perfect for lazy migrations:
migrating the data as its requested instead of doing it all at once. ming,
a Python wrapper around Mongo already provides this. This is especially
useful for large sites with lots of data that may or may not ever been
requested again.
Finding Records
Now that the records are there, finding them. The db.things
object comes
back now:
> db.things.find()
{ "_id" : ObjectId("4bf9a96b7d04f51b48499011"), "name" : "Travis Swicegood" }
{ "_id" : ObjectId("4bf9a96f7d04f51b48499012"), "city" : "Lawrence", "state" : "KS" }
That gives me everything. The find
method takes optional parameters to
filter the results. This is actually a good time to bring up the built-in help
in mongo
. Entering only the value of any function (i.e., without calling it)
displays the implementation of the function:
> db.things.find
function (query, fields, limit, skip) {
return new DBQuery(
this._mongo, this._db, this, this._fullName,
this._massageObject(query), fields, limit, skip);
}
Note: I changed the formatting so it's more easily viewable online.
The parameters are optional (like all Javascript function), so you can pass in
as many or as few as you want. Filtering the results is done by providing a
hash for the query
parameter (the first one). For example, to find my
record:
> db.things.find({name: "Travis Swicegood"})
{ "_id" : ObjectId("4bf9a96b7d04f51b48499011"),
"name" : "Travis Swicegood" }
One thing you can't do is full-text searching. I can't ask for all of the
records that begin with Travis
or have a portion of my name in it. The
current recommendation (at least via the wiki) is to build your own list of
keywords as an array, then search that array. For example:
> var person2 = {name: "Travis Swicegood",
> name_field: ["Travis", "Swicegood"]};
> db.things.save(person2)
> db.things.find({name_field: "Travis"})
{ "_id" : ObjectId("4bf9afa17d04f51b48499014"),
"name" : "Travis Swicegood",
"name_field" : [ "Travis", "Swicegood" ] }
For something like a name, this can be useful. For full-text searching of an article, it's probably best to delegate searching off to something like Solr and let Mongo focus on storage and retrieval.
Querying for sub-objects
Of course, I had to try sub-objects to see if they would work:
> db.things.find({person: person2})
{ "_id" : ObjectId("4bf9b02b7d04f51b48499015"),
"person" : { "name" : "Travis Swicegood",
"name_field" : [ "Travis", "Swicegood" ],
"_id" : ObjectId("4bf9afa17d04f51b48499014") },
"city" : { "city" : "Lawrence",
"state" : "KS",
"_id" : ObjectId("4bf9a96f7d04f51b48499012") } }
You can also query using the dot-notation to &lquot;reach through&rquot; an object and look at its children. This returns the same result as the previous query:
> db.things.find({"person.name_field": "Travis"})
Limiting returned columns
This ability to dynamically add columns to a record and definitely provides a
breading ground for massive documents with lots of keys. Most of the time a
small subset of those keys are all that's needed. The second parameter in find
provides us with that functionality:
> db.things.find({person: person2}, {city:1})
{ "_id" : ObjectId("4bf9b02b7d04f51b48499015"),
"city" : { "city" : "Lawrence",
"state" : "KS",
"_id" : ObjectId("4bf9a96f7d04f51b48499012") } }
Likewise, you can reach through the object and pull out a subfield:
> db.things.find({person: person2}, {"city.state":1})
{ "_id" : ObjectId("4bf9b02b7d04f51b48499015"),
"city" : { "state" : "KS" } }
These examples bring up a syntax thing with Mongo that I'm not crazy about: the use of the number one. It's the standard C style: 1 is true, 0 is false. I'd love to see the client and the libraries adopt an intent revealing name. Granted, this is a minor niggle, but the little things are what make a good system an amazing one.
Few issues
The docs, being that they are community run and Mongo's still relatively new, are a little loose. I've found a bunch of examples looking through them that don't work the way they were documented.
Another potential issue (or at least something you need to be aware of) is that Mongo's geospatial support isn't 100% year. They only provide 2d and the math they use assumes that 1° of longitude is the same at the poles as it is at the equator. For many applications, this isn't a huge issue, but if precision is important, Mongo's not ready for this type of use.
One thing that I'm looking forward to is Mongo's sharding. That is going to allow Mongo to scale horizontally really well. Some of the initial test results look amazing. What will be really interesting is to see how well is scales down. It's one thing to have over 300,000 ops/sec on a bigger box, another thing to be able to manage it on something like a 1gb instance on Rackspace Cloudservers.
Two Biggest Issues
First, Mongo's a master-slave system. It appears really robust, but whenever a box takes on a special role I start to get nervous. One of the promises of &lquot;NoSQL&rquot; is that it provides a tremendous amount of resilience. Any time you start to add special nodes you're taking away from that.
For example, if you're running 5 homogeneous servers and one goes down, the other 4 can pick up the slack—assuming you're not running 5 servers at peak capacity. This makes failure planning easy: figure up the amount of CPU time you need to handle your load, provision that many servers, then add enough servers to be comfortable when they start failing. Need 3 servers, provision 5 and you can have two failures before you peg your machines.
This isn't to say Mongo can't handle failures. It's current model is
rebalancing the load when one of the servers goes out. mongos
is the tool to
read up on for handling this. Unfortunately, I haven't been able to dive into
it yet. The only way to know for sure is to build up a cluster then start
killing servers. Of course, this type of testing is preferred for any data
storage system.
Second, the license. I'm not anti-AGPL, but there's some ambiguity. The Mongo
team has addressed this both on the
wiki and through an in-depth
blog post. According to
that, I can write up a service such as MongoHQ and as long as I don't
actually change the mongod
or mongos
code I'm fine.
On the other hand, most of the definitions I've read of the AGPL mean that code that talks to it is subject to being hit with the AGPL. I don't have any doubts with 10gen, but if they don't always own the copyright
Of course, those last two paragraphs are with the caveat I am not a lawyer.
I think Mongo is an amazingly compelling piece of software in the non-standard database realm. With the upcoming sharding and what I would have to imagine is an eminent fix to the geospatial queries, Mongo's definitely worth a look.