Mongodb Cookbook: Chapter No.1 "Installing and Starting The Mongodb Server"
Mongodb Cookbook: Chapter No.1 "Installing and Starting The Mongodb Server"
Amol Nayak
Chapter No.1
"Installing and Starting the
MongoDB Server"
I would like to thank everyone at Packt Publishing who have been involved
with this book. It started when Luke Presland from Packt Publishing
approached me to author a book on MongoDB. I was skeptical to take up the
opportunity due to other commitments and tight deadlines, but if it wasn't for
my mom, friends, and office colleagues, who convinced me to take up the
opportunity, I would not have written this book. The chapters and content to
be covered was a lot, and I was having a tough time keeping up with the
timelines. A special thanks to Priyanka Shah, Rebecca Pedley, Mary Alex,
and Joel Goveya, with whom I interacted the most; they were very flexible
to my changes in delivery timelines. A big thanks to Doug Duncan and other
reviewers of the book for reviewing the book closely and helping improve
the quality of the content drastically. Finally, I would like to thank the other
staff at Packt Publishing who were involved in the book's publishing process
but haven't interacted with me.
MongoDB Cookbook
MongoDB is a document-oriented, leading NoSQL database, which offers linear
scalability, thus making it a good contender for high-volume, high-performance systems
across all business domains. It has an edge over the majority of NoSQL solutions for its
ease of use, high performance, and rich features.
This book provides detailed recipes that describe how to use the different features of
MongoDB. The recipes cover topics ranging from setting up MongoDB, knowing its
programming-language API, monitoring and administration, to some advanced topics
such as cloud deployment, integration with Hadoop, and some open source and
proprietary tools for MongoDB. The recipe format presents the information in a concise,
actionable form; this lets you refer to the recipe to address and know the details of just
the use case in hand, without going through the entire book.
Chapter 7, Cloud Deployment on MongoDB, covers recipes that use MongoDB service
providers for cloud deployment, and we will set up our own MongoDB server on the
AWS cloud.
Chapter 8, Integration with Hadoop, covers recipes to integrate MongoDB with Hadoop
to use the Hadoop MapReduce API to run MapReduce jobs on data residing in
MongoDB/MongoDB data files and write the results back to them. We will also see how
to use AWS EMR to run our MapReduce jobs on the cloud using Amazon's managed
Hadoop cluster, EMR with the mongo-hadoop connector.
Chapter 9, Open Source and Proprietary Tools, is about using frameworks and products
built around MongoDB to improve a developer's productivity or about making some of
the day-to-day jobs in using Mongo easy. Unless explicitly mentioned, the
products/frameworks we will be looking at in this chapter are open source.
Appendix, Concepts for Reference, gives you a bit of additional information on write
concern and read preference for reference.
Single node installation of MongoDB with options from the config file
Connecting to a single node from the Mongo shell with a preloaded JavaScript
Connecting to the replica set from the shell to query and insert data
Connecting to the replica set to query and insert data from a Java client
Introduction
In this chapter, we will look at starting up the MongoDB server. Though it is a cakewalk to
start the server for development purposes and with the default settings, there are numerous
options that let us tune the startup behavior. We will start the server as a single node; then,
we'll introduce various configurations before we conclude by starting up a simple replica set
and a sharded setup. So, let's get started by installing and setting up the MongoDB server in
the easiest way possible, for simple development purposes.
Getting ready
Well, assuming that we have downloaded the MongoDB binaries from the download site,
extracted them, and have the bin directory of MongoDB in the operating system's path
variable (this is not mandatory but it really becomes convenient), the binaries can be
downloaded from http://www.mongodb.org/downloads after selecting your host
operating system.
How to do it
Perform the following steps to start with the single node installation of MongoDB:
1. Create the /data/mongo/db directory (or any of your choice). This will be our
database directory, and it needs to have permission to let the mongod process
(the mongo server process) write to it.
2. We will start the server from the console with the /data/mongo/db data directory
as follows:
$ mongod --dbpath
/data/mongo/db
There's more...
If you see the following message on the console, you have successfully started the server:
[initandlisten] waiting for connections on port 27017
Starting a server can't get easier than this. Despite the simplicity in starting the server, there
are a lot of configuration options that will be used to tune the behavior of the server on
startup. Most of the default options are sensible and need not be changed. With the default
values, the server should be listening to port 27017 for new connections, and the logs will be
printed out to the standard output.
Chapter 1
See also
f
The Starting a single node instance using command-line options recipe for more
startup options
Since the server is started for development purposes, we don't want to preallocate full size
database files (we will soon see what this means).
Getting ready
If you have already seen and executed the steps mentioned in the Single node installation of
MongoDB recipe, you need not do anything different. If all the prerequisites are met, we are
good for this recipe too.
How to do it
You can start a single node instance using command-line options with the following steps:
1. The /data/mongo/db directory for the database and /logs/ for the logs should be
created and present on your filesystem with appropriate write permissions.
2. Execute the following command:
> mongod --port 27000 --dbpath /data/mongo/db --logpath /logs/
mongo.log --smallfiles
How it works
OK, this wasn't too difficult and is similar to the previous recipe, but we have some additional
command-line options this time around. MongoDB actually supports quite a few options at
startup, and we will see a list of the ones that are most common and important in my opinion:
Option
Description
--help or -h
--config or -f
This specifies the location of the configuration file that contains all the
configuration options. We will learn more about this option in the Single
node installation of MongoDB with options from the config file recipe. It is
just a convenient way of specifying the configurations in a file rather than
in a command prompt, especially when the number of options specified
is more. Using a separate configuration file shared across different
mongod instances will also ensure that all the instances are running
with identical configurations.
--verbose or
-v
--quiet
This makes the logs more verbose. We can put more v's to make the output
even more verbose, for example, -vvvvv.
This is the quieter output. This is the opposite of verbose or the -v option.
It will keep the logs less chatty and clean.
--port
This option is used if you are looking to start the server that listens to
a port other than the default 27017. We will frequently use this option
whenever we are looking to start multiple Mongo servers on the same
machine; for example, --port 27018 will start the server that listens to
port 27018 for new connections.
--logpath
This provides a path to a logfile where the logs will be written. The value
defaults to STDOUT. For example, --logpath /logs/server.out
will use /logs/server.out as the logfile for the server. Remember
that the value provided should be a file and not a directory where the
logs will be written.
--logappend
This option will append to the existing logfile if any. The default behavior
is to rename the existing logfile and then create a new file for the logs
of the currently started Mongo instance. Let's assume that we used the
name of the logfile as server.out and on startup the file exists. Then,
by default, this file will be renamed as server.out.<timestamp>,
where <timestamp> is the current time. The time is GMT as against the
local time. Suppose the current date is October 28, 2013 and the time
is 12:02:15, then the file generated will have the 2013-10-28T12-02-15
value as the timestamp.
--dbpath
10
Chapter 1
Option
Description
--smallfiles
--replSet
This option is used to start the server as a member of the replica set.
The value of this argument is the name of the replica set, for example,
--replSet repl1. More information on this option is covered in the
Starting multiple instances as part of a replica set recipe, where we will
start a simple Mongo replica set.
--configsvr
This option is used to start the server as a config server. The role of the
config server will be made clearer when we set up a simple sharded
environment in the Starting a simple sharded environment of two shards
recipe in this chapter. This, however, will be started and listen to port
27019 by default and the /data/configdb data directory. These can,
of course, be overridden using the --port and --dbpath options.
--shardsvr
This informs the started mongod process that this server is being started
as a shard server. By giving this option, the server also listens to port
27018 instead of the default 27017. We will learn more about this option
when we start a simple sharded server.
--oplogSize
11
There's more
For an exhaustive list of the options available, use the --help or -h option. The preceding list
of options is not exhaustive, and we will see some more coming up in the upcoming recipes as
and when we need them. In the next recipe, we will see how to use a config file instead of the
command-line arguments.
See also
f
The Single node installation of MongoDB with options from the config file recipe to
use config files to provide startup options
To start a replica set, refer to the Starting multiple instances as part of a replica
set recipe
Getting ready
If you have already seen and executed the steps mentioned in the Single node installation of
MongoDB recipe, you need not do anything different, and all the prerequisites of this recipe
are the same.
How to do it
The /data/mongo/db directory for the database and /logs/ for the logs should be created
and present on your filesystem, with the appropriate write permissions. Let's take a look at the
steps in detail:
1. Create a config file that can have any arbitrary name. In our case, let's say we create
the file at /conf/mongo.conf. We will then edit the file and add the following lines of
code to it:
port = 27000
dbpath = /data/mongo/db
12
Chapter 1
logpath = /logs/mongo.log
smallfiles = true
/conf/mongo.conf
How it works
All the command-line options we discussed in the previous recipe, Starting a single node
instance using command-line options, hold true. We are just providing these options in a
configuration file instead. If you have not visited the previous recipe, I recommend that you
do so, as this is where we have discussed some of the common command-line options. The
properties are specified as <property name> = <value>. For all those properties that
don't have values, for example, the smallfiles option, the value given is a Boolean value,
true. If you need to have a verbose output, you will add v=true (or multiple v's to make it
more verbose) to our config file. If you already know what the command-line option is, it is
pretty easy to guess the value of the property in the file. It is the similar to the command-line
option, with just the hyphen removed.
Getting ready
It is not necessary for the MongoDB server to run to start a shell. We will rarely start a shell
without connecting it to a running MongoDB server. To start a server on the localhost without
much of a hassle, take a look at the first recipe, Single node installation of MongoDB, and
start the server.
How to do it
Let's take a look at the steps in detail:
1. First, we will start by creating a simple JavaScript file; let's call it hello.js. Type in
the following lines in the hello.js file:
function sayHello(name) {
13
2. Save this file at /mongo/scripts. (it can be saved at any other location too).
3. In the command prompt, execute the following command:
> mongo --shell /mongo/scripts/hello.js
5. Test the database that the shell is connected to by typing the following command:
> db
How it works
The JavaScript function we executed here is of no practical use, but it's just used to demonstrate
how a function can be preloaded upon the startup of the shell. There can be multiple functions
in the .js file that contain valid JavaScript code, possibly some complex business logic.
When we executed the mongo command without any arguments, we connected to the
MongoDB server that runs on the localhost and listens for new connections on the default
port 27017. The format of the command is as follows:
mongo <options> <db address> <.js files>
There's more...
Let's look at some example values of the db address command-line option and
its interpretation:
f
mydb: This will connect to the server that runs on the localhost and listens for
connection on port 27017. The database connected will be mydb.
mongo.server.host/mydb: This will connect to the server that runs on mongo.
server.host and the default port 27017. The database connected will be mydb.
14
Chapter 1
f
Now, there are quite a few options available on the Mongo client too. We will see a few of
them in the following table:
Option
Description
--help or h
--shell
--port
This specifies the port of the Mongo server where the client needs to connect.
--host
This specifies the hostname of the Mongo server where the client needs to
connect. If the db address is provided with the hostname, port, and database,
both the --host and --port options need not be specified.
--username
or u
This is relevant when security is enabled for Mongo. It is used to provide the
username of the user to be logged in.
--password
or p
This is relevant when security is enabled for Mongo. It is used to provide the
password of the user to be logged in.
When .js files are given as arguments, these scripts get executed, and the
Mongo client will exit. Providing this option ensures that the shell remains
running after the JavaScript files execute. All the functions and variables
defined in these .js files are available in the shell upon startup. As in the
preceding case, the sayHello function defined in the JavaScript file is
available in the shell for invocation.
Getting ready
The following are the prerequisites for this recipe:
f
Use the latest available version of Maven. Version 3.1.1 was the latest at the time of
writing this book.
15
Use the MongoDB Java driver. Version 2.11.3 was the latest at the time of writing
this book.
Connectivity to the Internet to access the online Maven repository or a local repository
is needed. Alternatively, you might choose an appropriate local repository accessible
to you from your computer.
The Mongo server is up and running on the localhost and on port 27017. Take a look
at the first recipe, Single node installation of MongoDB, and start the server.
How to do it
Let's take a look at the steps in detail:
1. Install the latest version of JDK if you don't already have it on your machine. We will
not be going through the steps to install JDK in this recipe but, before moving on with,
next step, the JDK should be present. Type javac -version on the shell to check
for the version installed.
2. Once the JDK is set up, the next step is to set up Maven. Skip the next three steps if
Maven is already installed on your machine.
3. Maven needs to be downloaded from http://maven.apache.org/download.cgi.
Choose the binaries in the .tar.gz or .zip format and download it. This recipe is
executed on a machine that runs on the Windows platform; thus, these steps are for
installation on Windows. The following screenshot shows the download page of Maven:
4. Once the archive is downloaded, we need to extract it and put the absolute path
of the bin folder in the extracted archive in the operating system's path variable.
Maven also needs the path of the JDK to be set as the JAVA_HOME environment
variable. Remember to set the root of your JDK as the value of this variable.
5. All we need to do now is type mvn -version in the command prompt. If you see
the version of Maven on the command prompt, we have successfully set up Maven:
> mvn -version
16
Chapter 1
6. At this stage, we have Maven installed, and we are now ready to create our simple
project to write our first Mongo client in Java. We will start by creating a project
folder. Let's assume that we create a folder called Mongo Java. Then, we will create
a folder structure src/main/java in this project folder. The root of the project
folder then contains a file called pom.xml. Once this folder creation is done, the
folder structure should look as follows:
Mongo Java
+--src
|
|
+main
+java
|--pom.xml
7.
We just have the project skeleton with us now. We will now add some content to the
pom.xml file. Not much is needed for this. Add the following code snippet in the
pom.xml file and save it:
<project>
<modelVersion>4.0.0</modelVersion>
<name>Mongo Java</name>
<groupId>com.packtpub</groupId>
<artifactId>mongo-cookbook-java</artifactId>
<version>1.0</version>
<packaging>jar</packaging>
<dependencies>
<dependency>
<groupId>org.mongodb</groupId>
<artifactId>mongo-java-driver</artifactId>
<version>2.11.3</version>
</dependency>
</dependencies>
</project>
8. Finally, we will write our Java client that will be used to connect to the Mongo server
and execute some very basic operations. The following is the Java class located at
src/main/java in the com.packtpub.mongo.cookbook package, and the name
of the class is FirstMongoClient:
package com.packtpub.mongo.cookbook;
import
import
import
import
com.mongodb.BasicDBObject;
com.mongodb.DB;
com.mongodb.DBCollection;
com.mongodb.DBObject;
17
18
Chapter 1
System.out.println("Closing client");
client.close();
}
}
9. It's now time to execute the preceding Java code. We will execute it using Maven from
the shell. You should be in the same directory as the pom.xml file of the project:
mvn compile exec:java -Dexec.mainClass=com.packtpub.mongo.
cookbook.FirstMongoClient
How it works
Those were quite a lot of steps to follow! Let's look at some of them in more detail. Everything
up to step 6 is straightforward and doesn't need any explanation. Let's look at the other steps.
The pom.xml file we have here is pretty simple. We defined a dependency on Mongo's
Java driver. It relies on the online repository (http://search.maven.org) for resolving
the artifacts. For a local repository, all we need to do is define the repositories and
pluginRepositories tags in pom.xml. For more information on Maven, refer to the
Maven documentation at http://maven.apache.org/guides/index.html.
Now, for the Java class, the org.mongodb.MongoClient class is the backbone. We will first
instantiate it using one of its overloaded constructors that gives the server's host and port. In
this case, the hostname and port were not really needed as the values provided are the default
values anyway, and the no-argument constructor would have worked well too. The following line
of code instantiates this client:
MongoClient client = new MongoClient("localhost", 27017);
The next step is to get the database; in this case, test using the getDB method. This is
returned as an object of type com.mongodb.DB. Note that this database might not exist, yet
getDB will not throw any exception. Instead, the database will get created whenever we add a
new document to the collection in this database. Similarly, getCollection on the DB object
will return an object of type com.mongodb.DBCollection, representing the collection in
the database. This too might not exist in the database and will get created automatically
upon the insertion of the first document.
The following lines of code from our class show how to get an instance of DB
and DBCollection:
DB testDB = client.getDB("test");
DBCollection collection = testDB.getCollection("person");
19
The findOne method on DBCollection is straightforward and returns one document from
the collection. This version of findOne doesn't accept DBObject (which, otherwise, acts
as a query executed before a document is selected and returned) as a parameter. This is
synonymous to executing a db.person.findOne() from the Mongo shell.
Finally, we will simply invoke getDatabaseNames to get a list of databases names in the
server. At this point of time, we should at least be having the test and local databases
in the returned result. Once all the operations are completed, we will close the client. The
MongoClient class is thread-safe; generally, one instance is used per application. To execute
the program, we will use Maven's exec plugin. On executing step 9, we will see the following
output on the console:
[INFO] --- exec-maven-plugin:1.2.1:java (default-cli) @ mongo-cookbookjava --Dropping person collection in test database
Adding a person document in the person collection of test database
Now finding a person using findOne
Person found, name is Fred and age is 30
Database names are
1: local
2: test
Closing client
[INFO] -----------------------------------------------------------------------
20
Chapter 1
[INFO] BUILD SUCCESS
[INFO] ----------------------------------------------------------------------[INFO] Total time: 5.183s
[INFO] Finished at: Wed Oct 30 00:42:29 IST 2013
[INFO] Final Memory: 7M/19M
[INFO] -----------------------------------------------------------------------
Getting ready
Though not a prerequisite, taking a look at the Starting a single node instance using
command-line options recipe will definitely make things easier, just in case you are not
aware of the various command-line options and their significance while starting a Mongo
server. Also, the necessary binaries and setup as mentioned in the Single node installation
of MongoDB recipe must be mastered before we continue with this recipe. Let's sum up
what we need to do.
21
Client 2
Client 3
Port 2700
/data/n1
Primary
n1
Readonly
clients
/logs/n1.log
Readonly
clients
Port
27002
Port
27001
/data/n2
/data/n3
Secondary
n3
Secondary
n2
/logs/n2.log
/logs/n3.log
n1 is the primary node on port 27000
n2 and n3 are secondary nodes on port 27001 and 27002
How to do it
Let's take a look at the steps in detail:
1. Create the /data/n1, /data/n2, and /data/n3 directories, /logs for data, and
logs of the three nodes. On the Windows platform, you can choose the c:\data\n1,
c:\data\n2, c:\data\n3, or c:\logs\ directory (or any other directory of your
choice) for data and logs, respectively. Ensure that these directories have appropriate
write permissions for the Mongo server to write the data and logs.
2. Start the three servers as follows (note that users on the Windows platform need to
skip the --fork option, as it is not supported):
$ mongod --replSet repSetTest --dbpath /data/n1 --logpath /logs/
n1.log --port 27000 --smallfiles --oplogSize 128 --fork
$ mongod --replSet repSetTest --dbpath /data/n2 --logpath /logs/
n2.log --port 27001 --smallfiles --oplogSize 128 --fork
$ mongod --replSet repSetTest --dbpath /data/n3 --logpath /logs/
n3.log --port 27002 --smallfiles --oplogSize 128 --fork
22
Chapter 1
3. Start the Mongo shell and connect to any of the Mongo servers that are running. In
this case, we will connect to the first one (the one listening to port 27000). Execute
the following command:
$ mongo localhost:27000
4. Try to execute an insert operation from the Mongo shell after connecting to it
as follows:
> db.person.insert({name:'Fred', age:35})
This operation should fail as the replica set is not initialized yet. More information can
be found in the How it works section of this recipe.
5. The next step is to start configuring the replica set. We will start by preparing a JSON
configuration in the shell:
cfg = {
'_id':'repSetTest',
'members':[
{'_id':0, 'host': 'localhost:27000'},
{'_id':1, 'host': 'localhost:27001'},
{'_id':2, 'host': 'localhost:27002'}
]
}
6. The last step is to initiate the replica set with the preceding configuration as follows:
> rs.initiate(cfg)
Execute rs.status() after a few seconds on the shell to see the status. In a few seconds,
one of them should become primary, and the remaining two should become secondary.
How it works
We described the common options and all these command-line options in the Starting a
single node instance using command-line options recipe in detail.
As we are starting three independent mongod services, we have three dedicated database
paths on the filesystem. Similarly, we have three separate logfile locations for each of the
processes. We then started three mongod processes with the database and logfile path
specified. As this setup is for test purposes and started on the same machine, we used the
--smallfiles and --oplogSize options. Avoid using these options in the production
environment. As these are running on the same host, we also choose the ports explicitly to
avoid port conflicts. The ports we chose here are 27000, 27001, and 27002. When we start
the servers on different hosts, we might or might not choose a separate port. We can very well
choose to use the default one whenever possible.
23
The preceding command allows us to have a batch file (a .bat file) that contains all the logic
to create the relevant directories and then spawn three mongod processes in three shells.
Let's get back to the replica set creation; we are not yet done with setting up a replica set. If
we take a look at the logs generated in the log directory, we will see the following lines in it:
[rsStart] replSet can't get local.system.replset config from self or
any seed (EMPTYCONFIG)
[rsStart] replSet info you may need to run replSetInitiate -rs.initiate() in the shell -- if that is not already done
Though we started three mongod processes with the --replSet option, we still haven't
configured them to work with each other as a replica set. This command-line option is just used
to tell the server on startup that this process will be running as part of a replica set. The name
of the replica set is the same as the value of this option passed on the command prompt. This
also explains why the insert operation executed on one of the nodes failed before the replica
set was initialized. In mongo replica sets, only one node is the primary node where all the inserts
and querying happen. In the preceding diagram, node n1 is shown as the primary node and
listens to port 27000 for client connections. All the other nodes are slave/secondary instances
that sync themselves up with the primary node; hence, querying too is disabled on them by
default. It is only when the primary node goes down that one of the secondaries takes over and
becomes a primary node. It is, however, possible to query the secondary instances for data,
as we showed in the preceding diagram. We will see how to query from a secondary instance
in the next recipe.
Well, all that is left now is to configure the replica set by grouping the three processes we
started. This is done by first defining a JSON object as follows:
cfg = {
'_id':'repSetTest',
'members':[
{'_id':0, 'host': 'localhost:27000'},
{'_id':1, 'host': 'localhost:27001'},
{'_id':2, 'host': 'localhost:27002'}
]
}
24
Chapter 1
There are two fields, _id and members, for the unique ID of the replica set and an array of
the hostnames and port numbers of the mongod server processes as part of this replica set,
respectively. Using the localhost to refer to the host is not a very good idea and is usually
discouraged. However, in this case, we started all the processes on the same machine; thus, we
are OK with it. It is, however, preferred to refer to the hosts by their hostnames even if they are
running on the localhost. Note that you cannot mix referring the instances using the localhost
and hostnames both in the same config. You can use either the hostnames or the localhost. To
configure the replica set, we then connect to any one of three running mongod processes; in this
case, we will connect to the first one and then execute the following command from the shell:
> rs.initiate(cfg)
The _id in the cfg object passed has the same value as the value we gave to the --replSet
option in the command prompt when we started the server processes. Not giving the same
value will throw the following error:
{
"ok" : 0,
"errmsg" : "couldn't initiate : set name does not match the set
name host Amol-PC:27000 expects"
}
If all goes well and the initiate call is successful, you will see something like the following JSON
response on the shell:
{
"info" : "Config now saved locally.
minute.",
"ok" : 1
In a few seconds, you should see a different prompt for the shell from which we executed this
command. It should now become a primary or secondary node. The following command is an
example of the shell connected to a primary member of the replica set:
repSetTest:PRIMARY>
Executing rs.status() should give us some stats on the replica set status. The stateStr
field here is important, and it contains the text PRIMARY, SECONDARY, and so on.
25
There's more
If you are looking to convert a standalone instance to a replica set, the instance with data
needs to become a primary instance first, and then empty secondary instances will be added,
to which the data will be synchronized. For more information on how to perform this operation,
visit http://docs.mongodb.org/manual/tutorial/convert-standalone-toreplica-set/.
See also
f
The Connecting to the replica set from the shell to query and insert data recipe to
perform more operations from the shell after connecting to a replica set
Getting ready
The prerequisite for this recipe is that the replica set should be set up, and it should be
up and running. For details on how to start the replica set, refer to the Starting multiple
instances as part of a replica set recipe.
How to do it
Let's take a look at the steps in detail:
1. Create the /data/n1, /data/n2, /data/n3, and /logs directories for data and
logs of the three nodes, respectively.
2. We will start two shells here: one for primary and one for secondary. Execute the
following command in the command prompt:
mongo localhost:27000
26
Chapter 1
3. The prompt of the shell tells whether the server to which we connected is primary or
secondary. It should show the replica set's name followed by : and then followed by
the server's state. In this case, if the replica set is initialized and is up and running,
we will see either repSetTest:PRIMARY> or repSetTest:SECONDARY>.
4. Suppose the first server we connected to is a secondary server, then we need to find
the primary server as follows:
1. Execute the rs.status() command in the shell and look out for the
stateStr field. This should give us the primary server. Use the Mongo
shell to connect to this server. At this point, we should have two shells
running: one connected to a primary node and the other connected
to a secondary node.
5. In the shell connected to the primary node, execute the following insert command:
repSetTest:PRIMARY> db.replTest.insert({_id:1, value:'abc'})
There is nothing special about it. We have just inserted a small document in a
collection that we use for the replication test.
6. By executing the following query on the primary node, we should get one result:
repSetTest:PRIMARY> db.replTest.findOne()
{ "_id" : 1, "value" : "abc" }
7.
So far so good. Now, we will go to the shell that is connected to the secondary node
and execute the following command:
repSetTest:SECONDARY> db.replTest.findOne()
rs.slaveOk(true)
10. Execute the query we executed in step 7 again on the shell. This will now get the
following results:
repSetTest:SECONDARY>db.replTest.findOne()
{ "_id" : 1, "value" : "abc" }
11. Execute the following insert command on the secondary node; it should not succeed
with the following message:
repSetTest:SECONDARY> db.replTest.insert({_id:1, value:'abc'}) not
master
27
How it works
We have done a lot of things in this recipe, and we will try to throw some light on some of the
important concepts to remember.
We basically connected to a primary and a secondary node from the shell and performed
(I would say, tried to perform) the select and insert operations. The architecture of a Mongo
replica set is made up of one primary (just one; no more, no less) and multiple secondary
nodes. All writes happen on the primary node only. Note that replication is not a mechanism
to distribute a read-request load that enables us to scale the system. Its primary intent is
to ensure high availability of data. By default we are not permitted to read data from the
secondary nodes. In step 6, we simply inserted data from the primary node and then executed
the query to get the document that we inserted. This is straightforward, and there is nothing
related to clustering here. Just note that we inserted the document from the primary node and
then queried it back.
In the next step, we executed the same query but, this time, from the secondary node's shell.
By default, querying is not enabled on the secondary node. There might be a small lag in
replicating the data, possibly due to heavy data volumes to be replicated, network latency, and
hardware capacity to name a few of the causes; thus, querying on the secondary node might
not reflect the latest inserts or updates made on the primary node. If, however, we are OK
with it and can live with the slight lag in the data being replicated, all we need to do is enable
querying on the secondary node explicitly by just executing one command, rs.slaveOk() or
rs.slaveOk(true). Once this is done, we are free to execute queries
on the secondary nodes too.
Finally, we tried to insert data in a collection of the slave node. Under no circumstances this is
permitted, regardless of whether we have executed rs.slaveOk(). When rs.slaveOk() is
invoked, it just permits the data to be queried from the secondary node. All the write operations
still have to go to the primary node and then flow down to the secondary node. The internals of
replication will be covered in a different recipe in the Understanding and analyzing oplogs recipe
in Chapter 4, Administration.
See also
f
The Connecting to the replica set to query and insert data from a Java client recipe is
to get details on how to connect to replica set from a Java client
28
Chapter 1
Getting ready
We first need to take a look at the Connecting to a single node from a Java client recipe, as
it contains all the prerequisites and steps to set up Maven and other dependencies. As we
are dealing with a Java client for replica sets, a replica set must be up and running. Refer to
the Starting multiple instances as part of a replica set recipe for details on how to start the
replica set.
How to do it
Let's take a look at the steps in detail:
1. First, we need to write/copy the following piece of code (this Java class is also
available for download from the book's site):
package com.packtpub.mongo.cookbook;
import
import
import
import
import
import
com.mongodb.BasicDBObject;
com.mongodb.DB;
com.mongodb.DBCollection;
com.mongodb.DBObject;
com.mongodb.MongoClient;
com.mongodb.ServerAddress;
import java.util.Arrays;
/**
*
*/
public class ReplicaSetMongoClient {
/**
* Main method for the test client connecting to the replica
set.
* @param args
29
30
Chapter 1
2. Connect to any of the nodes in the replica set, say to localhost:27000, and, from
the shell, execute rs.status(). Take a note of the primary instance in the replica
set and connect to it from the shell if localhost:27000 is not a primary node. Now,
switch to the admin database as follows:
repSetTest:PRIMARY>use admin
3. Now, execute the preceding program from the operating system shell as follows:
$ mvn compile exec:java -Dexec.mainClass=com.packtpub.mongo.
cookbook.ReplicaSetMongoClient
4. Shut down the primary instance by executing the following command on the Mongo
shell connected to the primary node:
repSetTest:PRIMARY> db.shutdownServer()
How it works
An interesting thing to observe is how we instantiate a MongoClient instance. It is done
as follows:
MongoClient client = new MongoClient(Arrays.asList(
new ServerAddress("localhost", 27000),
new ServerAddress("localhost", 27001),
new ServerAddress("localhost", 27002)));
31
As we can see, the query in the loop was interrupted when the primary node went down. The
client, however, switched to the new primary node seamlessly, well, nearly seamlessly, as the
client might have to catch an exception and retry the operation after a predetermined interval
has elapsed.
32
Chapter 1
Now, consider the days where the system load becomes twice or three times an average day's
load (or even more), for example, say on Thanksgiving Day, Christmas, and so on. If the platform
is able to deliver similar levels of service on these high-load days compared with any other day,
the system is said to have scaled up well to the sudden increase in the number of requests.
Now, consider an archiving application that needs to store the details of all the requests that
hit a particular website over the past decade. For each request that hits the website, we will
create a new record in the underlying data store. Suppose each record is of 250 bytes with
an average load of 3 million requests per day, then we will cross the 1 TB data mark in about
5 years. This data will be used for various analytic purposes and might be frequently queried.
The query performance should not be drastically affected when the data size increases. If the
system is able to cope with this increasing data volume and still gives a decent performance
comparable to that on low data volumes, the system is said to have scaled up well against the
increasing data volumes.
Now that we have seen in brief what scalability is, let me tell you that sharding is a mechanism
that lets a system scale to increasing demands. The crux lies in the fact that the entire data
is partitioned into smaller segments and distributed across various nodes called shards. Let's
assume that we have a total of 10 million documents in a Mongo collection. If we shard this
collection across 10 shards, we will ideally have 10,000,000/10 = 1,000,000 documents on
each shard. At a given point of time, one document will only reside on one shard (which, by itself,
will be a replica set in a production system). There is, however, some magic involved that keeps
this concept hidden from the developer querying the collection, who gets one unified view of the
collection irrespective of the number of shards. Based on the query, it is Mongo that decides
which shard to query for the data and return the entire result set. With this background, let's set
up a simple shard and take a closer look at it.
Getting ready
Apart from the MongoDB server already installed, there are no prerequisites from a software
perspective. We will create two data directories, one for each shard. There will be one
directory for data and one for logs.
How to do it
Let's take a look at the steps in detail:
1. We will start by creating directories for logs and data. Create the /data/s1/db,
/data/s2/db, and /logs directories. On Windows, we can have c:\data\s1\db,
and so on for the data and log directories. There is also a config server that is used
in a sharded environment to store some metadata. We will use /data/con1/db as
the data directory for the config server.
33
/logs/mongos.log
3. In the command prompt, execute the following command. This will show a
mongos prompt:
$ mongo
MongoDB shell version: 2.4.6
connecting to: test
mongos>
4. Finally, we set up the shard. From the mongos shell, execute the following
two commands:
mongos> sh.addShard("localhost:27000")
mongos> sh.addShard("localhost:27001")
5. On the addition of each shard, we will get an ok reply. Something like the following
JSON message will be seen giving the unique ID for each shard that is added:
{ "shardAdded" : "shard0000", "ok" : 1 }
34
Chapter 1
How it works
Let's see what we did in the process. We created three directories for data (two for the shards
and one for the config database) and one directory for logs. We can have a shell script or a
batch file to create the directories as well. In fact, in large production deployments, setting up
shards manually is not only time-consuming but also error-prone.
Let's try to get a picture of what exactly we have done and what we are trying to achieve.
The following diagram shows the shard setup we just built:
shared 1
config
shared 2
mongos
client 1
client n
If we look at the preceding diagram and the servers started in step 2, we will see that we
have shard servers that will store the actual data in the collections. These were the first two of
the four processes that started listening to port 27000 and 27001. Next, we started a config
server, which is seen on the left-hand side in the preceding diagram. It is the third server of the
four servers started in step 2, and it listens to port 25000 for incoming connections. The sole
purpose of this database is to maintain the metadata of the shard servers. Ideally, only the
mongos process or drivers connect to this server for the shard details/metadata and the shard
key information. We will see what a shard key is in the next recipe, where we will play around
with a sharded collection and see the shards we created in action.
Finally, we have a mongos process. This is a lightweight process that doesn't do any
persistence of data and just accepts connections from clients. This is the layer that acts as a
gatekeeper and abstracts the client from the concept of shards. For now, we can view it as a
router that consults the config server and takes the decision to route the client's query to the
appropriate shard server for execution. It then aggregates the result from various shards if
applicable and returns the result to the client. It is safe to say that no client directly connects
to the config or the shard servers; in fact, ideally, no one should connect to these processes
directly, except for some administration operations. Clients simply connect to the mongos
process and execute their queries, or insert or update operations.
35
shared 1
shared 2
shared n
mongos
mongos
mongos
Client
Client
Client
The number of shards will not be two but much more. Also, each shard will be a replica set
to ensure high availability. There will be three config servers to ensure the availability of the
config servers too. Similarly, there will be any number of mongos processes created for a
shard that listens for client connections. In some cases, it might even be started on a client
application's server.
There's more
What good is a shard unless we put it to action and see what happens from the shell on
inserting and querying the data? In the next recipe, we will make use of the shard setup,
add some data, and see it in action.
36
Chapter 1
Getting ready
Obviously, we need a sharded mongo server setup that is up and running. See the previous
recipe for more details on how to set up a simple shard. The mongos process, as in the
previous recipe, should be listening to port number 27017. We have got some names in
a JavaScript file called names.js. This file needs to be downloaded from this book's site
and kept on the local filesystem. The file contains a variable called names, and the value is
an array with some JSON documents as the values, each one representing a person. The
contents look as follows:
names = [
{name:'James Smith', age:30},
{name:'Robert Johnson', age:22},
How to do it
Let's take a look at the steps in detail:
1. Start the Mongo shell and connect to the default port on the localhost as follows
(this will ensure that the names will be available in the current shell):
mongo --shell names.js
MongoDB shell version: 2.4.6
connecting to: test
mongos>
2. Switch to the database that will be used to test sharding as follows (we call it
shardDB):
mongos> use shardDB
6. Execute the following command to get a query plan and the number of documents
on each shard:
mongos> db.person.find().explain()
How it works
This recipe demands some explanation. We have downloaded a JavaScript file that defines an
array of 20 people. Each element of the array is a JSON object with a name and age attribute.
We started the shell that connects to the mongos process loaded with this JavaScript. We then
switched to shardDB, which we will use for the purpose of sharding.
For a collection to be sharded, the database in which it will be created needs to be enabled
for sharding first. We do this using sh.enableSharding().
The next step is to enable the collection to be sharded. By default, all the data will be kept on
one shard and will not be split across different shards. Think about how Mongo will be able
to meaningfully split the data. The whole intention is to split it meaningfully and as evenly
as possible so that whenever we query based on a shard key, Mongo will easily be able to
determine which shard(s) to query. If a query doesn't contain a shard key, the execution of the
query will happen on all the shards, and the data will then be collated by the mongos process
before returning it to the client. Thus, choosing the right shard key is very crucial.
Let's now see how to shard the collection. We will do this by invoking
sh.shardCollection("shardDB.person", {name: "hashed"}, false).
The first parameter specifies a fully qualified name of the collection in the
<db name>.<collection name> format. This is the first parameter of the
shardCollection method.
The second parameter specifies the field name to shard upon in the collection.
This is the field that will be used to split the documents on the shards. One of the
requirements of a good shard key is that it should have high cardinality (the number
of possible values should be high). In our test data, the name value has a very low
cardinality and thus, is not a good choice as a shard key. We thus hash this key when
using it as a shard key. We do so by mentioning the key as {name: "hashed"}.
38
Chapter 1
f
The last parameter specifies whether the value used as a shard key is unique or not.
The name field is definitely not unique; thus, it will be false. If the field was, say, the
person's social security number, it could have been set as true. Also, SSN is a good
choice for a shard key due to its high cardinality. Remember though, for the query to
be efficient, the shard key has to be present in it.
The last step is to see the execution plan to find all the data. The intent of this operation is to see
how the data is being split across two shards. With 3,00,000 documents, we expect something
around 1,50,000 documents on each shard. From the explain plan's output, the shard attribute
has an array with a document value for each shard in the cluster. In our case. we have two; thus.
we have two shards that give the query plan for each shard. In each of them, the value of n is
something to look at. It should give us the number of documents that reside on each shard. The
following code snippet is the relevant JSON document we see from the console. The number of
documents on shards one and two is 164938 and 135062, respectively:
"shards" : {
"localhost:27000" : [
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 164938,
"nscannedObjects" : 164938,
"nscanned" : 164938,
"nscannedObjectsAllPlans" : 164938,
"nscannedAllPlans" : 164938,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 1,
"nChunkSkips" : 0,
"millis" : 974,
"indexBounds" : {
},
"server" : "Amol-PC:27000"
}
],
"localhost:27001" : [
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 135062,
"nscannedObjects" : 135062,
"nscanned" : 135062,
"nscannedObjectsAllPlans" : 135062,
"nscannedAllPlans" : 135062,
"scanAndOrder" : false,
39
There are a couple of additional things that I recommend you all to do.
Connect to the individual shard from the Mongo shell and execute queries on the person
collection. See that the counts in these collections are similar to what we see in the preceding
plan. Also, one can find out that no document exists on both the shards at the same time.
We discussed in brief how cardinality affects the way the data is split across shards. Let's do a
simple exercise. We will first drop the person collection and execute the shardCollection
operation again but, this time, with the {name: 1} shard key instead of {name: "hashed"}.
This ensures that the shard key is not hashed and stored as is. Now, load the data using the
JavaScript function we used earlier in step 5 and then execute explain on the collection once
the data is loaded. Observe how the data is now split (or not) across the shards.
There's more
A lot of questions might now come up, such as what are the best practices, what are
some tips and tricks, how is the sharding thing pulled off by MongoDB behind the scenes
in a way transparent to the end user, and so on.
This recipe only explained the basics. All these questions will be answered in
Chapter 4, Administration.
40
Alternatively, you can buy the book from Amazon, BN.com, Computer Manuals and
most internet book retailers.
www.PacktPub.com