6-MongoDB Architecture (E-Next - In)
6-MongoDB Architecture (E-Next - In)
MongoDB Architecture
In this chapter, you will learn about the MongoDB architecture, especially core processes and tools,
standalone deployment, sharding concepts, replication concepts, and production deployment.
Core Processes
The core components in the MongoDB package are
• mongod, which is the core database process
• mongos, which is the controller and query router for sharded clusters
• mongo, which is the interactive MongoDB shell
These components are available as applications under the bin folder. Let’s discuss these components
in detail.
mongod
The primary daemon in a MongoDB system is known as mongod. This daemon handles all the data
requests, manages the data format, and performs operations for background management.
When a mongod is run without any arguments, it connects to the default data directory, which is
C:\data\db or /data/db, and default port 27017, where it listens for socket connections.
It’s important to ensure that the data directory exists and you have write permissions to the directory
before the mongod process is started.
If the directory doesn’t exist or you don’t have write permissions on the directory, the start of this
process will fail. If the default port 27017 is not available, the server will fail to start.
mongod also has a HTTP server which listens on a port 1000 higher than the default port, so if you
started the mongod with the default port 27017, in this case the HTTP server will be on port 28017 and will
be accessible using the URL http://localhost:28017. This basic HTTP server provides administrative
information about the database.
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
mongo
mongo provides an interactive JavaScript interface for the developer to test queries and operations directly
on the database and for the system administrators to manage the database. This is all done via the command
line. When the mongo shell is started, it will connect to the default database called test. This database
connection value is assigned to global variable db.
As a developer or administrator you need to change the database from test to your database post the
first connection is made. You can do this by using <databasename>.
mongos
mongos is used in MongoDB sharding. It acts as a routing service that processes queries from the application
layer and determines where in the sharded cluster the requested data is located.
We will discuss mongos in more detail in the sharding section. Right now you can think of mongos as
the process that routes the queries to the correct server holding the data.
MongoDB Tools
Apart from the core services, there are various tools that are available as part of the MongoDB installation:
• mongodump: This utility is used as part of an effective backup strategy. It creates a
binary export of the database contents.
• mongorestore: The binary database dump created by the mongodump utility is
imported to a new or an existing database using the mongorestore utility.
• bsondump: This utility converts the BSON files into human-readable formats
such as JSON and CSV. For example, this utility can be used to read the output file
generated by mongodump.
• mongoimport, mongoexport: mongoimport provides a method for taking data in
JSON, CSV, or TSV formats and importing it into a mongod instance. mongoexport
provides a method to export data from a mongod instance into JSON, CSV, or TSV
formats.
• mongostat, mongotop, mongosniff: These utilities provide diagnostic information
related to the current operation of a mongod instance.
Standalone Deployment
Standalone deployment is used for development purpose; it doesn’t ensure any redundancy of data and
it doesn’t ensure recovery in case of failures. So it’s not recommended for use in production environment.
Standalone deployment has the following components: a single mongod and a client connecting to the
mongod, as shown in Figure 7-1.
96
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
MongoDB uses sharding and replication to provide a highly available system by distributing and duplicating
the data. In the coming sections, you will look at sharding and replication. Following that you’ll look at the
recommended production deployment architecture.
Replication
In a standalone deployment, if the mongod is not available, you risk losing all the data, which is not
acceptable in a production environment. Replication is used to offer safety against such kind of data loss.
Replication provides for data redundancy by replicating data on different nodes, thereby providing
protection of data in case of node failure. Replication provides high availability in a MongoDB deployment.
Replication also simplifies certain administrative tasks where the routine tasks such as backups can be
offloaded to the replica copies, freeing the main copy to handle the important application requests.
In some scenarios, it can also help in scaling the reads by enabling the client to read from the different
copies of data.
In this section, you will learn how replication works in MongoDB and its various components. There are
two types of replication supported in MongoDB: traditional master/slave replication and replica set.
Master/Slave Replication
In MongoDB, the traditional master/slave replication is available but it is recommended only for more than
50 node replications. The preferred replication approach is replica sets, which we will explain later. In this
type of replication, there is one master and a number of slaves that replicate the data from the master. The
only advantage with this type of replication is that there’s no restriction on the number of slaves within a
cluster. However, thousands of slaves will overburden the master node, so in practical scenarios it’s better to
have less than dozen slaves. In addition, this type of replication doesn’t automate failover and provides less
redundancy.
In a basic master/slave setup, you have two types of mongod instances: one instance is in the master
mode and the remaining are in the slave mode, as shown in Figure 7-2. Since the slaves are replicating from
the master, all slaves need to be aware of the master’s address.
97
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
The master node maintains a capped collection (oplog) that stores an ordered history of logical writes
to the database.
The slaves replicate the data using this oplog collection. Since the oplog is a capped collection, if
the slave’s state is far behind the master’s state, the slave may become out of sync. In that scenario, the
replication will stop and manual intervention will be needed to re-establish the replication.
There are two main reasons behind a slave becoming out of sync:
• The slave shuts down or stops and restarts later. During this time, the oplog may have
deleted the log of operations required to be applied on the slave.
• The slave is slow in executing the updates that are available from the master.
Replica Set
The replica set is a sophisticated form of the traditional master-slave replication and is a recommended
method in MongoDB deployments.
Replica sets are basically a type of master-slave replication but they provide automatic failover. A replica
set has one master, which is termed as primary, and multiple slaves, which are termed as secondary in the
replica set context; however, unlike master-slave replication, there’s no one node that is fixed to be primary
in the replica set.
98
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
If a master goes down in replica set, automatically one of the slave nodes is promoted to the master.
The clients start connecting to the new master, and both data and application will remain available. In a
replica set, this failover happens in an automated fashion. We will explain the details of how this process
happens later.
The primary node is selected through an election mechanism. If the primary goes down, the selected
node will be chosen as the primary node.
Figure 7-3 shows how a two-member replica set failover happens. Let’s discuss the various steps that
happen for a two-member replica set in failover.
99
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Replica set replication has a limitation on the number of members. Prior to version 3.0, the limit was
12 but this has been changed to 50 in version 3.0. So now replica set replication can have maximum of 50
members only, and at any given point of time in a 50-member replica set, only 7 can participate in a vote.
We will explain the voting concept in a replica set in detail.
Starting from Version 3.0, replica set members can use different storage engines. For example, the WiredTiger
storage engine might be used by the secondary members whereas the MMAPv1 engine could be used by the
primary. In the coming sections, you will look at the different storage engines available with MongoDB.
100
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
When deciding on the delay time, consider your maintenance period and the size of the oplog. The
delay time should be either equal to or greater than the maintenance window and the oplog size should be
set in a manner to ensure that no operations are lost while replicating.
Note that since the delayed members will not have up-to-date data as the primary node, the priority
should be set to 0 so that they cannot become primary. Also, the hidden property should be true in order to
avoid any read requests.
Arbiters are secondary members that do not hold a copy of the primary’s data, so they can never
become the primary. They are solely used as member for participating in voting. This enables the replica
set to have an uneven number of nodes without incurring any replication cost which arises with data
replication.
Non-voting members hold the primary’s data copy, they can accept client read operations, and they
can also become the primary, but they cannot vote in an election.
The voting ability of a member can be disabled by setting its votes to 0. By default every member has
one vote. Say you have a replica set with seven members. Using the following commands in mongo shell, the
votes for fourth, fifth, and sixth member are set to 0:
cfg_1 = rs.conf()
cfg_1.members[3].votes = 0
cfg_1.members[4].votes = 0
cfg_1.members[5].votes = 0
rs.reconfig(cfg_1)
Although this setting allows the fourth, fifth, and sixth members to be elected as primary, when voting
their votes will not be counted. They become non-voting members, which means they can stand for election
but cannot vote themselves.
You will see how the members can be configured later in this chapter.
Elections
In this section, you will look at the process of election for selecting a primary member. In order to get
elected, a server need to not just have the majority but needs to have majority of the total votes.
If there are X servers with each server having 1 vote, then a server can become primary only when it has
at least [(X/2) + 1] votes.
If a server gets the required number of votes or more, then it will become primary.
The primary that went down still remains part of the set; when it is up, it will act as a secondary server
until the time it gets a majority of votes again.
The complication with this type of voting system is that you cannot have just two nodes acting as master
and slave. In this scenario, you will have total of two votes, and to become a master, a node will need the
majority of votes, which will be both of the votes in this case. If one of the servers goes down, the other server
will end up having one vote out of two, and it will never be promoted as master, so it will remain a slave.
In case of network partitioning, the master will lose the majority of votes since it will have only its
own one vote and it’ll be demoted to slave and the node that is acting as slave will also remain a slave in
the absence of the majority of the votes. You will end up having two slaves until both servers reach each
other again.
A replica set has number of ways to avoid such situations. The simplest way is to use an arbiter to help
resolve such conflicts. It’s very lightweight and is just a voter, so it can run on either of the servers itself.
Let’s now see how the above scenario will change with the use of an arbiter. Let’s first consider the
network partitioning scenario. If you have a master, a slave, and an arbiter, each has one vote, totalling three
votes. If a network partition occurs with the master and arbiter in one data center and the slave in another
data center, the master will remain master since it will still have the majority of votes.
101
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
If the master fails with no network partitioning, the slave can be promoted to master because it will have
two votes (slave + arbiter).
This three-server setup provides a robust failover deployment.
102
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
The election is affected by the priority settings. A 0 priority member can never become a primary.
Oplog
Oplog stands for the operation log. An oplog is a capped collection where all the operations that modify the
data are recorded.
The oplog is maintained in a special database, namely local in the collection oplog.$main. Every
operation is maintained as a document, where each document corresponds to one operation that is
performed on the master server. The document contains various keys, including the following keys :
• ts: This stores the timestamp when the operations are performed. It’s an internal
type and is composed of a 4-byte timestamp and a 4-byte incrementing counter.
• op: This stores information about the type of operation performed. The value is
stored as 1-byte code (e.g. it will store an “I” for an insert operation).
• ns: This key stores the collection namespace on which the operation was performed.
• o: This key specifies the operation that is performed. In case of an insert, this will
store the document to insert.
Only operations that change the data are maintained in the oplog because it’s a mechanism for
ensuring that the secondary node data is in sync with the primary node data.
The operations that are stored in the oplog are transformed so that they remain idempotent, which
means that even if it’s applied multiple times on the secondary, the secondary node data will remain
consistent. Since the oplog is a capped collection, with every new addition of an operation, the oldest
operations are automatically moved out. This is done to ensure that it does not grow beyond a pre-set
bound, which is the oplog size.
103
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Depending on the OS, whenever the replica set member first starts up, the oplog is created of a default
size by MongoDB.
By default in MongoDB, available free space or 5% is used for the oplog on Windows and 64-bit Linux
instances. If the size is lower than 1GB, then 1GB of space is allocated by MongoDB.
Although the default size is sufficient in most cases, you can use the –oplogsize option to specify the
oplog size in MB when starting the server.
If you have the following workload, there might be a requirement of reconsidering the oplog size:
• Updates to multiple documents simultaneously: Since the operations need to be
translated into operations that are idempotent, this scenario might end up requiring
great deal of oplog size.
• Deletes and insertions happening at the same rate involving same amount of
data: In this scenario, although the database size will not increase, the operations
translation into an idempotent operation can lead to a bigger oplog.
• Large number of in-place updates: Although these updates will not change the
database size, the recording of updates as idempotent operations in the oplog can
lead to a bigger oplog.
104
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Since oplog ops are idempotent, the same operation can be applied any number of times, and every
time the result document will be same.
If you have the following doc
{I:11}
an increment operation is performed on the same, such as
{$inc:{I:1}} on the primary
In this case the following will be stored in the primary oplog:
{I:12}.
This will be replicated by the secondaries. So the value remains the same even if the log is applied
multiple times.
Starting Up
When a node is started, it checks its local collection to find out the lastOpTimeWritten. This is the time of
the latest op that was applied on the secondary.
The following shell helper can be used to find the latest op in the shell:
> rs.debug.getLastOpWritten()
The output returns a field named ts, which depicts the last op time.
If a member starts up and finds the ts entry, it starts by choosing a target to sync from and it will start
syncing as in a normal operation. However, if no entry is found, the node will begin the initial sync process.
In version 2.0, the slave’s delayed nodes were debatably included in “healthy” nodes. Starting from
version 2.2, delayed nodes and hidden nodes are excluded from the “healthy” nodes.
Running the following command will show the server that is chosen as the source for syncing:
db.adminCommand({replSetGetStatus:1})
The output field of syncingTo is present only on secondary nodes and provides information on the
node from which it is syncing.
105
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
However, this will never lead to a loop because the nodes will sync only from a secondary that has a latest
value of lastOpTimeWritten which is greater than its own. You will never end up in a scenario where N1 is
syncing from N2 and N2 is syncing from N1. It will always be either N1 is syncing from N2 or N2 is syncing
from N1.
In this section, you will see how w (write operation) works with slave chaining. If N1 is syncing from N2,
which is further syncing from N3, in this case how N3 will know that until which point N1 is synced to.
When N1 starts its sync from N2, a special “handshake” message is sent, which intimates to N2 that
N1 will be syncing from its oplog. Since N2 is not primary, it will forward the message to the node it is
syncing from (i.e. it opens a connection to N3 pretending to be N1). By the end of the above step, N2 has two
connections that are opened with N3: one connection for itself and the other for N1.
Whenever an op request is made by N1 to N2, the op is sent by N2 from its oplog and a dummy request
is forwarded on the link of N1 to N3, as shown in Figure 7-4.
Although this minimizes network traffic, it increases the absolute time for the write to reach to all of
the members.
Failover
In this section, you will look at how primary and secondary member failovers are handled in replica sets.
All members of a replica set are connected to each other. As shown in Figure 7-5, they exchange a heartbeat
message amongst each other.
106
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
107
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Rollbacks
In scenario of a primary node change, the data on the new primary is assumed to be the latest data in the
system. When the former primary joins back, any operation that is applied on it will also be rolled back.
Then it will be synced with the new primary.
The rollback operation reverts all the write operations that were not replicated across the replica set.
This is done in order to maintain database consistency across the replica set.
When connecting to the new primary, all nodes go through a resync process to ensure the rollback is
accomplished. The nodes look through the operation that is not there on the new primary, and then they
query the new primary to return an updated copy of the documents that were affected by the operations.
The nodes are in the process of resyncing and are said to be recovering; until the process is complete, they
will not be eligible for primary election.
This happens very rarely, and if it happens, it is often due to network partition with replication lag
where the secondaries cannot keep up with the operation’s throughput on the former primary.
It needs to be noted that if the write operations replicate to other members before the primary steps down,
and those members are accessible to majority of the nodes of the replica set, the rollback does not occur.
The rollback data is written to a BSON file with filenames such as <database>.<collection>.
<timestamp>.bson in the database’s dbpath directory.
The administrator can decide to either ignore or apply the rollback data. Applying the rollback data can
only begin when all the nodes are in sync with the new primary and have rolled back to a consistent state.
The content of the rollback files can be read using Bsondump, which then need to be manually applied to
the new primary using mongorestore.
There is no method to handle rollback situations automatically for MongoDB. Therefore manual
intervention is required to apply rollback data. While applying the rollback, it’s vital to ensure that these are
replicated to either all or at least some of the members in the set so that in case of any failover rollbacks can
be avoided.
Consistency
You have seen that the replica set members keep on replicating data among each other by reading the oplog.
How is the consistency of data maintained? In this section, you will look at how MongoDB ensures that you
always access consistent data.
In MongoDB, although the reads can be routed to the secondaries, the writes are always routed to the
primary, eradicating the scenario where two nodes are simultaneously trying to update the same data set.
The data set on the primary node is always consistent.
If the read requests are routed to the primary node, it will always see the up-to-date changes, which
means the read operations are always consistent with the last write operations.
However, if the application has changed the read preference to read from secondaries, there might be
a probability of user not seeing the latest changes or seeing previous states. This is because the writes are
replicated asynchronously on the secondaries.
This behavior is characterized as eventual consistency, which means that although the secondary’s state is not
consistent with the primary node state, it will eventually become consistent over time.
There is no way that reads from the secondary can be guaranteed to be consistent, except by issuing
write concerns to ensure that writes succeed on all members before the operation is actually marked
successful. We will be discussing write concerns in a while.
108
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Figure 7-6. Members replica set with primary, secondary, and arbiter
2. Replica set fault tolerance is the count of members, which can go down but
still the replica set has enough members to elect a primary in case of any failure.
Table 7-1 indicates the relationship between the member count in the replica set
and its fault tolerance. Fault tolerance should be considered when deciding on
the number of members.
109
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Figure 7-7. Members replica set with primary, secondary, and hidden members
Figure 7-8. Members replica set with primary, secondary, and a priority 0 member distributed across the
data center
110
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
6. When replica set members are distributed across data centers, network
partitioning can prevent data centers from communicating with each other. In
order to ensure a majority in the case of network partitioning, it keeps a majority
of the members in one location.
Scaling Reads
Although the primary purpose of the secondaries is to ensure data availability in case of downtime of the
primary node, there are other valid use cases for secondaries. They can be used dedicatedly to perform
backup operations or data processing jobs or to scale out reads. One of the ways to scale reads is to issue the
read queries against the secondary nodes; by doing so the workload on the master is reduced.
One important point that you need to consider when using secondaries for scaling read operations
is that in MongoDB the replication is asynchronous, which means if any write or update operation is
performed on the master’s data, the secondary data will be momentarily out-of-date. If the application in
question is read-heavy and is accessed over a network and does not need up-to-date data, the secondaries
can be used to scale out the read in order to provide a good read throughput. Although by default the read
requests are routed to the primary node, the requests can be distributed over secondary nodes by specifying
the read preferences. Figure 7-9 depicts the default read preference.
111
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
The following are ideal use cases whereby routing the reads on secondary node can help gain a
significant improvement in the read throughput and can also help reduce the latency:
1. Applications that are geographically distributed: In such cases, you can have
a replica set that is distributed across geographies. The read preferences should
be set to read from the nearest secondary node. This helps in reducing the
latency that is caused when reading over network and this improves the read
performance. See Figure 7-10.
112
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
3. If you have an application that supports two types of operations, the first
operation is the main workload that involves reading and doing some processing
on the data, whereas the second operation generates reports using the data. In
such a scenario, you can have the reporting reads directed to the secondaries.
MongoDB supports the following read preference modes:
• primary: This is the default mode. All the read requests are routed to the
primary node.
• primaryPreferred: In normal circumstances the reads will be from primary
but in an emergency such as a primary not available, reads will be from the
secondary nodes.
• secondary: Reads from the secondary members.
• secondaryPreferred: Reads from secondary members. If secondaries are
unavailable, then read from the primary.
• nearest: Reads from the nearest replica set member.
In addition to scaling reads, the second ideal use case for using secondaries is to offload intensive
processing, aggregating, and administration tasks in order to avoid degrading the primary’s performance.
Blocking operations can be performed on the secondary without ever affecting the primary node’s
performance.
113
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
114
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
If while specifying number the number is greater than the nodes that actually hold the data, the command will
keep on waiting until the members are available. In order to avoid this indefinite wait time, wtimeout should
also be used along with w, which will ensure that it will wait for the specified time period, and if the write has
not succeeded by that time, it will time out.
In order to understand how this command will be executed, say you have two members, one named
primary and the other named secondary, and it is syncing its data from the primary.
But how will the primary know the point at which the secondary is synced? Since the primary’s oplog is
queried by the secondary for op results to be applied, if the secondary requests an op written at say t time, it
implies to the primary that the secondary has replicated all ops written before t.
The following are the steps that a write concern takes.
1. The write operation is directed to the primary.
2. The operation is written to the oplog of primary with ts depicting the time of
operation.
3. A w: 2 is issued, so the write operation needs to be written to one more server
before it’s marked successful.
4. The secondary queries the primary’s oplog for the op, and it applies the op.
5. Next, the secondary sends a request to the primary requesting for ops with ts
greater than t.
6. At this point, the primary sends an update that the operation until t has been
applied by the secondary as it’s requesting for ops with {ts: {$gt: t}}.
7. The writeConcern finds that a write has occurred on both the primary and
secondary, satisfying the w: 2 criteria, and the command returns success.
115
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
The following examples assume a replica set named testset that has the configuration shown in Table 7-2.
The hostname used in the above table can be found out using the following command:
C:\>hostname
ANOC9
C:\>
In the following examples, the [hostname] need to be substituted with the value that the hostname command
returns on your system. In our case, the value returned is ANOC9, which is used in the following examples.
C:\>mkdir C:\db1\active1\data
C:\>
116
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
As you can see, the –replSet option specifies the name of the replica set the instance is joining and the
name of one more member of the set, which in the above example is Active_Member_2.
Although you have only specified one member in the above example, multiple members can be
provided by specifying comma-separated addresses like so:
In the next step, you get the second active member up and running. Create the data directory for the
second active member in a new terminal window.
C:\>mkdir C:\db1\active2\data
C:\>
Connect to mongod:
Finally, you need to start the passive member. Open a separate window and create the data directory for
the passive member.
C:\>mkdir C:\db1\passive1\data
C:\>
Connect to mongod:
117
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
In the preceding examples, the --rest option is used to activate a REST interface on port +1000.
Activating REST enables you to inspect the replica set status using web interface.
By the end of the above steps, you have three servers that are up and running and are communicating
with each other; however the replica set is still not initialized. In the next step, you initialize the replica set
and instruct each member about their responsibilities and roles.
In order to initialize the replica set, you connect to one of the servers. In this example, it is the first
server, which is running on port 27021.
Open a new command prompt and connect to the mongo interface for the first server:
C:\>cd c:\practicalmongodb\bin
c:\practicalmongodb\bin>mongo ANOC9 --port 27021
MongoDB shell version: 3.0.4
connecting to: ANOC9:27021/test
>
Next, a configuration data structure is set up, which mentions server wise roles:
>cfg = {
... _id: 'testset',
... members: [
... {_id:0, host: 'ANOC9:27021'},
... {_id:1, host: 'ANOC9:27022'},
... {_id:2, host: 'ANOC9:27023', priority:0}
... ]
... }
{ "_id" : "testset",
"members" : [
{
"_id" : 0,
"host" : "ANOC9:27021"
},
..........
{
"_id" : 2,
"host" : "ANOC9:27023",
"priority" : 0
} ]}>
> rs.initiate(cfg)
{ "ok" : 1}
118
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Let’s now view the replica set status in order to vet that it’s set up correctly:
testset:PRIMARY> rs.status()
{
"set" : "testset",
"date" : ISODate("2015-07-13T04:32:46.222Z")
"myState" : 1,
"members" : [
{
"_id" : 0,
...........................
testset:PRIMARY>
The output indicates that all is OK. The replica set is now successfully configured and initialized.
Let’s see how you can determine the primary node. In order to do so, connect to any of the members
and issue the following and verify the primary:
testset:PRIMARY> db.isMaster()
{
"setName" : "testset",
"setVersion" : 1,
"ismaster" : true,
"primary" : " ANOC9:27021",
"me" : "ANOC9:27021",
...........................................
"localTime" : ISODate("2015-07-13T04:36:52.365Z"),
.........................................................
"ok" : 1
}testset:PRIMARY>
Removing a Server
In this example, you will remove the secondary active member from the set. Let’s connect to the secondary
member mongo instance. Open a new command prompt, like so:
C:\>cd c:\practicalmongodb\bin
c:\practicalmongodb\bin>mongo ANOC9 --port 27022
MongoDB shell version: 3.0.4
connecting to: 127.0.0.1:27022/ANOC9
testset:SECONDARY>
119
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Next, you need to connect to the primary member mongo console and execute the following to remove
the member:
In order to vet whether the member is removed or not you can issue the rs.status() command.
Adding a Server
You will next add a new active member to the replica set. As with other members, you begin by opening a
new command prompt and creating the data directory first:
C:\>mkdir C:\db1\active3\data
C:\>
You have the new mongod running, so now you need to add this to the replica set. For this you connect
to the primary’s mongo console:
Finally, the following command needs to be issued to add the new mongod to the replica set:
testset:PRIMARY> rs.add("ANOC9:27024")
{ "ok" : 1 }
The replica set status can be checked to vet whether the new active member is added or not using
rs.status().
120
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
C:\>mkdir c:\db1\arbiter\data
C:\>
Connect to the primary’s mongo console, switch to the admin db, and add the newly created mongod as
an arbiter to the replica set:
testset:PRIMARY> rs.addArb("ANOC9:30000")
{ "ok" : 1 }
testset:PRIMARY>
testset:PRIMARY> rs.status()
{
"set" : "testset",
"date" : ISODate("2015-07-13T22:15:46.222Z")
"myState" : 1,
"members" : [
{
"_id" : 0,
...........................
"ok" : 1
testset:PRIMARY>
121
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
The myState field’s value indicates the status of the member and it can have the values shown in Table 7-3.
myState Description
0 Phase 1, starting up
1 Primary member
2 Secondary member
3 Recovering state
4 Fatal error state
5 Phase 2, Starting up
6 Unknown state
7 Arbiter member
8 Down or unreachable
9 This state is reached when a write operation is rolled back by the secondary after
transitioning from primary.
10 Members enter this state when removed from the replica set.
Hence the above command returns myState value as 1, which indicates that this is the primary member.
testset:PRIMARY> rs.stepDown()
2015-07-13T22:52:32.000-0700 I NETWORK DBClientCursor::init call() failed
2015-07-13T22:52:32.005-0700 E QUERY Error: error doing query: failed
2015-07-13T22:52:32.009-0700 I NETWORK trying reconnect to 127.0.0.1:27021 (127.0.0.1) failed
2015-07-13T22:52:32.011-0700 I NETWORK reconnect 127.0.0.1:27021 (127.0.0.1) ok
testset:SECONDARY>
After execution of the command the prompt changed from testset:PRIMARY to testset:SECONDARY.
rs.status() can be used to check whether the stepDown () is successful or not.
Please note the myState value it returns is 2 now, which means the “Member is operating as secondary.”
122
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
123
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Sharding
You saw in the previous section how replica sets in MongoDB are used to duplicate the data in order to
protect against any adversity and to distribute the read load in order to increase the read efficiency.
MongoDB uses memory extensively for low latency database operations. When you compare the speed of
reading data from memory to reading data from disk, reading from memory is approximately 100,000 times
faster than reading from the disk.
In MongoDB, ideally the working set should fit in memory. The working set consists of the most
frequently accessed data and indexes.
A page fault happens when data which is not there in memory is accessed by MongoDB. If there’s free memory
available, the OS will directly load the requested page into memory; however, in the absence of free memory,
the page in memory is written to the disk and then the requested page is loaded in the memory, slowing down
the process. Few operations accidentally purge large portion of the working set from the memory, leading to
an adverse effect on the performance. One example is a query scanning through all documents of a database
where the size exceeds the server memory. This leads to loading of the documents in memory and moving the
working set out to disk.
Ensuring you have defined the appropriate index coverage for your queries during the schema design phase
of the project will minimize the risk of this happening. The MongoDB explain operation can be used to provide
information on your query plan and the indexes used.
MongoDB’s serverStatus command returns a workingSet document that provides an estimate of the
instance’s working set size. The Operations team can track how many pages the instance accessed over a
given period of time and the elapsed time between the working set’s oldest and newest document. Tracking all
these metrics, it’s possible to detect when the working set will be hitting the current memory limit, so proactive
actions can be taken to ensure the system is scaled well enough to handle that.
In MongoDB, the scaling is handled by scaling out the data horizontally (i.e. partitioning the data across
multiple commodity servers), which is also called sharding (horizontal scaling).
Sharding addresses the challenges of scaling to support large data sets and high throughput by
horizontally dividing the datasets across servers where each server is responsible for handling its part of data
and no one server is burdened. These servers are also called shards.
Every shard is an independent database. All the shards collectively make up a single logical database.
Sharding reduces the operations count handled by each shard. For example, when data is inserted, only
the shards responsible for storing those records need to be accessed.
The processes that need to be handled by each shard reduce as the cluster grows because the subset of
data that the shard holds reduces. This leads to an increase in the throughput and capacity horizontally.
Let’s assume you have a database that is 1TB in size. If the number of shards is 4, you will have
approximately 265GB of data handled by each shard, whereas if the number of shards is increased to 40, only
25GB of data will be held on each shard.
Figure 7-15 depicts how a collection that is sharded will appear when distributed across three shards.
124
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Although sharding is a compelling and powerful feature, it has significant infrastructure requirements
and it increases the complexity of the overall deployment. So you need to understand the scenarios where
you might consider using sharding.
Use sharding in the following instances:
• The size of the dataset is huge and it has started challenging the capacity of a
single system.
• Since memory is used by MongoDB for quickly fetching data, it becomes
important to scale out when the active work set limits are set to reach.
• If the application is write-intensive, sharding can be used to spread the writes
across multiple servers.
Sharding Components
You will next look at the components that enable sharding in MongoDB. Sharding is enabled in MongoDB
via sharded clusters.
The following are the components of a sharded cluster:
• Shards
• mongos
• Config servers
125
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
The shard is the component where the actual data is stored. For the sharded cluster, it holds a subset
of data and can either be a mongod or a replica set. All shard’s data combined together forms the complete
dataset for the sharded cluster.
Sharding is enabled per collection basis, so there might be collections that are not sharded. In every
sharded cluster there’s a primary shard where all the unsharded collections are placed in addition to the
sharded collection data.
When deploying a sharded cluster, by default the first shard becomes the primary shard although it’s
configurable. See Figure 7-16.
Config servers are special mongods that hold the sharded cluster’s metadata. This metadata depicts the
sharded system state and organization.
The config server stores data for a single sharded cluster. The config servers should be available for the
proper functioning of the cluster.
One config server can lead to a cluster’s single point of failure. For production deployment it’s
recommended to have at least three config servers, so that the cluster keeps functioning even if one config
server is not accessible.
A config server stores the data in the config database, which enables routing of the client requests to the
respective data. This database should not be updated.
MongoDB writes data to the config server only when the data distribution has changed for balancing
the cluster.
The mongos act as the routers. They are responsible for routing the read and write request from the
application to the shards.
An application interacting with a mongo database need not worry about how the data is stored
internally on the shards. For them, it’s transparent because it’s only the mongos they interact with. The
mongos, in turn, route the reads and writes to the shards.
The mongos cache the metadata from config server so that for every read and write request they don’t
overburden the config server.
However, in the following cases, the data is read from the config server :
• Either an existing mongos has restarted or a new mongos has started for the first time.
• Migration of chunks. We will explain chunk migration in detail later.
126
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Shard Key
Any indexed single/compound field that exists within all documents of the collection can be a shard
key. You specify that this is the field basis which the documents of the collection need to be distributed.
Internally, MongoDB divides the documents based on the value of the field into chunks and distributes
them across the shards.
There are two ways MongoDB enables distribution of the data: range-based partitioning and hash-
based partitioning.
Range-Based Partitioning
In range-based partitioning, the shard key values are divided into ranges. Say you consider a timestamp
field as the shard key. In this way of partitioning, the values are considered as a straight line starting from a
Min value to Max value where Min is the starting period (say, 01/01/1970) and Max is the end period (say,
12/31/9999). Every document in the collection will have timestamp value within this range only, and it will
represent some point on the line.
Based on the number of shards available, the line will be divided into ranges, and documents will be
distributed based on them.
In this scheme of partitioning, shown in Figure 7-17, the documents where the values of the shard
key are nearby are likely to fall on the same shard. This can significantly improve the performance of the
range queries.
127
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
However, the disadvantage is that it can lead to uneven distribution of data, overloading one of the
shards, which may end up receiving majority of the requests, whereas the other shards remain underloaded,
so the system will not scale properly.
Hash-Based Partitioning
In hash-based partitioning, the data is distributed on the basis of the hash value of the shard field. If
selected, this will lead to a more random distribution compared to range-based partitioning.
It’s unlikely that the documents with close shard key will be part of the same chunk. For example, for
ranges based on the hash of the _id field, there will be a straight line of hash values, which will again be
partitioned on basis of the number of shards. On the basis of the hash values, the documents will lie in either
of the shards. See Figure 7-18.
In contrast to range-based partitioning, this ensures that the data is evenly distributed, but it happens at
the cost of efficient range queries.
Chunks
The data is moved between the shards in form of chunks. The shard key range is further partitioned into sub-
ranges, which are also termed as chunks. See Figure 7-19.
128
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
For a sharded cluster, 64MB is the default chunk size. In most situations, this is an apt size for chunk
slitting and migration.
Let’s discuss the execution of sharding and chunks with an example. Say you have a blog posts
collection which is sharded on the field date. This implies that the collection will be split up on the basis
of the date field values. Let’s assume further that you have three shards. In this scenario the data might be
distributed across shards as follows:
Shard #1: Beginning of time up to July 2009
Shard #2: August 2009 to December 2009
Shard #3: January 2010 to through the end of time
In order to retrieve documents from January 1, 2010 until today, the query is sent to mongos.
In this scenario,
1. The client queries mongos.
2. The mongos know which shards have the data, so mongos sends the queries to Shard #3.
3. Shard #3 executes the query and returns the results to mongos.
4. Mongos combines the data received from various shards, which in this case is
Shard #3 only, and returns the final result back to the client.
The application doesn’t need to be sharding-aware. It can query the mongos as though it’s a normal mongod.
Let’s consider another scenario where you insert a new document. The new document has today’s date.
The sequences of events are as follows:
1. The document is sent to the mongos.
2. Mongos checks the date and on basis of that, sends the document to Shard #3.
3. Shard #3 inserts the document.
129
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
From a client’s point of view, this is again identical to a single server setup.
Chunk Splitting
Chunk splitting is one of the processes that ensures the chunks are of the specified size. As you have seen,
a shard key is chosen and it is used to identify how the documents will be distributed across the shards.
The documents are further grouped into chunks of 64MB (default and is configurable) and are stored in the
shards based on the range it is hosting.
If the size of the chunk changes due to an insert or update operation, and exceeds the default chunk
size, then the chunk is split into two smaller chunks by the mongos. See Figure 7-20.
130
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
This process keeps the chunks within a shard of the specified size or lesser than that (i.e. it ensures that
the chunks are of the configured size).
Insert and update operations trigger splits. The split operation leads to modification of the data in the
config server as the metadata is modified. Although splits don’t lead to migration of data, this operation can
lead to an unbalance of the cluster with one shard having more chunks compared to another.
Balancer
Balancer is the background process that is used to ensure that all of the shards are equally loaded or are in a
balanced state. This process manages chunk migrations.
Splitting of the chunk can cause imbalance. The addition or removal of documents can also lead to a
cluster imbalance. In a cluster imbalance, balancer is used, which is the process of distributing data evenly.
When you have a shard with more chunks as compared to other shards, then the chunks balancing is
done automatically by MongoDB across the shards. This process is transparent to the application and to you.
Any of the mongos within the cluster can initiate the balancer process. They do so by acquiring a lock
on the config database of the config server, as balancer involves migration of chunks from one shard to
another, which can lead to a change in the metadata, which will lead to change in the config server database.
The balancer process can have huge impact on the database performance, so it can either
1. Be configured to start the migration only when the migration threshold has
reached. The migration threshold is the difference in the number of maximum
and minimum chunks on the shards. Threshold is shown in Table 7-4.
2. Or it can be scheduled to run in a time period that will not impact the production traffic.
131
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
The balancer migrates one chunk at a time (see Figure 7-21) and follows these steps:
1. The moveChunk command is sent to the source shard.
2. An internal moveChunk command is started on the source where it creates the
copy of the documents within the chunk and queues it. In the meantime, any
operations for that chunk are routed to the source by the mongos because the
config database is not yet changed and the source will be responsible for serving
any read/write request on that chunk.
3. The destination shard starts receiving the copy of the data from the source.
4. Once all of the documents in the chunks have been received by the destination
shard, the synchronization process is initiated to ensure that all changes that
have happened to the data while migration are updated at the destination shard.
5. Once the synchronization is completed, the next step is to update the metadata
with the chunk’s new location in the config database. This activity is done by
the destination shard that connects to the config database and carries out the
necessary updates.
6. Post successful completion of all the above, the document copy that is
maintained at the source shard is deleted.
132
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
If in the meanwhile the balancer needs additional chunk migration from the source shard, it can start
with the new migration without even waiting for the deletion step to finish for the current migration.
In case of any error during the migration process, the process is aborted by the balancer, leaving the
chunks on the original shard. On successful completion of the process, the chunk data is removed from the
original shard by MongoDB.
Addition or removal of shards can also lead to cluster imbalance. When a new shard is added, data
migration to the shard is started immediately. However, it takes time for the cluster to be balanced.
When a shard is removed, the balancer ensures that the data is migrated to the other shards and the
metadata information is updated. Post completion of the two activities, the shard is removed safely.
Operations
You will next look at how the read and write operations are performed on the sharded cluster. As mentioned,
the config servers maintain the cluster metadata. This data is stored in the config database. This data of the
config database is used by the mongos to service the application read and write requests.
The data is cached by the mongos instances, which is then used for routing write and read operations to
the shards. This way the config servers are not overburdened.
The mongos will only read from the config servers in the following scenarios:
• The mongos has started for first time or
• An existing mongos has restarted or
• After chunk migration when the mongos needs to update its cached metadata with
the new cluster metadata.
Whenever any operation is issued, the first step that the mongos need to do is to identify the shards that
will be serving the request. Since the shard key is used to distribute data across the sharded cluster, if the
operation is using the shard key field, then based on that specific shards can be targeted.
If the shard key is employeeid, the following things can happen:
1. If the find query contains the employeeid field, then to satiate the query, only
specific shards will be targeted by the mongos.
2. If a single update operation uses employeeid for updating the document, the
request will be routed to the shard holding that employee data.
However, if the operation is not using the shard key, then the request is broadcast to all the shards.
Generally a multi-update or remove operation is targeted across the cluster.
While querying the data, there might be scenarios where in addition to identifying the shards and
getting the data from them, the mongos might need to work on the data returned from various shards before
sending the final output to the client.
Say an application has issued a find() request with sort(). In this scenario, the mongos will pass the
$orderby option to the shards. The shards will fetch the data from their data set and will send the result in an
ordered manner. Once the mongos has all the shard’s sorted data, it will perform an incremental merge sort
on the entire data and then return the final output to the client.
Similar to sort are the aggregation functions such as limit(), skip(), etc., which require mongos to
perform operations post receiving the data from the shards and before returning the final result set to the client.
The mongos consumes minimal system resources and has no persistent state. So if the application requirement
is a simple find () queries that can be solely met by the shards and needs no manipulation at the mongos
level, you can run the mongos on the same system where your application servers are running.
133
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Implementing Sharding
In this section, you will learn to configure sharding in one machine on a Windows platform.
You will keep the example simple by using only two shards. In this configuration, you will be using the services
listed in Table 7-5.
134
www.allitebooks.com
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Next, start the mongos. Type the following in a new terminal window:
C:\>cd c:\practicalmongodb\bin
c:\practicalmongodb\bin>mongos --configdb localhost:27022 --port 27021 --chunkSize 1
2015-07-13T23:06:07.246-0700 W SHARDING running with 1 config server should be done only for
testing purposes and is not recommended for production
...............................................................
2015-07-13T23:09:07.464-0700 I SHARDING [Balancer] distributed lock 'balancer/
ANOC9:27021:1429783567:41' unlocked
You now have the shard controller (i.e. the mongos) up and running.
If you switch to the window where the config server has been started, you will find a registration of the
shard server to the config server.
In this example you have used chunk size of 1MB. Note that this is not ideal in a real-life scenario since
the size is less than 4MB (a document’s maximum size). However, this is just for demonstration purpose
since this creates the necessary amount of chunks without loading a large amount of data. The chunkSize is
128MB by default unless otherwise specified.
Next, bring up the shard servers, Shard0 and Shard1.
Open a fresh terminal window. Create the data directory for the first shard and start the mongod:
C:\>mkdir C:\db1\shard0\data
C:\>cd c:\practicalmongodb\bin
c:\practicalmongodb\bin>mongod --port 27023 --dbpath c:\db1\shard0\data –shardsvr
2015-07-13T23:14:58.076-0700 I CONTROL [initandlisten] MongoDB starting : pid=1996
port=27023 dbpath=c:\db1\shard0\data 64-bit host=ANOC9
.................................................................
2015-07-13T23:14:58.158-0700 I NETWORK [initandlisten] waiting for connections on port 27023
Open fresh terminal window. Create the data directory for the second shard and start the mongod:
C:\>mkdir c:\db1\shard1\data
C:\>cd c:\practicalmongodb\bin
c:\practicalmongodb\bin>mongod --port 27024 --dbpath C:\db1\shard1\data --shardsvr
2015-07-13T23:17:01.704-0700 I CONTROL [initandlisten] MongoDB starting : pid=3672
port=27024 dbpath=C:\db1\shard1\data 64-bit host=ANOC9
2015-07-13T23:17:01.704-0700 I NETWORK [initandlisten] waiting for connections on port 27024
All the servers relevant for the setup are up and running by the end of the above step. The next step is to
add the shards information to the shard controller.
The mongos appears as a complete MongoDB instance to the application in spite of actually not being a full
instance. The mongo shell can be used to connect to the mongos to perform any operation on it.
C:\>cd c:\practicalmongodb\bin
c:\ practicalmongodb\bin>mongo localhost:27021
MongoDB shell version: 3.0.4
connecting to: localhost:27021/test
mongos>
135
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
mongos> db.runCommand({addshard:"localhost:27023",allowLocal:true})
{ "shardAdded" : "shard0000", "ok" : 1 }
mongos> db.runCommand({addshard:"localhost:27024",allowLocal:true})
{ "shardAdded" : "shard0001", "ok" : 1 }
mongos>
mongos> db.runCommand({listshards:1})
{
"shards" : [
{
"_id" : "shard0000",
"host" : "localhost:27023"
}, {
"_id" : "shard0001",
"host" : "localhost:27024"
}
], "ok" : 1}
mongos> testdb=db.getSisterDB("testdb")
testdb
Next, specify the collection that needs to be sharded and the key on which the collection will be sharded:
With the completion of the above steps you now have a sharded cluster set up with all components up
and running. You have also created a database and enabled sharding on the collection.
Next, import data into the collection so that you can check the data distribution on the shards.
136
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
You will be using the import command to load data in the testcollection. Connect to a new terminal
window and execute the following:
C:\>cd C:\practicalmongodb\bin
C:\practicalmongodb\bin>mongoimport --host ANOC9 --port 27021 --db testdb --collection
testcollection --type csv --file c:\mongoimport.csv –-headerline
2015-07-13T23:17:39.101-0700 connected to: ANOC9:27021
2015-07-13T23:17:42.298-0700 [##############..........] testdb.testcollection 1.1 MB/1.9 MB (59.6%)
2015-07-13T23:17:44.781-0700 imported 100000 documents
The mongoimport.csv consists of two fields. The first is the testkey, which is a randomly generated
number. The second field is a text field; it is used to ensure that the documents occupy a sufficient number of
chunks, making it feasible to use the sharding mechanism.
This inserts 100,000 objects in the collection.
In order to vet whether the records are inserted or not, connect to the mongo console of the mongos
and issue the following command:
C:\Windows\system32>cd c:\practicalmongodb\bin
c:\practicalmongodb\bin>mongo localhost:27021
MongoDB shell version: 3.0.4
connecting to: localhost:27021/test
mongos> use testdb
switched to db testdb
mongos> db.testcollection.count()
100000
mongos>
Next, connect to the consoles of the two shards (Shard0 and Shard1) and look at how the data is
distributed. Open a new terminal window and connect to Shard0’s console:
C:\>cd C:\practicalmongodb\bin
C:\ practicalmongodb\bin>mongo localhost:27023
MongoDB shell version: 3.0.4
connecting to: localhost:27023/test
Switch to testdb and issue the count() command to check number of documents on the shard:
Next, open a new terminal window, connect to Shard1’s console, and follow the steps as above
(i.e. switch to testdb and check the count of testcollection collection):
C:\>cd c:\practicalmongodb\bin
c:\practicalmongodb\bin>mongo localhost:27024
MongoDB shell version: 3.0.4
connecting to: localhost:27024/test
> use testdb
switched to db testdb
> db.testcollection.count()
42002
>
137
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
You might see a difference in the document’s number in each shard when you run the above command for
some time. When the documents are loaded, all of the chunks are placed on one shard by the mongos. In time
the shard set is rebalanced by distributing the chunks evenly across all the shards.
c:\>mkdir c:\db1\shard2\data
c:\>cd c:\practicalmongodb\bin
c:\ practicalmongodb\bin>mongod --port 27025 --dbpath C:\db1\shard2\data --shardsvr
2015-07-13T23:25:49.103-0700 I CONTROL [initandlisten] MongoDB starting : pid=3744
port=27025 dbpath=C:\db1\shard2\data 64-bit host=ANOC9
................................
2015-07-13T23:25:49.183-0700 I NETWORK [initandlisten] waiting for connections on port 27025
Next, the new shard server will be added to the shard cluster. In order to configure it, open the mongos
mongo console in a new terminal window:
C:\>cd c:\practicalmongodb\bin
c:\practicalmongodb\bin>mongo localhost:27021
MongoDB shell version: 3.0.4
connecting to: localhost:27021/test
mongos>
Switch to the admin database and run the addshard command. This command adds the shard server to
the sharded cluster.
In order to vet whether the addition is successful or not, run the listshards command:
mongos> db.runCommand({listshards:1})
{
"shards" : [
{
"_id" : "shard0000",
"host" : "localhost:27023"
},
138
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
{
"_id" : "shard0001",
"host" : "localhost:27024"
},
{
"_id" : "shard0002",
"host" : "localhost:27025"
}
],
"ok" : 1
}
Next, check how the testcollection data is distributed. Connect to the new shard’s console in a new
terminal window:
C:\>cd c:\practicalmongodb\bin
c:\practicalmongodb\bin>mongo localhost:27025
MongoDB shell version: 3.0.4
connecting to: localhost:27025/test
> db.testcollection.count()
6928
> db.testcollection.count()
12928
> db.testcollection.count()
16928
Interestingly, the number of items in the collection is slowly going up. The mongos is rebalancing
the cluster.
With time, the chunks will be migrated from the shard servers Shard0 and Shard1 to the newly added
shard server, Shard2, so that the data is evenly distributed across all the servers. Post completion of this
process the config server metadata is updated. This is an automatic process and it happens even if there’s
no new data addition in the testcollection. This is one of the important factors you need to consider when
deciding on the chunk size.
If the value of chunkSize is very large, you will end up having less even data distribution. The data is
more evenly distributed when the chunkSize is smaller.
139
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Removing a Shard
In the following example, you will see how to remove a shard server. For this example, you will be removing
the server you added in the above example.
In order to initiate the process, you need to log on to the mongos console, switch to the admin db, and
execute the following command to remove the shard from the shard cluster:
C:\>cd c:\practicalmongodb\bin
c:\practicalmongodb\bin>mongo localhost:27021
MongoDB shell version: 3.0.4
connecting to: localhost:27021/test
mongos> use admin
switched to db admin
mongos> db.runCommand({removeShard: "localhost:27025"})
{
"msg" : "draining started successfully",
"state" : "started",
"shard" : "shard0002",
"ok" : 1
}
mongos>
As you can see, the removeShard command returns a message. One of the message fields is state,
which indicates the process state. The message also states that the draining process has started. This is
indicated by the field msg.
You can reissue the removeShard command to check the progress:
The response tells you the number of chunks and databases that still need to be drained from the server.
If you reissue the command and the process is terminated, the output of the command will depict the same.
140
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
You can use the listshards to vet whether removeShard was successful or not.
As you can see, the data is successfully migrated to the other shards, so you can delete the storage files
and terminate the Shard2 mongod process.
This ability to modify the shard cluster without going offline is one of the critical components of MongoDB,
which enables it to support highly available, highly scalable, large capacity data stores.
mongos> db.printShardingStatus()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"version" : 3,
"minCompatibleVersion" : 5,
"currentVersion" : 6,
"clusterId" : ObjectId("52fb7a8647e47c5884749a1a")
}
shards:
{ "_id" : "shard0000", "host" : "localhost:27023" }
{ "_id" : "shard0001", "host" : "localhost:27024" }
balancer:
Currently enabled: yes
Currently running: no
Failed balancer rounds in last 5 attempts: 0
Migration Results for the last 24 hours:
17 : Success
databases:
{ "_id" : "admin", "partitioned" : false, "primary" : "config" }
{ "_id" : "testdb", "partitioned" : true, "primary" : "shard0000" }
...............
Important information that can be obtained from the above command is the sharding keys range, which is
associated with each chunk. This also shows where specific chunks are stored (on which shard server).
The output can be used to analyse the shard server’s keys and chunks distribution.
141
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Prerequisite
You will start the cluster first. Just to reiterate, follow these steps.
1. Start the config server. Enter the following command in a new terminal window
(if it’s not already running):
2. Start the mongos. Enter the following command in a new terminal window (if it’s
not already running):
C:\>cd c:\practicalmongodb\bin
c:\practicalmongodb\bin>mongos --configdb localhost:27022 --port 27021
C:\>cd c:\practicalmongodb\bin
c:\practicalmongodb\bin>mongod --port 27023 --dbpath c:\db1\shard0\data --shardsvr
Start Shard1. Enter the following command in a new terminal window (if it’s not already running):
C:\>cd c:\practicalmongodb\bin
C:\practicalmongodb\bin>mongod --port 27024 --dbpath c:\db1\shard1\data –shardsvr
Start Shard2. Enter the following command in a new terminal window (if it’s not already running):
C:\>cd c:\practicalmongodb\bin
c:\practicalmongodb\bin>mongod --port 27025 --dbpath c:\db1\shard2\data –shardsvr
Since you have removed Shard2 from the sharded cluster in the earlier example, you must add Shard2
to the sharded cluster because for this example you need three shards.
In order to do so, you need to connect to the mongos. Enter the following commands:
C:\Windows\system32>cd c:\practicalmongodb\bin
c:\practicalmongodb\bin>mongo localhost:27021
MongoDB shell version: 3.0.4
connecting to: localhost:27021/test
mongos>
142
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Before the shard is added to the cluster you need to delete the testdb database:
If you try adding the removed shard without removing the testdb database, it will give the following error:
mongos>db.runCommand({addshard: "localhost:27025", allowlocal: true})
{
"ok" : 0,
"errmsg" : "can't add shard localhost:27025 because a local database 'testdb' exists
in another shard0000:localhost:27023"}
In order to ensure that all the three shards are present in the cluster, run the following command:
mongos> db.runCommand({listshards:1})
{
"shards" : [
{
"_id" : "shard0000",
"host" : "localhost:27023"
}, {
"_id" : "shard0001",
"host" : "localhost:27024"
}, {
"_id" : "shard0002",
"host" : "localhost:27025"
}
], "ok" : 1}
143
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Tagging
By the end of the above steps you have your sharded cluster with a config server, three shards, and a mongos
up and running. Next, connect to the mongos at 30999 port and configdb at 27022 in a new terminal window:
Next, start a new terminal window, connect to the mongos, and enable sharding on the collections:
144
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Next, insert some data in the collections, using the following sequence of commands:
By the end of the above step you have three shards and three collections with sharding enabled on the
collections. Next you will see how data is distributed across the shards.
Switch to configdb:
You can use chunks.find to look at how the chunks are distributed:
145
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
mongos>
As you can see, the chunks are pretty evenly spread out amongst the shards. See Figure 7-22.
Next, you will use tags to separate the collections. The intent of this is to have one collection per shard
(i.e. your goal is to have the chunk distribution shown in Table 7-6).
A tag describes the shard’s property, which can be anything. Hence you might tag a shard as “slow” or “fast” or
“rack space” or “west coast.”
146
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
In the following example, you will tag the shards as belonging to each of the collection:
The rule uses MinKey, which means negative infinity, and MaxKey, which means positive infinity.
Hence the above rule means mark all of the chunks of the collection movies.drama with the tag “dramas.”
Similar to this you will make rules for the other two collections.
Rule 2: All chunks created in the movies.action collection will be tagged as “actions.”
Rule 3: All chunks created in the movies.comedy collection will be tagged as “comedies.”
You need to wait for the cluster to rebalance so that the chunks are distributed based on the tags and
rules defined above. As mentioned, the chunk distribution is an automatic process, so after some time the
chunks will automatically be redistributed to implement the changes you have made.
Next, issue chunks.find to vet the chunks organization:
147
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
{ "shard" : "shard0001" }
mongos> db.chunks.find({ns:"movies.comedy"}, {shard:1, _id:0}).sort({shard:1})
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
{ "shard" : "shard0002" }
mongos>
Thus the collection chunks have been redistributed based on the tags and rules defined (Figure 7-23).
148
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
mongos> sh.addShardTag("shard0000","comedies")
mongos> sh.removeShardTag("shard0002","comedies")
Next, you add the tag “actions” to Shard2, so that movies.action chunks are spread across Shard2 also:
mongos> sh.addShardTag("shard0002","actions")
Re-issuing the find command after some time will show the following results:
149
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
{ "shard" : "shard0000" }
mongos>
The chunks have been redistributed reflecting the changes made (Figure 7-24).
Multiple Tags
You can have multiple tags associated with the shards. Let’s add two different tags to the shards.
Say you want to distribute the writes based on the disk. You have one shard that has a spinning disk and
the other has a SSD (solid state drive). You want to redirect 50% of the writes to the shard with SSD and the
remaining to the one with the spinning disk.
First, tag the shards based on these properties:
150
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Let’s further assume you have a distribution field of the movies.action collection that you will be using
as the shard key. The distribution field value is between 0 and 1. Next, you want to say, “If distribution < .5,
send this to the spinning disk. If distribution >= .5, send to the SSD.” So you define the rules as follows:
Now documents with distribution < .5 will be written to the spinning shard and the others will be
written to the SSD disk shard.
With tagging you can control the type of load that each newly added server will get.
Post this you can also let MongoDB know which chunks goes to which node.
For all this you will need knowledge of the data you will be imported to the database. And this also
depends on the use case you are aiming to solve and how the data is being read by your application. When
deciding where to place the chunk, keep things like data locality in mind.
151
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
use config
db.locks.find()
152
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Now let’s look at the possible failure scenarios in MongoDB production deployment and its impact on
the environment.
Scenario 1
Mongos become unavailable: The application server where mongos has gone down will not be able to
communicate with the cluster but it will not lead to any data loss since the mongos don’t maintain any data
of its own. The mongos can restart, and while restarting, it can sync up with the config servers to cache the
cluster metadata, and the application can normally start its operations (Figure 7-26).
153
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Scenario 2
One of the mongod of the replica set becomes unavailable in a shard: Since you used replica sets to
provide high availability, there is no data loss. If a primary node is down, a new primary is chosen, whereas if
it’s a secondary node, then it is disconnected and the functioning continues normally (Figure 7-27).
154
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
The only difference is that the duplication of the data is reduced, making the system little weak, so you
should in parallel check if the mongod is recoverable. If it is, it should be recovered and restarted whereas if
it’s non-recoverable, you need to create a new replica set and replace it as soon as possible.
Scenario 3
If one of the shard becomes unavailable: In this scenario, the data on the shard will be unavailable, but the
other shards will be available, so it won’t stop the application. The application can continue with its read/
write operations; however, the partial results must be dealt with within the application. In parallel, the shard
should attempt to recover as soon as possible (Figure 7-28).
155
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Scenario 4
Only one config server is available out of three: In this scenario, although the cluster will become read-
only, it will not serve any operations that might lead to changes in the cluster structure, thereby leading to a
change of metadata such as chunk migration or chunk splitting. The config servers should be replaced ASAP
because if all config servers become unavailable, this will lead to an inoperable cluster (Figure 7-29).
156
https://E-next.in
CHAPTER 7 ■ MONGODB ARCHITECTURE
Summary
In this chapter you covered the core processes and tools, standalone deployment, sharding concepts,
replication concepts, and production deployment. You also looked at how HA can be achieved.
In the following chapter, you will see how data is stored under the hood, how writes happens using
journaling, what is GridFS used for, and the different types of indexes available in MongoDB.
157
https://E-next.in