Gcru 2 Nosql
Gcru 2 Nosql
Topic Page No
2.1 Replication and sharding-MapReduce on databases. 2
2.2 Distribution Models- Single Server-Sharding-Master-Slave
Replication-Peer-to-Peer Replication- Combining Sharding and Replication 8
2.3 NoSQL Key/Value databases using MongoDB 16
2.4 Document Database- 21
2.5 Features-Consistency- 23
2.6 Transactions- Availability- Query Features- Scaling - Suitable Use Cases 31
2.7 Event Logging- Content Management Systems- Blogging Platforms 37
2.8 Web Analytics or Real-Time Analytics- E-Commerce Applications 41
2.9 When Not to Use-Complex Transactions Spanning Different Operations 44
2.10 Queries against Varying Aggregate Structure 47
Replication and sharding are two fundamental concepts in distributed databases that help to
improve performance, availability, and scalability.
Replication:
Replication involves creating copies of the same data on multiple servers or nodes. This provides
redundancy and increases data availability and fault tolerance. If one server fails, other replicas
can still serve requests.
2
○ Load Balancing: Read requests can be distributed across multiple replicas to
reduce load on the master node.
● Challenges:
○ Consistency: Ensuring that all replicas are synchronized can be difficult.
Replication lag can occur, leading to outdated data on replicas.
○ Write Conflicts: In multi-master setups, write conflicts can arise if two nodes try
to modify the same data at the same time.
Sharding:
Sharding is the process of distributing data across multiple machines or nodes. Each shard holds
a subset of the total data, which helps to balance the load and makes scaling easier as the
database grows.
In practice, systems often combine both replication and sharding to achieve high availability and
scalability. For example:
3
● Each shard could be replicated, meaning there are multiple copies of each shard on
different servers.
● Sharding helps with distributing the data, while replication helps with redundancy and
load balancing.
Together, these techniques ensure that the database can scale horizontally and remain available,
even in the event of hardware failures.
MapReduce is a programming model used for processing large data sets in a distributed manner.
It divides the task into two main steps:
1. Map: In this step, the input data is divided into smaller chunks (key-value pairs) and
processed in parallel across multiple nodes. Each mapper processes a portion of the data
and outputs intermediate key-value pairs.
2. Reduce: In this step, the intermediate data from the map phase is grouped by key, and a
reducer performs operations (e.g., aggregation, counting, summing) on these groups to
produce the final result.
While MapReduce is often associated with systems like Hadoop, the principles can also be
applied to database queries, especially for tasks that involve aggregation, filtering, or
summarizing large datasets.
Let's say you have a sales database table with the following structure:
4
4 103 250 2025-01-04 East
Goal:
We want to find out the total sales amount for each ProductID across different regions.
● (101, 100)
● (102, 200)
● (101, 150)
● (103, 250)
● (102, 300)
● The system groups all values by the ProductID key, so we can apply a reduce operation
on the same keys.
Reduce Phase:
5
● The "reduce" step aggregates the values for each key (in this case, sums up the sales
amounts).
● For each key (ProductID), the reducer will sum the associated sales amounts.
Final Result:
101 250
102 500
103 250
In many distributed databases, the MapReduce model can be simulated using SQL queries for
simpler aggregations. For example, in a SQL database, the equivalent query would be:
FROM sales
GROUP BY ProductID;
6
However, MapReduce can be more beneficial when dealing with much larger datasets that
cannot be easily handled by traditional SQL queries, especially in distributed systems or NoSQL
databases like Hadoop or MongoDB.
1. Hadoop/Hive: These platforms are designed to use the MapReduce model to process
large-scale data stored in a distributed file system (like HDFS). Hive can use MapReduce
to execute queries on large datasets.
Example in MongoDB:
};
};
This MapReduce job would output the total sales per product, similar to the earlier example, but
done in a distributed manner, leveraging MongoDB's distributed nature.
7
Conclusion:
MapReduce provides a powerful way to process large datasets in a distributed fashion. While it
can be simulated using SQL in some databases, the real benefit comes when dealing with
massive amounts of data that require parallel processing, such as in big data environments like
Hadoop or NoSQL databases like MongoDB.
Distribution Models refer to the strategies and approaches used to distribute data or workloads
across multiple systems or nodes in distributed computing environments. These models are
essential for achieving scalability, high availability, fault tolerance, and performance in
large-scale systems.
In the context of distributed databases and systems, distribution models determine how data is
stored, accessed, and managed across multiple machines. Below are the key distribution models
commonly used in distributed systems:
1. Data Replication
Data replication involves creating multiple copies of the same data across different nodes or
servers. This approach enhances fault tolerance, load balancing, and availability. Replication is
useful when it’s essential for systems to remain operational even if some parts of the system fail.
● Master-Slave Replication: One node is designated as the master (or primary), and the
other nodes are slaves (or replicas). The master handles write operations, while the slaves
handle read operations. Changes in the master are propagated to the slaves.
● Multi-Master Replication: Multiple nodes can accept both reads and writes. Changes
are synchronized across all masters. This model is more complex but allows greater
flexibility.
● Eventual Consistency: In a distributed system, replication often leads to eventual
consistency, meaning that not all replicas will immediately have the same data. Updates
will propagate over time.
8
Example Use Case:
● MySQL supports master-slave replication, where the master handles writes and
propagates changes to slave nodes for reads, providing higher scalability for read-heavy
workloads.
Sharding divides the dataset into smaller, more manageable pieces called shards, each of which
is stored on a different server. This distribution model is particularly useful for handling large
amounts of data and improving system performance by spreading the load across multiple
servers.
Types of Sharding:
● Horizontal Sharding: Data is partitioned by rows. Each shard contains a subset of the total
dataset. For example, a database might store user records on different servers based on
user ID ranges.
● Vertical Sharding: Data is partitioned by columns. This means each shard holds a subset
of columns from the database schema, with different shards containing different attributes
of the data.
● Range-based Sharding: Data is divided into ranges based on some key attribute (e.g.,
user ID or date). Each range is stored in a separate shard.
● Hash-based Sharding: A hash function is applied to a specific key (e.g., user ID) to
determine which shard will hold the data.
● MongoDB uses sharding to distribute data across multiple nodes. Each shard holds a
subset of the data based on a chosen sharding key, enabling MongoDB to scale
horizontally as data grows.
3. Data Federation
Data federation refers to a model where data remains stored in multiple, possibly heterogeneous,
systems, but queries can be run across those systems as if the data were in a single database. This
approach is commonly used in scenarios where data resides in multiple databases or locations but
needs to be accessed together.
Key Features:
9
● Single Query Interface: A federation layer allows users or applications to query multiple
data sources using a unified interface.
● Data Virtualization: The federation layer abstracts the underlying data sources, making
them appear as one logical entity to users and applications.
● Heterogeneous Sources: Data sources may use different technologies (e.g., relational
databases, NoSQL, flat files), but federation allows them to interact seamlessly.
● Apache Drill or Presto: These tools allow federated queries across a range of databases,
including relational, NoSQL, and file systems, enabling a unified query interface.
In the P2P distribution model, every node in the system has equal responsibility and capabilities.
Nodes are both data producers and consumers, and there is no central server or coordinator. Each
node communicates with other peers directly.
Key Features:
● BitTorrent is an example of a P2P system used for file sharing, where users directly
exchange data chunks with each other.
5. Cloud-Based Distribution
Cloud distribution involves the use of cloud services to store and distribute data. This model
takes advantage of cloud providers' infrastructure to handle the scaling of storage and compute
resources.
Key Features:
● Elastic Scalability: Cloud systems can scale resources up or down based on demand,
making them highly flexible.
10
● Global Availability: Cloud-based distribution often provides global coverage by
replicating data across different geographic regions.
● Managed Services: Many cloud providers offer managed distributed database services,
making it easier to handle the complexities of distribution, replication, and scaling.
● Amazon DynamoDB offers a cloud-based NoSQL database that automatically scales and
replicates data across multiple regions for high availability and low-latency access.
Hybrid distribution models combine multiple distribution strategies to take advantage of the
strengths of each. For example, a system might use sharding for data partitioning and
replication for fault tolerance and high availability.
Key Features:
● Flexibility: Hybrid models allow organizations to fine-tune their systems to meet both
performance and availability requirements.
● Optimized Data Access: By combining distribution models, organizations can optimize
access patterns, such as ensuring fast access to frequently used data while still
maintaining fault tolerance.
● Google Spanner uses a hybrid approach, combining sharding (horizontal partitioning) and
replication across geographically distributed servers for strong consistency and high
availability.
11
Peer-to-Peer Decentralized, equal responsibility among BitTorrent
(P2P) nodes
Each of these distribution models addresses different needs, depending on the size, complexity,
and requirements of the system or application being developed. By choosing the right model, you
can optimize performance, reliability, and scalability in your distributed systems.
● Sharding divides a large dataset into smaller, more manageable chunks (shards) based on
a defined sharding key (like user ID, geographic region, etc.). Each shard is stored on a
different server or node.
● Replication ensures that each shard has multiple copies (replicas) across different servers
to improve data availability, fault tolerance, and read performance.
In this combined model, each shard has several replicas, and the data is partitioned across
different machines. This combination helps scale the system horizontally while maintaining high
availability and fault tolerance.
How It Works:
1. Sharding: The database is divided into smaller pieces (shards) by distributing the data
across multiple nodes. Each shard holds a subset of the data. For example, in a user
database, users with IDs from 1 to 100,000 might go to one shard, while IDs from
100,001 to 200,000 go to another shard.
12
2. Replication: Each shard is replicated to multiple servers (often in master-slave or
master-master configurations). If a primary node for a shard goes down, one of the
replicas can be promoted to handle the traffic. This ensures availability even during
failures.
Let’s assume you have a user database and need to implement both sharding and replication:
● Sharding: The user data is partitioned based on the user_id key. Users with IDs from
1 to 1,000,000 are distributed across 10 shards. Each shard stores a subset of the total
user data.
● Replication: Each of these 10 shards is replicated to 2 or more nodes. For instance, Shard
1 might have:
This approach ensures that data from the same shard is spread across different servers, allowing
both high availability and load balancing.
Detailed Architecture:
1. Shards:
13
○ If Replica 1 for Shard 1 fails, the system can automatically switch to Replica 2
without downtime, ensuring continuous operation.
○ Similarly, if Shard 1 (the entire server node) fails, the replicas of Shard 1 can still
serve the data.
○ Sharding allows the database to scale horizontally, meaning the load is spread
across multiple servers, improving the performance of read and write operations.
○ Replication increases the read capacity by allowing multiple replicas to handle
read queries, which reduces the load on the primary node.
2. High Availability:
○ With replication, even if one node goes down (whether a primary node or replica),
the system can continue to function by redirecting traffic to the remaining
available replicas.
○ The system can tolerate failures without downtime, as replication ensures that
there are always copies of the data available.
3. Fault Tolerance:
○ Sharding allows you to distribute data across multiple machines, and you can
keep adding new shards as your dataset grows.
○ Replication ensures that these new shards are also backed up with replicas to
maintain availability and reliability.
○ Deciding how to partition data across shards (e.g., what field to use as a sharding
key) can be difficult. An inefficient sharding key can lead to uneven distribution,
causing some nodes to handle more traffic than others (data hotspots).
14
2. Consistency:
○ Managing a distributed system with both sharding and replication can become
complex. You need to handle failure recovery, balancing load across replicas, and
ensuring that each shard is evenly distributed.
4. Cross-Shard Queries:
○ Performing operations that involve multiple shards can be complex. For example,
a query that needs to join data across shards (e.g., joining user data with order
data across different shards) may not be as efficient as querying a single machine.
5. Network Latency:
○ With data spread across multiple servers and potentially data centers, network
latency can become an issue, especially if replicas are distributed across
geographically distant locations.
● MongoDB: MongoDB uses both sharding and replication. In MongoDB, data is sharded
across multiple nodes, and each shard has its own replica set. MongoDB’s sharded
clusters provide horizontal scaling, and replica sets ensure high availability.
15
Summary:
● Sharding divides data into smaller pieces (shards) to distribute the load and improve
scalability.
● Replication creates copies of the data to ensure availability, fault tolerance, and
improved read performance.
● Combining both allows systems to scale out horizontally while maintaining high
availability, fault tolerance, and performance under heavy load.
By combining these models, systems can handle massive datasets and maintain high
performance, even during failures or periods of high demand. However, managing the
complexity of both models requires careful design and configuration to ensure efficiency and
reliability.
NoSQL Key/Value Databases are a type of NoSQL database that store data as key-value pairs,
where each key is unique, and the value can be any type of data, such as strings, numbers, lists,
or even complex objects. They are often used for caching, session management, and storing user
preferences or configurations because of their speed and simplicity in accessing data.
In MongoDB, each document (record) in a collection is stored as a BSON (Binary JSON) object,
which inherently maps well to key-value concepts. The key is the document’s field name (e.g.,
name, age, email), and the value is the data associated with that field (e.g., "John", 30,
"john@example.com").
1. BSON Format: MongoDB stores documents in BSON format, which is a binary
encoding of JSON. This format allows MongoDB to store more complex data types and
makes it easy to treat documents as key-value pairs.
2. Flexibility: MongoDB allows you to store values in any data type (strings, arrays,
embedded documents), making it suitable for a wide range of use cases.
16
3. Indexed Keys: MongoDB indexes fields in a collection, enabling fast lookups of
key-value pairs, especially when using specific fields as keys.
4. Fast Lookups: MongoDB can quickly access data using the value of a field (acting as the
key) with indexing, much like traditional key-value stores.
1. Key: The _id field (or any other field you define as unique)
2. Value: The document itself, which can be a simple or complex value.
Example Data:
{
"_id": "user123", // This is the key
"theme": "dark", // This is a key-value pair
"notifications": true,
"language": "en"
}
● The key is the _id field (user123), which is unique to each document.
● The value is the entire document itself: a combination of preferences for that user.
To fetch a key-value pair in MongoDB, you can query the document using the key (e.g., _id):
This query would return the document with _id equal to "user123", providing you the value
associated with that key, which includes the user's preferences.
Output:
{
17
"_id": "user123",
"theme": "dark",
"notifications": true,
"language": "en"
}
This would return only the theme and notifications fields for the document with _id
"user123".
In caching scenarios, MongoDB can be used to store values that are fetched with a unique key.
For example, caching user data for quick retrieval:
Here, "session123" is the key, and the value is the user's session data.
18
Although MongoDB is primarily a document-oriented database, its functionality and flexibility
allow it to be used like a key-value store in many cases:
1. Schema Flexibility: Unlike traditional key-value stores (e.g., Redis), MongoDB allows
you to store documents of different structures within the same collection. This is useful
when different "values" for different keys need to be stored in different formats.
2. Rich Querying: MongoDB provides powerful querying and indexing features, which
enable more sophisticated lookups compared to traditional key-value stores, such as
querying on nested fields or performing aggregation.
3. Data Persistence: MongoDB is persistent, meaning the data stored in it won’t be lost
when the server is restarted, unlike some in-memory key-value stores like Redis.
4. Horizontal Scalability: MongoDB supports sharding and replication, meaning it can
scale horizontally and handle large datasets while maintaining high availability.
Limitations:
1. Performance: MongoDB might not be as fast as in-memory key-value databases like
Redis when it comes to simple, rapid key-value lookups. MongoDB is disk-based and
optimized for more complex queries and document management, which can result in
slightly slower access times compared to an in-memory solution.
2. Complexity: MongoDB’s advanced querying and document structure may be overkill for
simple use cases where a basic key-value store would suffice (e.g., caching user sessions
or simple configurations).
In Node.js, you can interact with MongoDB using the MongoDB Node.js driver or Mongoose
for object modeling.
19
const db = client.db(dbName);
const collection = db.collection('user_preferences');
await client.close();
}
main().catch(console.error);
Output:
{
"_id": "user123",
"theme": "dark",
"notifications": true,
"language": "en"
}
Conclusion:
20
2.4 Document Databases
A Document Database is a type of NoSQL database designed to store, retrieve, and manage
documents. Unlike relational databases that store data in tables with fixed schemas, document
databases store data in a more flexible, hierarchical format, typically using JSON (JavaScript
Object Notation), BSON (Binary JSON), or similar document-based formats.
Each document is a self-contained unit of data that can have its own structure, and the database
does not enforce a schema across all documents. This flexibility allows document databases to
efficiently handle complex and dynamic data structures.
21
Isolation, Durability). This provides better scalability and performance but may
result in eventual consistency rather than immediate consistency in some cases.
1. MongoDB:
○ MongoDB is one of the most popular and widely used document databases. It
stores data in BSON format (a binary version of JSON), which allows for rich
data structures, including arrays and embedded documents. MongoDB supports
powerful querying, indexing, and aggregation features.
○ Use Case: Suitable for applications that need to handle large volumes of
semi-structured data, such as web apps, content management systems (CMS), and
real-time analytics.
2. CouchDB:
○ CouchDB stores data in JSON format. It allows documents to be stored with
different structures within the same database, providing flexibility. CouchDB also
emphasizes eventual consistency and high availability through its MapReduce
indexing and replication mechanisms.
○ Use Case: Used in systems requiring high availability and scalability with the
need for efficient document storage and replication, such as IoT systems or
mobile applications.
3. Couchbase:
○ Couchbase is a distributed NoSQL database that supports both document storage
and key-value store functionality. It uses JSON documents to represent data and
provides advanced features like indexing, full-text search, and analytics.
○ Use Case: Suitable for applications with large-scale data storage needs, such as
customer 360-degree profiles, content management systems, and mobile apps.
4. RavenDB:
○ RavenDB is a document database designed for .NET environments. It stores data
in JSON format and supports rich querying and indexing. RavenDB provides
features like ACID transactions, full-text search, and dynamic queries.
○ Use Case: Typically used in enterprise environments and applications written in
.NET, such as business applications and customer relationship management
(CRM) systems.
22
5. Amazon DocumentDB:
○ Amazon DocumentDB is a fully managed document database service designed to
be compatible with MongoDB. It is optimized for use on Amazon Web Services
(AWS) and provides the scalability, availability, and security benefits of AWS.
○ Use Case: Useful for users already working within the AWS ecosystem who need
a scalable, managed document database compatible with MongoDB applications.
2.5 Features-Consistency
json
Copy
{
"_id": "user123",
"name": "John Doe",
"email": "johndoe@example.com",
"address": {
"street": "123 Main St",
"city": "Somewhere",
"state": "CA",
"zipcode": "90210"
},
"roles": ["admin", "editor"],
"last_login": "2025-02-06T10:00:00Z",
"is_active": true
}
23
● The _id field serves as the primary key, and it is unique for every document.
● The address field is a nested object (document inside a document), which contains
multiple properties.
● The roles field is an array of values.
● MongoDB allows these nested structures to be stored within the same document, offering
a flexible, hierarchical data model.
If you want to retrieve all users with the role "admin", you can perform a query:
javascript
Copy
db.users.find({ roles: "admin" })
This would return documents where the roles array contains the value "admin". MongoDB’s
query language allows querying nested fields as well, such as address.city or
last_login.
24
available and capable of handling large datasets distributed across multiple
machines.
5. Built-in Indexing:
○ Document databases typically allow indexing on various fields, including nested
fields. This makes queries faster and more efficient, particularly when retrieving
documents based on specific attributes.
Conclusion:
Document databases are a powerful and flexible tool for modern applications, especially when
dealing with semi-structured or evolving data. They excel in use cases where schema flexibility
and scalability are important, such as web applications, content management systems, and
real-time analytics. However, they may not be the best choice for applications that require
complex relational data or strong consistency guarantees across multiple operations.
25
Consistency in Document Data Base
Consistency in a document database refers to how the database ensures that all nodes or
replicas in a distributed system reflect the same data after any update or write operation. In
NoSQL systems like document databases, consistency is typically balanced with availability and
partition tolerance (as per the CAP theorem), which means that you may have to trade-off
strict consistency for better availability or fault tolerance.
Let’s break down the concept of consistency in document databases and how it can vary
depending on the system.
Document databases offer different consistency models, often depending on the configuration
(e.g., single-node vs. multi-node, replica sets, or sharding). These models generally align with
ACID (Atomicity, Consistency, Isolation, Durability) or BASE (Basically Available, Soft state,
Eventually consistent) principles.
● Strong consistency ensures that every read reflects the most recent write. After an update
operation, the system guarantees that all clients will see the same data at any given time,
no matter which replica or shard they access.
● Example: If you update a document (e.g., a user’s profile), subsequent reads (even from
different replicas) will immediately reflect the updated profile.
26
● Eventual consistency means that, given enough time, all replicas will converge to the
same data, but during that time, some replicas may be out of sync.
● Example: If you update a document, some users may still see the old version of the
document for a brief period until the replicas sync.
● Eventual consistency is generally used in systems that prioritize availability and
partition tolerance over strong consistency, such as large distributed databases.
● Tunable consistency allows you to configure the trade-off between consistency and
availability. For example, you can decide how many replicas need to acknowledge a write
before it is considered successful. MongoDB provides options for this.
○ Write Concern: Determines how many replica sets must acknowledge a write
operation before it is considered successful.
○ Read Concern: Determines the consistency level for read operations, such as
whether you want to read from the most recent write or accept data that may still
be inconsistent.
MongoDB, as one of the most popular document databases, offers flexibility in terms of
consistency, and it can be configured to provide different levels of consistency depending on the
use case. Here are a few important concepts in MongoDB:
In MongoDB, replica sets are used to provide high availability and data redundancy by
maintaining copies of the same data across multiple nodes. A replica set consists of a primary
node and one or more secondary nodes. The primary node handles all write operations, while the
secondary nodes replicate the primary's data.
27
● Strong Consistency (Single-Document Operations):
○ In MongoDB, write operations on a single document (e.g., updating a user profile)
are strongly consistent by default. Once the write operation is acknowledged by
the primary node, it is immediately available for subsequent reads.
○ MongoDB's default consistency ensures that if you write to the primary node, the
data is immediately visible to any subsequent reads from that primary.
● Write Concern:
○ In MongoDB, write concern defines how many replica nodes must acknowledge
a write operation before it is considered successful. The default write concern is
w:1, meaning only the primary node must acknowledge the write.
○ You can configure a higher write concern, such as w:majority, to ensure that
the majority of replica nodes acknowledge the write, improving data consistency
across the system.
●
● Read Concern:
28
Example of read concern:
db.collection.find(
{ name: "John Doe" },
{ readConcern: { level: "majority" } }
);
●
MongoDB uses a feature known as oplog (operation log) in replica sets, where the primary node
records all write operations, and secondary nodes apply those operations to maintain consistency.
However, in a distributed system, there might be replication lag between the primary and
secondary nodes, leading to the possibility of reading stale data from secondaries.
● To ensure stronger consistency, MongoDB can configure read preferences to read from
the primary node only (strong consistency), or to allow reads from secondaries, which
may return stale data but improve availability.
In a sharded MongoDB cluster, data is distributed across multiple nodes. Sharding introduces the
challenge of maintaining consistency across distributed shards. MongoDB ensures consistency
within each shard, but the consistency across multiple shards depends on the configuration of the
cluster and consistency level chosen.
MongoDB allows multi-document transactions in sharded clusters (since version 4.0), which
provide ACID guarantees across multiple documents in different shards.
○ Best suited for applications where it’s critical that users always see the latest data
after updates. Examples include financial systems, inventory management, and
accounting systems.
29
2. Eventual Consistency:
○ Best suited for applications where high availability and fault tolerance are
prioritized over strict consistency. Examples include social media platforms,
content delivery networks, and IoT systems.
3. Tunable Consistency:
○ Some applications may need tunable consistency, where you can adjust the
balance between consistency and performance based on specific requirements.
For example, in e-commerce platforms, product inventory could benefit from
eventual consistency, but payment transactions must be strongly consistent.
Let’s say we have an e-commerce platform where users can purchase products. We might want to
ensure strong consistency when updating inventory quantities but can tolerate eventual
consistency in other cases, such as viewing product reviews.
● The inventory update will be acknowledged by the majority of the replica set before it's
considered successful, ensuring that all users accessing the system will see the same
inventory level.
30
● This query can return stale data (e.g., product reviews from a secondary node), which is
acceptable in some cases like reading user reviews, where absolute consistency isn't
critical.
Conclusion
In document databases, consistency models are configurable and vary based on your
requirements for data availability, fault tolerance, and how up-to-date the data must be.
● Strong consistency ensures that all clients always see the most recent write.
● Eventual consistency trades off immediate consistency for better availability and
partition tolerance.
● Tunable consistency, as seen in MongoDB, allows you to adjust the consistency level
depending on the use case.
Choosing the right consistency model for your application is critical, and it depends on factors
such as how often data is updated, how critical it is for users to see the latest data, and the overall
design of your distributed system.
In MongoDB, transactions allow you to execute multiple operations in a way that ensures all
operations are completed successfully or none at all. This is part of MongoDB’s support for
ACID transactions (Atomicity, Consistency, Isolation, Durability) introduced in version 4.0 for
replica sets and version 4.2 for sharded clusters.
Let’s say we have two collections: accounts and transactions. We want to transfer
money from one account to another. The operation involves two steps:
31
The goal is to ensure both operations are atomic, meaning that if one operation fails, neither
operation should be committed.
Here’s how to implement this in MongoDB using the Node.js MongoDB Driver:
You need to have MongoDB running and a Node.js application. Install MongoDB’s Node.js
driver:
2. Code Example
Here is the code that demonstrates how to implement a money transfer transaction between two
accounts:
try {
await client.connect();
const session = client.startSession(); // Start a session for the transaction
32
const accounts = db.collection("accounts");
const transactions = db.collection("transactions");
// Start a transaction
session.startTransaction();
try {
// Perform the money transfer
const senderId = "user1";
const receiverId = "user2";
const amount = 100;
await accounts.updateOne(
{ accountId: senderId },
{ $inc: { balance: -amount } },
{ session }
);
await accounts.updateOne(
{ accountId: receiverId },
{ $inc: { balance: amount } },
{ session }
);
33
amount,
date: new Date()
}, { session });
} catch (error) {
// If any operation fails, abort the transaction
console.error("Transaction failed: ", error);
await session.abortTransaction();
} finally {
session.endSession();
}
} finally {
await client.close();
}
}
main().catch(console.error);
Explanation:
○ We first check the sender's balance to ensure they have enough funds. Then, we
perform the deduction from the sender’s account and the addition to the
receiver’s account using the updateOne method.
○ We also log the transaction into a separate transactions collection for audit
purposes.
34
4. Commit the Transaction:
○ If any operation fails (e.g., insufficient funds, receiver not found), we catch the
error and call session.abortTransaction() to roll back all changes
made during the transaction.
6. End Session:
Key Concepts:
● Atomicity: If an error occurs at any point in the transaction (e.g., insufficient funds,
receiver not found), MongoDB ensures that all changes are rolled back, and no partial
updates are made.
● Session: MongoDB transactions use sessions to maintain context for the transaction.
● Commit and Abort: After performing the operations, you either commit to make the
changes permanent or abort if something goes wrong.
Important Notes:
● Replica Sets: Transactions are only supported in replica sets (not standalone MongoDB
instances). If you're working with a sharded cluster, make sure you have MongoDB
version 4.2 or later, as this version added support for distributed transactions across
shards.
● Performance Consideration: While transactions are useful for maintaining data
consistency, they can introduce overhead. Use transactions only when necessary for
ensuring data integrity.
Example Output:
Transaction successful!
35
Conclusion:
This simple example demonstrates how to use MongoDB transactions to ensure atomicity and
consistency when performing multiple operations across different collections. Transactions are
essential for maintaining data integrity in scenarios such as money transfers, inventory updates,
or order processing where multiple operations must succeed or fail together.
Feature MongoDB
Availability - High availability through replica sets that provide failover if the primary
node goes down. - Can achieve automatic failover with replica sets. -
Tunable consistency options, with trade-offs between consistency and
availability.
Query - Supports rich query capabilities like filtering, sorting, and aggregation. -
Features Offers indexing for efficient searches. - Full-text search is available (using
MongoDB Atlas or text indexes). - Join operations through $lookup for
handling relationships between documents (like a "left join" in SQL).
Scaling - Horizontal scaling through sharding for distributing data across multiple
servers. - Auto-sharding helps in managing large datasets by partitioning
them across different machines. - Vertical scaling (increasing server
capacity) is also supported for better performance.
Suitable Use - Content Management Systems (CMS): Flexibility with schema changes
Cases for evolving content. - Real-time Analytics: With powerful aggregation
pipelines, MongoDB is suitable for real-time data processing. - IoT
Applications: MongoDB can efficiently handle large-scale, unstructured, and
semi-structured data. - E-commerce Platforms: Can handle user profiles,
product catalogs, and inventory systems. - Social Media Platforms: Scalable
architecture for storing large volumes of user-generated content and
interactions.
36
Summary:
● Availability: MongoDB ensures high availability through replica sets, making it resilient
to server failures. The trade-off is flexibility in consistency (eventual vs. strong
consistency).
● Query Features: MongoDB offers a powerful query language with support for
aggregation, filtering, and even joining collections (using $lookup), although it’s not as
feature-rich in joins as relational databases.
● Scaling: MongoDB excels in horizontal scaling with sharding, which allows it to
handle very large datasets across many servers.
● Use Cases: Ideal for applications that require flexibility in schema design (e.g., CMS,
IoT, social platforms) and need to scale as the data grows.
Here's a table summarizing how MongoDB applies to different use cases like Event Logging,
Content Management Systems, and Blogging Platforms:
37
Event Data - Event logs can be - In CMS, MongoDB - Blog posts and metadata
stored as documents, stores dynamic content (e.g., titles, tags, publish
each representing an (articles, posts, media) dates) are stored as
event with properties in documents. - Allows documents. - MongoDB
like timestamp, event embedding content like allows embedding or
type, and metadata. - images, videos, and linking of content like
Can store logs at scale metadata in a single comments, media, and
with high availability, document for ease of user activity.
allowing for quick retrieval.
retrieval and analysis.
38
Scaling - Sharding enables - CMS platforms benefit - MongoDB scales well
horizontal scaling of from MongoDB's for growing platforms,
event log data, which ability to scale supporting thousands of
is essential for horizontally using blog posts and user
high-volume logging sharding as the number interactions (comments,
scenarios. - of articles or media likes).
MongoDB’s ability to grows.
handle large amounts
of incoming event data
and scale horizontally
is critical.
39
Suitable Use - Application logs: - Dynamic content - Personal blogs:
Cases Storing logs for apps management: Handling personal blog
or systems, including Managing and posts and user
user actions, errors, organizing content, interactions (comments,
and system events. - including articles, likes). - Multi-author
Real-time analytics: media, users, and blogs: Managing multiple
Tracking and analyzing permissions. - Ideal for writers, editors, and
user behavior, system multi-channel content categories. -
status, or security publishing (web, Content-based
events. mobile, etc.). platforms: Storing posts,
media, and metadata for
large-scale platforms.
Summary:
Event Logging:
● MongoDB is a great fit for event logging systems due to its high write throughput,
ability to store time-series data, and flexible schema-less design. It can easily scale
horizontally with sharding, ensuring reliable storage of massive amounts of event data.
● MongoDB’s flexibility and schema-less design are well-suited for managing dynamic
and diverse content, allowing for easy evolution of content models. It can scale efficiently
as content volume grows, with replica sets ensuring high availability for CMS platforms.
Blogging Platforms:
● MongoDB is perfect for blogging platforms due to its ability to handle content, user
interactions (comments, likes), and metadata efficiently. Its flexibility supports evolving
content formats, and the scalability makes it ideal for growing platforms with high traffic
and interaction volumes.
MongoDB's strengths in scalability, flexibility, and performance make it a top choice for these
use cases. Whether you're managing logs, content, or blog posts, MongoDB allows you to store,
query, and scale data with ease.
40
2.8 Web Analytics or Real-Time Analytics- E-Commerce Applications
Here’s a table summarizing the suitability of MongoDB for Web Analytics or Real-Time
Analytics and E-Commerce Applications:
41
Scalability - Horizontal scalability via - MongoDB's sharding capability helps
sharding allows MongoDB to scale e-commerce platforms as they
handle large amounts of grow, handling millions of product listings
real-time event data at high and user transactions. - Horizontal scaling
speeds, enabling large-scale allows the system to support high traffic,
web analytics platforms. - especially during peak times (e.g., Black
Auto-sharding helps distribute Friday sales).
traffic across multiple servers,
which is critical for managing
data across global users.
Flexibility - Web analytics systems often - E-commerce data models often change as
evolve in terms of the data new products, promotions, and categories
collected (e.g., adding new user are added. MongoDB’s flexible schema
interaction events). MongoDB’s allows rapid adaptation to new data types
flexibility allows you to store and formats.
new types of events without
schema changes.
42
Reliability & - MongoDB’s replica sets - Replica sets ensure that e-commerce
Availability ensure high availability of platforms maintain high availability, even
real-time analytics data, during high traffic times. - Provides
ensuring that analytics platforms automatic failover to maintain platform
remain up even during node uptime.
failures. - Replica sets support
high availability and data
redundancy.
Summary:
Web Analytics / Real-Time Analytics:
● MongoDB is a great fit for web analytics due to its ability to handle high volumes of
event-driven data in real-time. Its flexible schema allows it to store diverse event data
(e.g., user clicks, page views, sessions) and adapt to new data types. Aggregation
pipelines enable powerful data analysis, while sharding and replica sets ensure
scalability and availability for global traffic.
E-Commerce Applications:
● MongoDB is also ideal for e-commerce applications, where it can handle dynamic data
models such as product catalogs, user profiles, and orders. The ability to store and
query complex, nested data (like shopping cart contents) in a flexible,
document-oriented format is key. Scalability through sharding and high availability
through replica sets make it well-suited for handling high traffic during sales events or
peak times.
43
MongoDB's real-time analytics and scalable data storage features make it a suitable choice for
both analytics platforms and e-commerce systems that need to handle high traffic, dynamic
data, and rapid scaling.
In MongoDB, there are cases where its design and capabilities may not be suitable for certain use
cases, particularly when it comes to complex transactions spanning different operations.
Below are some key scenarios where MongoDB may not be the best choice for complex
transactions:
MongoDB supports transactions in replica sets and sharded clusters (from version 4.0+), but
they are not as robust as traditional relational database transactions. Complex transactions
that span multiple collections or multiple databases may become inefficient or challenging to
implement.
● Issue: If you need to perform multiple operations across different collections or databases
with full ACID (Atomicity, Consistency, Isolation, Durability) guarantees, MongoDB’s
transaction management may not be as performant as traditional relational databases.
● Why: Transactions involving multiple collections or databases can lead to higher
latency, increased complexity, and potential performance bottlenecks in a sharded
cluster.
MongoDB is designed with a CAP Theorem trade-off, where availability and partition
tolerance are prioritized over strict consistency (depending on the setup). This means MongoDB
supports eventual consistency in many scenarios, and strong consistency is only possible under
certain configurations (e.g., using replica sets with read concern settings).
● Issue: If your application requires strict ACID compliance with strong consistency
across operations, especially when transactions span across multiple operations or
databases, MongoDB may not guarantee the level of consistency you need in certain
scenarios.
● Why: MongoDB’s default consistency model may lead to replication lag or delayed
consistency in some sharded environments.
44
3. When You Need Complex Joins or Foreign Key Constraints
MongoDB is a document-based NoSQL database, meaning it does not have the same features
as relational databases, such as complex joins or foreign key constraints. While MongoDB
does support $lookup (similar to SQL joins) for aggregations, it is not optimized for complex
join operations spanning multiple collections or databases.
● Issue: If your application requires complex joins between multiple collections, especially
where there are strong relationships between data, the lack of built-in foreign key
constraints and complex joins in MongoDB could make the design and maintenance more
complex and inefficient.
● Why: Complex operations like cascading updates, foreign key constraints, and data
integrity checks are often better handled in relational databases, and MongoDB’s lack of
strong relational integrity mechanisms can lead to data inconsistency and complicated
logic in the application layer.
4. When You Need to Ensure Full ACID Compliance Across Multiple Steps
While MongoDB supports multi-document transactions since version 4.0, the system was
originally designed for high-performance, distributed, and eventually consistent workloads,
not for multi-step ACID transactions involving various operations on different data sets.
● Issue: If your application involves multiple steps that require strong ACID guarantees
across a series of operations (like transferring money between different accounts, with
updates to multiple documents across multiple collections), using MongoDB for such use
cases could result in unexpected behavior or performance degradation.
● Why: MongoDB’s transaction support has overhead compared to traditional RDBMSs
that were built specifically to handle complex ACID transactions. This overhead could
lead to slowdowns or scaling issues.
Relational databases often offer advanced concurrency control mechanisms like serializable
isolation levels (to handle conflicts and data consistency in high-concurrency situations). While
MongoDB supports isolation levels for transactions, its concurrency control is not as fine-tuned
as that of relational databases.
● Issue: If your application requires fine-grained control over data consistency, deadlock
management, or complex conflict resolution in highly concurrent environments,
MongoDB may not provide the same level of reliability and control.
45
● Why: MongoDB’s default isolation levels (e.g., read committed) may not fully prevent
conflicting writes and dirty reads in complex multi-operation transactions. For example,
if you need to ensure serializability across complex transactions, MongoDB may not be
as effective as relational databases with advanced isolation.
Relational databases offer strong schema enforcement and the ability to define integrity
constraints like primary keys, foreign keys, and unique constraints. These are essential for
enforcing data integrity in complex applications where data consistency must be guaranteed at
all times.
● Issue: MongoDB’s schema-less design allows for flexibility, but this flexibility comes at
the cost of data integrity enforcement and schema validation.
● Why: If your application requires rigid data validation (e.g., through foreign key
relationships, cascading updates, and enforcement of complex business rules), MongoDB
may require more manual work or application logic to enforce these constraints.
Conclusion:
46
When Not to Use MongoDB for Complex Transactions:
● When transactions span across multiple collections or databases and require tight
ACID guarantees.
● When you need strong consistency or complex joins across related datasets.
● When your application is heavily dependent on complex, multi-step ACID transactions
or advanced concurrency control.
● When you require rigid schema enforcement and integrity constraints like foreign
keys or cascading operations.
In such cases, consider a traditional RDBMS or a hybrid approach that combines MongoDB
with relational systems where transactional integrity is critical.
However, MongoDB provides several powerful query and aggregation capabilities to deal with
this issue. Below is an explanation of the problem and how MongoDB’s features can address it.
When the structure of your documents changes over time or varies between different documents,
the traditional relational databases with rigid schemas and foreign key constraints don’t face
such variability. MongoDB, however, allows each document in a collection to have a different
structure. This flexibility can lead to challenges in querying and aggregating data, especially if
you need to run queries on documents that contain different fields or nested structures.
For example:
● One document might have a field called address, and another might have a nested object
location instead.
● Some documents might have a price field, while others have cost.
47
● Aggregation operations across documents with varying fields can become tricky,
especially if some documents have nested fields while others do not.
MongoDB provides several tools and techniques to deal with varying aggregate structures in
its documents:
1. Aggregation Framework
MongoDB’s aggregation framework allows for complex data transformation and querying even
when documents have varying structures. Key features include:
● $project: Used to reshape or modify fields, which helps normalize documents with
varying structures for querying.
● $addFields and $set: These can add new fields or modify existing ones during
aggregation.
● $unwind: Used to handle arrays or nested objects by flattening them into individual
documents, making them easier to aggregate.
● $ifNull: Can provide fallback values when certain fields don’t exist, allowing you to
handle missing fields during aggregation.
When documents have varying fields, the $project stage can be used to ensure that all documents
in a query return the same set of fields, even if some fields are missing or have different names.
Example:
db.orders.aggregate([
{
$project: {
orderId: 1,
totalAmount: {
$ifNull: ["$price", "$cost"] // If 'price' is missing, fall back to 'cost'
},
status: 1,
}
}
])
48
In this example:
For documents that may have different aggregate structures (i.e., different fields or nested data),
the $cond operator allows you to apply conditional logic based on whether a field exists or meets
a condition.
Example:
db.products.aggregate([
{
$project: {
itemName: 1,
itemPrice: {
$cond: {
if: { $gt: [{ $type: "$price" }, "missing"] },
then: "$price",
else: "$cost" // Fallback to 'cost' if 'price' is missing
}
}
}
}
])
Here:
When you have arrays or nested data in your documents, you can use $unwind to flatten those
structures and make them easier to aggregate.
49
Example:
db.sales.aggregate([
{ $unwind: "$products" }, // Flatten the array of products
{
$group: {
_id: "$products.category",
totalSales: { $sum: "$products.sales" }
}
}
])
In this example, the products field is an array. The $unwind stage flattens each item in the array
into its own document so that we can aggregate sales based on product category.
When documents with varying structures are scattered across multiple collections, the $lookup
operator can help join data from different collections even if their structures are different.
Example:
db.orders.aggregate([
{
$lookup: {
from: "customers", // Join the 'customers' collection
localField: "customerId",
foreignField: "_id",
as: "customerInfo"
}
},
{
$unwind: "$customerInfo" // Flatten the customer data
},
{
$project: {
orderId: 1,
customerName: "$customerInfo.name",
totalAmount: {
50
$ifNull: ["$price", "$cost"] // Handle varying fields
}
}
}
])
In this case:
● We are performing a join ($lookup) between the orders and customers collections.
● The customerInfo is an array, so we use $unwind to flatten it.
● We also handle the varying fields (price, cost) using $ifNull.
You can check if a field exists before using it, or provide a fallback value for missing fields using
$ifNull. If a field might be missing in some documents, you can check for its existence and
return a default value or handle the condition accordingly.
Example:
db.transactions.aggregate([
{
$project: {
transactionId: 1,
totalAmount: {
$ifNull: ["$paymentAmount", 0] // Use 0 if 'paymentAmount' is missing
},
description: 1,
}
}
])
In this case:
51
1. Flexibility: MongoDB allows you to structure your data however you want, but this
flexibility can lead to varying document structures. You can handle this variability by
using MongoDB’s aggregation framework with operators like $ifNull, $cond, and
$project to deal with missing or differently named fields.
2. Aggregation Pipeline: The aggregation framework is particularly helpful when working
with varying document structures. It enables you to shape the data, filter out unwanted
fields, and apply logic to handle missing or mismatched data.
3. Conditional Logic: MongoDB provides powerful conditional operators like $cond,
$ifNull, and $exists to handle missing fields and adjust how documents are aggregated or
queried based on their content.
4. Flattening Arrays: The $unwind operator is useful for flattening nested arrays or
objects, which helps in aggregating data from varying structures.
5. Real-Time Adaptability: With MongoDB’s schema-less design, your application can
handle evolving data structures over time. However, it’s important to use proper
aggregation techniques to manage varying data effectively in queries.
By leveraging MongoDB’s aggregation features and conditional operators, you can efficiently
query and aggregate data, even when the document structures vary.
52