0% found this document useful (0 votes)

5 views52 pages

Gcru 2 Nosql

The document discusses key concepts in distributed databases, focusing on replication and sharding as methods to enhance performance, availability, and scalability. It details various distribution models including master-slave replication, sharding techniques, data federation, peer-to-peer distribution, cloud-based distribution, and hybrid models, each with their advantages and challenges. Additionally, it explains the MapReduce programming model for processing large datasets and provides practical examples of its application in databases like MongoDB and Hadoop.

Uploaded by

sit22it104

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views52 pages

Gcru 2 Nosql

Uploaded by

sit22it104

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

1

Topic Page No
2.1 Replication and sharding-MapReduce on databases. 2
2.2 Distribution Models- Single Server-Sharding-Master-Slave
Replication-Peer-to-Peer Replication- Combining Sharding and Replication 8
2.3 NoSQL Key/Value databases using MongoDB 16
2.4 Document Database- 21
2.5 Features-Consistency- 23
2.6 Transactions- Availability- Query Features- Scaling - Suitable Use Cases 31
2.7 Event Logging- Content Management Systems- Blogging Platforms 37
2.8 Web Analytics or Real-Time Analytics- E-Commerce Applications 41
2.9 When Not to Use-Complex Transactions Spanning Diﬀerent Operations 44
2.10 Queries against Varying Aggregate Structure 47

2.1 Replication and sharding-MapReduce on databases.

Replication and sharding are two fundamental concepts in distributed databases that help to
improve performance, availability, and scalability.

Replication:

Replication involves creating copies of the same data on multiple servers or nodes. This provides
redundancy and increases data availability and fault tolerance. If one server fails, other replicas
can still serve requests.

Key points about replication:

● Master-Slave Replication: In a traditional master-slave setup, one node is the "master,"

which handles writes, and multiple "slave" nodes handle read queries. The master
propagates changes to the slaves.
● Multi-Master Replication: In this setup, multiple nodes are capable of handling both
reads and writes. Changes are synchronized across the masters.
● Advantages:
○ High Availability: If one replica fails, the system can still work with the other
replicas.

2
○ Load Balancing: Read requests can be distributed across multiple replicas to
reduce load on the master node.
● Challenges:
○ Consistency: Ensuring that all replicas are synchronized can be difficult.
Replication lag can occur, leading to outdated data on replicas.
○ Write Conflicts: In multi-master setups, write conflicts can arise if two nodes try
to modify the same data at the same time.

Sharding:

Sharding is the process of distributing data across multiple machines or nodes. Each shard holds
a subset of the total data, which helps to balance the load and makes scaling easier as the
database grows.

Key points about sharding:

● Horizontal Partitioning: Sharding is a form of horizontal partitioning, where data is

split into smaller, manageable parts called shards. Each shard is stored on a different
server.
● Sharding Key: A sharding key is used to determine how data is distributed across shards.
This could be a field like user ID, which helps determine the shard a particular data entry
belongs to.
● Advantages:
○ Scalability: Sharding allows the database to handle larger volumes of data by
distributing the load across multiple servers.
○ Improved Performance: Query performance can be improved as each node only
handles a portion of the data.
● Challenges:
○ Complexity: Managing sharded databases is more complex. Operations like joins
and aggregations across shards can be inefficient or complicated.
○ Data Distribution: Choosing the right sharding key is critical to ensure even
distribution of data. If the data is not evenly distributed, some shards may become
hotspots.

Combining Replication and Sharding:

In practice, systems often combine both replication and sharding to achieve high availability and
scalability. For example:

3
● Each shard could be replicated, meaning there are multiple copies of each shard on
different servers.
● Sharding helps with distributing the data, while replication helps with redundancy and
load balancing.

Together, these techniques ensure that the database can scale horizontally and remain available,
even in the event of hardware failures.

MapReduce is a programming model used for processing large data sets in a distributed manner.
It divides the task into two main steps:

1. Map: In this step, the input data is divided into smaller chunks (key-value pairs) and
processed in parallel across multiple nodes. Each mapper processes a portion of the data
and outputs intermediate key-value pairs.
2. Reduce: In this step, the intermediate data from the map phase is grouped by key, and a
reducer performs operations (e.g., aggregation, counting, summing) on these groups to
produce the final result.

While MapReduce is often associated with systems like Hadoop, the principles can also be
applied to database queries, especially for tasks that involve aggregation, filtering, or
summarizing large datasets.

Example of MapReduce on Databases:

Let's say you have a sales database table with the following structure:

SaleID ProductID Amount Date Region

1 101 100 2025-01-01 North

2 102 200 2025-01-02 South

3 101 150 2025-01-03 North

4
4 103 250 2025-01-04 East

5 102 300 2025-01-05 South

Goal:

We want to find out the total sales amount for each ProductID across different regions.

Step-by-Step MapReduce Approach:

Map Phase:

● The "map" step processes each row of the dataset.

● For each sale record, it outputs a key-value pair where the key is the ProductID and the
value is the Amount of that sale.

Map Output Example:

● (101, 100)
● (102, 200)
● (101, 150)
● (103, 250)
● (102, 300)

Shuffle and Sort Phase:

● The system groups all values by the ProductID key, so we can apply a reduce operation
on the same keys.

Grouped Data Example:

● Key: 101, Values: [100, 150]

● Key: 102, Values: [200, 300]
● Key: 103, Values: [250]

Reduce Phase:

5
● The "reduce" step aggregates the values for each key (in this case, sums up the sales
amounts).
● For each key (ProductID), the reducer will sum the associated sales amounts.

Reduce Output Example:

● For ProductID 101: 100 + 150 = 250

● For ProductID 102: 200 + 300 = 500
● For ProductID 103: 250

Final Result:

The output would be a summary of the total sales per product:

ProductID Total Sales

101 250

102 500

103 250

MapReduce in SQL-Like Databases:

In many distributed databases, the MapReduce model can be simulated using SQL queries for
simpler aggregations. For example, in a SQL database, the equivalent query would be:

SELECT ProductID, SUM(Amount) AS Total_Sales

FROM sales

GROUP BY ProductID;

6
However, MapReduce can be more beneficial when dealing with much larger datasets that
cannot be easily handled by traditional SQL queries, especially in distributed systems or NoSQL
databases like Hadoop or MongoDB.

Practical Usage in Distributed Databases:

1. Hadoop/Hive: These platforms are designed to use the MapReduce model to process
large-scale data stored in a distributed file system (like HDFS). Hive can use MapReduce
to execute queries on large datasets.

2. MongoDB (Aggregation Framework): MongoDB provides an aggregation pipeline that

mimics the MapReduce operations. MongoDB allows you to write MapReduce functions
to process data, but more commonly, the aggregation pipeline is used, which provides an
easier, more optimized way to perform similar tasks.

Example in MongoDB:

In MongoDB, the MapReduce function could look like this:

var mapFunction = function() {

emit(this.ProductID, this.Amount); // Emit ProductID as the key and Amount as the

value

};

var reduceFunction = function(key, values) {

return Array.sum(values); // Sum the sales amounts for each ProductID

};

db.sales.mapReduce(mapFunction, reduceFunction, { out: "total_sales" });

This MapReduce job would output the total sales per product, similar to the earlier example, but
done in a distributed manner, leveraging MongoDB's distributed nature.

7
Conclusion:

MapReduce provides a powerful way to process large datasets in a distributed fashion. While it
can be simulated using SQL in some databases, the real benefit comes when dealing with
massive amounts of data that require parallel processing, such as in big data environments like
Hadoop or NoSQL databases like MongoDB.

2.2 Distribution Models- Single Server-Sharding-Master-Slave Replication-Peer-to-Peer

Replication- Combining Sharding and Replication

Distribution Models refer to the strategies and approaches used to distribute data or workloads
across multiple systems or nodes in distributed computing environments. These models are
essential for achieving scalability, high availability, fault tolerance, and performance in
large-scale systems.

In the context of distributed databases and systems, distribution models determine how data is
stored, accessed, and managed across multiple machines. Below are the key distribution models
commonly used in distributed systems:

1. Data Replication

Data replication involves creating multiple copies of the same data across different nodes or
servers. This approach enhances fault tolerance, load balancing, and availability. Replication is
useful when it’s essential for systems to remain operational even if some parts of the system fail.

Types of Data Replication:

● Master-Slave Replication: One node is designated as the master (or primary), and the
other nodes are slaves (or replicas). The master handles write operations, while the slaves
handle read operations. Changes in the master are propagated to the slaves.
● Multi-Master Replication: Multiple nodes can accept both reads and writes. Changes
are synchronized across all masters. This model is more complex but allows greater
flexibility.
● Eventual Consistency: In a distributed system, replication often leads to eventual
consistency, meaning that not all replicas will immediately have the same data. Updates
will propagate over time.

8
Example Use Case:

● MySQL supports master-slave replication, where the master handles writes and
propagates changes to slave nodes for reads, providing higher scalability for read-heavy
workloads.

2. Data Sharding (Partitioning)

Sharding divides the dataset into smaller, more manageable pieces called shards, each of which
is stored on a different server. This distribution model is particularly useful for handling large
amounts of data and improving system performance by spreading the load across multiple
servers.

Types of Sharding:

● Horizontal Sharding: Data is partitioned by rows. Each shard contains a subset of the total
dataset. For example, a database might store user records on different servers based on
user ID ranges.
● Vertical Sharding: Data is partitioned by columns. This means each shard holds a subset
of columns from the database schema, with different shards containing different attributes
of the data.
● Range-based Sharding: Data is divided into ranges based on some key attribute (e.g.,
user ID or date). Each range is stored in a separate shard.
● Hash-based Sharding: A hash function is applied to a specific key (e.g., user ID) to
determine which shard will hold the data.

Example Use Case:

● MongoDB uses sharding to distribute data across multiple nodes. Each shard holds a
subset of the data based on a chosen sharding key, enabling MongoDB to scale
horizontally as data grows.

3. Data Federation

Data federation refers to a model where data remains stored in multiple, possibly heterogeneous,
systems, but queries can be run across those systems as if the data were in a single database. This
approach is commonly used in scenarios where data resides in multiple databases or locations but
needs to be accessed together.

Key Features:

9
● Single Query Interface: A federation layer allows users or applications to query multiple
data sources using a unified interface.
● Data Virtualization: The federation layer abstracts the underlying data sources, making
them appear as one logical entity to users and applications.
● Heterogeneous Sources: Data sources may use different technologies (e.g., relational
databases, NoSQL, flat files), but federation allows them to interact seamlessly.

Example Use Case:

● Apache Drill or Presto: These tools allow federated queries across a range of databases,
including relational, NoSQL, and file systems, enabling a unified query interface.

4. Peer-to-Peer (P2P) Distribution

In the P2P distribution model, every node in the system has equal responsibility and capabilities.
Nodes are both data producers and consumers, and there is no central server or coordinator. Each
node communicates with other peers directly.

Key Features:

● Decentralization: There is no single point of failure, making it highly fault-tolerant.

● Direct Communication: Nodes interact with each other to share data and resources.
● Scalability: The system can scale by adding more nodes, as each new node increases the
total processing power.

Example Use Case:

● BitTorrent is an example of a P2P system used for file sharing, where users directly
exchange data chunks with each other.

5. Cloud-Based Distribution

Cloud distribution involves the use of cloud services to store and distribute data. This model
takes advantage of cloud providers' infrastructure to handle the scaling of storage and compute
resources.

Key Features:

● Elastic Scalability: Cloud systems can scale resources up or down based on demand,
making them highly flexible.

10
● Global Availability: Cloud-based distribution often provides global coverage by
replicating data across different geographic regions.
● Managed Services: Many cloud providers offer managed distributed database services,
making it easier to handle the complexities of distribution, replication, and scaling.

Example Use Case:

● Amazon DynamoDB offers a cloud-based NoSQL database that automatically scales and
replicates data across multiple regions for high availability and low-latency access.

6. Hybrid Distribution Models

Hybrid distribution models combine multiple distribution strategies to take advantage of the
strengths of each. For example, a system might use sharding for data partitioning and
replication for fault tolerance and high availability.

Key Features:

● Flexibility: Hybrid models allow organizations to fine-tune their systems to meet both
performance and availability requirements.
● Optimized Data Access: By combining distribution models, organizations can optimize
access patterns, such as ensuring fast access to frequently used data while still
maintaining fault tolerance.

Example Use Case:

● Google Spanner uses a hybrid approach, combining sharding (horizontal partitioning) and
replication across geographically distributed servers for strong consistency and high
availability.

Summary of Distribution Models:

Model Key Features Example Use Case

Data Redundant copies for fault tolerance and MySQL master-slave

Replication availability replication

Data Sharding Partitioning data to scale horizontally MongoDB sharding

Data Querying multiple, heterogeneous data Apache Drill or Presto for

Federation sources federated queries

11
Peer-to-Peer Decentralized, equal responsibility among BitTorrent
(P2P) nodes

Cloud-Based Elastic scalability, global availability Amazon DynamoDB

Hybrid Combines replication, sharding, etc., for Google Spanner

optimal performance

Each of these distribution models addresses different needs, depending on the size, complexity,
and requirements of the system or application being developed. By choosing the right model, you
can optimize performance, reliability, and scalability in your distributed systems.

Combining sharding and replication

Combining sharding and replication is a common strategy in distributed databases to achieve

both horizontal scalability and high availability. By using both techniques, systems can scale
out (distribute data across multiple machines) while ensuring data redundancy, fault tolerance,
and better performance. Here's how these two concepts can be combined:

How Sharding and Replication Work Together

● Sharding divides a large dataset into smaller, more manageable chunks (shards) based on
a defined sharding key (like user ID, geographic region, etc.). Each shard is stored on a
different server or node.
● Replication ensures that each shard has multiple copies (replicas) across different servers
to improve data availability, fault tolerance, and read performance.

In this combined model, each shard has several replicas, and the data is partitioned across
different machines. This combination helps scale the system horizontally while maintaining high
availability and fault tolerance.

How It Works:

1. Sharding: The database is divided into smaller pieces (shards) by distributing the data
across multiple nodes. Each shard holds a subset of the data. For example, in a user
database, users with IDs from 1 to 100,000 might go to one shard, while IDs from
100,001 to 200,000 go to another shard.

12
2. Replication: Each shard is replicated to multiple servers (often in master-slave or
master-master configurations). If a primary node for a shard goes down, one of the
replicas can be promoted to handle the traffic. This ensures availability even during
failures.

Example of Combining Sharding and Replication:

Let’s assume you have a user database and need to implement both sharding and replication:

● Sharding: The user data is partitioned based on the user_id key. Users with IDs from
1 to 1,000,000 are distributed across 10 shards. Each shard stores a subset of the total
user data.

● Replication: Each of these 10 shards is replicated to 2 or more nodes. For instance, Shard
1 might have:

○ Shard 1 - Replica 1: Primary node

○ Shard 1 - Replica 2: Replica node

This approach ensures that data from the same shard is spread across different servers, allowing
both high availability and load balancing.

Detailed Architecture:

1. Shards:

○ Shard 1 contains users with IDs 1-1,000,000.

○ Shard 2 contains users with IDs 1,000,001-2,000,000.
○ And so on.
2. Replication for High Availability:

○ Shard 1 has Replica 1 (primary) and Replica 2 (secondary).

○ Shard 2 has Replica 1 (primary) and Replica 2 (secondary).
○ When users make requests, read queries can be served from Replica 1 or Replica
2 for better performance, while write operations are directed to the primary
replica of each shard.
3. Fault Tolerance:

13
○ If Replica 1 for Shard 1 fails, the system can automatically switch to Replica 2
without downtime, ensuring continuous operation.
○ Similarly, if Shard 1 (the entire server node) fails, the replicas of Shard 1 can still
serve the data.

Advantages of Combining Sharding and Replication:

1. Improved Performance:

○ Sharding allows the database to scale horizontally, meaning the load is spread
across multiple servers, improving the performance of read and write operations.
○ Replication increases the read capacity by allowing multiple replicas to handle
read queries, which reduces the load on the primary node.
2. High Availability:

○ With replication, even if one node goes down (whether a primary node or replica),
the system can continue to function by redirecting traffic to the remaining
available replicas.
○ The system can tolerate failures without downtime, as replication ensures that
there are always copies of the data available.
3. Fault Tolerance:

○ Replicas provide redundancy, so if the primary shard or replica goes down,

another replica can take over to ensure data is still available.
○ With multiple replicas spread across different nodes, failures in one server or data
center do not result in total loss of access to data.
4. Scalability:

○ Sharding allows you to distribute data across multiple machines, and you can
keep adding new shards as your dataset grows.
○ Replication ensures that these new shards are also backed up with replicas to
maintain availability and reliability.

Challenges in Combining Sharding and Replication:

1. Data Distribution:

○ Deciding how to partition data across shards (e.g., what field to use as a sharding
key) can be difficult. An inefficient sharding key can lead to uneven distribution,
causing some nodes to handle more traffic than others (data hotspots).

14
2. Consistency:

○ Maintaining consistency across replicas can be tricky. In a system with

replication, there may be a delay between when the primary replica is updated and
when the replicas are synchronized. This is known as replication lag, and it can
lead to stale reads from the replicas if not managed properly.
3. Complexity:

○ Managing a distributed system with both sharding and replication can become
complex. You need to handle failure recovery, balancing load across replicas, and
ensuring that each shard is evenly distributed.
4. Cross-Shard Queries:

○ Performing operations that involve multiple shards can be complex. For example,
a query that needs to join data across shards (e.g., joining user data with order
data across different shards) may not be as efficient as querying a single machine.
5. Network Latency:

○ With data spread across multiple servers and potentially data centers, network
latency can become an issue, especially if replicas are distributed across
geographically distant locations.

Real-World Examples of Combining Sharding and Replication:

● MongoDB: MongoDB uses both sharding and replication. In MongoDB, data is sharded
across multiple nodes, and each shard has its own replica set. MongoDB’s sharded
clusters provide horizontal scaling, and replica sets ensure high availability.

● Cassandra: Apache Cassandra also combines sharding (horizontal partitioning) with

replication. Cassandra’s architecture ensures that data is distributed across nodes using a
consistent hashing mechanism for sharding and provides high availability through
replication.

● Google Spanner: Google Spanner uses a combination of sharding and replication to

achieve horizontal scalability, strong consistency, and high availability across its
distributed database.

15
Summary:

● Sharding divides data into smaller pieces (shards) to distribute the load and improve
scalability.
● Replication creates copies of the data to ensure availability, fault tolerance, and
improved read performance.
● Combining both allows systems to scale out horizontally while maintaining high
availability, fault tolerance, and performance under heavy load.

By combining these models, systems can handle massive datasets and maintain high
performance, even during failures or periods of high demand. However, managing the
complexity of both models requires careful design and configuration to ensure efficiency and
reliability.

2.3 NoSQL Key/Value databases using MongoDB

NoSQL Key/Value Databases are a type of NoSQL database that store data as key-value pairs,
where each key is unique, and the value can be any type of data, such as strings, numbers, lists,
or even complex objects. They are often used for caching, session management, and storing user
preferences or configurations because of their speed and simplicity in accessing data.

While MongoDB is generally considered a document-oriented NoSQL database (storing data

as documents in collections), it also has key-value pair functionality, especially with its
collections and documents, which can be easily treated as key-value pairs in many use cases.

In MongoDB, each document (record) in a collection is stored as a BSON (Binary JSON) object,
which inherently maps well to key-value concepts. The key is the document’s field name (e.g.,
name, age, email), and the value is the data associated with that field (e.g., "John", 30,
"john@example.com").

Key Features of MongoDB as a Key-Value Store:

1. BSON Format: MongoDB stores documents in BSON format, which is a binary
encoding of JSON. This format allows MongoDB to store more complex data types and
makes it easy to treat documents as key-value pairs.
2. Flexibility: MongoDB allows you to store values in any data type (strings, arrays,
embedded documents), making it suitable for a wide range of use cases.

16
3. Indexed Keys: MongoDB indexes fields in a collection, enabling fast lookups of
key-value pairs, especially when using specific fields as keys.
4. Fast Lookups: MongoDB can quickly access data using the value of a field (acting as the
key) with indexing, much like traditional key-value stores.

Example of MongoDB Acting as a Key-Value Store:

In MongoDB, we can treat each document as a key-value pair:

1. Key: The _id field (or any other field you define as unique)
2. Value: The document itself, which can be a simple or complex value.

Example Data:

Consider the following MongoDB collection named user_preferences where each

document represents a user's preferences stored as a key-value pair:

{
"_id": "user123", // This is the key
"theme": "dark", // This is a key-value pair
"notifications": true,
"language": "en"
}

● The key is the _id field (user123), which is unique to each document.
● The value is the entire document itself: a combination of preferences for that user.

MongoDB Example: Retrieving a Key-Value Pair

To fetch a key-value pair in MongoDB, you can query the document using the key (e.g., _id):

db.user_preferences.find({ _id: "user123" })

This query would return the document with _id equal to "user123", providing you the value
associated with that key, which includes the user's preferences.

Output:
{

17
"_id": "user123",
"theme": "dark",
"notifications": true,
"language": "en"
}

In MongoDB, you can also query specific fields within a document:

db.user_preferences.find({ _id: "user123" }, { theme: 1, notifications: 1 })

This would return only the theme and notifications fields for the document with _id
"user123".

Using MongoDB as a Key-Value Store for Caching:

In caching scenarios, MongoDB can be used to store values that are fetched with a unique key.
For example, caching user data for quick retrieval:

1. Key: The user's session ID or user ID.

2. Value: The user’s profile data or session information.

Example Use Case: Caching User Profiles

{
"_id": "session123",
"user_id": "user123",
"session_data": {
"last_login": "2025-02-06T10:00:00Z",
"cart": [1, 2, 3],
"preferences": { "theme": "dark", "notifications": true }
}
}

Here, "session123" is the key, and the value is the user's session data.

MongoDB and Key-Value Store Characteristics

18
Although MongoDB is primarily a document-oriented database, its functionality and flexibility
allow it to be used like a key-value store in many cases:

Advantages of Using MongoDB as a Key-Value Store:

1. Schema Flexibility: Unlike traditional key-value stores (e.g., Redis), MongoDB allows
you to store documents of different structures within the same collection. This is useful
when different "values" for different keys need to be stored in different formats.
2. Rich Querying: MongoDB provides powerful querying and indexing features, which
enable more sophisticated lookups compared to traditional key-value stores, such as
querying on nested fields or performing aggregation.
3. Data Persistence: MongoDB is persistent, meaning the data stored in it won’t be lost
when the server is restarted, unlike some in-memory key-value stores like Redis.
4. Horizontal Scalability: MongoDB supports sharding and replication, meaning it can
scale horizontally and handle large datasets while maintaining high availability.

Limitations:

1. Performance: MongoDB might not be as fast as in-memory key-value databases like
Redis when it comes to simple, rapid key-value lookups. MongoDB is disk-based and
optimized for more complex queries and document management, which can result in
slightly slower access times compared to an in-memory solution.
2. Complexity: MongoDB’s advanced querying and document structure may be overkill for
simple use cases where a basic key-value store would suffice (e.g., caching user sessions
or simple configurations).

Example of Using MongoDB in Node.js as a Key-Value Store:

In Node.js, you can interact with MongoDB using the MongoDB Node.js driver or Mongoose
for object modeling.

Example using MongoDB Node.js Driver:

const { MongoClient } = require('mongodb');
const url = 'mongodb://localhost:27017';
const dbName = 'testdb';

async function main() {

const client = new MongoClient(url);
await client.connect();

19
const db = client.db(dbName);
const collection = db.collection('user_preferences');

// Insert a key-value pair (document)

await collection.insertOne({ _id: "user123", theme: "dark", notifications: true, language:
"en" });

// Retrieve the key-value pair (document)

const user = await collection.findOne({ _id: "user123" });
console.log(user);

await client.close();
}

main().catch(console.error);

Output:
{
"_id": "user123",
"theme": "dark",
"notifications": true,
"language": "en"
}

Conclusion:

● MongoDB can function as a key-value store, allowing you to treat documents as

key-value pairs where each key is typically the _id field, and the value is the entire
document.
● While MongoDB is a document store rather than a traditional key-value store, it
supports many key-value store use cases and adds flexibility with its document model,
allowing for complex data structures.
● MongoDB’s querying power, flexibility, and scalability make it suitable for many use
cases beyond simple key-value storage, but for ultra-fast, in-memory access with simple
key-value operations, a system like Redis might be more appropriate.

20
2.4 Document Databases

A Document Database is a type of NoSQL database designed to store, retrieve, and manage
documents. Unlike relational databases that store data in tables with fixed schemas, document
databases store data in a more flexible, hierarchical format, typically using JSON (JavaScript
Object Notation), BSON (Binary JSON), or similar document-based formats.

Each document is a self-contained unit of data that can have its own structure, and the database
does not enforce a schema across all documents. This flexibility allows document databases to
efficiently handle complex and dynamic data structures.

Key Features of Document Databases

1. Schema Flexibility:

○ Documents in a collection (similar to rows in relational databases) do not need to
have the same structure. A single collection can hold documents with different
fields and types, offering great flexibility for storing varied data.
2. Document Format:
○ Documents are typically stored as JSON or BSON (in MongoDB's case), which
makes them easy to work with and align closely with how data is represented in
modern web applications.
3. Hierarchical Data Representation:
○ Documents can contain nested structures (objects and arrays), allowing for a rich
and flexible data model that is well-suited for representing relationships between
entities without the need for complex joins or foreign keys.
4. Indexing:
○ Document databases support indexing, enabling fast searches across fields within
documents, even if the schema is dynamic.
5. Scalability:
○ Document databases can easily scale horizontally (across multiple servers) to
handle large amounts of data and traffic, often using sharding and replication for
high availability and distributed data storage.
6. Powerful Querying:
○ While more complex than key-value stores, document databases often provide
powerful query capabilities, allowing users to retrieve documents based on
criteria that may include deep nested attributes.
7. ACID vs. BASE:
○ Many document databases use the BASE model (Basically Available, Soft state,
Eventually consistent) instead of the ACID model (Atomicity, Consistency,

21
Isolation, Durability). This provides better scalability and performance but may
result in eventual consistency rather than immediate consistency in some cases.

Here are some of the most commonly used document databases:

1. MongoDB:
○ MongoDB is one of the most popular and widely used document databases. It
stores data in BSON format (a binary version of JSON), which allows for rich
data structures, including arrays and embedded documents. MongoDB supports
powerful querying, indexing, and aggregation features.
○ Use Case: Suitable for applications that need to handle large volumes of
semi-structured data, such as web apps, content management systems (CMS), and
real-time analytics.
2. CouchDB:
○ CouchDB stores data in JSON format. It allows documents to be stored with
different structures within the same database, providing flexibility. CouchDB also
emphasizes eventual consistency and high availability through its MapReduce
indexing and replication mechanisms.
○ Use Case: Used in systems requiring high availability and scalability with the
need for efficient document storage and replication, such as IoT systems or
mobile applications.
3. Couchbase:
○ Couchbase is a distributed NoSQL database that supports both document storage
and key-value store functionality. It uses JSON documents to represent data and
provides advanced features like indexing, full-text search, and analytics.
○ Use Case: Suitable for applications with large-scale data storage needs, such as
customer 360-degree profiles, content management systems, and mobile apps.
4. RavenDB:
○ RavenDB is a document database designed for .NET environments. It stores data
in JSON format and supports rich querying and indexing. RavenDB provides
features like ACID transactions, full-text search, and dynamic queries.
○ Use Case: Typically used in enterprise environments and applications written in
.NET, such as business applications and customer relationship management
(CRM) systems.

22
5. Amazon DocumentDB:
○ Amazon DocumentDB is a fully managed document database service designed to
be compatible with MongoDB. It is optimized for use on Amazon Web Services
(AWS) and provides the scalability, availability, and security benefits of AWS.
○ Use Case: Useful for users already working within the AWS ecosystem who need
a scalable, managed document database compatible with MongoDB applications.

2.5 Features-Consistency

Example of Document Databases in Action:

Let’s look at how a document is represented in a document database like MongoDB:

Example 1: Storing a User Document

In MongoDB, a user document might look like this:

json
Copy
{
"_id": "user123",
"name": "John Doe",
"email": "johndoe@example.com",
"address": {
"street": "123 Main St",
"city": "Somewhere",
"state": "CA",
"zipcode": "90210"
},
"roles": ["admin", "editor"],
"last_login": "2025-02-06T10:00:00Z",
"is_active": true
}

Here’s the breakdown:

23
● The _id field serves as the primary key, and it is unique for every document.
● The address field is a nested object (document inside a document), which contains
multiple properties.
● The roles field is an array of values.
● MongoDB allows these nested structures to be stored within the same document, offering
a flexible, hierarchical data model.

Example 2: Querying the Document

If you want to retrieve all users with the role "admin", you can perform a query:

javascript
Copy
db.users.find({ roles: "admin" })

This would return documents where the roles array contains the value "admin". MongoDB’s
query language allows querying nested fields as well, such as address.city or
last_login.

Advantages of Document Databases:

1. Schema Flexibility:

○ Document databases allow for schema-less design. You don’t need to define a
strict schema upfront, which is ideal for applications where the structure of data
can evolve over time.
2. Efficient Storage for Semi-Structured Data:
○ They are well-suited for storing semi-structured data (e.g., user profiles, product
catalogs, etc.), making them ideal for modern applications where the data format
may change frequently.
3. Fast Reads and Writes:
○ Since the data is stored as self-contained documents, fetching all relevant data in a
single read operation is efficient. Writes and updates are also fast, especially when
updating individual documents without requiring a full table scan or complicated
joins.
4. Scalable and Distributed:
○ Most document databases, like MongoDB and Couchbase, support horizontal
scaling (sharding) and replication out of the box. This makes them highly

24
available and capable of handling large datasets distributed across multiple
machines.
5. Built-in Indexing:
○ Document databases typically allow indexing on various fields, including nested
fields. This makes queries faster and more efficient, particularly when retrieving
documents based on specific attributes.

Disadvantages of Document Databases:

1. Lack of ACID Transactions (in some cases):

○ While some document databases support ACID transactions (e.g., MongoDB in
replica sets or RavenDB), others may not, which could pose challenges in
applications where strong consistency and transactional guarantees are required.
2. Eventual Consistency:
○ Many document databases prioritize availability and partition tolerance (as per
the CAP theorem) and may offer eventual consistency rather than strong
consistency. This means that there could be temporary inconsistencies across
replicas before they sync.
3. Limited Support for Joins:
○ Document databases are not designed for handling complex joins across multiple
documents or collections, making them less suitable for applications that rely on
relational data models with frequent cross-table relationships.
4. Large Documents:
○ Storing very large documents (several MBs or more) can lead to performance
degradation, as some document databases may struggle to efficiently manage
large, monolithic documents.

Conclusion:

Document databases are a powerful and flexible tool for modern applications, especially when
dealing with semi-structured or evolving data. They excel in use cases where schema flexibility
and scalability are important, such as web applications, content management systems, and
real-time analytics. However, they may not be the best choice for applications that require
complex relational data or strong consistency guarantees across multiple operations.

25
Consistency in Document Data Base

Consistency in a document database refers to how the database ensures that all nodes or
replicas in a distributed system reflect the same data after any update or write operation. In
NoSQL systems like document databases, consistency is typically balanced with availability and
partition tolerance (as per the CAP theorem), which means that you may have to trade-off
strict consistency for better availability or fault tolerance.

Let’s break down the concept of consistency in document databases and how it can vary
depending on the system.

Types of Consistency Models

Document databases offer different consistency models, often depending on the configuration
(e.g., single-node vs. multi-node, replica sets, or sharding). These models generally align with
ACID (Atomicity, Consistency, Isolation, Durability) or BASE (Basically Available, Soft state,
Eventually consistent) principles.

1. Strong Consistency (ACID)

● Strong consistency ensures that every read reflects the most recent write. After an update
operation, the system guarantees that all clients will see the same data at any given time,
no matter which replica or shard they access.
● Example: If you update a document (e.g., a user’s profile), subsequent reads (even from
different replicas) will immediately reflect the updated profile.

2. Eventual Consistency (BASE)

26
● Eventual consistency means that, given enough time, all replicas will converge to the
same data, but during that time, some replicas may be out of sync.
● Example: If you update a document, some users may still see the old version of the
document for a brief period until the replicas sync.
● Eventual consistency is generally used in systems that prioritize availability and
partition tolerance over strong consistency, such as large distributed databases.

3. Tunable Consistency (MongoDB example)

● Tunable consistency allows you to configure the trade-off between consistency and
availability. For example, you can decide how many replicas need to acknowledge a write
before it is considered successful. MongoDB provides options for this.
○ Write Concern: Determines how many replica sets must acknowledge a write
operation before it is considered successful.
○ Read Concern: Determines the consistency level for read operations, such as
whether you want to read from the most recent write or accept data that may still
be inconsistent.

Consistency in Document Databases (using MongoDB as an Example)

MongoDB, as one of the most popular document databases, offers flexibility in terms of
consistency, and it can be configured to provide different levels of consistency depending on the
use case. Here are a few important concepts in MongoDB:

1. Replica Sets and Consistency

In MongoDB, replica sets are used to provide high availability and data redundancy by
maintaining copies of the same data across multiple nodes. A replica set consists of a primary
node and one or more secondary nodes. The primary node handles all write operations, while the
secondary nodes replicate the primary's data.

27
● Strong Consistency (Single-Document Operations):
○ In MongoDB, write operations on a single document (e.g., updating a user profile)
are strongly consistent by default. Once the write operation is acknowledged by
the primary node, it is immediately available for subsequent reads.
○ MongoDB's default consistency ensures that if you write to the primary node, the
data is immediately visible to any subsequent reads from that primary.

2. Write Concern and Read Concern

● Write Concern:

○ In MongoDB, write concern defines how many replica nodes must acknowledge
a write operation before it is considered successful. The default write concern is
w:1, meaning only the primary node must acknowledge the write.
○ You can configure a higher write concern, such as w:majority, to ensure that
the majority of replica nodes acknowledge the write, improving data consistency
across the system.

Example of write concern:

db.collection.insertOne(
{ name: "John Doe", email: "johndoe@example.com" },
{ writeConcern: { w: "majority" } }
);

●
● Read Concern:

○ Read concern specifies the consistency level of a read operation. It controls

whether the data read should reflect the most recent write or allow reading of
potentially stale data from secondary replicas.
○ MongoDB provides different read concerns, including:
■ local: Reads from the nearest replica and may return stale data (no
guarantees).
■ majority: Ensures the data read reflects the majority of replica set
nodes and is consistent across nodes.
■ linearizable: Ensures the read returns the most up-to-date data after
the most recent write and ensures linear consistency.

28
Example of read concern:

db.collection.find(
{ name: "John Doe" },
{ readConcern: { level: "majority" } }
);

●

3. Read/Write Isolation in MongoDB

MongoDB uses a feature known as oplog (operation log) in replica sets, where the primary node
records all write operations, and secondary nodes apply those operations to maintain consistency.
However, in a distributed system, there might be replication lag between the primary and
secondary nodes, leading to the possibility of reading stale data from secondaries.

● To ensure stronger consistency, MongoDB can configure read preferences to read from
the primary node only (strong consistency), or to allow reads from secondaries, which
may return stale data but improve availability.

4. Sharding and Consistency

In a sharded MongoDB cluster, data is distributed across multiple nodes. Sharding introduces the
challenge of maintaining consistency across distributed shards. MongoDB ensures consistency
within each shard, but the consistency across multiple shards depends on the configuration of the
cluster and consistency level chosen.

MongoDB allows multi-document transactions in sharded clusters (since version 4.0), which
provide ACID guarantees across multiple documents in different shards.

How Consistency Affects Applications

The choice of consistency model depends on the application's needs:

1. Strong Consistency:

○ Best suited for applications where it’s critical that users always see the latest data
after updates. Examples include financial systems, inventory management, and
accounting systems.

29
2. Eventual Consistency:

○ Best suited for applications where high availability and fault tolerance are
prioritized over strict consistency. Examples include social media platforms,
content delivery networks, and IoT systems.
3. Tunable Consistency:

○ Some applications may need tunable consistency, where you can adjust the
balance between consistency and performance based on specific requirements.
For example, in e-commerce platforms, product inventory could benefit from
eventual consistency, but payment transactions must be strongly consistent.

Example: Consistency in MongoDB

Let’s say we have an e-commerce platform where users can purchase products. We might want to
ensure strong consistency when updating inventory quantities but can tolerate eventual
consistency in other cases, such as viewing product reviews.

Strong Consistency Example (Inventory Update):

// Write Concern: w:majority ensures that most replica nodes acknowledge the write
db.products.updateOne(
{ _id: "product123" },
{ $inc: { inventory: -1 } },
{ writeConcern: { w: "majority" } }
);

● The inventory update will be acknowledged by the majority of the replica set before it's
considered successful, ensuring that all users accessing the system will see the same
inventory level.

Eventual Consistency Example (Product Reviews):

// Read Concern: local allows reading from the nearest replica, which might be stale
db.products.findOne(
{ _id: "product123" },
{ readConcern: { level: "local" } }
);

30
● This query can return stale data (e.g., product reviews from a secondary node), which is
acceptable in some cases like reading user reviews, where absolute consistency isn't
critical.

Conclusion

In document databases, consistency models are configurable and vary based on your
requirements for data availability, fault tolerance, and how up-to-date the data must be.

● Strong consistency ensures that all clients always see the most recent write.
● Eventual consistency trades off immediate consistency for better availability and
partition tolerance.
● Tunable consistency, as seen in MongoDB, allows you to adjust the consistency level
depending on the use case.

Choosing the right consistency model for your application is critical, and it depends on factors
such as how often data is updated, how critical it is for users to see the latest data, and the overall
design of your distributed system.

2.6 Transactions- Availability- Query Features- Scaling - Suitable Use Cases

Transaction in MangoDB

In MongoDB, transactions allow you to execute multiple operations in a way that ensures all
operations are completed successfully or none at all. This is part of MongoDB’s support for
ACID transactions (Atomicity, Consistency, Isolation, Durability) introduced in version 4.0 for
replica sets and version 4.2 for sharded clusters.

Example of a Simple Transaction in MongoDB

Let’s say we have two collections: accounts and transactions. We want to transfer
money from one account to another. The operation involves two steps:

1. Deducting the amount from the sender's account.

2. Adding the amount to the receiver's account.

31
The goal is to ensure both operations are atomic, meaning that if one operation fails, neither
operation should be committed.

Steps to Perform a Transaction in MongoDB

1. Start a session.

2. Begin a transaction.
3. Perform operations (e.g., update the accounts).
4. Commit the transaction if all operations are successful.
5. Abort the transaction if any operation fails.

Here’s how to implement this in MongoDB using the Node.js MongoDB Driver:

MongoDB Transaction Example: Transfer Money

1. Setup the Environment

You need to have MongoDB running and a Node.js application. Install MongoDB’s Node.js
driver:

npm install mongodb

2. Code Example

Here is the code that demonstrates how to implement a money transfer transaction between two
accounts:

const { MongoClient } = require('mongodb');

async function main() {

const uri = "mongodb://localhost:27017"; // Replace with your MongoDB connection
string
const client = new MongoClient(uri);

try {
await client.connect();
const session = client.startSession(); // Start a session for the transaction

// Get the database and collections

const db = client.db("bank");

32
const accounts = db.collection("accounts");
const transactions = db.collection("transactions");

// Start a transaction
session.startTransaction();

try {
// Perform the money transfer
const senderId = "user1";
const receiverId = "user2";
const amount = 100;

// Deduct money from the sender’s account

const senderAccount = await accounts.findOne({ accountId: senderId }, { session });
if (!senderAccount || senderAccount.balance < amount) {
throw new Error('Insufficient funds');
}

await accounts.updateOne(
{ accountId: senderId },
{ $inc: { balance: -amount } },
{ session }
);

// Add money to the receiver’s account

const receiverAccount = await accounts.findOne({ accountId: receiverId }, { session });
if (!receiverAccount) {
throw new Error('Receiver not found');
}

await accounts.updateOne(
{ accountId: receiverId },
{ $inc: { balance: amount } },
{ session }
);

// Log the transaction (optional)

await transactions.insertOne({
sender: senderId,
receiver: receiverId,

33
amount,
date: new Date()
}, { session });

// Commit the transaction if all operations succeed

await session.commitTransaction();
console.log("Transaction successful!");

} catch (error) {
// If any operation fails, abort the transaction
console.error("Transaction failed: ", error);
await session.abortTransaction();
} finally {
session.endSession();
}

} finally {
await client.close();
}
}

main().catch(console.error);

Explanation:

1. Start a Session:

○ We begin by connecting to the MongoDB server and starting a session with

client.startSession().
2. Start a Transaction:

○ We invoke session.startTransaction() to begin a transaction.

3. Performing Operations:

○ We first check the sender's balance to ensure they have enough funds. Then, we
perform the deduction from the sender’s account and the addition to the
receiver’s account using the updateOne method.
○ We also log the transaction into a separate transactions collection for audit
purposes.

34
4. Commit the Transaction:

○ If all operations succeed, we call session.commitTransaction() to

make the changes permanent.
5. Abort on Failure:

○ If any operation fails (e.g., insufficient funds, receiver not found), we catch the
error and call session.abortTransaction() to roll back all changes
made during the transaction.
6. End Session:

○ Finally, we call session.endSession() to end the session.

Key Concepts:

● Atomicity: If an error occurs at any point in the transaction (e.g., insufficient funds,
receiver not found), MongoDB ensures that all changes are rolled back, and no partial
updates are made.
● Session: MongoDB transactions use sessions to maintain context for the transaction.
● Commit and Abort: After performing the operations, you either commit to make the
changes permanent or abort if something goes wrong.

Important Notes:

● Replica Sets: Transactions are only supported in replica sets (not standalone MongoDB
instances). If you're working with a sharded cluster, make sure you have MongoDB
version 4.2 or later, as this version added support for distributed transactions across
shards.
● Performance Consideration: While transactions are useful for maintaining data
consistency, they can introduce overhead. Use transactions only when necessary for
ensuring data integrity.

Example Output:
Transaction successful!

If an error occurred, like insufficient funds, you might see:

Transaction failed: Error: Insufficient funds

35
Conclusion:

This simple example demonstrates how to use MongoDB transactions to ensure atomicity and
consistency when performing multiple operations across different collections. Transactions are
essential for maintaining data integrity in scenarios such as money transfers, inventory updates,
or order processing where multiple operations must succeed or fail together.

Here’s a table summarizing key characteristics of MongoDB related to Availability, Query

Features, Scaling, and Suitable Use Cases:

Feature MongoDB

Availability - High availability through replica sets that provide failover if the primary
node goes down. - Can achieve automatic failover with replica sets. -
Tunable consistency options, with trade-offs between consistency and
availability.

Query - Supports rich query capabilities like filtering, sorting, and aggregation. -
Features Offers indexing for efficient searches. - Full-text search is available (using
MongoDB Atlas or text indexes). - Join operations through $lookup for
handling relationships between documents (like a "left join" in SQL).

Scaling - Horizontal scaling through sharding for distributing data across multiple
servers. - Auto-sharding helps in managing large datasets by partitioning
them across different machines. - Vertical scaling (increasing server
capacity) is also supported for better performance.

Suitable Use - Content Management Systems (CMS): Flexibility with schema changes
Cases for evolving content. - Real-time Analytics: With powerful aggregation
pipelines, MongoDB is suitable for real-time data processing. - IoT
Applications: MongoDB can efficiently handle large-scale, unstructured, and
semi-structured data. - E-commerce Platforms: Can handle user profiles,
product catalogs, and inventory systems. - Social Media Platforms: Scalable
architecture for storing large volumes of user-generated content and
interactions.

36
Summary:

● Availability: MongoDB ensures high availability through replica sets, making it resilient
to server failures. The trade-off is flexibility in consistency (eventual vs. strong
consistency).
● Query Features: MongoDB offers a powerful query language with support for
aggregation, filtering, and even joining collections (using $lookup), although it’s not as
feature-rich in joins as relational databases.
● Scaling: MongoDB excels in horizontal scaling with sharding, which allows it to
handle very large datasets across many servers.
● Use Cases: Ideal for applications that require flexibility in schema design (e.g., CMS,
IoT, social platforms) and need to scale as the data grows.

MongoDB’s architecture is particularly well-suited for highly available, scalable applications

that deal with semi-structured or unstructured data and require rapid changes in the schema.

2.7 Event Logging- Content Management Systems- Blogging Platforms

Here's a table summarizing how MongoDB applies to different use cases like Event Logging,
Content Management Systems, and Blogging Platforms:

Feature/Use Event Logging Content Management Blogging Platforms

Case Systems (CMS)

Suitability - MongoDB is ideal for - MongoDB is suitable - MongoDB can be used

logging events due to for CMS because it in blogging platforms due
its ability to handle offers a schema-less to its ability to handle
high write throughput structure, allowing user profiles, comments,
and store content models to posts, and metadata
semi-structured data. - evolve easily. - Stores efficiently. - The
It is well-suited for structured and flexibility of MongoDB
storing logs in a unstructured content suits the evolving nature
flexible, scalable effectively, supporting of blog content.
format (JSON-like varied data types.
documents).

37
Event Data - Event logs can be - In CMS, MongoDB - Blog posts and metadata
stored as documents, stores dynamic content (e.g., titles, tags, publish
each representing an (articles, posts, media) dates) are stored as
event with properties in documents. - Allows documents. - MongoDB
like timestamp, event embedding content like allows embedding or
type, and metadata. - images, videos, and linking of content like
Can store logs at scale metadata in a single comments, media, and
with high availability, document for ease of user activity.
allowing for quick retrieval.
retrieval and analysis.

Data - Schema-less design - Flexible document - Documents represent

Structure for flexibility in log structure for varying individual blog posts,
data (events may differ content formats including text content,
in structure). - (articles, images, images, and metadata. -
Supports time-series videos). - Embedded Embedded documents
data where events are documents for nested for comments and user
ordered by timestamp, content types (e.g., interactions.
allowing for efficient categories, authors,
queries. tags).

Querying - MongoDB’s - MongoDB supports - MongoDB's text search

and aggregation rich queries to filter allows fast queries for
Retrieval framework enables content based on tags, finding blog posts based
advanced filtering and dates, authors, etc. - on keywords. - You can
grouping of event logs Supports full-text query posts by metadata
(e.g., aggregating by search (for articles) and like author, category, and
event types, efficient retrieval of tags.
timestamps). - related content.
Supports querying by
timestamps and other
dynamic fields.

38
Scaling - Sharding enables - CMS platforms benefit - MongoDB scales well
horizontal scaling of from MongoDB's for growing platforms,
event log data, which ability to scale supporting thousands of
is essential for horizontally using blog posts and user
high-volume logging sharding as the number interactions (comments,
scenarios. - of articles or media likes).
MongoDB’s ability to grows.
handle large amounts
of incoming event data
and scale horizontally
is critical.

Performance - High write - Performance remains - Efficient performance in

throughput and quick high as MongoDB retrieving posts and
retrieval of event logs handles large volumes handling user-generated
for real-time of content with fast content. - MongoDB’s
monitoring. - Supports retrieval and scalable indexing and full-text
efficient handling of operations. search features optimize
time-series data for content discovery.
logs and analytics.

Reliability & - MongoDB provides - MongoDB's replica - High availability is

Availability high availability with sets ensure high essential for blogging
replica sets to ensure availability for CMS platforms, and
event logs are always platforms, preventing MongoDB’s replica sets
accessible. - Enables downtime for content ensure that posts and
real-time monitoring delivery. - Provides comments are always
and analytics with backup options and accessible.
minimal downtime. ensures reliable content
storage.

39
Suitable Use - Application logs: - Dynamic content - Personal blogs:
Cases Storing logs for apps management: Handling personal blog
or systems, including Managing and posts and user
user actions, errors, organizing content, interactions (comments,
and system events. - including articles, likes). - Multi-author
Real-time analytics: media, users, and blogs: Managing multiple
Tracking and analyzing permissions. - Ideal for writers, editors, and
user behavior, system multi-channel content categories. -
status, or security publishing (web, Content-based
events. mobile, etc.). platforms: Storing posts,
media, and metadata for
large-scale platforms.

Summary:
Event Logging:

● MongoDB is a great fit for event logging systems due to its high write throughput,
ability to store time-series data, and flexible schema-less design. It can easily scale
horizontally with sharding, ensuring reliable storage of massive amounts of event data.

Content Management Systems (CMS):

● MongoDB’s flexibility and schema-less design are well-suited for managing dynamic
and diverse content, allowing for easy evolution of content models. It can scale efficiently
as content volume grows, with replica sets ensuring high availability for CMS platforms.

Blogging Platforms:

● MongoDB is perfect for blogging platforms due to its ability to handle content, user
interactions (comments, likes), and metadata efficiently. Its flexibility supports evolving
content formats, and the scalability makes it ideal for growing platforms with high traffic
and interaction volumes.

MongoDB's strengths in scalability, flexibility, and performance make it a top choice for these
use cases. Whether you're managing logs, content, or blog posts, MongoDB allows you to store,
query, and scale data with ease.

40
2.8 Web Analytics or Real-Time Analytics- E-Commerce Applications

Here’s a table summarizing the suitability of MongoDB for Web Analytics or Real-Time
Analytics and E-Commerce Applications:

Feature/Use Web Analytics / Real-Time E-Commerce Applications

Case Analytics

Suitability - MongoDB is well-suited for - MongoDB supports the dynamic nature

handling high-volume, rapidly of e-commerce data, including product
changing data that is typical in catalogs, user profiles, and orders. - Its
web analytics and real-time ability to store unstructured data (e.g.,
analytics. - The schema-less user reviews, images, and metadata)
nature of MongoDB allows for makes it a flexible choice for e-commerce
flexibility in storing varied platforms.
event data (e.g., page views,
clicks, user actions) with no
need for predefined schema.

Data - Event-based data: Web - MongoDB stores product data, user

Structure analytics often rely on user profiles, orders, and reviews in
interactions, which are JSON-like documents. - Embedded
event-driven and vary in documents support complex relationships,
structure (e.g., clicks, session such as nested shopping cart items and
data, geolocation). - MongoDB's order details.
flexible document-based
structure works well to handle
various kinds of events without
predefined schemas.

Querying and - MongoDB supports - MongoDB enables real-time searches

Retrieval aggregation pipelines for and filters, making it ideal for searching
analyzing web traffic, session product catalogs by category, price, or
data, and user behavior. - ratings. - Efficient querying for
Supports real-time data user-specific data like purchase history
analysis with fast querying and preferences.
capabilities (e.g., to track user
activity or trends as they
happen).

41
Scalability - Horizontal scalability via - MongoDB's sharding capability helps
sharding allows MongoDB to scale e-commerce platforms as they
handle large amounts of grow, handling millions of product listings
real-time event data at high and user transactions. - Horizontal scaling
speeds, enabling large-scale allows the system to support high traffic,
web analytics platforms. - especially during peak times (e.g., Black
Auto-sharding helps distribute Friday sales).
traffic across multiple servers,
which is critical for managing
data across global users.

Performance - MongoDB’s high write - MongoDB is highly performant when

throughput and ability to handling complex product searches and
handle massive streams of dynamic product updates (e.g., stock
incoming events make it ideal levels, pricing). - Indexing helps in
for real-time analytics. - speeding up queries for both products and
Indexing and aggregation user transactions.
pipelines allow fast retrieval and
analysis of data.

Flexibility - Web analytics systems often - E-commerce data models often change as
evolve in terms of the data new products, promotions, and categories
collected (e.g., adding new user are added. MongoDB’s flexible schema
interaction events). MongoDB’s allows rapid adaptation to new data types
flexibility allows you to store and formats.
new types of events without
schema changes.

Real-Time - MongoDB can support - MongoDB can be used to process

Processing real-time event tracking and real-time orders and transactions,
analytics, making it a good fit allowing for instant order confirmations,
for monitoring user activity, inventory updates, and user behavior
behavior, and traffic in tracking.
real-time. - Can be used for
streaming analytics in
combination with tools like
Apache Kafka for real-time
data processing.

42
Reliability & - MongoDB’s replica sets - Replica sets ensure that e-commerce
Availability ensure high availability of platforms maintain high availability, even
real-time analytics data, during high traffic times. - Provides
ensuring that analytics platforms automatic failover to maintain platform
remain up even during node uptime.
failures. - Replica sets support
high availability and data
redundancy.

Suitable Use - User behavior tracking: - Product catalog management: Storing

Cases Monitoring clicks, page views, and updating product information (e.g.,
and sessions in real time. - A/B names, descriptions, prices, images). -
Testing: Analyzing which Customer profiles: Storing user data,
website design or features are purchase history, and preferences. - Order
performing best. - Real-time processing: Handling orders, payments,
traffic analysis: Analyzing user and inventory updates. -
demographics, geolocation, and Recommendation engines: Storing and
device types in real-time. analyzing user preferences and purchasing
patterns for personalized
recommendations.

Summary:
Web Analytics / Real-Time Analytics:

● MongoDB is a great fit for web analytics due to its ability to handle high volumes of
event-driven data in real-time. Its flexible schema allows it to store diverse event data
(e.g., user clicks, page views, sessions) and adapt to new data types. Aggregation
pipelines enable powerful data analysis, while sharding and replica sets ensure
scalability and availability for global traffic.

E-Commerce Applications:

● MongoDB is also ideal for e-commerce applications, where it can handle dynamic data
models such as product catalogs, user profiles, and orders. The ability to store and
query complex, nested data (like shopping cart contents) in a flexible,
document-oriented format is key. Scalability through sharding and high availability
through replica sets make it well-suited for handling high traffic during sales events or
peak times.

43
MongoDB's real-time analytics and scalable data storage features make it a suitable choice for
both analytics platforms and e-commerce systems that need to handle high traffic, dynamic
data, and rapid scaling.

2.9 When Not to Use-Complex Transactions Spanning Diﬀerent Operations

In MongoDB, there are cases where its design and capabilities may not be suitable for certain use
cases, particularly when it comes to complex transactions spanning different operations.
Below are some key scenarios where MongoDB may not be the best choice for complex
transactions:

1. When Transactions Require Multiple Collections or Databases

MongoDB supports transactions in replica sets and sharded clusters (from version 4.0+), but
they are not as robust as traditional relational database transactions. Complex transactions
that span multiple collections or multiple databases may become inefficient or challenging to
implement.

● Issue: If you need to perform multiple operations across different collections or databases
with full ACID (Atomicity, Consistency, Isolation, Durability) guarantees, MongoDB’s
transaction management may not be as performant as traditional relational databases.
● Why: Transactions involving multiple collections or databases can lead to higher
latency, increased complexity, and potential performance bottlenecks in a sharded
cluster.

2. When Strong Consistency Is a Strict Requirement

MongoDB is designed with a CAP Theorem trade-off, where availability and partition
tolerance are prioritized over strict consistency (depending on the setup). This means MongoDB
supports eventual consistency in many scenarios, and strong consistency is only possible under
certain configurations (e.g., using replica sets with read concern settings).

● Issue: If your application requires strict ACID compliance with strong consistency
across operations, especially when transactions span across multiple operations or
databases, MongoDB may not guarantee the level of consistency you need in certain
scenarios.
● Why: MongoDB’s default consistency model may lead to replication lag or delayed
consistency in some sharded environments.

44
3. When You Need Complex Joins or Foreign Key Constraints

MongoDB is a document-based NoSQL database, meaning it does not have the same features
as relational databases, such as complex joins or foreign key constraints. While MongoDB
does support $lookup (similar to SQL joins) for aggregations, it is not optimized for complex
join operations spanning multiple collections or databases.

● Issue: If your application requires complex joins between multiple collections, especially
where there are strong relationships between data, the lack of built-in foreign key
constraints and complex joins in MongoDB could make the design and maintenance more
complex and inefficient.
● Why: Complex operations like cascading updates, foreign key constraints, and data
integrity checks are often better handled in relational databases, and MongoDB’s lack of
strong relational integrity mechanisms can lead to data inconsistency and complicated
logic in the application layer.

4. When You Need to Ensure Full ACID Compliance Across Multiple Steps

While MongoDB supports multi-document transactions since version 4.0, the system was
originally designed for high-performance, distributed, and eventually consistent workloads,
not for multi-step ACID transactions involving various operations on different data sets.

● Issue: If your application involves multiple steps that require strong ACID guarantees
across a series of operations (like transferring money between different accounts, with
updates to multiple documents across multiple collections), using MongoDB for such use
cases could result in unexpected behavior or performance degradation.
● Why: MongoDB’s transaction support has overhead compared to traditional RDBMSs
that were built specifically to handle complex ACID transactions. This overhead could
lead to slowdowns or scaling issues.

5. When You Need Advanced Concurrency Control

Relational databases often offer advanced concurrency control mechanisms like serializable
isolation levels (to handle conflicts and data consistency in high-concurrency situations). While
MongoDB supports isolation levels for transactions, its concurrency control is not as fine-tuned
as that of relational databases.

● Issue: If your application requires fine-grained control over data consistency, deadlock
management, or complex conflict resolution in highly concurrent environments,
MongoDB may not provide the same level of reliability and control.

45
● Why: MongoDB’s default isolation levels (e.g., read committed) may not fully prevent
conflicting writes and dirty reads in complex multi-operation transactions. For example,
if you need to ensure serializability across complex transactions, MongoDB may not be
as effective as relational databases with advanced isolation.

6. When Your Operations Are Transaction-Intensive

If your application requires highly transactional workflows that involve multiple

interdependent operations, such as banking systems, order processing systems, or enterprise
resource planning (ERP) systems, MongoDB's transactional capabilities might not perform as
well as relational databases in terms of speed and complexity of operations.

● Issue: Applications with a large number of transactional operations (e.g., inventory

updates, payments, order processing) are better suited to relational databases that
specialize in complex transaction handling.
● Why: MongoDB is optimized for high availability and horizontal scaling in distributed
systems, but it may struggle with transaction-intensive workflows requiring tight
consistency, complex relationships, and strong integrity across multiple operations.

7. When You Need a Strong Schema and Enforced Integrity Constraints

Relational databases offer strong schema enforcement and the ability to define integrity
constraints like primary keys, foreign keys, and unique constraints. These are essential for
enforcing data integrity in complex applications where data consistency must be guaranteed at
all times.

● Issue: MongoDB’s schema-less design allows for flexibility, but this flexibility comes at
the cost of data integrity enforcement and schema validation.
● Why: If your application requires rigid data validation (e.g., through foreign key
relationships, cascading updates, and enforcement of complex business rules), MongoDB
may require more manual work or application logic to enforce these constraints.

Conclusion:

MongoDB is a powerful database for high-volume, high-availability applications that require

horizontal scaling, flexible schema design, and real-time data storage. However, it may not be
ideal for applications that require complex transactions involving multiple operations, strict
ACID compliance, or advanced concurrency control. In such cases, traditional relational
databases (RDBMS) like PostgreSQL, MySQL, or Oracle might be a better choice due to their
robust transaction management and integrity constraints.

46
When Not to Use MongoDB for Complex Transactions:

● When transactions span across multiple collections or databases and require tight
ACID guarantees.
● When you need strong consistency or complex joins across related datasets.
● When your application is heavily dependent on complex, multi-step ACID transactions
or advanced concurrency control.
● When you require rigid schema enforcement and integrity constraints like foreign
keys or cascading operations.

In such cases, consider a traditional RDBMS or a hybrid approach that combines MongoDB
with relational systems where transactional integrity is critical.

2.10 Queries against Varying Aggregate Structure

In MongoDB, handling queries against varying aggregate structures can be challenging

because MongoDB is schema-less and allows for flexible and dynamic document designs. As a
result, your documents may have different fields or structures depending on the data they store.
This becomes particularly tricky when querying across documents with different aggregation
structures or varying data types.

However, MongoDB provides several powerful query and aggregation capabilities to deal with
this issue. Below is an explanation of the problem and how MongoDB’s features can address it.

Problem: Varying Aggregate Structures

When the structure of your documents changes over time or varies between different documents,
the traditional relational databases with rigid schemas and foreign key constraints don’t face
such variability. MongoDB, however, allows each document in a collection to have a different
structure. This flexibility can lead to challenges in querying and aggregating data, especially if
you need to run queries on documents that contain different fields or nested structures.

For example:

● One document might have a field called address, and another might have a nested object
location instead.
● Some documents might have a price field, while others have cost.

47
● Aggregation operations across documents with varying fields can become tricky,
especially if some documents have nested fields while others do not.

MongoDB’s Solution: Querying and Aggregating Across Varying Structures

MongoDB provides several tools and techniques to deal with varying aggregate structures in
its documents:

1. Aggregation Framework

MongoDB’s aggregation framework allows for complex data transformation and querying even
when documents have varying structures. Key features include:

● $project: Used to reshape or modify fields, which helps normalize documents with
varying structures for querying.
● $addFields and $set: These can add new fields or modify existing ones during
aggregation.
● $unwind: Used to handle arrays or nested objects by flattening them into individual
documents, making them easier to aggregate.
● $ifNull: Can provide fallback values when certain fields don’t exist, allowing you to
handle missing fields during aggregation.

2. $project and Conditional Operators

When documents have varying fields, the $project stage can be used to ensure that all documents
in a query return the same set of fields, even if some fields are missing or have different names.

Example:

db.orders.aggregate([
{
$project: {
orderId: 1,
totalAmount: {
$ifNull: ["$price", "$cost"] // If 'price' is missing, fall back to 'cost'
},
status: 1,
}
}
])

48
In this example:

● If a document has a price field, it will be used for totalAmount.

● If price is missing but cost exists, it will fall back to cost.
● This allows querying documents with varying structures (i.e., some documents have
price, some have cost).

3. Using $cond for Conditional Logic

For documents that may have different aggregate structures (i.e., different fields or nested data),
the $cond operator allows you to apply conditional logic based on whether a field exists or meets
a condition.

Example:

db.products.aggregate([
{
$project: {
itemName: 1,
itemPrice: {
$cond: {
if: { $gt: [{ $type: "$price" }, "missing"] },
then: "$price",
else: "$cost" // Fallback to 'cost' if 'price' is missing
}
}
}
}
])

Here:

● If the price field exists, it will be used.

● If it’s missing, the cost field will be used instead.

4. Using $unwind for Nested Arrays

When you have arrays or nested data in your documents, you can use $unwind to flatten those
structures and make them easier to aggregate.

49
Example:

db.sales.aggregate([
{ $unwind: "$products" }, // Flatten the array of products
{
$group: {
_id: "$products.category",
totalSales: { $sum: "$products.sales" }
}
}
])

In this example, the products field is an array. The $unwind stage flattens each item in the array
into its own document so that we can aggregate sales based on product category.

5. $lookup for Joining Collections

When documents with varying structures are scattered across multiple collections, the $lookup
operator can help join data from different collections even if their structures are different.

Example:

db.orders.aggregate([
{
$lookup: {
from: "customers", // Join the 'customers' collection
localField: "customerId",
foreignField: "_id",
as: "customerInfo"
}
},
{
$unwind: "$customerInfo" // Flatten the customer data
},
{
$project: {
orderId: 1,
customerName: "$customerInfo.name",
totalAmount: {

50
$ifNull: ["$price", "$cost"] // Handle varying fields
}
}
}
])

In this case:

● We are performing a join ($lookup) between the orders and customers collections.
● The customerInfo is an array, so we use $unwind to flatten it.
● We also handle the varying fields (price, cost) using $ifNull.

6. Handling Missing Fields with $ifNull or $exists

You can check if a field exists before using it, or provide a fallback value for missing fields using
$ifNull. If a field might be missing in some documents, you can check for its existence and
return a default value or handle the condition accordingly.

Example:

db.transactions.aggregate([
{
$project: {
transactionId: 1,
totalAmount: {
$ifNull: ["$paymentAmount", 0] // Use 0 if 'paymentAmount' is missing
},
description: 1,
}
}
])

In this case:

● If paymentAmount is missing, it defaults to 0 for totalAmount.

Key Takeaways for Queries Against Varying Aggregate Structures:

51
1. Flexibility: MongoDB allows you to structure your data however you want, but this
flexibility can lead to varying document structures. You can handle this variability by
using MongoDB’s aggregation framework with operators like $ifNull, $cond, and
$project to deal with missing or differently named fields.

2. Aggregation Pipeline: The aggregation framework is particularly helpful when working
with varying document structures. It enables you to shape the data, filter out unwanted
fields, and apply logic to handle missing or mismatched data.

3. Conditional Logic: MongoDB provides powerful conditional operators like $cond,
$ifNull, and $exists to handle missing fields and adjust how documents are aggregated or
queried based on their content.

4. Flattening Arrays: The $unwind operator is useful for flattening nested arrays or
objects, which helps in aggregating data from varying structures.

5. Real-Time Adaptability: With MongoDB’s schema-less design, your application can
handle evolving data structures over time. However, it’s important to use proper
aggregation techniques to manage varying data effectively in queries.

By leveraging MongoDB’s aggregation features and conditional operators, you can efficiently
query and aggregate data, even when the document structures vary.

Module 2 Nosql
No ratings yet
Module 2 Nosql
31 pages
Big Data - No SQL Databases and Related Concepts
100% (1)
Big Data - No SQL Databases and Related Concepts
101 pages
Nosql Databases
No ratings yet
Nosql Databases
379 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
61 pages
Lecture 3 - Principles of NoSQL Databases
No ratings yet
Lecture 3 - Principles of NoSQL Databases
49 pages
R23 IDS Unit3
No ratings yet
R23 IDS Unit3
36 pages
Nosql M2-P1-P2
No ratings yet
Nosql M2-P1-P2
75 pages
Mongo DB
No ratings yet
Mongo DB
28 pages
2 NoSQL Databases Principles
No ratings yet
2 NoSQL Databases Principles
58 pages
Distribution Model
100% (1)
Distribution Model
24 pages
Unit 5 NOSQL
No ratings yet
Unit 5 NOSQL
102 pages
NoSQL Databases UNIT-2
No ratings yet
NoSQL Databases UNIT-2
29 pages
Framing The Future of Information Systems in Afghan Dynamics
No ratings yet
Framing The Future of Information Systems in Afghan Dynamics
4 pages
ADBS Short Questions - Lecture 7
No ratings yet
ADBS Short Questions - Lecture 7
4 pages
Lec 3 - Basic Concepts
No ratings yet
Lec 3 - Basic Concepts
32 pages
Mathina BDA
No ratings yet
Mathina BDA
11 pages
Module 7 - NoSQL
No ratings yet
Module 7 - NoSQL
34 pages
NoSql Module 2 Part 1
No ratings yet
NoSql Module 2 Part 1
13 pages
Cloud Computing Unit-3 Complete Notes 13-09-2024 Complete Notes
No ratings yet
Cloud Computing Unit-3 Complete Notes 13-09-2024 Complete Notes
25 pages
Module 5 Part II NoSQL DB
No ratings yet
Module 5 Part II NoSQL DB
12 pages
Module 2
No ratings yet
Module 2
40 pages
NoSQL Database
No ratings yet
NoSQL Database
8 pages
Nosql Data Management
No ratings yet
Nosql Data Management
13 pages
Module 2
No ratings yet
Module 2
36 pages
Unit-3 BDA
No ratings yet
Unit-3 BDA
21 pages
Big Data Management Basic Principles
No ratings yet
Big Data Management Basic Principles
55 pages
PersonalTracterApp (4) (1) Pages
No ratings yet
PersonalTracterApp (4) (1) Pages
68 pages
Ir MR 1
No ratings yet
Ir MR 1
34 pages
BD 3
No ratings yet
BD 3
1 page
Class X IT Practical File
0% (1)
Class X IT Practical File
8 pages
Module 3
No ratings yet
Module 3
14 pages
BDA Assignment1 BE6 20
No ratings yet
BDA Assignment1 BE6 20
10 pages
Big Data Slides
No ratings yet
Big Data Slides
26 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Nosql Mod2
No ratings yet
Nosql Mod2
25 pages
The Stuff of Thought - Chapter 7 - The Seven Words You Can't Say On Television
No ratings yet
The Stuff of Thought - Chapter 7 - The Seven Words You Can't Say On Television
50 pages
Big Data Storage Concepts
No ratings yet
Big Data Storage Concepts
31 pages
0zI2XrFJX5tR CjuECI f5HwGdQkpL8DAkTmwDPyFm3H0eCERMEvG9fH
No ratings yet
0zI2XrFJX5tR CjuECI f5HwGdQkpL8DAkTmwDPyFm3H0eCERMEvG9fH
13 pages
BDH Answer Bank
No ratings yet
BDH Answer Bank
21 pages
Chapter 4 Bing
No ratings yet
Chapter 4 Bing
5 pages
Lec21Notes Merged
No ratings yet
Lec21Notes Merged
20 pages
BDT Assignment
No ratings yet
BDT Assignment
4 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Unit 2 (Big Data Analytics)
No ratings yet
Unit 2 (Big Data Analytics)
11 pages
Big Data and Hadoop
No ratings yet
Big Data and Hadoop
8 pages
What Is A Distributed Database
No ratings yet
What Is A Distributed Database
8 pages
Big Data Management and Nosql Databases: Doc. Rndr. Irena Holubova, PH.D
No ratings yet
Big Data Management and Nosql Databases: Doc. Rndr. Irena Holubova, PH.D
27 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
NoSQL Big Data Management
No ratings yet
NoSQL Big Data Management
36 pages
Introduction To Nosql: Gabriele Pozzani
No ratings yet
Introduction To Nosql: Gabriele Pozzani
49 pages
Google Search Pro - 1
No ratings yet
Google Search Pro - 1
24 pages
NoSQL Databases
No ratings yet
NoSQL Databases
20 pages
Lefikir PowerPoint
No ratings yet
Lefikir PowerPoint
15 pages
BDA Answers
No ratings yet
BDA Answers
6 pages
Plants in Foreign Countries
No ratings yet
Plants in Foreign Countries
10 pages
Adbms Mini Sem 5-1
No ratings yet
Adbms Mini Sem 5-1
10 pages
Intro To Computer Programming
No ratings yet
Intro To Computer Programming
7 pages
NoSQL - Unit2
No ratings yet
NoSQL - Unit2
8 pages
Eustace Mullins The World Order Our Secret Rulers PDF
50% (2)
Eustace Mullins The World Order Our Secret Rulers PDF
2 pages
NOSQL
No ratings yet
NOSQL
16 pages
Basic Software Library Volume 6 - A Complete Business System
No ratings yet
Basic Software Library Volume 6 - A Complete Business System
186 pages
Primavera P6 Training Details and Course Contents in Khobar and Yanbu
No ratings yet
Primavera P6 Training Details and Course Contents in Khobar and Yanbu
3 pages
Big Data Unit-Ii Notes
No ratings yet
Big Data Unit-Ii Notes
7 pages
Document Type Definition Dtds
No ratings yet
Document Type Definition Dtds
38 pages
Nosql What Does It Mean
No ratings yet
Nosql What Does It Mean
15 pages
BDT Viva Questions
No ratings yet
BDT Viva Questions
2 pages
Seq Cheat Sheet
No ratings yet
Seq Cheat Sheet
5 pages
JAVA Conditional Statements
No ratings yet
JAVA Conditional Statements
10 pages
Uxpin Zen of White Space. Space, Ratios, Minimalism
100% (1)
Uxpin Zen of White Space. Space, Ratios, Minimalism
73 pages
Nosql What Does It Mean
No ratings yet
Nosql What Does It Mean
8 pages
Special Purpose Ledger
100% (1)
Special Purpose Ledger
9 pages
Hadoop 2
No ratings yet
Hadoop 2
27 pages
Planning and Designing Your DeltaV™ Digital Automation Systems and DeltaV™ SIS Process Safety Systems 2015 PDF
100% (2)
Planning and Designing Your DeltaV™ Digital Automation Systems and DeltaV™ SIS Process Safety Systems 2015 PDF
214 pages
Computing Inter-Rater Reliability For Observational Data - An Overview and Tutorial
No ratings yet
Computing Inter-Rater Reliability For Observational Data - An Overview and Tutorial
12 pages
Sur - Flo Turbine Meter
No ratings yet
Sur - Flo Turbine Meter
40 pages
HERZOG, I. - Manual Programa 'Stratify' PDF
No ratings yet
HERZOG, I. - Manual Programa 'Stratify' PDF
147 pages
ALV Grid Display With Checkbox To Process Selected Records at Runtime
No ratings yet
ALV Grid Display With Checkbox To Process Selected Records at Runtime
20 pages
RLT 03 Aa 1
No ratings yet
RLT 03 Aa 1
2 pages
School of Computing Science and Engineering Department of Computer Science and Engineering Galgotias University, Greater Noida India
No ratings yet
School of Computing Science and Engineering Department of Computer Science and Engineering Galgotias University, Greater Noida India
18 pages
2 PGIS Assignment
No ratings yet
2 PGIS Assignment
3 pages
Cs Option: Illustrated Parts List
No ratings yet
Cs Option: Illustrated Parts List
11 pages
Frame User Needs v4.1 01
No ratings yet
Frame User Needs v4.1 01
82 pages
Queuing Models Lecture Presentation
No ratings yet
Queuing Models Lecture Presentation
59 pages
Certificate of Training Jeraldin T. Bulat-Ag: Sibugay Technical Institute Incorporated Inc
No ratings yet
Certificate of Training Jeraldin T. Bulat-Ag: Sibugay Technical Institute Incorporated Inc
2 pages
Control M Interview Questions
No ratings yet
Control M Interview Questions
1 page
Automated Security For Virtual Data Centers & Clouds ALTOR v4.0
No ratings yet
Automated Security For Virtual Data Centers & Clouds ALTOR v4.0
2 pages
Upwork - Resume - Dwi Febrianto
No ratings yet
Upwork - Resume - Dwi Febrianto
2 pages
Adobe Photoshop CS4
No ratings yet
Adobe Photoshop CS4
3 pages
The DynamoDB Handbook: Practical Solutions for Modern NoSQL Database Management
From Everand
The DynamoDB Handbook: Practical Solutions for Modern NoSQL Database Management
Robert Johnson
No ratings yet
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Gcru 2 Nosql

Uploaded by

Gcru 2 Nosql

Uploaded by

1

2.1 Replication and sharding-MapReduce on databases.

Key points about replication:

●​ Master-Slave Replication: In a traditional master-slave setup, one node is the "master,"

Key points about sharding:

●​ Horizontal Partitioning: Sharding is a form of horizontal partitioning, where data is

Combining Replication and Sharding:

Example of MapReduce on Databases:

SaleID ProductID Amount Date Region

1 101 100 2025-01-01 North

2 102 200 2025-01-02 South

3 101 150 2025-01-03 North

5 102 300 2025-01-05 South

Step-by-Step MapReduce Approach:

●​ The "map" step processes each row of the dataset.

Map Output Example:

Shuffle and Sort Phase:

Grouped Data Example:

●​ Key: 101, Values: [100, 150]

Reduce Output Example:

●​ For ProductID 101: 100 + 150 = 250

The output would be a summary of the total sales per product:

ProductID Total Sales

MapReduce in SQL-Like Databases:

SELECT ProductID, SUM(Amount) AS Total_Sales

Practical Usage in Distributed Databases:

2.​ MongoDB (Aggregation Framework): MongoDB provides an aggregation pipeline that

In MongoDB, the MapReduce function could look like this:

var mapFunction = function() {

emit(this.ProductID, this.Amount); // Emit ProductID as the key and Amount as the

var reduceFunction = function(key, values) {

return Array.sum(values); // Sum the sales amounts for each ProductID

db.sales.mapReduce(mapFunction, reduceFunction, { out: "total_sales" });

2.2 Distribution Models- Single Server-Sharding-Master-Slave Replication-Peer-to-Peer

Types of Data Replication:

2. Data Sharding (Partitioning)

Example Use Case:

Example Use Case:

4. Peer-to-Peer (P2P) Distribution

●​ Decentralization: There is no single point of failure, making it highly fault-tolerant.

Example Use Case:

Example Use Case:

6. Hybrid Distribution Models

Example Use Case:

Summary of Distribution Models:

Data Redundant copies for fault tolerance and MySQL master-slave

Data Sharding Partitioning data to scale horizontally MongoDB sharding

Data Querying multiple, heterogeneous data Apache Drill or Presto for

Cloud-Based Elastic scalability, global availability Amazon DynamoDB

Hybrid Combines replication, sharding, etc., for Google Spanner

Combining sharding and replication

Combining sharding and replication is a common strategy in distributed databases to achieve

How Sharding and Replication Work Together

Example of Combining Sharding and Replication:

○​ Shard 1 - Replica 1: Primary node

○​ Shard 1 contains users with IDs 1-1,000,000.

○​ Shard 1 has Replica 1 (primary) and Replica 2 (secondary).

Advantages of Combining Sharding and Replication:

1.​ Improved Performance:​

○​ Replicas provide redundancy, so if the primary shard or replica goes down,

Challenges in Combining Sharding and Replication:

1.​ Data Distribution:​

○​ Maintaining consistency across replicas can be tricky. In a system with

Real-World Examples of Combining Sharding and Replication:

●​ Cassandra: Apache Cassandra also combines sharding (horizontal partitioning) with

●​ Google Spanner: Google Spanner uses a combination of sharding and replication to

2.3 NoSQL Key/Value databases using MongoDB

While MongoDB is generally considered a document-oriented NoSQL database (storing data

Key Features of MongoDB as a Key-Value Store:

Example of MongoDB Acting as a Key-Value Store:

In MongoDB, we can treat each document as a key-value pair:

Consider the following MongoDB collection named user_preferences where each

MongoDB Example: Retrieving a Key-Value Pair

db.user_preferences.find({ _id: "user123" })

In MongoDB, you can also query specific fields within a document:

db.user_preferences.find({ _id: "user123" }, { theme: 1, notifications: 1 })

● Master-Slave Replication: In a traditional master-slave setup, one node is the "master,"

● Horizontal Partitioning: Sharding is a form of horizontal partitioning, where data is

● The "map" step processes each row of the dataset.

● Key: 101, Values: [100, 150]

● For ProductID 101: 100 + 150 = 250

2. MongoDB (Aggregation Framework): MongoDB provides an aggregation pipeline that

● Decentralization: There is no single point of failure, making it highly fault-tolerant.

○ Shard 1 - Replica 1: Primary node

○ Shard 1 contains users with IDs 1-1,000,000.

○ Shard 1 has Replica 1 (primary) and Replica 2 (secondary).

1. Improved Performance:

○ Replicas provide redundancy, so if the primary shard or replica goes down,

1. Data Distribution:

○ Maintaining consistency across replicas can be tricky. In a system with

● Cassandra: Apache Cassandra also combines sharding (horizontal partitioning) with

● Google Spanner: Google Spanner uses a combination of sharding and replication to

1. Key: The user's session ID or user ID.

● MongoDB can function as a key-value store, allowing you to treat documents as

1. Schema Flexibility:

1. Schema Flexibility:

1. Lack of ACID Transactions (in some cases):

Example of write concern:

○ Read concern specifies the consistency level of a read operation. It controls

1. Strong Consistency:

1. Deducting the amount from the sender's account.

1. Start a session.

1. Start a Session:

○ We begin by connecting to the MongoDB server and starting a session with

○ We invoke session.startTransaction() to begin a transaction.

○ If all operations succeed, we call session.commitTransaction() to

○ Finally, we call session.endSession() to end the session.