NoSQL - Unit 2
NoSQL - Unit 2
Distribution Models
Achieved through:
Sharding: Data partitioned into smaller chunks across
multiple nodes.
Replication: Copying data across nodes for fault
tolerance and high availability.
Master-Slave
Peer-to-peer
Sharding: Horizontal partitioning of data across multiple
servers to scale and distribute the load efficiently.
How:
Split large datasets into shards, each stored on a
different node.
Sharding criteria can be based on user ID, region,
date, or other attributes.
Data locality: Helps distribute data closer to where it's
needed (e.g., geographical sharding).
Key Aspects:
Shard Key: A specific field used to divide the data
(e.g., customer ID, product ID).
Uniform Distribution: Ideally, shards should be evenly
distributed to avoid hot spots.
Global Indexing: Maintains a global index that helps to
locate data across shards.
Advantages:
Scalability
Performance: Parallel access to different data parts
improves throughput.
Fault Isolation: Failure of one shard doesn’t affect
others.
Challenges:
Data Rebalancing: Adding/removing shards requires
migrating data.
Consistency
Query Complexity across multiple shards may impact
performance.
Master-Slave Replication
One master node holds authoritative data, while slave
nodes replicate the data for read scalability and fault
tolerance.
How:
Master: Handles write operations (INSERT, UPDATE).
Slaves: Replicas that handle read queries.
Synchronous vs Asynchronous replication based on
latency and consistency needs.
Key Aspects:
Replication Lag: Asynchronous replication can
introduce a delay in data synchronization.
Failover Mechanism: In case of master failure, a slave
can be promoted to master.
Eventual Consistency: Slaves may not have the most
up-to-date data immediately after write operations.
Advantages:
Read Scalability
Fault Tolerance: Automatic failover in case of master
failure (with proper configuration).
Simple Architecture
Challenges:
Write Bottleneck: Master can become a bottleneck for
write-heavy workloads.
Replication Lag: May cause temporary data
inconsistency between master and slaves.
Single Point of Failure: If master fails and no failover
mechanism is in place, writes are blocked.
Peer-to-Peer Replication
Every node (peer) is equal, with no central master. All
nodes handle both reads and writes.
How:
Distributed Sync: Data is synchronized across all
nodes, and each peer can handle both read and write
operations.
Conflict Resolution: Conflicts (e.g., same data
modified in two nodes) are resolved using
mechanisms like versioning or last-write-wins.
Key Aspects:
Decentralized: No central master, and all peers
communicate directly.
Conflict Handling: Must ensure data consistency
through advanced techniques (e.g., vector clocks).
Eventual Consistency: Peers may not have the same
data immediately after updates.
Advantages:
Fault Tolerance: No SPOF
High Availability
Load Distribution: Equal distribution of read and write
operations across peers.
Challenges:
Data Consistency
Conflict Management: Requires efficient mechanisms
to resolve conflicts when peers have conflicting
updates.
Network Overhead
Sharding + Replication
Combining sharding (data partitioning) and replication
(data duplication) for scalable, fault-tolerant distributed
systems.
How:
Sharding: Splits data across multiple nodes based on
a shard key (e.g., customer ID).
Replication: Each shard is replicated on multiple
nodes to provide fault tolerance and high availability.
Partitioning & Replication: Partitioning data (sharding)
ensures scalability, while replication ensures data
availability and fault tolerance.
Key Aspects:
Data Redundancy
Consistency Model: Ensures consistency (often
eventual consistency) across shards and replicas.
Fault Isolation
Advantages:
Scalability
Availability
Improved Performance: Load balancing between
shards and replicas for both read and write
operations.
Challenges:
Complexity in Management: Managing both sharding
and replication requires careful planning
Data Synchronization
Rebalancing: Data may need to be redistributed
across shards as the system scales
Consistency & Relaxing Consistency
Scenarios:
CA Systems: No partition tolerance; suited for single-
node systems (e.g., RDBMS).
AP Systems: No strong consistency; common in geo-
distributed systems (e.g., DNS).
CP Systems: Sacrifice availability during partition
(e.g., MongoDB).
Purpose:
Detect conflicts in updates.
Enable reconciliation in peer-to-peer or eventually
consistent systems.
Types:
Logical Timestamps: Tracks order of events (Lamport
timestamps).
Vector Clocks: Tracks causal relationships between
updates.
Applications:
Conflict detection in distributed databases.
Version control systems (e.g., Git).
Challenges:
↑ Overhead in managing clocks.
Resolving conflicting timestamps.
Map-Reduce
Steps:
Map: Input data → key-value pairs.
Shuffle: Group data by keys across nodes.
Reduce: Combine values with the same key to
produce results.
Key Features:
Parallel processing using mappers/reducers.
Fault Tolerance: Re-runs failed tasks.
Data locality minimizes network transfer.
Applications:
Scaling computations across clusters.
Efficient large-scale data processing.
Challenges:
Non-combinable reducers (e.g., counting unique
elements).
Balancing partitions for equal load distribution.
Composing Map-Reduce Calculations