0% found this document useful (0 votes)
14 views

NoSQL - Unit 2

Nosql db notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

NoSQL - Unit 2

Nosql db notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

U2

Distribution Models

Definition: Techniques to optimize data distribution


across multiple servers or nodes for better performance,
availability, and scalability.

Achieved through:
Sharding: Data partitioned into smaller chunks across
multiple nodes.
Replication: Copying data across nodes for fault
tolerance and high availability.
Master-Slave
Peer-to-peer
Sharding: Horizontal partitioning of data across multiple
servers to scale and distribute the load efficiently.

How:
Split large datasets into shards, each stored on a
different node.
Sharding criteria can be based on user ID, region,
date, or other attributes.
Data locality: Helps distribute data closer to where it's
needed (e.g., geographical sharding).

Key Aspects:
Shard Key: A specific field used to divide the data
(e.g., customer ID, product ID).
Uniform Distribution: Ideally, shards should be evenly
distributed to avoid hot spots.
Global Indexing: Maintains a global index that helps to
locate data across shards.

Advantages:
Scalability
Performance: Parallel access to different data parts
improves throughput.
Fault Isolation: Failure of one shard doesn’t affect
others.

Challenges:
Data Rebalancing: Adding/removing shards requires
migrating data.
Consistency
Query Complexity across multiple shards may impact
performance.
Master-Slave Replication
One master node holds authoritative data, while slave
nodes replicate the data for read scalability and fault
tolerance.

How:
Master: Handles write operations (INSERT, UPDATE).
Slaves: Replicas that handle read queries.
Synchronous vs Asynchronous replication based on
latency and consistency needs.

Key Aspects:
Replication Lag: Asynchronous replication can
introduce a delay in data synchronization.
Failover Mechanism: In case of master failure, a slave
can be promoted to master.
Eventual Consistency: Slaves may not have the most
up-to-date data immediately after write operations.

Advantages:
Read Scalability
Fault Tolerance: Automatic failover in case of master
failure (with proper configuration).
Simple Architecture

Challenges:
Write Bottleneck: Master can become a bottleneck for
write-heavy workloads.
Replication Lag: May cause temporary data
inconsistency between master and slaves.
Single Point of Failure: If master fails and no failover
mechanism is in place, writes are blocked.
Peer-to-Peer Replication
Every node (peer) is equal, with no central master. All
nodes handle both reads and writes.

How:
Distributed Sync: Data is synchronized across all
nodes, and each peer can handle both read and write
operations.
Conflict Resolution: Conflicts (e.g., same data
modified in two nodes) are resolved using
mechanisms like versioning or last-write-wins.

Key Aspects:
Decentralized: No central master, and all peers
communicate directly.
Conflict Handling: Must ensure data consistency
through advanced techniques (e.g., vector clocks).
Eventual Consistency: Peers may not have the same
data immediately after updates.

Advantages:
Fault Tolerance: No SPOF
High Availability
Load Distribution: Equal distribution of read and write
operations across peers.

Challenges:
Data Consistency
Conflict Management: Requires efficient mechanisms
to resolve conflicts when peers have conflicting
updates.
Network Overhead
Sharding + Replication
Combining sharding (data partitioning) and replication
(data duplication) for scalable, fault-tolerant distributed
systems.

How:
Sharding: Splits data across multiple nodes based on
a shard key (e.g., customer ID).
Replication: Each shard is replicated on multiple
nodes to provide fault tolerance and high availability.
Partitioning & Replication: Partitioning data (sharding)
ensures scalability, while replication ensures data
availability and fault tolerance.

Key Aspects:
Data Redundancy
Consistency Model: Ensures consistency (often
eventual consistency) across shards and replicas.
Fault Isolation

Advantages:
Scalability
Availability
Improved Performance: Load balancing between
shards and replicas for both read and write
operations.

Challenges:
Complexity in Management: Managing both sharding
and replication requires careful planning
Data Synchronization
Rebalancing: Data may need to be redistributed
across shards as the system scales
Consistency & Relaxing Consistency

Consistency: Ensures all nodes have the same data after


an update operation.
Types:
i. Strong Consistency: Immediate synchronization;
all clients see updated data.
ii. Eventual Consistency: Updates propagate
asynchronously; all replicas converge eventually.
iii. Causal Consistency: Updates maintain a cause-
effect order.

Relaxing Consistency: Sacrificing immediate


synchronization for scalability/performance.
Techniques:
Eventual Consistency: Used in distributed systems
(e.g., DynamoDB, Cassandra).
Quorum Reads/Writes: Ensures consistency only
for a subset of nodes.
Trade-offs:
↑ Availability & performance.
↓ Consistency guarantees.
CAP Theorem

A distributed system can only achieve two out of three:


Consistency (C): All nodes see the same data
simultaneously.
Availability (A): System continues to operate despite
failures.
Partition Tolerance (P): System functions despite
network partition.

Scenarios:
CA Systems: No partition tolerance; suited for single-
node systems (e.g., RDBMS).
AP Systems: No strong consistency; common in geo-
distributed systems (e.g., DNS).
CP Systems: Sacrifice availability during partition
(e.g., MongoDB).

Applications: Helps design systems based on trade-offs.


Version Stamps

Concept: Identifiers for tracking changes/versioning in


distributed systems.

Purpose:
Detect conflicts in updates.
Enable reconciliation in peer-to-peer or eventually
consistent systems.

Types:
Logical Timestamps: Tracks order of events (Lamport
timestamps).
Vector Clocks: Tracks causal relationships between
updates.

Applications:
Conflict detection in distributed databases.
Version control systems (e.g., Git).

Challenges:
↑ Overhead in managing clocks.
Resolving conflicting timestamps.
Map-Reduce

Concept: Framework for parallel processing of large


datasets using clusters.

Steps:
Map: Input data → key-value pairs.
Shuffle: Group data by keys across nodes.
Reduce: Combine values with the same key to
produce results.

Key Features:
Parallel processing using mappers/reducers.
Fault Tolerance: Re-runs failed tasks.
Data locality minimizes network transfer.

Use Cases: Analytics, indexing, log processing (e.g.,


Hadoop, Spark).

Challenges: Not ideal for iterative algorithms or real-time


processing.
Partitioning & Combining

Partitioning: Splitting data into logical chunks for parallel


processing.
Based on key-value mapping (e.g., hash partitioning).
Enables distributed reduce operations.

Combining: Reduces data at mapper stage before


shuffling.
Minimizes data transfer across nodes.
Example: Pre-aggregating sales data per product on
mapper nodes.

Applications:
Scaling computations across clusters.
Efficient large-scale data processing.

Challenges:
Non-combinable reducers (e.g., counting unique
elements).
Balancing partitions for equal load distribution.
Composing Map-Reduce Calculations

Concept: Combining multiple Map-Reduce stages for


complex computations.
Process:
Output of one stage → Input for the next (e.g., pipes-
and-filters model).
Example:
Task: Compare monthly sales year-over-year.
Stage 1: Aggregate sales per product per month.
Stage 2: Compare results of each month for
consecutive years.
Advantages:
Reusability: Intermediate outputs useful for other
tasks.
Scalability: Breaks down complex logic into smaller,
manageable stages.
Materialized views improve efficiency.
Tools: Apache Pig (simplified Map-Reduce), Hive (SQL-like
interface)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy