0% found this document useful (0 votes)
8 views70 pages

Unit Ii

The document provides an overview of NoSQL data management, detailing various types of NoSQL databases including document, key-value, column-family, and graph databases, along with their characteristics and use cases. It highlights the differences between NoSQL and traditional relational databases, particularly in terms of scalability, flexibility, and data handling through aggregates. Additionally, it discusses the implications of aggregate orientation on data management and consistency in NoSQL systems.

Uploaded by

rumanamubarak25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views70 pages

Unit Ii

The document provides an overview of NoSQL data management, detailing various types of NoSQL databases including document, key-value, column-family, and graph databases, along with their characteristics and use cases. It highlights the differences between NoSQL and traditional relational databases, particularly in terms of scalability, flexibility, and data handling through aggregates. Additionally, it discusses the implications of aggregate orientation on data management and consistency in NoSQL systems.

Uploaded by

rumanamubarak25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 70

UNIT II

NOSQL DATA MANAGEMENT


Introduction to NoSQL – aggregate data models – key-
value and document data models – relationships – graph
databases – schemaless databases – materialized views
– distribution models – master-slave replication –
consistency - Cassandra – Cassandra data model –
Cassandra examples – Cassandra clients
Introduction to NoSQL
 NOT Only SQL
 Database Management Approach
 designed to handle large volumes of unstructured and semi-structured
data. Unlike traditional relational databases that use tables and pre-
defined schemas, NoSQL databases provide flexible data models and
support horizontal scalability, making them ideal for modern
applications that require real-time data processing.
 Types of NoSQL Databases
 Document databases
 Store data in JSON, BSON, or XML format.
 Data is stored as documents that can contain varying attributes.
 Examples: MongoDB, CouchDB, Cloudant
 Ideal for content management systems, user profiles, and catalogs where flexible
schemas are needed.
Introduction to NoSQL
 Types of NoSQL Databases
 Key-value stores
 Data is stored as key-value pairs, making retrieval extremely fast.
 Optimized for caching and session storage.
 Examples: Redis, Memcached, Amazon DynamoDB
 Perfect for applications requiring session management, real-time data caching,
and leaderboards.
 Column-family stores/ Tabular
 Data is stored in columns rather than rows, enabling high-speed analytics and
distributed computing.
 Efficient for handling large-scale data with high write/read demands.
 Examples: Apache Cassandra, HBase, Google Bigtable
 Great for time-series data, IoT applications, and big data analytics.
Introduction to NoSQL
 Types of NoSQL Databases
 Graph databases
 Data is stored as nodes and edges, enabling complex relationship management.
 Best suited for social networks, fraud detection, and recommendation
engines.
 Examples: Neo4j, Amazon Neptune, ArangoDB
 Useful for applications requiring relationship-based queries such as fraud
detection and social network analysis.
Introduction to NoSQL
 Traditional relational databases
 ACID (Atomicity, Consistency, Isolation, Durability) principles
 ensuring strong consistency and structured relationships between data
 NoSQL
 Scalability – Can scale horizontally by adding more nodes instead of
upgrading a single machine.
 Flexibility – Supports unstructured or semi-structured data without a rigid
schema.
 High Performance – Optimized for fast read/write operations with large
datasets.
 Distributed Architecture – Designed for high availability and partition
tolerance in distributed systems.
Introduction to NoSQL
Introduction to NoSQL
Aggregate data models
 aggregate is a collection of related objects that we wish to treat as a
unit. In particular, it is a unit for data manipulation and management of
consistency.
In json format
Consequences of Aggregate
Orientation
 Aggregate Orientation: This refers to organizing data based on how
it’s used by applications, focusing on units like orders containing items,
addresses, and payments. Relational databases don’t recognize
aggregates, making them aggregate-ignorant.
 Relational Databases: They model data using foreign keys but don't
differentiate between aggregate relationships and other relationships,
which can complicate data handling in certain scenarios.
 NoSQL Databases: Aggregate-oriented databases (like many NoSQL
systems) explicitly use aggregates, helping optimize data storage and
distribution in distributed systems by keeping related data together on
the same node.
Consequences of Aggregate
Orientation
 Challenges with Aggregates: Aggregates simplify some interactions
(e.g., customer orders), but complicate others (e.g., product sales
analysis), as they require digging through multiple aggregates to gather
the data.
 ACID Transactions: Relational databases support ACID transactions
across multiple tables. NoSQL databases typically support atomic
transactions only within a single aggregate, leaving developers to
manage atomicity across aggregates in application code.
 Consistency in NoSQL: NoSQL’s lack of ACID transactions across
aggregates doesn’t necessarily mean a lack of consistency; the concept
of consistency is more complex than just supporting or not supporting
ACID transactions.
ACID
 Atomicity: Atomicity ensures that a transaction is treated as a single,
indivisible unit. Either the entire transaction is executed, or none of it is.
If any part of the transaction fails, the whole transaction is rolled back.
 Consistency: Consistency ensures that a transaction brings the
database from one valid state to another. If a transaction violates any
database rules or constraints (e.g., an invalid amount), it will be rolled
back to maintain data integrity.
 Isolation : Isolation ensures that transactions are executed
independently of each other. Even if multiple transactions are happening
at the same time, each transaction is isolated from others to prevent
data inconsistencies.
 Durability: Durability ensures that once a transaction is committed, it
is permanently saved in the database, even in the case of a power
failure or system crash.
key-value data model
 In a key-value store, the data is stored as a collection of key-value pairs.
Each key is unique, and it’s associated with a value that can be a simple
or complex data structure.
 Key: A unique identifier for the value. It could be a string, number, or
any other primitive data type.
 Value: The data associated with the key. It could be a string, number,
JSON object, binary data, etc.
{
"user123": {"name": "Alice", "age": 30},
"user456": {"name": "Bob", "age": 25}
}
key-value data model
 Simple and Fast: Key-value databases are highly optimized for retrieving
values associated with a specific key, offering very fast lookups.
 Scalable: They are generally highly scalable and can handle huge amounts
of data across many servers.
 Flexibility: Values can be of any type, allowing for a flexible schema. There
is no rigid structure, making it easy to evolve the database over time.
 Limited Query Capability: Key-value stores don’t typically support
complex queries or join operations like relational databases.
 Use cases:
 Session storage: Websites use key-value pairs to store session data.
 Caching: Frequently accessed data is cached as key-value pairs to speed up
retrieval.
 Configuration management: Storing configuration settings and preferences.
document data model
 In a document data model, each "document" is an individual unit of data. A document
is usually in a format like JSON, BSON, or XML and can contain multiple fields with
different data types, including nested structures
{
"_id": "12345",
"name": "Alice",
"age": 30,
"address": {
"street": "123 Main St",
"city": "Springfield",
"zipcode": "12345"
},
"interests": ["reading", "swimming", "cycling"]
}
document data model
 Schema-less: each document can have a different structure. This
flexibility allows you to store varying types of data within the same
collection.
 Hierarchy: A document can contain other documents (nested
documents), arrays, or key-value pairs. This makes it suitable for
complex data with deep relationships and nested structures.
 Unique Identifiers: Each document typically has a unique identifier
(often referred to as a primary key), such as a document ID, which
allows efficient access and retrieval.
 Scalable: Document data models are often used in databases designed
to scale horizontally (across many servers), which makes them suitable
for large volumes of unstructured or semi-structured data.
document data model
 Usecases
 data has a flexible or evolving structure.
 large amounts of unstructured or semi-structured data.
 scalability and performance are key concerns.
 to store hierarchical or nested data in a way that is easy to retrieve.
 Document-Based Databases:
 MongoDB: stores data in BSON format (Binary JSON), where each document
is similar to a JSON object with key-value pairs.
 CouchDB: stores data in JSON format and allows to query the data using HTTP
requests.
Column Family Stores
 In a column-family data model, data is
stored in a structure resembling a table
with rows and columns, but the key
difference is that:
 Columns in a column family can vary
from row to row.
 Columns are grouped together into
families —
 A column family is essentially a
container for rows of data, and each row
has a unique key. Within each row, data is
stored in columns, but not all rows have
the same columns, making the model
flexible. Columns within a family are
stored together physically on disk for more
efficient access.
Column Family Stores
 Cassandra uses the terms “wide” and “skinny.” Skinny rows have few
columns with the same columns used across the many different rows
 A wide row has many columns (perhaps thousands), with rows having
very different columns. A wide column family models a list, with each
column being one element in that list.
 Features :
 Dynamic Schema: There’s no strict schema enforcement as in relational
databases. Each row can have a different number of columns.
 Efficient for Queries on Specific Columns: Column-family stores are
optimized for reading and writing specific columns, which makes them useful
for queries that only need a subset of the data, such as reading or writing a
few columns in a row. Data within the column family is stored together on
disk, which enhances the speed of reads when accessing related columns.
 Compression and Data Access: Because related columns are stored
together in a column family, they can often be compressed more effectively,
leading to better storage efficiency. Data from different column families might
be stored in separate disk blocks, so reading a column family may be faster
than reading an entire table (depending on access patterns)
Column Family Stores
 How Column Family Databases Work:
 Data Distribution: Column-family stores use a distributed architecture to
manage data across multiple machines. Data is distributed based on the row
key.
 Column Families on Disk: Columns that are related are stored together
physically on disk, which helps to optimize both read and write operations.
 Flexibility in Schema: There’s flexibility in column design, allowing the
database to handle evolving data structures without downtime or complex
schema migrations.
 Column Family Databases : Apache Cassandra, Hbase
 Limitations
 Limited Query Capabilities: Unlike relational databases, column-family
stores don’t support complex joins, and querying can be limited to specific
access patterns.
 Eventual Consistency: Many column-family databases like Cassandra use
eventual consistency (as part of the CAP theorem), meaning that data
might not be immediately consistent across all replicas.
Note
 An aggregate is a collection of data that we interact with as a unit.
Aggregates form the boundaries for ACID operations with the database.
 Key-value, document, and column-family databases can all be seen as forms
of aggregate oriented database.
 Aggregates make it easier for the database to manage data storage over
clusters.
 Aggregate-oriented databases work best when most data interaction is done
with the same aggregate;
 aggregate-ignorant databases are better when interactions use data
organized in many different formations.
 Aggregate-oriented databases : key-value, column Family, Document based
Data models
 aggregate-ignorant databases : Relational DB, Graph DB
Relationships
 Relationships refer to the ways in which different entities are linked or
associated with one another.
 Relational Databases (SQL):
 relationships are explicitly defined using foreign keys, joins, and
constraints. Relationships are typically normalized, meaning data is divided
into separate tables to reduce redundancy and improve data integrity.
 NoSQL Databases:
 relationships are handled differently because these databases often prioritize
scalability and flexibility over the rigid structure of relational models.
 In Document Databases (e.g., MongoDB):
 Embedded Documents
 Referencing Documents
Relationships
 In Document Databases (e.g., MongoDB):
 Embedded Documents: In a document store, relationships can be represented by
embedding documents inside one another. This is especially useful for one-to-many
relationships.
{
"_id": "user_001",
"name": "Alice",
"orders": [
{ "order_id": "order_001", "order_date": "2025-03-01" },
{ "order_id": "order_002", "order_date": "2025-03-10" }
]
}
 Referencing Documents: In some cases, a document might reference another
document using an identifier, like a foreign key in relational databases.
{
"_id": "order_001",
"user_id": "user_001",
"order_date": "2025-03-01"
}
Relationships
 In Column-Family Databases (e.g., Cassandra):
 Column-family databases don't have built-in support for joins or foreign keys.
Relationships are often handled by storing related data in the same row or by
modeling data to suit common query patterns.
 store related data like a list of orders in a single column-family and organize
the data for efficient retrieval.
CREATE TABLE users (
user_id UUID PRIMARY KEY,
name TEXT,
orders LIST<TEXT>
);
 user table, and each user has an array of orders (simple one-to-many
relationship)
Relationships
 Strategies for Managing Relationships in NoSQL Databases
 Denormalization: In NoSQL databases, often duplicate data
(denormalization) to optimize for read-heavy workloads. This approach can
help to avoid the need for complex joins or queries, but it may require extra
effort to keep the data consistent across multiple places.
 Eventual Consistency: Many NoSQL databases, like Cassandra and
MongoDB, operate under eventual consistency rather than strict
consistency. This means that changes to data may not be immediately
reflected in all places, especially in distributed systems.
 Application-Level Relationships: handling relationships is delegated to the
application layer. For example, the application logic may be responsible for
managing associations and linking data across different collections or column
families.
Graph Databases
 A graph database is a type of NoSQL database designed to represent
and store data in the form of a graph. In a graph, entities (also called
nodes) are connected by relationships (also called edges), forming a
network of interconnected data. This data structure is particularly well-
suited for modeling relationships between entities, such as social
networks, recommendations, and various other complex, interconnected
systems.
Graph Databases
 A graph database is a type of NoSQL database designed to represent
and store data in the form of a graph. In a graph, entities (also called
nodes) are connected by relationships (also called edges), forming a
network of interconnected data. This data structure is particularly well-suited
for modeling relationships between entities, such as social networks,
recommendations, and various other complex, interconnected systems.
 Key Components:
 Nodes: These are the entities or objects in the graph. Each node represents a
real-world object, such as a person, place, product, or event.
 Edges: These are the relationships or connections between nodes. Each edge has
a direction (from one node to another), and it can also carry metadata (properties).
 Properties: Both nodes and edges can have properties. Properties are key-value
pairs that provide more information about the node or the relationship. Ex.An
edge representing a "FRIEND_OF" relationship might have a property such as the
date when the friendship was established.
 Labels: In some graph databases, nodes can have labels that categorize them
into different types. For example, a node could have a label "Person" or "Product".
Graph Databases
 Why Graph DB?
 Graph databases excel in situations where relationships between data are as
important as the data itself. Traditional relational databases are not optimized
for handling complex relationships, especially when those relationships span
multiple tables and require frequent JOIN operations. Graph databases allow
for fast traversal of relationships, which makes them ideal
 Social Networks: Representing connections between users, such as friends,
followers, and interactions.
 Recommendation Engines: Suggesting products or services based on a user’s
preferences and interactions.
 Fraud Detection: Identifying patterns of fraudulent activity through the
relationships between entities like accounts, transactions, and customers.
 Knowledge Graphs: Organizing and linking pieces of information to make it more
accessible for queries and discovery.
Graph Databases
 Graph Databases: Neo4j, ArangoDB, Amazon Neptune, GraphDB
 Graph Database Query Languages: Cypher (used by Neo4j), Gremlin
(used by Apache TinkerPop), SPARQL (used by RDF stores, e.g.,
GraphDB)
 Example:
 Nodes:
 Person (represents a user in the social network)
 Post (represents a post made by a person)
 Tag (represents a tag associated with a post)

 Edges:
 FRIEND_OF (relates two people who are friends)
 FOLLOWS (relates one person following another)
 LIKES (relates a person to a post they liked)
 TAGGED_WITH (relates a post to a tag)
Graph Databases
 Example(cont.):
 Alice is friends with Bob, likes Post 1, and Post 1 is tagged with "Technology".
 Alice FRIEND_OF Bob
 Alice LIKES Post 1
 Post 1 TAGGED_WITH Technology

 Find All Friends of Alice:


MATCH (a:Person)-[:FRIEND_OF]->(b:Person)
WHERE a.name = 'Alice'
RETURN b.name
 Find All Posts Liked by Alice:
MATCH (a:Person)-[:LIKES]->(p:Post)
WHERE a.name = 'Alice'
RETURN p
Schemaless Databases
 A schemaless database is a type of database that does not require a
predefined structure or schema to store data. This is a characteristic of
NoSQL databases, which are designed to be flexible and scalable by
allowing data to be stored without enforcing a rigid schema.
 In a traditional relational database (RDBMS)
 data is stored in tables with predefined columns and data types. The schema
defines how data should be structured.
 Changes to the schema (e.g., adding or removing columns) typically require
migrating the database, which can be complex and time-consuming.
 Schema less databases
 allow data to be stored without a fixed schema, offering flexibility in terms of
how data is structured. different types of data in the same collection, and
each record can have a different structure if needed.
Schema less Databases
 Features:
 No Fixed Schema: each document or record can have a different structure.
New fields can be added or removed on the fly without affecting other records.
 Flexible Data Models: Data models can evolve as the application changes.
 Dynamic Field Types: store values of any type within a single record. For
instance, one record might store a string in a certain field, while another record
in the same collection might store an integer or a date in the same field.
 No Data Validation on Schema: perform data validation at the application
level (such as checking types, ensuring certain fields exist, etc.), the database
itself doesn’t enforce such validation rules based on schema. This allows the
database to store data more flexibly.
 Benefits
 Flexibility
 Rapid Development: In agile environments where data models evolve quickly,
a schemaless database allows developers to prototype and iterate faster. This
makes it easier to adapt the database structure to changing business needs.
Schema less Databases
 Benefits
 Scalability: Many schemaless databases, especially NoSQL types, are
designed for horizontal scalability. This makes them well-suited for handling
large volumes of data across distributed systems.
 Handling Complex Data Structures: Schemaless databases can easily
handle complex, hierarchical, or nested data structures (like JSON, XML, or
arrays). This is particularly useful for applications that need to store data with
varying structures, such as user profiles, social media posts, or product
catalogs.
 Adaptability: If the application needs to add new attributes to the data, it can
do so without affecting existing records or requiring downtime for schema
changes.
 Limitations
 Data Integrity: accidentally insert data with inconsistent field types (a field
that’s usually an integer could be accidentally stored as a string). To mitigate
this, validation is often handled at the application level.
 Complex Queries: Querying data from schemaless databases can be more
complex than in relational databases. finding records with missing or null fields
may require special handling.
Materialized views
 Need?
 When we talked about aggregate-oriented data models, we stressed
their advantages. If you want to access orders, it’s useful to have all
the data for an order contained in a single aggregate that can be stored
and accessed as a unit. But aggregate-orientation has a corresponding
disadvantage: What happens if a product manager wants to know
how much a particular item has sold over the last couple of
weeks? Now the aggregate-orientation works against you, forcing you
to potentially read every order in the database to answer the question.
 You can reduce this burden by building an index on the product, but
you’re still working against the aggregate structure.
Materialized views
 A materialized view is a database object that stores the results of a
query as a physical table. Unlike a regular view, which is essentially a
saved SQL query that is re-executed every time it is accessed, a
materialized view stores the query result as actual data. This allows
for faster querying because the results are precomputed and stored on
disk, rather than recalculating them each time a query is run.
 precomputed results of a query, which can include complex joins,
aggregations, or filtering operations. This avoids the need to recompute
these results each time the query is accessed.
 materialized views are stored physically as tables in the database.
 querying from pre-stored data instead of recalculating the results from
base tables.
 Materialized views improve query performance, especially for expensive
or resource-intensive operations (like aggregations, joins, or large table
scans) because they avoid recomputing the results each time.
Materialized views
 Since it store static data, they can become outdated as the underlying
data changes. Refreshing the materialized view ensures that it reflects
the most current data.
 Some databases allow automatic refreshing at specified intervals or when
changes are detected in the underlying tables.
 incremental refresh, which only updates the part of the materialized view
that is affected by changes in the base tables.
 Consistency: Depending on how the materialized view is refreshed, it
may be inconsistent with the base tables between refreshes.
 Benefits : Improved Query Performance, Efficiency for Aggregated
Data, Offloading Compute Resources, Simplifies Complex Queries
 Drawbacks: Storage Overhead, stale data, refresh overhead,
Complexity in Maintenance
Distribution Models
 Distribution models refer to how data is distributed across multiple
nodes (or servers) in a distributed database system.
 paths to data distribution: replication and sharding
 Replication takes the same data and copies it over multiple nodes.
Sharding puts different data on different nodes.
 Replication
 master-slave
 peer-to-peer
Distribution Models - Sharding
 Sharding is the process of dividing a large database or dataset into
smaller, more manageable pieces called shards. Each shard is stored
on a separate server or node in a distributed system, allowing for
horizontal scaling to manage large amounts of data efficiently.
Sharding is especially important when a system needs to handle high
volumes of read and write operations that cannot be supported by a
single server.
Distribution Models - Sharding
 A shard is a horizontal partition of data, stored across multiple
machines or nodes. Each shard holds a subset of the data, ensuring that
no single server is overwhelmed with too much load.
 shard key is the attribute or field that determines how data will be
distributed across shards. The data is divided based on the values of the
shard key. For example, in a user database, the user ID might be used
as the shard key.
 shard map is a mapping or directory that helps determine which shard
holds a particular piece of data. It is essential for routing queries to the
correct shard.
 replication is often used to increase availability and fault tolerance
by storing copies of shards on multiple nodes.
 Distributed Querying: Sharding often involves distributing queries
across multiple shards. In cases where the query needs data from
multiple shards, the system may need to aggregate the results from
different shards.
Distribution Models - Sharding
 Types of Sharding
 Range-based Sharding : Data is partitioned based on ranges of values. For
example, a customer database might shard data by customer ID range,
where IDs 1–100,000 are stored on one shard, IDs 100,001–200,000 on
another, and so on.
 Hash-based Sharding: A hash function is applied to the shard key, and the
resulting hash value is used to determine which shard the data belongs to.
This ensures a more even distribution of data.
 List-based Sharding: Data is partitioned according to predefined lists.
 Composite Sharding: A combination of multiple sharding strategies, often
using multiple fields as a shard key.
 Benefits : Scalability, Improved Performance, Fault Tolerance, Better
Resource Utilization
Distribution Models - Sharding
 Limitations:
 Data Rebalancing: As data grows, it may need to be redistributed across
shards. if one shard becomes too large, its data may need to be split into
additional shards, a process known as resharding. This can be complex and
time-consuming.
 Complex Querying: Queries that need to access data across multiple
shards can be more complex and slower. Aggregating results from different
shards can introduce additional overhead.
 Shard Key Selection: Choosing an appropriate shard key is critical
 Ensuring consistency across shards, especially when data is replicated, can
be challenging
Distribution Models - Replication
Master-Slave Replication
 Master-Slave Replication is a database replication strategy where
one database server (the master) handles all the write operations,
while one or more slave servers maintain copies of the master’s data.
The slaves replicate changes made to the master, allowing for read
scalability, fault tolerance, and data redundancy.
Distribution Models - Replication
 Master: primary node that handles write operations (e.g., insert,
update, delete). source of truth for the data, and all changes to the data
are written to this server.
 slave servers are secondary nodes that maintain read-only copies of
the data from the master. Slaves replicate the changes made on the
master, typically through a process that copies the master's transaction
logs or changes.
 Replication Process: Changes made to the master database (writes)
are asynchronously or synchronously propagated to the slave(s).
 Synchronous Replication: The slave must acknowledge receipt of the data
before the master confirms the write operation. This ensures that data is
immediately consistent across the master and slaves, but can introduce
latency.
 Asynchronous Replication: The master does not wait for the slave to
confirm receipt of the data before completing the write operation. This
improves performance but can lead to eventual consistency, where data
on the slave may lag behind the master
Distribution Models - Replication
 Replication Lag refers to the time delay between when data is written
to the master and when it appears on the slave(s). Lag is more common
in asynchronous replication and can impact the freshness of the data on
slave servers.
 In the case of failover:
 If the Master goes down, one of the Slaves can be promoted to Master,
ensuring that the website remains functional, although writes may be paused
until failover is complete.
Distribution Models - Replication
 Peer-to-Peer (P2P) Replication is a replication strategy where all
nodes in the system are treated as equals (peers), with no distinct
"master" or "slave" roles. Every node in the system can handle both
read and write operations, and each node can replicate data to other
nodes in the system. This model enables full redundancy, fault
tolerance, and high availability, as all nodes maintain copies of the
data and can accept both reads and writes.
 Peer-to-peer replication involves the synchronization of data across all
peers in the system. Each node keeps its own copy of the database and
can update other nodes with changes that have been made locally.
 As multiple nodes can accept writes, one of the challenges is ensuring
that the system remains consistent across all peers. Different peers
might receive updates at different times, leading to potential conflicts,
which need to be resolved, often through conflict resolution
strategies.
Peer to Peer Replication
Distribution Models - Replication
 The system must have a strategy to resolve these conflicts, either
automatically or through user intervention.
 Last Write Wins (LWW): The last update to the data wins, and earlier
changes are discarded.
 Version Vectors: Each update gets a unique version number, and the
system uses the version vector to determine the order of updates and resolve
conflicts.
 Since peer-to-peer systems allow for asynchronous replication, there
may be temporary periods of inconsistency across peers (due to
network delays, lag, etc.). However, the system ensures that,
eventually, all nodes will converge to the same state. This model is
known as eventual consistency.
Combining Sharding and Replication
Combining Sharding and Replication
consistency
 Write-write conflicts occur when two clients try to write the same
data at the same time.
 Read-write conflicts occur when one client reads inconsistent data in
the middle of another client’s write.
 Pessimistic approaches lock data records to prevent conflicts.
Optimistic approaches detect conflicts and fix them.
 Eventual consistency means that at some point the system will
become consistent once all the writes have propagated to all the nodes.
 increasing consistency often leads to higher latency
 Consistency: Ensures all nodes in a distributed system reflect the same data
at the same time. A read operation will always return the most recent write.
 Latency: The time taken to complete an operation (e.g., read or write).
consistency
consistency
consistency
 CAP Theorem
 The CAP Theorem, also known as Brewer’s Theorem, is a
fundamental principle in distributed systems that states it is
impossible for a distributed database system to simultaneously
guarantee all three of the following properties in the presence of a
network partition
 Properties
consistency
 The CAP theorem asserts that in a distributed system with network
partitions (P), you can choose at most two of the three properties
(C, A, P) but never all three simultaneously:
 CP (Consistency + Partition Tolerance):
 System ensures consistency even during network partitions.
 Availability is sacrificed—some requests may be rejected until the partition resolves.
 Example: Traditional relational databases using distributed transactions (e.g., some
SQL databases in strict mode).
 AP (Availability + Partition Tolerance):
 System remains available even during network partitions.
 Consistency is sacrificed—data may be stale or inconsistent temporarily.
 Example: DNS systems, Cassandra, DynamoDB (eventual consistency models).

 CA (Consistency + Availability):
 Provides consistency and availability as long as the network is reliable (no
partitions).
 If a network partition occurs, the system fails or becomes unavailable.
 Example: Typically not achievable in real-world distributed systems since network
consistency
 CA Systems (Rare in practice): Only feasible when the network is
guaranteed to be stable.
 CP Systems (Banking/Finance): Prioritize consistency over
availability (e.g., database locks to ensure accurate transactions).
 AP Systems (Social Media, E-Commerce): Prioritize availability; data
may not always be up-to-date, but the system remains operational.
In practice, most distributed systems opt for AP or CP models.
Relaxing durability
 Durability ensures that once a transaction is committed, the changes
are permanently stored in the database, even in the event of a crash,
power failure, or system error.
 Relaxing durability is often a conscious trade-off made in distributed
systems or high-performance databases to achieve:
 Lower Latency: Avoiding the overhead of persisting data instantly reduces
response time.
 Higher Throughput: More transactions can be processed per second
 Improved Availability: Systems can continue to function smoothly during
network partitions
 Scalability: Less stringent durability allows easier scaling of distributed
databases.
Quorum
 A quorum is the minimum number of nodes or replicas in a
distributed system that must agree on a read or write operation for it to
be considered successful. Quorums ensure that even in the presence of
network partitions or node failures, the system can maintain consistency
and availability.
 In a distributed database with N replicas, quorums are used for:
 Write Quorum (W): Minimum replicas that must acknowledge a write
operation before it’s considered successful.
 Read Quorum (R): Minimum replicas that must respond to a read operation.
 For strong consistency, the following rule must be satisfied:
 W+R>N
Quorum
 Example
 Imagine a distributed database with 5 replicas (N = 5):
 Strong Consistency (High W, Low R):
 W = 4, R = 2: A write needs confirmation from 4 replicas, and a read
requires responses from 2 replicas.
 Pros: Ensures strong consistency.
 Cons: Higher latency and reduced availability.
 Eventual Consistency (Low W, High R):
 W = 2, R = 4: A write is successful if 2 replicas confirm it, and reads require
responses from 4 replicas.
 Pros: Faster writes, high availability.
 Cons: Potentially stale reads.
Cassandra
 Apache Cassandra is a highly scalable, distributed NoSQL
database designed for handling large amounts of data across many
commodity servers.
 providing high availability,
 fault tolerance, and
 zero downtime.
 It is ideal for applications requiring fast writes, distributed data, and
linear scalability.
Cassandra data model
 Core concepts:
Cassandra data model
 Apache Cassandra data model components include keyspaces, tables, and
columns:
 Cassandra stores data as a set of rows organized into tables or column families
 A primary key value identifies each row
 The primary key partitions data
 You can fetch data in part or in its entirety based on the primary key
 Keyspaces. At a high level, the Cassandra NoSQL data model consists of
data containers called keyspaces. Keyspaces are similar to the schema in a
relational database. Typically, there are many tables in a keyspace.
 Tables. Tables, also called column families in earlier iterations of Cassandra,
are defined within the keyspaces. Tables store data in a set of rows and
contain a primary key and a set of columns.
 Columns. Columns define data structure within a table. There are various
types of columns, such as Boolean, double, integer, and text.
Cassandra data model
 Cassandra Keyspaces
 In Cassandra, a Keyspace has several basic attributes:
 Column families: Containers of rows collected and organized that represent
the data’s structure. There is at least one column family in each keyspace and
there may be many.
 Replication factor: The number of cluster machines that receive identical
copies of data.
 Replica placement strategy: Analogous to a load balancing algorithm, this is
simply the strategy for placement of replicas in the ring cluster. There are
rack-aware strategies and datacenter-shared strategies.
 Cassandra Primary Keys
 Partition key: The primary key is the required first column or set of columns.
The hashed partition key value determines where in the cluster the partition
will reside.
 Clustering key: Also called clustering columns, clustering keys are optional
columns after the partition key. The clustering key determines the order of
rows sort themselves into within a partition by default.
Cassandra examples
Example
CREATE TABLE sales (
region TEXT,
product_name TEXT,
sales_amount DOUBLE,
sales_date TIMESTAMP,
PRIMARY KEY (region, sales_date)
);

-- region is partition key, sales_date is


clustering key
Cassandra clients
 Cassandra clients are tools, drivers, and interfaces used to interact with
Apache Cassandra databases. They provide developers and
administrators with the ability to perform database operations, including
creating keyspaces, inserting data, and executing queries.
 CQLSH (Cassandra Query Language Shell)
 default command-line tool bundled with Cassandra. It allows users to
interact directly with the database using CQL (Cassandra Query
Language).
 Features of CQLSH:
 Execute CQL commands (create keyspace, tables, insert, update, delete, select)
 Supports interactive mode for manual queries and scripting mode for automation
 Provides commands for schema management, querying, and data manipulation

 Cassandra provides official drivers for various programming languages


to allow seamless interaction with databases. Ex. DataStax Java Driver,
CassandraCSharpDriver, cassandra-driver (from DataStax)
Cassandra clients
 Example for Java Driver

 Third-Party Clients and GUIs


 Several third-party tools provide a graphical interface for working with
Cassandra, simplifying database management:
 Example
 DataStax Studio- A web-based tool for visualizing and querying Cassandra data.
Offers graph and CQL support.
 Dbeaver, DBVisualizer, Cassandra Admin GUI
References

 https://www.scylladb.com/glossary/cassandra-data-model/

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy