Unit Ii
Unit Ii
Edges:
FRIEND_OF (relates two people who are friends)
FOLLOWS (relates one person following another)
LIKES (relates a person to a post they liked)
TAGGED_WITH (relates a post to a tag)
Graph Databases
Example(cont.):
Alice is friends with Bob, likes Post 1, and Post 1 is tagged with "Technology".
Alice FRIEND_OF Bob
Alice LIKES Post 1
Post 1 TAGGED_WITH Technology
CA (Consistency + Availability):
Provides consistency and availability as long as the network is reliable (no
partitions).
If a network partition occurs, the system fails or becomes unavailable.
Example: Typically not achievable in real-world distributed systems since network
consistency
CA Systems (Rare in practice): Only feasible when the network is
guaranteed to be stable.
CP Systems (Banking/Finance): Prioritize consistency over
availability (e.g., database locks to ensure accurate transactions).
AP Systems (Social Media, E-Commerce): Prioritize availability; data
may not always be up-to-date, but the system remains operational.
In practice, most distributed systems opt for AP or CP models.
Relaxing durability
Durability ensures that once a transaction is committed, the changes
are permanently stored in the database, even in the event of a crash,
power failure, or system error.
Relaxing durability is often a conscious trade-off made in distributed
systems or high-performance databases to achieve:
Lower Latency: Avoiding the overhead of persisting data instantly reduces
response time.
Higher Throughput: More transactions can be processed per second
Improved Availability: Systems can continue to function smoothly during
network partitions
Scalability: Less stringent durability allows easier scaling of distributed
databases.
Quorum
A quorum is the minimum number of nodes or replicas in a
distributed system that must agree on a read or write operation for it to
be considered successful. Quorums ensure that even in the presence of
network partitions or node failures, the system can maintain consistency
and availability.
In a distributed database with N replicas, quorums are used for:
Write Quorum (W): Minimum replicas that must acknowledge a write
operation before it’s considered successful.
Read Quorum (R): Minimum replicas that must respond to a read operation.
For strong consistency, the following rule must be satisfied:
W+R>N
Quorum
Example
Imagine a distributed database with 5 replicas (N = 5):
Strong Consistency (High W, Low R):
W = 4, R = 2: A write needs confirmation from 4 replicas, and a read
requires responses from 2 replicas.
Pros: Ensures strong consistency.
Cons: Higher latency and reduced availability.
Eventual Consistency (Low W, High R):
W = 2, R = 4: A write is successful if 2 replicas confirm it, and reads require
responses from 4 replicas.
Pros: Faster writes, high availability.
Cons: Potentially stale reads.
Cassandra
Apache Cassandra is a highly scalable, distributed NoSQL
database designed for handling large amounts of data across many
commodity servers.
providing high availability,
fault tolerance, and
zero downtime.
It is ideal for applications requiring fast writes, distributed data, and
linear scalability.
Cassandra data model
Core concepts:
Cassandra data model
Apache Cassandra data model components include keyspaces, tables, and
columns:
Cassandra stores data as a set of rows organized into tables or column families
A primary key value identifies each row
The primary key partitions data
You can fetch data in part or in its entirety based on the primary key
Keyspaces. At a high level, the Cassandra NoSQL data model consists of
data containers called keyspaces. Keyspaces are similar to the schema in a
relational database. Typically, there are many tables in a keyspace.
Tables. Tables, also called column families in earlier iterations of Cassandra,
are defined within the keyspaces. Tables store data in a set of rows and
contain a primary key and a set of columns.
Columns. Columns define data structure within a table. There are various
types of columns, such as Boolean, double, integer, and text.
Cassandra data model
Cassandra Keyspaces
In Cassandra, a Keyspace has several basic attributes:
Column families: Containers of rows collected and organized that represent
the data’s structure. There is at least one column family in each keyspace and
there may be many.
Replication factor: The number of cluster machines that receive identical
copies of data.
Replica placement strategy: Analogous to a load balancing algorithm, this is
simply the strategy for placement of replicas in the ring cluster. There are
rack-aware strategies and datacenter-shared strategies.
Cassandra Primary Keys
Partition key: The primary key is the required first column or set of columns.
The hashed partition key value determines where in the cluster the partition
will reside.
Clustering key: Also called clustering columns, clustering keys are optional
columns after the partition key. The clustering key determines the order of
rows sort themselves into within a partition by default.
Cassandra examples
Example
CREATE TABLE sales (
region TEXT,
product_name TEXT,
sales_amount DOUBLE,
sales_date TIMESTAMP,
PRIMARY KEY (region, sales_date)
);
https://www.scylladb.com/glossary/cassandra-data-model/