Chapter 4 1712934164766
Chapter 4 1712934164766
Tags
1. Structured Data:
Characteristics:
Chapter 4 1
2. Semi-Structured Data:
Characteristics:
Examples: JSON files, XML files, log files, NoSQL databases, etc.
3. Unstructured Data:
Characteristics:
Chapter 4 2
Use Cases: Sentiment analysis, image recognition, speech recognition,
content categorization, etc.
CAP Theorem
Chapter 4 3
The CAP theorem, also known as Brewer's theorem, is a fundamental principle
in distributed systems theory that states that it is impossible for a distributed
data system to simultaneously provide all three of the following guarantees:
Consistency: All nodes in the distributed system have the same data at the
same time. If a write is successful, all subsequent reads will reflect that
write. This ensures that the data is always in a valid state.
According to the CAP theorem, a distributed system can prioritize any two out
of these three properties, but it cannot simultaneously achieve all three. This
means that in the presence of a network partition (P), a distributed system must
choose between consistency (C) and availability (A).
Chapter 4 4
case, you would prioritize consistency over availability. Even if some data
centers become temporarily unreachable due to network partitions, you want to
ensure that all users see the same account balances to avoid discrepancies or
potential financial losses.
On the other hand, consider a social media platform where users are posting
updates and sharing content in real-time. In this scenario, you might prioritize
availability over consistency. Even if network partitions occur, you want to
ensure that users can still access and interact with the platform without
experiencing downtime or disruptions. Consistency can be eventually achieved
through mechanisms like eventual consistency or conflict resolution.
So, while consistency is important, it's often sacrificed to ensure that the
system remains available and functional even in challenging network
conditions.
Eventual Consistency:
Eventual consistency is a consistency model employed in distributed systems
where data may not be immediately consistent across all nodes but will
eventually converge to a consistent state given enough time and assuming no
further updates. In other words, after a certain period of time with no new
updates and with the resolution of any network partitions, all replicas of the
data will eventually agree on the same value.
Chapter 4 5
Eventual consistency allows for high availability and fault tolerance, as it
permits each replica to operate independently without needing constant
coordination with other replicas. However, it also means that there may be
temporary inconsistencies between replicas until convergence is achieved.
Tunable Consistency:
Tunable consistency refers to the ability to adjust the level of consistency
according to the specific requirements of an application. It allows developers to
choose a consistency level that best suits their application's needs, balancing
consistency, availability, and partition tolerance.
Chapter 4 6
time.
NoSQL
NoSQL, which stands for "Not Only SQL," is a term used to describe a broad
category of database management systems that diverge from the traditional
relational database management systems (RDBMS). NoSQL databases are
designed to handle large volumes of unstructured or semi-structured data,
which may not fit well into the rigid tabular format of relational databases. Here
are the main properties of NoSQL databases:
Chapter 4 7
This allows them to handle large volumes of data and high traffic loads
more easily than traditional relational databases, which often scale
vertically by adding more powerful hardware.
5. Optimized for Specific Use Cases: NoSQL databases are often optimized
for specific use cases or data models, such as document-oriented, key-
value, column-family, or graph databases. This allows them to provide
better performance and scalability for certain types of applications and
workloads.
RDMS VS NoSQL
1. Data Model:
Chapter 4 8
family (e.g., Cassandra), and graph (e.g., Neo4j). They offer more
flexibility in data modeling, allowing for schema-less or flexible
schemas.
2. Scalability:
3. ACID Compliance:
4. Query Language:
NoSQL: NoSQL databases may have their own query languages or APIs
tailored to their specific data models. For example, document-oriented
NoSQL databases often use JSON-based query languages, while key-
value stores may offer simple get/put operations.
5. Use Cases:
Chapter 4 9
NoSQL: NoSQL databases are well-suited for handling unstructured or
semi-structured data, high-volume data ingestion, real-time analytics,
content management systems, and applications requiring flexible
schemas or horizontal scalability.
6. Examples:
NoSQL Taxonomy
NoSQL Taxonomy refers to the categorization or classification of NoSQL (Not
Only SQL) databases based on their data models, architecture, and
characteristics.
1. Document Store:
Document-oriented databases store data in a semi-structured format,
typically using JSON (JavaScript Object Notation) or BSON (Binary JSON)
documents. Each document can have its own unique structure, and data
within documents can be queried using keys or attributes. Document stores
are flexible and schema-less, allowing for easy modification and adaptation
to evolving data schemas. This flexibility makes them suitable for a wide
range of use cases, including content management systems, e-commerce
applications, and real-time analytics.
Examples:
2. Graph Database:
Chapter 4 10
social networks, recommendation engines, fraud detection, and network
analysis.
Examples:
3. Key-Value Store:
4. Columnar Database:
Columnar databases are optimized for analytical queries that involve
aggregating data across large datasets. They organize data by column
rather than by row, which offers several advantages for certain use cases:
Chapter 4 11
analytical queries against large datasets. Examples include business
intelligence applications, data analytics platforms, and reporting tools.
Examples:
Hbase
Chapter 4 12
HBase is a distributed, scalable, and column-oriented NoSQL database built on
top of the Hadoop Distributed File System (HDFS). Its architecture is designed
for handling large volumes of data in a distributed environment. Let's break
down the components you mentioned:
1. HMaster:
2. HRegionServer:
3. HRegions:
Chapter 4 13
HRegion is a contiguous portion of a table's data, and each table is
divided into multiple regions. These regions are distributed across the
cluster.
HRegions contain the actual data stored in HBase tables. They are
stored in HDFS files called HFiles, which are immutable and append-
only.
When a region grows too large, it can be split into two or more smaller
regions to improve performance and scalability. This process is called
region splitting.
4. Zookeeper:
HBase stores its data in HDFS in the form of HFiles, which are stored in
blocks across the HDFS cluster.
Chapter 4 14
ZooKeeper in HBase serves multiple purposes:
Now, here are some basic HBase commands to store and select data in the
HBase database:
1. Creating a Table:
1. Inserting Data:
Chapter 4 15
put 'my_table', 'row1', 'cf1:col2', 'value2'
1. Retrieving Data:
1. Scanning Data:
scan 'my_table'
These commands allow you to create a table, insert data into it, retrieve
specific rows, and scan the entire table for data. You can further refine the
scans and queries using filters and qualifiers based on your requirements.
Here are some basic commands to perform CRUD operations (Create, Read,
Update, Delete) in HBase:
Chapter 4 16
1. Create Table:
shellCopy code
create 'tableName', 'columnFamily1', 'columnFamily2', ...
Example:
shellCopy code
create 'employee', 'personal', 'professional'
1. Insert/Store Record:
shellCopy code
put 'tableName', 'rowKey', 'columnFamily:columnQualifier',
'value'
Example:
shellCopy code
put 'employee', '1001', 'personal:name', 'John'
put 'employee', '1001', 'personal:age', '30'
put 'employee', '1001', 'professional:title', 'Software Eng
ineer'
1. Select/Read Record:
shellCopy code
get 'tableName', 'rowKey'
Example:
shellCopy code
get 'employee', '1001'
Chapter 4 17
1. Modify Record:
shellCopy code
put 'tableName', 'rowKey', 'columnFamily:columnQualifier',
'newValue'
Example:
shellCopy code
put 'employee', '1001', 'professional:title', 'Senior Softw
are Engineer'
1. Delete Record:
shellCopy code
delete 'tableName', 'rowKey', 'columnFamily:columnQualifie
r'
Example:
shellCopy code
delete 'employee', '1001', 'personal:age'
These are some of the basic commands in HBase for CRUD operations. There
are more advanced operations and configurations available depending on the
specific requirements and use cases.
Cassandra
Chapter 4 18
Cassandra is a distributed, decentralized, highly available, and fault-tolerant
NoSQL database system. Its architecture is designed to handle massive
amounts of data across multiple nodes while providing high performance and
scalability. Here's an overview of the architecture of Cassandra:
1. Node:
2. Data Distribution:
3. Replication:
Chapter 4 19
Replication strategy and replication factor determine how many copies
of each piece of data are stored and on which nodes.
4. Gossip Protocol:
Gossip ensures that each node has an up-to-date view of the cluster
topology and can adjust its behavior accordingly.
5. Data Model:
Cassandra has a flexible schema that allows for dynamic addition and
modification of columns.
Each table can have multiple columns, and each column can have
multiple values (similar to a wide-row/column-family model).
Data is replicated across multiple nodes, so if one node fails, data can
be retrieved from replicas on other nodes.
Chapter 4 20
Cassandra uses strategies like hinted handoff, read repair, and anti-
entropy repair to maintain consistency and recover from failures.
Eventual Consistency:
Eventual consistency is a consistency model where, after a certain period
of time with no updates, all replicas of the data will converge to the same
value. In Cassandra, eventual consistency is achieved through the use of a
distributed replication strategy, typically the "last write wins" approach.
When a write operation occurs, it is asynchronously propagated to multiple
replicas across different nodes in the cluster. However, due to network
latency and other factors, these replicas may not immediately receive the
update. As a result, during this period of inconsistency, clients may read
different versions of the data from different replicas. Eventually, all replicas
will converge to the same value, ensuring eventual consistency.
Tunable Consistency:
Tunable consistency in Cassandra refers to the ability to adjust the level of
consistency for read and write operations based on specific requirements
of the application. Cassandra provides tunable consistency through its
consistency level settings, which allow developers to specify how many
replicas must respond to a read or write operation for it to be considered
successful. Consistency levels range from "ALL" (requiring all replicas to
respond) to "ONE" (requiring only one replica to respond), providing a
spectrum of consistency options. By adjusting the consistency level,
developers can balance between consistency, availability, and partition
tolerance according to the needs of their application. This flexibility allows
developers to achieve the desired level of consistency while optimizing
performance and fault tolerance.
MongoBD
MongoDB is a popular NoSQL database that stores data in a flexible, JSON-
like format called BSON (Binary JSON). It follows a distributed architecture
that provides high scalability, availability, and performance. Here's an
overview of the architecture of MongoDB:
Chapter 4 21
1. Shards: Shards are individual MongoDB instances that store a portion
of the data in the database. Each shard contains a subset of the total
data set. When data is distributed across multiple shards, it allows for
horizontal scalability, meaning you can increase the capacity of your
database by adding more shards.
4. Client: The client could be any application or process that interacts with
the MongoDB database. Clients send read and write operations to the
Mongos instances, which then route those operations to the appropriate
shards.
Chapter 4 22
💡 How can you model RDMS table in Mongo DB? Give an example.
(4+6)
1. id (Primary Key)
2. nam
3. age
4. department
You can model this table in MongoDB as a collection named employees with
documents representing individual employees. Each document would contain
fields corresponding to the columns of the RDBMS table.
Here's how you might structure the data in MongoDB:
[
{
"_id": 1,
"name": "John Doe",
"age": 30,
"department": "Engineering"
},
{
"_id": 2,
"name": "Jane Smith",
"age": 35,
"department": "Marketing"
},
{
"_id": 3,
Chapter 4 23
"name": "Bob Johnson",
"age": 40,
"department": "Sales"
}
]
The _id field serves as the unique identifier for each document (similar to
the primary key in RDBMS).
The name , age , and department fields correspond to the columns in the
RDBMS table.
This denormalized structure allows you to efficiently query and manipulate the
data in MongoDB. However, it's important to note that the specific structure of
your MongoDB documents will depend on your application's requirements and
use cases. You may need to further denormalize or nest data within documents
based on how you plan to query and update the data.
Chapter 4 24
1. Data Distribution: When data is distributed across multiple nodes,
enforcing strict normalization rules can lead to frequent cross-node joins,
which can be resource-intensive and impact performance. In a distributed
system, joins involving data from different nodes can incur network latency
and overhead.
HBase, Cassandra, and MongoDB are actually not all considered column-
oriented databases. HBase and Cassandra are column-family stores, which are
a type of NoSQL database that organizes data into columns grouped together
within column families. MongoDB, on the other hand, is a document-oriented
database, which stores data in flexible, JSON-like documents.
Chapter 4 25
Row-oriented and column-oriented databases are two different ways of
organizing and storing data, each with its own advantages and disadvantages.
1. Row-Oriented Database:
2. Column-Oriented Database:
2. Access Patterns:
Chapter 4 26
Row-oriented databases: These are optimized for transactional
workloads and operations that involve retrieving entire records. In OLTP
(Online Transaction Processing) scenarios, where individual records are
frequently accessed and modified, row-oriented databases tend to
perform well.
3. Performance:
4. Compression:
6. Indexing:
Chapter 4 27
row retrieval.
Chapter 4 28