BDA.Unit-2
BDA.Unit-2
UNIT-II
NOSQL DATA MANAGEMENT
Introduction to NoSQL – aggregate data models – key-value and document data models –
relationships – graph databases – scheme less databases – materialized views – distribution models –
master-slave replication – consistency - Cassandra – Cassandra data model – Cassandra examples –
Cassandra clients
INTRODUCTION TO NOSQL
What is the need for NoSql databases? Explain the types of NoSql databases with
example.APR/MAY 2024.
NoSQL means Not Only SQL; it solves the problem of handling huge volume of data that relational
databases cannot handle. NoSQL databases are schema free and are non-relational databases. Most
of the NoSQL databases are open source.
NoSQL is also type of distributed database, which means that information is copied and stored on
various servers, which can be remote or local. This ensures availability and reliability of data. If
some of the data goes offline, the rest of the database can continue to run.
NoSQL encompasses structured data, semi-structured data, unstructured data and polymorphic data.
CAP Theorem
Comparison of SQL and NoSQL Databases
Sr.No. SQL NoSQL
1. SQL databases are relational. NoSQL databases are non-relational.
Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 4
2. SQL databases are vertically scalable. NoSQL databases are horizontally scalable.
3. SQL databases use structured query NoSQL databases have dynamic schemas
language and have a predefined schema. for unstructured data.
4. SQL databases are table-based. NoSQL databases are document, key-value,
graph, or wide-column stores.
5. SQL databases are better for multi-row while NoSQL is better for unstructured
transactions. data like documents or JSON
Saves data as a group of key value pairs, which is made up of two data items that are linked. The
link between the items is a "key" which acts as an identifier for an item within the data and the
"value" that is the data that has been identified.
Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 5
The data itself is usually some primitive data type (string, integer, and array) or a more complex
object that an application needs to persist and access directly.
This replaces the rigidity of relational schemas with a more flexible data model that allows
developers to easily modify fields and object structures as their applications evolve.
Key value systems treat the data as a single opaque collection which may have different fields for
every record.
In each key value pair,
a) The key is represented by an arbitrary string
b) The value can be any kind of data like an image, file, text or document.
In general, key-value stores have no query language. They simply provide a way to store, retrieve,
and update data using simply GET, PUT and DELETE commands.
The simplicity of this model makes a key-value store fast, easy to use, scalable, portable and
flexible.
What are the advantages and disadvantages of key value store?
Advantages of key value stores:
a) The secret to its speed lies in its simplicity. The path to retrieve data is a direct request to the
object in memory or on disk.
b) The relationship between data does not have to be calculated by a query language, there is no
optimization performed.
c) They can exist on distributed systems and do not need to worry about where to store indexes.
Disadvantages of key value stores:
a) No complex query filters
b) All joins must be done in code
c) No foreign key constraints
d) No trigger.
DOCUMENT-BASED
Sharding
Sharding is the process of splitting a large dataset into many small partitions which are placed on
different machines. Each partition is known as a "shard".
Each shard has the same database schema as the original database. Most data is distributed such that
each row appears in exactly one shard. The combined data from all shards is the same as the data
from the original database. The load is balanced out nicely between servers, for example, if we have
five servers, each one only has to handle 20 % of the load.
The NoSQL framework is natively designed to support automatic distribution of the data across
multiple servers including the query load. Both data and query replacements are automatically
distributed across multiple servers located in the different geographic regions and this facilitates
rapid, automatic and transparent replacement of the data or query instances without any disruption.
Sharding is particularly valuable for performance because it can improve both read and write
performance. Using replication, particularly with caching, can greatly improve read performance
but does little for applications that have a lot of writes.
Advantages of Sharding
a) Faster performance: There are more servers available to handle input/output.
b) Horizontal scaling: We can quickly add additional servers to a cluster.
c) Costs: Horizontal scaling can often be less expensive than vertical scaling.
d) Distribution/uptime: A horizontally scaled distributed database cans achieve better uptime than a
traditional single server.
Disadvantages of Sharding
a) Complexity: Depending on the database system, sharding complexity can vary.
b) Rebalancing: When adding additional machines to a cluster, the shards will likely need to be
rebalanced to distribute data evenly.
c) Increased infrastructure costs.
MASTER-SLAVE REPLICATION
Explain Master-slave Replication in big data distributed systems. Nov/Dec-2023
Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 15
We replicate data across multiple nodes. One node is designed as primary (master), others as
secondary (slaves). Master is responsible for processing any updates to that data. A replication
process synchronizes the slaves with the master.
Master is the authoritative source for the data. It is responsible for processing any updates to that
data. Masters can be appointed manually or automatically.
Slaves are a replication process that synchronizes the slaves with the master. After failure of the
master, a slave can be appointed as new master very quickly.
Master-slave replication is most helpful for scaling when we have a read-intensive dataset. It will
scale horizontally to handle more reads.
This design offers read resilience. Even if one or more of the servers fails, the remaining servers can
keep offering read access. This can help a lot with read-heavy applications, but will offer little
benefit to write-intensive applications.
Master-slave replication
As the slaves are exact replicas of the master server, one of them can assume the role of the master
in case the master fails. In fact, most of the times you can simply create a set of nodes and have
them automatically decide who would be the master. There are some consistency issues that occur
due to the delay in updating between master and slaves.
Masters can be appointed manually or automatically. In manual appointing performed when we
configure our cluster and we configure one node as the master. With automatic appointment, we
create a cluster of nodes and they elect one of themselves to be the master.
Problems of master-slave replication:
1. Does not help with scalability of writes
2. Provides resilience against failure of a slave, but not of a master
3. The master is still a bottleneck.
Peer-to-Peer Replication
Peer-to-peer replication
There are various ways to resolve this problem. The most standard approach would be to have the
replicas communicate their writes first before they "accept “them. Once a majority of the replicas
has confirmed a write, it can now be considered as having been successfully performed and a
response sent to the client. This requires a certain amount of network traffic in coordinating these
writes.
There is a problem of write-write conflict. Two users can update different copies of the same record
stored on different nodes at the same time is called a write-write conflict.
CONSISTENCY
Explain Consistency in big data distributed systems. Nov/Dec-2023.
Briefly explain about consistency and its types.
The CAP theorem is important when considering a distributed database, since we must make a
decision about what we are willing to give up. The database we choose will lose either availability
or consistency. Reading about NoSQL databases we can face the concept of quorum. A quorum is
the minimal number of nodes that must respond to a read or write operation to be considered
complete. Of course, having a maximum quorum and querying all servers is the way we can
determine the correct result.
Definition:
Consistency can be simply defined by how the copies from the same data may vary within the same
replicated database system.
Nowadays systems need to scale. The "traditional" monolithic database architecture, based on a
powerful server, does not guarantee the high availability and network partition required today's
web-scale systems, as demonstrated by the CAP theorem. To achieve such requirements, systems
cannot impose strong consistency.
Ans. NoSqL database provides a mechanism for storage and retrieval of data that is modeled in means
other than the tabular relations used in relational databases. NoSQL is often interpreted as Not-only-SQL to
emphasize that they may also support SQL-like query languages. Most NoSQL databases are designed to
store large quantities of data in a fault-tolerant way. Example: Amazon S3 ,CouchDB, HBase.
UNIT-II
Question bank