0% found this document useful (0 votes)
19 views30 pages

BDA.Unit-2

The document provides an overview of NoSQL data management, detailing its necessity, types, and advantages over traditional relational databases. It explains various NoSQL database models, including key-value, document, column-based, and graph databases, along with their respective benefits and drawbacks. Additionally, it discusses the CAP theorem and compares SQL and NoSQL databases, emphasizing NoSQL's flexibility and scalability for handling large volumes of diverse data.

Uploaded by

Malathi Krishnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views30 pages

BDA.Unit-2

The document provides an overview of NoSQL data management, detailing its necessity, types, and advantages over traditional relational databases. It explains various NoSQL database models, including key-value, document, column-based, and graph databases, along with their respective benefits and drawbacks. Additionally, it discusses the CAP theorem and compares SQL and NoSQL databases, emphasizing NoSQL's flexibility and scalability for handling large volumes of diverse data.

Uploaded by

Malathi Krishnan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 30

V.R.S.

College of Engineering and Technology


(Reaccredited by NAAC and an ISO 9001:2008 Recertified Institution)

SUBJECT NAME : BIG DATA ANALYTICS


SUBJECT CODE : CCS334
REGULATION : 2021
YEAR/SEMESTER : III/V
BRANCH : CSE

UNIT-II
NOSQL DATA MANAGEMENT
Introduction to NoSQL – aggregate data models – key-value and document data models –
relationships – graph databases – scheme less databases – materialized views – distribution models –
master-slave replication – consistency - Cassandra – Cassandra data model – Cassandra examples –
Cassandra clients

INTRODUCTION TO NOSQL
What is the need for NoSql databases? Explain the types of NoSql databases with
example.APR/MAY 2024.
 NoSQL means Not Only SQL; it solves the problem of handling huge volume of data that relational
databases cannot handle. NoSQL databases are schema free and are non-relational databases. Most
of the NoSQL databases are open source.
 NoSQL is also type of distributed database, which means that information is copied and stored on
various servers, which can be remote or local. This ensures availability and reliability of data. If
some of the data goes offline, the rest of the database can continue to run.
 NoSQL encompasses structured data, semi-structured data, unstructured data and polymorphic data.

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 1


 No SQL database provides a mechanism for storage and retrieval of data that employs less
constrained consistency models than traditional relational databases.
 NoSQL is a response to nowadays business data related factors:
1. Volume and velocity, referring to the ability to handle large datasets that arrive quickly,
2. Variability, referring to how diverse data types don't fit into structured tables;
3. Agility, referring to how fast an organization responds to business changes.
 NoSQL databases are very often referred to as data stores rather than data-bases.
 NoSQL systems work on multiple processors and can run on low-cost separate computer systems -
No need for expensive nodes to get high-speed performance. It supports linear scalability. Every
time we add more processors, we get a consistent increase in performance.
History of NoSQL:
 The acronym NoSQL was first used in 1998 by Carlo Strozzi while naming his lightweight, open-
source "relational" database that did not use SQL. The name came up again in 2009 when Eric
Evans and Johan Oskarsson used it to describe non-relational databases.
 Relational databases are often referred to as SQL systems. The term NoSQL can mean either No
SQL systems" or the more commonly accepted translation of "Not only SQL, to emphasize the fact
some systems might support SQL-like query languages.
 NoSQL developed at least in the beginning as a response to web data, the need for processing
unstructured data and the need for faster processing. The NoSQL model uses a distributed database
system, meaning a system with multiple computers.
 Not only can NoSQL systems handle both structured and unstructured data, but they can also
process unstructured big data quickly. This led to organizations such as Face book, Twitter,
LinkedIn, and Google adopting NoSQL systems. These organizations process tremendous amounts
of unstructured data, coordinating it to find patterns and gain business insights. Big data became an
official term in 2005.
Why NoSQL?
It can handle large volumes of structured, semi-structured and unstructured data.
 Agile sprints, quick iteration and frequent code pushes.
 Object-oriented programming that is easy to use and flexible.
 Scale-out architecture.
Types of NoSQL Stores:
1. Column Oriented (Accumulo, Cassandra, Hbase)
2. Document Oriented (MongoDB, Couchbase, Clusterpoint)

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 2


3. Key-value (Dynamo, MemcacheDB, Riak)
4. Graph (Allegro, Neo4j, OrientDB)
What are the several types of NoSQL databases?
1. Key-value store: Stores in the form of a hash table
(Example - Riak, Amazon S3 (Dynamo), Redis)
2. Document-based store: Stores objects, mostly JSON, which is web friendly or Supports ODM
(Object Document Mappings).
(Example - CouchDB, MongoDB)
3. Column-based store: Each storage block contains data from only one column
(Example-HBase, Cassandra)
4. Graph-based store: Graph representation of relationships, mostly used by social Networks.
(Example - Neo4J)
 NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means
other than the tabular relations used in relational databases. NoSQL is often interpreted as Not-only-
SQL to emphasize that they may also support SQL-like query languages. Most NoSQL databases
are designed to store large quantities of data in a fault-tolerant way.
 NoSQL is simply the term that is used to describe a family of databases that are all non-relational.
While the technologies, data types and use cases vary widely amount them, it is generally agreed
that there are four types of NoSQL databases:
 NoSQL databases can manage information using any of four primary data models Key-value store,
document-based, column based and graph based.
Example and Advantages
Examples of NoSQL databases
a) Apache CouchDB, an open source, JSON document-based database that uses JavaScript as its
query language.
b) Elastic search, a document-based database that includes a full-text search engine.
c) Couch base, a key-value and document database that empowers developers to build responsive and
flexible applications for cloud, mobile and edge computing.
Advantages
a) NoSQL databases have a simple and flexible structure. They are schema-free.
b) NoSQL databases are based on key-value pairs.
c) Some store types of NoSQL databases include column store, document store, key value store,
graph store, object store, XML store and other data store modes.

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 3


d) Each value in the database has a key. Some NoSQL database stores also allow developers to store
serialized objects into the database, not just simple string values.
e) Open-source NoSQL databases do not require expensive licensing fees and can run on
inexpensive hardware, rendering their deployment cost-effective.
 Disadvantages:
a) Most NoSQL databases do not support reliability features that are natively supported by
relational database systems.
b) In order to support reliability and consistency features, developers must implement their own
proprietary code, which adds more complexity to the system.
CAP Theorem
What is CAP theorem? Explain.
 The theorem states that distributed data systems will offer a trade-off between consistency,
availability and partition tolerance. And, that any database can only guarantee two of the three
properties:
 Consistency: Every node in the cluster responds with the most recent data, even if the system must
block the request until all replicas update. If you query a "consistent system" for an item that is
currently updating, you will wait for that response until all replicas successfully update. However,
you will receive the most current data.
 Availability: Every node returns an immediate response, even if that response is not the most recent
data. If you query an "available system" for an item that is updating, you will get the best possible
answer the service can provide at that moment.
 Guarantees the system continues to operate even if a replicated data node fails or loses
connectivity with other replicated data nodes.

CAP Theorem
Comparison of SQL and NoSQL Databases
Sr.No. SQL NoSQL
1. SQL databases are relational. NoSQL databases are non-relational.
Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 4
2. SQL databases are vertically scalable. NoSQL databases are horizontally scalable.
3. SQL databases use structured query NoSQL databases have dynamic schemas
language and have a predefined schema. for unstructured data.
4. SQL databases are table-based. NoSQL databases are document, key-value,
graph, or wide-column stores.
5. SQL databases are better for multi-row while NoSQL is better for unstructured
transactions. data like documents or JSON

AGGREGATE DATA MODELS


Explain in details about aggregate data model of NoSQLwith neat diagram.
 Aggregate means a collection of objects that are treated as a unit. In NoSQL Databases, an
aggregate is a collection of data that interact as a unit. Moreover, these units of data or aggregates of
data form the boundaries for the ACID operations.
 Aggregate data models in NoSQL make it easier for the databases to manage data storage over the
clusters as the aggregate data or unit can now reside on any of the machines. Whenever data is
retrieved from the database all the data comes along with the aggregate data models in NoSQL.
 Aggregate data models in NoSQL do not support ACID transactions and sacrifice one of the ACID
properties. With the help of aggregate data models in NoSQL,we can easily perform OLAP
operations on the database. We can achieve high efficiency of the aggregate data models in the
NoSQL database if the data transactions and interactions take place within the same aggregate.
KEY-VALUE STORE
 In the key-value structure, the key is usually a simple string of characters and the value is a series of
uninterrupted bytes that are opaque to the database. Key-value store is like a relational database
with only two columns: The key or attribute name and the value.

 Saves data as a group of key value pairs, which is made up of two data items that are linked. The
link between the items is a "key" which acts as an identifier for an item within the data and the
"value" that is the data that has been identified.
Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 5
 The data itself is usually some primitive data type (string, integer, and array) or a more complex
object that an application needs to persist and access directly.
 This replaces the rigidity of relational schemas with a more flexible data model that allows
developers to easily modify fields and object structures as their applications evolve.
 Key value systems treat the data as a single opaque collection which may have different fields for
every record.
 In each key value pair,
a) The key is represented by an arbitrary string
b) The value can be any kind of data like an image, file, text or document.
 In general, key-value stores have no query language. They simply provide a way to store, retrieve,
and update data using simply GET, PUT and DELETE commands.
 The simplicity of this model makes a key-value store fast, easy to use, scalable, portable and
flexible.
What are the advantages and disadvantages of key value store?
 Advantages of key value stores:
a) The secret to its speed lies in its simplicity. The path to retrieve data is a direct request to the
object in memory or on disk.
b) The relationship between data does not have to be calculated by a query language, there is no
optimization performed.
c) They can exist on distributed systems and do not need to worry about where to store indexes.
 Disadvantages of key value stores:
a) No complex query filters
b) All joins must be done in code
c) No foreign key constraints
d) No trigger.

DOCUMENT-BASED

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 6


Document based data model
 A document is an object and keys (strings) that have values of recognizable types, including
numbers, Booleans and strings, as well as nested arrays and dictionaries. All data is stored in one
table, so there is no need for cross-referencing and instead of storing information in a table, it is
stored in a document.
 Document databases are designed for flexibility. They are not typically forced to have a schema and
are therefore easy to modify.
 If an application requires the ability to store varying attributes along with large amounts of data,
document databases are a good option.
 Document stores work with multiple formats including XML and JSON. This allows for storage
and retrieval of data without an impedance match.
 Terminologies in document data store are as follows:
a) A table is called a collection
b) A row is called a document
c) A column in called a field.
 Typical use cases for document stores include the storage and retrieval of catalogs, blog posts, and
news articles and data analysis.
 MongoDB and Apache CouchDB are examples of popular document-based databases.
 Do not use document databases for transactions across multiple documents (records) and Ad hoc
cross-document queries.
What are the advantages and disadvantages of document-based data model?
 Advantages of document-based model:
a) Faster retrieval of data.
b) Dynamic architecture for unstructured data and storage options
c) Sharing for horizontal scalability
d) Replication is managed internally, so chances of accidental loss of data are negligible.

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 7


 Disadvantages of document data model:
a) No views, triggers, scripts or stored procedure.
b) Relationship not well defined.
c) No support for transactions, which could lead to data corruption.
COLUMN-BASED
 Column-based is also called 'wide column' models enabling very quick data access using a row key,
column name and cell timestamp.
 The flexible schema of these types of databases means that the columns do not have to be consistent
across records and you can add a column to specific rows without having to add them to every
single record.
 It is also called a two-level map as it offers a two-level aggregate structure.
 As data is organized into columns, we have better indexing compared to other key-value stores.
Also, when it comes to updates, multiple column block updates can be aggregated.
 Column store databases were born when Google open sourced its implementation of a Column store
NoSQL database called Big Table. Apparently, the data for the well-known Google e-mail service,
Gmail, is stored in the Google Big Table NoSQL Database.
 The wide, columnar stores data model, like that found in Apache Cassandra, are derived from
Google's Bigtable paper.
 Organizations mostly use Column data stores for data warehousing and data processing, which is
evident in services such as Amazon Redshift.
What are the advantages and disadvantages of column-based data model?
 Advantages of column data stores:
a) Column stores are very efficient at data compression and/or partitioning.
b) Columnar stores can be loaded extremely fast.
c) Columnar databases are very scalable.
d) Due to their structure, columnar databases perform particularly well with aggregation queries.
 Disadvantages of column data store:
a) Updates can be inefficient. The fact that a columnar family group attributes, as opposed to rows
of tuples, works against it.
b) If multiple attributes are touched by a join or query, this may also lead to column storage
experiencing slower performance.
c) It is also slower when deleting rows from columnar systems, as a record to be deleted from each
of the record files.

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 8


Explore how Graph-based database handle huge data and its unique capabilities in data
management and analytics. Nov/Dec-2023(13Marks)
 The modern graph database is data storage and processing engine that makes the persistence and
exploration of data and relationships more efficient.
 Graph-based data models store data in nodes that are connected by edges. These Aggregate Data
Models in NoSQL are widely used for storing the huge volumes of complex aggregates and
multidimensional data having many interconnections between them.
 In graph theory, structures are composed of vertices and edges, or what would later be called "data
relationships".
 Graphs behave similarly to how people think, in specific relationships between discrete units of
data. This database type is particularly useful for visualizing, analyzing, or helping to find
connections between different pieces of data.
 As a result, businesses leverage graph technologies for recommendation engines, fraud analytics
and network analysis. Examples of graph-based NoSQL databases include Neo4j and Janus Graph.
 Graph databases can be used to analyze customer interactions, social media and scientific
applications where it is crucial to traverse long relationship graphs to better understand data.
What are the advantages and disadvantages of column graph-based data model?
 Advantages of graph data:
a) More descriptive queries
b) Greater flexibility in adapting your model
c) Greater performance when traversing data relationships.
 Disadvantages of graph data stores:
a) Difficult to scale
b) No standard language
NoSQL Key/Value Database: MongoDB
What is MongoDB?
MongoDB is an open-source document database that provides high performance, high availability
and automatic scaling. MongoDB is one of the most popular open-source NoSQL databases written
in C++. As of February 2015, MongoDB is the fourth most popular database management system.
It was developed by accompany 10gen which is now known as MongoDB Inc.
Why use MongoDB?
a) Simple queries
b) Functionality provided applicable to most web applications

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 9


c) Easy and fast integration of data
d) No ERD diagrams
e) Not well suited for heavy and complex transactions systems.
 MongoDB did not provide any command to create a "database". Actually, you do not need to create
it manually, because, Mango DB will create it on the fly, during the first time you save the value
into the defined collection (or table in SQL) and database.
 MongoDB is a document-oriented database which stores data in JSON-like documents with
dynamic schema. It means you can store your records without worrying about the data structure
such as the number of fields or types of fields to store values. MongoDB documents are similar to
JSON objects.
 MongoDB stores data records as BSON documents. BSON is a binary representation of JSON
documents, though it contains more data types than JSON.
 MongoDB stores data in documents in-spite of tables. You can change the structure of records
simply by adding new fields or deleting existing ones. This ability of MongoDB helps you to
represent hierarchical relationships, to store arrays and other more complex structures easily.
 MongoDB uses Mongo server and Mongo shell commands to fetch records or the information from
the database (i.e. collections). Few areas where MongoDB is ideal are big data, user data
management, mobile and social infrastructure, content management and delivery, data hub.
 A MongoDB instance may have zero or more databases. A database may have zero or more
'collections'. A collection may have zero or more 'documents'. A document may have one or more
'fields'. MongoDB 'Indexes' function much like their RDBMS counterparts.
 Database is a physical container for collections. Collection is a group of documents and is similar to
an RDBMS table. A document is a set of key-value pairs. Documents have dynamic schema.
 MongoDB documents are composed of field-and-value pairs and have the following structure:
{
field1: value1,
field2: value2,
field3: value3,
…..
fieldN: valueN
}

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 10


 The value of a field can be any of the BSON data types, including other documents, arrays and
arrays of documents. MongoDB supports many data types such as String, integer, boolean, double,
arrays, timestamp, object, null, symbol, date, code and binary data.
 MongoDB uses MongoDB query language and supports Ad hoc queries,replication and sharding.
Sharding is a feature of MongoDB that helps it to operate as a distributed data system.

Relation between SQL terms and Mango DB terms


 Sharding is used by MongoDB to store data across multiple machines. It uses horizontal scaling to
add more machines to distribute data and operation with respect to the growth of load and demand.
 Sharding arrangement in MongoDB has mainly three components:
 Shards or replica sets: Each shard serves as a separate replica set. They store all the data. They
target to increase the consistency and availability of the data.
 Configuration servers: They are like the managers of the clusters. These servers contain the cluster's
metadata. They actually have the mapping of the cluster's data to the shards. When a query comes,
query routers use these mappings from the config servers to target the required shard.

Sharding and MongoDB


a) Query router: The query router is mongo instances which serve as interfacesfor user
applications. They take in the user queries from the applications and serve the applications with
the required results.

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 11


What are advantages of Mango DB?
 MongoDB is a schema - less document type database, MongoDB supports field, range-based
query, regular expression for searching the data from the stored data.
 MongoDB is very easy to scale up or down.
 It uses internal memory for storing the working temporary datasets for which it is much faster.
 MongoDB support primary and secondary indexes on any field.
 MongoDB supports replication of databases.
 MongoDB can be used as a file storage system which is known as a GridFS.
SCHEMALESS DATABASES
What is Schemaless database? Explain.
 Since NoSQL does not require a schema, there is no blueprint on how data should be stored and
therefore varies between databases. Generally, there are two ways that NoSQL data storage
functions:
1. On-the-disk using B-Trees, with the top of it being permanently in RAM.
2. In-memory where it is all on RAM using RB-Trees and anything stored on the disc is just an
append.
 Schemaless databases are a type of NoSQL databases that do not have a predefined schema or
structure for data. This means that data can be inserted and retrieved without adhering to a specific
structure and the database can adapt to changes in data over time without requiring schema
migrations or changes.
 Schemaless database manages information without the need for a blueprint. The onset of building a
schemaless database does not rely on conforming to certain fields, tables, or data model structures.
 There is no Relational Database Management System (RDBMS) to enforce any specific kind of
structure. In other words, it is a non-relational database that can handle any database type, whether
that is a key-value store, in-memory, document store, column-oriented, or graph data model.
 In actuality, there is no such thing as schema-less dataset:
1. In a relational database, the schema is explicit and created separately in advance.
2. In column-based databases, we create a fresh schema for each row and in fact,we often reuse
schema fragments from rows that are grouped together. Thesame is true for document databases.
3. In column-based and also in document databases, users directly query data based on the schema.
4. In graph-based databases, we are in essence building the schema as we build the data.
 In schemaless databases, information is stored in JSON-style documents which can have varying
sets of fields with different data types for each field. So, a collection could look like this:

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 12


{
name: "Joe", age: 30, interests: 'football' }
{
name: "Kate", age : 25
}
 In the above condition, the data itself normally has a fairly consistent structure. With the schemaless
MongoDB database, there is some additional structure, thesystem namespace contains an explicit
list of collections and indexes. Collections may be implicitly or explicitly created, indexes must be
explicitly declared.
 Benefits of using schemaless databases:
1. Flexibility: Schemaless databases allow for greater flexibility in data modeling.
2. Scalability: Schemaless databases are designed for scalability, as they can handle large amounts
of unstructured data with ease.
3. Reduced complexity Schemaless databases can reduce the complexity of data modeling and
development.
4. Good support for non-uniform data.
 Disadvantages:
1. Potentially inconsistent names and data types for a single value.
2. Management of the implicit schema migrates into the application layer.
MATERIALIZED VIEWS
Give short note about materialized views.
 Materialized views solve the problem of views. The views provide a mechanism to hide from the
client whether data is derived data or base data. Views are used when data is to be accessed
infrequently and the data in a table gets updated on a frequent basis.
 A materialized view is a replica of a target master from a single point in time. The master can be
either a master table at a master site or a master materialized view at a materialized view site. A
materialized view is like a cache, a copy of the data that can be accessed quickly.
 If a regular view is a saved query, a materialized view is a saved query plus its results stored as a
table.
 NoSQL databases do not have views, they may have pre computed and cached queries and they
reuse the term "materialized view" to describe them NoSql .
 We can use materialized views to achieve one or more of the following goals:
1. Ease network loads

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 13


2. Create a mass deployment environment
3. Enable data sub setting
4. Enable disconnected computing.
 Two methods are used for building a materialized view:
1. Eager approach: User updates the materialized view at the same time update the base data for it.
In this case, adding an order would also update the purchase history aggregates for each
product. This method is used when more frequent reads of the materialized view than writes.
2. The application database approach is valuable here as it makes it easier to ensure that any
updates to base data also update materialized views.
 Materialized views can be built outside of the database by reading the data, computing the view and
saving it back to the database.
DISTRIBUTION MODELS
Briefly explain about different types of distribution models
Ability of NoSql is to run a database on a large cluster. As data volumes increase, it becomes more
difficult and expensive to scale up, so it is necessary to buy bigger server to run the database on.
Single Server
Database is run on a single machine which handles all the reads and writes to the data store.
Organizations prefer a single server because it eliminates all the complexities that the other options
introduce.
 Single server is easy to manage for application developers. Lot of NoSQL databases are designed
around the idea of running on a cluster; it can make sense to use NoSQL with a single-server
distribution model if the data model of the NoSQL store is more suited to the application.
 Single-server configuration is suitable for graph-database.
 If data usage is mostly about processing aggregates, then a single-server document or key-value
store may be useful.
Sharding
What is sharding?
 Sharding is a method for distributing a single dataset across multiple databases, which can then be
stored on multiple machines. This allows for larger datasets be splitting into smaller chunks and
storing in multiple data nodes, increasing the total storage capacity of the system.
 Sharding is a form of scaling known as horizontal scaling or scale-out, as additional nodes are
brought on to share the load, horizontal scaling allows for near-limitless scalability to handle big
data and intense workloads.

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 14


 Sharding is also known as data partitioning. Many NoSQL databases offer auto-sharding.

Sharding
 Sharding is the process of splitting a large dataset into many small partitions which are placed on
different machines. Each partition is known as a "shard".
 Each shard has the same database schema as the original database. Most data is distributed such that
each row appears in exactly one shard. The combined data from all shards is the same as the data
from the original database. The load is balanced out nicely between servers, for example, if we have
five servers, each one only has to handle 20 % of the load.
 The NoSQL framework is natively designed to support automatic distribution of the data across
multiple servers including the query load. Both data and query replacements are automatically
distributed across multiple servers located in the different geographic regions and this facilitates
rapid, automatic and transparent replacement of the data or query instances without any disruption.
 Sharding is particularly valuable for performance because it can improve both read and write
performance. Using replication, particularly with caching, can greatly improve read performance
but does little for applications that have a lot of writes.
 Advantages of Sharding
a) Faster performance: There are more servers available to handle input/output.
b) Horizontal scaling: We can quickly add additional servers to a cluster.
c) Costs: Horizontal scaling can often be less expensive than vertical scaling.
d) Distribution/uptime: A horizontally scaled distributed database cans achieve better uptime than a
traditional single server.

 Disadvantages of Sharding
a) Complexity: Depending on the database system, sharding complexity can vary.
b) Rebalancing: When adding additional machines to a cluster, the shards will likely need to be
rebalanced to distribute data evenly.
c) Increased infrastructure costs.
MASTER-SLAVE REPLICATION
Explain Master-slave Replication in big data distributed systems. Nov/Dec-2023
Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 15
 We replicate data across multiple nodes. One node is designed as primary (master), others as
secondary (slaves). Master is responsible for processing any updates to that data. A replication
process synchronizes the slaves with the master.
 Master is the authoritative source for the data. It is responsible for processing any updates to that
data. Masters can be appointed manually or automatically.
 Slaves are a replication process that synchronizes the slaves with the master. After failure of the
master, a slave can be appointed as new master very quickly.
 Master-slave replication is most helpful for scaling when we have a read-intensive dataset. It will
scale horizontally to handle more reads.
 This design offers read resilience. Even if one or more of the servers fails, the remaining servers can
keep offering read access. This can help a lot with read-heavy applications, but will offer little
benefit to write-intensive applications.

Master-slave replication
 As the slaves are exact replicas of the master server, one of them can assume the role of the master
in case the master fails. In fact, most of the times you can simply create a set of nodes and have
them automatically decide who would be the master. There are some consistency issues that occur
due to the delay in updating between master and slaves.
 Masters can be appointed manually or automatically. In manual appointing performed when we
configure our cluster and we configure one node as the master. With automatic appointment, we
create a cluster of nodes and they elect one of themselves to be the master.
 Problems of master-slave replication:
1. Does not help with scalability of writes
2. Provides resilience against failure of a slave, but not of a master
3. The master is still a bottleneck.
Peer-to-Peer Replication

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 16


 In a peer-to-peer replication setup the various nodes are all "equals". Any node can accept reads as
well as writes and they communicate these writes to each other. In peer-to-peer replication updates
on any one server are replicated to all other associated servers.
 The advantages of this setup are its read and write resilience. One node's failure does not cause
problems, as the remaining nodes can continue their work without losing a beat.
 The problem that arises is that of consistency. For example, we may have conflicting write requests
that come to different nodes and then those nodes attempt to communicate those requests to the rest
of the nodes. This could lead to considerable inconsistencies.

Peer-to-peer replication
 There are various ways to resolve this problem. The most standard approach would be to have the
replicas communicate their writes first before they "accept “them. Once a majority of the replicas
has confirmed a write, it can now be considered as having been successfully performed and a
response sent to the client. This requires a certain amount of network traffic in coordinating these
writes.
 There is a problem of write-write conflict. Two users can update different copies of the same record
stored on different nodes at the same time is called a write-write conflict.

Compare between Sharding and Replication


Sharding and replication can be combined to get a better response. If we use both master-slave
replication with Sharding and Peer-to-peer replication with Sharding.
1. Master-slave replication and Sharding:
 We have multiple masters, but each data item only has a single master.
 A node can be a master for some data and a slave for others.
2. Peer-to-peer replication and Sharding:
 A common strategy for column-family databases.

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 17


 A good starting point for peer-to-peer replication is to have a replication factor of3, so each shard is
present on three nodes.
Difference between Replication and Sharding
Sr. Replication Sharding
No.
1. The primary server node copies data onto Sharding Handles horizontal scaling across
secondary server nodes. This can help increase servers using a shard key.
data availability and act as a backup, in case
the primary server fails.
2. Replication copies data across multiple Sharding distributes different data across
servers. multiple servers.
3. Each bit of data can be found in multiple Each server acts as the single source for a subset
places. of data.
4. Replicated servers contain identical copies of Sharded database servers each contain a part of
the entire database. the overall data, i.e. they store different data on
separate nodes.
5. More read requests. It can improve both reads and writes.

CONSISTENCY
Explain Consistency in big data distributed systems. Nov/Dec-2023.
Briefly explain about consistency and its types.
 The CAP theorem is important when considering a distributed database, since we must make a
decision about what we are willing to give up. The database we choose will lose either availability
or consistency. Reading about NoSQL databases we can face the concept of quorum. A quorum is
the minimal number of nodes that must respond to a read or write operation to be considered
complete. Of course, having a maximum quorum and querying all servers is the way we can
determine the correct result.

Definition:
Consistency can be simply defined by how the copies from the same data may vary within the same
replicated database system.
 Nowadays systems need to scale. The "traditional" monolithic database architecture, based on a
powerful server, does not guarantee the high availability and network partition required today's
web-scale systems, as demonstrated by the CAP theorem. To achieve such requirements, systems
cannot impose strong consistency.

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 18


 In the past, almost all architectures used in database systems were strongly consistent. In these
cases, most architecture would have a single database instance only responding to a few hundred
clients. Nowadays, many systems are accessed by hundreds of thousands of clients, so there was a
mandatory requirement to system's architectures that scale. However, considering the CAP theorem,
high-availability and consistency do conflict on distributed systems when subject to a network
partition event.
Update Consistency
 Two users updating the same data item at the same time is called write-write conflict.
 When the writes reach the server of the two users, the server will serialize them and decide to apply
one, then the other. First user's update would be applied and immediately overwritten by the second
user.
 In this case first user's is a lost update. Here the lost update is not a big problem. We see this as a
failure of consistency because second user's update was based on the state before first user's update,
yet was applied after it.
 Approaches for maintaining consistency in the case of concurrency are often described as
pessimistic or optimistic.
 A pessimistic approach works by preventing conflicts from occurring; an optimistic approach lets
conflicts occur, but detects them and takes action to sort them out.
 For update conflicts, the most common pessimistic approach is to have write locks, so that in order
to change a value we need to acquire a lock and the system ensures that only one client can get a
lock at a time.
 So, both users would attempt to acquire the write lock, but only the first user would succeed.
Second user would then see the result of the first user's write before deciding whether to make his
own update.
 A common optimistic approach is a conditional update where any client that does an update test the
value just before updating it to see if it is changed since his last read.
 Both the pessimistic and optimistic approaches rely on a consistent serialization of the updates and
it is possible for a single server.
 Two general solutions for write-write conflict are as follows:
1. Pessimistic approach: Preventing conflicts from occurring. Also acquire write locks
before update.
2. Optimistic approach: Lets conflicts occur, but detect them and take actions to resolve
them.

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 19


 If there are more than one server i.e., peer-to-peer replication, then two nodes might apply the
updates in a different order, resulting in a different value for the telephone number on each peer.
Sequential consistency is used in distributed systems.
 Optimistic way to handle a write-write conflict is to save both updates and record that they are in
conflict. Replication makes it much more likely to run into write-write conflicts. If different nodes
have different copies of some data which can be independently updated, then we will get conflicts
unless we take specific measures to avoid them. Using a single node as the target for all writes for
some data makes it much easier to maintain update consistency.
Read Consistency
 Problem: One user reads in the middle of another user's writing.
 It is called read-write conflict, inconsistent reading. This leads to logical inconsistency.
 In NoSQL databases, read consistency refers to the level of consistency between multiple read
operations on the same data. In a distributed database, where data can be replicated across multiple
nodes, ensuring read consistency can be challenging.
 Aggregate-oriented databases do support atomic updates, but only within a single aggregate. This
means that we will have logical consistency within an aggregate but not between aggregates.
 The length of time an inconsistency is present is called the inconsistency window. A NoSQL system
may have a quite short inconsistency window.
 There are different levels of read consistency available in NoSQL databases, ranging from eventual
consistency to strong consistency.
 Eventual consistency allows for a certain degree of inconsistency to occur between different
replicas of data. In this model, the database guarantees that all updates will eventually propagate to
all nodes, but it makes no guarantees about how long this will take or about the order in which
updates will be applied.
 Read-your-writes consistency means that once we have updated a record, all of our subsequent
reads of that record will return the updated value.
 Session consistency means read-your-writes consistency but at session level. Session can be
identified with a conversation between a client and a server. As long as the conversation continues,
we will read everything we have written during this conversation. If the session ends and we start
another session with the same server, there is no guarantee that we can read values we have written
during previous conversation.
 Session consistency is of two types: Sticky session and version stamps.
Quorums

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 20


Q. Discuss read and write Quorums.
 Quorum consistency is used in systems where consistency is more important than availability (CAP
theorem) for write and read.
 In systems with multiple replicas there is a possibility that the user reads inconsistent data. This
happen say when there are 2 replicas, N1 and N2 in a cluster and a user writes value v1 to node N1
and then another user reads from node N2 which is still behind N1 and thus will not have the value
v1, so the second user will not get the consistent state of data.
 In order to achieve a state where at least one node has consistent data we use quorum consistency.

Write and read quorums.


 Quorum is achieved when nodes follow the below protocol: w + r > n
Where n = Nodes in the quorum group,
w = Minimum write nodes,
r = Minimum read nodes
 Here w is our write quorum and r is our read quorum.
Relaxing Durability
 When write is committed, the change is permanent.
 In some cases, strict durability is not essential and it can be traded for scalability (write
performance).
 A simple way to relax durability is to store data in memory and flush to disk regularly. If the system
shuts down, we lose updates in memory.
CASSANDRA
Define Cassandra. Explain in details about Cassandra architecture with neat diagram.
Explain the features of Cassandra. List out the database components of Cassandra. What is CQLSH
and why it is used? APR/MAY 2024

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 21


 Cassandra is a column NoSQL database. It was initially developed by Facebook to fulfill the needs
of the company's Inbox Search services. In 2009, it became an Apache Project.
 Apache Cassandra was initially designed at Facebook using a Staged Event-Driven Architecture
(SEDA) to implement a combination of Amazon's Dynamo distributed storage and replication
techniques and Google's Bigtable data and storage engine model.
 Apache Cassandra is an open source, distributed, NoSQL database. Apache Cassandra is a
distributed database system using shared nothing architecture.
 A columnar database, also called a column-oriented database or a wide-column store, is a database
that stores the values of each column together, rather than storing the values of each row together.
 Columnar databases are well suited for big data processing, Business Intelligence(BI) and analytics.
 Cassandra provides tunable consistency i.e.; users can determine the consistency level by tuning it
via read and write operations. Cassandra enables users to configure the number of replicas in a
cluster that must acknowledge a read or write operation before considering the operation successful.
 Cassandra uses a gossip protocol to discover node state for all nodes in a cluster. Cassandra is
designed to handle "big data" workloads by distributing data, reads and writes (eventually) across
multiple nodes with no single point of failure.
 Features of Cassandra:
1. Elastic scalability: Cassandra is highly scalable; it allows adding more hardware to
accommodate more customers and more data as per requirement.
2. Always on architecture: Cassandra has no single point of failure.
3. Fast linear-scale performance: Cassandra is linearly scalable, i.e., it increases throughput as
we increase the number of nodes in the cluster.
4. Flexible data storage: Cassandra accommodates all possible data formats including
structured, semi-structured and unstructured.
5. Transaction support: Cassandra supports properties like ACID.
6. Easy data distribution: Cassandra provides the flexibility to distribute data where you need
by replicating data across multiple datacenters.
Cassandra Architecture
Components of Cassandra architecture are node, data center, cluster, commit log, memtable, SSTable,
Bloom Filters and Cassandra query language.

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 22


Cassandra Architecture
 Node: A Cassandra node is a place where data is stored.
 Data center: Data center is a collection of related nodes.
 Cluster: A cluster is a component which contains one or more data centers.
 Commit log: In Cassandra, the commit log is a crash-recovery mechanism. Every write operation is
written to the commit log.
 Mem-table: A mem-table is a memory-resident data structure. After commit log the data will be
written to the mem-table. Sometimes, for a single-column family, there will be multiple mem-
tables.
 SSTable It is a disk file to which the data is flushed from the mem-table when its contents reach a
threshold value.
 Bloom filter: Bloom filters are very fast, nondeterministic algorithms for testing whether an element
is a member of a set. It is a special kind of cache. Bloom filters are accessed after every query.
 Each node in a Cassandra cluster also maintains a sequential commit log of write activity on disk to
ensure data integrity. These writes are indexed and written to an in-memory structure called a mem-
table.
 A mem-table can be thought of as a write-back cache where write I/O is directed to cache with its
completion immediately confirmed by the host. This has the advantage of low latency and high
throughput. The mem-table structure is kept in Java heap memory by default.
 SSTables When the commit log gets full, a flush is triggered and the contents ofthe mem-table are
written to disk into an SSTables data file. At the completion of this process the mem-table is cleared
and the commit log is recycled. Cassandra automatically partitions these writes and replicates them
throughout the cluster.
CASSANDRA DATA MODEL
Briefly explain about Cassandra data model.

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 23


 Some of the features of Cassandra data model are as follows:
1. Data in Cassandra is stored as a set of rows that are organized into tables.
2. Tables are also called column families.
3. Each row is identified by a primary key value.
4. Data is partitioned by the primary key.
 Data modeling in Cassandra uses a query-driven approach, in which specific queries are the key to
organizing the data. The main goal of Cassandra data modeling is to develop and design a high-
performance and well-organized Cluster.
 Apache Cassandra data model components include key spaces, tables and columns:
a. Cassandra stores data as a set of rows organized into tables or column families.
b. A primary key value identifies each row.
c. The primary key partitions data.
d. We can fetch data in part or in its entirety based on the primary key.
 Cassandra data model provides a mechanism for data storage. The components of Cassandra data
model are keyspaces, tables and columns.
1. Keyspaces:
 At a high level, the Cassandra NoSQL data model consists of data containers called keyspaces.
Keyspaces are similar to the schema in a relational database. Typically, there are many tables in a
keyspace.
 Features of keyspaces are:
a. A keyspace needs to be defined before creating tables, as there is no default keyspace.
b. A keyspace can contain any number of tables and a table belongs only to one keyspace. This
represents a one-to-many relationship.
c. Replication is specified at the keyspace level. For example, replication of three implies that each
data row in the keyspace will have three copies.

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 24


Cassandra Data Model
2. Tables:
 Tables, also called column families in earlier iterations of Cassandra, are defined within the
keyspaces. Tables store data in a set of rows and contain a primary key and a set of columns.
 Cassandra tables are used to hold the actual data in the form of rows and columns. A table in
Cassandra must be created with the primary key during table creation time, post that it cannot be
altered.
 To alter the table new tables should be created with existing data. The primary key would be used to
locate and order the data.
 Some of the features of tables are:
a. Tables have multiple rows and columns. As mentioned earlier, a table is also called column
family in the earlier versions of Cassandra.
b. It is still referred to as column family in some of the error messages and documents of
Cassandra.
c. It is important to define a primary key for a table.
3. Columns:
 Columns define data structure within a table. There are various types of columns, such as Boolean,
double, integer and text.
 Cassandra column is used to store a single piece of data. The column can consist of various types of
data such as big integer, double, text, float and Boolean.
 Each column value has a timestamp associated with it that shows the time of update. Cassandra
provides the collection type of columns such as list, set and map.
 Some of its features are:
a. Columns consist of various types, such as integer, big integer, text, float, double and Boolean.
b. Cassandra also provides collection types such as set, list and map.
Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 25
c. Further, column values have an associated time stamp representing the time of update.
d. This timestamp can be retrieved using the function write time.
CASSANDRA CLIENTS
Write short note on Cassandra clients.
 Thrift is the driver-level interface; it provides the API for client implementations ina wide variety of
languages. Thrift was developed at Facebook.
 A Client holds connections to a Cassandra cluster, allowing it to be queried. Each Client instance
maintains multiple connections to the cluster nodes, provides policies to choose which node to use
for each query and handles retries for failed query etc...
 Client instances are designed to be long-lived and usually a single instance is enough per
application. As a given Client can only be "logged" into one keyspace at a time, it can make sense
to create one client per keyspace used. This is however not necessary to query multiple keyspaces
since it is always possible to use a single session with fully qualified table name in queries.
 The Cassandra cluster is denoted as a ring. The idea behind this representation is to show token
distribution.
Write in action:
 To write, clients need to connect to any of the Cassandra nodes and send a write request. This node
is called the coordinator node. When a node in Cassandra cluster receives a write request, it
delegates it to a service called Storage Proxy.
 This node may or may not be the right place to write the data to. The task of Storage Proxy is to get
the nodes (all the replicas) that are responsible to hold the data that is going to be written. It utilizes
a replication strategy to do that.
 Once the replica nodes are identified, it sends the Row Mutation message to them, the node waits
for replies from these nodes, but it does not wait for all the replies to come.
 It only waits for as many responses as are enough to satisfy the client's minimumnumber of
successful writes defined by consistency level.
 Write operations at a node level:
 Each node processes the request individually. Every node first writes the mutation to the
commit log and then writes the mutation to the mem-table.Writing to the commit log
ensures durability of the write as the mem-table is an in-memory structure and is only
written to disk when the mem-table is flushed to disk.
 A mem-table is flushed to disk when:
1. It reaches its maximum allocated size in memory

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 26


2. The number of minutes a mem-table can stay in memory elapses.
3. Manually flushed by a user.
 A mem-table is flushed to an immutable structure called as SSTable (Sorted String Table).
The commit log is used for playback purposes in case data from the mem-table is lost due to
node failure.
Two Marks Questions with Answers
Q.1 What is consistency in a distributed system?
Ans. In a distributed system, consistency will be defined as one that responds with the same output for the
same request at the same time across all the replicas.
Q.2 What is database Sharding?
Ans. Sharding is a method for distributing a single dataset across multiple databases,which can then be
stored on multiple machines. This allows for larger datasets to be split into smaller chunks and stored in
multiple data nodes, increasing the total storage capacity of the system.
Q.3 Why are NoSQL databases known as schemaless databases?
Ans. Because NoSQL databases are designed to store and query unstructured data, they do not require the
same rigid schemas used by relational databases. Although a schema can be applied at the application level,
NoSQL databases retain all of your unstructured data in its original raw format. This means that complete
granularity is retained, even if you later change your application schema - Something that is simply not
possible with a traditional SQL database.
Q.4 What is the difference between Sharding and replication?
Ans. Sharded database servers each contain a part of the overall data, i.e. they store different data on
separate nodes. Replicated servers contain identical copies of the entire database.
Q.5 How is Sharding different from partitioning?
Ans. All partitions of a table reside on the same server whereas Sharding involves multiple servers.
Therefore, Sharding implies a distributed architecture whereas partitioning does not. Partitions can be
horizontal (split by rows) or vertical (by columns). Shards are usually only horizontal. In other words, all
shards share the same schema but contain different records of the original table.
Q.6 What are write-write and read-write conflicts?
Ans. Write-write conflicts occur when two clients try to write the same data at the same time. Read-write
conflicts occur when one client reads inconsistent data in the middle of another client's write.
Q.7 Define Cassandra.

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 27


Ans. Cassandra is defined as distributed, fault tolerant, scalable, column oriented data store. Cassandra is a
peer-to-peer distributed system made up of a cluster of nodes in which any node can accept a read or write
request.
Q.8 What is the use of Bloom filters in Cassandra?
Ans. Bloom filters are used as a performance booster. Bloom filters are very fast, nondeterministic
algorithms for testing whether an element is a member of a set. They are nondeterministic because it is
possible to get a false-positive read from a Bloom filter, but not a false-negative. Bloom filters work by
mapping the values in a data set into a bit array and condensing a larger data set into a digest string. Bloom
filter is a special kind of cache.
Q.9 Explain sorted strings table.
Ans. Sorted strings table which is a file format used by Cassandra to store the statics and the data from the
mem-tables. The Cassandra SSTables are immutable hence any update on the table creates a new SSTables
file. The data structure format used by SSTables is Log-Structured Merge which is qualified for writing
intense heavy data sets compared to the traditional B tree structure.
Q.10 Explain Cassandra data center.
Ans. Cassandra data center is the collection of related nodes which are configured in the cluster to perform
the replication. The data centers can be physical data centers or logical data center and depending upon the
workload a separate data center can be used.
Q.11 Explain advantages and disadvantages of graph data.
Ans.
Advantages of graph data:
a) More descriptive queries
b) Greater flexibility in adapting your model
c) Greater performance when traversing data relationships.
• Disadvantages of graph data stores:
a) Difficult to scale,
b) No standard language.

Q.12 Describe session consistency.


Ans. Session consistency means read-your-writes consistency but at session level. Session can be identified
with a conversation between a client and a server. As long as the conversation continues, we will read
everything we have written during this conversation. If the session ends and we start another session with

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 28


the same server, there is no guarantee that we can read values we have written during previous
conversation.
Q.13 What are schemaless databases ? what is the main advantage of using schemaless databases?
NOV/DEC 2023
Ans. Schemaless databases are a type of NoSQL databases that do not have a predefined schema or
structure for data. This means that data can be inserted and retrieved without adhering to a specific
structure and the database can adapt to changes in data over time without requiring schema migrations or
changes.
Q.14. Summarize the key characteristics of the data model in Cassandra. NOV/DEC 2023
Why Cassandra model is popular among developers? APR/MAY 2024
Ans. Cassandra is highly scalable and has no single point of failure. It accommodates all possible data
formats including structured, semi-structured and unstructured. Cassandra provides the flexibility to
distribute data where you need by replicating data across multiple datacenters.

Q.15. What are NoSql databases? Give example. APR/MAY 2024

Ans. NoSqL database provides a mechanism for storage and retrieval of data that is modeled in means
other than the tabular relations used in relational databases. NoSQL is often interpreted as Not-only-SQL to
emphasize that they may also support SQL-like query languages. Most NoSQL databases are designed to
store large quantities of data in a fault-tolerant way. Example: Amazon S3 ,CouchDB, HBase.

UNIT-II
Question bank

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 29


1. What is the need for NoSql databases? Explain the types of NoSql databases with
example.APR/MAY 2024.
2. What are the several types of NoSQL databases? Explain in details about aggregate data model of
NoSQLwith neat diagram.
3. Explore how Graph-based database handle huge data and its unique capabilities in data
management and analytics. Nov/Dec-2023(13Marks)
4. What is CAP theorem? Explain.
5. What is Schemaless database? Explain.
6. Briefly explain about different types of distribution models
7. Explain Master-slave Replication in big data distributed systems. Nov/Dec-2023
8. Explain Consistency in big data distributed systems. Nov/Dec-2023.
9. Explain the features of Cassandra. List out the database components of Cassandra. What is
CQLSH and why it is used? APR/MAY 2024

Prepared by Mrs.C.Leena AP/CSE VRSCET CCS334-Big Data Analytics Page 30

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy