0% found this document useful (0 votes)
20 views104 pages

Big Data강의자료-Storing Big Data

Uploaded by

sujann283
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views104 pages

Big Data강의자료-Storing Big Data

Uploaded by

sujann283
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

BIG DATA MANAGEMENT THEORY

STORING BIG DATA

Big Data
A Very Short Introduction
Big Data Fundamentals
Concepts, Drivers & Techniques
Storage Capacity

The first hard drive (by IBM, 1956 )


a part of a mainframe computer, 5MB
Storage Capacity

Hard driver of a PC today: 8TB or even bigger

Computer storage capacity increased very


quickly and although PC storage has not kept
with big data storage
PC was boom in the 1980s
The average hard drive on a PC: 5MB
Storage Capacity
Computers have indeed become faster, cheaper, and more powerful since the change from
valves to transistors took place in the 1960s.

valves transistors
Moore’s Law
In 1965, Gordon Moore, who became the co-founder of Intel, famously
predicted that over the next ten years the number of transistors
incorporated in a chip would approximately double every twenty-four
months.

A microprocessor is the integrated circuit responsible for A nanometers is 10−9 meter, or one-millionth of a millimeter
performing the instructions provided by a computer program. A human hair is about 75,000 nanometers in diameter and the
This usually consists of billions of transistors, with a length of n- diameter of an atom is between 0.1 and 0.5 nanometers.
nanometers, embedded in a tiny space on a silicon microchip.
Storing Structured Data
Structured Data

Structured data, of the kind written by hand and kept in notebooks or in filing cabinets, is now
stored electronically on spreadsheets or databases, and consists of spreadsheet-style tables
with rows and columns, each row being a record and each column a well-defined field (e.g.
name, address, and age). Carefully structured and tabulated data is relatively easy to manage
and is amenable to statistical analysis, indeed until recently statistical analysis methods could be
applied only to structured data.
RDBMS
RDBMS
In order to manage structured data, a relational database management system (RDBMS) is use to create, maintain,
access, and manipulate the data.

What is RDBMS?

▪ RDBMS is a type of DBMS that organizes data in a tabular format with relationships between tables, while
DBMS can use different data organization models and may not enforce strict relationships between data
entities.

▪ In an RDBMS, data is stored in tables, where each table represents a specific entity or concept. The tables
consist of rows (also known as tuple in RDBMS or records) that represent individual instances of the entity,
and columns that define the attributes or properties of the entity.

▪ The relational model facilitates the establishment of relationships between tables through keys. A primary
key uniquely identifies each row in a table, while foreign keys establish relationships between tables by
referencing the primary key of another table.
RDBMS
▪ The first step is to design the database schema (i.e. the
structure of the database)
define data fields and arrange them in tables

▪ Identify the relationships between the tables.

▪ populate the database with data and inquire it using


structured query language (SQL).
RDBMS
Advantages of RDBMS

▪ Structured data storage: RDBMS stores data in the form


of tables, which gives the data structured and organized
characteristics, facilitating effective querying and analysis.
▪ ACID properties: Atomicity ensures that incomplete
transactions cannot update the database; consistency
excludes invalid data; isolation ensures one transaction does
not interfere with another transaction; and durability means
that the database must update before the next transaction is
carried out. These properties can ensure the integrity,
consistency, and reliability of data.
▪ Flexible query language: SQL (Structured Query
Language) is the standard query language of RDBMS,
which has powerful querying capabilities, enabling complex
data queries and operations.
▪ Data security: RDBMS offers access control and
permission management mechanisms, which can restrict
users' access to data and operations, ensuring data security.
RDBMS
features of RDBMS
▪ Atomicity: It ensures that a database
transaction is treated as a single
indivisible unit of work. Atomicity
guarantees that either all the changes
made within a transaction are
committed to the database, or none of ▪ Isolation : It ensures transactions are executed in isolation
them are. from each other. It guarantees that concurrent transactions do
▪ Consistency: It ensures the database not interfere with each other. Isolation ensures that the
remains in a valid and expected state intermediate states of a transaction are not visible to other
before and after execution of a concurrent transactions until the transaction is committed. This
transaction. It guarantees that data prevents concurrent transactions from accessing or modifying
meets all predefined rules, constraints, the same data simultaneously in a way that could lead to data
and relationships defined in the inconsistencies or conflicts.
database schema. Consistency ensures ▪ Durability: It ensures that once a transaction is committed, its
that changes made to the database effects are permanently stored and will survive any subsequent
during a transaction preserve these failures, such as power outages, system crashes, or other
integrity constraints and do not violate catastrophic events.
any defined rules.
RDBMS
How RDBMS works? Books Customers Orders
BookID CustomerID OrdersID
Title Name CustomerID
RDBMS follows a relational model for data organization. Author Email
▪ Data Organization BookID
Price Address
columns of “Books” and “Customers” table
▪ Data Relationships
primary keys of “Books” and “Customers” table, and
a relationship between these two tables
▪ Data Definition
Using the RDBMS Data Definition Language (DDL)
to define the structure of the database by creating the
“Books” and “Customers” tables, specifying the
columns
▪ Data Manipulation
Using the Data Manipulation Language (DML) to
perform operations on the data. Ex. insert new book
records into the tables.
Source: https://www.knowledgehut.com/blog/database/what-is-rdbms
RDBMS
How RDBMS works?
Books Customers
▪ Query Processing BookID CustomerID
When a user submit a query, such as “SELECT * FROM Title InvCus
BOOKS WHERE Price < 20,” RDBMS processes the query; Author Name
the query optimizer determines the most efficient way to Price Email
execute the query; RDBMS executes the query, retrieves the
matching book records, and presents the results Orders
▪ Data Integrity
OrdersID
RDBMS enforces data integrity through constraints. CustomerID
Entity integrity, referential integrity, domain integrity, BookID
business rule integrity InvCus
▪ Transaction Management
RDBMS supports transaction management to ensure
data consistency and integrity. e.g., when a customer places
an order, the RDBMS groups the order-related operation into
a transaction. If all operations within the transaction succeed,
the changes (e.g., updating the order history) are committed
to the database. Otherwise, if any operation fails, the changes
are rolled back.
RDBMS
How RDBMS works? Books Customers
BookID CustomerID
▪ Concurrency Control Title InvCusID
• RDBMS handles concurrent access to the data by Author Name
multiple users or applications. Price Email
• Concurrency control mechanisms like locking or
multi-versioning ensure that concurrent transactions do
Orders
not interfere with each other, maintaining data
OrdersID
consistency.
CustomerID
▪ Indexing and Performance Optimization BookID
• RDBMS users index to speed up data retrieval. InvCus
• E.g., an index can be created on the “Title” column of
the “Books” table, allowing for efficient searching
and retrieval of books by their titles.
▪ Security and Access Control
• RDBMS provides security mechanisms to control
access to the database.
• User authentication and authorization ensure that only
authorized users can access and modify the data.
RDBMS
Types of RDBMS

▪ Oracle Database: a widely used RDBMS, it offers robust support


for data management, and advanced analytics capabilities
▪ MySQL: a popular open-source RDBMS that is known for its
simplicity, speed, and ease of use. It is widely used in web
applications.
▪ SQL Server: a powerful RDBMS developed by Microsoft.
▪ PostgreSQL: a feature-rich open-source RDBMS known for its
extensibility and compliance with industry standards.
▪ IBM DB2: an enterprise-level RDBMS designed for high-
performance and scalability.
▪ Microsoft Access: It is primarily designed for small-scale database
applications, allowing users to create and manage databases without
requiring extensive programming knowledge.
▪ Azure SQL: a cloud-based relational database service provided by
Microsoft Azure. It is based on the popular Microsoft SQL Server
database engine and offers managing relational databases in the
cloud.
RDBMS
Uses of RDBMS

▪ Business Applications: store, manage, and process large volumes of structured data. serves as the backbone
for ERP (enterprise resource planning), CRM (customer relationship management), HRM (human resource
management), and other business systems.
▪ E-commerce: handle product catalogs, inventory management, customer data, orders, and transactions
▪ Financial Systems: manage banking transactions, accounting records, financial reporting, and risk analysis
▪ Healthcare Information Systems: store and manage patient data, medical records, laboratory results, and
healthcare information in EHR (electronic health record) systems
▪ Online Ticketing and Reservation Systems: ticket and reserve systems for airlines, railways, hotels, and
event management
▪ Education Systems: manage student information, course registration, grades, and academic records
▪ Supply Chain Management: manage inventory, track shipments, and monitor logistics data
▪ Social Media Content Management: handles user profiles, content storage, retrieval, and supports efficient
searching and indexing of vast amounts of data
RDBMS
Disadvantages of RDBMS

▪ Scalability limitations: Since RDBMS typically runs on a


single server, performance may be limited as data volume Managing
increases, and horizontal scaling can be challenging. Cost Complex
Data
▪ Fixed schema: RDBMS uses fixed tables and columns to
store data, resulting in lower efficiency for storing and
querying unstructured or semi-structured data.

▪ High costs: RDBMS often require expensive licensing fees,


and deployment and maintenance costs are high, especially
for large enterprise-grade database systems.

▪ Complexity: Managing and maintaining large RDBMS Limitations


in Field Storage
systems may require specialized skills and experience,
Length constraints
including knowledge of database design, performance
optimization, backup, and recovery.
RDBMS
Limitation of RDBMS for Big Data storage

▪ Some relational databases, e.g., IBM DB2


pureScale, Sybase ASE Cluster Edition, Oracle
Real Application Cluster (RAC) and Microsoft
Parallel Data Warehouse (PDW), are capable of
being run on clusters. However, these database
clusters still shared storage that can act as a
single point of failure.
RDBMS
Limitation of RDBMS for Big Data storage

Relational databases generally require data to adhere to a schema. As a result, storage of


semi-structured and unstructured data whose schemas are non-relational is not directly
supported. Furthermore, with a relational database schema conformance is validated at the
time of data insert or update by checking the data against the constraints of the schema. This
introduces overhead that creates latency.
This latency makes relational databases a less than ideal choice for storing high velocity
data that needs a highly available database storage device with fast data write capability.
As a result of its shortcomings, a traditional RDBMS is generally not useful as the primary
storage device in a Big Data solution environment.
RDBMS
Limitation of RDBMS for Big Data storage

▪ Relational databases need to be manually


sharded, mostly using application logic. This
means that the application logic needs to know
which shard to query in order to get required
data. This further complications data processing
when data from multiple shards is required.
Big Data Storage

RDBMS are no longer appropriate for storing unstructured data.

▪ Once the relational database schema has been constructed, it is difficult to change it.
▪ Unstructured data cannot be organized conveniently into rows and columns.
▪ Big data is often high-velocity and generated in real-time with real-time processing requirements

Storage is required when:

• External dataset are acquired, or internal data will be used in a big data environment
• Data is manipulated to be made amenable for data analysis
• Data is processed via ETL (Extract-Transform-Load) activity, or output is generated as result of an
analytical operation
Big Data Storage

Cluster

In computing, a cluster is a tightly coupled


collection of servers, nor nodes. These servers
usually have the same hardware specifications
and are connected together via network to work
as a single unit.
Each node in the cluster has its own dedicated
resources, such as memory, a processor, and a
hard drive. A cluster can execute a task by
splitting it into small pieces and distributing
their execution onto different computers that
belong to the cluster.
Big Data Storage

File System

A file system is the method of storing and


organizing data on a storage device, such as
flash drives, DVDs and hard drives. A file is an
atomic unit of storage used by the file system
to storage data. A file system provides a logical
view of the data storage on the storage device
and presents it as a tree structure of directions
and files.
Operating systems employ file systems to store
and retrieve data on behalf of applications.
Each operating system provides support for one
or more file systems, for example NTFS on
Microsoft Windows and ext on Linux.
Big Data Storage

Distributed File Systems

A distributed file system is a file system that


can store large files spread across the nodes of
a cluster.
To the client, files appear to be local; however,
this is only a logical view as physically the
files are distributed throughout the cluster. This
local view is presented via the distributed file
system and it enables the files to be accessed
from multiple locations. Examples include the
Google Files System (GFS) and Hadoop
Distributed File System (HDFS).
A distributed file system (DFS) provides
effective and reliable storage for big data
across many computers.
Hadoop

Distributed File System


and What is Hadoop

▪ Influenced by the ideas published in October 2003 by Google in a research paper launching the Google
Hadoop
File System, is a setwho
Doug Cutting, of was
open-source
then working software
at Yahoo,utilities. They facilitate
and his colleague Mike Caarella went to
work on usage of athe
developing network
Hadoop of many computers to solve problems involving
DFS.
▪ Hadoop, one massive amounts
of the most popularof data.
DFS, It provides
is part a software
of a bigger, open-sourceframework for called the
software project
Hadoop Ecosystem.
distributed Named after aand
storage yellow soft toy elephant
distributed owned by It
computing. Cutting’s
dividesson.
a file
▪ Hadoopinto
is written in the popular
the number programming
of blocks language,
and stores Java. It
it across a enables
clustertheof storage of both semi-
machines.
structured and unstructured data, and provides a platform for data analysis.
▪ Facebook,Hadoop also
Twitter, or achieves
eBay, faultHadoop
for example, tolerance by replicating
will have been working thein blocks on
the background while
you do so.the cluster. It does distributed processing by dividing a job into a
numbera software
▪ Hadoop provides of independent
framework tasks. These tasks
for distributed run and
storage in parallel over
distributed the
computing. It
does distributed processing by dividing a job intocluster.
computer a number of independent tasks. These tasks run
in parallel over the computer cluster.
Hadoop
Hadoop Components and Domains

▪ Hadoop consists of three layers (core


components).

▪ HDFS- Hadoop Distributed File System


provides for the storage of Hadoop. As the
name suggests it stores the data in a
distributed manner. The file gets divided into
a number of blocks which spreads across the
cluster of commodity hardware.

▪ MapReduce- This is the processing engine of


Hadoop. MapReduce works on the principle ▪ Yarn- Yet Another Resource Manager provides resource
of distributed processing. It divides the task management for Hadoop. There are two daemons running for
submitted by the user into a number of Yarn. One is NodeManager on the slave machines and other
independent subtasks. These sub-tasks is the Resource Manager on the master node. Yarn looks
execute in parallel thereby increasing the after the allocation of the resources among various slave
throughput. competing for it.
Hadoop
Hadoop is an open source software framework that stores data in a distributed manner and process that data
in parallel. Hadoop provides the world’s most reliable storage layer – HDFS, a batch processing engine –
MapReduce and a resource management layer – YARN.

https://data-flair.training/blogs/how-hadoop-works-internally/
Hadoop
Daemons

Daemon are the processes that run in the background. The Hadoop Daemons are:

◆ NameNode – It runs on master node for HDFS.

◆ DataNode – It runs on slave nodes for HDFS.

◆ Resource Manager – It runs on YARN master node for MapReduce.

◆ Node Manager – It runs on YARN slave node for MapReduce.

These 4 daemons run for Hadoop to be functional.


Hadoop
How Hadoop Works? - HDFS
HDFS has master-slave topology. It has got two daemons running, they are NameNode and DataNode.

◆ NameNode – It runs on master node for HDFS.


NameNode is the centerpiece of an HDFS file system. NameNode stores the directory tree of all files in the file
system. It tracks where across the cluster the file data resides. It does not store the data contained in these files.
When the client applications want to add/copy/move/delete a file, they interact with NameNode. The
NameNode respond to the request from client by returning a list of relevant DataNode servers where the data
lives.

◆ DataNode – It runs on slave nodes for HDFS.


DataNode stores data in the Hadoop File System. In functional file system data replicates across many
DataNodes.
On startup, a DataNode connects to the NameNode. It keeps on looking for the request from NameNode to
access data. Once the NameNode provides the location of the data, client applications can talk directly to a
DataNode, while replicating the data, DataNode instances can talk to each other.
Hadoop

▪ The NameNode deals with all requests


coming in from a client computer; it
distributes storage space, and keeps
track of storage availability and data
location. It also manages all the basic
file operations (e.g. opening and closing
files) and controls data access by client
computers.
▪ The DataNodes are responsible for
actually storing the data and in order to
do so, create, delete, and replicate
blocks as necessary.
Hadoop
How Hadoop Works? - MapReduce

The general idea of the MapReduce algorithm is to process the data in parallel on your distributed cluster. It
subsequently combine it into the desired result or output. Being the heart of the Hadoop system, Map-Reduce
process the data in a highly resilient, fault-tolerant manner.

Hadoop MapReduce includes several stages:

• The program locates and reads the <<input file>> containing the raw data.
• As the file format is arbitrary, there is a need to convert data into something the program can process. The
<<InputFormat>> and <<RecordReader>> (RR) does this job.
• InputFormat uses inputSplit function to split the file into smaller pieces
• The RecordFeader transforms the raw data from processing by the map. It outputs a list of key-value pairs.
• Once the mapper process these key-value pairs the result goes to <<OutputCollector>>. There is another
function called <<Reporter>> which intimates the user when the mapping task finishes.
• The Reduce function performs its task on each key-value pair from the mapper.
• OutputFormat organizes the key-value pairs from Reducer for writing it on HDFS.
Hadoop
How Hadoop Works? – MapReduce
Key-Value

Storage devices store data as key-value pairs and act like hash tables. The table is a list of values where each
value is identified by a key. The value stored can be any aggregate, ranging from sensor data to videos.
Value look-up can only be performed via the keys as the database is oblivious to the details of the stored
aggregate. Partial updates are not possible. An update is either a delete or an insert operation.
Key-value storage devices generally do not maintain any indexes, therefore writes are quite fast. Based on a
simple storage model, key-value storage devices are highly scalable.
In Hadoop, the concept of key-value is used to describe the data format and processing method in the
MapReduce programming model, which serves as the foundation for distributed data processing.
Hadoop
How Hadoop Works? -YARN
Yarn divides the task on resource management and job scheduling/monitoring into separate daemons. There is
one ResourceManager and per-application ApplicationMaster. An application can be either a job (typically
referring to a MapReduce job) or a DAG (Directed Acyclic Graph) of jobs.
The ResourceManager have two components – Scheduler and ApplicationManager.

The scheduler is a pure scheduler. It only allocates resources to various competing application. It allocates the
resources based on an abstract notion of a container. A container is nothing but a fraction of resources like CPU,
memory, disk, network etc.

The ApplicationManager is responsible for:


- Accepts submission of jobs by client
- Negotiates first container for specific ApplicationMaster
- Restarts the container after application failure.

The responsibilities of ApplicationMaster are:


- Negotiates containers from Scheduler
- Tracking container status and monitoring its progress
Hadoop
How Hadoop Works?

Hadoop is an open-source framework for distributed storage and processing of large-scale data. Its 3-layer
architecture includes Hadoop Distributed File System (HDFS), MapReduce, and YARN.
⚫ HDFS:
HDFS is the storage layer of Hadoop, HDFS divides large files into multiple blocks (default size is 128MB) and
stores them across multiple nodes in the cluster. Each block is replicated to multiple nodes to enhance data
reliability and fault tolerance. HDFS employs a master/slave architecture, consisting of a master node (NameNode)
and multiple slave nodes (DataNodes). The NameNode manages the file system namespace and metadata
operations for client, while DataNodes handle the actual storage and retrieval of data.
⚫ MapReduce:
MapReduce is the computational layer of Hadoop, used for processing data stored in HDFS. MapReduce divides
data into smaller chunks and processes them in parallel on multiple nodes in the cluster. It primarily of two stages:
the Map stage and Reduce stage. In the Map stage, data is split into smaller data blocks and processed by Mappers
on multiple nodes, generating intermediate key/value pairs. Then, in the Reduce stage, intermediate results are
aggregated and combined by Reducers on multiple nodes for final processing and computation. MapReduce
provides a simple and scalable programming model, enabling users to easily write parallel processing tasks.
Hadoop
How Hadoop Works?

⚫ YARN: YARN is the resource manage of Hadoop, used for allocation and scheduling of cluster resources.
YARN divides cluster resources into multiple containers, each with a certain amount of CPU and memory
resources. Applications can request containers from YARN and run their tasks within them. YARN is managed by
ResourceManager and NodeManager.
ResourceManager handles global management and allocation of cluster resources, while NodeManager monitors
the running status and resource usage of containers on each node. The introduction of YARN enables Hadoop to
not only MapReduce tasks but also support other types of computing frameworks such as Spark, Flink, etc.,
thereby enhancing the flexibility and diversity of Hadoop.

These three layers collectively form the core of Hadoop, enabling it to handle storage, computation, and resource
management of large-scale data.
Hadoop
NoSQL databases for big data storage
NoSQL?

A Not-only SQL (NoSQL) database is a


non-relational database that is highly
scalable, fault-tolerant and specifically
designed to house semi-structured and
unstructured data.
A NoSQL database often provides an
API-based query interface that can be
called from within an application.
NoSQL databases also support query
languages other than Structured Query
Language (SQL) because SQL was
designed to query structured data stored
within a relational database.
NoSQL databases for big data storage

Why NoSQL database can be used for big data?


The non-relational model has some features that are necessary in the management of big data, namely
scalability, availability, and performance.
Some techniques of NoSQL databases

sharding
Sharding is the process of horizontally
partitioning a large dataset into a
collection of smaller, more manageable
datasets called shards. The shards are
distributed across multiple nodes,
where a node is a server or a machine.
Each shard is stored on separate node
and each node is responsible for only
the data stored on it. Each shard shares
the same schema, and all shards
collectively represent the complete
dataset.
Some techniques of NoSQL databases
Advantages of sharding
➢ Scaling
Help to facilitate horizontal (This is often contrasted with vertical scaling, which involves upgrading the hardware of an
existing server.)

➢ Improve Performance
Speed up query response times because queries have to go over fewer rows and their result sets are returned much more
quickly.

➢ Reliability
Help to make an application more reliable by mitigating the impact of outages. With a sharding database, an outage is likely
to affect only a single shard.

➢ Easier to Manage
Maintenance tasks including regular backups, database optimization, and other common tasks can be performed
independently in parallel by using the sharding approach.

➢ Reduce Costs
Reduction of cost involving in saving license fees, software maintenance, hardware investment % when compared to
traditional solutions.
source: https://nadermedhatthoughts.medium.com/understanddatabase-sharding-the-good-and-ugly-868aa1cbc94c
Some techniques of NoSQL databases

How Sharding works?


Sharding by Key Range
The way of partitioning is to assign a
continuous range of keys (from some
minimum to some maximum) to each
partition.
Within each partition, we can keep
keys in sorted order, which has the
advantage that range scans are easy.
E.g., data from a network of sensors,
where the key is the timestamp of the
measurement.

Key range can lead to performance


bottleneck, waste of resource caused by
skew storage and hot spots
Some techniques of NoSQL databases

How Sharding works?


Sharding Hash of Key
Because of the risk of skew and hot spots,
many distributed datastores use a hash function
to determine the partition for a given key.
A good hash function takes skewed data and
makes it uniformly distributed.
Once you have a suitable hash function for
keys, you can assign each partition a range of
hashes, and every key whose hash falls within
a partition’s range will be stored in that
partition.
This technique is good at distributing keys
fairly among the partitions. The partition
boundaries can be evenly spaced, or they can
be chosen pseudorandomly.
Some techniques of NoSQL databases

Replication

Replication stores multiple copes of a


dataset, known as replicas, on multiple
nodes. Replication provides scalability
and availability due to the fact that the
same data is replicated on various
nodes. Fault tolerance is also
achieved since data redundancy
ensures that data is not lost when an
individual node fails. There are two
different methods that are used to
implement replication:
• mater-slave
• peer-to-peer
Some techniques of NoSQL databases

Master-slave Replication
Nodes are arranged in a master-slave
configuration, and all data is written to
master node. Once saved, the data is
replicated over to multiple slave nodes.
All external write requests, including insert,
update and delete, occur on the master node,
whereas read requests can be fulfilled by
any slave node.

Master-slave replication is ideal for read


intensive loads rather than write intensive loads.
Write performance will suffer as the amount of
writes increases. If the master node fails, reads
are still possible via any of the slave nodes.
Some techniques of NoSQL databases

Master-slave Replication

Issue: read inconsistency


If a slave node is read prior to an update to the
master being copied to it. A voting system (the
majority of the slaves contain the same version
of the record)can ensure read consistency.

A scenario where read inconsistency occurs


1. User A updates data.
2. The data is copied over to Slave A by the
Master.
3. Before the data is copied over to Slave B,
User B tries to read the data from Slave B,
which results in an inconsistent read.
4. The data will eventually become consistent
when Slave B is updated by the Master.
Some techniques of NoSQL databases

Peer-to-Peer Replication
With peer-to-peer replication, all nodes operate at the same level. Each node, known as a peer, is equally capable of
handling reads and writes. Each write is copied to all peers.

Write inconsistency occur as a


result of a simultaneous
update of the same data across
multiple peers. This can be
addressed by implementing
either a pessimistic or
optimistic concurrency
strategy.
Some techniques of NoSQL databases

Peer-to-Peer Replication
Pessimistic concurrency: prevents inconsistency, uses locking
to ensure that only one update to a record can occur at a time.
Optimistic concurrency: allows inconsistency to occur,
consistency will be achieved after all updates have propagated.

A scenario where an inconsistency read


occurs
1. User A updates data.
2. a. The data is copied over to Peer A. b.
The data is copied over to Peer B.
3. Before the data is copied over to Peer C,
User B tries to read the data from Peer
C, resulting in an inconsistent read.
4. The data will eventually be updated on
Peer C, and the database will once again
become consistent.
Some techniques of NoSQL databases

Sharding and Replication

To improve on the limited fault


tolerance offered by sharding,
while additionally benefiting
from the increased availability
and scalability of replication,
both sharding and replication
can be combined.
Some techniques of NoSQL databases

Combining Sharding and Master-Slave Replication


Multiple shards become slaves of a
single master, and the master itself is a
shard. Write consistency is maintained
by the master-shard.
Write operation is impacted if the
master-shard becomes non-operational,
multiple slave nodes provide scalability
and fault tolerance for read operations.
• Each node acts both as a master and a
slave for different shards.
• Write (id = 2) to Shard A are regulated
by Node A.
• Node A replicate data (id = 2) to Node
B, which is a slave for shard A.
• Reads (id = 4) can be served directly by
either Node B or Node C as they each
contain Shard B.
Some techniques of NoSQL databases

Combining Sharding and Peer-to-Peer Replication


Each shard is replicated to multiple
peers, and each peer is only responsible
for a subset of the overall datasets.
Helps achieve increased scalability and
fault tolerance. As there is no master
involved, there is no single point of
failure and fault-tolerance for both read
and write operations is supported.

• Each node contains replicas of two


different shards.
• Write (id = 3) are replicated to both
Node A and Node C (Peers) as they are
responsible for Shard C.
• Read (id = 6) can be served by either
Node B or Node C as they each contain
Shard B.
NoSQL databases

CAP Theorem

• In 2000, Eric Brewer, a professor of computer


science at the University of California Berkeley,
presented the CAP Theorem, also known as
Brewer’s theorem.
C: consistency refers to the requirement that all
copies of data should be the same across nodes.
A: availability requires that if a node fails,
other nodes still function.
P: partition tolerance requires that the system
continues to operate even if network partition
(DataNodes are distributed across physically
separate servers and communication between
these machines will sometimes fail.) occurs.
NoSQL databases

CAP Theorem

• In 2000, Eric Brewer, a professor of computer


science at the University of California Berkeley,
presented the CAP Theorem, also known as
Brewer’s theorem.
C: consistency refers to the requirement that all
copies of data should be the same across nodes.
A: availability requires that if a node fails,
other nodes still function.
P: partition tolerance requires that the system
continues to operate even if network partition
(DataNodes are distributed across physically
separate servers and communication between
these machines will sometimes fail.) occurs.
NoSQL databases

CAP Theorem

• The CAP Theorem expresses a triple constraint related to


distributed database systems. It states that a distributed
database system, running on a cluster, only two of three
criteria above can be met.
• If availability (A) and partition tolerance (P) are required,
then consistency (C) is not possible because of the data
communication requirement between the nodes. So, the
database can remain available (A) but with inconsistent
results.
• If consistency (C) and partition tolerance (P) are required,
nodes cannot remain available (A) as the nodes will become
unavailable while achieving a state of consistency (C).
• If consistency (C) and availability (A) are required, available
nodes need to communicate to ensure consistency (C).
Therefore, partition tolerance (P) is not possible.
NoSQL databases

BASE

BASE is a database design principle based on the CAP theorem and leveraged by database systems that use
distributed technology. BASE stands for:
• basically available
• soft state
• eventual consistency
- The system ensures basic availability, meaning it is able to respond to user requests within a finite time, even if
these responses may not be the most recent or completely accurate, or either in the form of the requested data or
a success/failure notification.
- Soft state means that a database may be in an inconsistent state when data is read; but eventually reach a
consistent state. This implies that the system allows for data inconsistency and can tolerate it within a certain
timeframe.
- Eventual consistency means the system eventually reaches a consistent state, even in the face of network
partitions or replication delays, where data in the system may temporarily be inconsistent, but will eventually
converge to a consistent state through some mechanism. While the database is in the process of attaining the
state of eventual consistency, it will be in a soft state.
NoSQL databases

BASE

Basically available example:

The database is basically available,


even though it has been partitioned as
a result of a network failure.
NoSQL databases

BASE

Soft state example:

• User A updates a record on Peer A.

• Before the other peers are updated,


User B requests the same record
from Peer C.

• The database is now in a soft state,


and stale data is returned to User B.
NoSQL databases

BASE

Eventual consistency example:

• User A updates a record.

• The record only gets updated at


Peer A, but before the other peers
can be updated, User B requests
the same record.

• The database is now in a soft state.


Stale data is returned to User B
from Peer C.

• However, the consistency is


eventually attained, and User C
gets the correct value.
ACID VS BASE
NoSQL databases
Characteristics

Below is a list of the principal characteristics of NoSQL storage devices that differentiate them from
traditional RDBMSs.
• Schema-less data model – Data can exist in its raw form.
• Scale out rather than scale up – More nodes can be added to obtain additional storage with a
NoSQL database, in contrast to having to replace the existing node with a better, higher
performance/capacity one.
• Highly available – This is built on cluster-based technologies that provide fault tolerance out of
the box.
• Lower operational costs – Many NoSQL databases are built on Open Source platforms with no
licensing costs. They can often be deployed on commodity hardware.
• Eventual consistency – Data reads across multiple nodes but may not be consistent immediately
after a write. However, all nodes will eventually be in a consistent state.
• BASE, not ACID – BASE compliance requires a database to maintain high availability in the event
of network/node failure, while not requiring the database to be in a consistent state whenever an
update occurs. The database can be in a soft/inconsistent state until it eventually attains
consistency. As a result, in consideration of the CAP theorem, NoSQL storage devices are
generally AP or CP.
NoSQL databases
Characteristics
• API driven data access – Data access is generally supported via API based queries, including RESTful
APIs, whereas some implementations may also provide SQL-like query capability.
• Auto sharding and replication– To support horizontal scaling and provide high availability, a NoSQL
storage device automatically employs sharding and replication techniques where the dataset is partitioned
horizontally and then copied to multiple nodes.
• Integrated caching – This removes the need for a third-party distributed caching layer, such as
Memcached.
• Distributed query support – NoSQL storage devices maintain consistent query behavior across multiple
shards.
• Polyglot persistence – The use of NoSQL storage does not mandate retiring traditional RDBMSs. In fact,
both can be used at the same time, thereby supporting polyglot persistence, which is an approach of
persisting data using different types of storage technologies within the same solution architecture. This is
good for developing systems requiring structured as well as semi/unstructured data.
• Aggregate-focused – Unlike relational databases that are most effective with fully normalized data, NoSQL
storage devices store de-normalized aggregated data (an entity containing merged, often nested, data for
an object) thereby eliminating the need for joins and extensive mapping between application objects and
the data stored in the database. One exception, however, is that graph database storage devices are not
aggregate-focused.
NoSQL databases

• Rationale
The emergence of NoSQL storage devices can primarily be attributed to the volume, velocity and variety
characteristics of Big Data datasets.
• Volume
The storage requirement of ever increasing data volumes commands the use of databases that are highly
scalable while keeping costs down for the business to remain competitive. NoSQL storage devices fulfill this
requirement by providing scale out capability while using inexpensive commodity servers.
• Velocity
The fast influx of data requires databases with fast access data write capability. NoSQL storage devices enable
fast writes by using schema-on-read rather than schema-on-write principle. Being highly available, NoSQL
storage devices can ensure that write latency does not occur because of node or network failure.
• Variety
A storage device needs to handle different data formats including documents, emails, images and videos and
incomplete data. NoSQL storage devices can store these different forms of semi-structured and unstructured
data formats. At the same time, NoSQL storage devices are able to store schema-less data and incomplete
data with the added ability of making schema changes as the data model of the datasets evolve. In other words,
NoSQL databases support schema evolution.
NoSQL databases

• Schema-on-write
The schema or the structure of the data is defined and enforced before the data is written into a database or
data storage system. This means that data must confirm to a predefined schema before it can be successfully
written. Schema-on-write is commonly associated with traditional relational database like MySQL, Oracle.
- Schema Definition: data types, constraints, and relationships, is defined before any data is written.
- Data Validation: When data is submitted for storage, it is validated against the predefined schema.
- Structured Storage: Data is stored in a structured manner.

• Schema-on-read
The structure or schema is not predefined before it is stored. Instead, data is stored in its raw or native format
without enforcing a specific schema at the time of ingestion. The schema is applied and interpreted at the time
of reading or querying the data. Schema-on-read is commonly associated with NoSQL databases like
MongoDB, Apache Hadoop.
- Data Ingestion: Data is ingested into the storage system without any predefined structure or schema.
- Schema Application: When querying or reading the data, the schema is applied dynamically based on how
the data is interpreted at that moment. This allows for flexibility in data interpretation.
- Flexibility: It offers flexibility in handling diverse data types and evolving data structures.
NoSQL databases
Type

NoSQL storage devices can mainly be divided into four types based on the way they store data

➢ key-value

➢ Document

➢ column-family

➢ graph
Key-value NoSQL databases

▪ Key-value storage devices store data as key-value pairs and act like hash tables. The table is a list of
values where each value is identified by a key. The value is opaque to the database and is typically
stored as a BLOB (Binary Large Object). The value stored can be any aggregate, ranging from sensor
data to videos.
▪ Value look-up can only be performed via the keys as the database is oblivious to the details of the stored
aggregate.
▪ Key-value storage devices generally do not maintain any indexes, therefore writes are quite fast. Based
on a simple storage model, key-value storage devices are highly scalable.
▪ Most key-value storage devices provide collections or buckets (like tables) into which key-value pairs
can be organized. A single collection can hold multiple data formats. Some implementations support
compressing values for reducing the storage footprint. However, this introduces latency at read time.
Key-value NoSQL databases

A key-value storage device is appropriate when:

▪ unstructured data storage is required

▪ high performance read/write are required

▪ the value is fully identifiable via the key alone

▪ value is a standalone entity that is not dependent on other values

▪ values have a comparatively simple structure or are binary

▪ query patterns are simple, involving insert, select and delete operations only

▪ stored values are manipulated at the application layer


Key-value NoSQL databases

A key-value storage device is inappropriate when:

▪ applications require searching or filtering data using attributes of the stored value

▪ relationships exist between different key-value entries

▪ a group of keys’ values need to be updated in a single transaction

▪ multiple keys require manipulation in a single operation

▪ schema consistency across different values is required

▪ update to individual attributes of the value is required


Document NoSQL databases

Document storage devices also store data as key-value pairs. However, unlike key-value storage devices,
the store value is a document that can be queried by the database. These documents can have a
complex nested structure, such as an invoice. The documents can be encoded using either a text-based
encoding scheme, such as XML, or JSON, or using a binary encoding scheme, such as BSON (Binary
JSON).
Document NoSQL databases

The main differences between document storage devices and key-value storage devices
are as follows:

▪ document storage devices are value-aware

▪ the stored value is self-describing; the schema can be inferred from the structure of the value or a reference to the
schema for the document is included in the value

▪ a select operation can reference a field inside the aggregate value

▪ a select operation can retrieve a part of the aggregate value

▪ partial updates are supported; therefore a subset of the aggregate can be updated

▪ indexes that speed up searches are generally supported


Document NoSQL databases

A document storage device is appropriate when:

▪ storing semi-structured document-oriented data comprising flat or nested schema

▪ schema evolution is a requirement as the structure of the document is either unknown or is likely to change

▪ applications require a partial update of the aggregate stored as a document

▪ searches need to be performed on different fields of the documents

▪ storing domain objects, such as customers, in serialized object form

▪ query patterns involve insert, select, update and delete operations


Document NoSQL databases

A document storage device is inappropriate when:

▪ multiple documents need to be updated as part of a single transaction

▪ performing operations that need joins between multiple documents or storing data that is normalized

▪ schema enforcement for achieving consistent query design is required as the document structure may change
between successive query runs, which will require restructuring the query

▪ the stored value is not self-describing and does not have a reference to a schema

▪ binary data needs to be stored


Column-Family NoSQL databases
Column-family storage devices store data much like a traditional RDBMS but group related columns
together in a row, resulting in column-families. Each column can be a collection of related columns itself,
referred to as a super-column. Each super-column can contain an arbitrary number of related columns that
are generally retrieved or updated as a single unit. Each row consists of multiple column-families and can
have a different set of columns,
thereby manifesting flexible
schema support. Each
row is identified by a row key.

Column-family storage devices


provide fast data access with
random read/write capability.
They store different column-
families in separate physical
files, which improves query
responsiveness as only the
required column-families are
searched.
Column-Family NoSQL databases

A column-family storage device is appropriate when:

▪ real time random read/write capability is needed and data being stored has some defined structure

▪ data represents a tabular structure, each row consists of a large number of columns and nested groups of
interrelated data exist

▪ support for schema evolution is required as column families can be added or removed without any system
downtime

▪ certain fields are mostly accessed together, and searches need to be performed using field values

▪ efficient use of storage is required when the data consists of sparsely populated rows since column-family
databases only allocate storage space if a column exists for a row. If no column is present, no space is allocated.

▪ query patterns involve insert, select, update and delete operations


Column-Family NoSQL databases

A column-family storage device is inappropriate when:

▪ relational data access is required; for example, joins

▪ ACID transactional support is required

▪ binary data needs to be stored

▪ SQL-compliant queries need to be executed

▪ query patterns are likely to change frequently because that could initiate a corresponding restructuring of
how column-families are arranged
Graph NoSQL databases
Graph storage devices are used to persist inter-
connected entities. Unlike other NoSQL storage
devices, where the emphasis is on the structure
of the entities, graph storage devices place
emphasis on storing the linkages between
entities. Entities are stored as nodes (not to be
confused with cluster nodes) and are also called
vertices, while the linkages between entities are
stored as edges. In RDBMS parlance, each
node can be thought of a single row while the
edge denotes a join.
Nodes can have more than one type of link
between them through multiple edges. Each
node can have attribute data as key-value pairs,
such as a customer node with ID, name and age
attributes.
Each edge can have its own attribute data as
key-value pairs, which can be used to further
filter query results. Queries generally involve
finding interconnected nodes based on node
attributes and/or edge attributes.
Graph NoSQL databases

A graph storage device is appropriate when:

▪ interconnected entities need to be stored

▪ querying entities based on the type of relationship with each other rather than the attributes of the entities

▪ finding groups of interconnected entities

▪ finding distances between entities in terms of the node traversal distance

▪ mining data with a view toward finding patterns


Graph NoSQL databases

A graph storage device is inappropriate when:

▪ update are required to a large number of node attributes or edge attributes, as this involves searching for nodes
or edges, which is a costly operation compared to performing node traversals

▪ entities have a large number of attributes or nested data – it is best to store lightweight entities in a graph
storage device while storing the rest of the attribute data in a separate non-graph NoSQL storage device

▪ binary storage is required

▪ queries based on the selection of node/edge attributes dominate node traversal queries

Examples include Neo4J, Infinite Graph and OrientDB


NewSQL databases

Shortcommings of NoSQL

▪ NoSQL storage devices do not provide the same transaction and consistency support as exhibited by ACID
compliant RDBMSs.

▪ Following the BASE model, NoSQL storage devices provide eventual consistency rather than immediate
consistency. They therefore will be in a soft state while reaching the state of eventual consistency.

▪ NoSQL storage devices are not appropriate for use when implementing large scale transactional systems.
NewSQL databases

What NewSQL database?


NewSQL databases

What NewSQL database?

▪ NewSQL storage devices combine the ACID properties of RDBMS with the scalability and fault tolerance
offered by NoSQL storage devices.

▪ NewSQL databases generally support SQL compliant syntax for data definition and data manipulation
operations, and they often use a logical relational data model for data storage.

▪ NewSQL databases can be used for developing OLTP (online transaction processing) systems with very high
volumes of transactions. They can be used for real time analytics.

▪ As compared to a NoSQL storage device, a NewSQL storage device provides an easier transition from a
traditional RDBMS to a highly scalable database due to its support for SQL.

Examples of NewSQL databases include VoltDB, NuoDB and InnoDB


Difference Between SQL(RDBMS), NoSQL, and NewSQL

Horizontal scaling
In-Memory Storage Devices

What NewSQL database?

▪ NewSQL storage devices combine the ACID properties of RDBMS with the scalability and fault tolerance
offered by NoSQL storage devices.

▪ NewSQL databases generally support SQL compliant syntax for data definition and data manipulation
operations, and they often use a logical relational data model for data storage.

▪ NewSQL databases can be used for developing OLTP (online transaction processing) systems with very high
volumes of transactions. They can be used for real time analytics.

▪ As compared to a NoSQL storage device, a NewSQL storage device provides an easier transition from a
traditional RDBMS to a highly scalable database due to its support for SQL.

Examples of NewSQL databases include VoltDB, NuoDB and InnoDB


On-Disk storage Devices VS In-Memory Storage Devices

Stores the data in RAM

Stores the data in hard drives


In-Memory Storage Devices

▪ An in-memory storage device generally utilizes RAM,


the main memory of a computer.
▪ The growing capacity and decreasing cost of RAM
has made it possible to develop in-memory data
storage solutions.
▪ Storage of data in memory eliminates the latency of
disk I/O and the data transfer time between the main
memory and the hard drive. This overall reduction in
data read/write latency makes data processing much
faster.
▪ In-memory storage device capacity can be increased
massively by horizontally scaling (scaling out) the
cluster that is hosting the in-memory storage device.
▪ When compared with an on-disk storage device, an
in-memory storage device is more expensive because
of the higher cost of memory as compared to a disk.
In-Memory Storage Devices

▪ Cluster-based memory enables storage of large


amounts of data, including Big Data datasets, which
can be accessed considerably faster when compared
with an on-disk storage device. This enables real time
Big data analytics.
▪ In-memory analytics enable operational analytics and
operational BI (business intelligence) through fast
execution of queries and algorithms.
▪ In-memory storage enables making sense of the fast
influx of data in a Big Data environment. This
supports making quick business decisions for
mitigating a threat or taking advantage of an
opportunity.
▪ An in-memory storage device can support schema-
less or schema-aware storage. Schema-less storage
support is provided through key-value based data
persistence.
In-Memory Storage Devices

❖ Operational Business Intelligence (Operational BI) is


a subset of business intelligence focused on
monitoring, managing, and optimizing the daily
operations of an organization. It emphasizes real-time
data processing and analysis to support immediate
decision-making and operational efficiency.

❖ Operational Analytics (OA) is a branch of analytics


focused on improving and optimizing the everyday
operations of an organization. It involves the use of
data analysis and real-time data processing to
enhance operational efficiency, reduce costs, and
improve service quality.
In-Memory Storage Devices

An in-memory storage device is appropriate when:

▪ data arrives at a fast pace and requires real time analytics or event stream processing

▪ continuous or always-on analytics is required, such as operational BI and operational analytics

▪ interactive query processing and real time data visualization needs to be performed, including what-if analysis
and drill-down operations

▪ the same dataset is required by multiple data processing jobs

▪ performing exploratory data analysis, as the same dataset does not need to be reloaded from disk if the
algorithm changes

▪ developing low latency Big Data solutions with ACID transaction support
In-Memory Storage Devices

An in-memory storage device is inappropriate when:

▪ data processing consists of batch processing

▪ very large amounts of data need to be persisted in-memory for along time in order to perform in-depth data
analysis

▪ performing strategic BI or strategic analytics that involves access to very large amounts of data and involves
batch data processing

▪ dataset are extremely large and do not fit into the available memory

▪ making the transition from traditional data analysis toward Big Data analysis, as incorporating an in-memory
storage device may require additional skills and involves a complex setup

▪ an enterprise has a limited budge, as setting up an in-memory storage device may require upgrading nodes,
which could either be done by node replacement or by adding more RAM
In-Memory Storage Devices

In-memory storage devices can be implemented as:

▪ In-Memory Data Grid (IMDG)

▪ In-Memory Database (IMDB)

Although both of these technologies use memory as their


underlying data storage medium, what makes them distinct is
the way data is stored in the memory.
In-Memory Data Grids
IMDGs store data in memory as key-value pairs across multiple nodes where the keys and values can be
any business object or application data in serialized form. This supports schema-less data storage through
storage of semi/unstructured data. Data access is typically provided via APIs.
In-Memory Data Grids

1. An image (a), XML data (b) and a


customer object are first serialized
using serialization engine.
2. They are then stored as key-value pairs
in an IMDG.
3. A client requests the customer object
via its key.
4. The value is then returned by the IMDG
in serialized form
5. The client then utilizes a serialization
engine to deserialize the value to obtain
the customer object…
6. … in order to manipulate the customer
object.
In-Memory Data Grids

▪ Nodes in IMDGs keep themselves synchronized and collectively provide high availability, fault tolerance and
consistency.

▪ In comparison to NoSQL’s eventual consistency approach, IMDGs support immediate consistency.

▪ As compared to relational IMDBs, IMDGs provide faster data access as IMDGs store non-relational data as
objects.

▪ IMDGs scale horizontally by implementing data partitioning and data replication and further support reliability
by replicating data to at least one extra node.
In-Memory Data Grids

▪ IMDGs are heavily used for real-time


analytics because they support Complex
Event Processing (CEP) via the publish-
subscribe messaging model. This is achieved
through a feature called continuous querying,
also known as active querying, where a filter
for events of interest is registered with the
IMDG. The IMDG then continuously
evaluates the filter and whenever the filter is
satisfied as a result of insert/update/delete
operations, subscribing clients are informed.
Applications of In-Memory Data Grids

▪ Real-time processing engines can make use


of IMDG where high velocity data is stored.
▪ IMDGs may also support in-memory
MapReduce that helps to reduce the latency
of disk based MapReduce processing.
▪ An IMDG can also be deployed within a
cloud based environment where it provides a
flexible storage medium that can scale out or
scale in automatically as the storage demand
increases or decreases.
▪ IMDGs can be added to existing Big Data
solutions by introducing them between the
existing on-disk storage device and the data
processing application.
The use of In-Memory Data Grids in Big Data environment

In a Big Data solution environment, IMDGs are often deployed together with on-disk
storage devices that act as the backend storage. This is achieved via the following
approaches that can be combined as necessary to support read/write performance,
consistency and simplicity requirements:

❖ read-through

❖ write-through

❖ write-behind

❖ refresh-ahead
The use of In-Memory Data Grids in Big Data environment

Read-through
If a requested value for a key is
not found in the IMDG, then it is
synchronously read from the
backend on-disk storage device,
such as a database. Upon a
successful read from the backend
on-disk storage device, the key-
value pair is inserted into the
IMDG, and the requested value
is returned to the client. Any
subsequent request for the same
key are then served by the
IMDG directly, instead of the
backend storage.
The use of In-Memory Data Grids in Big Data environment

Write-through

Any write (insert/update/delete) to the


IMDG is written synchronously in a
transactional manner to the backend
on-disk storage device, such as a
database. If the write to the backend
on-disk storage device fails, the
IMDG’s update is rolled back. Due to
this transactional nature, data
consistency is achieved immediately
between the two data stores.
The use of In-Memory Data Grids in Big Data environment

Write-behind

▪ Any write to the IMDG is written asynchronously


in a batch manner to the backend on-disk storage
device, such as a database.
▪ A queue is generally placed between the IMDG
and the backend storage for keeping track of the
required changes to the backend storage. This
queue can be configured to write data to the
backend storage at different intervals.
▪ The asynchronous nature increases both write
performance and read performance and
scalability/availability in general. However, the
asynchronous nature introduces inconsistency
until the backend storage is updated at the
specified interval.
The use of In-Memory Data Grids in Big Data environment

Refresh-ahead
▪ Refresh-ahead is a proactive approach where any
frequently accessed values are automatically,
asynchronously refreshed in the IMDG, provided
that the value is accessed before its expiry time as
configured in the IMDG.
▪ If a value is accessed after its expiry time, the
value, like in the read-through approach, is
synchronously read from the backend storage and
updated in the IMDG before being returned to the
client.
▪ Compared to the read-through approach, where a
value is served from the IMDG until its expiry,
data inconsistency between the IMDG and the
backend storage is minimized as values are
refreshed before they expire.
In-Memory Data Grids

An IMDG storage device is appropriate when:

▪ data needs to be readily accessible in object form with minimal latency

▪ data being stored is non-relational in nature such as semi-structured and unstructured data

▪ adding real-time support to an existing Big Data solution currently using on-disk storage

▪ the existing storage device cannot be replaced but the data access layer can be modified

▪ scalability is more important than relational storage

Examples of IMDG storage devices include: Hazelcast, Infinispan, Pivotal GemFire and Gigaspaces XAP
In-Memory Databases
IMDBs are in-memory storage devices that employ database technology and
leverage the performance of RAM to overcome runtime latency issues that
plague on-disk storage devices.

1. A relational dataset is stored into an


IMDB.
2. A client requests a customer record
(id = 2) via SQL.
3. The relevant customer record is then
returned by the IMDB, which is directly
manipulated by the client without the
need for any deserialization.
In-Memory Databases
▪ An IMDB can be relational in nature (relational IMDB) for the storage of structured data, or may leverage NoSQL
technology (non-relational IMDB) for the storage of semi-structured and unstructured data.

▪ Unlike IMDGs, which generally provide data access via APIs, relational IMDBs make use of the more familiar SQL
language. NoSQL-based IMDBs generally provide API-based access, which may be as simple as put, get and delete
operations

▪ Depending on the underlying implementation, some IMDBs scale-out, while others scale-up, to achieve scalability.

▪ Not all IMDB implementations directly support durability, but instead leverage various strategies for providing
durability in the face of machine failures or memory corruption.

▪ Like an IMDG, an IMDB may also support the continuous query feature.

▪ IMDBs are heavily used in real-time analytics and can further be used for developing low latency applications requiring
full ACID transaction support (relational IMDB).

▪ In comparison with IMDGs, IMDBs provide an easy to set up in-memory data storage option, as IMDBs do not
generally require on-disk backend storage devices.
In-Memory Databases
In Figure 7.27, an IMDB stores
temperature values for various sensors.
The following steps are shown:

1. A client issues a continuous query


(select * from sensors where
temperature > 75).

2. It is registered in the IMDB.

3. When the temperature for any sensor


exceeds 75F …

4. … an updated event is sent to the


subscribing client that contains various
details about the event
The use of In-Memory Data bases in Big Data environment

▪ Introduction of IMDBs into an existing Big Data solution generally requires replacement of existing
on-disk storage devices, including any RDBMSs if used. In the case of replacing an RDBMS with a
relational IMDB, little or no application code change is required due to SQL support provided by
the relational IMDB. However, when replacing an RDBMS with a NoSQL IMDB, code change may
be required due to the need to implement the IMDB’s NoSQL APIs.

▪ In the case of replacing an on-disk NoSQL database with a relational IMDB, code change will often
be required to establish SQL-based access. However, when replacing an on-disk NoSQL database
with a NoSQL IMDB, code change may still be required due to the implementation of new APIs.

▪ Relational IMDBs are generally less scalable than IMDGs, as relational IMDBs need to support
distributed queries and transactions across the cluster. Some IMDB implementations may benefit
from scaling up, which helps to address the latency that occurs when executing queries and
transactions in a scale-out environment.

Examples include Aerospike, MemSQL, Altibase HDB, eXtreme DB and Pivotal GemFire
XD.
In-Memory Data Bases

An IMDB storage device is appropriate when:

▪ relational data needs to be stored in memory with ACID support

▪ adding real-time support to an existing Big Data solution currently using on-disk storage

▪ the existing on-disk storage device can be replaced with an in-memory equivalent
technology

▪ it is required to minimize changes to the data access layer of the application code, such as
when the application consists of an SQL-based data access layer

▪ relational storage is more important than scalability

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy