Big Data강의자료-Storing Big Data
Big Data강의자료-Storing Big Data
Big Data
A Very Short Introduction
Big Data Fundamentals
Concepts, Drivers & Techniques
Storage Capacity
valves transistors
Moore’s Law
In 1965, Gordon Moore, who became the co-founder of Intel, famously
predicted that over the next ten years the number of transistors
incorporated in a chip would approximately double every twenty-four
months.
A microprocessor is the integrated circuit responsible for A nanometers is 10−9 meter, or one-millionth of a millimeter
performing the instructions provided by a computer program. A human hair is about 75,000 nanometers in diameter and the
This usually consists of billions of transistors, with a length of n- diameter of an atom is between 0.1 and 0.5 nanometers.
nanometers, embedded in a tiny space on a silicon microchip.
Storing Structured Data
Structured Data
Structured data, of the kind written by hand and kept in notebooks or in filing cabinets, is now
stored electronically on spreadsheets or databases, and consists of spreadsheet-style tables
with rows and columns, each row being a record and each column a well-defined field (e.g.
name, address, and age). Carefully structured and tabulated data is relatively easy to manage
and is amenable to statistical analysis, indeed until recently statistical analysis methods could be
applied only to structured data.
RDBMS
RDBMS
In order to manage structured data, a relational database management system (RDBMS) is use to create, maintain,
access, and manipulate the data.
What is RDBMS?
▪ RDBMS is a type of DBMS that organizes data in a tabular format with relationships between tables, while
DBMS can use different data organization models and may not enforce strict relationships between data
entities.
▪ In an RDBMS, data is stored in tables, where each table represents a specific entity or concept. The tables
consist of rows (also known as tuple in RDBMS or records) that represent individual instances of the entity,
and columns that define the attributes or properties of the entity.
▪ The relational model facilitates the establishment of relationships between tables through keys. A primary
key uniquely identifies each row in a table, while foreign keys establish relationships between tables by
referencing the primary key of another table.
RDBMS
▪ The first step is to design the database schema (i.e. the
structure of the database)
define data fields and arrange them in tables
▪ Business Applications: store, manage, and process large volumes of structured data. serves as the backbone
for ERP (enterprise resource planning), CRM (customer relationship management), HRM (human resource
management), and other business systems.
▪ E-commerce: handle product catalogs, inventory management, customer data, orders, and transactions
▪ Financial Systems: manage banking transactions, accounting records, financial reporting, and risk analysis
▪ Healthcare Information Systems: store and manage patient data, medical records, laboratory results, and
healthcare information in EHR (electronic health record) systems
▪ Online Ticketing and Reservation Systems: ticket and reserve systems for airlines, railways, hotels, and
event management
▪ Education Systems: manage student information, course registration, grades, and academic records
▪ Supply Chain Management: manage inventory, track shipments, and monitor logistics data
▪ Social Media Content Management: handles user profiles, content storage, retrieval, and supports efficient
searching and indexing of vast amounts of data
RDBMS
Disadvantages of RDBMS
▪ Once the relational database schema has been constructed, it is difficult to change it.
▪ Unstructured data cannot be organized conveniently into rows and columns.
▪ Big data is often high-velocity and generated in real-time with real-time processing requirements
• External dataset are acquired, or internal data will be used in a big data environment
• Data is manipulated to be made amenable for data analysis
• Data is processed via ETL (Extract-Transform-Load) activity, or output is generated as result of an
analytical operation
Big Data Storage
Cluster
File System
▪ Influenced by the ideas published in October 2003 by Google in a research paper launching the Google
Hadoop
File System, is a setwho
Doug Cutting, of was
open-source
then working software
at Yahoo,utilities. They facilitate
and his colleague Mike Caarella went to
work on usage of athe
developing network
Hadoop of many computers to solve problems involving
DFS.
▪ Hadoop, one massive amounts
of the most popularof data.
DFS, It provides
is part a software
of a bigger, open-sourceframework for called the
software project
Hadoop Ecosystem.
distributed Named after aand
storage yellow soft toy elephant
distributed owned by It
computing. Cutting’s
dividesson.
a file
▪ Hadoopinto
is written in the popular
the number programming
of blocks language,
and stores Java. It
it across a enables
clustertheof storage of both semi-
machines.
structured and unstructured data, and provides a platform for data analysis.
▪ Facebook,Hadoop also
Twitter, or achieves
eBay, faultHadoop
for example, tolerance by replicating
will have been working thein blocks on
the background while
you do so.the cluster. It does distributed processing by dividing a job into a
numbera software
▪ Hadoop provides of independent
framework tasks. These tasks
for distributed run and
storage in parallel over
distributed the
computing. It
does distributed processing by dividing a job intocluster.
computer a number of independent tasks. These tasks run
in parallel over the computer cluster.
Hadoop
Hadoop Components and Domains
https://data-flair.training/blogs/how-hadoop-works-internally/
Hadoop
Daemons
Daemon are the processes that run in the background. The Hadoop Daemons are:
The general idea of the MapReduce algorithm is to process the data in parallel on your distributed cluster. It
subsequently combine it into the desired result or output. Being the heart of the Hadoop system, Map-Reduce
process the data in a highly resilient, fault-tolerant manner.
• The program locates and reads the <<input file>> containing the raw data.
• As the file format is arbitrary, there is a need to convert data into something the program can process. The
<<InputFormat>> and <<RecordReader>> (RR) does this job.
• InputFormat uses inputSplit function to split the file into smaller pieces
• The RecordFeader transforms the raw data from processing by the map. It outputs a list of key-value pairs.
• Once the mapper process these key-value pairs the result goes to <<OutputCollector>>. There is another
function called <<Reporter>> which intimates the user when the mapping task finishes.
• The Reduce function performs its task on each key-value pair from the mapper.
• OutputFormat organizes the key-value pairs from Reducer for writing it on HDFS.
Hadoop
How Hadoop Works? – MapReduce
Key-Value
Storage devices store data as key-value pairs and act like hash tables. The table is a list of values where each
value is identified by a key. The value stored can be any aggregate, ranging from sensor data to videos.
Value look-up can only be performed via the keys as the database is oblivious to the details of the stored
aggregate. Partial updates are not possible. An update is either a delete or an insert operation.
Key-value storage devices generally do not maintain any indexes, therefore writes are quite fast. Based on a
simple storage model, key-value storage devices are highly scalable.
In Hadoop, the concept of key-value is used to describe the data format and processing method in the
MapReduce programming model, which serves as the foundation for distributed data processing.
Hadoop
How Hadoop Works? -YARN
Yarn divides the task on resource management and job scheduling/monitoring into separate daemons. There is
one ResourceManager and per-application ApplicationMaster. An application can be either a job (typically
referring to a MapReduce job) or a DAG (Directed Acyclic Graph) of jobs.
The ResourceManager have two components – Scheduler and ApplicationManager.
The scheduler is a pure scheduler. It only allocates resources to various competing application. It allocates the
resources based on an abstract notion of a container. A container is nothing but a fraction of resources like CPU,
memory, disk, network etc.
Hadoop is an open-source framework for distributed storage and processing of large-scale data. Its 3-layer
architecture includes Hadoop Distributed File System (HDFS), MapReduce, and YARN.
⚫ HDFS:
HDFS is the storage layer of Hadoop, HDFS divides large files into multiple blocks (default size is 128MB) and
stores them across multiple nodes in the cluster. Each block is replicated to multiple nodes to enhance data
reliability and fault tolerance. HDFS employs a master/slave architecture, consisting of a master node (NameNode)
and multiple slave nodes (DataNodes). The NameNode manages the file system namespace and metadata
operations for client, while DataNodes handle the actual storage and retrieval of data.
⚫ MapReduce:
MapReduce is the computational layer of Hadoop, used for processing data stored in HDFS. MapReduce divides
data into smaller chunks and processes them in parallel on multiple nodes in the cluster. It primarily of two stages:
the Map stage and Reduce stage. In the Map stage, data is split into smaller data blocks and processed by Mappers
on multiple nodes, generating intermediate key/value pairs. Then, in the Reduce stage, intermediate results are
aggregated and combined by Reducers on multiple nodes for final processing and computation. MapReduce
provides a simple and scalable programming model, enabling users to easily write parallel processing tasks.
Hadoop
How Hadoop Works?
⚫ YARN: YARN is the resource manage of Hadoop, used for allocation and scheduling of cluster resources.
YARN divides cluster resources into multiple containers, each with a certain amount of CPU and memory
resources. Applications can request containers from YARN and run their tasks within them. YARN is managed by
ResourceManager and NodeManager.
ResourceManager handles global management and allocation of cluster resources, while NodeManager monitors
the running status and resource usage of containers on each node. The introduction of YARN enables Hadoop to
not only MapReduce tasks but also support other types of computing frameworks such as Spark, Flink, etc.,
thereby enhancing the flexibility and diversity of Hadoop.
These three layers collectively form the core of Hadoop, enabling it to handle storage, computation, and resource
management of large-scale data.
Hadoop
NoSQL databases for big data storage
NoSQL?
sharding
Sharding is the process of horizontally
partitioning a large dataset into a
collection of smaller, more manageable
datasets called shards. The shards are
distributed across multiple nodes,
where a node is a server or a machine.
Each shard is stored on separate node
and each node is responsible for only
the data stored on it. Each shard shares
the same schema, and all shards
collectively represent the complete
dataset.
Some techniques of NoSQL databases
Advantages of sharding
➢ Scaling
Help to facilitate horizontal (This is often contrasted with vertical scaling, which involves upgrading the hardware of an
existing server.)
➢ Improve Performance
Speed up query response times because queries have to go over fewer rows and their result sets are returned much more
quickly.
➢ Reliability
Help to make an application more reliable by mitigating the impact of outages. With a sharding database, an outage is likely
to affect only a single shard.
➢ Easier to Manage
Maintenance tasks including regular backups, database optimization, and other common tasks can be performed
independently in parallel by using the sharding approach.
➢ Reduce Costs
Reduction of cost involving in saving license fees, software maintenance, hardware investment % when compared to
traditional solutions.
source: https://nadermedhatthoughts.medium.com/understanddatabase-sharding-the-good-and-ugly-868aa1cbc94c
Some techniques of NoSQL databases
Replication
Master-slave Replication
Nodes are arranged in a master-slave
configuration, and all data is written to
master node. Once saved, the data is
replicated over to multiple slave nodes.
All external write requests, including insert,
update and delete, occur on the master node,
whereas read requests can be fulfilled by
any slave node.
Master-slave Replication
Peer-to-Peer Replication
With peer-to-peer replication, all nodes operate at the same level. Each node, known as a peer, is equally capable of
handling reads and writes. Each write is copied to all peers.
Peer-to-Peer Replication
Pessimistic concurrency: prevents inconsistency, uses locking
to ensure that only one update to a record can occur at a time.
Optimistic concurrency: allows inconsistency to occur,
consistency will be achieved after all updates have propagated.
CAP Theorem
CAP Theorem
CAP Theorem
BASE
BASE is a database design principle based on the CAP theorem and leveraged by database systems that use
distributed technology. BASE stands for:
• basically available
• soft state
• eventual consistency
- The system ensures basic availability, meaning it is able to respond to user requests within a finite time, even if
these responses may not be the most recent or completely accurate, or either in the form of the requested data or
a success/failure notification.
- Soft state means that a database may be in an inconsistent state when data is read; but eventually reach a
consistent state. This implies that the system allows for data inconsistency and can tolerate it within a certain
timeframe.
- Eventual consistency means the system eventually reaches a consistent state, even in the face of network
partitions or replication delays, where data in the system may temporarily be inconsistent, but will eventually
converge to a consistent state through some mechanism. While the database is in the process of attaining the
state of eventual consistency, it will be in a soft state.
NoSQL databases
BASE
BASE
BASE
Below is a list of the principal characteristics of NoSQL storage devices that differentiate them from
traditional RDBMSs.
• Schema-less data model – Data can exist in its raw form.
• Scale out rather than scale up – More nodes can be added to obtain additional storage with a
NoSQL database, in contrast to having to replace the existing node with a better, higher
performance/capacity one.
• Highly available – This is built on cluster-based technologies that provide fault tolerance out of
the box.
• Lower operational costs – Many NoSQL databases are built on Open Source platforms with no
licensing costs. They can often be deployed on commodity hardware.
• Eventual consistency – Data reads across multiple nodes but may not be consistent immediately
after a write. However, all nodes will eventually be in a consistent state.
• BASE, not ACID – BASE compliance requires a database to maintain high availability in the event
of network/node failure, while not requiring the database to be in a consistent state whenever an
update occurs. The database can be in a soft/inconsistent state until it eventually attains
consistency. As a result, in consideration of the CAP theorem, NoSQL storage devices are
generally AP or CP.
NoSQL databases
Characteristics
• API driven data access – Data access is generally supported via API based queries, including RESTful
APIs, whereas some implementations may also provide SQL-like query capability.
• Auto sharding and replication– To support horizontal scaling and provide high availability, a NoSQL
storage device automatically employs sharding and replication techniques where the dataset is partitioned
horizontally and then copied to multiple nodes.
• Integrated caching – This removes the need for a third-party distributed caching layer, such as
Memcached.
• Distributed query support – NoSQL storage devices maintain consistent query behavior across multiple
shards.
• Polyglot persistence – The use of NoSQL storage does not mandate retiring traditional RDBMSs. In fact,
both can be used at the same time, thereby supporting polyglot persistence, which is an approach of
persisting data using different types of storage technologies within the same solution architecture. This is
good for developing systems requiring structured as well as semi/unstructured data.
• Aggregate-focused – Unlike relational databases that are most effective with fully normalized data, NoSQL
storage devices store de-normalized aggregated data (an entity containing merged, often nested, data for
an object) thereby eliminating the need for joins and extensive mapping between application objects and
the data stored in the database. One exception, however, is that graph database storage devices are not
aggregate-focused.
NoSQL databases
• Rationale
The emergence of NoSQL storage devices can primarily be attributed to the volume, velocity and variety
characteristics of Big Data datasets.
• Volume
The storage requirement of ever increasing data volumes commands the use of databases that are highly
scalable while keeping costs down for the business to remain competitive. NoSQL storage devices fulfill this
requirement by providing scale out capability while using inexpensive commodity servers.
• Velocity
The fast influx of data requires databases with fast access data write capability. NoSQL storage devices enable
fast writes by using schema-on-read rather than schema-on-write principle. Being highly available, NoSQL
storage devices can ensure that write latency does not occur because of node or network failure.
• Variety
A storage device needs to handle different data formats including documents, emails, images and videos and
incomplete data. NoSQL storage devices can store these different forms of semi-structured and unstructured
data formats. At the same time, NoSQL storage devices are able to store schema-less data and incomplete
data with the added ability of making schema changes as the data model of the datasets evolve. In other words,
NoSQL databases support schema evolution.
NoSQL databases
• Schema-on-write
The schema or the structure of the data is defined and enforced before the data is written into a database or
data storage system. This means that data must confirm to a predefined schema before it can be successfully
written. Schema-on-write is commonly associated with traditional relational database like MySQL, Oracle.
- Schema Definition: data types, constraints, and relationships, is defined before any data is written.
- Data Validation: When data is submitted for storage, it is validated against the predefined schema.
- Structured Storage: Data is stored in a structured manner.
• Schema-on-read
The structure or schema is not predefined before it is stored. Instead, data is stored in its raw or native format
without enforcing a specific schema at the time of ingestion. The schema is applied and interpreted at the time
of reading or querying the data. Schema-on-read is commonly associated with NoSQL databases like
MongoDB, Apache Hadoop.
- Data Ingestion: Data is ingested into the storage system without any predefined structure or schema.
- Schema Application: When querying or reading the data, the schema is applied dynamically based on how
the data is interpreted at that moment. This allows for flexibility in data interpretation.
- Flexibility: It offers flexibility in handling diverse data types and evolving data structures.
NoSQL databases
Type
NoSQL storage devices can mainly be divided into four types based on the way they store data
➢ key-value
➢ Document
➢ column-family
➢ graph
Key-value NoSQL databases
▪ Key-value storage devices store data as key-value pairs and act like hash tables. The table is a list of
values where each value is identified by a key. The value is opaque to the database and is typically
stored as a BLOB (Binary Large Object). The value stored can be any aggregate, ranging from sensor
data to videos.
▪ Value look-up can only be performed via the keys as the database is oblivious to the details of the stored
aggregate.
▪ Key-value storage devices generally do not maintain any indexes, therefore writes are quite fast. Based
on a simple storage model, key-value storage devices are highly scalable.
▪ Most key-value storage devices provide collections or buckets (like tables) into which key-value pairs
can be organized. A single collection can hold multiple data formats. Some implementations support
compressing values for reducing the storage footprint. However, this introduces latency at read time.
Key-value NoSQL databases
▪ query patterns are simple, involving insert, select and delete operations only
▪ applications require searching or filtering data using attributes of the stored value
Document storage devices also store data as key-value pairs. However, unlike key-value storage devices,
the store value is a document that can be queried by the database. These documents can have a
complex nested structure, such as an invoice. The documents can be encoded using either a text-based
encoding scheme, such as XML, or JSON, or using a binary encoding scheme, such as BSON (Binary
JSON).
Document NoSQL databases
The main differences between document storage devices and key-value storage devices
are as follows:
▪ the stored value is self-describing; the schema can be inferred from the structure of the value or a reference to the
schema for the document is included in the value
▪ partial updates are supported; therefore a subset of the aggregate can be updated
▪ schema evolution is a requirement as the structure of the document is either unknown or is likely to change
▪ performing operations that need joins between multiple documents or storing data that is normalized
▪ schema enforcement for achieving consistent query design is required as the document structure may change
between successive query runs, which will require restructuring the query
▪ the stored value is not self-describing and does not have a reference to a schema
▪ real time random read/write capability is needed and data being stored has some defined structure
▪ data represents a tabular structure, each row consists of a large number of columns and nested groups of
interrelated data exist
▪ support for schema evolution is required as column families can be added or removed without any system
downtime
▪ certain fields are mostly accessed together, and searches need to be performed using field values
▪ efficient use of storage is required when the data consists of sparsely populated rows since column-family
databases only allocate storage space if a column exists for a row. If no column is present, no space is allocated.
▪ query patterns are likely to change frequently because that could initiate a corresponding restructuring of
how column-families are arranged
Graph NoSQL databases
Graph storage devices are used to persist inter-
connected entities. Unlike other NoSQL storage
devices, where the emphasis is on the structure
of the entities, graph storage devices place
emphasis on storing the linkages between
entities. Entities are stored as nodes (not to be
confused with cluster nodes) and are also called
vertices, while the linkages between entities are
stored as edges. In RDBMS parlance, each
node can be thought of a single row while the
edge denotes a join.
Nodes can have more than one type of link
between them through multiple edges. Each
node can have attribute data as key-value pairs,
such as a customer node with ID, name and age
attributes.
Each edge can have its own attribute data as
key-value pairs, which can be used to further
filter query results. Queries generally involve
finding interconnected nodes based on node
attributes and/or edge attributes.
Graph NoSQL databases
▪ querying entities based on the type of relationship with each other rather than the attributes of the entities
▪ update are required to a large number of node attributes or edge attributes, as this involves searching for nodes
or edges, which is a costly operation compared to performing node traversals
▪ entities have a large number of attributes or nested data – it is best to store lightweight entities in a graph
storage device while storing the rest of the attribute data in a separate non-graph NoSQL storage device
▪ queries based on the selection of node/edge attributes dominate node traversal queries
Shortcommings of NoSQL
▪ NoSQL storage devices do not provide the same transaction and consistency support as exhibited by ACID
compliant RDBMSs.
▪ Following the BASE model, NoSQL storage devices provide eventual consistency rather than immediate
consistency. They therefore will be in a soft state while reaching the state of eventual consistency.
▪ NoSQL storage devices are not appropriate for use when implementing large scale transactional systems.
NewSQL databases
▪ NewSQL storage devices combine the ACID properties of RDBMS with the scalability and fault tolerance
offered by NoSQL storage devices.
▪ NewSQL databases generally support SQL compliant syntax for data definition and data manipulation
operations, and they often use a logical relational data model for data storage.
▪ NewSQL databases can be used for developing OLTP (online transaction processing) systems with very high
volumes of transactions. They can be used for real time analytics.
▪ As compared to a NoSQL storage device, a NewSQL storage device provides an easier transition from a
traditional RDBMS to a highly scalable database due to its support for SQL.
Horizontal scaling
In-Memory Storage Devices
▪ NewSQL storage devices combine the ACID properties of RDBMS with the scalability and fault tolerance
offered by NoSQL storage devices.
▪ NewSQL databases generally support SQL compliant syntax for data definition and data manipulation
operations, and they often use a logical relational data model for data storage.
▪ NewSQL databases can be used for developing OLTP (online transaction processing) systems with very high
volumes of transactions. They can be used for real time analytics.
▪ As compared to a NoSQL storage device, a NewSQL storage device provides an easier transition from a
traditional RDBMS to a highly scalable database due to its support for SQL.
▪ data arrives at a fast pace and requires real time analytics or event stream processing
▪ interactive query processing and real time data visualization needs to be performed, including what-if analysis
and drill-down operations
▪ performing exploratory data analysis, as the same dataset does not need to be reloaded from disk if the
algorithm changes
▪ developing low latency Big Data solutions with ACID transaction support
In-Memory Storage Devices
▪ very large amounts of data need to be persisted in-memory for along time in order to perform in-depth data
analysis
▪ performing strategic BI or strategic analytics that involves access to very large amounts of data and involves
batch data processing
▪ dataset are extremely large and do not fit into the available memory
▪ making the transition from traditional data analysis toward Big Data analysis, as incorporating an in-memory
storage device may require additional skills and involves a complex setup
▪ an enterprise has a limited budge, as setting up an in-memory storage device may require upgrading nodes,
which could either be done by node replacement or by adding more RAM
In-Memory Storage Devices
▪ Nodes in IMDGs keep themselves synchronized and collectively provide high availability, fault tolerance and
consistency.
▪ As compared to relational IMDBs, IMDGs provide faster data access as IMDGs store non-relational data as
objects.
▪ IMDGs scale horizontally by implementing data partitioning and data replication and further support reliability
by replicating data to at least one extra node.
In-Memory Data Grids
In a Big Data solution environment, IMDGs are often deployed together with on-disk
storage devices that act as the backend storage. This is achieved via the following
approaches that can be combined as necessary to support read/write performance,
consistency and simplicity requirements:
❖ read-through
❖ write-through
❖ write-behind
❖ refresh-ahead
The use of In-Memory Data Grids in Big Data environment
Read-through
If a requested value for a key is
not found in the IMDG, then it is
synchronously read from the
backend on-disk storage device,
such as a database. Upon a
successful read from the backend
on-disk storage device, the key-
value pair is inserted into the
IMDG, and the requested value
is returned to the client. Any
subsequent request for the same
key are then served by the
IMDG directly, instead of the
backend storage.
The use of In-Memory Data Grids in Big Data environment
Write-through
Write-behind
Refresh-ahead
▪ Refresh-ahead is a proactive approach where any
frequently accessed values are automatically,
asynchronously refreshed in the IMDG, provided
that the value is accessed before its expiry time as
configured in the IMDG.
▪ If a value is accessed after its expiry time, the
value, like in the read-through approach, is
synchronously read from the backend storage and
updated in the IMDG before being returned to the
client.
▪ Compared to the read-through approach, where a
value is served from the IMDG until its expiry,
data inconsistency between the IMDG and the
backend storage is minimized as values are
refreshed before they expire.
In-Memory Data Grids
▪ data being stored is non-relational in nature such as semi-structured and unstructured data
▪ adding real-time support to an existing Big Data solution currently using on-disk storage
▪ the existing storage device cannot be replaced but the data access layer can be modified
Examples of IMDG storage devices include: Hazelcast, Infinispan, Pivotal GemFire and Gigaspaces XAP
In-Memory Databases
IMDBs are in-memory storage devices that employ database technology and
leverage the performance of RAM to overcome runtime latency issues that
plague on-disk storage devices.
▪ Unlike IMDGs, which generally provide data access via APIs, relational IMDBs make use of the more familiar SQL
language. NoSQL-based IMDBs generally provide API-based access, which may be as simple as put, get and delete
operations
▪ Depending on the underlying implementation, some IMDBs scale-out, while others scale-up, to achieve scalability.
▪ Not all IMDB implementations directly support durability, but instead leverage various strategies for providing
durability in the face of machine failures or memory corruption.
▪ Like an IMDG, an IMDB may also support the continuous query feature.
▪ IMDBs are heavily used in real-time analytics and can further be used for developing low latency applications requiring
full ACID transaction support (relational IMDB).
▪ In comparison with IMDGs, IMDBs provide an easy to set up in-memory data storage option, as IMDBs do not
generally require on-disk backend storage devices.
In-Memory Databases
In Figure 7.27, an IMDB stores
temperature values for various sensors.
The following steps are shown:
▪ Introduction of IMDBs into an existing Big Data solution generally requires replacement of existing
on-disk storage devices, including any RDBMSs if used. In the case of replacing an RDBMS with a
relational IMDB, little or no application code change is required due to SQL support provided by
the relational IMDB. However, when replacing an RDBMS with a NoSQL IMDB, code change may
be required due to the need to implement the IMDB’s NoSQL APIs.
▪ In the case of replacing an on-disk NoSQL database with a relational IMDB, code change will often
be required to establish SQL-based access. However, when replacing an on-disk NoSQL database
with a NoSQL IMDB, code change may still be required due to the implementation of new APIs.
▪ Relational IMDBs are generally less scalable than IMDGs, as relational IMDBs need to support
distributed queries and transactions across the cluster. Some IMDB implementations may benefit
from scaling up, which helps to address the latency that occurs when executing queries and
transactions in a scale-out environment.
Examples include Aerospike, MemSQL, Altibase HDB, eXtreme DB and Pivotal GemFire
XD.
In-Memory Data Bases
▪ adding real-time support to an existing Big Data solution currently using on-disk storage
▪ the existing on-disk storage device can be replaced with an in-memory equivalent
technology
▪ it is required to minimize changes to the data access layer of the application code, such as
when the application consists of an SQL-based data access layer