Unit 4
Unit 4
Big Data solutions require a scalable distributed computing model with shared-nothing
architecture. A solution is Big Data store in HDFS files. NoSQL data also store Big Data,
and facilitate random read/write accesses. The accesses are sequential in HDFS data.
HBase is a NoSQL solution (Section 2.6.3). Examples of other solutions are MongoDB
and Cassandra. MongoDB and Cassandra DBMSs create HDFS compatible distributed
data stores and include their specific query processing languages.
Schema-Less Models
NoSQL data not necessarily have a fixed table schema. The systems do not use the
concept of Join (between distributed datasets). A cluster-based highly distributed node
manages a single large data store with a NoSQL DB. Data written at one node replicates
to multiple nodes. Therefore, these are identical, fault-tolerant and partitioned into
shards.
Distributed databases can store and process a set of information on more than one
computing nodes. NoSQL data model offers relaxation in one or more of the ACID
properties (Atomicity, consistence, isolation and durability) of the database. Distribution
follows CAP theorem. CAP theorem states that out of the three properties, two must at
least be present for the application/service/process.
Key-Value Store
The simplest way to implement a schema-less data store is to use key-value pairs. The
data store characteristics are high performance, scalability and flexibility. Data retrieval
is fast in key-value pairs data A simple string called, key maps to a large data string or
BLOB (Basic Large Object). Key-value store accesses use a primary key for accessing
the values. Therefore, the store can be easily scaled up for very large data.
Typical uses of key-value store are: (i) Image store, (ii) Document or file store, (iii) Lookup
table, and (iv) Query-cache.
Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases. They are
all based on Amazon’s Dynamo paper.
Characteristics of Document Data Store are high performance and flexibility. Scalability
varies, depends on stored contents. Complexity is low compared to tabular, object and
graph data stores. Following are the features in Document Store: 1. Document stores
unstructured data. 2. Storage has similarity with object store. 3. Data stores in nested
hierarchies. For example, inJSON formats data model [Example 3.3(ii)], XML document
object model (DOM), or machine-readable data as one BLOB. Hierarchical information
stores in a single unit called document tree. Logical data stored together in a unit.
CSV and JSON File Formats CSV data store is a format for records [Example 1.9 and
Example 3.3(i)]. CSV does not represent object-oriented databases or hierarchical data
records. ]SON and XML represent semistructured data, object• oriented records and
hierarchical data records.
Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are popular
Document originated DBMS systems.
Tabular Data
Tabular data stores use rows and columns. Row-head field may be used as a key which
accesses and retrieves multiple values from the successive columns in that row. The
OLTP is fast on in-memory row-format data.
Oracle DBs provide both options: columnar and row format storages.
Generally, relational DB stores are in-memory row-based data, in which a key in the first
column of the row is at a memory address, and values in successive columns at
successive memory addresses. That makes OLTP easier. All fields of a row are accessed
at a time together during OLTP. Different rows are stored in different addresses in the
memory or disk. In-memory row-based DB stores a row as a consecutive memory or disk
entry. This strategy makes data searching and accessing faster during transactions
processing.
In-memory column-based data has the keys (row-head keys) in the first column of each
row at successive memory addresses. The next column of each row after the key has the
values at successive memory addresses.
Column Family Store Columnar Data Store A way to implement a schema is the divisions
into columns. Storage of each column, successive values is at the successive memory
addresses. Analytics processing (AP) In-memory uses columnar storage in memory. A
pair of row-head and column-head is a key-pair. The pair accesses a field in the table.
Columnar family data store imbibes characteristics of very high performance and
scalability, moderate level of flexibility and lower complexity when compared to the
object and graph databases.
Examples of widely used column-family data store are Google's BigTable, HBase and
Cassandra.
Graph Database
One way to implement a data store is to use a graph database. A characteristic of graphs
is high flexibility. Any number of nodes and any number of edges can be added to
expand a graph. The complexity is high and the performance is variable with scalability.
Graph databases enable fast network searches. Graph uses linked datasets, such as
social media data. Data store uses graphs with nodes and edges connecting each other
through relations, associations and properties. Querying for data uses graph traversal
along the paths. Traversal may use single-step, path expressions or full recursion. A
relationship represents a key. A node possesses property including ID. An edge may
have a label which may specify a role. Characteristics of graph databases are:
2. Create a database system which models the data in a completely different way than the
key-values, document, columnar and object data store models.
Typical uses of graph databases are: (i) link analysis, (ii) friend of friend queries, (iii)
Rules and inference, (iv) rule induction and (v) Pattern matching.
Examples of graph DBs are Neo4J, AllegroGraph, HyperGraph, Infinite Graph, Titan and
FlockDB.Neo4Jgraph database enable easy usages by Java developers. Neo4J can be
designed fully ACID rules compliant.
1. High and easy scalability: NoSQL data stores are designed to expand horizontally.
Horizontal scaling means that scaling out by adding more machines as data nodes
(servers) into the pool of resources (processing, memory, network connections). The
design scales out using multi-utility cloud services.
2. Support to replication: Multiple copies of data store across multiple nodes of a cluster.
This ensures high availability, partition, reliability and fault tolerance.
4. Usages of NoSQL servers which are less expensive. NoSQL data stores require less
management efforts. It supports many features like automatic repair, easier data
distribution and simpler data models that makes database administrator (DBA) and
tuning requirements less stringent.
5. Usages of open-source tools: NoSQL data stores are cheap and open source.
Database implementation is easy and typically uses cheap servers to manage the
exploding data and transaction while RDBMS databases are expensive and use big
servers and storage systems. So, cost per gigabyte data store and processing of that
data can be many times less than the cost ofRDBMS.
6. Support to schema-less data model: NoSQL data store is schema less, so data can be
inserted in a NoSQL data store without any predefined schema. So, the format or data
model can be changed any time, without disruption of application. Managing the changes
is a difficult problem in SQL.
7. Support to integrated caching: NoSQL data store support the caching in system
memory. That increases output performance. SQL database needs a separate
infrastructure for that.
Features of MongoDB
Cassandra
Features of Cassandra
Cassandra has become so popular because of its outstanding technical features. Given
below are some of the features of Cassandra:
throughput as you increase the number of nodes in the cluster. Therefore it maintains a
quick response time.
Flexible data storage − Cassandra accommodates all possible data formats including:
blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the
read efficiency.
GraphQL
1. It allows for more flexible and dynamic APIs, as clients can modify their
queries as needed without requiring changes to the server-side code.
2. GraphQL provides a structured way to define the schema of an API and the
types of data that can be queried which allows for easy documentation and better tooling
support for clients and servers.