0% found this document useful (0 votes)
123 views7 pages

Unit 4

Big Data solutions require distributed computing architectures like HDFS and NoSQL databases to store and access large, distributed data in a scalable way. HDFS allows for sequential access of data files while NoSQL databases like HBase allow for random read/write access and are more flexible in their data schemas. MongoDB and Cassandra are examples of NoSQL databases that can store data in HDFS-compatible distributed data stores and include query languages. MongoDB is an open-source, document-oriented NoSQL database that provides high performance, high availability, and automatic scaling for distributed data.

Uploaded by

manik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views7 pages

Unit 4

Big Data solutions require distributed computing architectures like HDFS and NoSQL databases to store and access large, distributed data in a scalable way. HDFS allows for sequential access of data files while NoSQL databases like HBase allow for random read/write access and are more flexible in their data schemas. MongoDB and Cassandra are examples of NoSQL databases that can store data in HDFS-compatible distributed data stores and include query languages. MongoDB is an open-source, document-oriented NoSQL database that provides high performance, high availability, and automatic scaling for distributed data.

Uploaded by

manik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Unit-4

Big Data solutions require a scalable distributed computing model with shared-nothing
architecture. A solution is Big Data store in HDFS files. NoSQL data also store Big Data,
and facilitate random read/write accesses. The accesses are sequential in HDFS data.
HBase is a NoSQL solution (Section 2.6.3). Examples of other solutions are MongoDB
and Cassandra. MongoDB and Cassandra DBMSs create HDFS compatible distributed
data stores and include their specific query processing languages.

NoSQL is an altogether new approach of thinking about databases, such as schema


flexibility, simple relationships, dynamic schemas, auto sharding, replication, integrated
caching, horizontal scalability of shards, distributable tuples, semi-structures data and
flexibility in approach. Issues with NoSQL data stores are lack of standardization in
approaches, processing difficulties for complex queries, dependence on eventually
consistent results in place of consistency in all states.

Schema-Less Models
NoSQL data not necessarily have a fixed table schema. The systems do not use the
concept of Join (between distributed datasets). A cluster-based highly distributed node
manages a single large data store with a NoSQL DB. Data written at one node replicates
to multiple nodes. Therefore, these are identical, fault-tolerant and partitioned into
shards.
Distributed databases can store and process a set of information on more than one
computing nodes. NoSQL data model offers relaxation in one or more of the ACID
properties (Atomicity, consistence, isolation and durability) of the database. Distribution
follows CAP theorem. CAP theorem states that out of the three properties, two must at
least be present for the application/service/process.

NOSQL DATA ARCHITECTURE PATTERNS

Key-Value Store

The simplest way to implement a schema-less data store is to use key-value pairs. The
data store characteristics are high performance, scalability and flexibility. Data retrieval
is fast in key-value pairs data A simple string called, key maps to a large data string or
BLOB (Basic Large Object). Key-value store accesses use a primary key for accessing
the values. Therefore, the store can be easily scaled up for very large data.

Typical uses of key-value store are: (i) Image store, (ii) Document or file store, (iii) Lookup
table, and (iv) Query-cache.
Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases. They are
all based on Amazon’s Dynamo paper.

Document Data Store

Characteristics of Document Data Store are high performance and flexibility. Scalability
varies, depends on stored contents. Complexity is low compared to tabular, object and
graph data stores. Following are the features in Document Store: 1. Document stores
unstructured data. 2. Storage has similarity with object store. 3. Data stores in nested
hierarchies. For example, inJSON formats data model [Example 3.3(ii)], XML document
object model (DOM), or machine-readable data as one BLOB. Hierarchical information
stores in a single unit called document tree. Logical data stored together in a unit.

CSV and JSON File Formats CSV data store is a format for records [Example 1.9 and
Example 3.3(i)]. CSV does not represent object-oriented databases or hierarchical data
records. ]SON and XML represent semistructured data, object• oriented records and
hierarchical data records.

Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are popular
Document originated DBMS systems.

Tabular Data

Tabular data stores use rows and columns. Row-head field may be used as a key which
accesses and retrieves multiple values from the successive columns in that row. The
OLTP is fast on in-memory row-format data.

Oracle DBs provide both options: columnar and row format storages.

Generally, relational DB stores are in-memory row-based data, in which a key in the first
column of the row is at a memory address, and values in successive columns at
successive memory addresses. That makes OLTP easier. All fields of a row are accessed
at a time together during OLTP. Different rows are stored in different addresses in the
memory or disk. In-memory row-based DB stores a row as a consecutive memory or disk
entry. This strategy makes data searching and accessing faster during transactions
processing.

In-memory column-based data has the keys (row-head keys) in the first column of each
row at successive memory addresses. The next column of each row after the key has the
values at successive memory addresses.
Column Family Store Columnar Data Store A way to implement a schema is the divisions
into columns. Storage of each column, successive values is at the successive memory
addresses. Analytics processing (AP) In-memory uses columnar storage in memory. A
pair of row-head and column-head is a key-pair. The pair accesses a field in the table.

Column-Family Data Store Column-family data-store has a group of columns as a column


family. A combination of row-head, column-family head and table• column head can also
be a key to access a field in a column of the table during querying.

Columnar family data store imbibes characteristics of very high performance and
scalability, moderate level of flexibility and lower complexity when compared to the
object and graph databases.

Examples of widely used column-family data store are Google's BigTable, HBase and
Cassandra.

Graph Database

One way to implement a data store is to use a graph database. A characteristic of graphs
is high flexibility. Any number of nodes and any number of edges can be added to
expand a graph. The complexity is high and the performance is variable with scalability.

Graph databases enable fast network searches. Graph uses linked datasets, such as
social media data. Data store uses graphs with nodes and edges connecting each other
through relations, associations and properties. Querying for data uses graph traversal
along the paths. Traversal may use single-step, path expressions or full recursion. A
relationship represents a key. A node possesses property including ID. An edge may
have a label which may specify a role. Characteristics of graph databases are:

1. Use specialized query languages, such as RDF uses SPARQL

2. Create a database system which models the data in a completely different way than the
key-values, document, columnar and object data store models.

3. Can have hyper-edges. A hyper-edge is a set of vertices of a hypergraph. A hypergraph


is a generalization of a graph in which an edge can join any number of vertices (not only
the neighboring vertices).

Typical uses of graph databases are: (i) link analysis, (ii) friend of friend queries, (iii)
Rules and inference, (iv) rule induction and (v) Pattern matching.
Examples of graph DBs are Neo4J, AllegroGraph, HyperGraph, Infinite Graph, Titan and
FlockDB.Neo4Jgraph database enable easy usages by Java developers. Neo4J can be
designed fully ACID rules compliant.

Using NoSQL to manage big data

Characteristics of Big Data NoSQL solution are:

1. High and easy scalability: NoSQL data stores are designed to expand horizontally.
Horizontal scaling means that scaling out by adding more machines as data nodes
(servers) into the pool of resources (processing, memory, network connections). The
design scales out using multi-utility cloud services.

2. Support to replication: Multiple copies of data store across multiple nodes of a cluster.
This ensures high availability, partition, reliability and fault tolerance.

3. Distributable: Big Data solutions permit sharding and distributing of shards on


multiple clusters which enhances performance and throughput.

4. Usages of NoSQL servers which are less expensive. NoSQL data stores require less
management efforts. It supports many features like automatic repair, easier data
distribution and simpler data models that makes database administrator (DBA) and
tuning requirements less stringent.

5. Usages of open-source tools: NoSQL data stores are cheap and open source.
Database implementation is easy and typically uses cheap servers to manage the
exploding data and transaction while RDBMS databases are expensive and use big
servers and storage systems. So, cost per gigabyte data store and processing of that
data can be many times less than the cost ofRDBMS.

6. Support to schema-less data model: NoSQL data store is schema less, so data can be
inserted in a NoSQL data store without any predefined schema. So, the format or data
model can be changed any time, without disruption of application. Managing the changes
is a difficult problem in SQL.

7. Support to integrated caching: NoSQL data store support the caching in system
memory. That increases output performance. SQL database needs a separate
infrastructure for that.

8. No inflexibility unlike the SQL/RD


MongoDB
MongoDB is an open-source document database that provides high performance, high
availability, and automatic scaling.
In simple words, you can say that - Mongo DB is a document-oriented database. It is an
open source product, developed and supported by a company named 10gen.
MongoDB is available under General Public license for free, and it is also available under
Commercial license from the manufacturer.
The manufacturing company 10gen has defined MongoDB as:
"MongoDB is a scalable, open source, high performance, document-oriented database." -
10gen
MongoDB was designed to work with commodity servers. Now it is used by the company
of all sizes, across all industries.

Features of MongoDB

1. Support ad hoc queries


In MongoDB, you can search by field, range query and it also supports regular expression searches.
2. Indexing
You can index any field in a document.
3. Replication
MongoDB supports Master Slave replication.
A master can perform Reads and Writes and a Slave copies data from the master and can only be used for
reads or back up (not writes)
4. Duplication of data
MongoDB can run over multiple servers. The data is duplicated to keep the system up and also keep its
running condition in case of hardware failure.
5. Load balancing
It has an automatic load balancing configuration because of data placed in shards.
6. Supports map reduce and aggregation tools.
7. Uses JavaScript instead of Procedures.
8. It is a schema-less database written in C++.
9. Provides high performance.

Cassandra

Apache Cassandra is an open source, distributed and decentralized/distributed storage


system (database), for managing very large amounts of structured data spread out
across the world. It provides highly available service with no single point of failure.
Listed below are some of the notable points of Apache Cassandra −
It is scalable, fault-tolerant, and consistent.
It is a column-oriented database.
Its distribution design is based on Amazon’s Dynamo and its data model on Google’s
Bigtable.
Created at Facebook, it differs sharply from relational database management systems.
Cassandra implements a Dynamo-style replication model with no single point of failure,
but adds a more powerful “column family” data model.
Cassandra is being used by some of the biggest companies such as Facebook, Twitter,
Cisco, Rackspace, ebay, Twitter, Netflix, and more.

Features of Cassandra
Cassandra has become so popular because of its outstanding technical features. Given
below are some of the features of Cassandra:

​ Elastic scalability − Cassandra is highly scalable; it allows to add more hardware to

accommodate more customers and more data as per requirement.


​ Always on architecture − Cassandra has no single point of failure and it is continuously

available for business-critical applications that cannot afford a failure.


​ Fast linear-scale performance − Cassandra is linearly scalable, i.e., it increases your

throughput as you increase the number of nodes in the cluster. Therefore it maintains a
quick response time.
​ Flexible data storage − Cassandra accommodates all possible data formats including:

structured, semi-structured, and unstructured. It can dynamically accommodate changes


to your data structures according to your need.
​ Easy data distribution − Cassandra provides the flexibility to distribute data where you

need by replicating data across multiple data centers.


​ Transaction support − Cassandra supports properties like Atomicity, Consistency,

Isolation, and Durability (ACID).


​ Fast writes − Cassandra was designed to run on cheap commodity hardware. It performs

blazingly fast writes and can store hundreds of terabytes of data, without sacrificing the
read efficiency.
GraphQL

GraphQL is an open-source query language and runtime for APIs, developed by


Facebook in 2012. It is designed to provide a more efficient, powerful, and flexible
alternative to traditional REST APIs. it allows clients to request only the data they need,
reducing the amount of data transferred over the network and improving performance.

Some major features of GraphQL are:

1. It allows for more flexible and dynamic APIs, as clients can modify their
queries as needed without requiring changes to the server-side code.

2. GraphQL provides a structured way to define the schema of an API and the
types of data that can be queried which allows for easy documentation and better tooling
support for clients and servers.

3. Is its ability to handle nested and complex queries, which can be


challenging with traditional REST APIs. With GraphQL, clients can specify complex
nested queries with multiple levels of depth, and the server can respond with the
requested data in a single response.

4. GraphQL enables a consistent and predictable API interface, regardless of


changes in the underlying data model. This is made possible by using a version-less API,
allowing for easier evolution and iteration of APIs over time without breaking existing
client applications.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy