0% found this document useful (0 votes)
5 views20 pages

Lec 6 - Big Data Storage Technologies II - NoSQL

This document discusses big data storage concepts, focusing on the integration of Hadoop, Spark, and NoSQL databases for analytics. It covers key topics such as MapReduce, the characteristics and types of NoSQL databases, and the emergence of NewSQL databases that combine ACID properties with NoSQL scalability. The lecture highlights the advantages of these technologies in handling large datasets and real-time processing needs.

Uploaded by

amirosama21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views20 pages

Lec 6 - Big Data Storage Technologies II - NoSQL

This document discusses big data storage concepts, focusing on the integration of Hadoop, Spark, and NoSQL databases for analytics. It covers key topics such as MapReduce, the characteristics and types of NoSQL databases, and the emergence of NewSQL databases that combine ACID properties with NoSQL scalability. The lecture highlights the advantages of these technologies in handling large datasets and real-time processing needs.

Uploaded by

amirosama21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Big Data Storage Concepts II

Lecture 6
A General View

▪ Hadoop, Spark, and NoSQL databases can work


together to create a powerful big-data analytics platform.

▪ Hadoop can be used for data storage and batch


processing,

▪ Spark can be used for real-time processing and analysis,

▪ NoSQL databases can be used for storing and querying


large volumes of unstructured data.
▪ In order to understand the underlying
mechanisms behind Big Data storage
technology, the following topics are
introduced in this lecture.
Outline ▪ MapReduce – Brief

▪ What is Spark.

▪ NoSQL Databases
▪ Characteristics

▪ NoSQL and CAP

▪ Four types of NoSQL Datastores

▪ Who Uses NoSQL

▪ NewSQL Databases

▪ Distributed SQL
3
MapReduce – Brief

▪ MapReduce is a batch-oriented processing engine used to process


large datasets using parallel processing deployed over clusters of
commodity hardware.

▪ It is highly scalable, reliable and is based on the principle of divide-


and-conquer, which provides built-in fault tolerance.

▪ MapReduce does not require that the input data conform to any
particular data model.

Processing in Batch Mode


Ref. No. [3] - Chapter 6
MapReduce: Simple
Programming for Big Data

▪ A dataset is broken down into multiple


smaller parts

▪ Operations are performed on each


part independently and in parallel.

▪ The MapReduce system sends


computation code (map and reduce
functions) to where the data resides.

▪ Favouring data locality and cluster


rack affinity rather than bringing data
to your application.
MapReduce: Simple
Programming for Big Data

https://www.todaysoftmag.com/article/1358/hadoop-mapreduce-deep-diving-and-tuning
▪ Forces your data processing into Map and
Reduce
▪ Based on “Acyclic Data Flow” from Disk to
Disk (HDFS)
▪ Read and write to Disk before and after Map
Shortcoming and Reduce
▪ Not efficient for iterative tasks. i.e. Machine
of Learning
MapReduce ▪ The Implementation is primarily written in Java
▪ Only for Batch processing
What Is Apache Spark?
▪ Apache Spark is a cluster-computing platform that provides an API for distributed
programming like the MapReduce model but is designed to be fast for interactive
queries and iterative algorithms.
▪ Spark provides in-memory storage for intermediate computations, where programs
can checkpoint data and refer back to it without reloading it from disk.
▪ It incorporates libraries for machine learning (MLlib), SQL for interactive queries
(Spark SQL), stream processing (Structured Streaming) for interacting with real-
time data, and graph processing (GraphX).

The Spark stack


Spark Uses Memory instead of Disk
Hadoop: Use Disk for Data Sharing

HDFS HDFS HDFS


HDFS
read Write read
Write
Iteration1 Iteration2

Spark: In-Memory Data Sharing


HDFS read

Iteration1 Iteration2

Afzal Godil, Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis, Information Access Division, ITL, NIST
NoSQL Datastores

▪ The emergence of NoSQL datastores can primarily be attributed to the


volume, velocity and variety characteristics of Big Data datasets.
▪ NoSQL databases (aka "not only SQL") are non-tabular databases and store
data differently than relational tables.
▪ They have better horizontal scaling capability, fault-tolerant , and improved
performance for big data at the cost of having less rigorous consistency
models.
▪ These systems are optimized for fast retrieval and appending operations on
records where real-time performance is more important than consistency.
▪ NoSQL databases come in a variety of types based on their data model.

NoSQL vs SQL- 4 Reasons Why NoSQL is better


for Big Data applications
Feb 2023
NoSQL Datastores Characteristics

▪ This list should only be considered a general guide, as not all


NoSQL storage devices exhibit all of these features:
• Schema-less data model – Data can exist in its raw form.
• Scale out rather than scale up – More nodes can be added to efficiently meet
the needs for varying workloads.
• Highly available – This is built on cluster-based technologies that provide fault
tolerance out of the box.
• Lower operational costs – Many NoSQL databases are built on Open-Source
platforms with no licensing costs. They can often be deployed on commodity
hardware.
• BASE not ACID – Maintain high availability in the event of network/node failure,
while not requiring the database to be in a consistent state whenever an update
occurs. The database can be in a soft/inconsistent state until it eventually attains
consistency.
NoSQL Datastores Characteristics

• Auto sharding and replication – To support horizontal scaling and provide


high availability, a NoSQL storage device automatically employs sharding
and replication techniques where the dataset is partitioned horizontally and
then copied to multiple nodes.
• Distributed query support – NoSQL storage devices maintain consistent
query behaviour across multiple shards.
• Polyglot persistence – An approach of persisting data using different types
of storage technologies within the same solution architecture.
• Aggregate-oriented – NoSQL storage devices store de-normalized
aggregated data thereby eliminating the need for joins
• One exception, however, is that graph database storage devices are not aggregate-focused.
The relational model
divides the information
into tables of tuples.
This simple structure for
data is one of the key
aspects of its success and Aggregate Data
dominance
Models
▪ An aggregate is a collection of data that
we manipulate and manage as a unit.
Aggregate oriented
➢ complex record with simple fields, arrays, records
models take a different nested inside
approach.
They tend to operate on ▪ Aggregate-oriented databases work best
data in units that have a when most data interaction is done with
more complex the same aggregate (intra)
structure. ▪ Aggregate-ignorant databases are better
when interactions use data organized in
many different formations (inter)
Freedom and flexibility
Schemeless Databases double-edged sword

Schemaless allows for more flexibility than schema-based databased.


▪ However, there is less opportunity to automatically enforce data integrity
rules.
▪ There should be an implicit schema expected by users of the data.
A set of assumptions about the structure of the data in the code that
manipulates it.
▪ Schemaless database shifts the schema into the application code that
accesses it.
▪ This becomes problematic if multiple applications, developed by different
people, access the same database.
▪ In order to understand the structure of the data, you have to understand
the application code.
▪ Having a schemaless affects the efficiency of storing and retrieving the
data.
Polyglot Persistence

▪ Using multiple specialized persistent stores rather than one single general-
purpose database.

▪ “Monoglot” was (and still is) fine for simple application (one type of workload)

▪ But… applications become complex.

▪ A simple E-commerce platform must have:

• Session data (Add to Basket)

• Search Engine (Search for products)

• Recommendation engine.

• Payment platform.

• Geo Location service


NoSQL
Systems
and CAP
Four Types of NoSQL Databases
Who Uses NoSQL

▪ Over the last few years, NoSQL database technology has experienced
explosive growth and accelerating use by large enterprises. For example:

• Tesco uses NoSQL to support its catalogue, pricing, inventory, and


coupon applications.

• McGraw-Hill uses NoSQL to power its online learning platform

• Sky uses NoSQL to manage user profiles for 20 million subscribers

The Top 10 Enterprise NoSQL Use Cases


2015
NewSQL Databases

▪ NewSQL storage devices combine the ACID properties with the scalability
and fault tolerance offered by NoSQL storage devices.
▪ They generally support SQL compliant syntax for data definition and data
manipulation operations, and they often use a logical relational data model
for data storage.
▪ NewSQL databases can be used for developing OLTP systems with very
high volumes of transactions as they leverage in-memory storage.
▪ E.g. example a banking system. They can also be used for realtime analytics,

▪ Compared to a NoSQL storage device, a NewSQL storage device provides


an easier transition from a traditional RDBMS to a highly scalable database.
▪ Examples of NewSQL databases include VoltDB, NuoDB and InnoDB.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy