Lec 6 - Big Data Storage Technologies II - NoSQL
Lec 6 - Big Data Storage Technologies II - NoSQL
Lecture 6
A General View
▪ What is Spark.
▪ NoSQL Databases
▪ Characteristics
▪ NewSQL Databases
▪ Distributed SQL
3
MapReduce – Brief
▪ MapReduce does not require that the input data conform to any
particular data model.
https://www.todaysoftmag.com/article/1358/hadoop-mapreduce-deep-diving-and-tuning
▪ Forces your data processing into Map and
Reduce
▪ Based on “Acyclic Data Flow” from Disk to
Disk (HDFS)
▪ Read and write to Disk before and after Map
Shortcoming and Reduce
▪ Not efficient for iterative tasks. i.e. Machine
of Learning
MapReduce ▪ The Implementation is primarily written in Java
▪ Only for Batch processing
What Is Apache Spark?
▪ Apache Spark is a cluster-computing platform that provides an API for distributed
programming like the MapReduce model but is designed to be fast for interactive
queries and iterative algorithms.
▪ Spark provides in-memory storage for intermediate computations, where programs
can checkpoint data and refer back to it without reloading it from disk.
▪ It incorporates libraries for machine learning (MLlib), SQL for interactive queries
(Spark SQL), stream processing (Structured Streaming) for interacting with real-
time data, and graph processing (GraphX).
Iteration1 Iteration2
Afzal Godil, Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis, Information Access Division, ITL, NIST
NoSQL Datastores
▪ Using multiple specialized persistent stores rather than one single general-
purpose database.
▪ “Monoglot” was (and still is) fine for simple application (one type of workload)
• Recommendation engine.
• Payment platform.
▪ Over the last few years, NoSQL database technology has experienced
explosive growth and accelerating use by large enterprises. For example:
▪ NewSQL storage devices combine the ACID properties with the scalability
and fault tolerance offered by NoSQL storage devices.
▪ They generally support SQL compliant syntax for data definition and data
manipulation operations, and they often use a logical relational data model
for data storage.
▪ NewSQL databases can be used for developing OLTP systems with very
high volumes of transactions as they leverage in-memory storage.
▪ E.g. example a banking system. They can also be used for realtime analytics,