NoSQL DBs
NoSQL DBs
Lecture Plan
• Introductions
• What is NoSQL?
• Relational vs. NoSQL databases
• Aggregate data model
• Map-Reduce and Hadoop
Relational databases: strengths
• Persistence: large amounts of data can be safely and
securely kept on storage device(s)
– ability to get small bits of information quickly and easily
• Concurrency: many applications may look at the same
body of data at once, possibly modifying that data:
– RDBs handle concurrency by controlling the access to their
data through transactions
– if an error occurs during the processing of changes,
transactions can be rolled back
• Integration: several applications need to communicate
and collaborate to solve a complex task:
– concurrency control automatically handles multiple
applications
Relational databases: weaknesses
• Impedance mismatch: difference between the
relational model and in-memory data structures
– RDBs organize data into structure of relations and
tuples (tables and rows)
– values in a relational tuple have to be simple (i.e. no
structures, such as nested records or lists)
– in-memory data structures can be more complex than
simple relations
– as a result, in-memory data structures need to be
translated into a relational representation in order to
be stored on disk
Relational data model
Relational databases: major weakness
• RDBs are designed to be run on a single machine
• Sharding: RDBs could be run as separate servers for
different sets of data
– sharding is controlled by an application, which keeps track
of which RDB server to talk to for each bit of data
– …but querying, referential integrity, transactions and
consistency control across shards still need to be
implemented
Why NoSQL?
• Relational DBMSs have been a successful technology for more than
twenty years, since they provided reliable persistence, concurrency
control and integration mechanisms
• RDBs are designed to run on a single machine and do not scale up
horizontally
• However, the need to process large volumes of data led to a shift
from scaling vertically to scaling horizontally on clusters
• Cluster: large number of commodity machines connected with a
network
History of NoSQL
• Early efforts were focused on proprietary systems by
Amazon and Google in 2000s:
– BigTable from Google
– Dynamo from Amazon
• The term “NoSQL” traces back to a meetup on June
11, 2009 in San Francisco, after which NoSQL DBMs
have become an open-source phenomenon
Relational DBs
KEY-VALUE STORES
Document Stores
Graph Databases
Wide Column Database
Types of NoSQL databases
• Key-value: BerkeleyDB, LevelDB, Memcached, Project
Voldemort, Redis, Riak
• Document: OrientDB, RavenDB, Terrastore, CouchDB,
MongoDB
• Column-family: Amazon SimpleDB, Cassandra, Hypertable,
HBase
• Graph: FlockDB, HyperGraphDB, Infinite Graph, Neo4J
DB-Engines Ranking
https://db-engines.com/en/ranking
NoSQL: aggregate data model
• Explicit storage of a rich structure of closely related
data that is accessed as a unit (called aggregates)
• Aggregates provide a natural unit of interaction for
many applications
• Suitable for distributed environment
• Downside: difficulty in handling relationships
between entities in different aggregates
Aggregate
• Complex record allowing lists and other record
structures to be nested inside it
• Collection of related objects that are treated
as a unit
Relational schema
Relational data model
Example of aggregates
Aggregate vs. relational data model
• No normalization:
– instead of using IDs, some records may be duplicated and
copied with an aggregate
– minimize the number of aggregates we access during data
interaction
– minimizing the number of nodes to query for data and data
transfer overhead when gathering the data
• Relations between aggregates are still possible:
– e.g., between orders and customers
– aggregate boundaries are context-specific (i.e. depend on the
task and how the data is manipulated by the application)
• Relational databases are aggregate-ignorant:
– and so are NoSQL graph databases
Relational vs. NoSQL DBs: atomicity
• RDBs allow to manipulate any combination of
rows from any tables in a single ACID (Atomic,
Consistent, Isolated and Durable) transaction:
– many rows spanning many tables are updated as a
single atomic operation
– atomic operations succeed or fail entirely
• NoSQL databases support atomic manipulation of
single aggregate at a time:
– cross-aggregate atomic operations need to be
implemented programmatically
• Aggregate-ignorant NoSQL DBs support ACID
transactions similar to relational DBs
CAP theorem
The CAP theorem
• Many database systems forgo transactions
entirely, because the performance impact is
too high
• MySQL was popular since it was lightweight
and didn’t support transactions
• Consistency can and should often be relaxed
The CAP theorem
Choose DBs
https://www.dataversity.net/choose-right-nosql-
database-application/#
Map-Reduce and Hadoop
What is Hadoop?
• A software framework that supports data-intensive distributed
applications.
• Yahoo! has been the largest contributor to the project, and uses Hadoop
extensively across its businesses.
Who uses Hadoop?
http://wiki.apache.org/hadoop/PoweredBy
Who uses Hadoop?
• Yahoo!
– More than 100,000 CPUs in >36,000 computers.
• Facebook
– Used in reporting/analytics and machine learning and also
as storage engine for logs.
– A 1100-machine cluster with 8800 cores and about 12 PB
raw storage.
– A 300-machine cluster with 2400 cores and about 3 PB raw
storage.
– Each (commodity) node has 8 cores and 12 TB of storage.
Very Large Storage Requirements
• Facebook has Hadoop clusters with 15 PB of raw storage
(15,000,000 GB).
• No single storage can handle this amount of data.
3. Read data
1 3 1 3 1 3
2
2 2
Data Nodes
Terabyte Sort Benchmark
• http://sortbenchmark.org/
• Task: Sorting 100TB of data and writing results
on disk (10^12 records each 100 bytes).
This is the first time that a Java program has won this competition.
Example: word count
Counting Words by MapReduce
Hello World
Bye World
Hello World
Bye World
Split
Hello Hadoop
Goodbye Hadoop
Hello Hadoop
Goodbye Hadoop
Counting Words by MapReduce
Hello, <1>
Hello World World, <1>
Mapper
Bye World Bye, <1>
World, <1>
Bye, <1>
Sort & Merge Hello, <1>
World, <1, 1>
Bye, <1>
Combiner Hello, <1>
World, <2>
Node 1
Counting Words by MapReduce
Bye, <1>
Bye, <1>
Hello, <1>
Goodbye, <1>
World, <2> Bye, <1> Hadoop, <2>
Goodbye, <1>
Sort & Merge Hadoop, <2> Split
Hello, <1, 1>
Goodbye, <1> World, <2> Hello, <1, 1>
Hadoop, <2>
World, <2>
Hello, <1>
Counting Words by MapReduce
Node 1
part-00000
Bye, <1> Bye, <1>
Goodbye, <1> Reducer Goodbye, <1> Bye 1
Hadoop, <2> Hadoop, <2> Goodbye 1
Hadoop 2
Write on Disk
Node 2
part-00001
Hello 2
Hello, <1, 1> Hello, <2>
Reducer World 2
World, <2> World, <2>
High Level Architecture of MapReduce
Master Node
Client
JobTracker
Computer