0% found this document useful (0 votes)
2 views46 pages

NoSQL DBs

The document provides an introduction to NoSQL databases, comparing them with relational databases and discussing their strengths and weaknesses. It highlights the evolution of NoSQL, various types of NoSQL databases, and the importance of the aggregate data model. Additionally, it covers Hadoop and its role in processing large data volumes through distributed applications and the Map-Reduce framework.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views46 pages

NoSQL DBs

The document provides an introduction to NoSQL databases, comparing them with relational databases and discussing their strengths and weaknesses. It highlights the evolution of NoSQL, various types of NoSQL databases, and the importance of the aggregate data model. Additionally, it covers Hadoop and its role in processing large data volumes through distributed applications and the Map-Reduce framework.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Introduction to NoSQL

Lecture Plan
• Introductions
• What is NoSQL?
• Relational vs. NoSQL databases
• Aggregate data model
• Map-Reduce and Hadoop
Relational databases: strengths
• Persistence: large amounts of data can be safely and
securely kept on storage device(s)
– ability to get small bits of information quickly and easily
• Concurrency: many applications may look at the same
body of data at once, possibly modifying that data:
– RDBs handle concurrency by controlling the access to their
data through transactions
– if an error occurs during the processing of changes,
transactions can be rolled back
• Integration: several applications need to communicate
and collaborate to solve a complex task:
– concurrency control automatically handles multiple
applications
Relational databases: weaknesses
• Impedance mismatch: difference between the
relational model and in-memory data structures
– RDBs organize data into structure of relations and
tuples (tables and rows)
– values in a relational tuple have to be simple (i.e. no
structures, such as nested records or lists)
– in-memory data structures can be more complex than
simple relations
– as a result, in-memory data structures need to be
translated into a relational representation in order to
be stored on disk
Relational data model
Relational databases: major weakness
• RDBs are designed to be run on a single machine
• Sharding: RDBs could be run as separate servers for
different sets of data
– sharding is controlled by an application, which keeps track
of which RDB server to talk to for each bit of data
– …but querying, referential integrity, transactions and
consistency control across shards still need to be
implemented
Why NoSQL?
• Relational DBMSs have been a successful technology for more than
twenty years, since they provided reliable persistence, concurrency
control and integration mechanisms
• RDBs are designed to run on a single machine and do not scale up
horizontally
• However, the need to process large volumes of data led to a shift
from scaling vertically to scaling horizontally on clusters
• Cluster: large number of commodity machines connected with a
network
History of NoSQL
• Early efforts were focused on proprietary systems by
Amazon and Google in 2000s:
– BigTable from Google
– Dynamo from Amazon
• The term “NoSQL” traces back to a meetup on June
11, 2009 in San Francisco, after which NoSQL DBMs
have become an open-source phenomenon
Relational DBs
KEY-VALUE STORES
Document Stores
Graph Databases
Wide Column Database
Types of NoSQL databases
• Key-value: BerkeleyDB, LevelDB, Memcached, Project
Voldemort, Redis, Riak
• Document: OrientDB, RavenDB, Terrastore, CouchDB,
MongoDB
• Column-family: Amazon SimpleDB, Cassandra, Hypertable,
HBase
• Graph: FlockDB, HyperGraphDB, Infinite Graph, Neo4J
DB-Engines Ranking
https://db-engines.com/en/ranking
NoSQL: aggregate data model
• Explicit storage of a rich structure of closely related
data that is accessed as a unit (called aggregates)
• Aggregates provide a natural unit of interaction for
many applications
• Suitable for distributed environment
• Downside: difficulty in handling relationships
between entities in different aggregates
Aggregate
• Complex record allowing lists and other record
structures to be nested inside it
• Collection of related objects that are treated
as a unit
Relational schema
Relational data model
Example of aggregates
Aggregate vs. relational data model
• No normalization:
– instead of using IDs, some records may be duplicated and
copied with an aggregate
– minimize the number of aggregates we access during data
interaction
– minimizing the number of nodes to query for data and data
transfer overhead when gathering the data
• Relations between aggregates are still possible:
– e.g., between orders and customers
– aggregate boundaries are context-specific (i.e. depend on the
task and how the data is manipulated by the application)
• Relational databases are aggregate-ignorant:
– and so are NoSQL graph databases
Relational vs. NoSQL DBs: atomicity
• RDBs allow to manipulate any combination of
rows from any tables in a single ACID (Atomic,
Consistent, Isolated and Durable) transaction:
– many rows spanning many tables are updated as a
single atomic operation
– atomic operations succeed or fail entirely
• NoSQL databases support atomic manipulation of
single aggregate at a time:
– cross-aggregate atomic operations need to be
implemented programmatically
• Aggregate-ignorant NoSQL DBs support ACID
transactions similar to relational DBs
CAP theorem
The CAP theorem
• Many database systems forgo transactions
entirely, because the performance impact is
too high
• MySQL was popular since it was lightweight
and didn’t support transactions
• Consistency can and should often be relaxed
The CAP theorem
Choose DBs
https://www.dataversity.net/choose-right-nosql-
database-application/#
Map-Reduce and Hadoop
What is Hadoop?
• A software framework that supports data-intensive distributed
applications.

• It enables applications to work with thousands of nodes and petabytes of


data.

• Hadoop was inspired by Google's MapReduce and Google File System


(GFS).

• Hadoop is a top-level Apache project being built and used by a global


community of contributors, using the Java programming language.

• Yahoo! has been the largest contributor to the project, and uses Hadoop
extensively across its businesses.
Who uses Hadoop?

http://wiki.apache.org/hadoop/PoweredBy
Who uses Hadoop?
• Yahoo!
– More than 100,000 CPUs in >36,000 computers.

• Facebook
– Used in reporting/analytics and machine learning and also
as storage engine for logs.
– A 1100-machine cluster with 8800 cores and about 12 PB
raw storage.
– A 300-machine cluster with 2400 cores and about 3 PB raw
storage.
– Each (commodity) node has 8 cores and 12 TB of storage.
Very Large Storage Requirements
• Facebook has Hadoop clusters with 15 PB of raw storage
(15,000,000 GB).
• No single storage can handle this amount of data.

• We need a large set of nodes each storing part of the data.


HDFS: Hadoop Distributed File System

1. filename, index Namenode

Client 2. Datanodes, Blockid

3. Read data

1 3 1 3 1 3
2
2 2

Data Nodes
Terabyte Sort Benchmark
• http://sortbenchmark.org/
• Task: Sorting 100TB of data and writing results
on disk (10^12 records each 100 bytes).

• Yahoo’s Hadoop Cluster is the current winner:


– 173 minutes
– 3452 nodes x (2 Quadcore Xeons, 8 GB RAM)

This is the first time that a Java program has won this competition.
Example: word count
Counting Words by MapReduce

Hello World
Bye World
Hello World
Bye World
Split
Hello Hadoop
Goodbye Hadoop
Hello Hadoop
Goodbye Hadoop
Counting Words by MapReduce

Hello, <1>
Hello World World, <1>
Mapper
Bye World Bye, <1>
World, <1>

Bye, <1>
Sort & Merge Hello, <1>
World, <1, 1>

Bye, <1>
Combiner Hello, <1>
World, <2>

Node 1
Counting Words by MapReduce

Bye, <1>
Bye, <1>
Hello, <1>
Goodbye, <1>
World, <2> Bye, <1> Hadoop, <2>
Goodbye, <1>
Sort & Merge Hadoop, <2> Split
Hello, <1, 1>
Goodbye, <1> World, <2> Hello, <1, 1>
Hadoop, <2>
World, <2>
Hello, <1>
Counting Words by MapReduce
Node 1

part-00000
Bye, <1> Bye, <1>
Goodbye, <1> Reducer Goodbye, <1> Bye 1
Hadoop, <2> Hadoop, <2> Goodbye 1
Hadoop 2

Write on Disk
Node 2
part-00001
Hello 2
Hello, <1, 1> Hello, <2>
Reducer World 2
World, <2> World, <2>
High Level Architecture of MapReduce
Master Node

Client
JobTracker
Computer

TaskTracker TaskTracker TaskTracker

Task Task Task Task Task

Slave Node Slave Node Slave Node


High Level Architecture of Hadoop
Master Node Slave Node Slave Node

TaskTracker TaskTracker TaskTracker

MapReduce layer JobTracker

HDFS layer NameNode

DataNode DataNode DataNode


Hadoop Job Scheduling
• FIFO queue matches incoming jobs to
available nodes
– No notion of fairness
– Never switches out running job
Distributed File Cache
• The Distributed Cache facility allows you to
transfer files from the distributed file system
to the local file system (for reading only) of all
participating nodes before the beginning of a
job.
References
• Hadoop Project Page:
http://hadoop.apache.org/

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy