0% found this document useful (0 votes)
28 views79 pages

Module 3

Uploaded by

vishnu priya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views79 pages

Module 3

Uploaded by

vishnu priya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 79

NOSQL DATABASE

MODULE -3

Department of
Computer Science & Engineering

www.cambridge.edu.in
Pre-Requestion
Hadoop Distributed File System
(HDFS)
•Concurrent processing
MapReduce splits large amounts of data into smaller chunks
and processes
them in parallel.

•Consolidated output
MapReduce aggregates all the data from multiple servers to
return a
consolidated output.

MapReduce is often used in data warehouses to analyze large


volumes of data and build specialized business logic.
What is MapReduce?

A MapReduce is a data processing tool which is used to process


the data parallelly in a distributed form. It was developed in
2004, on the basis of paper titled as "MapReduce: Simplified
Data Processing on Large Clusters," published by Google.
The MapReduce is a paradigm which has two phases,
1. the mapper phase, and the reducer phase. In the Mapper, the input is given in the
form of a key-value pair.
2. The output of the Mapper is fed to the reducer as input. The reducer runs only after
the Mapper is over. The reducer too takes input in key-value format, and the output of
reducer is the final output.
A Word Count Example of
MapReduce
• Let us understand, how a MapReduce works by taking
an example where I have a text file called example.txt
whose contents are as follows:
Dear, Bear, River,
Car, Car, River,
Deer, Car and Bear
• Now, suppose, we have to perform a word count on the
sample.txt using MapReduce. So, we will be finding
unique words and the number of occurrences of those
unique words.
• First, we divide the input into three splits as shown in the
figure. This will
distribute the work among all the map nodes.

• Then, we tokenize the words in each of the mappers and


give a hardcoded
value (1) to each of the tokens or words. The rationale
behind giving a
hardcoded value equal to 1 is that every word, in itself, will
occur once.

• Now, a list of key-value pair will be created where the key is


nothing but
the individual words and value is one. So, for the first line
• After the mapper phase, a partition process
takes place
where sorting and shuffling happen so that all
the tuples
with the same key are sent to the
corresponding reducer.

• So, after the sorting and shuffling phase, each


reducer will
have a unique key and a list of values
• Now, each Reducer counts the values which are
present in that list of
values. As shown in the figure, reducer gets a list of
values which is
[1,1] for the key Bear. Then, it counts the number of
ones in the very
list and gives the final output as — Bear, 2.

• Finally, all the output key/value pairs are then collected


and written in the output file.
Basic Map-Reduce

To explain the basic idea,


Let’s assume we have chosen orders as our
aggregate, with each order having line items.
Each line item has a product ID, quantity, and
the price charged. This aggregate makes a lot
of sense as usually people want to see the
whole order in one access.

This is exactly the kind of situation that calls for map-reduce.


 However, sales analysis people want to see a product and its total revenue for
the last seven days. This report doesn’t fit the aggregate structure that we
have—which is the downside of using aggregates.

 In order to get the product revenue report, you’ll have to visit every machine
in the cluster and examine many records on each machine.

DAY 1 DAY 2 DAY 3 DAY 4 DAY 5 DAY 6 DAY 7


MAP FUNCTION

The first stage in a map-reduce job is the map. A map is a function whose input
is a single aggregate and whose output is a bunch of key-value pairs.
INPUT MAPPING

Black tea

Brown rice tea

Drag well tea


reduce function
A map operation only operates on a single record; the reduce function takes
multiple map outputs with the same key and combines their values. So, a map
function might yield 1000 line items from orders for “Database Refactoring”; the
reduce function would reduce down to one, with the totals for the quantity and
revenue. While the map function is limited to working only on data from a single
aggregate, the reduce function can use all values emitted for a single key (see
Figure 7.2)
Partitioning and Combining

Partitioning
Combining
Composing Map-Reduce Calculations

One simple limitation is that you have to structure your calculations around
operations that fit in well with the notion of a reduce operation.
Combine with reduce calculation suppose we want to know
the average ordered
quantity of each product.
An important property of
averages is that they are
not composable—that is,
if I take two groups of
orders, I can’t combine
their averages alone.
Instead, I need to take
total amount and the
count of orders from each
group, combine those, and
then calculate the average
from the combined sum
and count
Mapping with reduce calculation
7.3.1. A Two Stage Map-Reduce Example

As map-reduce calculations get more complex, it’s useful to break them down into stages
using a pipes-and-filters approach, with the output of one stage serving as input to the
next, rather like the pipelines in UNIX.

Consider an example where we want to compare the sales of products


for each month in 2011 to the prior year.
To do this, we’ll break the calculations down into two stages.
The first stage will produce records showing the aggregate figures for a single product in a
single month of the year.
The second stage then uses these as inputs and produces the result for a single product by
comparing one month’s results with the same month in the prior year
A first stage: Creating records for monthly sales of a different product

This stage is similar to the map-reduce examples we’ve seen so far. The only new feature is using a
composite key so that we can reduce records based on the values of multiple fields.
The second-stage mappers: The second stage mapper creates base records
for year-on-year comparisons.
Chapter 8.
Key-Value Databases
Implement
 A key-value store is a simple hash table,
 primarily used when all access to the database is via primary key.
Think of a table in a traditional RDBMS with two columns, such as
ID and NAME,
 the ID column being the key and NAME column storing the value.

In an RDBMS, the NAME column is restricted to storing data of type String.


The application can provide an ID and VALUE and persist the pair; if the ID
already exists the current value is overwritten, otherwise a new entry is
created.
Let’s look at how terminology compares in Oracle and Riak
8.1. What Is a Key-Value Store ?
 Key-value stores are the simplest NoSQL data stores to use from
an API perspective.
 The client can either
• get the value for the key,
• put a value for a key, or
• delete a key from the data store.
 The value is a blob that the data store just stores, without caring
or knowing what’s inside; it’s the responsibility of the
application to understand what was stored.
Some of the popular key-value databases
1. Riak Data Structure server) [Riak],
2. Redis (often referred to as [Redis],
3. Memcached DB and its flavors DB [Memcached],
4. Berkeley [Berkeley DB],
5. HamsterDB (especially suited for embedded use)
6. Amazon DynamoDB [HamsterDB], [Amazon’s Dynamo] (not open-source),
and
7. Project Voldemort [Project Voldemort] (an open-source implementation of
Amazon DynamoDB).
Riak Databases
Riak lets us store keys into buckets
If we wanted to store user session
data, shopping cart information, and
user preferences in Riak, we could
just store all of them in the same
bucket with a single key and single
value for all of these objects.
In Riak, they are known as domain buckets allowing the
serialization and deserialization to be handled by the client
driver.
8.2. Key-Value Store Features

8.2.1. Consistency:
8.2.2. Transactions
8.2.3. Query Features
8.2.4. Structure of Data
8.2.5. Scaling
8.3. Suitable Use Cases

8.3.1. STORING SESSION INFORMATION:


1. Storing Session Information Generally, every web session is unique and is
assigned a unique session id value.
2. Applications that store the session id on disk or in an RDBMS will greatly
benefit from moving to a key-value store, since everything about the
session can be stored by a single PUT request or retrieved using GET.
3. This single-request operation makes it very fast, as everything about the
session is stored in a single object.
4. Solutions such as Memcached are used by many web applications, and
Riak can be used when availability is important
8.3.2. User Profiles,
1. Preferences Almost every user has a unique user Id,
username, or some other attribute, as well as preferences
such as language, color, timezone, which products the user
has access to, and so on.
2. This can all be put into an object, so getting preferences of a
user takes a single GET operation. Similarly, product profiles
can be stored.
8.3.3. Shopping Cart Data E-commerce websites have shopping
carts tied to the user. As we want the shopping carts to be available
all the time, across browsers, machines, and sessions, all the
shopping information can be put into the value where the key is
the user id. A Riak cluster would be best suited for these kinds of
applications
8.4. When Not to Use

8.4.1. Relationships among Data If you need to have relationships


between different sets of data, or correlate the data between
different sets of keys, key-value stores are not the best solution to
use, even though some key-value stores provide link-walking
features.
8.4.2. Multioperation Transactions If you’re saving
multiple keys and there is a failure to save any one of
them, and you want to revert or roll back the rest of the
operations, key-value stores are not the best solution to
be used.
8.4.3. Query by Data
If you need to search the keys based on something found in the value part
of the key value pairs, then key-value stores are not going to perform well
for you. There is no way to inspect the value on the database side, with
the exception of some products like Riak Search or indexing engines like
Lucene [Lucene] or Solr

8.4.4. Operations by Sets [Solr].


Since operations are limited to one key at a time, there is no way to
operate upon multiple keys at the same time. If you need to operate upon
multiple keys, you have to handle this from the client side

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy