0% found this document useful (0 votes)
104 views23 pages

Bda Sem8 It BH Sample Notes

1. The document discusses different types of NoSQL databases including key-value, columnar, and document stores. 2. Key features of each type are described - key-value stores use unique keys to retrieve simple values, column stores group data by column for analytical queries, and document stores use documents which can include complex nested data structures. 3. Examples of popular databases for each type are provided, such as DynamoDB for key-value, HBase for columnar, and MongoDB for document stores.

Uploaded by

Syeda Farhat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views23 pages

Bda Sem8 It BH Sample Notes

1. The document discusses different types of NoSQL databases including key-value, columnar, and document stores. 2. Key features of each type are described - key-value stores use unique keys to retrieve simple values, column stores group data by column for analytical queries, and document stores use documents which can include complex nested data structures. 3. Examples of popular databases for each type are provided, such as DynamoDB for key-value, HBase for columnar, and MongoDB for document stores.

Uploaded by

Syeda Farhat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

A quality product by

BrainheatersTM

pp
A
s
er
Bh.Notes: BDA
at

IT Semester 8
he
in

A series of Important Concepts/Questions


highly recommended for MU Exam
ra
B

‘C’ SCHEME - 2022-2023

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
pp
A
s
er
at
he
in
ra
B

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
Q1. Differentiate Traditional vs. Big Data business approach. (P4 -
Appeared 1 Time) (5-10M)
Ans.

Traditional Data Big Data

Traditional data is generated at Big data is generated outside the

pp
the enterprise level. enterprise level.

A
Its volume ranges from
Its volume ranges from Gigabytes
Petabytes to Zettabytes or
to Terabytes.
Exabytes.

s
er
Big data systems deal with
Traditional database system deals structured,
at
with structured data. semi-structured,database, and
unstructured data.
he

Traditional data is generated per But big data is generated more


in

hour or per day or more. frequently, mainly per seconds.


ra

Traditional data source is Big data source is distributed


centralized and it is managed in and it is managed in distributed
B

centralized form. form.

Data integration is very easy. Data integration is very difficult.

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
Normal system configuration is
High system configuration is
capable of processing traditional
required to process big data.
data.

The size is more than the


The size of the data is very small.
traditional data size.

pp
Special kinds of database tools
Traditional database tools are
are required to perform any

A
required to perform any database
database schema-based
operation.
operation.

s
Normal functions can manipulate
er Special kinds of functions can
data. manipulate data.
at

Its data model is strict schema Its data model is a flat schema
he

based and it is static. based and it is dynamic.


in

Traditional data is stable and inter Big data is not a stable and
relationship. unknown relationship.
ra

Traditional data is in manageable Big data is in huge volume which


B

volume. becomes unmanageable.

It is easy to manage and It is difficult to manage and


manipulate the data. manipulate the data.

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
Its data sources include ERP
Its data sources include social
transaction data, CRM transaction
media, device data, sensor data,
data, financial data, organizational
video, images, audio etc.
data, web transaction data etc.

Q2. What is NoSQL? NoSQL data architecture patterns. (P4 - Appeared 1

pp
Time) (5-10M)
Ans. NoSQL is a type of database management system (DBMS) that is

A
designed to handle and store large volumes of unstructured and
semi-structured data.

s
● Unlike traditional relational databases that use tables with
er
predefined schemas to store data, NoSQL databases use flexible
data models that can adapt to changes in data structures and are
capable of scaling horizontally to handle growing amounts of data.
at
● The term NoSQL originally referred to “non-SQL” or “non-relational”
databases, but the term has since evolved to mean “not only SQL,”
he

as NoSQL databases have expanded to include a wide range of


different database architectures and data models.
in

● Architecture Pattern is a logical way of categorizing data that will be


stored on the Database. NoSQL is a type of database which helps to
ra

perform operations on big data and store it in a valid format. It is


widely used because of its flexibility and a wide variety of services.
B

Architecture Patterns of NoSQL:


The data is stored in NoSQL in any of the following four data architecture
patterns.
1. Key-Value Store Database
2. Column Store Database

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
3. Document Database
4. Graph Database

These are explained below.


1. Key-Value Store Database:
● This model is one of the most basic models of NoSQL databases. As
the name suggests, the data is stored in the form of Key-Value

pp
Pairs. The key is usually a sequence of strings, integers or characters
but can also be a more advanced data type. The value is typically
linked or correlated to the key. The key-value pair storage

A
databases generally store data as a hash table where each key is
unique. The value can be of any type (JSON, BLOB(Binary Large

s
Object), strings, etc). This type of pattern is usually used in shopping
websites or e-commerce applications.
Advantages:
er
● Can handle large amounts of data and heavy load,
at
● Easy retrieval of data by keys.
Limitations:
he

● Complex queries may attempt to involve multiple key-value


pairs which may delay performance.
Data can involve many-to-many relationships which may
in


collide.
Examples:
ra

● DynamoDB
● Berkeley DB
B

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
pp
● 2. Column Store Database:
Rather than storing data in relational tuples, the data is stored in

A
individual cells which are further grouped into columns.
Column-oriented databases work only on columns.
● They store large amounts of data into columns together. Format

s
and titles of the columns can diverge from one row to another.
er
Every column is treated separately.
● But still, each individual column may contain multiple other
at
columns like traditional databases.
Basically, columns are modes of storage in this type.
he

Advantages:
● Data is readily available
● Queries like SUM, AVERAGE, COUNT can be easily performed on
in

columns.
Examples:
ra

● HBase
● Bigtable by Google
B

● Cassandra

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
pp
A
s
● 3. Document Database:
The document database fetches and accumulates data in form of
er
key-value pairs but here, the values are called as Documents.
Document can be stated as a complex data structure.
at
● Document here can be a form of text, arrays, strings, JSON, XML or
any such format. The use of nested documents is also very
he

common. It is very effective as most of the data created is usually in


the form of JSONs and is unstructured.
in

Advantages:
● This type of format is very useful and apt for semi-structured
ra

data.
● Storage retrieval and managing of documents is easy.
Limitations:
B

● Handling multiple documents is challenging


● Aggregation operations may not work accurately.
Examples:
● MongoDB
● CouchDB

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
pp
A
Figure – Document Store Model in form of JSON documents

s
4. Graph Databases:

er
Clearly, this architecture pattern deals with the storage and
management of data in graphs. Graphs are basically structures
that depict connections between two or more objects in some data.
at
● The objects or entities are called nodes and are joined together by
relationships called Edges. Each edge has a unique identifier. Each
he

node serves as a point of contact for the graph.


● This pattern is very commonly used in social networks where there
in

are a large number of entities and each entity has one or many
characteristics which are connected by edges.
ra

● The relational database pattern has tables that are loosely


connected, whereas graphs are often very strong and rigid in
nature.
B

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
pp
A
s
er
Figure – Graph model format of NoSQL Databases
at

Advantages:
he

● Fastest traversal because of connections.


● Spatial data can be easily handled.
Limitations:
in

Wrong connections may lead to infinite loops.


Examples:
ra

● Neo4J
● FlockDB( Used by Twitter)
B

Q3. Differentiate MongoDB vs other NoSQL systems. (P4 - Appeared 1


Time) (5-10M)
Ans. MongoDB

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
● MongoDB is a document-oriented database. It is open source
software. A relational database has tables, and the tables have
rows and columns. Similarly, MongoDB has collections and
documents.
● A document is a record in MongoDB collection. A collection is a set
of MongoDB documents. Normally, all documents have a similar
purpose. A single MongoDB server has multiple databases.

pp
‘mongod.exe’ is the database server and ‘mongo.exe’ is the
interactive shell.
● The programmer writes documents in JSON format. MongoDB

A
internally converts JSON objects to BSON. BSON is binary objects
and have quotation marks in both key and value. MongoDB is useful

s
for agile based software development because it can change a
large amount of data.

er
It is easy to change documents by easily adding and deleting
existing ones. MongoDB can store different types of data such as
at
string, number, date, array, Booleans, etc. It also has a buffer data
type for storing video, images, and audio.
he

● The mixed data type can combine different types of data. MongoDB
has easy syntax, so it is easy to write queries. It can also provide
map-reduce programs in distributed architecture.
in

Similarities Between NoSQL and MongoDB


● Both can handle Big Data.
ra

● Supports horizontal scalability without expensive hardware.


● Supports distributed architecture.
B

● Both do not support joins.


● Both cannot handle complex transactions.
● The schema is dynamic.
● Flexible and easy to use.
NoSQL

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
● NoSQL is a new breed of database management systems that
fundamentally differ from relational database systems. NoSQL
database is a highly scalable and flexible database management
system.
● NoSQL database allows the user to store and process unstructured
data and semi-structured data; this feature is not possible in
RDBMS tools.

pp
● NoSQL is an approach to database design that can accommodate
a wide variety of data models, including key-value, document,
columnar and graph formats. NoSQL systems don’t generally

A
provide the same level of data consistency as SQL databases. In
fact, SQL databases have traditionally sacrificed scalability and

s
performance for the ACID properties.
● NoSQL databases guarantee high speed and scalability
performance.
er
● NoSQL systems have the architecture in such a way to operate at
at
high speed and wider flexibility towards the developer side.
he

Q4. Explain in detail Matrix-Vector Multiplication by MapReduce . (P4 -


Appeared 1 Time) (5-10M)
in

Ans. :Suppose we have an


n×n
ra

● n×n matrix M, whose element in row i and column j will be denoted


mij . Suppose we also have a vector v of length n, whose jth element
B

is vj .
● Then the matrix-vector product is the vector x of length n, whose ith
element xi is given by xi=∑j=1nmij×vj. If n = 100, we do not want to use
a DFS or MapReduce for this calculation.

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
● But this sort of calculation is at the heart of the ranking of Web
pages that goes on at search engines, and there, n is in the tens of
billions.3 Let us first assume that n is large, but not so large that
vector v cannot fit in main memory and thus be available to every
Map task.
● The matrix M and the vector v each will be stored in a file of the DFS.
We assume that the row-column coordinates of each matrix

pp
element will be discoverable, either from its position in the file, or
because it is stored with explicit coordinates, as a triple (i, j,mij).
● We also assume the position of element vj in the vector v will be

A
discoverable in the analogous way.
● The Map Function: The Map function is written to apply to one

s
element of M. However, if v is not already read into main memory at
the compute node executing a Map task, then v is first read, in its
er
entirety, and subsequently will be available to all applications of the
Map function performed at this Map task.
at
● Each Map task will operate on a chunk of the matrix M. From each
matrix element mij it produces the key-value pair (i,mij* vj). Thus, all
he

terms of the sum that make up the component xi of the


matrix-vector product will get the same key, i.The Reduce Function:
The Reduce function simply sums all the values associated with a
in


given key i. The result will be a pair (i, xi).
● We can divide the matrix into vertical stripes of equal width and
ra

divide the vector into an equal number of horizontal stripes, of the


same height.
B

● Our goal is to use enough stripes so that the portion of the vector in
one stripe can fit conveniently into the main memory at a computer
node. Figure suggests what the partition looks like if the matrix and
vector are each divided into five stripes.

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
pp
A
Map : for input mij
Emit (i, ps = ∑ mij * vj)

s
Reduce : Compute Xi = ∑ ps er
Q5. Explain Illustrating use of MapReduce with use of real life databases
at
and applications. (P4 - Appeared 1 Time) (5-10M)
Ans. MapReduce is a programming model used to perform distributed
he

processing in parallel in a Hadoop cluster, which Makes Hadoop working so


fast. When you are dealing with Big Data, serial processing is no more of
any use. MapReduce has mainly two tasks which are divided phase-wise:
in

● Map Task
Reduce Task
ra


Let us understand it with a real-time example, and the example helps you
understand Mapreduce Programming Model in a story manner:
B

● Suppose the Indian government has assigned you the task to


count the population of India. You can demand all the resources
you want, but you have to do this task in 4 months. Calculating
the population of such a large country is not an easy task for a
single person(you). So what will be your approach?.

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
● One of the ways to solve this problem is to divide the country by
states and assign individual in-charge to each state to count
the population of that state.
Task Of Each Individual: Each Individual has to visit every home present in
the state and need to keep a record of each house members as:
State_Name Member_House1
State_Name Member_House2

pp
State_Name Member_House3

State_Name Member_House n

A
● For Simplicity, we have taken only three states.

s
er
at
he
in
ra

This is a simple Divide and Conquer approach and will be


B

followed by each individual to count people in his/her


state.Once they have counted each house member in their
respective state. Now they need to sum up their results and
send it to the Head-quarter at New Delhi.

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
● We have a trained officer at the Head-quarter to receive all the
results from each state and aggregate them by each state to
get the population of that entire state. and Now, with this
approach, you are easily able to count the population of India
by summing up the results obtained at Head-quarter.
● The Indian Govt. is happy with your work and the next year they
asked you to do the same job in 2 months instead of 4 months.

pp
Again you will be provided with all the resources you want.
Since the Govt. has provided you with all the resources, you will simply
double the number of assigned individuals in-charge for each state from

A
one to two. For that divide each state in 2 division and assigned different
in-charge for these two divisions as:

s
State_Name_Incharge_division1
State_Name_Incharge_division2

er
● Similarly, each individual in charge of its division will gather the
at
information about members from each house and keep its
record.
he

We can also do the same thing at the Head-quarters, so let’s also divide
the Head-quarter in two division as:
Head-qurter_Division1
in

Head-qurter_Division2

ra

● Now with this approach, you can find the population of India in
two months. But there is a small problem with this, we never
B

want the divisions of the same state to send their result at


different Head-quarters then, in that case, we have the partial
population of that state in Head-quarter_Division1 and
Head-quarter_Division2 which is inconsistent because we want
consolidated population by the state, not the partial counting.

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
● One easy way to solve this problem is that we can instruct all
individuals of a state to either send their result to
Head-quarter_Division1 or Head-quarter_Division2. Similarly, for
all the states.
● Our problem has been solved, and you successfully did it in two
months.
● Now, if they ask you to do this process in a month, you know how

pp
to approach the solution.
● Great, now we have a good scalable model that works so well.
The model we have seen in this example is like the MapReduce

A
Programming model. so now you must be aware that
MapReduce is a programming model, not a programming

s
language.
er
at
he
in
ra

Now let’s discuss the phases and important things involved in our model.
1. Map Phase: The Phase where the individual in-charges are collecting the
B

population of each house in their division is Map Phase.


● Mapper: Involved individual in-charge for calculating population
● Input Splits: The state or the division of the state
● Key-Value Pair: Output from each individual Mapper like the key
is Rajasthan and value is 2

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
2. Reduce Phase: The Phase where you are aggregating your result
● Reducers: Individuals who are aggregating the actual result.
Here in our example, the trained-officers. Each Reducer produce
the output as a key-value pair
3. Shuffle Phase: The Phase where the data is copied from Mappers to
Reducers is Shuffler’s Phase. It comes in between the Map and Reduces
phase. Now the Map Phase, Reduce Phase, and Shuffler Phase are the three

pp
main Phases of our Mapreduce.

Q6. Write a short note on The Bloom Filter Counting. (P4 - Appeared 1

A
Time) (5-10M)

s
Ans. Classical Bloom Filter deletion operation is not possible in that i.e.,
only the element can be inserted and whether the element is present in the
data or not can be checked.
er
● Even if anyone tries to change the bits in the bit array
at
corresponding to the k positions it may lead to false negatives.
Fortunately, missing deletion support is not a problem for many
he

real-world applications,
Counting Bloom Filter and its Implementation
in

● The most popular extension of the classical Bloom filter that


supports deletion is the Counting Bloom filter, proposed by Li Fan,
Pei Cao, Jussara Almeida, and Andrei Z. Broder in 2000. Counting
ra

Bloom Filter introduces an array of m counters {Cj}mj=1


corresponding to each bit in the filter’s array.
B

● The Counting Bloom filter allows approximating the number of


times each element has been seen in the filter by incrementing the
corresponding counter every time the element is added.
● The associated CountingBloomFilter data structure contains a bit
array and the array of counters of length m, all initialized to zeros.

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
● When a new element is inserted into CountingBloomFilter, first
compute its corresponding bit-positions, then for each position, we
increment the associated counter. These are some basic ideas to
code in any preferred language.
Pseudo Code:

Input: Element x belongs to D

Input: Counting Bloom Filter with m counters {Cj}j=1m and k hash functions

pp
{hi}i=1k

Algo:

A
for i=1 to k do

j=hi(x)

Cj=Cj+1

s
er
Next to test whether an element is present or not we check the counters
at
corresponding to k positions if the value of all the counters is greater than 0

implies the element is probably present.


he

Input: Element x belongs to D

Input: Counting Bloom Filter with m counters and k hash functions.


in

Algo:
ra

for i=1 to k do

j=hi(x)
B

if CountingBloomFilter[j]<1 then

return False

return True

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
Now, finally the important deletion part. The deletion is quite similar to the

insertion but in reverse. To delete an element x , we compute all k hash

values hi = {hi(x )}ki=1 and decrease the corresponding counters.

Q7. Describe in detail The Datar-Gionis-Indyk, Motwani Algorithm. OR

pp
Explain in detail Query Answering in the DGIM Algorithm. (P4 - Appeared 1
Time) (5-10M)

A
Ans. Suppose we have a window of length N on a binary stream. We want
at all times to be able to answer queries of the form “how many 1’s are

s
there in the last k bits?” for any k≤ N. For this purpose we use the DGIM
algorithm.
er
The basic version of the algorithm uses O(log2 N) bits to represent a
window of N bits, and allows us to estimate the number of 1’s in the window
at
with an error of no more than 50%.
To begin, each bit of the stream has a timestamp, the position in which it
he

arrives. The first bit has timestamp 1, the second has timestamp 2, and so
on.
Since we only need to distinguish positions within the window of length N,
in

we shall represent timestamps modulo N, so they can be represented by


log2 N bits. If we also store the total number of bits ever seen in the stream
ra

(i.e., the most recent timestamp) modulo N, then we can determine from a
timestamp modulo N where in the current window the bit with that
B

timestamp is.
We divide the window into buckets, 5 consisting of:
1. The timestamp of its right (most recent) end.
2. The number of 1’s in the bucket. This number must be a power of 2,
and we refer to the number of 1’s as the size of the bucket.

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
To represent a bucket, we need log2 N bits to represent the timestamp
(modulo N) of its right end. To represent the number of 1’s we only need
log2 log2 N bits. The reason is that we know this number i is a power of 2,
say 2j , so we can represent i by coding j in binary. Since j is at most log2 N,
it requires log2 log2 N bits. Thus, O(logN) bits suffice to represent a bucket.
There are six rules that must be followed when representing a stream by
buckets.

pp
● The right end of a bucket is always a position with a 1.
● Every position with a 1 is in some bucket.
● No position is in more than one bucket.

A
● There are one or two buckets of any given size, up to some
maximum size.

s
● All sizes must be a power of 2.
● Buckets cannot decrease in size as we move to the left (back in
time).
er
at
he
in
ra
B

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
Q8. more questions are available in Brainheaters app….
.

pp
A
Full module-wise notes with
34+ Q/A is available in
s
er
Brainheaters App
at
he
in

Download the App Now!


ra
B

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4
pp
A
s
er
at
he
in
ra
B

Download Brainheaters App - https://bit.ly/Brainheaters_App


Whatsapp Community Link:- https://chat.whatsapp.com/Gp1v0PbkFoG5DkC8Cyqev4

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy