Bda Sem8 It BH Sample Notes
Bda Sem8 It BH Sample Notes
BrainheatersTM
pp
A
s
er
Bh.Notes: BDA
at
IT Semester 8
he
in
pp
the enterprise level. enterprise level.
A
Its volume ranges from
Its volume ranges from Gigabytes
Petabytes to Zettabytes or
to Terabytes.
Exabytes.
s
er
Big data systems deal with
Traditional database system deals structured,
at
with structured data. semi-structured,database, and
unstructured data.
he
pp
Special kinds of database tools
Traditional database tools are
are required to perform any
A
required to perform any database
database schema-based
operation.
operation.
s
Normal functions can manipulate
er Special kinds of functions can
data. manipulate data.
at
Its data model is strict schema Its data model is a flat schema
he
Traditional data is stable and inter Big data is not a stable and
relationship. unknown relationship.
ra
pp
Time) (5-10M)
Ans. NoSQL is a type of database management system (DBMS) that is
A
designed to handle and store large volumes of unstructured and
semi-structured data.
s
● Unlike traditional relational databases that use tables with
er
predefined schemas to store data, NoSQL databases use flexible
data models that can adapt to changes in data structures and are
capable of scaling horizontally to handle growing amounts of data.
at
● The term NoSQL originally referred to “non-SQL” or “non-relational”
databases, but the term has since evolved to mean “not only SQL,”
he
pp
Pairs. The key is usually a sequence of strings, integers or characters
but can also be a more advanced data type. The value is typically
linked or correlated to the key. The key-value pair storage
A
databases generally store data as a hash table where each key is
unique. The value can be of any type (JSON, BLOB(Binary Large
s
Object), strings, etc). This type of pattern is usually used in shopping
websites or e-commerce applications.
Advantages:
er
● Can handle large amounts of data and heavy load,
at
● Easy retrieval of data by keys.
Limitations:
he
●
collide.
Examples:
ra
● DynamoDB
● Berkeley DB
B
A
individual cells which are further grouped into columns.
Column-oriented databases work only on columns.
● They store large amounts of data into columns together. Format
s
and titles of the columns can diverge from one row to another.
er
Every column is treated separately.
● But still, each individual column may contain multiple other
at
columns like traditional databases.
Basically, columns are modes of storage in this type.
he
Advantages:
● Data is readily available
● Queries like SUM, AVERAGE, COUNT can be easily performed on
in
columns.
Examples:
ra
● HBase
● Bigtable by Google
B
● Cassandra
Advantages:
● This type of format is very useful and apt for semi-structured
ra
data.
● Storage retrieval and managing of documents is easy.
Limitations:
B
s
4. Graph Databases:
●
er
Clearly, this architecture pattern deals with the storage and
management of data in graphs. Graphs are basically structures
that depict connections between two or more objects in some data.
at
● The objects or entities are called nodes and are joined together by
relationships called Edges. Each edge has a unique identifier. Each
he
are a large number of entities and each entity has one or many
characteristics which are connected by edges.
ra
Advantages:
he
● Neo4J
● FlockDB( Used by Twitter)
B
pp
‘mongod.exe’ is the database server and ‘mongo.exe’ is the
interactive shell.
● The programmer writes documents in JSON format. MongoDB
A
internally converts JSON objects to BSON. BSON is binary objects
and have quotation marks in both key and value. MongoDB is useful
s
for agile based software development because it can change a
large amount of data.
●
er
It is easy to change documents by easily adding and deleting
existing ones. MongoDB can store different types of data such as
at
string, number, date, array, Booleans, etc. It also has a buffer data
type for storing video, images, and audio.
he
● The mixed data type can combine different types of data. MongoDB
has easy syntax, so it is easy to write queries. It can also provide
map-reduce programs in distributed architecture.
in
pp
● NoSQL is an approach to database design that can accommodate
a wide variety of data models, including key-value, document,
columnar and graph formats. NoSQL systems don’t generally
A
provide the same level of data consistency as SQL databases. In
fact, SQL databases have traditionally sacrificed scalability and
s
performance for the ACID properties.
● NoSQL databases guarantee high speed and scalability
performance.
er
● NoSQL systems have the architecture in such a way to operate at
at
high speed and wider flexibility towards the developer side.
he
is vj .
● Then the matrix-vector product is the vector x of length n, whose ith
element xi is given by xi=∑j=1nmij×vj. If n = 100, we do not want to use
a DFS or MapReduce for this calculation.
pp
element will be discoverable, either from its position in the file, or
because it is stored with explicit coordinates, as a triple (i, j,mij).
● We also assume the position of element vj in the vector v will be
A
discoverable in the analogous way.
● The Map Function: The Map function is written to apply to one
s
element of M. However, if v is not already read into main memory at
the compute node executing a Map task, then v is first read, in its
er
entirety, and subsequently will be available to all applications of the
Map function performed at this Map task.
at
● Each Map task will operate on a chunk of the matrix M. From each
matrix element mij it produces the key-value pair (i,mij* vj). Thus, all
he
●
given key i. The result will be a pair (i, xi).
● We can divide the matrix into vertical stripes of equal width and
ra
● Our goal is to use enough stripes so that the portion of the vector in
one stripe can fit conveniently into the main memory at a computer
node. Figure suggests what the partition looks like if the matrix and
vector are each divided into five stripes.
s
Reduce : Compute Xi = ∑ ps er
Q5. Explain Illustrating use of MapReduce with use of real life databases
at
and applications. (P4 - Appeared 1 Time) (5-10M)
Ans. MapReduce is a programming model used to perform distributed
he
● Map Task
Reduce Task
ra
●
Let us understand it with a real-time example, and the example helps you
understand Mapreduce Programming Model in a story manner:
B
pp
State_Name Member_House3
State_Name Member_House n
A
● For Simplicity, we have taken only three states.
s
er
at
he
in
ra
pp
Again you will be provided with all the resources you want.
Since the Govt. has provided you with all the resources, you will simply
double the number of assigned individuals in-charge for each state from
A
one to two. For that divide each state in 2 division and assigned different
in-charge for these two divisions as:
s
State_Name_Incharge_division1
State_Name_Incharge_division2
●
er
● Similarly, each individual in charge of its division will gather the
at
information about members from each house and keep its
record.
he
We can also do the same thing at the Head-quarters, so let’s also divide
the Head-quarter in two division as:
Head-qurter_Division1
in
Head-qurter_Division2
●
ra
● Now with this approach, you can find the population of India in
two months. But there is a small problem with this, we never
B
pp
to approach the solution.
● Great, now we have a good scalable model that works so well.
The model we have seen in this example is like the MapReduce
A
Programming model. so now you must be aware that
MapReduce is a programming model, not a programming
s
language.
er
at
he
in
ra
Now let’s discuss the phases and important things involved in our model.
1. Map Phase: The Phase where the individual in-charges are collecting the
B
pp
main Phases of our Mapreduce.
Q6. Write a short note on The Bloom Filter Counting. (P4 - Appeared 1
A
Time) (5-10M)
s
Ans. Classical Bloom Filter deletion operation is not possible in that i.e.,
only the element can be inserted and whether the element is present in the
data or not can be checked.
er
● Even if anyone tries to change the bits in the bit array
at
corresponding to the k positions it may lead to false negatives.
Fortunately, missing deletion support is not a problem for many
he
real-world applications,
Counting Bloom Filter and its Implementation
in
Input: Counting Bloom Filter with m counters {Cj}j=1m and k hash functions
pp
{hi}i=1k
Algo:
A
for i=1 to k do
j=hi(x)
Cj=Cj+1
s
er
Next to test whether an element is present or not we check the counters
at
corresponding to k positions if the value of all the counters is greater than 0
Algo:
ra
for i=1 to k do
j=hi(x)
B
if CountingBloomFilter[j]<1 then
return False
return True
pp
Explain in detail Query Answering in the DGIM Algorithm. (P4 - Appeared 1
Time) (5-10M)
A
Ans. Suppose we have a window of length N on a binary stream. We want
at all times to be able to answer queries of the form “how many 1’s are
s
there in the last k bits?” for any k≤ N. For this purpose we use the DGIM
algorithm.
er
The basic version of the algorithm uses O(log2 N) bits to represent a
window of N bits, and allows us to estimate the number of 1’s in the window
at
with an error of no more than 50%.
To begin, each bit of the stream has a timestamp, the position in which it
he
arrives. The first bit has timestamp 1, the second has timestamp 2, and so
on.
Since we only need to distinguish positions within the window of length N,
in
(i.e., the most recent timestamp) modulo N, then we can determine from a
timestamp modulo N where in the current window the bit with that
B
timestamp is.
We divide the window into buckets, 5 consisting of:
1. The timestamp of its right (most recent) end.
2. The number of 1’s in the bucket. This number must be a power of 2,
and we refer to the number of 1’s as the size of the bucket.
pp
● The right end of a bucket is always a position with a 1.
● Every position with a 1 is in some bucket.
● No position is in more than one bucket.
A
● There are one or two buckets of any given size, up to some
maximum size.
s
● All sizes must be a power of 2.
● Buckets cannot decrease in size as we move to the left (back in
time).
er
at
he
in
ra
B
pp
A
Full module-wise notes with
34+ Q/A is available in
s
er
Brainheaters App
at
he
in