BDA Handy Notes
BDA Handy Notes
Volume: Social media platforms generate vast amounts of data every second. Users
post status updates, tweets, photos, videos, comments, and likes, resulting in a massive
volume of data. For example, Facebook, Twitter, Instagram, and other platforms
collectively produce billions of pieces of content daily.
Velocity: The speed at which data is created and disseminated on social media is
incredibly high. Real-time interactions and the constant stream of user-generated
content mean that data is produced and needs to be processed at a fast pace. This rapid
generation and flow of information are crucial elements of Big Data.
Variety: Social media data is highly diverse, encompassing text, images, videos, audio,
and various forms of metadata (like timestamps, geolocation tags, and user
interactions). This variety provides a rich and multifaceted dataset that can be analyzed
for numerous insights across different media formats.
Veracity: While social media data can be noisy and unstructured, it often reflects
genuine user opinions, behaviors, and trends. However, this also means dealing with
challenges related to data accuracy and trustworthiness, as the information can
sometimes be misleading or false.
Value: The data collected from social media holds significant value due to its potential
to reveal insights into consumer behavior, public sentiment, market trends, and more.
Businesses, researchers, and governments can leverage this data for decision-making,
marketing strategies, and policy-making.
i. Volume
The exponential growth in the data storage as the data is now more than text data.
The data can be found in the format of videos, music’s and large images on our social
media channels.
As the database grows the applications and architecture built to support the data needs
to be re-evaluated quite often.
Sometimes the same data is re-evaluated with multiple angles and even though the
original data is the same the new found intelligence creates explosion of the data.
ii. Velocity
The data growth and social media explosion have changed how we look at the data.
There was a time when we used to believe that data of yesterday is recent.
The matter of the fact newspapers is still following that logic.
However, news channels and radios have changed how fast we receive the news.
Today, people reply on social media to update them with the latest happening. On social
media sometimes a few seconds old messages is not something interests users.
They often discard old messages and pay attention to recent updates. The data
movement is now almost real time and the update window has reduced to fractions of
the seconds.
BDA HANDY NOTES
iii. Variety
Data can be stored in multiple format. For example database, excel, csv, access or for
the matter of the fact, it can be stored in a simple text file.
Sometimes the data is not even in the traditional format as we assume, it may be in the
form of video, SMS, pdf or something we might have not thought about it. It is the need
of the organization to arrange it and make it meaningful.
It will be easy to do so if we have data in the same format, however it is not the case
most of the time. The real world have data in many different formats and that is the
challenge we need to overcome with the Big Data.
3. CAP Theorem
The CAP theorem states that it is not possible to guarantee all three of the desirable
properties-Consistency, Availability, and Partition tolerance at the same time in a
distributed system with data replication.
Consistency: It means that the nodes will have the same
copies of a replicated data item visible for various transactions.
Availability: It means that each read or write request for a data
item will either be processed successfully or will receive a
message that the operation cannot be completed. Every non-
failing node returns a response for all the read and write
requests in a reasonable amount of time.
Partition tolerance: It means that the system can continue
operating even if the network connecting the nodes has a fault
that results in two or more partitions, where the nodes in each
partition can only communicate among each other. That means, the system continues to
function and upholds its consistency guarantees in spite of network partitions.
How NoSQL systems guarantees BASE property
NoSQL relies upon a softer model known as the BASE model. BASE (Basically Available,
Soft state, Eventual consistency).
Basically Available: Guarantees the availability of the data. There will be a response to any
request (can be failure too).
Soft state: The state of the system could change over time.
Eventual consistency: The system will eventually become consistent once it stops receiving
input.
NoSQL databases give up the A, C and/or D requirements, and in return they improve
scalability.
BDA HANDY NOTES
4. Distributed Storage System of Hadoop
With growing data velocity the data size easily outgrows the storage limit of a machine. A
solution would be to store the data across a network of machines.
Such file systems are called distributed file systems. Since data is stored across a network
all the complications of a network come in.
HDFS is a unique design that provides storage for extremely large files with streaming data
access pattern and it runs on commodity hardware.
Extremely large files: Here we are talking about the data in range of petabytes (1000 TB).
Streaming Data Access Pattern: HDFS is designed on principle of write-once read-many-
times. Once data is written large portions of dataset can be processed any no. of times.
Commodity hardware: Hardware that is inexpensive and easily available in the market.
This is one of feature which specially distinguishes HDFS from other file system.
i. HDFS: HDFS is the primary or major component of Hadoop ecosystem and is responsible
for storing large data sets of structured or unstructured data across various nodes and
thereby maintaining the metadata in the form of log files. It consists of two core
components: Name node and Data Node.
ii. YARN: Yet Another Resource Negotiator is the one who helps to manage the resources
across the clusters. It performs scheduling and resource allocation for the Hadoop System.
iii. MapReduce: By making the use of distributed and parallel algorithms, MapReduce makes
it possible to carry over the processing’s logic and helps to write applications which
transform big data sets into a manageable one.
iv. PIG: Pig was basically developed by Yahoo which works on a pig Latin language, which
is Query based language similar to SQL. It is a platform for structuring the data flow,
processing and analysing huge data sets.
v. HIVE: With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. Its query language is called as HQL (Hive Query Language).
vi. Mahout: Mahout, allows Machine Learnability to a system or application. Machine
Learning, as the name suggests helps the system to develop itself based on some patterns,
user/environmental interaction or on the basis of algorithms.
vii. Apache Spark: It’s a platform that handles all the process consumptive tasks like batch
processing, interactive real-time processing, graph conversions, and visualization, etc.
Apache HBase: It’s a NoSQL database which supports all kinds of data and thus capable
of handling anything of Hadoop Database. It provides capabilities of Google’s BigTable,
thus able to work on Big Data sets effectively.
Solr, Lucene: These are the two services that perform the task of searching and indexing
with the help of some java libraries.
Zookeeper: Zookeeper provides coordination and synchronization among the resources or
the components of Hadoop by performing synchronization, inter-component based
communication, grouping, and maintenance.
Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding
them together as a single unit.
BDA HANDY NOTES
6. Matrix Multiplication using MapReduce
BDA HANDY NOTES
7. Grouping and Aggregation Algorithm using MapReduce
Grouping: Grouping involves categorizing data based on certain criteria. In the context of
MapReduce, it involves gathering data with similar keys together. In MapReduce, grouping
is typically done during the shuffle phase, where the output of the map tasks is transferred
to the reduce tasks.
Aggregation: Aggregation involves performing calculations on grouped data to derive
summary statistics or other meaningful insights. In MapReduce, aggregation is performed
within the reduce tasks. Each reduce task receives a group of values associated with a
particular key.
Map Phase:
Input data is divided into smaller chunks and processed by multiple map tasks in
parallel.
Each map task processes a portion of the input data and emits key-value pairs, where
the key represents the grouping criterion (e.g., product ID) and the value is the data
associated with that key (e.g., sales amount).
Shuffle and Sort Phase:
Output from the map tasks is shuffled and sorted based on the keys. All values
corresponding to the same key are grouped together.
Reduce Phase:
Each reduce task receives a set of key-value pairs where the keys are sorted and
grouped.
The reducer processes each group of values (associated with the same key) and
performs aggregation operations (e.g., summing up sales amounts).
The results of the aggregation are typically written to an output file or another
storage system for further analysis or use.
BDA HANDY NOTES
8. Explain different types of NoSQL data stores and their typical usage.
i. Document-Based Database:
The document-based database is a non-relational database. Instead of storing the data in
rows and columns (tables), it uses the documents to store the data in the database.
A document database stores data in JSON, BSON, or XML documents.
Documents can be stored and retrieved in a form that is much closer to the data objects
used in applications, i.e. less translation is required to use these data in the applications.
In the Document database, the particular elements can be accessed by using the index value
that is assigned for faster querying.
Use cases:
a. User Profiles: Since they have a flexible Schema, the document can store different
attributes and values. This enables the users to store different types of information.
b. Management of Content: Since it has a flexible schema, collection and storing any
data has been made possible.
ii. Key-Value Stores:
It is a non-relational database. The simplest form of a NoSQL database is a key-value store.
Every data element in the database is stored in key-value pairs.
The data can be retrieved by using a unique key allotted to each element in the database.
The values can be simple data types like strings and numbers or complex objects.
Use cases:
a. Storing session information: offers to save and restore sessions.
b. Shopping carts: easily handle the loss of storage nodes and quickly scale big data
during a holiday/sale on an e-commerce application.
iii. Column Oriented Databases:
It is a non-relational database that stores the data in columns instead of rows.
When we want to run analytics on a small number of columns, you can read those columns
directly without consuming memory with the unwanted data.
They are designed to read data more efficiently and retrieve the data with greater speed.
A columnar database is used to store a large amount of data.
Use cases:
a. Business Intelligence
b. Managing data warehouses
c. Reporting Systems
iv. Graph-Based databases:
These focus on the relationship between the elements. It stores the data in the form of nodes
in the database.
The connections between the nodes are called links or relationships.
In this database, it is easy to identify the relationship between the data by using the links.
The Query’s output is real-time results.
Use cases:
a. Social networking site
b. Recommendation engine
c. Risk assessment
BDA HANDY NOTES
9. Suppose a data stream consists of the integers 3, 1, 4, 1, 5, 9, 2, 6, 5. Let the hash
function being used is h(x) = 3x + 1 mod 5. Show how the Flajolet Martin Algorithm
will estimate the number of distinct elements in this stream.
BDA HANDY NOTES
10. Use of CURE algorithm to cluster big data
CURE (Clustering Using Representatives), assumes a
Euclidean space. However, it does not assume anything
about the shape of clusters; they need not be normally
distributed, and can even have strange bends, S-shapes, or
even rings. Instead of representing clusters by their
centroid, it uses a collection of representative points, as the
name implies.
Initialization in CURE
i. Take a small sample of the data and cluster it in main memory. In principle, any
clustering method could be used, but as CURE is designed to handle oddly shaped
clusters, it is often advisable to use a hierarchical method in which clusters are merged
when they have a close pair of points.
ii. Select a small set of points from each cluster to be representative points. These points
should be chosen to be as far from one another as possible, using the method.
iii. Move each of the representative points a fixed
fraction of the distance between its location and the
centroid of its cluster. Perhaps 20% is a good
fraction to choose. Note that this step requires a
Euclidean space, since otherwise, there might not be
any notion of a line between two points.
iv. For the second step, we pick the representative
points. If the sample from which the clusters are
constructed is large enough, we can count on a
cluster's sample points at greatest distance from one another lying on the boundary of
the cluster.
v. Finally, we move the representative points a fixed fraction of the distance from their
true location toward the centroid of the cluster. In Fig. 5.1.3 both clusters have their
centroid in the same place: the center of the inner circle. Thus, the representative points
from the circle move inside the cluster, as was intended. Points on the outer edge of the
ring also move into their cluster, but points on the ring's inner edge move outside the
cluster.
Completion of the CURE Algorithm
vi. The next phase of CURE is to merge two clusters
if they have a pair of representative points, one
from each cluster, that are sufficiently close. The
user may pick the distance that defines "close."
This merging step can repeat, until there are no
more sufficiently close clusters.
vii. The last step of CURE is point assignment. Each
point p is brought from secondary storage and
compared with the representative points. We
assign p to the cluster of the representative point that is closest to p.
BDA HANDY NOTES
11. Shuffle and Sort phase and Reducer phase in MapReduce
In map phase the task tracker performs the computation on local data and output is
generated.
The output is called as intermediate results and are stored on temporary local storage.
After the map phase is over, all the intermediate values for a given intermediate key are
combined together into a list.
The list is given to a reducer.
There may be single or multiple reducers.
All values associated with a particular intermediate key are guaranteed to go to the same
reducer.
The intermediate keys, and their value lists, are passed to the reducer in sorted key order.
This step is known as ' shuffle and sort'.
The reducer outputs zero or more final key value pairs.
These are written to HDFS.
The reducer usually emits a single key/value pair for each input key.
The job tracker starts a reduce task on any one of the nodes and instruct to grab the
intermediate data from the completed map task.
The reducer performs final computation and o/p is written to HDFS.
The client reads the output from file and job completes.
Selection: Selection lets you apply a condition over the data you have and only get the
rows that satisfy the condition.
Projection: In order to select some columns only we use the projection operator.
Union: We concatenate two tables vertically. The duplicate rows are removed implicitly.
Intersection: It intersects two tables and selects only the common rows.
Difference: The rows that are in the first table but not in second are selected for output.
Natural Join: Merge two tables based on some common column.
Grouping and Aggregation: Group rows based on some set of columns and apply some
aggregation (sum, count, max, min, etc.) on some column of the small groups that are
formed.
BDA HANDY NOTES
Grouping and Aggregation using MapReduce:
i. Input Files: The data for MapReduce job is stored in Input Files. Input files reside in
HDFS.
ii. InputFormat: InputFormat defines how to divide and read these input files. It selects
the files or other objects for input.
iii. InputSplits: It represents the data that will be processed by an individual Mapper. For
each split, one map task is created.
iv. RecordReader: It communicates with the inputSplit and then transforms the data into
key-value pairs suitable for reading by the Mapper.
v. Mapper: It processes input records produced by the RecordReader and generates
intermediate key-value pairs.
vi. Combiner: Combiner is Mini-reducer that performs local aggregation on the mapper’s
output. It minimizes the data transfer between mapper and reducer.
vii. Partitioner: Partitioner comes into existence if we are working with more than one
reducer. It grabs the output of the combiner and performs partitioning.
viii. Shuffling and Sorting: All the mappers complete and shuffle the output on the reducer
nodes. Then the framework joins this intermediate output and sort.
ix. Reducer: The reducer then takes a set of intermediate key-value pairs produced by the
mappers as the input and runs a reducer function on each of them to generate the output.
The output of the reducer is the decisive output. The framework stores the output on
HDFS.
x. RecordWriter: It writes output key-value pairs from the Reducer phase to the output
files.
xi. OutputFormat: OutputFormat defines the way how RecordReader writes these output
key-value pairs in output files. So, its instances offered by the Hadoop write files in
HDFS. Thus OutputFormat instances write the decisive output of reducer on HDFS.
BDA HANDY NOTES
17. Issues of stream processing
Scalability: Ensuring the system can process more data streams w/o a performance decline.
Fault tolerance: Guaranteeing data analysis can continue even after failures.
Data integrity: Maintaining data integrity by validating the data being processed.
Time: Streams are unbounded and continuously serve events, so time is an important aspect
of stream processing.
Processing delays: Delays can occur due to network traffic, slow processing speeds, or
pressure from subsequent operators.
18. PageRank
PageRank refers to an algorithm developed by Larry Page and Sergey Brin, the founders
of Google, to rank web pages in search engine results.
PageRank is a crucial component of Google's search algorithm, used to determine the
importance of a webpage based on the quantity and quality of links pointing to it.
The fundamental idea behind PageRank is that a webpage is considered more important if
it is linked to by other important pages. This concept is akin to academic citations: a paper
is considered more influential if it is cited by other reputable papers.
Using the web graph shown below compute the PageRank at every node at the end of the
second iteration. Use teleport factor= 0.8
BDA HANDY NOTES
The simplest case of an algorithm called DGIM. This version of the algorithm uses
O(log2N) bits to represent a window of N bits and enables the estimation of the number of
is in the window with an error of no more than 50%.
Divide the window into brackets, consisting of - The timestamp of its right 1- The number
of is in the bucket. This number must be power of 2, and we refer to the number of is as the
size of the bucket.
The right end of a bucket always starts with a position with a 1.
Number of 1s must be power of 2.
Either one or two buckets with the same power of 2 number of Is exists.
Buckets do not overlap in timestamps
Buckets are stored by size.
Buckets disappear when their end-time is N time units in the past
Example: Suppose the stream is that of Fig. 4.9.1. and k = 10 Then the query asks for the
number of 1's in the ten rightmost bits, which happen to be 0110010110. Let the current
timestamp (time of the rightmost bit) bet. Then the two buckets with one 1, having
timestamps t - 1 and t - 2 are completely included in the answer.
The bucket of size 2, with timestamp t - 4 is also completely Theluded. However, the
rightmost bucket of size 4. with timestamp 1-8 is only partly included. We know it is the
lat bucket to contribute to the answer, because the next bucket to its left has timestamp less
than t - 9 and thus is completely out of the window.
BDA HANDY NOTES
On the other hand, we know the buckets to its right are completely inside the range of the
query because of the existence of a bucket to their left with timestamp 1-9 or greater.
Our estimate of the number of 1's in the last ten positions is that 6. This number is the two
buckets of size 1, the bucket of size 2. and half the bucket of size 4 that is partially within
range. Of course the correct answer is 5.2 Suppose the above estimate of the answer to a
query involves a bucket b of size 2 that partially within the range of the query.
Let us consider how far from the correct answer c our estimate could be. There are two
cases: the estimate could be larger or smalier than c.
Case 1: The estimate is less than c. In the worst case, all the I's of b are actually within
the range of the query, so the estimate misses half bucket b, or 2j-1 1's. But in this case,
c is at least 2j: in fact it is at least 2j+1-1, since there is at least one bucket of each of
the sizes 2j-1, 2j-2 … 1. We conclude that our estimate is at least 50% of c.
Case 2: The estimate is greater than c. In the worst case, only the rightmost bit of bucket
b is within range, and there is only one bucket of each of the sizes smaller than b. Then
c = 1 +2j-1+2j -2+ + + 1 = 2j and the estimate we give is 2j-1+2j-1+2j-2+… +1=2j+2j-
1-1. We see that the estimate is no more than 50% greater than c.
Suppose instead that page B had a link to pages C and A, page C had a link to page A, and page
D had links to all three pages. Thus, upon the first iteration, page B would transfer half of its
existing value, or 0.125, to page A and the other half, or 0.125, to page C. Page C would transfer
all of its existing value, 0.25, to the only page it links to, A. Since D had three outbound links,
it would transfer one-third of its existing value, or approximately 0.083, to A. At the completion
of this iteration, page A will have a PageRank of approximately 0.458.
BDA HANDY NOTES
22. PCY Algorithm
In pass 1 of A-Priori, most memory is idle. We store only individual item counts. We use
the idle memory to reduce memory required in pass 2.
Pass 1: In addition to item counts, maintain a hash table with as many buckets as fit in memory.
Keep a count for each bucket into which pairs of items are hashed. For each bucket just keep
the count, not the actual pairs that hash to the bucket.
FOR (each basket)
FOR (each item in the basket)
add 1 to item's count;
FOR (each pair of items)
hash the pair to a bucket;
add 1 to the count for that bucket;
Pairs of items need to be generated from the input file;
they are not present in the file.
We are not just interested in the presence of a pair, but we need to see whether it is present
at least s (support) times.
If a bucket contains a frequent pair, then the bucket is surely frequent v However, even
without any frequent pair, a bucket can still be frequent. So, we cannot use the hash to
eliminate any member (pair) of a "frequent" bucket.
But, for a bucket with total count less than s, none of its pairs can be frequent. Pairs that
hash to this bucket can be eliminated as candidates (even if the pair consists of
2 frequent items). Eg., even though (A), (B) are frequent, count of the backet containing
(A,B) might be <s
Pass 2: Only count pairs that hash to frequent buckets
Replace the buckets by a bit-vector. I means the bucket count exceeded the support s (call
it a frequent bucket); 0 means it did not.
4-byte integer counts are replaced by bits, so the bit-vector requires 1/32 of memory. Also,
decide which items are frequent and list them for the second pass.
Count all pairs (i, j) that meet the conditions for being a candidate pair:
i. Both i and jare frequent iterns
ii. The pair (i, j) hashes to a bucket whose bit in the bit vector is 1 (i.e. a frequent bucket).
Both conditions are necessary for the pair to have a chance of being frequent.
BDA HANDY NOTES
23. Collaborative Filtering