0% found this document useful (0 votes)

22 views19 pages

BDA Handy Notes

Uploaded by

abhishekmanwatkar7777

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views19 pages

BDA Handy Notes

Uploaded by

abhishekmanwatkar7777

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

BDA HANDY NOTES

1. Characteristics of social media that makes it suitable for Big Data

 Volume: Social media platforms generate vast amounts of data every second. Users
post status updates, tweets, photos, videos, comments, and likes, resulting in a massive
volume of data. For example, Facebook, Twitter, Instagram, and other platforms
collectively produce billions of pieces of content daily.
 Velocity: The speed at which data is created and disseminated on social media is
incredibly high. Real-time interactions and the constant stream of user-generated
content mean that data is produced and needs to be processed at a fast pace. This rapid
generation and flow of information are crucial elements of Big Data.
 Variety: Social media data is highly diverse, encompassing text, images, videos, audio,
and various forms of metadata (like timestamps, geolocation tags, and user
interactions). This variety provides a rich and multifaceted dataset that can be analyzed
for numerous insights across different media formats.
 Veracity: While social media data can be noisy and unstructured, it often reflects
genuine user opinions, behaviors, and trends. However, this also means dealing with
challenges related to data accuracy and trustworthiness, as the information can
sometimes be misleading or false.
 Value: The data collected from social media holds significant value due to its potential
to reveal insights into consumer behavior, public sentiment, market trends, and more.
Businesses, researchers, and governments can leverage this data for decision-making,
marketing strategies, and policy-making.

2. The Vs of Big Data

i. Volume
 The exponential growth in the data storage as the data is now more than text data.
 The data can be found in the format of videos, music’s and large images on our social
media channels.
 As the database grows the applications and architecture built to support the data needs
to be re-evaluated quite often.
 Sometimes the same data is re-evaluated with multiple angles and even though the
original data is the same the new found intelligence creates explosion of the data.

ii. Velocity
 The data growth and social media explosion have changed how we look at the data.
 There was a time when we used to believe that data of yesterday is recent.
 The matter of the fact newspapers is still following that logic.
 However, news channels and radios have changed how fast we receive the news.
 Today, people reply on social media to update them with the latest happening. On social
media sometimes a few seconds old messages is not something interests users.
 They often discard old messages and pay attention to recent updates. The data
movement is now almost real time and the update window has reduced to fractions of
the seconds.
BDA HANDY NOTES
iii. Variety
 Data can be stored in multiple format. For example database, excel, csv, access or for
the matter of the fact, it can be stored in a simple text file.
 Sometimes the data is not even in the traditional format as we assume, it may be in the
form of video, SMS, pdf or something we might have not thought about it. It is the need
of the organization to arrange it and make it meaningful.
 It will be easy to do so if we have data in the same format, however it is not the case
most of the time. The real world have data in many different formats and that is the
challenge we need to overcome with the Big Data.

3. CAP Theorem

 The CAP theorem states that it is not possible to guarantee all three of the desirable
properties-Consistency, Availability, and Partition tolerance at the same time in a
distributed system with data replication.
 Consistency: It means that the nodes will have the same
copies of a replicated data item visible for various transactions.
 Availability: It means that each read or write request for a data
item will either be processed successfully or will receive a
message that the operation cannot be completed. Every non-
failing node returns a response for all the read and write
requests in a reasonable amount of time.
 Partition tolerance: It means that the system can continue
operating even if the network connecting the nodes has a fault
that results in two or more partitions, where the nodes in each
partition can only communicate among each other. That means, the system continues to
function and upholds its consistency guarantees in spite of network partitions.
How NoSQL systems guarantees BASE property
 NoSQL relies upon a softer model known as the BASE model. BASE (Basically Available,
Soft state, Eventual consistency).
 Basically Available: Guarantees the availability of the data. There will be a response to any
request (can be failure too).
 Soft state: The state of the system could change over time.
 Eventual consistency: The system will eventually become consistent once it stops receiving
input.
 NoSQL databases give up the A, C and/or D requirements, and in return they improve
scalability.
BDA HANDY NOTES
4. Distributed Storage System of Hadoop

 With growing data velocity the data size easily outgrows the storage limit of a machine. A
solution would be to store the data across a network of machines.
 Such file systems are called distributed file systems. Since data is stored across a network
all the complications of a network come in.
 HDFS is a unique design that provides storage for extremely large files with streaming data
access pattern and it runs on commodity hardware.
 Extremely large files: Here we are talking about the data in range of petabytes (1000 TB).
 Streaming Data Access Pattern: HDFS is designed on principle of write-once read-many-
times. Once data is written large portions of dataset can be processed any no. of times.
 Commodity hardware: Hardware that is inexpensive and easily available in the market.
This is one of feature which specially distinguishes HDFS from other file system.

Nodes: Master-slave nodes typically forms the HDFS cluster.

i. NameNode(MasterNode):
 Manages all the slave nodes and assign work to them.
 It executes filesystem namespace operations like opening, closing, renaming files and
directories.
 It should be deployed on reliable hardware which has the high configuration not on
commodity hardware.
ii. DataNode(SlaveNode):
 Actual worker nodes, who do the actual work like reading, writing, processing etc.
 They also perform creation, deletion, and replication upon instruction from the master.
 They can be deployed on commodity hardware.
HDFS daemons: Daemons are the processes running in background.
i. NameNodes:
 Run on the master node.
 Store metadata (data about data) like file path, the number of blocks, block Ids. etc.
 Require high amount of RAM.
 Store meta-data in RAM for fast retrieval i.e. to reduce seek time. Though a persistent
copy of it is kept on disk.
ii. DataNodes:
 Run on slave nodes.
 Require high memory as data is actually stored here.
BDA HANDY NOTES
5. Hadoop Ecosystem

i. HDFS: HDFS is the primary or major component of Hadoop ecosystem and is responsible
for storing large data sets of structured or unstructured data across various nodes and
thereby maintaining the metadata in the form of log files. It consists of two core
components: Name node and Data Node.
ii. YARN: Yet Another Resource Negotiator is the one who helps to manage the resources
across the clusters. It performs scheduling and resource allocation for the Hadoop System.
iii. MapReduce: By making the use of distributed and parallel algorithms, MapReduce makes
it possible to carry over the processing’s logic and helps to write applications which
transform big data sets into a manageable one.
iv. PIG: Pig was basically developed by Yahoo which works on a pig Latin language, which
is Query based language similar to SQL. It is a platform for structuring the data flow,
processing and analysing huge data sets.
v. HIVE: With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. Its query language is called as HQL (Hive Query Language).
vi. Mahout: Mahout, allows Machine Learnability to a system or application. Machine
Learning, as the name suggests helps the system to develop itself based on some patterns,
user/environmental interaction or on the basis of algorithms.
vii. Apache Spark: It’s a platform that handles all the process consumptive tasks like batch
processing, interactive real-time processing, graph conversions, and visualization, etc.
 Apache HBase: It’s a NoSQL database which supports all kinds of data and thus capable
of handling anything of Hadoop Database. It provides capabilities of Google’s BigTable,
thus able to work on Big Data sets effectively.
 Solr, Lucene: These are the two services that perform the task of searching and indexing
with the help of some java libraries.
 Zookeeper: Zookeeper provides coordination and synchronization among the resources or
the components of Hadoop by performing synchronization, inter-component based
communication, grouping, and maintenance.
 Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding
them together as a single unit.
BDA HANDY NOTES
6. Matrix Multiplication using MapReduce
BDA HANDY NOTES
7. Grouping and Aggregation Algorithm using MapReduce

 Grouping: Grouping involves categorizing data based on certain criteria. In the context of
MapReduce, it involves gathering data with similar keys together. In MapReduce, grouping
is typically done during the shuffle phase, where the output of the map tasks is transferred
to the reduce tasks.
 Aggregation: Aggregation involves performing calculations on grouped data to derive
summary statistics or other meaningful insights. In MapReduce, aggregation is performed
within the reduce tasks. Each reduce task receives a group of values associated with a
particular key.
 Map Phase:
 Input data is divided into smaller chunks and processed by multiple map tasks in
parallel.
 Each map task processes a portion of the input data and emits key-value pairs, where
the key represents the grouping criterion (e.g., product ID) and the value is the data
associated with that key (e.g., sales amount).
 Shuffle and Sort Phase:
 Output from the map tasks is shuffled and sorted based on the keys. All values
corresponding to the same key are grouped together.
 Reduce Phase:
 Each reduce task receives a set of key-value pairs where the keys are sorted and
grouped.
 The reducer processes each group of values (associated with the same key) and
performs aggregation operations (e.g., summing up sales amounts).
 The results of the aggregation are typically written to an output file or another
storage system for further analysis or use.
BDA HANDY NOTES
8. Explain different types of NoSQL data stores and their typical usage.

i. Document-Based Database:
 The document-based database is a non-relational database. Instead of storing the data in
rows and columns (tables), it uses the documents to store the data in the database.
 A document database stores data in JSON, BSON, or XML documents.
 Documents can be stored and retrieved in a form that is much closer to the data objects
used in applications, i.e. less translation is required to use these data in the applications.
 In the Document database, the particular elements can be accessed by using the index value
that is assigned for faster querying.
 Use cases:
a. User Profiles: Since they have a flexible Schema, the document can store different
attributes and values. This enables the users to store different types of information.
b. Management of Content: Since it has a flexible schema, collection and storing any
data has been made possible.
ii. Key-Value Stores:
 It is a non-relational database. The simplest form of a NoSQL database is a key-value store.
 Every data element in the database is stored in key-value pairs.
 The data can be retrieved by using a unique key allotted to each element in the database.
 The values can be simple data types like strings and numbers or complex objects.
 Use cases:
a. Storing session information: offers to save and restore sessions.
b. Shopping carts: easily handle the loss of storage nodes and quickly scale big data
during a holiday/sale on an e-commerce application.
iii. Column Oriented Databases:
 It is a non-relational database that stores the data in columns instead of rows.
 When we want to run analytics on a small number of columns, you can read those columns
directly without consuming memory with the unwanted data.
 They are designed to read data more efficiently and retrieve the data with greater speed.
 A columnar database is used to store a large amount of data.
 Use cases:
a. Business Intelligence
b. Managing data warehouses
c. Reporting Systems
iv. Graph-Based databases:
 These focus on the relationship between the elements. It stores the data in the form of nodes
in the database.
 The connections between the nodes are called links or relationships.
 In this database, it is easy to identify the relationship between the data by using the links.
 The Query’s output is real-time results.
 Use cases:
a. Social networking site
b. Recommendation engine
c. Risk assessment
BDA HANDY NOTES
9. Suppose a data stream consists of the integers 3, 1, 4, 1, 5, 9, 2, 6, 5. Let the hash
function being used is h(x) = 3x + 1 mod 5. Show how the Flajolet Martin Algorithm
will estimate the number of distinct elements in this stream.
BDA HANDY NOTES
10. Use of CURE algorithm to cluster big data
CURE (Clustering Using Representatives), assumes a
Euclidean space. However, it does not assume anything
about the shape of clusters; they need not be normally
distributed, and can even have strange bends, S-shapes, or
even rings. Instead of representing clusters by their
centroid, it uses a collection of representative points, as the
name implies.
Initialization in CURE
i. Take a small sample of the data and cluster it in main memory. In principle, any
clustering method could be used, but as CURE is designed to handle oddly shaped
clusters, it is often advisable to use a hierarchical method in which clusters are merged
when they have a close pair of points.
ii. Select a small set of points from each cluster to be representative points. These points
should be chosen to be as far from one another as possible, using the method.
iii. Move each of the representative points a fixed
fraction of the distance between its location and the
centroid of its cluster. Perhaps 20% is a good
fraction to choose. Note that this step requires a
Euclidean space, since otherwise, there might not be
any notion of a line between two points.
iv. For the second step, we pick the representative
points. If the sample from which the clusters are
constructed is large enough, we can count on a
cluster's sample points at greatest distance from one another lying on the boundary of
the cluster.
v. Finally, we move the representative points a fixed fraction of the distance from their
true location toward the centroid of the cluster. In Fig. 5.1.3 both clusters have their
centroid in the same place: the center of the inner circle. Thus, the representative points
from the circle move inside the cluster, as was intended. Points on the outer edge of the
ring also move into their cluster, but points on the ring's inner edge move outside the
cluster.
Completion of the CURE Algorithm
vi. The next phase of CURE is to merge two clusters
if they have a pair of representative points, one
from each cluster, that are sufficiently close. The
user may pick the distance that defines "close."
This merging step can repeat, until there are no
more sufficiently close clusters.
vii. The last step of CURE is point assignment. Each
point p is brought from secondary storage and
compared with the representative points. We
assign p to the cluster of the representative point that is closest to p.
BDA HANDY NOTES
11. Shuffle and Sort phase and Reducer phase in MapReduce

 In map phase the task tracker performs the computation on local data and output is
generated.
 The output is called as intermediate results and are stored on temporary local storage.
 After the map phase is over, all the intermediate values for a given intermediate key are
combined together into a list.
 The list is given to a reducer.
 There may be single or multiple reducers.
 All values associated with a particular intermediate key are guaranteed to go to the same
reducer.
 The intermediate keys, and their value lists, are passed to the reducer in sorted key order.
 This step is known as ' shuffle and sort'.
 The reducer outputs zero or more final key value pairs.
 These are written to HDFS.
 The reducer usually emits a single key/value pair for each input key.
 The job tracker starts a reduce task on any one of the nodes and instruct to grab the
intermediate data from the completed map task.
 The reducer performs final computation and o/p is written to HDFS.
 The client reads the output from file and job completes.

12. Relational algebra operations using MapReduce

 Selection: Selection lets you apply a condition over the data you have and only get the
rows that satisfy the condition.
 Projection: In order to select some columns only we use the projection operator.
 Union: We concatenate two tables vertically. The duplicate rows are removed implicitly.
 Intersection: It intersects two tables and selects only the common rows.
 Difference: The rows that are in the first table but not in second are selected for output.
 Natural Join: Merge two tables based on some common column.
 Grouping and Aggregation: Group rows based on some set of columns and apply some
aggregation (sum, count, max, min, etc.) on some column of the small groups that are
formed.
BDA HANDY NOTES
Grouping and Aggregation using MapReduce:

13. Document Data Store vs Column Family Data Store

Document Database Column family Database
Data is stored and retrieved one row at a time In this type of data store, data are stored and
and hence could read unnecessary data if retrieved in columns and hence it can only
some of the data in a row are required. able to read only the relevant data if required.
Records in Document Data stores are easy to Read and write operations are slower as
read and write. compared to document.
Document data stores are best suited for Column family stores are best suited for
online transaction systems. online analytical processing.
These are not efficient in performing These are efficient in performing operations
operations applicable to the entire datasets applicable to the entire dataset and hence
and hence aggregation in document is an enable aggregation over many rows and
expensive job or operation. columns.
Typical compression mechanisms provide These type of data stores basically permits
less efficient results than what we achieve high compression rates due to few distinct or
from column-oriented data stores. unique values in columns.
The best example of a document data store The best example of a column family
is Relational Database, which is a structured datastores is HBase Database, which is
data storage and also a sophisticated query basically designed from the ground up to
engine. provide scalability and partitioning.
Common examples of document data stores Common examples of column family data
are PostgreSQL and MySQL. stores are Apache Cassandra, ScyllaDB, etc.
BDA HANDY NOTES
14. Bloom’s Filter concept

 A simple space-efficient data structure

introduced by Burton Howard Bloom in
1970. The filter matches the membership of
an element in a dataset.
 The filter is basically a bit vector of length m
that represent a set S=(X1,X2,..., xm) of m
elements
 Initially all bits 0. Then, define k independent
hash functions h1, h2 and h3, Each of which
maps (hashes) some element x in set S to one
of the m array positions with a uniform
random distribution.
 Number k is constant, and much smaller than
m. That is, for each element E S. the bits h (x)
are set to 1 for Isisk.
 It maintains a counter for each bit in the
Bloom filter. The counters corresponding to the k hash values increment or decrement
whenever an element in the filter is added or deleted, respectively.
 As soon as a counter changes from 0 to 1, the corresponding bit the bit vector is set to 1.
When a counter changes from 1 to 0, the corresponding bit in the bit vector is set to 0. The
courtn basically maintains the sumber of elements that hashed.

15. Types of Big Data:

i. Structured Data: Structured data is organized and formatted in a specific way. It
includes data types like numbers, dates, and strings, and is easy to query and analyze.
ii. Unstructured Data: Unstructured data lacks a predefined structure and organization.
Examples include text documents, images, videos, audio recordings, and social media
posts. Analyzing unstructured data requires advanced techniques like NLP and ML.
iii. Semi-Structured Data: Semi-structured data falls somewhere between structured and
unstructured data. It has some organizational properties but doesn't adhere to a strict
schema. Examples include JSON, XML, and log files.

Challenges of Big Data:

i. Storage: Storing massive amounts of data requires scalable and cost-effective storage
solutions. Traditional storage systems may not be capable of handling big data
workloads efficiently.
ii. Processing: Processing large volumes of data in a timely manner is a significant
challenge. Distributed computing frameworks like Hadoop and Spark have emerged to
address this challenge by enabling parallel processing across clusters of computers.
iii. Analysis: Analyzing big data involves extracting meaningful insights from complex
and diverse datasets. Advanced analytics methods, including ML, AI, and predictive
analytics, are often employed to derive insights from big data.
BDA HANDY NOTES
iv. Privacy and Security: Big data often contains sensitive information, raising concerns
about privacy and security. Ensuring the confidentiality, integrity, and availability of
data is crucial to maintaining trust and compliance with regulations.
v. Data Quality: Big data sources may contain errors, inconsistencies, and inaccuracies.
Poor data quality can lead to incorrect analysis and decision-making.
vi. Scalability: Big data systems need to scale horizontally to accommodate growing data
volumes and user demands. Scalability ensures that the system can handle increased
workloads without sacrificing performance or reliability.

16. MapReduce Execution

i. Input Files: The data for MapReduce job is stored in Input Files. Input files reside in
HDFS.
ii. InputFormat: InputFormat defines how to divide and read these input files. It selects
the files or other objects for input.
iii. InputSplits: It represents the data that will be processed by an individual Mapper. For
each split, one map task is created.
iv. RecordReader: It communicates with the inputSplit and then transforms the data into
key-value pairs suitable for reading by the Mapper.
v. Mapper: It processes input records produced by the RecordReader and generates
intermediate key-value pairs.
vi. Combiner: Combiner is Mini-reducer that performs local aggregation on the mapper’s
output. It minimizes the data transfer between mapper and reducer.
vii. Partitioner: Partitioner comes into existence if we are working with more than one
reducer. It grabs the output of the combiner and performs partitioning.
viii. Shuffling and Sorting: All the mappers complete and shuffle the output on the reducer
nodes. Then the framework joins this intermediate output and sort.
ix. Reducer: The reducer then takes a set of intermediate key-value pairs produced by the
mappers as the input and runs a reducer function on each of them to generate the output.
The output of the reducer is the decisive output. The framework stores the output on
HDFS.
x. RecordWriter: It writes output key-value pairs from the Reducer phase to the output
files.
xi. OutputFormat: OutputFormat defines the way how RecordReader writes these output
key-value pairs in output files. So, its instances offered by the Hadoop write files in
HDFS. Thus OutputFormat instances write the decisive output of reducer on HDFS.
BDA HANDY NOTES
17. Issues of stream processing

 Scalability: Ensuring the system can process more data streams w/o a performance decline.
 Fault tolerance: Guaranteeing data analysis can continue even after failures.
 Data integrity: Maintaining data integrity by validating the data being processed.
 Time: Streams are unbounded and continuously serve events, so time is an important aspect
of stream processing.
 Processing delays: Delays can occur due to network traffic, slow processing speeds, or
pressure from subsequent operators.

18. PageRank

 PageRank refers to an algorithm developed by Larry Page and Sergey Brin, the founders
of Google, to rank web pages in search engine results.
 PageRank is a crucial component of Google's search algorithm, used to determine the
importance of a webpage based on the quantity and quality of links pointing to it.
 The fundamental idea behind PageRank is that a webpage is considered more important if
it is linked to by other important pages. This concept is akin to academic citations: a paper
is considered more influential if it is cited by other reputable papers.
Using the web graph shown below compute the PageRank at every node at the end of the
second iteration. Use teleport factor= 0.8
BDA HANDY NOTES

19. DGIM algorithm

 The simplest case of an algorithm called DGIM. This version of the algorithm uses
O(log2N) bits to represent a window of N bits and enables the estimation of the number of
is in the window with an error of no more than 50%.
 Divide the window into brackets, consisting of - The timestamp of its right 1- The number
of is in the bucket. This number must be power of 2, and we refer to the number of is as the
size of the bucket.
 The right end of a bucket always starts with a position with a 1.
 Number of 1s must be power of 2.
 Either one or two buckets with the same power of 2 number of Is exists.
 Buckets do not overlap in timestamps
 Buckets are stored by size.
 Buckets disappear when their end-time is N time units in the past

 Example: Suppose the stream is that of Fig. 4.9.1. and k = 10 Then the query asks for the
number of 1's in the ten rightmost bits, which happen to be 0110010110. Let the current
timestamp (time of the rightmost bit) bet. Then the two buckets with one 1, having
timestamps t - 1 and t - 2 are completely included in the answer.
 The bucket of size 2, with timestamp t - 4 is also completely Theluded. However, the
rightmost bucket of size 4. with timestamp 1-8 is only partly included. We know it is the
lat bucket to contribute to the answer, because the next bucket to its left has timestamp less
than t - 9 and thus is completely out of the window.
BDA HANDY NOTES
 On the other hand, we know the buckets to its right are completely inside the range of the
query because of the existence of a bucket to their left with timestamp 1-9 or greater.
 Our estimate of the number of 1's in the last ten positions is that 6. This number is the two
buckets of size 1, the bucket of size 2. and half the bucket of size 4 that is partially within
range. Of course the correct answer is 5.2 Suppose the above estimate of the answer to a
query involves a bucket b of size 2 that partially within the range of the query.
 Let us consider how far from the correct answer c our estimate could be. There are two
cases: the estimate could be larger or smalier than c.
 Case 1: The estimate is less than c. In the worst case, all the I's of b are actually within
the range of the query, so the estimate misses half bucket b, or 2j-1 1's. But in this case,
c is at least 2j: in fact it is at least 2j+1-1, since there is at least one bucket of each of
the sizes 2j-1, 2j-2 … 1. We conclude that our estimate is at least 50% of c.
 Case 2: The estimate is greater than c. In the worst case, only the rightmost bit of bucket
b is within range, and there is only one bucket of each of the sizes smaller than b. Then
c = 1 +2j-1+2j -2+ + + 1 = 2j and the estimate we give is 2j-1+2j-1+2j-2+… +1=2j+2j-
1-1. We see that the estimate is no more than 50% greater than c.

20. Social Network Graph Clustering Algorithm

 Social networks are naturally modeled as graphs,

which we sometimes refer to as a social graph.
 The entities are the nodes, and an edge connects
two nodes if the nodes are related by the
relationship that characterizes the network.
 If there is a degree associated with the
relationship, this degree is represented by labeling the edges.
 Often, social graphs are undirected, as for the Facebook friends graph. But they can be
directed graphs, as for example the graphs of followers on Twitter or Google+.
Clustering of Social-Network Graphs:
i. Distance Measures for Social-Network Graphs
 When the edges of the graph have labels, these labels might be usable as a distance
measure, depending on what they represented. But when the edges are unlabeled, as in
a “friends” graph, there is not much we can do to define a suitable distance.
 Our first instinct is to assume that nodes are close if they have an edge between them
and distant if not. Thus, we could say that the distance d(x, y) is 0 if there is an edge (x,
y) and 1 if there is no such edge. We could use any other two values, such as 1 and ∞,
as long as the distance is closer when there is an edge.
ii. Applying Standard Clustering Methods
 Hierarchical clustering of a social-network graph starts by combining some two nodes
that are connected by an edge.
 Successively, edges that are not between two nodes of the same cluster would be chosen
randomly to combine the clusters to which their two nodes belong.
 The choices would be random, because all distances represented by an edge are the
same.
BDA HANDY NOTES
iii. The Girvan-Newman Algorithm:
 The Girvan-Newman (GN) Algorithm visits each node X once and computes the
number of shortest paths from X to each of the other nodes that go through each of the
edges.
 The algorithm begins by performing a breadth-first search (BFS) of the graph, starting
at the node X. Note that the level of each node in the BFS presentation is the length of
the shortest path from X to that node.
 Thus, the edges that go between nodes at the same level can never be part of a shortest
path from X.

21. PageRank Algorithm

 The PageRank algorithm outputs a probability distribution used to represent the likelihood
that a person randomly clicking on links will arrive at any particular page.
 The PageRank computations require several passes, called “iterations”, through the
collection to adjust approximate PageRank values to more closely reflect the theoretical
true value.
Example: Assume a small universe of four web pages: A, B, C, and D. Links from a page to
itself, or multiple outbound links from one single page to another single page, are ignored.
PageRank is initialized to the same value for all pages. In the original form of PageRank, the
sum of PageRank over all pages was the total number of pages on the web at that time, so each
page in this example would have an initial value of 1. However, later versions of PageRank,
and the remainder of this section, assume a probability distribution between 0 and 1. Hence the
initial value for each page in this example is 0.25.
The PageRank transferred from a given page to the targets of its outbound links upon the next
iteration is divided equally among all outbound links.
If the only links in the system were from pages B, C, and D to A, each link would transfer 0.25
PageRank to A upon the next iteration, for a total of 0.75.

Suppose instead that page B had a link to pages C and A, page C had a link to page A, and page
D had links to all three pages. Thus, upon the first iteration, page B would transfer half of its
existing value, or 0.125, to page A and the other half, or 0.125, to page C. Page C would transfer
all of its existing value, 0.25, to the only page it links to, A. Since D had three outbound links,
it would transfer one-third of its existing value, or approximately 0.083, to A. At the completion
of this iteration, page A will have a PageRank of approximately 0.458.
BDA HANDY NOTES
22. PCY Algorithm
 In pass 1 of A-Priori, most memory is idle. We store only individual item counts. We use
the idle memory to reduce memory required in pass 2.
Pass 1: In addition to item counts, maintain a hash table with as many buckets as fit in memory.
Keep a count for each bucket into which pairs of items are hashed. For each bucket just keep
the count, not the actual pairs that hash to the bucket.
FOR (each basket)
FOR (each item in the basket)
add 1 to item's count;
FOR (each pair of items)
hash the pair to a bucket;
add 1 to the count for that bucket;
 Pairs of items need to be generated from the input file;
they are not present in the file.
 We are not just interested in the presence of a pair, but we need to see whether it is present
at least s (support) times.
 If a bucket contains a frequent pair, then the bucket is surely frequent v However, even
without any frequent pair, a bucket can still be frequent. So, we cannot use the hash to
eliminate any member (pair) of a "frequent" bucket.
 But, for a bucket with total count less than s, none of its pairs can be frequent. Pairs that
hash to this bucket can be eliminated as candidates (even if the pair consists of
 2 frequent items). Eg., even though (A), (B) are frequent, count of the backet containing
(A,B) might be <s
Pass 2: Only count pairs that hash to frequent buckets
 Replace the buckets by a bit-vector. I means the bucket count exceeded the support s (call
it a frequent bucket); 0 means it did not.
 4-byte integer counts are replaced by bits, so the bit-vector requires 1/32 of memory. Also,
decide which items are frequent and list them for the second pass.
 Count all pairs (i, j) that meet the conditions for being a candidate pair:
i. Both i and jare frequent iterns
ii. The pair (i, j) hashes to a bucket whose bit in the bit vector is 1 (i.e. a frequent bucket).
 Both conditions are necessary for the pair to have a chance of being frequent.
BDA HANDY NOTES
23. Collaborative Filtering

 Collaborative filtering is a technique used by recommendation systems to generate

personalized suggestions or recommendations for users based on the preferences and
behavior of similar users.
 The underlying idea is that if users have similar tastes or behaviors, they are likely to
appreciate similar items.
 User-item interaction matrix: The system first creates a matrix where rows represent
users and columns represent items. Entries in the matrix represent the interactions between
users and items, such as purchases, ratings, likes, or views.
 Finding similar users or items: Collaborative filtering then identifies users who have
interacted with similar items or items that have been interacted with by similar users. This
is done using similarity measures such as cosine similarity, Pearson correlation etc.
 Generating recommendations: Once similar users or items are identified, the system can
generate recommendations for a particular user. For example, if User A has interacted with
items X, Y, and Z, and User B has interacted with items X and Y, the system might
recommend item Z to User B since User A, who is similar to User B, has interacted with it.
i. Memory based approach:
 For the memory based approach, the utility matrix ia memorized and recommendations
are made by querying the given user with the rest of the utility matrix.
 Let's consider an example of the same: If we have an movies and a users, we want to
find out how much user i likes movie k.
 Similarity between users a and i can be computed using any methods like cosine
similarity/Jaccard similarity/Pearson's correlation coefficient, etc. These results are
very easy to create and interpret, but once the data becomes too spurse, performance
becomes poor.
ii. Model based approach:
 One of the more prevalent implementations of model based approach is Matrix
Factorization. In this, we create representations of the users and items from the utility
matrix.
 Matrix factorization helps by reducing the dimensionality, hence making computation
faster. One disadvantage of this method is that we tend to lose interpretability as we do
not know what exactly elements of the vectors mean.

24. Define Hubs and authorities (Sums)

https://www.youtube.com/watch?v=au8HHqfi8rY&ab_channel=AnuradhaBhatia

25. Sums on agglomerative clustering

https://www.youtube.com/watch?v=oNYtYm0tFso&ab_channel=MaheshHuddar

Bda Guess Paper Solution
No ratings yet
Bda Guess Paper Solution
130 pages
Company Wise Leetcode Questions 1733825093
No ratings yet
Company Wise Leetcode Questions 1733825093
9 pages
BAD601 Big Data Model Question Paper Solution Search Creators
No ratings yet
BAD601 Big Data Model Question Paper Solution Search Creators
50 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
BDAV Internal Notes
No ratings yet
BDAV Internal Notes
61 pages
AI Lab Manual-24-25
No ratings yet
AI Lab Manual-24-25
34 pages
Demystifying The Big Data Ecosystem... - Param Natarajan
100% (1)
Demystifying The Big Data Ecosystem... - Param Natarajan
8 pages
Big Data Analysis BDA IMP QNA Openinapp
No ratings yet
Big Data Analysis BDA IMP QNA Openinapp
33 pages
Lahore Garrison University: Course Syllabus Description
No ratings yet
Lahore Garrison University: Course Syllabus Description
4 pages
Unit 1 BDA
No ratings yet
Unit 1 BDA
43 pages
Big Data and Hadoop Overview
100% (1)
Big Data and Hadoop Overview
17 pages
Bda Ut1 Question Bank
No ratings yet
Bda Ut1 Question Bank
19 pages
R23 IDS Unit3
No ratings yet
R23 IDS Unit3
36 pages
BDA IA1 QB Solved Complete
No ratings yet
BDA IA1 QB Solved Complete
22 pages
07 BigData DataAnalysis
No ratings yet
07 BigData DataAnalysis
66 pages
Lecture 02
No ratings yet
Lecture 02
60 pages
Short Questions
No ratings yet
Short Questions
17 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Data Science
No ratings yet
Data Science
87 pages
Big Data Analysis Unit 1-5 Extended
No ratings yet
Big Data Analysis Unit 1-5 Extended
35 pages
Prof. Shaik Naseera Department of CSE JNTUACEK, Kalikiri: Graph Traversals & Bi-Connected Components
No ratings yet
Prof. Shaik Naseera Department of CSE JNTUACEK, Kalikiri: Graph Traversals & Bi-Connected Components
59 pages
Chapter 14
No ratings yet
Chapter 14
35 pages
09 - Cloud-Enabling Technologies - v2
No ratings yet
09 - Cloud-Enabling Technologies - v2
45 pages
Super Important Questions For BDA
100% (1)
Super Important Questions For BDA
26 pages
BIT4440 BSE4040 CloudComputing 3.big Data Technologies
No ratings yet
BIT4440 BSE4040 CloudComputing 3.big Data Technologies
43 pages
Test 1 Big Data
No ratings yet
Test 1 Big Data
17 pages
Bda QB
No ratings yet
Bda QB
18 pages
Artificial Intelligence Lab Manual BCA6 UGCA1945
No ratings yet
Artificial Intelligence Lab Manual BCA6 UGCA1945
85 pages
DFS and BFS Algorithm
100% (1)
DFS and BFS Algorithm
11 pages
Bda IA2
No ratings yet
Bda IA2
12 pages
Big Data - S
No ratings yet
Big Data - S
79 pages
Bda 2
No ratings yet
Bda 2
6 pages
Ism 6404 CH 7
No ratings yet
Ism 6404 CH 7
47 pages
Big Data Imp-1
No ratings yet
Big Data Imp-1
16 pages
TIE - 21CS71 SIMP With Key Answers
No ratings yet
TIE - 21CS71 SIMP With Key Answers
19 pages
BD Imp Ques 1
No ratings yet
BD Imp Ques 1
22 pages
BD Unit 1
No ratings yet
BD Unit 1
5 pages
Big Data 2023
No ratings yet
Big Data 2023
18 pages
BD by Maaz
No ratings yet
BD by Maaz
19 pages
Big Data and Mapreduce Challenges, Opportunities and Trends
No ratings yet
Big Data and Mapreduce Challenges, Opportunities and Trends
9 pages
BigData Session1
No ratings yet
BigData Session1
14 pages
Assignment BDHHHH
No ratings yet
Assignment BDHHHH
15 pages
IET Udaipur BDA Unit-1
No ratings yet
IET Udaipur BDA Unit-1
10 pages
Graph Traversal - DFS & BFS
100% (1)
Graph Traversal - DFS & BFS
42 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Bda Ut1 Que Ans
No ratings yet
Bda Ut1 Que Ans
13 pages
Big Data NOTES
No ratings yet
Big Data NOTES
14 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Big Data Analysis
No ratings yet
Big Data Analysis
8 pages
Cmpe 224 Exams
No ratings yet
Cmpe 224 Exams
214 pages
Answer Key Set A CT2
No ratings yet
Answer Key Set A CT2
16 pages
Ite06 Big Data Analytics-Qbank
No ratings yet
Ite06 Big Data Analytics-Qbank
18 pages
Assignment 6
No ratings yet
Assignment 6
12 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
BDA CW Chapter 3
No ratings yet
BDA CW Chapter 3
9 pages
Experiment No - 1 Bda
No ratings yet
Experiment No - 1 Bda
10 pages
BDA Answers
No ratings yet
BDA Answers
6 pages
Uc PDF
No ratings yet
Uc PDF
10 pages
Q. What Is Big Data?
No ratings yet
Q. What Is Big Data?
8 pages
Ai Lab
No ratings yet
Ai Lab
26 pages
A Review Paper On Big Data
No ratings yet
A Review Paper On Big Data
5 pages
Be Sem 7 Ia 1 Question Bank
No ratings yet
Be Sem 7 Ia 1 Question Bank
4 pages
Introduction To Big Dat1
No ratings yet
Introduction To Big Dat1
6 pages
BDA Answers-1
No ratings yet
BDA Answers-1
15 pages
Types of Graphs in Data Structures
No ratings yet
Types of Graphs in Data Structures
18 pages
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
No ratings yet
Bangladesh University of Professionals: Submitted by Submitted To ID: Section: Batch
6 pages
Big Data Deals With Large Data Sets
No ratings yet
Big Data Deals With Large Data Sets
4 pages
Chapter 11 Mining Social-Network Graphs
No ratings yet
Chapter 11 Mining Social-Network Graphs
13 pages
Ai Notes
No ratings yet
Ai Notes
33 pages
Artificial Intelligence (2180703) : Semester: Vii Credit: 6 MCQ Question Bank
No ratings yet
Artificial Intelligence (2180703) : Semester: Vii Credit: 6 MCQ Question Bank
10 pages
High Multiple Choice Questions On Queues
No ratings yet
High Multiple Choice Questions On Queues
3 pages
Unit-3-Graph Ads-Mcq-Merged
No ratings yet
Unit-3-Graph Ads-Mcq-Merged
159 pages
Graph-2 Dfs
No ratings yet
Graph-2 Dfs
56 pages
Lecture On AI - Uninformed Search
No ratings yet
Lecture On AI - Uninformed Search
17 pages
DS Unit 4 Solved QB
No ratings yet
DS Unit 4 Solved QB
13 pages
Elementary Graph Algorithms: Manoj Agnihotri M.Tech I.T Dept of CSE ACET Amritsar
No ratings yet
Elementary Graph Algorithms: Manoj Agnihotri M.Tech I.T Dept of CSE ACET Amritsar
58 pages
Graphs: Data Structures and Algorithms in Java 1/36
No ratings yet
Graphs: Data Structures and Algorithms in Java 1/36
36 pages
Important Questions and Answers of Big Data Course
No ratings yet
Important Questions and Answers of Big Data Course
4 pages
Nikhil
No ratings yet
Nikhil
18 pages
An Abstract Data Type
No ratings yet
An Abstract Data Type
6 pages
Lab Manual
No ratings yet
Lab Manual
20 pages
Ws 22 Introrobots
No ratings yet
Ws 22 Introrobots
4 pages
Algorithms & Data Structures 07
No ratings yet
Algorithms & Data Structures 07
15 pages
Question Set 2
No ratings yet
Question Set 2
2 pages
ActivitySheet2 RP Uninformed Search
No ratings yet
ActivitySheet2 RP Uninformed Search
4 pages
DAA Syllabus
No ratings yet
DAA Syllabus
1 page
Lecture 2: Problem Solving and Search
No ratings yet
Lecture 2: Problem Solving and Search
8 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

BDA Handy Notes

Uploaded by

BDA Handy Notes

Uploaded by

BDA HANDY NOTES

1. Characteristics of social media that makes it suitable for Big Data

2. The Vs of Big Data

Nodes: Master-slave nodes typically forms the HDFS cluster.

12. Relational algebra operations using MapReduce

13. Document Data Store vs Column Family Data Store

 A simple space-efficient data structure

15. Types of Big Data:

Challenges of Big Data:

16. MapReduce Execution

19. DGIM algorithm

20. Social Network Graph Clustering Algorithm

 Social networks are naturally modeled as graphs,

21. PageRank Algorithm

 Collaborative filtering is a technique used by recommendation systems to generate

24. Define Hubs and authorities (Sums)

25. Sums on agglomerative clustering

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.