0% found this document useful (0 votes)
22 views19 pages

BDA Handy Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views19 pages

BDA Handy Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

BDA HANDY NOTES

1. Characteristics of social media that makes it suitable for Big Data

 Volume: Social media platforms generate vast amounts of data every second. Users
post status updates, tweets, photos, videos, comments, and likes, resulting in a massive
volume of data. For example, Facebook, Twitter, Instagram, and other platforms
collectively produce billions of pieces of content daily.
 Velocity: The speed at which data is created and disseminated on social media is
incredibly high. Real-time interactions and the constant stream of user-generated
content mean that data is produced and needs to be processed at a fast pace. This rapid
generation and flow of information are crucial elements of Big Data.
 Variety: Social media data is highly diverse, encompassing text, images, videos, audio,
and various forms of metadata (like timestamps, geolocation tags, and user
interactions). This variety provides a rich and multifaceted dataset that can be analyzed
for numerous insights across different media formats.
 Veracity: While social media data can be noisy and unstructured, it often reflects
genuine user opinions, behaviors, and trends. However, this also means dealing with
challenges related to data accuracy and trustworthiness, as the information can
sometimes be misleading or false.
 Value: The data collected from social media holds significant value due to its potential
to reveal insights into consumer behavior, public sentiment, market trends, and more.
Businesses, researchers, and governments can leverage this data for decision-making,
marketing strategies, and policy-making.

2. The Vs of Big Data

i. Volume
 The exponential growth in the data storage as the data is now more than text data.
 The data can be found in the format of videos, music’s and large images on our social
media channels.
 As the database grows the applications and architecture built to support the data needs
to be re-evaluated quite often.
 Sometimes the same data is re-evaluated with multiple angles and even though the
original data is the same the new found intelligence creates explosion of the data.

ii. Velocity
 The data growth and social media explosion have changed how we look at the data.
 There was a time when we used to believe that data of yesterday is recent.
 The matter of the fact newspapers is still following that logic.
 However, news channels and radios have changed how fast we receive the news.
 Today, people reply on social media to update them with the latest happening. On social
media sometimes a few seconds old messages is not something interests users.
 They often discard old messages and pay attention to recent updates. The data
movement is now almost real time and the update window has reduced to fractions of
the seconds.
BDA HANDY NOTES
iii. Variety
 Data can be stored in multiple format. For example database, excel, csv, access or for
the matter of the fact, it can be stored in a simple text file.
 Sometimes the data is not even in the traditional format as we assume, it may be in the
form of video, SMS, pdf or something we might have not thought about it. It is the need
of the organization to arrange it and make it meaningful.
 It will be easy to do so if we have data in the same format, however it is not the case
most of the time. The real world have data in many different formats and that is the
challenge we need to overcome with the Big Data.

3. CAP Theorem

 The CAP theorem states that it is not possible to guarantee all three of the desirable
properties-Consistency, Availability, and Partition tolerance at the same time in a
distributed system with data replication.
 Consistency: It means that the nodes will have the same
copies of a replicated data item visible for various transactions.
 Availability: It means that each read or write request for a data
item will either be processed successfully or will receive a
message that the operation cannot be completed. Every non-
failing node returns a response for all the read and write
requests in a reasonable amount of time.
 Partition tolerance: It means that the system can continue
operating even if the network connecting the nodes has a fault
that results in two or more partitions, where the nodes in each
partition can only communicate among each other. That means, the system continues to
function and upholds its consistency guarantees in spite of network partitions.
How NoSQL systems guarantees BASE property
 NoSQL relies upon a softer model known as the BASE model. BASE (Basically Available,
Soft state, Eventual consistency).
 Basically Available: Guarantees the availability of the data. There will be a response to any
request (can be failure too).
 Soft state: The state of the system could change over time.
 Eventual consistency: The system will eventually become consistent once it stops receiving
input.
 NoSQL databases give up the A, C and/or D requirements, and in return they improve
scalability.
BDA HANDY NOTES
4. Distributed Storage System of Hadoop

 With growing data velocity the data size easily outgrows the storage limit of a machine. A
solution would be to store the data across a network of machines.
 Such file systems are called distributed file systems. Since data is stored across a network
all the complications of a network come in.
 HDFS is a unique design that provides storage for extremely large files with streaming data
access pattern and it runs on commodity hardware.
 Extremely large files: Here we are talking about the data in range of petabytes (1000 TB).
 Streaming Data Access Pattern: HDFS is designed on principle of write-once read-many-
times. Once data is written large portions of dataset can be processed any no. of times.
 Commodity hardware: Hardware that is inexpensive and easily available in the market.
This is one of feature which specially distinguishes HDFS from other file system.

Nodes: Master-slave nodes typically forms the HDFS cluster.


i. NameNode(MasterNode):
 Manages all the slave nodes and assign work to them.
 It executes filesystem namespace operations like opening, closing, renaming files and
directories.
 It should be deployed on reliable hardware which has the high configuration not on
commodity hardware.
ii. DataNode(SlaveNode):
 Actual worker nodes, who do the actual work like reading, writing, processing etc.
 They also perform creation, deletion, and replication upon instruction from the master.
 They can be deployed on commodity hardware.
HDFS daemons: Daemons are the processes running in background.
i. NameNodes:
 Run on the master node.
 Store metadata (data about data) like file path, the number of blocks, block Ids. etc.
 Require high amount of RAM.
 Store meta-data in RAM for fast retrieval i.e. to reduce seek time. Though a persistent
copy of it is kept on disk.
ii. DataNodes:
 Run on slave nodes.
 Require high memory as data is actually stored here.
BDA HANDY NOTES
5. Hadoop Ecosystem

i. HDFS: HDFS is the primary or major component of Hadoop ecosystem and is responsible
for storing large data sets of structured or unstructured data across various nodes and
thereby maintaining the metadata in the form of log files. It consists of two core
components: Name node and Data Node.
ii. YARN: Yet Another Resource Negotiator is the one who helps to manage the resources
across the clusters. It performs scheduling and resource allocation for the Hadoop System.
iii. MapReduce: By making the use of distributed and parallel algorithms, MapReduce makes
it possible to carry over the processing’s logic and helps to write applications which
transform big data sets into a manageable one.
iv. PIG: Pig was basically developed by Yahoo which works on a pig Latin language, which
is Query based language similar to SQL. It is a platform for structuring the data flow,
processing and analysing huge data sets.
v. HIVE: With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. Its query language is called as HQL (Hive Query Language).
vi. Mahout: Mahout, allows Machine Learnability to a system or application. Machine
Learning, as the name suggests helps the system to develop itself based on some patterns,
user/environmental interaction or on the basis of algorithms.
vii. Apache Spark: It’s a platform that handles all the process consumptive tasks like batch
processing, interactive real-time processing, graph conversions, and visualization, etc.
 Apache HBase: It’s a NoSQL database which supports all kinds of data and thus capable
of handling anything of Hadoop Database. It provides capabilities of Google’s BigTable,
thus able to work on Big Data sets effectively.
 Solr, Lucene: These are the two services that perform the task of searching and indexing
with the help of some java libraries.
 Zookeeper: Zookeeper provides coordination and synchronization among the resources or
the components of Hadoop by performing synchronization, inter-component based
communication, grouping, and maintenance.
 Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding
them together as a single unit.
BDA HANDY NOTES
6. Matrix Multiplication using MapReduce
BDA HANDY NOTES
7. Grouping and Aggregation Algorithm using MapReduce

 Grouping: Grouping involves categorizing data based on certain criteria. In the context of
MapReduce, it involves gathering data with similar keys together. In MapReduce, grouping
is typically done during the shuffle phase, where the output of the map tasks is transferred
to the reduce tasks.
 Aggregation: Aggregation involves performing calculations on grouped data to derive
summary statistics or other meaningful insights. In MapReduce, aggregation is performed
within the reduce tasks. Each reduce task receives a group of values associated with a
particular key.
 Map Phase:
 Input data is divided into smaller chunks and processed by multiple map tasks in
parallel.
 Each map task processes a portion of the input data and emits key-value pairs, where
the key represents the grouping criterion (e.g., product ID) and the value is the data
associated with that key (e.g., sales amount).
 Shuffle and Sort Phase:
 Output from the map tasks is shuffled and sorted based on the keys. All values
corresponding to the same key are grouped together.
 Reduce Phase:
 Each reduce task receives a set of key-value pairs where the keys are sorted and
grouped.
 The reducer processes each group of values (associated with the same key) and
performs aggregation operations (e.g., summing up sales amounts).
 The results of the aggregation are typically written to an output file or another
storage system for further analysis or use.
BDA HANDY NOTES
8. Explain different types of NoSQL data stores and their typical usage.

i. Document-Based Database:
 The document-based database is a non-relational database. Instead of storing the data in
rows and columns (tables), it uses the documents to store the data in the database.
 A document database stores data in JSON, BSON, or XML documents.
 Documents can be stored and retrieved in a form that is much closer to the data objects
used in applications, i.e. less translation is required to use these data in the applications.
 In the Document database, the particular elements can be accessed by using the index value
that is assigned for faster querying.
 Use cases:
a. User Profiles: Since they have a flexible Schema, the document can store different
attributes and values. This enables the users to store different types of information.
b. Management of Content: Since it has a flexible schema, collection and storing any
data has been made possible.
ii. Key-Value Stores:
 It is a non-relational database. The simplest form of a NoSQL database is a key-value store.
 Every data element in the database is stored in key-value pairs.
 The data can be retrieved by using a unique key allotted to each element in the database.
 The values can be simple data types like strings and numbers or complex objects.
 Use cases:
a. Storing session information: offers to save and restore sessions.
b. Shopping carts: easily handle the loss of storage nodes and quickly scale big data
during a holiday/sale on an e-commerce application.
iii. Column Oriented Databases:
 It is a non-relational database that stores the data in columns instead of rows.
 When we want to run analytics on a small number of columns, you can read those columns
directly without consuming memory with the unwanted data.
 They are designed to read data more efficiently and retrieve the data with greater speed.
 A columnar database is used to store a large amount of data.
 Use cases:
a. Business Intelligence
b. Managing data warehouses
c. Reporting Systems
iv. Graph-Based databases:
 These focus on the relationship between the elements. It stores the data in the form of nodes
in the database.
 The connections between the nodes are called links or relationships.
 In this database, it is easy to identify the relationship between the data by using the links.
 The Query’s output is real-time results.
 Use cases:
a. Social networking site
b. Recommendation engine
c. Risk assessment
BDA HANDY NOTES
9. Suppose a data stream consists of the integers 3, 1, 4, 1, 5, 9, 2, 6, 5. Let the hash
function being used is h(x) = 3x + 1 mod 5. Show how the Flajolet Martin Algorithm
will estimate the number of distinct elements in this stream.
BDA HANDY NOTES
10. Use of CURE algorithm to cluster big data
CURE (Clustering Using Representatives), assumes a
Euclidean space. However, it does not assume anything
about the shape of clusters; they need not be normally
distributed, and can even have strange bends, S-shapes, or
even rings. Instead of representing clusters by their
centroid, it uses a collection of representative points, as the
name implies.
Initialization in CURE
i. Take a small sample of the data and cluster it in main memory. In principle, any
clustering method could be used, but as CURE is designed to handle oddly shaped
clusters, it is often advisable to use a hierarchical method in which clusters are merged
when they have a close pair of points.
ii. Select a small set of points from each cluster to be representative points. These points
should be chosen to be as far from one another as possible, using the method.
iii. Move each of the representative points a fixed
fraction of the distance between its location and the
centroid of its cluster. Perhaps 20% is a good
fraction to choose. Note that this step requires a
Euclidean space, since otherwise, there might not be
any notion of a line between two points.
iv. For the second step, we pick the representative
points. If the sample from which the clusters are
constructed is large enough, we can count on a
cluster's sample points at greatest distance from one another lying on the boundary of
the cluster.
v. Finally, we move the representative points a fixed fraction of the distance from their
true location toward the centroid of the cluster. In Fig. 5.1.3 both clusters have their
centroid in the same place: the center of the inner circle. Thus, the representative points
from the circle move inside the cluster, as was intended. Points on the outer edge of the
ring also move into their cluster, but points on the ring's inner edge move outside the
cluster.
Completion of the CURE Algorithm
vi. The next phase of CURE is to merge two clusters
if they have a pair of representative points, one
from each cluster, that are sufficiently close. The
user may pick the distance that defines "close."
This merging step can repeat, until there are no
more sufficiently close clusters.
vii. The last step of CURE is point assignment. Each
point p is brought from secondary storage and
compared with the representative points. We
assign p to the cluster of the representative point that is closest to p.
BDA HANDY NOTES
11. Shuffle and Sort phase and Reducer phase in MapReduce

 In map phase the task tracker performs the computation on local data and output is
generated.
 The output is called as intermediate results and are stored on temporary local storage.
 After the map phase is over, all the intermediate values for a given intermediate key are
combined together into a list.
 The list is given to a reducer.
 There may be single or multiple reducers.
 All values associated with a particular intermediate key are guaranteed to go to the same
reducer.
 The intermediate keys, and their value lists, are passed to the reducer in sorted key order.
 This step is known as ' shuffle and sort'.
 The reducer outputs zero or more final key value pairs.
 These are written to HDFS.
 The reducer usually emits a single key/value pair for each input key.
 The job tracker starts a reduce task on any one of the nodes and instruct to grab the
intermediate data from the completed map task.
 The reducer performs final computation and o/p is written to HDFS.
 The client reads the output from file and job completes.

12. Relational algebra operations using MapReduce

 Selection: Selection lets you apply a condition over the data you have and only get the
rows that satisfy the condition.
 Projection: In order to select some columns only we use the projection operator.
 Union: We concatenate two tables vertically. The duplicate rows are removed implicitly.
 Intersection: It intersects two tables and selects only the common rows.
 Difference: The rows that are in the first table but not in second are selected for output.
 Natural Join: Merge two tables based on some common column.
 Grouping and Aggregation: Group rows based on some set of columns and apply some
aggregation (sum, count, max, min, etc.) on some column of the small groups that are
formed.
BDA HANDY NOTES
Grouping and Aggregation using MapReduce:

13. Document Data Store vs Column Family Data Store


Document Database Column family Database
Data is stored and retrieved one row at a time In this type of data store, data are stored and
and hence could read unnecessary data if retrieved in columns and hence it can only
some of the data in a row are required. able to read only the relevant data if required.
Records in Document Data stores are easy to Read and write operations are slower as
read and write. compared to document.
Document data stores are best suited for Column family stores are best suited for
online transaction systems. online analytical processing.
These are not efficient in performing These are efficient in performing operations
operations applicable to the entire datasets applicable to the entire dataset and hence
and hence aggregation in document is an enable aggregation over many rows and
expensive job or operation. columns.
Typical compression mechanisms provide These type of data stores basically permits
less efficient results than what we achieve high compression rates due to few distinct or
from column-oriented data stores. unique values in columns.
The best example of a document data store The best example of a column family
is Relational Database, which is a structured datastores is HBase Database, which is
data storage and also a sophisticated query basically designed from the ground up to
engine. provide scalability and partitioning.
Common examples of document data stores Common examples of column family data
are PostgreSQL and MySQL. stores are Apache Cassandra, ScyllaDB, etc.
BDA HANDY NOTES
14. Bloom’s Filter concept

 A simple space-efficient data structure


introduced by Burton Howard Bloom in
1970. The filter matches the membership of
an element in a dataset.
 The filter is basically a bit vector of length m
that represent a set S=(X1,X2,..., xm) of m
elements
 Initially all bits 0. Then, define k independent
hash functions h1, h2 and h3, Each of which
maps (hashes) some element x in set S to one
of the m array positions with a uniform
random distribution.
 Number k is constant, and much smaller than
m. That is, for each element E S. the bits h (x)
are set to 1 for Isisk.
 It maintains a counter for each bit in the
Bloom filter. The counters corresponding to the k hash values increment or decrement
whenever an element in the filter is added or deleted, respectively.
 As soon as a counter changes from 0 to 1, the corresponding bit the bit vector is set to 1.
When a counter changes from 1 to 0, the corresponding bit in the bit vector is set to 0. The
courtn basically maintains the sumber of elements that hashed.

15. Types of Big Data:


i. Structured Data: Structured data is organized and formatted in a specific way. It
includes data types like numbers, dates, and strings, and is easy to query and analyze.
ii. Unstructured Data: Unstructured data lacks a predefined structure and organization.
Examples include text documents, images, videos, audio recordings, and social media
posts. Analyzing unstructured data requires advanced techniques like NLP and ML.
iii. Semi-Structured Data: Semi-structured data falls somewhere between structured and
unstructured data. It has some organizational properties but doesn't adhere to a strict
schema. Examples include JSON, XML, and log files.

Challenges of Big Data:


i. Storage: Storing massive amounts of data requires scalable and cost-effective storage
solutions. Traditional storage systems may not be capable of handling big data
workloads efficiently.
ii. Processing: Processing large volumes of data in a timely manner is a significant
challenge. Distributed computing frameworks like Hadoop and Spark have emerged to
address this challenge by enabling parallel processing across clusters of computers.
iii. Analysis: Analyzing big data involves extracting meaningful insights from complex
and diverse datasets. Advanced analytics methods, including ML, AI, and predictive
analytics, are often employed to derive insights from big data.
BDA HANDY NOTES
iv. Privacy and Security: Big data often contains sensitive information, raising concerns
about privacy and security. Ensuring the confidentiality, integrity, and availability of
data is crucial to maintaining trust and compliance with regulations.
v. Data Quality: Big data sources may contain errors, inconsistencies, and inaccuracies.
Poor data quality can lead to incorrect analysis and decision-making.
vi. Scalability: Big data systems need to scale horizontally to accommodate growing data
volumes and user demands. Scalability ensures that the system can handle increased
workloads without sacrificing performance or reliability.

16. MapReduce Execution

i. Input Files: The data for MapReduce job is stored in Input Files. Input files reside in
HDFS.
ii. InputFormat: InputFormat defines how to divide and read these input files. It selects
the files or other objects for input.
iii. InputSplits: It represents the data that will be processed by an individual Mapper. For
each split, one map task is created.
iv. RecordReader: It communicates with the inputSplit and then transforms the data into
key-value pairs suitable for reading by the Mapper.
v. Mapper: It processes input records produced by the RecordReader and generates
intermediate key-value pairs.
vi. Combiner: Combiner is Mini-reducer that performs local aggregation on the mapper’s
output. It minimizes the data transfer between mapper and reducer.
vii. Partitioner: Partitioner comes into existence if we are working with more than one
reducer. It grabs the output of the combiner and performs partitioning.
viii. Shuffling and Sorting: All the mappers complete and shuffle the output on the reducer
nodes. Then the framework joins this intermediate output and sort.
ix. Reducer: The reducer then takes a set of intermediate key-value pairs produced by the
mappers as the input and runs a reducer function on each of them to generate the output.
The output of the reducer is the decisive output. The framework stores the output on
HDFS.
x. RecordWriter: It writes output key-value pairs from the Reducer phase to the output
files.
xi. OutputFormat: OutputFormat defines the way how RecordReader writes these output
key-value pairs in output files. So, its instances offered by the Hadoop write files in
HDFS. Thus OutputFormat instances write the decisive output of reducer on HDFS.
BDA HANDY NOTES
17. Issues of stream processing

 Scalability: Ensuring the system can process more data streams w/o a performance decline.
 Fault tolerance: Guaranteeing data analysis can continue even after failures.
 Data integrity: Maintaining data integrity by validating the data being processed.
 Time: Streams are unbounded and continuously serve events, so time is an important aspect
of stream processing.
 Processing delays: Delays can occur due to network traffic, slow processing speeds, or
pressure from subsequent operators.

18. PageRank

 PageRank refers to an algorithm developed by Larry Page and Sergey Brin, the founders
of Google, to rank web pages in search engine results.
 PageRank is a crucial component of Google's search algorithm, used to determine the
importance of a webpage based on the quantity and quality of links pointing to it.
 The fundamental idea behind PageRank is that a webpage is considered more important if
it is linked to by other important pages. This concept is akin to academic citations: a paper
is considered more influential if it is cited by other reputable papers.
Using the web graph shown below compute the PageRank at every node at the end of the
second iteration. Use teleport factor= 0.8
BDA HANDY NOTES

19. DGIM algorithm

 The simplest case of an algorithm called DGIM. This version of the algorithm uses
O(log2N) bits to represent a window of N bits and enables the estimation of the number of
is in the window with an error of no more than 50%.
 Divide the window into brackets, consisting of - The timestamp of its right 1- The number
of is in the bucket. This number must be power of 2, and we refer to the number of is as the
size of the bucket.
 The right end of a bucket always starts with a position with a 1.
 Number of 1s must be power of 2.
 Either one or two buckets with the same power of 2 number of Is exists.
 Buckets do not overlap in timestamps
 Buckets are stored by size.
 Buckets disappear when their end-time is N time units in the past

 Example: Suppose the stream is that of Fig. 4.9.1. and k = 10 Then the query asks for the
number of 1's in the ten rightmost bits, which happen to be 0110010110. Let the current
timestamp (time of the rightmost bit) bet. Then the two buckets with one 1, having
timestamps t - 1 and t - 2 are completely included in the answer.
 The bucket of size 2, with timestamp t - 4 is also completely Theluded. However, the
rightmost bucket of size 4. with timestamp 1-8 is only partly included. We know it is the
lat bucket to contribute to the answer, because the next bucket to its left has timestamp less
than t - 9 and thus is completely out of the window.
BDA HANDY NOTES
 On the other hand, we know the buckets to its right are completely inside the range of the
query because of the existence of a bucket to their left with timestamp 1-9 or greater.
 Our estimate of the number of 1's in the last ten positions is that 6. This number is the two
buckets of size 1, the bucket of size 2. and half the bucket of size 4 that is partially within
range. Of course the correct answer is 5.2 Suppose the above estimate of the answer to a
query involves a bucket b of size 2 that partially within the range of the query.
 Let us consider how far from the correct answer c our estimate could be. There are two
cases: the estimate could be larger or smalier than c.
 Case 1: The estimate is less than c. In the worst case, all the I's of b are actually within
the range of the query, so the estimate misses half bucket b, or 2j-1 1's. But in this case,
c is at least 2j: in fact it is at least 2j+1-1, since there is at least one bucket of each of
the sizes 2j-1, 2j-2 … 1. We conclude that our estimate is at least 50% of c.
 Case 2: The estimate is greater than c. In the worst case, only the rightmost bit of bucket
b is within range, and there is only one bucket of each of the sizes smaller than b. Then
c = 1 +2j-1+2j -2+ + + 1 = 2j and the estimate we give is 2j-1+2j-1+2j-2+… +1=2j+2j-
1-1. We see that the estimate is no more than 50% greater than c.

20. Social Network Graph Clustering Algorithm

 Social networks are naturally modeled as graphs,


which we sometimes refer to as a social graph.
 The entities are the nodes, and an edge connects
two nodes if the nodes are related by the
relationship that characterizes the network.
 If there is a degree associated with the
relationship, this degree is represented by labeling the edges.
 Often, social graphs are undirected, as for the Facebook friends graph. But they can be
directed graphs, as for example the graphs of followers on Twitter or Google+.
Clustering of Social-Network Graphs:
i. Distance Measures for Social-Network Graphs
 When the edges of the graph have labels, these labels might be usable as a distance
measure, depending on what they represented. But when the edges are unlabeled, as in
a “friends” graph, there is not much we can do to define a suitable distance.
 Our first instinct is to assume that nodes are close if they have an edge between them
and distant if not. Thus, we could say that the distance d(x, y) is 0 if there is an edge (x,
y) and 1 if there is no such edge. We could use any other two values, such as 1 and ∞,
as long as the distance is closer when there is an edge.
ii. Applying Standard Clustering Methods
 Hierarchical clustering of a social-network graph starts by combining some two nodes
that are connected by an edge.
 Successively, edges that are not between two nodes of the same cluster would be chosen
randomly to combine the clusters to which their two nodes belong.
 The choices would be random, because all distances represented by an edge are the
same.
BDA HANDY NOTES
iii. The Girvan-Newman Algorithm:
 The Girvan-Newman (GN) Algorithm visits each node X once and computes the
number of shortest paths from X to each of the other nodes that go through each of the
edges.
 The algorithm begins by performing a breadth-first search (BFS) of the graph, starting
at the node X. Note that the level of each node in the BFS presentation is the length of
the shortest path from X to that node.
 Thus, the edges that go between nodes at the same level can never be part of a shortest
path from X.

21. PageRank Algorithm


 The PageRank algorithm outputs a probability distribution used to represent the likelihood
that a person randomly clicking on links will arrive at any particular page.
 The PageRank computations require several passes, called “iterations”, through the
collection to adjust approximate PageRank values to more closely reflect the theoretical
true value.
Example: Assume a small universe of four web pages: A, B, C, and D. Links from a page to
itself, or multiple outbound links from one single page to another single page, are ignored.
PageRank is initialized to the same value for all pages. In the original form of PageRank, the
sum of PageRank over all pages was the total number of pages on the web at that time, so each
page in this example would have an initial value of 1. However, later versions of PageRank,
and the remainder of this section, assume a probability distribution between 0 and 1. Hence the
initial value for each page in this example is 0.25.
The PageRank transferred from a given page to the targets of its outbound links upon the next
iteration is divided equally among all outbound links.
If the only links in the system were from pages B, C, and D to A, each link would transfer 0.25
PageRank to A upon the next iteration, for a total of 0.75.

Suppose instead that page B had a link to pages C and A, page C had a link to page A, and page
D had links to all three pages. Thus, upon the first iteration, page B would transfer half of its
existing value, or 0.125, to page A and the other half, or 0.125, to page C. Page C would transfer
all of its existing value, 0.25, to the only page it links to, A. Since D had three outbound links,
it would transfer one-third of its existing value, or approximately 0.083, to A. At the completion
of this iteration, page A will have a PageRank of approximately 0.458.
BDA HANDY NOTES
22. PCY Algorithm
 In pass 1 of A-Priori, most memory is idle. We store only individual item counts. We use
the idle memory to reduce memory required in pass 2.
Pass 1: In addition to item counts, maintain a hash table with as many buckets as fit in memory.
Keep a count for each bucket into which pairs of items are hashed. For each bucket just keep
the count, not the actual pairs that hash to the bucket.
FOR (each basket)
FOR (each item in the basket)
add 1 to item's count;
FOR (each pair of items)
hash the pair to a bucket;
add 1 to the count for that bucket;
 Pairs of items need to be generated from the input file;
they are not present in the file.
 We are not just interested in the presence of a pair, but we need to see whether it is present
at least s (support) times.
 If a bucket contains a frequent pair, then the bucket is surely frequent v However, even
without any frequent pair, a bucket can still be frequent. So, we cannot use the hash to
eliminate any member (pair) of a "frequent" bucket.
 But, for a bucket with total count less than s, none of its pairs can be frequent. Pairs that
hash to this bucket can be eliminated as candidates (even if the pair consists of
 2 frequent items). Eg., even though (A), (B) are frequent, count of the backet containing
(A,B) might be <s
Pass 2: Only count pairs that hash to frequent buckets
 Replace the buckets by a bit-vector. I means the bucket count exceeded the support s (call
it a frequent bucket); 0 means it did not.
 4-byte integer counts are replaced by bits, so the bit-vector requires 1/32 of memory. Also,
decide which items are frequent and list them for the second pass.
 Count all pairs (i, j) that meet the conditions for being a candidate pair:
i. Both i and jare frequent iterns
ii. The pair (i, j) hashes to a bucket whose bit in the bit vector is 1 (i.e. a frequent bucket).
 Both conditions are necessary for the pair to have a chance of being frequent.
BDA HANDY NOTES
23. Collaborative Filtering

 Collaborative filtering is a technique used by recommendation systems to generate


personalized suggestions or recommendations for users based on the preferences and
behavior of similar users.
 The underlying idea is that if users have similar tastes or behaviors, they are likely to
appreciate similar items.
 User-item interaction matrix: The system first creates a matrix where rows represent
users and columns represent items. Entries in the matrix represent the interactions between
users and items, such as purchases, ratings, likes, or views.
 Finding similar users or items: Collaborative filtering then identifies users who have
interacted with similar items or items that have been interacted with by similar users. This
is done using similarity measures such as cosine similarity, Pearson correlation etc.
 Generating recommendations: Once similar users or items are identified, the system can
generate recommendations for a particular user. For example, if User A has interacted with
items X, Y, and Z, and User B has interacted with items X and Y, the system might
recommend item Z to User B since User A, who is similar to User B, has interacted with it.
i. Memory based approach:
 For the memory based approach, the utility matrix ia memorized and recommendations
are made by querying the given user with the rest of the utility matrix.
 Let's consider an example of the same: If we have an movies and a users, we want to
find out how much user i likes movie k.
 Similarity between users a and i can be computed using any methods like cosine
similarity/Jaccard similarity/Pearson's correlation coefficient, etc. These results are
very easy to create and interpret, but once the data becomes too spurse, performance
becomes poor.
ii. Model based approach:
 One of the more prevalent implementations of model based approach is Matrix
Factorization. In this, we create representations of the users and items from the utility
matrix.
 Matrix factorization helps by reducing the dimensionality, hence making computation
faster. One disadvantage of this method is that we tend to lose interpretability as we do
not know what exactly elements of the vectors mean.

24. Define Hubs and authorities (Sums)


https://www.youtube.com/watch?v=au8HHqfi8rY&ab_channel=AnuradhaBhatia

25. Sums on agglomerative clustering


https://www.youtube.com/watch?v=oNYtYm0tFso&ab_channel=MaheshHuddar

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy