0% found this document useful (0 votes)

14 views35 pages

Bda PT 2

Uploaded by

himanshidvyas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views35 pages

Bda PT 2

Uploaded by

himanshidvyas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Unit IV

1. Algorithm and Problem on Blooms Filter

2. Algorithm and Problem on FM Algorithm
3. Algorithm and Problem on DGIM Algorithm
4. Explain the concept of bloom ﬁlter with an example. ( May 2024) (
5. Explain DGIM algorithm for counting ones in a stream with example. ( May 24)
6. Suppose the stream is S = {2, 1, 6, 1, 5, 9, 2, 3, 5}. Let hash functions h(x) = ax + b mod 16 for some a and b, treat result as a
4-bit binary integer. Show how the Flajolet- Martin algorithm will estimate the number of distinct elements, h(x) = 4x + 1 mod
16. ( May 23)
7. List down all 6 constraints that must be satisﬁed for representing a stream by buckets using DGIM algorithm with an example.
( Dec 23)
8. Suppose the stream is S = {4, 2, 5 ,9, 1, 6, 3, 7}. Let hash functions h(x) = 3x + 7 mod 32 for some a and b, treat result as a 5-bit
binary integer. Show how the Flajolet- Martin algorithm will estimate the number of distinct elements in this stream. ( May 2024)
9. Sum hai Dec( 23)

Unit V
1. Write short note on the recommendation system.
2. Explain Collaborative and Content Based Filtering in Recommendation System
3. List and explain the different issues and challenges in data stream query processing. ( May 23)
4. What is graph store? Give an example where a graph store can be used to effectively solve a particular business problem. (
May 2024)
5. With a neat sketch, explain the architecture of the data-stream management system. ( May 23)
6. Describe collaborative filtering in a recommendation system. ( May 24)
7. Define collaborative filtering. Using an example of an e-commerce site like flipkart or amazon describe how it can be used to
provide recommendation to users ( May 23)
8. How recommendation is done based on the properties of product? Explain with the help of example ? ( Dec 2023)
9. Algorithm / Problem Problem on Girvan Newman
(And sums on Girvan Algorithm)
Unit VI
1. How to handle basic expressions in R, Give two example.
2. How to create and use object in R?
3. Explain any two functions available in “dplyr” packages. ( Ans : filter , select, mutate, arrange, count etc
4. List and discuss basic features of R.
5. What are the advantages of using functions over scripts? ( Dec 23)
6. List and discuss various types of data structures in R.
7. Discuss the syntax of defining a function in R.
8. List and explain operators used to form data subsets in R.
9. Describe applications of data visualization. ( MAy 2024)

FM
https://www.ques10.com/p/42208/suppose-a-stream-consists-of-the-integers-21615923/
https://www.youtube.com/watch?v=jgmUPdEq-U4

Girvan Newman
https://www.youtube.com/watch?v=dmMKJ1YUl-M
Solved problem : https://drive.google.com/ﬁle/d/1vDE3KDNr7m_FjTb2KypMnVGBYLvUaPAA/view?usp=sharing

Bloom Filter
https://drive.google.com/ﬁle/d/1ckQdObMnNzImA7B7r0wsnIhBmx0jaumv/view?usp=sharing
Unit IV
1. Algorithm and Problem on Blooms Filter
4. Explain the concept of bloom ﬁlter with an example.

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a
set. For example, checking availability of username is set membership problem, where the set is the list of all registered
username. The price we pay for efficiency is that it is probabilistic in nature that means, there might be some False
Positive results. False positive means, it might tell that given word is already taken but actually it’s not.

1. Unlike a standard hash table, a Bloom ﬁlter of a ﬁxed size can represent a set with an arbitrarily large. number of
elements.

2. Adding an element never fails. However, the false positive rate increases steadily as elements are added
until all bits in the ﬁlter are set to 1, at which point all queries yield a positive result.

3. Deleting elements from ﬁlter is not possible because, if we delete a single element by clearing bits at indices
generated by k hash functions, it might cause deletion of few other elements.

Algorithm :
1. Initialize a bit array of m bits, all set to zero
2. Create k number of hash functions to calculate hashes for given input. When we insert an element in the
bloom filter, the bits at indices h1(x), h2(x), …. hk(x) are set. Always choose a good hashing algorithm to
avoid occurrence of collisions else the rate of false positive will increase and the correctness of the bloom
filter decreases. Fast simple non cryptographic hashes include murmur, FNV hashes and Jenkins hashes
Lets say for example we choose a word “donkey”, first we will hash it using our k hash functions
donkey
h1(“donkey”) = x
h2(“donkey”) = y
h3(“donkey”) = z
Now we will set the bits of x, y, z th position. Lets say x = 1, y = 4 and z = 7

Now lets say we want to check whether “donkey” exists in the bloom ﬁlter or not, now we do the process in reverse,
we hash donkey with our k hash functions, and check whether those bits are set or not.

The false positive of bloom ﬁlter comes if the bits of a word were set previously by another word or combination of
some words. In those cases the bit becomes set but we actually didnt insert the word. Hence choosing a good hash
function is crucial to the success rate of bloom ﬁlter.

Operations of bloom ﬁlter

1. insert(x)
2. lookup(x)
n = number of elements expected to be inserted
Size of bit array m = size of bit array
P = probability of false positive
k = number of hashes used
Probability of false positive

Optimal number of hash functions

2. Algorithm and Problem on FM Algorithm (Flajolet Martin Algorithm)

The Flajolet-Martin algorithm is also known as probabilistic algorithm which is mainly used to count the number
of unique elements in a stream or database. The basic idea to which Flajolet-Martin algorithm is based on is to
use a hash function to map the elements in the given dataset to a binary string, and to make use of the length of
the longest null sequence in the binary string as an estimator for the number of unique elements to use as a
value element.

Algorithm
1. Choose a hash function that can be used to map the elements in the database to ﬁxed-length binary
strings. The length of the binary string can be chosen based on the accuracy desired.

2. Apply the hash function to each data item in the dataset to get its binary string representation.

3. Determining the position of the rightmost zero in each binary string.

4. Compute the maximum position of the rightmost zero for all binary strings.

5. Estimate the number of distinct elements in the dataset as 2 to the power of the maximum position of the
rightmost zero which we calculated in previous step.

The accuracy of Flajolet Martin Algorithm is determined by the length of the binary strings and the number of
hash functions it uses. Generally, with increase in the length of the binary strings or using more hash functions in
algorithm can often increase the algorithm’s accuracy.
3. Algorithm and Problem on DGIM Algorithm
5. Explain DGIM algorithm for counting ones in a stream with example. (May 24)
7. List down all 6 constraints that must be satisﬁed for representing a stream by buckets using DGIM algorithm
with an example. ( Dec 23)

Suppose we have a window of length N on a binary stream. We want at all times to be able to answer queries of the
form “how many 1’s are there in the last k bits?” for any k≤ N.
The basic version of the algorithm uses O(log2 N) bits to represent a window of N bits, and allows us to estimate the
number of 1’s in the window with an error of no more than 50%.
To begin, each bit of the stream has a timestamp, the position in which it arrives. The ﬁrst bit has timestamp 1, the
second has timestamp 2, and so on.
Since we only need to distinguish positions within the window of length N, we shall represent timestamps modulo N,
so they can be represented by log2 N bits. If we also store the total number of bits ever seen in the stream (i.e., the
most recent timestamp) modulo N, then we can determine from a timestamp modulo N where in the current window
the bit with that timestamp is.
We divide the window into buckets, 5 consisting of:
1. The timestamp of its right (most recent) end.
2. The number of 1’s in the bucket. This number must be a power of 2, and we refer to the number of 1’s as the
size of the bucket.
To represent a bucket, we need log2 N bits to represent the timestamp (modulo N) of its right end. To represent the
number of 1’s we only need log2 log2 N bits. The reason is that we know this number i is a power of 2, say 2j , so we
can represent i by coding j in binary. Since j is at most log2 N, it requires log2 log2 N bits. Thus, O(logN) bits sufﬁce to
represent a bucket. There are six rules that must be followed when representing a stream by buckets. (Q7)
● The right end of a bucket is always a position with a 1.
● Every position with a 1 is in some bucket.
● No position is in more than one bucket.
● There are one or two buckets of any given size, up to some maximum size.
● All sizes must be a power of 2.
● Buckets cannot decrease in size as we move to the left (back in time).
Unit V
1. Write short note on the recommendation system.

Data mining makes use of various methodologies in statistics

and different algorithms, like classiﬁcation models, clustering,
and regression models to exploit the insights which are present
in the large set of data. It helps us to predict the outcome based
on the history of events that have taken place.

The recommender system mainly deals with the likes and

dislikes of the users. Its major objective is to recommend an
item to a user which has a high chance of liking or is in need of
a particular user based on his previous purchases. It is like
having a personalized team who can understand our likes and
dislikes and help us in making the decisions regarding a
particular item without being biased by any means by making
use of a large amount of data in the repositories which are
generated day by day. The aim of recommender systems is to
supply simply accessible, high-quality recommendations for the
user community. Its wish is to own a reasonable personal
authority with efﬁciency.
2. Explain Collaborative and Content Based Filtering in Recommendation System
6. Describe collaborative ﬁltering in a recommendation system. ( May 24)

In Collaborative Filtering, we tend to ﬁnd similar users and recommend what similar users like. In this type of
recommendation system, we don’t use the features of the item to recommend it, rather we classify the users into
clusters of similar types and recommend each user according to the preference of its cluster.

There are basically 4 types of algorithms -> Memory Based, Model Based, Hybrid, Deep Learning

One of the main advantages that these recommender systems have is that they are highly efﬁcient in providing
personalized content but also able to adapt to changing user preferences.

Cosine Similarity : Larger cosine implies that there is a smaller angle between two users, hence they have similar
interests. We can apply the cosine distance between two users in the utility matrix, and we can also give the zero
value to all the unﬁlled columns to make calculation easy, if we get smaller cosine then there will be a larger distance
between the users, and if the cosine is larger than we have a small angle between the users, and we can recommend
them similar things.

Normalizing Rating : In the process of normalizing, we take the average rating of a user and subtract all the given
ratings from it, so we’ll get either positive or negative values as a rating, which can simply classify further into similar
groups. By normalizing the data we can make clusters of the users that give a similar rating to similar items and then
we can use these clusters to recommend items to the users.
A Content-Based Recommender works by the data that we take from the user, either explicitly (rating) or implicitly
(clicking on a link). By the data we create a user proﬁle, which is then used to suggest to the user, as the user provides
more input or take more actions on the recommendation, the engine becomes more accurate.

User Profile : In the User Profile, we create vectors that describe the user’s preference. In the creation of a user profile, we
use the utility matrix which describes the relationship between user and item. With this information, the best estimate we
can make regarding which item user likes, is some aggregation of the profiles of those items.

Item Proﬁle : In Content-Based Recommender, we must build a proﬁle for each item, which will represent the important
characteristics of that item.

Utility Matrix : Utility Matrix signiﬁes the user’s preference with certain items. In the data gathered from the user, we
have to ﬁnd some relation between the items which are liked by the user and those which are disliked, for this purpose
we use the utility matrix. In it we assign a particular value to each user-item pair, this value is known as the degree of
preference. Then we draw a matrix of a user with the respective items to identify their preference relationship.

1. We can use the cosine distance between the vectors of the item and the user to determine its preference to the
user.
2. We can use a classification approach in the recommendation systems too, like we can use the Decision Tree for
finding out whether a user wants to watch a movie or not, like at each level we can apply a certain condition to
refine our recommendation
7. Define collaborative filtering. Using an example of an e-commerce site like flipkart or amazon describe how it can
be used to provide recommendation to users

Collaborative ﬁltering is a popular recommendation technique used by e-commerce platforms like Flipkart or Amazon to
provide personalized product recommendations to users. It leverages the preferences, behaviors, or actions of users to
predict what a particular user might like, based on the assumption that users with similar preferences will like similar
items.

1. User Based : In this approach, the system looks for users who have similar shopping patterns to Alice. If another
user, Bob, has a similar purchasing history to Alice and bought a product that Alice hasn’t seen yet, the system can
recommend that product to Alice. Alice has bought a smartphone, a laptop, and headphones. Bob has also bought
a smartphone and a laptop, but he has also purchased a smartwatch. Since Alice and Bob share similar tastes
(smartphone, laptop), the system assumes Alice might also like the smartwatch, and it recommends a smartwatch
to Alice.

2. Item Based : In this approach, the system looks for items similar to the ones Alice has interacted with. It ﬁnds
products that are often bought together with the items Alice likes and recommends those to her. Alice has bought
a laptop. Many other users who bought the same laptop have also purchased a laptop stand or mouse. The system
will recommend these related items (laptop stand, mouse) to Alice, assuming she might be interested in them
since other buyers of the same laptop have purchased these items.
3. List and explain the different issues and challenges in data stream query processing.

1. High Throughput and Low Latency : Data streams typically involve high-speed, continuous data generation
(e.g., from IoT sensors, social media feeds). Queries on such streams require rapid processing to maintain
real-time or near-real-time responses. Designing systems capable of handling high volumes of data
(throughput) with minimal delay (latency) to ensure timely query results.

2. Unbounded and Continuous Nature of Data : Unlike traditional databases, data streams are unbounded and
continuously generate new data, making it impossible to store and query all the data at once. Queries must
operate on recent or relevant subsets of data, requiring windowing techniques (time-based, tuple-based, or
count-based windows) to handle inﬁnite streams effectively.

3. Memory and Resource Constraints : Since data streams are continuous and potentially inﬁnite, processing
systems must handle queries without relying on storing the entire dataset. Efﬁcient use of memory and
computational resources is crucial. This often necessitates approximate algorithms, sliding windows, and
summarization techniques to manage resource consumption.

4. Imprecise and Incomplete Data : Streams often contain noisy, incomplete, or imprecise data, as seen in sensor
readings, social media, or ﬁnancial transactions. Query systems need to handle uncertainty and missing values,
which may require techniques for noise reduction, ﬁltering, or interpolation.

5. Approximate Query Processing : Exact answers are sometimes impractical or unnecessary in streaming
environments due to time and resource constraints. Designing approximate query processing (AQP) algorithms,
such as sampling, sketching, or hashing, to provide fast, near-accurate answers with controlled error margins.
6. Out-of-Order and Late Data : Data can arrive out of order or be delayed, especially in distributed environments
where network latency or failures affect data delivery. Systems must cope with such scenarios, using techniques like
watermarking and reordering to ensure consistent and correct query results.

7. Distributed and Parallel Processing : Large-scale data streams may be processed in distributed or cloud
environments to meet scalability and fault-tolerance needs. Ensuring efﬁcient data distribution, synchronization, and
fault tolerance in distributed environments while minimizing communication overhead and ensuring consistency.

8. Fault Tolerance and Availability : Data stream systems need to operate continuously, even in the face of failures.
Maintaining high availability, recovering from failures without data loss, and ensuring reliable processing across
distributed systems are crucial.

9. Query Optimization : Continuous queries need to be optimized for efﬁcient execution in environments where both
data streams and queries are dynamic. Optimizing query execution plans, choosing the best query processing
strategies, and adapting to changes in the data or query characteristics are key aspects of query optimization in stream
processing.

10. Security and Privacy : Sensitive data may ﬂow through streams, raising privacy and security concerns, particularly
when data is processed across multiple domains or third parties. Ensuring data security and privacy, such as
encrypting data streams and enforcing access control, without introducing signiﬁcant latency.
4. What is graph store? Give an example where a graph store can be used to effectively solve a particular
business problem.

A graph store is a type of database designed speciﬁcally to store, manage, and query data that is best represented as
a graph. In graph databases, entities (nodes) and the relationships (edges) between them are stored explicitly, making
it easy to represent and query complex networks of information.

Graph databases are highly effective when the relationships between entities are as important as the entities
themselves. These databases allow for fast and efﬁcient traversals across nodes based on relationships, making them
ideal for applications like social networks, recommendation engines, fraud detection, and more.

1. It solves Many-To-Many relationship problems : If we have friends of friends and stuff like that, these are many
to many relationships. Used when the query in the relational database is very complex.
2. When relationships between data elements are more important : If there is data element such as user data
element inside a graph database there could be multiple user data elements but the relationship is what is
going to be the factor for all these data elements which are stored inside the graph database.
3. Low latency with large scale data : When you add lots of relationships in the relational database, the data sets
are going to be huge and when you query it, the complexity is going to be more complex and it is going to be
more than a usual time. However, in graph database, it is speciﬁcally designed for this particular purpose and
one can query relationship with ease.

Graph Databases can easily handle frequent schema changes, managing volume of data, real-time query response
time, and more intelligent data activation requirements are done by graph model.
5. With a neat sketch, explain the architecture of the data-stream management system. ( May 23)

DSMS stands for data stream management system. It is nothing but a software application just like DBMS (database
management system) but it involves processing and management of a continuously ﬂowing data stream rather than
static data

1. Data Source layer : The ﬁrst layer of DSMS is data source layer which comprises of all the data sources which
includes sensors, social media feeds, ﬁnancial market, stock markets etc. In the layer capturing and parsing of data
stream happens. Basically it is the collection layer which collects the data.

2. Data Ingestion Layer : This layer is a bridge between data source layer and processing layer. The main purpose
of this layer is to handle the ﬂow of data i.e., data ﬂow control, data buffering and data routing.

3. Processing Layer : This layer consider as heart of DSMS architecture it is functional layer of DSMS applications. It
process the data streams in real time. To perform processing it is uses processing engines like Apache ﬂink or
Apache storm etc., The main function of this layer is to ﬁlter, transform, aggregate and enriching the data stream.
This can be done by derive insights and detect patterns.

4. Storage Layer : Once data is process we need to store the processed data in any storage unit. Storage layer
consist of various storage like NoSQL database, distributed database etc., It helps to ensure data durability and
availability of data in case of system failure.
5. Querying Layer : It support 2 types of query ad hoc query and standard query. This layer provides the tools which
can be used for querying and analyzing the stored data stream. It also have SQL like query languages or
programming API.

6. Visualization and Reporting Layer : This layer provides tools for perform visualization like charts, pie chart,
histogram etc., On the basis of this visual representation it also helps to generate the report for analysis.

7. Integration Layer : This layer responsible for integrating DSMS application with traditional system, business
intelligence tools, data warehouses, ML application, NLP applications. It helps to improve already present running
applications.

1-6 matlab wo saare layers jo upar mention

kiya hai
8. How recommendation is done based on the properties of product? Explain with the help of example ?

1. User-based recommendation: Here we calculate Pearson’s similarity measure, which is needed to determine
the closely related users, i.e, whose likes and dislikes follow the same pattern. The computational operations are
based on the formula of Pearson similarity. The ratings of two different users are subtracted by the mean value
and multiplied in the numerator and in the denominator, the ratings are squared and summation is calculated for
each. After getting the summation values, they are divided to get the similarity measure.

Example : Alice likes movies X, Y, and Z. Bob likes movies Y and Z. Since Alice and Bob have similar preferences,
the system might recommend movie X to Bob, assuming he hasn't watched it yet.

2. Item-based recommendation: Initial aim is to obtain the mean adjusted matrix. The mean adjusted matrix is
used in the prediction of the rating of a new user using the item, based on reducing the errors caused by the
users, as some tend to give very high ratings most of the time and some tend to give very low ratings most of
the time. So, to reduce this inconsistency, we subtract the mean value from each of the users. The next step is
the calculation of the similarity measure between the items. Here we can make use of the cosine similarity
matrix. The computational operations are based on the formula of cosine similarity. The ratings of different users
on two items are multiplied in the numerator and in the denominator, the ratings are squared and a summation
is calculated for each. After getting the summation values, they are divided to get the similarity measure.

Example : Alice watches movie X and gives it a high rating. The system checks that other users who liked movie
X also liked movie Y. Based on this, the system recommends movie Y to Alice.
9. Algorithm / Problem Problem on Girvan Newman (Ye black cheez mere se hatt nahi raha)

The Girvan-Newman algorithm for the detection and analysis of community structure relies on the iterative elimination of
edges that have the highest number of shortest paths between nodes passing through them. By removing edges from
the graph one-by-one, the network breaks down into smaller pieces, so-called communities (more formally, components
of a graph).

The idea was to find which edges in a network occur most frequently between other pairs of nodes by finding edges
betweenness centrality. The edges joining communities are then expected to have a high edge betweenness. The
underlying community structure of the network will be much more fine-grained once the edges with the highest
betweenness are eliminated which means that communities will be much easier to spot.

Betweenness centrality measures the extent to which a vertex or edge lies on paths between vertices. Vertices and
edges with high betweenness may have considerable inﬂuence within a network by virtue of their control over
information passing between others. The calculation of betweenness centrality is not standardized and there are many
ways to solve it. It is deﬁned as the number of shortest paths in the graph that pass through the node or edge divided by
the total number of shortest paths.
Algorithm:
1. Create a graph of N nodes and its edges or take an inbuilt graph like a barbell graph.
2. Calculate the betweenness of all existed edges in the graph.
3. Now remove all the edge(s) with the highest betweenness.
4. Now recalculate the betweenness of all the edges that got affected by the removal of edges.
5. Now repeat steps 3 and 4 until no edges remain.

-> Algorithm ->

The Girvan-Newman algorithm would remove the edge between nodes C and D because it is the one with the
highest strength. As you can see intuitively, this means that the edge is located between communities. After
removing an edge, the betweenness centrality has to be recalculated for every remaining edge. In this example,
we have come to the point where every edge has the same betweenness centrality.
Unit VI

1. How to handle basic expressions in R, Give two example.

R Expression also known as Referential Expression. In R, an expression is a way to represent calculations, text formatting, or
mathematical symbols. These expressions can be used to display mathematical symbols, format text, and are often used in
plotting or reporting.

Expressions in R can be created using the `expression()` function, which generates objects of the class `expression`. These
expressions can be math expressions, basic formatting expressions, or font face expressions.

Basic Syntax for Expressions

Expressions in R are categorized into three major types:

1. Math Expressions: Represent mathematical formulas or symbols.
2. Basic Expressions: Used to format text with superscripts, subscripts, etc.
3. Font Face Expressions: Used to change text appearance like bold, italic.

- `Super^script`: Makes "script" appear as a superscript to "Super."

- `~text`: Adds spacing between the superscript and the word "text."
Example 2: Regular Expressions for Pattern Matching

R provides functions to work with regular expressions to ﬁnd patterns in text. Regular expressions allow searching and
modifying strings based on speciﬁc patterns.

- `grep()` and `grepl()` are used to ﬁnd patterns.

- `sub()` and `gsub()` are used to replace patterns.

Example of using `gsub()` to replace a pattern:

Basic expressions in R allow formatting text or performing mathematical operations, while regular expressions enable pattern
matching and string replacement. These tools are widely used for visualization and data manipulation in R.
2. How to create and use object in R?
► Instances of classes are know as objects
► Variables of other languages are referred to as objects in R.
► There are multiple objects in R:
i. Vectors,
ii. Lists,
iii. Arrays,
iv. Matrices,
v. Data frame, and
vi. Factors.

You can create an object using the “<-“ or an equals “=” sign.
Both of these are used to assign values to objects. There are a few rules to
keep in mind while naming an object in R. They are:

► Object names should be short and explicit.

► Object names cannot start with a number.
► “Data” is different than “data” as R is case sensitive.
► The names of default fundamental functions should not be
used as object names in R. eg: function, if, else, repeat, etc.
► Let us check out an example. We are creating an object
distance_km that stores the distance covered in kilometers.
► Once you have created an object, you can perform several
arithmetic operations on it.
Types of Objects:

► Vectors:
▪ Vector is the most basic data object in R
▪ Def: Vector is a collection of elements which is most commonly of mode character, integer, logical or numeric
▪ To create a vector with more than 1 element we must use the c() function

► Lists:
▪ A list in R is a generic object consisting of an ordered collection of objects
▪ Lists are one-dimensional and heterogenous in nature
▪ To create a list we must use the list() function
► Arrays:
▪ Arrays can be of any number of dimensions, it takes a dim attribute and creates the required number of dimensions
▪ Vectors are given as the input and the value in the dim parameter is used to create an array
▪ To create an array we must use the array() function

► Matrices:
▪ A matrix is an R object in which the element are arranged in a two-dimensional rectangular layout
▪ The elements are of the same atomic type
▪ To create a matrix we must use the matrix() function

► Factors:
▪ These are data objects that are used to categorize the data and store it as levels
▪ It is very useful for statistical modelling
▪ To create an factor we must use the factor() function

► Data Frames:
▪ Data frames are tabular data objects
▪ It is an object in which the columns can contain different modes of data
▪ To create a data frame we must use the data.frame() function
3. Explain any two functions available in “dplyr” packages. ( Ans : ﬁlter , select, mutate, arrange, count etc)

● Data needs to be manipulated in order to make it more readable and in an organized way.
● Manipulation can be done by filtering and ordering rows, renaming and adding columns and computing summary
statistics.
● There are different ways to perform data manipulation in R, such as using Base R functions like subset(), with(), within(),
etc.Packages like data.table, ggplot2, reshape2, readr, etc., and different Machine Learning algorithms.
● The dplyr package consists of many functions specifically used for data manipulation. These functions process data faster
than Base R functions and are known the best for data exploration and transformation, as well.
Below are two commonly used functions from the `dplyr` package:
1. `filter()`: Filtering Rows Based on Conditions

The `filter()` function is used to select rows from a data frame based on specific conditions. It helps in extracting a
subset of data that meets certain criteria. Example:
library(dplyr)
# Sample data frame
data <- data.frame(
Output:
Name = c("Alice", "Bob", "Charlie", "David"),
```
Age = c(25, 30, 35, 40),
Name Age Score
Score = c(85, 90, 75, 88)
1 Alice 25 85
)
2 Bob 30 90
# Filter rows where Age is greater than 30
3 Charlie 35 75
filtered_data <- filter(data, Age > 30)
4 David 40 88
print(filtered_data)
Key Point:
`filter()` helps you select rows that meet the condition, such as Àge > 30` in this case.
2. `mutate()`: Adding or Modifying Columns

The `mutate()` function is used to add new columns to a data frame or modify existing ones by performing
calculations on existing columns.

Example.
# Adding a new column "Score_Adjusted" by subtracting 5 from the "Score" column
mutated_data <- mutate(data, Score_Adjusted = Score - 5)
print(mutated_data)
Output:
Name Age Score Score_Adjusted
1 Alice 25 85 80
2 Bob 30 90 85
3 Charlie 35 75 70
4 David 40 88 83
`mutate()` allows you to transform the data by creating new variables or
updating existing ones, making it useful for data analysis.

Both `filter()` and `mutate()` are essential functions in the `dplyr` package:
- `filter()` helps extract rows based on conditions.
- `mutate()` helps create or modify columns within a data frame.
4. List and discuss basic features of R.
1. Comprehensive Statistical Analysis: R langauge provides a wide array of statistical techniques, including linear and nonlinear
modeling, classical statistical tests, time-series analysis, classification, and clustering.

2. Advanced Data Visualization: With packages like ggplot2, plotly, and lattice, R excels at creating complex and aesthetically
pleasing data visualizations, including plots, graphs, and charts.

3. Extensive Packages and Libraries: The Comprehensive R Archive Network (CRAN) hosts thousands of packages that extend
R’s capabilities in areas such as machine learning, data manipulation, bioinformatics, and more
.
4. Open Source and Free: R is free to download and use, making it accessible to everyone. Its open-source nature encourages
community contributions and continuous improvement.

5. Platform Independence: R is platform-independent, running on various operating systems, including Windows, macOS, and
Linux, which ensures ﬂexibility and ease of use across different environments.

6. Integration with Other Languages: R language can integrate with other programming languages such as C, C++, Python, Java,
and SQL, allowing for seamless interaction with various data sources and computational processes.

7. Powerful Data Handling and Storage: R efﬁciently handles and stores data, supporting various data types and structures,
including vectors, matrices, data frames, and lists.

8. Robust Community and Support: R has a vibrant and active community that provides extensive support through forums,
mailing lists, and online resources, contributing to its rich ecosystem of packages and documentation.

9. Interactive Development Environment (IDE): RStudio, the most popular IDE for R, offers a user-friendly interface with features
like syntax highlighting, code completion, and integrated tools for plotting, history, and debugging.

10. Reproducible Research: R supports reproducible research practices with tools like R Markdown and Knitr, enabling users to
create dynamic reports, presentations, and documents that combine code, text, and visualizations
5. What are the advantages of using functions over scripts? ( Dec 23)
- Function: A function is a reusable block of code designed to perform a speciﬁc task, often taking inputs (parameters) and
returning an output.

- Script: A script is a sequential set of instructions executed from top to bottom, without encapsulating code into reusable
functions.

Advantages of Using Functions Over Scripts

Using functions offers several advantages compared to scripts (code without function encapsulation). Functions help
organize code, make it reusable, and simplify maintenance. Key advantages include:

1. Code Reusability
Functions can be reused across different parts of the program, while scripts may require code repetition, making
maintenance harder.

```r
add <- function(a, b) {
return(a + b)
}
result1 <- add(3, 5)
result2 <- add(10, 20)
```
2. Modularity
Functions break code into smaller, logical units, making it easier to read and maintain. Scripts often grow large and become
harder to navigate.
3. Simpliﬁed Debugging and Testing
Functions can be tested independently, making it easier to debug speciﬁc parts of the code, unlike scripts which require running
the entire program.

4. Encapsulation
Functions contain local variables, reducing unintended side effects, whereas scripts use global variables that can lead to conﬂicts.

5. Maintainability and Readability

Functions improve code clarity, making it easier to update and manage, while scripts tend to become complex over time.

6. Parameterization
Functions can take parameters, allowing ﬂexibility. Scripts need manual changes for different inputs, increasing error risk.
```r
multiply <- function(x, y) {
return(x * y)
}
result1 <- multiply(2, 3)
result2 <- multiply(5, 7)
```
7. Scalability
Functions help manage growing projects by organizing code into manageable parts, whereas scripts become unwieldy as they
scale.

Conclusion
Functions provide reusability, modularity, easier debugging, and scalability, making them a better choice over scripts for structured
programming.
6. List and discuss various types of data structures in R.

► Data Frames:
▪ Data frames are tabular data objects
▪ It is an object in which the columns can contain different modes of data
▪ To create a data frame we must use the data.frame() function
7. Discuss the syntax of deﬁning a function in R.

In R, the syntax for deﬁning a function is structured as follows:

Syntax:
function_name <- function(argument1, argument2, ...) {
# Code to be executed
# Optional: return() to return the result
}

Components:

1. function_name:
This is the name you assign to the function. It follows the same naming rules as variables (e.g., no spaces, must start with a
letter).

2. function():
The keyword `function` is used to deﬁne a function in R. Inside the parentheses, you specify the arguments (also known as
parameters) that the function will take.

3. Arguments:
These are the inputs to the function. A function can take zero, one, or more arguments, separated by commas. Each argument
can also have a default value. For example:

function_name <- function(arg1 = default_value1, arg2 = default_value2, ...)

4. Function Body:
The body of the function is enclosed in curly braces `{}` and contains the statements to be executed when the function is
called. This can include any valid R code, such as computations, loops, or conditional statements.

5. Return Value (Optional):

By default, a function in R returns the last evaluated expression. However, you can explicitly return a value using the
`return()` function.

Example:

# Deﬁning a simple function to add two numbers

add_numbers <- function(a, b) {
sum_result <- a + b
return(sum_result) # Explicit return
}

# Calling the function

print(add_numbers(5, 10)) # Output: 15

In this example, the function `add_numbers` takes two arguments (`a` and `b`), adds them, and returns the result using the
`return()` function.
8. List and explain operators used to form data subsets in R.

1. Bracket Operator : The most versatile and commonly used operator for subsetting.
x[i] returns the i-th element of vector x

2. Dollar Sign $ Operator : Used for accessing elements by name in lists or data frames.
For example, data$name extracts the name column from a data frame data

3. Logical Subsetting : You can use logical vectors to subset based on conditions. Logical expressions are commonly
used to subset data frames based on column conditions.

4. Colon : Operator : Used to generate sequences, typically for indexing.

5. %in% Operator : Used to check for membership in a vector. For example, x %in% y returns a logical vector indicating
if each element of x is in y

6. Negative Indexing : In R, you can exclude elements by using negative indices

7. Subset Function (subset()) : A function for ﬁltering data frames based on conditions.
9. Describe applications of data visualization. ( MAy 2024)

HealthCare Industries Military

Health care is a time-consuming Dynamic Data Visualization aids

procedure, and the majority of it in a better knowledge of
is spent evaluating prior reports. geography and climate, resulting
By boosting response time, data in a more effective approach
visualisation provides a superior
selling point..

Finance Industries Data Science

For exploring/explaining data of Data scientists generally create
linked customers, understanding visualisations for their personal
consumer behaviour, efficiency of use or to communicate
decision making, and so on, data information to a small group of
visualisation tools are becoming a people.
requirement for ﬁnancial sectors.
Marketing Education
In marketing analytics, data
visualisation is a boon. We may use Users may visually engage with data,
visuals and reports to analyse various answer questions quickly, make more
patterns and trends analysis, such as accurate, data-informed decisions, and
sales analysis, market research share their results with others using
analysis, customer analysis, defect intuitive, interactive dashboards.
analysis, cost analysis, and
forecasting.

E-Commerce Business Intelligence

The key to running a successful internet When compared to local options,
business is getting rapid insights. This cloud connection can provide the
is feasible with data visualisation cost-effective “heavy lifting” of
processor-intensive analytics,
because crossing data shows features
allowing users to see bigger
that would otherwise be hidden. volumes of data from numerous
sources to help speed up
decision-making.

Mining Data Stream
No ratings yet
Mining Data Stream
31 pages
Lec1 Bloom Distinctcount
No ratings yet
Lec1 Bloom Distinctcount
76 pages
Mining Data Streams
No ratings yet
Mining Data Streams
34 pages
Bloom Filter
No ratings yet
Bloom Filter
50 pages
Rsa 2008
No ratings yet
Rsa 2008
32 pages
Bloom Filters A Tutorial, Analysis, and Survey
No ratings yet
Bloom Filters A Tutorial, Analysis, and Survey
31 pages
DSBDA UT 2 Part 2
No ratings yet
DSBDA UT 2 Part 2
21 pages
MMD 05
No ratings yet
MMD 05
50 pages
Decaying Window
No ratings yet
Decaying Window
16 pages
Bloom Filter
No ratings yet
Bloom Filter
29 pages
DGIM
No ratings yet
DGIM
90 pages
Bda Ut-2
No ratings yet
Bda Ut-2
34 pages
Module 3 Mining Data Streams
No ratings yet
Module 3 Mining Data Streams
96 pages
Unit 3
No ratings yet
Unit 3
49 pages
Streams 2
No ratings yet
Streams 2
49 pages
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
No ratings yet
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
26 pages
BDA Questions
No ratings yet
BDA Questions
20 pages
Data Stream Sampling
No ratings yet
Data Stream Sampling
25 pages
DSBD Unit-II 3
No ratings yet
DSBD Unit-II 3
28 pages
Bloom Filters: References
No ratings yet
Bloom Filters: References
22 pages
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
No ratings yet
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
90 pages
Data Science 5
No ratings yet
Data Science 5
82 pages
Bloom Filter
No ratings yet
Bloom Filter
9 pages
Unit 4 - Lecture 3 - DGIM Algorithm Notes
100% (1)
Unit 4 - Lecture 3 - DGIM Algorithm Notes
8 pages
Blooms Filter
No ratings yet
Blooms Filter
15 pages
Assignment 3
No ratings yet
Assignment 3
3 pages
Flajolet-Martin Algorithm
No ratings yet
Flajolet-Martin Algorithm
28 pages
BDA Assignment2 BE6 20
No ratings yet
BDA Assignment2 BE6 20
9 pages
ADS EXP 8 Tanisha Kanal
No ratings yet
ADS EXP 8 Tanisha Kanal
10 pages
BDA Experiment 7
No ratings yet
BDA Experiment 7
7 pages
Bloom Filters: Presented By: Eman Shafiq (2017-EE-389) Bareera Azhar (2017-EE-379) Ruqia Rubab (2017-EE-383
No ratings yet
Bloom Filters: Presented By: Eman Shafiq (2017-EE-389) Bareera Azhar (2017-EE-379) Ruqia Rubab (2017-EE-383
14 pages
Bloom Filters - A Probabilistic Data Structure - LinkedIn
No ratings yet
Bloom Filters - A Probabilistic Data Structure - LinkedIn
7 pages
Viden Io Data Analytics Lecture7 Data Stream Filtering PDF
No ratings yet
Viden Io Data Analytics Lecture7 Data Stream Filtering PDF
20 pages
Module 4
No ratings yet
Module 4
10 pages
Bloom FIlter and Hash Function Numericals
No ratings yet
Bloom FIlter and Hash Function Numericals
6 pages
Manual Bda 6 7 8
No ratings yet
Manual Bda 6 7 8
6 pages
Streams 1
No ratings yet
Streams 1
33 pages
Bda Exp8
No ratings yet
Bda Exp8
4 pages
Streaming Algorithms Complete
No ratings yet
Streaming Algorithms Complete
10 pages
Bda 8 59
No ratings yet
Bda 8 59
4 pages
B43 BDA Exp7
No ratings yet
B43 BDA Exp7
12 pages
Counting Ones in A Window
No ratings yet
Counting Ones in A Window
27 pages
Lecture08 BloomFilter
No ratings yet
Lecture08 BloomFilter
2 pages
Module 4
No ratings yet
Module 4
20 pages
Bda Exp4 Chinmay
No ratings yet
Bda Exp4 Chinmay
4 pages
Rank-Indexed Hashing: A Compact Construction of Bloom Filters and Variants
No ratings yet
Rank-Indexed Hashing: A Compact Construction of Bloom Filters and Variants
10 pages
24-25 CS18003 Data Analytics Assignment 2
No ratings yet
24-25 CS18003 Data Analytics Assignment 2
2 pages
Experiment No 8
No ratings yet
Experiment No 8
7 pages
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
No ratings yet
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
4 pages
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
No ratings yet
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
4 pages
4 Bda Chapter4 Answer
No ratings yet
4 Bda Chapter4 Answer
6 pages
6 Filtering and Streaming: 6.1 Bloom Filters
No ratings yet
6 Filtering and Streaming: 6.1 Bloom Filters
6 pages
Assignment No.2: HOANG Nguyen Phong
No ratings yet
Assignment No.2: HOANG Nguyen Phong
6 pages
Don Bosco Institute of Technology: ITDO8011 Big Data Analytics
No ratings yet
Don Bosco Institute of Technology: ITDO8011 Big Data Analytics
6 pages
Bloom Filters and Their Applications
No ratings yet
Bloom Filters and Their Applications
5 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Bloomfilter
No ratings yet
Bloomfilter
9 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Bda PT 2

Uploaded by

Bda PT 2

Uploaded by

Unit IV

1. Algorithm and Problem on Blooms Filter

Operations of bloom ﬁlter

Optimal number of hash functions

3. Determining the position of the rightmost zero in each binary string.

Data mining makes use of various methodologies in statistics

The recommender system mainly deals with the likes and

1-6 matlab wo saare layers jo upar mention

-> Algorithm ->

1. How to handle basic expressions in R, Give two example.

Basic Syntax for Expressions

Expressions in R are categorized into three major types:

- `Super^script`: Makes "script" appear as a superscript to "Super."

- `grep()` and `grepl()` are used to ﬁnd patterns.

Example of using `gsub()` to replace a pattern:

► Object names should be short and explicit.

Advantages of Using Functions Over Scripts

5. Maintainability and Readability

In R, the syntax for deﬁning a function is structured as follows:

function_name <- function(arg1 = default_value1, arg2 = default_value2, ...)

5. Return Value (Optional):

# Deﬁning a simple function to add two numbers

# Calling the function

4. Colon : Operator : Used to generate sequences, typically for indexing.

6. Negative Indexing : In R, you can exclude elements by using negative indices

HealthCare Industries Military

Health care is a time-consuming Dynamic Data Visualization aids

Finance Industries Data Science

E-Commerce Business Intelligence

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.