Bda PT 2
Bda PT 2
Unit V
1. Write short note on the recommendation system.
2. Explain Collaborative and Content Based Filtering in Recommendation System
3. List and explain the different issues and challenges in data stream query processing. ( May 23)
4. What is graph store? Give an example where a graph store can be used to effectively solve a particular business problem. (
May 2024)
5. With a neat sketch, explain the architecture of the data-stream management system. ( May 23)
6. Describe collaborative filtering in a recommendation system. ( May 24)
7. Define collaborative filtering. Using an example of an e-commerce site like flipkart or amazon describe how it can be used to
provide recommendation to users ( May 23)
8. How recommendation is done based on the properties of product? Explain with the help of example ? ( Dec 2023)
9. Algorithm / Problem Problem on Girvan Newman
(And sums on Girvan Algorithm)
Unit VI
1. How to handle basic expressions in R, Give two example.
2. How to create and use object in R?
3. Explain any two functions available in “dplyr” packages. ( Ans : filter , select, mutate, arrange, count etc
4. List and discuss basic features of R.
5. What are the advantages of using functions over scripts? ( Dec 23)
6. List and discuss various types of data structures in R.
7. Discuss the syntax of defining a function in R.
8. List and explain operators used to form data subsets in R.
9. Describe applications of data visualization. ( MAy 2024)
FM
https://www.ques10.com/p/42208/suppose-a-stream-consists-of-the-integers-21615923/
https://www.youtube.com/watch?v=jgmUPdEq-U4
Girvan Newman
https://www.youtube.com/watch?v=dmMKJ1YUl-M
Solved problem : https://drive.google.com/file/d/1vDE3KDNr7m_FjTb2KypMnVGBYLvUaPAA/view?usp=sharing
Bloom Filter
https://drive.google.com/file/d/1ckQdObMnNzImA7B7r0wsnIhBmx0jaumv/view?usp=sharing
Unit IV
1. Algorithm and Problem on Blooms Filter
4. Explain the concept of bloom filter with an example.
A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a
set. For example, checking availability of username is set membership problem, where the set is the list of all registered
username. The price we pay for efficiency is that it is probabilistic in nature that means, there might be some False
Positive results. False positive means, it might tell that given word is already taken but actually it’s not.
1. Unlike a standard hash table, a Bloom filter of a fixed size can represent a set with an arbitrarily large. number of
elements.
2. Adding an element never fails. However, the false positive rate increases steadily as elements are added
until all bits in the filter are set to 1, at which point all queries yield a positive result.
3. Deleting elements from filter is not possible because, if we delete a single element by clearing bits at indices
generated by k hash functions, it might cause deletion of few other elements.
Algorithm :
1. Initialize a bit array of m bits, all set to zero
2. Create k number of hash functions to calculate hashes for given input. When we insert an element in the
bloom filter, the bits at indices h1(x), h2(x), …. hk(x) are set. Always choose a good hashing algorithm to
avoid occurrence of collisions else the rate of false positive will increase and the correctness of the bloom
filter decreases. Fast simple non cryptographic hashes include murmur, FNV hashes and Jenkins hashes
Lets say for example we choose a word “donkey”, first we will hash it using our k hash functions
donkey
h1(“donkey”) = x
h2(“donkey”) = y
h3(“donkey”) = z
Now we will set the bits of x, y, z th position. Lets say x = 1, y = 4 and z = 7
Now lets say we want to check whether “donkey” exists in the bloom filter or not, now we do the process in reverse,
we hash donkey with our k hash functions, and check whether those bits are set or not.
The false positive of bloom filter comes if the bits of a word were set previously by another word or combination of
some words. In those cases the bit becomes set but we actually didnt insert the word. Hence choosing a good hash
function is crucial to the success rate of bloom filter.
The Flajolet-Martin algorithm is also known as probabilistic algorithm which is mainly used to count the number
of unique elements in a stream or database. The basic idea to which Flajolet-Martin algorithm is based on is to
use a hash function to map the elements in the given dataset to a binary string, and to make use of the length of
the longest null sequence in the binary string as an estimator for the number of unique elements to use as a
value element.
Algorithm
1. Choose a hash function that can be used to map the elements in the database to fixed-length binary
strings. The length of the binary string can be chosen based on the accuracy desired.
2. Apply the hash function to each data item in the dataset to get its binary string representation.
5. Estimate the number of distinct elements in the dataset as 2 to the power of the maximum position of the
rightmost zero which we calculated in previous step.
The accuracy of Flajolet Martin Algorithm is determined by the length of the binary strings and the number of
hash functions it uses. Generally, with increase in the length of the binary strings or using more hash functions in
algorithm can often increase the algorithm’s accuracy.
3. Algorithm and Problem on DGIM Algorithm
5. Explain DGIM algorithm for counting ones in a stream with example. (May 24)
7. List down all 6 constraints that must be satisfied for representing a stream by buckets using DGIM algorithm
with an example. ( Dec 23)
Suppose we have a window of length N on a binary stream. We want at all times to be able to answer queries of the
form “how many 1’s are there in the last k bits?” for any k≤ N.
The basic version of the algorithm uses O(log2 N) bits to represent a window of N bits, and allows us to estimate the
number of 1’s in the window with an error of no more than 50%.
To begin, each bit of the stream has a timestamp, the position in which it arrives. The first bit has timestamp 1, the
second has timestamp 2, and so on.
Since we only need to distinguish positions within the window of length N, we shall represent timestamps modulo N,
so they can be represented by log2 N bits. If we also store the total number of bits ever seen in the stream (i.e., the
most recent timestamp) modulo N, then we can determine from a timestamp modulo N where in the current window
the bit with that timestamp is.
We divide the window into buckets, 5 consisting of:
1. The timestamp of its right (most recent) end.
2. The number of 1’s in the bucket. This number must be a power of 2, and we refer to the number of 1’s as the
size of the bucket.
To represent a bucket, we need log2 N bits to represent the timestamp (modulo N) of its right end. To represent the
number of 1’s we only need log2 log2 N bits. The reason is that we know this number i is a power of 2, say 2j , so we
can represent i by coding j in binary. Since j is at most log2 N, it requires log2 log2 N bits. Thus, O(logN) bits suffice to
represent a bucket. There are six rules that must be followed when representing a stream by buckets. (Q7)
● The right end of a bucket is always a position with a 1.
● Every position with a 1 is in some bucket.
● No position is in more than one bucket.
● There are one or two buckets of any given size, up to some maximum size.
● All sizes must be a power of 2.
● Buckets cannot decrease in size as we move to the left (back in time).
Unit V
1. Write short note on the recommendation system.
In Collaborative Filtering, we tend to find similar users and recommend what similar users like. In this type of
recommendation system, we don’t use the features of the item to recommend it, rather we classify the users into
clusters of similar types and recommend each user according to the preference of its cluster.
There are basically 4 types of algorithms -> Memory Based, Model Based, Hybrid, Deep Learning
One of the main advantages that these recommender systems have is that they are highly efficient in providing
personalized content but also able to adapt to changing user preferences.
Cosine Similarity : Larger cosine implies that there is a smaller angle between two users, hence they have similar
interests. We can apply the cosine distance between two users in the utility matrix, and we can also give the zero
value to all the unfilled columns to make calculation easy, if we get smaller cosine then there will be a larger distance
between the users, and if the cosine is larger than we have a small angle between the users, and we can recommend
them similar things.
Normalizing Rating : In the process of normalizing, we take the average rating of a user and subtract all the given
ratings from it, so we’ll get either positive or negative values as a rating, which can simply classify further into similar
groups. By normalizing the data we can make clusters of the users that give a similar rating to similar items and then
we can use these clusters to recommend items to the users.
A Content-Based Recommender works by the data that we take from the user, either explicitly (rating) or implicitly
(clicking on a link). By the data we create a user profile, which is then used to suggest to the user, as the user provides
more input or take more actions on the recommendation, the engine becomes more accurate.
User Profile : In the User Profile, we create vectors that describe the user’s preference. In the creation of a user profile, we
use the utility matrix which describes the relationship between user and item. With this information, the best estimate we
can make regarding which item user likes, is some aggregation of the profiles of those items.
Item Profile : In Content-Based Recommender, we must build a profile for each item, which will represent the important
characteristics of that item.
Utility Matrix : Utility Matrix signifies the user’s preference with certain items. In the data gathered from the user, we
have to find some relation between the items which are liked by the user and those which are disliked, for this purpose
we use the utility matrix. In it we assign a particular value to each user-item pair, this value is known as the degree of
preference. Then we draw a matrix of a user with the respective items to identify their preference relationship.
1. We can use the cosine distance between the vectors of the item and the user to determine its preference to the
user.
2. We can use a classification approach in the recommendation systems too, like we can use the Decision Tree for
finding out whether a user wants to watch a movie or not, like at each level we can apply a certain condition to
refine our recommendation
7. Define collaborative filtering. Using an example of an e-commerce site like flipkart or amazon describe how it can
be used to provide recommendation to users
Collaborative filtering is a popular recommendation technique used by e-commerce platforms like Flipkart or Amazon to
provide personalized product recommendations to users. It leverages the preferences, behaviors, or actions of users to
predict what a particular user might like, based on the assumption that users with similar preferences will like similar
items.
1. User Based : In this approach, the system looks for users who have similar shopping patterns to Alice. If another
user, Bob, has a similar purchasing history to Alice and bought a product that Alice hasn’t seen yet, the system can
recommend that product to Alice. Alice has bought a smartphone, a laptop, and headphones. Bob has also bought
a smartphone and a laptop, but he has also purchased a smartwatch. Since Alice and Bob share similar tastes
(smartphone, laptop), the system assumes Alice might also like the smartwatch, and it recommends a smartwatch
to Alice.
2. Item Based : In this approach, the system looks for items similar to the ones Alice has interacted with. It finds
products that are often bought together with the items Alice likes and recommends those to her. Alice has bought
a laptop. Many other users who bought the same laptop have also purchased a laptop stand or mouse. The system
will recommend these related items (laptop stand, mouse) to Alice, assuming she might be interested in them
since other buyers of the same laptop have purchased these items.
3. List and explain the different issues and challenges in data stream query processing.
1. High Throughput and Low Latency : Data streams typically involve high-speed, continuous data generation
(e.g., from IoT sensors, social media feeds). Queries on such streams require rapid processing to maintain
real-time or near-real-time responses. Designing systems capable of handling high volumes of data
(throughput) with minimal delay (latency) to ensure timely query results.
2. Unbounded and Continuous Nature of Data : Unlike traditional databases, data streams are unbounded and
continuously generate new data, making it impossible to store and query all the data at once. Queries must
operate on recent or relevant subsets of data, requiring windowing techniques (time-based, tuple-based, or
count-based windows) to handle infinite streams effectively.
3. Memory and Resource Constraints : Since data streams are continuous and potentially infinite, processing
systems must handle queries without relying on storing the entire dataset. Efficient use of memory and
computational resources is crucial. This often necessitates approximate algorithms, sliding windows, and
summarization techniques to manage resource consumption.
4. Imprecise and Incomplete Data : Streams often contain noisy, incomplete, or imprecise data, as seen in sensor
readings, social media, or financial transactions. Query systems need to handle uncertainty and missing values,
which may require techniques for noise reduction, filtering, or interpolation.
5. Approximate Query Processing : Exact answers are sometimes impractical or unnecessary in streaming
environments due to time and resource constraints. Designing approximate query processing (AQP) algorithms,
such as sampling, sketching, or hashing, to provide fast, near-accurate answers with controlled error margins.
6. Out-of-Order and Late Data : Data can arrive out of order or be delayed, especially in distributed environments
where network latency or failures affect data delivery. Systems must cope with such scenarios, using techniques like
watermarking and reordering to ensure consistent and correct query results.
7. Distributed and Parallel Processing : Large-scale data streams may be processed in distributed or cloud
environments to meet scalability and fault-tolerance needs. Ensuring efficient data distribution, synchronization, and
fault tolerance in distributed environments while minimizing communication overhead and ensuring consistency.
8. Fault Tolerance and Availability : Data stream systems need to operate continuously, even in the face of failures.
Maintaining high availability, recovering from failures without data loss, and ensuring reliable processing across
distributed systems are crucial.
9. Query Optimization : Continuous queries need to be optimized for efficient execution in environments where both
data streams and queries are dynamic. Optimizing query execution plans, choosing the best query processing
strategies, and adapting to changes in the data or query characteristics are key aspects of query optimization in stream
processing.
10. Security and Privacy : Sensitive data may flow through streams, raising privacy and security concerns, particularly
when data is processed across multiple domains or third parties. Ensuring data security and privacy, such as
encrypting data streams and enforcing access control, without introducing significant latency.
4. What is graph store? Give an example where a graph store can be used to effectively solve a particular
business problem.
A graph store is a type of database designed specifically to store, manage, and query data that is best represented as
a graph. In graph databases, entities (nodes) and the relationships (edges) between them are stored explicitly, making
it easy to represent and query complex networks of information.
Graph databases are highly effective when the relationships between entities are as important as the entities
themselves. These databases allow for fast and efficient traversals across nodes based on relationships, making them
ideal for applications like social networks, recommendation engines, fraud detection, and more.
1. It solves Many-To-Many relationship problems : If we have friends of friends and stuff like that, these are many
to many relationships. Used when the query in the relational database is very complex.
2. When relationships between data elements are more important : If there is data element such as user data
element inside a graph database there could be multiple user data elements but the relationship is what is
going to be the factor for all these data elements which are stored inside the graph database.
3. Low latency with large scale data : When you add lots of relationships in the relational database, the data sets
are going to be huge and when you query it, the complexity is going to be more complex and it is going to be
more than a usual time. However, in graph database, it is specifically designed for this particular purpose and
one can query relationship with ease.
Graph Databases can easily handle frequent schema changes, managing volume of data, real-time query response
time, and more intelligent data activation requirements are done by graph model.
5. With a neat sketch, explain the architecture of the data-stream management system. ( May 23)
DSMS stands for data stream management system. It is nothing but a software application just like DBMS (database
management system) but it involves processing and management of a continuously flowing data stream rather than
static data
1. Data Source layer : The first layer of DSMS is data source layer which comprises of all the data sources which
includes sensors, social media feeds, financial market, stock markets etc. In the layer capturing and parsing of data
stream happens. Basically it is the collection layer which collects the data.
2. Data Ingestion Layer : This layer is a bridge between data source layer and processing layer. The main purpose
of this layer is to handle the flow of data i.e., data flow control, data buffering and data routing.
3. Processing Layer : This layer consider as heart of DSMS architecture it is functional layer of DSMS applications. It
process the data streams in real time. To perform processing it is uses processing engines like Apache flink or
Apache storm etc., The main function of this layer is to filter, transform, aggregate and enriching the data stream.
This can be done by derive insights and detect patterns.
4. Storage Layer : Once data is process we need to store the processed data in any storage unit. Storage layer
consist of various storage like NoSQL database, distributed database etc., It helps to ensure data durability and
availability of data in case of system failure.
5. Querying Layer : It support 2 types of query ad hoc query and standard query. This layer provides the tools which
can be used for querying and analyzing the stored data stream. It also have SQL like query languages or
programming API.
6. Visualization and Reporting Layer : This layer provides tools for perform visualization like charts, pie chart,
histogram etc., On the basis of this visual representation it also helps to generate the report for analysis.
7. Integration Layer : This layer responsible for integrating DSMS application with traditional system, business
intelligence tools, data warehouses, ML application, NLP applications. It helps to improve already present running
applications.
1. User-based recommendation: Here we calculate Pearson’s similarity measure, which is needed to determine
the closely related users, i.e, whose likes and dislikes follow the same pattern. The computational operations are
based on the formula of Pearson similarity. The ratings of two different users are subtracted by the mean value
and multiplied in the numerator and in the denominator, the ratings are squared and summation is calculated for
each. After getting the summation values, they are divided to get the similarity measure.
Example : Alice likes movies X, Y, and Z. Bob likes movies Y and Z. Since Alice and Bob have similar preferences,
the system might recommend movie X to Bob, assuming he hasn't watched it yet.
2. Item-based recommendation: Initial aim is to obtain the mean adjusted matrix. The mean adjusted matrix is
used in the prediction of the rating of a new user using the item, based on reducing the errors caused by the
users, as some tend to give very high ratings most of the time and some tend to give very low ratings most of
the time. So, to reduce this inconsistency, we subtract the mean value from each of the users. The next step is
the calculation of the similarity measure between the items. Here we can make use of the cosine similarity
matrix. The computational operations are based on the formula of cosine similarity. The ratings of different users
on two items are multiplied in the numerator and in the denominator, the ratings are squared and a summation
is calculated for each. After getting the summation values, they are divided to get the similarity measure.
Example : Alice watches movie X and gives it a high rating. The system checks that other users who liked movie
X also liked movie Y. Based on this, the system recommends movie Y to Alice.
9. Algorithm / Problem Problem on Girvan Newman (Ye black cheez mere se hatt nahi raha)
The Girvan-Newman algorithm for the detection and analysis of community structure relies on the iterative elimination of
edges that have the highest number of shortest paths between nodes passing through them. By removing edges from
the graph one-by-one, the network breaks down into smaller pieces, so-called communities (more formally, components
of a graph).
The idea was to find which edges in a network occur most frequently between other pairs of nodes by finding edges
betweenness centrality. The edges joining communities are then expected to have a high edge betweenness. The
underlying community structure of the network will be much more fine-grained once the edges with the highest
betweenness are eliminated which means that communities will be much easier to spot.
Betweenness centrality measures the extent to which a vertex or edge lies on paths between vertices. Vertices and
edges with high betweenness may have considerable influence within a network by virtue of their control over
information passing between others. The calculation of betweenness centrality is not standardized and there are many
ways to solve it. It is defined as the number of shortest paths in the graph that pass through the node or edge divided by
the total number of shortest paths.
Algorithm:
1. Create a graph of N nodes and its edges or take an inbuilt graph like a barbell graph.
2. Calculate the betweenness of all existed edges in the graph.
3. Now remove all the edge(s) with the highest betweenness.
4. Now recalculate the betweenness of all the edges that got affected by the removal of edges.
5. Now repeat steps 3 and 4 until no edges remain.
The Girvan-Newman algorithm would remove the edge between nodes C and D because it is the one with the
highest strength. As you can see intuitively, this means that the edge is located between communities. After
removing an edge, the betweenness centrality has to be recalculated for every remaining edge. In this example,
we have come to the point where every edge has the same betweenness centrality.
Unit VI
Expressions in R can be created using the `expression()` function, which generates objects of the class `expression`. These
expressions can be math expressions, basic formatting expressions, or font face expressions.
R provides functions to work with regular expressions to find patterns in text. Regular expressions allow searching and
modifying strings based on specific patterns.
Basic expressions in R allow formatting text or performing mathematical operations, while regular expressions enable pattern
matching and string replacement. These tools are widely used for visualization and data manipulation in R.
2. How to create and use object in R?
► Instances of classes are know as objects
► Variables of other languages are referred to as objects in R.
► There are multiple objects in R:
i. Vectors,
ii. Lists,
iii. Arrays,
iv. Matrices,
v. Data frame, and
vi. Factors.
You can create an object using the “<-“ or an equals “=” sign.
Both of these are used to assign values to objects. There are a few rules to
keep in mind while naming an object in R. They are:
► Vectors:
▪ Vector is the most basic data object in R
▪ Def: Vector is a collection of elements which is most commonly of mode character, integer, logical or numeric
▪ To create a vector with more than 1 element we must use the c() function
► Lists:
▪ A list in R is a generic object consisting of an ordered collection of objects
▪ Lists are one-dimensional and heterogenous in nature
▪ To create a list we must use the list() function
► Arrays:
▪ Arrays can be of any number of dimensions, it takes a dim attribute and creates the required number of dimensions
▪ Vectors are given as the input and the value in the dim parameter is used to create an array
▪ To create an array we must use the array() function
► Matrices:
▪ A matrix is an R object in which the element are arranged in a two-dimensional rectangular layout
▪ The elements are of the same atomic type
▪ To create a matrix we must use the matrix() function
► Factors:
▪ These are data objects that are used to categorize the data and store it as levels
▪ It is very useful for statistical modelling
▪ To create an factor we must use the factor() function
► Data Frames:
▪ Data frames are tabular data objects
▪ It is an object in which the columns can contain different modes of data
▪ To create a data frame we must use the data.frame() function
3. Explain any two functions available in “dplyr” packages. ( Ans : filter , select, mutate, arrange, count etc)
● Data needs to be manipulated in order to make it more readable and in an organized way.
● Manipulation can be done by filtering and ordering rows, renaming and adding columns and computing summary
statistics.
● There are different ways to perform data manipulation in R, such as using Base R functions like subset(), with(), within(),
etc.Packages like data.table, ggplot2, reshape2, readr, etc., and different Machine Learning algorithms.
● The dplyr package consists of many functions specifically used for data manipulation. These functions process data faster
than Base R functions and are known the best for data exploration and transformation, as well.
Below are two commonly used functions from the `dplyr` package:
1. `filter()`: Filtering Rows Based on Conditions
The `filter()` function is used to select rows from a data frame based on specific conditions. It helps in extracting a
subset of data that meets certain criteria. Example:
library(dplyr)
# Sample data frame
data <- data.frame(
Output:
Name = c("Alice", "Bob", "Charlie", "David"),
```
Age = c(25, 30, 35, 40),
Name Age Score
Score = c(85, 90, 75, 88)
1 Alice 25 85
)
2 Bob 30 90
# Filter rows where Age is greater than 30
3 Charlie 35 75
filtered_data <- filter(data, Age > 30)
4 David 40 88
print(filtered_data)
Key Point:
`filter()` helps you select rows that meet the condition, such as `Age > 30` in this case.
2. `mutate()`: Adding or Modifying Columns
The `mutate()` function is used to add new columns to a data frame or modify existing ones by performing
calculations on existing columns.
Example.
# Adding a new column "Score_Adjusted" by subtracting 5 from the "Score" column
mutated_data <- mutate(data, Score_Adjusted = Score - 5)
print(mutated_data)
Output:
Name Age Score Score_Adjusted
1 Alice 25 85 80
2 Bob 30 90 85
3 Charlie 35 75 70
4 David 40 88 83
`mutate()` allows you to transform the data by creating new variables or
updating existing ones, making it useful for data analysis.
Both `filter()` and `mutate()` are essential functions in the `dplyr` package:
- `filter()` helps extract rows based on conditions.
- `mutate()` helps create or modify columns within a data frame.
4. List and discuss basic features of R.
1. Comprehensive Statistical Analysis: R langauge provides a wide array of statistical techniques, including linear and nonlinear
modeling, classical statistical tests, time-series analysis, classification, and clustering.
2. Advanced Data Visualization: With packages like ggplot2, plotly, and lattice, R excels at creating complex and aesthetically
pleasing data visualizations, including plots, graphs, and charts.
3. Extensive Packages and Libraries: The Comprehensive R Archive Network (CRAN) hosts thousands of packages that extend
R’s capabilities in areas such as machine learning, data manipulation, bioinformatics, and more
.
4. Open Source and Free: R is free to download and use, making it accessible to everyone. Its open-source nature encourages
community contributions and continuous improvement.
5. Platform Independence: R is platform-independent, running on various operating systems, including Windows, macOS, and
Linux, which ensures flexibility and ease of use across different environments.
6. Integration with Other Languages: R language can integrate with other programming languages such as C, C++, Python, Java,
and SQL, allowing for seamless interaction with various data sources and computational processes.
7. Powerful Data Handling and Storage: R efficiently handles and stores data, supporting various data types and structures,
including vectors, matrices, data frames, and lists.
8. Robust Community and Support: R has a vibrant and active community that provides extensive support through forums,
mailing lists, and online resources, contributing to its rich ecosystem of packages and documentation.
9. Interactive Development Environment (IDE): RStudio, the most popular IDE for R, offers a user-friendly interface with features
like syntax highlighting, code completion, and integrated tools for plotting, history, and debugging.
10. Reproducible Research: R supports reproducible research practices with tools like R Markdown and Knitr, enabling users to
create dynamic reports, presentations, and documents that combine code, text, and visualizations
5. What are the advantages of using functions over scripts? ( Dec 23)
- Function: A function is a reusable block of code designed to perform a specific task, often taking inputs (parameters) and
returning an output.
- Script: A script is a sequential set of instructions executed from top to bottom, without encapsulating code into reusable
functions.
Using functions offers several advantages compared to scripts (code without function encapsulation). Functions help
organize code, make it reusable, and simplify maintenance. Key advantages include:
1. Code Reusability
Functions can be reused across different parts of the program, while scripts may require code repetition, making
maintenance harder.
```r
add <- function(a, b) {
return(a + b)
}
result1 <- add(3, 5)
result2 <- add(10, 20)
```
2. Modularity
Functions break code into smaller, logical units, making it easier to read and maintain. Scripts often grow large and become
harder to navigate.
3. Simplified Debugging and Testing
Functions can be tested independently, making it easier to debug specific parts of the code, unlike scripts which require running
the entire program.
4. Encapsulation
Functions contain local variables, reducing unintended side effects, whereas scripts use global variables that can lead to conflicts.
6. Parameterization
Functions can take parameters, allowing flexibility. Scripts need manual changes for different inputs, increasing error risk.
```r
multiply <- function(x, y) {
return(x * y)
}
result1 <- multiply(2, 3)
result2 <- multiply(5, 7)
```
7. Scalability
Functions help manage growing projects by organizing code into manageable parts, whereas scripts become unwieldy as they
scale.
Conclusion
Functions provide reusability, modularity, easier debugging, and scalability, making them a better choice over scripts for structured
programming.
6. List and discuss various types of data structures in R.
► Vectors:
▪ Vector is the most basic data object in R
▪ Def: Vector is a collection of elements which is most commonly of mode character, integer, logical or numeric
▪ To create a vector with more than 1 element we must use the c() function
► Lists:
▪ A list in R is a generic object consisting of an ordered collection of objects
▪ Lists are one-dimensional and heterogenous in nature
▪ To create a list we must use the list() function
► Arrays:
▪ Arrays can be of any number of dimensions, it takes a dim attribute and creates the required number of dimensions
▪ Vectors are given as the input and the value in the dim parameter is used to create an array
▪ To create an array we must use the array() function
► Matrices:
▪ A matrix is an R object in which the element are arranged in a two-dimensional rectangular layout
▪ The elements are of the same atomic type
▪ To create a matrix we must use the matrix() function
► Factors:
▪ These are data objects that are used to categorize the data and store it as levels
▪ It is very useful for statistical modelling
▪ To create an factor we must use the factor() function
► Data Frames:
▪ Data frames are tabular data objects
▪ It is an object in which the columns can contain different modes of data
▪ To create a data frame we must use the data.frame() function
7. Discuss the syntax of defining a function in R.
Syntax:
function_name <- function(argument1, argument2, ...) {
# Code to be executed
# Optional: return() to return the result
}
Components:
1. function_name:
This is the name you assign to the function. It follows the same naming rules as variables (e.g., no spaces, must start with a
letter).
2. function():
The keyword `function` is used to define a function in R. Inside the parentheses, you specify the arguments (also known as
parameters) that the function will take.
3. Arguments:
These are the inputs to the function. A function can take zero, one, or more arguments, separated by commas. Each argument
can also have a default value. For example:
Example:
In this example, the function `add_numbers` takes two arguments (`a` and `b`), adds them, and returns the result using the
`return()` function.
8. List and explain operators used to form data subsets in R.
1. Bracket Operator : The most versatile and commonly used operator for subsetting.
x[i] returns the i-th element of vector x
2. Dollar Sign $ Operator : Used for accessing elements by name in lists or data frames.
For example, data$name extracts the name column from a data frame data
3. Logical Subsetting : You can use logical vectors to subset based on conditions. Logical expressions are commonly
used to subset data frames based on column conditions.
7. Subset Function (subset()) : A function for filtering data frames based on conditions.
9. Describe applications of data visualization. ( MAy 2024)