DM passing package
DM passing package
2. Define Prediction.
Ans: PREDICTION:
To find a numerical output, prediction is used. The training dataset contains the
inputs and numerical output values. According to the training dataset, the algorithm
generates a model or predictor. When fresh data is provided, the model should find
a numerical output. This approach, unlike classification, does not have a class
label. A continuous-valued function or ordered value is predicted by the model.
Example: 1. Predicting the worth of a home based on facts like the number of
rooms, total area, and so on.
3. Define Regression.
Ans: REGRESSION IN DATA MINING:
Regression refers to a data mining technique that is used to predict the numeric
values in a given data set. Regression involves the technique of fitting a straight
line or a curve on numerous data points.
For example, regression might be used to predict the product or service cost or
other variables. It is also used in various industries for business and marketing
behavior, trend analysis, and financial forecast.
Regression is divided into five different types
1. Linear Regression
2. Logistic Regression
3. Lasso Regression
4. Ridge Regression
5. Polynomial Regression
4. What do you mean by outliers?
Ans: outliers are sample points with values much different from those of the
remaining set of data. Outliers may represent errors in the data or could be
correct data values that are simply much different from the remaining data. A
person who is 2.5 meters tall is much taller than most people. In analysing
the height of individuals; this value probably would be viewed as an outlier.
Some clustering techniques do not perform well with the presence of
outliers.
5. What is Decision Tree?
Ans: A decision tree is a type of supervised learning algorithm that is commonly used in machine
learning to model and predict outcomes based on input data. It is a tree-like structure where each
internal node tests on attribute, each branch corresponds to attribute value and each leaf node
represents the final decision or prediction.
PART-B
II. Answer any Four questions, each carries Five marks. ( 4 x 4 = 20 )
7) What are the difference between Data Mining and knowledge discovery in databases?
Ans: DATA MINING VS KDD.
Basic Definition Data mining is the process of identifying The KDD method is a complex and
patterns and extracting details about big iterative approach to knowledge
data sets using intelligent methods. extraction from big data.
Example Clustering groups of data elements based Data analysis to find patterns and
on how similar they are. links.
8) What are the various issues associated with the Data Mining?
Ans: FACTORS THAT CREATE SOME ISSUES.
PERFORMANCE ISSUES :
Applications
• Simplistic algorithm — uses only value of K (odd number) and the distance function
(Euclidean, as mentioned today).
• Efficient method for small datasets.
• Utilises “Lazy Learning.” In doing so, the training dataset is stored and is used only when
making predictions therefore making it more quick than Support Vector Machines
(SVMs) and Linear Regression.
10) Describe in detail one of the Decision Tree Algorithm give examples.
Ans: Decision tree algorithm:
1. Begin with the entire dataset as the root node of the decision tree.
2. Determine the best attribute to split the dataset based on a given
criterion,
3. Create a new internal node that corresponds to the best attribute and
connects it to the root node.
4. Partition the dataset into subsets based on the values of the best
attribute.
5. Recursively repeat steps 1-4 for each subset until all instances in a
given subset belong to the same class or no further splitting is
possible.
6. Assign a leaf node to each subset that contains instances that belong to
the same class.
7. Make predictions based on the decision tree by traversing it from the
root node to a leaf node that corresponds to the instance being
classified.
The following decision tree is for the concept to buy computer that indicates
whether a customer at a company is likely to buy a computer or not. Each internal
node represents a test on an attribute. Each leaf node represents a class.
11) Explain Hierarchical clustering in detail.
Ans: HIERARCHICAL ALGORITH MS
As mentioned earlier, hierarchical clustering algorithms actually creates sets of
clusters. Hierarchical algorithms differ in how the sets are created. A tree data
structure, called a dendrogram, can be used to illustrate the hierarchical clustering
technique and the sets of different clusters. The root in a dendrogram tree
contains one cluster where all elements are together. The leaves in the
dendrogram each consist of a single element cluster. Internal nodes in the
dendrogram represent new clusters formed by merging the clusters that appear as
its children in the tree. Each level in the tree is associated with the distance
measure that was used to merge the clusters. All clusters created at a particular
level were combined because the children clusters had a distance between them
less than the distance value associated with this level in the tree.
Fig: Dendrogram
Data Parallelisms
Data Parallelism means concurrent execution of the same task on each multiple computing core.
Let’s take an example, summing the contents of an array of size N. For a single-core system, one thread would
simply sum the elements [0] . . . [N − 1]. For a dual-core system, however, thread A, running on core 0, could
sum the elements [0] . . . [N/2 − 1] and while thread B, running on core 1, could sum the elements [N/2] . . .
[N − 1]. So the Two threads would be running in parallel on separate computing cores.
PART C
III. Answer any Four questions, each carries Five marks. ( 4 x 8 = 32 )
13) How can you describe Data mining from the perspective of database?
Ans: Data Mining from a Database Perspective.
The time complexity of K-means is O(tkn) where t is the number of iteratio ns. K-means
finds a local optimum and may actually miss the global optimum. K-eans does not work
on categorical data because the mean must be defined on the attnbute type.
Data Parallelisms
Data Parallelism means concurrent execution of the same task on each multiple computing core.
Let’s take an example, summing the contents of an array of size N. For a single-core system, one thread would
simply sum the elements [0] . . . [N − 1]. For a dual-core system, however, thread A, running on core 0, could
sum the elements [0] . . . [N/2 − 1] and while thread B, running on core 1, could sum the elements [N/2] . . .
[N − 1]. So the Two threads would be running in parallel on separate computing cores.
Ans: Regression
Regression is a statistical tool that helps determine the cause and effect relationship
between the variables. It determines the relationship between a dependent and an
independent variable. It is generally used to predict future trends and events.
Ans: CART is a predictive algorithm used in Machine learning and it explains how the target
variable’s values can be predicted based on other matters. It is a decision tree where each fork is
split into a predictor variable and each node has a prediction for the target variable at the end.
Basic Definition Data mining is the process of identifying The KDD method is a complex and
patterns and extracting details about big iterative approach to knowledge
data sets using intelligent methods. extraction from big data.
Scope In the KDD method, the fourth phase is KDD is a broad method that
called "data mining." includes data mining as one of its
steps.
Example Clustering groups of data elements based Data analysis to find patterns and
on how similar they are. links.
Bayesian interpretation:
Although the naive Bayes approach is straightforward to use, it does not always yield satisfactory results.
First, the attributes usually are not independent. We could use a subset of the attributes by ignoring any
that are dependent on others. The technique does not handle continuous data.
Ans: 1. Classification:
This technique is used to obtain important and relevant information about data and
metadata. This data mining technique helps to classify data in different classes.
2. Clustering:
3. Regression:
Regression analysis is the data mining process ,used to identify and analyze the
relationship between variables because of the presence of the other factor. It is used
to define the probability of the specific variable. Regression, primarily a form of
planning and modeling. For example, we might use it to project certain costs,
depending on other factors such as availability, consumer demand, and
competition. Primarily it gives the exact relationship between two or more
variables in the given data set.
4. Association Rules:
This data mining technique helps to discover a link between two or more items. It
finds a hidden pattern in the data set.
Association rules are if-then statements that support to show the probability of
interactions between data items within large data sets in different types of
databases., For example, a list of grocery items that you have been buying for the
last six months. It calculates a percentage of items being purchased together.
5. Outlier detection:
This type of data mining technique relates to the observation of data items in the
data set, which do not match an expected pattern or expected behavior. This
technique may be used in various domains like intrusion, detection, fraud
detection, etc. It is also known as Outlier Analysis or Outlier mining. The outlier is
a data point that diverges too much from the rest of the dataset. The majority of the
real-world datasets have an outlier. Outlier detection plays a significant role in the
data mining field. Outlier detection is valuable in numerous fields like network
interruption identification, credit or debit card fraud detection, detecting outlying in
wireless sensor network data, etc.
6. Sequential Patterns:
10) Describe in detail one of the Decision Tree Algorithm give examples.
Ans: Decision tree algorithm:
8. Begin with the entire dataset as the root node of the decision tree.
9. Determine the best attribute to split the dataset based on a given
criterion,
10.Create a new internal node that corresponds to the best attribute and
connects it to the root node.
11.Partition the dataset into subsets based on the values of the best
attribute.
12.Recursively repeat steps 1-4 for each subset until all instances in a
given subset belong to the same class or no further splitting is
possible.
13.Assign a leaf node to each subset that contains instances that belong to
the same class.
14.Make predictions based on the decision tree by traversing it from the
root node to a leaf node that corresponds to the instance being
classified.
The following decision tree is for the concept to buy computer that indicates
whether a customer at a company is likely to buy a computer or not. Each internal
node represents a test on an attribute. Each leaf node represents a class.
11) Explain Hierarchical clustering in detail.
Ans: HIERARCHICAL ALGORITH MS
As mentioned earlier, hierarchical clustering algorithms actually creates sets of
clusters. Hierarchical algorithms differ in how the sets are created. A tree data
structure, called a dendrogram, can be used to illustrate the hierarchical clustering
technique and the sets of different clusters. The root in a dendrogram tree
contains one cluster where all elements are together. The leaves in the
dendrogram each consist of a single element cluster. Internal nodes in the
dendrogram represent new clusters formed by merging the clusters that appear as
its children in the tree. Each level in the tree is associated with the distance
measure that was used to merge the clusters. All clusters created at a particular
level were combined because the children clusters had a distance between them
less than the distance value associated with this level in the tree.
Fig: Dendrogram
PART C
III. Answer any Four questions, each carries Five marks. ( 4 x 8 = 32 )
13) How can you describe Data mining from the perspective of a database?
Ans: Data Mining from a Database Perspective.
Data mining refers to extracting or mining knowledge from large amounts of data.
In other words, data mining is the science, art, and technology of discovering large
and complex bodies of data in order to discover useful patterns. Theoreticians and
practitioners are continually seeking improved techniques to make the process
more efficient, cost-effective, and accurate. Any situation can be analyzed in two
ways in data mining:
15) Explain how K-Means Clustering algorithm is working and give examples.
Ans: K Means Clustering:
K- means is an iterative clustering algorithm in which items are moved among sets of clus- .
ters until the desired set is reached. As such, it may be viewed as a type of squared error
algorithm, although the convergence criteria need not be defined based on the squared
error. A high degree of similarity among elements in clusters is obtained, while a high
degree of dissimilarity among elements in different clusters is achieved simultaneously
The time complexity of K-means is O(tkn) where t is the number of iteratio ns. K-means
finds a local optimum and may actually miss the global optimum. K-eans does not work
on categorical data because the mean must be defined on the attnbute
type.
to unsupervised learning.
A classification of the different types of clustering algorithms is shown Clustering
algorithms themselves may be viewed as hierarchical or partitional. With
hierarchical clustering, a nested set of clusters is created.
Apriori says:
The probability that item I is not frequent is if:
#1) In the first iteration of the algorithm, each item is taken as a 1-itemsets
candidate. The algorithm will count the occurrences of each item.
#2) Let there be some minimum support, min_sup ( eg 2). The set of 1 – itemsets
whose occurrence is satisfying the min sup are determined. Only those candidates
which count more than or equal to min_sup, are taken ahead for the next iteration
and the others are pruned.
#3) Next, 2-itemset frequent items with min_sup are discovered. For this in the join
step, the 2-itemset is generated by forming a group of 2 by combining items with
itself.
#4) The 2-itemset candidates are pruned using min-sup threshold value. Now the
table will have 2 –itemsets with min-sup only.
#5) The next iteration will form 3 –itemsets using join and prune step. This
iteration will follow antimonotone property where the subsets of 3-itemsets, that is
the 2 –itemset subsets of each group fall in min_sup. If all 2-itemset subsets are
frequent then the superset will be frequent otherwise it is pruned.
#6) Next step will follow making 4-itemset by joining 3-itemset with itself and
pruning if its subset does not meet the min_sup criteria. The algorithm is stopped
when the most frequent itemset is achieved.
Advantages
1. Easy to understand algorithm
2. Join and Prune steps are easy to implement on large itemsets in
large databases Disadvantages
1. It requires high computation if the itemsets are very large and the minimum
support is kept very low.
2. The entire database needs to be scanned.
3. FPM has many applications in the field of data analysis, software bugs,
cross-marketing, sale campaign analysis, market basket analysis, etc.
Data Parallelisms
Data Parallelism means concurrent execution of the same task on each multiple computing core.
Let’s take an example, summing the contents of an array of size N. For a single-core system, one thread would
simply sum the elements [0] . . . [N − 1]. For a dual-core system, however, thread A, running on core 0, could
sum the elements [0] . . . [N/2 − 1] and while thread B, running on core 1, could sum the elements [N/2] . . .
[N − 1]. So the Two threads would be running in parallel on separate computing cores.
From the above calculations, we can clearly say the Mean is more affected than the Median.
Detecting Outliers
If our dataset is small, we can detect the outlier by just looking at the dataset. But what if we have a huge
dataset, how do we identify the outliers then? We need to use visualization and mathematical techniques.
Below are some of the techniques of detecting outliers
Boxplots
Z-score
Inter Quantile Range(IQR)
12. Explain Apriori algorithm with examples.
Ans: Refer MQP 2 Q 17
PART-C
III. Answer any FOUR questions. Each question carries Eight marks.
13. Explain data mining from a database perspective.
Ans: Refer MQP 2 Q 13
14. Explain CART in detail.
Ans:
A Classification and Regression Tree (CART) is a decision tree algorithm used for both classification and
regression tasks. It's a popular and versatile machine learning algorithm that recursively splits the dataset into
subsets based on the most significant attribute, A Classification and Regression Tree (CART) is a decision tree
algorithm used for both classification and regression tasks. It's a popular and versatile machine learning
algorithm that recursively splits the dataset into subsets based on the most significant attribute, splits the dataset
into subsets based on the most significant attribute, resulting in a tree-like structure.
Key Concepts:
1. Decision Tree:
• A tree-like model where each internal node represents a decision based on the value of a particular attribute.
• Each leaf node represents the outcome or predicted value.
2. Splitting Criteria:
• The algorithm selects the attribute and the split point (or threshold) that best separates the data into
homogeneous subsets.
• For classification, common criteria include Gini impurity and entropy.
• For regression, the mean squared error (MSE) is often used.
3. Recursive Splitting:
• The dataset is split recursively until a stopping criterion is met (e.g., maximum depth, minimum samples per
leaf).
• Each split further refines the decision boundaries.
4. Classification:
• For classification tasks, the leaf nodes represent the predicted class based on majority voting.
5. Regression:
• For regression tasks, the leaf nodes represent the predicted value based on the average of the target values in
that node.
How CART Works:
1. Root Node:
• The algorithm selects the attribute and split point that best separates the entire dataset.
2. Splitting:
• The dataset is split into subsets based on the chosen attribute and split point.
• The dataset is split into subsets based on the chosen attribute and split point. and split point.
• This process is repeated for each subset until a stopping criterion is met.
3. Leaf Nodes:
• The terminal nodes (leaves) contain the final predictions or classifications.
4. Prediction/Classification:
• For a new instance, it traverses the tree from the root to a leaf, making decisions based on attribute values.
• The predicted class or value is determined by the leaf node reached.
Example:
Let's consider a classification task where we want to predict whether a passenger survived or not based on
features like age, gender, and ticket class.
• The root node might split the data based on gender.
• The next level might split based on age.
• The leaf nodes might represent different survival outcomes.
For a regression task, the target might be the price of a house based on features like the number of bedrooms and
square footage.
• The tree would split the dataset based on features to create leaves that represent predicted house prices.
Applications:
• Classification: Predicting outcomes like spam or non-spam emails, customer churn, etc.
• Regression: Predicting numeric values like house prices, temperature, etc.
• Interpretability: Decision trees are human-readable and can help understand the decision-making process.
CART is a powerful algorithm with the ability to handle complex relationships in data.
However, it's prone to overfitting, and CART is a powerful algorithm with the ability to handle complex
relationships in data.
However, it's prone to overfitting, and relationships in data.
However, it's prone to overfitting, and techniques like pruning are often applied to prevent this.
15. Explain K-means clustering algorithm with examples.
Ans: Refer MQP 1 Q 9
16. What is hierarchical clustering? Explain in detail and give example.
Ans: Refer MQP 1 Q 16
17. What is large item-sets? Explain in detail.
Ans: Refer MQP 1 Q 17
18. Write a note on data parallelism.
Ans: Refer MQP 1 Q 18
It is defined as a Data Integration service and allows companies to combine data from various sources into a single,
consistent data store that is loaded into a Data Warehouse or any other target system.
• Extraction: In this, the structured or unstructured data is extracted from its source and consolidated into a single
repository. ETL tools automate the extraction process and create a more efficient and reliable workflow for handling
large volumes of data and multiple sources.
• Transformation: In order to improve data integrity the data needs to be transformed such as it needs to be sorted,
standardized, and redundant data should be removed. This step ensures that raw data which arrives at its new
destination is fully compatible and ready to use.
• Loading: This is the final step of the ETL process which involves loading the data into the final destination(data lake or
data warehouse). The data can be loaded all at once(full load) or at scheduled intervals(incremental load).
6.What are ETL tools ?what are the different types of ETL tools?
Ans:
ETL Tools are applications/platforms that enable users to execute ETL processes. In simple terms, these tools
help businesses move data from one or many desperate data sources to a destination. These help in making the
data both digestible and accessible (and in turn analysis-ready) in the desired location – often a data warehouse.
ETL tools are the first essential step in the data warehousing process that eventually make more informed
decisions in less time.
The ETL tools are often bundled as part of a larger platform and appeal to enterprises with older, legacy systems
that they need to work with and build on. These ETL tools can handle pipelines efficiently and are highly
scalable since they were one of the first to offer ETL tools and mature in the market. These tools support most
relational and nonrelational databases.
• Custom ETL Tools
In this, the custom tools and pipelines are created using scripting languages like SQL or Python. While this gives
you an opportunity for customization and higher flexibility, it also requires more administration and
maintenance.
• Cloud-Based ETL Tools
These tools integrate with proprietary data sources and ingest data from different web apps or on-premises
sources. These tools move data between systems and copy, transform, and enrich data before writing it to data
warehouses or data lakes.
• Open-Source ETL Tools
Many ETL tools today are free and provide easy-to-use user interfaces for designing data exchange processes
and monitoring the flow of information. An advantage of open-source solutions is that organizations can access
the source code to study the tool infrastructure and extend the functionality.
Long answers
1.Explain classification process with examples.
Ans:
THE DATA CLASSIFICATION PROCESS INCLUDES TWO STEPS
1. Building the Classifier or Model This step is the learning step or the learning phase. In this step the
classification algorithms build the classifier. The classifier is built from the training set made up of database
tuples and their associated class labels. A model or classifier is constructed to predict the categorical labels.
These labels are risky or safe for loan application data.
2. Using Classifier for Classification In this step, the classifier is used for classification. Here the test data is
used to estimate the accuracy of classification rules. The classification rules can be applied to the new data
tuples if the accuracy is considered acceptable.
Linear regression is the type of regression that forms a relationship between the target variable and one or more
independent variables utilizing a straight line. The given equation represents the equation of linear regression.
It is a statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc. Linear regression algorithm
shows a linear relationship between a dependent (y) and one or more independent (x) variables, hence called as
linear regression. The linear regression model provides a sloped straight line representing the relationship
between the variables.
The values for x and y variables are training datasets (data points) for Linear Regression model representation.
Y = a + b*X + e. Where, a represents the intercept of the line (The point where the line or curve crosses the axis
of the graph is called intercept. If a point crosses the x-axis, then it is called the x-intercept. If a point crosses the
y-axis, then it is called the y-intercept.) b represents the slope of the regression line e represents the random error
X represent the predictor variable (independent)
Y represent the target variable (dependent).
LOGISTIC REGRESSION
When the dependent variable is binary in nature, i.e., 0 and 1, true or false, success or failure, the logistic
regression technique comes into existence. Here, the target value (Y) ranges from 0 to 1, and it is primarily used
for classification based problems. Unlike linear regression, it does not need any independent and dependent
variables to have a linear relationship. Example: Acceptance into university based on student grades.
Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must be a
either Yes or No, 0 or 1, true or False, etc. In Logistic regression, instead of fitting a regression line, we fit an
"S" shaped logistic function, which predicts two maximum values (0 or 1).
Applications:
1. Medical researchers can use this regression model to determine the relationship between independent
characteristics, such as age and body weight, and dependent ones, such as blood pressure. This can help
reveal the risk factors associated with diseases. They can use this information to identify high-risk
patients and promote healthy lifestyles.
2. Financial analysts use linear models to evaluate a company's operational performance and forecast
returns on investment. They also use it in the capital asset pricing model, which studies the relationship
between the expected investment returns and the associated market risks. It shows companies if an
investment has a fair price and contributes to decisions on whether or not to invest in the asset
1. An e-commerce company that mails expensive promotional offers to customers, would like to know
whether a particular customer is likely to respond to the offers or not i.e., whether that consumer will be a
"responder" or a "non-responder."
2. Likewise, a credit card company will develop a model to help it predict if a customer is going to default
on its credit card based on such characteristics as annual income, monthly credit card payments and the number
of defaults.
3.A medical researcher may want to know the impact of a new drug on treatment outcomes across different age
groups. This involves a lot of nested multiplication and division for comparing the outcomes of young and older
people who never received a treatment, younger people who received the treatment, older people who received
the treatment, and then the whole spontaneous healing rate of the entire group.
4. Logistic regression has become particularly popular in online advertising, enabling marketers to predict the
likelihood of specific website users who will click on particular advertisements as a yes or no percentage.
5.In healthcare to identify risk factors for diseases and plan preventive measures;
6. In drug research to learn the effectiveness of medicines on health outcomes across age, gender etc
7. In weather forecasting apps to predict snowfall and weather conditions;
8. in political polls to determine if voters will vote for a particular candidate;
9. in banking to predict the chances that a loan applicant will default on a loan or not, based on annual income,
past defaults and past debts.
Standard deviation: This provides a standard way of knowing what is normal or extra large or extra small and
helps to understand the spread of the variable from the mean. It shows how close all the values are to the mean.
Variance: This is similar to standard deviation but it measures how tightly or loosely values are spread around
the average.
Range: The range indicates the difference between the largest and the smallest values thereby showing the
distance between the extremes.
3) Distribution of a Sample of Data
The distribution of sample data values has to do with the shape which refers to how data values are distributed
across the range of values in the sample. In simple terms, it means if the values are clustered around the average
to show how they are symmetrically arranged around it or if there are more values to one side than the order.
Two ways to explore the distribution of the sample data are
1. Graphically
2. through shape statistics.
1. Scalability: Scalability (to increase Style)in clustering implies that as we boost the amount of data objects,
the time to perform clustering should approximately scale to the complexity order of the algorithm. If we
raise the number of data objects 10 folds, then the time taken to cluster them should also approximately
increase 10 times. It means there should be a linear relationship. If that is not the case, then there is some
error with our implementation process.
2. Interpretability: The outcomes of clustering should be interpretable, comprehensible, and usable.
3. Discovery of clusters with attribute shape: The clustering algorithm should be able to find arbitrary shape
clusters. They should not be limited to only distance measurements that tend to discover a spherical cluster of
small sizes.
4. Ability to deal with different types of attributes: Algorithms should be capable of being applied to any
data such as data based on intervals (numeric), binary data, and categorical data.
5. Ability to deal with noisy data: Databases contain data that is noisy, missing, or incorrect. Few algorithms
are sensitive to such data and may result in poor quality clusters.
6. High dimensionality: The clustering tools should not only be able to handle high dimensional data space but
also the low-dimensional space.
Data Preparation:
Collect and preprocess the dataset to ensure it is clean and formatted correctly. This often involves removing
duplicates, handling missing values, and transforming data into a suitable format (e.g., transactional data).
Define Minimum Support and Confidence:
Support: This is the proportion of transactions in the dataset that contain a particular itemset. It helps to identify
how frequently an itemset appears in the dataset.
Confidence: This measures the likelihood that an item B is purchased when item A is purchased. It is calculated
as the ratio of the support of the itemset containing both A and B to the support of the itemset containing A.
Generate Frequent Itemsets:
Use an algorithm (like Apriori or FP-Growth) to identify all itemsets that meet the minimum support threshold.
Apriori Algorithm: This algorithm works by iteratively identifying frequent itemsets. It starts with single items
and combines them to form larger itemsets, pruning those that do not meet the support threshold.
Generate Association Rules:
From the frequent itemsets, generate rules of the form ( A > B ), where A and B are itemsets.
For each frequent itemset, generate all possible rules by splitting the itemset into two parts (antecedent and
consequent).
Calculate Confidence for Each Rule:
For each generated rule, calculate the confidence to determine how strong the rule is. Only keep the rules
that meet the minimum confidence threshold.
Evaluate and Filter Rules:
Optionally, you can also calculate other metrics such as lift, which measures how much more likely the
consequent is purchased when the antecedent is purchased compared to when it is not.
Filter the rules based on additional criteria, such as lift or interestingness, to focus on the most relevant rules.
Example: Suppose you have a dataset of transactions in a grocery store:
Transaction 1: {Bread, Milk}
Transaction 2: {Bread,Rice, Beer, Eggs}
Transaction 3: {Milk,Rice, Beer, Cola}
Transaction 4: {Bread, Milk,Rice, Beer}
Transaction 5: {Bread, Milk, Cola}
Define Support and Confidence: Set minimum support to 40% and minimum confidence to 70%.
Generate Frequent Itemsets: Identify frequent itemsets like {Bread}, {Milk}, {Rice}, {Beer}, {Bread, Milk},etc.
Generate Rules: From the frequent itemsets, generate rules like {Bread} → {Milk}, {Rice} → {Beer}, etc.
Calculate Confidence: For the rule {Bread} → {Milk}, calculate confidence and check if it meets the threshold.
Filter Rules: Keep only the rules that meet both support and confidence thresholds.