Data Mining. VBV JJJ Ldce Vgec
Data Mining. VBV JJJ Ldce Vgec
Data Mining
(3160714)
Institute logo
Main motto of any laboratory/practical/field work is for enhancing required skills as well as
creating ability amongst students to solve real time problem by developing relevant competencies
in psychomotor domain. By keeping in view, GTU has designed competency focused outcome-
based curriculum for engineering degree programs where sufficient weightage is given to
practical work. It shows importance of enhancement of skills amongst the students and it pays
attention to utilize every second of time allotted for practical amongst students, instructors and
faculty members to achieve relevant outcomes by performing the experiments rather than having
merely study type experiments. It is must for effective implementation of competency focused
outcome-based curriculum that every practical is keenly designed to serve as a tool to develop
and enhance relevant competency required by the various industry among every student. These
psychomotor skills are very difficult to develop through traditional chalk and board content
delivery method in the classroom. Accordingly, this lab manual is designed to focus on the
industry defined relevant outcomes, rather than old practice of conducting practical to prove
concept and theory.
By using this lab manual students can go through the relevant theory and procedure in advance
before the actual performance which creates an interest and students can have basic idea prior to
performance. This in turn enhances pre-determined outcomes amongst students. Each experiment
in this manual begins with competency, industry relevant skills, course outcomes as well as
practical outcomes (objectives). The students will also achieve safety and necessary precautions
to be taken while performing practical.
This manual also provides guidelines to faculty members to facilitate student centric lab activities
through each experiment by arranging and managing necessary resources in order that the
students follow the procedures with required safety and necessary precautions to achieve the
outcomes. It also gives an idea that how students will be assessed by providing rubrics.
Data mining is a key to sentiment analysis, price optimization, database marketing, credit risk
management, training and support, fraud detection, healthcare and medical diagnoses, risk
assessment, recommendation systems and much more. It can be an effective tool in just about any
industry, including retail, wholesale distribution, service industries, telecom, communications,
insurance, education, manufacturing, healthcare, banking, science, engineering, and online
marketing or social media.
Utmost care has been taken while preparing this lab manual however always there is chances of
improvement. Therefore, we welcome constructive suggestions for improvement and removal of
errors if any.
Vishwakarma Government Engineering College
Department of Computer Engineering
CERTIFICATE
this Institute (GTU Code: 017) has satisfactorily completed the Practical / Tutorial work for the
Place: ___________
Date: ___________
1
Data Mining (3160714) Enrolment No
DTE’s Vision
Institute’s Vision
Institute’s Mission
Department’s Vision
Department’s Mission
Sr. CO CO CO CO CO
Objective(s) of Experiment
No. 1 2 3 4 5
Identify how data mining is an interdisciplinary field by an
1. √
Application.
Write programs to perform the following tasks of
preprocessing (any language).
2.1 Noisy data handling
Equal Width Binning
Equal Frequency/Depth Binning
2. 2.2 Normalization Techniques √
Min max normalization
Z score normalization
Decimal scaling
2.3. Implement data dispersion measure Five
Number Summary generate box plot using python
libraries
To perform hand on experiments of data preprocessing with
3. √ √
sample data on Orange tool.
Implement Apriori algorithm of association rule data
4. mining technique in any √
Programming language.
Apply association rule data mining technique on sample
5. √ √
data sets using WEKA.
Apply Classification data mining technique on sample data
6. √ √
sets in WEKA.
7.1. Implement Classification technique with quality
Measures in any Programming language.
7. √
7.2 Implement Regression technique in any
Programming language.
Apply K-means Clustering Algorithm any Programming
8. √ √
language.
Perform hands on experiment on any advance mining
9. √
Techniques Using Appropriate Tool.
Solve Real world problem using Data Mining Techniques
10. √
using Python Programming Language.
Data Mining (3160714) Enrolment No
1. Teacher should provide the guideline with demonstration of practical to the students
with all features.
2. Teacher shall explain basic concepts/theory related to the experiment to the students before
starting of each practical
3. Involve all the students in performance of each experiment.
4. Teacher is expected to share the skills and competencies to be developed in the
students and ensure that the respective skills and competencies are developed in the
students after the completion of the experimentation.
5. Teachers should give opportunity to students for hands-on experience after the
demonstration.
6. Teacher may provide additional knowledge and skills to the students even though not
covered in the manual but are expected from the students by concerned industry.
7. Give practical assignment and assess the performance of students based on task
assigned to check whether it is as per the instructions or not.
8. Teacher is expected to refer complete curriculum of the course and follow the
guidelines for implementation.
1. Students are expected to carefully listen to all the theory classes delivered by the faculty
members and understand the COs, content of the course, teaching and examination scheme,
skill set to be developed etc.
2. Students will have to perform experiments as per practical list given.
3. Students have to show output of each program in their practical file.
4. Students are instructed to submit practical list as per given sample list shown on next page.
5. Student should develop a habit of submitting the experimentation work as per the schedule
and s/he should be well prepared for the same.
Index
(Progressive Assessment Sheet)
Experiment No - 1
Aim: Identify how data mining is an interdisciplinary field by an Application.
Data mining is an interdisciplinary field that involves computer science, statistics, mathematics, and
domain-specific knowledge. One application that showcases the interdisciplinary nature of data
mining
Theory:
Movie recommendation systems are a great example of how data mining is an interdisciplinary field
involving multiple expertise areas. The process of creating a movie recommendation system involves
the following steps:
Dataset: A movie recommendation system requires a dataset that contains information about movies
and their attributes, as well as information about users and their preferences. Here are some Examples
of dataset.
MovieLens: This is a dataset of movie ratings collected by the GroupLens research group at
the University of Minnesota. It contains over 27,000 movies and 21 million ratings from
around 250,000 users.
Netflix Prize: This is a dataset of movie ratings collected by the online movie rental company
Netflix. It contains over 100 million ratings from around 480,000 users on 17,000 movies.
IMDb: This is a dataset of movie metadata collected by the Internet Movie Database (IMDb).
It contains information about over 600,000 movies, including titles, cast and crew, plot
summaries, ratings, and reviews.
Preprocessing: It involves cleaning and transforming the data to make it suitable for analysis. Here
are some preprocessing techniques commonly used in movie recommendation systems:
Page No
Data Mining (3160714) Enrolment No
Data Cleaning: This involves removing missing or irrelevant data, correcting errors, and
removing duplicates. For example, if a movie has missing information such as the release date,
it may be removed from the dataset or the information may be imputed.
Data Normalization: This involves scaling the data to a common range or standard deviation.
For example, ratings from different users may be normalized to a common scale of 1-5.
Data Transformation: This involves transforming the data into a format suitable for analysis.
For example, movie genres may be encoded as binary variables to enable analysis using
machine learning algorithms.
Feature Generation: This involves creating new features from the existing data that may be
useful for analysis. For example, the average rating for each movie may be calculated based
on user ratings, or the number of times a movie has been watched may be counted.
Data Reduction: This involves reducing the dimensionality of the data to improve processing
efficiency and reduce noise. For example, principal component analysis (PCA) may be used
to identify the most important features in the dataset.
These preprocessing techniques help to ensure that the data is clean, normalized, and transformed in
a way that enables accurate analysis and prediction of user preferences for movie recommendations.
Data Mining Techniques: Association rule mining, clustering, and classification are all data mining
techniques that can be applied to movie recommendation systems. Here is a brief overview of how
each of these techniques can be used:
Association Rule Mining: This technique can be used to identify patterns and relationships
between movies that can help with recommendations. For example, if users who watch
romantic comedies also tend to watch dramas, then a recommendation system can use this
association to suggest dramas to users who have watched romantic comedies. Association rule
mining can also be used to identify frequent item sets, which can help identify popular movies
or genres.
Clustering: This technique can be used to group movies and users based on their
characteristics. For example, movies can be clustered based on genre, rating, release year, or
other attributes. Users can also be clustered based on their movie preferences or ratings.
Clustering can help with making personalized recommendations by identifying groups of
similar users or movies.
Classification: This technique can be used to predict the preferences or ratings of users for a
particular movie. For example, a classification algorithm can be trained on past user ratings
to predict whether a new user will like a particular movie or not. Classification algorithms can
also be used to predict the genre or rating of a new movie based on its characteristics.
Procedure:
1. Select the domain.
2. Selection of particular system .
3. Preprocessing used on system.
4. Mining techniques applied on the system.
Observation/Program:
Page No
Data Mining (3160714) Enrolment No
In a movie recommendation system, data mining techniques are used to analyze large amounts of
data about movies and users, and generate accurate and personalized recommendations.
Conclusion: Movie recommendation systems provide a compelling example of how data mining is
an interdisciplinary field.
Quiz:
(1) What are the different preprocessing techniques can be applied on dataset?
(2) What is the use of data mining techniques on particular system?
Suggested References:
1. Han, J., & Kamber, M. (2011). Data mining: concepts and techniques.
2. https://www.kaggle.com/code/rounakbanik/movie-recommender-systems
Problem Completeness
Knowledge Team Work
Recognition and accuracy Ethics (2)
Rubrics (2) (2) Total
(2) (2)
Good Avg. Good Avg. Good Avg. Good Avg. Good Avg.
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
Marks
Page No
Data Mining (3160714) Enrolment No
Experiment No - 2
Aim: Write programs to perform the following tasks of preprocessing (any language).
2.1 Noisy data handling
Equal Width Binning
Equal Frequency/Depth Binning
2.2 Normalization Techniques
Min max normalization
Z score normalization
Decimal scaling
2.3. Implement data dispersion measure Five Number Summary generate box plot using
python libraries
Theory:
Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the
values around it. The sorted values are distributed into a number of “buckets,” or bins. Because
binning methods consult the neighborhood of values, they perform local smoothing.
Data Preprocessing Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Page No
Data Mining (3160714) Enrolment No
Partition into (equal-frequency)
bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Smoothing by median:
Bin 1: 8, 8, 8
Bin 2: 21, 21, 21
Bin 3: 28, 28, 28
bins have equal width with a range of each bin are defined as [min + w], [min + 2w] …. [min + nw]
where w = (max – min) / (N)
Example :
5, 10, 11, 13, 15, 35, 50 ,55, 72, 92, 204, 215
W=(215-5)/3=70
Normalization techniques are used in data preprocessing to scale numerical data to a common range.
Here are three commonly used normalization techniques:
The measurement unit used can affect the data analysis. For example, changing measurement units
from meters to inches for height, or from kilograms to pounds for weight, may lead to very different
results. In general, expressing an attribute in smaller units will lead to a larger range for that attribute,
and thus tend to give such an attribute greater effect or “weight.” To help avoid dependence on the
choice of measurement units, the data should be normalized or standardized. This involves
Page No
Data Mining (3160714) Enrolment No
transforming the data to fall within a smaller or common range such as [−1,1] or [0.0, 1.0]. (The terms
standardize and normalize are used interchangeably in data preprocessing, although in statistics, the
latter term also has other connotations.) Normalizing the data attempts to give all attributes an equal
weight. Normalization is particularly useful for classification algorithms involving neural networks
or distance measurements such as nearest-neighbor classification and clustering. If using the neural
network back propagation algorithm for classification mining. There are many methods for data
normalization. We Focus on min-max normalization, z-score normalization, and normalization by
decimal scaling.
Min-Max Normalization: This technique scales the data to a range of 0 to 1. The formula for min-
max normalization is:
X_norm = (X - X_min) / (X_max - X_min)
where X is the original data, X_min is the minimum value in the dataset, and X_max is the maximum
value in the dataset.
Z-Score Normalization: This technique scales the data to have a mean of 0 and a standard deviation
of 1. The formula for z-score normalization is:
X_norm = (X - X_mean) / X_std
where X is the original data, X_mean is the mean of the dataset, and X_std is the standard deviation
of the dataset.
Decimal Scaling: This technique scales the data by moving the decimal point a certain number of
places to the left or right. The formula for decimal scaling is:
X_norm = X / 10^j
where X is the original data and j is the number of decimal places to shift.
2.3. Implement data dispersion measure Five Number Summary generate box plot using
python libraries
Let’s understand this with the help of an example. Suppose we have some data such as:
11,23,32,26,16,19,30,14,16,10
Here, in the above set of data points our Five Number Summary are as follows:
First of all, we will arrange the data points in ascending order and then calculate the
summary: 10,11,14,16,16,19,23,26,30,32
Minimum value: 10
25th Percentile: 14
Calculation of 25th Percentile: (25/100)*(n+1) = (25/100)*(11) = 2.75 i.e 3rd value of the data
50th Percentile : 17.5
Page No
Data Mining (3160714) Enrolment No
Calculation of 50th Percentile: (16+19)/2 = 17.5
75th Percentile : 26
Calculation of 75th Percentile: (75/100)*(n+1) = (75/100)*(11) = 8.25 i.e 8th value of the
data
Box plots
Boxplots are the graphical representation of the distribution of the data using Five Number
summary values. It is one of the most efficient ways to detect outliers in our dataset.
In statistics, an outlier is a data point that differs significantly from other observations. An
outlier may be due to variability in the measurement or it may indicate experimental error;
the latter are sometimes excluded from the dataset. An outlier can cause serious problems in
statistical analyses.
Prior to preprocessing, thoroughly clean the data by handling missing values and outliers. Incorrect
handling of noisy data can lead to skewed results.
Procedure:
1. Collect raw data from various sourcesSelection of particular system .
2. Handle missing values and outliers.
3. Equal Width Binning:
Divide data into bins of equal width.
4. Equal Frequency/Depth Binning:
Divide data into bins of equal frequency.
5. Min-Max Normalization:
Scale data to a specific range (e.g., [0, 1]).
6. Z-Score Normalization:
Standardize data to have a mean of 0 and standard deviation of 1.
7. Decimal Scaling:
Scale data by a power of 10.
Observation/Program:
Code:
#2.1 Noisy Data Handling
# Generate random Numbers
import random
data = random.sample(range(10, 100), 20)
data = sorted(data)
print("Random data sample: ", data)
equal_width = []
min_val = min(data)
max_val = max(data)
diff_val = (max_val - min_val) // bins
j=0
for i in range(1, bins+1):
j, val = range_val(j, min_val + (i * diff_val))
equal_width.append(val)
equal_freq = []
Page No
Data Mining (3160714) Enrolment No
def smooth_mean(data):
smooth_data = []
for i in range(bins):
mean_data = mean(data[i])
smooth_data.append([mean_data for j in range(len(data[i]))])
return smooth_data
def smooth_median(data):
smooth_data = []
for i in range(bins):
median_data = median(data[i])
smooth_data.append([median_data for j in range(len(data[i]))])
return smooth_data
def smooth_bound(data):
smooth_data = []
for i in range(bins):
d = []
d.append(data[i][0])
for j in range(1, len(data[i])-1):
min_d = min(data[i])
max_d = max(data[i])
if (data[i][j] - min_d) <= (max_d - data[i][j]):
d.append(min_d)
else:
d.append(max_d)
d.append(data[i][-1])
smooth_data.append(d)
return smooth_data
Code:
Page No
Data Mining (3160714) Enrolment No
# 2.2 Normalization Techniques
# Min max normalization
# Z score normalization
# Decimal scaling
from statistics import mean, stdev
import random
min_data = min(data)
max_data = max(data)
# 1. min-max normalization
new_min = 0.0
new_max = 1.0
# 2. Z score normalization
data_mean = mean(data)
data_std = stdev(data)
# 3. Decimal Scaling
# 2.3 Implement data dispersion measure Five Number Summary generate box plot using python
libraries.
Page No
Data Mining (3160714) Enrolment No
if len(data)%2 == 0:
ind = len(data) // 2
else:
ind = (len(data)+1) // 2
Q1 = median(data[: ind+1])
Q2 = median(data)
Q3 = median(data[ind+1: ])
sns.boxplot(data)
plt.show()
Page No
Data Mining (3160714) Enrolment No
2.3 Implement data dispersion measure Five Number Summary
Output:
Conclusion:
Binning, Normalization techniques and the five number summary are both important tools in data
preprocessing that help prepare data for data mining tasks.
Quiz:
(1) What is Five Number summary? How to generate box plot using Python Libraries?
(2) What is Normalization techniques?
(3) What are the different smoothing techniques?
Suggested Reference:
Problem Completeness
Knowledge Logic
Recognition and accuracy Ethics (2)
Rubrics (2) Building (2) Total
(2) (2)
Good Avg. Good Avg. Good Avg. Good Avg. Good Avg.
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
Marks
Page No
Data Mining (3160714) Enrolment No
Experiment No - 3
Aim: To perform hand on experiments of data preprocessing with sample data on Orange
tool.
Document the steps you take in Orange, including the specific preprocessing techniques,
parameters used, and the reasoning behind your choices.
Procedure:
1. Install and set up Orange.
2. Import your dataset.
3. Apply different pre processing methods .
Demonstration of Tool:
Preprocesses data with selected methods. Inputs Data: input dataset Outputs Preprocessor:
preprocessing method Preprocessed Data: data preprocessed with selected methods
Preprocessing is crucial for achieving better-quality analysis results. The Preprocess widget
offers several preprocessing methods that can be combined in a single preprocessing
pipeline. Some methods are available as separate widgets, which offer advanced techniques
and greater parameter tuning.
Page No
Data Mining (3160714) Enrolment No
1. List of preprocessors. Double click the preprocessors you wish to use and shuffle their
order by dragging them up or down. You can also add preprocessors by dragging them from
the left menu to the right.
2. Preprocessing pipeline.
3. When the box is ticked (Send Automatically), the widget will communicate changes
automatically. Alternatively, click Send.
Page No
Data Mining (3160714) Enrolment No
⮚ Preprocessed Technique :
Page No
Data Mining (3160714) Enrolment No
Conclusion: Orange is a powerful open-source data analysis and visualization tool for machine
learning and data mining tasks. It provides a wide variety of functionalities including data
visualization, data preprocessing, feature selection, classification, regression, clustering, and more.
Its user-friendly interface and drag-and-drop workflow make it easy for non-experts to work with and
understand machine learning concepts.
Quiz:
Suggested Reference:
1. J. Han, M. Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufman
2. https://orangedatamining.com/docs/
Page No
Data Mining (3160714) Enrolment No
Marks
Page No
Data Mining (3160714) Enrolment No
Experiment No - 4
Aim: Implement Apriori algorithm of association rule data mining technique in any Programming
language.
Objectives: To implement basic logic for association rule mining algorithm with support and
confidence measures.
.
Equipment/Instruments: Personal Computer, open-source software for programming
Theory:
The Apriori algorithm is a classic and fundamental data mining algorithm used for discovering
association rules in transactional datasets.
Apriori is designed for finding associations or relationships between items in a dataset. It's
commonly used in market basket analysis and recommendation systems.
Apriori discovers frequent itemsets, which are sets of items that frequently co-occur in
transactions. A frequent itemset is a set of items that appears in a minimum number of
transactions, known as the "support threshold."
Support and Confidence: Support measures how often an itemset appears in the dataset, while
confidence measures how often a rule is true. High-confidence rules derived from frequent
itemsets are of interest.
Page No
Data Mining (3160714) Enrolment No
Safety and necessary Precautions:
Ensure that your dataset is clean and free from missing values, outliers, and inconsistencies.
1. Procedure:
2. Import the dataset that you want to analyze for association rules.
3. Define the minimum support and confidence thresholds for the Apriori algorithm. These
parameters control the minimum occurrence of itemsets and the minimum confidence level
for rules.
4. Implement the Apriori algorithm to discover frequent itemsets.
5. Use the frequent itemsets obtained from the previous step to generate association rules
Observation/Program:
Observations:
Conclusion:
Quiz:
Suggested Reference:
Problem Completeness
Knowledge Logic
Recognition and accuracy Ethics (2)
Rubrics (2) Building (2) Total
(2) (2)
Good Average Good Average Good Average Good Average Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
Marks
Page No
Data Mining (3160714) Enrolment No
Experiment No - 5
Aim: Apply association rule data mining technique on sample data sets using WEKA.
Properly evaluate the generated association rules and avoid drawing incorrect conclusions.
Consider using various rule quality metrics to assess rule significance.
Procedure:
1. Install and set up WEKA.
2. Import your dataset.
3. Apply association rule data mining methods .
Demonstration of Tool:
Observations: NA
Conclusion:
Quiz:
(1) What is WEKA tool?
(2) What is association analysis, and how can it be performed using WEKA Tool?
Suggested Reference:
Page No
Data Mining (3160714) Enrolment No
Rubric wise marks obtained:
Marks
Page No
Data Mining (3160714) Enrolment No
Experiment No - 6
Aim: Apply Classification data mining technique on sample data sets in WEKA.
Properly evaluate the classification rules and avoid drawing incorrect conclusions.
Procedure:
1. Install and set up WEKA.
2. Import your dataset.
3. Apply Classification data mining technique .
Demonstration of Tool:
//Demo
Conclusion:
// Write conclusion here
Quiz:
1) What is classification and how can it be performed using WEKA tool?
2) What are the key evaluation metrics used to assess the accuracy of a model in WEKA?
Suggested Reference:
Problem
Knowledge Tool Usage Demonstration
Recognition Ethics (2)
Rubrics (2) (2) (2) Total
(2)
Good Average Good Average Good Average Good Good Average
Average (1)
(2) (1) (2) (1) (2) (1) (2) (2) (1)
Marks
Page No
Data Mining (3160714) Enrolment No
Experiment No - 7
Objectives:
(a) To evaluate the quality of the classification model using accuracy and confusion matrix.
(b) To evaluate regression model
Theory:
Classification
The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program learns
from the given dataset or observations and then classifies new observation into a number of
classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes
can be called as targets/labels or categories.
Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labeled input data, which means it contains input with the
corresponding output.
In classification algorithm, a discrete output function(y) is mapped to input variable(x).
y=f(x), where y = categorical output
The best example of an ML classification algorithm is Email Spam Detector.
The main goal of the Classification algorithm is to identify the category of a given dataset,
and these algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are
similar to each other and dissimilar to other classes.
Page No
Data Mining (3160714) Enrolment No
3. AUC-ROC curve:
ROC curve stands for Receiver Operating Characteristics Curve and AUC stands
for Area Under the Curve.
It is a graph that shows the performance of the classification model at different thresholds.
To visualize the performance of the multi-class classification model, we use the AUC-ROC
Curve.
The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis
and FPR(False Positive Rate) on X-axis.
Use cases of Classification Algorithms
Classification algorithms can be used in different places. Below are some popular use cases
of Classification Algorithms:
Email Spam Detection
Speech Recognition
Identifications of Cancer tumor cells.
Drugs Classification
Biometric Identification, etc.
Page No
Data Mining (3160714) Enrolment No
Regression
Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables. More
specifically, Regression analysis helps us to understand how the value of the dependent
variable is changing corresponding to an independent variable when other independent
variables are held fixed. It predicts continuous/real values such as temperature, age, salary,
price, etc.
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every
year and get sales on that. The below list shows the advertisement made by the company in
the last 5 years and the corresponding sales:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to
know the prediction about the sales for this year. So to solve such type of prediction
problems in machine learning, we need regression analysis.Regression is a supervised
learning technique which helps in finding the correlation between variables and enables us
to predict the continuous output variable based on the one or more predictor variables. It is
mainly used for prediction, forecasting, time series modeling, and determining the causal-
effect relationship between variables.
In Regression, we plot a graph between the variables which best fits the given datapoints,
using this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance between the datapoints and the
regression line is minimum." The distance between datapoints and line tells whether a model
has captured a strong relationship or not.
Page No
Data Mining (3160714) Enrolment No
o Dependent Variable: The main factor in Regression analysis which we want to predict
or understand is called the dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the dependent variables or which are
used to predict the values of the dependent variables are called independent variable,
also called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high
value in comparison to other observed values. An outlier may hamper the result, so it
should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each other
than other variables, then such condition is called Multicollinearity. It should not be
present in the dataset, because it creates problem while ranking the most affecting
variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset but
not well with test dataset, then such problem is called Overfitting. And if our
algorithm does not perform well even with training dataset, then such problem is
called underfitting.
o Regression analysis helps in the prediction of a continuous variable. There are various
scenarios in the real world where we need some future predictions such as weather
condition, sales prediction, marketing trends, etc., for such case we need some
technology which can make predictions more accurately.
o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important
factor, the least important factor, and how each factor is affecting the other factors.
Types of Regression
There are various types of regressions which are used in data science and machine learning.
Each type has its own importance on different scenarios, but at the core, all the regression
methods analyze the effect of the independent variable on dependent variables. Here we are
discussing some important types of regression which are given below:
o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression
Page No
Data Mining (3160714) Enrolment No
o Random Forest Regression
o Ridge Regression
o Lasso Regression:
Choose a classification and regression algorithm that is suitable for your task, such as
decision trees, logistic regression, support vector machines, or neural networks.
Procedure:
Load and preprocess the dataset (cleaning, encoding, and splitting).
Choose a classification and regression algorithm
Train the model on the training data.
Testing
Evaluation
Observations/Program:
// Write your code and output here
Conclusion:
Quiz:
Problem Completeness
Knowledge Logic
Recognition and accuracy Ethics (2)
Rubrics (2) Building (2) Total
(2) (2)
Good Average Good Average Good Average Good Average Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
Marks
Page No
Data Mining (3160714) Enrolment No
Experiment No - 8
Theory:
K means Clustering
Unsupervised Learning is the process of teaching a computer to use unlabeled, unclassified
data and enabling the algorithm to operate on that data without supervision. Without any
previous data training, the machine’s job in this case is to organize unsorted data according
to parallels, patterns, and variations.
The goal of clustering is to divide the population or set of data points into a number of groups
so that the data points within each group are more comparable to one another and different
from the data points within the other groups. It is essentially a grouping of things based on
how similar and different they are to one another.
We are given a data set of items, with certain features, and values for these features (like a
vector). The task is to categorize those items into groups. To achieve this, we will use the
K-means algorithm; an unsupervised learning algorithm. ‘K’ in the name of the algorithm
represents the number of groups/clusters we want to classify our items into.
(It will help if you think of items as points in an n-dimensional space). The algorithm will
categorize the items into k groups or clusters of similarity. To calculate that similarity, we
will use the Euclidean distance as a measurement.
The algorithm works as follows:
First, we randomly initialize k points, called means or cluster centroids.
We categorize each item to its closest mean and we update the mean’s coordinates,
which are the averages of the items categorized in that cluster so far.
We repeat the process for a given number of iterations and at the end, we have our
clusters.
The “points” mentioned above are called means because they are the mean values of
the items categorized in them. To initialize these means, we have a lot of options. An
intuitive method is to initialize the means at random items in the data set.
Procedure:
Initialize k means with random values
Page No
Data Mining (3160714) Enrolment No
For a given number of iterations:
Iterate through items:
Find the mean closest to the item by calculating
the euclidean distance of the item with each of the means
Assign item to mean
Update mean by shifting it to the average of the items in that cluster
Program:
// Write your code here
Observations:
Conclusion:
Quiz:
Suggested Reference:
Problem Completeness
Knowledge Logic
Recognition and accuracy Ethics (2)
Rubrics (2) Building (2) Total
(2) (2)
Good Average Good Average Good Average Good Average Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
Marks
Page No
Data Mining (3160714) Enrolment No
Experiment No - 9
Aim: Perform hands on experiment on any advance mining Techniques Using Appropriate Tool.
Objectives:
1) Improve users' understanding of advance mining Techniques like Text Mining, Stream Mining,
and Web Content Mining Using Appropriate Tool
2) Familiarize with the tool
.
Equipment/Instruments: Write your tool name here
Demonstration of Tool:
Observations: NA
Conclusion:
Quiz:
Suggested Reference:
Marks
Page No
Data Mining (3160714) Enrolment No
Experiment No - 10
Aim: Solve Real world problem using Data Mining Techniques using Python Programming
Language.
Date: // Write the date of the experiment here
Theory:
Program:
// Write your steps and their code here step by step
Observations:
Conclusion:
Quiz:
1) What are other techniques that can be used to solve your system problem?
Completeness
Knowledge Teamwork Logic
and accuracy Ethics (2)
Rubrics (2) (2) Building (2) Total
(2)
Good Average Good Average Good Average Good Average Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
Marks
Page No
Data Mining (3160714) Enrolment No
Page No