0% found this document useful (0 votes)

44 views43 pages

Data Mining. VBV JJJ Ldce Vgec

This laboratory manual for Data Mining (3160714) is designed for B.E. Semester 6 Computer Engineering students, emphasizing competency-focused, outcome-based learning through practical experiments. It outlines the importance of data mining across various industries and provides guidelines for both faculty and students to enhance practical skills and safety. The manual includes course outcomes, program objectives, and a structured approach to conducting experiments related to data preprocessing, association rules, classification, and clustering.

Uploaded by

tanishvekariya1809

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views43 pages

Data Mining. VBV JJJ Ldce Vgec

Uploaded by

tanishvekariya1809

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Degree Engineering

A Laboratory Manual for

Data Mining
(3160714)

B.E. Semester 6(Computer

Engineering)

Institute logo

Directorate of Technical Education, Gandhinagar,

Gujarat
Preface

Main motto of any laboratory/practical/field work is for enhancing required skills as well as
creating ability amongst students to solve real time problem by developing relevant competencies
in psychomotor domain. By keeping in view, GTU has designed competency focused outcome-
based curriculum for engineering degree programs where sufficient weightage is given to
practical work. It shows importance of enhancement of skills amongst the students and it pays
attention to utilize every second of time allotted for practical amongst students, instructors and
faculty members to achieve relevant outcomes by performing the experiments rather than having
merely study type experiments. It is must for effective implementation of competency focused
outcome-based curriculum that every practical is keenly designed to serve as a tool to develop
and enhance relevant competency required by the various industry among every student. These
psychomotor skills are very difficult to develop through traditional chalk and board content
delivery method in the classroom. Accordingly, this lab manual is designed to focus on the
industry defined relevant outcomes, rather than old practice of conducting practical to prove
concept and theory.

By using this lab manual students can go through the relevant theory and procedure in advance
before the actual performance which creates an interest and students can have basic idea prior to
performance. This in turn enhances pre-determined outcomes amongst students. Each experiment
in this manual begins with competency, industry relevant skills, course outcomes as well as
practical outcomes (objectives). The students will also achieve safety and necessary precautions
to be taken while performing practical.

This manual also provides guidelines to faculty members to facilitate student centric lab activities
through each experiment by arranging and managing necessary resources in order that the
students follow the procedures with required safety and necessary precautions to achieve the
outcomes. It also gives an idea that how students will be assessed by providing rubrics.

Data mining is a key to sentiment analysis, price optimization, database marketing, credit risk
management, training and support, fraud detection, healthcare and medical diagnoses, risk
assessment, recommendation systems and much more. It can be an effective tool in just about any
industry, including retail, wholesale distribution, service industries, telecom, communications,
insurance, education, manufacturing, healthcare, banking, science, engineering, and online
marketing or social media.
Utmost care has been taken while preparing this lab manual however always there is chances of
improvement. Therefore, we welcome constructive suggestions for improvement and removal of
errors if any.
Vishwakarma Government Engineering College
Department of Computer Engineering

CERTIFICATE

This is to certify that Mr./Ms. _____________________________________________ Enrollment

No. _______ of B.E. Semester from Computer Engineering Department of

this Institute (GTU Code: 017) has satisfactorily completed the Practical / Tutorial work for the

subject Data Mining (3160714) for the academic year 2022-23.

Place: ___________

Date: ___________

Signature of Course Faculty Head of the Department

1
Data Mining (3160714) Enrolment No

DTE’s Vision

 To provide globally competitive technical education

 Remove geographical imbalances and inconsistencies
 Develop student friendly resources with a special focus on girls’ education and support to
weaker sections
 Develop programs relevant to industry and create a vibrant pool of technical professionals

Institute’s Vision

 To create an ecosystem for proliferation of socially responsible and technically sound

engineers, innovators and entrepreneurs.

Institute’s Mission

 To develop state-of-the-art laboratories and well-equipped academic infrastructure.

 To motivate faculty and staff for qualification up-gradation, and enhancement of subject
knowledge.
 To promote research, innovation and real-life problem-solving skills.
 To strengthen linkages with industries, academic and research organizations.
 To reinforce concern for sustainability, natural resource conservation and social
responsibility.

Department’s Vision

 To create an environment for providing value-based education in Computer Engineering

through innovation, team work and ethical practices.

Department’s Mission

 To produce computer engineering graduates according to the needs of industry,

government, society and scientific community.
 To develop state of the art computing facilities and academic infrastructure.
 To develop partnership with industries, government agencies and R & D organizations for
knowledge sharing and overall development of faculties and students.
 To solve industrial, governance and societal issues by applying computing techniques.
 To create environment for research and entrepreneurship.
Data Mining (3160714) Enrolment No

Programme Outcomes (POs)

1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering

fundamentals, and an engineering specialization to the solution of complex engineering
problems.
2. Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.
3. Design/development of solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.
4. Conduct investigations of complex problems: Use research-based knowledge and research
methods including design of experiments, analysis and interpretation of data, and synthesis of
the information to provide valid conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities
with an understanding of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to
the professional engineering practice.
7. Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
9. Individual and team work: Function effectively as an individual, and as a member or leader
in diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and write
effective reports and design documentation, make effective presentations, and give and receive
clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.
12. Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.
Data Mining (3160714) Enrolment No

Program Specific Outcomes (PSOs)

 Sound knowledge of fundamentals of computer science and engineering including software

and hardware.
 Develop the software using sound software engineering principles having web based/mobile
based interface.
 Use various tools and technology supporting modern software frameworks for solving
problems having large volume of data in the domain of data science and machine learning.

Program Educational Objectives (PEOs)

 Possess technical competence in solving real life problems related to Computing.

 Acquire good analysis, design, development, implementation and testing skills to formulate
simple computing solutions to the business and societal needs.
 Provide requisite skills to pursue entrepreneurship, higher studies, research, and
development and imbibe high degree of professionalism in the fields of computing.
 Embrace life-long learning and remain continuously employable.
 Work and excel in a highly competence supportive, multicultural and professional
environment which abiding to the legal and ethical responsibilities.
Data Mining (3160714) Enrolment No

Practical – Course Outcome matrix

Course Outcomes (COs):
CO_3160714.1 Perform the preprocessing of data and apply mining techniques on it.
CO_3160714.2 Identify the association rules, classification, and clusters in large data sets.
CO_3160714.3 Solve real-world problems in business and scientific information using data
mining.
CO_3160714.4 Use data analysis tools for scientific applications.
CO_3160714.5 Implement various supervised machine learning algorithms.

Sr. CO CO CO CO CO
Objective(s) of Experiment
No. 1 2 3 4 5
Identify how data mining is an interdisciplinary field by an
1. √
Application.
Write programs to perform the following tasks of
preprocessing (any language).
2.1 Noisy data handling
 Equal Width Binning
 Equal Frequency/Depth Binning
2. 2.2 Normalization Techniques √
 Min max normalization
 Z score normalization
 Decimal scaling
2.3. Implement data dispersion measure Five
Number Summary generate box plot using python
libraries
To perform hand on experiments of data preprocessing with
3. √ √
sample data on Orange tool.
Implement Apriori algorithm of association rule data
4. mining technique in any √
Programming language.
Apply association rule data mining technique on sample
5. √ √
data sets using WEKA.
Apply Classification data mining technique on sample data
6. √ √
sets in WEKA.
7.1. Implement Classification technique with quality
Measures in any Programming language.
7. √
7.2 Implement Regression technique in any
Programming language.
Apply K-means Clustering Algorithm any Programming
8. √ √
language.
Perform hands on experiment on any advance mining
9. √
Techniques Using Appropriate Tool.
Solve Real world problem using Data Mining Techniques
10. √
using Python Programming Language.
Data Mining (3160714) Enrolment No

Guidelines for Faculty members

1. Teacher should provide the guideline with demonstration of practical to the students
with all features.
2. Teacher shall explain basic concepts/theory related to the experiment to the students before
starting of each practical
3. Involve all the students in performance of each experiment.
4. Teacher is expected to share the skills and competencies to be developed in the
students and ensure that the respective skills and competencies are developed in the
students after the completion of the experimentation.
5. Teachers should give opportunity to students for hands-on experience after the
demonstration.
6. Teacher may provide additional knowledge and skills to the students even though not
covered in the manual but are expected from the students by concerned industry.
7. Give practical assignment and assess the performance of students based on task
assigned to check whether it is as per the instructions or not.
8. Teacher is expected to refer complete curriculum of the course and follow the
guidelines for implementation.

Instructions for Students

1. Students are expected to carefully listen to all the theory classes delivered by the faculty
members and understand the COs, content of the course, teaching and examination scheme,
skill set to be developed etc.
2. Students will have to perform experiments as per practical list given.
3. Students have to show output of each program in their practical file.
4. Students are instructed to submit practical list as per given sample list shown on next page.
5. Student should develop a habit of submitting the experimentation work as per the schedule
and s/he should be well prepared for the same.

Common Safety Instructions

Students are expected to

1) switch on the PC carefully (not to use wet hands)

2) shutdown the PC properly at the end of your Lab
3) carefully handle the peripherals (Mouse, Keyboard, Network cable etc)
4) use Laptop in lab after getting permission from Teacher
Data Mining (3160714) Enrolment No

Index
(Progressive Assessment Sheet)

Sr. Objective(s) of Experiment Page Date Date of Assessme Sign. of Rema

No. No. of submis nt Teacher rks
perfor sion Marks with
mance date
1 Identify how data mining is an interdisciplinary
field by an Application.
2 Write programs to perform the following tasks of
preprocessing (any language).
2.1 Noisy data handling
 Equal Width Binning
 Equal Frequency/Depth Binning
2.2 Normalization Techniques
 Min max normalization
 Z score normalization
 Decimal scaling
2.3. Implement data dispersion measure Five
Number Summary generate box plot using
python libraries
3 To perform hand on experiments of data
preprocessing with sample data on Orange tool.
4 Implement Apriori algorithm of association rule
data mining technique in any Programming
language.
5 Apply association rule data mining technique on
sample data sets using WEKA.
6 Apply Classification data mining technique on
sample data sets in WEKA .
7 7.1. Implement Classification technique with
quality Measures in any Programming language.
7.2. Implement Regression technique in any
Programming language.
8 Apply K-means Clustering Algorithm any
Programming language.
9 Perform hands on experiment on any advance
mining Techniques Using Appropriate Tool.
10 Solve Real world problem using Data Mining
Techniques using Python Programming
Language.
Total
Data Mining (3160714) Enrolment No

Experiment No - 1
Aim: Identify how data mining is an interdisciplinary field by an Application.

Data mining is an interdisciplinary field that involves computer science, statistics, mathematics, and
domain-specific knowledge. One application that showcases the interdisciplinary nature of data
mining

Date: // Write date of experiment here

Competency and Practical Skills: Understanding and Analyzing

Relevant CO: CO3

Objectives: (a) To understand the application of domain

(b) To understand Preprocessing Techniques.
(c) To understand the application's use of Data Mining functionalities.
.
Equipment/Instruments: Personal Computer

Theory:

System Name: Movie recommendation systems

Movie recommendation systems are a great example of how data mining is an interdisciplinary field
involving multiple expertise areas. The process of creating a movie recommendation system involves
the following steps:

Dataset: A movie recommendation system requires a dataset that contains information about movies
and their attributes, as well as information about users and their preferences. Here are some Examples
of dataset.

 MovieLens: This is a dataset of movie ratings collected by the GroupLens research group at
the University of Minnesota. It contains over 27,000 movies and 21 million ratings from
around 250,000 users.
 Netflix Prize: This is a dataset of movie ratings collected by the online movie rental company
Netflix. It contains over 100 million ratings from around 480,000 users on 17,000 movies.
 IMDb: This is a dataset of movie metadata collected by the Internet Movie Database (IMDb).
It contains information about over 600,000 movies, including titles, cast and crew, plot
summaries, ratings, and reviews.

Preprocessing: It involves cleaning and transforming the data to make it suitable for analysis. Here
are some preprocessing techniques commonly used in movie recommendation systems:

Page No
Data Mining (3160714) Enrolment No
 Data Cleaning: This involves removing missing or irrelevant data, correcting errors, and
removing duplicates. For example, if a movie has missing information such as the release date,
it may be removed from the dataset or the information may be imputed.
 Data Normalization: This involves scaling the data to a common range or standard deviation.
For example, ratings from different users may be normalized to a common scale of 1-5.
 Data Transformation: This involves transforming the data into a format suitable for analysis.
For example, movie genres may be encoded as binary variables to enable analysis using
machine learning algorithms.
 Feature Generation: This involves creating new features from the existing data that may be
useful for analysis. For example, the average rating for each movie may be calculated based
on user ratings, or the number of times a movie has been watched may be counted.
 Data Reduction: This involves reducing the dimensionality of the data to improve processing
efficiency and reduce noise. For example, principal component analysis (PCA) may be used
to identify the most important features in the dataset.

These preprocessing techniques help to ensure that the data is clean, normalized, and transformed in
a way that enables accurate analysis and prediction of user preferences for movie recommendations.

Data Mining Techniques: Association rule mining, clustering, and classification are all data mining
techniques that can be applied to movie recommendation systems. Here is a brief overview of how
each of these techniques can be used:

 Association Rule Mining: This technique can be used to identify patterns and relationships
between movies that can help with recommendations. For example, if users who watch
romantic comedies also tend to watch dramas, then a recommendation system can use this
association to suggest dramas to users who have watched romantic comedies. Association rule
mining can also be used to identify frequent item sets, which can help identify popular movies
or genres.
 Clustering: This technique can be used to group movies and users based on their
characteristics. For example, movies can be clustered based on genre, rating, release year, or
other attributes. Users can also be clustered based on their movie preferences or ratings.
Clustering can help with making personalized recommendations by identifying groups of
similar users or movies.
 Classification: This technique can be used to predict the preferences or ratings of users for a
particular movie. For example, a classification algorithm can be trained on past user ratings
to predict whether a new user will like a particular movie or not. Classification algorithms can
also be used to predict the genre or rating of a new movie based on its characteristics.

Safety and necessary Precautions:

Choose appropriate system to show how mining is an interdisiplinary field.

Procedure:
1. Select the domain.
2. Selection of particular system .
3. Preprocessing used on system.
4. Mining techniques applied on the system.

Observation/Program:

Page No
Data Mining (3160714) Enrolment No

In a movie recommendation system, data mining techniques are used to analyze large amounts of
data about movies and users, and generate accurate and personalized recommendations.

Conclusion: Movie recommendation systems provide a compelling example of how data mining is
an interdisciplinary field.

Quiz:

(1) What are the different preprocessing techniques can be applied on dataset?
(2) What is the use of data mining techniques on particular system?

Suggested References:

1. Han, J., & Kamber, M. (2011). Data mining: concepts and techniques.
2. https://www.kaggle.com/code/rounakbanik/movie-recommender-systems

References used by the students:

// Write references used by you here

Rubric wise marks obtained:

//Rubric Example

Problem Completeness
Knowledge Team Work
Recognition and accuracy Ethics (2)
Rubrics (2) (2) Total
(2) (2)
Good Avg. Good Avg. Good Avg. Good Avg. Good Avg.
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)

Marks

Page No
Data Mining (3160714) Enrolment No

Experiment No - 2
Aim: Write programs to perform the following tasks of preprocessing (any language).
2.1 Noisy data handling
 Equal Width Binning
 Equal Frequency/Depth Binning
2.2 Normalization Techniques
 Min max normalization
 Z score normalization
 Decimal scaling
2.3. Implement data dispersion measure Five Number Summary generate box plot using
python libraries

Date: // Write date of experiment here

Competency and Practical Skills: Programming and statistical methods

Relevant CO: CO1

Objectives: (a) To understand Basic Preprocessing Techniques and statistical Measures.

(b) To show how to implement Preprocessing Techniques.
(c) To show how to use different Python Libraries to implement Techniques.
.
Equipment/Instruments: Personal Computer, open-source software for programming

Theory:

2.1 Noisy data handling

Equal Width Binning
Equal Frequency/Depth Binning

Noise: random error or variance in a measured variable

Incorrect attribute values may be due to

• Faulty Data Collection Instruments

• Data Entry Problems
• Data Transmission Problems
• Technology Limitation
• Inconsistency in Naming Convention

Binning: Binning methods smooth a sorted data value by consulting its “neighborhood,” that is, the
values around it. The sorted values are distributed into a number of “buckets,” or bins. Because
binning methods consult the neighborhood of values, they perform local smoothing.

Data Preprocessing Sorted data for price (in dollars): 4, 8, 15, 21, 21, 24, 25, 28, 34
Page No
Data Mining (3160714) Enrolment No
Partition into (equal-frequency)
bins:
Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34

Smoothing by bin means:

Bin 1: 9, 9, 9
Bin 2: 22, 22, 22
Bin 3: 29, 29, 29

Smoothing by median:
Bin 1: 8, 8, 8
Bin 2: 21, 21, 21
Bin 3: 28, 28, 28

Smoothing by bin boundaries:

Bin 1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34

Equal Width Binning :

bins have equal width with a range of each bin are defined as [min + w], [min + 2w] …. [min + nw]
where w = (max – min) / (N)
Example :
5, 10, 11, 13, 15, 35, 50 ,55, 72, 92, 204, 215

W=(215-5)/3=70

bin1: 5,10,11,13,15,35,50,55,72 I.e. all values between 5 and 75

bin2: 92 I.e. all values between 75 and 145
bin3: 204,215 I.e. all values between 145 and 215
.
2.2.Normalization Techniques
Min max normalization
Z score normalization
Decimal scaling

Normalization techniques are used in data preprocessing to scale numerical data to a common range.
Here are three commonly used normalization techniques:

The measurement unit used can affect the data analysis. For example, changing measurement units
from meters to inches for height, or from kilograms to pounds for weight, may lead to very different
results. In general, expressing an attribute in smaller units will lead to a larger range for that attribute,
and thus tend to give such an attribute greater effect or “weight.” To help avoid dependence on the
choice of measurement units, the data should be normalized or standardized. This involves

Page No
Data Mining (3160714) Enrolment No
transforming the data to fall within a smaller or common range such as [−1,1] or [0.0, 1.0]. (The terms
standardize and normalize are used interchangeably in data preprocessing, although in statistics, the
latter term also has other connotations.) Normalizing the data attempts to give all attributes an equal
weight. Normalization is particularly useful for classification algorithms involving neural networks
or distance measurements such as nearest-neighbor classification and clustering. If using the neural
network back propagation algorithm for classification mining. There are many methods for data
normalization. We Focus on min-max normalization, z-score normalization, and normalization by
decimal scaling.

Min-Max Normalization: This technique scales the data to a range of 0 to 1. The formula for min-
max normalization is:
X_norm = (X - X_min) / (X_max - X_min)
where X is the original data, X_min is the minimum value in the dataset, and X_max is the maximum
value in the dataset.

Z-Score Normalization: This technique scales the data to have a mean of 0 and a standard deviation
of 1. The formula for z-score normalization is:
X_norm = (X - X_mean) / X_std
where X is the original data, X_mean is the mean of the dataset, and X_std is the standard deviation
of the dataset.

Decimal Scaling: This technique scales the data by moving the decimal point a certain number of
places to the left or right. The formula for decimal scaling is:
X_norm = X / 10^j
where X is the original data and j is the number of decimal places to shift.

2.3. Implement data dispersion measure Five Number Summary generate box plot using
python libraries

Five Number Summary

Descriptive Statistics involves understanding the distribution and nature of the data. Five
number summary is a part of descriptive statistics and consists of five values and all these
values will help us to describe the data.
The minimum value (the lowest value)
25th Percentile or Q1
50th Percentile or Q2 or Median
75th Percentile or Q3
Maximum Value (the highest value)

Let’s understand this with the help of an example. Suppose we have some data such as:
11,23,32,26,16,19,30,14,16,10

Here, in the above set of data points our Five Number Summary are as follows:
First of all, we will arrange the data points in ascending order and then calculate the
summary: 10,11,14,16,16,19,23,26,30,32

Minimum value: 10
25th Percentile: 14
Calculation of 25th Percentile: (25/100)*(n+1) = (25/100)*(11) = 2.75 i.e 3rd value of the data
50th Percentile : 17.5
Page No
Data Mining (3160714) Enrolment No
Calculation of 50th Percentile: (16+19)/2 = 17.5
75th Percentile : 26
Calculation of 75th Percentile: (75/100)*(n+1) = (75/100)*(11) = 8.25 i.e 8th value of the
data

Box plots

Boxplots are the graphical representation of the distribution of the data using Five Number
summary values. It is one of the most efficient ways to detect outliers in our dataset.

Fig Box plot using Five Number Summary

In statistics, an outlier is a data point that differs significantly from other observations. An
outlier may be due to variability in the measurement or it may indicate experimental error;
the latter are sometimes excluded from the dataset. An outlier can cause serious problems in
statistical analyses.

Safety and necessary Precautions:

Prior to preprocessing, thoroughly clean the data by handling missing values and outliers. Incorrect
handling of noisy data can lead to skewed results.

Procedure:
1. Collect raw data from various sourcesSelection of particular system .
2. Handle missing values and outliers.
3. Equal Width Binning:
Divide data into bins of equal width.
4. Equal Frequency/Depth Binning:
Divide data into bins of equal frequency.

5. Min-Max Normalization:
Scale data to a specific range (e.g., [0, 1]).

6. Z-Score Normalization:
Standardize data to have a mean of 0 and standard deviation of 1.

7. Decimal Scaling:
Scale data by a power of 10.

8. Data Dispersion Measure:

Calculate the five-number summary (minimum, Q1, median, Q3, maximum).

9. Box Plot Generation:

Create a box plot to visualize data distribution.
Page No
Data Mining (3160714) Enrolment No

Observation/Program:
Code:
#2.1 Noisy Data Handling
# Generate random Numbers
import random
data = random.sample(range(10, 100), 20)
data = sorted(data)
print("Random data sample: ", data)

# Numbers of bins user wants

bins = int(input('Enter the number of bins: '))

#1. Equal width binning

equal_width = []
min_val = min(data)
max_val = max(data)
diff_val = (max_val - min_val) // bins

def range_val(j, limit):

d = []
while data[j] <= limit:
d.append(data[j])
j = j+1
return j, d

j=0
for i in range(1, bins+1):
j, val = range_val(j, min_val + (i * diff_val))
equal_width.append(val)

print("Equal Width : ")

print(equal_width)

#2. Equal Frequency/Depth Binning

equal_freq = []

for i in range(1, bins+1):

size_bin = len(data) // bins
start = (i-1) * size_bin
stop = start + size_bin # (i * (size_bin-1)) + i
equal_freq.append(data[start: stop])

print("Equal frequency : ")

print(equal_freq)

# apply smoothing techniques by means, by median and bin boundaries.

from statistics import mean, median

Page No
Data Mining (3160714) Enrolment No
def smooth_mean(data):
smooth_data = []
for i in range(bins):
mean_data = mean(data[i])
smooth_data.append([mean_data for j in range(len(data[i]))])
return smooth_data

def smooth_median(data):
smooth_data = []
for i in range(bins):
median_data = median(data[i])
smooth_data.append([median_data for j in range(len(data[i]))])
return smooth_data

def smooth_bound(data):
smooth_data = []
for i in range(bins):
d = []
d.append(data[i][0])
for j in range(1, len(data[i])-1):
min_d = min(data[i])
max_d = max(data[i])
if (data[i][j] - min_d) <= (max_d - data[i][j]):
d.append(min_d)
else:
d.append(max_d)
d.append(data[i][-1])
smooth_data.append(d)
return smooth_data

print("Smooth mean for Equal frequency : ")

print(smooth_mean(equal_freq))
print("Smooth mean for Equal Width : ")
print(smooth_mean(equal_width))
print("Smooth median for Equal frequency : ")
print(smooth_median(equal_freq))
print("Smooth median for Equal Width : ")
print(smooth_median(equal_width))
print("Smooth bound for Equal frequency : ")
print(smooth_bound(equal_freq))
print("Smooth bound for Equal Width : ")
print(smooth_bound(equal_width))

2.2 Normalization Techniques

1) Min max normalization

2) Z score normalization
3) Decimal scaling

Code:
Page No
Data Mining (3160714) Enrolment No
# 2.2 Normalization Techniques
# Min max normalization
# Z score normalization
# Decimal scaling
from statistics import mean, stdev
import random

data = random.sample(range(100, 1000), 20)

data = sorted(data)
print("Sample data: ")
print(data)

min_data = min(data)
max_data = max(data)

# 1. min-max normalization

def min_max(val, new_min, new_max):

return round(((val-min_data)/(max_data-min_data)) * (new_max-new_min) + new_min, 2)

new_min = 0.0
new_max = 1.0

min_max_norm = [min_max(i, new_min, new_max) for i in data]

print('Min-max norm: ')
print(min_max_norm)

# 2. Z score normalization

def z_score(val, mean, std):

return round((val-mean) / std, 2)

data_mean = mean(data)
data_std = stdev(data)

z_norm = [z_score(i, data_mean, data_std) for i in data]

print('Z norm: ')
print(z_norm)

# 3. Decimal Scaling

def dec_scale(n, j):

return round(n / (10**j), 2)

abs_max_data = max(list(map(abs, data)))

len_abs_max_data = len(str(abs_max_data))

dec_norm = [dec_scale(i, len_abs_max_data) for i in data]

print('Decimal Scaling norm: ')
print(dec_norm)

# 2.3 Implement data dispersion measure Five Number Summary generate box plot using python
libraries.

Page No
Data Mining (3160714) Enrolment No

import matplotlib.pyplot as plt

import seaborn as sns

if len(data)%2 == 0:
ind = len(data) // 2
else:
ind = (len(data)+1) // 2

Q1 = median(data[: ind+1])
Q2 = median(data)
Q3 = median(data[ind+1: ])

print('Five Number Summary')

print('Minimum value: ', min_data)
print('Q1 (25%): ', Q1)
print('Q2 (50%)(Median):', Q2)
print('Q3 (75%): ', Q3)
print('Maximum value: ', max_data)

sns.boxplot(data)
plt.show()

2.1 Noisy Data Handling

Output:

2.2 Normalization Techniques

Output:

Page No
Data Mining (3160714) Enrolment No
2.3 Implement data dispersion measure Five Number Summary
Output:

Conclusion:

Binning, Normalization techniques and the five number summary are both important tools in data
preprocessing that help prepare data for data mining tasks.

Quiz:
(1) What is Five Number summary? How to generate box plot using Python Libraries?
(2) What is Normalization techniques?
(3) What are the different smoothing techniques?

Suggested Reference:

J. Han, M. Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufmann

References used by the students:

// Write references used by you here

Rubric wise marks obtained:

Problem Completeness
Knowledge Logic
Recognition and accuracy Ethics (2)
Rubrics (2) Building (2) Total
(2) (2)
Good Avg. Good Avg. Good Avg. Good Avg. Good Avg.
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)

Marks

Page No
Data Mining (3160714) Enrolment No

Experiment No - 3
Aim: To perform hand on experiments of data preprocessing with sample data on Orange
tool.

Date: // Write date of experiment here

Competency and Practical Skills: Exploration and Understanding of Tool

Relevant CO: CO1 & CO4

Objectives: 1) improve users' understanding of data preprocessing techniques

2) Familiarize them with the tool
.
Equipment/Instruments: Orange tool

Safety and necessary Precautions:

Document the steps you take in Orange, including the specific preprocessing techniques,
parameters used, and the reasoning behind your choices.

Procedure:
1. Install and set up Orange.
2. Import your dataset.
3. Apply different pre processing methods .

Demonstration of Tool:

Data Preprocessing With Orange tool

Preprocesses data with selected methods. Inputs Data: input dataset Outputs Preprocessor:
preprocessing method Preprocessed Data: data preprocessed with selected methods
Preprocessing is crucial for achieving better-quality analysis results. The Preprocess widget
offers several preprocessing methods that can be combined in a single preprocessing
pipeline. Some methods are available as separate widgets, which offer advanced techniques
and greater parameter tuning.

Page No
Data Mining (3160714) Enrolment No

Fig Handling Missing Values

1. List of preprocessors. Double click the preprocessors you wish to use and shuffle their
order by dragging them up or down. You can also add preprocessors by dragging them from
the left menu to the right.
2. Preprocessing pipeline.
3. When the box is ticked (Send Automatically), the widget will communicate changes
automatically. Alternatively, click Send.

Page No
Data Mining (3160714) Enrolment No
⮚ Preprocessed Technique :

[ Fig: Descrete Continuous variables -> Most Frequent is base Used ]

⮚ Data Table of Preprocessed Data

Page No
Data Mining (3160714) Enrolment No

Conclusion: Orange is a powerful open-source data analysis and visualization tool for machine
learning and data mining tasks. It provides a wide variety of functionalities including data
visualization, data preprocessing, feature selection, classification, regression, clustering, and more.
Its user-friendly interface and drag-and-drop workflow make it easy for non-experts to work with and
understand machine learning concepts.
Quiz:

(1) What is the purpose of Orange's Preprocess method?

(2) What is the use of orange tool?

Suggested Reference:
1. J. Han, M. Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufman
2. https://orangedatamining.com/docs/
Page No
Data Mining (3160714) Enrolment No

References used by the students:

// Write references used by you here

Rubric wise marks obtained:

Problem Tool Usage/

Knowledge Communicati
Recognition Demonstrati Ethics (2)
Rubrics (2) on Skill (2) Total
(2) on (2)
Good Average Good Average Good Average Good Average Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)

Marks

Page No
Data Mining (3160714) Enrolment No

Experiment No - 4

Aim: Implement Apriori algorithm of association rule data mining technique in any Programming
language.

Date: // Write date of experiment here

Competency and Practical Skills: Logic building, Programming and Analyzing

Relevant CO: CO2

Objectives: To implement basic logic for association rule mining algorithm with support and
confidence measures.
.
Equipment/Instruments: Personal Computer, open-source software for programming

Theory:
The Apriori algorithm is a classic and fundamental data mining algorithm used for discovering
association rules in transactional datasets.

 Apriori is designed for finding associations or relationships between items in a dataset. It's
commonly used in market basket analysis and recommendation systems.
 Apriori discovers frequent itemsets, which are sets of items that frequently co-occur in
transactions. A frequent itemset is a set of items that appears in a minimum number of
transactions, known as the "support threshold."
 Support and Confidence: Support measures how often an itemset appears in the dataset, while
confidence measures how often a rule is true. High-confidence rules derived from frequent
itemsets are of interest.

 Apriori uses an iterative approach to progressively discover frequent itemsets of increasing

size. It starts with finding frequent 1-itemsets, then 2-itemsets, and so on.
 The algorithm employs pruning techniques to reduce the number of candidate itemsets that
need to be checked, making it more efficient.
 Apriori is widely used in retail for market basket analysis. It helps retailers understand which
products are often purchased together, allowing for optimized store layouts, targeted
marketing, and product recommendations.

Page No
Data Mining (3160714) Enrolment No
Safety and necessary Precautions:

Ensure that your dataset is clean and free from missing values, outliers, and inconsistencies.

1. Procedure:
2. Import the dataset that you want to analyze for association rules.
3. Define the minimum support and confidence thresholds for the Apriori algorithm. These
parameters control the minimum occurrence of itemsets and the minimum confidence level
for rules.
4. Implement the Apriori algorithm to discover frequent itemsets.
5. Use the frequent itemsets obtained from the previous step to generate association rules

Observation/Program:

// Write your code and output here

Observations:

// write your output here

Conclusion:

// Write conclusion here

Quiz:

(1) What Do you Mean by Association rule mining?

(2) What are the different measures are used in apriori algorithm?

Suggested Reference:

 J. Han, M. Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufmann

References used by the students:

// Write references used by you here

Rubric wise marks obtained:

Problem Completeness
Knowledge Logic
Recognition and accuracy Ethics (2)
Rubrics (2) Building (2) Total
(2) (2)
Good Average Good Average Good Average Good Average Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)

Marks

Page No
Data Mining (3160714) Enrolment No

Experiment No - 5

Aim: Apply association rule data mining technique on sample data sets using WEKA.

Date: // Write date of experiment here

Competency and Practical Skills: Exploration and Understanding of Tool

Relevant CO: CO2 & CO4

Objectives: 1) improve users' understanding of Association rule mining techniques

2) Familiarize with the tool
.
Equipment/Instruments: WEKA Tool

Safety and necessary Precautions:

Properly evaluate the generated association rules and avoid drawing incorrect conclusions.
Consider using various rule quality metrics to assess rule significance.

Procedure:
1. Install and set up WEKA.
2. Import your dataset.
3. Apply association rule data mining methods .

Demonstration of Tool:

Observations: NA

Conclusion:

// Write conclusion here

Quiz:
(1) What is WEKA tool?
(2) What is association analysis, and how can it be performed using WEKA Tool?

Suggested Reference:

1. J. Han, M. Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufmann

2. https://www.solver.com/xlminer-data-mining

References used by the students:

// Write references used by you here

Page No
Data Mining (3160714) Enrolment No
Rubric wise marks obtained:

Problem Tool Usage/

Knowledge Communicatio
Recognition Demonstrati Ethics (2)
Rubrics (2) n Skill (2) Total
(2) on (2)
Good Average Good Average Good Average Good Good Average
Average (1)
(2) (1) (2) (1) (2) (1) (2) (2) (1)

Marks

Page No
Data Mining (3160714) Enrolment No
Experiment No - 6

Aim: Apply Classification data mining technique on sample data sets in WEKA.

Date: // Write date of experiment here

Competency and Practical Skills: Exploration and Understanding of Tool.

Relevant CO: CO4 & CO5

Objectives: 1) improve users' understanding of classification techniques

2) Familiarize with the tool
.
Equipment/Instruments: WEKA Tool

Safety and necessary Precautions:

Properly evaluate the classification rules and avoid drawing incorrect conclusions.

Procedure:
1. Install and set up WEKA.
2. Import your dataset.
3. Apply Classification data mining technique .
Demonstration of Tool:
//Demo
Conclusion:
// Write conclusion here
Quiz:
1) What is classification and how can it be performed using WEKA tool?
2) What are the key evaluation metrics used to assess the accuracy of a model in WEKA?

Suggested Reference:

1. J. Han, M. Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufmann

2. https://waikato.github.io/weka-wiki/documentation/

References used by the students:

// Write references used by you here

Rubric wise marks obtained:

Problem
Knowledge Tool Usage Demonstration
Recognition Ethics (2)
Rubrics (2) (2) (2) Total
(2)
Good Average Good Average Good Average Good Good Average
Average (1)
(2) (1) (2) (1) (2) (1) (2) (2) (1)

Marks

Page No
Data Mining (3160714) Enrolment No

Experiment No - 7

Aim: 7.1.Implement Classification technique with quality Measures in any Programming

language.
7.2 Implement Regression technique in any Programming language.

Date: // Write date of experiment here

Competency and Practical Skills: Logic building, Programming and Analyzing

Relevant CO: CO5

Objectives:
(a) To evaluate the quality of the classification model using accuracy and confusion matrix.
(b) To evaluate regression model

Equipment/Instruments: open-source software for programming

Theory:

Classification
The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program learns
from the given dataset or observations and then classifies new observation into a number of
classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes
can be called as targets/labels or categories.
Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labeled input data, which means it contains input with the
corresponding output.
In classification algorithm, a discrete output function(y) is mapped to input variable(x).
y=f(x), where y = categorical output
The best example of an ML classification algorithm is Email Spam Detector.
The main goal of the Classification algorithm is to identify the category of a given dataset,
and these algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are
similar to each other and dissimilar to other classes.

Page No
Data Mining (3160714) Enrolment No

The algorithm which implements the classification on a dataset is known as a classifier.

There are two types of Classifications:
Binary Classifier:
If the classification problem has only two possible outcomes, then it is called as Binary
Classifier.
Examples:
YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
Multi-class Classifier:
If a classification problem has more than two outcomes, then it is called as Multi-class
Classifier.
Example: Classifications of types of crops, Classification of types of music.
Learners in Classification Problems: In the classification problems, there are two types of
learners:
Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the
test dataset. In Lazy learner case, classification is done on the basis of the most related data
stored in the training dataset. It takes less time in training but more time for predictions.
Example: K-NN algorithm, Case-based reasoning
Eager Learners:Eager Learners develop a classification model based on a training dataset
before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in
learning, and less time in prediction.
Example: Decision Trees, Naïve Bayes, ANN.
Types of ML Classification Algorithms:
Classification Algorithms can be further divided into the Mainly two category:
Linear Models
Logistic Regression
Support Vector Machines
Non-linear Models
K-Nearest Neighbours
Kernel SVM
Naïve Bayes
Decision Tree Classification
Page No
Data Mining (3160714) Enrolment No
Random Forest Classification

Evaluating a Classification model:

Once our model is completed, it is necessary to evaluate its performance; either it is a
Classification or Regression model. So for evaluating a Classification model, we have the
following ways:
1. Log Loss or Cross-Entropy Loss:
It is used for evaluating the performance of a classifier, whose output is a probability value
between the 0 and 1.
For a good binary Classification model, the value of log loss should be near to 0.
The value of log loss increases if the predicted value deviates from the actual value.
The lower log loss represents the higher accuracy of the model.
2. Confusion Matrix:
The confusion matrix provides us a matrix/table as output and describes the performance of
the model.
It is also known as the error matrix.
The matrix consists of predictions result in a summarized form, which has a total number of
correct predictions and incorrect predictions. The matrix looks like as below table:

Actual Positive Actual Negative

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

3. AUC-ROC curve:
ROC curve stands for Receiver Operating Characteristics Curve and AUC stands
for Area Under the Curve.
It is a graph that shows the performance of the classification model at different thresholds.
To visualize the performance of the multi-class classification model, we use the AUC-ROC
Curve.
The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis
and FPR(False Positive Rate) on X-axis.
Use cases of Classification Algorithms
Classification algorithms can be used in different places. Below are some popular use cases
of Classification Algorithms:
Email Spam Detection
Speech Recognition
Identifications of Cancer tumor cells.
Drugs Classification
Biometric Identification, etc.

Page No
Data Mining (3160714) Enrolment No
Regression
Regression analysis is a statistical method to model the relationship between a dependent
(target) and independent (predictor) variables with one or more independent variables. More
specifically, Regression analysis helps us to understand how the value of the dependent
variable is changing corresponding to an independent variable when other independent
variables are held fixed. It predicts continuous/real values such as temperature, age, salary,
price, etc.

We can understand the concept of regression analysis using the below example:

Example: Suppose there is a marketing company A, who does various advertisement every
year and get sales on that. The below list shows the advertisement made by the company in
the last 5 years and the corresponding sales:

Now, the company wants to do the advertisement of $200 in the year 2019 and wants to
know the prediction about the sales for this year. So to solve such type of prediction
problems in machine learning, we need regression analysis.Regression is a supervised
learning technique which helps in finding the correlation between variables and enables us
to predict the continuous output variable based on the one or more predictor variables. It is
mainly used for prediction, forecasting, time series modeling, and determining the causal-
effect relationship between variables.

In Regression, we plot a graph between the variables which best fits the given datapoints,
using this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance between the datapoints and the
regression line is minimum." The distance between datapoints and line tells whether a model
has captured a strong relationship or not.

Some examples of regression can be as:

o Prediction of rain using temperature and other factors

o Determining Market trends
o Prediction of road accidents due to rash driving.

Terminologies Related to the Regression Analysis:

Page No
Data Mining (3160714) Enrolment No
o Dependent Variable: The main factor in Regression analysis which we want to predict
or understand is called the dependent variable. It is also called target variable.
o Independent Variable: The factors which affect the dependent variables or which are
used to predict the values of the dependent variables are called independent variable,
also called as a predictor.
o Outliers: Outlier is an observation which contains either very low value or very high
value in comparison to other observed values. An outlier may hamper the result, so it
should be avoided.
o Multicollinearity: If the independent variables are highly correlated with each other
than other variables, then such condition is called Multicollinearity. It should not be
present in the dataset, because it creates problem while ranking the most affecting
variable.
o Underfitting and Overfitting: If our algorithm works well with the training dataset but
not well with test dataset, then such problem is called Overfitting. And if our
algorithm does not perform well even with training dataset, then such problem is
called underfitting.

o Regression analysis helps in the prediction of a continuous variable. There are various
scenarios in the real world where we need some future predictions such as weather
condition, sales prediction, marketing trends, etc., for such case we need some
technology which can make predictions more accurately.
o Regression estimates the relationship between the target and the independent variable.
o It is used to find the trends in data.
o It helps to predict real/continuous values.
o By performing the regression, we can confidently determine the most important
factor, the least important factor, and how each factor is affecting the other factors.

Types of Regression

There are various types of regressions which are used in data science and machine learning.
Each type has its own importance on different scenarios, but at the core, all the regression
methods analyze the effect of the independent variable on dependent variables. Here we are
discussing some important types of regression which are given below:

o Linear Regression
o Logistic Regression
o Polynomial Regression
o Support Vector Regression
o Decision Tree Regression

Page No
Data Mining (3160714) Enrolment No
o Random Forest Regression
o Ridge Regression
o Lasso Regression:

Safety and necessary Precautions:

Choose a classification and regression algorithm that is suitable for your task, such as
decision trees, logistic regression, support vector machines, or neural networks.

Procedure:
 Load and preprocess the dataset (cleaning, encoding, and splitting).
 Choose a classification and regression algorithm
 Train the model on the training data.
 Testing
 Evaluation

Observations/Program:
// Write your code and output here

Conclusion:

// Write conclusion here

Quiz:

(1) What is the use of precision, recall, specificity, sensitivity etc.

Page No
Data Mining (3160714) Enrolment No
(2) What are the different Regression techniques?
(3) What is information gain, gini index and gain ratio in decision tree induction method.
Suggested Reference:
 J. Han, M. Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufmann
References used by the students:

// Write references used by you here

Rubric wise marks obtained:

Marks

Page No
Data Mining (3160714) Enrolment No

Experiment No - 8

Aim: Apply K-means Clustering Algorithm any Programming language.

Date: // Write date of experiment here

Competency and Practical Skills: Logic building, Programming and Analyzing

Relevant CO: CO2 & CO4

Objectives: To implement Clustering Algorithm.

.
Equipment/Instruments: open-source software for programming

Theory:
K means Clustering
Unsupervised Learning is the process of teaching a computer to use unlabeled, unclassified
data and enabling the algorithm to operate on that data without supervision. Without any
previous data training, the machine’s job in this case is to organize unsorted data according
to parallels, patterns, and variations.
The goal of clustering is to divide the population or set of data points into a number of groups
so that the data points within each group are more comparable to one another and different
from the data points within the other groups. It is essentially a grouping of things based on
how similar and different they are to one another.
We are given a data set of items, with certain features, and values for these features (like a
vector). The task is to categorize those items into groups. To achieve this, we will use the
K-means algorithm; an unsupervised learning algorithm. ‘K’ in the name of the algorithm
represents the number of groups/clusters we want to classify our items into.
(It will help if you think of items as points in an n-dimensional space). The algorithm will
categorize the items into k groups or clusters of similarity. To calculate that similarity, we
will use the Euclidean distance as a measurement.
The algorithm works as follows:
 First, we randomly initialize k points, called means or cluster centroids.
 We categorize each item to its closest mean and we update the mean’s coordinates,
which are the averages of the items categorized in that cluster so far.
 We repeat the process for a given number of iterations and at the end, we have our
clusters.
 The “points” mentioned above are called means because they are the mean values of
the items categorized in them. To initialize these means, we have a lot of options. An
intuitive method is to initialize the means at random items in the data set.

Safety and necessary Precautions:

Use validation sets to assess performance and prevent overfitting

Procedure:
 Initialize k means with random values

Page No
Data Mining (3160714) Enrolment No
 For a given number of iterations:
Iterate through items:
 Find the mean closest to the item by calculating
the euclidean distance of the item with each of the means
 Assign item to mean
 Update mean by shifting it to the average of the items in that cluster

Program:
// Write your code here

Observations:

// Write your output here

Conclusion:

// Write the conclusion here

Quiz:

(1) What are the different distance measures?

(2) What do you mean by centroid in K-means Algorithm?

Suggested Reference:

J. Han, M. Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufmann

References used by the students:

// Write references used by you here

Rubric wise marks obtained:

Marks

Page No
Data Mining (3160714) Enrolment No
Experiment No - 9

Aim: Perform hands on experiment on any advance mining Techniques Using Appropriate Tool.

Date: // Write date of experiment here

Competency and Practical Skills: Exploration and Understanding of Tool

Relevant CO: CO4

Objectives:

1) Improve users' understanding of advance mining Techniques like Text Mining, Stream Mining,
and Web Content Mining Using Appropriate Tool
2) Familiarize with the tool
.
Equipment/Instruments: Write your tool name here

Demonstration of Tool:

Observations: NA

Conclusion:

// Write conclusion here

Quiz:

1) What different data mining techniques are used in your tool?

Suggested Reference:

1. J. Han, M. Kamber, “Data Mining Concepts and Techniques”, Morgan Kaufmann

References used by the students:

// Write references used by you here

Rubric wise marks obtained:

Problem Tool Usage/

Marks

Page No
Data Mining (3160714) Enrolment No

Experiment No - 10

Aim: Solve Real world problem using Data Mining Techniques using Python Programming
Language.
Date: // Write the date of the experiment here

Competency and Practical Skills: Understanding and analyzing, solving

Relevant CO: CO3

Objectives: (a) To understand real-world problems.

(b) To analyze which data mining technique can be used to solve your problem.

Equipment/Instruments: Python Programming

Theory:

// Write about your system as selected in 1st practical

Program:
// Write your steps and their code here step by step

Observations:

// Write your output here

Conclusion:

// Write the conclusion here

Quiz:
1) What are other techniques that can be used to solve your system problem?

References used by the students:

// Write references used by you here

Rubric wise marks obtained:

Completeness
Knowledge Teamwork Logic
and accuracy Ethics (2)
Rubrics (2) (2) Building (2) Total
(2)
Good Average Good Average Good Average Good Average Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)

Marks

Page No
Data Mining (3160714) Enrolment No

Page No

2025 HESI RN Exit Exam V5 Actual Qs & Ans To Pass The Exam, (NGN & Case Studies), 100% Verified PDF
100% (1)
2025 HESI RN Exit Exam V5 Actual Qs & Ans To Pass The Exam, (NGN & Case Studies), 100% Verified PDF
10 pages
CCS334 Master Manual
No ratings yet
CCS334 Master Manual
65 pages
DDCO Lab Manual
No ratings yet
DDCO Lab Manual
56 pages
Degree Engineering: Data Mining
No ratings yet
Degree Engineering: Data Mining
44 pages
Ds Narjjis
No ratings yet
Ds Narjjis
98 pages
COA Lab Manual
No ratings yet
COA Lab Manual
60 pages
ML Labmanual R22
No ratings yet
ML Labmanual R22
52 pages
DS - Lab Manual
No ratings yet
DS - Lab Manual
75 pages
Nayan DS
No ratings yet
Nayan DS
65 pages
DBMS Manual2
No ratings yet
DBMS Manual2
108 pages
Siddh Ds
No ratings yet
Siddh Ds
121 pages
DM Lab Manual
No ratings yet
DM Lab Manual
26 pages
COA LabManual
No ratings yet
COA LabManual
56 pages
Data Structure Lab Manual
No ratings yet
Data Structure Lab Manual
56 pages
Data Structure Practical-1
No ratings yet
Data Structure Practical-1
27 pages
Data Warehousing Lab Manual Regulation 2015
No ratings yet
Data Warehousing Lab Manual Regulation 2015
51 pages
Dm-Lab - Nov 1
No ratings yet
Dm-Lab - Nov 1
86 pages
Ada Final
No ratings yet
Ada Final
37 pages
Lab File of Data Science FSD 405P
No ratings yet
Lab File of Data Science FSD 405P
7 pages
SHIVAM
No ratings yet
SHIVAM
103 pages
Data Structure Lab Manual
No ratings yet
Data Structure Lab Manual
104 pages
Experiment List. DSPYL
No ratings yet
Experiment List. DSPYL
10 pages
Data Engineering Lab
No ratings yet
Data Engineering Lab
55 pages
1 To 5 and 9
No ratings yet
1 To 5 and 9
38 pages
Kush Wah
No ratings yet
Kush Wah
103 pages
IV Year Syllabus (2024 - 25)
No ratings yet
IV Year Syllabus (2024 - 25)
51 pages
Ai 1131
No ratings yet
Ai 1131
65 pages
DSBA Manual 2025
No ratings yet
DSBA Manual 2025
77 pages
Team14 Mini Report FINAL
No ratings yet
Team14 Mini Report FINAL
61 pages
Is Sneha
No ratings yet
Is Sneha
78 pages
Rudra Iot Practical
No ratings yet
Rudra Iot Practical
79 pages
DMBI Lab Manual Final
No ratings yet
DMBI Lab Manual Final
56 pages
Final ASD
No ratings yet
Final ASD
46 pages
Experiment List. DSPYL
No ratings yet
Experiment List. DSPYL
10 pages
3161608-Ai Gecm
No ratings yet
3161608-Ai Gecm
45 pages
6 Big Data Analytics Lab Manual
No ratings yet
6 Big Data Analytics Lab Manual
73 pages
Laboratory Manual Data Warehousing and Mining Lab: Department of Computer Science and Engineering
No ratings yet
Laboratory Manual Data Warehousing and Mining Lab: Department of Computer Science and Engineering
234 pages
DL Intro
No ratings yet
DL Intro
7 pages
Bedp 2030 - Planning Frameworks - 4 Pillars and Enabling Environment
100% (3)
Bedp 2030 - Planning Frameworks - 4 Pillars and Enabling Environment
12 pages
BTech - CSE - Syllabus - 2023 - Onwards - 12th July-24
No ratings yet
BTech - CSE - Syllabus - 2023 - Onwards - 12th July-24
151 pages
Bda 20cs41001 Course File Ds
No ratings yet
Bda 20cs41001 Course File Ds
170 pages
Bda Lab Manual (R20a0592)
No ratings yet
Bda Lab Manual (R20a0592)
89 pages
BDA Final Lab Manual
100% (1)
BDA Final Lab Manual
56 pages
DWBI Venky Final Print
No ratings yet
DWBI Venky Final Print
39 pages
Web Technology Lab Manual Updated
No ratings yet
Web Technology Lab Manual Updated
82 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
90 pages
BDA Lab Manual AI&DS
No ratings yet
BDA Lab Manual AI&DS
60 pages
Bda Lab Manual
No ratings yet
Bda Lab Manual
62 pages
It - (20) - 2-2 - Database Management Systems Laboratory Manual (2022-23)
No ratings yet
It - (20) - 2-2 - Database Management Systems Laboratory Manual (2022-23)
70 pages
Dbms Lab Manual
No ratings yet
Dbms Lab Manual
68 pages
DS Lab Manual
No ratings yet
DS Lab Manual
70 pages
Software Record
No ratings yet
Software Record
71 pages
Oop Manual 220173107016
0% (1)
Oop Manual 220173107016
91 pages
It - III B.tech Sem-II - DWDM Lab Manual (20-21)
No ratings yet
It - III B.tech Sem-II - DWDM Lab Manual (20-21)
94 pages
Cse All
No ratings yet
Cse All
8 pages
DWDM Lab Manual - It - Iii-Ii - 2018-19 PDF
No ratings yet
DWDM Lab Manual - It - Iii-Ii - 2018-19 PDF
96 pages
Anti Bullying Policy Classroom Based
100% (1)
Anti Bullying Policy Classroom Based
3 pages
Bda Lab Manual (R20a0592)
No ratings yet
Bda Lab Manual (R20a0592)
89 pages
Test Mesurement in Sports
No ratings yet
Test Mesurement in Sports
5 pages
RMM Data Mining Lab Manual Iv-I Cse R16 2019-2020 PDF
No ratings yet
RMM Data Mining Lab Manual Iv-I Cse R16 2019-2020 PDF
136 pages
Geethanjali College of Engineering and Technology (Ugc Autonomous Institution)
No ratings yet
Geethanjali College of Engineering and Technology (Ugc Autonomous Institution)
34 pages
It Iii B.tech Sem-Ii Dwdm-R17a0590 Lab Manual 2019-20
No ratings yet
It Iii B.tech Sem-Ii Dwdm-R17a0590 Lab Manual 2019-20
107 pages
NCP - Risk For Infection
100% (1)
NCP - Risk For Infection
2 pages
RRB NTPC Exam Pattern 2022, Railway NTPC Stage I + II
No ratings yet
RRB NTPC Exam Pattern 2022, Railway NTPC Stage I + II
3 pages
Thoracic Aortic Aneurysm - Optimal Surveillance and Treatment
100% (1)
Thoracic Aortic Aneurysm - Optimal Surveillance and Treatment
12 pages
NSC 401 PDF
No ratings yet
NSC 401 PDF
327 pages
Clinics Consolidated Master Data
No ratings yet
Clinics Consolidated Master Data
414 pages
Working Memory and The Mind
No ratings yet
Working Memory and The Mind
8 pages
Handling and Retention of BA and BE Testing Samples 1711545083
No ratings yet
Handling and Retention of BA and BE Testing Samples 1711545083
22 pages
BF Policy 2024 Update
No ratings yet
BF Policy 2024 Update
34 pages
The Care of The Older Person 5th Edition PDF DOCX DOWNLOAD
No ratings yet
The Care of The Older Person 5th Edition PDF DOCX DOWNLOAD
16 pages
Adhd Rating Scale IV Preschool
No ratings yet
Adhd Rating Scale IV Preschool
2 pages
Reading Material-3 PDF
No ratings yet
Reading Material-3 PDF
112 pages
Swiss Ball
No ratings yet
Swiss Ball
26 pages
None of This Diseases
No ratings yet
None of This Diseases
10 pages
Sample Paper 11 (Arihant)
No ratings yet
Sample Paper 11 (Arihant)
8 pages
Veterinary Clinical Associate - National VMLC
No ratings yet
Veterinary Clinical Associate - National VMLC
2 pages
Family Case Study
No ratings yet
Family Case Study
44 pages
AGT Workplace Competence Assessment Guidance
No ratings yet
AGT Workplace Competence Assessment Guidance
10 pages
Sumit
No ratings yet
Sumit
13 pages
Guidance On The Biocidal Products Regulation: Volume II: Efficacy Part A: Information Requirements
No ratings yet
Guidance On The Biocidal Products Regulation: Volume II: Efficacy Part A: Information Requirements
46 pages
Low Country Cable and Communication Expansion Project Risk Management Plan
No ratings yet
Low Country Cable and Communication Expansion Project Risk Management Plan
15 pages
Ug Pros2021 22
No ratings yet
Ug Pros2021 22
28 pages
s11154 023 09871 9aasda
No ratings yet
s11154 023 09871 9aasda
11 pages
WB UG Medical/Dental/AYUSH 2020 ORCR Round-2
No ratings yet
WB UG Medical/Dental/AYUSH 2020 ORCR Round-2
6 pages
Test 11th Grade Consumer Society Tests - 83566
No ratings yet
Test 11th Grade Consumer Society Tests - 83566
3 pages
Price List All
No ratings yet
Price List All
9 pages
ICT Project Management: Framework for ICT-based Pedagogy System: Development, Operation, and Management
From Everand
ICT Project Management: Framework for ICT-based Pedagogy System: Development, Operation, and Management
Suman Ahmmed
No ratings yet
Different Approaches to Learning Science, Technology, Engineering, and Mathematics: Case Studies from Thailand, the Republic of Korea, Singapore, and Finland
From Everand
Different Approaches to Learning Science, Technology, Engineering, and Mathematics: Case Studies from Thailand, the Republic of Korea, Singapore, and Finland
Asian Development Bank
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data Mining. VBV JJJ Ldce Vgec

Uploaded by

Data Mining. VBV JJJ Ldce Vgec

Uploaded by

Degree Engineering

A Laboratory Manual for

B.E. Semester 6(Computer

Directorate of Technical Education, Gandhinagar,

This is to certify that Mr./Ms. _____________________________________________ Enrollment

No. _______________ of B.E. Semester ________ from Computer Engineering Department of

subject Data Mining (3160714) for the academic year 2022-23.

Signature of Course Faculty Head of the Department

 To provide globally competitive technical education

 To create an ecosystem for proliferation of socially responsible and technically sound

 To develop state-of-the-art laboratories and well-equipped academic infrastructure.

 To create an environment for providing value-based education in Computer Engineering

 To produce computer engineering graduates according to the needs of industry,

Programme Outcomes (POs)

1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering

Program Specific Outcomes (PSOs)

 Sound knowledge of fundamentals of computer science and engineering including software

Program Educational Objectives (PEOs)

 Possess technical competence in solving real life problems related to Computing.

Practical – Course Outcome matrix

Guidelines for Faculty members

Instructions for Students

Common Safety Instructions

Students are expected to

1) switch on the PC carefully (not to use wet hands)

Sr. Objective(s) of Experiment Page Date Date of Assessme Sign. of Rema

Date: // Write date of experiment here

Competency and Practical Skills: Understanding and Analyzing

Relevant CO: CO3

Objectives: (a) To understand the application of domain

System Name: Movie recommendation systems

Safety and necessary Precautions:

Choose appropriate system to show how mining is an interdisiplinary field.

References used by the students:

// Write references used by you here

Rubric wise marks obtained:

Date: // Write date of experiment here

Competency and Practical Skills: Programming and statistical methods

Relevant CO: CO1

Objectives: (a) To understand Basic Preprocessing Techniques and statistical Measures.

2.1 Noisy data handling

Noise: random error or variance in a measured variable

• Faulty Data Collection Instruments

Smoothing by bin means:

Smoothing by bin boundaries:

Equal Width Binning :

bin1: 5,10,11,13,15,35,50,55,72 I.e. all values between 5 and 75

Five Number Summary

Fig Box plot using Five Number Summary

Safety and necessary Precautions:

8. Data Dispersion Measure:

9. Box Plot Generation:

# Numbers of bins user wants

#1. Equal width binning

def range_val(j, limit):

print("Equal Width : ")

#2. Equal Frequency/Depth Binning

for i in range(1, bins+1):

print("Equal frequency : ")

# apply smoothing techniques by means, by median and bin boundaries.

from statistics import mean, median

print("Smooth mean for Equal frequency : ")

2.2 Normalization Techniques

1) Min max normalization

data = random.sample(range(100, 1000), 20)

def min_max(val, new_min, new_max):

min_max_norm = [min_max(i, new_min, new_max) for i in data]

def z_score(val, mean, std):

z_norm = [z_score(i, data_mean, data_std) for i in data]

def dec_scale(n, j):

abs_max_data = max(list(map(abs, data)))

dec_norm = [dec_scale(i, len_abs_max_data) for i in data]

import matplotlib.pyplot as plt

print('Five Number Summary')

2.1 Noisy Data Handling

2.2 Normalization Techniques

No. _______ of B.E. Semester from Computer Engineering Department of