0% found this document useful (0 votes)

10 views56 pages

IDS5

The document provides an overview of data preprocessing techniques in data science, focusing on handling numeric data through methods such as discretization, normalization, and categorical encoding. It discusses various discretization techniques, including supervised and unsupervised methods, as well as binning and histogram analysis. Additionally, it emphasizes the importance of feature engineering and handling redundancy in data integration for effective data mining.

Uploaded by

AtindranathGhosh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views56 pages

IDS5

Uploaded by

AtindranathGhosh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 56

Introduction to Data Science

Data Preprocessing (Contd.)

BITS Pilani
Pilani|Dubai|Goa|Hyderabad

1
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• We have added and modified slides to suit the requirements of
the course.
2

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Handling Numeric Data

BITS Pilani, Pilani Campus

Handling Numeric Data

Techniques are
Discretization – Convert numeric data into discrete
categories Binarization – Convert numeric data into
binary categories Normalization – Scale numeric data
to a specific range Smoothing
• which may remove noisy variations from the data. Techniques include binning,
regression, and clustering.
Pima Indians Diabetes dataset

Numeric attributes come with diverse ranges.

INTRODUCTION TO DATA SCIENCE

https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
Discretization

Convert continuous attribute into a discrete attribute.

Discretization involves converting the raw values of a numeric attribute
(e.g., age) into
• interval labels (e.g., 0–10, 11–20, etc.)
• conceptual labels (e.g., youth, adult, senior)
Discretization Process
• The raw data are replaced by a smaller number of interval or concept labels.
• This simplifies the original data and makes the mining more efficient.
• Concept hierarchies are also useful for mining at multiple abstraction levels.

INTRODUCTION TO DATA SCIENCE

Concept Hierarchy

Divide the range of a continuous attribute into

intervals. Interval labels can then be used to replace
actual data values.
The labels, in turn, can be recursively organized into higher-level
concepts. This results in a concept hierarchy for the numeric
attribute.

INTRODUCTION TO DATA SCIENCE

Discretization Techniques

Discretization techniques can be categorized based on how the discretization is

performed.

Supervised vs. Unsupervised discretization

• If the discretization process uses class information, then we say it is
supervised discretization. Otherwise, it is unsupervised.
Top-down discretization or Splitting
• The process starts by first finding one or a few points (called split points or cut
points) to split the entire attribute range. Then the process repeats recursively
on the resulting intervals.
Bottom-up discretization or Merging
• The process starts by considering all of the continuous values as potential split-
points.
• Removes some by merging neighborhood values to form intervals. Then
recursively applies this process to the resulting intervals.
Discretization by Binning Methods

1 Equal Width (distance)

binning
)
Each bin has equal width.
max − min
width = interval = #bins
)
Highly sensitive to outliers.
)
If outliers are present, the width of each bin is large, resulting in skewed
2
data.
Equal Depth (frequency) binning
)
Specify the number of values that have to be stored in each bin.
)
Number of entries in each bin are equal.
)
Some values can be stored in different bins.
Binning Example

Discretize the following data into 3 discrete categories using binning technique.
70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 80, 81, 53, 56, 57, 63, 66, 67, 67, 67, 68,
69, 70, 70.
Binning Example

Original 53, 56, 57, 63, 66, 67, 67, 67, 68, 69, 70, 70,
Data 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 80, 81
Method Bin1 Bin 2 Bin 3
Equal width= [53, 62) = [62, 72) = [72, 81] =
Width 81-53 = 28 53, 56, 57 63, 66, 67, 67, 72, 73, 75, 75,
28/3 = 9.33 67, 68, 69, 70, 76, 76, 78,
70, 70, 70 79, 80, 81
Equal depth = 53, 56, 57, 63, 68, 69, 70, 70, 75, 75, 76, 76,
Depth 24 /3 = 8 66, 67, 67, 67 70, 70, 72, 73 78, 79, 80, 81

INTRODUCTION TO DATA SCIENCE

Discretization by Histogram Analysis

Histogram analysis is an unsupervised discretization technique because it

does not use class information.
Histograms use binning to approximate data distributions and are a
popular form of data reduction.
A histogram for an attribute, X, partitions the data distribution of X
into disjoint subsets, referred to as buckets or bins.
If each bucket represents only a single attribute–value/frequency pair, the
buckets are called singleton buckets.
Often, buckets represent continuous ranges for the given attribute.
The histogram analysis algorithm can be applied recursively to each
partition in order to automatically generate a multilevel concept hierarchy.

INTRODUCTION TO DATA SCIENCE

Discretization by Histogram Analysis

1 Equal Width Histogram

)
The values are partitioned into equal size partitions or ranges.
2 Equal Frequency Histogram
)
The values are partitioned such that each partition contains the same number
of data objects.

INTRODUCTION TO DATA SCIENCE

Discretization Without Supervision: Binning vs. Clustering

Data Equal width (distance) binning

Equal depth (frequency) K-means clustering leads to better

(binning) results
Variable Transformation

Variable transformation involves changing the values of an attribute.

For each object (tuple), a transformation is applied to the value of the
variable for that object.
1 Simple functional
2 transformations
Normalization
Simple Functional Transformation

https://aegis4048.github.io/transforming-non-normal-
distribution-to-normal-distribution
Normalization

Normalizing the data attempts to give all attributes an equal weight.

The goal of standardization or normalization is to make an entire set of
values have a particular property.
Normalization is particularly useful for:
)
classification algorithms involving neural networks.
• normalizing the input values for each attribute in the training tuples will help
speed up the learning phase.
)
distance measurements such as nearest-neighbor classification and clustering.
• normalization helps prevent attributes with initially large ranges (e.g.,
income) from outweighing attributes with initially smaller ranges (e.g.,
binary attributes).

INTRODUCTION TO DATA SCIENCE

Why Feature Scaling?

Features with bigger magnitude dominate over the features with smaller
magnitudes. Good practice to have all variables within a similar scale.
Euclidean distances are sensitive to feature magnitude.
Gradient descent converges faster when all the variables are in the
similar scale. Feature scaling helps decrease the time of finding support
vectors.
Why Feature Scaling?

For distance-based methods, normalization helps prevent attributes with

initially large ranges (e.g., income) from out-weighing attributes with
initially smaller ranges (e.g., binary attributes).

INTRODUCTION TO DATA SCIENCE

Algorithms Sensitive to Feature Magnitude

Linear and Logistic

Regression Neural
Networks
Support Vector
Machines KNN
K-Means Clustering
Linear Discriminant Analysis
(LDA) Principal Component
Analysis (PCA)
INTRODUCTION TO DATA SCIENCE
Normalization

Scale the feature magnitude to a standard range like [0, 1] or [−1, +1] or any other.
Techniques
• Min-Max normalization
• z-score normalization
• Normalization by decimal scaling
Impact of outliers in the data ???
Min-Max Scaling

Min-max scaling squeezes (or stretches) all feature values to be within the
range of
[ 0, 1] .
Min-Max normalization preserves the relationships among the original data
values.
It will encounter an ”out-of-bounds” error if a future input case for
normalization falls outside of the original data range for X .

INTRODUCTION TO DATA SCIENCE

Min-Max Normalization

Suppose that the minimum and maximum values for the attribute income are
$12,000 and
$98,000, respectively. The new range is [0.0,1.0]. Apply min-max normalization
to value of
$73,600.

INTRODUCTION TO DATA SCIENCE

z-score Normalization

In z-score normalization (or zero-mean normalization), the values for an

attribute, x , are normalized based on the mean µ(x ) and standard
deviation σ(x ) of x .
The resulting scaled feature has a mean of 0 and a
variance of 1. Z score is

x − µ(x )

σ(x )
z-score normalization is useful when the actual minimum
and maximum of attribute X
INTRODUCTION TO DATA SCIENCE
are unknown, or when there are outliers that dominate
the min-max normalization.
z-score Normalization

Suppose that the mean and standard deviation of the values for the attribute
income are
$54,000 and $16,000, respectively. Apply z-score normalization to value of
$73,600.

INTRODUCTION TO DATA SCIENCE

Decimal Normalization

Normalizes by moving the decimal point of values of attribute x .

The number of decimal points moved depends on the maximum absolute
value of x . New range is [−1, +1].

INTRODUCTION TO DATA SCIENCE

Decimal Normalization

Example 1
CGPA Formula Normalized CGPA
2 2/10 0.2
3 3/10 0.3
Example 2
Bonus Formula Normalized Bonus
450 450/1000 0.45
310 310/1000 0.31
Example 3
Salary Formula Normalized Salary
48000 48000/100000 0.48
67000 67000/100000 0.67
INTRODUCTION TO DATA SCIENCE
CATEGORICAL ENCODING

We need to convert categorical columns to numeric columns so that a

machine learning algorithm understands it.

Categorical encoding is a process of converting categories to numbers.

• Binarization maps a continuous or

categorical attribute into one or more
binary attributes.
• Must maintain ordinal relationship.
• Algorithms that find association patterns require
that the data be in the form of binary attributes.
• E.g., Apriori algorithm, Frequent Pattern (FP)
Growth algorithm
CATEGORICAL ENCODING Techniques

One-hot encoding
Label Encoding

INTRODUCTION TO DATA SCIENCE

One-Hot Encoding

Encode each categorical variable with a set of Boolean variables which take
values 0 or 1, indicating if a category is present for each observation.
One binary attribute for each categorical
value. Advantages
)
Makes no assumption about the
distribution or categories of the categorical
variable .
)
Keeps all the information of the categorical
variable .
)
Suitable for linear models.
Disadvantages
)
Expands the feature space.
)
Does not add extra information while
encoding.
INTRODUCTION) Many dummy variables may be identical,
TO DATA SCIENCE

introducing redundant information .

)
Number of resulting attributes may
become too large.
One-Hot Encoding Example

Assume an ordinal attribute for representing service of a restaurant:

(Awful < Poor < OK < Good < Great ) requires 5 bits to maintain the
ordinal relationship.
Service X1 X2 X3 X4 X5
Quality
Awful 0 0 0 0 1
Poor 0 0 0 1 0
OK 0 0 1 0 0
Good 0 1 0 0 0
Great 1 0 0 0 0

31
One-hot Encoding Example

INTRODUCTION TO DATA SCIENCE

Label Encoding

Replace the categories by digits from 1 to n (or 0 to n − 1, depending

the implementation), where n is the number of distinct categories of
the variable.
The categories are arranged in ascending order and the numbers are
assigned. Advantages
• Straightforward to implement.
• Does not expand the feature space.
• Work well enough with tree based algorithms.
Disadvantages
• Does not add extra information while encoding.
• Not suitable for linear models.
• Does not handle new categories in test set automatically.
Used for features which have multiple values into domain. eg: colour,
INTRODUCTION TO DATA SCIENCE
protocol types
Label Encoding Example

Assume an ordinal attribute for representing service of a restaurant: (Awful,

Poor, OK, Good, Great)

Service Integer Value

Quality
Awful 0
Poor 1
OK 2
Good 3
Great 4

INTRODUCTION TO DATA SCIENCE

Feature Engineering

35
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Feature Engineering

Feature Engineering is the process of selecting and extracting useful,

predictive features from data.
The goal is to create a set of features that best represent the information
contained in the data, producing a simpler model that generalizes well to
future observations.

36
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Motivation for Feature Engineering
H UGHES P HENOMENON
Given fixed number of data points, performance of a regressor or a classifier
first increases but later decreases as the number of dimensions of the data
increases.

Reasons for this phenomenon

Redundant Features
Correlation between
features Irrelevant
Features

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Handling Redundancy

38
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple databases
– Object identification: The same attribute or object may have different
names in different databases
– Derivable data: One attribute may be a “derived” attribute in another
table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation analysis
and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining speed
and quality

02/26/2025 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Correlation Analysis (Numeric Data)
• Correlation coefficient (also called Pearson’s product moment coefficient)

𝑛 𝑛

∑ (𝑎𝑖 − 𝐴) (𝑏𝑖 − 𝐵) ∑ ( 𝑎𝑖 𝑏 𝑖 )−𝑛 𝐴 𝐵

𝑖=1 𝑖=1
𝑟 𝐴 , 𝐵= =
𝑛𝜎 𝐴𝜎𝐵 𝑛 𝜎 𝐴 𝜎𝐵

where n is the number of tuples, A and B are the respective means of A

and B, σA and σB are the respective standard deviation of A and B.
• +1 >= rA,B >= -1
• If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The
higher, the stronger correlation.
• rA,B = 0: independent;
• rA,B < 0: negatively correlated
Data Mining 40

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Correlation (viewed as linear relationship)

• Correlation measures the linear relationship between objects

• To compute correlation, we standardize data objects, A and B, and then
take their dot product

a 'k (ak  mean( A)) / std ( A)

b'k (bk  mean( B)) / std ( B )

correlatio n( A, B )  A' B '

Data Mining
41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Covariance (Numeric Data)
• Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, A and B are the respective mean or

expected values of A and B, σA and σB are the respective standard deviation of A
and B
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is
likely to be smaller than its expected value 42

• Independence: CovA,B = 0
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Co-Variance: An Example

• It can be simplified in computation as

• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5,
10), (4, 11), (6, 14).

• Question: If the stocks are affected by the same industry trends, will their prices rise
or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

• Thus, A and B rise together since Cov(A, B) > 0.

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Correlation Analysis (Nominal Data)
• χ2 (chi-square) test
(Observed  Expected ) 2
 
2

Expected

• The larger the χ2 (chi-square) value, the more likely the variables are related
• The cells that contribute the most to the χ2 value are those whose actual count
is very different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population

Data Mining
44
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250 200 450
Not like science fiction 50 1000 1050
Sum(col.) 300 1200 1500

Do you think there is correlation between ‘Play chess’ and ‘Like science fiction’
For the above cross-tabulation
• There are two possible values in rows: Likes science fiction, Not like science fiction
• There are two possible values in columns: Play chess, Not play chess

Degrees of Freedom = (rows-1)*(cols-1) = (2-1)(2-1) = 1

45
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Chi-Square Calculation: Computation
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500

• Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated

based on the data distribution in the two categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 
2
   507.93
90 210 360 840
• We can check statistical significance of chi-square value in standard ( χ2) table.
• It shows that like_science_fiction and play_chess are correlated in the dataset

46
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Feature Construction

48
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Feature Construction

Create derived features

Involves creating a new feature using data from existing
features Mostly rely on domain knowledge
Eg: Calculating price per sqft
Area Price (Rs) Price/Sft (Rs)
1800 81,00,000 4500
2000 78,00,000 3900
1550 65,10,000 4200
2400 1,15,20,000 4800
3500 1,22,50,000 3500
2800 1,45,60,000 5200

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Feature Construction

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Feature Selection

51
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Feature Selection Methods

Forward selection
) starts with one predictor and adds more iteratively.
) At each subsequent iteration, the best of the remaining original
predictors are added based on performance criteria.
) SequentialFeatureSelector class from mlxtend
Backward elimination
) starts with all predictors and eliminates one-by-one iteratively.
) One of the most popular algorithms is Recursive Feature Elimination
(RFE) which eliminates less important predictors based on feature
importance ranking.
) RFE class from sklearn

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SFS Example – wine data

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

SBS Example – wine data

54 / 83

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Text Books

T1 Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar

T2 Introducing Data Science by Cielen, Meysman and Ali

T3 Storytelling with Data, A data visualization guide for business
professionals, by Cole Nussbaumer Knaflic; Wiley
T4 Data Mining: Concepts and Techniques, Third Edition by Jiawei
Han and Micheline Kamber Morgan Kaufmann Publishers

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Thank You

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

21 Network Programmability and Automation
No ratings yet
21 Network Programmability and Automation
34 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
4 - Finding and Fixing Data Quality Issues
No ratings yet
4 - Finding and Fixing Data Quality Issues
48 pages
Lecture 10 - Data Transformation-M
No ratings yet
Lecture 10 - Data Transformation-M
8 pages
3point5point2 Normalization
No ratings yet
3point5point2 Normalization
3 pages
4 Popular Discretization Techniques You Need To Know in Data Science
No ratings yet
4 Popular Discretization Techniques You Need To Know in Data Science
17 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Data Preprocessing For Clustering
No ratings yet
Data Preprocessing For Clustering
40 pages
W2-Data Preparation
No ratings yet
W2-Data Preparation
46 pages
Data Transformation
No ratings yet
Data Transformation
16 pages
3 1 Chapter 3 Normalization
No ratings yet
3 1 Chapter 3 Normalization
22 pages
Data Discretization
No ratings yet
Data Discretization
4 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
DWDM AR16 Unit 1.2
No ratings yet
DWDM AR16 Unit 1.2
14 pages
4 Binning
No ratings yet
4 Binning
19 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
DSBDA
No ratings yet
DSBDA
18 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
dmdw2 2
No ratings yet
dmdw2 2
24 pages
Normalization
No ratings yet
Normalization
35 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
Adobe Scan 19 Mar 2025
No ratings yet
Adobe Scan 19 Mar 2025
8 pages
Scaling Techniques
No ratings yet
Scaling Techniques
30 pages
Lecture 5
No ratings yet
Lecture 5
27 pages
Feature Engineering
No ratings yet
Feature Engineering
18 pages
Slide 2 - Data Preprocessing
100% (1)
Slide 2 - Data Preprocessing
39 pages
Presentation #1 Data Mining Minahel Khan BSIT (E) 22!11!1
No ratings yet
Presentation #1 Data Mining Minahel Khan BSIT (E) 22!11!1
7 pages
Lecture 7 Data Transformation and Dimensionality Reduction
No ratings yet
Lecture 7 Data Transformation and Dimensionality Reduction
22 pages
DMDW 5
No ratings yet
DMDW 5
25 pages
Data Preparation DM
No ratings yet
Data Preparation DM
26 pages
Unit 4-1
No ratings yet
Unit 4-1
13 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Feature Eng Cheat Sheet
No ratings yet
Feature Eng Cheat Sheet
5 pages
Week 2
No ratings yet
Week 2
96 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
ML Notes
No ratings yet
ML Notes
44 pages
3 - AML - Lecture 3 - Feature Engg
No ratings yet
3 - AML - Lecture 3 - Feature Engg
39 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
Data Discretization
No ratings yet
Data Discretization
9 pages
5.feauture Engineering
No ratings yet
5.feauture Engineering
34 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Big Data Lecture # 04
No ratings yet
Big Data Lecture # 04
22 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Normalization in Data Mining
No ratings yet
Data Normalization in Data Mining
8 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Cluster Analysis Introduction
No ratings yet
Cluster Analysis Introduction
23 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
What Is Data Science and Cpare Data Science and Information Science
No ratings yet
What Is Data Science and Cpare Data Science and Information Science
11 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Lec 5
No ratings yet
Lec 5
24 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
DM Data Transformation Techniques
No ratings yet
DM Data Transformation Techniques
25 pages
DSECLZG519 Lec 01
No ratings yet
DSECLZG519 Lec 01
33 pages
Dseclzg519-Lec-03
No ratings yet
Dseclzg519-Lec-03
54 pages
WSS DPR 27.05.2025
No ratings yet
WSS DPR 27.05.2025
4 pages
784 Dsad Session5
No ratings yet
784 Dsad Session5
34 pages
Dseclzg519-Lec-07
No ratings yet
Dseclzg519-Lec-07
38 pages
Contact Session-2 Introduction To Intelligent Agents - DSE
No ratings yet
Contact Session-2 Introduction To Intelligent Agents - DSE
21 pages
ISM - Session 1 - May 2025
No ratings yet
ISM - Session 1 - May 2025
54 pages
ISM Session-4 24&25 MAY 2025
No ratings yet
ISM Session-4 24&25 MAY 2025
63 pages
ISM - Session 2 - May 2025
No ratings yet
ISM - Session 2 - May 2025
44 pages
Contact Session 3 ISM (17 May 2025)
No ratings yet
Contact Session 3 ISM (17 May 2025)
56 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
IDS4
No ratings yet
IDS4
50 pages
Process Control - 2019
No ratings yet
Process Control - 2019
9 pages
Contact Session 5 - With Annotation
No ratings yet
Contact Session 5 - With Annotation
27 pages
IDS6
No ratings yet
IDS6
64 pages
7 - Repurchase Agreements-1
No ratings yet
7 - Repurchase Agreements-1
18 pages
Environmental Pollution Control: BITS Pilani
No ratings yet
Environmental Pollution Control: BITS Pilani
19 pages
Final Exam-Programming Language
0% (1)
Final Exam-Programming Language
17 pages
Web Essentials - IT3401 - Important Questions
No ratings yet
Web Essentials - IT3401 - Important Questions
8 pages
Graph Theory & Combinatories Jan 2014
No ratings yet
Graph Theory & Combinatories Jan 2014
2 pages
Test 8
No ratings yet
Test 8
2 pages
L02 - Introduction To Programming
No ratings yet
L02 - Introduction To Programming
30 pages
2023A FE PM Questions
No ratings yet
2023A FE PM Questions
44 pages
Csstrg2024 Wave 2 Challenges
No ratings yet
Csstrg2024 Wave 2 Challenges
37 pages
Constructor & Destructor
No ratings yet
Constructor & Destructor
9 pages
CS411 Assignment No 02
No ratings yet
CS411 Assignment No 02
4 pages
190 Python Projects With Source Code by Aman Kharwal Medium
No ratings yet
190 Python Projects With Source Code by Aman Kharwal Medium
18 pages
Java Programming Masterclass Covering Java 11 & Java 17
No ratings yet
Java Programming Masterclass Covering Java 11 & Java 17
235 pages
Cs-304 Mcq's Final Term by Vu Topper RM
No ratings yet
Cs-304 Mcq's Final Term by Vu Topper RM
19 pages
CNS - Notes - Unit 2
No ratings yet
CNS - Notes - Unit 2
33 pages
Lecture 4-5 - ARM - Assembly - 3 (New)
No ratings yet
Lecture 4-5 - ARM - Assembly - 3 (New)
33 pages
Dsa Sheet - Dsa Series Sheet
No ratings yet
Dsa Sheet - Dsa Series Sheet
3 pages
PDF Reducer V.3: User Guide
No ratings yet
PDF Reducer V.3: User Guide
38 pages
M.SC., Computer Science
No ratings yet
M.SC., Computer Science
80 pages
Assignment #2: Programming Fundamentals
No ratings yet
Assignment #2: Programming Fundamentals
7 pages
Top 10 PLSQL Developer Job Interview Questions
No ratings yet
Top 10 PLSQL Developer Job Interview Questions
3 pages
Cayley Graph
No ratings yet
Cayley Graph
15 pages
Answer On Sample Problems On Scientific Measurements
No ratings yet
Answer On Sample Problems On Scientific Measurements
3 pages
Chapter8 All Past Paper Questions Till j21 PDF
No ratings yet
Chapter8 All Past Paper Questions Till j21 PDF
40 pages
DSA (Course Outlines) Spring24
No ratings yet
DSA (Course Outlines) Spring24
4 pages
Python Report SHREE
No ratings yet
Python Report SHREE
35 pages
LAB2
No ratings yet
LAB2
4 pages
ITS662 Chapter 3 - Rule-Based Expert System
No ratings yet
ITS662 Chapter 3 - Rule-Based Expert System
47 pages
Incoming Ii Puc Computer Science Chapter 1 Exception Handling in Python Study Material - 2025-26
No ratings yet
Incoming Ii Puc Computer Science Chapter 1 Exception Handling in Python Study Material - 2025-26
9 pages
Linked List Insertion
No ratings yet
Linked List Insertion
23 pages
Logcat 1710637078760
No ratings yet
Logcat 1710637078760
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

IDS5

Uploaded by

IDS5

Uploaded by

Introduction to Data Science

Data Preprocessing (Contd.)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Pilani Campus

Numeric attributes come with diverse ranges.

INTRODUCTION TO DATA SCIENCE

Convert continuous attribute into a discrete attribute.

INTRODUCTION TO DATA SCIENCE

Divide the range of a continuous attribute into

INTRODUCTION TO DATA SCIENCE

Discretization techniques can be categorized based on how the discretization is

Supervised vs. Unsupervised discretization

1 Equal Width (distance)

INTRODUCTION TO DATA SCIENCE

Histogram analysis is an unsupervised discretization technique because it

INTRODUCTION TO DATA SCIENCE

1 Equal Width Histogram

INTRODUCTION TO DATA SCIENCE

Data Equal width (distance) binning

Equal depth (frequency) K-means clustering leads to better

Variable transformation involves changing the values of an attribute.

Normalizing the data attempts to give all attributes an equal weight.

INTRODUCTION TO DATA SCIENCE

For distance-based methods, normalization helps prevent attributes with

INTRODUCTION TO DATA SCIENCE

Linear and Logistic

INTRODUCTION TO DATA SCIENCE

INTRODUCTION TO DATA SCIENCE

In z-score normalization (or zero-mean normalization), the values for an

INTRODUCTION TO DATA SCIENCE

Normalizes by moving the decimal point of values of attribute x .

INTRODUCTION TO DATA SCIENCE

We need to convert categorical columns to numeric columns so that a

Categorical encoding is a process of converting categories to numbers.

• Binarization maps a continuous or

INTRODUCTION TO DATA SCIENCE

introducing redundant information .

Assume an ordinal attribute for representing service of a restaurant:

INTRODUCTION TO DATA SCIENCE

Replace the categories by digits from 1 to n (or 0 to n − 1, depending

Assume an ordinal attribute for representing service of a restaurant: (Awful,

Service Integer Value

INTRODUCTION TO DATA SCIENCE

Feature Engineering is the process of selecting and extracting useful,

Reasons for this phenomenon

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

∑ (𝑎𝑖 − 𝐴) (𝑏𝑖 − 𝐵) ∑ ( 𝑎𝑖 𝑏 𝑖 )−𝑛 𝐴 𝐵

where n is the number of tuples, A and B are the respective means of A

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

• Correlation measures the linear relationship between objects

a 'k (ak  mean( A)) / std ( A)

correlatio n( A, B )  A' B '

where n is the number of tuples, A and B are the respective mean or

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

• It can be simplified in computation as

• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

• Thus, A and B rise together since Cov(A, B) > 0.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Degrees of Freedom = (rows-1)*(cols-1) = (2-1)(2-1) = 1

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

• Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Create derived features

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

T1 Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar

T2 Introducing Data Science by Cielen, Meysman and Ali

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.