IDS5
IDS5
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
1
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• We have added and modified slides to suit the requirements of
the course.
2
Techniques are
Discretization – Convert numeric data into discrete
categories Binarization – Convert numeric data into
binary categories Normalization – Scale numeric data
to a specific range Smoothing
• which may remove noisy variations from the data. Techniques include binning,
regression, and clustering.
Pima Indians Diabetes dataset
https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
Discretization
Discretize the following data into 3 discrete categories using binning technique.
70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 80, 81, 53, 56, 57, 63, 66, 67, 67, 67, 68,
69, 70, 70.
Binning Example
Original 53, 56, 57, 63, 66, 67, 67, 67, 68, 69, 70, 70,
Data 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 80, 81
Method Bin1 Bin 2 Bin 3
Equal width= [53, 62) = [62, 72) = [72, 81] =
Width 81-53 = 28 53, 56, 57 63, 66, 67, 67, 72, 73, 75, 75,
28/3 = 9.33 67, 68, 69, 70, 76, 76, 78,
70, 70, 70 79, 80, 81
Equal depth = 53, 56, 57, 63, 68, 69, 70, 70, 75, 75, 76, 76,
Depth 24 /3 = 8 66, 67, 67, 67 70, 70, 72, 73 78, 79, 80, 81
https://aegis4048.github.io/transforming-non-normal-
distribution-to-normal-distribution
Normalization
Features with bigger magnitude dominate over the features with smaller
magnitudes. Good practice to have all variables within a similar scale.
Euclidean distances are sensitive to feature magnitude.
Gradient descent converges faster when all the variables are in the
similar scale. Feature scaling helps decrease the time of finding support
vectors.
Why Feature Scaling?
Scale the feature magnitude to a standard range like [0, 1] or [−1, +1] or any other.
Techniques
• Min-Max normalization
• z-score normalization
• Normalization by decimal scaling
Impact of outliers in the data ???
Min-Max Scaling
Min-max scaling squeezes (or stretches) all feature values to be within the
range of
[ 0, 1] .
Min-Max normalization preserves the relationships among the original data
values.
It will encounter an ”out-of-bounds” error if a future input case for
normalization falls outside of the original data range for X .
Suppose that the minimum and maximum values for the attribute income are
$12,000 and
$98,000, respectively. The new range is [0.0,1.0]. Apply min-max normalization
to value of
$73,600.
x − µ(x )
σ(x )
z-score normalization is useful when the actual minimum
and maximum of attribute X
INTRODUCTION TO DATA SCIENCE
are unknown, or when there are outliers that dominate
the min-max normalization.
z-score Normalization
Suppose that the mean and standard deviation of the values for the attribute
income are
$54,000 and $16,000, respectively. Apply z-score normalization to value of
$73,600.
Example 1
CGPA Formula Normalized CGPA
2 2/10 0.2
3 3/10 0.3
Example 2
Bonus Formula Normalized Bonus
450 450/1000 0.45
310 310/1000 0.31
Example 3
Salary Formula Normalized Salary
48000 48000/100000 0.48
67000 67000/100000 0.67
INTRODUCTION TO DATA SCIENCE
CATEGORICAL ENCODING
One-hot encoding
Label Encoding
Encode each categorical variable with a set of Boolean variables which take
values 0 or 1, indicating if a category is present for each observation.
One binary attribute for each categorical
value. Advantages
)
Makes no assumption about the
distribution or categories of the categorical
variable .
)
Keeps all the information of the categorical
variable .
)
Suitable for linear models.
Disadvantages
)
Expands the feature space.
)
Does not add extra information while
encoding.
INTRODUCTION) Many dummy variables may be identical,
TO DATA SCIENCE
31
One-hot Encoding Example
35
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Feature Engineering
36
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Motivation for Feature Engineering
H UGHES P HENOMENON
Given fixed number of data points, performance of a regressor or a classifier
first increases but later decreases as the number of dimensions of the data
increases.
38
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple databases
– Object identification: The same attribute or object may have different
names in different databases
– Derivable data: One attribute may be a “derived” attribute in another
table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation analysis
and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining speed
and quality
39
02/26/2025 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Correlation Analysis (Numeric Data)
• Correlation coefficient (also called Pearson’s product moment coefficient)
𝑛 𝑛
Data Mining
41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Covariance (Numeric Data)
• Covariance is similar to correlation
Correlation coefficient:
• Independence: CovA,B = 0
Data Mining
• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5,
10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will their prices rise
or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
Expected
• The larger the χ2 (chi-square) value, the more likely the variables are related
• The cells that contribute the most to the χ2 value are those whose actual count
is very different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population
Data Mining
44
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250 200 450
Not like science fiction 50 1000 1050
Sum(col.) 300 1200 1500
Do you think there is correlation between ‘Play chess’ and ‘Like science fiction’
For the above cross-tabulation
• There are two possible values in rows: Likes science fiction, Not like science fiction
• There are two possible values in columns: Play chess, Not play chess
45
Data Mining
46
Data Mining
48
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Feature Construction
50
51
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Feature Selection Methods
Forward selection
) starts with one predictor and adds more iteratively.
) At each subsequent iteration, the best of the remaining original
predictors are added based on performance criteria.
) SequentialFeatureSelector class from mlxtend
Backward elimination
) starts with all predictors and eliminates one-by-one iteratively.
) One of the most popular algorithms is Recursive Feature Elimination
(RFE) which eliminates less important predictors based on feature
importance ranking.
) RFE class from sklearn
54 / 83
55
56