0% found this document useful (0 votes)
10 views56 pages

IDS5

The document provides an overview of data preprocessing techniques in data science, focusing on handling numeric data through methods such as discretization, normalization, and categorical encoding. It discusses various discretization techniques, including supervised and unsupervised methods, as well as binning and histogram analysis. Additionally, it emphasizes the importance of feature engineering and handling redundancy in data integration for effective data mining.

Uploaded by

AtindranathGhosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views56 pages

IDS5

The document provides an overview of data preprocessing techniques in data science, focusing on handling numeric data through methods such as discretization, normalization, and categorical encoding. It discusses various discretization techniques, including supervised and unsupervised methods, as well as binning and histogram analysis. Additionally, it emphasizes the importance of feature engineering and handling redundancy in data integration for effective data mining.

Uploaded by

AtindranathGhosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

Introduction to Data Science

Data Preprocessing (Contd.)

BITS Pilani
Pilani|Dubai|Goa|Hyderabad

1
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• We have added and modified slides to suit the requirements of
the course.
2

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Handling Numeric Data

BITS Pilani, Pilani Campus


Handling Numeric Data

Techniques are
Discretization – Convert numeric data into discrete
categories Binarization – Convert numeric data into
binary categories Normalization – Scale numeric data
to a specific range Smoothing
• which may remove noisy variations from the data. Techniques include binning,
regression, and clustering.
Pima Indians Diabetes dataset

Numeric attributes come with diverse ranges.

INTRODUCTION TO DATA SCIENCE

https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
Discretization

Convert continuous attribute into a discrete attribute.


Discretization involves converting the raw values of a numeric attribute
(e.g., age) into
• interval labels (e.g., 0–10, 11–20, etc.)
• conceptual labels (e.g., youth, adult, senior)
Discretization Process
• The raw data are replaced by a smaller number of interval or concept labels.
• This simplifies the original data and makes the mining more efficient.
• Concept hierarchies are also useful for mining at multiple abstraction levels.

INTRODUCTION TO DATA SCIENCE


Concept Hierarchy

Divide the range of a continuous attribute into


intervals. Interval labels can then be used to replace
actual data values.
The labels, in turn, can be recursively organized into higher-level
concepts. This results in a concept hierarchy for the numeric
attribute.

INTRODUCTION TO DATA SCIENCE


Discretization Techniques

Discretization techniques can be categorized based on how the discretization is


performed.

Supervised vs. Unsupervised discretization


• If the discretization process uses class information, then we say it is
supervised discretization. Otherwise, it is unsupervised.
Top-down discretization or Splitting
• The process starts by first finding one or a few points (called split points or cut
points) to split the entire attribute range. Then the process repeats recursively
on the resulting intervals.
Bottom-up discretization or Merging
• The process starts by considering all of the continuous values as potential split-
points.
• Removes some by merging neighborhood values to form intervals. Then
recursively applies this process to the resulting intervals.
Discretization by Binning Methods

1 Equal Width (distance)


binning
)
Each bin has equal width.
max − min
width = interval = #bins
)
Highly sensitive to outliers.
)
If outliers are present, the width of each bin is large, resulting in skewed
2
data.
Equal Depth (frequency) binning
)
Specify the number of values that have to be stored in each bin.
)
Number of entries in each bin are equal.
)
Some values can be stored in different bins.
Binning Example

Discretize the following data into 3 discrete categories using binning technique.
70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 80, 81, 53, 56, 57, 63, 66, 67, 67, 67, 68,
69, 70, 70.
Binning Example

Original 53, 56, 57, 63, 66, 67, 67, 67, 68, 69, 70, 70,
Data 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 80, 81
Method Bin1 Bin 2 Bin 3
Equal width= [53, 62) = [62, 72) = [72, 81] =
Width 81-53 = 28 53, 56, 57 63, 66, 67, 67, 72, 73, 75, 75,
28/3 = 9.33 67, 68, 69, 70, 76, 76, 78,
70, 70, 70 79, 80, 81
Equal depth = 53, 56, 57, 63, 68, 69, 70, 70, 75, 75, 76, 76,
Depth 24 /3 = 8 66, 67, 67, 67 70, 70, 72, 73 78, 79, 80, 81

INTRODUCTION TO DATA SCIENCE


Discretization by Histogram Analysis

Histogram analysis is an unsupervised discretization technique because it


does not use class information.
Histograms use binning to approximate data distributions and are a
popular form of data reduction.
A histogram for an attribute, X, partitions the data distribution of X
into disjoint subsets, referred to as buckets or bins.
If each bucket represents only a single attribute–value/frequency pair, the
buckets are called singleton buckets.
Often, buckets represent continuous ranges for the given attribute.
The histogram analysis algorithm can be applied recursively to each
partition in order to automatically generate a multilevel concept hierarchy.

INTRODUCTION TO DATA SCIENCE


Discretization by Histogram Analysis

1 Equal Width Histogram


)
The values are partitioned into equal size partitions or ranges.
2 Equal Frequency Histogram
)
The values are partitioned such that each partition contains the same number
of data objects.

INTRODUCTION TO DATA SCIENCE


Discretization Without Supervision: Binning vs. Clustering

Data Equal width (distance) binning

Equal depth (frequency) K-means clustering leads to better


(binning) results
Variable Transformation

Variable transformation involves changing the values of an attribute.


For each object (tuple), a transformation is applied to the value of the
variable for that object.
1 Simple functional
2 transformations
Normalization
Simple Functional Transformation

https://aegis4048.github.io/transforming-non-normal-
distribution-to-normal-distribution
Normalization

Normalizing the data attempts to give all attributes an equal weight.


The goal of standardization or normalization is to make an entire set of
values have a particular property.
Normalization is particularly useful for:
)
classification algorithms involving neural networks.
• normalizing the input values for each attribute in the training tuples will help
speed up the learning phase.
)
distance measurements such as nearest-neighbor classification and clustering.
• normalization helps prevent attributes with initially large ranges (e.g.,
income) from outweighing attributes with initially smaller ranges (e.g.,
binary attributes).

INTRODUCTION TO DATA SCIENCE


Why Feature Scaling?

Features with bigger magnitude dominate over the features with smaller
magnitudes. Good practice to have all variables within a similar scale.
Euclidean distances are sensitive to feature magnitude.
Gradient descent converges faster when all the variables are in the
similar scale. Feature scaling helps decrease the time of finding support
vectors.
Why Feature Scaling?

For distance-based methods, normalization helps prevent attributes with


initially large ranges (e.g., income) from out-weighing attributes with
initially smaller ranges (e.g., binary attributes).

INTRODUCTION TO DATA SCIENCE


Algorithms Sensitive to Feature Magnitude

Linear and Logistic


Regression Neural
Networks
Support Vector
Machines KNN
K-Means Clustering
Linear Discriminant Analysis
(LDA) Principal Component
Analysis (PCA)
INTRODUCTION TO DATA SCIENCE
Normalization

Scale the feature magnitude to a standard range like [0, 1] or [−1, +1] or any other.
Techniques
• Min-Max normalization
• z-score normalization
• Normalization by decimal scaling
Impact of outliers in the data ???
Min-Max Scaling

Min-max scaling squeezes (or stretches) all feature values to be within the
range of
[ 0, 1] .
Min-Max normalization preserves the relationships among the original data
values.
It will encounter an ”out-of-bounds” error if a future input case for
normalization falls outside of the original data range for X .

INTRODUCTION TO DATA SCIENCE


Min-Max Normalization

Suppose that the minimum and maximum values for the attribute income are
$12,000 and
$98,000, respectively. The new range is [0.0,1.0]. Apply min-max normalization
to value of
$73,600.

INTRODUCTION TO DATA SCIENCE


z-score Normalization

In z-score normalization (or zero-mean normalization), the values for an


attribute, x , are normalized based on the mean µ(x ) and standard
deviation σ(x ) of x .
The resulting scaled feature has a mean of 0 and a
variance of 1. Z score is

x − µ(x )

σ(x )
z-score normalization is useful when the actual minimum
and maximum of attribute X
INTRODUCTION TO DATA SCIENCE
are unknown, or when there are outliers that dominate
the min-max normalization.
z-score Normalization

Suppose that the mean and standard deviation of the values for the attribute
income are
$54,000 and $16,000, respectively. Apply z-score normalization to value of
$73,600.

INTRODUCTION TO DATA SCIENCE


Decimal Normalization

Normalizes by moving the decimal point of values of attribute x .


The number of decimal points moved depends on the maximum absolute
value of x . New range is [−1, +1].

INTRODUCTION TO DATA SCIENCE


Decimal Normalization

Example 1
CGPA Formula Normalized CGPA
2 2/10 0.2
3 3/10 0.3
Example 2
Bonus Formula Normalized Bonus
450 450/1000 0.45
310 310/1000 0.31
Example 3
Salary Formula Normalized Salary
48000 48000/100000 0.48
67000 67000/100000 0.67
INTRODUCTION TO DATA SCIENCE
CATEGORICAL ENCODING

We need to convert categorical columns to numeric columns so that a


machine learning algorithm understands it.

Categorical encoding is a process of converting categories to numbers.

• Binarization maps a continuous or


categorical attribute into one or more
binary attributes.
• Must maintain ordinal relationship.
• Algorithms that find association patterns require
that the data be in the form of binary attributes.
• E.g., Apriori algorithm, Frequent Pattern (FP)
Growth algorithm
CATEGORICAL ENCODING Techniques

One-hot encoding
Label Encoding

INTRODUCTION TO DATA SCIENCE


One-Hot Encoding

Encode each categorical variable with a set of Boolean variables which take
values 0 or 1, indicating if a category is present for each observation.
One binary attribute for each categorical
value. Advantages
)
Makes no assumption about the
distribution or categories of the categorical
variable .
)
Keeps all the information of the categorical
variable .
)
Suitable for linear models.
Disadvantages
)
Expands the feature space.
)
Does not add extra information while
encoding.
INTRODUCTION) Many dummy variables may be identical,
TO DATA SCIENCE

introducing redundant information .


)
Number of resulting attributes may
become too large.
One-Hot Encoding Example

Assume an ordinal attribute for representing service of a restaurant:


(Awful < Poor < OK < Good < Great ) requires 5 bits to maintain the
ordinal relationship.
Service X1 X2 X3 X4 X5
Quality
Awful 0 0 0 0 1
Poor 0 0 0 1 0
OK 0 0 1 0 0
Good 0 1 0 0 0
Great 1 0 0 0 0

31
One-hot Encoding Example

INTRODUCTION TO DATA SCIENCE


Label Encoding

Replace the categories by digits from 1 to n (or 0 to n − 1, depending


the implementation), where n is the number of distinct categories of
the variable.
The categories are arranged in ascending order and the numbers are
assigned. Advantages
• Straightforward to implement.
• Does not expand the feature space.
• Work well enough with tree based algorithms.
Disadvantages
• Does not add extra information while encoding.
• Not suitable for linear models.
• Does not handle new categories in test set automatically.
Used for features which have multiple values into domain. eg: colour,
INTRODUCTION TO DATA SCIENCE
protocol types
Label Encoding Example

Assume an ordinal attribute for representing service of a restaurant: (Awful,


Poor, OK, Good, Great)

Service Integer Value


Quality
Awful 0
Poor 1
OK 2
Good 3
Great 4

INTRODUCTION TO DATA SCIENCE


Feature Engineering

35
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Feature Engineering

Feature Engineering is the process of selecting and extracting useful,


predictive features from data.
The goal is to create a set of features that best represent the information
contained in the data, producing a simpler model that generalizes well to
future observations.

36
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Motivation for Feature Engineering
H UGHES P HENOMENON
Given fixed number of data points, performance of a regressor or a classifier
first increases but later decreases as the number of dimensions of the data
increases.

Reasons for this phenomenon


Redundant Features
Correlation between
features Irrelevant
Features

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Handling Redundancy

38
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple databases
– Object identification: The same attribute or object may have different
names in different databases
– Derivable data: One attribute may be a “derived” attribute in another
table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation analysis
and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining speed
and quality

39

02/26/2025 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Correlation Analysis (Numeric Data)
• Correlation coefficient (also called Pearson’s product moment coefficient)

𝑛 𝑛

∑ (𝑎𝑖 − 𝐴) (𝑏𝑖 − 𝐵) ∑ ( 𝑎𝑖 𝑏 𝑖 )−𝑛 𝐴 𝐵


𝑖=1 𝑖=1
𝑟 𝐴 , 𝐵= =
𝑛𝜎 𝐴𝜎𝐵 𝑛 𝜎 𝐴 𝜎𝐵

where n is the number of tuples, A and B are the respective means of A


and B, σA and σB are the respective standard deviation of A and B.
• +1 >= rA,B >= -1
• If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The
higher, the stronger correlation.
• rA,B = 0: independent;
• rA,B < 0: negatively correlated
Data Mining 40

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Correlation (viewed as linear relationship)

• Correlation measures the linear relationship between objects


• To compute correlation, we standardize data objects, A and B, and then
take their dot product

a 'k (ak  mean( A)) / std ( A)


b'k (bk  mean( B)) / std ( B )

correlatio n( A, B )  A' B '

Data Mining
41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Covariance (Numeric Data)
• Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, A and B are the respective mean or


expected values of A and B, σA and σB are the respective standard deviation of A
and B
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is
likely to be smaller than its expected value 42

• Independence: CovA,B = 0
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Co-Variance: An Example

• It can be simplified in computation as

• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5,
10), (4, 11), (6, 14).

• Question: If the stocks are affected by the same industry trends, will their prices rise
or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

• Thus, A and B rise together since Cov(A, B) > 0.


Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Correlation Analysis (Nominal Data)
• χ2 (chi-square) test
(Observed  Expected ) 2
 
2

Expected

• The larger the χ2 (chi-square) value, the more likely the variables are related
• The cells that contribute the most to the χ2 value are those whose actual count
is very different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population

Data Mining
44
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250 200 450
Not like science fiction 50 1000 1050
Sum(col.) 300 1200 1500

Do you think there is correlation between ‘Play chess’ and ‘Like science fiction’
For the above cross-tabulation
• There are two possible values in rows: Likes science fiction, Not like science fiction
• There are two possible values in columns: Play chess, Not play chess

Degrees of Freedom = (rows-1)*(cols-1) = (2-1)(2-1) = 1

45
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Chi-Square Calculation: Computation
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500

• Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated


based on the data distribution in the two categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 
2
   507.93
90 210 360 840
• We can check statistical significance of chi-square value in standard ( χ2) table.
• It shows that like_science_fiction and play_chess are correlated in the dataset

46
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Feature Construction

48
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Feature Construction

Create derived features


Involves creating a new feature using data from existing
features Mostly rely on domain knowledge
Eg: Calculating price per sqft
Area Price (Rs) Price/Sft (Rs)
1800 81,00,000 4500
2000 78,00,000 3900
1550 65,10,000 4200
2400 1,15,20,000 4800
3500 1,22,50,000 3500
2800 1,45,60,000 5200

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Feature Construction

50

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Feature Selection

51
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Feature Selection Methods

Forward selection
) starts with one predictor and adds more iteratively.
) At each subsequent iteration, the best of the remaining original
predictors are added based on performance criteria.
) SequentialFeatureSelector class from mlxtend
Backward elimination
) starts with all predictors and eliminates one-by-one iteratively.
) One of the most popular algorithms is Recursive Feature Elimination
(RFE) which eliminates less important predictors based on feature
importance ranking.
) RFE class from sklearn

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


SFS Example – wine data

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


SBS Example – wine data

54 / 83

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Text Books

T1 Introduction to Data Mining, by Tan, Steinbach and Vipin Kumar

T2 Introducing Data Science by Cielen, Meysman and Ali


T3 Storytelling with Data, A data visualization guide for business
professionals, by Cole Nussbaumer Knaflic; Wiley
T4 Data Mining: Concepts and Techniques, Third Edition by Jiawei
Han and Micheline Kamber Morgan Kaufmann Publishers

55

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Thank You

56

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy