Feature Scaling
Feature Scaling
Feature Scaling
In a given data set different columns can be present in different ranges. For example, there can be
a column with a unit of distance, and another with the unit of a currency. These two columns will
have starkly different ranges, making it difficult for any machine learning model to reach an optimal
computation state.
Many machine learning algorithms perform better or converge faster when features are on a
relatively similar scale and/or close to normally distributed. Examples of such algorithm families
include:
In more technical terms, if one considers using Gradient Descent, it will take longer for the gradient
descent algorithm to converge, since it has to process different ranges that are far apart. The same
is demonstrated in the figure below.
P a g e 11 | 122
AIML | YBI Foundation
The diagram on the left has scaled features. This means that features are brought down to values
which are comparable with one another, so the optimization function doesn’t have to take major
leaps to reach the optimal point. Scaling is not necessary for algorithms (like the decision tree),
which are not distance-based. Distance based models however, must have scaled features without
any exception.
Algorithms that do not require normalization/scaling are the ones that rely on rules. They would not
be affected by any monotonic transformations of the variables. Scaling is a monotonic
transformation. Examples of algorithms in this category are all the tree-based algorithms (1) CART,
(2) Random Forests, (3) Gradient Boosted Decision Trees. These algorithms utilize rules (series of
inequalities) and do not require normalization. Algorithms like (4) Linear Discriminant Analysis
(LDA), (5) Naive Bayes is by design equipped to handle this and give weights to the features
accordingly. Performing features scaling in these algorithms may not have much effect.
Machine learning algorithm just sees number — if there is a vast difference in the range say few
ranging in thousands and few ranging in the tens, and it makes the underlying assumption that
P a g e 12 | 122
AIML | YBI Foundation
higher ranging numbers have superiority of some sort. So, these more significant number starts
playing a more decisive role while training the model.
Example: If an algorithm is not using the feature scaling method then it can consider the value 3000
meters to be greater than 5 km but that’s actually not true and, in this case, the algorithm will give
wrong predictions. So, we use Feature Scaling to bring all values to the same magnitudes and thus,
tackle this issue.
The machine learning algorithm works on numbers and does not know what that number
represents. A weight of 10 grams and a price of 10 dollars represents completely two different
things — which is a no brainer for humans, but for a model as a feature, it treats both as same.
Suppose we have two features of weight and price, as in the below table. The “Weight” cannot have
a meaningful comparison with the “Price.” So the assumption algorithm makes that since “Weight”
> “Price,” thus “Weight,” is more important than “Price.”
So these more significant number starts playing a more decisive role while training the model. Thus
feature scaling is needed to bring every feature in the same footing without any upfront importance.
Interestingly, if we convert the weight to “Kg,” then “Price” becomes dominant.
P a g e 13 | 122
AIML | YBI Foundation
Another reason why feature scaling is applied is that few algorithms like Neural network gradient
descent converge much faster with feature scaling than without it.
a. Min-Max Scaler: min-max scaler shrinks the feature values between any range of choice. For
example, between 0 and 5.
b. Standard Scaler: a standard scaler assumes that the variable is normally distributed and then
scales it down so that the standard deviation is 1 and the distribution is cantered at 0. Deep
learning algorithms often call for zero mean and unit variance. Regression-type algorithms
also benefit from normally distributed data with small sample sizes.
c. Robust Scaler: robust scaler works best when there are outliers in the dataset. It scales the
data with respect to the inter-quartile range after removing the median. RobustScaler
transforms the feature vector by subtracting the median and then dividing by the interquartile
range (75% value — 25% value). Use RobustScaler if you want to reduce the effects of outliers,
relative to MinMaxScaler.
d. Max-Abs Scaler: similar to min-max scaler, but instead of a given range, the feature is scaled
to its maximum absolute value. The sparsity of the data is preserved since it does not center
the data.
Standardizing Data
Standardizing data helps us transform attributes with a Gaussian distribution of differing means
and of differing standard deviations into a standard Gaussian distribution with a mean of 0 and a
standard deviation of 1. Standardization of data is done using scikit-learn with the StandardScaler
class.
The Standard Scaler assumes data is normally distributed within each feature and scales them such
that the distribution centered around 0, with a standard deviation of 1. Centring and scaling happen
P a g e 14 | 122
AIML | YBI Foundation
independently on each feature by computing the relevant statistics on the samples in the training
set. If data is not normally distributed, this is not the best Scaler to use.
Z score standardization is one of the most popular methods to normalize data. In this case, we
rescale an original variable to have a mean of zero and a standard deviation of one. Mathematically,
the scaled variable would be calculated by subtracting the mean of the original variable from the
raw value and then divide it by the standard deviation of the original variable.
Normalization is a scaling technique in which values are shifted and rescaled so that they end up
ranging between 0 and 1. It is also known as Min-Max scaling. Transform features by scaling each
feature to a given range. This estimator scales and translates each feature individually such that it
P a g e 15 | 122
AIML | YBI Foundation
is in the given range on the training set, e.g., between zero and one. This Scaler shrinks the data
within the range of -1 to 1 if there are negative values. We can set the range like [0,1] or [0,5] or [-1,1].
This Scaler responds well if the standard deviation is small and when a distribution is not Gaussian.
This Scaler is sensitive to outliers.
• When the value of X is the minimum value in the column, the numerator will be 0, and hence
X’ is 0
• On the other hand, when the value of X is the maximum value in the column, the numerator
is equal to the denominator and thus the value of X’ is 1
• If the value of X is between the minimum and the maximum value, then the value of X’ is
between 0 and 1
For each value in a feature, MinMaxScaler subtracts the minimum value in the feature and then
divides by the range. The range is the difference between the original maximum and original
minimum. MinMaxScaler preserves the shape of the original distribution. It doesn’t meaningfully
change the information embedded in the original data. Note that MinMaxScaler doesn’t reduce the
importance of outliers. The default range for the feature returned by MinMaxScaler is 0 to 1.
MinMaxScaler isn’t a bad place to start, unless you know you want your feature to have a normal
distribution or you have outliers and you want them to have reduced influence.
P a g e 16 | 122
AIML | YBI Foundation
Binarizing Data
In this method, all the values that are above the threshold are transformed into 1 and those equal to
or below the threshold are transformed into 0. This method is useful when we deal with probabilities
and need to convert the data into crisp values. Binarizing is done using the Binarizer class.
Robust Scaler:
As the name suggests, this Scaler is robust to outliers. If our data contains many outliers, scaling
using the mean and standard deviation of the data won’t work well.
This Scaler removes the median and scales the data according to the quantile range (defaults to
IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd
quartile (75th quantile). The centering and scaling statistics of this Scaler are based on percentiles
and are therefore not influenced by a few numbers of huge marginal outliers. Note that the outliers
themselves are still present in the transformed data. If a separate outlier clipping is desirable, a non-
linear transformation is required.
The power transformer is a family of parametric, monotonic transformations that are applied to
make data more Gaussian-like. This is useful for modeling issues related to the variability of a
variable that is unequal across the range (heteroscedasticity) or situations where normality is
desired.
The power transform finds the optimal scaling factor in stabilizing variance and minimizing
skewness through maximum likelihood estimation. Currently, Sklearn implementation of Power
Transformer supports the Box-Cox transform and the Yeo-Johnson transform. The optimal
parameter for stabilizing variance and minimizing skewness is estimated through maximum
P a g e 17 | 122
AIML | YBI Foundation
likelihood. Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both
positive or negative data.
Scaling Transformation
P a g e 18 | 122
AIML | YBI Foundation
Standardization and Normalization are scaling techniques. Standardization raises the data based
on the Z-score, using the formula (x-mean)/standard deviation, reducing the data width to -3 to 3.
Normalization scales the data using the formula (x – min)/(max-min) and Min-Max Scalar. It reduces
the data width from 0 to 1.
P a g e 19 | 122
AIML | YBI Foundation
Some machine learning algorithms are sensitive to feature scaling while others are virtually invariant
to it.
Gradient Descent Based Algorithms: Machine learning algorithms like linear regression, logistic
regression, neural network, etc. that use gradient descent as an optimization technique require data
to be scaled. Take a look at the formula for gradient descent below:
The presence of feature value X in the formula will affect the step size of the gradient descent. The
difference in ranges of features will cause different step sizes for each feature. To ensure that the
gradient descent moves smoothly towards the minima and that the steps for gradient descent are
updated at the same rate for all the features, we scale the data before feeding it to the model. Having
featured on a similar scale can help the gradient descent converge more quickly towards the
minima.
Distance-Based Algorithms: Distance algorithms like KNN, K-means, and SVM are most affected
by the range of features. This is because behind the scenes they are using distances between data
points to determine their similarity.
Tree-Based Algorithms: Tree-based algorithms, on the other hand, are fairly insensitive to the scale
of the features. Think about it, a decision tree is only splitting a node based on a single feature. The
decision tree splits a node on a feature that increases the homogeneity of the node. This split on a
feature is not influenced by other features.
P a g e 20 | 122
AIML | YBI Foundation
So, there is virtually no effect of the remaining features on the split. This is what makes them
invariant to the scale of the features!
Normalizer is also a normalization technique. The only difference is the way it computes the
normalized values. By default, it is calculating the l2 norm of the row values i.e. each element of a
row is normalized by the square root of the sum of squared values of all elements in that row. It is
useful in text classification where the dot product of two Tf-IDF vectors gives a cosine similarity
between the different sentences/documents in the dataset.
P a g e 21 | 122
AIML | YBI Foundation
For distance-based models, standardization is performed to prevent features with wider ranges
from dominating the distance metric. But the reason we standardize data
• BEFORE PCA:
In Principal Component Analysis, features with high variances/wide ranges, get more weight
than those with low variance, and consequently, they end up illegitimately dominating the First
Principal Components (Components with maximum variance).
• BEFORE CLUSTERING:
Clustering models are distance-based algorithms, in order to measure similarities between
observations and form clusters they use a distance metric. So, features with high ranges will
have a bigger influence on the clustering. Therefore, standardization is required before building
a clustering model.
• BEFORE KNN:
k-nearest neighbor is a distance-based classifier that classifies new observations based on
similarity measures (e.g., distance metrics) with labeled observations of the training set.
Standardization makes all variables contribute equally to the similarity measures.
• BEFORE SVM
Support Vector Machine tries to maximize the distance between the separating plane and the
support vectors. If one feature has very large values, it will dominate over other features when
calculating the distance. So Standardization gives all features the same influence on the
distance metric.
• BEFORE MEASURING VARIABLE IMPORTANCE IN REGRESSION MODELS
You can measure variable importance in regression analysis, by fitting a regression model using
the standardized independent variables and comparing the absolute value of their standardized
coefficients. But, if the independent variables are not standardized, comparing their coefficients
becomes meaningless.
• BEFORE LASSO AND RIDGE REGRESSION
P a g e 22 | 122
AIML | YBI Foundation
LASSO and Ridge regressions place a penalty on the magnitude of the coefficients associated
with each variable. And the scale of variables will affect how many penalties will be applied to
their coefficients. Because coefficients of variables with large variance are small and thus less
penalized. Therefore, standardization is required before fitting both regressions.
Tree-based algorithms such as Decision Tree, Random forest, and gradient boosting, are fairly
insensitive to the scale of the features. Think about it, a decision tree is only splitting a node based
on a single feature. The decision tree splits a node on a feature that increases the homogeneity of
the node. This split on a feature is not influenced by other features.
P a g e 23 | 122