0% found this document useful (0 votes)
6 views13 pages

Feature Scaling

Feature scaling is crucial in machine learning as it ensures that different features are on a similar scale, which aids in the convergence of algorithms like gradient descent and distance-based models. Various scaling techniques such as Min-Max Scaler, Standard Scaler, and Robust Scaler are discussed, each serving different purposes depending on the data distribution and presence of outliers. The document also differentiates between scaling and transformation, as well as normalization and standardization, emphasizing the importance of selecting the appropriate method based on the specific algorithm and data characteristics.

Uploaded by

nhkjdhyegvemd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views13 pages

Feature Scaling

Feature scaling is crucial in machine learning as it ensures that different features are on a similar scale, which aids in the convergence of algorithms like gradient descent and distance-based models. Various scaling techniques such as Min-Max Scaler, Standard Scaler, and Robust Scaler are discussed, each serving different purposes depending on the data distribution and presence of outliers. The document also differentiates between scaling and transformation, as well as normalization and standardization, emphasizing the importance of selecting the appropriate method based on the specific algorithm and data characteristics.

Uploaded by

nhkjdhyegvemd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

AIML | YBI Foundation

Feature Scaling

In a given data set different columns can be present in different ranges. For example, there can be
a column with a unit of distance, and another with the unit of a currency. These two columns will
have starkly different ranges, making it difficult for any machine learning model to reach an optimal
computation state.

Many machine learning algorithms perform better or converge faster when features are on a
relatively similar scale and/or close to normally distributed. Examples of such algorithm families
include:

In more technical terms, if one considers using Gradient Descent, it will take longer for the gradient
descent algorithm to converge, since it has to process different ranges that are far apart. The same
is demonstrated in the figure below.

1. linear and logistic regression


2. neural networks
3. support vector machines with radial bias kernel functions
4. linear discriminant analysis
5. nearest neighbors (KNN) with a Euclidean distance measure is sensitive to magnitudes and
hence should be scaled for all features to weigh in equally.
6. K-Means uses the Euclidean distance measure here feature scaling matters.
7. Scaling is critical while performing Principal Component Analysis (PCA). PCA tries to get the
features with maximum variance, and the variance is high for high magnitude features and
skews the PCA towards high magnitude features.
8. We can speed up gradient descent by scaling because θ descends quickly on small ranges
and slowly on large ranges, and oscillates inefficiently down to the optimum when the
variables are very uneven.

www.ybifoundation.org (+91) 9667987711 support@ybifoundation.or g

P a g e 11 | 122
AIML | YBI Foundation

The diagram on the left has scaled features. This means that features are brought down to values
which are comparable with one another, so the optimization function doesn’t have to take major
leaps to reach the optimal point. Scaling is not necessary for algorithms (like the decision tree),
which are not distance-based. Distance based models however, must have scaled features without
any exception.

Algorithms that do not require normalization/scaling are the ones that rely on rules. They would not
be affected by any monotonic transformations of the variables. Scaling is a monotonic
transformation. Examples of algorithms in this category are all the tree-based algorithms (1) CART,
(2) Random Forests, (3) Gradient Boosted Decision Trees. These algorithms utilize rules (series of
inequalities) and do not require normalization. Algorithms like (4) Linear Discriminant Analysis
(LDA), (5) Naive Bayes is by design equipped to handle this and give weights to the features
accordingly. Performing features scaling in these algorithms may not have much effect.

Why do we need scaling?

Machine learning algorithm just sees number — if there is a vast difference in the range say few
ranging in thousands and few ranging in the tens, and it makes the underlying assumption that

www.ybifoundation.org (+91) 9667987711 support@ybifoundation.or g

P a g e 12 | 122
AIML | YBI Foundation

higher ranging numbers have superiority of some sort. So, these more significant number starts
playing a more decisive role while training the model.

Example: If an algorithm is not using the feature scaling method then it can consider the value 3000
meters to be greater than 5 km but that’s actually not true and, in this case, the algorithm will give
wrong predictions. So, we use Feature Scaling to bring all values to the same magnitudes and thus,
tackle this issue.

The machine learning algorithm works on numbers and does not know what that number
represents. A weight of 10 grams and a price of 10 dollars represents completely two different
things — which is a no brainer for humans, but for a model as a feature, it treats both as same.

Suppose we have two features of weight and price, as in the below table. The “Weight” cannot have
a meaningful comparison with the “Price.” So the assumption algorithm makes that since “Weight”
> “Price,” thus “Weight,” is more important than “Price.”

So these more significant number starts playing a more decisive role while training the model. Thus
feature scaling is needed to bring every feature in the same footing without any upfront importance.
Interestingly, if we convert the weight to “Kg,” then “Price” becomes dominant.

www.ybifoundation.org (+91) 9667987711 support@ybifoundation.or g

P a g e 13 | 122
AIML | YBI Foundation

Another reason why feature scaling is applied is that few algorithms like Neural network gradient
descent converge much faster with feature scaling than without it.

Some popular scaling techniques are:

a. Min-Max Scaler: min-max scaler shrinks the feature values between any range of choice. For
example, between 0 and 5.
b. Standard Scaler: a standard scaler assumes that the variable is normally distributed and then
scales it down so that the standard deviation is 1 and the distribution is cantered at 0. Deep
learning algorithms often call for zero mean and unit variance. Regression-type algorithms
also benefit from normally distributed data with small sample sizes.
c. Robust Scaler: robust scaler works best when there are outliers in the dataset. It scales the
data with respect to the inter-quartile range after removing the median. RobustScaler
transforms the feature vector by subtracting the median and then dividing by the interquartile
range (75% value — 25% value). Use RobustScaler if you want to reduce the effects of outliers,
relative to MinMaxScaler.
d. Max-Abs Scaler: similar to min-max scaler, but instead of a given range, the feature is scaled
to its maximum absolute value. The sparsity of the data is preserved since it does not center
the data.

Standardizing Data

Standardizing data helps us transform attributes with a Gaussian distribution of differing means
and of differing standard deviations into a standard Gaussian distribution with a mean of 0 and a
standard deviation of 1. Standardization of data is done using scikit-learn with the StandardScaler
class.

The Standard Scaler assumes data is normally distributed within each feature and scales them such
that the distribution centered around 0, with a standard deviation of 1. Centring and scaling happen

www.ybifoundation.org (+91) 9667987711 support@ybifoundation.or g

P a g e 14 | 122
AIML | YBI Foundation

independently on each feature by computing the relevant statistics on the samples in the training
set. If data is not normally distributed, this is not the best Scaler to use.

Z score standardization is one of the most popular methods to normalize data. In this case, we
rescale an original variable to have a mean of zero and a standard deviation of one. Mathematically,
the scaled variable would be calculated by subtracting the mean of the original variable from the
raw value and then divide it by the standard deviation of the original variable.

Normalization or Min Max Scaler

Normalization is a scaling technique in which values are shifted and rescaled so that they end up
ranging between 0 and 1. It is also known as Min-Max scaling. Transform features by scaling each
feature to a given range. This estimator scales and translates each feature individually such that it

www.ybifoundation.org (+91) 9667987711 support@ybifoundation.or g

P a g e 15 | 122
AIML | YBI Foundation

is in the given range on the training set, e.g., between zero and one. This Scaler shrinks the data
within the range of -1 to 1 if there are negative values. We can set the range like [0,1] or [0,5] or [-1,1].

This Scaler responds well if the standard deviation is small and when a distribution is not Gaussian.
This Scaler is sensitive to outliers.

• When the value of X is the minimum value in the column, the numerator will be 0, and hence
X’ is 0
• On the other hand, when the value of X is the maximum value in the column, the numerator
is equal to the denominator and thus the value of X’ is 1
• If the value of X is between the minimum and the maximum value, then the value of X’ is
between 0 and 1

For each value in a feature, MinMaxScaler subtracts the minimum value in the feature and then
divides by the range. The range is the difference between the original maximum and original
minimum. MinMaxScaler preserves the shape of the original distribution. It doesn’t meaningfully
change the information embedded in the original data. Note that MinMaxScaler doesn’t reduce the
importance of outliers. The default range for the feature returned by MinMaxScaler is 0 to 1.
MinMaxScaler isn’t a bad place to start, unless you know you want your feature to have a normal
distribution or you have outliers and you want them to have reduced influence.

www.ybifoundation.org (+91) 9667987711 support@ybifoundation.or g

P a g e 16 | 122
AIML | YBI Foundation

Binarizing Data

In this method, all the values that are above the threshold are transformed into 1 and those equal to
or below the threshold are transformed into 0. This method is useful when we deal with probabilities
and need to convert the data into crisp values. Binarizing is done using the Binarizer class.

Robust Scaler:

As the name suggests, this Scaler is robust to outliers. If our data contains many outliers, scaling
using the mean and standard deviation of the data won’t work well.

This Scaler removes the median and scales the data according to the quantile range (defaults to
IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd
quartile (75th quantile). The centering and scaling statistics of this Scaler are based on percentiles
and are therefore not influenced by a few numbers of huge marginal outliers. Note that the outliers
themselves are still present in the transformed data. If a separate outlier clipping is desirable, a non-
linear transformation is required.

Power Transformer Scaler:

The power transformer is a family of parametric, monotonic transformations that are applied to
make data more Gaussian-like. This is useful for modeling issues related to the variability of a
variable that is unequal across the range (heteroscedasticity) or situations where normality is
desired.

The power transform finds the optimal scaling factor in stabilizing variance and minimizing
skewness through maximum likelihood estimation. Currently, Sklearn implementation of Power
Transformer supports the Box-Cox transform and the Yeo-Johnson transform. The optimal
parameter for stabilizing variance and minimizing skewness is estimated through maximum

www.ybifoundation.org (+91) 9667987711 support@ybifoundation.or g

P a g e 17 | 122
AIML | YBI Foundation

likelihood. Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both
positive or negative data.

Q. Calculate Min-Max Scaler and Standard Scaler

SN Age Salary Purchase


1 28 27000 1
2 32 39000 1
3 18 20000 0
4 45 50000 1
5 35 38000 0
6 20 22000 1
7 25 27000 1
8 38 41000 0
9 39 40000 0
10 40 44000 1
Mean 32 34800 0.60
SD 9 10075 0.52

Q. What is the difference between Scaling and Transformation?

Scaling Transformation

Transformation helps in the case of


skewed variables to reduce the
The goal is to compare the skewness. In the case of regression,
variables as scaled variables on either if the assumptions of
Purpose the same band can be compared regression aren’t met or if the
and increase the computational relationship between the target and
power (or the efficiency). independent variables is non-linear,
then can use transformation to
linearize.

www.ybifoundation.org (+91) 9667987711 support@ybifoundation.or g

P a g e 18 | 122
AIML | YBI Foundation

Scaling has no impact on the


data. All the properties of the Transformation changes the data,
Impact on
data remain the same—only the and so does the distribution of the
Data
range of the independent data.
variables changes.

Impact on As the distribution remains the


Transformation can decrease the
Skewness, same so no changes in
skewness. It brings values closer,
Kurtosis, skewness and kurtosis. Scaling
which can remove the outliers.
Outliers doesn’t remove outliers.

Q. What is the difference between Standardization and Normalization?

Standardization and Normalization are scaling techniques. Standardization raises the data based
on the Z-score, using the formula (x-mean)/standard deviation, reducing the data width to -3 to 3.
Normalization scales the data using the formula (x – min)/(max-min) and Min-Max Scalar. It reduces
the data width from 0 to 1.

Q. List the different feature transformation techniques.


• The common feature transformation techniques are:
• Logarithmic transformation
• Exponential transformation
• Square Root transformation
• Reciprocal transformation
• Box-cox transformation

www.ybifoundation.org (+91) 9667987711 support@ybifoundation.or g

P a g e 19 | 122
AIML | YBI Foundation

Q. Why Should we Use Feature Scaling?

Some machine learning algorithms are sensitive to feature scaling while others are virtually invariant
to it.

Gradient Descent Based Algorithms: Machine learning algorithms like linear regression, logistic
regression, neural network, etc. that use gradient descent as an optimization technique require data
to be scaled. Take a look at the formula for gradient descent below:

The presence of feature value X in the formula will affect the step size of the gradient descent. The
difference in ranges of features will cause different step sizes for each feature. To ensure that the
gradient descent moves smoothly towards the minima and that the steps for gradient descent are
updated at the same rate for all the features, we scale the data before feeding it to the model. Having
featured on a similar scale can help the gradient descent converge more quickly towards the
minima.

Distance-Based Algorithms: Distance algorithms like KNN, K-means, and SVM are most affected
by the range of features. This is because behind the scenes they are using distances between data
points to determine their similarity.

Tree-Based Algorithms: Tree-based algorithms, on the other hand, are fairly insensitive to the scale
of the features. Think about it, a decision tree is only splitting a node based on a single feature. The
decision tree splits a node on a feature that increases the homogeneity of the node. This split on a
feature is not influenced by other features.

www.ybifoundation.org (+91) 9667987711 support@ybifoundation.or g

P a g e 20 | 122
AIML | YBI Foundation

So, there is virtually no effect of the remaining features on the split. This is what makes them
invariant to the scale of the features!

Q. What is the difference between normalization and standardization?


• Normalization is good to use when you know that the distribution of your data does not follow a
Gaussian distribution. This can be useful in algorithms that do not assume any distribution of
the data like K-Nearest Neighbours and Neural Networks.
• Standardization, on the other hand, can be helpful in cases where the data follows a Gaussian
distribution. However, this does not have to be necessarily true. Also, unlike normalization,
standardization does not have a bounding range. So, even if you have outliers in your data, they
will not be affected by standardization.
• However, at the end of the day, the choice of using normalization or standardization will depend
on your problem and the machine learning algorithm you are using.
• There is no hard and fast rule to tell you when to normalize or standardize your data. You can
always start by fitting your model to raw, normalized, and standardized data and compare the
performance for the best results.
• It is a good practice to fit the scaler on the training data and then uses it to transform the testing
data. This would avoid any data leakage during the model testing process. Also, the scaling of
target values is generally not required.

Q. What the difference between sklearn.preprocessing import MinMaxScaler &


sklearn.preprocessing.Normalizer ? When to use MinMaxScaler and when to
Normalizer?

Normalizer is also a normalization technique. The only difference is the way it computes the
normalized values. By default, it is calculating the l2 norm of the row values i.e. each element of a
row is normalized by the square root of the sum of squared values of all elements in that row. It is
useful in text classification where the dot product of two Tf-IDF vectors gives a cosine similarity
between the different sentences/documents in the dataset.

www.ybifoundation.org (+91) 9667987711 support@ybifoundation.or g

P a g e 21 | 122
AIML | YBI Foundation

Q. WHEN TO STANDARDIZE DATA AND WHY?

For distance-based models, standardization is performed to prevent features with wider ranges
from dominating the distance metric. But the reason we standardize data

• BEFORE PCA:
In Principal Component Analysis, features with high variances/wide ranges, get more weight
than those with low variance, and consequently, they end up illegitimately dominating the First
Principal Components (Components with maximum variance).
• BEFORE CLUSTERING:
Clustering models are distance-based algorithms, in order to measure similarities between
observations and form clusters they use a distance metric. So, features with high ranges will
have a bigger influence on the clustering. Therefore, standardization is required before building
a clustering model.
• BEFORE KNN:
k-nearest neighbor is a distance-based classifier that classifies new observations based on
similarity measures (e.g., distance metrics) with labeled observations of the training set.
Standardization makes all variables contribute equally to the similarity measures.
• BEFORE SVM
Support Vector Machine tries to maximize the distance between the separating plane and the
support vectors. If one feature has very large values, it will dominate over other features when
calculating the distance. So Standardization gives all features the same influence on the
distance metric.
• BEFORE MEASURING VARIABLE IMPORTANCE IN REGRESSION MODELS
You can measure variable importance in regression analysis, by fitting a regression model using
the standardized independent variables and comparing the absolute value of their standardized
coefficients. But, if the independent variables are not standardized, comparing their coefficients
becomes meaningless.
• BEFORE LASSO AND RIDGE REGRESSION

www.ybifoundation.org (+91) 9667987711 support@ybifoundation.or g

P a g e 22 | 122
AIML | YBI Foundation

LASSO and Ridge regressions place a penalty on the magnitude of the coefficients associated
with each variable. And the scale of variables will affect how many penalties will be applied to
their coefficients. Because coefficients of variables with large variance are small and thus less
penalized. Therefore, standardization is required before fitting both regressions.

Q. What are the cases when we can’t apply standardization?


• Tree-based algorithms

Tree-based algorithms such as Decision Tree, Random forest, and gradient boosting, are fairly
insensitive to the scale of the features. Think about it, a decision tree is only splitting a node based
on a single feature. The decision tree splits a node on a feature that increases the homogeneity of
the node. This split on a feature is not influenced by other features.

www.ybifoundation.org (+91) 9667987711 support@ybifoundation.or g

P a g e 23 | 122

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy