0% found this document useful (0 votes)
14 views7 pages

Standar Ization

Uploaded by

gat64013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views7 pages

Standar Ization

Uploaded by

gat64013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

12/14/21, 3:39 PM How and why to Standardize your data: A python tutorial | Towards Data Science

Get started Open in app

Follow 605K Followers

You have 1 free member-only story left this month. Sign up for Medium and get an extra one

How and why to Standardize your data: A


python tutorial
In this post I explain why and how to apply Standardization using scikit-learn in
Python

Serafeim Loukas May 26, 2020 · 4 min read

Figure taken from: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html. Left


subplot: the unscaled data. Right subplot: the transformed data.

https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832 1/7
12/14/21, 3:39 PM How and why to Standardize your data: A python tutorial | Towards Data Science

Hi there.

This is my first Medium post. I am an electrical & computer engineer currently finishing
my PhD studies in the biomedical engineering and computational neuroscience field. I
have been working on machine learning problems for the past 4 years. A very common
question that I see all around the web is how to standardize and why to do so, the data
before fitting a machine learning model.

How does scikit-learn’s StandardScaler work ?

The first question that comes to one’s mind is:

Why to standardize in the first place?


Why to standardize before fitting a ML model?
Well, the idea is simple. Variables that are measured at different scales do not contribute
equally to the model fitting & model learned function and might end up creating a bias.
Thus, to deal with this potential problem feature-wise standardized (μ=0, σ=1) is
usually used prior to model fitting.

To do that using scikit-learn , we first need to construct an input array X containing


the features and samples with X.shape being [number_of_samples, number_of_features] .

Keep in mind that all scikit-learn machine learning (ML) functions expect as input an
numpy array X with that shape i.e. the rows are the samples and the columns are the
features/variables. Having said that, let’s assume that we have a matrix X where each
row/line is a sample/observation and each column is a variable/feature.

Note: Tree-based models are usually not dependent on scaling, but non-tree models models
such as SVM, LDA etc. are often hugely dependent on it.

Core of method
The main idea is to normalize/standardize i.e. μ = 0 and σ = 1 your
features/variables/columns of X , individually, before applying any machine learning

https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832 2/7
12/14/21, 3:39 PM How and why to Standardize your data: A python tutorial | Towards Data Science

model. Thus, StandardScaler() will normalize the features i.e. each column of X,
INDIVIDUALLY so that each column/feature/variable will have μ = 0 and σ = 1.

The mathematical formulation of the standardization procedure. Image generated by the author.

Working Python code example:

https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832 3/7
12/14/21, 3:39 PM How and why to Standardize your data: A python tutorial | Towards Data Science

from sklearn.preprocessing import StandardScaler


import numpy as np

# 4 samples/observations and 2 variables/features


X = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])

# the scaler object (model)


scaler = StandardScaler()

# fit and transform the data


scaled_data = scaler.fit_transform(X)

print(X)
[[0, 0],
[1, 0],
[0, 1],
[1, 1]])

print(scaled_data)
[[-1. -1.]
[ 1. -1.]
[-1. 1.]
[ 1. 1.]]

Verify that the mean of each feature (column) is 0:

scaled_data.mean(axis = 0)
array([0., 0.])

Verify that the std of each feature (column) is 1:

scaled_data.std(axis = 0)
array([1., 1.])

The effect of the transform in a visual example

https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832 4/7
12/14/21, 3:39 PM How and why to Standardize your data: A python tutorial | Towards Data Science

Figure taken from: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html. Left


subplot: the unscaled data. Right subplot: the transformed data.

Summary
StandardScaler removes the mean and scales each feature/variable to unit variance.
This operation is performed feature-wise in an independent way.

StandardScaler can be influenced by outliers (if they exist in the dataset) since it
involves the estimation of the empirical mean and standard deviation of each
feature.

How to deal with outliers


Manual way (not recommended): Visually inspect the data and remove outliers
using outlier removal statistical methods such as the Interquartile Range (IQR)
threshold method.

Recommended way: Use the RobustScaler that will just scale the features but in this
case using statistics that are robust to outliers. This scaler removes the median
and scales the data according to the quantile range (defaults to IQR: Interquartile
Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd
quartile (75th quantile).

That’s all for today! Hope you liked this first post! Next story coming next week. Stay
tuned & safe.

Stay tuned & support me


If you liked and found this article useful, follow me and applaud my story to support
me!

- My mailing list in just 5 seconds: https://seralouk.medium.com/subscribe


https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832 5/7
12/14/21, 3:39 PM How and why to Standardize your data: A python tutorial | Towards Data Science

- Become a member and support


me:https://seralouk.medium.com/membership

References
[1] https://scikit-
learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

[2] https://scikit-
learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html

Get in touch with me


LinkedIn: https://www.linkedin.com/in/serafeim-loukas/

ResearchGate: https://www.researchgate.net/profile/Serafeim_Loukas

EPFL profile: https://people.epfl.ch/serafeim.loukas

Stack Overflow: https://stackoverflow.com/users/5025009/seralouk

Sign up for The Variable


By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.

Get this newsletter

Machine Learning Scikit Learn Sklearn Standardization Normalization

About Write Help Legal

https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832 6/7
12/14/21, 3:39 PM How and why to Standardize your data: A python tutorial | Towards Data Science

Get the Medium app

https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832 7/7

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy