Standar Ization
Standar Ization
You have 1 free member-only story left this month. Sign up for Medium and get an extra one
https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832 1/7
12/14/21, 3:39 PM How and why to Standardize your data: A python tutorial | Towards Data Science
Hi there.
This is my first Medium post. I am an electrical & computer engineer currently finishing
my PhD studies in the biomedical engineering and computational neuroscience field. I
have been working on machine learning problems for the past 4 years. A very common
question that I see all around the web is how to standardize and why to do so, the data
before fitting a machine learning model.
Keep in mind that all scikit-learn machine learning (ML) functions expect as input an
numpy array X with that shape i.e. the rows are the samples and the columns are the
features/variables. Having said that, let’s assume that we have a matrix X where each
row/line is a sample/observation and each column is a variable/feature.
Note: Tree-based models are usually not dependent on scaling, but non-tree models models
such as SVM, LDA etc. are often hugely dependent on it.
Core of method
The main idea is to normalize/standardize i.e. μ = 0 and σ = 1 your
features/variables/columns of X , individually, before applying any machine learning
https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832 2/7
12/14/21, 3:39 PM How and why to Standardize your data: A python tutorial | Towards Data Science
model. Thus, StandardScaler() will normalize the features i.e. each column of X,
INDIVIDUALLY so that each column/feature/variable will have μ = 0 and σ = 1.
The mathematical formulation of the standardization procedure. Image generated by the author.
https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832 3/7
12/14/21, 3:39 PM How and why to Standardize your data: A python tutorial | Towards Data Science
print(X)
[[0, 0],
[1, 0],
[0, 1],
[1, 1]])
print(scaled_data)
[[-1. -1.]
[ 1. -1.]
[-1. 1.]
[ 1. 1.]]
scaled_data.mean(axis = 0)
array([0., 0.])
scaled_data.std(axis = 0)
array([1., 1.])
https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832 4/7
12/14/21, 3:39 PM How and why to Standardize your data: A python tutorial | Towards Data Science
Summary
StandardScaler removes the mean and scales each feature/variable to unit variance.
This operation is performed feature-wise in an independent way.
StandardScaler can be influenced by outliers (if they exist in the dataset) since it
involves the estimation of the empirical mean and standard deviation of each
feature.
Recommended way: Use the RobustScaler that will just scale the features but in this
case using statistics that are robust to outliers. This scaler removes the median
and scales the data according to the quantile range (defaults to IQR: Interquartile
Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd
quartile (75th quantile).
That’s all for today! Hope you liked this first post! Next story coming next week. Stay
tuned & safe.
References
[1] https://scikit-
learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
[2] https://scikit-
learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html
ResearchGate: https://www.researchgate.net/profile/Serafeim_Loukas
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.
https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832 6/7
12/14/21, 3:39 PM How and why to Standardize your data: A python tutorial | Towards Data Science
https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832 7/7