0% found this document useful (0 votes)

14 views7 pages

Standar Ization

Uploaded by

gat64013

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views7 pages

Standar Ization

Uploaded by

gat64013

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

12/14/21, 3:39 PM How and why to Standardize your data: A python tutorial | Towards Data Science

Get started Open in app

Follow 605K Followers

You have 1 free member-only story left this month. Sign up for Medium and get an extra one

How and why to Standardize your data: A

python tutorial
In this post I explain why and how to apply Standardization using scikit-learn in
Python

Serafeim Loukas May 26, 2020 · 4 min read

Figure taken from: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html. Left

subplot: the unscaled data. Right subplot: the transformed data.

https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832 1/7
12/14/21, 3:39 PM How and why to Standardize your data: A python tutorial | Towards Data Science

Hi there.

This is my first Medium post. I am an electrical & computer engineer currently finishing
my PhD studies in the biomedical engineering and computational neuroscience field. I
have been working on machine learning problems for the past 4 years. A very common
question that I see all around the web is how to standardize and why to do so, the data
before fitting a machine learning model.

How does scikit-learn’s StandardScaler work ?

The first question that comes to one’s mind is:

Why to standardize in the first place?

Why to standardize before fitting a ML model?
Well, the idea is simple. Variables that are measured at different scales do not contribute
equally to the model fitting & model learned function and might end up creating a bias.
Thus, to deal with this potential problem feature-wise standardized (μ=0, σ=1) is
usually used prior to model fitting.

To do that using scikit-learn , we first need to construct an input array X containing

the features and samples with X.shape being [number_of_samples, number_of_features] .

Keep in mind that all scikit-learn machine learning (ML) functions expect as input an
numpy array X with that shape i.e. the rows are the samples and the columns are the
features/variables. Having said that, let’s assume that we have a matrix X where each
row/line is a sample/observation and each column is a variable/feature.

Note: Tree-based models are usually not dependent on scaling, but non-tree models models
such as SVM, LDA etc. are often hugely dependent on it.

Core of method
The main idea is to normalize/standardize i.e. μ = 0 and σ = 1 your
features/variables/columns of X , individually, before applying any machine learning

https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832 2/7
12/14/21, 3:39 PM How and why to Standardize your data: A python tutorial | Towards Data Science

model. Thus, StandardScaler() will normalize the features i.e. each column of X,
INDIVIDUALLY so that each column/feature/variable will have μ = 0 and σ = 1.

The mathematical formulation of the standardization procedure. Image generated by the author.

Working Python code example:

https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832 3/7
12/14/21, 3:39 PM How and why to Standardize your data: A python tutorial | Towards Data Science

from sklearn.preprocessing import StandardScaler

import numpy as np

# 4 samples/observations and 2 variables/features

X = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])

# the scaler object (model)

scaler = StandardScaler()

# fit and transform the data

scaled_data = scaler.fit_transform(X)

print(X)
[[0, 0],
[1, 0],
[0, 1],
[1, 1]])

print(scaled_data)
[[-1. -1.]
[ 1. -1.]
[-1. 1.]
[ 1. 1.]]

Verify that the mean of each feature (column) is 0:

scaled_data.mean(axis = 0)
array([0., 0.])

Verify that the std of each feature (column) is 1:

scaled_data.std(axis = 0)
array([1., 1.])

The effect of the transform in a visual example

https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832 4/7
12/14/21, 3:39 PM How and why to Standardize your data: A python tutorial | Towards Data Science

Figure taken from: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html. Left

subplot: the unscaled data. Right subplot: the transformed data.

Summary
StandardScaler removes the mean and scales each feature/variable to unit variance.
This operation is performed feature-wise in an independent way.

StandardScaler can be influenced by outliers (if they exist in the dataset) since it
involves the estimation of the empirical mean and standard deviation of each
feature.

How to deal with outliers

Manual way (not recommended): Visually inspect the data and remove outliers
using outlier removal statistical methods such as the Interquartile Range (IQR)
threshold method.

Recommended way: Use the RobustScaler that will just scale the features but in this
case using statistics that are robust to outliers. This scaler removes the median
and scales the data according to the quantile range (defaults to IQR: Interquartile
Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd
quartile (75th quantile).

That’s all for today! Hope you liked this first post! Next story coming next week. Stay
tuned & safe.

Stay tuned & support me

If you liked and found this article useful, follow me and applaud my story to support
me!

- My mailing list in just 5 seconds: https://seralouk.medium.com/subscribe

https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832 5/7
12/14/21, 3:39 PM How and why to Standardize your data: A python tutorial | Towards Data Science

- Become a member and support

me:https://seralouk.medium.com/membership

References
[1] https://scikit-
learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

[2] https://scikit-
learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html

Get in touch with me

LinkedIn: https://www.linkedin.com/in/serafeim-loukas/

ResearchGate: https://www.researchgate.net/profile/Serafeim_Loukas

EPFL profile: https://people.epfl.ch/serafeim.loukas

Stack Overflow: https://stackoverflow.com/users/5025009/seralouk

Sign up for The Variable

By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials
and cutting-edge research to original features you don't want to miss. Take a look.

Get this newsletter

Machine Learning Scikit Learn Sklearn Standardization Normalization

About Write Help Legal

https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832 6/7
12/14/21, 3:39 PM How and why to Standardize your data: A python tutorial | Towards Data Science

Get the Medium app

https://towardsdatascience.com/how-and-why-to-standardize-your-data-996926c2c832 7/7

Towards Data Science All About Feature Scaling
No ratings yet
Towards Data Science All About Feature Scaling
16 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Chapter 6: Data Preprocessing, Parameter Selection, and Inductive Conformal Prediction
No ratings yet
Chapter 6: Data Preprocessing, Parameter Selection, and Inductive Conformal Prediction
56 pages
Data Normalization in Data Mining
No ratings yet
Data Normalization in Data Mining
8 pages
Scikit Learn
No ratings yet
Scikit Learn
28 pages
Feature Engineering PDF
100% (1)
Feature Engineering PDF
75 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Math 10 Quartile J Decile and Percentile
100% (1)
Math 10 Quartile J Decile and Percentile
28 pages
Defining Data Science
100% (1)
Defining Data Science
167 pages
Dsbda Ass2
No ratings yet
Dsbda Ass2
49 pages
Feature Scaling Techniques: Machine Learning
No ratings yet
Feature Scaling Techniques: Machine Learning
27 pages
Seven Lab Instruction
No ratings yet
Seven Lab Instruction
38 pages
ML - Lab Manual
No ratings yet
ML - Lab Manual
54 pages
3 - AML - Lecture 3 - Feature Engg
No ratings yet
3 - AML - Lecture 3 - Feature Engg
39 pages
5.feauture Engineering
No ratings yet
5.feauture Engineering
34 pages
Data Preparation.2
No ratings yet
Data Preparation.2
18 pages
ML Lab Exam Document
No ratings yet
ML Lab Exam Document
14 pages
Session 7 Feature Selection & Dimensionality Reduction
No ratings yet
Session 7 Feature Selection & Dimensionality Reduction
20 pages
Data Scaling
No ratings yet
Data Scaling
5 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
Feature Scaling (Standardization & Normalization)
No ratings yet
Feature Scaling (Standardization & Normalization)
35 pages
Feature Engineering
No ratings yet
Feature Engineering
18 pages
Preprocessing
No ratings yet
Preprocessing
5 pages
AP Statistics 핵심정리
100% (1)
AP Statistics 핵심정리
20 pages
Scaling Techniques
No ratings yet
Scaling Techniques
30 pages
ML - Week 04
No ratings yet
ML - Week 04
33 pages
7 Data Transformation - Jupyter Notebook
No ratings yet
7 Data Transformation - Jupyter Notebook
3 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
Feature Engineering
No ratings yet
Feature Engineering
50 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
Djghuh
No ratings yet
Djghuh
2 pages
ML Normalization Techniques - Overview & Practical Guide
No ratings yet
ML Normalization Techniques - Overview & Practical Guide
5 pages
Week 10
No ratings yet
Week 10
50 pages
Unit 2 ML 2019
No ratings yet
Unit 2 ML 2019
91 pages
Data Preparation
No ratings yet
Data Preparation
11 pages
Data Mining
No ratings yet
Data Mining
33 pages
04 - Data Normalization in Python - en
No ratings yet
04 - Data Normalization in Python - en
1 page
ML Unit 2
No ratings yet
ML Unit 2
90 pages
Feature Scaling
No ratings yet
Feature Scaling
13 pages
s4 Igcse Math Ch12 Mock Test
No ratings yet
s4 Igcse Math Ch12 Mock Test
7 pages
Practical 6
No ratings yet
Practical 6
6 pages
Data Preprocessing PT 2
No ratings yet
Data Preprocessing PT 2
7 pages
Feature Scaling Notes
No ratings yet
Feature Scaling Notes
4 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Normalization Vs Standardization
No ratings yet
Normalization Vs Standardization
2 pages
Chap 2 Linear Regression - Part2
No ratings yet
Chap 2 Linear Regression - Part2
16 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Summative Exam in Grade 10 Math Quarter 4
No ratings yet
Summative Exam in Grade 10 Math Quarter 4
10 pages
GCSE CumulativeFrequencyAndBoxPlots
100% (1)
GCSE CumulativeFrequencyAndBoxPlots
44 pages
ML Notes
No ratings yet
ML Notes
44 pages
Normalization and Standardization: Methods To Preprocess Data To Have Consistent Scales and Distributions
No ratings yet
Normalization and Standardization: Methods To Preprocess Data To Have Consistent Scales and Distributions
10 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
21BDS0357 VL2024250504577 Ast02
No ratings yet
21BDS0357 VL2024250504577 Ast02
5 pages
Measures of Dispersion: Dr. Poonam Kaushal Assistant Professor ICFAI Business School
No ratings yet
Measures of Dispersion: Dr. Poonam Kaushal Assistant Professor ICFAI Business School
34 pages
Math
No ratings yet
Math
14 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
1D-Statistics II
No ratings yet
1D-Statistics II
19 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
ML Da
No ratings yet
ML Da
55 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
Disperson SkwenessOriginal
No ratings yet
Disperson SkwenessOriginal
10 pages
Data Mining
No ratings yet
Data Mining
5 pages
Chapter 3 Numerical Technique
No ratings yet
Chapter 3 Numerical Technique
56 pages
Data - Preprocessing - Jupyter Notebook
No ratings yet
Data - Preprocessing - Jupyter Notebook
5 pages
PGIS Practical File (Finalised)
No ratings yet
PGIS Practical File (Finalised)
71 pages
Study of Quants
No ratings yet
Study of Quants
15 pages
Comprehensive Guidelines For The Application of In-Situ Polymer Gels For Injection Well Conformance Improvement Based On Field Projects 179575
No ratings yet
Comprehensive Guidelines For The Application of In-Situ Polymer Gels For Injection Well Conformance Improvement Based On Field Projects 179575
27 pages
Zirconia Crowns
No ratings yet
Zirconia Crowns
8 pages
Mini 4
No ratings yet
Mini 4
9 pages
Stats Chapter 2
No ratings yet
Stats Chapter 2
40 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Feature Engineering
No ratings yet
Feature Engineering
23 pages
Question 1
No ratings yet
Question 1
3 pages
Feature Scaling in Machine Learning
No ratings yet
Feature Scaling in Machine Learning
4 pages
Lesson 13: Describing Variability Using The Interquartile Range (IQR)
No ratings yet
Lesson 13: Describing Variability Using The Interquartile Range (IQR)
10 pages
Art 4th Quarter Math
No ratings yet
Art 4th Quarter Math
11 pages
A Novel Approach For Feature Selection and Classification of Diabetes Mellitus: Machine Learning Methods
No ratings yet
A Novel Approach For Feature Selection and Classification of Diabetes Mellitus: Machine Learning Methods
11 pages
Stats
No ratings yet
Stats
16 pages
Stats Lab 2
No ratings yet
Stats Lab 2
15 pages
Angle 2011, Self-Reported Pain Associated With The Use of Intermaxillary Elastics Compared To Pain Experienced After Initial Archwire Placement
No ratings yet
Angle 2011, Self-Reported Pain Associated With The Use of Intermaxillary Elastics Compared To Pain Experienced After Initial Archwire Placement
5 pages
BIO401 Best File For Mid Term by Jawad Masroor (J Biology)
No ratings yet
BIO401 Best File For Mid Term by Jawad Masroor (J Biology)
7 pages
1 Mark Type (Statistics)
No ratings yet
1 Mark Type (Statistics)
8 pages
Kami Export - Braylin Austin - 6ib32
No ratings yet
Kami Export - Braylin Austin - 6ib32
12 pages
Basic Statistics Questions
No ratings yet
Basic Statistics Questions
16 pages
WST01 01 Que Jan20215213
No ratings yet
WST01 01 Que Jan20215213
20 pages
Outliers ML
No ratings yet
Outliers ML
14 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Standar Ization

Uploaded by

Standar Ization

Uploaded by

12/14/21, 3:39 PM How and why to Standardize your data: A python tutorial | Towards Data Science

Get started Open in app

Follow 605K Followers

How and why to Standardize your data: A

Serafeim Loukas May 26, 2020 · 4 min read

Figure taken from: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html. Left

How does scikit-learn’s StandardScaler work ?

The first question that comes to one’s mind is:

Why to standardize in the first place?

To do that using scikit-learn , we first need to construct an input array X containing

Working Python code example:

from sklearn.preprocessing import StandardScaler

# 4 samples/observations and 2 variables/features

# the scaler object (model)

# fit and transform the data

Verify that the mean of each feature (column) is 0:

Verify that the std of each feature (column) is 1:

The effect of the transform in a visual example

Figure taken from: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html. Left

How to deal with outliers

Stay tuned & support me

- My mailing list in just 5 seconds: https://seralouk.medium.com/subscribe

- Become a member and support

Get in touch with me

EPFL profile: https://people.epfl.ch/serafeim.loukas

Stack Overflow: https://stackoverflow.com/users/5025009/seralouk

Sign up for The Variable

Get this newsletter

Machine Learning Scikit Learn Sklearn Standardization Normalization

About Write Help Legal

Get the Medium app

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.