0% found this document useful (0 votes)

94 views32 pages

Supervised Learning With Scikit-Learn: Preprocessing Data

This document discusses preprocessing data for supervised machine learning models in scikit-learn. It covers encoding categorical features, handling missing data through dropping rows or imputation, and scaling numeric features for models that are sensitive to feature scale such as k-nearest neighbors. Pipelines can be used to combine preprocessing steps like imputation and scaling with models. Examples use real-world datasets like automobile and Pima Indians diabetes data.

Uploaded by

NourheneMbarek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views32 pages

Supervised Learning With Scikit-Learn: Preprocessing Data

Uploaded by

NourheneMbarek

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

SUPERVISED LEARNING WITH SCIKIT-LEARN

Preprocessing data
Supervised Learning with scikit-learn

Dealing with categorical features

● Scikit-learn will not accept categorical features by default
● Need to encode categorical features numerically
● Convert to ‘dummy variables’
● 0: Observation was NOT that category
● 1: Observation was that category
Supervised Learning with scikit-learn

Dummy variables
Supervised Learning with scikit-learn

Dealing with categorical features in Python

● scikit-learn: OneHotEncoder()
● pandas: get_dummies()
Supervised Learning with scikit-learn

Automobile dataset
● mpg: Target Variable
● Origin: Categorical Feature
Supervised Learning with scikit-learn

EDA w/ categorical feature

Supervised Learning with scikit-learn

Encoding dummy variables

In [1]: import pandas as pd
 
In [2]: df = pd.read_csv('auto.csv')

In [3]: df_origin = pd.get_dummies(df)

In [4]: print(df_origin.head())
mpg displ hp weight accel size origin_Asia origin_Europe \
0 18.0 250.0 88 3139 14.5 15.0 0 0
1 9.0 304.0 193 4732 18.5 20.0 0 0
2 36.1 91.0 60 1800 16.4 10.0 1 0
3 18.5 250.0 98 3525 19.0 15.0 0 0
4 34.3 97.0 78 2188 15.8 10.0 0 1

origin_US
0 1
1 1
2 0
3 1
4 0
Supervised Learning with scikit-learn

Encoding dummy variables

In [5]: df_origin = df_origin.drop('origin_Asia', axis=1)

In [6]: print(df_origin.head())
mpg displ hp weight accel size origin_Europe origin_US
0 18.0 250.0 88 3139 14.5 15.0 0 1
1 9.0 304.0 193 4732 18.5 20.0 0 1
2 36.1 91.0 60 1800 16.4 10.0 0 0
3 18.5 250.0 98 3525 19.0 15.0 0 1
4 34.3 97.0 78 2188 15.8 10.0 1 0
Supervised Learning with scikit-learn

Linear regression with dummy variables

In [7]: from sklearn.model_selection import train_test_split
 
In [8]: from sklearn.linear_model import Ridge

In [9]: X_train, X_test, y_train, y_test = train_test_split(X, y,

...: test_size=0.3, random_state=42)

In [10]: ridge = Ridge(alpha=0.5, normalize=True).fit(X_train,

...: y_train)

In [11]: ridge.score(X_test, y_test)

Out[11]: 0.719064519022
SUPERVISED LEARNING WITH SCIKIT-LEARN

Let’s practice!
SUPERVISED LEARNING WITH SCIKIT-LEARN

Handling missing
data
Supervised Learning with scikit-learn

PIMA Indians dataset

In [1]: df = pd.read_csv('diabetes.csv')

In [2]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
pregnancies 768 non-null int64
glucose 768 non-null int64
diastolic 768 non-null int64
triceps 768 non-null int64
insulin 768 non-null int64
bmi 768 non-null float64
dpf 768 non-null float64
age 768 non-null int64
diabetes 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
None
Supervised Learning with scikit-learn

PIMA Indians dataset

In [3]: print(df.head())
pregnancies glucose diastolic triceps insulin bmi dpf age \
0 6 148 72 35 0 33.6 0.627 50
1 1 85 66 29 0 26.6 0.351 31
2 8 183 64 0 0 23.3 0.672 32
3 1 89 66 23 94 28.1 0.167 21
4 0 137 40 35 168 43.1 2.288 33

diabetes
0 1
1 0
2 1
3 0
4 1
Supervised Learning with scikit-learn

Dropping missing data

In [8]: df.insulin.replace(0, np.nan, inplace=True)

In [9]: df.triceps.replace(0, np.nan, inplace=True)

In [10]: df.bmi.replace(0, np.nan, inplace=True)

In [11]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
pregnancies 768 non-null int64
glucose 768 non-null int64
diastolic 768 non-null int64
triceps 541 non-null float64
insulin 394 non-null float64
bmi 757 non-null float64
dpf 768 non-null float64
age 768 non-null int64
diabetes 768 non-null int64
dtypes: float64(4), int64(5)
memory usage: 54.1 KB
Supervised Learning with scikit-learn

Dropping missing data

In [12]: df = df.dropna()

In [13]: df.shape
Out[13]: (393, 9)
Supervised Learning with scikit-learn

Imputing missing data

● Making an educated guess about the missing values
● Example: Using the mean of the non-missing entries

In [1]: from sklearn.preprocessing import Imputer

In [2]: imp = Imputer(missing_values='NaN', strategy='mean', axis=0)

In [3]: imp.fit(X)

In [4]: X = imp.transform(X)
Supervised Learning with scikit-learn

Imputing within a pipeline

In [1]: from sklearn.pipeline import Pipeline

In [2]: from sklearn.preprocessing import Imputer

In [3]: imp = Imputer(missing_values='NaN', strategy='mean', axis=0)

In [4]: logreg = LogisticRegression()

In [5]: steps = [('imputation', imp),

...: ('logistic_regression', logreg)]

In [6]: pipeline = Pipeline(steps)

In [7]: X_train, X_test, y_train, y_test = train_test_split(X, y,

...: test_size=0.3, random_state=42)
Supervised Learning with scikit-learn

Imputing within a pipeline

In [8]: pipeline.fit(X_train, y_train)

In [9]: y_pred = pipeline.predict(X_test)

In [10]: pipeline.score(X_test, y_test)

Out[10]: 0.75324675324675328
SUPERVISED LEARNING WITH SCIKIT-LEARN

Let’s practice!
SUPERVISED LEARNING WITH SCIKIT-LEARN

Centering and
scaling
Supervised Learning with scikit-learn

Why scale your data?

In [1]: print(df.describe())
fixed acidity free sulfur dioxide total sulfur dioxide density \
count 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 15.874922 46.467792 0.996747
std 1.741096 10.460157 32.895324 0.001887
min 4.600000 1.000000 6.000000 0.990070
25% 7.100000 7.000000 22.000000 0.995600
50% 7.900000 14.000000 38.000000 0.996750
75% 9.200000 21.000000 62.000000 0.997835
max 15.900000 72.000000 289.000000 1.003690

pH sulphates alcohol quality

count 1599.000000 1599.000000 1599.000000 1599.000000
mean 3.311113 0.658149 10.422983 0.465291
std 0.154386 0.169507 1.065668 0.498950
min 2.740000 0.330000 8.400000 0.000000
25% 3.210000 0.550000 9.500000 0.000000
50% 3.310000 0.620000 10.200000 0.000000
75% 3.400000 0.730000 11.100000 1.000000
max 4.010000 2.000000 14.900000 1.000000
Supervised Learning with scikit-learn

Why scale your data?

● Many models use some form of distance to inform them
● Features on larger scales can unduly influence the model
● Example: k-NN uses distance explicitly when making predictions
● We want features to be on a similar scale
● Normalizing (or scaling and centering)
Supervised Learning with scikit-learn

Ways to normalize your data

● Standardization: Subtract the mean and divide by variance
● All features are centered around zero and have variance one
● Can also subtract the minimum and divide by the range
● Minimum zero and maximum one
● Can also normalize so the data ranges from -1 to +1
● See scikit-learn docs for further details
Supervised Learning with scikit-learn

Scaling in scikit-learn
In [2]: from sklearn.preprocessing import scale

In [3]: X_scaled = scale(X)

In [4]: np.mean(X), np.std(X)

Out[4]: (8.13421922452, 16.7265339794)

In [5]: np.mean(X_scaled), np.std(X_scaled)

Out[5]: (2.54662653149e-15, 1.0)
Supervised Learning with scikit-learn

Scaling in a pipeline
In [6]: from sklearn.preprocessing import StandardScaler

In [7]: steps = [('scaler', StandardScaler()),

...: ('knn', KNeighborsClassifier())]

In [8]: pipeline = Pipeline(steps)

In [9]: X_train, X_test, y_train, y_test = train_test_split(X, y,

...: test_size=0.2, random_state=21)

In [10]: knn_scaled = pipeline.fit(X_train, y_train)

In [11]: y_pred = pipeline.predict(X_test)

In [12]: accuracy_score(y_test, y_pred)

Out[12]: 0.956

In [13]: knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)

In [14]: knn_unscaled.score(X_test, y_test)

Out[14]: 0.928
Supervised Learning with scikit-learn

CV and scaling in a pipeline

In [14]: steps = [('scaler', StandardScaler()),
...: (('knn', KNeighborsClassifier())]

In [15]: pipeline = Pipeline(steps)

In [16]: parameters = {knn__n_neighbors=np.arange(1, 50)}

In [17]: X_train, X_test, y_train, y_test = train_test_split(X, y,

...: test_size=0.2, random_state=21)

In [18]: cv = GridSearchCV(pipeline, param_grid=parameters)

In [19]: cv.fit(X_train, y_train)

In [20]: y_pred = cv.predict(X_test)

Supervised Learning with scikit-learn

Scaling and CV in a pipeline

In [21]: print(cv.best_params_)
{'knn__n_neighbors': 41}

In [22]: print(cv.score(X_test, y_test))

0.956

In [23]: print(classification_report(y_test, y_pred))

precision recall f1-score support

0 0.97 0.90 0.93 39

1 0.95 0.99 0.97 75

avg / total 0.96 0.96 0.96 114

SUPERVISED LEARNING WITH SCIKIT-LEARN

Let’s practice!
SUPERVISED LEARNING WITH SCIKIT-LEARN

Final thoughts
Supervised Learning with scikit-learn

What you’ve learned

● Using machine learning techniques to build predictive models
● For both regression and classification problems
● With real-world data
● Underfi!ing and overfi!ing
● Test-train split
● Cross-validation
● Grid search
Supervised Learning with scikit-learn

What you’ve learned

● Regularization, lasso and ridge regression
● Data preprocessing
● For more: Check out the scikit-learn documentation
SUPERVISED LEARNING WITH SCIKIT-LEARN

Congratulations!

KNN - Jupyter Notebook (1)
No ratings yet
KNN - Jupyter Notebook (1)
7 pages
Python Scikit-Learn Cheat Sheet For Machine Learning
No ratings yet
Python Scikit-Learn Cheat Sheet For Machine Learning
3 pages
EXP - 7- Prasham Doshi - 22bec097
No ratings yet
EXP - 7- Prasham Doshi - 22bec097
7 pages
supervised learning using python - chapter3
No ratings yet
supervised learning using python - chapter3
47 pages
Assignment 03
No ratings yet
Assignment 03
6 pages
Slides (A12 A14)
No ratings yet
Slides (A12 A14)
353 pages
Stroke Prediction
No ratings yet
Stroke Prediction
10 pages
Supervised Learning With Scikit-Learn: Introduction To Regression
No ratings yet
Supervised Learning With Scikit-Learn: Introduction To Regression
31 pages
ML Lab Programs PDF
No ratings yet
ML Lab Programs PDF
15 pages
m1
No ratings yet
m1
10 pages
Machine File
No ratings yet
Machine File
27 pages
003-FIN7790 (Part2)
No ratings yet
003-FIN7790 (Part2)
162 pages
Approach For The Smoothing of Three-Dimensional Reconstructions of The Human Spine Using Dual Kriging Interpolation
No ratings yet
Approach For The Smoothing of Three-Dimensional Reconstructions of The Human Spine Using Dual Kriging Interpolation
7 pages
Assignment 3
No ratings yet
Assignment 3
7 pages
Project 3 - Diabetes Prediction.ipynb - Colab
No ratings yet
Project 3 - Diabetes Prediction.ipynb - Colab
4 pages
Import As From Import From Import From Import From Import From Import From Import From Import From Import From Import From Import Import As
No ratings yet
Import As From Import From Import From Import From Import From Import From Import From Import From Import From Import From Import Import As
8 pages
Preprocessing Data For Machine Learning: Sarah Guido
No ratings yet
Preprocessing Data For Machine Learning: Sarah Guido
21 pages
How Good Is Your Model?: Andreas Müller
No ratings yet
How Good Is Your Model?: Andreas Müller
54 pages
Student - Linear Regression Example - Colaboratory
No ratings yet
Student - Linear Regression Example - Colaboratory
6 pages
Chapter4 (The Evaluating Multiple Models Chapter Is Really Good!)
No ratings yet
Chapter4 (The Evaluating Multiple Models Chapter Is Really Good!)
47 pages
Logistic Regression For Binary Classification With Core APIs - TensorFlow Core
No ratings yet
Logistic Regression For Binary Classification With Core APIs - TensorFlow Core
22 pages
machinelearning
No ratings yet
machinelearning
26 pages
6 - Machine Learning 2
No ratings yet
6 - Machine Learning 2
14 pages
UNIT II Part-1
No ratings yet
UNIT II Part-1
59 pages
Openlab1
No ratings yet
Openlab1
17 pages
9781394209118.fmatter
No ratings yet
9781394209118.fmatter
19 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
FYMCA IDSLab A6 Submission
No ratings yet
FYMCA IDSLab A6 Submission
9 pages
Big Data Analysis
No ratings yet
Big Data Analysis
38 pages
Artificial Neural Network (Ann)
No ratings yet
Artificial Neural Network (Ann)
1 page
MACHINE LEARNING manual
No ratings yet
MACHINE LEARNING manual
36 pages
Chapter 2
No ratings yet
Chapter 2
50 pages
Assignment1_LATEX
No ratings yet
Assignment1_LATEX
11 pages
Introduction To Regression: George Boorman
No ratings yet
Introduction To Regression: George Boorman
50 pages
ML LAB - BCSL606
No ratings yet
ML LAB - BCSL606
67 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
No ratings yet
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
20 pages
Linear Regression in Scikit-Learn (Sklearn) - An Introduction - Datagy
No ratings yet
Linear Regression in Scikit-Learn (Sklearn) - An Introduction - Datagy
22 pages
ML LabManual (1)
No ratings yet
ML LabManual (1)
16 pages
Week 10
No ratings yet
Week 10
50 pages
Supervised Learning With Scikit-learn
No ratings yet
Supervised Learning With Scikit-learn
178 pages
Code Ecospat
No ratings yet
Code Ecospat
49 pages
Machine Learning and Econometrics EF
No ratings yet
Machine Learning and Econometrics EF
270 pages
PW3 SupervisedLearning
No ratings yet
PW3 SupervisedLearning
10 pages
Data Analysis
No ratings yet
Data Analysis
8 pages
Machine Learning Basics Dl2 Rk (1)
No ratings yet
Machine Learning Basics Dl2 Rk (1)
16 pages
ML Pgms_24Mar2025
No ratings yet
ML Pgms_24Mar2025
23 pages
Final ML
No ratings yet
Final ML
2 pages
ML QB final
No ratings yet
ML QB final
16 pages
Machine Learning Algorithms for GeoSpatial Data. Applications And
No ratings yet
Machine Learning Algorithms for GeoSpatial Data. Applications And
9 pages
Chapter4 PDF
No ratings yet
Chapter4 PDF
34 pages
Improvements On Cross Validation The 632 Bootstrap Method
No ratings yet
Improvements On Cross Validation The 632 Bootstrap Method
14 pages
Al Manja Hie 2020
No ratings yet
Al Manja Hie 2020
15 pages
CS6735 ProgrammingProject Group08 Report
No ratings yet
CS6735 ProgrammingProject Group08 Report
7 pages
ML LAB 12 - Jupyter Notebook
No ratings yet
ML LAB 12 - Jupyter Notebook
11 pages
Code
No ratings yet
Code
5 pages
Lecture-2-20022025-092902am
No ratings yet
Lecture-2-20022025-092902am
87 pages
Importing Data in Python I: Introduction To Relational Databases
No ratings yet
Importing Data in Python I: Introduction To Relational Databases
33 pages
Exam1 Practice Solutions
No ratings yet
Exam1 Practice Solutions
25 pages
Introduction To Databases in Python: Filtering and Targeting Data
No ratings yet
Introduction To Databases in Python: Filtering and Targeting Data
32 pages
Introduction To Databases in Python: Creating Databases and Tables
No ratings yet
Introduction To Databases in Python: Creating Databases and Tables
31 pages
Ch1 - Slides - Supervised Learning
No ratings yet
Ch1 - Slides - Supervised Learning
32 pages
Introduction To Databases in Python: Calculating Values Ina Query
No ratings yet
Introduction To Databases in Python: Calculating Values Ina Query
30 pages
Supervised Learning: Andreas Müller
No ratings yet
Supervised Learning: Andreas Müller
43 pages
Comment On Peterson Et Al (2021)
No ratings yet
Comment On Peterson Et Al (2021)
3 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
26 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
DMW Lab Manual
No ratings yet
DMW Lab Manual
35 pages
Modeling Investor Behavior Using Machine Learning
No ratings yet
Modeling Investor Behavior Using Machine Learning
14 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
24 pages
ML 4
No ratings yet
ML 4
21 pages
Linear Classi Ers: Prediction Equations: Michael (Mike) Gelbart
No ratings yet
Linear Classi Ers: Prediction Equations: Michael (Mike) Gelbart
22 pages
Logistic Regression and Regularization: Michael (Mike) Gelbart
No ratings yet
Logistic Regression and Regularization: Michael (Mike) Gelbart
19 pages
ml file syllabus
No ratings yet
ml file syllabus
43 pages
DATA ANALYSIS UNIT 4 Notes
No ratings yet
DATA ANALYSIS UNIT 4 Notes
19 pages
Welcome To The Course!: Michael (Mike) Gelbart
No ratings yet
Welcome To The Course!: Michael (Mike) Gelbart
17 pages
Sentiment Analysis Using Support Vector Machines With Diverse Information Sources (2004) PDF
No ratings yet
Sentiment Analysis Using Support Vector Machines With Diverse Information Sources (2004) PDF
6 pages
Explorer Award Exam: IA Pour L'ingénieur 3DNI ISITCOM 2019-2020
No ratings yet
Explorer Award Exam: IA Pour L'ingénieur 3DNI ISITCOM 2019-2020
12 pages
Project Valuation (Finance Analysis)
No ratings yet
Project Valuation (Finance Analysis)
41 pages
Question Bank - PA
No ratings yet
Question Bank - PA
3 pages
Regularization: Ridge Regression and The LASSO: Statistics 305: Autumn Quarter 2006/2007
No ratings yet
Regularization: Ridge Regression and The LASSO: Statistics 305: Autumn Quarter 2006/2007
56 pages
Epfl Machine Learning Final Exam 2021 Solutions
No ratings yet
Epfl Machine Learning Final Exam 2021 Solutions
21 pages
Scikit Learn Cheat Sheet Python
No ratings yet
Scikit Learn Cheat Sheet Python
1 page
ch4 Slides PDF
No ratings yet
ch4 Slides PDF
44 pages
Cleaning Data in Python: Pu!ing It All Together
No ratings yet
Cleaning Data in Python: Pu!ing It All Together
14 pages
Laptop Price Prediction Using Machine Learning (Abstract)
0% (1)
Laptop Price Prediction Using Machine Learning (Abstract)
3 pages
Supervised Learning With Scikit-Learn
No ratings yet
Supervised Learning With Scikit-Learn
178 pages
NIR - Multivariate Calibration - 3rd Edition 2014 PDF
No ratings yet
NIR - Multivariate Calibration - 3rd Edition 2014 PDF
127 pages
Data Mining Cheat Sheet
No ratings yet
Data Mining Cheat Sheet
6 pages
The Data Science Workshop: A New, Interactive Approach to Learning Data Science
From Everand
The Data Science Workshop: A New, Interactive Approach to Learning Data Science
Anthony So
No ratings yet
INSY 5339 - Data Mining Exam #2 Review
No ratings yet
INSY 5339 - Data Mining Exam #2 Review
1 page
ML-Lab Manual - NEP - DSS
No ratings yet
ML-Lab Manual - NEP - DSS
23 pages
Reinforcement Learning: A Practical Guide to Algorithms
From Everand
Reinforcement Learning: A Practical Guide to Algorithms
Trilokesh Khatri
No ratings yet
Improving Performance of Spatio-Temporal Machine Learning Models Using Forward Feature Selection and Target-Oriented Validation
No ratings yet
Improving Performance of Spatio-Temporal Machine Learning Models Using Forward Feature Selection and Target-Oriented Validation
9 pages
SAAI1-AI Analyst 2019-Course Guide 1
No ratings yet
SAAI1-AI Analyst 2019-Course Guide 1
166 pages
Machine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition
From Everand
Machine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition
Abhishek Mishra
No ratings yet
The Supervised Learning Workshop - Second Edition: A New, Interactive Approach to Understanding Supervised Learning Algorithms, 2nd Edition
From Everand
The Supervised Learning Workshop - Second Edition: A New, Interactive Approach to Understanding Supervised Learning Algorithms, 2nd Edition
Blaine Bateman
No ratings yet
TensorFlow深度学习项目实战: Chinese Edition
From Everand
TensorFlow深度学习项目实战: Chinese Edition
Posts & Telecom Press
No ratings yet
Capstone Project
No ratings yet
Capstone Project
25 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
PLS-SEM Results Reporting Format
100% (1)
PLS-SEM Results Reporting Format
46 pages
Petrel Velocity Modeling Important PDF
100% (5)
Petrel Velocity Modeling Important PDF
138 pages
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Python Machine Learning For Beginners Ebook Final
100% (11)
Python Machine Learning For Beginners Ebook Final
305 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Supervised Learning With Scikit-Learn: Preprocessing Data

Uploaded by

Supervised Learning With Scikit-Learn: Preprocessing Data

Uploaded by

SUPERVISED LEARNING WITH SCIKIT-LEARN

Dealing with categorical features

Dealing with categorical features in Python

EDA w/ categorical feature

Encoding dummy variables

In [3]: df_origin = pd.get_dummies(df)

Encoding dummy variables

Linear regression with dummy variables

In [9]: X_train, X_test, y_train, y_test = train_test_split(X, y,

In [10]: ridge = Ridge(alpha=0.5, normalize=True).fit(X_train,

In [11]: ridge.score(X_test, y_test)

PIMA Indians dataset

PIMA Indians dataset

Dropping missing data

In [9]: df.triceps.replace(0, np.nan, inplace=True)

In [10]: df.bmi.replace(0, np.nan, inplace=True)

Dropping missing data

Imputing missing data

In [1]: from sklearn.preprocessing import Imputer

In [2]: imp = Imputer(missing_values='NaN', strategy='mean', axis=0)

Imputing within a pipeline

In [2]: from sklearn.preprocessing import Imputer

In [3]: imp = Imputer(missing_values='NaN', strategy='mean', axis=0)

In [4]: logreg = LogisticRegression()

In [5]: steps = [('imputation', imp),

In [6]: pipeline = Pipeline(steps)

In [7]: X_train, X_test, y_train, y_test = train_test_split(X, y,

Imputing within a pipeline

In [9]: y_pred = pipeline.predict(X_test)

In [10]: pipeline.score(X_test, y_test)

Why scale your data?

pH sulphates alcohol quality

Why scale your data?

Ways to normalize your data

In [3]: X_scaled = scale(X)

In [4]: np.mean(X), np.std(X)

In [5]: np.mean(X_scaled), np.std(X_scaled)

In [7]: steps = [('scaler', StandardScaler()),

In [8]: pipeline = Pipeline(steps)

In [9]: X_train, X_test, y_train, y_test = train_test_split(X, y,

In [10]: knn_scaled = pipeline.fit(X_train, y_train)

In [11]: y_pred = pipeline.predict(X_test)

In [12]: accuracy_score(y_test, y_pred)

In [13]: knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)

In [14]: knn_unscaled.score(X_test, y_test)

CV and scaling in a pipeline

In [15]: pipeline = Pipeline(steps)

In [16]: parameters = {knn__n_neighbors=np.arange(1, 50)}

In [17]: X_train, X_test, y_train, y_test = train_test_split(X, y,

In [18]: cv = GridSearchCV(pipeline, param_grid=parameters)

In [19]: cv.fit(X_train, y_train)

In [20]: y_pred = cv.predict(X_test)

Scaling and CV in a pipeline

In [22]: print(cv.score(X_test, y_test))

In [23]: print(classification_report(y_test, y_pred))

0 0.97 0.90 0.93 39

avg / total 0.96 0.96 0.96 114

What you’ve learned

What you’ve learned

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.