0% found this document useful (0 votes)

29 views480 pages

MLP Slides Merged

The document outlines the steps involved in an end-to-end machine learning project, specifically focusing on predicting wine quality based on physiochemical characteristics. It emphasizes the importance of data collection, preparation, model selection, and collaboration with domain experts throughout the process. Additionally, it discusses techniques for data visualization, feature significance, and the necessity of proper sampling methods to ensure unbiased model evaluation.

Uploaded by

the2003diego

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views480 pages

MLP Slides Merged

Uploaded by

the2003diego

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 480

End to end Machine Learning

Project
Machine Learning Practice Course

2
Outline
1. Steps in ML projects
2. Illustration through practical set up

3
ML Project
Excellent wine company wants to develop ML model for predicting wine
quality on certain physiochemical characteristics in order to replace
expensive quality sensor.
Let's understand steps involved in addressing this problem.

4
Steps in ML projects
1. Look at the big picture.
2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Present your solution.
8. Launch, monitor and maintain your system.

5
A few words of wisdom

ML is usually a small piece in a big project. e.g. wine quality

prediction is a small piece in setting up the manufacturing
process.
Typically 10-15% of time is spent on ML.
A lot more time is spent on capturing and processing data
needed for ML and taking decisions based on output of ML
module.
Needs strong collaboration with domain experts, product
managers and eng-teams for successful execution.

6
Step 1: Look at the big picture
1. Frame the problem
2. Select a performance measure
3. List and check the assumptions
1.1 Frame the problem

What is input and output?

What is the business objective? How does company expects to use and
beneﬁt from the model?
Useful in problem framing
Algorithm and performance measure selection
Overall effort estimation
What is the current solution (if any)?
Provides a useful baseline
7
Design consideration in problem framing

Is this a supervised, unsupervised or a RL problem?

Is this a classiﬁcation, regression or some other task?

What is the nature of the output: single or multiple outputs?

Does system need continuous learning or periodic updates?

What would be the learning style: batch or online?

8
1.2 Selection of performance measure
Regression
Mean Squared Error (MSE) or
Mean Absolute Error (MAE)
Classiﬁcation
Precision
Recall
F1-score
Accuracy

9
1.3 Check the assumptions
List down various assumptions about the task.
Review with domain experts and other teams that plan to consume ML output.
Make sure all assumptions are reviewed and approved before coding!

10
Step 2: Get the data

Data spread across multiple tables, ﬁles or documents with access control.
Obtain appropriate access controls and authorizations.
Get familiarized with data by looking at schema and a few rows. (Familiarity with
SQL would be useful here.)

Load basic libraries

1 import pandas as pd
2 import matplotlib.pyplot as plt
3 import seaborn as sns
4 import numpy as np

11
Let's ﬁrst access our data - in this case, we need to download it from the web.
It's a good practice to create a function for downloading and extracting the
data.

1 data_url = 'https://archive.ics.uci.edu/ml/machine-learning-
databases/wine-quality/winequality-red.csv'
2 data = pd.read_csv(data_url, sep=";")

Now that the data is loaded, let's examine it.

12
2.1 Check data samples

Let's look at a few data samples with head() method.

1 data.head()

13
2.2 Features
It's a good idea to understand signiﬁcance of each feature by consulting the experts.

Feature Significance
Fixed acidity Most acids involved with wine or fixed or nonvolatile (do
not evaporate readily)
Volitile acidity The amount of acetic acid in wine, which at too high of
levels can lead to an unpleasant, vinegar taste
Citric acid Found in small quantities, citric acid can add 'freshness'
and flavor to wines
Residual sugar it's rare to find wines with less than 1 gram/liter and wines
with greater than 45 grams/liter are considered sweet.
Chlorides The amount of salt in the wine.

⋮ ⋮
Alcohol The percentage of alcohol contents in the wine.

(Credits:https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009) 14
1 feature_list = data.columns[:-1].values
2 label = [data.columns[-1]]
3
4 print ("Feature list:", feature_list)
5 print ("Label:", label)

Feature list: ['fixed acidity' 'volatile acidity' 'citric acid'

'residual sugar' 'chlorides' 'free sulfur dioxide' 'total
sulfur dioxide' 'density' 'pH' 'sulphates' 'alcohol'] Label:
['quality']

15
2.3 Data statistics

1 data.info()

Let's use info() method to get

quick description of data.

16
2.3 Data statistics
Total entries: 1599 (Tiny dataset by
ML standard)
There are total 12 columns: 11
features + 1 label
Label column: quality
Features: [ﬁxed acidity, volitile
acidity, citric acid, residual sugar,
cholrides, free sulphur dioxide,
total sulphur dioxide, density, pH,
sulphates, alcohol]
All columns are numeric (ﬂoat64) and
label is an integer.

17
In order to understand nature of numeric attribites, we use describe() method.

1 data.describe()

This one prints count and statistical properties - mean, standard deviations and
quartiles.

18
The wine quality can be between 0 and 10, but in this dataset, the quality values
are between 3 and 8. Let's look at the distribution of examples by the wine quality.

1 data['quality'].value_counts()

High quality value → better quality of wine

You can see that there are lots of samples of average wines than good or the poor
quality ones.
Many examples with quality = 5 or 6

19
The information can be viewed through histogram plot.

A Histogram gives the count of how many samples occurs within a speciﬁc range (bins).
The x-axis denotes the range of values in a feature and
The y-axis denotes the frequency of samples with those speciﬁc values.

1 sns.set()
2 data.quality.hist()
3 plt.xlabel('Wine Quality')
4 plt.ylabel('Count')

Note taller bars for quality 5 and 6 compared to the other qualities.
20
In a similar manner, we can plot all numerical attributes with histogram plot for quick
examination.

21
A few observations based on these plots:

1. Features are at different scales.

2. Features have different distributions -
A few are tail heavy. e.g. residual sugar, free so2
A few have multiple modes. e.g. volitile acidity, citric acid

Before any further exploration, it's a good idea to separate test set and do not look at it
in order to have a clean evaluation set.

22
2.4 Create test set
When we look at the test set, we are likely to notice patterns in that and based
on that we may select certain models.
This leads to biased estimation on test set, which may not generalize well in
practice. This is called data snooping bias.

23
Let's write a function to split the data into training and test. Make sure to set the seed so
that we get the same test set in the next run.

1 def split_train_test(data, test_ratio):

2 # set the random seed.
3 np.random.seed(42)
4
5 # shuffle the dataset.
6 shuffled_indices = np.random.permutation(len(data))
7
8 # calculate the size of the test set.
9 test_set_size = int(len(data) * test_ratio)
10
11 # split dataset to get training and test sets.
12 test_indices = shuffled_indices[:test_set_size]
13 train_indices = shuffled_indices[test_set_size:]
14 return data.iloc[train_indices], data.iloc[test_indices]

1 train_set, test_set = split_train_test(data, 0.2)

24
Scikit-Learn provides a few functions for creating test sets based on

1. Random sampling, which randomly selects k % points in the test set.

2. Stratiﬁed sampling, which samples test examples such that they are
representative of overall distribution.

25
Random sampling

train_test_split() function performs random sampling with

random_state parameter to set the random seed, which ensures that the
same examples are selected for test sets across runs.
test_size parameter for specifying size of the test set.
shuffle flag to specify if the data needs to be shuffled before splitting.
Provision for processing multiple datasets with an identical number of rows and
selecting the same indices from these datasets.
Useful when labels are in different dataframe.

1 from sklearn.model_selection import train_test_split

26
1 from sklearn.model_selection import train_test_split

We can read the documentation for this function by using the following line of code:

1 ?train_test_split

27
Let's perform random sampling on our dataset:

1 train_set, test_set = train_test_split(data, test_size=0.2, random_state=42)

28
Stratiﬁed sampling
Data distribution may not be uniform in real world data.
Random sampling - by its nature - introduces biases in such data sets.
Recall the label distribution in our dataset: It's not uniform!

1 sns.set()
2 data.quality.hist()
3 plt.xlabel('Wine Quality')
4 plt.ylabel('Count')

Many examples of class 5 and 6 compared to the other classes.

This causes a problem while random sampling. The test distribution may not match
with the overall distribution. 29
How do we sample in such cases?
We divide the population into homogenous groups called strata.
Data is sampled from each stratum so as to match it with the overall data distribution.
Scikit-Learn provides a class StratifiedShuffleSplit that helps us in stratified sampling.

1 from sklearn.model_selection import StratifiedShuffleSplit

2 split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
3 for train_index, test_index in split.split(data, data["quality"]):
4 strat_train_set = data.loc[train_index]
5 strat_test_set = data.loc[test_index]

Let's examine the test set distribution by the wine quality that was used for stratiﬁed
sampling.

1 strat_dist = strat_test_set["quality"].value_counts() / len(strat_test_set)

30
Now compare this with the overall distribution:

1 overall_dist = data["quality"].value_counts() / len(data)

Let's look at them side-by-side:

1 dist_comparison = pd.DataFrame({'overall': overall_dist, 'stratified': strat_dist})

2 dist_comparison['diff(s-o)'] = dist_comparison['stratified'] - dist_comparison['ove
3 dist_comparison['diff(s-o)_pct'] = 100*(dist_comparison['diff(s-o)']/dist_compariso

31
You can notice that there is a small difference in most strata.

1 dist_comparison

32
Let's contrast this with random sampling:

1 random_dist = test_set["quality"].value_counts() / len(test_set)

2 random_dist

33
Sampling bias comparison
Compare the difference in distribution of stratiﬁed and uniform sampling:
Stratiﬁed sampling gives us test distribution closer to the overall distribution than the
random sampling.

1 dist_comparison.loc[:, ['diff(s-o)_pct', 'diff(r-o)_pct']]

34
Step 3: Data visualization
Performed on training set.
In case of large training set -
Sample examples to form exploration set.
Enables to understand features and their relationship among themselves and with
output label.

In our case, we have a small training data and we use it all for data exploration. There is
no need to create a separate exploration set.

It's a good idea to create a copy of the training set so that we can freely manipulate it
without worrying about any manipulation in the original set.

1 exploration_set = strat_train_set.copy()
35
Scatter Visualization
With seaborn library:

1 sns.scatterplot(x='fixed acidity', y='density', hue='quality',

2 data=exploration_set)

36
With matplotlib:

1 exploration_set.plot(kind='scatter', x='fixed acidity', y='density', alpha=0.5,

2 c="quality", cmap=plt.get_cmap("jet"))

37
Relationship between features
Standard correlation coefﬁcient between features.
Ranges between -1 to +1
Correlation = +1: Strong positive correlation between features
Correlation = -1: Strong negative correlation between features
Correlation = 0: No linear correlation between features
Visualization with heat map
Only captures linear relationship between features.
For non-linear relationship, use rank correlation

Let's calculate correlations between our features.

1 corr_matrix = exploration_set.corr()
38
Let's check features that are correlated with the label, which is quality in our case.

Notice that quality has strong positive

1 corr_matrix['quality']
correlation with alcohol content [0.48] and
strong negative correlation with volitile
acidity [-0.38].

39
Let's visualize correlation matrix with heatmap:

1 plt.figure(figsize=(14,7))
2 sns.heatmap(corr_matrix, annot=True)

40
41
You can notice:
The correlation coefficient on diagonal is +1.
Darker colors represent negative correlations, while
fainter colors denote positive correlations. For example
citric acid and fixed acidity have strong positive
correlation.
pH and fixed acidity have strong negative
correlation.

Another option to visualize the relationship between the

feature is with scatter matrix.

42
1 from pandas.plotting import scatter_matrix
2 attribute_list = ['citric acid', 'pH', 'alcohol', 'sulphates', 'quality']
3 scatter_matrix(exploration_set[attribute_list])

For convenience of
visualization, we show it for
a small number of attributes.

43
Similar analysis can be carried out with combined features - features that are
derived from the original features.

44
Note of wisdom
1. Visualization and data exploration do not have to be absolutely thorough.
2. Objective is to get quick insight into features and its relationship with other features
and labels.
3. Exploration is an iterative process: Once we build model and obtain more insights,
we can come back to this step.

45
Step 4: Prepare data for ML algorithm
We often need to preprocess the data before using it for model building due to variety
of reasons:
Due to errors in data capture, data may contain outliers or missing values.
Different features may be at different scales.
The current data distribution is not exactly amenable to learning.
Typical steps in data preprocessing are as follows:
1. Separate features and labels.
2. Handling missing values and outliers.
3. Feature scaling to bring all features on the same scale.
4. Applying certain transformations like log, square root on the features.

It's a good practice to make a copy of the data and apply preprocessing on that copy.
This ensures that in case something goes wrong, we will at least have original copy of
the data intact. 46
4.1 Separate features and labels from the training set.

1 # Copy all features leaving aside the label.

2 wine_features = strat_train_set.drop("quality", axis=1)
3
4 # Copy the label list
5 wine_labels = strat_train_set['quality'].copy()

47
4.2 Data cleaning
Let's ﬁrst check if there are any missing values in feature set: One way to ﬁnd that out is
column-wise.

1 wine_features.isna().sum() # counts the number of NaN in each column of wine_fe

In this dataset, we do not have any missing values.

48
In case, we have non-zero numbers in any columns, we have a problem of missing
values.
These values are missing due to errors in recording or they do not exist.
If they are not recorded:
Use imputation technique to ﬁll up the missing values.
Drop the rows containing missing values.
If they do not exists, it is better to keep it as NaN.

Sklearn provides the following methods to drop rows containing missing values:
dropna()
drop()

It provides SimpleImputer class for ﬁlling up missing values with. say, median value.

49
1 from sklearn.impute import SimpleImputer
2 imputer = SimpleImputer(strategy="median")

The strategy contains instructions as how to replace the missing values. In this case,
we specify that the missing value should be replaced by the median value.

1 imputer.fit(wine_features)

SimpleImputer(add_indicator=False, copy=True, fill_value=None, missing_values=nan,

strategy='median', verbose=0)

In case, the features contains non-numeric attributes, they need to be dropped before
calling the ﬁt method on imputer object.

50
Let's check the statistics learnt by the imputer on the training set:

1 imputer.statistics_

array([ 7.9 , 0.52 , 0.26 , 2.2 , 0.08 , 14. , 39. , 0.99675, 3.31 ,
0.62 , 10.2 ])

Note that these are median values for each feature. We can cross-check it by calculating
median on the feature set:

1 wine_features.median()

51
Finally we use the trained imputer to transform the training set such that the missing
values are replaced by the medians:

1 tr_features = imputer.transform(wine_features)

This returns a Numpy array and we can convert it to the dataframe if needed:

1 tr_features.shape

(1279, 11)

1 wine_features_tr = pd.DataFrame(tr_features, columns=wine_features.columns)

52
4.3 Handling text and categorical attributes
4.3.1 Converting categories to numbers:

1 from sklearn.preprocessing import OrdinalEncoder

2 ordinal_encoder = OrdinalEncoder()

Call ﬁt_transform() method on ordinal_encoder object to convert text to numbers.

The list of categories can be obtained via categories_ instance variable.

One issue with this representation is that the ML algorithm would assume that the
two nearby values are closer than the distinct ones.

53
4.3.2 Using one hot encoding
Here we create one binary feature per category - the feature value is 1 when the category
is present else it is 0.
Only one feature is 1 (hot) and the rest are 0 (cold).
The new features are referred to as dummy features.
Scikit-Learn provides a OneHotEncoder class to convert categorical values into one-hot
vectors.

1 from sklearn.preprocessing import OneHotEncoder

2 cat_encoder = OneHotEncoder()

We need to call ﬁt_transform() method on OneHotEncoder object.

The output is a SciPy sparse matrix rather than NumPy array. This enables us to save
space when we have a huge number of categories.
In case we want to convert it to dense representation, we can do so with toarray()
method.
The list of categories can be obtained via categories_ instance variable.
54
As we observed that when the number of categories are very large, the one-hot encoding
would result in a very large number of features.
This can be addressed with one of the following approaches:
Replace with categorical numerical features
Convert into low-dimensional learnable vectors called embeddings

55
4.4 Feature Scaling
Most ML algorithms do not perform well when input features are on very different
scales.
Scaling of target label is generally not required.

4.5.1 Min-max scaling or Normalization

We subtract minimum value of a feature from the current value and divide it by the
difference between the minimum and the maximum value of that feature.
Values are shifted and scaled so that they range between 0 and 1.
Scikit-Learn provides MinMaxScalar transformer for this.
One can specify hyperparameter feature_range to specify the range of the feature.

56
4.5.2 Standardization
We subtract mean value of each feature from the current value and divide it by the
standard deviation so that the resulting feature has a unit variance.
While normalization bounds values between 0 and 1, standardization does not bound
values to a speciﬁc range.
Standardization is less affected by the outliers compared to the normalization.
Scikit-Learn provides StandardScalar transformation for feature standardization.
Note that all these transformers are learnt on the training data and then applied on the
training and test data to tranform them.
Never learn these transformers on the full dataset.

57
Transformation Pipeline
Scikit-Learn provides a Pipeline class to line up transformations in an intended order.
Here is an example pipeline:

1 from sklearn.pipeline import Pipeline

2 from sklearn.preprocessing import StandardScaler
3 transform_pipeline = Pipeline([
4 ('imputer', SimpleImputer(strategy="median")),
5 ('std_scaler', StandardScaler()),])
6 wine_features_tr = transform_pipeline.fit_transform(wine_features)

Let's understand what is happening here:

Pipeline has a sequence of transformations - missing value imputation followed by
standardization.
Each step in the sequence is deﬁned by name, estimator pair.
Each name should be unique and should not contain __ (double underscore).

58
1 from sklearn.pipeline import Pipeline
2 from sklearn.preprocessing import StandardScaler
3 transform_pipeline = Pipeline([
4 ('imputer', SimpleImputer(strategy="median")),
5 ('std_scaler', StandardScaler()),])
6 wine_features_tr = transform_pipeline.fit_transform(wine_features)

The output of one step is passed on the next one in sequence until it reaches the last
step.
Here the pipeline first performs imputation of missing values and its result is passed for
standardization.
The pipeline exposes the same method as the final estimator.
Here StandardScalar is the last estimator and since it is a transformer, we call
fit_transform() method on the Pipeline object.
59
How to transform mixed features?
The real world data has both categorical as well as numerical features and we need to
apply different transformations to them.
Scikit-Learn introduced ColumnTransformer for this purpose.

1 from sklearn.compose import ColumnTransformer

In our dataset, we do not have features of mixed types. All our features are numeric.

60
For the illustration purpose, here is an example code snippet:

1 num_attribs = list(wine_features)
2 cat_attribs = ["place_of_manufacturing"]
3 full_pipeline = ColumnTransformer([
4 ("num", num_pipeline, num_attribs),
5 ("cat", OneHotEncoder(), cat_attribs),
6 ])
7 wine_features_tr = full_pipeline.fit_transform(wine_features)

Here we apply num_pipeline on numerical features and OneHotEncoder transformation on

the categorical features.
The ColumnTransformer applies each transformation to the appropriate columns and then
concatenates the outputs along the columns.
Note that all transformers must return the same number of rows.
The numeric transformers return dense matrix while the categorical ones return sparse
matrix. The ColumnTransformer automatically determines the type of the output based on
the density of the resulting matrix. 61
Step 5: Select and train ML model
It's a good practice to build a quick baseline model on the preprocessed data and get
an idea about model performance.

1 from sklearn.linear_model import LinearRegression

2
3 lin_reg = LinearRegression()
4 lin_reg.fit(wine_features_tr, wine_labels)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,

normalize=False)

62
Now that we have a working model of a regression, let's evaluate performance of the
model on training as well as test sets.
For regression models, we use mean squared error as an evaluation measure.

1 from sklearn.metrics import mean_squared_error

2
3 quality_predictions = lin_reg.predict(wine_features_tr)
4 mean_squared_error(wine_labels, quality_predictions)

0.4206571060060278

63
Let's evaluate performance on the test set.
We need to ﬁrst apply transformation on the test set and then apply the model prediction
function.

1 # copy all features leaving aside the label.

2 wine_features_test = strat_test_set.drop("quality", axis=1)
3
4 # copy the label list
5 wine_labels_test = strat_test_set['quality'].copy()
6
7 # apply transformations
8 wine_features_test_tr = transform_pipeline.fit_transform(wine_features_test)
9
10 # call predict function and calculate MSE.
11 quality_test_predictions = lin_reg.predict(wine_features_test_tr)
12 mean_squared_error(wine_labels_test, quality_test_predictions)

0.39759130875015164

64
Let's visualize the error between the actual and predicted values.

1 plt.scatter(wine_labels_test, quality_test_predictions)
2 plt.plot(wine_labels_test, wine_labels_test, 'r-')
3 plt.xlabel('Actual quality')
4 plt.ylabel('Predicted quality')

The model seem to be making errors

on the best and poor quality wines.

65
Let's try another model: DecisionTreeRegressor.

1 from sklearn.tree import DecisionTreeRegressor

2 tree_reg = DecisionTreeRegressor()
3 tree_reg.fit(wine_features_tr, wine_labels)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None, max_features=None,

max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0,
presort='deprecated', random_state=None, splitter='best')

Notice similarity between two code snippets.

Linear regression Decision Trees
lin_reg.ﬁt(wine_features_tr, tree_reg.ﬁt(wine_features_tr,
wine_labels) wine_labels)

66
1 quality_predictions = tree_reg.predict(wine_features_tr)
2 mean_squared_error(wine_labels, quality_predictions)

0.0

1 quality_test_predictions = tree_reg.predict(wine_features_test_tr)
2 mean_squared_error(wine_labels_test, quality_test_predictions)

0.58125

Note that the training error is 0, while the test error is 0.58. This is an example of an
overﬁtted model.

67
1 plt.scatter(wine_labels_test, quality_test_predictions)
2 plt.plot(wine_labels_test, wine_labels_test, 'r-')
3 plt.xlabel('Actual quality')
4 plt.ylabel('Predicted quality')

68
We can use cross-validation (CV) for robust evaluation of model performance.

1 from sklearn.model_selection import cross_val_score

Cross validation provides a separate MSE for each validation set, which we can
use to get a mean estimation of MSE as well as the standard deviation, which
helps us to determine how precise is the estimate.
The additional cost we pay in cross validation is additional training runs, which
may be too expensive in certain cases.

1 def display_scores(scores):
2 print("Scores:", scores)
3 print("Mean:", scores.mean())
4 print("Standard deviation:", scores.std())

69
Linear Regression CV

1 scores = cross_val_score(lin_reg, wine_features_tr, wine_labels,

2 scoring="neg_mean_squared_error", cv=10)
3 lin_reg_mse_scores = -scores
4 display_scores(lin_reg_mse_scores)

Scores: [0.56364537 0.4429824 0.38302744 0.40166681 0.29687635 0.37322622

0.33184855 0.50182048 0.51661311 0.50468542]
Mean: 0.431639217212196
Standard deviation: 0.08356359730413976

70
Decision tree CV

1 scores = cross_val_score(tree_reg, wine_features_tr, wine_labels,

2 scoring="neg_mean_squared_error", cv=10)
3 tree_mse_scores = -scores
4 display_scores(tree_mse_scores)

Scores: [0.6171875 0.6875 0.6328125 0.5078125 0.4609375 0.640625 0.65625 0.7109375

0.859375 1.07874016]
Mean: 0.6852177657480315
Standard deviation: 0.16668343331737054

Let's compare scores of Linear regression (LinReg) and decision tree

(DT)regressions:
LinReg has better MSE and more precise estimation compared to DT.

71
Random forest CV
Random forest model builds multiple decision trees on randomly selected features
and then average their predictions.
Building a model on top of other model is called ensemble learning, which is often
used to improve performance of ML models.

1 from sklearn.ensemble import RandomForestRegressor

2
3 forest_reg = RandomForestRegressor()
4 forest_reg.fit(wine_features_tr, wine_labels)
5
6 scores = cross_val_score(forest_reg, wine_features_tr, wine_labels,
7 scoring="neg_mean_squared_error", cv=10)
8 forest_mse_scores = -scores
9 display_scores(forest_mse_scores)

Scores: [0.36989922 0.41363672 0.29063438 0.31722344 0.21798125 0.30233828 0.27124922

0.38747344 0.42379219 0.46229449]
Mean: 0.34565226131889765
Standard deviation: 0.0736322184302973
72
1 quality_test_predictions = forest_reg.predict(wine_features_test_tr)
2 mean_squared_error(wine_labels_test, quality_test_predictions)

0.34449875

1 plt.scatter(wine_labels_test, quality_test_predictions)
2 plt.plot(wine_labels_test, wine_labels_test, 'r-')
3 plt.xlabel('Actual quality')
4 plt.ylabel('Predicted quality')

Random forest looks more promising than the other

two models.
It's a good practice to build a few such models
quickly without tuning their hyperparameters and
shortlist a few promising models among them.
Also save the models to the disk in Python pickle
format.
73
What to do next?

Model diagnosis Remedy

Underﬁtting Models with more capacity

Less
constraints/regularization
Overﬁtting More data
Simpler model
More
constraints/regularization

74
Step 6: Finetune your model
Usually there are a number of hyperparameters in the model, which are set
manually.
Tuning these hyperparameters lead to better accuracy of ML models.
Finding the best combination of hyperparameters is a search problem in the
space of hyperparameters, which is huge.

Grid search
Scikit-Learn provives a class GridSearchCV that helps us in this pursuit.

1 from sklearn.model_selection import GridSearchCV

We need to specify a list of hyperparameters along with the range of values to try.
It automatically evaluates all possible combinations of hyperparameter values using
cross-validation.
75
For example, there are number of hyperparameters in RandomForest regression
such as:
Number of estimators
Maximum number of features

1 param_grid = [
2 {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
3 {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
4 ]

Here the parameter grid contains two combinations:

1. The ﬁrst combination contains n_estimators with 3 values and max_features with
4 values.
2. The second combination has an additional bootstrap parameter, which is set to
False. Note that it was set to its default value, which is True, in the ﬁrst grid.

76
Let's compute the total combinations evaluated here:
1. The ﬁrst one results in 3 × 4 = 12 combinations.
2. The second one has 2 values of n_estimators and 3 values of max_features, thus
resulting 2 × 3 = 6 in total of values.

The total number of combinations evaluated by the parameter grid 12 + 6 = 18

Let's create an object of GridSearchCV:

1 grid_search = GridSearchCV(forest_reg, param_grid, cv=5,

2 scoring='neg_mean_squared_error',
3 return_train_score=True)

77
Let's create an object of GridSearchCV:

1 grid_search = GridSearchCV(forest_reg, param_grid, cv=5,

2 scoring='neg_mean_squared_error',
3 return_train_score=True)

In this case, we set cv=5 i.e. using 5 fold cross validation for training the model.
We need to train the model for 18 parameter combinations and each combination
would be trained 5 times as we are using cross-validation here.
The total model training runs = 18 × 5 = 90

78
Let's launch the hyperparameter search:

1 grid_search.fit(wine_features_tr, wine_labels)

79
The best parameter combination can be obtained as follows:

1 grid_search.best_params_

{'max_features': 6, 'n_estimators': 30}

Let's ﬁnd out the error at different parameter settings:

cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(-mean_score, params)

80
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(-mean_score, params)

As you can notice the lowest MSE is obtained for the best parameter combination.
81
Let's obtain the best estimator as follows:

1 grid_search.best_estimator_

Note: GridSearchCV is initialized with reﬁt=True option, which retrains the best
estimator on the full training set. This is likely to lead us to a better model as it is
trained on a larger dataset.

82
Randomized Search
When we have a large hyperparameter space, it is desirable to try
RandomizedSearchCV.
It selects a random value for each hyperparameter at the start of each iteration and
repeats the process for the given number of random combinations.
It enables us to search hyperparameter space with appropriate budget control.

1 from sklearn.model_selection import RandomizedSearchCV

83
Analysis of best model and its errors
Analysis of the model provides useful insights about features. let's obtain the feature
importance as learnt by the model:

1 feature_importances = grid_search.best_estimator_.feature_importances_

1 sorted(zip(feature_importances, feature_list), reverse=True)

Based on this information, we may drop features that are not so important.
It is also useful to analyze the errors in prediction and understand its causes and ﬁx
them. 84
Evaluation on test set
Now that we have a reasonable model, we evaluate its performance on the test set. The
following steps are involved in the process:

1. Transform the test features.

1 # copy all features leaving aside the label.

85
2. Use the predict method with the trained model and the test set.

1 quality_test_predictions = grid_search.best_estimator_.predict(
2 wine_features_test_tr)

3.Compare the predicted labels with the actual ones and report the evaluation metrics.

1 mean_squared_error(wine_labels_test, quality_test_predictions)

0.35345138888888883

86
4.It's a good idea to get 95% conﬁdence interval of the evaluation metric. It can be
obtained by the following code:

1 from scipy import stats

2 confidence = 0.95
3 squared_errors = (quality_test_predictions - wine_labels_test) ** 2
4 stats.t.interval(confidence, len(squared_errors) - 1,
5 loc=squared_errors.mean(),
6 scale=stats.sem(squared_errors))

(0.29159276569581916, 0.4153100120819586)

87
Step 7: Present your solution

Once we have satisfactory model based on its performance on the test set, we reach
the prelaunch phase.

Before launch,

1. We need to present our solution that highlights learnings, assumptions and

systems limitation.
2. Document everything, create clear visualizations and present the model.
3. In case, the model does not work better than the experts, it may still be a
good idea to launch it and free up bandwidths of human experts.

88
Step 8: Launch, monitor and maintain your system
Launch
Plug in input sources and
Write test cases

Monitoring
System outages
Degradation of model performance
Sampling predictions for human evaluation
Regular assessment of data quality, which is critical for model performance

Maintenance
Train model regularly every ﬁxed interval with fresh data.
Production roll out of the model.

89
Summary
In this module, we studied steps involved in end to end machine learning project with
an example of prediction of wine quality.

90
Introduction to Scikit-
Learn (sklearn)

2
sklearn APIs are organized on the lines of our
ML framework.

Scikit-learn ML Framework
Training data and
Training data
preprocessing
Model subsumes Model
loss function and
optimization Loss function
procedure Optimization
Model selection
and evaluation Evaluation
Model inspection
3
API design principles

4
@sir, copied to 'Data
Preprocessing' slide deck
sklearn APIs are well designed with the following principles:

Consistency: All APIs share a simple and consistent

interface.
Inspection: The learnable parameters as well as
hyperparameters of all estimator's are accessible directly
via public instance variables.
Nonproliferation of classes: Datasets are represented
as Numpy arrays or Scipy sparse matrix instead of
custom designed classes.
Composition: Existing building blocks are reduced as
much as possible.
Sensible defaults values are used for parameters that
enables quick baseline building.

5
@sir, copied to 'Data
Preprocessing' slide deck
Types of sklearn objects
Transformers Estimators Predictors
transforms dataset Estimates model Makes prediction
transform() for parameters based on dataset
transforming on training data predict() method
dataset. and hyper that takes dataset
fit() learns parameters. as an input and
parameters. fit() method returns predictions.
fit_transform() ﬁts score() method to
parameters and measure quality of
transform() the predictions.
dataset.

Data Preprocessing Training Inference

6
sklearn API

7
@sir, copied to 'Data
Preprocessing' slide deck
Data API
Provides functionality for loading, generating and
preprocessing the training and test data.

Module Functionality
sklearn.datasets Loading datasets - custom as well as
popular reference dataset.
sklearn.preprocessing Scaling, centering, normalization and
binarization methods
sklearn.impute Filling missing values
sklearn.feature_selection Implements feature selection
algorithms
sklearn.feature_extraction Implements feature extraction from
raw data.

8
Model API
Implements supervised and unsupervised models

Regression Classiﬁcation
sklearn.linear_model sklearn.linear_model
(linear, ridge, lasso sklearn.svm
models) sklearn.trees
sklearn.trees sklearn.neighbors
sklearn.naive_bayes
sklearn.multiclass

sklearn.multioutput implements multi-output

classiﬁcation and regression.
sklearn.cluster implements many popular clustering
algorithms
9
Model evaluation API
sklearn.metrics implements different metrics for
model evaluation.
Classiﬁcation metrics
Regression metrics
Clustering metrics

10
Model selection API

sklearn.model_selection implements various model

selection strategies like cross-validation, tuning hyper-
parameters and plotting learning curves.

11
Model inspection API

sklearn.model_inspection includes tools for model

inspection.

12
Practical advice
It is not possible to remember each and every sklearn
API.

Remember high level modules and API design

principles.

Use documentation for more information as follows:

1 import sklearn.linear_model import LogisticRegression
2 ?LogisticRegression

Keep the following links handy:

API reference
sklearn user guide
Worked examples for reference implementations

13
Data loading

2
General dataset API has three main kind of interfaces:

The dataset loaders are used to load toy

datasets bundled with sklearn.

The dataset fetchers are used to download

and load datasets from the internet.

The dataset generators are used to generate

controlled synthetic datasets.

3
Dataset API
Loaders Fetchers Generator
Load small Fetch and load Controlled
standard datasets larger datasets synthetic datasets
return_X_y = True
Both loaders and fetchers return a Bunch Returns tuple (X, y)
object, which is a dictionary with two keys of numpy arrays:
of our interest:
X has shape
Key Values
(n, m)
data Array of shape (n, m)
y has shape (n, )
target Array of shape (n,)

load_* fetch_* make_*

4
Dataset Loaders
Dataset Loader # samples # features # labels Type
(n) (m)
load_iris 150 3 1 Classification
load_diabetes 442 10 1 Regression
load_digits 1797 64 1 Classification
load_linnerud 20 3 3 Regression (multi
output)
load_wine 178 13 1 Classification
load_breast_cancer 569 30 1 Classification

Note: These datasets are bundled with sklearn and we do not

require to download them from external sources.
5
Dataset Fetchers
Dataset Loader # samples # features # labels Type
(n) (m)
fetch_olivetti_faces 400 4096 1 (40) multi-class image
classification
fetch_20newsgroups 18846 1 1 (20) (multi-class) text
classification
fetch_lfw_people 13233 5828 1 (5749) (multi-class) image
classification
fetch_covtype 581012 54 1 (7) (multi-class)
classification
fetch_rcv1 804414 47236 1 (103) (multi-class)
classification
fetch_kddcup99 4898431 41 1 (multi-class)
classification
fetch_california_housing 20640 8 1 regression
6
Dataset generators
Regression make_regression() produces regression targets
as a sparse random linear combination of
random features with noise. The informative
features are either uncorrelated or low rank.
Classification
make_blobs() and make_classification() first
creates a bunch of normally-distributed
Single label clusters of points and then assign one or
more clusters to each class thereby creating
multi-class datasets.

make_multilabel_classification() generates
Multilabel random samples with multiple labels with a
speciﬁc generative process and rejection
sampling.
7
Dataset generators
Clustering make_blobs()generates a bunch of normally-
distributed clusters of points with speciﬁc
mean and standard deviations for each
cluster.

8
Loading external datasets
fetch_openml()fetches datasets from openml.org, which
is a public repository for machine learning data and
experiments.

pandas.io provides tools to read from common formats

like CSV, excel, json, SQL.
specializes in binary formats used in scientific
scipy.io
computing like .mat and .arff.
numpy/routines.iospecializes in loading columnar data
into numpy arrays.
dataset.load_files loads directories of text files where
directory name is a label and each file is a sample.

9
Loading external datasets
datasets.load_svmlight_files() loads data in svmlight
and libSVM sparse format.

provides tools to load images and videos in

skimage.io
numpy arrays.
scipy.io.wavfile.read specializes reading WAV ﬁle into a
numpy array.

10
For managing numerical data, sklearn recommends using
an optimized ﬁle format such as HDF5 (Hierarchical Data
Format version 5) to reduce data load times.

Pandas, Py Tables and H5Py provides an interface to read

and write data in that format.

11
Data transformation

12
Types of transformers

sklearn provides a library of transformers for

Data cleaning (sklearn.preprocessing) such as
Feature extraction (sklearn.feature_extraction)
Feature reduction
Feature expansion (sklearn.kernel_approximation)

13
Transformer methods

Each transformer has the following methods:

fit() method learns model parameters from a

training set.

transform()method applies the learnt

transformation to the new data.

fit_transform() performs function of both fit() and

transform() methods and is more convenient and
eﬃcient to use.

14
Transformers are combined with one another or with
other estimators such as classiﬁers or regressors to
build composite estimators.

Tool Usage
Pipeline Chaining multiple estimators to
execute a fixed sequence of
steps in data preprocessing and
modelling.
FeatureUnion Combines output from several
transformer objects by creating
a new transformer from them.
ColumnTransformer Enables different
transformations on different
columns of data based on their
types.

15
Data Preprocessing
Machine Learning Practice

Dr. Ashish Tendulkar

IIT Madras

2
The real world training data is usually not clean and has many
issues such as missing values for certain features, features on
different scales, non-numeric attributes etc.

Often there is a need to pre-process the data to make it

amenable for training the model.
Sklearn provides a rich set of transformers for this job.

The same pre-processing should be applied to both training

and test set.
Sklearn provides pipeline for making it easier to chain
multiple transforms together and apply them uniformly
across train, eval and test sets.

3
Once you get the training data, the ﬁrst job is to explore
the data and list down preprocessing needed.

Typical problems include

Missing values in features
Numerical features are not on the same scale.

Categorical attributes need to be represented

with sensible numerical representation.

Too many features, reduce them.

Extract features from non-numeric data.

4
Sklearn provides a library of transformers for
data preprocessing.

Data cleaning (sklearn.preprocessing) such as

standardization, missing value imputation, etc.
Feature extraction (sklearn.feature_extraction)
Feature reduction (sklearn.decomposition.pca)
Feature expansion (sklearn.kernel_approximation)

5
Transformer methods

Each transformer has the following methods:

fit() method learns model parameters from a training

set.

transform() method applies the learnt transformation

to the new data.

fit_transform() performs function of both fit() and

transform() methods and is more convenient and
efﬁcient to use.

6
Part 1. Feature extraction

7
sklearn.feature_extraction has useful APIs to extract
features from data:

DictVectorizer FeatureHasher

Let's study these APIs one by one.

8
DictVectorizer

Converts lists of mappings of feature name and feature

value, into a matrix.
Transformed feature
Original data
matrix X′
data =
[{'age': 4, 'height':96.0}, ⎡4 96.0⎤
=⎢
1 73.9⎥
{'age': 1, 'height':73.9}, X′ 4×2 ⎢3 88.9⎥
{'age': 3, 'height':88.9}, ⎣2 81.6⎦
{'age': 2, 'height':81.6}]

dv = DictVectorizer(sparse=False)
dv.fit_transform(data)

9
FeatureHasher

High-speed, low-memory vectorizer that uses feature

hashing technique.
Instead of building a hash table of the features, as the
vectorizers do, it applies a hash function to the features
to determine their column index in sample matrices
directly.
This results in increased speed and reduced memory
usage, at the expense of inspectability; the hasher does
not remember what the input features looked like and
has no inverse_transform method.
Output of this transformer is scipy.sparse matrix.

10
Feature Extraction from images and text

sklearn.feature_extraction.image.* has useful APIs to extract

features from image data. Find out more about them in sklearn
user guide at the following link: Feature Extraction from Images.

sklearn.feature_extraction.text.* has useful APIs to extract

features from text data. Find out more about them in sklearn
user guide at the following link: Feature Extraction from Text.

11
Part 2: Data Cleaning

12
Handling missing values

13
Missing values occur due to errors in data capture such as
sensor malfunctioning, measurement errors etc.

Many ML algorithms do not work with missing data and need

all features to be present.

Discarding records containing missing values would result

in loss of valuable training samples.

sklearn.impute API provides functionality to ﬁll missing

values in a dataset.

SimpleImputer KNNImputer

MissingIndicator provides indicators for missing values.

14
SimpleImputer

Fills missing values with one of the following strategies:

'mean', 'median', 'most_frequent' and 'constant'.

Original feature Transformed

si = SimpleImputer(strategy='mean')
matrix X si.fit_transform(X) feature matrix X′

⎡ 7 1 ⎤ ⎡7 1⎤
=⎢
8 ⎥
=⎢
6 8⎥
⎢ 2 nan⎥ ⎢2 5⎥
nan
X4×2 X′ 4×2
⎣ 9 6 ⎦ ⎣9 6⎦
7+2+9
=
3
6
1+8+6
=5
3
15
KNNImputer

Uses k-nearest neighbours approach to ﬁll missing

values in a dataset.
The missing value of an attribute in a speciﬁc
example is ﬁlled with the mean value of the same
attribute of n_neighbors closest neighbors.
The nearest neighbours are decided based on
Euclidean distance.

16
Example: KNNImputer

Consider following feature matrix.

⎡ 1. 2. nan⎤
=⎢
3. 4. 3. ⎥
X4×3 ⎢nan 6. 5. ⎥
⎣ 8. 8. 7. ⎦

It has 4 samples and 2 missing values.

Let's ﬁll in missing values with KNNImputer.

17
⎡ 1. 2. nan⎤
=⎢
3. 4. 3. ⎥
⎢nan 5. ⎥
Let's ﬁll the missing value in ﬁrst sample/row. X4×3
6.
⎣ 8. 8. 7. ⎦
Distance with [1. 2. nan.]

[3. 4. 3.] (1 − 3)2 + (2 − 4)2 ≈ 2.82 2 nearest

[nan 6. 5.] (2 − 6)2 =4 neighbours
[8. 8. 7.] (1 − 8)2 + (2 − 8)2 ≈ 9.21

Values of the feature from

2 nearest neighbours

3+5
=4 [1. 2. 4.]
2
# of neighbours

18
In this way, we can ﬁll up the missing values with KNNImputer.

Original feature Transformed feature

matrix X matrix X′

⎡ 1. 2. nan.⎤ ⎡ 1. 2. 4.⎤
=⎢
3.⎥
=⎢
3. ⎥ 3. 4.
⎢5.5 5.⎥
3. 4.
⎢nan 5. ⎥
X4×4 X′ 4×4
6.
⎣ 8. 7.⎦
6.
⎣ 8. 8. 7. ⎦ 8.

knni = KNNImputer(n_neighbors=2, weights="uniform")

knni.fit_transform(X)

19
Marking imputed values

It is useful to indicate the presence of missing values in

the dataset.
MissingIndicator helps us get those indications.
It returns a binary matrix,
True values correspond to missing entries in
original dataset.

20
1.2 Numeric transformers

1. Feature scaling
2. Polynomial transformation
3. Discretization

21
Feature scaling

Numerical features with different scales leads to slower

convergence of iterative optimization procedures.

It is a good practice to scale numerical features so that

all of them are on the same scale.

Let's learn how to scale numerical features with sklearn

API.

Three feature scaling APIs are available in sklearn

StandardScaler MaxAbsScaler MinMaxScaler

22
StandardScaler

Transforms the original features vector x into new feature

vector x′ using following formula
x − μ Learns parameters μ
x′ = and σ .
σ
⎡4⎤ ⎡ 0 ⎤
⎢3⎥ ⎢ −1/ 2⎥
x5×1 = ⎢
⎢2⎥
⎥ x′ ⎢−2/ 2⎥
⎢ ⎥
⎢5⎥
=
⎢ 1/ 2 ⎥
ss = StandardScalar()
ss.fit_transform(x)
5×1

⎣6⎦ ⎣ 2/ 2 ⎦

μ = 4, σ = 2 μ′ = 0, σ ′ = 1

Note that the transformed feature vector x′ has

mean (μ) = 0 and standard deviation (σ ) = 1.
23
MinMaxScaler
It transforms the original feature vector x into new feature
vector x′ so that all values fall within range [0, 1]
′ x − x.min
x =
x.max − x.min
where x.max and x.min are largest and smallest values of
that feature respectively, of the original feature vector x.

⎡ 15 ⎤ ⎡ 1 ⎤
⎢ 2 ⎥ ⎢0.35⎥
=⎢ ⎥ =⎢ ⎥
mms = MinMaxScalar()

x5×1 ⎢ ⎥5 x′ 5×1 ⎢ ⎥
mms.fit_transform(x)
0.5
⎢−2⎥ ⎢ 0.6 ⎥
⎣−5⎦ ⎣ 0 ⎦
The largest number is transformed to 1 and
x.max = 15, x.min = -5
the smallest number is transformed to 0.
24
MaxAbsScaler

It transforms the original features vector x into new feature

vector x′ so that all values fall within range [−1, 1]
′ x
x =
MaxAbsoluteValue
where MaxAbsoluteValue = max(x.max, ∣x.min∣)

⎡ 4 ⎤ ⎡ 0.04 ⎤
⎢ 2 ⎥ ⎢ 0.02 ⎥
=⎢ ⎥ =⎢
⎢
⎥
⎥
⎢ ⎥
mas = MaxAbsScalar()
x′ 5×1 0.05
⎢−0.02⎥
x5×1 5
⎢ −2 ⎥
mas.fit_transform(x)

⎣−100⎦ ⎣ −1 ⎦

MaxAbsoluteValue = max(5, ∣ − 100∣) = 100

25
FunctionTransformer

Constructs transformed features by applying a user deﬁned

function.

⎡128 2 ⎤ ⎡7 1⎤
=⎢
256⎥
=⎢
2 1 8⎥
X4×2 ⎢ 4 1 ⎥
X′ 4×2 ⎢2 0⎥
⎣512 64 ⎦ ⎣9 6⎦

ft = FunctionTransformer(numpy.log2)
ft.fit_transform(X)

Applies log2 function to the features.

26
Polynomial transformation

Generates a new feature matrix consisting of all polynomial

combinations of the features with degree less than or equal
to the speciﬁed degree.

pf=PolynomialFeatures(degree=2)
pf.fit_transform(X)

X = [x1 , x2 ] degree =2
X′ = [x1 , x2 , x1 x2 , x21 , x22 ]

pf=PolynomialFeatures(degree = 3)
pf.fit_transform(X)

degree =3
X′ = [x1 , x2 , x1 x2 , x21 , x22 , x21 x2 , x1 x22 , x31 , x32 ]
27
KBinsDiscretizer

Divides a continuous variable into bins.

One hot encoding or ordinal encoding is further applied
to the bin labels.

⎡ 0 ⎤ ⎡0.⎤
⎢0.125⎥ ⎢0.⎥
⎢ 0.25 ⎥ ⎢1.⎥
⎢ ⎥ ⎢ ⎥
KBinsDiscretizer(

⎢0.375⎥ ⎢1.⎥
⎢ ⎥ ⎢ ⎥
n_bins=5,

=⎢ ⎥ =⎢ ⎥
strategy='uniform',

x9×1 ⎢ 0.5 ⎥ x′ 9×1 ⎢2.⎥

encode = 'ordinal')

⎢0.675⎥ ⎢3.⎥
⎢ ⎥ ⎢ ⎥
⎢ 0.75 ⎥ ⎢3.⎥
⎢ ⎥ ⎢ ⎥
⎢0.875⎥ ⎢4.⎥
⎣ 1.0 ⎦ ⎣4.⎦

28
1.2 Categorical transformers

1. Feature encoding
2. Label encoding

29
OneHotEncoder

Encodes categorical feature or label as a one-hot

numeric array.
Creates one binary column for each of K unique values.
Exactly one column has 1 in it and rest have 0.

⎡1⎤ ⎡1 0 0⎤
=⎢
0⎥
=⎢
2⎥ 0 1
⎢3⎥ ⎢0 1⎥
ohe=OneHotEncoder()
x4×1 ohe.fit_transform(x) X′ 4×3
0
⎣1⎦ ⎣1 0 0⎦

# unique values: # columns in

K=3 transformed matrix = 3

30
LabelEncoder

Encodes target labels with value between 0 and K − 1, where

K is number of distinct values.

⎡1⎤ ⎡0⎤
⎢2⎥ ⎢1⎥
⎢6⎥ ⎢2⎥
=⎢ ⎥
⎢1⎥ =⎢ ⎥
⎢0⎥
le = LabelEncoder()
y6×1 y′ 6×1
⎢ ⎥ ⎢ ⎥
le.fit_transform(y)

⎢8⎥ ⎢3⎥
⎣6⎦ ⎣2⎦
Here K = 4: {1, 2, 6, 8}

1 is encoded as 0, 2 as 1, 6 as 2, and 8 as 3.

31
OrdinalEncoder

Encodes categorical features with value between 0 and

K − 1, where K is number of distinct values.

⎡1 ‘male′ ⎤ ⎡0 1⎤
⎢2 ‘f emale′ ⎥ ⎢1 0⎥
⎢6 ‘f emale′ ⎥ ⎢2 0⎥
=⎢
⎢1 ′ ⎥
⎥ =⎢
⎢0
⎥
1⎥
oe = OrdinalEncoder()
X6×2 X′ 6×2
⎢ ‘male ⎥ ⎢ ⎥
oe.fit_transform(X)

⎢8 ‘male′ ⎥ ⎢3 1⎥
⎣6 ‘f emale′ ⎦ ⎣2 0⎦

OrdinalEncoder can operate multi dimensional data, while

LabelEncoder can transform only 1D data.

32
LabelBinarizer

Several regression and binary classiﬁcation can be extended

to multi-class setup in one-vs-all fashion.

This involves training a single regressor or classiﬁer per class.

For this, we need to convert multi-class labels to binary labels,

and LabelBinarizer performs this task.

⎡1⎤ ⎡1 0 0 0⎤
⎢2⎥ ⎢0 1 0 0⎥
⎢6⎥ ⎢0 0⎥
=⎢ ⎥ =⎢ ⎥
0 1
⎢1⎥ ⎢1 0⎥
lb=LabelBinarizer()
y6×1 Y′ 6×4
⎢ ⎥ ⎢ ⎥
lb.fit_transform(y)
0 0
⎢8⎥ ⎢0 0 0 1⎥
⎣6⎦ ⎣0 0 1 0⎦

If estimator supports multiclass data, LabelBinarizer is not needed.

33
MultiLabelBinarizer
Encodes categorical features with value between 0 and
K − 1, where K is number of classes.

In this example K = 4, since there are only 4 genres of

movies.

movie_genres =
[{'action', 'comedy' }, ⎡1 1 0 0⎤
⎢
=⎢
0 1 0 0⎥
1⎥
{'comedy'}, X′ 4×4
1 0 0
{'action', 'thriller'}, ⎣1 0 1 1⎦
{'science-ﬁction', 'action', 'thriller'}]

mlb = MultiLabelBinarizer()
mlb.fit_transform(movie_genres)

34
add_dummy_feature

Augments dataset with a column vector, each value in

the column vector is 1.

⎡7 1⎤ ⎡1 7 1⎤
=⎢
8⎥
=⎢
1 1 1 8⎥
⎢2 0⎥ ⎢1 0⎥
add_dummy_feature(X)
X4×2 X′ 4×3
2
⎣9 6⎦ ⎣1 9 6⎦

35
Part 2: Feature selection
Filter based
Wrapper based

36
Sometimes in a real world dataset, all features do not
contribute well enough towards fitting a model.
The features that do not contribute significantly, can be
removed. It leads to decrease in size of the dataset and
hence, the computation cost of fitting a model.
sklearn.feature_selection provides many APIs to
accomplish this task.

Filter Wrapper
VarianceThreshold RFE
SelectKBest RFECV
SelectPercentile SelectFromModel
GenericUnivariateSelect SequentialFeatureSelector

Note: Tree based and kernel based feature selection algorithms

will be covered in later weeks.
37
Filter based feature selection
methods

38
Removing features with low variance

VarianceThreshold

Removes all features with variance below a

certain threshold, as speciﬁed by the
user, from input feature matrix

By default removes a feature which has same

value, i.e. zero variance.

39
Univariate feature selection
Univariate feature selection selects features based on
univariate statistical tests.

There are three APIs for univariate feature selection:

SelectKBest SelectPercentile
Removes all but the Removes all but a user-
k highest scoring speciﬁed highest scoring
features percentage of features

GenericUnivariateSelect
Performs univariate feature selection with a
conﬁgurable strategy, which can be found
via hyper-parameter search.
40
sklearn provides one more class of univariate feature
selection methods that work on common univariate
statistical tests for each feature:

SelectFpr selects features based on a false positive

rate test.

SelectFdr selects features based on an estimated

false discovery rate.

SelectFwe selects features based on family-wise

error rate.

41
Univariate scoring function
Each API need a scoring function to score each feature.

Three classes of scoring functions are proposed:

Mutual information (MI) Chi-square F-statistics

MI and F-statistics can be used in both classiﬁcation and

regression problems.
mutual_info_regression f_regression
mutual_info_classif f_classif

Chi-square can be used only in classiﬁcation problems.

chi2
42
Mutual information (MI) Chi-square
Measures dependency Measures dependence
between two variables. between two variables.
It returns a non-negative Computes chi-square stats
value. between non-negative
MI = 0 for independent feature (boolean or
variables. frequencies) and class label.
Higher MI indicates Higher chi-square values
higher dependency. indicates that the features
and labels are likely to be
correlated.

MI and chi-squared feature selection is recommended for sparse

data.
43
SelectKBest
skb = SelectKBest(chi2, k=20)
X_new = skb.fit_transform(X, y)

Selects 20 best features based on chi-square scoring function.

SelectPercentile
sp = SelectPercentile(chi2, percentile=20)
X_new = sp.fit_transform(X, y)

Selects top 20 percentile best features based on chi-square

scoring function. 'percentile' (default), 'k_best',

GenericUnivariateSelect 'fpr', 'fdr', 'fwe'

transformer = GenericUnivariateSelect(chi2, mode='k_best', param=20)

X_new = transformer.fit_transform(X, y)

Selects 20 best features based on chi-square scoring function.

44
GenericUnivariateSelect

transformer = GenericUnivariateSelect(chi2, mode='k_best', param=20)

X_new = transformer.fit_transform(X, y)

Selects set of features based on a feature selection mode

and a scoring function.
The mode could be 'percentile' (default), 'k_best', 'fpr',
'fdr', 'fwe'.
The param argument takes value corresponding to the mode.

45
Do not use regression feature scoring
function with a classiﬁcation problem. It will
lead to useless results.

46
Wrapper based ﬁlter selection

Unlike ﬁlter based methods, wrapper based

methods use estimator class rather than a
scoring function.

47
Recursive Feature Elimination (RFE)

Uses an estimator to recursively remove features.

Initially ﬁts an estimator on all features.
Obtains feature importance from the estimator and
removes the least important feature.
Repeats the process by removing features one by
one, until desired number of features are obtained.

Use RFECV if we do not want to specify the desired

number of features in RFE .
It performs RFE in a cross-validation loop to ﬁnd the
optimal number of features.

48
SelectFromModel

Selects desired number of important features (as speciﬁed

with max_features parameter) above certain threshold of
feature importance as obtained from the trained estimator.

The feature importance is obtained via coef_,

feature_importances_ or an importance_getter callable
from the trained estimator

The feature importance threshold can be speciﬁed either

numerically or through string argument based on built-in
heuristics such as 'mean', 'median' and ﬂoat multiples of
these like '0.1*mean'.

Let's look at a concrete example of SelectFromModel

49
clf = LinearSVC(C=0.01, penalty="l1", dual=False)
clf = clf.fit(X, y)
clf.coef_

model = SelectFromModel(clf, prefit=True)

X_new = model.transform(X)

Here we use a linear support vector classiﬁer to get

coefﬁcients of features for SelectFromModel transformer.
It ends up selecting features with non-zero weights or
coefﬁcients.

50
Sequential feature selection
Performs feature selection by selecting or deselecting
features one by one in a greedy manner.

Uses one of the two approaches

Forward selection Backward selection

Starting with a zero feature, it ﬁnds Starting with all features
one feature that obtains the best and removes least
cross validation score for an estimator important features one by
when trained on that feature. one following the idea of
Repeats the process by adding a new forward selection.
feature to the set of selected features.

Stops when reach the desired number of features.

51
The direction parameter controls whether forward or backward
SFS is used.
In general, forward and backward selection do not yield
equivalent results.
Select the direction that is efﬁcient for the required number of
selected features:
When we want to select 7 out of 10 features,
Forward selection would need to perform 7 iterations.
Backward selection would only need to perform 3.
Backward selection seems to be a reasonable choice here.

52
SFS does not require the underlying model to expose a coef_ or
feature_importances_ attributes unlike in RFE and SelectFromModel.
SFS may be slower than RFE and SelectFromModel as it needs to
evaluate more models compared to the other two approaches.

For example in backward selection, the iteration going from m

features to m − 1 features using k -fold cross-validation requires
fitting m × k models, while
RFE would require only a single fit, and
SelectFromModel performs a single fit and requires no
iterations.

53
Applying transformations to
diverse features

54
Generally training data contains diverse features such
as numeric and categorical.

Different feature types are processed with different

transformers.

Need a way to combine different feature transformers

seamlessly.

55
Composite Transformer

sklearn.compose has useful classes and methods to apply

transformation on subset of features and combine them:

ColumnTransformer TransformedTargetRegressor

56
ColumnTransformer
It applies a set of transformers to columns of an array or
pandas.DataFrame, concatenates the transformed outputs from
different transformers into a single matrix.
It is useful for transforming heterogenous data by applying
different transformers to separate subsets of features.
It combines different feature selection mechanisms and
transformation into a single transformer object.

57
ColumnTransformer()
Each tuple has format
(estimatorName,estimator(...), columnIndices)

column_trans = ColumnTransformer(
[('ageScaler', CountVectorizer(), [0]]),
('genderEncoder', OneHotEncoder(dtype='int'), [1])],
remainder='drop', verbose_feature_names_out=False)

58
Illustration of Column Transformer

Consider following feature matrix, which represent weight and

gender of a class of students.

⎡20.0 ‘male′ ⎤
⎢11.2 ‘f emale′ ⎥
⎢15.6 ‘f emale′ ⎥
X6×2 =⎢
⎢13.0 ′ ⎥
⎥
⎢ ‘male ⎥
⎢18.6 ‘male′ ⎥
⎣16.4 ‘f emale′ ⎦

Here, ﬁrst column is numeric, however, second column is

categorical, therefore different transformers have to be applied
on them.
59
In this example, lets apply MaxAbsScaler on the numeric
column and OneHotEncoder on categorical column.
column_trans = ColumnTransformer(
[('ageScaler', MaxAbsScaler(), [0]]),
('genderEncoder', OneHotEncoder(dtype='int'), [1])],
remainder='drop', verbose_feature_names_out=False)
column_trans.fit_transform(X)

⎡20.0 ‘male′ ⎤ ⎡ 1. 0. 1.⎤

⎢11.2 ‘f emale′ ⎥ ⎢0.56 1. 0.⎥
⎢15.6 ‘f emale′ ⎥ ⎢0.78 0.⎥
=⎢ ⎥ =⎢ ⎥
1.
X6×2 ⎢13.0 ′ ⎥ X′ 6×3 ⎢0.65 1.⎥
⎢ ‘male ⎥ ⎢ 0. ⎥
⎢18.6 ‘male′ ⎥ ⎢0.93 0. 1.⎥
⎣16.4 ‘f emale′ ⎦ ⎣0.82 1. 0.60⎦
Transforming Target for Regression
TransformedTargetRegressor
Transforms the target variable y before ﬁtting a
regression model.
The predicted values are mapped back to the original
space via an inverse transform.
TransformedTargetRegressor takes regressor and
transformer to be applied to the target variable as
arguments.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.compose import TransformedTargetRegressor
tt = TransformedTargetRegressor(regressor=LinearRegression(),
func=np.log, inverse_func=np.exp)
X = np.arange(4).reshape(-1, 1)
y = np.exp(2 * X).ravel()
tt.fit(X, y)
61
Part 3: Dimensionality reduction

62
Another way to reduce the number of feature
is through unsupervised dimensionality
reduction techniques.

sklearn.decomposition module has a number of APIs for this task.

We will focus on how to perform feature

reduction with principle component analysis
(PCA) in sklearn.

63
PCA 101
PCA, is a linear dimensionality reduction technique.
It uses singular value decomposition (SVD) to project
the feature matrix or data to a lower dimensional space.
The first principle component (PC) is in the direction of
maximum variance in the data.
It captures bulk of the variance in the data.
The subsequent PCs are orthogonal to the first PC and
gradually capture lesser and lesser variance in the data.
We can select first k PCs such that we are able to
capture the desired variance in the data.

sklearn.decomposition.PCA API is used for performing

PCA based dimensionality reduction.
64
PCA illustration

Blue dots are data points. z represents projection of

x1 and x2 are features. data points on a candidate
C1 , C2 and C3 are candidate vector.
PCs.
65
Out of 3 candidate vectors to
project data on, vector C1 captures
most of the variance, hence it is the
ﬁrst PC and C2 , which is orthogonal
to it is the second PC

66
After applying PCA and choosing only ﬁrst PC to reduce
dimension of data.

67
Part 4: Chaining transformers

68
The preprocessing transformations are applied one after
another on the input feature matrix.
si = SimpleImputer()
X_imputed = si.fit_transform(X)
ss =StandardScaler()
X_scaled = ss.fit_transform(X_imputed)

It is important to apply exactly same transformation on

training, evaluation and test set in the same order.

Failing to do so would lead to incorrect predictions from

model due to distribution shift and hence incorrect
performance evaluation.

69
The sklearn.pipeline module provides utilities to build a
composite estimator, as a chain of transformers and
estimators.

There are two classes: (i) Pipeline and (ii) FeatureUnion.

Class Usage
Pipeline Constructs a chain of multiple transformers to
execute a ﬁxed sequence of steps in data
preprocessing and modelling.
FeatureUnion Combines output from several transformer
objects by creating a new transformer from
them.

70
sklearn.pipeline.Pipeline

Sequentially apply a list of transformers and estimators.

Intermediate steps of the pipeline must be

‘transformers’ that is, they must implement ﬁt and
transform methods.

The ﬁnal estimator only needs to implement ﬁt.

The purpose of the pipeline is to assemble several

steps that can be cross-validated together while
setting different parameters.

71
Creating Pipelines
Two ways to create a pipeline object.
Pipeline()
It takes a list of
('estimatorName', estimators = [
('simpleImputer', SimpleImputer()),
estimator(...)) tuples. ('standardScaler', StandardScaler()),
]
The pipeline object exposes pipe = Pipeline(steps=estimators)
interface of the last step.

make_pipeline
It takes a number of estimator pipe = make_pipeline(SimpleImputer(),
StandardScaler())
objects only.

72
Without pipeline:
si = SimpleImputer()
X_imputed = si.fit_transform(X)
ss =StandardScaler()
X_scaled = ss.fit_transform(X_imputed)

With pipeline:
estimators = [
('simpleImputer', SimpleImputer()),
('standardScaler', StandardScaler()),
]
pipe = Pipeline(steps=estimators)
pipe.fit_transform(X)

73
Accessing individual steps in Pipeline

estimators = [
('simpleImputer', SimpleImputer()),
('pca', PCA()),
('regressor', LinearRegression())
]
pipe = Pipeline(steps=estimators)

Total # steps: 3 The second estimator can be

accessed in following 4 ways:
1. SimpleImputer
2. PCA pipe.named_steps.pca
3. LinearRegression pipe.steps[1]
pipe[1]
pipe['pca']

74
Accessing parameters of each step in Pipeline

Parameters of the estimators in the pipeline can be

accessed using the <estimator>__<parameterName> syntax,
note there are two underscores between <estimator> and
<parameterName>

estimators = [
('simpleImputer', SimpleImputer()),
('pca', PCA()),
('regressor', LinearRegression())
]
pipe = Pipeline(steps=estimators)

pipe.set_params(pca__n_components = 2)

In above example n_components of PCA() step is set after the

pipeline is created. 75
Performing grid search with pipeline

By using naming convention of nested parameters, grid

search can implemented.

param_grid = dict(imputer=['passthrough',
SimpleImputer(),
KNNImputer()],
clf=[SVC(), LogisticRegression()],
clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=param_grid)

C is an inverse of regularization, lower its value stronger

the regularization is.
In the example above clf__C provides a set of values for
grid search.
76
Caching transformers
Transforming data is a computationally expensive step.
For grid search, transformers need not be applied for
every parameter conﬁguration. They can be applied only
once, and the transformed data can reused.
This can be achieved by setting memory parameter of a
pipeline object.
memory can take either location of a directory in string
format or joblib.Memory object.
estimators = [
('simpleImputer', SimpleImputer()),
('pca', PCA(2)),
('regressor', LinearRegression())
]
pipe = Pipeline(steps=estimators, memory = '/path/to/cache/dir')

77
Advantages of pipeline
Combines multiple steps of end to end ML into single object
such as missing value imputation, feature scaling and
encoding, model training and cross validation.
Enables joint grid search over parameters of all the
estimators in the pipeline.
Makes conﬁguring and tuning end to end ML quick and easy.
Offers convenience, as a developer has to call fit() and
predict() methods only on a Pipeline object (assuming last
step in the pipeline is an estimator).
Reduces code duplication: With a Pipeline object, one
doesn't have to repeat code for preprocessing and
transforming the test data.

78
sklearn.pipeline.FeatureUnion

Concatenates results of multiple transformer objects.

Applies a list of transformer objects in parallel, and their
outputs are concatenated side-by-side into a larger matrix.
FeatureUnion and Pipeline can be used to create complex
transformers.

79
Combining Transformers and Pipelines
FeatureUnion() accepts a list of tuples.
Each tuple is of the format:
('estimatorName',estimator(...))

num_pipeline = Pipeline([('selector',ColumnTransformer([('select_first_4',
'passthrough',
slice(0,4))])),
('imputer', SimpleImputer(strategy="median")),
('std_scaler', StandardScaler()),
])
cat_pipeline = ColumnTransformer([('label_binarizer', LabelBinarizer(),[4]),
])
full_pipeline = FeatureUnion(transformer_list=
[("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),])

80
Visualizing Composite Transformers

from sklearn import set_config

set_config(display='diagram')
# displays HTML representation in a jupyter context
full_pipeline

It creates the following visualization

81
That's it from data preprocessing.

Only way to master is to practice it with

examples.

Dr. Ashish Tendulkar

IIT Madras

1
How to build baseline regression model?
DummyRegressor helps in creating a baseline for regression.
1 from sklearn.dummy import DummyRegressor
2
3 dummy_regr = DummyRegressor(strategy="mean")
4 dummy_regr.fit(X_train, y_train)
5 dummy_regr.predict(X_test)
6 dummy_regr.score(X_test, y_test)

It makes a prediction as speciﬁed by the strategy.

Strategy is based on some statistical property of the
training set or user speciﬁed value.
Strategy mean median quantile constant

quantile constant 2
How is Linear Regression model trained?
Step 1: Instantiate object of a suitable linear regression estimator from
one of the following two options

Normal 1 from sklearn.linear_model import LinearRegression

equation 2 linear_regressor = LinearRegression()

Iterative 1 from sklearn.linear_model import SGDRegressor

optimization 2 linear_regressor = SGDRegressor()

Step 2: Call ﬁt method on linear regression object with training feature

matrix and label vector as arguments.
1 # Model training with feature matrix X_train and
2 # label vector or matrix y_train
3 linear_regressor.fit(X_train, y_train)

Works for both single and multi-output regression.

3
SGDRegressor Estimator

4
SGDRegressor Estimator
Implements stochastic gradient descent
Use for large training set up (> 10k samples)
Provides greater control on optimization process through
provision for hyperparameter settings.
loss= 'squared error' penalty = 'l1'
loss = 'huber' penalty = 'l2'
penalty = 'elasticnet'

SGDRegressor

learning_rate = 'constant'
learning_rate = 'optimal' early_stopping = 'True'

learning_rate = 'invscaling' early_stopping = 'False'

learning_rate = 'adaptive' 5
It's a good idea to use a random seed of your choice while
instantiating SGDRegressor object. It helps us get
reproducible results.

Set random_state to seed of your choice.

1 from sklearn.linear_model import SGDRegressor

2 linear_regressor = SGDRegressor(random_state=42)

Note: In the rest of the presentation, we won't set the random

seed for sake of brevity. However while coding, always set the
random seed in the constructor.

6
How to perform feature scaling for SGDRegressor?
SGD is sensitive to feature scaling, so it is highly recommended
to scale input feature matrix.
1 from sklearn.linear_model import SGDRegressor
2 from sklearn.pipeline import Pipeline
3 from sklearn.preprocessing import StandardScaler
4
5 sgd = Pipeline([
6 ('feature_scaling', StandardScaler())),
7 ('sgd_regressor', SGDRegressor())])
8
9 sgd.fit(X_train, y_train)

Note Feature scaling is not needed for word frequencies and

indicator features as they have intrinsic scale.
Features extracted using PCA should be scaled by some
constant c such that the average L2 norm of the training
data equals one.
7
How to shufﬂe training data after each epoch
in SGDRegressor?

1 from sklearn.linear_model import SGDRegressor

2 linear_regressor = SGDRegressor(shuffle=True)

8
How to use set learning rate in SGDRegreesor?
learning_rate = 'constant' learning_rate = 'invscaling'

learning_rate = 'adaptive'

1 from sklearn.linear_model import SGDRegressor

2 linear_regressor = SGDRegressor(random_state=42)

What is the default setting?

learning_rate = 'invscaling' eta0 = 1e-2 power_t = 0.25

Learning rate reduces after every iteration:

eta = eta0 / pow(t, power_t)
Note: You can make changes to these parameters to speed
up or slow down the training process.
9
How to use set constant learning rate ?

learning_rate = 'constant'

1 from sklearn.linear_model import SGDRegressor

2 linear_regressor = SGDRegressor(learning_rate='constant',
3 eta0=1e-2)

Constant learning rate eta0 = 1e-2 is used throughout the training.

10
How to set adaptive learning rate?

1 from sklearn.linear_model import SGDRegressor

2 linear_regressor = SGDRegressor(learning_rate='adaptive',
3 eta0=1e-2)

The learning rate is kept to initial value as long as the training loss
decreases.
When the stopping criterion is reached, the learning rate is divided
by 5, and the training loop continues.
The algorithm stops when the learning rate goes below 10−6 .

11
How to set #epochs in SGDRegreesor?

Set max_iter to desired #epochs. The default value is 1000.

1 from sklearn.linear_model import SGDRegressor

2 linear_regressor = SGDRegressor(max_iter=100)

Remember one epoch is one full pass over the training data.

Practical tip
SGD converges after observing approximately 106 training samples.
Thus, a reasonable ﬁrst guess for the number of iterations for n
sampled training set is
max_iter = np.ceil(106 /n)

12
How to set stopping criteria in SGDRegreesor?

Option #1 tol, n_iter_no_change, max_iter.

1 from sklearn.linear_model import SGDRegressor

2 linear_regressor = SGDRegressor(loss='squared_error',
3 max_iter=500,
4 tol=1e-3,
5 n_iter_no_change=5)

The SGDRegreesor stops

when the training loss does not improve (loss >
best_loss - tol) for n_iter_no_change consecutive
epochs
else after a maximum number of iteration max_iter.

13
How to set stopping criteria in SGDRegreesor?
Option #2 early_stopping, validation_fraction
1 from sklearn.linear_model import SGDRegressor
2 linear_regressor = SGDRegressor(loss='squared_error',
3 early_stopping=True
4 max_iter=500,
5 tol=1e-3,
6 validation_fraction=0.2,
7 n_iter_no_change=5)

Set aside validation_fraction percentage records from training

set as validation set. Use score method to obtain validation score.

The SGDRegreesor stops when

validation score does not improve by at least tol for
n_iter_no_change consecutive epochs.
else after a maximum number of iteration max_iter. 14
How to use different loss functions in SGDRegreesor?

Set loss parameter to one of the supported values

'squared_error' {studied in this course}
1 from sklearn.linear_model import SGDRegressor
2 linear_regressor = SGDRegressor(loss='squared_error')

It also supports other losses as documented in sklearn API

15
How to use averaged SGD?
Averaged SGD updates the weight vector to average of
weights from previous updates.

Option #1: Averaging across all updates average=True

1 from sklearn.linear_model import SGDRegressor

2 linear_regressor = SGDRegressor(average=True)

16
Option #2: Set average to int value.

Averaging begins once the total number of samples seen

reaches average

Setting average=10 starts averaging after seeing 10 samples

1 from sklearn.linear_model import SGDRegressor

2 linear_regressor = SGDRegressor(average=10)

Averaged SGD works best with a larger number of features and

a higher eta0

17
How do we initialize SGD with weight vector
of the previous run?

Set warm_start = TRUE

while instantiating object of SGDRegressor

1 from sklearn.linear_model import SGDRegressor

2 linear_regressor = SGDRegressor(warm_start=True)

By default warm_start = False

18
How to monitor SGD loss iteration after iteration?
Make use of warm_start = TRUE
1 sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True,
2 penalty=None, learning_rate="constant", eta0=0.0005)
3
4 for epoch in range(1000):
5 sgd_reg.fit(X_train, y_train) # continues where it left off
6 y_val_predict = sgd_reg.predict(X_val)
7 val_error = mean_squared_error(y_val, y_val_predict)

19
Model inspection

20
How to access the weights of trained Linear
Regression model?
y^ = w0 + w1 x1 + w2 x2 + … + wm xm = wT x

The weights w1 , w2 , … , wm are stored in coef_ class variable.

1 linear_regressor.coef_

The intercept w0 is stored in intercept_ class variable.

1 linear_regressor.intercept_

Note: These code snippets works for both LinearRegression and

SGDRegressor, and for that matter to all regression estimators
that we will study in this module. Why?

All of them are estimators.

21
Model inference

22
How to make predictions on new data in Linear
Regression model?
Step 1: Arrange data for prediction in a feature matrix of
shape (#samples, #features) or in sparse matrix format.

Step 2: Call predict method on linear regression object

with feature matrix as an argument.

1 # Predict labels for feature matrix X_test

2 linear_regressor.predict(X_test)

Same code works for all regression estimators.

23
Model evaluation

24
General steps in model evaluation

STEP 1: Split data into train and test

1 from sklearn.model_selection import train_test_split
2 X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

STEP 2: Fit linear regression estimator on training set.

STEP 3: Calculate training error (a.k.a. empirical error)

STEP 4: Calculate test error (a.k.a. generalization error)

Compare training and test errors

25
How to evaluate trained Linear Regression model?
Using score method on linear regression object:
1 # Evaluation on the eval set with
2 # 1. feature matrix
3 # 2. label vector or matrix (single/multi-output)
4 linear_regressor.score(X_test, y_test)

The score returns R2 or coefﬁcient of determination

residual sum of squares:
u = (Xw − y)T (Xw − y )
2 u
R = (1 − v)
Sum of squared error
(actual and predicted label)
total sum of square
Sum of squared error
(actual and mean predicted v = (y − y
^mean )T (y − y
^mean )
label) 26
The score returns R2 or coefﬁcient of determination

R2 = (1 − uv )
When?

The best possible score is 1.0. u, sum of squared error = 0

A constant model that always

predicts the expected value of u=v
y , would get a score of 0.0.

The score can be negative

(because the model can be
arbitrarily worse).

27
Evaluation metrics
sklearn provides a bunch of regression metrics to evaluate
performance of the trained estimator on the evaluation set.

mean_absolute_error
1 from sklearn.metrics import mean_absolute_error
2 eval_score = mean_absolute_error(y_test, y_predicted)

mean_squarred_error
1 from sklearn.metrics import mean_squarred_error
2 eval_score = mean_squarred_error(y_test, y_predicted)

r2_score Same as output of score

1 from sklearn.metrics import r2_score
2 eval_score = r2_score(y_test, y_predicted)

These metrics can also be used in multi-output regression setup.

28
mean_squared_log_error
1 from sklearn.metrics import mean_squared_log_error
2 eval_score = mean_squared_log_error(y_test, y_predicted)

Useful for targets with exponential growths like population,

sales growth etc,
Penalizes under-estimation heavier than the over-estimation.

mean_absolute_percentage_error
1 from sklearn.metrics import mean_absolute_percentage_error
2 eval_score = mean_absolute_percentage_error(y_test, y_predicted)

Sensitive to relative error.

median_absolute_error
1 from sklearn.metrics import median_absolute_error
2 eval_score = median_absolute_error(y_test, y_predicted)

Robust to outliers 29
How to evaluate regression model on
worst case error?

Use metrics max_error

Worst case error on train set can be calculated as follows:

1 from sklearn.metrics import max_error
2 train_error = max_error(y_train, y_predicted)

Worst case error on test set can be calculated as follows:

1 from sklearn.metrics import max_error
2 test_error = max_error(y_test, y_predicted)

This metrics can, however, be used only for single output

regression. It does not support multi-output regression.

30
Scores and Errors
Score is a metric for which higher value is better.
Error is a metric for which lower value is better.

Convert error metric to score metric by adding neg_ sufﬁx.

Function Scoring
metrics.mean_absolute_error neg_mean_absolute_error
metrics.mean_squared_error neg_mean_squared_error
metrics.mean_squared_error neg_root_mean_squared_error
metrics.mean_squared_log_error neg_mean_squared_log_error
metrics.median_absolute_error neg_median_absolute_error

31
In case, we get comparable performance on train and test with
this split, is this performance guaranteed on other splits too?

Is test set sufﬁciently large?

In case it is small, the test error obtained may be
unstable and would not reﬂect the true test error on
large test set.
What is the chance that the easiest examples were
kept aside as test by chance?
This if happens would lead to optimistic estimation
of the true test error.

We use cross validation for robust performance evaluation.

32
Cross-validation performs robust evaluation of model performance
by repeated splitting and
providing many training and test errors
This enables us to estimate variability in generalization
performance of the model.
sklearn implements the following cross validation iterators
KFold
RepeatedKfold

LeaveOneOut

ShuffleSplit

33
How to obtain cross validated performance
measure using KFold?
1 from sklearn.model_selection import cross_val_score
2 from sklearn.linear_model import linear_regression
3
4 lin_reg = linear_regression()
5 score = cross_val_score(lin_reg, X, y, cv=5)

Uses KFold cross validation iterator, that divides training data

into 5 folds.
In each run, it uses 4 folds for training and 1 for evaluation.

Alternate way of writing the same thing

1 from sklearn.model_selection import cross_val_score
2 from sklearn.model_selection import KFold
3 from sklearn.linear_model import linear_regression
4
5 lin_reg = linear_regression()
6 kfold_cv = KFold(n_splits=5, random_state=42)
7 score = cross_val_score(lin_reg, X, y, cv=kfold_cv)
34
How to obtain cross validated performance
measure using LeaveOneOut?
1 from sklearn.model_selection import cross_val_score
2 from sklearn.model_selection import LeaveOneOut
3 from sklearn.linear_model import linear_regression
4
5 lin_reg = linear_regression()
6 loocv = LeaveOneOut()
7 score = cross_val_score(lin_reg, X, y, cv=loocv)

which is same as

1 from sklearn.model_selection import cross_val_score

2 from sklearn.model_selection import KFold
3 from sklearn.linear_model import linear_regression
4
5 lin_reg = linear_regression()
6 n = X.shape[0]
7 kfold_cv = KFold(n_splits=n)
8 score = cross_val_score(lin_reg, X, y, cv=kfold_cv)

35
How to obtain cross validated performance
measure using ShufﬂeSplit?
1 from sklearn.linear_model import linear_regression
2 from sklearn.model_selection import cross_val_score
3 from sklearn.model_selection import ShuffleSplit
4
5 lin_reg = linear_regression()
6 shuffle_split = ShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
7 score = cross_val_score(lin_reg, X, y, cv=shuffle_split)

It is also called random permutation based cross validation strategy.

Generates user deﬁned number of train/test splits.
It is robust to class distribution.

In each iteration, it shufﬂes order of data samples and then splits it

into train and test.

36
How to specify a performance measure in
cross_val_score
1 from sklearn.linear_model import linear_regression
2 from sklearn.model_selection import cross_val_score
3 from sklearn.model_selection import ShuffleSplit
4
5 lin_reg = linear_regression()
6 shuffle_split = ShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
7 score = cross_val_score(lin_reg, X, y, cv=shuffle_split,
8 scoring='neg_mean_absolute_error')

scoring parameter can be set to one of the scoring schemes

implemented in sklearn as follows
max_error r2
neg_mean_absolute_error neg_mean_squared_error
neg_mean_squared_log_error neg_median_absolute_error
neg_root_mean_squared_error 37
How to obtain test scores from different folds?

1 from sklearn.model_selection import cross_validate

2 from sklearn.model_selection import ShuffleSplit
3
4 cv = ShuffleSplit(n_splits=40, test_size=0.3, random_state=0)
5 cv_results = cross_validate(
6 regressor, data, target, cv=cv, scoring="neg_mean_absolute_error")

The results are stored in python dictionary with

the following keys:
fit_time
score_time
test_score
estimator (optional)
train_score (optional) 38
How to obtain trained estimators and scores on
training data during cross validation?

For trained estimator, set return_estimator = True

For scores on training set, set return_train_score = True

1 from sklearn.model_selection import cross_validate
2 from sklearn.model_selection import ShuffleSplit
3
4 cv = ShuffleSplit(n_splits=40, test_size=0.3,
5 random_state=0)
6 cv_results = cross_validate(
7 regressor, data, target,
8 cv=cv, scoring="neg_mean_absolute_error",
9 return_train_score=True,
10 return_estimator=True)

The estimators can be accessed through estimator

key of the dictionary returned by cross_validate
39
How to evaluate multiple metrics of regression in
cross validation set up?
1 from sklearn.model_selection import cross_validate
2 from sklearn.model_selection import ShuffleSplit
3
4 cv = ShuffleSplit(n_splits=40, test_size=0.3,
5 random_state=0)
6 cv_results = cross_validate(
7 regressor, data, target,
8 cv=cv,
9 scoring=["neg_mean_absolute_error", "neg_mean_squared_error"]
10 return_train_score=True,
11 return_estimator=True)

cross_validate allows us to specify multiple scoring metrics

unlike cross_val_score

40
How to study effect of #samples on training and test
errors?

STEP 1: Instantiate an object of learning_curve class with

estimator, training data, size, cross validation strategy and
scoring scheme as arguments.
1 from sklearn.model_selection import learning_curve
2
3 results = learning_curve(
4 lin_reg, X_train, y_train, train_sizes=train_sizes, cv=cv,
5 scoring="neg_mean_absolute_error")
6 train_size, train_scores, test_scores = results[:3]
7 # Convert the scores into errors
8 train_errors, test_errors = -train_scores, -test_scores

STEP 2: Plot training and test scores as function of the size

of training sets. And make assessment about model
fitment: under/overfitting or right fit.
41
Underfitting/Overfitting diagnosis

STEP 1: Fit linear models with different number of features.

STEP 2: For each model, obtain training and test errors.

STEP 3: Plot #features vs error graph - one each for training

and test errors.

STEP 4: Examine the graphs to detect under/overﬁtting.

We can replace #features with any other tunable

hyperparameter to do this diagnosis for setting that
hyperparameter to the appropriate value.

42
Polynomial regression

1
How is polynomial regression model
trained?
Step 1: Apply polynomial transformation on the feature matrix.

Step 2: Learn linear regression model (via normal equation or

SGD) on the transformed feature matrix.

Implementation tips: Make use of pipeline construct for

polynomial transformation followed by linear regression
estimator.

2
Set up polynomial regression model with normal equation
1 from sklearn.linear_model import LinearRegression
2 from sklearn.pipeline import Pipeline
3 from sklearn.preprocessing import PolynomialFeatures
4
5 # Two steps:
6 # 1. Polynomial features of desired degree (here degree=2)
7 # 2. Linear regression
8 poly_model = Pipeline([
9 ('polynomial_transform', PolynomialFeatures(degree=2))),
10 ('linear_regression', LinearRegression())])
11
12 # Train with feature matrix X_train and label vector y_train
13 poly_model.fit(X_train, y_train)

Set up polynomial regression model with SGD

1 from sklearn.linear_model import SGDRegressor
2 from sklearn.pipeline import Pipeline
3 from sklearn.preprocessing import PolynomialFeatures
4
5 poly_model = Pipeline([
6 ('polynomial_transform', PolynomialFeatures(degree=2))),
7 ('sgd_regression', SGDRegressor())])
8 poly_model.fit(X_train, y_train)

Notice that there is a single line code change in two code snippets. 3
How to use only interaction features for
polynomial regression?

1 from sklearn.preprocessing import PolynomialFeatures

2 poly_transform = PolynomialFeatures(degree=2, interaction_only=True)

[x1 , x2 ] is transformed to [1, x1 , x2 , x1 x2 ].

Note that [x21 , x22 ] are excluded.

4
Hyperparameter tuning (HPT)

5
How to recognize hyperparameters in
any sklearn estimator?

Hyper-parameters are parameters that are not directly learnt

within estimators.
In sklearn, they are passed as arguments to the constructor of
the estimator classes.

For example,
degree in PolynomialFeatures
learning rate in SGDRegressor

6
How to set these hyperparameters?

Select hyperparameters that results in the best cross validation

scores.

Hyper parameter search consists of

an estimator (regressor or classiﬁer);
a parameter space;
a method for searching or sampling candidates;
a cross-validation scheme; and
a score function.

7
Two generic HPT approaches implemented in sklearn are:

GridSearchCVexhaustively considers all parameter

combinations for speciﬁed values.

RandomizedSearchCV samples a given number of candidate

values from a parameter space with a speciﬁed distribution.

8
What are the differences between grid and
randomized search?

Grid search Randomized search

Specifies exact values of Specifies distributions of
parameters in grid parameter values and values are
sampled from those distributions.
Computational budget can be
chosen independent of number of
parameters and their possible
values.
The budget is chosen in terms of
the number of sampled
candidates or the number of
training iterations. Specified in
n_iter argument 9
What data split is recommended for HPT?

Training Data
Training Labels

Data Validation Data

Labels Validation Labels

Test Data
Test Labels

10
What are the steps in HPT?

STEP 1: Divide training data into training, validation and test sets.

Training Data
Training Labels

Data Validation Data

Labels Validation Labels

Test Data
Test Labels

11
STEP 2: For each combination of hyper-parameter values learn a
model with training set.
Hyperparameter
Model
Values

Training Data Hyperparameter Learning

Model
Training Labels Values Algorithms

Hyperparameter
Model
Values

This step creates multiple models.

Tips This step can be be run in parallel by setting n_jobs = -1.

Some parameter combinations may cause failure in ﬁtting
one or more folds of data. This may cause the search to
fail. Set error_score = 0 (or np.NaN) to set score for the
problematic fold to 0 and complete the search. 12
STEP 3: Evaluate performance of each model with validation set and
select a model with the best evaluation score.
Validation Prediction
Fold Data

Model
Performance

Validation
Fold Labels

Validation Prediction Best

Fold Data Hyperparameter
Values

Model Best
Performance Model

Validation
Fold Labels

Validation Prediction
Fold Data

Model
Performance

Validation
Fold Labels 13
STEP 4: Retrain model with the best hyper-parameter settings on
training and validation set combined.

Best
Training Data Validation Data Hyperparameter Values
Model
Training Labels Validation Labels Learning
Algorithm

14
STEP 5: Evaluate the model performance on the test set.

Prediction
Test Data

Model
Performance

Test Labels

Note that the test set was not used in hyper-parameter search
and model retraining .
15
What are some of model speciﬁc HPT available for
regression tasks?

Some models can ﬁt data for a range of values of some

parameter almost as efficiently as fitting the estimator for a
single value of the parameter.
This feature can be leveraged to perform more efficient
cross-validation used for model selection of this parameter.

linear_model.LassoCV
linear_model.RidgeCV
linear_model.ElasticNetCV

16
How to determine degree of polynomial regression
with grid search?
1 from sklearn.model_selection import GridSearchCV
2 from sklearn.pipeline import Pipeline
3 from sklearn.preprocessing import POlynomialFeatures
4 from sklearn.linear_model import SGDRegressor
5
6 param_grid = [
7 {'poly__degree': [2, 3, 4, 5, 6, 7, 8, 9]}
8 ]
9
10 pipeline = Pipeline(steps=[('poly', PolynomialFeatures()),
11 ('sgd', SGDRegressor())])
12
13 grid_search = GridSearchCV(pipeline, param_grid, cv=5,
14 scoring='neg_mean_squared_error',
15 return_train_score=True)
16
17 grid_search.fit(x_train.reshape(-1, 1), y_train)

17
Regularization

18
How to perform ridge regularization with speciﬁc
regularization rate?
[Option #1]
Step 1: Instantiate object of Ridge estimator
Step 2: Set parameter alpha to the required regularization rate.
1 from sklearn.linear_model import Ridge
2 ridge = Ridge(alpha=1e-3)

ﬁt, score, predict work exactly like other linear regression estimators

[Option #2]
Step 1: Instantiate object of SGDRegressor estimator
Step 2: Set parameter alpha to the required regularization rate
and penalty = l2.
1 from sklearn.linear_model import SGDRegressor
2 sgd = SGDRegressor(alpha=1e-3, penalty='l2')
19
How to search the best regularization parameter
for ridge?
[Option #1]
Search for the best regularization rate with built-in cross
validation in RidgeCV estimator.

[Option #2]
Use cross validation with Ridge or SVDRegressor to search
for best regularization.
Grid search
Randomized search

20
How to perform ridge regularization in polynomial
regression?
Set up a pipeline of polynomial transformation followed by the
ridge regressor.
1 from sklearn.linear_model import Ridge
2 from sklearn.pipeline import Pipeline
3 from sklearn.preprocessing import PolynomialFeatures
4
5 poly_model = Pipeline([
6 ('polynomial_transform', PolynomialFeatures(degree=2))),
7 ('ridge', Ridge(alpha=1e-3))])
8 poly_model.fit(X_train, y_train)

Instead of Ridge, we can use SGDRegressor, as shown on previous

slide, to get equivalent formulation.

21
How to perform lasso regularization with speciﬁc
regularization rate?
[Option #1]
Step 1: Instantiate object of Lasso estimator
Step 2: Set parameter alpha to the required regularization rate.
1 from sklearn.linear_model import Lasso
2 lasso = Lasso(alpha=1e-3)

ﬁt, score, predict work exactly like other linear regression estimators

[Option #2]
Step 1: Instantiate object of SGDRegressor estimator
Step 2: Set parameter alpha to the required regularization rate
and penalty = l1.
1 from sklearn.linear_model import SGDRegressor
2 sgd = SGDRegressor(alpha=1e-3, penalty='l1')
22
How to search the best regularization
parameter for lasso regularization?

[Option #1]
Search for the best regularization rate with built-in cross
validation in LassoCV estimator.

[Option #2]
Use cross validation with Lasso or SVDRegressor to search
for best regularization.
Grid search
Randomized search

23
How to perform lasso regularization in polynomial
regression?
Set up a pipeline of polynomial transformation followed by the
lasso regressor.
1 from sklearn.linear_model import Lasso
2 from sklearn.pipeline import Pipeline
3 from sklearn.preprocessing import PolynomialFeatures
4
5 poly_model = Pipeline([
6 ('polynomial_transform', PolynomialFeatures(degree=2))),
7 ('lasso', Lasso(alpha=1e-3))])
8 poly_model.fit(X_train, y_train)

Instead of Lasso, we can use SGDRegressor to get equivalent

formulation.

24
How to perform both lasso and ridge regularization
in polynomial regression?
Set up a pipeline of polynomial transformation followed by the
SGDRegressor with penalty = 'elasticnet'
1 from sklearn.linear_model import Lasso
2 from sklearn.pipeline import Pipeline
3 from sklearn.preprocessing import PolynomialFeatures
4
5 poly_model = Pipeline([
6 ('polynomial_transform', PolynomialFeatures(degree=2))),
7 ('elasticnet', SGDRegressor(penalty='elasticnet',
8 l1_ratio=0.3))])
9 poly_model.fit(X_train, y_train)

Remember elasticnet is a convex combination of L1 (Lasso) and L2

(Ridge) regularization.
In this example, we have set l1_ratio to 0.3, which means l2_ratio
= 1- l1_ratio = 0.7. L2 takes higher weightage in this formulation.
25
Summary
How to implement
Different regression models: standard linear regression and
polynomial regression.
Regularization.
Model evaluation through different error metrics and scores derived
from them.
Cross validation - different iterators
Hyperparameter tuning via grid search and randomized search.

26
Appendix - More
Details

27
Introduction
In this module, we will be covering the implementation aspects of
models of linear regression.
First we will learn linear regression models with:
Normal equation, which estimates model parameter with a
closed-form solution.

Iterative optimization approach of gradient descent and its

variants namely batch, mini-batch and stochastic gradient
descent.

28
Further, we will study the implementation of the polynomial
regression model, that is capable of modelling non-linear
relationships between features and labels.
Since the polynomial regression uses more
parameters (due to polynomial representation of the
input), it is more prone to overfitting.
We will study how to detect overfitting with learning
curves and use of regularization to mitigate the risk of
overfitting.

29
Recap
Let's recall components of Linear regression

30
Training Data

(features, label) or (X, y), where label y is a real number.

31
Model
The label is obtained by a linear combination (or weighted sum) of the
input features and a bias (or intercept) term. The model hw is given by
y^ = w0 + w1 x1 + w2 x2 + … + wm xm = wT x
where,
y^ is the predicted value.
X is a feature vector {x1 , x2 , … , xm } for a given example with m
features in total.
i-th feature: xi
Weight or parameter vector w includes bias term too:
{w0 , w1 , w2 , … , wm }
wi is i-the model parameter associated with i-the feature.
hw is a model with parameter vector w.

32
Loss function
The model parameters w are learnt such that the square of difference
between the actual and the predicted values is minimized.

1 n (i) (i) 2
J(w) = ∑ (y^ − y )
2 i=1

1 n 2
J(w) = ∑ (wT x(i) − y (i) )
2 i=1

33
Optimization

1. Normal equation
2. Iterative optimization with gradient descent: full batch, mini-batch or
stochastic.

34
Evaluation measure
1. Mean squared error
2. Root mean squared error

35
Implementing with sklearn

36
Normal equation
sklearn provides LinearRegression estimator for weight
vector estimation via normal equation

1 from sklearn.linear_model import LinearRegression

As like other estimator object, it implements ﬁt method that

takes dataset as an input along with any other
hyperparameters and returns estimated weight vector.
1 lin_reg = LinearRegression(normalize=True)
2 lin_reg.fit(X_train, y_train)

It's a good practice to scale or normalize features. We can

set normalize ﬂag to True for normalizing the input features.
By default, normalize is False.
37
It also implements a couple of other methods:
predict: Predicts label for a new examples based on the
learnt model.
score: Returns R2 of the linear regression model.

38
Coefﬁcient of determination R2
2 u
R = (1 − v)
where
u is the residual sum of squares and is calculated as
u = (Xw − y)T (Xw − y )
1
and v is the total sum of square. Let, y^mean = n (Xw), then v is
calculated as
v = (y − y
^mean )T (y − y
^mean )

39
R2 = (1 − uv )
The best possible score is 1.0.
The score can be negative (because the model can be arbitrarily
worse).
A constant model that always predicts the expected value of y ,
disregarding the input features, would get a score of 0.0.

40
41
Model inspection

The learnt weights can be obtained by accessing the following class

variables of LinearRegression estimator:
The intercept weight w0 can be obtained via intercept_ class
variable.

The weights can be obtained via coef_ class variable.

1 lin_reg.intercept_, lin_reg.coef_

42
Computational Complexity
The normal equation uses the following equation to obtain
−1
w = (X X) XT y
T

This involves matrix inversion operation of feature matrix X.

The `LinearRegression`estimator uses SVD for this task and has the
complexity of O(m2 ) where m is the number of features.

43
This implies that if we double the number of features, the training
computation grows roughly 4 times.
As the number of features grows large, the approach of normal
equation slows down signiﬁcantly.
These approaches are linear w.r.t. the number of training
examples as long as the training set ﬁts in the memory.
The inference process is linear w.r.t. both the number of examples
and number of features.

44
Weight vector estimation via SGD
SGD is a simple yet very efﬁcient approach of learning weight
vectors in linear regression problems especially in large scale
problem settings.
SGD offers provisions for tuning the optimization process. However
as a downside of this, we need to set a few hyperparameters.
SGD is sensitive to feature scaling.

45
In sklearn, an estimator SGDRegressor implements a plain stochastic
gradient descent learning routine which supports different loss functions
and penalties to ﬁt linear regression models.

SGDRegressor is well suited for regression problems with a large

number of training samples (> 10,000). For learning problems with
small number of training examples, sklearn user guide recommends
Ridge or Lasso.

46
Key functionalities of SGDRegressor
Loss function:
Can be set via the loss parameter.
SGDRegressor supports the following loss functions.
loss= 'squared error': Ordinary least squares,
loss = 'huber': Huber loss for robust regression

47
Regularization
SGD supports the following penalties:
Penalty = 'l2' : L2 norm penalty on coef_ . This is default setting.
penalty = 'l1': L1 norm penalty on coef_. This leads to sparse
solutions.
penalty = 'elasticnet': Convex combination of L2 and L1;
`(1 - l1_ratio) * L2 + l1_ratio * L1

48
Learning rate
The learning rate η can be either constant or gradually decaying. There
are following options for learning rate schedule speciﬁcation in SGD:

1. invscaling

For regression the default learning rate schedule is inverse scaling

learning_rate = 'invscaling'. The learning rate in t-th iteration or time
step is calculated as

η (t) = t
η0
power_t

where, η0 and power_t are hyperparameters chosen by the user.

49
2. Constant

For a constant learning rate use learning_rate = 'constant' and use

η0 to specify the learning rate.

50
3. Adaptive
For an adaptively decreasing learning rate, use learning_rate =
'adaptive' and use η0 to specify the starting learning rate.

When the stopping criterion is reached, the learning rate is divided

by 5, and the training loop continues.
The algorithm stops when the learning rate goes below 10−6 .

51
4. Optimal
Used as a default setting for classiﬁcation problems. The learning rate η
for t-th iteration is calculated as follows:

(t) 1
η =
α(t0 + t)
Here
α is a regularization rate.

t is the time step (there are a total of n_samples*n_iter time steps)

t0 is determined based on a heuristic proposed by Léon Bottou such

that the expected initial updates are comparable with the expected
size of the weights (this assuming that the norm of the training
samples is approx. 1).
52
Stoping creteria
SGDRegressor provides two stopping criteria to stop the algorithm
when a given level of convergence is reached:
1. With early_stopping = True
The input data is split into a training set and a validation
set based on the validation_fraction parameter.

The model is ﬁtted on the training set, and the stopping

criterion is based on the prediction score (using the
scoring method) computed on the validation set.

53
2. With early_stopping = False
The model is ﬁtted on the entire input data and
The stopping criterion is based on the objective function
computed on the training data.

54
In both cases, the criterion is evaluated once by epoch, and the
algorithm stops when the criterion does not improve
n_iter_no_change times in a row.

The improvement is evaluated with absolute tolerance tol.

The algorithm stops in any case after a maximum number of iteration
max_iter

55
SGD variation: Average SGD
SGDRegressor supports averaged SGD (ASGD). Averaging can be
enabled by setting average = True

ASGD performs the same updates as the regular SGD, and sets
coef_ attribute to the average value of the coefﬁcients across all
updates.
SGD sets coef_ attribute to the last value of the coefﬁcients
(i.e. the values of the last update)
The same process is followed for the intercept_ attribute.
When using ASGD the learning rate can be larger and even constant,
leading to a speed up in training.
56
Model inspection

We obtain the weight vector from the trained model as follows:

coef_ variable stores weights assigned to the features.

intercept_, as name suggests, stores the intercept term.

57
Complexity
The major advantage of SGD is its efﬁciency, which is basically linear in
the number of training examples.

If X is a matrix of size (n, m) training has a cost of O(knp).

where k is the number of iterations (epochs) and p is the average
number of non-zero attributes per sample.

Recent theoretical results, however, show that the runtime to get some
desired optimization accuracy does not increase as the training set size
increases.

58
Polynomial regression

59
Polynomial regression

Polynomial regression = Polynomial transformation + Linear Regression

PolynomialFeatures transformer transforms an input data matrix into a

new data matrix of a given degree.

60
Example:
1 from sklearn.preprocessing import PolynomialFeatures
2 import numpy as np
3 X = np.arange(6).reshape(3, 2)
4 print ("Data matrix: \n", X)
5 poly = PolynomialFeatures(degree=2)
6 print ("\n\nAfter transformation: \n", poly.fit_transform(X))
Output:

1 Data matrix:
2 [[0 1]
3 [2 3]
4 [4 5]]
5
6
7 After transformation:
8 [[ 1. 0. 1. 0. 0. 1.]
9 [ 1. 2. 3. 4. 6. 9.]
10 [ 1. 4. 5. 16. 20. 25.]]
61
In the above example, the features of X have been transformed from
[x1 , x2 ] to [1, x1 , x2 , x21 , x1 x2 , x22 ], and can now be used within any linear
model.

In some cases it’s not necessary to include higher powers of any single
feature, but only the so-called interaction features that multiply together
as most distinct features. These can be gotten from
'PolynomialFeatures' with the setting interaction_only = True.
In this case, the features of X would be transformed from [x1 , x2 ] to
[1, x1 , x2 , x1 x2 ].

62
Ridge regression

63
Ridge regression
Ridge regression minimizes L2 penalized sum of squared error.

Ridge Loss = Sum of squared error + regularization_rate * penalty

We use 'Ridge' estimator for implementing ridge regression. It takes
parameter 'alpha' which is the regularization rate.
'RidgeCV' estimator implements ridge regression with cross
validation for regularization rate.

64
RidgeCV parameters:
1. alphas is the list of regularization rates to try.
The regularization rate must be positive.
Larger values indicate stronger regularization.
2. cv determines the cross-validation splitting strategy.
None, to use the efﬁcient Leave-One-Out cross-validation
integer, to specify the number of folds.
CV splitter speciﬁes how to generate cross validation sets.
An iterable yielding (train, test) splits as arrays of indices.

65
In case of a binary or multiclass problems, for 'cv=None' or 'cv=5' (i.e.
integer), 'StratiﬁedKFold' cross validation strategy is used. In other
cases, 'KFold' cross validation strategy is used.

66
Model inspection
'RidgeCV' provides an additional output apart from usual outputs like
coef_ and intercept_:

alphas provides the estimated regularization parameter.

67
Lasso regression

68
Lasso regression

Lasso uses L1 norm of weight vector as a penalty in linear regression

loss function.

'sklearn' provides two implementations for learning weight vector in

Lasso.
'Lasso' uses coordinated descent algorithm.
'LassoLars' uses least angle regression algorithm. It is
adaption of forward stepwise feature selection for solving
Lasso regression.

69
Classiﬁcation functions in sci-kit
learn

Dr. Ashish Tendulkar

IIT Madras
Machine Learning Practice

2
In this week, we will study sklearn functionality for
implementing classification algorithms.
We will cover sklearn APIs for
Specific classification algorithms for least square
classification, perceptron, and logistic regression
classifier.
with regularization
multiclass, multilabel and multi-output setting
Various classification metrics.

Cross validation and hyper parameter search for

classification works exactly like how it works in
regression setting.
However there are a couple of CV strategies that are
specific to classification
3
Part I: sklearn API for
classification

4
There are broadly two types of APIs based
on their functionality:

Generic Specific
SGD classifier Logistic regression
Perceptron
Ridge classifier (for LSC)
K-nearest neighbours
(KNNs)
Support vector machines
(SVMs)
Naive Bayes
Uses gradient Specialized solvers for opt
descent for opt
Need to specify loss
function 5
All sklearn estimators for classification implement a few
common methods for model training, prediction and
evaluation.

6
Model training
fit(X, y[, coef_init, intercept_init, …])

Prediction
predict(X) predicts class label for samples

decision_function(X) predicts conﬁdence score for

samples.
Evaluation
score(X, y[, sample_weight])
Return the mean accuracy on the given test data and
labels.

7
There a few common miscellaneous methods as follows:

get_params([deep]) gets parameter for this estimator.

set_params(**params) sets the parameters of this estimator.

densify() converts coefﬁcient matrix to dense array format.

sparsify() converts coefﬁcient matrix to sparse format.

8
Now let's study how to implement different classiﬁers
with sklearn APIs.

9
Let's start with implementation of least square
classiﬁcation (LSC) with RidgeClassiﬁer API.

10
Ridge classifier
RidgeClassifier is a classifier variant of the Ridge regressor.
Binary classification:
classifier first converts binary targets to {-1, 1} and then treats the
problem as a regression task, optimizing the objective of regressor:
minimize a penalized residual sum of squares
min ∣∣Xw − y∣∣22 + α∣∣w∣∣22
w
sklearn provides different solvers for this optimization
sklearn uses α to denote regularization rate
predicted class corresponds to the sign of the regressor’s
prediction

Multiclass classification:
treated as multi-output regression
predicted class corresponds to the output with the highest value
11
How to train a least square classifier?
Step 1: Instantiate a classification estimator without passing any
arguments to it. This creates a ridge classifier object.

1 from sklearn.linear_model import RidgeClassifier

2 ridge_classifier = RidgeClassifier()

Step 2: Call ﬁt method on ridge classiﬁer object with training feature

matrix and label vector as arguments.
Note: The model is ﬁtted using X_train and y_train.

1 # Model training with feature matrix X_train and

2 # label vector or matrix y_train
3 ridge_classifier.fit(X_train, y_train)

12
How to set regularization rate in RidgeClassiﬁer?

Set alpha to ﬂoat value. The default value is 0.1.

1 from sklearn.linear_model import RidgeClassifier

2 ridge_classifier = RidgeClassifier(alpha=0.001)

alpha should be positive.

Larger alpha values specify stronger regularization.

13
How to solve optimization problem in RidgeClassiﬁer?
Using one of the following solvers
uses a Singular Value Decomposition of the feature matrix to
svd
compute the Ridge coefﬁcients.
uses scipy.linalg.solve function to obtain the closed-form
cholesky
solution
sparse_cg uses the conjugate gradient solver of scipy.sparse.linalg.cg .
uses the dedicated regularized least-squares routine
lsqr
scipy.sparse.linalg.lsqr and it is fastest.

uses a Stochastic Average Gradient descent iterative procedure

sag , saga
'saga' is unbiased and more ﬂexible version of 'sag'

uses L-BFGS-B algorithm implemented in

lbfgs scipy.optimize.minimize .
can be used only when coefﬁcients are forced to be positive.
14
Uses of solver in RidgeClassiﬁer

For large scale data, use 'sparse_cg ' solver.

When both n_samples and n_features are large, use ‘sag ’ or

‘saga’ solvers.
Note that fast convergence is only guaranteed on features
with approximately the same scale.

15
How to make RidgeClassiﬁer select the solver
automatically?
1 ridge_classifier = RidgeClassifier(solver=auto)

chooses the solver automatically based on

auto
the type of data

1 if solver == 'auto':
2 if return_intercept:
3 # only sag supports fitting intercept directly
4 solver = "sag"
5 elif not sparse.issparse(X):
6 solver = "cholesky"
7 else:
8 solver = "sparse_cg"

Default choice for solver is auto .

16
Is intercept estimation necessary for
RidgeClassiﬁer?
If data is already centered, set ﬁt_intercept as false, so that no
intercept will be used in calculations.

Default:
1 ridge_classifier = RidgeClassifier(fit_intercept=True)

17
How to make predictions on new data
samples?
Use predict method to predict class labels for samples

Step 1: Arrange data for prediction in a feature matrix of shape

(#samples, #features) or in sparse matrix format.
Step 2: Call predict method on classiﬁer object with feature matrix
as an argument.

1 # Predict labels for feature matrix X_test

2 y_pred = ridge_classifier.predict(X_test)

Other classiﬁers also use the same predict method.

18
RidgeClassiﬁerCV implements
RidgeClassiﬁer with built-in cross validation.

19
Let's implement perceptron classiﬁer with
Perceptron API.

20
Perceptron classiﬁcation
It is a simple classiﬁcation algorithm suitable for large-scale
learning.

Shares the same underlying implementation with SGDClassifier

Perceptron()

SGDClassifier(loss="perceptron", eta0=1,
learning_rate="constant", penalty=None)

Perceptron uses SGD for training.

21
How to implement perceptron classiﬁer?
Step 1: Instantiate a Perceptron estimator without passing any
arguments to it to create a classiﬁer object.

1 from sklearn.linear_model import Perceptron

2 perceptron_classifier = Perceptron()

Step 2: Call ﬁt method on perceptron estimator object with training

feature matrix and label vector as arguments.

1 # Model training with feature matrix X_train and

2 # label vector or matrix y_train
3 perceptron_classifier.fit(X_train, y_train)

22
Perceptron can be further customized with the following
parameters:
penalty l1_ratio
(default = 'l2') (default = 0.15)

alpha early_stopping
(default = 0.0001) (default = False)

fit_intercept max_iter
(default = True) (default = 1000)

n_iter_no_change tol
(default = 5) (default = 1e-3)

eta0 validation_fraction

(default = 1) (default = 0.1)

23
Perceptron classiﬁer can be trained in an iterative
manner with partial_ﬁt method

Perceptron classiﬁer can be initialized to the weights of

the previous run by specifying warm_start = True in the
constructor.

24
Let's implement logistic regression classiﬁer
with LogisticRegression API.

25
LogisticRegression API

Implements logistic regression classiﬁer, which is also known

by a few different names like logit regression, maximum entropy
classiﬁer (maxent) and log-linear classiﬁer.

arg minw,C regularization penalty + C cross entropy loss

This implementation can ﬁt

binary classiﬁcation
one-vs-rest (OVR)
multinomial logistic regression

Provision for ℓ1 , ℓ2 or elastic-net regularization

26
How to train a LogisticRegression classiﬁer?
Step 1: Instantiate a classiﬁer estimator without passing any
arguments to it. This creates a logistic regression object.

1 from sklearn.linear_model import LogisticRegression

2 logit_classifier = LogisticRegression()

Step 2: Call ﬁt method on logistic regression classiﬁer object with

training feature matrix and label vector as arguments

1 # Model training with feature matrix X_train and

2 # label vector or matrix y_train
3 logit_classifier.fit(X_train, y_train)

27
Logistic regression uses speciﬁc algorithms for solving the
optimization problem in training. These algorithms are
known as solvers.

The choice of the solver depends on the classiﬁcation

problem set up such as size of the dataset, number of
features and labels.

28
How to select solvers for Logistic Regression
classiﬁer?
For small datasets, ‘liblinear’ is a
good choice, whereas ‘sag’ and
‘saga ’ are faster for large ones.
solver ‘newton-cg ’
For unscaled datasets, ‘liblinear',
‘lbfgs ’ 'lbfgs' and 'newton-cg ' are robust.
‘liblinear ’
For multiclass problems, only
‘sag ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs ’
handle multinomial loss.
‘saga ’
‘liblinear ’ is limited to one-versus-
rest schemes

By default, logistic regression uses lbfgs solver.

1 logit_classifier = LogisticRegression(solver='lbfgs')
29
How to add regularization in Logistic Regression
classiﬁer?

l2 - adds a L2 penalty term l1 - adds a L1 penalty term

penalty

elasticnet - both L1 and L2

none - no penalty is added
penalty terms are added

Regularization is applied by default because it improves numerical

stability.

By default, it uses L2 penalty.

1 logit_classifier = LogisticRegression(penalty='l2')
30
Not all the solvers supports all the penalties.
Select appropriate solver for the desired penalty.
L2 penalty is supported by all solvers
L1 penalty is supported only by a few solvers.

Solver Penalty
‘newton-cg ’ [‘l2’, ‘none’]
‘lbfgs ’ [‘l2’, ‘none’]
‘liblinear’ [‘l1’, ‘l2’]
‘sag’ [‘l2’, ‘none’]
‘saga ’ [‘elasticnet’, ‘l1’, ‘l2’, ‘none’]

31
How to control amount of regularization in
logistic regression?
sklearn implementation uses parameter C, which is
inverse of regularization rate to control regularization.

Recall
arg minw,C regularization penalty + C cross entropy loss

C is speciﬁed in the constructor and must be positive

Smaller value leads to stronger regularization.
Larger value leads to weaker regularization.

32
LogisticRegression classiﬁer has a class_weight parameter
in its constructor.

What purpose does it serve?

Handles class imbalanace with differential class

weights.
Mistakes in a class are penalized by the class weight.
Higher value here would mean higher emphasis on
the class.

This parameter is available in classiﬁer estimators in sklearn.

Exercise: Read stack overﬂow discussion on this parameter.

33
LogisticRegressionCV implements logistic regression with in built
cross validation support to ﬁnd the best values of C and l1_ratio
parameters according to the speciﬁed scoring attribute.

34
These classiﬁers can also be implemented with a
generic SGDClassiﬁer API by setting the loss
parameter appropriately.

35
Let's study SGDClassiﬁer API.

36
SGDClassifier
SGD is a simple yet very efficient approach to fitting linear
classifiers under convex loss functions

This API uses SGD as an optimization technique and can be

applied to build a variety of linear classiﬁers by adjusting the
loss parameter.

It supports multi-class classiﬁcation by combining multiple

binary classiﬁers in a “one versus all” (OVA) scheme.
Easily scales up to large scale problems with more than 105
training examples and 105 features. It also works with sparse
machine learning problems
Text classiﬁcation and natural language processing

37
We need to set loss parameter appropriately to build train
classiﬁer of our interest with SGDClassiﬁer

'hinge' - (soft-margin) linear Support Vector Machine

'log' - logistic regression

'modiﬁed_huber' - smoothed hinge loss brings

tolerance to outliers as well as probability estimates
loss
parameter 'squared_hinge' - like hinge but is quadratically
penalized

'perceptron' - linear loss used by the perceptron

algorithm

‘squared_error’, ‘huber’, ‘epsilon_insensitive’, or

‘squared_epsilon_insensitive’ - regression losses
38
By default SGDClassiﬁer uses hinge loss and hence
trains linear support vector machine classiﬁer.

39
An instance of SGDClassiﬁer might have an equivalent estimator
in the scikit-learn API.

SGDClassifier(loss='log')

LogisticRegression(solver='sgd')

SGDClassifier(loss='hinge')

Linear Support vector machine

40
How does SGDClassiﬁer work?
SGDClassiﬁer implements a plain stochastic gradient descent
learning routine.
the gradient of the loss is estimated with one sample at a time
and the model is updated along the way with a decreasing
learning rate (or strength) schedule.

Advantages: Disadvantages:
Efﬁciency. Requires a number of
Ease of implementation hyperparameters.
Sensitive to feature scaling.

It is important
to permute (shuffle) the training data before fitting the model.
to standardize the features.
41
How to use SGDClassifier for training a classifer?
Step 1: Instantiate a SGDClassifer estimator by setting appropriate
loss parameter to define classifier of interest. By default it uses hinge
loss, which is used for training linear support vector machine.

1 from sklearn.linear_model import SGDClassifier

2 SGD_classifier = SGDClassifier(loss='log')

Here we have used `log` loss that deﬁnes a

logistic regression classiﬁer.

Step 2: Call ﬁt method on SGD classiﬁer object with training feature

matrix and label vector as arguments.

1 # Model training with feature matrix X_train and

2 # label vector or matrix y_train
3 SGD_classifier.fit(X_train, y_train)

42
How to perform regularization in SGD classiﬁer?

l2 - adds a L2 penalty term l1 - adds a L1 penalty term

penalty

elasticnet - Convex combination of L2 and L1

(1 - l1_ratio) * L2 + l1_ratio * L1
(l1_ratio controls the convex combination of L1 and L2 penalty. default=0.15)

Default: 1 SGD_classifier = SGDClassifier(penalty='l2')

alpha
Constant that multiplies the regularization term.
Has ﬂoat values and default = 0.0001
43
How to set maximum number of epochs for SGD
Classiﬁer?

The maximum number of passes over the training data (aka epochs)
is an integer that can be set by the max_iter parameter.

1 SGD_classifier = SGDClassifier(max_iter=100)

Default:
max_iter = 1000

44
Some common parameters between SGDClassiﬁer
and SGDRegressor

learning_rate average
‘constant’ SGDClassiﬁer also supports
‘optimal’ averaged SGD (ASGD)
‘invscaling’
tol
‘adaptive’
n_iter_no_change
max_iter
warm_start Stopping criteria
early_stopping
‘True’
validation_fraction
‘False’

45
Summary
We learnt how to implement the following classifiers with
sklearn APIs:
Least square classification (RidgeClassifier)
Perceptron (Perceptron)
Logistic regression (LogisticRegression)

Alternatively we can use SGDClassiﬁer with appropriate

loss setting for implementing these classiﬁers:
loss = `log` for logistic regression
loss = `perceptron` for perceptron
loss = `squared_error` for least square classiﬁcation

Classiﬁcation estimators implements a few common

methods like ﬁt, score, decision_function, and predict.
46
These estimators can be readily used in multiclass setting.

They support regularized loss function optimization.

All classiﬁcation estimators have ability to deal with

class imbalance through class_weight parameter in
the constructor.

47
Part II: Multi-learning classiﬁcation
set up

48
Let's extend these classiﬁers to multi-
learning (multi-class, multi-label & multi-
output) settings.

49
Basics of multiclass, multilabel and
multioutput classiﬁcation
Multiclass classiﬁcation has exactly one output label and
the total number of labels > 2.

For more than one output, there are two types of classiﬁcation
models:
Multilabel Multiclass multioutput
total #labels = 2 total #labels > 2

We will refer both these models as multi-label classiﬁcation

models, where # of output labels > 1.

Multiclass, multilabel, multioutput problems are referred to as

multi-learning problems.
50
sklearn provides a bunch of meta-estimators, which extend the
functionality of base estimators to support multi-learning
problems.
The meta-estimators transform the multi-learning problem into a
set of simpler problems and ﬁt one estimator per problem.

OneVsRestClassifier
Multiclass
classification OneVsOneClassifier
(sklearn.multiclass)
OutputCodeClassifier
problem meta-
types estimators
MultiOutputClassifier
Multilabel
classification
(sklearn.multioutput)
ClassifierChain

51
Many sklearn estimators have built-in support for multi-
learning problems.
Meta-estimators are not needed for such estimators,
however meta-estimators can be used in case we
want to use these base estimators with strategies
beyond the built-in ones.

Inherently Multiclass as Multiclass as

Multilabel
multiclass OVO OVR

52
LogisticRegression (multi_class = 'multinomial')
Inherently LogisticRegressionCV (multi_class = 'multinomial')
multiclass RidgeClassiﬁer
RidgeClassiﬁerCV

LogisticRegression (multi_class = 'ovr')

Multiclass as LogisticRegressionCV (multi_class = 'ovr')
OVR SGDClassiﬁer
Perceptron

RidgeClassiﬁer
Multilabel
RidgeClassiﬁerCV

53
First we will study multiclass APIs in sklearn.

54
Multi-class classiﬁcation
Classiﬁcation task with more than two classes.
Each example is labeled with exactly one class

In Iris dataset,
There are three class labels: setosa, versicolor and virginica.
Each example has exactly one label of the three available
class labels.
Thus, this is an instance of a multi-class classiﬁcation.

In MNIST digit recognition dataset,

There are 10 class labels: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9.
Each example has exactly one label of the 10 available class
labels.
Thus, this is an instance of a multi-class classiﬁcation.
55
How to represent class labels in multi-class
setup?
Each example is marked with a single label out of k
labels. The shape of label vector is (n, 1).

Use LabelBinarizer transformation to convert the class

label to multi-class format.
1 from sklearn.preprocessing import LabelBinarizer
2 y = np.array(['apple', 'pear', 'apple', 'orange'])
3 y_dense = LabelBinarizer().fit_transform(y)

The resulting label vector has shape of (n, k).

[[1 0 0]
[0 0 1]
[1 0 0]
[0 1 0]]
56
Let's say, you are given labels as part of the training set, how
do we check if they are is suitable for multi-class
classiﬁcation?

Use type_of_target to determine the type of the label.

1 from sklearn.utils.multiclass import type_of_target

2 type_of_target(y)

In case, y is a vector with more than two discrete values,

type_of_target returns multiclass.

57
type_of_target can determine different types
of multi-learning targets.

58
target_type y
contains more than two discrete values
‘multiclass’
not a sequence of sequences
1d or a column vector
2d array that contains more than two discrete
‘multiclass- values
multioutput’ not a sequence of sequences
dimensions are of size > 1
label indicator matrix
‘multilabel-
an array of two dimensions with at least two
indicator’ columns, and at most 2 unique values.
array-like but none of the above, such as a 3d
‘unknown’ array,
sequence of sequences, or an array of non-
sequence objects.
59
Examples

1 >>> type_of_target([1, 0, 2])

2 'multiclass'
3 >>> type_of_target([1.0, 0.0, 3.0])
multiclass 4 'multiclass'
5 >>> type_of_target(['a', 'b', 'c'])
6 'multiclass'

1 >>> type_of_target(np.array([[1, 2], [3, 1]]))

multiclass-multioutput 2 'multiclass-multioutput'

1 type_of_target(np.array([[0, 1], [1, 1]]))

2 'multilabel-indicator'
multilabel-indicator 3 >>> type_of_target([[1, 2]])
4 'multilabel-indicator'

60
Apart from these, there are three more types, type_of_target
can determine targets corresponding to regression and
binary classiﬁcation.

continuous - regression target

continuous-multioutput - multi-output target
binary - classiﬁcation

61
All classiﬁers in scikit-learn perform multiclass
classiﬁcation out-of-the-box.

Use sklearn.multiclass module only when you want to

experiment with different multiclass strategies.

Using different multi-class strategy than the one

implemented by default may affect performance of
classiﬁer in terms of either generalization error or
computational resource requirement.

62
What are different multi-class classiﬁcation
strategies implemented in sklearn?

One-vs-all or one-vs-rest (OVR)

One-vs-One (OVA)

OVR is implemented by OneVsRestClassiﬁer API.

OVA is implemented by OneVsOneClassiﬁer API.

63
OVR - OneVsRestClassifier
Fits one classifier per class c - c vs not c.
This approach is computationally efficient and requires
only k classifiers.
The resulting model is interpretable.

1 from sklearn.multiclass import OneVsRestClassifier

2 OneVsRestClassifier(LinearSVC(random_state=0)).fit(X, y)

We need to supply estimator as an argument in the

constructor.
Support methods like other classifiers - fit, predict,
predict_proba, partial_fit.
OneVsRest classifier also supports multilabel classification.
We need to supply labels as indicator matrix of shape (n, k).
64
OVA - OneVsOneClassifier
Fits one classifier per pair of classes. Total classifiers = (k2).
Predicts class that receives maximum votes.
The tie among classes is broken by selecting the class
with the highest aggregate classification confidence.

1 from sklearn.multiclass import OneVsOneClassifier

2 OneVsOneClassifier(LinearSVC(random_state=0)).fit(X, y)

We need to supply estimator as an argument in the

constructor.
Support methods like other classifiers - fit, predict,
predict_proba, partial_fit.
OneVsOne classifier processes subset of data at a time and
is useful in cases where the classifier does not scale with
the data. 65
What is the difference between OVR and OVA?
OneVsRestClassifier OneVsOneClassifier

Fits one classifier per class. Fits one classifier per pair of
For each classifier, the class is classes.
fitted against all the other At prediction time, the class
classes. which received the most votes
is selected.

66
Now we will learn how to perform multilabel
and multi-output classiﬁcation.

67
How MultiOutputClassifier works?
Strategy consists of fitting one classifier per target.

Classiﬁer #1 Class #1

Input Feature
Classiﬁer #2 Class #2
Matrix (X)

Classiﬁer #k Class #k

Allows multiple target variable classiﬁcations.

68
How ClassifierChain works?
A multi-label model that arranges binary classifiers into a chain.
Way of combining a number of binary classifiers into a single
multi-label model.

69
Comparison of MultiOutputClassiﬁer and
ClassiﬁerChain

MultiOutputClassiﬁer ClassiﬁerChain
Able to estimate a series of Capable of exploiting
target functions that are trained correlations among targets.
on a single predictor matrix to
predict a series of responses.

Allows multiple target variable For a multi-label classiﬁcation

classifications. problem with k classes, k binary
classifiers are assigned an
integer between 0 and k − 1.
These integers define the order
of models in the chain.

70
Summary

Different types of multi-learning setups: multi-class,

multi-label, multi-output.
type_of_target to determine the nature of supplied
labels.
Meta-estimators:
multi-class: One-vs-rest, one-vs-one
multi-label: Classiﬁer chain and multi-output classiﬁer

71
Evaluating Classiﬁers

72
So far we learnt how to train classiﬁers for binary, multi-class
and multi-label/output cases.

We will learn how to evaluate these classiﬁers with different

scoring functions and with cross-validation.

We will also study how to set hyper-parameters for classiﬁers.

Many cross-validation and HPT methods discussed in the

regression context are also applicable in classifiers.
We will not repeat that discussion in this topic.
Instead we will focus on only additional methods that are
specific to classifiers.

73
Stratiﬁed cross validation iterators

There may be issues like class imbalance in classiﬁcation,

which tend to impact the cross validation folds.

The overall class distribution and the ones in folds may be

different and this has implications in effective model training.

sklearn.model_selection module provides three stratiﬁed

APIs to create folds such that the overall class distribution is
replicated in individual folds.

74
sklearn.model_selection module provides the following
three stratiﬁed APIs to create folds such that the overall
class distribution is replicated in individual folds.

StratifiedKFold
RepeatedStratifiedKFold
StratifiedShuffleSplit

Note: Folds obtained via StratiﬁedShufﬂeSplit may not be

completely different.

75
LogisticRegressionCV
Support in-build cross validation for optimizing hyperparameters

The following are key parameters for HPT and cross validation
cv specifies scoring specifies cs specifies
cross validation scoring function to regularization
iterator use for HPT strengths to
experiment with.

Choosing the best hyper-parameters

Scores averaged across folds, values
refit = True corresponding to the best score are selected
and final refit with these parameters
refit = False the coefs, intercepts and C that correspond
to the best scores across folds are averaged.
76
Now let's look at classification metrics
implemented in sklearn.

77
Classiﬁcation metrics

sklearn.metrics implements a bunch of classiﬁcation scoring

metrics based on true labels and predicted labels as inputs.

accuracy_score balanced_accuracy_score

top_k_accuracy_score roc_auc_score

precision_score recall_score f1_score

score(actual_labels, predicted_labels)

78
Confusion matrix
confusion_matrix evaluates classiﬁcation accuracy by computing
the confusion matrix with each row corresponding to the true class.
1 from sklearn.metrics import confusion_matrix
2 confusion_matrix(y_true, y_predicted)

Example:

Entry i, j in a confusion matrix

number of observations actually in group i,

but predicted to be in group j .

79
Confusion matrix can be displayed with ConfusionMatrixDisplay
API in sklearn.metrics.

Confusion matrix
1 ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)

From estimators
1 ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test)

From predictions
1 ConfusionMatrixDisplay.from_predictions(y_test, y_pred)

80
The classification_report function builds a text report showing
the main classiﬁcation metrics.
1 from sklearn.metrics import classification_report
2 print(classification_report(y_true, y_predicted))

81
Classiﬁer Performance across probability
thresholds
1 from sklearn.metrics import precision_recall_curve
2 precision, recall, thresholds = precision_recall_curve(y_true, y_predicted)

1 from sklearn.metrics import roc_curve

2 fpr, tpr, thresholds = metrics.roc_curve(y_true, y_scores, pos_label=2)
82
How to extend binary metric to multiclass or
multilabel problems?
Treat data as a collection of binary problems, one for each class.
Then, average binary metric calculations across the set of classes.
Can be done using average parameter.

macro calculates the mean of the binary metrics

computes the average of binary metrics in which each class’s

weighted
score is weighted by its presence in the true data sample.

gives each sample-class pair an equal contribution to the

micro
overall metric

calculates the metric over the true and predicted classes for
samples
each sample in the evaluation data, and returns their average

None returns an array with the score for each class

83
Summary

Classiﬁcation speciﬁc cross validation iterator based on

stratiﬁcation.
Classiﬁcation metrics
Extending binary metrics to multi-learning set up.

84
Naive Bayes in sci-kit learn

Dr. Ashish Tendulkar

IIT Madras
Machine Learning Practice

1
Naive Bayes Classiﬁer

2
Naive Bayes classiﬁer
Naive Bayes classiﬁer applies Bayes’ theorem with the “naive”
assumption of conditional independence between every pair of
features given the value of the class variable.

For a given class variable y and dependent feature vector

x1 through xm ,
the naive conditional independence assumption is given by:

P (xi ∣y, x1 , ..., xi−1 , xi+1 , ..., xm ) = P (xi ∣y)

Naive Bayes learners and classiﬁers can be extremely fast compared

to more sophisticated methods.

3
List of NB Classiﬁers

Implemented in sklearn.naive_bayes module

GaussianNB

BernoulliNB CategoricalNB

MultinomialNB ComplementNB

Implements ﬁt method to estimate parameters of NB

classiﬁer with feature matrix and labels as inputs.

The prediction is performed using predict method.

4
Which NB to use if data is only numerical?

implements the Gaussian Naive Bayes

GaussianNB
algorithm for classiﬁcation

Instantiate a GaussianNBClassifer estimator and then call ﬁt method

using X_train and y_train.

1 from sklearn.naive_bayes import GaussianNB

2 gnb = GaussianNB()
3 gnb.fit(X_train, y_train)

5
Which NB to use if data is multinomially distributed?

implements the naive Bayes algorithm for

MultinomialNB multinomially distributed data
(text classiﬁcation)

Instantiate a MultinomialNBClassifer estimator and then call ﬁt method

using X_train and y_train.

1 from sklearn.naive_bayes import MultinomialNB

2 mnb = MultinomialNB()
3 mnb.fit(X_train, y_train)

6
What to do if data is imbalanced ?

implements the complement naive Bayes

ComplementNB
(CNB) algorithm.

Instantiate a ComplementNBClassifer estimator and then call ﬁt

method using X_train and y_train.

1 from sklearn.naive_bayes import ComplementNB

2 cnb = ComplementNB()
3 cnb.fit(X_train, y_train)

CNB regularly outperforms MNB (often by a considerable margin) on

text classiﬁcation tasks.

7
What to do if data has multivariate Bernoulli
distributions?
implements the naive Bayes algorithm for
BernoulliNB
data that is distributed according to
multivariate Bernoulli distributions

each feature is assumed to be a binary-

valued (Bernoulli, boolean) variable
Instantiate a BernoulliNBClassifer estimator and then call ﬁt method
using X_train and y_train.

1 from sklearn.naive_bayes import BernoulliNB

2 bnb = BernoulliNB()
3 bnb.fit(X_train, y_train)

8
What to do if data is categorical ?
implements the categorical naive Bayes
algorithm suitable for classiﬁcation with
CategoricalNB
discrete features that are categorically
distributed

assumes that each feature, which is

described by the index i, has its own
categorical distribution.

Instantiate a CategoricalNBClassifer estimator and then call ﬁt method

using X_train and y_train.

1 from sklearn.naive_bayes import CategoricalNB

2 canb = CategoricalNB()
3 canb.fit(X_train, y_train)
9
K Nearest Neighbours

Dr. Ashish Tendulkar

IIT Madras
Machine Learning Practice

2
Nearest neighbor classiﬁer

It is a type of instance-based learning or non-generalizing learning

does not attempt to construct a model
simply stores instances of the training data

Classiﬁcation is computed from a simple majority vote of the

nearest neighbors of each point.

Two different implementations of nearest neighbors classiﬁers are

available.

1. KNeighborsClassiﬁer
2. RadiusNeighborsClassiﬁer

3
How are KNeighborsClassiﬁer and
RadiusNeighborsClassiﬁer different?

KNeighborsClassifier RadiusNeighborsClassifier
learning based on the k learning based on the number of
nearest neighbors neighbors within a fixed radius
r of each training point
most commonly used used in cases where the data is
technique not uniformly sampled
choice of the value k is fixed value of r is specified, such
highly data-dependent that points in sparser
neighborhoods use fewer nearest
neighbors for the classification

4
How do you apply KNeighborsClassiﬁer?
Step 1: Instantiate a KNeighborsClassifer estimator without passing
any arguments to it to create a classifer object.

1 from sklearn.neighbors import KNeighborsClassifier

2 kneighbor_classifier = KNeighborsClassifier()

Step 2: Call ﬁt method on KNeighbors classiﬁer object with training

feature matrix and label vector as arguments.

1 # Model training with feature matrix X_train and

2 # label vector or matrix y_train
3 kneighbor_classifier.fit(X_train, y_train)

5
How do you specify the number of nearest
neighbors in KNeighborsClassiﬁer?

Specify the number of nearest neighbors K from the training

dataset using n_neighbors parameter.
value should be int.

1 kneighbor_classifier = KNeighborsClassifier(n_neighbors = 3)

What is the default value of K?

n_neighbors = 5

6
How do you assign weights to neighborhood in
KNeighborsClassiﬁer?

It is better to weight the neighbors such that nearer neighbors

contribute more to the ﬁt.

‘uniform’ : All points in each

neighborhood are weighted equally.
‘distance’ : weight points by the inverse of
weights
their distance.
closer neighbors of a query point will
have a greater inﬂuence than
neighbors which are further away.

Default:
1 kneighbor_classifier = KNeighborsClassifier(weights= 'uniform')
7
Can we deﬁne our own weight values for
KNeighborsClassiﬁer?

Yes, it is possible if you have an array of distances.

weights parameter also accepts a user-deﬁned function which
takes an array of distances as input, and returns an array of the
same shape containing the weights.

Example:

1 def user_weights(weights_array):
2 return weights_array
3
4 kneighbor_classifier = KNeighborsClassifier(weights=user_weights)

8
Which algorithm is used to compute the nearest
neighbors in KNeighborsClassiﬁer?

‘ball_tree’ will use BallTree

‘kd_tree’ will use KDTree
algorithm
‘brute’ will use a brute-force search

‘auto’ will attempt to decide the most

appropriate algorithm based on the values
passed to the ﬁt method.

Default:
1 kneighbor_classifier = KNeighborsClassifier(algorithm='auto')

9
Some additional parmeters for tree algorithm in
KNeighborsClassiﬁer?
For 'ball_tree' and 'kd_tree' algorithms, there are some other
parameters to be set.

leaf_size metric
can affect the speed of Distance metric to use for the tree
the construction and It is either string or callable function
query, as well as the some metrics are listed below:
memory required to
“euclidean”, “manhattan”, “chebyshev”,
store the tree
“minkowski”, “wminkowski”,
default = 30
“seuclidean”, “mahalanobis”
default = 'minkowski'
p
Power parameter for the Minkowski metric.
default = 2 10
How do you apply RadiusNeighborsClassiﬁer?
Step 1: Instantiate a RadiusNeighborsClassifer estimator without
passing any arguments to it to create a classifer object.

1 from sklearn.neighbors import RadiusNeighborsClassifier

2 radius_classifier = RadiusNeighborsClassifier()

Step 2: Call ﬁt method on RadiusNeighbors classiﬁer object with

training feature matrix and label vector as arguments.

1 # Model training with feature matrix X_train and

2 # label vector or matrix y_train
3 radius_classifier.fit(X_train, y_train)

11
How do you specify the number of neighbors in
RadiusNeighborsClassiﬁer?

The number of neighbors is speciﬁed within a ﬁxed radius r of each

training point using radius parameter.
r is a ﬂoat value.

1 radius_classifier = RadiusNeighborsClassifier(radius=1.0)

What is the default value of r ?

r = 1.0

12
Parameters for RadiusNeighborsClassiﬁer

weights algorithm

‘uniform’ ‘ball_tree’
‘distance’ ‘kd_tree’
[callable] ‘brute’
function
‘auto’
default =
default = ‘auto’
'uniform'

leaf_size metric p

default = 30 default = default = 2

'minkowski' 13
Support Vector Machines

Dr. Ashish Tendulkar

IIT Madras
Machine Learning Practice

2
In this week, we will study how to implement support
vector machines for classiﬁcation tasks with sklearn.

3
Support Vector Machines
Support Vector Machines (SVM) are a set of supervised
learning methods used for classification, regression and outliers
detection.
SVM constructs a hyper-plane or set of hyper-planes in a high
or infinite dimensional space, which can be used for
classification, regression or other tasks.

In sklearn, we have three methods to implement SVM.

SVC These are similar methods but, accept

slightly different sets of parameters.
NuSVC Implementation is based on libsvm.
Faster implementation of linear SVM
LinearSVC classiﬁcation with only linear kernel.
Implementation is based on liblinear. 4
Training data

Array X : holding the training

samples 1 X = [[0, 0], [1, 1]]

shape → (n_samples, n_features)

Array y : holding the class

labels (strings or integers)
1 y = [0,1]

shape → (n_samples)

5
How to implement SVC (C-Support Vector
Classiﬁcation)?

Step 1: Instantiate a SVC classiﬁer estimator.

1 from sklearn.svm import SVC

2 SVC_classifier = SVC()

Step 2: Call ﬁt method on SVC classiﬁer object with training feature

matrix and label vector as arguments.

1 # Model training with feature matrix X_train and

2 # label vector or matrix y_train
3 SVC_classifier.fit(X_train, y_train)

6
How to perform regularization in SVC classiﬁer?

C Regularization parameter

ﬂoat value

Default: 1 SVC_classifier = SVC(C=1.0)

Note:
strength of the regularization is inversely proportional to C
strictly positive
penalty is a squared l2 penalty

7
How to specify kernel type to be used in the
algorithm ?
‘linear’

‘poly’

kernel ‘rbf’

‘sigmoid’

‘precomputed’

Default: 1 SVC_classifier = SVC(kernel = 'rbf')

If kernel = poly , set degree (any integer value)

If kernel = callable is given it is used to pre-compute the
kernel matrix from data matrices
8
How to set kernel coefﬁcient for 'rbf', 'poly' and
'sigmoid' kernels?
gamma

1
‘scale’ value of gamma = number of features∗ X.Var()

1
‘auto’ value of gamma = number of features

ﬂoat value

Default: 1 SVC_classifier = SVC(gamma = 'scale')

If kernel = 'poly' or 'sigmoid' , set coef0 which is an

independent term in kernel function (any integer value)

9
How to view support vectors?

After the classiﬁer is ﬁt on the training data, there are few attributes
which reveal the details of support vectors.

1 from sklearn.svm import SVC

2 SVC_classifier = SVC()
3 clf = SVC_classifier.fit(X_train, y_train)
4
5 #to view indices of the support vectors
6 clf.support_
7
8 #to view the support vectors
9 clf.support_vectors_
10
11 #to view the number of support vectors for each class
12 clf.n_support_

10
How to implement NuSVC (ν -Support Vector
Classiﬁcation)?

Step 1: Instantiate a NuSVC classiﬁer estimator.

1 from sklearn.svm import NuSVC

2 NuSVC_classifier = NuSVC()

Step 2: Call ﬁt method on NuSVC classiﬁer object with training feature

matrix and label vector as arguments.

1 # Model training with feature matrix X_train and

2 # label vector or matrix y_train
3 NuSVC_classifier.fit(X_train, y_train)

11
What is the signiﬁcance of ν in NuSVC?

Instead of C in SVC, ν is introduced in NuSVC to control the

number of support vectors and margin errors.

ν is an upper bound on the fraction of margin errors and and

a lower bound of the fraction of support vectors.

Value of ν should ∈ (0, 1]

Default: ν = 0.5

Other parameters for NuSVC are same as that of SVC.

12
How to implement LinearSVC (Linear Support
Vector Classiﬁcation)?

Step 1: Instantiate a LinearSVC classiﬁer estimator.

1 from sklearn.svm import LinearSVC

2 LinearSVC_classifier = LinearSVC()

Step 2: Call ﬁt method on SVC classiﬁer object with training feature

matrix and label vector as arguments.

1 # Model training with feature matrix X_train and

2 # label vector or matrix y_train
3 LinearSVC_classifier.fit(X_train, y_train)

13
Advantages of LinearSVC

It has more ﬂexibility in the choice of penalties and loss

functions since it is implemented in terms of liblinear.
Scales better to large numbers of samples.
Supports both dense and sparse input.

14
How to provide penalty in LinearSVC classiﬁer?

l1 - adds a L1 penalty term

penalty

l2 - adds a L2 penalty term

l1 - leads to coef_ vectors that are sparse.

Default: 1 LinearSVC_classifier = Linear_SVC(penalty = 'l2')

15
How to choose loss functions in LinearSVC
classiﬁer?

'hinge' - standard SVM loss

loss
parameter 'squared_hinge' - square of the hinge loss

Default:
1 LinearSVC_classifier = Linear_SVC(loss = 'squared_hinge')

Combination not supported:

penalty='l1' and loss='hinge'
16
Some parameters in LinearSVC classiﬁer
C

Regularization parameter

dual

Select the algorithm to either solve the dual or primal

optimization problem.
When n_samples >n_features, prefer dual=False.

fit_intercept

To calculate the intercept for the model.

17
How to perform multi-class classiﬁcation using SVM?

SVC and NuSVC implement the “one-versus-one” approach

for multi-class classiﬁcation.
decision_function_shape
‘ovo’
‘ovr’

LinearSVC implements “one-vs-the-rest” approach for multi-

class classiﬁcation.

multi_class
‘ovr’
‘crammer_singer’
18
Advantages of SVM
Effective in high dimensional spaces.
Effective in cases where number of dimensions is greater than
the number of samples.
Uses a subset of training points in the decision function
(called support vectors), so it is also memory efﬁcient.
Versatile: different Kernel Functions can be speciﬁed for the
decision function.

Disadvantages of SVM
SVMs do not directly provide probability estimates, these are
calculated using an expensive ﬁve-fold cross-validation.
Avoid over-ﬁtting in choosing Kernel functions if the number of
features is much greater than the number of samples.
19
Decision trees
Machine Learning Practice

Dr. Ashish Tendulkar

IIT Madras

2
Quick recap
Non-parametric supervised learning methods.
Can learn classiﬁcation and regression models.
Predicts label based on rules inferred from the
features in the training set.

3
Tree algorithms

4
sklearn implementation of trees
scikit-learn uses an optimized version of the CART
algorithm; however, it does not support categorical
variables for now

Classiﬁcation sklearn.tree.DecisionTreeClassifier

Regression sklearn.tree.DecisionTreeRegressor

Both these estimators have the same set of parameters

except for criterion used for tree splitting.

splitter max_depth min_samples_split min_samples_leaf

5
sklearn tree parameters
splitter Strategy for splitting at each node. best random

max_depth Maximum depth of the tree.

int When None , the tree expanded until all leaves are
pure or they contain less than min_samples_split
samples.

min_samples_split The minimum number of samples required

2
int float to split an internal node.

min_samples_leaf The minimum number of samples required

1
int float to be at a leaf node.
6
sklearn tree parameters

criterion Speciﬁes function to measure the quality

of a split.

Classiﬁcation Regression

gini squarred_error

entropy friedman_mse

absolute_error

poisson

7
Tree visualization
sklearn.tree.plot_tree

decision_tree The decision tree to be plotted.

max_depth The maximum depth of the representation. If

none , the tree is fully generated.

feature_names Names of each of the features. none

class_names Names of each of the target classes in

none
ascending numerical order.
label Whether to show informative labels for impurity.
none
8
Avoiding overﬁtting of trees

Pre-pruning
Uses hyper-parameter search like GridSearchCV for
ﬁnding the best set of parameters.

Post-pruning
First grows trees without any constraints and then uses
cost_complexity_pruning with max_depth and
min_samples_split .

9
Tips for practical usage
Decision trees tend to overﬁt data with a large number of
features. Make sure that we have the right ratio of samples
to number of features.

Perform dimensionality reduction (PCA, or Feature

Selection) on a data before using it for training the trees.
It gives a better chance of ﬁnding discriminative features.

Visualize the trained tree by using max_depth=3 as an

initial tree depth to get a feel for the ﬁtment and then
increase the depth.

Balance the dataset before training to prevent the tree

from being biased toward the classes that are dominant.
10
Use min_samples_split or min_samples_leaf to ensure
that multiple samples inﬂuence every decision in the tree,
by controlling which splits will be considered.
A very small number will usually mean the tree will
overﬁt.
A large number will prevent the tree from learning the
data.

11
Bagging and Boosting

Machine Learning Practice

Dr. Ashish Tendulkar

IIT Madras

2
Part 2: Boosting

17
There are two boosting estimators:
AdaBoost estimator
Gradient boosting estimator

18
AdaBoost estimator

Class: sklearn.ensemble.AdaBoostClassifier
Class: sklearn.ensemble.AdaBoostRegressor

19
Class: sklearn.ensemble.AdaBoostClassifier

base_estimator Default estimator is

DecisionTreeClassiﬁer with depth = 1.

n_estimators Maximum number of estimators where

boosting is terminated. The default value
is 50.

learning_rate Weight applied to each classiﬁer during

boosting.
Higher value here would increase
contribution of individual classiﬁers.
There is a trade-off between
n_estimators and learning_rate.
20
Class: sklearn.ensemble.AdaBoostRegressor

base_estimator Default estimator is

DecisionTreeRegressor with depth = 3.

n_estimators Maximum number of estimators where

boosting is terminated. The default value
is 50.

learning_rate Weight applied to each regressor at

each boosting iteration.
Higher value here would increase
contribution of individual regressor.
There is a trade-off between
n_estimators and learning_rate.
21
The main parameters to tune to obtain good results are
n_estimators and
Complexity of the base estimators (e.g. its depth
max_depth or min_samples_split).

22
Attributes of AdaBoost estimators

base_estimator_ Base estimator of ensemble.

estimators_ Collection of ﬁtted sub-estimators.

estimator_weights_ Weights for each estimator in ensemble.

estimator_errors_ Errors for each estimator in ensemble.

23
Gradient boosting estimators

Class: sklearn.ensemble.GradientBoostingClassifier
Class: sklearn.ensemble.GradientBoostingRegressor

There are two most important parameters of these estimators:

n_estimators
learning_rates

sklearn.ensemble.GradientBoostingClassifier supports both binary and

multiclass classiﬁcation.

24
We will directly demonstrate XGBoost
through colab demonstration.

25
Bagging and Boosting

Machine Learning Practice

Dr. Ashish Tendulkar

IIT Madras

2
Contents

Part 1: Voting, bagging and random forest

Part 2: Boosting and gradient boosting

Part 3: XGBoost

3
Voting estimators
Class: sklearn.ensemble.VotingClassifier
Class: sklearn.ensemble.VotingRegressor
Both these estimators take the following common parameters:
base_estimator weights

Both these estimators implement the following functions:

fit predict fit_transform score

VotingClassiﬁer takes an additional argument:

voting hard soft
4
Bagging estimators

Class: sklearn.ensemble.BaggingClassifier
Class: sklearn.ensemble.BaggingRegressor

5
Common parameters
base estimator to ﬁt on
base_estimator default=None random subsets of dataset
number of base estimators
n_estimators default=10 in the ensemble
number of samples to
draw from X to train each
max_samples default=1.0 base estimator (with
replacement by default)
number of samples to
draw from X to train each
max_features default=1.0
base estimator (without
replacement by default)

bootstrap Whether samples are

default=True
drawn with replacement 6
Common parameters

bootstrap_features default=False Whether features are

drawn with replacement

oob_score default=False Whether to use out-of-bag

samples to estimate
generalization error

7
Random forest estimators
Class: sklearn.ensemble.RandomForestClassifier
Class: sklearn.ensemble.RandomForestRegressor

The parameters can be classiﬁed as

Bagging parameters
Decision tree parameters

8
Bagging parameters
The number of trees are speciﬁed by n_estimators .
Default #trees for classiﬁcation = 10
Default #trees for regression = 100

bootstrap speciﬁes whether to use bootstrap samples for

training.
True : bootstrapped samples are used.
False : whole dataset is used.

oob_score speciﬁes whether to use out-of-bag samples for

estimating generalization error. It is only available when
bootstrap = True .

9
Bagging parameters
max_samples speciﬁes the number of samples to be drawn
while bootstrapping.
None : Use all samples in the training data.
int : Use max_samples samples from the training data.
float : Use
max_samples*total number of samples from training data
The value should be between 0 and 1.

random_state controls randomness of features and

samples selected during bootstrap.

10
The number of features to be considered while splitting is
speciﬁed by max_features .
auto , sqrt , log2 , int , float

Value max_features
int value speciﬁed
ﬂoat value * # features
auto sqrt(#features)
sqrt sqrt(#features)
log2 log2(#features)
None #features

11
Decision tree parameters

12
The criteria for splitting the node is speciﬁed through criterion .
Default for classiﬁcation: gini
Default for regression: squared_error

The depth of the tree is controlled by max_depth . The default value

is None , which means the tree will be grown until all leaf nodes are
pure or until leaves contain less than min_samples_splits samples.

We will continue to split the internal node until they contain

min_samples_splits samples.
Whenever it is specified as an integer, then it is considered as a
number.
Whenever it is specified as a float, and the min_samples_splits
is calculated as min_samples_splits × n.
13
The tree growth can also be controlled by min_impurity_decrease
parameter.
A node will be split if it reduces impurity at least by the value
specified in this parameter.

The complexity of tree can also be controlled by ccp_alpha

parameter through minimal cost complexity pruning procedure.

14
Trained random forest estimators

estimators_ member variable contains a collection of

ﬁtted estimators.

feature_importances_ member variable contains a list of

important features.

15
Training and inference for random forest

fit builds forest of trees from the training dataset with

the speciﬁed parameters.

decision_path returns decision path in the

forest.
predict returns class label in classiﬁcation and output
value in regression.

predict_proba and predict_log_proba returns

probabilities and their logs for classiﬁcation set up.

16
Neural Networks

Dr. Ashish Tendulkar

IIT Madras
Machine Learning Practice

2
In this week, we will study how to implement Multilayer
Perceptron neural network models for classiﬁcation and
regression tasks with sklearn.

3
Multilayer Perceptron (MLP)
It is a supervised learning algorithm.

MLP learns a non-linear function approximator for either

classiﬁcation or regression depending on the given dataset.

In sklearn, we implement MLP using:

1. MLPClassiﬁer for classiﬁcation
2. MLPRegressor for regression

MLPClassiﬁer supports multi-class classiﬁcation by applying

Softmax as the output function.
It also supports supports multi-label classiﬁcation in which a
sample can belong to more than one class.

MLPRegressor also supports multi-output regression, in which a

sample can have more than one target. 4
Training data

shape → (n_samples, n_features) shape → (n_samples,)

5
MLPClassiﬁer
How to implement MLPClassiﬁer?

Step 1: Instantiate a MLP classiﬁer estimator.

1 from sklearn.neural_network import MLPClassifier

2 MLP_clf = MLPClassifier()

Step 2: Call ﬁt method on MLP classiﬁer object with training feature

matrix and label vector as arguments.

1 # Model training with feature matrix X_train and

2 # label vector or matrix y_train
3 MLP_clf.fit(X_train, y_train)

6
MLPClassiﬁer

Step 3: After ﬁtting (training), the model can make predictions for new
samples (X_test) using two methods:

1 MLP_clf.predict(X_test)
2 MLP_clf.predict_proba(X_test)

gives labels for new samples

for example:
array([1, 0])
gives vector of probability estimates per
sample
for example:
array([1.967...e-04, 9.998...-01])
MLPClassiﬁer supports only the Cross-
Entropy loss function 7
How to set the number of hidden layers?

This parameter sets the number of layers and the number of

neurons in each layer.

It is a tuple where each element in the tuple represents the

number of neurons at the i th position where i is the index of the
tuple.
The length of tuple denotes the total number of hidden layers in
the network.

To create a 3 hidden layer neural network with 15 neurons in ﬁrst

layer, 10 neurons in second layer and 5 neurons in third layer:

1 MLPClassifier(hidden_layer_sizes=(15,10,5))
8
How to perform regularization in MLPClassiﬁer?

The alpha parameter sets L2 penalty Regularization parameter

ﬂoat value

Default: 1 alpha = 0.0001

9
How to set the activation function for the hidden
layers?
no-op activation logistic sigmoid function
1
returns f (x) = x returns f (x) = (1+exp(−x))

'identity' 'logistic'

Default

'tanh' 'relu'
hyperbolic tan function rectiﬁed linear unit function
returns f (x) = tanh(x) returns f (x) = max(0, x)

10
How to perform weight optimization in
MLPClassifier?
MLPClassifier optimizes the log-loss function using LBFGS or
stochastic gradient descent
If the solver is ‘lbfgs’, the
lbfgs classifier will not use
minibatch.
Size of minibatches can be
sgd set to other stochastic
optimizers: batch_size (int)
default batch_size is 'auto'.
adam
1 batch_size=min(200, n_samples)
Default

11
How to view weight matrix coefﬁcients of trained
MLPClassiﬁer?

It is a list of shape (n_layers - 1,)

The i th element in the list represents the weight matrix
corresponding to layer i.

Example:
"weights between input and ﬁrst
hidden layer:"
1 print(MLP_clf.coefs_[0])

"weights between ﬁrst hidden

and second hidden layer:"
1 print(MLP_clf.coefs_[1])
12
How to view bias vector of trained MLPClassiﬁer?

It is a list of shape (n_layers - 1,)

The i th element in the list bias vector corresponding to layer
i + 1.

Example:
"Bias values for ﬁrst hidden layer:"
1 print(MLP_clf.intercepts_[0])

"Bias values for second hidden layer:"

1 print(MLP_clf.intercepts_[1])

13
Some parameters in MLPClassiﬁer

ﬂoat value
'constant'
default: 0.001
'invscaling'

'adaptive'
ﬂoat value int value
default: 'constant'
default: 0.5 default: 500

learning_rate and power_t are used only for solver = 'sgd'

learning_rate_init is used when solver='sgd' or ‘adam’.
shuffle is used to shufﬂe samples in each iteration when
solver='sgd' or 'adam'
momentum is used for gradient descent update when solver='sgd'
14
MLPRegressor

MLPRegressor trains using backpropagation with no activation

function in the output layer.
Therefore, it uses the square error as the loss function, and the
output is a set of continuous values.

The parameters of MLPRegressor are the same as that of

MLPClassiﬁer.

15
How to implement MLPRegressor?

Step 1: Instantiate a MLP regressor estimator.

1 from sklearn.neural_network import MLPRegressor

2 MLP_reg = MLPRegressor()

Step 2: Call ﬁt method on MLP regressor object with training feature

matrix and label vector as arguments.

1 # Model training with feature matrix X_train and

2 # label vector or matrix y_train
3 MLP_reg.fit(X_train, y_train)

16
Step 3: After ﬁtting (training), the model can make predictions for new
samples (X_test):

returns predicted values for new

1 MLP_reg.predict(X_test) samples
for example:
array([-0.9..., -7.1...])

returns R2 score
1 MLP_reg.score(X_test,y_test) for example:
0.45678889

LightGBM - An In-Depth Guide Python
No ratings yet
LightGBM - An In-Depth Guide Python
26 pages
Guillermo Garcia Rodriguez - Rivendel S.L
No ratings yet
Guillermo Garcia Rodriguez - Rivendel S.L
85 pages
Finaldocmp
No ratings yet
Finaldocmp
40 pages
Wine Quality Research Paper
100% (1)
Wine Quality Research Paper
3 pages
Machine Learning (16CIC73) Project Report Template
33% (3)
Machine Learning (16CIC73) Project Report Template
12 pages
College Project by Muhannad-3
No ratings yet
College Project by Muhannad-3
20 pages
Red Wine Mine
100% (1)
Red Wine Mine
32 pages
Final Report Beer Recommendation Project
No ratings yet
Final Report Beer Recommendation Project
48 pages
Lab 2
No ratings yet
Lab 2
17 pages
Wine
No ratings yet
Wine
22 pages
CSC 240 HW 4
No ratings yet
CSC 240 HW 4
17 pages
Exploratory Data Analysis and Case
No ratings yet
Exploratory Data Analysis and Case
29 pages
Wine
No ratings yet
Wine
15 pages
ML PR
No ratings yet
ML PR
32 pages
Honours LY Project
No ratings yet
Honours LY Project
31 pages
University of Mauritius: Assignment On Supervised & Unsupervised Machine Learning Algorithms
No ratings yet
University of Mauritius: Assignment On Supervised & Unsupervised Machine Learning Algorithms
71 pages
Importing Libraries: Pandas PD Matplotlib - Pyplot PLT Numpy NP
No ratings yet
Importing Libraries: Pandas PD Matplotlib - Pyplot PLT Numpy NP
10 pages
Wine Quality Prediction GHAR
No ratings yet
Wine Quality Prediction GHAR
19 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Mini Project Report
No ratings yet
Mini Project Report
12 pages
Wine 9
No ratings yet
Wine 9
20 pages
Wine Prediction
100% (1)
Wine Prediction
13 pages
ML LAB Mannual - Index
No ratings yet
ML LAB Mannual - Index
29 pages
Business Analytics 1 Ca 2
No ratings yet
Business Analytics 1 Ca 2
26 pages
Big Data Projecct
No ratings yet
Big Data Projecct
12 pages
Machine Learning Practice
No ratings yet
Machine Learning Practice
17 pages
Wine Quality Predictions
No ratings yet
Wine Quality Predictions
13 pages
R Project
No ratings yet
R Project
22 pages
ML Project Report
No ratings yet
ML Project Report
12 pages
Machine Learning Based Predictive Modelling For The Enhancement of Wine Quality
No ratings yet
Machine Learning Based Predictive Modelling For The Enhancement of Wine Quality
18 pages
Wine Quality Prediction Using Machine Learning
No ratings yet
Wine Quality Prediction Using Machine Learning
10 pages
Machine Learning Miniproject
No ratings yet
Machine Learning Miniproject
10 pages
Edau 5
No ratings yet
Edau 5
10 pages
Wine Quality Prediction Project Report
No ratings yet
Wine Quality Prediction Project Report
4 pages
DT-1 Project Report
No ratings yet
DT-1 Project Report
12 pages
HW04
No ratings yet
HW04
3 pages
Devesh
No ratings yet
Devesh
11 pages
Mahima 2020
No ratings yet
Mahima 2020
8 pages
RANDOM FOREST (Binary Classification)
No ratings yet
RANDOM FOREST (Binary Classification)
5 pages
10.1007@978 981 13 7403 623
No ratings yet
10.1007@978 981 13 7403 623
9 pages
PS ScreenShots - Manual
No ratings yet
PS ScreenShots - Manual
32 pages
Training vs. Testing Sets - Solution
No ratings yet
Training vs. Testing Sets - Solution
4 pages
Wine Quality Prediction PoC Report
No ratings yet
Wine Quality Prediction PoC Report
2 pages
# Tommy Trojan # ITP 449 Fall 2021 # Final Project # Q1
No ratings yet
# Tommy Trojan # ITP 449 Fall 2021 # Final Project # Q1
6 pages
ml-4
No ratings yet
ml-4
22 pages
Programming The Internet of Things
100% (1)
Programming The Internet of Things
86 pages
Phase 3 IBM
No ratings yet
Phase 3 IBM
7 pages
Boeing-Stearman Kaydet PT13 - 17
100% (2)
Boeing-Stearman Kaydet PT13 - 17
12 pages
Operating Systems: Chapter 2 - Operating System Structures
No ratings yet
Operating Systems: Chapter 2 - Operating System Structures
56 pages
PROFORMA of BCA PROJECT PROPOSAL Bcsp064
50% (2)
PROFORMA of BCA PROJECT PROPOSAL Bcsp064
1 page
Pytorch (Tabular) - Regression
No ratings yet
Pytorch (Tabular) - Regression
13 pages
Program 5
No ratings yet
Program 5
3 pages
C++ Basics for New Programmers: A Practical Guide with Examples
From Everand
C++ Basics for New Programmers: A Practical Guide with Examples
William E. Clark
No ratings yet
Exercise#9 Instructions 2021
No ratings yet
Exercise#9 Instructions 2021
5 pages
Red Wine Quality Prediction Using Machine Learning
No ratings yet
Red Wine Quality Prediction Using Machine Learning
4 pages
Report
No ratings yet
Report
6 pages
ML Mini Report
No ratings yet
ML Mini Report
6 pages
NCS Expert Tutorial - How To Code Features in Your Car.
100% (1)
NCS Expert Tutorial - How To Code Features in Your Car.
10 pages
Python Machine Learning Tutorial With Scikit-Learn
No ratings yet
Python Machine Learning Tutorial With Scikit-Learn
16 pages
Practical04.ipynb - Colab
No ratings yet
Practical04.ipynb - Colab
2 pages
WINE Prediction Quality
100% (1)
WINE Prediction Quality
6 pages
Wine Quality Prediction Using Machine Learning Algorithms
100% (1)
Wine Quality Prediction Using Machine Learning Algorithms
4 pages
7 Data Science / Machine Learning Cheat Sheets in One
100% (1)
7 Data Science / Machine Learning Cheat Sheets in One
9 pages
Decision Trees
No ratings yet
Decision Trees
2 pages
SAP Abap Quiz Part 2
No ratings yet
SAP Abap Quiz Part 2
10 pages
O'Connor - Matlab's Floating Point System
No ratings yet
O'Connor - Matlab's Floating Point System
17 pages
MSSQL Lab Sertup
No ratings yet
MSSQL Lab Sertup
18 pages
Linear Regression with Multiple Covariates
From Everand
Linear Regression with Multiple Covariates
Brett Kottmann
No ratings yet
Online Mediation in India
No ratings yet
Online Mediation in India
26 pages
Wine Quality Prediction: Implementation
No ratings yet
Wine Quality Prediction: Implementation
3 pages
Fundamental of Programmingi
No ratings yet
Fundamental of Programmingi
21 pages
Advanced Excel - Waterfall Chart
No ratings yet
Advanced Excel - Waterfall Chart
8 pages
CHAPTER 4: Indices, Surds and Logarithms
100% (1)
CHAPTER 4: Indices, Surds and Logarithms
13 pages
BSI MD Consultants Day Usability and Human Factors Presentation UK EN
No ratings yet
BSI MD Consultants Day Usability and Human Factors Presentation UK EN
38 pages
Extra Worksheets 1st Year
No ratings yet
Extra Worksheets 1st Year
41 pages
Survey Results Report Guide
No ratings yet
Survey Results Report Guide
21 pages
Advanced Ec Section 6
No ratings yet
Advanced Ec Section 6
5 pages
The Future of Cybersecurity - Emerging Trends and Challenges
No ratings yet
The Future of Cybersecurity - Emerging Trends and Challenges
5 pages
Information Security Handbook: Enhance Your Proficiency in Information Security Program Development 2nd Edition Anonymous 2024 Scribd Download
No ratings yet
Information Security Handbook: Enhance Your Proficiency in Information Security Program Development 2nd Edition Anonymous 2024 Scribd Download
40 pages
4100ES, 4190, 4010ES and 4010 Network Interface Reference PDF
No ratings yet
4100ES, 4190, 4010ES and 4010 Network Interface Reference PDF
4 pages
Internship Presentation of Learning
No ratings yet
Internship Presentation of Learning
12 pages
Indian Ins Titut e of Technology M Adras: (Sep'2016 - Present)
No ratings yet
Indian Ins Titut e of Technology M Adras: (Sep'2016 - Present)
1 page
Synchronous Optical Networking (Sonet)
No ratings yet
Synchronous Optical Networking (Sonet)
6 pages
ENT Readme
No ratings yet
ENT Readme
2 pages
Sap Powerdesigner: Object-Oriented Model Report
No ratings yet
Sap Powerdesigner: Object-Oriented Model Report
13 pages
User Guide Tp600 and Tp400 Standard Menus
No ratings yet
User Guide Tp600 and Tp400 Standard Menus
25 pages
Math Quad
No ratings yet
Math Quad
4 pages
31 X 41ft G.F House Plan
No ratings yet
31 X 41ft G.F House Plan
1 page
Epfo Mis 312
No ratings yet
Epfo Mis 312
1 page
CAT-9519 MGC-CONFIG-KIT4 Fire Panel Configuration Kit
No ratings yet
CAT-9519 MGC-CONFIG-KIT4 Fire Panel Configuration Kit
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.