MLP Slides Merged
MLP Slides Merged
Project
Machine Learning Practice Course
2
Outline
1. Steps in ML projects
2. Illustration through practical set up
3
ML Project
Excellent wine company wants to develop ML model for predicting wine
quality on certain physiochemical characteristics in order to replace
expensive quality sensor.
Let's understand steps involved in addressing this problem.
4
Steps in ML projects
1. Look at the big picture.
2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.
5. Select a model and train it.
6. Fine-tune your model.
7. Present your solution.
8. Launch, monitor and maintain your system.
5
A few words of wisdom
6
Step 1: Look at the big picture
1. Frame the problem
2. Select a performance measure
3. List and check the assumptions
1.1 Frame the problem
8
1.2 Selection of performance measure
Regression
Mean Squared Error (MSE) or
Mean Absolute Error (MAE)
Classification
Precision
Recall
F1-score
Accuracy
9
1.3 Check the assumptions
List down various assumptions about the task.
Review with domain experts and other teams that plan to consume ML output.
Make sure all assumptions are reviewed and approved before coding!
10
Step 2: Get the data
Data spread across multiple tables, files or documents with access control.
Obtain appropriate access controls and authorizations.
Get familiarized with data by looking at schema and a few rows. (Familiarity with
SQL would be useful here.)
1 import pandas as pd
2 import matplotlib.pyplot as plt
3 import seaborn as sns
4 import numpy as np
11
Let's first access our data - in this case, we need to download it from the web.
It's a good practice to create a function for downloading and extracting the
data.
1 data_url = 'https://archive.ics.uci.edu/ml/machine-learning-
databases/wine-quality/winequality-red.csv'
2 data = pd.read_csv(data_url, sep=";")
12
2.1 Check data samples
1 data.head()
13
2.2 Features
It's a good idea to understand significance of each feature by consulting the experts.
Feature Significance
Fixed acidity Most acids involved with wine or fixed or nonvolatile (do
not evaporate readily)
Volitile acidity The amount of acetic acid in wine, which at too high of
levels can lead to an unpleasant, vinegar taste
Citric acid Found in small quantities, citric acid can add 'freshness'
and flavor to wines
Residual sugar it's rare to find wines with less than 1 gram/liter and wines
with greater than 45 grams/liter are considered sweet.
Chlorides The amount of salt in the wine.
⋮ ⋮
Alcohol The percentage of alcohol contents in the wine.
(Credits:https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009) 14
1 feature_list = data.columns[:-1].values
2 label = [data.columns[-1]]
3
4 print ("Feature list:", feature_list)
5 print ("Label:", label)
15
2.3 Data statistics
1 data.info()
16
2.3 Data statistics
Total entries: 1599 (Tiny dataset by
ML standard)
There are total 12 columns: 11
features + 1 label
Label column: quality
Features: [fixed acidity, volitile
acidity, citric acid, residual sugar,
cholrides, free sulphur dioxide,
total sulphur dioxide, density, pH,
sulphates, alcohol]
All columns are numeric (float64) and
label is an integer.
17
In order to understand nature of numeric attribites, we use describe() method.
1 data.describe()
This one prints count and statistical properties - mean, standard deviations and
quartiles.
18
The wine quality can be between 0 and 10, but in this dataset, the quality values
are between 3 and 8. Let's look at the distribution of examples by the wine quality.
1 data['quality'].value_counts()
19
The information can be viewed through histogram plot.
A Histogram gives the count of how many samples occurs within a specific range (bins).
The x-axis denotes the range of values in a feature and
The y-axis denotes the frequency of samples with those specific values.
1 sns.set()
2 data.quality.hist()
3 plt.xlabel('Wine Quality')
4 plt.ylabel('Count')
Note taller bars for quality 5 and 6 compared to the other qualities.
20
In a similar manner, we can plot all numerical attributes with histogram plot for quick
examination.
21
A few observations based on these plots:
Before any further exploration, it's a good idea to separate test set and do not look at it
in order to have a clean evaluation set.
22
2.4 Create test set
When we look at the test set, we are likely to notice patterns in that and based
on that we may select certain models.
This leads to biased estimation on test set, which may not generalize well in
practice. This is called data snooping bias.
23
Let's write a function to split the data into training and test. Make sure to set the seed so
that we get the same test set in the next run.
25
Random sampling
26
1 from sklearn.model_selection import train_test_split
We can read the documentation for this function by using the following line of code:
1 ?train_test_split
27
Let's perform random sampling on our dataset:
28
Stratified sampling
Data distribution may not be uniform in real world data.
Random sampling - by its nature - introduces biases in such data sets.
Recall the label distribution in our dataset: It's not uniform!
1 sns.set()
2 data.quality.hist()
3 plt.xlabel('Wine Quality')
4 plt.ylabel('Count')
Let's examine the test set distribution by the wine quality that was used for stratified
sampling.
30
Now compare this with the overall distribution:
31
You can notice that there is a small difference in most strata.
1 dist_comparison
32
Let's contrast this with random sampling:
33
Sampling bias comparison
Compare the difference in distribution of stratified and uniform sampling:
Stratified sampling gives us test distribution closer to the overall distribution than the
random sampling.
34
Step 3: Data visualization
Performed on training set.
In case of large training set -
Sample examples to form exploration set.
Enables to understand features and their relationship among themselves and with
output label.
In our case, we have a small training data and we use it all for data exploration. There is
no need to create a separate exploration set.
It's a good idea to create a copy of the training set so that we can freely manipulate it
without worrying about any manipulation in the original set.
1 exploration_set = strat_train_set.copy()
35
Scatter Visualization
With seaborn library:
36
With matplotlib:
37
Relationship between features
Standard correlation coefficient between features.
Ranges between -1 to +1
Correlation = +1: Strong positive correlation between features
Correlation = -1: Strong negative correlation between features
Correlation = 0: No linear correlation between features
Visualization with heat map
Only captures linear relationship between features.
For non-linear relationship, use rank correlation
1 corr_matrix = exploration_set.corr()
38
Let's check features that are correlated with the label, which is quality in our case.
39
Let's visualize correlation matrix with heatmap:
1 plt.figure(figsize=(14,7))
2 sns.heatmap(corr_matrix, annot=True)
40
41
You can notice:
The correlation coefficient on diagonal is +1.
Darker colors represent negative correlations, while
fainter colors denote positive correlations. For example
citric acid and fixed acidity have strong positive
correlation.
pH and fixed acidity have strong negative
correlation.
42
1 from pandas.plotting import scatter_matrix
2 attribute_list = ['citric acid', 'pH', 'alcohol', 'sulphates', 'quality']
3 scatter_matrix(exploration_set[attribute_list])
For convenience of
visualization, we show it for
a small number of attributes.
43
Similar analysis can be carried out with combined features - features that are
derived from the original features.
44
Note of wisdom
1. Visualization and data exploration do not have to be absolutely thorough.
2. Objective is to get quick insight into features and its relationship with other features
and labels.
3. Exploration is an iterative process: Once we build model and obtain more insights,
we can come back to this step.
45
Step 4: Prepare data for ML algorithm
We often need to preprocess the data before using it for model building due to variety
of reasons:
Due to errors in data capture, data may contain outliers or missing values.
Different features may be at different scales.
The current data distribution is not exactly amenable to learning.
Typical steps in data preprocessing are as follows:
1. Separate features and labels.
2. Handling missing values and outliers.
3. Feature scaling to bring all features on the same scale.
4. Applying certain transformations like log, square root on the features.
It's a good practice to make a copy of the data and apply preprocessing on that copy.
This ensures that in case something goes wrong, we will at least have original copy of
the data intact. 46
4.1 Separate features and labels from the training set.
47
4.2 Data cleaning
Let's first check if there are any missing values in feature set: One way to find that out is
column-wise.
Sklearn provides the following methods to drop rows containing missing values:
dropna()
drop()
It provides SimpleImputer class for filling up missing values with. say, median value.
49
1 from sklearn.impute import SimpleImputer
2 imputer = SimpleImputer(strategy="median")
The strategy contains instructions as how to replace the missing values. In this case,
we specify that the missing value should be replaced by the median value.
1 imputer.fit(wine_features)
In case, the features contains non-numeric attributes, they need to be dropped before
calling the fit method on imputer object.
50
Let's check the statistics learnt by the imputer on the training set:
1 imputer.statistics_
array([ 7.9 , 0.52 , 0.26 , 2.2 , 0.08 , 14. , 39. , 0.99675, 3.31 ,
0.62 , 10.2 ])
Note that these are median values for each feature. We can cross-check it by calculating
median on the feature set:
1 wine_features.median()
51
Finally we use the trained imputer to transform the training set such that the missing
values are replaced by the medians:
1 tr_features = imputer.transform(wine_features)
This returns a Numpy array and we can convert it to the dataframe if needed:
1 tr_features.shape
(1279, 11)
52
4.3 Handling text and categorical attributes
4.3.1 Converting categories to numbers:
One issue with this representation is that the ML algorithm would assume that the
two nearby values are closer than the distinct ones.
53
4.3.2 Using one hot encoding
Here we create one binary feature per category - the feature value is 1 when the category
is present else it is 0.
Only one feature is 1 (hot) and the rest are 0 (cold).
The new features are referred to as dummy features.
Scikit-Learn provides a OneHotEncoder class to convert categorical values into one-hot
vectors.
55
4.4 Feature Scaling
Most ML algorithms do not perform well when input features are on very different
scales.
Scaling of target label is generally not required.
56
4.5.2 Standardization
We subtract mean value of each feature from the current value and divide it by the
standard deviation so that the resulting feature has a unit variance.
While normalization bounds values between 0 and 1, standardization does not bound
values to a specific range.
Standardization is less affected by the outliers compared to the normalization.
Scikit-Learn provides StandardScalar transformation for feature standardization.
Note that all these transformers are learnt on the training data and then applied on the
training and test data to tranform them.
Never learn these transformers on the full dataset.
57
Transformation Pipeline
Scikit-Learn provides a Pipeline class to line up transformations in an intended order.
Here is an example pipeline:
58
1 from sklearn.pipeline import Pipeline
2 from sklearn.preprocessing import StandardScaler
3 transform_pipeline = Pipeline([
4 ('imputer', SimpleImputer(strategy="median")),
5 ('std_scaler', StandardScaler()),])
6 wine_features_tr = transform_pipeline.fit_transform(wine_features)
The output of one step is passed on the next one in sequence until it reaches the last
step.
Here the pipeline first performs imputation of missing values and its result is passed for
standardization.
The pipeline exposes the same method as the final estimator.
Here StandardScalar is the last estimator and since it is a transformer, we call
fit_transform() method on the Pipeline object.
59
How to transform mixed features?
The real world data has both categorical as well as numerical features and we need to
apply different transformations to them.
Scikit-Learn introduced ColumnTransformer for this purpose.
In our dataset, we do not have features of mixed types. All our features are numeric.
60
For the illustration purpose, here is an example code snippet:
1 num_attribs = list(wine_features)
2 cat_attribs = ["place_of_manufacturing"]
3 full_pipeline = ColumnTransformer([
4 ("num", num_pipeline, num_attribs),
5 ("cat", OneHotEncoder(), cat_attribs),
6 ])
7 wine_features_tr = full_pipeline.fit_transform(wine_features)
62
Now that we have a working model of a regression, let's evaluate performance of the
model on training as well as test sets.
For regression models, we use mean squared error as an evaluation measure.
0.4206571060060278
63
Let's evaluate performance on the test set.
We need to first apply transformation on the test set and then apply the model prediction
function.
0.39759130875015164
64
Let's visualize the error between the actual and predicted values.
1 plt.scatter(wine_labels_test, quality_test_predictions)
2 plt.plot(wine_labels_test, wine_labels_test, 'r-')
3 plt.xlabel('Actual quality')
4 plt.ylabel('Predicted quality')
65
Let's try another model: DecisionTreeRegressor.
66
1 quality_predictions = tree_reg.predict(wine_features_tr)
2 mean_squared_error(wine_labels, quality_predictions)
0.0
1 quality_test_predictions = tree_reg.predict(wine_features_test_tr)
2 mean_squared_error(wine_labels_test, quality_test_predictions)
0.58125
Note that the training error is 0, while the test error is 0.58. This is an example of an
overfitted model.
67
1 plt.scatter(wine_labels_test, quality_test_predictions)
2 plt.plot(wine_labels_test, wine_labels_test, 'r-')
3 plt.xlabel('Actual quality')
4 plt.ylabel('Predicted quality')
68
We can use cross-validation (CV) for robust evaluation of model performance.
Cross validation provides a separate MSE for each validation set, which we can
use to get a mean estimation of MSE as well as the standard deviation, which
helps us to determine how precise is the estimate.
The additional cost we pay in cross validation is additional training runs, which
may be too expensive in certain cases.
1 def display_scores(scores):
2 print("Scores:", scores)
3 print("Mean:", scores.mean())
4 print("Standard deviation:", scores.std())
69
Linear Regression CV
70
Decision tree CV
71
Random forest CV
Random forest model builds multiple decision trees on randomly selected features
and then average their predictions.
Building a model on top of other model is called ensemble learning, which is often
used to improve performance of ML models.
0.34449875
1 plt.scatter(wine_labels_test, quality_test_predictions)
2 plt.plot(wine_labels_test, wine_labels_test, 'r-')
3 plt.xlabel('Actual quality')
4 plt.ylabel('Predicted quality')
74
Step 6: Finetune your model
Usually there are a number of hyperparameters in the model, which are set
manually.
Tuning these hyperparameters lead to better accuracy of ML models.
Finding the best combination of hyperparameters is a search problem in the
space of hyperparameters, which is huge.
Grid search
Scikit-Learn provives a class GridSearchCV that helps us in this pursuit.
We need to specify a list of hyperparameters along with the range of values to try.
It automatically evaluates all possible combinations of hyperparameter values using
cross-validation.
75
For example, there are number of hyperparameters in RandomForest regression
such as:
Number of estimators
Maximum number of features
1 param_grid = [
2 {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
3 {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
4 ]
76
Let's compute the total combinations evaluated here:
1. The first one results in 3 × 4 = 12 combinations.
2. The second one has 2 values of n_estimators and 3 values of max_features, thus
resulting 2 × 3 = 6 in total of values.
77
Let's create an object of GridSearchCV:
In this case, we set cv=5 i.e. using 5 fold cross validation for training the model.
We need to train the model for 18 parameter combinations and each combination
would be trained 5 times as we are using cross-validation here.
The total model training runs = 18 × 5 = 90
78
Let's launch the hyperparameter search:
1 grid_search.fit(wine_features_tr, wine_labels)
79
The best parameter combination can be obtained as follows:
1 grid_search.best_params_
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(-mean_score, params)
80
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(-mean_score, params)
As you can notice the lowest MSE is obtained for the best parameter combination.
81
Let's obtain the best estimator as follows:
1 grid_search.best_estimator_
Note: GridSearchCV is initialized with refit=True option, which retrains the best
estimator on the full training set. This is likely to lead us to a better model as it is
trained on a larger dataset.
82
Randomized Search
When we have a large hyperparameter space, it is desirable to try
RandomizedSearchCV.
It selects a random value for each hyperparameter at the start of each iteration and
repeats the process for the given number of random combinations.
It enables us to search hyperparameter space with appropriate budget control.
83
Analysis of best model and its errors
Analysis of the model provides useful insights about features. let's obtain the feature
importance as learnt by the model:
1 feature_importances = grid_search.best_estimator_.feature_importances_
Based on this information, we may drop features that are not so important.
It is also useful to analyze the errors in prediction and understand its causes and fix
them. 84
Evaluation on test set
Now that we have a reasonable model, we evaluate its performance on the test set. The
following steps are involved in the process:
85
2. Use the predict method with the trained model and the test set.
1 quality_test_predictions = grid_search.best_estimator_.predict(
2 wine_features_test_tr)
3.Compare the predicted labels with the actual ones and report the evaluation metrics.
1 mean_squared_error(wine_labels_test, quality_test_predictions)
0.35345138888888883
86
4.It's a good idea to get 95% confidence interval of the evaluation metric. It can be
obtained by the following code:
(0.29159276569581916, 0.4153100120819586)
87
Step 7: Present your solution
Once we have satisfactory model based on its performance on the test set, we reach
the prelaunch phase.
Before launch,
88
Step 8: Launch, monitor and maintain your system
Launch
Plug in input sources and
Write test cases
Monitoring
System outages
Degradation of model performance
Sampling predictions for human evaluation
Regular assessment of data quality, which is critical for model performance
Maintenance
Train model regularly every fixed interval with fresh data.
Production roll out of the model.
89
Summary
In this module, we studied steps involved in end to end machine learning project with
an example of prediction of wine quality.
90
Introduction to Scikit-
Learn (sklearn)
2
sklearn APIs are organized on the lines of our
ML framework.
Scikit-learn ML Framework
Training data and
Training data
preprocessing
Model subsumes Model
loss function and
optimization Loss function
procedure Optimization
Model selection
and evaluation Evaluation
Model inspection
3
API design principles
4
@sir, copied to 'Data
Preprocessing' slide deck
sklearn APIs are well designed with the following principles:
5
@sir, copied to 'Data
Preprocessing' slide deck
Types of sklearn objects
Transformers Estimators Predictors
transforms dataset Estimates model Makes prediction
transform() for parameters based on dataset
transforming on training data predict() method
dataset. and hyper that takes dataset
fit() learns parameters. as an input and
parameters. fit() method returns predictions.
fit_transform() fits score() method to
parameters and measure quality of
transform() the predictions.
dataset.
7
@sir, copied to 'Data
Preprocessing' slide deck
Data API
Provides functionality for loading, generating and
preprocessing the training and test data.
Module Functionality
sklearn.datasets Loading datasets - custom as well as
popular reference dataset.
sklearn.preprocessing Scaling, centering, normalization and
binarization methods
sklearn.impute Filling missing values
sklearn.feature_selection Implements feature selection
algorithms
sklearn.feature_extraction Implements feature extraction from
raw data.
8
Model API
Implements supervised and unsupervised models
Regression Classification
sklearn.linear_model sklearn.linear_model
(linear, ridge, lasso sklearn.svm
models) sklearn.trees
sklearn.trees sklearn.neighbors
sklearn.naive_bayes
sklearn.multiclass
10
Model selection API
11
Model inspection API
12
Practical advice
It is not possible to remember each and every sklearn
API.
13
Data loading
2
General dataset API has three main kind of interfaces:
3
Dataset API
Loaders Fetchers Generator
Load small Fetch and load Controlled
standard datasets larger datasets synthetic datasets
return_X_y = True
Both loaders and fetchers return a Bunch Returns tuple (X, y)
object, which is a dictionary with two keys of numpy arrays:
of our interest:
X has shape
Key Values
(n, m)
data Array of shape (n, m)
y has shape (n, )
target Array of shape (n,)
make_multilabel_classification() generates
Multilabel random samples with multiple labels with a
specific generative process and rejection
sampling.
7
Dataset generators
Clustering make_blobs()generates a bunch of normally-
distributed clusters of points with specific
mean and standard deviations for each
cluster.
8
Loading external datasets
fetch_openml()fetches datasets from openml.org, which
is a public repository for machine learning data and
experiments.
9
Loading external datasets
datasets.load_svmlight_files() loads data in svmlight
and libSVM sparse format.
10
For managing numerical data, sklearn recommends using
an optimized file format such as HDF5 (Hierarchical Data
Format version 5) to reduce data load times.
11
Data transformation
12
Types of transformers
13
Transformer methods
14
Transformers are combined with one another or with
other estimators such as classifiers or regressors to
build composite estimators.
Tool Usage
Pipeline Chaining multiple estimators to
execute a fixed sequence of
steps in data preprocessing and
modelling.
FeatureUnion Combines output from several
transformer objects by creating
a new transformer from them.
ColumnTransformer Enables different
transformations on different
columns of data based on their
types.
15
Data Preprocessing
Machine Learning Practice
IIT Madras
2
The real world training data is usually not clean and has many
issues such as missing values for certain features, features on
different scales, non-numeric attributes etc.
3
Once you get the training data, the first job is to explore
the data and list down preprocessing needed.
4
Sklearn provides a library of transformers for
data preprocessing.
5
Transformer methods
6
Part 1. Feature extraction
7
sklearn.feature_extraction has useful APIs to extract
features from data:
DictVectorizer FeatureHasher
8
DictVectorizer
dv = DictVectorizer(sparse=False)
dv.fit_transform(data)
9
FeatureHasher
10
Feature Extraction from images and text
11
Part 2: Data Cleaning
12
Handling missing values
13
Missing values occur due to errors in data capture such as
sensor malfunctioning, measurement errors etc.
SimpleImputer KNNImputer
⎡ 7 1 ⎤ ⎡7 1⎤
=⎢
8 ⎥
=⎢
6 8⎥
⎢ 2 nan⎥ ⎢2 5⎥
nan
X4×2 X′ 4×2
⎣ 9 6 ⎦ ⎣9 6⎦
7+2+9
=
3
6
1+8+6
=5
3
15
KNNImputer
16
Example: KNNImputer
⎡ 1. 2. nan⎤
=⎢
3. 4. 3. ⎥
X4×3 ⎢nan 6. 5. ⎥
⎣ 8. 8. 7. ⎦
17
⎡ 1. 2. nan⎤
=⎢
3. 4. 3. ⎥
⎢nan 5. ⎥
Let's fill the missing value in first sample/row. X4×3
6.
⎣ 8. 8. 7. ⎦
Distance with [1. 2. nan.]
3+5
=4 [1. 2. 4.]
2
# of neighbours
18
In this way, we can fill up the missing values with KNNImputer.
⎡ 1. 2. nan.⎤ ⎡ 1. 2. 4.⎤
=⎢
3.⎥
=⎢
3. ⎥ 3. 4.
⎢5.5 5.⎥
3. 4.
⎢nan 5. ⎥
X4×4 X′ 4×4
6.
⎣ 8. 7.⎦
6.
⎣ 8. 8. 7. ⎦ 8.
19
Marking imputed values
20
1.2 Numeric transformers
1. Feature scaling
2. Polynomial transformation
3. Discretization
21
Feature scaling
22
StandardScaler
⎣6⎦ ⎣ 2/ 2 ⎦
μ = 4, σ = 2 μ′ = 0, σ ′ = 1
⎡ 15 ⎤ ⎡ 1 ⎤
⎢ 2 ⎥ ⎢0.35⎥
=⎢ ⎥ =⎢ ⎥
mms = MinMaxScalar()
x5×1 ⎢ ⎥5 x′ 5×1 ⎢ ⎥
mms.fit_transform(x)
0.5
⎢−2⎥ ⎢ 0.6 ⎥
⎣−5⎦ ⎣ 0 ⎦
The largest number is transformed to 1 and
x.max = 15, x.min = -5
the smallest number is transformed to 0.
24
MaxAbsScaler
⎡ 4 ⎤ ⎡ 0.04 ⎤
⎢ 2 ⎥ ⎢ 0.02 ⎥
=⎢ ⎥ =⎢
⎢
⎥
⎥
⎢ ⎥
mas = MaxAbsScalar()
x′ 5×1 0.05
⎢−0.02⎥
x5×1 5
⎢ −2 ⎥
mas.fit_transform(x)
⎣−100⎦ ⎣ −1 ⎦
⎡128 2 ⎤ ⎡7 1⎤
=⎢
256⎥
=⎢
2 1 8⎥
X4×2 ⎢ 4 1 ⎥
X′ 4×2 ⎢2 0⎥
⎣512 64 ⎦ ⎣9 6⎦
ft = FunctionTransformer(numpy.log2)
ft.fit_transform(X)
pf=PolynomialFeatures(degree=2)
pf.fit_transform(X)
X = [x1 , x2 ] degree =2
X′ = [x1 , x2 , x1 x2 , x21 , x22 ]
pf=PolynomialFeatures(degree = 3)
pf.fit_transform(X)
degree =3
X′ = [x1 , x2 , x1 x2 , x21 , x22 , x21 x2 , x1 x22 , x31 , x32 ]
27
KBinsDiscretizer
⎡ 0 ⎤ ⎡0.⎤
⎢0.125⎥ ⎢0.⎥
⎢ 0.25 ⎥ ⎢1.⎥
⎢ ⎥ ⎢ ⎥
KBinsDiscretizer(
⎢0.375⎥ ⎢1.⎥
⎢ ⎥ ⎢ ⎥
n_bins=5,
=⎢ ⎥ =⎢ ⎥
strategy='uniform',
⎢0.675⎥ ⎢3.⎥
⎢ ⎥ ⎢ ⎥
⎢ 0.75 ⎥ ⎢3.⎥
⎢ ⎥ ⎢ ⎥
⎢0.875⎥ ⎢4.⎥
⎣ 1.0 ⎦ ⎣4.⎦
28
1.2 Categorical transformers
1. Feature encoding
2. Label encoding
29
OneHotEncoder
⎡1⎤ ⎡1 0 0⎤
=⎢
0⎥
=⎢
2⎥ 0 1
⎢3⎥ ⎢0 1⎥
ohe=OneHotEncoder()
x4×1 ohe.fit_transform(x) X′ 4×3
0
⎣1⎦ ⎣1 0 0⎦
30
LabelEncoder
⎡1⎤ ⎡0⎤
⎢2⎥ ⎢1⎥
⎢6⎥ ⎢2⎥
=⎢ ⎥
⎢1⎥ =⎢ ⎥
⎢0⎥
le = LabelEncoder()
y6×1 y′ 6×1
⎢ ⎥ ⎢ ⎥
le.fit_transform(y)
⎢8⎥ ⎢3⎥
⎣6⎦ ⎣2⎦
Here K = 4: {1, 2, 6, 8}
1 is encoded as 0, 2 as 1, 6 as 2, and 8 as 3.
31
OrdinalEncoder
⎡1 ‘male′ ⎤ ⎡0 1⎤
⎢2 ‘f emale′ ⎥ ⎢1 0⎥
⎢6 ‘f emale′ ⎥ ⎢2 0⎥
=⎢
⎢1 ′ ⎥
⎥ =⎢
⎢0
⎥
1⎥
oe = OrdinalEncoder()
X6×2 X′ 6×2
⎢ ‘male ⎥ ⎢ ⎥
oe.fit_transform(X)
⎢8 ‘male′ ⎥ ⎢3 1⎥
⎣6 ‘f emale′ ⎦ ⎣2 0⎦
32
LabelBinarizer
⎡1⎤ ⎡1 0 0 0⎤
⎢2⎥ ⎢0 1 0 0⎥
⎢6⎥ ⎢0 0⎥
=⎢ ⎥ =⎢ ⎥
0 1
⎢1⎥ ⎢1 0⎥
lb=LabelBinarizer()
y6×1 Y′ 6×4
⎢ ⎥ ⎢ ⎥
lb.fit_transform(y)
0 0
⎢8⎥ ⎢0 0 0 1⎥
⎣6⎦ ⎣0 0 1 0⎦
movie_genres =
[{'action', 'comedy' }, ⎡1 1 0 0⎤
⎢
=⎢
0 1 0 0⎥
1⎥
{'comedy'}, X′ 4×4
1 0 0
{'action', 'thriller'}, ⎣1 0 1 1⎦
{'science-fiction', 'action', 'thriller'}]
mlb = MultiLabelBinarizer()
mlb.fit_transform(movie_genres)
34
add_dummy_feature
⎡7 1⎤ ⎡1 7 1⎤
=⎢
8⎥
=⎢
1 1 1 8⎥
⎢2 0⎥ ⎢1 0⎥
add_dummy_feature(X)
X4×2 X′ 4×3
2
⎣9 6⎦ ⎣1 9 6⎦
35
Part 2: Feature selection
Filter based
Wrapper based
36
Sometimes in a real world dataset, all features do not
contribute well enough towards fitting a model.
The features that do not contribute significantly, can be
removed. It leads to decrease in size of the dataset and
hence, the computation cost of fitting a model.
sklearn.feature_selection provides many APIs to
accomplish this task.
Filter Wrapper
VarianceThreshold RFE
SelectKBest RFECV
SelectPercentile SelectFromModel
GenericUnivariateSelect SequentialFeatureSelector
38
Removing features with low variance
VarianceThreshold
39
Univariate feature selection
Univariate feature selection selects features based on
univariate statistical tests.
GenericUnivariateSelect
Performs univariate feature selection with a
configurable strategy, which can be found
via hyper-parameter search.
40
sklearn provides one more class of univariate feature
selection methods that work on common univariate
statistical tests for each feature:
41
Univariate scoring function
Each API need a scoring function to score each feature.
chi2
42
Mutual information (MI) Chi-square
Measures dependency Measures dependence
between two variables. between two variables.
It returns a non-negative Computes chi-square stats
value. between non-negative
MI = 0 for independent feature (boolean or
variables. frequencies) and class label.
Higher MI indicates Higher chi-square values
higher dependency. indicates that the features
and labels are likely to be
correlated.
SelectPercentile
sp = SelectPercentile(chi2, percentile=20)
X_new = sp.fit_transform(X, y)
45
Do not use regression feature scoring
function with a classification problem. It will
lead to useless results.
46
Wrapper based filter selection
47
Recursive Feature Elimination (RFE)
48
SelectFromModel
50
Sequential feature selection
Performs feature selection by selecting or deselecting
features one by one in a greedy manner.
52
SFS does not require the underlying model to expose a coef_ or
feature_importances_ attributes unlike in RFE and SelectFromModel.
SFS may be slower than RFE and SelectFromModel as it needs to
evaluate more models compared to the other two approaches.
53
Applying transformations to
diverse features
54
Generally training data contains diverse features such
as numeric and categorical.
55
Composite Transformer
ColumnTransformer TransformedTargetRegressor
56
ColumnTransformer
It applies a set of transformers to columns of an array or
pandas.DataFrame, concatenates the transformed outputs from
different transformers into a single matrix.
It is useful for transforming heterogenous data by applying
different transformers to separate subsets of features.
It combines different feature selection mechanisms and
transformation into a single transformer object.
57
ColumnTransformer()
Each tuple has format
(estimatorName,estimator(...), columnIndices)
column_trans = ColumnTransformer(
[('ageScaler', CountVectorizer(), [0]]),
('genderEncoder', OneHotEncoder(dtype='int'), [1])],
remainder='drop', verbose_feature_names_out=False)
58
Illustration of Column Transformer
⎡20.0 ‘male′ ⎤
⎢11.2 ‘f emale′ ⎥
⎢15.6 ‘f emale′ ⎥
X6×2 =⎢
⎢13.0 ′ ⎥
⎥
⎢ ‘male ⎥
⎢18.6 ‘male′ ⎥
⎣16.4 ‘f emale′ ⎦
62
Another way to reduce the number of feature
is through unsupervised dimensionality
reduction techniques.
63
PCA 101
PCA, is a linear dimensionality reduction technique.
It uses singular value decomposition (SVD) to project
the feature matrix or data to a lower dimensional space.
The first principle component (PC) is in the direction of
maximum variance in the data.
It captures bulk of the variance in the data.
The subsequent PCs are orthogonal to the first PC and
gradually capture lesser and lesser variance in the data.
We can select first k PCs such that we are able to
capture the desired variance in the data.
66
After applying PCA and choosing only first PC to reduce
dimension of data.
67
Part 4: Chaining transformers
68
The preprocessing transformations are applied one after
another on the input feature matrix.
si = SimpleImputer()
X_imputed = si.fit_transform(X)
ss =StandardScaler()
X_scaled = ss.fit_transform(X_imputed)
69
The sklearn.pipeline module provides utilities to build a
composite estimator, as a chain of transformers and
estimators.
Class Usage
Pipeline Constructs a chain of multiple transformers to
execute a fixed sequence of steps in data
preprocessing and modelling.
FeatureUnion Combines output from several transformer
objects by creating a new transformer from
them.
70
sklearn.pipeline.Pipeline
71
Creating Pipelines
Two ways to create a pipeline object.
Pipeline()
It takes a list of
('estimatorName', estimators = [
('simpleImputer', SimpleImputer()),
estimator(...)) tuples. ('standardScaler', StandardScaler()),
]
The pipeline object exposes pipe = Pipeline(steps=estimators)
interface of the last step.
make_pipeline
It takes a number of estimator pipe = make_pipeline(SimpleImputer(),
StandardScaler())
objects only.
72
Without pipeline:
si = SimpleImputer()
X_imputed = si.fit_transform(X)
ss =StandardScaler()
X_scaled = ss.fit_transform(X_imputed)
With pipeline:
estimators = [
('simpleImputer', SimpleImputer()),
('standardScaler', StandardScaler()),
]
pipe = Pipeline(steps=estimators)
pipe.fit_transform(X)
73
Accessing individual steps in Pipeline
estimators = [
('simpleImputer', SimpleImputer()),
('pca', PCA()),
('regressor', LinearRegression())
]
pipe = Pipeline(steps=estimators)
74
Accessing parameters of each step in Pipeline
estimators = [
('simpleImputer', SimpleImputer()),
('pca', PCA()),
('regressor', LinearRegression())
]
pipe = Pipeline(steps=estimators)
pipe.set_params(pca__n_components = 2)
param_grid = dict(imputer=['passthrough',
SimpleImputer(),
KNNImputer()],
clf=[SVC(), LogisticRegression()],
clf__C=[0.1, 10, 100])
grid_search = GridSearchCV(pipe, param_grid=param_grid)
77
Advantages of pipeline
Combines multiple steps of end to end ML into single object
such as missing value imputation, feature scaling and
encoding, model training and cross validation.
Enables joint grid search over parameters of all the
estimators in the pipeline.
Makes configuring and tuning end to end ML quick and easy.
Offers convenience, as a developer has to call fit() and
predict() methods only on a Pipeline object (assuming last
step in the pipeline is an estimator).
Reduces code duplication: With a Pipeline object, one
doesn't have to repeat code for preprocessing and
transforming the test data.
78
sklearn.pipeline.FeatureUnion
79
Combining Transformers and Pipelines
FeatureUnion() accepts a list of tuples.
Each tuple is of the format:
('estimatorName',estimator(...))
num_pipeline = Pipeline([('selector',ColumnTransformer([('select_first_4',
'passthrough',
slice(0,4))])),
('imputer', SimpleImputer(strategy="median")),
('std_scaler', StandardScaler()),
])
cat_pipeline = ColumnTransformer([('label_binarizer', LabelBinarizer(),[4]),
])
full_pipeline = FeatureUnion(transformer_list=
[("num_pipeline", num_pipeline),
("cat_pipeline", cat_pipeline),])
80
Visualizing Composite Transformers
81
That's it from data preprocessing.
82
Linear Regression
Machine Learning Practice
1
How to build baseline regression model?
DummyRegressor helps in creating a baseline for regression.
1 from sklearn.dummy import DummyRegressor
2
3 dummy_regr = DummyRegressor(strategy="mean")
4 dummy_regr.fit(X_train, y_train)
5 dummy_regr.predict(X_test)
6 dummy_regr.score(X_test, y_test)
quantile constant 2
How is Linear Regression model trained?
Step 1: Instantiate object of a suitable linear regression estimator from
one of the following two options
4
SGDRegressor Estimator
Implements stochastic gradient descent
Use for large training set up (> 10k samples)
Provides greater control on optimization process through
provision for hyperparameter settings.
loss= 'squared error' penalty = 'l1'
loss = 'huber' penalty = 'l2'
penalty = 'elasticnet'
SGDRegressor
learning_rate = 'constant'
learning_rate = 'optimal' early_stopping = 'True'
learning_rate = 'adaptive' 5
It's a good idea to use a random seed of your choice while
instantiating SGDRegressor object. It helps us get
reproducible results.
6
How to perform feature scaling for SGDRegressor?
SGD is sensitive to feature scaling, so it is highly recommended
to scale input feature matrix.
1 from sklearn.linear_model import SGDRegressor
2 from sklearn.pipeline import Pipeline
3 from sklearn.preprocessing import StandardScaler
4
5 sgd = Pipeline([
6 ('feature_scaling', StandardScaler())),
7 ('sgd_regressor', SGDRegressor())])
8
9 sgd.fit(X_train, y_train)
8
How to use set learning rate in SGDRegreesor?
learning_rate = 'constant' learning_rate = 'invscaling'
learning_rate = 'adaptive'
learning_rate = 'constant'
10
How to set adaptive learning rate?
The learning rate is kept to initial value as long as the training loss
decreases.
When the stopping criterion is reached, the learning rate is divided
by 5, and the training loop continues.
The algorithm stops when the learning rate goes below 10−6 .
11
How to set #epochs in SGDRegreesor?
Remember one epoch is one full pass over the training data.
Practical tip
SGD converges after observing approximately 106 training samples.
Thus, a reasonable first guess for the number of iterations for n
sampled training set is
max_iter = np.ceil(106 /n)
12
How to set stopping criteria in SGDRegreesor?
13
How to set stopping criteria in SGDRegreesor?
Option #2 early_stopping, validation_fraction
1 from sklearn.linear_model import SGDRegressor
2 linear_regressor = SGDRegressor(loss='squared_error',
3 early_stopping=True
4 max_iter=500,
5 tol=1e-3,
6 validation_fraction=0.2,
7 n_iter_no_change=5)
15
How to use averaged SGD?
Averaged SGD updates the weight vector to average of
weights from previous updates.
16
Option #2: Set average to int value.
17
How do we initialize SGD with weight vector
of the previous run?
18
How to monitor SGD loss iteration after iteration?
Make use of warm_start = TRUE
1 sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True,
2 penalty=None, learning_rate="constant", eta0=0.0005)
3
4 for epoch in range(1000):
5 sgd_reg.fit(X_train, y_train) # continues where it left off
6 y_val_predict = sgd_reg.predict(X_val)
7 val_error = mean_squared_error(y_val, y_val_predict)
19
Model inspection
20
How to access the weights of trained Linear
Regression model?
y^ = w0 + w1 x1 + w2 x2 + … + wm xm = wT x
22
How to make predictions on new data in Linear
Regression model?
Step 1: Arrange data for prediction in a feature matrix of
shape (#samples, #features) or in sparse matrix format.
23
Model evaluation
24
General steps in model evaluation
25
How to evaluate trained Linear Regression model?
Using score method on linear regression object:
1 # Evaluation on the eval set with
2 # 1. feature matrix
3 # 2. label vector or matrix (single/multi-output)
4 linear_regressor.score(X_test, y_test)
R2 = (1 − uv )
When?
27
Evaluation metrics
sklearn provides a bunch of regression metrics to evaluate
performance of the trained estimator on the evaluation set.
mean_absolute_error
1 from sklearn.metrics import mean_absolute_error
2 eval_score = mean_absolute_error(y_test, y_predicted)
mean_squarred_error
1 from sklearn.metrics import mean_squarred_error
2 eval_score = mean_squarred_error(y_test, y_predicted)
mean_absolute_percentage_error
1 from sklearn.metrics import mean_absolute_percentage_error
2 eval_score = mean_absolute_percentage_error(y_test, y_predicted)
median_absolute_error
1 from sklearn.metrics import median_absolute_error
2 eval_score = median_absolute_error(y_test, y_predicted)
Robust to outliers 29
How to evaluate regression model on
worst case error?
30
Scores and Errors
Score is a metric for which higher value is better.
Error is a metric for which lower value is better.
Function Scoring
metrics.mean_absolute_error neg_mean_absolute_error
metrics.mean_squared_error neg_mean_squared_error
metrics.mean_squared_error neg_root_mean_squared_error
metrics.mean_squared_log_error neg_mean_squared_log_error
metrics.median_absolute_error neg_median_absolute_error
31
In case, we get comparable performance on train and test with
this split, is this performance guaranteed on other splits too?
32
Cross-validation performs robust evaluation of model performance
by repeated splitting and
providing many training and test errors
This enables us to estimate variability in generalization
performance of the model.
sklearn implements the following cross validation iterators
KFold
RepeatedKfold
LeaveOneOut
ShuffleSplit
33
How to obtain cross validated performance
measure using KFold?
1 from sklearn.model_selection import cross_val_score
2 from sklearn.linear_model import linear_regression
3
4 lin_reg = linear_regression()
5 score = cross_val_score(lin_reg, X, y, cv=5)
which is same as
35
How to obtain cross validated performance
measure using ShuffleSplit?
1 from sklearn.linear_model import linear_regression
2 from sklearn.model_selection import cross_val_score
3 from sklearn.model_selection import ShuffleSplit
4
5 lin_reg = linear_regression()
6 shuffle_split = ShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
7 score = cross_val_score(lin_reg, X, y, cv=shuffle_split)
36
How to specify a performance measure in
cross_val_score
1 from sklearn.linear_model import linear_regression
2 from sklearn.model_selection import cross_val_score
3 from sklearn.model_selection import ShuffleSplit
4
5 lin_reg = linear_regression()
6 shuffle_split = ShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
7 score = cross_val_score(lin_reg, X, y, cv=shuffle_split,
8 scoring='neg_mean_absolute_error')
40
How to study effect of #samples on training and test
errors?
42
Polynomial regression
1
How is polynomial regression model
trained?
Step 1: Apply polynomial transformation on the feature matrix.
2
Set up polynomial regression model with normal equation
1 from sklearn.linear_model import LinearRegression
2 from sklearn.pipeline import Pipeline
3 from sklearn.preprocessing import PolynomialFeatures
4
5 # Two steps:
6 # 1. Polynomial features of desired degree (here degree=2)
7 # 2. Linear regression
8 poly_model = Pipeline([
9 ('polynomial_transform', PolynomialFeatures(degree=2))),
10 ('linear_regression', LinearRegression())])
11
12 # Train with feature matrix X_train and label vector y_train
13 poly_model.fit(X_train, y_train)
Notice that there is a single line code change in two code snippets. 3
How to use only interaction features for
polynomial regression?
4
Hyperparameter tuning (HPT)
5
How to recognize hyperparameters in
any sklearn estimator?
For example,
degree in PolynomialFeatures
learning rate in SGDRegressor
6
How to set these hyperparameters?
7
Two generic HPT approaches implemented in sklearn are:
8
What are the differences between grid and
randomized search?
Training Data
Training Labels
Test Data
Test Labels
10
What are the steps in HPT?
STEP 1: Divide training data into training, validation and test sets.
Training Data
Training Labels
Test Data
Test Labels
11
STEP 2: For each combination of hyper-parameter values learn a
model with training set.
Hyperparameter
Model
Values
Hyperparameter
Model
Values
Model
Performance
Validation
Fold Labels
Model Best
Performance Model
Validation
Fold Labels
Validation Prediction
Fold Data
Model
Performance
Validation
Fold Labels 13
STEP 4: Retrain model with the best hyper-parameter settings on
training and validation set combined.
Best
Training Data Validation Data Hyperparameter Values
Model
Training Labels Validation Labels Learning
Algorithm
14
STEP 5: Evaluate the model performance on the test set.
Prediction
Test Data
Model
Performance
Test Labels
Note that the test set was not used in hyper-parameter search
and model retraining .
15
What are some of model specific HPT available for
regression tasks?
linear_model.LassoCV
linear_model.RidgeCV
linear_model.ElasticNetCV
16
How to determine degree of polynomial regression
with grid search?
1 from sklearn.model_selection import GridSearchCV
2 from sklearn.pipeline import Pipeline
3 from sklearn.preprocessing import POlynomialFeatures
4 from sklearn.linear_model import SGDRegressor
5
6 param_grid = [
7 {'poly__degree': [2, 3, 4, 5, 6, 7, 8, 9]}
8 ]
9
10 pipeline = Pipeline(steps=[('poly', PolynomialFeatures()),
11 ('sgd', SGDRegressor())])
12
13 grid_search = GridSearchCV(pipeline, param_grid, cv=5,
14 scoring='neg_mean_squared_error',
15 return_train_score=True)
16
17 grid_search.fit(x_train.reshape(-1, 1), y_train)
17
Regularization
18
How to perform ridge regularization with specific
regularization rate?
[Option #1]
Step 1: Instantiate object of Ridge estimator
Step 2: Set parameter alpha to the required regularization rate.
1 from sklearn.linear_model import Ridge
2 ridge = Ridge(alpha=1e-3)
fit, score, predict work exactly like other linear regression estimators
[Option #2]
Step 1: Instantiate object of SGDRegressor estimator
Step 2: Set parameter alpha to the required regularization rate
and penalty = l2.
1 from sklearn.linear_model import SGDRegressor
2 sgd = SGDRegressor(alpha=1e-3, penalty='l2')
19
How to search the best regularization parameter
for ridge?
[Option #1]
Search for the best regularization rate with built-in cross
validation in RidgeCV estimator.
[Option #2]
Use cross validation with Ridge or SVDRegressor to search
for best regularization.
Grid search
Randomized search
20
How to perform ridge regularization in polynomial
regression?
Set up a pipeline of polynomial transformation followed by the
ridge regressor.
1 from sklearn.linear_model import Ridge
2 from sklearn.pipeline import Pipeline
3 from sklearn.preprocessing import PolynomialFeatures
4
5 poly_model = Pipeline([
6 ('polynomial_transform', PolynomialFeatures(degree=2))),
7 ('ridge', Ridge(alpha=1e-3))])
8 poly_model.fit(X_train, y_train)
21
How to perform lasso regularization with specific
regularization rate?
[Option #1]
Step 1: Instantiate object of Lasso estimator
Step 2: Set parameter alpha to the required regularization rate.
1 from sklearn.linear_model import Lasso
2 lasso = Lasso(alpha=1e-3)
fit, score, predict work exactly like other linear regression estimators
[Option #2]
Step 1: Instantiate object of SGDRegressor estimator
Step 2: Set parameter alpha to the required regularization rate
and penalty = l1.
1 from sklearn.linear_model import SGDRegressor
2 sgd = SGDRegressor(alpha=1e-3, penalty='l1')
22
How to search the best regularization
parameter for lasso regularization?
[Option #1]
Search for the best regularization rate with built-in cross
validation in LassoCV estimator.
[Option #2]
Use cross validation with Lasso or SVDRegressor to search
for best regularization.
Grid search
Randomized search
23
How to perform lasso regularization in polynomial
regression?
Set up a pipeline of polynomial transformation followed by the
lasso regressor.
1 from sklearn.linear_model import Lasso
2 from sklearn.pipeline import Pipeline
3 from sklearn.preprocessing import PolynomialFeatures
4
5 poly_model = Pipeline([
6 ('polynomial_transform', PolynomialFeatures(degree=2))),
7 ('lasso', Lasso(alpha=1e-3))])
8 poly_model.fit(X_train, y_train)
24
How to perform both lasso and ridge regularization
in polynomial regression?
Set up a pipeline of polynomial transformation followed by the
SGDRegressor with penalty = 'elasticnet'
1 from sklearn.linear_model import Lasso
2 from sklearn.pipeline import Pipeline
3 from sklearn.preprocessing import PolynomialFeatures
4
5 poly_model = Pipeline([
6 ('polynomial_transform', PolynomialFeatures(degree=2))),
7 ('elasticnet', SGDRegressor(penalty='elasticnet',
8 l1_ratio=0.3))])
9 poly_model.fit(X_train, y_train)
26
Appendix - More
Details
27
Introduction
In this module, we will be covering the implementation aspects of
models of linear regression.
First we will learn linear regression models with:
Normal equation, which estimates model parameter with a
closed-form solution.
28
Further, we will study the implementation of the polynomial
regression model, that is capable of modelling non-linear
relationships between features and labels.
Since the polynomial regression uses more
parameters (due to polynomial representation of the
input), it is more prone to overfitting.
We will study how to detect overfitting with learning
curves and use of regularization to mitigate the risk of
overfitting.
29
Recap
Let's recall components of Linear regression
30
Training Data
31
Model
The label is obtained by a linear combination (or weighted sum) of the
input features and a bias (or intercept) term. The model hw is given by
y^ = w0 + w1 x1 + w2 x2 + … + wm xm = wT x
where,
y^ is the predicted value.
X is a feature vector {x1 , x2 , … , xm } for a given example with m
features in total.
i-th feature: xi
Weight or parameter vector w includes bias term too:
{w0 , w1 , w2 , … , wm }
wi is i-the model parameter associated with i-the feature.
hw is a model with parameter vector w.
32
Loss function
The model parameters w are learnt such that the square of difference
between the actual and the predicted values is minimized.
1 n (i) (i) 2
J(w) = ∑ (y^ − y )
2 i=1
1 n 2
J(w) = ∑ (wT x(i) − y (i) )
2 i=1
33
Optimization
1. Normal equation
2. Iterative optimization with gradient descent: full batch, mini-batch or
stochastic.
34
Evaluation measure
1. Mean squared error
2. Root mean squared error
35
Implementing with sklearn
36
Normal equation
sklearn provides LinearRegression estimator for weight
vector estimation via normal equation
38
Coefficient of determination R2
2 u
R = (1 − v)
where
u is the residual sum of squares and is calculated as
u = (Xw − y)T (Xw − y )
1
and v is the total sum of square. Let, y^mean = n (Xw), then v is
calculated as
v = (y − y
^mean )T (y − y
^mean )
39
R2 = (1 − uv )
The best possible score is 1.0.
The score can be negative (because the model can be arbitrarily
worse).
A constant model that always predicts the expected value of y ,
disregarding the input features, would get a score of 0.0.
40
41
Model inspection
1 lin_reg.intercept_, lin_reg.coef_
42
Computational Complexity
The normal equation uses the following equation to obtain
−1
w = (X X) XT y
T
The `LinearRegression`estimator uses SVD for this task and has the
complexity of O(m2 ) where m is the number of features.
43
This implies that if we double the number of features, the training
computation grows roughly 4 times.
As the number of features grows large, the approach of normal
equation slows down significantly.
These approaches are linear w.r.t. the number of training
examples as long as the training set fits in the memory.
The inference process is linear w.r.t. both the number of examples
and number of features.
44
Weight vector estimation via SGD
SGD is a simple yet very efficient approach of learning weight
vectors in linear regression problems especially in large scale
problem settings.
SGD offers provisions for tuning the optimization process. However
as a downside of this, we need to set a few hyperparameters.
SGD is sensitive to feature scaling.
45
In sklearn, an estimator SGDRegressor implements a plain stochastic
gradient descent learning routine which supports different loss functions
and penalties to fit linear regression models.
46
Key functionalities of SGDRegressor
Loss function:
Can be set via the loss parameter.
SGDRegressor supports the following loss functions.
loss= 'squared error': Ordinary least squares,
loss = 'huber': Huber loss for robust regression
47
Regularization
SGD supports the following penalties:
Penalty = 'l2' : L2 norm penalty on coef_ . This is default setting.
penalty = 'l1': L1 norm penalty on coef_. This leads to sparse
solutions.
penalty = 'elasticnet': Convex combination of L2 and L1;
`(1 - l1_ratio) * L2 + l1_ratio * L1
48
Learning rate
The learning rate η can be either constant or gradually decaying. There
are following options for learning rate schedule specification in SGD:
1. invscaling
η (t) = t
η0
power_t
49
2. Constant
50
3. Adaptive
For an adaptively decreasing learning rate, use learning_rate =
'adaptive' and use η0 to specify the starting learning rate.
51
4. Optimal
Used as a default setting for classification problems. The learning rate η
for t-th iteration is calculated as follows:
(t) 1
η =
α(t0 + t)
Here
α is a regularization rate.
53
2. With early_stopping = False
The model is fitted on the entire input data and
The stopping criterion is based on the objective function
computed on the training data.
54
In both cases, the criterion is evaluated once by epoch, and the
algorithm stops when the criterion does not improve
n_iter_no_change times in a row.
55
SGD variation: Average SGD
SGDRegressor supports averaged SGD (ASGD). Averaging can be
enabled by setting average = True
ASGD performs the same updates as the regular SGD, and sets
coef_ attribute to the average value of the coefficients across all
updates.
SGD sets coef_ attribute to the last value of the coefficients
(i.e. the values of the last update)
The same process is followed for the intercept_ attribute.
When using ASGD the learning rate can be larger and even constant,
leading to a speed up in training.
56
Model inspection
57
Complexity
The major advantage of SGD is its efficiency, which is basically linear in
the number of training examples.
Recent theoretical results, however, show that the runtime to get some
desired optimization accuracy does not increase as the training set size
increases.
58
Polynomial regression
59
Polynomial regression
60
Example:
1 from sklearn.preprocessing import PolynomialFeatures
2 import numpy as np
3 X = np.arange(6).reshape(3, 2)
4 print ("Data matrix: \n", X)
5 poly = PolynomialFeatures(degree=2)
6 print ("\n\nAfter transformation: \n", poly.fit_transform(X))
Output:
1 Data matrix:
2 [[0 1]
3 [2 3]
4 [4 5]]
5
6
7 After transformation:
8 [[ 1. 0. 1. 0. 0. 1.]
9 [ 1. 2. 3. 4. 6. 9.]
10 [ 1. 4. 5. 16. 20. 25.]]
61
In the above example, the features of X have been transformed from
[x1 , x2 ] to [1, x1 , x2 , x21 , x1 x2 , x22 ], and can now be used within any linear
model.
In some cases it’s not necessary to include higher powers of any single
feature, but only the so-called interaction features that multiply together
as most distinct features. These can be gotten from
'PolynomialFeatures' with the setting interaction_only = True.
In this case, the features of X would be transformed from [x1 , x2 ] to
[1, x1 , x2 , x1 x2 ].
62
Ridge regression
63
Ridge regression
Ridge regression minimizes L2 penalized sum of squared error.
64
RidgeCV parameters:
1. alphas is the list of regularization rates to try.
The regularization rate must be positive.
Larger values indicate stronger regularization.
2. cv determines the cross-validation splitting strategy.
None, to use the efficient Leave-One-Out cross-validation
integer, to specify the number of folds.
CV splitter specifies how to generate cross validation sets.
An iterable yielding (train, test) splits as arrays of indices.
65
In case of a binary or multiclass problems, for 'cv=None' or 'cv=5' (i.e.
integer), 'StratifiedKFold' cross validation strategy is used. In other
cases, 'KFold' cross validation strategy is used.
66
Model inspection
'RidgeCV' provides an additional output apart from usual outputs like
coef_ and intercept_:
67
Lasso regression
68
Lasso regression
69
Classification functions in sci-kit
learn
2
In this week, we will study sklearn functionality for
implementing classification algorithms.
We will cover sklearn APIs for
Specific classification algorithms for least square
classification, perceptron, and logistic regression
classifier.
with regularization
multiclass, multilabel and multi-output setting
Various classification metrics.
4
There are broadly two types of APIs based
on their functionality:
Generic Specific
SGD classifier Logistic regression
Perceptron
Ridge classifier (for LSC)
K-nearest neighbours
(KNNs)
Support vector machines
(SVMs)
Naive Bayes
Uses gradient Specialized solvers for opt
descent for opt
Need to specify loss
function 5
All sklearn estimators for classification implement a few
common methods for model training, prediction and
evaluation.
6
Model training
fit(X, y[, coef_init, intercept_init, …])
Prediction
predict(X) predicts class label for samples
7
There a few common miscellaneous methods as follows:
8
Now let's study how to implement different classifiers
with sklearn APIs.
9
Let's start with implementation of least square
classification (LSC) with RidgeClassifier API.
10
Ridge classifier
RidgeClassifier is a classifier variant of the Ridge regressor.
Binary classification:
classifier first converts binary targets to {-1, 1} and then treats the
problem as a regression task, optimizing the objective of regressor:
minimize a penalized residual sum of squares
min ∣∣Xw − y∣∣22 + α∣∣w∣∣22
w
sklearn provides different solvers for this optimization
sklearn uses α to denote regularization rate
predicted class corresponds to the sign of the regressor’s
prediction
Multiclass classification:
treated as multi-output regression
predicted class corresponds to the output with the highest value
11
How to train a least square classifier?
Step 1: Instantiate a classification estimator without passing any
arguments to it. This creates a ridge classifier object.
12
How to set regularization rate in RidgeClassifier?
13
How to solve optimization problem in RidgeClassifier?
Using one of the following solvers
uses a Singular Value Decomposition of the feature matrix to
svd
compute the Ridge coefficients.
uses scipy.linalg.solve function to obtain the closed-form
cholesky
solution
sparse_cg uses the conjugate gradient solver of scipy.sparse.linalg.cg .
uses the dedicated regularized least-squares routine
lsqr
scipy.sparse.linalg.lsqr and it is fastest.
15
How to make RidgeClassifier select the solver
automatically?
1 ridge_classifier = RidgeClassifier(solver=auto)
1 if solver == 'auto':
2 if return_intercept:
3 # only sag supports fitting intercept directly
4 solver = "sag"
5 elif not sparse.issparse(X):
6 solver = "cholesky"
7 else:
8 solver = "sparse_cg"
16
Is intercept estimation necessary for
RidgeClassifier?
If data is already centered, set fit_intercept as false, so that no
intercept will be used in calculations.
Default:
1 ridge_classifier = RidgeClassifier(fit_intercept=True)
17
How to make predictions on new data
samples?
Use predict method to predict class labels for samples
18
RidgeClassifierCV implements
RidgeClassifier with built-in cross validation.
19
Let's implement perceptron classifier with
Perceptron API.
20
Perceptron classification
It is a simple classification algorithm suitable for large-scale
learning.
Perceptron()
SGDClassifier(loss="perceptron", eta0=1,
learning_rate="constant", penalty=None)
21
How to implement perceptron classifier?
Step 1: Instantiate a Perceptron estimator without passing any
arguments to it to create a classifier object.
22
Perceptron can be further customized with the following
parameters:
penalty l1_ratio
(default = 'l2') (default = 0.15)
alpha early_stopping
(default = 0.0001) (default = False)
fit_intercept max_iter
(default = True) (default = 1000)
n_iter_no_change tol
(default = 5) (default = 1e-3)
eta0 validation_fraction
24
Let's implement logistic regression classifier
with LogisticRegression API.
25
LogisticRegression API
26
How to train a LogisticRegression classifier?
Step 1: Instantiate a classifier estimator without passing any
arguments to it. This creates a logistic regression object.
27
Logistic regression uses specific algorithms for solving the
optimization problem in training. These algorithms are
known as solvers.
28
How to select solvers for Logistic Regression
classifier?
For small datasets, ‘liblinear’ is a
good choice, whereas ‘sag’ and
‘saga ’ are faster for large ones.
solver ‘newton-cg ’
For unscaled datasets, ‘liblinear',
‘lbfgs ’ 'lbfgs' and 'newton-cg ' are robust.
‘liblinear ’
For multiclass problems, only
‘sag ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs ’
handle multinomial loss.
‘saga ’
‘liblinear ’ is limited to one-versus-
rest schemes
penalty
Solver Penalty
‘newton-cg ’ [‘l2’, ‘none’]
‘lbfgs ’ [‘l2’, ‘none’]
‘liblinear’ [‘l1’, ‘l2’]
‘sag’ [‘l2’, ‘none’]
‘saga ’ [‘elasticnet’, ‘l1’, ‘l2’, ‘none’]
31
How to control amount of regularization in
logistic regression?
sklearn implementation uses parameter C, which is
inverse of regularization rate to control regularization.
Recall
arg minw,C regularization penalty + C cross entropy loss
32
LogisticRegression classifier has a class_weight parameter
in its constructor.
34
These classifiers can also be implemented with a
generic SGDClassifier API by setting the loss
parameter appropriately.
35
Let's study SGDClassifier API.
36
SGDClassifier
SGD is a simple yet very efficient approach to fitting linear
classifiers under convex loss functions
37
We need to set loss parameter appropriately to build train
classifier of our interest with SGDClassifier
39
An instance of SGDClassifier might have an equivalent estimator
in the scikit-learn API.
SGDClassifier(loss='log')
LogisticRegression(solver='sgd')
SGDClassifier(loss='hinge')
40
How does SGDClassifier work?
SGDClassifier implements a plain stochastic gradient descent
learning routine.
the gradient of the loss is estimated with one sample at a time
and the model is updated along the way with a decreasing
learning rate (or strength) schedule.
Advantages: Disadvantages:
Efficiency. Requires a number of
Ease of implementation hyperparameters.
Sensitive to feature scaling.
It is important
to permute (shuffle) the training data before fitting the model.
to standardize the features.
41
How to use SGDClassifier for training a classifer?
Step 1: Instantiate a SGDClassifer estimator by setting appropriate
loss parameter to define classifier of interest. By default it uses hinge
loss, which is used for training linear support vector machine.
42
How to perform regularization in SGD classifier?
penalty
alpha
Constant that multiplies the regularization term.
Has float values and default = 0.0001
43
How to set maximum number of epochs for SGD
Classifier?
The maximum number of passes over the training data (aka epochs)
is an integer that can be set by the max_iter parameter.
1 SGD_classifier = SGDClassifier(max_iter=100)
Default:
max_iter = 1000
44
Some common parameters between SGDClassifier
and SGDRegressor
learning_rate average
‘constant’ SGDClassifier also supports
‘optimal’ averaged SGD (ASGD)
‘invscaling’
tol
‘adaptive’
n_iter_no_change
max_iter
warm_start Stopping criteria
early_stopping
‘True’
validation_fraction
‘False’
45
Summary
We learnt how to implement the following classifiers with
sklearn APIs:
Least square classification (RidgeClassifier)
Perceptron (Perceptron)
Logistic regression (LogisticRegression)
47
Part II: Multi-learning classification
set up
48
Let's extend these classifiers to multi-
learning (multi-class, multi-label & multi-
output) settings.
49
Basics of multiclass, multilabel and
multioutput classification
Multiclass classification has exactly one output label and
the total number of labels > 2.
For more than one output, there are two types of classification
models:
Multilabel Multiclass multioutput
total #labels = 2 total #labels > 2
OneVsRestClassifier
Multiclass
classification OneVsOneClassifier
(sklearn.multiclass)
OutputCodeClassifier
problem meta-
types estimators
MultiOutputClassifier
Multilabel
classification
(sklearn.multioutput)
ClassifierChain
51
Many sklearn estimators have built-in support for multi-
learning problems.
Meta-estimators are not needed for such estimators,
however meta-estimators can be used in case we
want to use these base estimators with strategies
beyond the built-in ones.
52
LogisticRegression (multi_class = 'multinomial')
Inherently LogisticRegressionCV (multi_class = 'multinomial')
multiclass RidgeClassifier
RidgeClassifierCV
RidgeClassifier
Multilabel
RidgeClassifierCV
53
First we will study multiclass APIs in sklearn.
54
Multi-class classification
Classification task with more than two classes.
Each example is labeled with exactly one class
In Iris dataset,
There are three class labels: setosa, versicolor and virginica.
Each example has exactly one label of the three available
class labels.
Thus, this is an instance of a multi-class classification.
57
type_of_target can determine different types
of multi-learning targets.
58
target_type y
contains more than two discrete values
‘multiclass’
not a sequence of sequences
1d or a column vector
2d array that contains more than two discrete
‘multiclass- values
multioutput’ not a sequence of sequences
dimensions are of size > 1
label indicator matrix
‘multilabel-
an array of two dimensions with at least two
indicator’ columns, and at most 2 unique values.
array-like but none of the above, such as a 3d
‘unknown’ array,
sequence of sequences, or an array of non-
sequence objects.
59
Examples
60
Apart from these, there are three more types, type_of_target
can determine targets corresponding to regression and
binary classification.
61
All classifiers in scikit-learn perform multiclass
classification out-of-the-box.
62
What are different multi-class classification
strategies implemented in sklearn?
63
OVR - OneVsRestClassifier
Fits one classifier per class c - c vs not c.
This approach is computationally efficient and requires
only k classifiers.
The resulting model is interpretable.
Fits one classifier per class. Fits one classifier per pair of
For each classifier, the class is classes.
fitted against all the other At prediction time, the class
classes. which received the most votes
is selected.
66
Now we will learn how to perform multilabel
and multi-output classification.
67
How MultiOutputClassifier works?
Strategy consists of fitting one classifier per target.
Classifier #1 Class #1
Input Feature
Classifier #2 Class #2
Matrix (X)
Classifier #k Class #k
69
Comparison of MultiOutputClassifier and
ClassifierChain
MultiOutputClassifier ClassifierChain
Able to estimate a series of Capable of exploiting
target functions that are trained correlations among targets.
on a single predictor matrix to
predict a series of responses.
70
Summary
71
Evaluating Classifiers
72
So far we learnt how to train classifiers for binary, multi-class
and multi-label/output cases.
73
Stratified cross validation iterators
74
sklearn.model_selection module provides the following
three stratified APIs to create folds such that the overall
class distribution is replicated in individual folds.
StratifiedKFold
RepeatedStratifiedKFold
StratifiedShuffleSplit
75
LogisticRegressionCV
Support in-build cross validation for optimizing hyperparameters
The following are key parameters for HPT and cross validation
cv specifies scoring specifies cs specifies
cross validation scoring function to regularization
iterator use for HPT strengths to
experiment with.
77
Classification metrics
accuracy_score balanced_accuracy_score
top_k_accuracy_score roc_auc_score
score(actual_labels, predicted_labels)
78
Confusion matrix
confusion_matrix evaluates classification accuracy by computing
the confusion matrix with each row corresponding to the true class.
1 from sklearn.metrics import confusion_matrix
2 confusion_matrix(y_true, y_predicted)
Example:
79
Confusion matrix can be displayed with ConfusionMatrixDisplay
API in sklearn.metrics.
Confusion matrix
1 ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
From estimators
1 ConfusionMatrixDisplay.from_estimator(clf, X_test, y_test)
From predictions
1 ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
80
The classification_report function builds a text report showing
the main classification metrics.
1 from sklearn.metrics import classification_report
2 print(classification_report(y_true, y_predicted))
81
Classifier Performance across probability
thresholds
1 from sklearn.metrics import precision_recall_curve
2 precision, recall, thresholds = precision_recall_curve(y_true, y_predicted)
calculates the metric over the true and predicted classes for
samples
each sample in the evaluation data, and returns their average
84
Naive Bayes in sci-kit learn
1
Naive Bayes Classifier
2
Naive Bayes classifier
Naive Bayes classifier applies Bayes’ theorem with the “naive”
assumption of conditional independence between every pair of
features given the value of the class variable.
3
List of NB Classifiers
GaussianNB
BernoulliNB CategoricalNB
MultinomialNB ComplementNB
4
Which NB to use if data is only numerical?
5
Which NB to use if data is multinomially distributed?
6
What to do if data is imbalanced ?
7
What to do if data has multivariate Bernoulli
distributions?
implements the naive Bayes algorithm for
BernoulliNB
data that is distributed according to
multivariate Bernoulli distributions
8
What to do if data is categorical ?
implements the categorical naive Bayes
algorithm suitable for classification with
CategoricalNB
discrete features that are categorically
distributed
2
Nearest neighbor classifier
1. KNeighborsClassifier
2. RadiusNeighborsClassifier
3
How are KNeighborsClassifier and
RadiusNeighborsClassifier different?
KNeighborsClassifier RadiusNeighborsClassifier
learning based on the k learning based on the number of
nearest neighbors neighbors within a fixed radius
r of each training point
most commonly used used in cases where the data is
technique not uniformly sampled
choice of the value k is fixed value of r is specified, such
highly data-dependent that points in sparser
neighborhoods use fewer nearest
neighbors for the classification
4
How do you apply KNeighborsClassifier?
Step 1: Instantiate a KNeighborsClassifer estimator without passing
any arguments to it to create a classifer object.
5
How do you specify the number of nearest
neighbors in KNeighborsClassifier?
1 kneighbor_classifier = KNeighborsClassifier(n_neighbors = 3)
6
How do you assign weights to neighborhood in
KNeighborsClassifier?
Default:
1 kneighbor_classifier = KNeighborsClassifier(weights= 'uniform')
7
Can we define our own weight values for
KNeighborsClassifier?
Example:
1 def user_weights(weights_array):
2 return weights_array
3
4 kneighbor_classifier = KNeighborsClassifier(weights=user_weights)
8
Which algorithm is used to compute the nearest
neighbors in KNeighborsClassifier?
Default:
1 kneighbor_classifier = KNeighborsClassifier(algorithm='auto')
9
Some additional parmeters for tree algorithm in
KNeighborsClassifier?
For 'ball_tree' and 'kd_tree' algorithms, there are some other
parameters to be set.
leaf_size metric
can affect the speed of Distance metric to use for the tree
the construction and It is either string or callable function
query, as well as the some metrics are listed below:
memory required to
“euclidean”, “manhattan”, “chebyshev”,
store the tree
“minkowski”, “wminkowski”,
default = 30
“seuclidean”, “mahalanobis”
default = 'minkowski'
p
Power parameter for the Minkowski metric.
default = 2 10
How do you apply RadiusNeighborsClassifier?
Step 1: Instantiate a RadiusNeighborsClassifer estimator without
passing any arguments to it to create a classifer object.
11
How do you specify the number of neighbors in
RadiusNeighborsClassifier?
1 radius_classifier = RadiusNeighborsClassifier(radius=1.0)
12
Parameters for RadiusNeighborsClassifier
weights algorithm
‘uniform’ ‘ball_tree’
‘distance’ ‘kd_tree’
[callable] ‘brute’
function
‘auto’
default =
default = ‘auto’
'uniform'
leaf_size metric p
2
In this week, we will study how to implement support
vector machines for classification tasks with sklearn.
3
Support Vector Machines
Support Vector Machines (SVM) are a set of supervised
learning methods used for classification, regression and outliers
detection.
SVM constructs a hyper-plane or set of hyper-planes in a high
or infinite dimensional space, which can be used for
classification, regression or other tasks.
shape → (n_samples)
5
How to implement SVC (C-Support Vector
Classification)?
6
How to perform regularization in SVC classifier?
C Regularization parameter
float value
Note:
strength of the regularization is inversely proportional to C
strictly positive
penalty is a squared l2 penalty
7
How to specify kernel type to be used in the
algorithm ?
‘linear’
‘poly’
kernel ‘rbf’
‘sigmoid’
‘precomputed’
1
‘scale’ value of gamma = number of features∗ X.Var()
1
‘auto’ value of gamma = number of features
float value
9
How to view support vectors?
After the classifier is fit on the training data, there are few attributes
which reveal the details of support vectors.
10
How to implement NuSVC (ν -Support Vector
Classification)?
11
What is the significance of ν in NuSVC?
Default: ν = 0.5
12
How to implement LinearSVC (Linear Support
Vector Classification)?
13
Advantages of LinearSVC
14
How to provide penalty in LinearSVC classifier?
penalty
15
How to choose loss functions in LinearSVC
classifier?
Default:
1 LinearSVC_classifier = Linear_SVC(loss = 'squared_hinge')
Regularization parameter
dual
fit_intercept
17
How to perform multi-class classification using SVM?
multi_class
‘ovr’
‘crammer_singer’
18
Advantages of SVM
Effective in high dimensional spaces.
Effective in cases where number of dimensions is greater than
the number of samples.
Uses a subset of training points in the decision function
(called support vectors), so it is also memory efficient.
Versatile: different Kernel Functions can be specified for the
decision function.
Disadvantages of SVM
SVMs do not directly provide probability estimates, these are
calculated using an expensive five-fold cross-validation.
Avoid over-fitting in choosing Kernel functions if the number of
features is much greater than the number of samples.
19
Decision trees
Machine Learning Practice
IIT Madras
2
Quick recap
Non-parametric supervised learning methods.
Can learn classification and regression models.
Predicts label based on rules inferred from the
features in the training set.
3
Tree algorithms
4
sklearn implementation of trees
scikit-learn uses an optimized version of the CART
algorithm; however, it does not support categorical
variables for now
Classification sklearn.tree.DecisionTreeClassifier
Regression sklearn.tree.DecisionTreeRegressor
5
sklearn tree parameters
splitter Strategy for splitting at each node. best random
Classification Regression
gini squarred_error
entropy friedman_mse
absolute_error
poisson
7
Tree visualization
sklearn.tree.plot_tree
Pre-pruning
Uses hyper-parameter search like GridSearchCV for
finding the best set of parameters.
Post-pruning
First grows trees without any constraints and then uses
cost_complexity_pruning with max_depth and
min_samples_split .
9
Tips for practical usage
Decision trees tend to overfit data with a large number of
features. Make sure that we have the right ratio of samples
to number of features.
11
Bagging and Boosting
IIT Madras
2
Part 2: Boosting
17
There are two boosting estimators:
AdaBoost estimator
Gradient boosting estimator
18
AdaBoost estimator
Class: sklearn.ensemble.AdaBoostClassifier
Class: sklearn.ensemble.AdaBoostRegressor
19
Class: sklearn.ensemble.AdaBoostClassifier
22
Attributes of AdaBoost estimators
23
Gradient boosting estimators
Class: sklearn.ensemble.GradientBoostingClassifier
Class: sklearn.ensemble.GradientBoostingRegressor
24
We will directly demonstrate XGBoost
through colab demonstration.
25
Bagging and Boosting
IIT Madras
2
Contents
Part 3: XGBoost
3
Voting estimators
Class: sklearn.ensemble.VotingClassifier
Class: sklearn.ensemble.VotingRegressor
Both these estimators take the following common parameters:
base_estimator weights
Class: sklearn.ensemble.BaggingClassifier
Class: sklearn.ensemble.BaggingRegressor
5
Common parameters
base estimator to fit on
base_estimator default=None random subsets of dataset
number of base estimators
n_estimators default=10 in the ensemble
number of samples to
draw from X to train each
max_samples default=1.0 base estimator (with
replacement by default)
number of samples to
draw from X to train each
max_features default=1.0
base estimator (without
replacement by default)
7
Random forest estimators
Class: sklearn.ensemble.RandomForestClassifier
Class: sklearn.ensemble.RandomForestRegressor
8
Bagging parameters
The number of trees are specified by n_estimators .
Default #trees for classification = 10
Default #trees for regression = 100
9
Bagging parameters
max_samples specifies the number of samples to be drawn
while bootstrapping.
None : Use all samples in the training data.
int : Use max_samples samples from the training data.
float : Use
max_samples*total number of samples from training data
The value should be between 0 and 1.
10
The number of features to be considered while splitting is
specified by max_features .
auto , sqrt , log2 , int , float
Value max_features
int value specified
float value * # features
auto sqrt(#features)
sqrt sqrt(#features)
log2 log2(#features)
None #features
11
Decision tree parameters
12
The criteria for splitting the node is specified through criterion .
Default for classification: gini
Default for regression: squared_error
14
Trained random forest estimators
15
Training and inference for random forest
16
Neural Networks
2
In this week, we will study how to implement Multilayer
Perceptron neural network models for classification and
regression tasks with sklearn.
3
Multilayer Perceptron (MLP)
It is a supervised learning algorithm.
5
MLPClassifier
How to implement MLPClassifier?
6
MLPClassifier
Step 3: After fitting (training), the model can make predictions for new
samples (X_test) using two methods:
1 MLP_clf.predict(X_test)
2 MLP_clf.predict_proba(X_test)
1 MLPClassifier(hidden_layer_sizes=(15,10,5))
8
How to perform regularization in MLPClassifier?
float value
9
How to set the activation function for the hidden
layers?
no-op activation logistic sigmoid function
1
returns f (x) = x returns f (x) = (1+exp(−x))
'identity' 'logistic'
Default
'tanh' 'relu'
hyperbolic tan function rectified linear unit function
returns f (x) = tanh(x) returns f (x) = max(0, x)
10
How to perform weight optimization in
MLPClassifier?
MLPClassifier optimizes the log-loss function using LBFGS or
stochastic gradient descent
If the solver is ‘lbfgs’, the
lbfgs classifier will not use
minibatch.
Size of minibatches can be
sgd set to other stochastic
optimizers: batch_size (int)
default batch_size is 'auto'.
adam
1 batch_size=min(200, n_samples)
Default
11
How to view weight matrix coefficients of trained
MLPClassifier?
Example:
"weights between input and first
hidden layer:"
1 print(MLP_clf.coefs_[0])
Example:
"Bias values for first hidden layer:"
1 print(MLP_clf.intercepts_[0])
13
Some parameters in MLPClassifier
float value
'constant'
default: 0.001
'invscaling'
'adaptive'
float value int value
default: 'constant'
default: 0.5 default: 500
15
How to implement MLPRegressor?
16
Step 3: After fitting (training), the model can make predictions for new
samples (X_test):
returns R2 score
1 MLP_reg.score(X_test,y_test) for example:
0.45678889
17