ANL252 SU5 Jul2022
ANL252 SU5 Jul2022
3/60
Algorithms in Scikit-Learn
Here are some common algorithms available in scikit-learn:
Supervised learning
Linear Models Gaussian Processes
Discriminant Analysis Cross decomposition
Kernel ridge regression Decision Trees
Support Vector Machines Isotonic regression
Nearest Neighbours Neural network models (supervised)
Unsupervised learning
Gaussian mixture models Novelty and Outlier Detection
Clustering Density Estimation
Covariance estimation Neural network models (unsupervised)
4/60
Install and Import Scikit-Learn
• Use the import syntax to load scikit-learn into the program.
import sklearn
5/60
Import Algorithms from Scikit-Learn
• Since the library is extensive, programmers usually only load the required
algorithm or only its “estimator” object.
• For instance, if linear regression models are required, import the
estimator LinearRegression from the module linear_model by:
from sklearn.linear_model import LinearRegression
• It is not unusual to load multiple algorithms for one analytics task.
• Put sufficient comments in the program to explain the purpose and use of
each imported module.
6/60
Activity
• Import the following algorithms of scikit-learn into your JupyterLab
notebook:
➢ train_test_split
➢ metrics
➢ pre-processing
➢ tree
➢ Module “KMeans” from “cluster”
➢ Module “” from “decomposition”
7/60
Discussion
• What is the difference between supervised and unsupervised machine
learning?
8/60
Data Preparation for
Analytics Algorithms
Specify and Remove Missing Values
• We can define specific strings (e.g., strings with white space ("") as
missing values during the import process of a DataFrame using
read_csv().
DataFrame_name = pd.read_csv("csv_file_name.csv",
na_values = "na_string", na_filer = True/False)
• One way to deal with missing values is to remove those rows with any
missing value from the DataFrame completely.
DataFrame_name.dropna(axis = 0, how = "any"/"all")
11/60
Replace Missing Values
• Alternatively, we can choose to replace missing values of an entire
DataFrame or a particular column by specific values.
DataFrame_name.fillna(value = repl_value)
DataFrame_name["column_label"].fillna(value = repl_value)
12/60
Reduce Number of Categories
• Categorical variables may need to be treated in data analytics.
• Usually, they are converted to dummy variables required by some
algorithms.
• Example: Weight of {Normal, Overweight, Obese} can become two dummy
variables of Overweight with either 0 or 1; and Obese with either 0 or 1.
Here, 0 means No, 1 means Yes.
• Sometimes to reduce the number of categories to simplify the problem.
• Example: Grade {Fail, Subpass, Pass} can become {Fail, Pass}
• In Python, we can replace category labels by the .replace() method:
DataFrame_Name["column_label"].replace(to_replace, value)
13/60
Discretisation
• If a categorical variable has ordered numeric values as categories, we can
discretise them into new bins by the cut() function.
DF_Name["column"] = pd.cut(x = array, bins, labels)
• The cut() function only includes the rightmost edge in each bin and not
the leftmost one.
• Put the highest value of each category in the list of x. Start the list with -1
in case 0 also belongs to the original categories, or 0 otherwise.
14/60
Select Variables
• Some variables may not be relevant to the analysis
• Example: Name is not useful for predicting Income
• These variables should be removed from the DataFrame before applying
the scikit-learn estimator.
• Use the attributes .iloc() and .loc() to select the necessary columns.
15/60
Using iloc vs loc for Selection (I)
• Suppose we only need Fruits and Prices from the Imports dataset
16/60
Using iloc vs loc for Selection (II)
• loc: we specify the variable labels ‘Fruits’, ‘Prices’
17/60
Rename Variables
• Use the .rename() method to rename the variables in a DataFrame.
DF_Name["column"].rename(columns = {"oldvar": "newvar"})
• The column labels to be renamed must be put as keys of a dictionary.
• The values of the dictionary will be the new labels of the corresponding
columns.
18/60
Create Dummy Variables
• Categorical variables must be converted to dummy variables before
included in the scikit-learn algorithms.
• Dummy variables only have two values: 0 and 1.
• If an observation belongs to a certain category, the corresponding dummy
variable will be 1, otherwise 0.
• Use get_dummies() to convert categorical variables to dummy variables.
DF_name["column"] = pd.get_dummies(data, drop_first)
19/60
Data Transformation
• Variables with wide range of values tend to have higher impact in the
model than those with smaller value ranges.
• They need to be scaled down to match the other numeric variables.
• The most common methods here are normalisation and standardisation.
• For normalisation, use normalize() from the preprocessing module.
Object_name = sklearn.preprocessing.normalize(X)
• For standardisation, the estimator needs to be initiated first before the
transformation can take place.
scaler = preprocessing.StandardScaler()
scaler.fit(X)
Object_name = scaler.transform(X)
20/60
Training and Testing Data
• The performance of a predictive model is measured by its accuracy of
predicting unseen data.
• Since such data are usually unavailable, analysts usually “hold back” a
subset of data, the testing dataset, for model evaluation purpose.
• The remaining data, the training dataset, are used to construct the model.
• The train_test_split() function from the model_selection
module splits DataFrame randomly into training and testing datasets.
Object_name = sklearn.model_selection.
train_test_split(arrays, test_size, random_state)
21/60
Extract Dependent and Independent
Variables (I)
• In scikit-learn, the ultimate command of many algorithms is to fit a model
by the .fit() function which has two parameters: X and Y.
• The parameter X is the design matrix containing all independent variables.
• The parameter Y is the vector of the target variable.
• Both X and Y can be NumPy arrays or pandas DataFrames.
• We need to extract X and Y from the original DataFrame.
22/60
Extract Dependent and Independent
Variables (II)
Select the column representing Y and save it as a new object.
y = DataFrame_name["target_var"]
Subset the matrix of X from the DataFrame in similar manner.
X = DataFrame_name[["X1", "X2", …]]
If the DataFrame only contains the independent and target variables, just
drop the target variable from the original DataFrame to obtain X.
X = DataFrame_name.drop("target_var")
23/60
Activity
Car sales program:
Carry out the following tasks in JupyterLab:
• Since there are many car models (especially their names are complicated
and not comparable with each other) in the market, it is rather more
important to include “category” than “model” in our analysis later.
➢ Remove “model” from the DataFrame.
➢ Reduce entries with more than one category (e.g.,
'Convertible,Sedan,Coupe‘) in the “Category” field by assigning each car
to the first listed category only (i.e., ‘Convertible’).
➢ Create a training and a testing dataset. The proportion here should be
0.7:0.3 and the random state should be set to 0.
24/60
Discussion
• Can we use the .replace() method to reduce the number of categories of
a numeric categorical variable instead of discretisation?
• If the target variable is categorical and binary, there must be two dummy
variables resulting from the dummy-variable-conversion. Which dummy
variable shall we choose for the scikit-learn algorithm?
25/60
Activity
SU5 Regression Example
Run through the SU5 linear regression notebook
27/60
The rest of the slides are
optional
Clustering
(This section is optional)
Introduction to K-Means Clustering
• One of the most popular clustering algorithms is the K-Means method.
• This technique is very efficient in clustering large data sets.
• The algorithm here is to split the data into K groups with equal variance by
minimising the variation within the cluster (inertia).
• The K-Means method requires the number of clusters to be specified
before the algorithm starts.
• The clusters are characterised by their centroids (means), the centre of an
area in a two-dimensional space.
• The K different means should be explored during the clustering process.
30/60
Intuition of K-Means Clustering
• We want members in a cluster to be ‘close’ to each other, and clusters to be
far apart from each other.
31/60
K-Means Clustering Process
The process of K-Means clustering contains five main steps:
1. Select K observations randomly as initial cluster centroids.
2. Compute the distance of each object to the centroid.
3. Assign each object to the nearest centroid based on the distance
computed. Objects assigned to the same centroid form a cluster.
4. Recompute the centroid for each cluster using the objects in it.
Restart iteration from Step 2.
5. Iteration stops when the centroids remain unchanged, or a specified
number of iterations has been performed.
32/60
K-Means Properties
• The Euclidean distance is preferred in general when computing the
distance between an object to the centroids.
• The Euclidean distance is measured by the sum of the squared differences
between the values of some selected clustering criteria.
• After the clustering process, ensure the insights created by the resulting
clusters. A good clustering solution allows clear description of the cluster
profile.
• Evaluate the quality of clustering solutions by cohesion, separation and
parsimony.
33/60
Fitting K-Means Clustering by Scikit-
Learn
• In scikit-learn, all algorithms are controlled and executed by the estimator.
• Adjust the parameters for the modelling process in the syntax of the
estimator declaration.
• In K-Means Clustering, the estimator is called KMeans. And it can be
imported from the cluster module of the scikit-learn package.
from sklearn.cluster import KMeans
• Initiate the KMeans estimator first and adjust the estimation parameters
according to the individual needs.
km_Object = sklearn.cluster.KMeans(n_clusters = 8, init =
"k-means++", n_init = 10, max_iter = 300, tol = 0.0001,
precompute_distances = "auto", random_state = None)
34/60
Parameters of the K-Means Estimator
Below is a list of the parameters of the K-Means estimator.
35/60
Fit K-Means Clustering
• Perform K-Means clustering using the KMeans estimator.
km_fit_Object = km_Object.fit(X, sample_weight = None)
• X is a prepared DataFrame based on which the clusters are constructed.
• With sample_weight we can pre-specify the weights for each
observation in X. If it is set to None, all observations will be assigned equal
weight.
• The fitted estimator of the K-Means algorithm is saved in
km_fit_Object.
36/60
K-Means Clustering
• Find the clusters for each observation in the original DataFrame using the
fit_predict() function.
km_pred_Object = km_Object.fit_predict
(X, sample_weight = None)
37/60
Explore K-Means Clustering Models (I)
• Explore the characteristics of the clusters by looking at some statistics
such as mean, min, max, etc. of the clustering criteria.
• Cross-tabulate the clustering criteria and the cluster index to understand
the features of the clusters.
• If a clustering criteria variable is categorical, interpret the clusters based
on the proportional distribution of the clusters in each category.
• Create cross-tables with the crosstab() function:
38/60
Activity
Run through the SU5 Clustering Notebook
39/60
Decision Trees
(This section is optional)
Introduction to Decision Trees (I)
• Decision trees split a sample to reach certain decision points based on
some criteria. It is therefore a sample classification technique.
• Each decision point is called a node and represents a subset of the sample.
• Nodes split from a superordinate node are called the child node while the
origin node is called the parent node.
• A child node with no further subdivisions or splitting is called a leaf node.
• The decision tree algorithm predicts the individual classification based on
the input variables and the value of the target variable at the same time.
• These rules of decision form the resulting model which can be illustrated
by a tree-like structure graphically.
41/60
Introduction to Decision Trees (II)
• Decision tree model illustrates the complex relationship between the
target variable and the input variables rather well.
• The importance of the input variables in the decision rules is reflected in
the decision tree. Higher up variables are more important.
• The value of a leaf node is the prediction of the target variable for those
observations classified in it.
• If the target variable is categorical, the predicted value will be the mode. If
it is numeric, the value will be the mean.
42/60
Example of Decision Tree
Source: https://scikit-
Root node; the most
learn.org/stable/modules/tree.html
important feature
44/60
Algorithm of Constructing Decision
Trees (II)
• The sample subset at each node become more homogeneous as the split
proves advances.
• The homogeneity of each subset reflects the split quality from a parent
node to its child nodes.
• In classification tree, the homogeneity is measured by Gini and Entropy.
• In regression tree, the impurity is measured by the sum of squared error
(SSE).
• In both options, the split with the highest reduction of impurity or highest
homogeneity in the child nodes will be chosen.
45/60
Algorithm of Constructing Decision
Trees (III)
• Two possibilities to stop the CART algorithm:
➢ no or little improvement of impurity detected after a new split
• Gini/SSE drops below a threshold
• Low Gini or SSE indicates rather homogeneous nodes, another
split would not decrease the impurity significantly.
• High thresholds: oversimplified trees.
Low thresholds: overcomplicated trees.
➢ a pre-specified depth (number of splits) of the tree is reached
• control tree size without oversimplifying or overcomplicating it.
• Set a lower bound of observations in the nodes.
46/60
Evaluation of Decision Trees (I)
• Evaluate the performance of a decision tree by examining its fit on
available data.
➢ For classification trees, use confusion matrix
• Summarise correct and incorrect classifications.
• The larger the proportion of identical predicted and observed
classifications, the more accurate is the decision tree model.
➢ For regression trees, use Root-Mean-Square-Error (RMSE)
• Average deviance of all predicted values from the observed values.
• The lower the deviance, the closer are the predictions to the
actual values, and the better is the model.
47/60
Evaluation of Decision Trees (II)
• Evaluate also the performance of a decision tree by its prediction accuracy
on unseen data.
• Partition the original dataset randomly into a training and a testing
dataset.
• The decision tree as a predictive model is evaluated by its ability to apply
what it has “learned” from the training data to the testing data.
• If the prediction accuracy of the model on the training data is much higher
than the testing data, the model tends to be overfitted.
48/60
Decision Tree Estimators (I)
• Import the sklearn.tree module in the first place.
from sklearn import tree
• Initiate the DecisionTreeClassifier estimator for classification tree:
tree_Object = sklearn.tree.DecisionTreeClassifier(…)
• To fit a regression tree, declare the DecisionTreeRegressor estimator.
tree_Object = sklearn.tree.DecisionTreeRegressor(…)
• “…” indicates the parameters of the estimator that control the fitting of
the decision trees.
49/60
Decision Tree Estimators (II)
• Here is a list of the parameters of the DecisionTreeClassifier and
DecisionTreeRegressor estimators.
• The main differences between them are the parameter criterion and
the availability of the parameter class_weight.
Parameter Value Type
criterion (Default: "gini" for Classification: "gini", "entropy"
classification, "mse" for regression) Regression: "mse", "friedman_mse",
"mae", "poisson"
splitter (Default: "best") "best", "random"
max_depth (Default: None) Integer
min_samples_split (Default: 2) Integer or float
min_samples_leaf (Default: 1) Integer or float
• (Source: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html,
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)
50/60
Decision Tree Estimators (III)
51/60
Fit Decision Trees
• Fit decision trees on a prepared training DataFrames by .fit().
52/60
Evaluate Decision Trees
• One way to evaluate the performance of a decision tree is the confusion
matrix.
• The confusion matrix can be computed by confusion_matrix().
metrics.confusion_matrix(target_var, tree_pred_Object)
• Accuracy, precision, and recall scores can also assess the predictive
performance of a decision tree.
metrics.accuracy_score(target_var, tree_pred_Object)
metrics.precision_score(target_var, tree_pred_Object)
metrics.recall_score(target_var, tree_pred_Object)
• These scores are useful in examining binary classification target variables.
53/60
Example of a Confusion Matrix
• Suppose we fit a decision tree to predict if a consumer buys a coke
➢ target: ‘1’ for buy a coke; ‘0’ for did not buy a coke.
20 observations 30 observations
False Negative True Positive
10 observations 40 observations
55/60
Plot Decision Trees (I)
• Decision tree plot is an important tool to understand and to evaluate its
performance.
• The plot can be generated by the plot_tree() function of the tree
module conveniently.
56/60
Plot Decision Trees (II)
Here is a list of the parameters of the plot_tree() function.
57/60
Plot Decision Trees (III)
(Source: https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html)
58/60
Activity
Run through the SU5 Decision Tree Notebook
59/60
Discussion
• Name all the main differences between classification trees and regression
trees.
• How do we judge a model with low prediction accuracy for testing data but
high accuracy for training data? Is it possible for a model to have low
prediction accuracy for training data, but high accuracy for training data?
60/60