0% found this document useful (0 votes)

80 views58 pages

ANL252 SU5 Jul2022

The document discusses preparing data for use in scikit-learn machine learning algorithms in Python. It covers topics like handling missing values by removing rows or replacing values, reducing the number of categories, discretizing continuous variables, selecting relevant variables, renaming variables, creating dummy variables from categorical variables, normalizing and standardizing data for transformation, splitting data into training and testing sets, and extracting dependent and independent variables to fit machine learning models in scikit-learn.

Uploaded by

Ebad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views58 pages

ANL252 SU5 Jul2022

Uploaded by

Ebad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

Study Unit 5

Data Analytics in Python

Introduction to
Scikit-Learn
Scikit-Learn
• It is common to use Python to carry out tasks for data analytics, statistical
modelling or machine learning recently.
• One of the most common libraries used for these purposes is scikit-learn,
a free machine learning library written for Python.
• Scikit-learn features various algorithms such as classification, regression,
clustering, etc.
• In machine learning, programs are constructed with parameters such that
they can “learn” from newly fed data.
• Programs can automatically adjust and improve their behaviour according
to the new “knowledge”.

3/60
Algorithms in Scikit-Learn
Here are some common algorithms available in scikit-learn:

Supervised learning
Linear Models Gaussian Processes
Discriminant Analysis Cross decomposition
Kernel ridge regression Decision Trees
Support Vector Machines Isotonic regression
Nearest Neighbours Neural network models (supervised)
Unsupervised learning
Gaussian mixture models Novelty and Outlier Detection
Clustering Density Estimation
Covariance estimation Neural network models (unsupervised)

4/60
Install and Import Scikit-Learn
• Use the import syntax to load scikit-learn into the program.

import sklearn

• It is sklearn and not scikit-learn that refers to the scikit-learn library.

5/60
Import Algorithms from Scikit-Learn
• Since the library is extensive, programmers usually only load the required
algorithm or only its “estimator” object.
• For instance, if linear regression models are required, import the
estimator LinearRegression from the module linear_model by:
from sklearn.linear_model import LinearRegression
• It is not unusual to load multiple algorithms for one analytics task.
• Put sufficient comments in the program to explain the purpose and use of
each imported module.

6/60
Activity
• Import the following algorithms of scikit-learn into your JupyterLab
notebook:
➢ train_test_split
➢ metrics
➢ pre-processing
➢ tree
➢ Module “KMeans” from “cluster”
➢ Module “” from “decomposition”

7/60
Discussion
• What is the difference between supervised and unsupervised machine
learning?

8/60
Data Preparation for
Analytics Algorithms
Specify and Remove Missing Values
• We can define specific strings (e.g., strings with white space ("") as
missing values during the import process of a DataFrame using
read_csv().
DataFrame_name = pd.read_csv("csv_file_name.csv",
na_values = "na_string", na_filer = True/False)

• One way to deal with missing values is to remove those rows with any
missing value from the DataFrame completely.
DataFrame_name.dropna(axis = 0, how = "any"/"all")

11/60
Replace Missing Values
• Alternatively, we can choose to replace missing values of an entire
DataFrame or a particular column by specific values.

DataFrame_name.fillna(value = repl_value)
DataFrame_name["column_label"].fillna(value = repl_value)

• Different variable types may require different treatments of missing

values.

12/60
Reduce Number of Categories
• Categorical variables may need to be treated in data analytics.
• Usually, they are converted to dummy variables required by some
algorithms.
• Example: Weight of {Normal, Overweight, Obese} can become two dummy
variables of Overweight with either 0 or 1; and Obese with either 0 or 1.
Here, 0 means No, 1 means Yes.
• Sometimes to reduce the number of categories to simplify the problem.
• Example: Grade {Fail, Subpass, Pass} can become {Fail, Pass}
• In Python, we can replace category labels by the .replace() method:

DataFrame_Name["column_label"].replace(to_replace, value)

13/60
Discretisation
• If a categorical variable has ordered numeric values as categories, we can
discretise them into new bins by the cut() function.
DF_Name["column"] = pd.cut(x = array, bins, labels)
• The cut() function only includes the rightmost edge in each bin and not
the leftmost one.
• Put the highest value of each category in the list of x. Start the list with -1
in case 0 also belongs to the original categories, or 0 otherwise.

14/60
Select Variables
• Some variables may not be relevant to the analysis
• Example: Name is not useful for predicting Income
• These variables should be removed from the DataFrame before applying
the scikit-learn estimator.
• Use the attributes .iloc() and .loc() to select the necessary columns.

15/60
Using iloc vs loc for Selection (I)
• Suppose we only need Fruits and Prices from the Imports dataset

16/60
Using iloc vs loc for Selection (II)
• loc: we specify the variable labels ‘Fruits’, ‘Prices’

• Iloc: we specify column indices 0:2 after the comma

17/60
Rename Variables
• Use the .rename() method to rename the variables in a DataFrame.
DF_Name["column"].rename(columns = {"oldvar": "newvar"})
• The column labels to be renamed must be put as keys of a dictionary.
• The values of the dictionary will be the new labels of the corresponding
columns.

18/60
Create Dummy Variables
• Categorical variables must be converted to dummy variables before
included in the scikit-learn algorithms.
• Dummy variables only have two values: 0 and 1.
• If an observation belongs to a certain category, the corresponding dummy
variable will be 1, otherwise 0.
• Use get_dummies() to convert categorical variables to dummy variables.
DF_name["column"] = pd.get_dummies(data, drop_first)

• You can use .astype() to change a numeric categorical variable to

“category”, “str” or “object” before converting it to dummy variables.
DF_name.astype({"var_name": "type_str", …})

19/60
Data Transformation
• Variables with wide range of values tend to have higher impact in the
model than those with smaller value ranges.
• They need to be scaled down to match the other numeric variables.
• The most common methods here are normalisation and standardisation.
• For normalisation, use normalize() from the preprocessing module.
Object_name = sklearn.preprocessing.normalize(X)
• For standardisation, the estimator needs to be initiated first before the
transformation can take place.
scaler = preprocessing.StandardScaler()
scaler.fit(X)
Object_name = scaler.transform(X)

20/60
Training and Testing Data
• The performance of a predictive model is measured by its accuracy of
predicting unseen data.
• Since such data are usually unavailable, analysts usually “hold back” a
subset of data, the testing dataset, for model evaluation purpose.
• The remaining data, the training dataset, are used to construct the model.
• The train_test_split() function from the model_selection
module splits DataFrame randomly into training and testing datasets.

Object_name = sklearn.model_selection.
train_test_split(arrays, test_size, random_state)

• The default proportion of training and testing data in sklearn is 0.75:0.25.

21/60
Extract Dependent and Independent
Variables (I)
• In scikit-learn, the ultimate command of many algorithms is to fit a model
by the .fit() function which has two parameters: X and Y.
• The parameter X is the design matrix containing all independent variables.
• The parameter Y is the vector of the target variable.
• Both X and Y can be NumPy arrays or pandas DataFrames.
• We need to extract X and Y from the original DataFrame.

22/60
Extract Dependent and Independent
Variables (II)
Select the column representing Y and save it as a new object.
y = DataFrame_name["target_var"]
Subset the matrix of X from the DataFrame in similar manner.
X = DataFrame_name[["X1", "X2", …]]
If the DataFrame only contains the independent and target variables, just
drop the target variable from the original DataFrame to obtain X.
X = DataFrame_name.drop("target_var")

23/60
Activity
Car sales program:
Carry out the following tasks in JupyterLab:
• Since there are many car models (especially their names are complicated
and not comparable with each other) in the market, it is rather more
important to include “category” than “model” in our analysis later.
➢ Remove “model” from the DataFrame.
➢ Reduce entries with more than one category (e.g.,
'Convertible,Sedan,Coupe‘) in the “Category” field by assigning each car
to the first listed category only (i.e., ‘Convertible’).
➢ Create a training and a testing dataset. The proportion here should be
0.7:0.3 and the random state should be set to 0.

24/60
Discussion
• Can we use the .replace() method to reduce the number of categories of
a numeric categorical variable instead of discretisation?

• If the target variable is categorical and binary, there must be two dummy
variables resulting from the dummy-variable-conversion. Which dummy
variable shall we choose for the scikit-learn algorithm?

25/60
Activity
SU5 Regression Example
Run through the SU5 linear regression notebook

27/60
The rest of the slides are
optional
Clustering
(This section is optional)
Introduction to K-Means Clustering
• One of the most popular clustering algorithms is the K-Means method.
• This technique is very efficient in clustering large data sets.
• The algorithm here is to split the data into K groups with equal variance by
minimising the variation within the cluster (inertia).
• The K-Means method requires the number of clusters to be specified
before the algorithm starts.
• The clusters are characterised by their centroids (means), the centre of an
area in a two-dimensional space.
• The K different means should be explored during the clustering process.

30/60
Intuition of K-Means Clustering
• We want members in a cluster to be ‘close’ to each other, and clusters to be
far apart from each other.

31/60
K-Means Clustering Process
The process of K-Means clustering contains five main steps:
1. Select K observations randomly as initial cluster centroids.
2. Compute the distance of each object to the centroid.
3. Assign each object to the nearest centroid based on the distance
computed. Objects assigned to the same centroid form a cluster.
4. Recompute the centroid for each cluster using the objects in it.
Restart iteration from Step 2.
5. Iteration stops when the centroids remain unchanged, or a specified
number of iterations has been performed.

32/60
K-Means Properties
• The Euclidean distance is preferred in general when computing the
distance between an object to the centroids.
• The Euclidean distance is measured by the sum of the squared differences
between the values of some selected clustering criteria.
• After the clustering process, ensure the insights created by the resulting
clusters. A good clustering solution allows clear description of the cluster
profile.
• Evaluate the quality of clustering solutions by cohesion, separation and
parsimony.

33/60
Fitting K-Means Clustering by Scikit-
Learn
• In scikit-learn, all algorithms are controlled and executed by the estimator.
• Adjust the parameters for the modelling process in the syntax of the
estimator declaration.
• In K-Means Clustering, the estimator is called KMeans. And it can be
imported from the cluster module of the scikit-learn package.
from sklearn.cluster import KMeans
• Initiate the KMeans estimator first and adjust the estimation parameters
according to the individual needs.
km_Object = sklearn.cluster.KMeans(n_clusters = 8, init =
"k-means++", n_init = 10, max_iter = 300, tol = 0.0001,
precompute_distances = "auto", random_state = None)

34/60
Parameters of the K-Means Estimator
Below is a list of the parameters of the K-Means estimator.

Parameter Value Type

n_clusters (Default: 8) Integer
init (Default : "k-means++") "k-means++", "random", callable, array-
like of shape
n_init (Default: 10) Integer
max_iter (Default: 300) Integer
tol (Default: 1e-4) Float
precompute_distances "auto", True, False
(Default: "auto")
random_state (Default: None) Integer, RandomState instance, None
(Source: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans. html)

35/60
Fit K-Means Clustering
• Perform K-Means clustering using the KMeans estimator.
km_fit_Object = km_Object.fit(X, sample_weight = None)
• X is a prepared DataFrame based on which the clusters are constructed.
• With sample_weight we can pre-specify the weights for each
observation in X. If it is set to None, all observations will be assigned equal
weight.
• The fitted estimator of the K-Means algorithm is saved in
km_fit_Object.

36/60
K-Means Clustering
• Find the clusters for each observation in the original DataFrame using the
fit_predict() function.

km_pred_Object = km_Object.fit_predict
(X, sample_weight = None)

• X contains the data for which the cluster prediction is calculated.

• The output object here is an array with indices of the clusters to which
each sample is assigned.
• Use elbow method to determine the optimal number of clusters before
the clustering algorithm starts.

37/60
Explore K-Means Clustering Models (I)
• Explore the characteristics of the clusters by looking at some statistics
such as mean, min, max, etc. of the clustering criteria.
• Cross-tabulate the clustering criteria and the cluster index to understand
the features of the clusters.
• If a clustering criteria variable is categorical, interpret the clusters based
on the proportional distribution of the clusters in each category.
• Create cross-tables with the crosstab() function:

pd.crosstab(index = criteria_var, columns =

cluster_index, normalize = "index", margin = True)

• Clustering criteria are usually listed in rows rather than in columns.

• Cluster indices are placed as columns since the cluster number is limited.

38/60
Activity
Run through the SU5 Clustering Notebook

39/60
Decision Trees
(This section is optional)
Introduction to Decision Trees (I)
• Decision trees split a sample to reach certain decision points based on
some criteria. It is therefore a sample classification technique.
• Each decision point is called a node and represents a subset of the sample.
• Nodes split from a superordinate node are called the child node while the
origin node is called the parent node.
• A child node with no further subdivisions or splitting is called a leaf node.
• The decision tree algorithm predicts the individual classification based on
the input variables and the value of the target variable at the same time.
• These rules of decision form the resulting model which can be illustrated
by a tree-like structure graphically.

41/60
Introduction to Decision Trees (II)
• Decision tree model illustrates the complex relationship between the
target variable and the input variables rather well.
• The importance of the input variables in the decision rules is reflected in
the decision tree. Higher up variables are more important.
• The value of a leaf node is the prediction of the target variable for those
observations classified in it.
• If the target variable is categorical, the predicted value will be the mode. If
it is numeric, the value will be the mean.

42/60
Example of Decision Tree
Source: https://scikit-
Root node; the most
learn.org/stable/modules/tree.html
important feature

• In total, we have 9 leaf nodes (nodes in the red rectangles)

43/60
Algorithm of Constructing Decision
Trees (I)
• The most common algorithms for constructing decision trees are CHAID
(chi-square automatic interaction detection), C5.0 (a proprietary
algorithm) and CART (classification and regression tree).
• The estimators DecisionTreeClassifier and DecisionTree-
Regressor of scikit-learn use an optimised version of CART.
• The CART algorithm can create classification and regression trees.
• Regression trees estimate values of continuous target variables, while
classification trees predict outcomes of categorical target variables.
• CART splits the tree into two sub-samples at every decision point based on
the input variables.
• The splitting process terminates once certain stopping criteria are fulfilled.

44/60
Algorithm of Constructing Decision
Trees (II)
• The sample subset at each node become more homogeneous as the split
proves advances.
• The homogeneity of each subset reflects the split quality from a parent
node to its child nodes.
• In classification tree, the homogeneity is measured by Gini and Entropy.
• In regression tree, the impurity is measured by the sum of squared error
(SSE).
• In both options, the split with the highest reduction of impurity or highest
homogeneity in the child nodes will be chosen.

45/60
Algorithm of Constructing Decision
Trees (III)
• Two possibilities to stop the CART algorithm:
➢ no or little improvement of impurity detected after a new split
• Gini/SSE drops below a threshold
• Low Gini or SSE indicates rather homogeneous nodes, another
split would not decrease the impurity significantly.
• High thresholds: oversimplified trees.
Low thresholds: overcomplicated trees.
➢ a pre-specified depth (number of splits) of the tree is reached
• control tree size without oversimplifying or overcomplicating it.
• Set a lower bound of observations in the nodes.

46/60
Evaluation of Decision Trees (I)
• Evaluate the performance of a decision tree by examining its fit on
available data.
➢ For classification trees, use confusion matrix
• Summarise correct and incorrect classifications.
• The larger the proportion of identical predicted and observed
classifications, the more accurate is the decision tree model.
➢ For regression trees, use Root-Mean-Square-Error (RMSE)
• Average deviance of all predicted values from the observed values.
• The lower the deviance, the closer are the predictions to the
actual values, and the better is the model.

47/60
Evaluation of Decision Trees (II)
• Evaluate also the performance of a decision tree by its prediction accuracy
on unseen data.
• Partition the original dataset randomly into a training and a testing
dataset.
• The decision tree as a predictive model is evaluated by its ability to apply
what it has “learned” from the training data to the testing data.
• If the prediction accuracy of the model on the training data is much higher
than the testing data, the model tends to be overfitted.

48/60
Decision Tree Estimators (I)
• Import the sklearn.tree module in the first place.
from sklearn import tree
• Initiate the DecisionTreeClassifier estimator for classification tree:
tree_Object = sklearn.tree.DecisionTreeClassifier(…)
• To fit a regression tree, declare the DecisionTreeRegressor estimator.
tree_Object = sklearn.tree.DecisionTreeRegressor(…)
• “…” indicates the parameters of the estimator that control the fitting of
the decision trees.

49/60
Decision Tree Estimators (II)
• Here is a list of the parameters of the DecisionTreeClassifier and
DecisionTreeRegressor estimators.
• The main differences between them are the parameter criterion and
the availability of the parameter class_weight.
Parameter Value Type
criterion (Default: "gini" for Classification: "gini", "entropy"
classification, "mse" for regression) Regression: "mse", "friedman_mse",
"mae", "poisson"
splitter (Default: "best") "best", "random"
max_depth (Default: None) Integer
min_samples_split (Default: 2) Integer or float
min_samples_leaf (Default: 1) Integer or float
• (Source: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html,
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)

50/60
Decision Tree Estimators (III)

Parameter Value Type

min_weight_fraction_leaf Float
(Default: 0.0)
max_features (Default: None) Integer, float, "auto", "sqrt", "log2"
random_state (Default: None) integer, RandomState instance, None
max_leaf_nodes (Default: None) Integer
min_impurity_decrease Float
(Default: 0.0)
min_impurity_split (Default: 0) Float
class_weight (Default: None) dictionary, list of dictionaries,
Note: Classification trees only. "balanced"
(Source: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html,
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)

51/60
Fit Decision Trees
• Fit decision trees on a prepared training DataFrames by .fit().

tree_fit_Object = tree_Object.fit(X_train, y_train,

sample_weight = None)

• The .fit() method applies to both DecisionTreeClassifier and

DecisionTreeRegressor.
• Predict the classification of the testing dataset by .predict().
tree_Pred_Object = tree_Object.predict(X_test)
• The returned object is a NumPy array with the predicted target values of
every observation in the testing dataset.

52/60
Evaluate Decision Trees
• One way to evaluate the performance of a decision tree is the confusion
matrix.
• The confusion matrix can be computed by confusion_matrix().
metrics.confusion_matrix(target_var, tree_pred_Object)
• Accuracy, precision, and recall scores can also assess the predictive
performance of a decision tree.
metrics.accuracy_score(target_var, tree_pred_Object)
metrics.precision_score(target_var, tree_pred_Object)
metrics.recall_score(target_var, tree_pred_Object)
• These scores are useful in examining binary classification target variables.

53/60
Example of a Confusion Matrix
• Suppose we fit a decision tree to predict if a consumer buys a coke
➢ target: ‘1’ for buy a coke; ‘0’ for did not buy a coke.

predicted not to buy

coke, and actually
did not buy coke

True Negative False Positive

20 observations 30 observations
False Negative True Positive

10 observations 40 observations

predicted not to buy coke, Predicted to buy coke,

but actually buys coke and actually buys coke
54/60
3 Scores from Confusion Matrix
• Accuracy Score = TN + TP / (TN + FP + TP + FN) = (20 + 40)/100 = 0.60
➢ Proportion of observations correctly classified in the sample

• Precision Score = TP / (TP + FP) = 40 / (30 + 40) = 4/7 = 0.57

➢ Proportion of observations correctly classified in the subsample
which is predicted to buy coke

• Recall Score = TP / (TP + FN) = 40 / (40 + 10) = 4/5 = 0.80

➢ Proportion of observations correctly predicted to buy coke in the
subsample that actually bought coke

55/60
Plot Decision Trees (I)
• Decision tree plot is an important tool to understand and to evaluate its
performance.
• The plot can be generated by the plot_tree() function of the tree
module conveniently.

tree.plot_tree(decision_tree, max_depth = None,

feature_names = None, class_names = None, label =
"all", filled = False, impurity = True, node_ids =
False, proportion = False, rounded = False,
precision = 3, ax = None, fontsize = None)
• The plot_tree() function can also be combined with the matplotlib
options to optimise the output of the tree.

56/60
Plot Decision Trees (II)
Here is a list of the parameters of the plot_tree() function.

Parameter Value Type

decision_tree (No default value) Decision tree object
max_depth (Default: None) Integer
feature_names (Default: None) List of strings
class_names (Default: None) List of strings or Boolean
label (Default: "all") "all", "root", "none"
filled (Default: False) Boolean
impurity (Default: True) Boolean
(Source: https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html)

57/60
Plot Decision Trees (III)

Parameter Value Type

node_ids (Default: False) Boolean
proportion (Default: False) Boolean
rounded (Default: False) Boolean
precision (Default: 3) Integer
ax (Default: None) matplotlib axes
fontsize (Default: None) Integer

(Source: https://scikit-learn.org/stable/modules/generated/sklearn.tree.plot_tree.html)

58/60
Activity
Run through the SU5 Decision Tree Notebook

59/60
Discussion
• Name all the main differences between classification trees and regression
trees.
• How do we judge a model with low prediction accuracy for testing data but
high accuracy for training data? Is it possible for a model to have low
prediction accuracy for training data, but high accuracy for training data?

60/60

Econ 1629
No ratings yet
Econ 1629
5 pages
(Cornelius Lanczos) Linear Differential Operators
No ratings yet
(Cornelius Lanczos) Linear Differential Operators
582 pages
1 - MANM304 2021 Coursework and Marking Scheme
No ratings yet
1 - MANM304 2021 Coursework and Marking Scheme
13 pages
Algorithms For Big Data (CS 229r)
No ratings yet
Algorithms For Big Data (CS 229r)
3 pages
Neftyania
No ratings yet
Neftyania
6 pages
Solving The Solutions Problem
No ratings yet
Solving The Solutions Problem
5 pages
2009 UCLA Anderson MCA Casebook PDF
No ratings yet
2009 UCLA Anderson MCA Casebook PDF
42 pages
ANL252 SU4 Jul2022
No ratings yet
ANL252 SU4 Jul2022
55 pages
ANL252 SU6 Jul2022
No ratings yet
ANL252 SU6 Jul2022
51 pages
ANL252 SU3 Jul2022
No ratings yet
ANL252 SU3 Jul2022
23 pages
ANL252 SU2 Jul2022
No ratings yet
ANL252 SU2 Jul2022
52 pages
Insead PHD Brochure
No ratings yet
Insead PHD Brochure
48 pages
BCG Workshop
No ratings yet
BCG Workshop
7 pages
BCG GAMMA DS Case Interview Prep
No ratings yet
BCG GAMMA DS Case Interview Prep
24 pages
Wharton 2008
100% (1)
Wharton 2008
133 pages
2023 Mba Employment Report
No ratings yet
2023 Mba Employment Report
10 pages
Simonkucher Case Interview Prep 2015
100% (1)
Simonkucher Case Interview Prep 2015
23 pages
Mit PHD Thesis Proposal
100% (1)
Mit PHD Thesis Proposal
7 pages
Analytics - PrepBook 2018 Laterals
No ratings yet
Analytics - PrepBook 2018 Laterals
34 pages
Risk Management:: A Helicopter View
No ratings yet
Risk Management:: A Helicopter View
2 pages
L1.2 Signs You Get Redrock
No ratings yet
L1.2 Signs You Get Redrock
3 pages
Case Interview Prep
No ratings yet
Case Interview Prep
35 pages
Python Basic
No ratings yet
Python Basic
34 pages
U Gro Capital
No ratings yet
U Gro Capital
43 pages
CQF Brochure
No ratings yet
CQF Brochure
23 pages
Virginia Darden - 2020
100% (1)
Virginia Darden - 2020
183 pages
The Five Steps in Problem Analysis
No ratings yet
The Five Steps in Problem Analysis
5 pages
An Introduction To Case Interviews
No ratings yet
An Introduction To Case Interviews
6 pages
Bain Analytical Test 3 MCP
No ratings yet
Bain Analytical Test 3 MCP
21 pages
ESADE 2014 Case Book
No ratings yet
ESADE 2014 Case Book
104 pages
Consulting Business Situation Cases - Street of Walls
No ratings yet
Consulting Business Situation Cases - Street of Walls
20 pages
Case Book Cornell
No ratings yet
Case Book Cornell
112 pages
Candidate Prep Handout - Magna Health
100% (1)
Candidate Prep Handout - Magna Health
7 pages
Financial Modeling Case Study (Enercon)
No ratings yet
Financial Modeling Case Study (Enercon)
2 pages
L0. Before You Start
No ratings yet
L0. Before You Start
7 pages
L1 - Machine Learning For Finance
100% (1)
L1 - Machine Learning For Finance
131 pages
Consulting Casebook Consulere DMS IITDelhi
No ratings yet
Consulting Casebook Consulere DMS IITDelhi
150 pages
Linear Regression Interview Questions
No ratings yet
Linear Regression Interview Questions
4 pages
Fintech Class Presentation 14 - Big Data and Advanced Analytics
No ratings yet
Fintech Class Presentation 14 - Big Data and Advanced Analytics
29 pages
Learn Data Modelling by Example PT 1 Beginner Level
No ratings yet
Learn Data Modelling by Example PT 1 Beginner Level
99 pages
Case Interview Frameworks
No ratings yet
Case Interview Frameworks
46 pages
Numpy Ref
No ratings yet
Numpy Ref
1,128 pages
Factset Placements
No ratings yet
Factset Placements
2 pages
Derivatives and Risk Management
0% (1)
Derivatives and Risk Management
82 pages
Must Know Maths For Consulting Case Interviews
No ratings yet
Must Know Maths For Consulting Case Interviews
13 pages
Data Science With Python Explained PDF
No ratings yet
Data Science With Python Explained PDF
1 page
Interview Preparations - NielsenIQ
No ratings yet
Interview Preparations - NielsenIQ
1 page
Fundamental Analysis Via Machine Learning
No ratings yet
Fundamental Analysis Via Machine Learning
26 pages
FM GWP 1 Report
No ratings yet
FM GWP 1 Report
7 pages
Concepts and Techniques: Data Mining
100% (1)
Concepts and Techniques: Data Mining
81 pages
Analytics Prepbook Laterals 2019-2020
100% (1)
Analytics Prepbook Laterals 2019-2020
40 pages
McKinsey DACH Round 1
No ratings yet
McKinsey DACH Round 1
7 pages
Single customer view Second Edition
From Everand
Single customer view Second Edition
Gerardus Blokdyk
No ratings yet
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
ML Pgms - 24mar2025
No ratings yet
ML Pgms - 24mar2025
23 pages
Kabir Data Preprocessing Python
No ratings yet
Kabir Data Preprocessing Python
14 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Data Mining Lab Manual 2 2
No ratings yet
Data Mining Lab Manual 2 2
63 pages
003-FIN7790 (Part2)
No ratings yet
003-FIN7790 (Part2)
162 pages
Data Pre-Processing Steps
No ratings yet
Data Pre-Processing Steps
32 pages
Mlviva
No ratings yet
Mlviva
14 pages
Module-4 Problems
No ratings yet
Module-4 Problems
9 pages
Neural Networks
No ratings yet
Neural Networks
75 pages
Maths Class Xii Sample Paper Test 14 For Board Exam 2024 Answers 2
No ratings yet
Maths Class Xii Sample Paper Test 14 For Board Exam 2024 Answers 2
15 pages
Lesson 2.0 DFT
No ratings yet
Lesson 2.0 DFT
42 pages
Chapte 3
No ratings yet
Chapte 3
105 pages
Unit-5 Computer Vision
No ratings yet
Unit-5 Computer Vision
3 pages
@vtucode - in BCS515B Module 3 Textbook
No ratings yet
@vtucode - in BCS515B Module 3 Textbook
32 pages
Option Pit Boot Camp The Option Pit Method For Trading Options
No ratings yet
Option Pit Boot Camp The Option Pit Method For Trading Options
55 pages
Pyq Probability Distributions
No ratings yet
Pyq Probability Distributions
3 pages
Random Processes Spectral Characteristics
No ratings yet
Random Processes Spectral Characteristics
8 pages
Artificial Neural Networks B Yegnanarayana Instant Download
No ratings yet
Artificial Neural Networks B Yegnanarayana Instant Download
90 pages
Om 9 2017 CLR
No ratings yet
Om 9 2017 CLR
25 pages
Factoring Huge Integers
No ratings yet
Factoring Huge Integers
17 pages
203 Sample HW Tests
No ratings yet
203 Sample HW Tests
16 pages
Assignment
No ratings yet
Assignment
2 pages
Multi-Class Text Classification
No ratings yet
Multi-Class Text Classification
6 pages
Second Law of Thermodynamics
100% (1)
Second Law of Thermodynamics
18 pages
An Automatic Dermatology Detection System Based On Deep Learning and Computer Vision
No ratings yet
An Automatic Dermatology Detection System Based On Deep Learning and Computer Vision
10 pages
2024-2025 Term2 ENGG1120A Course Syllabus
No ratings yet
2024-2025 Term2 ENGG1120A Course Syllabus
3 pages
CSE 101 Homework 1 Solutions: Winter 2021
No ratings yet
CSE 101 Homework 1 Solutions: Winter 2021
3 pages
Mle & Map
No ratings yet
Mle & Map
21 pages
Activity. Relations and Functions
No ratings yet
Activity. Relations and Functions
5 pages
Introduction To Pattern Recognition
No ratings yet
Introduction To Pattern Recognition
46 pages
03LaplaceTransformationInUsingMatlab PERCIL
No ratings yet
03LaplaceTransformationInUsingMatlab PERCIL
12 pages
Uma035 5
No ratings yet
Uma035 5
2 pages
DDA3020 Lecture 06 Logistic Regression
No ratings yet
DDA3020 Lecture 06 Logistic Regression
47 pages
3305 Syl
No ratings yet
3305 Syl
2 pages
Soft Computing
No ratings yet
Soft Computing
30 pages
Riscv Crypto Spec v0.9.0 Scalar
No ratings yet
Riscv Crypto Spec v0.9.0 Scalar
52 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

ANL252 SU5 Jul2022

Uploaded by

ANL252 SU5 Jul2022

Uploaded by

Study Unit 5

Data Analytics in Python

• It is sklearn and not scikit-learn that refers to the scikit-learn library.

• Different variable types may require different treatments of missing

• Iloc: we specify column indices 0:2 after the comma

• You can use .astype() to change a numeric categorical variable to

• The default proportion of training and testing data in sklearn is 0.75:0.25.

Parameter Value Type

• X contains the data for which the cluster prediction is calculated.

pd.crosstab(index = criteria_var, columns =

• Clustering criteria are usually listed in rows rather than in columns.

• In total, we have 9 leaf nodes (nodes in the red rectangles)

Parameter Value Type

tree_fit_Object = tree_Object.fit(X_train, y_train,

• The .fit() method applies to both DecisionTreeClassifier and

predicted not to buy

True Negative False Positive

predicted not to buy coke, Predicted to buy coke,

• Precision Score = TP / (TP + FP) = 40 / (30 + 40) = 4/7 = 0.57

• Recall Score = TP / (TP + FN) = 40 / (40 + 10) = 4/5 = 0.80

tree.plot_tree(decision_tree, max_depth = None,

Parameter Value Type

Parameter Value Type

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.