0% found this document useful (0 votes)
102 views7 pages

Model Evaluation and Selection Cheatsheet 1708023215

This document provides a cheatsheet on various machine learning techniques for model evaluation and selection. It lists functions and approaches for data splitting, model evaluation metrics, cross-validation, hyperparameter tuning, model selection, ensemble methods, dimensionality reduction, data preprocessing, advanced evaluation, text processing, clustering, neural networks, imbalanced data handling, interpretation, and time series analysis. The cheatsheet acts as a reference for popular scikit-learn and other Python machine learning libraries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
102 views7 pages

Model Evaluation and Selection Cheatsheet 1708023215

This document provides a cheatsheet on various machine learning techniques for model evaluation and selection. It lists functions and approaches for data splitting, model evaluation metrics, cross-validation, hyperparameter tuning, model selection, ensemble methods, dimensionality reduction, data preprocessing, advanced evaluation, text processing, clustering, neural networks, imbalanced data handling, interpretation, and time series analysis. The cheatsheet acts as a reference for popular scikit-learn and other Python machine learning libraries.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

# [ Model Evaluation and Selection ] [ cheatsheet ]

Data Splitting

● Splitting dataset into training and test sets: from


sklearn.model_selection import train_test_split
● Splitting dataset with stratification: X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size=0.2, stratify=y)
● Creating a validation set: X_train, X_val, y_train, y_val =
train_test_split(X_train, y_train, test_size=0.25)

Model Evaluation Metrics

● Accuracy score: from sklearn.metrics import accuracy_score


● Precision score: from sklearn.metrics import precision_score
● Recall score: from sklearn.metrics import recall_score
● F1 score: from sklearn.metrics import f1_score
● Area under ROC curve: from sklearn.metrics import roc_auc_score
● Mean squared error: from sklearn.metrics import mean_squared_error
● Mean absolute error: from sklearn.metrics import mean_absolute_error
● R2 score: from sklearn.metrics import r2_score
● Confusion matrix: from sklearn.metrics import confusion_matrix
● Classification report: from sklearn.metrics import classification_report
● Log loss: from sklearn.metrics import log_loss

Cross-validation

● K-fold cross-validation: from sklearn.model_selection import


cross_val_score
● Stratified K-fold for classification: from sklearn.model_selection import
StratifiedKFold
● Leave-One-Out cross-validation: from sklearn.model_selection import
LeaveOneOut
● Cross-validation with multiple scoring metrics: from
sklearn.model_selection import cross_validate

Hyperparameter Tuning

● Grid search CV: from sklearn.model_selection import GridSearchCV

By: Waleed Mousa


● Randomized search CV: from sklearn.model_selection import
RandomizedSearchCV
● Specifying parameter grid for grid search: param_grid = {'param1': [1, 2,
3], 'param2': ['a', 'b', 'c']}
● Running grid search: grid_search = GridSearchCV(estimator, param_grid,
cv=5)
● Accessing best parameters: grid_search.best_params_
● Accessing best model: grid_search.best_estimator_

Model Selection

● Comparing multiple classifiers: [cross_val_score(estimator, X, y,


cv=5).mean() for estimator in [estimator1, estimator2]]
● Feature importance from models: model.feature_importances_
● Selecting features based on importance: from sklearn.feature_selection
import SelectFromModel
● Pipeline creation: from sklearn.pipeline import make_pipeline
● Saving a model: from joblib import dump
● Loading a model: from joblib import load

Ensemble Methods

● Random Forest: from sklearn.ensemble import RandomForestClassifier


● Gradient Boosting: from sklearn.ensemble import
GradientBoostingClassifier
● AdaBoost: from sklearn.ensemble import AdaBoostClassifier
● Stacking classifiers: from sklearn.ensemble import StackingClassifier

Dimensionality Reduction

● PCA: from sklearn.decomposition import PCA


● t-SNE: from sklearn.manifold import TSNE
● Selecting features with high variance: from sklearn.feature_selection
import VarianceThreshold

Data Preprocessing

● Scaling features: from sklearn.preprocessing import StandardScaler


● Encoding categorical variables: from sklearn.preprocessing import
OneHotEncoder

By: Waleed Mousa


● Imputing missing values: from sklearn.impute import SimpleImputer
● Generating polynomial features: from sklearn.preprocessing import
PolynomialFeatures

Model Evaluation Techniques

● Bootstrapping: from sklearn.utils import resample


● Calculating information criteria: AIC = 2*k - 2*np.log(L) (where k is
the number of parameters and L is the likelihood of the model)
● BIC for model selection: BIC = n*np.log(RSS/n) + k*np.log(n) (where RSS
is the residual sum of squares)

Advanced Model Evaluation

● Plotting ROC curve: from sklearn.metrics import roc_curve


● Plotting precision-recall curve: from sklearn.metrics import
precision_recall_curve
● Visualizing confusion matrix: from sklearn.metrics import
plot_confusion_matrix
● Calculating adjusted R2: adjusted_R2 = 1 - (1-R2)*(n-1)/(n-p-1)

Text Data Processing

● Count Vectorization: from sklearn.feature_extraction.text import


CountVectorizer
● TF-IDF Transformation: from sklearn.feature_extraction.text import
TfidfTransformer
● HashingVectorizer: from sklearn.feature_extraction.text import
HashingVectorizer

Clustering and Unsupervised Learning

● K-Means clustering: from sklearn.cluster import KMeans


● DBSCAN for density-based clustering: from sklearn.cluster import DBSCAN
● Hierarchical clustering: from sklearn.cluster import
AgglomerativeClustering

Neural Networks and Deep Learning

● Using Keras/TensorFlow for neural networks: from tensorflow.keras.models


import Sequential

By: Waleed Mousa


● Defining a simple neural network architecture: model =
Sequential([Dense(10, activation='relu'), Dense(1)])

Handling Imbalanced Datasets

● Under-sampling the majority class: from imblearn.under_sampling import


RandomUnderSampler
● Over-sampling the minority class: from imblearn.over_sampling import
RandomOverSampler
● SMOTE for synthetic minority over-sampling: from imblearn.over_sampling
import SMOTE
● Using class weights to handle imbalance: class_weight='balanced'
(applicable in many sklearn classifiers)

Advanced Feature Selection

● Recursive feature elimination: from sklearn.feature_selection import RFE


● Feature selection using mutual information: from
sklearn.feature_selection import mutual_info_classif
● SelectKBest with custom scoring function: from sklearn.feature_selection
import SelectKBest

Model Interpretability and Explanation

● Permutation importance: from sklearn.inspection import


permutation_importance
● SHAP values: import shap (requires SHAP library)
● LIME for local interpretability: import lime (requires LIME library)

Advanced Cross-validation Techniques

● TimeSeriesSplit for time-series data: from sklearn.model_selection import


TimeSeriesSplit
● GroupKFold for grouped data: from sklearn.model_selection import
GroupKFold
● PredefinedSplit to use custom splits: from sklearn.model_selection import
PredefinedSplit

Hyperparameter Optimization Beyond Grid and Random Search

By: Waleed Mousa


● Bayesian optimization with Hyperopt: from hyperopt import fmin, tpe, hp,
Trials
● Optuna for optimization: import optuna
● Scikit-optimize for Bayesian optimization: from skopt import
BayesSearchCV

Ensemble and Meta-models Advanced Techniques

● VotingClassifier for combining models by voting: from sklearn.ensemble


import VotingClassifier
● Bagging with base estimator: from sklearn.ensemble import
BaggingClassifier
● Feature stacking for meta-modeling: from sklearn.ensemble import
StackingClassifier

Performance Improvement and Efficiency

● Using joblib for parallel processing in GridSearchCV:


GridSearchCV(estimator, param_grid, cv=5, n_jobs=-1)
● Incremental learning with partial_fit: estimator.partial_fit(X_batch,
y_batch)
● Using categorical dtype for pandas to reduce memory usage: df['feature']
= df['feature'].astype('category')

Working with Text Data Advanced Techniques

● N-grams with CountVectorizer: CountVectorizer(ngram_range=(1, 2))


● Custom tokenizer in TfidfVectorizer:
TfidfVectorizer(tokenizer=custom_tokenizer)
● Word embeddings with Gensim or Spacy: from gensim.models import Word2Vec
or import spacy

Advanced Clustering Techniques

● Spectral clustering for non-linearly separable data: from


sklearn.cluster import SpectralClustering
● Affinity propagation for clustering without specifying the number of
clusters: from sklearn.cluster import AffinityPropagation
● Mean shift clustering for arbitrary shaped clusters: from
sklearn.cluster import MeanShift

By: Waleed Mousa


Evaluation Metrics for Regression Advanced

● Explained variance score: from sklearn.metrics import


explained_variance_score
● Mean squared logarithmic error: from sklearn.metrics import
mean_squared_log_error
● Median absolute error: from sklearn.metrics import median_absolute_error

Evaluation Metrics for Classification Advanced

● Balanced accuracy score: from sklearn.metrics import


balanced_accuracy_score
● Cohen's kappa: from sklearn.metrics import cohen_kappa_score
● Matthews correlation coefficient: from sklearn.metrics import
matthews_corrcoef

Multioutput and Multiclass Strategies

● OneVsRest for multiclass classification: from sklearn.multiclass import


OneVsRestClassifier
● MultiOutputClassifier for multi-output regression: from
sklearn.multioutput import MultiOutputClassifier

Advanced Data Preprocessing Techniques

● Power transformation for normalizing data: from sklearn.preprocessing


import PowerTransformer
● Binarizing features: from sklearn.preprocessing import Binarizer
● Custom transformers with FunctionTransformer: from sklearn.preprocessing
import FunctionTransformer

Model Persistence Advanced

● Pickle for model saving: import pickle


● Using dill for more complex objects: import dill as pickle

Time Series Analysis

● Rolling window features: df['rolling_mean'] =


df['feature'].rolling(window=5).mean()

By: Waleed Mousa


● Expanding window features: df['expanding_mean'] =
df['feature'].expanding(2).mean()
● Time series cross-validation: from sklearn.model_selection import
TimeSeriesSplit

Neural Networks and Deep Learning Advanced

● Early stopping in Keras: from tensorflow.keras.callbacks import


EarlyStopping
● Custom loss functions in TensorFlow/Keras: def custom_loss(y_true,
y_pred): return tf.reduce_mean(tf.abs(y_true - y_pred))
● Fine-tuning pre-trained models in TensorFlow/Keras: model.trainable =
True and model.compile(...)

Working with Large Datasets

● Incremental PCA for large datasets: from sklearn.decomposition import


IncrementalPCA
● Online learning algorithms (e.g., SGDClassifier): from
sklearn.linear_model import SGDClassifier
● Dask for parallel computing: import dask.dataframe as dd

By: Waleed Mousa

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy