Model Evaluation and Selection Cheatsheet 1708023215
This document provides a cheatsheet on various machine learning techniques for model evaluation and selection. It lists functions and approaches for data splitting, model evaluation metrics, cross-validation, hyperparameter tuning, model selection, ensemble methods, dimensionality reduction, data preprocessing, advanced evaluation, text processing, clustering, neural networks, imbalanced data handling, interpretation, and time series analysis. The cheatsheet acts as a reference for popular scikit-learn and other Python machine learning libraries.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
102 views7 pages
Model Evaluation and Selection Cheatsheet 1708023215
This document provides a cheatsheet on various machine learning techniques for model evaluation and selection. It lists functions and approaches for data splitting, model evaluation metrics, cross-validation, hyperparameter tuning, model selection, ensemble methods, dimensionality reduction, data preprocessing, advanced evaluation, text processing, clustering, neural networks, imbalanced data handling, interpretation, and time series analysis. The cheatsheet acts as a reference for popular scikit-learn and other Python machine learning libraries.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7
# [ Model Evaluation and Selection ] [ cheatsheet ]
Data Splitting
● Splitting dataset into training and test sets: from
sklearn.model_selection import train_test_split ● Splitting dataset with stratification: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y) ● Creating a validation set: X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25)
Model Evaluation Metrics
● Accuracy score: from sklearn.metrics import accuracy_score
● Precision score: from sklearn.metrics import precision_score ● Recall score: from sklearn.metrics import recall_score ● F1 score: from sklearn.metrics import f1_score ● Area under ROC curve: from sklearn.metrics import roc_auc_score ● Mean squared error: from sklearn.metrics import mean_squared_error ● Mean absolute error: from sklearn.metrics import mean_absolute_error ● R2 score: from sklearn.metrics import r2_score ● Confusion matrix: from sklearn.metrics import confusion_matrix ● Classification report: from sklearn.metrics import classification_report ● Log loss: from sklearn.metrics import log_loss
Cross-validation
● K-fold cross-validation: from sklearn.model_selection import
cross_val_score ● Stratified K-fold for classification: from sklearn.model_selection import StratifiedKFold ● Leave-One-Out cross-validation: from sklearn.model_selection import LeaveOneOut ● Cross-validation with multiple scoring metrics: from sklearn.model_selection import cross_validate
Hyperparameter Tuning
● Grid search CV: from sklearn.model_selection import GridSearchCV
● Comparing multiple classifiers: [cross_val_score(estimator, X, y,
cv=5).mean() for estimator in [estimator1, estimator2]] ● Feature importance from models: model.feature_importances_ ● Selecting features based on importance: from sklearn.feature_selection import SelectFromModel ● Pipeline creation: from sklearn.pipeline import make_pipeline ● Saving a model: from joblib import dump ● Loading a model: from joblib import load
Ensemble Methods
● Random Forest: from sklearn.ensemble import RandomForestClassifier
● Gradient Boosting: from sklearn.ensemble import GradientBoostingClassifier ● AdaBoost: from sklearn.ensemble import AdaBoostClassifier ● Stacking classifiers: from sklearn.ensemble import StackingClassifier
Dimensionality Reduction
● PCA: from sklearn.decomposition import PCA
● t-SNE: from sklearn.manifold import TSNE ● Selecting features with high variance: from sklearn.feature_selection import VarianceThreshold
Data Preprocessing
● Scaling features: from sklearn.preprocessing import StandardScaler
● Encoding categorical variables: from sklearn.preprocessing import OneHotEncoder
By: Waleed Mousa
● Imputing missing values: from sklearn.impute import SimpleImputer ● Generating polynomial features: from sklearn.preprocessing import PolynomialFeatures
Model Evaluation Techniques
● Bootstrapping: from sklearn.utils import resample
● Calculating information criteria: AIC = 2*k - 2*np.log(L) (where k is the number of parameters and L is the likelihood of the model) ● BIC for model selection: BIC = n*np.log(RSS/n) + k*np.log(n) (where RSS is the residual sum of squares)
Advanced Model Evaluation
● Plotting ROC curve: from sklearn.metrics import roc_curve
● Count Vectorization: from sklearn.feature_extraction.text import
CountVectorizer ● TF-IDF Transformation: from sklearn.feature_extraction.text import TfidfTransformer ● HashingVectorizer: from sklearn.feature_extraction.text import HashingVectorizer
Clustering and Unsupervised Learning
● K-Means clustering: from sklearn.cluster import KMeans
● DBSCAN for density-based clustering: from sklearn.cluster import DBSCAN ● Hierarchical clustering: from sklearn.cluster import AgglomerativeClustering
Neural Networks and Deep Learning
● Using Keras/TensorFlow for neural networks: from tensorflow.keras.models
import Sequential
By: Waleed Mousa
● Defining a simple neural network architecture: model = Sequential([Dense(10, activation='relu'), Dense(1)])
Handling Imbalanced Datasets
● Under-sampling the majority class: from imblearn.under_sampling import
RandomUnderSampler ● Over-sampling the minority class: from imblearn.over_sampling import RandomOverSampler ● SMOTE for synthetic minority over-sampling: from imblearn.over_sampling import SMOTE ● Using class weights to handle imbalance: class_weight='balanced' (applicable in many sklearn classifiers)
Advanced Feature Selection
● Recursive feature elimination: from sklearn.feature_selection import RFE
● Feature selection using mutual information: from sklearn.feature_selection import mutual_info_classif ● SelectKBest with custom scoring function: from sklearn.feature_selection import SelectKBest
Model Interpretability and Explanation
● Permutation importance: from sklearn.inspection import
permutation_importance ● SHAP values: import shap (requires SHAP library) ● LIME for local interpretability: import lime (requires LIME library)
Advanced Cross-validation Techniques
● TimeSeriesSplit for time-series data: from sklearn.model_selection import
TimeSeriesSplit ● GroupKFold for grouped data: from sklearn.model_selection import GroupKFold ● PredefinedSplit to use custom splits: from sklearn.model_selection import PredefinedSplit
Hyperparameter Optimization Beyond Grid and Random Search
By: Waleed Mousa
● Bayesian optimization with Hyperopt: from hyperopt import fmin, tpe, hp, Trials ● Optuna for optimization: import optuna ● Scikit-optimize for Bayesian optimization: from skopt import BayesSearchCV
Ensemble and Meta-models Advanced Techniques
● VotingClassifier for combining models by voting: from sklearn.ensemble
import VotingClassifier ● Bagging with base estimator: from sklearn.ensemble import BaggingClassifier ● Feature stacking for meta-modeling: from sklearn.ensemble import StackingClassifier
Performance Improvement and Efficiency
● Using joblib for parallel processing in GridSearchCV:
GridSearchCV(estimator, param_grid, cv=5, n_jobs=-1) ● Incremental learning with partial_fit: estimator.partial_fit(X_batch, y_batch) ● Using categorical dtype for pandas to reduce memory usage: df['feature'] = df['feature'].astype('category')
Working with Text Data Advanced Techniques
● N-grams with CountVectorizer: CountVectorizer(ngram_range=(1, 2))
● Custom tokenizer in TfidfVectorizer: TfidfVectorizer(tokenizer=custom_tokenizer) ● Word embeddings with Gensim or Spacy: from gensim.models import Word2Vec or import spacy
Advanced Clustering Techniques
● Spectral clustering for non-linearly separable data: from
sklearn.cluster import SpectralClustering ● Affinity propagation for clustering without specifying the number of clusters: from sklearn.cluster import AffinityPropagation ● Mean shift clustering for arbitrary shaped clusters: from sklearn.cluster import MeanShift
By: Waleed Mousa
Evaluation Metrics for Regression Advanced
● Explained variance score: from sklearn.metrics import
explained_variance_score ● Mean squared logarithmic error: from sklearn.metrics import mean_squared_log_error ● Median absolute error: from sklearn.metrics import median_absolute_error
Evaluation Metrics for Classification Advanced
● Balanced accuracy score: from sklearn.metrics import
balanced_accuracy_score ● Cohen's kappa: from sklearn.metrics import cohen_kappa_score ● Matthews correlation coefficient: from sklearn.metrics import matthews_corrcoef
Multioutput and Multiclass Strategies
● OneVsRest for multiclass classification: from sklearn.multiclass import
OneVsRestClassifier ● MultiOutputClassifier for multi-output regression: from sklearn.multioutput import MultiOutputClassifier
Advanced Data Preprocessing Techniques
● Power transformation for normalizing data: from sklearn.preprocessing
import PowerTransformer ● Binarizing features: from sklearn.preprocessing import Binarizer ● Custom transformers with FunctionTransformer: from sklearn.preprocessing import FunctionTransformer
Model Persistence Advanced
● Pickle for model saving: import pickle
● Using dill for more complex objects: import dill as pickle
Time Series Analysis
● Rolling window features: df['rolling_mean'] =
df['feature'].rolling(window=5).mean()
By: Waleed Mousa
● Expanding window features: df['expanding_mean'] = df['feature'].expanding(2).mean() ● Time series cross-validation: from sklearn.model_selection import TimeSeriesSplit
Neural Networks and Deep Learning Advanced
● Early stopping in Keras: from tensorflow.keras.callbacks import
EarlyStopping ● Custom loss functions in TensorFlow/Keras: def custom_loss(y_true, y_pred): return tf.reduce_mean(tf.abs(y_true - y_pred)) ● Fine-tuning pre-trained models in TensorFlow/Keras: model.trainable = True and model.compile(...)
Working with Large Datasets
● Incremental PCA for large datasets: from sklearn.decomposition import
IncrementalPCA ● Online learning algorithms (e.g., SGDClassifier): from sklearn.linear_model import SGDClassifier ● Dask for parallel computing: import dask.dataframe as dd