0% found this document useful (0 votes)
31 views10 pages

Northbay Summarizes Data Pre-Processing Algorithms

The document discusses common model issues including missing data, noisy data, computational issues, imbalanced datasets, non-stationarity, high variance/low bias, and poor generalization. Potential fixes are provided for each issue such as using imputation techniques, data cleaning, feature selection, updating models with new data, simplifying models, and increasing training data.

Uploaded by

surendersara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views10 pages

Northbay Summarizes Data Pre-Processing Algorithms

The document discusses common model issues including missing data, noisy data, computational issues, imbalanced datasets, non-stationarity, high variance/low bias, and poor generalization. Potential fixes are provided for each issue such as using imputation techniques, data cleaning, feature selection, updating models with new data, simplifying models, and increasing training data.

Uploaded by

surendersara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Model Issues

14. Missing Data 1. Model Issue:

Description:

Missing values in the dataset leading to insufficient training of the model. 2. Overfitting
Potential Fixes:
Description:
Imputation techniques.
Model performs well on training data but poorly on unseen/test data.
Use models that can handle missing data.
Potential Fixes:
Removal of features with excessive missing values.
Increase regularization.
Data collection strategies to minimize missing data.
Reduce model complexity.

Use more training data.


13. Noisy Data Implement cross-validation.

Description: Apply data augmentation.

Data contains a lot of irrelevant information or errors. Prune the model (if applicable).

Potential Fixes:

Data cleaning. 3. Underfitting


Robust preprocessing.
Description:
Outlier detection and removal.
Model performs poorly on both training and unseen/test data.
Feature selection to focus on relevant features.
Potential Fixes:

Increase model complexity.


12. Computational Issues Feature engineering.

Description: Reduce regularization.

Issues related to the size of the data, speed of training, etc. Check for data quality issues.

Potential Fixes: Revisit data preprocessing.

Optimize algorithms.

Feature selection to reduce dimensionality. 4. Bias (Statistical)


Use more efficient hardware.
Description:
Apply distributed computing techniques.
Model shows prejudice towards certain groups or outcomes based on training data.

Potential Fixes:
11. Imbalanced Dataset Balanced dataset.

Description: Use fairness metrics for evaluation.

The training dataset does not have a representative distribution of classes. Implement algorithmic fairness techniques.

Potential Fixes: Explore different feature sets.

Use resampling techniques. Conduct thorough exploratory data analysis.

Apply different performance metrics (e.g., F1-score instead of accuracy).

Use cost-sensitive learning methods. 5. Variance


Synthetic data generation (SMOTE).
Description:

Model is too sensitive to small fluctuations in the training set.


10. Non-Stationarity Potential Fixes:

Description: Simplify the model.

Model performs poorly due to changing underlying relationships in the data over time. Increase training data.

Potential Fixes: Apply regularization.

Use models capable of adapting to change. Use ensemble methods.

Regularly update the model with new data.

Apply techniques like windowing for time series data. 6. Data Leakage

Description:
9. High Variance/Low Bias Model inadvertently gains information from outside its training dataset, often leading to
overfitting.
Description:
Potential Fixes:
Model is too complex, fitting too closely to the training data (related to overfitting).
Careful feature selection.
Potential Fixes:
Cross-validation.
Simplify the model.
Ensure separation of training and test datasets.
Increase training data.
Scrutinize data preprocessing steps.
Apply regularization.

Use bagging in ensemble methods.


7. Poor Generalization

Description:
8. High Bias/Low Variance
Model does not perform well on new, unseen data.
Description:
Potential Fixes:
Model is too simple, making it unable to capture complexities in the data (related to
underfitting). Use more diverse training data.

Potential Fixes: Implement cross-validation.

Increase model complexity. Try different model architectures.

Feature engineering. Data augmentation.

Reduce regularization.

Explore advanced models.


Data Pre-Processing Algorithms

14. KBinsDiscretizer 1. StandardScaler

How to Call: How to Call:

KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile') StandardScaler()

Use Cases: Use Cases:

Binning continuous data into intervals. Scaling features with a mean of 0 and variance of 1. Useful when algorithms assume
features to be on a similar scale, e.g., SVM, Neural Networks.

13. QuantileTransformer
2. MinMaxScaler
How to Call:
How to Call:
QuantileTransformer()
MinMaxScaler()
Use Cases:
Use Cases:
Transform features using quantiles information. Can spread out the most frequent
values and reduce the impact of (marginal) outliers. Scaling features to a given range, often [0,1]. Useful when you need a bounded
interval.

12. PowerTransformer
3. RobustScaler
How to Call:
How to Call:
PowerTransformer(method='yeo-johnson')
RobustScaler()
Use Cases:
Use Cases:
Apply a power transform featurewise to make data more Gaussian-like. Useful when
you want to stabilize variance and make the data more Gaussian-like. Scaling features based on median and IQR. Useful for data with outliers.

11. FunctionTransformer 4. OneHotEncoder

How to Call: How to Call:

FunctionTransformer(func) OneHotEncoder()

Use Cases: Use Cases:

Constructs a transformer from an arbitrary callable. Useful for applying a simple Encoding categorical variables as binary vectors. Used when the categorical data isn't
transformation function. ordinal.

10. PolynomialFeatures 5. OrdinalEncoder

How to Call: How to Call:

PolynomialFeatures(degree=2) OrdinalEncoder()

Use Cases: Use Cases:

Generating polynomial features. Useful for linear regression when the relationship isn't Encoding categorical variables as integer values. Useful for ordinal data.
purely linear.

6. LabelEncoder
9. SimpleImputer
How to Call:
How to Call:
LabelEncoder()
SimpleImputer(strategy='mean')
Use Cases:
Use Cases:
Convert categories to integers. Often used for target variable encoding.
Imputation transformer for completing missing values.

7. LabelBinarizer
8. Binarizer
How to Call:
How to Call:
LabelBinarizer()
Binarizer(threshold=0.5)
Use Cases:
Use Cases:
Converts multi-class labels to binary labels (one-vs-all).
Convert continuous data into binary form based on a threshold.
Data Pre-Processing Algorithms

9. OutputCodeClassifier 1. MaxAbsScaler

How to Call: How to Call:

OutputCodeClassifier(estimator=...) MaxAbsScaler()

Use Cases: Use Cases:

Uses a classifier to produce a continuous-valued output that is converted to a Scales each feature by its maximum absolute value. This is meant for data that is
multiclass classification problem. Useful for multiclass learning with binary classifiers. already centered at zero without outliers.

8. MissingIndicator 2. Normalizer

How to Call: How to Call:

MissingIndicator() Normalizer(norm='l2')

Use Cases: Use Cases:

Marks features with missing values. Often used in conjunction with an imputation Normalizes samples individually to unit norm. This technique is useful when you want
technique. to consider the angle between feature vectors.

7. DictVectorizer 3. CategoricalImputer

How to Call: How to Call:

DictVectorizer() CategoricalImputer()

Use Cases: Use Cases:

Transforms lists of feature-value mappings to vectors. Useful when feature extraction Fills in missing values within categorical features using the most frequent value or a
from text data results in a dictionary. placeholder.

6. ColumnTransformer 4. FeatureHasher

How to Call: How to Call:

ColumnTransformer(transformers=[...]) FeatureHasher(n_features=20)

Use Cases: Use Cases:

Applies transformers to columns of arrays or pandas DataFrames. Allows different Applies a hash function to the features to determine their column index in feature
columns to be transformed differently. matrices. Useful for high-dimensional data.

5. MultiLabelBinarizer

How to Call:

MultiLabelBinarizer()

Use Cases:

Transforms a list of multilabel tags to a binary matrix. Essential for multi-label


classification problems.
Data Pre-Processing Algorithms

10. LocallyLinearEmbedding 1. VarianceThreshold

How to Call: How to Call:

LocallyLinearEmbedding(n_neighbors=10, n_components=2) VarianceThreshold(threshold=0.0)

Use Cases: Use Cases:

Seeks a lower-dimensional projection of the data which preserves distances within Removes all features that have a variance below a certain threshold. It is useful for
local neighborhoods. It is useful for unwrapping twisted manifolds. feature selection to remove non-informative features.

9. Isomap 2. SelectKBest

How to Call: How to Call:

Isomap(n_neighbors=5, n_components=2) SelectKBest(score_func=f_classif, k=10)

Use Cases: Use Cases:

Non-linear dimensionality reduction through Isometric Mapping. It's particularly useful Selects the top-k scoring features based on a chosen scoring function. It's commonly
when the data lies on an embedded non-linear manifold. used to improve model performance by retaining only the most informative features.

8. SequentialFeatureSelector 3. SelectPercentile

How to Call: How to Call:

SequentialFeatureSelector(estimator, n_features_to_select=10) SelectPercentile(score_func=f_classif, percentile=50)

Use Cases: Use Cases:

Adds or removes features to form the best feature subset. It's a greedy procedure that Selects features according to a percentile of the highest scores. Similar to
adds or removes one feature at a time based on model performance. SelectKBest but selects a percentage of features instead of a fixed number.

7. SelectFromModel 4. RFE (Recursive Feature Elimination)

How to Call: How to Call:

SelectFromModel(estimator) RFE(estimator, n_features_to_select=10)

Use Cases: Use Cases:

Selects features based on importance weights provided by a fitted model. Useful when Recursively removes the weakest features to improve model accuracy. Often used
using tree-based estimators like RandomForest that can compute feature when the number of features is very high, and reducing complexity is necessary.
importances.

5. PCA (Principal Component Analysis)


6. TruncatedSVD
How to Call:
How to Call:
PCA(n_components=2)
TruncatedSVD(n_components=2)
Use Cases:
Use Cases:
Dimensionality reduction technique that transforms features into a set of orthogonal
Similar to PCA but suitable for sparse matrices, which are common in text data. components that explain the most variance in the data.
Data Pre-Processing Algorithms

20. SparsePCA 1. TfidfVectorizer

How to Call: How to Call:

SparsePCA(n_components=2) TfidfVectorizer()

Use Cases: Use Cases:

Principal component analysis for sparse data, aiming to find a set of sparse Converts a collection of raw documents to a matrix of TF-IDF features. Ideal for text
components that can explain the variance in the data. analysis.

19. KernelPCA 2. CountVectorizer

How to Call: How to Call:

KernelPCA(n_components=2, kernel='linear') CountVectorizer()

Use Cases: Use Cases:

Non-linear dimensionality reduction through the use of kernels. Converts a collection of text documents to a matrix of token counts. This
representation is used for text classification.

18. MaxAbsScaler
3. HashingVectorizer
How to Call:
How to Call:
MaxAbsScaler()
HashingVectorizer(n_features=2**20)
Use Cases:
Use Cases:
Scale each feature by its maximum absolute value, useful for data that is already
centered at zero. Converts a collection of text documents to a matrix of occurrences, normalized by
hashing trick.

17. Binarize
4. NMF (Non-Negative Matrix Factorization)
How to Call:
How to Call:
Binarize(threshold=0.0)
NMF(n_components=2)
Use Cases:
Use Cases:
Convert numerical features into boolean values based on a threshold.
Factorization method to discover hidden topics or concepts within the data, often
used in text data.

16. UMAP

How to Call:
5. LatentDirichletAllocation
UMAP(n_components=2) How to Call:

Use Cases: LatentDirichletAllocation(n_components=10)

Uniform Manifold Approximation and Projection. A non-linear dimensionality reduction Use Cases:
technique often used for visualization.
Topic modeling technique that assigns topics to documents and words to topics.

15. TSNE
6. AdditiveChi2Sampler
How to Call:
How to Call:
TSNE(n_components=2)
AdditiveChi2Sampler()
Use Cases:
Use Cases:
T-distributed Stochastic Neighbor Embedding. Non-linear dimensionality reduction,
ideal for visualization of high-dimensional datasets. Computes additive chi-squared kernel between features and class labels, for non-
linear classification.

14. LocalOutlierFactor
7. KernelCenterer
How to Call:
How to Call:
LocalOutlierFactor()
KernelCenterer()
Use Cases:
Use Cases:
Unsupervised outlier detection using local density estimation.
Centers a kernel matrix, especially useful in Kernel Principal Component Analysis.

13. KNeighborsTransformer
8. Normalizer
How to Call:
How to Call:
KNeighborsTransformer(n_neighbors=5)
Normalizer(norm='l2')
Use Cases:
Use Cases:
Transform data to a matrix of distances to the nearest neighbors.
Normalizes individual samples to have unit norm. Useful in text classification when
using cosine similarity.

12. RadiusNeighborsTransformer

How to Call:
9. LabelSpreading
RadiusNeighborsTransformer(radius=1.0) How to Call:

Use Cases: LabelSpreading()

Transform data to a matrix of distances to all neighbors within a given radius. Use Cases:

Semi-supervised learning algorithm that spreads label information through a dataset.

11. NearestNeighbors

How to Call:
10. LabelPropagation
NearestNeighbors(n_neighbors=3) How to Call:

Use Cases: LabelPropagation()

Unsupervised learner for implementing neighbor searches, used in clustering, Use Cases:
classification, and regression.
Another semi-supervised learning technique that infers labels for unlabeled data
points.
Data Pre-Processing Algorithms

20. KMeans 1. IncrementalPCA

How to Call: How to Call:

KMeans(n_clusters=8) IncrementalPCA(n_components=2)

Use Cases: Use Cases:

Partitioning n observations into k clusters in which each observation belongs to the Incremental principal component analysis is useful for large datasets that cannot fit in
cluster with the nearest mean. memory.

19. Discretization 2. FactorAnalysis

How to Call: How to Call:

Discretization(n_bins=5, encode='ordinal') FactorAnalysis(n_components=2)

Use Cases: Use Cases:

Discretizes continuous features into discrete bins. A method to model observed variables and their underlying latent factors.

18. QuantileNormalizer 3. FastICA

How to Call: How to Call:

QuantileNormalizer() FastICA(n_components=2)

Use Cases: Use Cases:

Normalizes features using quantile information to follow a standard normal distribution. Fast Independent Component Analysis for signal separation and feature extraction.

17. SkewedChi2Sampler 4. MDS (Multidimensional Scaling)

How to Call: How to Call:

SkewedChi2Sampler(skewedness=0.5, n_components=100) MDS(n_components=2)

Use Cases: Use Cases:

Reduces skewness in data by using the Chi-squared kernel. A technique used for analyzing similarity or dissimilarity data, helps to visualize high-
dimensional data.

16. RBFSampler
5. IsolationForest
How to Call:
How to Call:
RBFSampler(gamma=1.0, n_components=100)
IsolationForest()
Use Cases:
Use Cases:
Approximates the feature map of an RBF kernel by Monte Carlo approximation of its
Fourier transform. An ensemble algorithm for anomaly detection that isolates outliers instead of profiling
normal data points.

15. Nystroem
6. SelectFpr (False Positive Rate test)
How to Call:
How to Call:
Nystroem(kernel='rbf', n_components=2)
SelectFpr(alpha=0.05)
Use Cases:
Use Cases:
An efficient method to approximate a kernel map for large-scale datasets.
Filter method to select features based on a false positive rate test.

14. MiniBatchSparsePCA
7. SelectFdr (False Discovery Rate test)
How to Call:
How to Call:
MiniBatchSparsePCA(n_components=2)
SelectFdr(alpha=0.05)
Use Cases:
Use Cases:
A scalable version of SparsePCA that uses mini-batch approach.
Feature selection technique that controls the false discovery rate.

13. MiniBatchDictionaryLearning
8. SelectFwe (Family-wise Error rate)
How to Call:
How to Call:
MiniBatchDictionaryLearning(n_components=2)
SelectFwe(alpha=0.05)
Use Cases:
Use Cases:
A faster version of DictionaryLearning suitable for large datasets.
Selects features based on family-wise error rate, often used in hypothesis testing.

12. DictionaryLearning
9. GenericUnivariateSelect
How to Call:
How to Call:
DictionaryLearning(n_components=2)
GenericUnivariateSelect(mode='fpr', param=0.05)
Use Cases:
Use Cases:
An unsupervised method for dictionary learning and feature extraction.
Allows to perform univariate feature selection with a configurable strategy.

11. SpectralEmbedding
10. FeatureAgglomeration
How to Call:
How to Call:
SpectralEmbedding(n_components=2)
FeatureAgglomeration()
Use Cases:
Use Cases:
Uses spectral decomposition to reduce dimensionality, useful in clustering tasks.
Hierarchical clustering to group together similar features.
Data Pre-Processing Algorithms

20. EllipticEnvelope 1. Birch

How to Call: How to Call:

EllipticEnvelope(contamination=0.1) Birch(threshold=0.5, n_clusters=3)

Use Cases: Use Cases:

An outlier detection algorithm that fits a robust covariance estimate to the data, and An online-learning algorithm for clustering that builds a tree called the Clustering
thus fits an ellipse to the central data points, ignoring points outside the central mode. Feature Tree (CFT).

19. DummyClassifier 2. DBSCAN

How to Call: How to Call:

DummyClassifier(strategy='stratified') DBSCAN(eps=0.5, min_samples=5)

Use Cases: Use Cases:

A classifier that makes predictions using simple rules, which can be useful as a A density-based clustering algorithm that groups together points that are closely
baseline for comparison with actual classifiers. packed together.

18. DecisionTreeClassifier 3. OPTICS

How to Call: How to Call:

DecisionTreeClassifier() OPTICS(min_samples=5, xi=0.05, min_cluster_size=0.1)

Use Cases: Use Cases:

A non-parametric supervised learning method used for classification and regression Clustering algorithm similar to DBSCAN but with the ability to find clusters of varying
that models decisions and their possible consequences as a tree. densities.

17. ClassifierChain 4. AffinityPropagation

How to Call: How to Call:

ClassifierChain(base_estimator=SVC(), order='random', random_state=0) AffinityPropagation(damping=0.5, max_iter=200)

Use Cases: Use Cases:

A multi-label model that arranges binary classifiers into a chain where each classifier A clustering algorithm that sends messages between pairs of samples until
deals with the label predicted by its predecessor. convergence.

16. CheckingClassifier 5. AgglomerativeClustering

How to Call: How to Call:

CheckingClassifier() AgglomerativeClustering(n_clusters=2)

Use Cases: Use Cases:

A classifier for sanity checking or debugging purposes that does not learn from input A hierarchical clustering method using a bottom-up approach: each observation starts
data and only performs checks or returns fixed predictions. in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

15. ComplementNB 6. FeatureUnion

How to Call: How to Call:

ComplementNB() FeatureUnion(transformer_list=[('transformer1', transformer1), ('transformer2',


transformer2)])
Use Cases:
Use Cases:
A modification of the standard Multinomial Naive Bayes algorithm that is particularly
suited for imbalanced data sets. A pipeline utility to combine multiple feature extraction or transformation methods into
a single transformer.

14. CategoricalNB
7. PolynomialCountSketch
How to Call:
How to Call:
CategoricalNB()
PolynomialCountSketch(degree=2, n_components=100)
Use Cases:
Use Cases:
Naive Bayes classifier for categorical features, particularly suited for features that are
discretely distributed. Approximates a feature map of an arbitrary polynomial kernel by a fast, sparse
projection.

13. CalibratedClassifierCV
8. ExtraTreesClassifier
How to Call:
How to Call:
CalibratedClassifierCV(base_estimator=SVC(), method='sigmoid', cv=5)
ExtraTreesClassifier(n_estimators=100, random_state=0)
Use Cases:
Use Cases:
Probability calibration with isotonic regression or logistic regression on classifier
output. An ensemble learning method fundamentally similar to a random forest, but it selects
tree splits in a more random manner.

12. HistGradientBoostingClassifier
9. GradientBoostingClassifier
How to Call:
How to Call:
HistGradientBoostingClassifier(max_iter=100)
GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1,
Use Cases: random_state=0)

A histogram-based Gradient Boosting Classification Tree designed for speed, which Use Cases:
can handle categorical data and naturally deals with missing values.
A machine learning technique for regression and classification problems, which
produces a prediction model in the form of an ensemble of weak prediction models.

11. BaggingClassifier

How to Call:
10. AdaBoostClassifier
BaggingClassifier(base_estimator=SVC(), n_estimators=10, random_state=0) How to Call:

Use Cases: AdaBoostClassifier(n_estimators=100, random_state=0)

An ensemble meta-estimator that fits base classifiers each on random subsets of the Use Cases:
original dataset and then aggregates their individual predictions to form a final
prediction. A boosting ensemble meta-estimator that begins by fitting a classifier on the original
dataset and then fits additional copies of the classifier on the same dataset but where
the weights of incorrectly classified instances are adjusted such that subsequent
classifiers focus more on difficult cases.
Data Pre-Processing Algorithms

19. PoissonRegressor 1. ExtraTreesRegressor

How to Call: How to Call:

PoissonRegressor(alpha=1e-2, max_iter=1000) ExtraTreesRegressor(n_estimators=100, random_state=0)

Use Cases: Use Cases:

Generalized Linear Model with a Poisson distribution. An ensemble learning method for regression that fits a number of randomized decision
trees on various sub-samples of the dataset and uses averaging to improve the
predictive accuracy and control overfitting.

18. QuantileRegressor

How to Call:
2. GradientBoostingRegressor
QuantileRegressor(quantile=0.5, alpha=0.01) How to Call:

Use Cases: GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3,


random_state=0)
Linear regression model that predicts a specified quantile of the target's distribution.
Use Cases:

A machine learning technique for regression that builds an additive model in a forward
17. HuberRegressor stage-wise fashion; it allows for the optimization of arbitrary differentiable loss
functions.
How to Call:

HuberRegressor(max_iter=100, epsilon=1.35)
3. RandomForestClassifier
Use Cases:
How to Call:
Linear regression model that is robust to outliers.
RandomForestClassifier(n_estimators=100, random_state=0)

Use Cases:
16. TheilSenRegressor
A meta estimator that fits a number of decision tree classifiers on various sub-
samples of the dataset and uses averaging to improve the predictive accuracy and
How to Call:
control over-fitting.
TheilSenRegressor(random_state=0)

Use Cases:
4. RandomForestRegressor
Theil-Sen Estimator: robust multivariate regression model.
How to Call:

RandomForestRegressor(n_estimators=100, random_state=0)
15. RANSACRegressor
Use Cases:
How to Call:
A meta estimator that fits a number of classifying decision trees on various sub-
samples of the dataset and uses averaging to improve the predictive accuracy and
RANSACRegressor(min_samples=2, max_trials=100, loss='absolute_loss')
control over-fitting for regression tasks.
Use Cases:

RANSAC (RANdom SAmple Consensus) algorithm. RANSAC is an iterative algorithm


for the robust estimation of parameters from a subset of inliers from the complete data 5. RidgeClassifier
set.
How to Call:

RidgeClassifier(alpha=1.0)
14. PassiveAggressiveRegressor
Use Cases:
How to Call:
Classifier that uses ridge regression to classify multi-class data.
PassiveAggressiveRegressor(max_iter=1000, random_state=0)

Use Cases:
6. RidgeClassifierCV
Passive Aggressive algorithms for regression.
How to Call:

RidgeClassifierCV(alphas=[0.1, 1.0, 10.0])


13. PassiveAggressiveClassifier
Use Cases:
How to Call:
Ridge classifier with built-in cross-validation of the alpha parameter.
PassiveAggressiveClassifier(max_iter=1000, random_state=0)

Use Cases:
7. SGDClassifier
Passive Aggressive algorithms are a family of algorithms for large-scale learning that
are similar to the Perceptron in that they do not require a learning rate. How to Call:

SGDClassifier(loss='hinge', penalty='l2', alpha=0.0001)

12. NearestCentroid Use Cases:

Linear classifiers (SVM, logistic regression, a.o.) with stochastic gradient descent
How to Call:
(SGD) training.
NearestCentroid()

Use Cases:
8. SGDRegressor
Nearest centroid classifier. Each class is represented by its centroid, with test
samples classified to the class with the nearest centroid. How to Call:

SGDRegressor(loss='squared_loss', penalty='l2', alpha=0.0001)

11. LinearSVR Use Cases:

Linear regression model trained with SGD.


How to Call:

LinearSVR(epsilon=0.0, tol=1e-4)

Use Cases: 9. SVR


Scalable Linear Support Vector Machine for regression implemented using liblinear. How to Call:

SVR(kernel='rbf', C=1.0, epsilon=0.2)

10. NuSVR Use Cases:

Epsilon-Support Vector Regression. The free parameters in the model are C and
How to Call:
epsilon.
NuSVR(nu=0.5, C=1.0, kernel='rbf')

Use Cases:

Nu-Support Vector Regression. Similar to SVR but uses a parameter nu to control the
number of support vectors.
Data Pre-Processing Algorithms

17. cross_val_score - How to Call: 1. VotingClassifier - How to Call:

cross_val_score(estimator=SVC(), X=data, y=labels, cv=5) VotingClassifier(estimators=[('lr', LogisticRegression()), ('rf',


RandomForestClassifier())])
Use Cases: Evaluate a score by cross-validation.
Use Cases: A classifier that fits multiple classifiers and takes the majority vote for
prediction. Useful for combining conceptually different machine learning classifiers.

16. RandomizedSearchCV - How to Call:

RandomizedSearchCV(estimator=RandomForestClassifier(), 2. VotingRegressor - How to Call:


param_distributions={'max_depth': [3, None], 'max_features': randint(1, 9)}, cv=5)
VotingRegressor(estimators=[('lr', LinearRegression()), ('rf',
Use Cases: Randomized search on hyperparameters. RandomForestRegressor())])

Use Cases: A regressor that fits multiple regressors and averages their predictions.
Helpful for reducing variance.
15. GridSearchCV - How to Call:

GridSearchCV(estimator=SVC(), param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf')}, cv=5)


3. StackingClassifier - How to Call:
Use Cases: Exhaustive search over specified parameter values for an estimator.
StackingClassifier(estimators=[('rf', RandomForestClassifier()), ('svr',
make_pipeline(StandardScaler(), LinearSVC()))])

14. GroupKFold - How to Call: Use Cases: A classifier that stacks the output of individual estimators and uses a
classifier to compute the final prediction.
GroupKFold(n_splits=5)

Use Cases: K-fold iterator variant with non-overlapping groups.


4. StackingRegressor - How to Call:

StackingRegressor(estimators=[('lr', LinearRegression()), ('ridge', Ridge())])


13. StratifiedKFold - How to Call:
Use Cases: A regressor that stacks the output of individual estimators and uses a
regressor to compute the final prediction.
StratifiedKFold(n_splits=5)

Use Cases: Stratified K-Folds cross-validator providing train/test indices to split data.

5. SelectFromModel - How to Call:

12. TimeSeriesSplit - How to Call: SelectFromModel(estimator=LogisticRegression(penalty="l1"), threshold='mean')

Use Cases: Meta-transformer for selecting features based on importance weights from
TimeSeriesSplit(n_splits=5)
a model, such as Lasso.
Use Cases: Cross-validator for time series data.

6. SequentialFeatureSelector - How to Call:


11. MissingIndicator - How to Call:
SequentialFeatureSelector(estimator=RandomForestClassifier(),
n_features_to_select=5)
MissingIndicator()
Use Cases: A transformer that selects features by recursively considering smaller and
Use Cases: Binary indicators for missing values.
smaller sets of features.

10. KNNImputer - How to Call: 7. ColumnTransformer - How to Call:


KNNImputer(n_neighbors=2, weights="uniform")
ColumnTransformer(transformers=[('num', MinMaxScaler(), ['age']), ('cat',
Use Cases: Imputation for completing missing values using k-Nearest Neighbors. OneHotEncoder(), ['gender'])])

Use Cases: Applies transformers to columns of arrays or pandas DataFrames.

9. IterativeImputer - How to Call:


8. SimpleImputer - How to Call:
IterativeImputer(estimator=BaysianRidge(), max_iter=10, random_state=0)

Use Cases: Multivariate imputer that estimates each feature from all the others SimpleImputer(strategy='mean')
through specified estimator.
Use Cases: Imputation transformer for completing missing values in datasets.
Data Pre-Processing Algorithms

22. SelectFpr - How to Call: 1. cross_val_predict - How to Call:

SelectFpr(score_func=f_classif, alpha=0.05) cross_val_predict(estimator=SVC(), X=data, y=labels, cv=5)

Use Cases: Select features based on a false positive rate test. Use Cases: Generates cross-validated estimates for each input data point.

21. SelectPercentile - How to Call: 2. permutation_importance - How to Call:

SelectPercentile(score_func=f_classif, percentile=10) permutation_importance(estimator=model, X=val_data, y=val_labels, n_repeats=30)

Use Cases: Select features according to a percentile of the highest scores. Use Cases: Assessment of the importance of different features via permutation.

20. mutual_info_regression - How to Call: 3. learning_curve - How to Call:

SelectKBest(score_func=mutual_info_regression, k=2) learning_curve(estimator=RandomForestClassifier(), X=data, y=labels,


train_sizes=np.linspace(.1, 1.0, 5))
Use Cases: Estimates mutual information for a continuous target variable.
Use Cases: Determines cross-validated training and test scores for different training
set sizes.

19. f_regression - How to Call:

SelectKBest(score_func=f_regression, k=2) 4. validation_curve - How to Call:


Use Cases: Select features based on F-test for regression tasks. validation_curve(estimator=SVC(), X=data, y=labels, param_name='param_C',
param_range=param_range, cv=5)

Use Cases: Determines training and test scores for varying parameter values.
18. mutual_info_classif - How to Call:

SelectKBest(score_func=mutual_info_classif, k=2)
5. ShuffleSplit - How to Call:
Use Cases: Estimates mutual information for a discrete target variable.
ShuffleSplit(n_splits=10, test_size=0.25, random_state=0)

Use Cases: Random permutation cross-validator.


17. f_classif - How to Call:

SelectKBest(score_func=f_classif, k=2)
6. GroupShuffleSplit - How to Call:
Use Cases: Compute the ANOVA F-value for the provided sample.
GroupShuffleSplit(n_splits=5, test_size=0.2, random_state=42)

Use Cases: Ensures that the same group is not represented in both testing and
16. Chi2 - How to Call: training sets.

SelectKBest(score_func=chi2, k=2)

Use Cases: Select features according to the k highest scores of the chi-squared 7. StratifiedShuffleSplit - How to Call:
statistic.
StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)

Use Cases: Provides train/test indices to split data in train/test sets while preserving
15. train_test_split - How to Call: the percentage of samples for each class.

train_test_split(*arrays, test_size=0.25, random_state=0)

Use Cases: Splits arrays or matrices into random train and test subsets. 8. LeaveOneOut - How to Call:

LeaveOneOut()

14. PredefinedSplit - How to Call: Use Cases: Provides train/test indices to split data in train/test sets where each
sample is used once as a test set (singleton).
PredefinedSplit(test_fold=array)

Use Cases: Generates train/test indices based on predefined splits.


9. LeavePOut - How to Call:

LeavePOut(p=2)
13. TimeSeriesSplit - How to Call:
Use Cases: Similar to LeaveOneOut, but leaves P samples out.
TimeSeriesSplit(n_splits=5)

Use Cases: Provides train/test indices to split time series data samples that are
observed at fixed time intervals. 10. LeaveOneGroupOut - How to Call:

LeaveOneGroupOut()

12. GroupKFold - How to Call: Use Cases: Provides train/test indices to split data according to a third-party provided
group.
GroupKFold(n_splits=5)

Use Cases: Ensures that the same group is not in both testing and training sets.
11. LeavePGroupsOut - How to Call:

LeavePGroupsOut(n_groups=2)

Use Cases: Leaves P groups out, and the rest of the data is used as a training set.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy