0% found this document useful (0 votes)

31 views10 pages

Northbay Summarizes Data Pre-Processing Algorithms

The document discusses common model issues including missing data, noisy data, computational issues, imbalanced datasets, non-stationarity, high variance/low bias, and poor generalization. Potential fixes are provided for each issue such as using imputation techniques, data cleaning, feature selection, updating models with new data, simplifying models, and increasing training data.

Uploaded by

surendersara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views10 pages

Northbay Summarizes Data Pre-Processing Algorithms

Uploaded by

surendersara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Model Issues

14. Missing Data 1. Model Issue:

Description:

Missing values in the dataset leading to insufficient training of the model. 2. Overfitting
Potential Fixes:
Description:
Imputation techniques.
Model performs well on training data but poorly on unseen/test data.
Use models that can handle missing data.
Potential Fixes:
Removal of features with excessive missing values.
Increase regularization.
Data collection strategies to minimize missing data.
Reduce model complexity.

Use more training data.

13. Noisy Data Implement cross-validation.

Description: Apply data augmentation.

Data contains a lot of irrelevant information or errors. Prune the model (if applicable).

Potential Fixes:

Data cleaning. 3. Underfitting

Robust preprocessing.
Description:
Outlier detection and removal.
Model performs poorly on both training and unseen/test data.
Feature selection to focus on relevant features.
Potential Fixes:

Increase model complexity.

12. Computational Issues Feature engineering.

Description: Reduce regularization.

Issues related to the size of the data, speed of training, etc. Check for data quality issues.

Potential Fixes: Revisit data preprocessing.

Optimize algorithms.

Feature selection to reduce dimensionality. 4. Bias (Statistical)

Use more efficient hardware.
Description:
Apply distributed computing techniques.
Model shows prejudice towards certain groups or outcomes based on training data.

Potential Fixes:
11. Imbalanced Dataset Balanced dataset.

Description: Use fairness metrics for evaluation.

The training dataset does not have a representative distribution of classes. Implement algorithmic fairness techniques.

Potential Fixes: Explore different feature sets.

Use resampling techniques. Conduct thorough exploratory data analysis.

Apply different performance metrics (e.g., F1-score instead of accuracy).

Use cost-sensitive learning methods. 5. Variance

Synthetic data generation (SMOTE).
Description:

Model is too sensitive to small fluctuations in the training set.

10. Non-Stationarity Potential Fixes:

Description: Simplify the model.

Model performs poorly due to changing underlying relationships in the data over time. Increase training data.

Potential Fixes: Apply regularization.

Use models capable of adapting to change. Use ensemble methods.

Regularly update the model with new data.

Apply techniques like windowing for time series data. 6. Data Leakage

Description:
9. High Variance/Low Bias Model inadvertently gains information from outside its training dataset, often leading to
overfitting.
Description:
Potential Fixes:
Model is too complex, fitting too closely to the training data (related to overfitting).
Careful feature selection.
Potential Fixes:
Cross-validation.
Simplify the model.
Ensure separation of training and test datasets.
Increase training data.
Scrutinize data preprocessing steps.
Apply regularization.

Use bagging in ensemble methods.

7. Poor Generalization

Description:
8. High Bias/Low Variance
Model does not perform well on new, unseen data.
Description:
Potential Fixes:
Model is too simple, making it unable to capture complexities in the data (related to
underfitting). Use more diverse training data.

Potential Fixes: Implement cross-validation.

Increase model complexity. Try different model architectures.

Feature engineering. Data augmentation.

Reduce regularization.

Explore advanced models.

Data Pre-Processing Algorithms

14. KBinsDiscretizer 1. StandardScaler

How to Call: How to Call:

KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile') StandardScaler()

Use Cases: Use Cases:

Binning continuous data into intervals. Scaling features with a mean of 0 and variance of 1. Useful when algorithms assume
features to be on a similar scale, e.g., SVM, Neural Networks.

13. QuantileTransformer
2. MinMaxScaler
How to Call:
How to Call:
QuantileTransformer()
MinMaxScaler()
Use Cases:
Use Cases:
Transform features using quantiles information. Can spread out the most frequent
values and reduce the impact of (marginal) outliers. Scaling features to a given range, often [0,1]. Useful when you need a bounded
interval.

12. PowerTransformer
3. RobustScaler
How to Call:
How to Call:
PowerTransformer(method='yeo-johnson')
RobustScaler()
Use Cases:
Use Cases:
Apply a power transform featurewise to make data more Gaussian-like. Useful when
you want to stabilize variance and make the data more Gaussian-like. Scaling features based on median and IQR. Useful for data with outliers.

11. FunctionTransformer 4. OneHotEncoder

How to Call: How to Call:

FunctionTransformer(func) OneHotEncoder()

Use Cases: Use Cases:

Constructs a transformer from an arbitrary callable. Useful for applying a simple Encoding categorical variables as binary vectors. Used when the categorical data isn't
transformation function. ordinal.

10. PolynomialFeatures 5. OrdinalEncoder

How to Call: How to Call:

PolynomialFeatures(degree=2) OrdinalEncoder()

Use Cases: Use Cases:

Generating polynomial features. Useful for linear regression when the relationship isn't Encoding categorical variables as integer values. Useful for ordinal data.
purely linear.

6. LabelEncoder
9. SimpleImputer
How to Call:
How to Call:
LabelEncoder()
SimpleImputer(strategy='mean')
Use Cases:
Use Cases:
Convert categories to integers. Often used for target variable encoding.
Imputation transformer for completing missing values.

7. LabelBinarizer
8. Binarizer
How to Call:
How to Call:
LabelBinarizer()
Binarizer(threshold=0.5)
Use Cases:
Use Cases:
Converts multi-class labels to binary labels (one-vs-all).
Convert continuous data into binary form based on a threshold.
Data Pre-Processing Algorithms

9. OutputCodeClassifier 1. MaxAbsScaler

How to Call: How to Call:

OutputCodeClassifier(estimator=...) MaxAbsScaler()

Use Cases: Use Cases:

Uses a classifier to produce a continuous-valued output that is converted to a Scales each feature by its maximum absolute value. This is meant for data that is
multiclass classification problem. Useful for multiclass learning with binary classifiers. already centered at zero without outliers.

8. MissingIndicator 2. Normalizer

How to Call: How to Call:

MissingIndicator() Normalizer(norm='l2')

Use Cases: Use Cases:

Marks features with missing values. Often used in conjunction with an imputation Normalizes samples individually to unit norm. This technique is useful when you want
technique. to consider the angle between feature vectors.

7. DictVectorizer 3. CategoricalImputer

How to Call: How to Call:

DictVectorizer() CategoricalImputer()

Use Cases: Use Cases:

Transforms lists of feature-value mappings to vectors. Useful when feature extraction Fills in missing values within categorical features using the most frequent value or a
from text data results in a dictionary. placeholder.

6. ColumnTransformer 4. FeatureHasher

How to Call: How to Call:

ColumnTransformer(transformers=[...]) FeatureHasher(n_features=20)

Use Cases: Use Cases:

Applies transformers to columns of arrays or pandas DataFrames. Allows different Applies a hash function to the features to determine their column index in feature
columns to be transformed differently. matrices. Useful for high-dimensional data.

5. MultiLabelBinarizer

How to Call:

MultiLabelBinarizer()

Use Cases:

Transforms a list of multilabel tags to a binary matrix. Essential for multi-label

classification problems.
Data Pre-Processing Algorithms

10. LocallyLinearEmbedding 1. VarianceThreshold

How to Call: How to Call:

LocallyLinearEmbedding(n_neighbors=10, n_components=2) VarianceThreshold(threshold=0.0)

Use Cases: Use Cases:

Seeks a lower-dimensional projection of the data which preserves distances within Removes all features that have a variance below a certain threshold. It is useful for
local neighborhoods. It is useful for unwrapping twisted manifolds. feature selection to remove non-informative features.

9. Isomap 2. SelectKBest

How to Call: How to Call:

Isomap(n_neighbors=5, n_components=2) SelectKBest(score_func=f_classif, k=10)

Use Cases: Use Cases:

Non-linear dimensionality reduction through Isometric Mapping. It's particularly useful Selects the top-k scoring features based on a chosen scoring function. It's commonly
when the data lies on an embedded non-linear manifold. used to improve model performance by retaining only the most informative features.

8. SequentialFeatureSelector 3. SelectPercentile

How to Call: How to Call:

SequentialFeatureSelector(estimator, n_features_to_select=10) SelectPercentile(score_func=f_classif, percentile=50)

Use Cases: Use Cases:

Adds or removes features to form the best feature subset. It's a greedy procedure that Selects features according to a percentile of the highest scores. Similar to
adds or removes one feature at a time based on model performance. SelectKBest but selects a percentage of features instead of a fixed number.

7. SelectFromModel 4. RFE (Recursive Feature Elimination)

How to Call: How to Call:

SelectFromModel(estimator) RFE(estimator, n_features_to_select=10)

Use Cases: Use Cases:

Selects features based on importance weights provided by a fitted model. Useful when Recursively removes the weakest features to improve model accuracy. Often used
using tree-based estimators like RandomForest that can compute feature when the number of features is very high, and reducing complexity is necessary.
importances.

5. PCA (Principal Component Analysis)

6. TruncatedSVD
How to Call:
How to Call:
PCA(n_components=2)
TruncatedSVD(n_components=2)
Use Cases:
Use Cases:
Dimensionality reduction technique that transforms features into a set of orthogonal
Similar to PCA but suitable for sparse matrices, which are common in text data. components that explain the most variance in the data.
Data Pre-Processing Algorithms

20. SparsePCA 1. TfidfVectorizer

How to Call: How to Call:

SparsePCA(n_components=2) TfidfVectorizer()

Use Cases: Use Cases:

Principal component analysis for sparse data, aiming to find a set of sparse Converts a collection of raw documents to a matrix of TF-IDF features. Ideal for text
components that can explain the variance in the data. analysis.

19. KernelPCA 2. CountVectorizer

How to Call: How to Call:

KernelPCA(n_components=2, kernel='linear') CountVectorizer()

Use Cases: Use Cases:

Non-linear dimensionality reduction through the use of kernels. Converts a collection of text documents to a matrix of token counts. This
representation is used for text classification.

18. MaxAbsScaler
3. HashingVectorizer
How to Call:
How to Call:
MaxAbsScaler()
HashingVectorizer(n_features=2**20)
Use Cases:
Use Cases:
Scale each feature by its maximum absolute value, useful for data that is already
centered at zero. Converts a collection of text documents to a matrix of occurrences, normalized by
hashing trick.

17. Binarize
4. NMF (Non-Negative Matrix Factorization)
How to Call:
How to Call:
Binarize(threshold=0.0)
NMF(n_components=2)
Use Cases:
Use Cases:
Convert numerical features into boolean values based on a threshold.
Factorization method to discover hidden topics or concepts within the data, often
used in text data.

16. UMAP

How to Call:
5. LatentDirichletAllocation
UMAP(n_components=2) How to Call:

Use Cases: LatentDirichletAllocation(n_components=10)

Uniform Manifold Approximation and Projection. A non-linear dimensionality reduction Use Cases:
technique often used for visualization.
Topic modeling technique that assigns topics to documents and words to topics.

15. TSNE
6. AdditiveChi2Sampler
How to Call:
How to Call:
TSNE(n_components=2)
AdditiveChi2Sampler()
Use Cases:
Use Cases:
T-distributed Stochastic Neighbor Embedding. Non-linear dimensionality reduction,
ideal for visualization of high-dimensional datasets. Computes additive chi-squared kernel between features and class labels, for non-
linear classification.

14. LocalOutlierFactor
7. KernelCenterer
How to Call:
How to Call:
LocalOutlierFactor()
KernelCenterer()
Use Cases:
Use Cases:
Unsupervised outlier detection using local density estimation.
Centers a kernel matrix, especially useful in Kernel Principal Component Analysis.

13. KNeighborsTransformer
8. Normalizer
How to Call:
How to Call:
KNeighborsTransformer(n_neighbors=5)
Normalizer(norm='l2')
Use Cases:
Use Cases:
Transform data to a matrix of distances to the nearest neighbors.
Normalizes individual samples to have unit norm. Useful in text classification when
using cosine similarity.

12. RadiusNeighborsTransformer

How to Call:
9. LabelSpreading
RadiusNeighborsTransformer(radius=1.0) How to Call:

Use Cases: LabelSpreading()

Transform data to a matrix of distances to all neighbors within a given radius. Use Cases:

Semi-supervised learning algorithm that spreads label information through a dataset.

11. NearestNeighbors

How to Call:
10. LabelPropagation
NearestNeighbors(n_neighbors=3) How to Call:

Use Cases: LabelPropagation()

Unsupervised learner for implementing neighbor searches, used in clustering, Use Cases:
classification, and regression.
Another semi-supervised learning technique that infers labels for unlabeled data
points.
Data Pre-Processing Algorithms

20. KMeans 1. IncrementalPCA

How to Call: How to Call:

KMeans(n_clusters=8) IncrementalPCA(n_components=2)

Use Cases: Use Cases:

Partitioning n observations into k clusters in which each observation belongs to the Incremental principal component analysis is useful for large datasets that cannot fit in
cluster with the nearest mean. memory.

19. Discretization 2. FactorAnalysis

How to Call: How to Call:

Discretization(n_bins=5, encode='ordinal') FactorAnalysis(n_components=2)

Use Cases: Use Cases:

Discretizes continuous features into discrete bins. A method to model observed variables and their underlying latent factors.

18. QuantileNormalizer 3. FastICA

How to Call: How to Call:

QuantileNormalizer() FastICA(n_components=2)

Use Cases: Use Cases:

Normalizes features using quantile information to follow a standard normal distribution. Fast Independent Component Analysis for signal separation and feature extraction.

17. SkewedChi2Sampler 4. MDS (Multidimensional Scaling)

How to Call: How to Call:

SkewedChi2Sampler(skewedness=0.5, n_components=100) MDS(n_components=2)

Use Cases: Use Cases:

Reduces skewness in data by using the Chi-squared kernel. A technique used for analyzing similarity or dissimilarity data, helps to visualize high-
dimensional data.

16. RBFSampler
5. IsolationForest
How to Call:
How to Call:
RBFSampler(gamma=1.0, n_components=100)
IsolationForest()
Use Cases:
Use Cases:
Approximates the feature map of an RBF kernel by Monte Carlo approximation of its
Fourier transform. An ensemble algorithm for anomaly detection that isolates outliers instead of profiling
normal data points.

15. Nystroem
6. SelectFpr (False Positive Rate test)
How to Call:
How to Call:
Nystroem(kernel='rbf', n_components=2)
SelectFpr(alpha=0.05)
Use Cases:
Use Cases:
An efficient method to approximate a kernel map for large-scale datasets.
Filter method to select features based on a false positive rate test.

14. MiniBatchSparsePCA
7. SelectFdr (False Discovery Rate test)
How to Call:
How to Call:
MiniBatchSparsePCA(n_components=2)
SelectFdr(alpha=0.05)
Use Cases:
Use Cases:
A scalable version of SparsePCA that uses mini-batch approach.
Feature selection technique that controls the false discovery rate.

13. MiniBatchDictionaryLearning
8. SelectFwe (Family-wise Error rate)
How to Call:
How to Call:
MiniBatchDictionaryLearning(n_components=2)
SelectFwe(alpha=0.05)
Use Cases:
Use Cases:
A faster version of DictionaryLearning suitable for large datasets.
Selects features based on family-wise error rate, often used in hypothesis testing.

12. DictionaryLearning
9. GenericUnivariateSelect
How to Call:
How to Call:
DictionaryLearning(n_components=2)
GenericUnivariateSelect(mode='fpr', param=0.05)
Use Cases:
Use Cases:
An unsupervised method for dictionary learning and feature extraction.
Allows to perform univariate feature selection with a configurable strategy.

11. SpectralEmbedding
10. FeatureAgglomeration
How to Call:
How to Call:
SpectralEmbedding(n_components=2)
FeatureAgglomeration()
Use Cases:
Use Cases:
Uses spectral decomposition to reduce dimensionality, useful in clustering tasks.
Hierarchical clustering to group together similar features.
Data Pre-Processing Algorithms

20. EllipticEnvelope 1. Birch

How to Call: How to Call:

EllipticEnvelope(contamination=0.1) Birch(threshold=0.5, n_clusters=3)

Use Cases: Use Cases:

An outlier detection algorithm that fits a robust covariance estimate to the data, and An online-learning algorithm for clustering that builds a tree called the Clustering
thus fits an ellipse to the central data points, ignoring points outside the central mode. Feature Tree (CFT).

19. DummyClassifier 2. DBSCAN

How to Call: How to Call:

DummyClassifier(strategy='stratified') DBSCAN(eps=0.5, min_samples=5)

Use Cases: Use Cases:

A classifier that makes predictions using simple rules, which can be useful as a A density-based clustering algorithm that groups together points that are closely
baseline for comparison with actual classifiers. packed together.

18. DecisionTreeClassifier 3. OPTICS

How to Call: How to Call:

DecisionTreeClassifier() OPTICS(min_samples=5, xi=0.05, min_cluster_size=0.1)

Use Cases: Use Cases:

A non-parametric supervised learning method used for classification and regression Clustering algorithm similar to DBSCAN but with the ability to find clusters of varying
that models decisions and their possible consequences as a tree. densities.

17. ClassifierChain 4. AffinityPropagation

How to Call: How to Call:

ClassifierChain(base_estimator=SVC(), order='random', random_state=0) AffinityPropagation(damping=0.5, max_iter=200)

Use Cases: Use Cases:

A multi-label model that arranges binary classifiers into a chain where each classifier A clustering algorithm that sends messages between pairs of samples until
deals with the label predicted by its predecessor. convergence.

16. CheckingClassifier 5. AgglomerativeClustering

How to Call: How to Call:

CheckingClassifier() AgglomerativeClustering(n_clusters=2)

Use Cases: Use Cases:

A classifier for sanity checking or debugging purposes that does not learn from input A hierarchical clustering method using a bottom-up approach: each observation starts
data and only performs checks or returns fixed predictions. in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

15. ComplementNB 6. FeatureUnion

How to Call: How to Call:

ComplementNB() FeatureUnion(transformer_list=[('transformer1', transformer1), ('transformer2',

transformer2)])
Use Cases:
Use Cases:
A modification of the standard Multinomial Naive Bayes algorithm that is particularly
suited for imbalanced data sets. A pipeline utility to combine multiple feature extraction or transformation methods into
a single transformer.

14. CategoricalNB
7. PolynomialCountSketch
How to Call:
How to Call:
CategoricalNB()
PolynomialCountSketch(degree=2, n_components=100)
Use Cases:
Use Cases:
Naive Bayes classifier for categorical features, particularly suited for features that are
discretely distributed. Approximates a feature map of an arbitrary polynomial kernel by a fast, sparse
projection.

13. CalibratedClassifierCV
8. ExtraTreesClassifier
How to Call:
How to Call:
CalibratedClassifierCV(base_estimator=SVC(), method='sigmoid', cv=5)
ExtraTreesClassifier(n_estimators=100, random_state=0)
Use Cases:
Use Cases:
Probability calibration with isotonic regression or logistic regression on classifier
output. An ensemble learning method fundamentally similar to a random forest, but it selects
tree splits in a more random manner.

12. HistGradientBoostingClassifier
9. GradientBoostingClassifier
How to Call:
How to Call:
HistGradientBoostingClassifier(max_iter=100)
GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1,
Use Cases: random_state=0)

A histogram-based Gradient Boosting Classification Tree designed for speed, which Use Cases:
can handle categorical data and naturally deals with missing values.
A machine learning technique for regression and classification problems, which
produces a prediction model in the form of an ensemble of weak prediction models.

11. BaggingClassifier

How to Call:
10. AdaBoostClassifier
BaggingClassifier(base_estimator=SVC(), n_estimators=10, random_state=0) How to Call:

Use Cases: AdaBoostClassifier(n_estimators=100, random_state=0)

An ensemble meta-estimator that fits base classifiers each on random subsets of the Use Cases:
original dataset and then aggregates their individual predictions to form a final
prediction. A boosting ensemble meta-estimator that begins by fitting a classifier on the original
dataset and then fits additional copies of the classifier on the same dataset but where
the weights of incorrectly classified instances are adjusted such that subsequent
classifiers focus more on difficult cases.
Data Pre-Processing Algorithms

19. PoissonRegressor 1. ExtraTreesRegressor

How to Call: How to Call:

PoissonRegressor(alpha=1e-2, max_iter=1000) ExtraTreesRegressor(n_estimators=100, random_state=0)

Use Cases: Use Cases:

Generalized Linear Model with a Poisson distribution. An ensemble learning method for regression that fits a number of randomized decision
trees on various sub-samples of the dataset and uses averaging to improve the
predictive accuracy and control overfitting.

18. QuantileRegressor

How to Call:
2. GradientBoostingRegressor
QuantileRegressor(quantile=0.5, alpha=0.01) How to Call:

Use Cases: GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3,

random_state=0)
Linear regression model that predicts a specified quantile of the target's distribution.
Use Cases:

A machine learning technique for regression that builds an additive model in a forward
17. HuberRegressor stage-wise fashion; it allows for the optimization of arbitrary differentiable loss
functions.
How to Call:

HuberRegressor(max_iter=100, epsilon=1.35)
3. RandomForestClassifier
Use Cases:
How to Call:
Linear regression model that is robust to outliers.
RandomForestClassifier(n_estimators=100, random_state=0)

Use Cases:
16. TheilSenRegressor
A meta estimator that fits a number of decision tree classifiers on various sub-
samples of the dataset and uses averaging to improve the predictive accuracy and
How to Call:
control over-fitting.
TheilSenRegressor(random_state=0)

Use Cases:
4. RandomForestRegressor
Theil-Sen Estimator: robust multivariate regression model.
How to Call:

RandomForestRegressor(n_estimators=100, random_state=0)
15. RANSACRegressor
Use Cases:
How to Call:
A meta estimator that fits a number of classifying decision trees on various sub-
samples of the dataset and uses averaging to improve the predictive accuracy and
RANSACRegressor(min_samples=2, max_trials=100, loss='absolute_loss')
control over-fitting for regression tasks.
Use Cases:

RANSAC (RANdom SAmple Consensus) algorithm. RANSAC is an iterative algorithm

for the robust estimation of parameters from a subset of inliers from the complete data 5. RidgeClassifier
set.
How to Call:

RidgeClassifier(alpha=1.0)
14. PassiveAggressiveRegressor
Use Cases:
How to Call:
Classifier that uses ridge regression to classify multi-class data.
PassiveAggressiveRegressor(max_iter=1000, random_state=0)

Use Cases:
6. RidgeClassifierCV
Passive Aggressive algorithms for regression.
How to Call:

RidgeClassifierCV(alphas=[0.1, 1.0, 10.0])

13. PassiveAggressiveClassifier
Use Cases:
How to Call:
Ridge classifier with built-in cross-validation of the alpha parameter.
PassiveAggressiveClassifier(max_iter=1000, random_state=0)

Use Cases:
7. SGDClassifier
Passive Aggressive algorithms are a family of algorithms for large-scale learning that
are similar to the Perceptron in that they do not require a learning rate. How to Call:

SGDClassifier(loss='hinge', penalty='l2', alpha=0.0001)

12. NearestCentroid Use Cases:

Linear classifiers (SVM, logistic regression, a.o.) with stochastic gradient descent
How to Call:
(SGD) training.
NearestCentroid()

Use Cases:
8. SGDRegressor
Nearest centroid classifier. Each class is represented by its centroid, with test
samples classified to the class with the nearest centroid. How to Call:

SGDRegressor(loss='squared_loss', penalty='l2', alpha=0.0001)

11. LinearSVR Use Cases:

Linear regression model trained with SGD.

How to Call:

LinearSVR(epsilon=0.0, tol=1e-4)

Use Cases: 9. SVR

Scalable Linear Support Vector Machine for regression implemented using liblinear. How to Call:

SVR(kernel='rbf', C=1.0, epsilon=0.2)

10. NuSVR Use Cases:

Epsilon-Support Vector Regression. The free parameters in the model are C and
How to Call:
epsilon.
NuSVR(nu=0.5, C=1.0, kernel='rbf')

Use Cases:

Nu-Support Vector Regression. Similar to SVR but uses a parameter nu to control the
number of support vectors.
Data Pre-Processing Algorithms

17. cross_val_score - How to Call: 1. VotingClassifier - How to Call:

cross_val_score(estimator=SVC(), X=data, y=labels, cv=5) VotingClassifier(estimators=[('lr', LogisticRegression()), ('rf',

RandomForestClassifier())])
Use Cases: Evaluate a score by cross-validation.
Use Cases: A classifier that fits multiple classifiers and takes the majority vote for
prediction. Useful for combining conceptually different machine learning classifiers.

16. RandomizedSearchCV - How to Call:

RandomizedSearchCV(estimator=RandomForestClassifier(), 2. VotingRegressor - How to Call:

param_distributions={'max_depth': [3, None], 'max_features': randint(1, 9)}, cv=5)
VotingRegressor(estimators=[('lr', LinearRegression()), ('rf',
Use Cases: Randomized search on hyperparameters. RandomForestRegressor())])

Use Cases: A regressor that fits multiple regressors and averages their predictions.
Helpful for reducing variance.
15. GridSearchCV - How to Call:

GridSearchCV(estimator=SVC(), param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf')}, cv=5)

3. StackingClassifier - How to Call:
Use Cases: Exhaustive search over specified parameter values for an estimator.
StackingClassifier(estimators=[('rf', RandomForestClassifier()), ('svr',
make_pipeline(StandardScaler(), LinearSVC()))])

14. GroupKFold - How to Call: Use Cases: A classifier that stacks the output of individual estimators and uses a
classifier to compute the final prediction.
GroupKFold(n_splits=5)

Use Cases: K-fold iterator variant with non-overlapping groups.

4. StackingRegressor - How to Call:

StackingRegressor(estimators=[('lr', LinearRegression()), ('ridge', Ridge())])

13. StratifiedKFold - How to Call:
Use Cases: A regressor that stacks the output of individual estimators and uses a
regressor to compute the final prediction.
StratifiedKFold(n_splits=5)

Use Cases: Stratified K-Folds cross-validator providing train/test indices to split data.

5. SelectFromModel - How to Call:

12. TimeSeriesSplit - How to Call: SelectFromModel(estimator=LogisticRegression(penalty="l1"), threshold='mean')

Use Cases: Meta-transformer for selecting features based on importance weights from
TimeSeriesSplit(n_splits=5)
a model, such as Lasso.
Use Cases: Cross-validator for time series data.

6. SequentialFeatureSelector - How to Call:

11. MissingIndicator - How to Call:
SequentialFeatureSelector(estimator=RandomForestClassifier(),
n_features_to_select=5)
MissingIndicator()
Use Cases: A transformer that selects features by recursively considering smaller and
Use Cases: Binary indicators for missing values.
smaller sets of features.

10. KNNImputer - How to Call: 7. ColumnTransformer - How to Call:

KNNImputer(n_neighbors=2, weights="uniform")
ColumnTransformer(transformers=[('num', MinMaxScaler(), ['age']), ('cat',
Use Cases: Imputation for completing missing values using k-Nearest Neighbors. OneHotEncoder(), ['gender'])])

Use Cases: Applies transformers to columns of arrays or pandas DataFrames.

9. IterativeImputer - How to Call:

8. SimpleImputer - How to Call:
IterativeImputer(estimator=BaysianRidge(), max_iter=10, random_state=0)

Use Cases: Multivariate imputer that estimates each feature from all the others SimpleImputer(strategy='mean')
through specified estimator.
Use Cases: Imputation transformer for completing missing values in datasets.
Data Pre-Processing Algorithms

22. SelectFpr - How to Call: 1. cross_val_predict - How to Call:

SelectFpr(score_func=f_classif, alpha=0.05) cross_val_predict(estimator=SVC(), X=data, y=labels, cv=5)

Use Cases: Select features based on a false positive rate test. Use Cases: Generates cross-validated estimates for each input data point.

21. SelectPercentile - How to Call: 2. permutation_importance - How to Call:

SelectPercentile(score_func=f_classif, percentile=10) permutation_importance(estimator=model, X=val_data, y=val_labels, n_repeats=30)

Use Cases: Select features according to a percentile of the highest scores. Use Cases: Assessment of the importance of different features via permutation.

20. mutual_info_regression - How to Call: 3. learning_curve - How to Call:

SelectKBest(score_func=mutual_info_regression, k=2) learning_curve(estimator=RandomForestClassifier(), X=data, y=labels,

train_sizes=np.linspace(.1, 1.0, 5))
Use Cases: Estimates mutual information for a continuous target variable.
Use Cases: Determines cross-validated training and test scores for different training
set sizes.

19. f_regression - How to Call:

SelectKBest(score_func=f_regression, k=2) 4. validation_curve - How to Call:

Use Cases: Select features based on F-test for regression tasks. validation_curve(estimator=SVC(), X=data, y=labels, param_name='param_C',
param_range=param_range, cv=5)

Use Cases: Determines training and test scores for varying parameter values.
18. mutual_info_classif - How to Call:

SelectKBest(score_func=mutual_info_classif, k=2)
5. ShuffleSplit - How to Call:
Use Cases: Estimates mutual information for a discrete target variable.
ShuffleSplit(n_splits=10, test_size=0.25, random_state=0)

Use Cases: Random permutation cross-validator.

17. f_classif - How to Call:

SelectKBest(score_func=f_classif, k=2)
6. GroupShuffleSplit - How to Call:
Use Cases: Compute the ANOVA F-value for the provided sample.
GroupShuffleSplit(n_splits=5, test_size=0.2, random_state=42)

Use Cases: Ensures that the same group is not represented in both testing and
16. Chi2 - How to Call: training sets.

SelectKBest(score_func=chi2, k=2)

Use Cases: Select features according to the k highest scores of the chi-squared 7. StratifiedShuffleSplit - How to Call:
statistic.
StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)

Use Cases: Provides train/test indices to split data in train/test sets while preserving
15. train_test_split - How to Call: the percentage of samples for each class.

train_test_split(*arrays, test_size=0.25, random_state=0)

Use Cases: Splits arrays or matrices into random train and test subsets. 8. LeaveOneOut - How to Call:

LeaveOneOut()

14. PredefinedSplit - How to Call: Use Cases: Provides train/test indices to split data in train/test sets where each
sample is used once as a test set (singleton).
PredefinedSplit(test_fold=array)

Use Cases: Generates train/test indices based on predefined splits.

9. LeavePOut - How to Call:

LeavePOut(p=2)
13. TimeSeriesSplit - How to Call:
Use Cases: Similar to LeaveOneOut, but leaves P samples out.
TimeSeriesSplit(n_splits=5)

Use Cases: Provides train/test indices to split time series data samples that are
observed at fixed time intervals. 10. LeaveOneGroupOut - How to Call:

LeaveOneGroupOut()

12. GroupKFold - How to Call: Use Cases: Provides train/test indices to split data according to a third-party provided
group.
GroupKFold(n_splits=5)

Use Cases: Ensures that the same group is not in both testing and training sets.
11. LeavePGroupsOut - How to Call:

LeavePGroupsOut(n_groups=2)

Use Cases: Leaves P groups out, and the rest of the data is used as a training set.

ACCA - Performance Management - Syllabus and Study Guide
No ratings yet
ACCA - Performance Management - Syllabus and Study Guide
20 pages
Machine Learning Algorithms PDF
100% (1)
Machine Learning Algorithms PDF
148 pages
Machine Learning Project Checklist
100% (1)
Machine Learning Project Checklist
10 pages
MMG Australia Limited, 2 5 Dam Tailing Storage Facility, Rosebery - Dpemp - Appendix A - Design Report - Part 1
No ratings yet
MMG Australia Limited, 2 5 Dam Tailing Storage Facility, Rosebery - Dpemp - Appendix A - Design Report - Part 1
78 pages
Python Essential Methods in Machine Learning
No ratings yet
Python Essential Methods in Machine Learning
6 pages
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004 - Compressed
No ratings yet
PYTHON PROGRAMMING FOR MACHINE LEARNING-220901004 - Compressed
6 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
DPT Week 1
No ratings yet
DPT Week 1
3 pages
PMA Unit-2 PDF
No ratings yet
PMA Unit-2 PDF
19 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
ML Checklist PDF
No ratings yet
ML Checklist PDF
4 pages
Kaggle Competitions - How To Win
No ratings yet
Kaggle Competitions - How To Win
74 pages
Unit 7 ML
No ratings yet
Unit 7 ML
33 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Data Pre Processing
No ratings yet
Data Pre Processing
2 pages
1data Cleansing Cheklist
No ratings yet
1data Cleansing Cheklist
2 pages
Features Selection and Featurs Generation
No ratings yet
Features Selection and Featurs Generation
5 pages
Dsur Ea2352001010391 W7
No ratings yet
Dsur Ea2352001010391 W7
3 pages
Lecture 5 - Feature Extraction, Model Building & Evaluation
No ratings yet
Lecture 5 - Feature Extraction, Model Building & Evaluation
35 pages
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
10 pages
Machine Learning Engineer Interview Preparation Guide
No ratings yet
Machine Learning Engineer Interview Preparation Guide
14 pages
ML Self Unit 2
No ratings yet
ML Self Unit 2
20 pages
Complete Data Science Questions
No ratings yet
Complete Data Science Questions
5 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
2 pages
Chapter 5 2025
No ratings yet
Chapter 5 2025
19 pages
Data Collection
No ratings yet
Data Collection
8 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Data Science Checklist
No ratings yet
Data Science Checklist
22 pages
Datascience
No ratings yet
Datascience
26 pages
HCA2
No ratings yet
HCA2
63 pages
Unit - II MLT
No ratings yet
Unit - II MLT
75 pages
AI-Module 4 - Updated
No ratings yet
AI-Module 4 - Updated
53 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Data Cleaning Approaches in Machine Learning Algorithms
No ratings yet
Data Cleaning Approaches in Machine Learning Algorithms
8 pages
ML Viva Practice (Answers)
No ratings yet
ML Viva Practice (Answers)
4 pages
Machine: Learning
No ratings yet
Machine: Learning
24 pages
Final ML
No ratings yet
Final ML
2 pages
Practical Assignment ML
No ratings yet
Practical Assignment ML
50 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
Kaggle Course Notes
No ratings yet
Kaggle Course Notes
87 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
TE ML LAB Mannual
No ratings yet
TE ML LAB Mannual
21 pages
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
No ratings yet
01 Apply Data Preprocessing On Heart Dataset and Evaluate Performance Using Confusion Matrix
19 pages
MLTAHER
No ratings yet
MLTAHER
14 pages
ML Unit 2
No ratings yet
ML Unit 2
52 pages
ML Lab Manual Completed
No ratings yet
ML Lab Manual Completed
56 pages
Untitled Document
No ratings yet
Untitled Document
8 pages
Machine Learning Fundamentals
No ratings yet
Machine Learning Fundamentals
4 pages
Advance Python
No ratings yet
Advance Python
5 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Human Activity Recognition
No ratings yet
Human Activity Recognition
8 pages
Business Analytics
No ratings yet
Business Analytics
14 pages
ML Notes
No ratings yet
ML Notes
15 pages
Exam PA Knowledge Based Outline
No ratings yet
Exam PA Knowledge Based Outline
22 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Comprehensive Overview of Common ML Techniques
No ratings yet
Comprehensive Overview of Common ML Techniques
7 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
JSREP Volume 39 Issue 188ج1 Pages 360-406
No ratings yet
JSREP Volume 39 Issue 188ج1 Pages 360-406
48 pages
E-Learning: It'S Effectiveness As A Teaching Method For Junior High School Students of Southernside Montessori School
No ratings yet
E-Learning: It'S Effectiveness As A Teaching Method For Junior High School Students of Southernside Montessori School
78 pages
Basic Research Templates
No ratings yet
Basic Research Templates
15 pages
Research Article: Aircraft Failure Rate Prediction Method Based On CEEMD and Combined Model
No ratings yet
Research Article: Aircraft Failure Rate Prediction Method Based On CEEMD and Combined Model
19 pages
20 Linguistic Thesis Topics On Discourse Studies
100% (1)
20 Linguistic Thesis Topics On Discourse Studies
7 pages
Lesson Plan Template: Different Ways To Assess Student Learning and How To Do Learning Centers
No ratings yet
Lesson Plan Template: Different Ways To Assess Student Learning and How To Do Learning Centers
4 pages
GHJK
No ratings yet
GHJK
65 pages
Information SystemNOTES
100% (1)
Information SystemNOTES
61 pages
Ayebare (Edited Internship Report)
No ratings yet
Ayebare (Edited Internship Report)
33 pages
Chapter 7 An Introduction To Linear Programming: Learning Objectives
No ratings yet
Chapter 7 An Introduction To Linear Programming: Learning Objectives
49 pages
Biyani Bba HRD Notes PDF
No ratings yet
Biyani Bba HRD Notes PDF
57 pages
CH 5 Time Series
No ratings yet
CH 5 Time Series
46 pages
Spring 2018 Bus 498 Exit Assessment Test HRM
No ratings yet
Spring 2018 Bus 498 Exit Assessment Test HRM
8 pages
GED - 3211 Smart Village Studies (Report 1)
No ratings yet
GED - 3211 Smart Village Studies (Report 1)
19 pages
Bba Ca
No ratings yet
Bba Ca
92 pages
Nursing Assessment Student
No ratings yet
Nursing Assessment Student
58 pages
Deepcrack A Deep Learning Approach For Image-Based Crack Prediction Using MobileNet and Transfer Learning
No ratings yet
Deepcrack A Deep Learning Approach For Image-Based Crack Prediction Using MobileNet and Transfer Learning
6 pages
2009 - Vandewaetere - Measuring Students Attitude Towards CALL
No ratings yet
2009 - Vandewaetere - Measuring Students Attitude Towards CALL
33 pages
AMU Launches A 2.5 Million ETB Project To Scale Enset Technologies at Gamo Highlands
No ratings yet
AMU Launches A 2.5 Million ETB Project To Scale Enset Technologies at Gamo Highlands
2 pages
The Effects of Perceived Value On Loyalty: The Moderating Effect of Market Orientation Adoption
No ratings yet
The Effects of Perceived Value On Loyalty: The Moderating Effect of Market Orientation Adoption
24 pages
Artikel Bullying Behavior and Its Relationship To Children's Self-Esteem
No ratings yet
Artikel Bullying Behavior and Its Relationship To Children's Self-Esteem
9 pages
Literature Review Technology Education
100% (2)
Literature Review Technology Education
5 pages
Practical Research 2 First Semester - Quarter 2 Week 5-6 Learning Activity Sheets (LAS)
No ratings yet
Practical Research 2 First Semester - Quarter 2 Week 5-6 Learning Activity Sheets (LAS)
4 pages
Assesment Plan BUSINESS CREATIVITY AND INNOVATION - WEEK 11 - 30 JUNE
No ratings yet
Assesment Plan BUSINESS CREATIVITY AND INNOVATION - WEEK 11 - 30 JUNE
8 pages
Antibiotic Prophylaxis in Obstetric Procedures PDF
No ratings yet
Antibiotic Prophylaxis in Obstetric Procedures PDF
7 pages
Industry 4.0: A Review On Industrial Automation and Robotic: Article
No ratings yet
Industry 4.0: A Review On Industrial Automation and Robotic: Article
9 pages
Jurnal 3
No ratings yet
Jurnal 3
19 pages
Course Outline-Agri-Fishery Arts
No ratings yet
Course Outline-Agri-Fishery Arts
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.