Northbay Summarizes Data Pre-Processing Algorithms
Northbay Summarizes Data Pre-Processing Algorithms
Description:
Missing values in the dataset leading to insufficient training of the model. 2. Overfitting
Potential Fixes:
Description:
Imputation techniques.
Model performs well on training data but poorly on unseen/test data.
Use models that can handle missing data.
Potential Fixes:
Removal of features with excessive missing values.
Increase regularization.
Data collection strategies to minimize missing data.
Reduce model complexity.
Data contains a lot of irrelevant information or errors. Prune the model (if applicable).
Potential Fixes:
Issues related to the size of the data, speed of training, etc. Check for data quality issues.
Optimize algorithms.
Potential Fixes:
11. Imbalanced Dataset Balanced dataset.
The training dataset does not have a representative distribution of classes. Implement algorithmic fairness techniques.
Model performs poorly due to changing underlying relationships in the data over time. Increase training data.
Apply techniques like windowing for time series data. 6. Data Leakage
Description:
9. High Variance/Low Bias Model inadvertently gains information from outside its training dataset, often leading to
overfitting.
Description:
Potential Fixes:
Model is too complex, fitting too closely to the training data (related to overfitting).
Careful feature selection.
Potential Fixes:
Cross-validation.
Simplify the model.
Ensure separation of training and test datasets.
Increase training data.
Scrutinize data preprocessing steps.
Apply regularization.
Description:
8. High Bias/Low Variance
Model does not perform well on new, unseen data.
Description:
Potential Fixes:
Model is too simple, making it unable to capture complexities in the data (related to
underfitting). Use more diverse training data.
Reduce regularization.
Binning continuous data into intervals. Scaling features with a mean of 0 and variance of 1. Useful when algorithms assume
features to be on a similar scale, e.g., SVM, Neural Networks.
13. QuantileTransformer
2. MinMaxScaler
How to Call:
How to Call:
QuantileTransformer()
MinMaxScaler()
Use Cases:
Use Cases:
Transform features using quantiles information. Can spread out the most frequent
values and reduce the impact of (marginal) outliers. Scaling features to a given range, often [0,1]. Useful when you need a bounded
interval.
12. PowerTransformer
3. RobustScaler
How to Call:
How to Call:
PowerTransformer(method='yeo-johnson')
RobustScaler()
Use Cases:
Use Cases:
Apply a power transform featurewise to make data more Gaussian-like. Useful when
you want to stabilize variance and make the data more Gaussian-like. Scaling features based on median and IQR. Useful for data with outliers.
FunctionTransformer(func) OneHotEncoder()
Constructs a transformer from an arbitrary callable. Useful for applying a simple Encoding categorical variables as binary vectors. Used when the categorical data isn't
transformation function. ordinal.
PolynomialFeatures(degree=2) OrdinalEncoder()
Generating polynomial features. Useful for linear regression when the relationship isn't Encoding categorical variables as integer values. Useful for ordinal data.
purely linear.
6. LabelEncoder
9. SimpleImputer
How to Call:
How to Call:
LabelEncoder()
SimpleImputer(strategy='mean')
Use Cases:
Use Cases:
Convert categories to integers. Often used for target variable encoding.
Imputation transformer for completing missing values.
7. LabelBinarizer
8. Binarizer
How to Call:
How to Call:
LabelBinarizer()
Binarizer(threshold=0.5)
Use Cases:
Use Cases:
Converts multi-class labels to binary labels (one-vs-all).
Convert continuous data into binary form based on a threshold.
Data Pre-Processing Algorithms
9. OutputCodeClassifier 1. MaxAbsScaler
OutputCodeClassifier(estimator=...) MaxAbsScaler()
Uses a classifier to produce a continuous-valued output that is converted to a Scales each feature by its maximum absolute value. This is meant for data that is
multiclass classification problem. Useful for multiclass learning with binary classifiers. already centered at zero without outliers.
8. MissingIndicator 2. Normalizer
MissingIndicator() Normalizer(norm='l2')
Marks features with missing values. Often used in conjunction with an imputation Normalizes samples individually to unit norm. This technique is useful when you want
technique. to consider the angle between feature vectors.
7. DictVectorizer 3. CategoricalImputer
DictVectorizer() CategoricalImputer()
Transforms lists of feature-value mappings to vectors. Useful when feature extraction Fills in missing values within categorical features using the most frequent value or a
from text data results in a dictionary. placeholder.
6. ColumnTransformer 4. FeatureHasher
ColumnTransformer(transformers=[...]) FeatureHasher(n_features=20)
Applies transformers to columns of arrays or pandas DataFrames. Allows different Applies a hash function to the features to determine their column index in feature
columns to be transformed differently. matrices. Useful for high-dimensional data.
5. MultiLabelBinarizer
How to Call:
MultiLabelBinarizer()
Use Cases:
Seeks a lower-dimensional projection of the data which preserves distances within Removes all features that have a variance below a certain threshold. It is useful for
local neighborhoods. It is useful for unwrapping twisted manifolds. feature selection to remove non-informative features.
9. Isomap 2. SelectKBest
Non-linear dimensionality reduction through Isometric Mapping. It's particularly useful Selects the top-k scoring features based on a chosen scoring function. It's commonly
when the data lies on an embedded non-linear manifold. used to improve model performance by retaining only the most informative features.
8. SequentialFeatureSelector 3. SelectPercentile
Adds or removes features to form the best feature subset. It's a greedy procedure that Selects features according to a percentile of the highest scores. Similar to
adds or removes one feature at a time based on model performance. SelectKBest but selects a percentage of features instead of a fixed number.
Selects features based on importance weights provided by a fitted model. Useful when Recursively removes the weakest features to improve model accuracy. Often used
using tree-based estimators like RandomForest that can compute feature when the number of features is very high, and reducing complexity is necessary.
importances.
SparsePCA(n_components=2) TfidfVectorizer()
Principal component analysis for sparse data, aiming to find a set of sparse Converts a collection of raw documents to a matrix of TF-IDF features. Ideal for text
components that can explain the variance in the data. analysis.
Non-linear dimensionality reduction through the use of kernels. Converts a collection of text documents to a matrix of token counts. This
representation is used for text classification.
18. MaxAbsScaler
3. HashingVectorizer
How to Call:
How to Call:
MaxAbsScaler()
HashingVectorizer(n_features=2**20)
Use Cases:
Use Cases:
Scale each feature by its maximum absolute value, useful for data that is already
centered at zero. Converts a collection of text documents to a matrix of occurrences, normalized by
hashing trick.
17. Binarize
4. NMF (Non-Negative Matrix Factorization)
How to Call:
How to Call:
Binarize(threshold=0.0)
NMF(n_components=2)
Use Cases:
Use Cases:
Convert numerical features into boolean values based on a threshold.
Factorization method to discover hidden topics or concepts within the data, often
used in text data.
16. UMAP
How to Call:
5. LatentDirichletAllocation
UMAP(n_components=2) How to Call:
Uniform Manifold Approximation and Projection. A non-linear dimensionality reduction Use Cases:
technique often used for visualization.
Topic modeling technique that assigns topics to documents and words to topics.
15. TSNE
6. AdditiveChi2Sampler
How to Call:
How to Call:
TSNE(n_components=2)
AdditiveChi2Sampler()
Use Cases:
Use Cases:
T-distributed Stochastic Neighbor Embedding. Non-linear dimensionality reduction,
ideal for visualization of high-dimensional datasets. Computes additive chi-squared kernel between features and class labels, for non-
linear classification.
14. LocalOutlierFactor
7. KernelCenterer
How to Call:
How to Call:
LocalOutlierFactor()
KernelCenterer()
Use Cases:
Use Cases:
Unsupervised outlier detection using local density estimation.
Centers a kernel matrix, especially useful in Kernel Principal Component Analysis.
13. KNeighborsTransformer
8. Normalizer
How to Call:
How to Call:
KNeighborsTransformer(n_neighbors=5)
Normalizer(norm='l2')
Use Cases:
Use Cases:
Transform data to a matrix of distances to the nearest neighbors.
Normalizes individual samples to have unit norm. Useful in text classification when
using cosine similarity.
12. RadiusNeighborsTransformer
How to Call:
9. LabelSpreading
RadiusNeighborsTransformer(radius=1.0) How to Call:
Transform data to a matrix of distances to all neighbors within a given radius. Use Cases:
11. NearestNeighbors
How to Call:
10. LabelPropagation
NearestNeighbors(n_neighbors=3) How to Call:
Unsupervised learner for implementing neighbor searches, used in clustering, Use Cases:
classification, and regression.
Another semi-supervised learning technique that infers labels for unlabeled data
points.
Data Pre-Processing Algorithms
KMeans(n_clusters=8) IncrementalPCA(n_components=2)
Partitioning n observations into k clusters in which each observation belongs to the Incremental principal component analysis is useful for large datasets that cannot fit in
cluster with the nearest mean. memory.
Discretizes continuous features into discrete bins. A method to model observed variables and their underlying latent factors.
QuantileNormalizer() FastICA(n_components=2)
Normalizes features using quantile information to follow a standard normal distribution. Fast Independent Component Analysis for signal separation and feature extraction.
Reduces skewness in data by using the Chi-squared kernel. A technique used for analyzing similarity or dissimilarity data, helps to visualize high-
dimensional data.
16. RBFSampler
5. IsolationForest
How to Call:
How to Call:
RBFSampler(gamma=1.0, n_components=100)
IsolationForest()
Use Cases:
Use Cases:
Approximates the feature map of an RBF kernel by Monte Carlo approximation of its
Fourier transform. An ensemble algorithm for anomaly detection that isolates outliers instead of profiling
normal data points.
15. Nystroem
6. SelectFpr (False Positive Rate test)
How to Call:
How to Call:
Nystroem(kernel='rbf', n_components=2)
SelectFpr(alpha=0.05)
Use Cases:
Use Cases:
An efficient method to approximate a kernel map for large-scale datasets.
Filter method to select features based on a false positive rate test.
14. MiniBatchSparsePCA
7. SelectFdr (False Discovery Rate test)
How to Call:
How to Call:
MiniBatchSparsePCA(n_components=2)
SelectFdr(alpha=0.05)
Use Cases:
Use Cases:
A scalable version of SparsePCA that uses mini-batch approach.
Feature selection technique that controls the false discovery rate.
13. MiniBatchDictionaryLearning
8. SelectFwe (Family-wise Error rate)
How to Call:
How to Call:
MiniBatchDictionaryLearning(n_components=2)
SelectFwe(alpha=0.05)
Use Cases:
Use Cases:
A faster version of DictionaryLearning suitable for large datasets.
Selects features based on family-wise error rate, often used in hypothesis testing.
12. DictionaryLearning
9. GenericUnivariateSelect
How to Call:
How to Call:
DictionaryLearning(n_components=2)
GenericUnivariateSelect(mode='fpr', param=0.05)
Use Cases:
Use Cases:
An unsupervised method for dictionary learning and feature extraction.
Allows to perform univariate feature selection with a configurable strategy.
11. SpectralEmbedding
10. FeatureAgglomeration
How to Call:
How to Call:
SpectralEmbedding(n_components=2)
FeatureAgglomeration()
Use Cases:
Use Cases:
Uses spectral decomposition to reduce dimensionality, useful in clustering tasks.
Hierarchical clustering to group together similar features.
Data Pre-Processing Algorithms
An outlier detection algorithm that fits a robust covariance estimate to the data, and An online-learning algorithm for clustering that builds a tree called the Clustering
thus fits an ellipse to the central data points, ignoring points outside the central mode. Feature Tree (CFT).
A classifier that makes predictions using simple rules, which can be useful as a A density-based clustering algorithm that groups together points that are closely
baseline for comparison with actual classifiers. packed together.
A non-parametric supervised learning method used for classification and regression Clustering algorithm similar to DBSCAN but with the ability to find clusters of varying
that models decisions and their possible consequences as a tree. densities.
A multi-label model that arranges binary classifiers into a chain where each classifier A clustering algorithm that sends messages between pairs of samples until
deals with the label predicted by its predecessor. convergence.
CheckingClassifier() AgglomerativeClustering(n_clusters=2)
A classifier for sanity checking or debugging purposes that does not learn from input A hierarchical clustering method using a bottom-up approach: each observation starts
data and only performs checks or returns fixed predictions. in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
14. CategoricalNB
7. PolynomialCountSketch
How to Call:
How to Call:
CategoricalNB()
PolynomialCountSketch(degree=2, n_components=100)
Use Cases:
Use Cases:
Naive Bayes classifier for categorical features, particularly suited for features that are
discretely distributed. Approximates a feature map of an arbitrary polynomial kernel by a fast, sparse
projection.
13. CalibratedClassifierCV
8. ExtraTreesClassifier
How to Call:
How to Call:
CalibratedClassifierCV(base_estimator=SVC(), method='sigmoid', cv=5)
ExtraTreesClassifier(n_estimators=100, random_state=0)
Use Cases:
Use Cases:
Probability calibration with isotonic regression or logistic regression on classifier
output. An ensemble learning method fundamentally similar to a random forest, but it selects
tree splits in a more random manner.
12. HistGradientBoostingClassifier
9. GradientBoostingClassifier
How to Call:
How to Call:
HistGradientBoostingClassifier(max_iter=100)
GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1,
Use Cases: random_state=0)
A histogram-based Gradient Boosting Classification Tree designed for speed, which Use Cases:
can handle categorical data and naturally deals with missing values.
A machine learning technique for regression and classification problems, which
produces a prediction model in the form of an ensemble of weak prediction models.
11. BaggingClassifier
How to Call:
10. AdaBoostClassifier
BaggingClassifier(base_estimator=SVC(), n_estimators=10, random_state=0) How to Call:
An ensemble meta-estimator that fits base classifiers each on random subsets of the Use Cases:
original dataset and then aggregates their individual predictions to form a final
prediction. A boosting ensemble meta-estimator that begins by fitting a classifier on the original
dataset and then fits additional copies of the classifier on the same dataset but where
the weights of incorrectly classified instances are adjusted such that subsequent
classifiers focus more on difficult cases.
Data Pre-Processing Algorithms
Generalized Linear Model with a Poisson distribution. An ensemble learning method for regression that fits a number of randomized decision
trees on various sub-samples of the dataset and uses averaging to improve the
predictive accuracy and control overfitting.
18. QuantileRegressor
How to Call:
2. GradientBoostingRegressor
QuantileRegressor(quantile=0.5, alpha=0.01) How to Call:
A machine learning technique for regression that builds an additive model in a forward
17. HuberRegressor stage-wise fashion; it allows for the optimization of arbitrary differentiable loss
functions.
How to Call:
HuberRegressor(max_iter=100, epsilon=1.35)
3. RandomForestClassifier
Use Cases:
How to Call:
Linear regression model that is robust to outliers.
RandomForestClassifier(n_estimators=100, random_state=0)
Use Cases:
16. TheilSenRegressor
A meta estimator that fits a number of decision tree classifiers on various sub-
samples of the dataset and uses averaging to improve the predictive accuracy and
How to Call:
control over-fitting.
TheilSenRegressor(random_state=0)
Use Cases:
4. RandomForestRegressor
Theil-Sen Estimator: robust multivariate regression model.
How to Call:
RandomForestRegressor(n_estimators=100, random_state=0)
15. RANSACRegressor
Use Cases:
How to Call:
A meta estimator that fits a number of classifying decision trees on various sub-
samples of the dataset and uses averaging to improve the predictive accuracy and
RANSACRegressor(min_samples=2, max_trials=100, loss='absolute_loss')
control over-fitting for regression tasks.
Use Cases:
RidgeClassifier(alpha=1.0)
14. PassiveAggressiveRegressor
Use Cases:
How to Call:
Classifier that uses ridge regression to classify multi-class data.
PassiveAggressiveRegressor(max_iter=1000, random_state=0)
Use Cases:
6. RidgeClassifierCV
Passive Aggressive algorithms for regression.
How to Call:
Use Cases:
7. SGDClassifier
Passive Aggressive algorithms are a family of algorithms for large-scale learning that
are similar to the Perceptron in that they do not require a learning rate. How to Call:
Linear classifiers (SVM, logistic regression, a.o.) with stochastic gradient descent
How to Call:
(SGD) training.
NearestCentroid()
Use Cases:
8. SGDRegressor
Nearest centroid classifier. Each class is represented by its centroid, with test
samples classified to the class with the nearest centroid. How to Call:
LinearSVR(epsilon=0.0, tol=1e-4)
Epsilon-Support Vector Regression. The free parameters in the model are C and
How to Call:
epsilon.
NuSVR(nu=0.5, C=1.0, kernel='rbf')
Use Cases:
Nu-Support Vector Regression. Similar to SVR but uses a parameter nu to control the
number of support vectors.
Data Pre-Processing Algorithms
Use Cases: A regressor that fits multiple regressors and averages their predictions.
Helpful for reducing variance.
15. GridSearchCV - How to Call:
14. GroupKFold - How to Call: Use Cases: A classifier that stacks the output of individual estimators and uses a
classifier to compute the final prediction.
GroupKFold(n_splits=5)
Use Cases: Stratified K-Folds cross-validator providing train/test indices to split data.
Use Cases: Meta-transformer for selecting features based on importance weights from
TimeSeriesSplit(n_splits=5)
a model, such as Lasso.
Use Cases: Cross-validator for time series data.
Use Cases: Multivariate imputer that estimates each feature from all the others SimpleImputer(strategy='mean')
through specified estimator.
Use Cases: Imputation transformer for completing missing values in datasets.
Data Pre-Processing Algorithms
Use Cases: Select features based on a false positive rate test. Use Cases: Generates cross-validated estimates for each input data point.
Use Cases: Select features according to a percentile of the highest scores. Use Cases: Assessment of the importance of different features via permutation.
Use Cases: Determines training and test scores for varying parameter values.
18. mutual_info_classif - How to Call:
SelectKBest(score_func=mutual_info_classif, k=2)
5. ShuffleSplit - How to Call:
Use Cases: Estimates mutual information for a discrete target variable.
ShuffleSplit(n_splits=10, test_size=0.25, random_state=0)
SelectKBest(score_func=f_classif, k=2)
6. GroupShuffleSplit - How to Call:
Use Cases: Compute the ANOVA F-value for the provided sample.
GroupShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
Use Cases: Ensures that the same group is not represented in both testing and
16. Chi2 - How to Call: training sets.
SelectKBest(score_func=chi2, k=2)
Use Cases: Select features according to the k highest scores of the chi-squared 7. StratifiedShuffleSplit - How to Call:
statistic.
StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
Use Cases: Provides train/test indices to split data in train/test sets while preserving
15. train_test_split - How to Call: the percentage of samples for each class.
Use Cases: Splits arrays or matrices into random train and test subsets. 8. LeaveOneOut - How to Call:
LeaveOneOut()
14. PredefinedSplit - How to Call: Use Cases: Provides train/test indices to split data in train/test sets where each
sample is used once as a test set (singleton).
PredefinedSplit(test_fold=array)
LeavePOut(p=2)
13. TimeSeriesSplit - How to Call:
Use Cases: Similar to LeaveOneOut, but leaves P samples out.
TimeSeriesSplit(n_splits=5)
Use Cases: Provides train/test indices to split time series data samples that are
observed at fixed time intervals. 10. LeaveOneGroupOut - How to Call:
LeaveOneGroupOut()
12. GroupKFold - How to Call: Use Cases: Provides train/test indices to split data according to a third-party provided
group.
GroupKFold(n_splits=5)
Use Cases: Ensures that the same group is not in both testing and training sets.
11. LeavePGroupsOut - How to Call:
LeavePGroupsOut(n_groups=2)
Use Cases: Leaves P groups out, and the rest of the data is used as a training set.