House DZ RC 158 ML Patterns 2023
House DZ RC 158 ML Patterns 2023
Anti-Patterns
• Anti-Patterns in Machine Learning
• Conclusion
By using patterns, machine learning (ML) practitioners can save time from data, while predictive analytics is focused on using data to make
and resources by leveraging tried and true techniques that have been predictions about future events or behaviors.
shown to work well. Anti-patterns, on the other hand, refer to common
When developing ML models, key challenges include data quality,
mistakes or pitfalls that can hinder the performance of ML models.
reproducibility, data scalability, and catering to multiple objectives.
This Refcard, comprising patterns and anti-patterns in ML, provides a
set of guidelines that can help practitioners design and develop more Data quality is a measure of data's accuracy, completeness,
effective models by leveraging successful techniques and avoiding consistency, and timeliness:
common mistakes.
• Data accuracy can be mitigated by understanding the source of
OVERVIEW OF MACHINE LEARNING the data and the potential errors in the data collection process.
Machine learning and predictive analytics are two closely related fields • Data completeness can be achieved by ensuring that the
that involve using data and statistical algorithms to make predictions training data contains a varied representation of each label.
or decisions. Machine learning algorithms learn patterns in data and
• Data consistency can be achieved when the bias of each data
use those patterns to make predictions or decisions. There are several
collector can be eliminated.
types of ML algorithms, including supervised learning, unsupervised
learning, and reinforcement learning: • Timeliness can be ascertained by keeping a timestamp about
when the event occurs and when it is added to the database.
• Supervised learning – The algorithm is trained on labeled data
to learn a function that maps input to output.
and then ML engineers need to ensure the right infrastructures like • Involves representing each category as a binary
vector, where only one element is "on," and the
processors for seamless training. Data scientists need to be served with
rest are "off."
the right infrastructure support for continued scoring of the models.
Text • Represents in various formats like bag-of-words,
Lastly, multiple teams in an organization might have different representation which represents each document as a frequency
objectives and expectations from a model, despite using the same one. vector of individual words.
DATA REPRESENTATION which data representation technique to use depends on the type of
Data representation design patterns refer to common techniques and data being used and the specific requirements of the learning algorithm
strategies for representing data in a way that is suitable for learning being applied.
PATTERNS
• Reduce variance and improve the accuracy of Monitoring • Essential for identifying and diagnosing problems in
the model. and logging the system.
Transfer learning • Uses pre-trained models to improve the • Common monitoring techniques include health
performance of a new ML model. checks, metrics collection, and log aggregation.
• Reduces the amount of data required to train a • Improve the reliability and resilience of the system
new model and improve its performance. by providing real-time feedback on the system's
performance and health.
Deep learning • Use multiple layers to learn hierarchical
architectures representations of the data. Failover and • Ensure that the system remains available in the
redundancy event of failure.
• Improve the performance and interpretability of
the model by learning more complex features of • Common techniques include standby instances,
the data. automatic failover, and data replication.
Table 5 Table 6
Version control • Tracks changes to code, data, and experiment CHALLENGE OCCURS WHEN TO AVOID
configurations over time.
Model drift The performance of an Regularly monitor the
• Ensures that the results can be reproduced by ML model deteriorates performance of the model
others by providing a history of changes and over time due to changes and retrain it
allowing others to track the same versions of in the input data as needed.
code and data used in the original experiment. distribution.
Containerization • Packages an experiment and its dependencies Lack of MLOps processes are Automate as much of
into a self-contained environment that can be run automation not fully automated, the MLOps process as
on any machine. leading to errors, possible, including data
• Ensures that the results can be reproduced by inconsistencies, preprocessing and model
others by providing a consistent environment for and delays. training, evaluation, and
running the experiment. deployment.
Documentation • Essential for ensuring that the experiment can be Data bias The training data is Carefully curate the
understood and reproduced by others. biased, leading to biased training data to ensure
or inaccurate models. that it represents the
• Common practices include documenting the target population and
experiment's purpose, methodology, data the data has no
sources, and analysis techniques. unintentional bias.
Hyperparameter • The process of searching for the best set of Lack of MLOps processes are Document all aspects
tuning hyperparameters for a ML model. documentation not well-documented, of the MLOps process,
• Ensures that the results can be reproduced by leading to confusion including data sources;
others by providing a systematic and repeatable and errors. preprocessing steps; and
process for finding the best hyperparameters. model training, evaluation,
and deployment.
Code readability • Essential for ensuring that the code used in the
experiment can be understood and modified Poor model The wrong ML Carefully evaluate different
by others. selection algorithm is selected ML algorithms and select
for a given problem, the one best suited for the
• Common practices include using descriptive
leading to suboptimal given problem.
variable names, adding comments and
performance.
documentation, and following coding standards.
Overfitting The ML model is too Regularize the model and
complex and fits the use techniques such as
AVOIDING MLOPS MISTAKES
training data too cross-validation to ensure
Common mistakes and pitfalls that can occur during the design and closely, leading to that the model generalizes
implementation of MLOps are listed in the following table: poor generalization well to new data.
performance on
SEE TABLE IN NEXT COLUMN new data.
PHANTOM MENACE
The term "Phantom Menace" comes from instances when differences
between training and test data may not be immediately apparent
during the development and evaluation phase, but it can become a The use of a sentinel can help mitigate risks associated with model
problem when the model is deployed in the real world. or data degradation, concept drift, and other issues that can occur
when deploying ML models in production. However, it is important to
The training/serving skew occurs when the statistical properties of
design the sentinel model carefully to ensure that it provides adequate
the training data are different from the distribution of the data that
protection without unnecessarily delaying the deployment of the
the model is exposed to during inference. This difference can result
primary model.
in poor performance when the model is deployed, even if it performs
well during training. For example, if the training data for an image THE HULK
classification model consists mostly of daytime photos, but the model The "Hulk" anti-pattern is a technique where the entire model training,
is later deployed to classify nighttime photos, the model may not validation, and evaluation process is performed offline, and only the final
perform well due to this mismatch in data distributions. output or prediction is published for use in a production environment.
This approach is also sometimes referred to as offline precompute.
To mitigate training/serving skew, it is important to ensure that
the training data is representative of the data that the model will "Hulk" comes from the idea that the model is developed and tested in
encounter during inference and to monitor the model's performance isolation, like the character Bruce Banner who becomes the Hulk when
in production to detect any performance degradation caused by isolated from others.
distributional shift. Techniques like data augmentation, transfer
Figure 2: The Hulk
learning, and model calibration can also help improve the model's
ability to generalize new data.
THE SENTINEL
The "Sentinel" anti-pattern is a technique used to validate
models or data in an online environment before deploying them
to production. It is a separate model or set of rules that is used to
evaluate the performance of the primary model or data in a production
environment. The purpose is to act as a "safety net" and prevent
any incorrect or undesirable outputs from being released into the
real world. It can detect issues such as data drift, concept drift, or
performance degradation and provide alerts to the development team
to investigate and resolve the issue before it causes harm.
THE LUMBERJACK
The "Lumberjack" (also known as feature logging) anti-pattern
refers to a technique where features are logged online from within an
application, and the resulting logs are used to train ML models. Similar
to how lumberjacks cut down trees, process them into logs, and then
use the logs to build structures, in feature logging, the input data is
"cut down" into individual features that are then processed and used to
build a model, as shown in Figure 3.
TECHNIQUE DESCRIPTION
performance in a production environment and continuously monitor Performance • ML models can be evaluated using a variety
the data and model performance to detect and address any issues that metrics of performance metrics, including accuracy,
precision, recall, F1 score, and AUC-ROC.
may arise.
• Monitoring these metrics over time can help
THE TIME MACHINE detect changes in model performance and
The "Time Machine" anti-pattern is a technique where historical data identify anti-patterns like model drift and
overfitting.
is used to train a model, and the resulting model is then used to make
predictions about future data (hence the name). This approach is also
known as time-based modeling or temporal modeling.
CONCLUSION
The present Refcard on ML patterns and anti-patterns took off by
To mitigate the risks associated with the Time Machine anti-pattern, walking through an overview of ML models, which comprises common
it is important to carefully design the modeling process to capture challenges like data quality, reproducibility, data scalability, and
changes in the underlying data over time and to validate the model's catering to multiple objectives of the organization. Subsequently, this
performance on recent data. This can include techniques such as Refcard covers five key patterns, ways to avoid MLOps mistakes, five
using sliding windows, incorporating time-dependent features, and key anti-patterns, and techniques to detect ML anti-patterns.
monitoring the model's performance over time.
Thus, the Refcard provides substantial knowledge and direction to the
TECHNIQUES TO DETECT MACHINE LEARNING ML engineers and data scientists to be cognizant of the patterns and
ANTI-PATTERNS anti-patterns in machine learning and take the necessary measures to
The following techniques help to identify and mitigate common avoid mistakes.
mistakes and pitfalls that can arise in the development and deployment
of ML models. References:
9. Muralidhar, N., Muthiah, S., Butler, P., Jain, M., Yu, Y., Burne, K., ... Copyright © 2023 DZone. All rights reserved. No part of this publication may be
reproduced, stored in a retrieval system, or transmitted, in any form or by means
& Ramakrishnan, N. (2021). "Using antipatterns to avoid MLOps of electronic, mechanical, photocopying, or otherwise, without prior written
permission of the publisher.
mistakes." arXiv preprint arXiv:2107.00079.