0% found this document useful (0 votes)
48 views7 pages

House DZ RC 158 ML Patterns 2023

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views7 pages

House DZ RC 158 ML Patterns 2023

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

158

Machine Learning CONTENTS

Patterns and •  Overview of Machine Learning

•  Patterns in Machine Learning

Anti-Patterns
•  Anti-Patterns in Machine Learning

•  Conclusion

DR. TUHIN CHATTOPADHYAY


PROFESSOR OF AI & BLOCKCHAIN, JAGDISH SHETH SCHOOL OF MANAGEMENT

By using patterns, machine learning (ML) practitioners can save time from data, while predictive analytics is focused on using data to make
and resources by leveraging tried and true techniques that have been predictions about future events or behaviors.
shown to work well. Anti-patterns, on the other hand, refer to common
When developing ML models, key challenges include data quality,
mistakes or pitfalls that can hinder the performance of ML models.
reproducibility, data scalability, and catering to multiple objectives.
This Refcard, comprising patterns and anti-patterns in ML, provides a
set of guidelines that can help practitioners design and develop more Data quality is a measure of data's accuracy, completeness,
effective models by leveraging successful techniques and avoiding consistency, and timeliness:
common mistakes.
•  Data accuracy can be mitigated by understanding the source of
OVERVIEW OF MACHINE LEARNING the data and the potential errors in the data collection process.
Machine learning and predictive analytics are two closely related fields •  Data completeness can be achieved by ensuring that the
that involve using data and statistical algorithms to make predictions training data contains a varied representation of each label.
or decisions. Machine learning algorithms learn patterns in data and
•  Data consistency can be achieved when the bias of each data
use those patterns to make predictions or decisions. There are several
collector can be eliminated.
types of ML algorithms, including supervised learning, unsupervised
learning, and reinforcement learning: •  Timeliness can be ascertained by keeping a timestamp about
when the event occurs and when it is added to the database.
•  Supervised learning – The algorithm is trained on labeled data
to learn a function that maps input to output.

•  Unsupervised learning – The algorithm tries to find patterns in


unlabeled data without any predefined output variables.

•  Reinforcement learning – The algorithm learns by trial and error


in an environment to maximize a reward function.

Common ML algorithms include linear regression, logistic regression,


decision trees, random forests, and neural networks.

Predictive analytics can be used in a wide range of industries and use


cases, including tasks such as forecasting sales, predicting customer
behavior, detecting fraud, and identifying at-risk patients. Both ML and
predictive analytics rely heavily on data and statistical techniques to
make predictions or decisions. While there is some overlap between the
two fields, ML is focused on developing algorithms that can be learned

© DZONE | REFCARD | DECEMBER 2023 1


REFCARD | MACHINE LEARNING PATTERNS AND ANTI-PATTERNS

Reproducibility is a common challenge in ML, as the ML model weights Table 1


are initialized with random values during training. Thus, the same model
PATTERNS
code with the same training data may produce slightly different results
across each iteration. Since the models run in a dynamic business Feature scaling •  Scales input features to a common range, such as
between 0 and 1, to avoid large discrepancies in
environment, it's critical to keep the ML models relevant by constantly
feature magnitudes.
updating the variables and to prevent any data drift.
•  Helps to improve the convergence rate and
accuracy of learning algorithms.
The challenge of data scalability needs to be addressed during
data collection and preprocessing, training, and serving. First, data One-hot •  Represents categorical variables in a numerical
engineers need to build data pipelines that can scale to handle big data, encoding format.

and then ML engineers need to ensure the right infrastructures like •  Involves representing each category as a binary
vector, where only one element is "on," and the
processors for seamless training. Data scientists need to be served with
rest are "off."
the right infrastructure support for continued scoring of the models.
Text •  Represents in various formats like bag-of-words,
Lastly, multiple teams in an organization might have different representation which represents each document as a frequency
objectives and expectations from a model, despite using the same one. vector of individual words.

•  Other techniques: term frequency-inverse


PATTERNS IN MACHINE LEARNING document frequency (TF-IDF) and word
embeddings.
Design patterns provide a set of proven solutions to common problems
that arise during the design and implementation of ML systems. They Time series •  Represents using techniques like sliding window
provide a systematic approach to designing and building ML systems, representation to divide the time series into overlapping
windows and represents each window as a feature
which can lead to more robust and scalable systems that are easier to
vector.
maintain and update. An ML pattern is a technique, process, or design
that has been observed to work well for a given problem or task. ML Image •  Represents in various formats, such as pixel
representation values, color histograms, or convolutional neural
patterns can help guide the development of new models, as well as
network (CNN) features.
provide a framework for understanding how existing models work.
In this section, we'll cover patterns for data representation, problem
These data representation design patterns are used to transform raw
representation, model training, resilient serving, and reproducibility.
data into a form that is suitable for learning algorithms. The choice of

DATA REPRESENTATION which data representation technique to use depends on the type of

Data representation design patterns refer to common techniques and data being used and the specific requirements of the learning algorithm

strategies for representing data in a way that is suitable for learning being applied.

algorithms to process. The design patterns help transform raw input


PROBLEM REPRESENTATION
data into a form that can be more easily analyzed and understood by
These patterns are common strategies and techniques used to represent
ML models.
a problem effectively in a way that can be solved by a ML model.
SEE TABLE IN NEXT COLUMN
Table 2

PATTERNS

Feature •  Selects and transforms raw data into features


engineering that can be used by an ML model.

Dimensionality •  Reduces the number of features in the dataset.


reduction

Resampling •  Balances the class distribution in the dataset.


•  Helps improve the performance of the model
when there is an imbalance in the class
distribution.

The choice of problem representation design pattern depends on the


specific requirements of the problem, such as the type of data, the size
of the dataset, and the available computing resources.

© DZONE | REFCARD | DECEMBER 2023 2


REFCARD | MACHINE LEARNING PATTERNS AND ANTI-PATTERNS

MODEL TRAINING Table 4


Model training patterns are common strategies and techniques used
PATTERNS
to design and train ML models effectively. These design patterns are
intended to improve the performance, scalability, and interpretability Model serving •  Overall design of the system that serves the
of ML models, as well as to reduce the risk of overfitting or underfitting. architecture ML model.

•  Common architectures include microservices,


Table 3 serverless, and containerized deployments.

•  Choice of architecture often depends on the specific


PATTERNS
requirements of the system, such as scalability,
Cross-validation •  Assesses the performance of an ML model reliability, and cost.
by partitioning the data into training and
Load •  Distributes incoming requests across multiple
validation sets.
balancing instances of the ML model.
•  Reduces overfitting and ensures that the model
•  Improves the performance and reliability of the
can generalize to new data.
system by distributing the workload evenly and
Regularization •  Reduces overfitting by adding a penalty term to avoiding overloading any single instance.
the loss function of the ML model.
Caching •  Stores frequently accessed data in memory or disk
•  Ensures that the model does not memorize the to reduce the response time of the system.
training data and can generalize to new data.
•  Improves performance and scalability of the system
Ensemble •  Combine multiple ML models to improve their by reducing the number of requests that need to be
methods performance. processed by the ML model.

•  Reduce variance and improve the accuracy of Monitoring •  Essential for identifying and diagnosing problems in
the model. and logging the system.

Transfer learning •  Uses pre-trained models to improve the •  Common monitoring techniques include health
performance of a new ML model. checks, metrics collection, and log aggregation.

•  Reduces the amount of data required to train a •  Improve the reliability and resilience of the system
new model and improve its performance. by providing real-time feedback on the system's
performance and health.
Deep learning •  Use multiple layers to learn hierarchical
architectures representations of the data. Failover and •  Ensure that the system remains available in the
redundancy event of failure.
•  Improve the performance and interpretability of
the model by learning more complex features of •  Common techniques include standby instances,
the data. automatic failover, and data replication.

•  Improve the resilience and reliability of the system


by ensuring that the system can continue to serve
RESILIENT SERVING
requests, even in the event of a failure.
These patterns are common strategies and techniques for deploying
ML models in production and ensuring that they are reliable, scalable,
Design pattern choice often depends on the specific requirements of
and resilient to failures. Resilient serving is essential for building
the system, such as performance, reliability, scalability, and cost.
production-grade ML systems that can handle large volumes of traffic
and provide accurate predictions in real time. REPRODUCIBILITY
Reproducibility design patterns are a set of practices and techniques
SEE TABLE IN NEXT COLUMN
used to ensure that the results of a machine learning experiment can be
reproduced by others. Reproducibility is essential for building trust in
ML research and ensuring that the results can be used in practice.

SEE TABLE ON NEXT PAGE

© DZONE | REFCARD | DECEMBER 2023 3


REFCARD | MACHINE LEARNING PATTERNS AND ANTI-PATTERNS

Table 5 Table 6

PATTERNS MLOPS CHALLENGES

Version control •  Tracks changes to code, data, and experiment CHALLENGE OCCURS WHEN TO AVOID
configurations over time.
Model drift The performance of an Regularly monitor the
•  Ensures that the results can be reproduced by ML model deteriorates performance of the model
others by providing a history of changes and over time due to changes and retrain it
allowing others to track the same versions of in the input data as needed.
code and data used in the original experiment. distribution.

Containerization •  Packages an experiment and its dependencies Lack of MLOps processes are Automate as much of
into a self-contained environment that can be run automation not fully automated, the MLOps process as
on any machine. leading to errors, possible, including data
•  Ensures that the results can be reproduced by inconsistencies, preprocessing and model
others by providing a consistent environment for and delays. training, evaluation, and
running the experiment. deployment.

Documentation •  Essential for ensuring that the experiment can be Data bias The training data is Carefully curate the
understood and reproduced by others. biased, leading to biased training data to ensure
or inaccurate models. that it represents the
•  Common practices include documenting the target population and
experiment's purpose, methodology, data the data has no
sources, and analysis techniques. unintentional bias.
Hyperparameter •  The process of searching for the best set of Lack of MLOps processes are Document all aspects
tuning hyperparameters for a ML model. documentation not well-documented, of the MLOps process,
•  Ensures that the results can be reproduced by leading to confusion including data sources;
others by providing a systematic and repeatable and errors. preprocessing steps; and
process for finding the best hyperparameters. model training, evaluation,
and deployment.
Code readability •  Essential for ensuring that the code used in the
experiment can be understood and modified Poor model The wrong ML Carefully evaluate different
by others. selection algorithm is selected ML algorithms and select
for a given problem, the one best suited for the
•  Common practices include using descriptive
leading to suboptimal given problem.
variable names, adding comments and
performance.
documentation, and following coding standards.
Overfitting The ML model is too Regularize the model and
complex and fits the use techniques such as
AVOIDING MLOPS MISTAKES
training data too cross-validation to ensure
Common mistakes and pitfalls that can occur during the design and closely, leading to that the model generalizes
implementation of MLOps are listed in the following table: poor generalization well to new data.
performance on
SEE TABLE IN NEXT COLUMN new data.

By avoiding these MLOps mistakes and pitfalls, ML engineers can build


more robust, scalable, and accurate ML systems that deliver value to
the business.

ANTI-PATTERNS IN MACHINE LEARNING


Machine learning anti-patterns are commonly occurring solutions
to problems that appear to be the right thing to do, but ultimately
lead to bad outcomes or suboptimal results. They are the pitfalls or
mistakes that are commonly made in the development or application
of ML models. These mistakes can lead to poor performance, biases,
overfitting, or other problems.

PHANTOM MENACE
The term "Phantom Menace" comes from instances when differences
between training and test data may not be immediately apparent

© DZONE | REFCARD | DECEMBER 2023 4


REFCARD | MACHINE LEARNING PATTERNS AND ANTI-PATTERNS

during the development and evaluation phase, but it can become a The use of a sentinel can help mitigate risks associated with model
problem when the model is deployed in the real world. or data degradation, concept drift, and other issues that can occur
when deploying ML models in production. However, it is important to
The training/serving skew occurs when the statistical properties of
design the sentinel model carefully to ensure that it provides adequate
the training data are different from the distribution of the data that
protection without unnecessarily delaying the deployment of the
the model is exposed to during inference. This difference can result
primary model.
in poor performance when the model is deployed, even if it performs
well during training. For example, if the training data for an image THE HULK
classification model consists mostly of daytime photos, but the model The "Hulk" anti-pattern is a technique where the entire model training,
is later deployed to classify nighttime photos, the model may not validation, and evaluation process is performed offline, and only the final
perform well due to this mismatch in data distributions. output or prediction is published for use in a production environment.
This approach is also sometimes referred to as offline precompute.
To mitigate training/serving skew, it is important to ensure that
the training data is representative of the data that the model will "Hulk" comes from the idea that the model is developed and tested in
encounter during inference and to monitor the model's performance isolation, like the character Bruce Banner who becomes the Hulk when
in production to detect any performance degradation caused by isolated from others.
distributional shift. Techniques like data augmentation, transfer
Figure 2: The Hulk
learning, and model calibration can also help improve the model's
ability to generalize new data.

THE SENTINEL
The "Sentinel" anti-pattern is a technique used to validate
models or data in an online environment before deploying them
to production. It is a separate model or set of rules that is used to
evaluate the performance of the primary model or data in a production
environment. The purpose is to act as a "safety net" and prevent
any incorrect or undesirable outputs from being released into the
real world. It can detect issues such as data drift, concept drift, or
performance degradation and provide alerts to the development team
to investigate and resolve the issue before it causes harm.

For example, in the context of an online recommendation system, a


sentinel model can be used to evaluate the recommendations made
by the primary model before they are shown to the user. If the sentinel
model detects that the recommendations are significantly different To mitigate risks associated with the Hulk anti-pattern, it is important
from what is expected, it can trigger an alert for the development team to validate the model's performance in a production environment and
to investigate and address any issues before the recommendations are continuously monitor the data and model performance to detect and
shown to the user. address any issues that may arise. This can include techniques such
as data logging, monitoring, and feedback mechanisms to enable the
Figure 1: The Sentinel model to adapt and improve over time.

THE LUMBERJACK
The "Lumberjack" (also known as feature logging) anti-pattern
refers to a technique where features are logged online from within an
application, and the resulting logs are used to train ML models. Similar
to how lumberjacks cut down trees, process them into logs, and then
use the logs to build structures, in feature logging, the input data is
"cut down" into individual features that are then processed and used to
build a model, as shown in Figure 3.

SEE FIGURE 3 ON NEXT PAGE

© DZONE | REFCARD | DECEMBER 2023 5


REFCARD | MACHINE LEARNING PATTERNS AND ANTI-PATTERNS

Figure 3: The Lumberjack Table 7

TECHNIQUE DESCRIPTION

Cross-validation •  Assess an ML model's performance by splitting


the dataset into training and testing sets.

•  Detect overfitting and underfitting, which are


common anti-patterns in ML.

Bias detection •  Bias is a common anti-pattern in ML that can lead


to unfair or inaccurate predictions.

•  ML techniques, like fairness metrics,


demographic parity, and equalized odds, can be
used to detect and mitigate bias in models.

Feature selection •  Identify the most important features or variables


in a dataset.

•  Detect and address anti-patterns like irrelevant


features and feature redundancy, which can lead
to overfitting and reduced model performance.
To mitigate the risks associated with the Lumberjack anti-pattern, it
Model •  ML techniques like decision trees, random
is important to carefully design the feature logging process to capture
interpretability forests, and LIME can be used to provide
relevant information and avoid biases or errors. This can include interpretability and transparency to ML models.
techniques such as feature selection, feature engineering, and data •  Detect and address anti-patterns like black-box
validation to ensure that the logged features accurately represent models, which are difficult to interpret and can
the underlying data. It is also important to validate the model's lead to reduced trust and performance.

performance in a production environment and continuously monitor Performance •  ML models can be evaluated using a variety
the data and model performance to detect and address any issues that metrics of performance metrics, including accuracy,
precision, recall, F1 score, and AUC-ROC.
may arise.
•  Monitoring these metrics over time can help
THE TIME MACHINE detect changes in model performance and
The "Time Machine" anti-pattern is a technique where historical data identify anti-patterns like model drift and
overfitting.
is used to train a model, and the resulting model is then used to make
predictions about future data (hence the name). This approach is also
known as time-based modeling or temporal modeling.
CONCLUSION
The present Refcard on ML patterns and anti-patterns took off by
To mitigate the risks associated with the Time Machine anti-pattern, walking through an overview of ML models, which comprises common
it is important to carefully design the modeling process to capture challenges like data quality, reproducibility, data scalability, and
changes in the underlying data over time and to validate the model's catering to multiple objectives of the organization. Subsequently, this
performance on recent data. This can include techniques such as Refcard covers five key patterns, ways to avoid MLOps mistakes, five
using sliding windows, incorporating time-dependent features, and key anti-patterns, and techniques to detect ML anti-patterns.
monitoring the model's performance over time.
Thus, the Refcard provides substantial knowledge and direction to the
TECHNIQUES TO DETECT MACHINE LEARNING ML engineers and data scientists to be cognizant of the patterns and
ANTI-PATTERNS anti-patterns in machine learning and take the necessary measures to
The following techniques help to identify and mitigate common avoid mistakes.
mistakes and pitfalls that can arise in the development and deployment
of ML models. References:

1. Alexander, C. (1977). A pattern language: towns, buildings,


SEE TABLE IN NEXT COLUMN
construction. Oxford University Press.

2. Alexander, C. (1979). The timeless way of building (Vol. 1). New


York: Oxford University Press.

3. Brown, W. H., Malveau, R. C., McCormick, H. W. S., & Mowbray, T.


J. (1998). AntiPatterns: refactoring software, architectures, and
projects in crisis. John Wiley & Sons, Inc.

© DZONE | REFCARD | DECEMBER 2023 6


REFCARD | MACHINE LEARNING PATTERNS AND ANTI-PATTERNS

4. Barbez, A., Khomh, F., & Guéhéneuc, Y. G. (2020). "A machine-


learning based ensemble method for anti-patterns detection." WRITTEN BY DR. TUHIN CHATTOPADHYAY,
PROFESSOR OF AI & BLOCKCHAIN, JAGDISH SHETH
Journal of Systems and Software, 161, 110486.
SCHOOL OF MANAGEMENT
5. Gamma, E., Helm, R., Johnson, R., Johnson, R. E., & Vlissides, Dr. Tuhin Chattopadhyay, Professor of Practice at
J. (1995). Design patterns: elements of reusable object-oriented JAGSoM, India, is a celebrated Industry 4.0 thought
leader among both the academic and corporate fraternity.
software. Pearson Deutschland GmbH. Recipient of numerous prestigious awards, Tuhin is hailed as India's
Top 10 Data Scientists by Analytics India Magazine. Besides teaching,
6. Tuggener, L., Amirian, M., Benites, F., von Däniken, P., Gupta, Dr. Tuhin also drives his AI & Analytics consultancy globally that can
P., Schilling, F. P., & Stadelmann, T. (2020). "Design patterns for be explored from tuhin.ai.

resource-constrained automated deep-learning methods." AI,


1(4), 510-538.
3343 Perimeter Hill Dr, Suite 100
7. Lakshmanan, V., Robinson, S., & Munn, M. (2020). Machine Nashville, TN 37211
888.678.0399 | 919.678.0300
learning design patterns. O'Reilly Media.
At DZone, we foster a collaborative environment that empowers developers and
8. Buschmann, F., Meunier, R., Rohnert, H., Sommerlad, P., & Stal, tech professionals to share knowledge, build skills, and solve problems through
content, code, and community. We thoughtfully — and with intention — challenge
M. (2008). Pattern-Oriented Software Architecture: A System of the status quo and value diverse perspectives so that, as one, we can inspire
positive change through technology.
Patterns, Volume 1 (Vol. 1). John Wiley & Sons.

9. Muralidhar, N., Muthiah, S., Butler, P., Jain, M., Yu, Y., Burne, K., ... Copyright © 2023 DZone. All rights reserved. No part of this publication may be
reproduced, stored in a retrieval system, or transmitted, in any form or by means
& Ramakrishnan, N. (2021). "Using antipatterns to avoid MLOps of electronic, mechanical, photocopying, or otherwise, without prior written
permission of the publisher.
mistakes." arXiv preprint arXiv:2107.00079.

© DZONE | REFCARD | DECEMBER 2023 7

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy