Chapter 2
Chapter 2
Machine Learning (ML) is a subset of artificial intelligence (AI) that enables computers to learn from data and
make decisions or predictions without being explicitly programmed. It involves developing algorithms that allow a
system to improve its performance on a task over time by recognizing patterns in data.
1. Data: ML algorithms learn from data, which can be in the form of numbers, text, images, or other formats.
2. Training: In ML, a model is trained using a dataset. The model adjusts its internal parameters to minimize
errors in predictions.
3. Models: The trained model is used to make predictions or decisions on new, unseen data.
4. Learning: ML systems learn from experience (data) and improve their accuracy without being
reprogrammed.
1. Supervised Learning: The model is trained on labeled data, where both inputs and outputs are provided.
The goal is to learn a mapping from input to output (e.g., predicting house prices based on features like area
and location).
2. Unsupervised Learning: The model is given unlabeled data and tries to find hidden patterns or structures
(e.g., grouping customers based on purchasing behavior without predefined categories).
3. Reinforcement Learning: The model learns by interacting with an environment and receiving feedback in
the form of rewards or penalties, aiming to maximize long-term rewards.
2.2 ML Models
There are many commonly used models in ML, and we will abstain from giving an overview of all of them here. In
addition to common models, many model variations, novel architectures, and optimization strategies are published
on a weekly basis. In May 2019 alone, more than 13,000 papers were submitted to ArXiv, a popular electronic
archive of research where papers about new models are frequently submitted. It is useful, however, to share an
overview of different categories of models and how they can be applied to different problems. To this end, I propose
here a simple taxonomy of models based on how they approach a problem. You can use it as a guide for selecting an
approach to tackle a particular ML problem. Because models and data are closely coupled in ML, you will notice
some overlap between this section and “Data types”. ML algorithms can be categorized based on whether they
require labels.
Here, a label refers to the presence in the data of an ideal output that a model should produce for a given example.
Supervised algorithms leverage datasets that contain labels for inputs, and they aim to learn a mapping from inputs
to labels. Unsupervised algorithms, on the other hand, do not require labels. Finally, weakly supervised algorithms
leverage labels that aren’t exactly the desired output but that resemble it in some way. Many product goals can be
tackled by both supervised and unsupervised algorithms.
Machine learning (ML) models can be broadly classified based on the kind of learning they perform and the type of
data they work with. There are three main types of ML models: Supervised Learning, Unsupervised Learning,
and Reinforcement Learning. These can be further broken down into various specific models or algorithms.
In supervised learning, the algorithm is trained on labeled data, meaning the input data is paired with the correct
output. The model learns a mapping from inputs to outputs and is then able to make predictions on unseen data.
Linear Regression: Used for predicting continuous values, such as predicting house prices based on
features like area, number of rooms, etc.
Logistic Regression: Used for classification tasks, especially for binary outcomes (e.g., spam vs. not
spam).
Support Vector Machines (SVM): A powerful classifier that works by finding a hyperplane that best
separates the data into classes.
Decision Trees: A tree-like model of decisions that splits data based on feature values. It’s interpretable but
can be prone to overfitting.
Random Forest: An ensemble method that uses multiple decision trees to improve accuracy and reduce
overfitting.
K-Nearest Neighbors (KNN): A classification algorithm that assigns a class to a data point based on the
majority class of its nearest neighbors.
Naive Bayes: A probabilistic classifier based on Bayes' theorem, suitable for text classification and other
probabilistic tasks.
Neural Networks: A network of nodes (neurons) inspired by biological neural networks, used for both
classification and regression, especially in deep learning.
Unsupervised learning involves training a model on data that has no labels or explicit outputs. The goal is often to
find hidden patterns or groupings in the data.
K-Means Clustering: A clustering algorithm that partitions data into k distinct clusters based on similarity.
Hierarchical Clustering: Builds a hierarchy of clusters, represented as a tree-like structure (dendrogram).
Principal Component Analysis (PCA): A dimensionality reduction technique used to reduce the number
of features while retaining the essential information.
Gaussian Mixture Model (GMM): A probabilistic model that assumes all data points are generated from a
mixture of several Gaussian distributions.
Autoencoders: Neural networks designed for unsupervised learning, typically used for dimensionality
reduction or anomaly detection.
t-Distributed Stochastic Neighbor Embedding (t-SNE): A technique used to visualize high-dimensional
data in 2D or 3D space.
Q-Learning: A model-free RL algorithm where the agent learns a policy by estimating the value of taking
certain actions in certain states.
Deep Q Networks (DQN): A combination of Q-Learning and deep learning, where a deep neural network
approximates the Q-values.
Policy Gradient Methods: A family of RL algorithms that optimize the policy directly, rather than the
value function.
Actor-Critic Models: These models combine value-based and policy-based methods, where the "actor"
selects actions and the "critic" evaluates them.
Proximal Policy Optimization (PPO): A more stable RL algorithm, widely used in training deep
reinforcement learning models.
In Semi-Supervised Learning, the model is trained with a small amount of labeled data and a large amount of
unlabeled data. This approach leverages the vast amounts of unlabeled data, with the small labeled set guiding the
model’s learning.
Self-Training: The model is initially trained on the labeled data and then predicts labels for the unlabeled
data, which are added to the training set.
Co-Training: Two models are trained on different views of the data and help label the unlabeled data for
each other.
Graph-Based Methods: Use the relationships between data points (represented as a graph) to propagate
labels from labeled data to unlabeled data.
Self-supervised learning is a type of unsupervised learning where the system learns by creating labels from the input
data itself. It generates pseudo-labels based on inherent structures within the data and learns to predict these labels.
Examples:
Contrastive Learning: Used in deep learning, where a model learns to distinguish between similar and
dissimilar data points by maximizing the agreement between positive pairs and minimizing the agreement
between negative pairs.
SimCLR: A self-supervised learning framework that learns representations of images by maximizing the
similarity between augmented versions of the same image.
Data Cleaning and Preprocessing: Raw data is often noisy, incomplete, or inconsistent, which requires
significant cleaning, transformation, and normalization before it can be used for training. This step is
critical, as the model's performance heavily depends on the quality of the data it learns from.
Data Labeling: For supervised learning, acquiring accurately labeled data can be expensive, time-
consuming, and error-prone, especially when working with large datasets.
Imbalanced Data: In many real-world datasets, certain classes of data may be underrepresented, leading to
biased models that perform poorly on underrepresented classes.
Data Drift: Over time, the statistical properties of data can change (data drift), making previously trained
models less effective. Continuous monitoring and adaptation are needed to address this.
Overfitting: A common problem where a model performs well on training data but poorly on unseen data
(test set or production data). This occurs when the model learns noise or patterns that do not generalize well
to new data.
Bias-Variance Tradeoff: Striking the right balance between a model's complexity (high variance) and its
simplicity (high bias) is often difficult. A model that is too complex may overfit, while a simpler model
may underperform.
Training Time: Some ML models, particularly deep learning models, can require significant
computational resources and time to train, especially on large datasets. This can make it difficult to iterate
quickly.
Inference Speed: Once a model is deployed, the inference (prediction) speed is crucial for real-time
applications. Ensuring low latency and high throughput during inference, especially in production
environments with large-scale data, can be a major challenge.
Resource Consumption: Many ML models, especially deep learning models, are resource-intensive in
terms of memory, CPU, and GPU usage. Optimizing these models to be resource-efficient is essential for
production environments.
Model Versioning: Keeping track of different versions of models and ensuring the correct version is
deployed can become complex, especially when multiple models are in production simultaneously.
Deployment Pipelines: Building an automated and robust ML pipeline to handle the deployment process,
including testing, continuous integration, and monitoring, can be challenging.
Model Compatibility: Ensuring that the model works well across different platforms, environments, or
devices (e.g., on-premise servers, cloud infrastructure, edge devices) can introduce integration challenges.
5. Monitoring and Maintenance
Model Monitoring: Monitoring a model’s performance in production is essential for identifying any drop
in accuracy or problems that arise due to data drift or model degradation over time.
Detecting Concept Drift: Changes in the underlying distribution of the data (concept drift) can lead to
reduced model performance. Continuously retraining models with new data or adapting the model is
necessary to keep it relevant.
A/B Testing: When deploying a new model, it's important to perform A/B testing to compare it with the
previous version and evaluate improvements. This requires careful setup and analysis.
Black-box Models: Many complex ML models, especially deep neural networks, function as “black
boxes” — meaning they are not easily interpretable. This is a challenge when stakeholders need to
understand how a model arrived at a particular decision, especially in regulated industries like healthcare,
finance, and law.
Accountability and Trust: As models make critical decisions, it is essential to ensure that they can be
trusted, and their outputs are understandable and explainable to end-users, especially in high-risk
applications.
Data Privacy: ML models can inadvertently memorize sensitive data, leading to privacy issues. Ensuring
that data used for training and predictions complies with privacy regulations like GDPR is crucial.
Adversarial Attacks: ML models are vulnerable to adversarial attacks, where small, intentionally crafted
changes to input data can cause the model to make incorrect predictions. Robustness to such attacks is
important for models in security-sensitive applications.
Model Theft and Reverse Engineering: In production, ML models can be reverse-engineered or stolen.
Securing models and their endpoints against unauthorized access or misuse is a key challenge.
Bias and Fairness: Ensuring that models do not perpetuate or amplify bias based on race, gender, age, etc.,
is a growing concern in ML. Bias can arise from biased data or unfair treatment of certain groups by the
model.
Regulations: In many industries (e.g., finance, healthcare), there are strict regulations about how data is
used and models are deployed. Ensuring compliance with these regulations can be challenging.
Ethical AI: Ensuring that AI systems operate in an ethical and responsible manner, without causing harm
or making discriminatory decisions, is an ongoing challenge.
Cross-functional Collaboration: Building ML models often requires collaboration between data scientists,
engineers, product managers, and domain experts. Miscommunication or lack of alignment between these
teams can hinder progress and lead to inefficiencies.
Communication of Results: Explaining technical results to non-technical stakeholders is often difficult,
yet essential for business decisions. Effective communication is required to translate model outputs into
actionable insights.
Computational Costs: Training large-scale models (e.g., deep learning models) can be expensive in terms
of computational resources, especially when using GPUs or cloud-based infrastructure. Optimizing these
costs while maintaining model performance is a key challenge.
Operational Costs: Running models in production at scale requires ongoing infrastructure and monitoring
costs. Efficiently managing these costs while ensuring high availability and low latency is critical.
2.4 ML Project Lifecycle along with the pitfalls of focusing on only ML models
Define the Business Goal: This is the first and crucial step where the objective of the ML project is clearly defined.
The business problem that needs to be solved with ML is identified, and success criteria are established. It’s
important to understand what you want to achieve through the model, like improving sales, reducing costs, etc.
Collect and Prepare Data: Once the goal is defined, the next step is to gather relevant data from various sources.
This may involve extracting data from databases, APIs, or other sources. After collecting the data, it needs to be
cleaned and prepared (e.g., removing duplicates, handling missing values) so that it is ready for the model-building
process.
Build and Deploy Model: In this stage, different machine learning algorithms are applied to the data to build a
predictive model. Feature engineering, selection of the right algorithm, and training the model with the data are all
part of this step. Once the model is trained and evaluated, it is deployed into a production environment to be used by
the application.
Integrate with Application: After deploying the model, it needs to be integrated with a real-world application. This
involves using the model’s predictions in a practical system or service that can serve the end-users or business
processes. For instance, an ML model might be embedded into a recommendation engine or a customer service
chatbot.
Monitor Impact: The final stage is about continuously monitoring the model's performance and ensuring that it is
delivering the expected business outcomes. If the model’s accuracy or relevance declines over time (e.g., due to
changing data), it may need to be retrained, updated, or adjusted. Monitoring ensures the model remains useful and
effective in meeting the business goals.
Each stage is part of a cyclical process, as indicated by the loop, meaning that after monitoring, the business goals
might be refined, and the process begins again.
The pitfalls in ML models are
a) Using the wrong datasets, which can easily lead to inaccurate or biased results.
c) Finding out historical features used to train the model are unavailable in the production or real-time environment
d) Discovering there is no practical way to integrate the model predictions into the current application
e) Realizing the ML project costs are higher than the generated value or, in a worst-case scenario, cause losses in
revenue or customer satisfaction
The machine learning process involves several key steps, starting from understanding the problem and preparing the
data, to building and deploying a model, and eventually monitoring its performance in a production environment.
This process is iterative and requires continuous refinement. Below is a detailed overview of the Machine Learning
Process:
1. Problem Definition
Before diving into the technical aspects, it’s crucial to understand the problem at hand. This involves:
Understanding Business Requirements: Determine what the goal of the project is (e.g., classification,
regression, recommendation). Understanding the business context will help in framing the problem
correctly.
Defining the Objective: This step involves specifying clear, measurable objectives, such as predicting
customer churn, classifying emails as spam or not, or forecasting sales.
Setting Evaluation Metrics: Metrics like accuracy, precision, recall, F1 score (for classification), or mean
squared error (MSE) for regression should be defined early on to measure the model's success.
Data is the foundation of machine learning, so gathering high-quality data is critical. This step includes:
Collecting Data: Data can be collected from multiple sources like databases, APIs, web scraping, or third-
party datasets. The data could include structured data (tables), unstructured data (images, text), or semi-
structured data (XML, JSON).
Data Availability: Assess if the required data is available in sufficient quantities and whether it’s diverse
enough to train a robust model.
Understanding the Data: Data exploration helps in understanding its size, type, structure, and any
potential issues like missing values or outliers.
Once the data is collected, it needs to be cleaned and transformed into a usable format. This stage involves:
Handling Missing Data: Missing values can be handled by techniques like imputation, removing rows
with missing values, or filling them with default values (e.g., mean, median).
Data Transformation: Data needs to be transformed into a format suitable for machine learning
algorithms. This may involve normalization (scaling numerical values) or encoding categorical data.
Outlier Detection: Identifying and handling outliers (extreme values) to prevent them from negatively
impacting the model.
Feature Engineering: Creating new features (derived variables) from existing ones can improve the
model’s performance. This might involve combining or transforming raw features, for example, generating
a “day of the week” feature from a timestamp.
Feature Selection: Selecting the most relevant features to reduce complexity and improve model
performance. Techniques like correlation analysis, decision trees, or L1 regularization can be used.
EDA is an essential step to better understand the data, its patterns, and any relationships between features. This
process typically involves:
Visualizing the Data: Using plots (e.g., histograms, scatter plots, box plots) to uncover patterns,
correlations, and anomalies in the data.
Statistical Analysis: Computing summary statistics (e.g., mean, median, standard deviation) to get a sense
of data distribution.
Identifying Patterns: Understanding the relationships between different features to decide which variables
are important for the model.
5. Model Selection
In this stage, the appropriate machine learning model or algorithm is selected based on the problem type
(classification, regression, clustering, etc.) and the nature of the data. The models might include:
Supervised Learning: If you have labeled data, algorithms like Logistic Regression, Decision Trees,
Random Forests, Support Vector Machines, or Neural Networks might be appropriate.
Unsupervised Learning: For unlabeled data, clustering techniques like K-Means, DBSCAN, or
dimensionality reduction techniques like PCA or t-SNE might be used.
Reinforcement Learning: If the problem involves learning from an environment through trial and error
(e.g., gaming, robotics), reinforcement learning models such as Q-Learning or Deep Q Networks (DQN)
may be suitable.
6. Model Training
Once the model is selected, the next step is training it on the dataset. This involves:
Splitting the Data: Typically, the data is split into training, validation, and test sets. The model is trained
on the training set, tuned on the validation set, and evaluated on the test set.
Model Training: The training process involves feeding the data into the chosen model to learn the
underlying patterns. During training, the model adjusts its internal parameters (like weights in neural
networks) to minimize a loss function.
Hyperparameter Tuning: ML models often have hyperparameters (like learning rate, regularization
strength, or number of trees in a random forest) that need to be optimized. This can be done using
techniques like Grid Search, Random Search, or Bayesian Optimization.
7. Model Evaluation
After training, the model’s performance is assessed using appropriate evaluation metrics based on the type of
problem. Common metrics include:
Classification Metrics: For classification tasks, metrics like accuracy, precision, recall, F1 score, and
AUC-ROC curve are used.
Regression Metrics: For regression tasks, metrics like mean squared error (MSE), mean absolute error
(MAE), R² are used.
Cross-Validation: Techniques like K-Fold Cross-Validation are used to ensure that the model performs
well on unseen data and is not overfitting.
Once a model has been evaluated, the next step is to optimize and fine-tune it. This involves:
Adjusting Hyperparameters: Fine-tuning the model’s hyperparameters using techniques like Grid
Search or Random Search to improve performance.
Regularization: Applying regularization techniques like L1, L2, or Dropout to prevent overfitting and
improve generalization.
Ensemble Methods: Combining multiple models (e.g., Random Forests, Boosting algorithms like
XGBoost or AdaBoost) to improve predictive performance.
9. Model Deployment
Once the model has been trained and optimized, it is ready to be deployed into a production environment. This
involves:
Model Serving: Setting up a system to serve the model for real-time predictions (e.g., using REST APIs,
cloud services, or on-premise servers).
Containerization: Packaging the model in containers (e.g., using Docker) for easy deployment and
scalability.
CI/CD Pipelines: Implementing Continuous Integration (CI) and Continuous Deployment (CD) pipelines
for automating model updates, monitoring, and maintenance.
Monitoring: Track performance metrics to detect any drop in accuracy or other performance issues due to
data drift or model decay.
Model Retraining: As new data becomes available, the model might need to be retrained to adapt to
changes. This can be done periodically or triggered by performance degradation.
Logging and Alerts: Set up logging and alerting mechanisms to detect failures, errors, or performance
issues in real-time.
Data Drift: Changes in the underlying data distribution over time that can lead to reduced model
performance.
Concept Drift: Changes in the relationship between input features and target variables over time.
New Data: Incorporating new data into the model to improve its predictions.
Source code control (also known as version control) is essential in managing the complexity of software
development and particularly crucial in machine learning (ML) projects. As ML models evolve over time, tracking
changes to the code, data, and models becomes increasingly important. Proper version control enables the team to
work collaboratively, maintain reproducibility, and manage complex changes.
History: Source code control, or version control, has been used in software development for decades, with
early systems like RCS. It evolved into modern distributed version control systems like Git.
Role in Version Control: In ML projects, source code control tracks changes to code, data, and model
artifacts. It ensures versioning, traceability, and the ability to roll back to previous states. This is crucial for
reproducibility.
Role in Collaboration: Source code control enables collaboration among data scientists, engineers, and
researchers. It allows multiple team members to work on code simultaneously, merge changes, and resolve
conflicts. It fosters efficient teamwork and maintains code quality.
Significance: It ensures that machine learning projects can be reliably reproduced, tested, and scaled. It also
provides a history of changes, which is valuable for debugging and auditing. Source code control is an
essential part of the MLOps lifecycle.
1. Versioning of Code, Data, and Models: ML projects involve not just code (which implements models and
algorithms) but also datasets and model parameters. Keeping track of different versions of code, data, and
trained models allows for efficient model iteration, comparison, and deployment. This is particularly
important when:
o Different versions of models need to be tested against each other.
o The team is iterating over experiments that require consistent tracking of changes.
o Reproducing past results for debugging or further development is necessary.
2. Collaboration: Source control enables multiple team members (data scientists, software engineers, product
managers) to work on the same project simultaneously without conflicts. Each team member can work on
different parts of the project, with changes being merged smoothly.
o Branching and Merging: Team members can work on different features or improvements
independently and then merge them into the main codebase.
o Collaboration Tools: Modern version control platforms (e.g., GitHub, GitLab, Bitbucket) provide
tools for reviewing and approving code changes, thus improving collaboration and transparency.
3. Reproducibility: In ML, reproducing experiments is crucial for validating models and ensuring
consistency. Version control helps ensure that the exact version of code, dependencies, and data that
produced a certain result can be retrieved. This reproducibility is particularly important in research and
regulated industries like healthcare or finance.
o By tagging specific commits or versions, teams can capture the state of the model at any point and
recreate it in the future.
4. Experiment Tracking: Machine learning projects involve frequent experimentation with different models,
hyperparameters, and datasets. Source control systems like Git can be used alongside experiment tracking
tools (such as MLflow or Weights & Biases) to track these experiments, log parameters, and store results.
This allows teams to:
o Compare different experiments.
o Rollback to earlier versions of models or code when needed.
5. Code Integrity and Quality: By using version control systems, teams can maintain code integrity through
automated checks, continuous integration, and testing. Source control tools support:
o Automated tests to check the validity of models or code changes.
o Code reviews, where team members review and suggest improvements to each other’s code
before it gets merged.
6. Model Deployment and Rollback: ML models evolve over time, and new models are deployed in
production based on experiments. Source control can help ensure that each model version is correctly
deployed and rolled back when necessary. This is especially useful when a newly deployed model leads to
performance issues or errors, as the previous stable version can be restored quickly.
Example: Your team is starting up a new ML Project and you have been asked to set up code Repo (on Azure or
GitHub) for the same. Please Perform the needed steps as below and upload a screenshot of the commands as
answers.
4: Create 3 files created in the root project directory – {your_name}.py and {your sur_name}.py &
{your_sap_id}.py
3: Commit {your_name}.py with the message “{your name} committing this change for term test 1”
1. Data Scientist:
3. Data Engineer:
4. DevOps/IT Administrator:
6. Compliance Officer/Legal:
Each persona plays a vital role in the end-to-end ML platform, contributing to the success of machine learning
projects.
Model Retraining is a crucial aspect of maintaining the performance and accuracy of machine learning
models over time. As data evolves, models may become less effective due to various factors such as
changes in the underlying data distribution, new trends, or the introduction of new data. Retraining helps
ensure that models remain relevant and accurate in their predictions.
For example, COVID-19 abruptly changed human behavior across the globe. But the pandemic not only
significantly impacted human lives, it also disrupted ML models. Data engineers woke up to find that
their ML models, which were trained on pre-pandemic data sets, had suddenly drifted and were not
delivering reliable results.
The models’ performance degraded because the pre-pandemic data was not reflecting current behaviors
and therefore it was no longer relevant or accurate. These models had to be retrained to ensure their
validation and efficacy for the pandemic era. While COVID-19 is an extreme example, data keeps
changing because people change and the world changes. This means models trained on outdated data lose
relevance. Model retraining, also known as continuous training or continual training, is the act of training
models again and again on updated data and then redeploying them to production.
By retraining, data engineers can ensure the models are up-to-date, valid, and trustworthy. This ensures
the predictions and outputs of models are always accurate for the business use cases they were designed
to answer. If models aren’t retrained, they will become stale. Accurate models are essential for business
success. If an organization uses a model that provides inaccurate outputs, the result could be loss of
customers and profit. For example, if a model is supposed to detect fraud but doesn’t do so accurately,
this will mean either that fraudsters get away with fraud, costing the company its customers and perhaps a
loss of millions in insurance claims, or that there will be too many false positives, resulting in frustrated
end-users (who won’t be able to make online purchases) and adverse financial impact to the company’s
customers (again, losing customers). Automating the process of model retraining makes it reliable and
optimized. Automation also reduces the chance of manual errors or data engineers forgetting to retrain
models. With automation, data engineers and data scientists can ensure their measurements are defensible
and quantitative and that explainability tests are set up.
Data drift
When the statistical distribution of production data is different from the baseline data used to train or
build the model. This happens when human behavior changes, training data was inaccurate, or there were
data quality issues.
Concept drift
When the statistical properties of the target variable change over time. In other words, the concept, or the
relations between the datasets, have drifted.
Interval-based: According to a certain schedule or repeating interval; for example, retraining every
Sunday night or every end of the month. This ensures the models will always stay up-to-date since they
are constantly retrained. However, this method can be costly since resources are used even when
retraining is unnecessary.
Based on data changes: This type of retraining takes place when there are new data sets or when code
changes are made. Such retraining ensures adaptivity to engineering changes but might miss drift that
degrades the model performance.
Manually on-demand
This non automated retraining method provides complete control for data scientists but is prone to errors
and could mean retraining does not occur when needed.
---------------------------------------------------------------------------------------------------------------------
Some Important Questions