0% found this document useful (0 votes)
19 views60 pages

MACHINE LEARNING UINT I & II

This study material provides an introduction to Machine Learning (ML), explaining its definition, importance, and various applications across fields such as healthcare, finance, and e-commerce. It outlines the machine learning life cycle, which includes steps like data gathering, preparation, analysis, training, testing, and deployment, as well as types of machine learning, specifically supervised and unsupervised learning. Key applications discussed include image recognition, speech recognition, self-driving cars, and fraud detection.

Uploaded by

KUMAR VIJAY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views60 pages

MACHINE LEARNING UINT I & II

This study material provides an introduction to Machine Learning (ML), explaining its definition, importance, and various applications across fields such as healthcare, finance, and e-commerce. It outlines the machine learning life cycle, which includes steps like data gathering, preparation, analysis, training, testing, and deployment, as well as types of machine learning, specifically supervised and unsupervised learning. Key applications discussed include image recognition, speech recognition, self-driving cars, and fraud detection.

Uploaded by

KUMAR VIJAY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 60

MACHINE LEARNING

(Study Material)
UNIT- I
Introduction to Machine Learning
What is Machine Learning?
Machine learning (ML) is a type of Artificial Intelligence (AI) that allows computers to
learn and make decisions without being explicitly programmed. It involves feeding data
into algorithms that can then identify patterns and make predictions on new data. Machine
learning is used in a wide variety of applications, including image and speech recognition,
natural language processing, and recommender systems.

Why we need Machine Learning?


Machine learning is able to learn, train from data and solve/predict complex solutions
which cannot be done with traditional programming. It enables us with better decision
making and solve complex business problems in optimized time. Machine learning has
applications in various fields, like Healthcare, finance, educations, sports and more.
Let’s explore some reasons why Machine learning has become essential in every field –

1. Solving Complex Business Problems:


It is too complex to tackle problems like Image recognition, Natural language processing,
disease diagnose etc. with Traditional programming. Machine learning can handle such
problems by learning from examples or making predictions, rather than following some
rigid rules.
2. Handling Large Volumes of Data:
Expansion of Internet and users is producing massive amount of data. Machine Learning
can process these data effectively and analyze, predict useful insights from them.
 For example, ML can analyze millions of everyday transactions to detect any fraud
activity in real time.
 Social platforms like Facebook, Instagram use ML to analyze billions of post, like and
share to predict next recommendation in your feed.
3. Automate Repetitive Tasks:
With Machine Learning, we can automate time-consuming and repetitive tasks, with better
accuracy.
 GMail uses ML to filter out Spam emails and ensure your Index stay clean and spam
free. Using traditional programming or handling these manually will only make the
system error-prone.
 Customer Support chatbots can use ML to solve frequent occurring problems like
Checking order status, Password reset etc.
 Big organizations can use ML to process large amount of data (like Invoices etc) to
extract historical and current key insights.
4. Personalized User Experience:
All social-media, OTT and E-commerce platforms uses Machine learning to recommend
better feed based on user preference or interest.
 Netflix recommends movies and TV shows based on what you’ve watched
 E-commerce platforms suggesting products you are likely to buy.
5. Self-Improvement in Performance:
ML models are able to improve themselves based on more data, like user-behavior and
feedback. For example,
 Voice Assistants (Siri, Alexa, Google Assistant) – Voice assistants continuously
improve as they process millions of voice inputs. They adapt to user preferences,
understand regional accents better, and handle ambiguous queries more effectively.
 Search Engines (Google, Bing) – Search engines analyze user behavior to refine their
ranking algorithms.
 Self-driving Cars – Self-driving cars use data from millions of miles driven (both in
simulations and real-world scenarios) to enhance their decision-making.

Applications of Machine learning

1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image
recognition and face detection is, Automatic friend tagging suggestion:

Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a


photo with our Facebook friends, then we automatically get a tagging suggestion with name,
and the technology behind this is machine learning's face detection and recognition
algorithm.

It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture.

2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech
recognition, and it's a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is also known
as "Speech to text", or "Computer speech recognition." At present, machine learning
algorithms are widely used by various applications of speech recognition. Google
assistant, Siri, Cortana, and Alexa are using speech recognition technology to follow the
voice instructions.

3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct
path with the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:

o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes information
from the user and sends back to its database to improve the performance.

4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for
some product on Amazon, then we started getting an advertisement for the same product
while internet surfing on the same browser and this is because of machine learning.

Google understands the user interest using various machine learning algorithms and suggests
the product as per customer interest.

As similar, when we use Netflix, we find some recommendations for entertainment series,
movies, etc., and this is also done with the help of machine learning.

5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine
learning plays a significant role in self-driving cars. Tesla, the most popular car
manufacturing company is working on self-driving car. It is using unsupervised learning
method to train the car models to detect people and objects while driving.

6. Email Spam and Malware Filtering:


Whenever we receive a new email, it is filtered automatically as important, normal, and
spam. We always receive an important mail in our inbox with the important symbol and spam
emails in our spam box, and the technology behind this is Machine learning. Below are some
spam filters used by Gmail:

o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve
Bayes classifier are used for email spam filtering and malware detection.

7. Virtual Personal Assistant:


We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As
the name suggests, they help us in finding the information using our voice instruction. These
assistants can help us in various ways just by our voice instructions such as Play music, call
someone, Open an email, Scheduling an appointment, etc.

These virtual assistants use machine learning algorithms as an important part.

8. Online Fraud Detection:


Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various ways that a
fraudulent transaction can take place such as fake accounts, fake ids, and steal money in the
middle of a transaction. So to detect this, Feed Forward Neural network helps us by checking
whether it is a genuine transaction or a fraud transaction.

For each genuine transaction, the output is converted into some hash values, and these values
become the input for the next round. For each genuine transaction, there is a specific pattern
which gets change for the fraud transaction hence, it detects it and makes our online
transactions more secure.

9. Stock Market trading:


Machine learning is widely used in stock market trading. In the stock market, there is always
a risk of up and downs in shares, so for this machine learning's long short term memory
neural network is used for the prediction of stock market trends.

10. Medical Diagnosis:


In medical science, machine learning is used for diseases diagnoses. With this, medical
technology is growing very fast and able to build 3D models that can predict the exact
position of lesions in the brain.

It helps in finding brain tumours and other brain-related diseases easily.


11. Automatic Language Translation:
Nowadays, if we visit a new place and we are not aware of the language then it is not a
problem at all, as for this also machine learning helps us by converting the text into our
known languages. Google's GNMT (Google Neural Machine Translation) provide this
feature, which is a Neural Machine Learning that translates the text into our familiar
language, and it called as automatic translation.

The technology behind the automatic translation is a sequence to sequence learning


algorithm, which is used with image recognition and translates the text from one language to
another language.

Machine learning Life cycle


Machine learning has given the computer systems the abilities to automatically learn without
being explicitly programmed. But how does a machine learning system work? So, it can be
described using the life cycle of machine learning. Machine learning life cycle is a cyclic
process to build an efficient machine learning project. The main purpose of the life cycle is to
find a solution to the problem or project.

Machine learning life cycle involves seven major steps, which are given below:

o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment
1. Gathering Data:

Data Gathering is the first step of the machine learning life cycle. The goal of this step is to
identify and obtain all data-related problems.

In this step, we need to identify the different data sources, as data can be collected from
various sources such as files, database, internet, or mobile devices. It is one of the most
important steps of the life cycle. The quantity and quality of the collected data will determine
the efficiency of the output. The more will be the data, the more accurate will be the
prediction.

This step includes the below tasks:

o Identify various data sources


o Collect data
o Integrate the data obtained from different sources
By performing the above task, we get a coherent set of data, also called as a dataset. It will be
used in further steps.

2. Data preparation

After collecting the data, we need to prepare it for further steps. Data preparation is a step
where we put our data into a suitable place and prepare it to use in our machine learning
training.
In this step, first, we put all data together, and then randomize the ordering of data.
This step can be further divided into two processes:

o Data exploration:
It is used to understand the nature of data that we have to work with. We need to
understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.
o Data pre-processing:
Now the next step is pre-processing of data for its analysis.

3. Data Wrangling

Data wrangling is the process of cleaning and converting raw data into a useable format. It is
the process of cleaning the data, selecting the variable to use, and transforming the data in a
proper format to make it more suitable for analysis in the next step. It is one of the most
important steps of the complete process. Cleaning of data is required to address the quality
issues.

It is not necessary that data we have collected is always of our use as some of the data may
not be useful. In real-world applications, collected data may have various issues, including:

o Missing Values
o Duplicate data
o Invalid data
o Noise
So, we use various filtering techniques to clean the data.

It is mandatory to detect and remove the above issues because it can negatively affect the
quality of the outcome.

4. Data Analysis

Now the cleaned and prepared data is passed on to the analysis step. This step involves:

o Selection of analytical techniques


o Building models
o Review the result
The aim of this step is to build a machine learning model to analyze the data using various
analytical techniques and review the outcome. It starts with the determination of the type of
the problems, where we select the machine learning techniques such
as Classification, Regression, Cluster analysis, Association, etc. then build the model using
prepared data, and evaluate the model.

5. Train Model

Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.

We use datasets to train the model using various machine learning algorithms. Training a
model is required so that it can understand the various patterns, rules, and, features.

6. Test Model

Once our machine learning model has been trained on a given dataset, then we test the model.
In this step, we check for the accuracy of our model by providing a test dataset to it.

Testing the model determines the percentage accuracy of the model as per the requirement of
project or problem.

7. Deployment

The last step of machine learning life cycle is deployment, where we deploy the model in the
real-world system.

If the above-prepared model is producing an accurate result as per our requirement with
acceptable speed, then we deploy the model in the real system. But before deploying the
project, we will check whether it is improving its performance using available data or not.
The deployment phase is similar to making the final report for a project.
Types of Machine Learning
 Supervised Machine Learning
 Unsupervised Machine Learning

Supervised Machine Learning


Supervised learning is the types of machine learning in which machines are trained using well
"labelled" training data, and on basis of that data, machines predict the output. The labelled
data means some input data is already tagged with the correct output.

In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student
learns in the supervision of the teacher.

Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping
function to map the input variable(x) with the output variable(y).

In the real-world, supervised learning can be used for Risk Assessment, Image classification,
Fraud Detection, spam filtering, etc.

How Supervised Learning Works?

In supervised learning, models are trained using labelled dataset, where the model learns
about each type of data. Once the training process is completed, the model is tested on the
basis of test data (a subset of the training set), and then it predicts the output.

The working of Supervised learning can be easily understood by the below example and
diagram

Suppose we have a dataset of different types of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is that we need to train the model for each shape.

o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to
identify the shape.

The machine is already trained on all types of shapes, and when it finds a new shape, it
classifies the shape on the bases of a number of sides, and predicts the output.

Steps Involved in Supervised Learning:

o First Determine the type of training dataset


o Collect/Gather the labelled training data.
o Split the training dataset into training dataset, test dataset, and validation dataset.
o Determine the input features of the training dataset, which should have enough
knowledge so that the model can accurately predict the output.
o Determine the suitable algorithm for the model, such as support vector machine,
decision tree, etc.
o Execute the algorithm on the training dataset. Sometimes we need validation sets as
the control parameters, which are the subset of training datasets.
o Evaluate the accuracy of the model by providing the test set. If the model predicts the
correct output, which means our model is accurate.

Types of supervised Machine learning Algorithms:

Supervised learning can be further divided into two types of problems:

1. Regression

Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc. Below are some popular Regression algorithms which come
under supervised learning:

o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification

Classification algorithms are used when the output variable is categorical, which means there
are two classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,

o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
Applications of Supervised Learning
Supervised learning is used in a wide variety of applications, including:
 Image classification: Identify objects, faces, and other features in images.
 Natural language processing: Extract information from text, such as sentiment, entities,
and relationships.
 Speech recognition: Convert spoken language into text.
 Recommendation systems: Make personalized recommendations to users.
 Predictive analytics: Predict outcomes, such as sales, customer churn, and stock prices.
 Medical diagnosis: Detect diseases and other medical conditions.
 Fraud detection: Identify fraudulent transactions.
 Autonomous vehicles: Recognize and respond to objects in the environment.
 Email spam detection: Classify emails as spam or not spam.
 Quality control in manufacturing: Inspect products for defects.
 Credit scoring: Assess the risk of a borrower defaulting on a loan.
 Gaming: Recognize characters, analyse player behaviour, and create NPCs.
 Customer support: Automate customer support tasks.
 Weather forecasting: Make predictions for temperature, precipitation, and other
meteorological parameters.
 Sports analytics: Analyse player performance, make game predictions, and optimize
strategies.

Advantages of Supervised learning:

o With the help of supervised learning, the model can predict the output on the basis of
prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such
as fraud detection, spam filtering, etc.

Disadvantages of supervised learning:

o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is different from
the training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.

Unsupervised Machine Learning


In the previous topic, we learned supervised machine learning in which models are trained
using labelled data under the supervision of training data. But there may be many cases in
which we do not have labelled data and need to find the hidden patterns from the given
dataset. So, to solve such types of cases in machine learning, we need unsupervised learning
techniques.

What is Unsupervised Learning?


As the name suggests, unsupervised learning is a machine learning technique in which
models are not supervised using training dataset. Instead, models itself find the hidden
patterns and insights from the given data. It can be compared to learning which takes place in
the human brain while learning new things. It can be defined as:

Unsupervised learning is a type of machine learning in which models are trained using
unlabelled dataset and are allowed to act on that data without any supervision.
Unsupervised learning cannot be directly applied to a regression or classification problem
because unlike supervised learning, we have the input data but no corresponding output data.
The goal of unsupervised learning is to find the underlying structure of dataset, group that
data according to similarities, and represent that dataset in a compressed format.

Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the given
dataset, which means it does not have any idea about the features of the dataset. The task of
the unsupervised learning algorithm is to identify the image features on their own.
Unsupervised learning algorithm will perform this task by clustering the image dataset into
the groups according to similarities between images.

Why use Unsupervised Learning?

Below are some main reasons which describe the importance of Unsupervised Learning:

o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.

Working of Unsupervised Learning

Working of unsupervised learning can be understood by the below diagram:


Here, we have taken an unlabelled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabelled input data is fed to the
machine learning model in order to train it. Firstly, it will interpret the raw data to find the
hidden patterns from the data and then will apply suitable algorithms such as k-means
clustering, Decision tree, etc.

Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.

Types of Unsupervised Learning Algorithm:

The unsupervised learning algorithm can be further categorized into two types of problems:

o Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities with
the objects of another group. Cluster analysis finds the commonalities between the
data objects and categorizes them as per the presence and absence of those
commonalities.
o Association: An association rule is an unsupervised learning method which is used
for finding the relationships between variables in the large database. It determines the
set of items that occurs together in the dataset. Association rule makes marketing
strategy more effective. Such as people who buy X item (suppose a bread) are also
tend to purchase Y (Butter/Jam) item. A typical example of Association rule is Market
Basket Analysis.
Unsupervised Learning algorithms:

Below is the list of some popular unsupervised learning algorithms:

o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition

Applications of Unsupervised Learning


Here are some common applications of unsupervised learning:
 Clustering: Group similar data points into clusters.
 Anomaly detection: Identify outliers or anomalies in data.
 Dimensionality reduction: Reduce the dimensionality of data while preserving its
essential information.
 Recommendation systems: Suggest products, movies, or content to users based on their
historical behaviour or preferences.
 Topic modelling: Discover latent topics within a collection of documents.
 Density estimation: Estimate the probability density function of data.
 Image and video compression: Reduce the amount of storage required for multimedia
content.
 Data pre-processing: Help with data pre-processing tasks such as data cleaning,
imputation of missing values, and data scaling.
 Market basket analysis: Discover associations between products.
 Genomic data analysis: Identify patterns or group genes with similar expression
profiles.
 Image segmentation: Segment images into meaningful regions.
 Community detection in social networks: Identify communities or groups of individuals
with similar interests or connections.
 Customer behaviour analysis: Uncover patterns and insights for better marketing and
product recommendations.
 Content recommendation: Classify and tag content to make it easier to recommend
similar items to users.
 Exploratory data analysis (EDA): Explore data and gain insights before defining
specific tasks.

Advantages of Unsupervised Learning

o Unsupervised learning is used for more complex tasks as compared to supervised


learning because, in unsupervised learning, we don't have labelled input data.
o Unsupervised learning is preferable as it is easy to get unlabelled data in comparison
to labelled data.

Disadvantages of Unsupervised Learning


o Unsupervised learning is intrinsically more difficult than supervised learning as it
does not have corresponding output.
o The result of the unsupervised learning algorithm might be less accurate as input data
is not labelled, and algorithms do not know the exact output in advance.

3. Semi-Supervised Learning
Semi-Supervised learning is a machine learning algorithm that works between
the supervised and unsupervised learning so it uses both labelled and unlabelled data. It’s
particularly useful when obtaining labelled data is costly, time-consuming, or resource-
intensive. This approach is useful when the dataset is expensive and time-consuming. Semi-
supervised learning is chosen when labelled data requires skills and relevant resources in
order to train or learn from it. We use these techniques when we are dealing with data that
is a little bit labelled and the rest large portion of it is unlabelled. We can use the
unsupervised techniques to predict labels and then feed these labels to supervised
techniques. This technique is mostly applicable in the case of image data sets where usually
all images are not labelled.

Let’s understand it with the help of an example.


Example: Consider that we are building a language translation model, having labelled
translations for every sentence pair can be resources intensive. It allows the models to learn
from labelled and unlabelled sentence pairs, making them more accurate. This technique
has led to significant improvements in the quality of machine translation services.

Types of Semi-Supervised Learning Methods


There are a number of different semi-supervised learning methods each with its own
characteristics. Some of the most common ones include:
 Graph-based semi-supervised learning: This approach uses a graph to represent the
relationships between the data points. The graph is then used to propagate labels from
the labelled data points to the unlabelled data points.
 Label propagation: This approach iteratively propagates labels from the labelled data
points to the unlabelled data points, based on the similarities between the data points.
 Co-training: This approach trains two different machine learning models on different
subsets of the unlabelled data. The two models are then used to label each other’s
predictions.
 Self-training: This approach trains a machine learning model on the labelled data and
then uses the model to predict labels for the unlabelled data. The model is then retrained
on the labelled data and the predicted labels for the unlabelled data.
 Generative adversarial networks (GANs): GANs are a type of deep learning algorithm
that can be used to generate synthetic data. GANs can be used to generate unlabelled
data for semi-supervised learning by training two neural networks, a generator and a
discriminator.

Advantages of Semi- Supervised Machine Learning


 It leads to better generalization as compared to supervised learning, as it takes both
labelled and unlabelled data.
 Can be applied to a wide range of data.

Disadvantages of Semi- Supervised Machine Learning


 Semi-supervised methods can be more complex to implement compared to other
approaches.
 It still requires some labelled data that might not always be available or easy to obtain.
 The unlabelled data can impact the model performance accordingly.

Applications of Semi-Supervised Learning


Here are some common applications of semi-supervised learning:
 Image Classification and Object Recognition: Improve the accuracy of models by
combining a small set of labelled images with a larger set of unlabelled images.
 Natural Language Processing (NLP): Enhance the performance of language models and
classifiers by combining a small set of labelled text data with a vast amount of
unlabelled text.
 Speech Recognition: Improve the accuracy of speech recognition by leveraging a
limited amount of transcribed speech data and a more extensive set of unlabelled audio.
 Recommendation Systems: Improve the accuracy of personalized recommendations by
supplementing a sparse set of user-item interactions (labelled data) with a wealth of
unlabelled user behaviour data.
 Healthcare and Medical Imaging: Enhance medical image analysis by utilizing a small
set of labelled medical images alongside a larger set of unlabelled images.

Reinforcement Machine Learning


Reinforcement machine learning algorithm is a learning method that interacts with the
environment by producing actions and discovering errors. Trial, error, and delay are the
most relevant characteristics of reinforcement learning. In this technique, the model keeps
on increasing its performance using Reward Feedback to learn the behaviour or pattern.
These algorithms are specific to a particular problem e.g. Google Self Driving car,
AlphaGo where a bot competes with humans and even itself to get better and better
performers in Go Game. Each time we feed in data, they learn and add the data to their
knowledge which is training data. So, the more it learns the better it gets trained and hence
experienced.
Here are some of most common reinforcement learning algorithms:
 Q-learning: Q-learning is a model-free RL algorithm that learns a Q-function, which
maps states to actions. The Q-function estimates the expected reward of taking a
particular action in a given state.
 SARSA (State-Action-Reward-State-Action): SARSA is another model-free RL
algorithm that learns a Q-function. However, unlike Q-learning, SARSA updates the Q-
function for the action that was actually taken, rather than the optimal action.
 Deep Q-learning: Deep Q-learning is a combination of Q-learning and deep learning.
Deep Q-learning uses a neural network to represent the Q-function, which allows it to
learn complex relationships between states and actions.
Let’s understand it with the help of examples.
Example: Consider that you are training an AI agent to play a game like chess. The agent
explores different moves and receives positive or negative feedback based on the outcome.
Reinforcement Learning also finds applications in which they learn to perform tasks by
interacting with their surroundings.

Types of Reinforcement Machine Learning


There are two main types of reinforcement learning:
Positive reinforcement
 Rewards the agent for taking a desired action.
 Encourages the agent to repeat the behaviour.
 Examples: Giving a treat to a dog for sitting, providing a point in a game for a correct
answer.
Negative reinforcement
 Removes an undesirable stimulus to encourage a desired behavior.
 Discourages the agent from repeating the behavior.
 Examples: Turning off a loud buzzer when a lever is pressed, avoiding a penalty by
completing a task.

Advantages of Reinforcement Machine Learning


 It has autonomous decision-making that is well-suited for tasks and that can learn to
make a sequence of decisions, like robotics and game-playing.
 This technique is preferred to achieve long-term results that are very difficult to
achieve.
 It is used to solve a complex problems that cannot be solved by conventional
techniques.

Disadvantages of Reinforcement Machine Learning


 Training Reinforcement Learning agents can be computationally expensive and time-
consuming.
 Reinforcement learning is not preferable to solving simple problems.
 It needs a lot of data and a lot of computation, which makes it impractical and costly.

Applications of Reinforcement Machine Learning


Here are some applications of reinforcement learning:
 Game Playing: RL can teach agents to play games, even complex ones.
 Robotics: RL can teach robots to perform tasks autonomously.
 Autonomous Vehicles: RL can help self-driving cars navigate and make decisions.
 Recommendation Systems: RL can enhance recommendation algorithms by learning
user preferences.
 Healthcare: RL can be used to optimize treatment plans and drug discovery.
 Natural Language Processing (NLP): RL can be used in dialogue systems and chatbots.
 Finance and Trading: RL can be used for algorithmic trading.
 Supply Chain and Inventory Management: RL can be used to optimize supply chain
operations.
 Energy Management: RL can be used to optimize energy consumption.
 Game AI: RL can be used to create more intelligent and adaptive NPCs in video games.
 Adaptive Personal Assistants: RL can be used to improve personal assistants.
 Virtual Reality (VR) and Augmented Reality (AR): RL can be used to create immersive
and interactive experiences.
 Industrial Control: RL can be used to optimize industrial processes.
 Education: RL can be used to create adaptive learning systems.
 Agriculture: RL can be used to optimize agricultural operations.

Designing a Learning System

According to Tom Mitchell, “A computer program is said to be learning from experience


(E), with respect to some task (T). Thus, the performance measure (P) is the performance at
task T, which is measured by P, and it improves with experience E.”
Example: In Spam E-Mail detection,
 Task, T: To classify mails into Spam or Not Spam.
 Performance measure, P: Total percent of mails being correctly classified as being
“Spam” or “Not Spam”.
 Experience, E: Set of Mails with label “Spam”

Steps for Designing Learning System are:

Step 1:
Choosing the Training Experience: The very important and first task is to choose the
training data or training experience which will be fed to the Machine Learning Algorithm.
It is important to note that the data or experience that we fed to the algorithm must have a
significant impact on the Success or Failure of the Model. So Training data or experience
should be chosen wisely.
Below are the attributes which will impact on Success and Failure of Data:
 The training experience will be able to provide direct or indirect feedback regarding
choices. For example: While Playing chess the training data will provide feedback to
itself like instead of this move if this is chosen the chances of success increases.
 Second important attribute is the degree to which the learner will control the sequences
of training examples. For example: when training data is fed to the machine then at that
time accuracy is very less but when it gains experience while playing again and again
with itself or opponent the machine algorithm will get feedback and control the chess
game accordingly.
 Third important attribute is how it will represent the distribution of examples over
which performance will be measured. For example, a Machine learning algorithm will
get experience while going through a number of different cases and different examples.
Thus, Machine Learning Algorithm will get more and more experience by passing
through more and more examples and hence its performance will increase.
Step 2:
Choosing target function: The next important step is choosing the target function. It
means according to the knowledge fed to the algorithm the machine learning will choose
Next Move function which will describe what type of legal moves should be taken. For
example : While playing chess with the opponent, when opponent will play then the
machine learning algorithm will decide what be the number of possible legal moves taken
in order to get success.

Step 3:
Choosing Representation for Target function: When the machine algorithm will know
all the possible legal moves the next step is to choose the optimized move using any
representation i.e. using linear Equations, Hierarchical Graph Representation, Tabular form
etc. The NextMove function will move the Target move like out of these move which will
provide more success rate. For Example: while playing chess machine have 4 possible
moves, so the machine will choose that optimized move which will provide success to it.

Step 4:
Choosing Function Approximation Algorithm: An optimized move cannot be chosen
just with the training data. The training data had to go through with set of example and
through these examples the training data will approximates which steps are chosen and after
that machine will provide feedback on it. For Example: When a training data of Playing
chess is fed to algorithm so at that time it is not machine algorithm will fail or get success
and again from that failure or success it will measure while next move what step should be
chosen and what is its success rate.

Step 5:
Final Design: The final design is created at last when system goes from number of
examples, failures and success, correct and incorrect decision and what will be the next step
etc. Example: DeepBlue is an intelligent computer which is ML-based won chess game
against the chess expert Garry Kasparov, and it became the first computer which had beaten
a human chess expert.

Mathematical foundations of machine learning:


Machine learning is an interdisciplinary field that involves computer science, statistics, and
mathematics. In particular, mathematics plays a critical role in developing and understanding
machine learning algorithms. In this chapter, we will discuss the mathematical concepts that
are essential for machine learning, including linear algebra, calculus, probability, and
statistics.
Linear Algebra

Linear algebra is the branch of mathematics that deals with linear equations and their
representation in vector spaces. In machine learning, linear algebra is used to represent and
manipulate data. In particular, vectors and matrices are used to represent and manipulate data
points, features, and weights in machine learning models.

A vector is an ordered list of numbers, while a matrix is a rectangular array of numbers. For
example, a vector can represent a single data point, and a matrix can represent a dataset.
Linear algebra operations, such as matrix multiplication and inversion, can be used to
transform and analyse data.

Followings are some of the important linear algebra concepts highlighting their importance in
machine learning –

 Vectors and matrix − Vectors and matrices are used to represent datasets, features, target
values, weights, etc.
 Matrix operations − operations such as addition, multiplication, subtraction, and transpose
are used in all ML algorithms.
 Eigenvalues and eigenvectors − These are very useful in dimensionality reduction related
algorithms such principal component analysis (PCA).
 Projection − Concept of hyper plane and projection onto a plane is essential to understand
support vector machine (SVM).
 Factorization − Matrix factorization and singular value decomposition (SVD) are used to
extract important information in the dataset.
 Tensors − Tensors are used in deep learning to represent multidimensional data. A tensor can
represent a scalar, vector or matrix.
 Gradients − Gradients are used to find optimal values of the model parameters.
 Jacobian Matrix − Jacobian matrix is used to analyse the relationship between input and
output variables in ML model
 Orthogonality − This is a core concept used in algorithms like principal component analysis
(PCA), support vector machines (SVM)

Calculus

Calculus is the branch of mathematics that deals with rates of change and accumulation. In
machine learning, calculus is used to optimize models by finding the minimum or maximum
of a function. In particular, gradient descent, a widely used optimization algorithm, is based
on calculus.

Gradient descent is an iterative optimization algorithm that updates the weights of a model
based on the gradient of the loss function. The gradient is the vector of partial derivatives of
the loss function with respect to each weight. By iteratively updating the weights in the
direction of the negative gradient, gradient descent tries to minimize the loss function.

Followings are some of the important calculus concepts essential for machine learning −

 Functions − Functions are core of machine learning. In machine learning, model learns a
function between inputs and outputs during the training phase. You should learn basics of
functions, continuous and discrete functions.
 Derivative, Gradient and Slope − These are the core concepts to understand how
optimization algorithms, like gradient descent, work.
 Partial Derivatives − These are used to find maxima or minima of a function. Generally
used in optimization algorithms.
 Chain Rules − Chain rules are used to calculate the derivatives of loss functions with
multiple variables. You can see the application of chain rules mainly in neural networks.
 Optimization Methods − These methods are used to find the optimal values of parameters
that minimizes cost function. Gradient Descent is one of the most used optimization methods.

Probability Theory

Probability is the branch of mathematics that deals with uncertainty and randomness. In
machine learning, probability is used to model and analyse data that are uncertain or variable.
In particular, probability distributions, such as Gaussian and Poisson distributions, are used to
model the probability of data points or events.

Bayesian inference, a probabilistic modeling technique, is also widely used in machine


learning. Bayesian inference is based on Bayes' theorem, which states that the probability of a
hypothesis given the data is proportional to the probability of the data given the hypothesis
multiplied by the prior probability of the hypothesis. By updating the prior probability based
on the observed data, Bayesian inference can make probabilistic predictions or
classifications.

Followings are some of the important probability theory concepts essential for machine
learning −

 Simple probability − It's a fundamental concept in machine learning. All classification


problems use probability concepts. SoftMax function uses simple probability in artificial
neural networks.
 Conditional probability − Classification algorithms like the Naive Bayes classifier are based
on conditional probability.
 Random Variables − Random variables are used to assign the initial values to the model
parameters. Parameter initialization is considered as the starting of the training process.
 Probability distribution − These are used in finding loss functions for classification
problems.
 Continuous and Discrete distribution − These distributions are used to model different
types of data in ML.
 Distribution functions − These functions are often used to model the distribution of error
terms in linear regression and other statistical models.
 Maximum likelihood estimation − It is a base of some machine learning and deep learning
approaches used for classification problems.

Statistics

Statistics is the branch of mathematics that deals with the collection, analysis, interpretation,
and presentation of data. In machine learning, statistics is used to evaluate and compare
models, estimate model parameters, and test hypotheses.

For example, cross-validation is a statistical technique that is used to evaluate the


performance of a model on new, unseen data. In cross-validation, the dataset is split into
multiple subsets, and the model is trained and evaluated on each subset. This allows us to
estimate the model's performance on new data and compare different models.
Followings are some of the important statistics concepts essential for machine learning −

 Mean, Median, Mode: These measures are used to understand the distribution of data and
identify outliers.
 Standard deviation, Variance: These are used to understand the variability of a dataset and
to detect outliers.
 Percentiles: These are used to summarize the distribution of a dataset and identify outliers.
 Data Distribution: It is how data points are distributed or spread out across a dataset.
 Skewness and Kurtosis: These are two important measures of the shape of a probability
distribution in machine learning.
 Bias and Variance: They describe the sources of error in a model's predictions.
 Hypothesis Testing: It is a tentative assumption or idea that can be tested and validated
using data.
 Linear Regression: It is the most used regression algorithm in supervised machine learning.
 Logistic Regression: It's also an important supervised learning algorithm mostly used in
machine learning.

Random Variable
Random variable is a fundamental concept in statistics that bridges the gap between
theoretical probability and real-world data. A Random variable in statistics is a function
that assigns a real value to an outcome in the sample space of a random experiment .

For example: if you roll a die, you can assign a number to each possible outcome.
There are two basic types of random variables:
 Discrete Random Variables (which take on specific values).
 Continuous Random Variables (assume any value within a given range).

We define a random variable as a function that maps from the sample space of an
experiment to the real numbers. Mathematically, Random Variable is expressed as,
X: S →R
Where,
 X is Random Variable (It is usually denoted using capital letter)
 S is Sample Space
 R is Set of Real Numbers

Random Variable Examples


Example 1
If two unbiased coins are tossed then find the random variable associated with
that event.
Solution:
Suppose Two (unbiased) coins are tossed
X = number of heads. [X is a random variable or function]
Here, the sample space S = {HH, HT, TH, TT}

Example 2
Suppose a random variable X takes m different values, X = {x 1, x2, x3………xm},
with corresponding probabilities P(X = xi) = pi, where 1 ≤ i ≤ m.
The probabilities must satisfy the following conditions :
 0 ≤ pi ≤ 1; where 1 ≤ i ≤ m
 p1 + p2 + p3 + ……. + pm = 1 or we can say 0 ≤ p i ≤ 1 and ∑pi = 1
For example, Suppose a die is thrown (X = outcome of the dice).
Here, the sample space S = {1, 2, 3, 4, 5, 6}.
The output of the function will be:
 P(X = 1) = 1/6
 P(X = 2) = 1/6
 P(X = 3) = 1/6
 P(X = 4) = 1/6
 P(X = 5) = 1/6
 P(X = 6) = 1/6
This also satisfies the condition ∑ 6i=1 P(X = i) = 1, since:
P(X = 1) + P(X = 1) + P(X = 2) + P(X = 3) + P(X = 4) + P(X = 5) + P(X = 6)
= 6 × 1/6 = 1

Variate
A variate is a general term often used interchangeably with a random variable,
particularly in contexts where the random variable is not yet fully specified by a particular
probabilistic experiment. The range of values that a random variable X can take is denoted
as RX, and individual values within this range are called quantiles. The probability of the
random variable X taking a specific value x is written as P(X = x).

Types of Random Variables


Random variables are of two types that are,
 Discrete Random Variable
 Continuous Random Variable

Discrete Random Variable


A Discrete Random Variable takes on a finite number of values. The probability function
associated with it is said to be PMF.
PMF(Probability Mass Function)
If X is a discrete random variable and the PMF of X is P(xi), then
 0 ≤ pi ≤ 1
 ∑pi = 1 where the sum is taken over all possible values of x
Discrete Random Variables Example
Example: Let S = {0, 1, 2}
xi 0 1 2

Pi(X = xi) P1 0.3 0.5

Find the value of P (X = 0)


Solution:
We know that the sum of all probabilities is equal to 1. And P (X = 0) be P1
P1 + 0.3 + 0.5 = 1
P1 = 0.2
Then, P (X = 0) is 0.2

Continuous Random Variable


Continuous Random Variable takes on an infinite number of values. The probability
function associated with it is said to be PDF (Probability Density Function).
PDF (Probability Density Function)
If X is a continuous random variable. P (x < X < x + dx) = f(x)dx then,
 0 ≤ f(x) ≤ 1; for all x
 ∫ f(x) dx = 1 over all values of x
Then P (X) is said to be a PDF of the distribution.
Continuous Random Variables Example
Find the value of P (1 < X < 2)
Such that,
 f(x) = kx3; 0 ≤ x ≤ 3 = 0
Otherwise f(x) is a density function.
Solution:
If a function f is said to be a density function, then the sum of all probabilities is equal to 1.
Since it is a continuous random variable Integral value is 1 overall sample space s.
∫ f(x) dx = 1
∫ kx3 dx = 1
K[x4]/4 = 1
Given interval, 0 ≤ x ≤ 3 = 0
K[34 – 04]/4 = 1
K(81/4) = 1
K = 4/81
Thus,
P (1 < X < 2) = k × [X4]/4
P = 4/81 × [16-1]/4
P = 15/81

Probability Distribution – Function, Formula,


Table
A probability distribution describes how the probabilities of different outcomes are
assigned to the possible values of a random variable. It provides a way of modeling the
likelihood of each outcome in a random experiment.

While a frequency distribution shows how often outcomes occur in a sample or dataset, a
probability distribution assigns probabilities to outcomes in an abstract, theoretical manner,
regardless of any specific dataset. These probabilities represent the likelihood of each
outcome occurring.
In a discrete probability distribution, the random variable takes distinct values (like the
outcome of rolling a die). In a continuous probability distribution, the random variable
can take any value within a certain range (like the height of a person).
Key properties of a probability distribution include:
 The probability of each outcome is greater than or equal to zero.
 The sum of the probabilities of all possible outcomes equals 1.

Probability Theory
Probability theory is an advanced branch of mathematics that deals with measuring the
likelihood of events occurring. It provides tools to analyze situations
involving uncertainty and helps in determining how likely certain outcomes are. This
theory uses the concepts of random variables, sample space, probability distributions, and
more to determine the outcome of any situation.
For Example: Flipping a Coin
Flipping a coin is a random event with two possible outcomes: heads or tails. Each time
you flip a fair coin, there are exactly two possible outcomes, each with an equal chance of
occurring. Therefore, the probability of landing on heads is 1/2, and similarly, the
probability of landing on tails is also 1/2.

Different Approaches In Probability Theory


Probability theory studies random events and tells us about their occurrence. The three
main approaches for studying probability theory are:
 Theoretical Probability
 Experimental Probability
 Subjective Probability

Theoretical Probability
Theoretical Probability deals with assumptions to avoid unfeasible or costly repetition of
experiments. The theoretical Probability for an event A can be calculated as follows:
P(A) = (Number of outcomes favorable to Event A) / (Number of all possible outcomes)
The image shown below shows the theoretical probability formula.
Now, as we learn the formula, let’s put this formula in our coin-tossing case. In tossing a
coin, there are two outcomes: Head or Tail. Hence, The Probability of occurrence of Head
on tossing a coin is P(H) = 1/2
Similarly, The Probability of the occurrence of a Tail on tossing a coin is P(T) = 1/2

Experimental Probability
Experimental probability is found by performing a series of experiments and observing
their outcomes. These random experiments are also known as trials. The experimental
probability for Event A can be calculated as follows:
P(E) = (Number of times event A happened) / (Total number of trials)
The following image shows the Experimental Probability Formula,

Now, as we learn the formula, let’s put this formula in our coin-tossing case. If we tossed a
coin 10 times and recorded heads for 4 times and a tail 6 times then the Probability of
occurrence of Head on tossing a coin: P(H) = 4/10
Similarly, the Probability of Occurrence of Tails on tossing a coin: P(T) = 6/10

Subjective Probability
Subjective probability refers to the likelihood of an event occurring, as estimated by an
individual based on their personal beliefs, experiences, intuition, or knowledge, rather than
on objective statistical data or formal mathematical models.
Example: A cricket enthusiast might assign a 70% probability to a team’s victory based on
their understanding of the team’s recent form, the opponent’s strengths and weaknesses,
and other relevant factors.

Basics of Probability Theory


Random Experiment
In probability theory, any event that can be repeated multiple times and its outcome is not
hampered by its repetition is called a Random Experiment. Tossing a coin, rolling dice, etc.
are random experiments.
Sample Space
The set of all possible outcomes for any random experiment is called sample space. For
example, throwing dice results in six outcomes, which are 1, 2, 3, 4, 5, and 6. Thus, its
sample space is (1, 2, 3, 4, 5, 6)
Event
The outcome of any experiment is called an event. Various types of events used in
probability theory are,
 Independent Events: The events whose outcomes are not affected by the outcomes of
other future and/or past events are called independent events. For example, the output
of tossing a coin in repetition is not affected by its previous outcome.
 Dependent Events: The events whose outcomes are affected by the outcome of other
events are called dependent events. For example, picking oranges from a bag that
contains 100 oranges without replacement.
 Mutually Exclusive Events: The events that cannot occur simultaneously are
called mutually exclusive events. For example, obtaining a head or a tail in tossing a
coin, because both (head and tail) cannot be obtained together.
 Equally likely Events: The events that have an equal chance or probability of
happening are known as equally likely events. For example, observing any face in
rolling dice has an equal probability of 1/6.

Random Variable
A variable that can assume the value of all possible outcomes of an experiment is called a
random variable in Probability Theory. Random variables in probability theory are of two
types which are discussed below,
Discrete Random Variable
Variables that can take countable values such as 0, 1, 2,… are called discrete random
variables.
Continuous Random Variable
Variables that can take an infinite number of values in a given range are called continuous
random variables.
Probability Theory Formulas
Various formulas are used in probability theory and some of them are discussed below,
 Theoretical Probability Formula: (Number of Favourable Outcomes) / (Number of
Total Outcomes)
 Empirical Probability Formula: (Number of times event A happened) / (Total number

 Addition Rule of Probability: P(A ∪ B) = P(A) + P(B) – P(A∩B)


of trials)

 Independent Events: P(A∩B) = P(A) ⋅ P(B)


 Complementary Rule of Probability: P(A’) = 1 – P(A)

 Bayes’ Theorem: P(A | B) = P(B | A) ⋅ P(A) / P(B)


 Conditional Probability: P(A | B) = P(A∩B) / P(B)

Solved Examples of Probability Theory


We can study the concept of probability with the help of the example discussed below,
Example 1: Let’s take two random dice and roll them randomly, now the probability
of getting a total of 10 is calculated.
Solution:
Total Possible events that can occur (sample space) {(1,1), (1,2),…, (1,6),…, (6,6)}. The
total spaces are 36.
Now the required events, {(4,6), (5,5), (6,4)} are all which adds up to 10.
So the probability of getting a total of 10 is = 3/36 = 1/12

Example 2: A fair coin is tossed three times. What is the probability of getting exactly
two heads?
Solution:
Total possible outcomes when tossing a coin three times = 23 = 8.
Possible outcomes: HHH, HHT, HTH, THH, HTT, THT, TTH, TTT.
Outcomes with exactly two heads: HHT, HTH, THH (3 outcomes).
Probability of getting exactly two heads:
P(exactly 2 heads)=Number of favorable outcomes/ Total outcomes.
P(exactly 2 heads)=3/ 8.
Decision Theory
Decision theory is a foundational concept in Artificial Intelligence (AI), enabling machines
to make rational and informed decisions based on available data. It combines principles
from mathematics, statistics, economics, and psychology to model and improve decision-
making processes. In AI, decision theory provides the framework to predict outcomes,
evaluate choices, and guide actions in uncertain environments.
There are two primary branches of decision theory:
 Normative decision theory focuses on identifying the optimal decision, assuming the
decision-maker is rational and has complete information.
 Descriptive decision theory examines how decisions are actually made in practice,
often dealing with cognitive limitations and psychological biases.
Understanding Decision Theory in Machine Learning
In artificial intelligence, decision theory uses mathematical models to assess possible
outcomes under uncertainty and assist systems in making decisions.
AI systems often use decision theory in two primary ways:
 supervised learning
 Reinforcement learning.

1. Supervised Learning
In supervised learning, AI systems are trained using labelled data to make predictions or
decisions. Decision theory helps optimize the classification or regression tasks by
evaluating the trade-offs between false positives, false negatives, and other outcomes based
on the utility of each result.
For instance, in medical diagnosis, the utility of correctly identifying a disease may be far
higher than the cost of a false alarm, leading the AI to favor sensitivity over specificity.

2. Reinforcement Learning
Reinforcement learning (RL) is one of the key areas where decision theory shines in AI. In
RL, agents learn to make decisions through trial and error, receiving feedback from their
environment in the form of rewards or penalties.
 Markov Decision Processes (MDPs) are a common formalism for decision-making in
reinforcement learning, where decision theory principles help in navigating uncertainty
and maximizing long-term rewards.
 In MDPs, the agent needs to choose actions that optimize future rewards, which aligns
with the decision-theoretic concept of maximizing expected utility.

Components of Decision Theory


1. Agents and Actions: In decision theory, an agent is an entity that makes decisions. The
agent has a set of possible actions or decisions to choose from.
2. States of the World: These represent the possible conditions or scenarios that may
affect the outcome of the agent’s decision. The agent often has incomplete knowledge
about the current or future states of the world.
3. Outcomes and Consequences: Every decision leads to an outcome. Outcomes can be
desirable, neutral, or undesirable, depending on the goals of the agent.
4. Probabilities: Since outcomes are often uncertain, decision theory involves assigning
probabilities to different states or outcomes based on available data.
5. Utility Function: This is a measure of the desirability of an outcome. A utility function
quantifies how much an agent values a specific result, helping in ranking outcomes to
guide decisions.
6. Decision Rules: These are the guidelines the agent follows to choose the best action.
Examples include the Maximization of Expected Utility (MEU), where an agent selects
the action that offers the highest expected utility.

Bayes Theorem in Machine learning


Bayes theorem is one of the most popular machine learning concepts that helps to calculate
the probability of occurring one event with uncertain knowledge while other one has already
occurred.

Bayes' theorem can be derived using product rule and conditional probability of event X with
known event Y:

o According to the product rule we can express as the probability of event X with
known event Y as follows;
1. P(X ? Y)= P(X|Y) P(Y) {equation 1}

o Further, the probability of event Y with known event X:


1. P(X ? Y)= P(Y|X) P(X) {equation 2}
Mathematically, Bayes theorem can be expressed by combining both equations on right hand
side. We will get:

Here, both events X and Y are independent events which means probability of outcome
of both events does not depends one another.

The above equation is called as Bayes Rule or Bayes Theorem.

o P(X|Y) is called as posterior, which we need to calculate. It is defined as


updated probability after considering the evidence.
o P(Y|X) is called the likelihood. It is the probability of evidence when
hypothesis is true.
o P(X) is called the prior probability, probability of hypothesis before
considering the evidence
o P(Y) is called marginal probability. It is defined as the probability of evidence
under any consideration.
Hence, Bayes Theorem can be written as:

Posterior = likelihood * prior / evidence

Example:

A bag contains 4 balls. Two balls are drawn at random without replacement and are found to
be blue. What is the probability that all balls in the bag are blue?
Solution:

Let E1 = Bag contains two blue balls

E2 = Bag contains three blue balls

E3 = Bag contains four blue balls

A = event of getting two white balls

P(E1) = P(E2) = P(E3) = ⅓

P(A|E1) = 2C2/4C2 = ⅙

P(A|E2) = 3C2/4C2 = ½

P(A|E3) = 4C2/4C2 = 1

P(E3|A)=P(A|E3)P(E3)P(A|E1)P(E1)+P(A|E2)P(E2)+P(A|E3)P(E3)
= [⅓ × 1]/[⅓ × ⅙ + ⅓ × ½ + ⅓ × 1]

= ⅗.

Example 2:
An unbiased dice is rolled and for each number on the dice a bag is chosen:
Numbers on the Dice Bag choosen
1 Bag A
2 or 3 Bag B
4 or 5 or 6 Bag C
Bag A contains 3 white ball and 2 black ball, bag B contains 3 white ball and 4 black ball
and bag C contains 4 white ball and 5 black ball. Dice is rolled and bag is chosen, if a white
ball is chosen find the probability that it is chosen from bag B.
Solution:
Let E1 = event of choosing bag A

E2 = event of choosing bag B

E3 = event of choosing bag C

A = event of choosing white ball

Then, P(E1) = ⅙, P(E2) = 2/6 = ⅔, P(E3) = 3/6 = ½

And P(A|E1) = ⅗, P(A|E2) = 3/7, P(A|E3) = 4/9

P(E2|A)=P(A|E2)P(E2)P(A|E1)P(E1)+P(A|E2)P(E2)+P(A|E3)P(E3)
=3/7×1/33/5×1/6+3/7×1/3+4/9×1/2=1/71/10+1/7+2/9
⇒ P(E2|A) = 90/293.

Information Theory in Machine Learning


Information theory, introduced by Claude Shannon in 1948, is a mathematical framework
for quantifying information, data compression, and transmission. In machine learning,
information theory provides powerful tools for analyzing and improving algorithms.

Key Concepts of Information Theory


1. Entropy
Entropy measures the uncertainty or unpredictability of a random variable. In machine
learning, entropy quantifies the amount of information required to describe a dataset.
 Definition: For a discrete random variable X with possible valuesx1,x2,...,xnx1,x2
,...,xn and a probability mass function P(X), the entropy H(X) is defined as:
o H(X)=−∑i=1nP(xi)log⁡P(xi)H(X)=−∑i=1nP(xi)logP(xi)
 Interpretation: Higher entropy indicates greater unpredictability, while lower entropy
indicates more predictability.
2. Mutual Information
Mutual information measures the amount of information obtained about one random
variable through another random variable. It quantifies the dependency between variables.
 Definition: For two random variables X and Y, the mutual information I(X;Y) is
defined as: I(X;Y)=∑xϵX∑yϵYP(x,y)log⁡P(x,y)P(x)P(y)I(X;Y)=∑xϵX∑yϵY
P(x,y)logP(x)P(y)P(x,y)
 Interpretation: Mutual information is zero if X and Y are independent, and higher
values indicate greater dependency.
3. Kullback-Leibler (KL) Divergence
KL divergence measures the difference between two probability distributions. It is often
used in machine learning to compare the predicted probability distribution with the true
distribution.
 Definition: For two probability distributions P and Q defined over the same variable X,
the KL divergence DKL(P∣∣Q)DKL(P∣∣Q) is:
o DKL(P∣∣Q)=∑xϵXP(x)log⁡P(x)Q(x)DKL(P∣∣Q)=∑xϵXP(x)logQ(x)P(x)
 Interpretation: KL divergence is non-negative and asymmetric,
meaning DKL(P∣∣Q)≠DKL(Q∣∣P)DKL(P∣∣Q)=DKL(Q∣∣P).
Applications of Information Theory in Machine Learning
1. Feature Selection
Feature selection aims to identify the most relevant features for building a predictive model.
Information-theoretic measures like mutual information can quantify the relevance of each
feature with respect to the target variable.
 Method: Calculate the mutual information between each feature and the target variable.
Select features with the highest mutual information values.
 Benefit: Helps in reducing dimensionality and improving model performance by
removing irrelevant or redundant features.

2. Decision Trees
Decision trees use entropy and information gain to split nodes and build a tree structure.
Information gain, based on entropy, measures the reduction in uncertainty after splitting a
node.
 Information Gain: The information gain IG(T,A) for a dataset T and attribute A is:

∣T∣∣Tv∣H(Tv)
o IG(T,A)=H(T)−∑vϵValues(A)∣Tv∣∣T∣H(Tv)IG(T,A)=H(T)−∑vϵValues(A)

o where TvTv is the subset of T with attribute A having value v.


3. Regularization and Model Selection
KL divergence is used in regularization techniques like variational inference in Bayesian
neural networks. By minimizing KL divergence between the approximate and true posterior
distributions, we achieve better model regularization.
Example: Variational Autoencoders (VAEs) use KL divergence to regularize the latent
space distribution, ensuring it follows a standard normal distribution.
4. Information Bottleneck
The information bottleneck method aims to find a compressed representation of the input
data that retains maximal information about the output.
 Objective: Maximize mutual information between the compressed representation and
the output while minimizing mutual information between the input and the compressed
representation.
 Applications: Used in deep learning for learning efficient representations.

Practical Implementation of Information Theory in Python


Calculating Entropy in Python
The following code defines a function entropy that calculates the entropy of a given
probability distribution. It uses NumPy to perform the calculation. The entropy is computed
as the negative sum of the probabilities multiplied by their base-2 logarithms. The example
provided calculates the entropy of the probability distribution [0.2, 0.3, 0.5].

Code:
import numpy as np

def entropy(prob_dist):
return -np.sum(prob_dist * np.log2(prob_dist))

# Example
prob_dist = np.array([0.2, 0.3, 0.5])
print("Entropy:", entropy(prob_dist))

Output:
Entropy: 1.4854752972273344

UNIT- II
Linear Regression in Machine learning
Linear regression is also a type of machine-learning algorithm more specifically
a supervised machine-learning algorithm that learns from the labelled datasets and maps
the data points to the most optimized linear functions, which can be used for prediction on
new datasets.
First off we should know what supervised machine learning algorithms is. It is a type of
machine learning where the algorithm learns from labelled data. Supervised learning has
two types:
 Classification: It predicts the class of the dataset based on the independent input
variable. Class is the categorical or discrete values. like the image of an animal is a cat
or dog?
 Regression: It predicts the continuous output variables based on the independent input
variable. like the prediction of house prices based on different parameters like house
age, distance from the main road, location, area, etc .

Types of Linear Regression


There are two main types of linear regression:
Simple Linear Regression
This is the simplest form of linear regression, and it involves only one independent variable
and one dependent variable. The equation for simple linear regression is:
y=β0+β1Xy=β0+β1X
where:
 Y is the dependent variable
 X is the independent variable
 β0 is the intercept
 β1 is the slope

Multiple Linear Regression


This involves more than one independent variable and one dependent variable. The
equation for multiple linear regression is:
y=β0+β1X1+β2X2+………βnXny=β0+β1X1+β2X2+………βnXn
where:
 Y is the dependent variable
 X1, X2, …, Xn are the independent variables
 β0 is the intercept
 β1, β2, …, βn are the slopes
The goal of the algorithm is to find the best Fit Line equation that can predict the values
based on the independent variables.
In regression set of records are present with X and Y values and these values are used to
learn a function so if you want to predict Y from an unknown X this learned function can
be used. In regression we have to find the value of Y, So, a function is required that
predicts continuous Y in the case of regression given X as independent features.

What is the best Fit Line?


Our primary objective while using linear regression is to locate the best-fit line, which
implies that the error between the predicted and actual values should be kept to a minimum.
There will be the least error in the best-fit line.
The best Fit Line equation provides a straight line that represents the relationship between
the dependent and independent variables. The slope of the line indicates how much the
dependent variable changes for a unit change in the independent variable(s).
Here Y is called a dependent or target variable and X is called an independent variable also
known as the predictor of Y. There are many types of functions or modules that can be used
for regression. A linear function is the simplest type of function. Here, X may be a single
feature or multiple features representing the problem.
Linear regression performs the task to predict a dependent variable value (y) based on a
given independent variable (x)). Hence, the name is Linear Regression. In the figure above,
X (input) is the work experience and Y (output) is the salary of a person. The regression
line is the best-fit line for our model.
We utilize the cost function to compute the best values in order to get the best fit line since
different values for weights or the coefficient of lines result in different regression lines.

Why Linear Regression is Important?


The interpretability of linear regression is a notable strength. Linear regression is not
merely a predictive tool; it forms the basis for various advanced models. Techniques like
regularization and support vector machines draw inspiration from linear regression,
expanding its utility.

Linear Model
The Linear Model is one of the most straightforward models in machine learning. It is the
building block for many complex machine learning algorithms, including deep neural
networks. Linear models predict the target variable using a linear function of the input
features.
The linear model is one of the simplest models in machine learning. It assumes that the data
is linearly separable and tries to learn the weight of each feature. Mathematically, it can be
written as Y=WTXY=WTX, where X is the feature matrix, Y is the target variable, and W is
the learned weight vector. We apply a transformation function or a threshold for the
classification problem to convert the continuous-valued variable Y into a discrete category.
Here we will briefly learn linear and logistic regression, which are
the regression and classification task models, respectively.

Linear models in machine learning are easy to implement and interpret and are helpful in
solving many real-life use cases.

Types of Linear Models

Among many linear models, this article will cover linear regression and logistic regression.
Linear Regression
Linear Regression is a statistical approach that predicts the result of a response variable by
combining numerous influencing factors. It attempts to represent the linear connection
between features (independent variables) and the target (dependent variables). The cost
function enables us to find the best possible values for the model parameters.

Example: An analyst would be interested in seeing how market movement influences the
price of ExxonMobil (XOM). The value of the S&P 500 index will be the independent
variable, or predictor, in this example, while the price of XOM will be the dependent
variable. In reality, various elements influence an event's result. Hence, we usually have
many independent features.

Logistic Regression
Logistic regression is an extension of linear regression. The sigmoid function first transforms
the linear regression output between 0 and 1. After that, a predefined threshold helps to
determine the probability of the output values. The values higher than the threshold value
tend towards having a probability of 1, whereas values lower than the threshold value tend
towards having a probability of 0.

Example: A bank wants to predict if a customer will default on their loan based on their
credit score and income. The independent variables would be credit score and income, while
the dependent variable would be whether the customer defaults (1) or not (0).

Applications of Linear Models

Several real-life scenarios follow linear relations between dependent and independent
variables. Some of the examples are:

 The relationship between the boiling point of water and change in altitude.
 The relationship between spending on advertising and the revenue of an organization.
 The relationship between the amount of fertilizer used and crop yields.
 Performance of athletes and their training regimen.

Naive Bayes Classifiers


Naive Bayes classifiers are a collection of classification algorithms based on Bayes’
Theorem. It is not a single algorithm but a family of algorithms where all of them share a
common principle, i.e. every pair of features being classified is independent of each
other. One of the most simple and effective classification algorithms, the Naïve Bayes
classifier aids in the rapid development of machine learning models with rapid prediction
capabilities.
Naïve Bayes algorithm is used for classification problems. It is highly used in text
classification.

Why it is Called Naive Bayes?


The “Naive” part of the name indicates the simplifying assumption made by the Naïve
Bayes classifier. The classifier assumes that the features used to describe an observation are
conditionally independent, given the class label. The “Bayes” part of the name refers to
Reverend Thomas Bayes, an 18th-century statistician and theologian who formulated
Bayes’ theorem.
Consider a fictional dataset that describes the weather conditions for playing a game of
golf. Given the weather conditions, each tuple classifies the conditions as fit(“Yes”) or
unfit(“No”) for playing golf.Here is a tabular representation of our dataset.

Outlook Temperature Humidity Windy Play Golf

0 Rainy Hot High False No

1 Rainy Hot High True No

2 Overcast Hot High False Yes

3 Sunny Mild High False Yes

4 Sunny Cool Normal False Yes

5 Sunny Cool Normal True No

6 Overcast Cool Normal True Yes

7 Rainy Mild High False No

8 Rainy Cool Normal False Yes

9 Sunny Mild Normal False Yes


10 Rainy Mild Normal True Yes

11 Overcast Mild High True Yes

12 Overcast Hot Normal False Yes

13 Sunny Mild High True No

The dataset is divided into two parts, namely, feature matrix and the response vector.
 Feature matrix contains all the vectors (rows) of dataset in which each vector consists of
the value of dependent features. In above dataset, features are ‘Outlook’,
‘Temperature’, ‘Humidity’ and ‘Windy’.
 Response vector contains the value of class variable (prediction or output) for each row
of feature matrix. In above dataset, the class variable name is ‘Play golf’.

Assumption of Naive Bayes


The fundamental Naive Bayes assumption is that each feature makes an:
 Feature independence: The features of the data are conditionally independent of each
other, given the class label.
 Continuous features are normally distributed: If a feature is continuous, then it is
assumed to be normally distributed within each class.
 Discrete features have multinomial distributions: If a feature is discrete, then it is
assumed to have a multinomial distribution within each class.
 Features are equally important: All features are assumed to contribute equally to the
prediction of the class label.
 No missing data: The data should not contain any missing values.

Bayes’ Theorem and Conditional Probability


Bayes theorem (also known as the Bayes Rule or Bayes Law) is used to determine the
conditional probability of event A when event B has already occurred.

Bayes Theorem Formula


For any two events A and B, then the formula for the Bayes theorem is given by: (the image
given below gives the Bayes’ theorem formula)

where,
 P(A) and P(B) are the probabilities of events A and B also P(B) is never equal
to zero.
 P(A|B) is the probability of event A when event B happens
 P(B|A) is the probability of event B when A happens

Solved Examples on Bayes Theorem


Example 1: A person has undertaken a job. The probabilities of completion of the
job on time with and without rain are 0.44 and 0.95 respectively. If the probability
that it will rain is 0.45, then determine the probability that the job will be completed
on time.

Solution:
Let E1 be the event that the mining job will be completed on time and E2 be the event that
it rains. We have,
P(A) = 0.45,
P(no rain) = P(B) = 1 − P(A) = 1 − 0.45 = 0.55
By multiplication law of probability,
P(E1) = 0.44, and P(E2) = 0.95
Since, events A and B form partitions of the sample space S, by total probability theorem,
we have

⇒ P(E) = 0.45 × 0.44 + 0.55 × 0.95


P(E) = P(A) P(E1) + P(B) P(E2)

⇒ P(E) = 0.198 + 0.5225 = 0.7205


So, the probability that the job will be completed on time is 0.7205

Probabilistic Models in Machine Learning


Probabilistic models are an essential component of machine learning, which aims to learn
patterns from data and make predictions on new, unseen data. Probabilistic models are used
in various applications such as image and speech recognition, natural language processing,
and recommendation systems.
Categories Of Probabilistic Models
These models can be classified into the following categories:
 Generative models
 Discriminative models.
 Graphical models

Generative models:
Generative models aim to model the joint distribution of the input and output variables.
These models generate new data based on the probability distribution of the original
dataset. Generative models are powerful because they can generate new data that resembles
the training data. They can be used for tasks such as image and speech synthesis, language
translation, and text generation.

Discriminative models
The discriminative model aims to model the conditional distribution of the output variable
given the input variable. They learn a decision boundary that separates the different classes
of the output variable. Discriminative models are useful when the focus is on making
accurate predictions rather than generating new data. They can be used for tasks such
as image recognition, speech recognition, and sentiment analysis.

Graphical models
These models use graphical representations to show the conditional dependence between
variables. They are commonly used for tasks such as image recognition, natural language
processing, and causal inference.
Naive Bayes Algorithm in Probabilistic Models
The Naive Bayes algorithm is a widely used approach in probabilistic models,
demonstrating remarkable efficiency and effectiveness in
solving classification problems. By leveraging the power of the Bayes theorem and
making simplifying assumptions about feature independence, the algorithm
calculates the probability of the target class given the feature set. This method has
found diverse applications across various industries, ranging from spam filtering to
medical diagnosis. Despite its simplicity, the Naive Bayes algorithm has proven
to be highly robust, providing rapid results in a multitude of real-world problems.
Naive Bayes is a probabilistic algorithm that is used for classification problems.
It is based on the Bayes theorem of probability and assumes that the features are
conditionally independent of each other given the class. The Naive Bayes
Algorithm is used to calculate the probability of a given sample belonging to a
particular class. This is done by calculating the posterior probability of each class
given the sample and then selecting the class with the highest posterior
probability as the predicted class.

The algorithm works as follows:


1. Collect a labelled dataset of samples, where each sample has a set of features and a
class label.
2. For each feature in the dataset, calculate the conditional probability of the feature given
the class.
3. This is done by counting the number of times the feature occurs in samples of the class
and dividing by the total number of samples in the class.
4. Calculate the prior probability of each class by counting the number of samples in each
class and dividing by the total number of samples in the dataset.
5. Given a new sample with a set of features, calculate the posterior probability of each
class using the Bayes theorem and the conditional probabilities and prior probabilities
calculated in steps 2 and 3.
6. Select the class with the highest posterior probability as the predicted class for the new
sample.

Importance of Probabilistic Models


 Probabilistic models play a crucial role in the field of machine learning, providing a
framework for understanding the underlying patterns and complexities in massive
datasets.
 Probabilistic models provide a natural way to reason about the likelihood of different
outcomes and can help us understand the underlying structure of the data.
 Probabilistic models help enable researchers and practitioners to make informed
decisions when faced with uncertainty.
 Probabilistic models allow us to perform Bayesian inference, which is a powerful
method for updating our beliefs about a hypothesis based on new data. This can be
particularly useful in situations where we need to make decisions under uncertainty.

Bayesian logistic regression


Bayesian logistic regression is the Bayesian counterpart to logistic regression, a common
machine learning tool. Logistic regression is a data analysis technique that uses mathematics
to predict the value of one data factor based on another. The prediction usually has a finite
number of outcomes, like yes or no.

Types of logistic regression:


 Binary logistic regression: The most common type of logistic regression, used when the
dependent variable has only two possible outcomes.
 Multinomial logistic regression: Used when the target variable has multiple classes.
 Ordinal logistic regression: Another type of logistic regression.

Decision Tree
A decision tree is a type of supervised learning algorithm that is commonly used in machine
learning to model and predict outcomes based on input data. It is a tree-like structure where
each internal node tests on attribute, each branch corresponds to attribute value and each
leaf node represents the final decision or prediction. They can be used to solve
both regression and classification

Decision Tree Terminologies


There are specialized terms associated with decision trees that denote various components
and facets of the tree structure and decision-making procedure. :
 Root Node: A decision tree’s root node, which represents the original choice or feature
from which the tree branches, is the highest node.
 Internal Nodes (Decision Nodes): Nodes in the tree whose choices are determined by
the values of particular attributes. There are branches on these nodes that go to other
nodes.
 Leaf Nodes (Terminal Nodes): The branches’ termini, when choices or forecasts are
decided upon. There are no more branches on leaf nodes.
 Branches (Edges): Links between nodes that show how decisions are made in response
to particular circumstances.
 Splitting: The process of dividing a node into two or more sub-nodes based on a
decision criterion. It involves selecting a feature and a threshold to create subsets of
data.
 Parent Node: A node that is split into child nodes. The original node from which a split
originates.
 Child Node: Nodes created as a result of a split from a parent node.
 Decision Criterion: The rule or condition used to determine how the data should be
split at a decision node. It involves comparing feature values against a threshold.
 Pruning: The process of removing branches or nodes from a decision tree to improve
its generalisation and prevent overfitting.

Decision Tree Approach

Decision tree uses the tree representation to solve the problem in which each leaf node
corresponds to a class label and attributes are represented on the internal node of the tree.
We can represent any Boolean function on discrete attributes using the decision tree.
Below are some assumptions that we made while using the decision tree:
At the beginning, we consider the whole training set as the root.
 Feature values are preferred to be categorical. If the values are continuous then they are
discretized prior to building the model.
 On the basis of attribute values, records are distributed recursively.
 We use statistical methods for ordering attributes as root or the internal node.

As you can see from the above image the Decision Tree works on the Sum of Product form
which is also known as Disjunctive Normal Form. In the above image, we are predicting
the use of computer in the daily life of people. In the Decision Tree, the major challenge is
the identification of the attribute for the root node at each level. This process is known as
attribute selection. We have two popular attribute selection measures:
1. Information Gain
2. Gini Index

Information Gain:
When we use a node in a decision tree to partition the training instances into smaller
subsets the entropy changes. Information gain is a measure of this change in entropy.
 Suppose S is a set of instances,
 A is an attribute
 Sv is the subset of S
 v represents an individual value that the attribute A can take and Values (A) is the set of
all possible values of A, then
Gain(S,A)=Entropy(S)–∑vA∣S∣∣Sv∣.Entropy(Sv)
Entropy: is the measure of uncertainty of a random variable, it characterizes the impurity
of an arbitrary collection of examples. The higher the entropy more the information content.
Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A = v, and
Values (A) is the set of all possible values of A, then
Gain(S,A)=Entropy(S)–∑vϵValues(A)∣S∣∣Sv∣.Entropy(Sv)
Example:

For the set X = {a,a,a,b,b,b,b,b}


Total instances: 8
Instances of b: 5
Instances of a: 3

Entropy H(X)=[(8/3)log2 8/3+(8/5)log2 8/5]


= − [0.375(−1.415)+0.625(−0.678)]
= − (−0.53−0.424)=0.954

Example: Now, let us draw a Decision Tree for the following data using Information
gain. Training set: 3 features and 2 classes

X Y Z C

1 1 1 I

1 1 0 I

0 0 1 II

1 0 0 II

Here, we have 3 features and 2 output classes. To build a decision tree using Information
gain. We will take each of the features and calculate the information for each feature.
From the above images, we can see that the information gain is maximum when we make a
split on feature Y. So, for the root node best-suited feature is feature Y. Now we can see
that while splitting the dataset by feature Y, the child contains a pure subset of the target
variable. So we don’t need to further split the dataset. The final tree for the above dataset
would look like this:

Building Decision Tree using Information Gain The


essentials:
 Start with all training instances associated with the root node
 Use info gain to choose which attribute to label each node with
 Note: No root-to-leaf path should contain the same discrete
attribute twice
 Recursively construct each subtree on the subset of training
instances that would be classified down that path in the tree.
 If all positive or all negative training instances remain, the label
that node “yes” or “no” accordingly
 If no attributes remain, label with a majority vote of training
instances left at that node
 If no instances remain, label with a majority vote of the
parent’s training instances.

Example: Now, let us draw a Decision Tree for the following data
using Information gain. Training set: 3 features and 2 classes
X Y Z C

1 1 1 I

1 1 0 I

0 0 1 II

1 0 0 II

Here, we have 3 features and 2 output classes. To build a decision


tree using Information gain. We will take each of the features and
calculate the information for each feature.

Split on feature X
Split on feature Y

Split on feature Z
From the above images, we can see that the information gain is
maximum when we make a split on feature Y. So, for the root
node best-suited feature is feature Y. Now we can see that while
splitting the dataset by feature Y, the child contains a pure subset
of the target variable. So we don’t need to further split the
dataset. The final tree for the above dataset would look like this:
2. Gini Index
 Gini Index is a metric to measure how often a randomly chosen
element would be incorrectly identified.
 It means an attribute with a lower Gini index should be
preferred.
 Sklearn supports “Gini” criteria for Gini Index and by default, it
takes “gini” value.
 The Formula for the calculation of the Gini Index is given below.

The Formula for Gini Index is given by :

The Gini Index is a measure of the inequality or impurity of a distribution, commonly used
in decision trees and other machine learning algorithms. It ranges from 0 to 0.5, where 0
indicates a pure set, and 0.5 indicates a maximally impure set.
Example of a Decision Tree Algorithm
Forecasting Activities Using Weather Information
 Root node: Whole dataset
 Attribute : “Outlook” (sunny, cloudy, rainy).
 Subsets: Overcast, Rainy, and Sunny.
 Recursive Splitting: Divide the sunny subset even more according to humidity, for
example.
 Leaf Nodes: Activities include “swimming,” “hiking,” and “staying inside.”

Advantages of Decision Tree


 Easy to understand and interpret, making them accessible to non-experts.
 Handle both numerical and categorical data without requiring extensive pre-processing.
 Provides insights into feature importance for decision-making.
 Handle missing values and outliers without significant impact.
 Applicable to both classification and regression tasks.

Disadvantages of Decision Tree


 The potential for overfitting
 Sensitivity to small changes in data, limited generalization if training data is not
representative
 Potential bias in the presence of imbalanced data.

A Reegression tree
A regression tree is a type of decision tree that uses a data set to predict continuous response
variables. It's a simple and fast algorithm that's often used to predict categorical, discrete, or
nonlinear sample data.

CART (Classification And Regression Tree):


Classification and Regression Trees (CART) is a decision tree algorithm that is used for
both classification and regression tasks. It is a supervised learning algorithm that learns
from labelled data to predict unseen data.
 Tree structure: CART builds a tree-like structure consisting of nodes and branches.
The nodes represent different decision points, and the branches represent the possible
outcomes of those decisions. The leaf nodes in the tree contain a predicted class label or
value for the target variable.
 Splitting criteria: CART uses a greedy approach to split the data at each node. It
evaluates all possible splits and selects the one that best reduces the impurity of the
resulting subsets. For classification tasks, CART uses Gini impurity as the splitting
criterion. The lower the Gini impurity, the more pure the subset is. For regression tasks,
CART uses residual reduction as the splitting criterion. The lower the residual
reduction, the better the fit of the model to the data.
 Pruning: To prevent overfitting of the data, pruning is a technique used to remove the
nodes that contribute little to the model accuracy. Cost complexity pruning and
information gain pruning are two popular pruning techniques. Cost complexity pruning
involves calculating the cost of each node and removing nodes that have a negative
cost. Information gain pruning involves calculating the information gain of each node
and removing nodes that have a low information gain.

How does CART algorithm works?


The CART algorithm works via the following process:
 The best-split point of each input is obtained.
 Based on the best-split points of each input in Step 1, the new “best” split point is
identified.
 Split the chosen input according to the “best” split point.
 Continue splitting until a stopping rule is satisfied or no further desirable splitting is
available.
Gini index/Gini impurity
The Gini index is a metric for the classification tasks in CART. It works on categorical
variables, provides outcomes either “successful” or “failure” and hence conducts binary
splitting only.
The degree of the Gini index varies from 0 to 1,
 Where 0 depicts that all the elements are allied to a certain class, or only one class
exists there.
 Gini index close to 1 means a high level of impurity, where each class contains a very
small fraction of elements.
 A value of 1-1/n occurs when the elements are uniformly distributed into n classes and
each class has an equal probability of 1/n. For example, with two classes, the Gini
impurity is 1 – 1/2 = 0.5.
Mathematically, we can write Gini Impurity as follows:
Gini=1−∑i=1n(pi)2
where pipi is the probability of an object being classified to a particular class.
Pseudo-code of the CART algorithm
d = 0, endtree = 0
Note(0) = 1, Node(1) = 0, Node(2) = 0
while endtree < 1
if Node(2d-1) + Node(2d) + .... + Node(2 d+1-2) = 2 - 2d+1
endtree = 1
else
do i = 2d-1, 2d, .... , 2d+1-2
if Node(i) > -1
Split tree
else
Node(2i+1) = -1
Node(2i+2) = -1
end if
end do
end if
d = d + 1
end while

Pruning
In machine learning, pruning is a technique that reduces the size of a
decision tree by removing non-critical branches or nodes. The goal of
pruning is to improve the model's performance, generalization, and
efficiency.

Benefits of pruning:
 Reduces overfitting: Pruning prevents the model from memorizing the training data, which
can lead to poor performance on new data.
 Improves predictive accuracy: Pruning reduces the complexity of the model, which can
improve its predictive accuracy.
 Improves model simplicity: Pruning can make the model simpler, faster, and more robust.

Types Of Decision Tree Pruning


There are two main types of decision tree pruning: Pre-Pruning and Post-Pruning.

Pre-Pruning (Early Stopping)


Sometimes, the growth of the decision tree can be stopped before it gets too complex, this
is called pre-pruning. It is important to prevent the overfitting of the training data, which
results in a poor performance when exposed to new data.
Some common pre-pruning techniques include:
 Maximum Depth: It limits the maximum level of depth in a decision tree.
 Minimum Samples per Leaf: Set a minimum threshold for the number of samples in
each leaf node.
 Minimum Samples per Split: Specify the minimal number of samples needed to break
up a node.
 Maximum Features: Restrict the quantity of features considered for splitting.
By pruning early, we come to be with a simpler tree that is less likely to overfit the training
facts.

Post-Pruning (Reducing Nodes)


After the tree is fully grown, post-pruning involves removing branches or nodes to improve
the model's ability to generalize. Some common post-pruning techniques include:
 Cost-Complexity Pruning (CCP): This method assigns a price to each subtree
primarily based on its accuracy and complexity, then selects the subtree with the lowest
fee.
 Reduced Error Pruning: Removes branches that do not significantly affect the overall
accuracy.
 Minimum Impurity Decrease: Prunes nodes if the decrease in impurity (Gini impurity
or entropy) is beneath a certain threshold.
 Minimum Leaf Size: Removes leaf nodes with fewer samples than a specified
threshold.
Neural Networks
Neural networks are machine learning models that mimic the complex functions of the
human brain. These models consist of interconnected nodes or neurons that process data,
learn patterns, and enable tasks such as pattern recognition and decision-making.
These networks are built from several key components:
1. Neurons: The basic units that receive inputs, each neuron is governed by a threshold
and an activation function.
2. Connections: Links between neurons that carry information, regulated by weights and
biases.
3. Weights and Biases: These parameters determine the strength and influence of
connections.
4. Propagation Functions: Mechanisms that help process and transfer data across layers
of neurons.
5. Learning Rule: The method that adjusts weights and biases over time to improve
accuracy.

Learning in neural networks follows a structured, three-stage process:


1. Input Computation: Data is fed into the network.
2. Output Generation: Based on the current parameters, the network generates an output.
3. Iterative Refinement: The network refines its output by adjusting weights and biases,
gradually improving its performance on diverse tasks.

Layers in Neural Network Architecture


1. Input Layer: This is where the network receives its input data. Each input neuron in
the layer corresponds to a feature in the input data.
2. Hidden Layers: These layers perform most of the computational heavy lifting. A
neural network can have one or multiple hidden layers. Each layer consists of units
(neurons) that transform the inputs into something that the output layer can use.
3. Output Layer: The final layer produces the output of the model. The format of these
outputs varies depending on the specific task (e.g., classification, regression).

Types of Neural Networks


There are seven types of neural networks that can be used.
 Feedforward Networks: A feedforward neural network is a simple artificial neural
network architecture in which data moves from input to output in a single direction.
 Multilayer Perceptron (MLP): MLP is a type of feedforward neural network with
three or more layers, including an input layer, one or more hidden layers, and an output
layer. It uses nonlinear activation functions.
 Convolutional Neural Network (CNN): A Convolutional Neural Network (CNN) is a
specialized artificial neural network designed for image processing. It employs
convolutional layers to automatically learn hierarchical features from input images,
enabling effective image recognition and classification.
 Recurrent Neural Network (RNN): An artificial neural network type intended for
sequential data processing is called a Recurrent Neural Network (RNN). It is
appropriate for applications where contextual dependencies are critical, such as time
series prediction and natural language processing, since it makes use of feedback loops,
which enable information to survive within the network.
 Long Short-Term Memory (LSTM): LSTM is a type of RNN that is designed to
overcome the vanishing gradient problem in training RNNs. It uses memory cells and
gates to selectively read, write, and erase information.

Working of Neural Networks


Forward Propagation
When data is input into the network, it passes through the network in the forward direction,
from the input layer through the hidden layers to the output layer. This process is known as
forward propagation. Here’s what happens during this phase:
1. Linear Transformation: Each neuron in a layer receives inputs, which are multiplied
by the weights associated with the connections. These products are summed together,
and a bias is added to the sum. This can be represented mathematically
as: z=w1x1+w2x2+…+wnxn+bz=w1x1+w2x2+…+wnxn+b where ww represents the
2. weights, xx represents the inputs, and bb is the bias.
2. Activation: The result of the linear transformation (denoted as zz) is then passed
through an activation function. The activation function is crucial because it introduces
non-linearity into the system, enabling the network to learn more complex patterns.
Popular activation functions include ReLU, sigmoid, and tanh.

Backpropagation
After forward propagation, the network evaluates its performance using a loss function,
which measures the difference between the actual output and the predicted output. The goal
of training is to minimize this loss. This is where backpropagation comes into play:
1. Loss Calculation: The network calculates the loss, which provides a measure of error
in the predictions. The loss function could vary; common choices are mean squared
error for regression tasks or cross-entropy loss for classification.
2. Gradient Calculation: The network computes the gradients of the loss function with
respect to each weight and bias in the network. This involves applying the chain rule of
calculus to find out how much each part of the output error can be attributed to each
weight and bias.
3. Weight Update: Once the gradients are calculated, the weights and biases are updated
using an optimization algorithm like stochastic gradient descent (SGD). The weights are
adjusted in the opposite direction of the gradient to minimize the loss. The size of the
step taken in each update is determined by the learning rate.

Iteration
This process of forward propagation, loss calculation, backpropagation, and weight update
is repeated for many iterations over the dataset. Over time, this iterative process reduces the
loss, and the network’s predictions become more accurate.
Through these steps, neural networks can adapt their parameters to better approximate the
relationships in the data, thereby improving their performance on tasks such as
classification, regression, or any other predictive modeling.
Example of Email Classification
Let’s consider a record of an email dataset:
Email Subject
ID Email Content Sender Line Label

“Get free gift “Exclusive


1 spam@example.com 1
cards now!” Offer”

To classify this email, we will create a feature vector based on the analysis of keywords
such as “free,” “win,” and “offer.”
The feature vector of the record can be presented as:
 “free”: Present (1)
 “win”: Absent (0)
 “offer”: Present (1)

Now, let’s delve into the working:


1. Input Layer: The input layer contains 3 nodes that indicates the presence of each
keyword.
2. Hidden Layer
 The input data is passed through one or more hidden layers.
 Each neuron in the hidden layer performs the following operations:
1. Weighted Sum: Each input is multiplied by a corresponding weight assigned to the
connection. For example, if the weights from the input layer to the hidden layer
neurons are as follows:
o Weights for Neuron H1: [0.5, -0.2, 0.3]
o Weights for Neuron H2: [0.4, 0.1, -0.5]
2. Calculate Weighted Input:
o For Neuron H1:
o Calculation=(1×0.5)+(0×−0.2)+(1×0.3)=0.5+0+0.3=0.8Calcul
ation=(1×0.5)+(0×−0.2)+(1×0.3)=0.5+0+0.3=0.8
o For Neuron H2:
o Calculation=(1×0.4)+(0×0.1)+(1×−0.5)=0.4+0−0.5=−0.1Calc
ulation=(1×0.4)+(0×0.1)+(1×−0.5)=0.4+0−0.5=−0.1
3. Activation Function: The result is passed through an activation function (e.g.,
ReLU or sigmoid) to introduce non-linearity.
o For H1, applying ReLU: ReLU(0.8)=0.8ReLU(0.8)=0.8
o For H2, applying ReLU: ReLU(−0.1)=0ReLU(−0.1)=0
3. Output Layer
 The activated outputs from the hidden layer are passed to the output neuron.
 The output neuron receives the values from the hidden layer neurons and computes the
final prediction using weights:
o Suppose the output weights from hidden layer to output neuron are [0.7, 0.2].
o Calculation:
o Input=(0.8×0.7)+(0×0.2)=0.56+0=0.56Input=(0.8×0.7)+(0×0.2)=
0.56+0=0.56
o Final Activation: The output is passed through a sigmoid activation function
to obtain a probability:
o σ(0.56)≈0.636σ(0.56)≈0.636
4. Final Classification
 The output value of approximately 0.636 indicates the probability of the email being
spam.
 Since this value is greater than 0.5, the neural network classifies the email as spam (1).

Advantages of Neural Networks


Neural networks are widely used in many different applications because of their many
benefits:
 Adaptability: Neural networks are useful for activities where the link between inputs
and outputs is complex or not well defined because they can adapt to new situations and
learn from data.
 Pattern Recognition: Their proficiency in pattern recognition renders them efficacious
in tasks like as audio and image identification, natural language processing, and other
intricate data patterns.
 Parallel Processing: Because neural networks are capable of parallel processing by
nature, they can process numerous jobs at once, which speeds up and improves the
efficiency of computations.
 Non-Linearity: Neural networks are able to model and comprehend complicated
relationships in data by virtue of the non-linear activation functions found in neurons,
which overcome the drawbacks of linear models.

Disadvantages of Neural Networks


Neural networks, while powerful, are not without drawbacks and difficulties:
 Computational Intensity: Large neural network training can be a laborious and
computationally demanding process that demands a lot of computing power.
 Black box Nature: As “black box” models, neural networks pose a problem in
important applications since it is difficult to understand how they make decisions.
 Overfitting: Overfitting is a phenomenon in which neural networks commit training
material to memory rather than identifying patterns in the data. Although regularization
approaches help to alleviate this, the problem still exists.
 Need for Large datasets: For efficient training, neural networks frequently need
sizable, labeled datasets; otherwise, their performance may suffer from incomplete or
skewed data.

Applications of Neural Networks


Neural networks have numerous applications across various fields:
1. Image and Video Recognition: CNNs are extensively used in applications such as
facial recognition, autonomous driving, and medical image analysis.
2. Natural Language Processing (NLP): RNNs and transformers power language
translation, chatbots, and sentiment analysis.
3. Finance: Predicting stock prices, fraud detection, and risk management.
4. Healthcare: Neural networks assist in diagnosing diseases, analysing medical images,
and personalizing treatment plans.
5. Gaming and Autonomous Systems: Neural networks enable real-time decision-
making, enhancing user experience in video games and enabling autonomous systems
like self-driving cars.

Feedforward neural network


A Feedforward Neural Network (FNN) is a type of artificial neural network where
connections between the nodes do not form cycles.
Structure of a Feedforward Neural Network
1. Input Layer: The input layer consists of neurons that receive the input data. Each
neuron in the input layer represents a feature of the input data.
2. Hidden Layers: One or more hidden layers are placed between the input and output
layers. These layers are responsible for learning the complex patterns in the data. Each
neuron in a hidden layer applies a weighted sum of inputs followed by a non-linear
activation function.
3. Output Layer: The output layer provides the final output of the network. The number
of neurons in this layer corresponds to the number of classes in a classification problem
or the number of outputs in a regression problem.

Activation Functions
Activation functions introduce non-linearity into the network, enabling it to learn and
model complex data patterns. Common activation functions include:
Training a Feedforward Neural Network
Training a Feedforward Neural Network involves adjusting the weights of the neurons to
minimize the error between the predicted output and the actual output. This process is
typically performed using backpropagation and gradient descent.
1. Forward Propagation: During forward propagation, the input data passes through the
network, and the output is calculated.
2. Loss Calculation: The loss (or error) is calculated using a loss function such as Mean
Squared Error (MSE) for regression tasks or Cross-Entropy Loss for classification tasks.
3. Backpropagation: In backpropagation, the error is propagated back through the
network to update the weights. The gradient of the loss function with respect to each
weight is calculated, and the weights are adjusted using gradient descent.

Evaluation of Feedforward neural network


Evaluating the performance of the trained model involves several metrics:
 Accuracy: The proportion of correctly classified instances out of the total
instances.
 Precision: The ratio of true positive predictions to the total predicted
positives.
 Recall: The ratio of true positive predictions to the actual positives.
 F1 Score: The harmonic mean of precision and recall, providing a balance
between the two.
 Confusion Matrix: A table used to describe the performance of a
classification model, showing the true positives, true negatives, false
positives, and false negatives.
What is backpropagation?
Backpropagation is the algorithm used to train Feedforward Neural
Networks. It involves calculating the gradient of the loss function with
respect to each weight by the chain rule, then updating the weights to
minimize the loss using an optimization algorithm like Gradient Descent or
Adam.

Support Vector Machine (SVM)


A Support Vector Machine (SVM) is a supervised machine learning algorithm used for
both classification and regression tasks. While it can be applied to regression problems,
SVM is best suited for classification tasks. The primary objective of the SVM algorithm is
to identify the optimal hyperplane in an N-dimensional space that can effectively separate
data points into different classes in the feature space

The dimension of the hyperplane depends on the number of features. For instance, if there
are two input features, the hyperplane is simply a line, and if there are three input features,
the hyperplane becomes a 2-D plane. As the number of features increases beyond three, the
complexity of visualizing the hyperplane also increases.

Consider two independent variables, x1 and x2, and one dependent variable represented as
either a blue circle or a red circle.
 In this scenario, the hyperplane is a line because we are working with two features
(x1 and x2).
 There are multiple lines (or hyperplanes) that can separate the data points.
 The challenge is to determine the best hyperplane that maximizes the separation
margin between the red and blue circles.

From the figure above it’s very clear that there are multiple lines (our hyperplane here is a
line because we are considering only two input features x1, x2) that segregate our data
points or do a classification between red and blue circles.

How does Support Vector Machine Algorithm Work?


The maximum-margin hyperplane, also referred to as the hard margin, is selected based
on maximizing the distance between the hyperplane and the nearest data point on each side.
So we choose the hyperplane whose distance from it to the nearest data point on
each side is maximized. If such a hyperplane exists it is known as the maximum-
margin hyperplane/hard margin. So from the above figure, we choose L2.
Let’s consider a scenario like shown below

Here we have one blue ball in the boundary of the red ball. So how does SVM classify the
data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue balls. The
SVM algorithm has the characteristics to ignore the outlier and finds the best hyperplane
that maximizes the margin. SVM is robust to outliers.
So in this type of data point what SVM does is, finds the maximum margin as done with
previous data sets along with that it adds a penalty each time a point crosses the margin. So

the data set, the SVM tries to minimize (1/margin+∧(∑penalty)). Hinge loss is a
the margins in these types of cases are called soft margins. When there is a soft margin to

commonly used penalty. If no violations no hinge loss.If violations hinge loss proportional
to the distance of violation.

SVM solves this by creating a new variable using a kernel. We call a point xi on the line
and we create a new variable yi as a function of distance from origin o.so if we plot this we
get something like as shown below
Support Vector Machine Terminology
 Hyperplane: The hyperplane is the decision boundary used to separate data points of
different classes in a feature space. For linear classification, this is a linear equation
represented as wx+b=0.
 Support Vectors: Support vectors are the closest data points to the hyperplane. These
points are critical in determining the hyperplane and the margin in Support Vector
Machine (SVM).
 Margin: The margin refers to the distance between the support vector and the
hyperplane. The primary goal of the SVM algorithm is to maximize this margin, as a
wider margin typically results in better classification performance.
 Kernel: The kernel is a mathematical function used in SVM to map input data into a
higher-dimensional feature space. This allows the SVM to find a hyperplane in cases
where data points are not linearly separable in the original space. Common kernel
functions include linear, polynomial, radial basis function (RBF), and sigmoid.
 Hard Margin: A hard margin refers to the maximum-margin hyperplane that perfectly
separates the data points of different classes without any misclassifications.
 Soft Margin: When data contains outliers or is not perfectly separable, SVM uses
the soft margin technique. This method introduces a slack variable for each data point
to allow some misclassifications while balancing between maximizing the margin and
minimizing violations.
 C: The C parameter in SVM is a regularization term that balances margin
maximization and the penalty for misclassifications. A higher C value imposes a stricter
penalty for margin violations, leading to a smaller margin but fewer misclassifications.
 Hinge Loss: The hinge loss is a common loss function in SVMs. It penalizes
misclassified points or margin violations and is often combined with a regularization
term in the objective function.
 Dual Problem: The dual problem in SVM involves solving for the Lagrange
multipliers associated with the support vectors. This formulation allows for the use of
the kernel trick and facilitates more efficient computation.

Bagging And Boosting in Machine Learning


Bagging and Boosting are two types of Ensemble Learning.

Bagging
Bootstrap Aggregating, also known as bagging, is a machine learning ensemble meta-
algorithm designed to improve the stability and accuracy of machine learning algorithms
used in statistical classification and regression. It decreases the variance and helps to
avoid overfitting. It is usually applied to decision tree methods. Bagging is a special case of
the model averaging approach.

Implementation Steps of Bagging


 Step 1: Multiple subsets are created from the original data set with equal tuples,
selecting observations with replacement.
 Step 2: A base model is created on each of these subsets.
 Step 3: Each model is learned in parallel with each training set and independent of each
other.
 Step 4: The final predictions are determined by combining the predictions from all the
models.
Example of Bagging
The Random Forest model uses Bagging, where decision tree models with higher variance
are present. It makes random feature selection to grow trees. Several random trees make a
Random Forest.

Boosting
Boosting is an ensemble modeling technique designed to create a strong classifier by
combining multiple weak classifiers. The process involves building models sequentially,
where each new model aims to correct the errors made by the previous ones.
 Initially, a model is built using the training data.
 Subsequent models are then trained to address the mistakes of their predecessors.
 Boosting assigns weights to the data points in the original dataset.
 Higher weights: Instances that were misclassified by the previous model receive higher
weights.
 Lower weights: Instances that were correctly classified receive lower weights.
 Training on weighted data: The subsequent model learns from the weighted dataset,
focusing its attention on harder-to-learn examples (those with higher weights).
 This iterative process continues until:
o The entire training dataset is accurately predicted, or
o A predefined maximum number of models is reached.
Boosting Algorithms
There are several boosting algorithms. The original ones, proposed by Robert
Schapire and Yoav Freund were not adaptive and could not take full advantage of the
weak learners. Schapire and Freund then developed AdaBoost, an adaptive boosting
algorithm that won the prestigious Gödel Prize. AdaBoost was the first really successful
boosting algorithm developed for the purpose of binary classification. AdaBoost is short for
Adaptive Boosting and is a very popular boosting technique that combines multiple “weak
classifiers” into a single “strong classifier”.
Algorithm:
1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified data points.
3. Increase the weight of the wrongly classified data points and decrease the weights of
correctly classified data points. And then normalize the weights of all data points.
4. if (got required results)
Goto step 5
else
Goto step 2
5. End

Differences between Bagging and Boosting


S.No Bagging Boosting

1.
The simplest way of combining
predictions that A way of combining predictions that
belong to the same type. belong to the different types.
2. Aim to decrease variance, not bias. Aim to decrease bias, not variance.
3.
Models are weighted according to their
Each model receives equal weight. performance.
4.
New models are influenced
by the performance of previously built
Each model is built independently. models.
5.

Different training data subsets are


selected using row sampling with Iteratively train models, with each new
replacement and random sampling model focusing on correcting the errors
methods from the entire training (misclassifications or high residuals) of the
dataset. previous models
6.
Bagging tries to solve the over-
fitting problem. Boosting tries to reduce bias.
7.
If the classifier is unstable (high If the classifier is stable and simple (high
variance), then apply bagging. bias) the apply boosting.
8.
In this base classifiers are trained In this base classifiers are trained
parallelly. sequentially.
9.
Example: The Random forest model Example: The AdaBoost uses Boosting
uses Bagging. techniques

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy