MACHINE LEARNING UINT I & II
MACHINE LEARNING UINT I & II
(Study Material)
UNIT- I
Introduction to Machine Learning
What is Machine Learning?
Machine learning (ML) is a type of Artificial Intelligence (AI) that allows computers to
learn and make decisions without being explicitly programmed. It involves feeding data
into algorithms that can then identify patterns and make predictions on new data. Machine
learning is used in a wide variety of applications, including image and speech recognition,
natural language processing, and recommender systems.
1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image
recognition and face detection is, Automatic friend tagging suggestion:
It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture.
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech
recognition, and it's a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known
as "Speech to text", or "Computer speech recognition." At present, machine learning
algorithms are widely used by various applications of speech recognition. Google
assistant, Siri, Cortana, and Alexa are using speech recognition technology to follow the
voice instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct
path with the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes information
from the user and sends back to its database to improve the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for
some product on Amazon, then we started getting an advertisement for the same product
while internet surfing on the same browser and this is because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests
the product as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series,
movies, etc., and this is also done with the help of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine
learning plays a significant role in self-driving cars. Tesla, the most popular car
manufacturing company is working on self-driving car. It is using unsupervised learning
method to train the car models to detect people and objects while driving.
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve
Bayes classifier are used for email spam filtering and malware detection.
For each genuine transaction, the output is converted into some hash values, and these values
become the input for the next round. For each genuine transaction, there is a specific pattern
which gets change for the fraud transaction hence, it detects it and makes our online
transactions more secure.
Machine learning life cycle involves seven major steps, which are given below:
o Gathering Data
o Data preparation
o Data Wrangling
o Analyse Data
o Train the model
o Test the model
o Deployment
1. Gathering Data:
Data Gathering is the first step of the machine learning life cycle. The goal of this step is to
identify and obtain all data-related problems.
In this step, we need to identify the different data sources, as data can be collected from
various sources such as files, database, internet, or mobile devices. It is one of the most
important steps of the life cycle. The quantity and quality of the collected data will determine
the efficiency of the output. The more will be the data, the more accurate will be the
prediction.
2. Data preparation
After collecting the data, we need to prepare it for further steps. Data preparation is a step
where we put our data into a suitable place and prepare it to use in our machine learning
training.
In this step, first, we put all data together, and then randomize the ordering of data.
This step can be further divided into two processes:
o Data exploration:
It is used to understand the nature of data that we have to work with. We need to
understand the characteristics, format, and quality of data.
A better understanding of data leads to an effective outcome. In this, we find
Correlations, general trends, and outliers.
o Data pre-processing:
Now the next step is pre-processing of data for its analysis.
3. Data Wrangling
Data wrangling is the process of cleaning and converting raw data into a useable format. It is
the process of cleaning the data, selecting the variable to use, and transforming the data in a
proper format to make it more suitable for analysis in the next step. It is one of the most
important steps of the complete process. Cleaning of data is required to address the quality
issues.
It is not necessary that data we have collected is always of our use as some of the data may
not be useful. In real-world applications, collected data may have various issues, including:
o Missing Values
o Duplicate data
o Invalid data
o Noise
So, we use various filtering techniques to clean the data.
It is mandatory to detect and remove the above issues because it can negatively affect the
quality of the outcome.
4. Data Analysis
Now the cleaned and prepared data is passed on to the analysis step. This step involves:
5. Train Model
Now the next step is to train the model, in this step we train our model to improve its
performance for better outcome of the problem.
We use datasets to train the model using various machine learning algorithms. Training a
model is required so that it can understand the various patterns, rules, and, features.
6. Test Model
Once our machine learning model has been trained on a given dataset, then we test the model.
In this step, we check for the accuracy of our model by providing a test dataset to it.
Testing the model determines the percentage accuracy of the model as per the requirement of
project or problem.
7. Deployment
The last step of machine learning life cycle is deployment, where we deploy the model in the
real-world system.
If the above-prepared model is producing an accurate result as per our requirement with
acceptable speed, then we deploy the model in the real system. But before deploying the
project, we will check whether it is improving its performance using available data or not.
The deployment phase is similar to making the final report for a project.
Types of Machine Learning
Supervised Machine Learning
Unsupervised Machine Learning
In supervised learning, the training data provided to the machines work as the supervisor that
teaches the machines to predict the output correctly. It applies the same concept as a student
learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. The aim of a supervised learning algorithm is to find a mapping
function to map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image classification,
Fraud Detection, spam filtering, etc.
In supervised learning, models are trained using labelled dataset, where the model learns
about each type of data. Once the training process is completed, the model is tested on the
basis of test data (a subset of the training set), and then it predicts the output.
The working of Supervised learning can be easily understood by the below example and
diagram
Suppose we have a dataset of different types of shapes which includes square, rectangle,
triangle, and Polygon. Now the first step is that we need to train the model for each shape.
o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides then it will be labelled as hexagon.
Now, after training, we test our model using the test set, and the task of the model is to
identify the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it
classifies the shape on the bases of a number of sides, and predicts the output.
1. Regression
Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather
forecasting, Market Trends, etc. Below are some popular Regression algorithms which come
under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which means there
are two classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines
Applications of Supervised Learning
Supervised learning is used in a wide variety of applications, including:
Image classification: Identify objects, faces, and other features in images.
Natural language processing: Extract information from text, such as sentiment, entities,
and relationships.
Speech recognition: Convert spoken language into text.
Recommendation systems: Make personalized recommendations to users.
Predictive analytics: Predict outcomes, such as sales, customer churn, and stock prices.
Medical diagnosis: Detect diseases and other medical conditions.
Fraud detection: Identify fraudulent transactions.
Autonomous vehicles: Recognize and respond to objects in the environment.
Email spam detection: Classify emails as spam or not spam.
Quality control in manufacturing: Inspect products for defects.
Credit scoring: Assess the risk of a borrower defaulting on a loan.
Gaming: Recognize characters, analyse player behaviour, and create NPCs.
Customer support: Automate customer support tasks.
Weather forecasting: Make predictions for temperature, precipitation, and other
meteorological parameters.
Sports analytics: Analyse player performance, make game predictions, and optimize
strategies.
o With the help of supervised learning, the model can predict the output on the basis of
prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such
as fraud detection, spam filtering, etc.
o Supervised learning models are not suitable for handling the complex tasks.
o Supervised learning cannot predict the correct output if the test data is different from
the training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.
Unsupervised learning is a type of machine learning in which models are trained using
unlabelled dataset and are allowed to act on that data without any supervision.
Unsupervised learning cannot be directly applied to a regression or classification problem
because unlike supervised learning, we have the input data but no corresponding output data.
The goal of unsupervised learning is to find the underlying structure of dataset, group that
data according to similarities, and represent that dataset in a compressed format.
Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the given
dataset, which means it does not have any idea about the features of the dataset. The task of
the unsupervised learning algorithm is to identify the image features on their own.
Unsupervised learning algorithm will perform this task by clustering the image dataset into
the groups according to similarities between images.
Below are some main reasons which describe the importance of Unsupervised Learning:
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
The unsupervised learning algorithm can be further categorized into two types of problems:
o Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities with
the objects of another group. Cluster analysis finds the commonalities between the
data objects and categorizes them as per the presence and absence of those
commonalities.
o Association: An association rule is an unsupervised learning method which is used
for finding the relationships between variables in the large database. It determines the
set of items that occurs together in the dataset. Association rule makes marketing
strategy more effective. Such as people who buy X item (suppose a bread) are also
tend to purchase Y (Butter/Jam) item. A typical example of Association rule is Market
Basket Analysis.
Unsupervised Learning algorithms:
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
3. Semi-Supervised Learning
Semi-Supervised learning is a machine learning algorithm that works between
the supervised and unsupervised learning so it uses both labelled and unlabelled data. It’s
particularly useful when obtaining labelled data is costly, time-consuming, or resource-
intensive. This approach is useful when the dataset is expensive and time-consuming. Semi-
supervised learning is chosen when labelled data requires skills and relevant resources in
order to train or learn from it. We use these techniques when we are dealing with data that
is a little bit labelled and the rest large portion of it is unlabelled. We can use the
unsupervised techniques to predict labels and then feed these labels to supervised
techniques. This technique is mostly applicable in the case of image data sets where usually
all images are not labelled.
Step 1:
Choosing the Training Experience: The very important and first task is to choose the
training data or training experience which will be fed to the Machine Learning Algorithm.
It is important to note that the data or experience that we fed to the algorithm must have a
significant impact on the Success or Failure of the Model. So Training data or experience
should be chosen wisely.
Below are the attributes which will impact on Success and Failure of Data:
The training experience will be able to provide direct or indirect feedback regarding
choices. For example: While Playing chess the training data will provide feedback to
itself like instead of this move if this is chosen the chances of success increases.
Second important attribute is the degree to which the learner will control the sequences
of training examples. For example: when training data is fed to the machine then at that
time accuracy is very less but when it gains experience while playing again and again
with itself or opponent the machine algorithm will get feedback and control the chess
game accordingly.
Third important attribute is how it will represent the distribution of examples over
which performance will be measured. For example, a Machine learning algorithm will
get experience while going through a number of different cases and different examples.
Thus, Machine Learning Algorithm will get more and more experience by passing
through more and more examples and hence its performance will increase.
Step 2:
Choosing target function: The next important step is choosing the target function. It
means according to the knowledge fed to the algorithm the machine learning will choose
Next Move function which will describe what type of legal moves should be taken. For
example : While playing chess with the opponent, when opponent will play then the
machine learning algorithm will decide what be the number of possible legal moves taken
in order to get success.
Step 3:
Choosing Representation for Target function: When the machine algorithm will know
all the possible legal moves the next step is to choose the optimized move using any
representation i.e. using linear Equations, Hierarchical Graph Representation, Tabular form
etc. The NextMove function will move the Target move like out of these move which will
provide more success rate. For Example: while playing chess machine have 4 possible
moves, so the machine will choose that optimized move which will provide success to it.
Step 4:
Choosing Function Approximation Algorithm: An optimized move cannot be chosen
just with the training data. The training data had to go through with set of example and
through these examples the training data will approximates which steps are chosen and after
that machine will provide feedback on it. For Example: When a training data of Playing
chess is fed to algorithm so at that time it is not machine algorithm will fail or get success
and again from that failure or success it will measure while next move what step should be
chosen and what is its success rate.
Step 5:
Final Design: The final design is created at last when system goes from number of
examples, failures and success, correct and incorrect decision and what will be the next step
etc. Example: DeepBlue is an intelligent computer which is ML-based won chess game
against the chess expert Garry Kasparov, and it became the first computer which had beaten
a human chess expert.
Linear algebra is the branch of mathematics that deals with linear equations and their
representation in vector spaces. In machine learning, linear algebra is used to represent and
manipulate data. In particular, vectors and matrices are used to represent and manipulate data
points, features, and weights in machine learning models.
A vector is an ordered list of numbers, while a matrix is a rectangular array of numbers. For
example, a vector can represent a single data point, and a matrix can represent a dataset.
Linear algebra operations, such as matrix multiplication and inversion, can be used to
transform and analyse data.
Followings are some of the important linear algebra concepts highlighting their importance in
machine learning –
Vectors and matrix − Vectors and matrices are used to represent datasets, features, target
values, weights, etc.
Matrix operations − operations such as addition, multiplication, subtraction, and transpose
are used in all ML algorithms.
Eigenvalues and eigenvectors − These are very useful in dimensionality reduction related
algorithms such principal component analysis (PCA).
Projection − Concept of hyper plane and projection onto a plane is essential to understand
support vector machine (SVM).
Factorization − Matrix factorization and singular value decomposition (SVD) are used to
extract important information in the dataset.
Tensors − Tensors are used in deep learning to represent multidimensional data. A tensor can
represent a scalar, vector or matrix.
Gradients − Gradients are used to find optimal values of the model parameters.
Jacobian Matrix − Jacobian matrix is used to analyse the relationship between input and
output variables in ML model
Orthogonality − This is a core concept used in algorithms like principal component analysis
(PCA), support vector machines (SVM)
Calculus
Calculus is the branch of mathematics that deals with rates of change and accumulation. In
machine learning, calculus is used to optimize models by finding the minimum or maximum
of a function. In particular, gradient descent, a widely used optimization algorithm, is based
on calculus.
Gradient descent is an iterative optimization algorithm that updates the weights of a model
based on the gradient of the loss function. The gradient is the vector of partial derivatives of
the loss function with respect to each weight. By iteratively updating the weights in the
direction of the negative gradient, gradient descent tries to minimize the loss function.
Followings are some of the important calculus concepts essential for machine learning −
Functions − Functions are core of machine learning. In machine learning, model learns a
function between inputs and outputs during the training phase. You should learn basics of
functions, continuous and discrete functions.
Derivative, Gradient and Slope − These are the core concepts to understand how
optimization algorithms, like gradient descent, work.
Partial Derivatives − These are used to find maxima or minima of a function. Generally
used in optimization algorithms.
Chain Rules − Chain rules are used to calculate the derivatives of loss functions with
multiple variables. You can see the application of chain rules mainly in neural networks.
Optimization Methods − These methods are used to find the optimal values of parameters
that minimizes cost function. Gradient Descent is one of the most used optimization methods.
Probability Theory
Probability is the branch of mathematics that deals with uncertainty and randomness. In
machine learning, probability is used to model and analyse data that are uncertain or variable.
In particular, probability distributions, such as Gaussian and Poisson distributions, are used to
model the probability of data points or events.
Followings are some of the important probability theory concepts essential for machine
learning −
Statistics
Statistics is the branch of mathematics that deals with the collection, analysis, interpretation,
and presentation of data. In machine learning, statistics is used to evaluate and compare
models, estimate model parameters, and test hypotheses.
Mean, Median, Mode: These measures are used to understand the distribution of data and
identify outliers.
Standard deviation, Variance: These are used to understand the variability of a dataset and
to detect outliers.
Percentiles: These are used to summarize the distribution of a dataset and identify outliers.
Data Distribution: It is how data points are distributed or spread out across a dataset.
Skewness and Kurtosis: These are two important measures of the shape of a probability
distribution in machine learning.
Bias and Variance: They describe the sources of error in a model's predictions.
Hypothesis Testing: It is a tentative assumption or idea that can be tested and validated
using data.
Linear Regression: It is the most used regression algorithm in supervised machine learning.
Logistic Regression: It's also an important supervised learning algorithm mostly used in
machine learning.
Random Variable
Random variable is a fundamental concept in statistics that bridges the gap between
theoretical probability and real-world data. A Random variable in statistics is a function
that assigns a real value to an outcome in the sample space of a random experiment .
For example: if you roll a die, you can assign a number to each possible outcome.
There are two basic types of random variables:
Discrete Random Variables (which take on specific values).
Continuous Random Variables (assume any value within a given range).
We define a random variable as a function that maps from the sample space of an
experiment to the real numbers. Mathematically, Random Variable is expressed as,
X: S →R
Where,
X is Random Variable (It is usually denoted using capital letter)
S is Sample Space
R is Set of Real Numbers
Example 2
Suppose a random variable X takes m different values, X = {x 1, x2, x3………xm},
with corresponding probabilities P(X = xi) = pi, where 1 ≤ i ≤ m.
The probabilities must satisfy the following conditions :
0 ≤ pi ≤ 1; where 1 ≤ i ≤ m
p1 + p2 + p3 + ……. + pm = 1 or we can say 0 ≤ p i ≤ 1 and ∑pi = 1
For example, Suppose a die is thrown (X = outcome of the dice).
Here, the sample space S = {1, 2, 3, 4, 5, 6}.
The output of the function will be:
P(X = 1) = 1/6
P(X = 2) = 1/6
P(X = 3) = 1/6
P(X = 4) = 1/6
P(X = 5) = 1/6
P(X = 6) = 1/6
This also satisfies the condition ∑ 6i=1 P(X = i) = 1, since:
P(X = 1) + P(X = 1) + P(X = 2) + P(X = 3) + P(X = 4) + P(X = 5) + P(X = 6)
= 6 × 1/6 = 1
Variate
A variate is a general term often used interchangeably with a random variable,
particularly in contexts where the random variable is not yet fully specified by a particular
probabilistic experiment. The range of values that a random variable X can take is denoted
as RX, and individual values within this range are called quantiles. The probability of the
random variable X taking a specific value x is written as P(X = x).
While a frequency distribution shows how often outcomes occur in a sample or dataset, a
probability distribution assigns probabilities to outcomes in an abstract, theoretical manner,
regardless of any specific dataset. These probabilities represent the likelihood of each
outcome occurring.
In a discrete probability distribution, the random variable takes distinct values (like the
outcome of rolling a die). In a continuous probability distribution, the random variable
can take any value within a certain range (like the height of a person).
Key properties of a probability distribution include:
The probability of each outcome is greater than or equal to zero.
The sum of the probabilities of all possible outcomes equals 1.
Probability Theory
Probability theory is an advanced branch of mathematics that deals with measuring the
likelihood of events occurring. It provides tools to analyze situations
involving uncertainty and helps in determining how likely certain outcomes are. This
theory uses the concepts of random variables, sample space, probability distributions, and
more to determine the outcome of any situation.
For Example: Flipping a Coin
Flipping a coin is a random event with two possible outcomes: heads or tails. Each time
you flip a fair coin, there are exactly two possible outcomes, each with an equal chance of
occurring. Therefore, the probability of landing on heads is 1/2, and similarly, the
probability of landing on tails is also 1/2.
Theoretical Probability
Theoretical Probability deals with assumptions to avoid unfeasible or costly repetition of
experiments. The theoretical Probability for an event A can be calculated as follows:
P(A) = (Number of outcomes favorable to Event A) / (Number of all possible outcomes)
The image shown below shows the theoretical probability formula.
Now, as we learn the formula, let’s put this formula in our coin-tossing case. In tossing a
coin, there are two outcomes: Head or Tail. Hence, The Probability of occurrence of Head
on tossing a coin is P(H) = 1/2
Similarly, The Probability of the occurrence of a Tail on tossing a coin is P(T) = 1/2
Experimental Probability
Experimental probability is found by performing a series of experiments and observing
their outcomes. These random experiments are also known as trials. The experimental
probability for Event A can be calculated as follows:
P(E) = (Number of times event A happened) / (Total number of trials)
The following image shows the Experimental Probability Formula,
Now, as we learn the formula, let’s put this formula in our coin-tossing case. If we tossed a
coin 10 times and recorded heads for 4 times and a tail 6 times then the Probability of
occurrence of Head on tossing a coin: P(H) = 4/10
Similarly, the Probability of Occurrence of Tails on tossing a coin: P(T) = 6/10
Subjective Probability
Subjective probability refers to the likelihood of an event occurring, as estimated by an
individual based on their personal beliefs, experiences, intuition, or knowledge, rather than
on objective statistical data or formal mathematical models.
Example: A cricket enthusiast might assign a 70% probability to a team’s victory based on
their understanding of the team’s recent form, the opponent’s strengths and weaknesses,
and other relevant factors.
Random Variable
A variable that can assume the value of all possible outcomes of an experiment is called a
random variable in Probability Theory. Random variables in probability theory are of two
types which are discussed below,
Discrete Random Variable
Variables that can take countable values such as 0, 1, 2,… are called discrete random
variables.
Continuous Random Variable
Variables that can take an infinite number of values in a given range are called continuous
random variables.
Probability Theory Formulas
Various formulas are used in probability theory and some of them are discussed below,
Theoretical Probability Formula: (Number of Favourable Outcomes) / (Number of
Total Outcomes)
Empirical Probability Formula: (Number of times event A happened) / (Total number
Example 2: A fair coin is tossed three times. What is the probability of getting exactly
two heads?
Solution:
Total possible outcomes when tossing a coin three times = 23 = 8.
Possible outcomes: HHH, HHT, HTH, THH, HTT, THT, TTH, TTT.
Outcomes with exactly two heads: HHT, HTH, THH (3 outcomes).
Probability of getting exactly two heads:
P(exactly 2 heads)=Number of favorable outcomes/ Total outcomes.
P(exactly 2 heads)=3/ 8.
Decision Theory
Decision theory is a foundational concept in Artificial Intelligence (AI), enabling machines
to make rational and informed decisions based on available data. It combines principles
from mathematics, statistics, economics, and psychology to model and improve decision-
making processes. In AI, decision theory provides the framework to predict outcomes,
evaluate choices, and guide actions in uncertain environments.
There are two primary branches of decision theory:
Normative decision theory focuses on identifying the optimal decision, assuming the
decision-maker is rational and has complete information.
Descriptive decision theory examines how decisions are actually made in practice,
often dealing with cognitive limitations and psychological biases.
Understanding Decision Theory in Machine Learning
In artificial intelligence, decision theory uses mathematical models to assess possible
outcomes under uncertainty and assist systems in making decisions.
AI systems often use decision theory in two primary ways:
supervised learning
Reinforcement learning.
1. Supervised Learning
In supervised learning, AI systems are trained using labelled data to make predictions or
decisions. Decision theory helps optimize the classification or regression tasks by
evaluating the trade-offs between false positives, false negatives, and other outcomes based
on the utility of each result.
For instance, in medical diagnosis, the utility of correctly identifying a disease may be far
higher than the cost of a false alarm, leading the AI to favor sensitivity over specificity.
2. Reinforcement Learning
Reinforcement learning (RL) is one of the key areas where decision theory shines in AI. In
RL, agents learn to make decisions through trial and error, receiving feedback from their
environment in the form of rewards or penalties.
Markov Decision Processes (MDPs) are a common formalism for decision-making in
reinforcement learning, where decision theory principles help in navigating uncertainty
and maximizing long-term rewards.
In MDPs, the agent needs to choose actions that optimize future rewards, which aligns
with the decision-theoretic concept of maximizing expected utility.
Bayes' theorem can be derived using product rule and conditional probability of event X with
known event Y:
o According to the product rule we can express as the probability of event X with
known event Y as follows;
1. P(X ? Y)= P(X|Y) P(Y) {equation 1}
Here, both events X and Y are independent events which means probability of outcome
of both events does not depends one another.
Example:
A bag contains 4 balls. Two balls are drawn at random without replacement and are found to
be blue. What is the probability that all balls in the bag are blue?
Solution:
P(A|E1) = 2C2/4C2 = ⅙
P(A|E2) = 3C2/4C2 = ½
P(A|E3) = 4C2/4C2 = 1
P(E3|A)=P(A|E3)P(E3)P(A|E1)P(E1)+P(A|E2)P(E2)+P(A|E3)P(E3)
= [⅓ × 1]/[⅓ × ⅙ + ⅓ × ½ + ⅓ × 1]
= ⅗.
Example 2:
An unbiased dice is rolled and for each number on the dice a bag is chosen:
Numbers on the Dice Bag choosen
1 Bag A
2 or 3 Bag B
4 or 5 or 6 Bag C
Bag A contains 3 white ball and 2 black ball, bag B contains 3 white ball and 4 black ball
and bag C contains 4 white ball and 5 black ball. Dice is rolled and bag is chosen, if a white
ball is chosen find the probability that it is chosen from bag B.
Solution:
Let E1 = event of choosing bag A
P(E2|A)=P(A|E2)P(E2)P(A|E1)P(E1)+P(A|E2)P(E2)+P(A|E3)P(E3)
=3/7×1/33/5×1/6+3/7×1/3+4/9×1/2=1/71/10+1/7+2/9
⇒ P(E2|A) = 90/293.
2. Decision Trees
Decision trees use entropy and information gain to split nodes and build a tree structure.
Information gain, based on entropy, measures the reduction in uncertainty after splitting a
node.
Information Gain: The information gain IG(T,A) for a dataset T and attribute A is:
∣T∣∣Tv∣H(Tv)
o IG(T,A)=H(T)−∑vϵValues(A)∣Tv∣∣T∣H(Tv)IG(T,A)=H(T)−∑vϵValues(A)
Code:
import numpy as np
def entropy(prob_dist):
return -np.sum(prob_dist * np.log2(prob_dist))
# Example
prob_dist = np.array([0.2, 0.3, 0.5])
print("Entropy:", entropy(prob_dist))
Output:
Entropy: 1.4854752972273344
UNIT- II
Linear Regression in Machine learning
Linear regression is also a type of machine-learning algorithm more specifically
a supervised machine-learning algorithm that learns from the labelled datasets and maps
the data points to the most optimized linear functions, which can be used for prediction on
new datasets.
First off we should know what supervised machine learning algorithms is. It is a type of
machine learning where the algorithm learns from labelled data. Supervised learning has
two types:
Classification: It predicts the class of the dataset based on the independent input
variable. Class is the categorical or discrete values. like the image of an animal is a cat
or dog?
Regression: It predicts the continuous output variables based on the independent input
variable. like the prediction of house prices based on different parameters like house
age, distance from the main road, location, area, etc .
Linear Model
The Linear Model is one of the most straightforward models in machine learning. It is the
building block for many complex machine learning algorithms, including deep neural
networks. Linear models predict the target variable using a linear function of the input
features.
The linear model is one of the simplest models in machine learning. It assumes that the data
is linearly separable and tries to learn the weight of each feature. Mathematically, it can be
written as Y=WTXY=WTX, where X is the feature matrix, Y is the target variable, and W is
the learned weight vector. We apply a transformation function or a threshold for the
classification problem to convert the continuous-valued variable Y into a discrete category.
Here we will briefly learn linear and logistic regression, which are
the regression and classification task models, respectively.
Linear models in machine learning are easy to implement and interpret and are helpful in
solving many real-life use cases.
Among many linear models, this article will cover linear regression and logistic regression.
Linear Regression
Linear Regression is a statistical approach that predicts the result of a response variable by
combining numerous influencing factors. It attempts to represent the linear connection
between features (independent variables) and the target (dependent variables). The cost
function enables us to find the best possible values for the model parameters.
Example: An analyst would be interested in seeing how market movement influences the
price of ExxonMobil (XOM). The value of the S&P 500 index will be the independent
variable, or predictor, in this example, while the price of XOM will be the dependent
variable. In reality, various elements influence an event's result. Hence, we usually have
many independent features.
Logistic Regression
Logistic regression is an extension of linear regression. The sigmoid function first transforms
the linear regression output between 0 and 1. After that, a predefined threshold helps to
determine the probability of the output values. The values higher than the threshold value
tend towards having a probability of 1, whereas values lower than the threshold value tend
towards having a probability of 0.
Example: A bank wants to predict if a customer will default on their loan based on their
credit score and income. The independent variables would be credit score and income, while
the dependent variable would be whether the customer defaults (1) or not (0).
Several real-life scenarios follow linear relations between dependent and independent
variables. Some of the examples are:
The relationship between the boiling point of water and change in altitude.
The relationship between spending on advertising and the revenue of an organization.
The relationship between the amount of fertilizer used and crop yields.
Performance of athletes and their training regimen.
The dataset is divided into two parts, namely, feature matrix and the response vector.
Feature matrix contains all the vectors (rows) of dataset in which each vector consists of
the value of dependent features. In above dataset, features are ‘Outlook’,
‘Temperature’, ‘Humidity’ and ‘Windy’.
Response vector contains the value of class variable (prediction or output) for each row
of feature matrix. In above dataset, the class variable name is ‘Play golf’.
where,
P(A) and P(B) are the probabilities of events A and B also P(B) is never equal
to zero.
P(A|B) is the probability of event A when event B happens
P(B|A) is the probability of event B when A happens
Solution:
Let E1 be the event that the mining job will be completed on time and E2 be the event that
it rains. We have,
P(A) = 0.45,
P(no rain) = P(B) = 1 − P(A) = 1 − 0.45 = 0.55
By multiplication law of probability,
P(E1) = 0.44, and P(E2) = 0.95
Since, events A and B form partitions of the sample space S, by total probability theorem,
we have
Generative models:
Generative models aim to model the joint distribution of the input and output variables.
These models generate new data based on the probability distribution of the original
dataset. Generative models are powerful because they can generate new data that resembles
the training data. They can be used for tasks such as image and speech synthesis, language
translation, and text generation.
Discriminative models
The discriminative model aims to model the conditional distribution of the output variable
given the input variable. They learn a decision boundary that separates the different classes
of the output variable. Discriminative models are useful when the focus is on making
accurate predictions rather than generating new data. They can be used for tasks such
as image recognition, speech recognition, and sentiment analysis.
Graphical models
These models use graphical representations to show the conditional dependence between
variables. They are commonly used for tasks such as image recognition, natural language
processing, and causal inference.
Naive Bayes Algorithm in Probabilistic Models
The Naive Bayes algorithm is a widely used approach in probabilistic models,
demonstrating remarkable efficiency and effectiveness in
solving classification problems. By leveraging the power of the Bayes theorem and
making simplifying assumptions about feature independence, the algorithm
calculates the probability of the target class given the feature set. This method has
found diverse applications across various industries, ranging from spam filtering to
medical diagnosis. Despite its simplicity, the Naive Bayes algorithm has proven
to be highly robust, providing rapid results in a multitude of real-world problems.
Naive Bayes is a probabilistic algorithm that is used for classification problems.
It is based on the Bayes theorem of probability and assumes that the features are
conditionally independent of each other given the class. The Naive Bayes
Algorithm is used to calculate the probability of a given sample belonging to a
particular class. This is done by calculating the posterior probability of each class
given the sample and then selecting the class with the highest posterior
probability as the predicted class.
Decision Tree
A decision tree is a type of supervised learning algorithm that is commonly used in machine
learning to model and predict outcomes based on input data. It is a tree-like structure where
each internal node tests on attribute, each branch corresponds to attribute value and each
leaf node represents the final decision or prediction. They can be used to solve
both regression and classification
Decision tree uses the tree representation to solve the problem in which each leaf node
corresponds to a class label and attributes are represented on the internal node of the tree.
We can represent any Boolean function on discrete attributes using the decision tree.
Below are some assumptions that we made while using the decision tree:
At the beginning, we consider the whole training set as the root.
Feature values are preferred to be categorical. If the values are continuous then they are
discretized prior to building the model.
On the basis of attribute values, records are distributed recursively.
We use statistical methods for ordering attributes as root or the internal node.
As you can see from the above image the Decision Tree works on the Sum of Product form
which is also known as Disjunctive Normal Form. In the above image, we are predicting
the use of computer in the daily life of people. In the Decision Tree, the major challenge is
the identification of the attribute for the root node at each level. This process is known as
attribute selection. We have two popular attribute selection measures:
1. Information Gain
2. Gini Index
Information Gain:
When we use a node in a decision tree to partition the training instances into smaller
subsets the entropy changes. Information gain is a measure of this change in entropy.
Suppose S is a set of instances,
A is an attribute
Sv is the subset of S
v represents an individual value that the attribute A can take and Values (A) is the set of
all possible values of A, then
Gain(S,A)=Entropy(S)–∑vA∣S∣∣Sv∣.Entropy(Sv)
Entropy: is the measure of uncertainty of a random variable, it characterizes the impurity
of an arbitrary collection of examples. The higher the entropy more the information content.
Suppose S is a set of instances, A is an attribute, Sv is the subset of S with A = v, and
Values (A) is the set of all possible values of A, then
Gain(S,A)=Entropy(S)–∑vϵValues(A)∣S∣∣Sv∣.Entropy(Sv)
Example:
Example: Now, let us draw a Decision Tree for the following data using Information
gain. Training set: 3 features and 2 classes
X Y Z C
1 1 1 I
1 1 0 I
0 0 1 II
1 0 0 II
Here, we have 3 features and 2 output classes. To build a decision tree using Information
gain. We will take each of the features and calculate the information for each feature.
From the above images, we can see that the information gain is maximum when we make a
split on feature Y. So, for the root node best-suited feature is feature Y. Now we can see
that while splitting the dataset by feature Y, the child contains a pure subset of the target
variable. So we don’t need to further split the dataset. The final tree for the above dataset
would look like this:
Example: Now, let us draw a Decision Tree for the following data
using Information gain. Training set: 3 features and 2 classes
X Y Z C
1 1 1 I
1 1 0 I
0 0 1 II
1 0 0 II
Split on feature X
Split on feature Y
Split on feature Z
From the above images, we can see that the information gain is
maximum when we make a split on feature Y. So, for the root
node best-suited feature is feature Y. Now we can see that while
splitting the dataset by feature Y, the child contains a pure subset
of the target variable. So we don’t need to further split the
dataset. The final tree for the above dataset would look like this:
2. Gini Index
Gini Index is a metric to measure how often a randomly chosen
element would be incorrectly identified.
It means an attribute with a lower Gini index should be
preferred.
Sklearn supports “Gini” criteria for Gini Index and by default, it
takes “gini” value.
The Formula for the calculation of the Gini Index is given below.
The Gini Index is a measure of the inequality or impurity of a distribution, commonly used
in decision trees and other machine learning algorithms. It ranges from 0 to 0.5, where 0
indicates a pure set, and 0.5 indicates a maximally impure set.
Example of a Decision Tree Algorithm
Forecasting Activities Using Weather Information
Root node: Whole dataset
Attribute : “Outlook” (sunny, cloudy, rainy).
Subsets: Overcast, Rainy, and Sunny.
Recursive Splitting: Divide the sunny subset even more according to humidity, for
example.
Leaf Nodes: Activities include “swimming,” “hiking,” and “staying inside.”
A Reegression tree
A regression tree is a type of decision tree that uses a data set to predict continuous response
variables. It's a simple and fast algorithm that's often used to predict categorical, discrete, or
nonlinear sample data.
Pruning
In machine learning, pruning is a technique that reduces the size of a
decision tree by removing non-critical branches or nodes. The goal of
pruning is to improve the model's performance, generalization, and
efficiency.
Benefits of pruning:
Reduces overfitting: Pruning prevents the model from memorizing the training data, which
can lead to poor performance on new data.
Improves predictive accuracy: Pruning reduces the complexity of the model, which can
improve its predictive accuracy.
Improves model simplicity: Pruning can make the model simpler, faster, and more robust.
Backpropagation
After forward propagation, the network evaluates its performance using a loss function,
which measures the difference between the actual output and the predicted output. The goal
of training is to minimize this loss. This is where backpropagation comes into play:
1. Loss Calculation: The network calculates the loss, which provides a measure of error
in the predictions. The loss function could vary; common choices are mean squared
error for regression tasks or cross-entropy loss for classification.
2. Gradient Calculation: The network computes the gradients of the loss function with
respect to each weight and bias in the network. This involves applying the chain rule of
calculus to find out how much each part of the output error can be attributed to each
weight and bias.
3. Weight Update: Once the gradients are calculated, the weights and biases are updated
using an optimization algorithm like stochastic gradient descent (SGD). The weights are
adjusted in the opposite direction of the gradient to minimize the loss. The size of the
step taken in each update is determined by the learning rate.
Iteration
This process of forward propagation, loss calculation, backpropagation, and weight update
is repeated for many iterations over the dataset. Over time, this iterative process reduces the
loss, and the network’s predictions become more accurate.
Through these steps, neural networks can adapt their parameters to better approximate the
relationships in the data, thereby improving their performance on tasks such as
classification, regression, or any other predictive modeling.
Example of Email Classification
Let’s consider a record of an email dataset:
Email Subject
ID Email Content Sender Line Label
To classify this email, we will create a feature vector based on the analysis of keywords
such as “free,” “win,” and “offer.”
The feature vector of the record can be presented as:
“free”: Present (1)
“win”: Absent (0)
“offer”: Present (1)
Activation Functions
Activation functions introduce non-linearity into the network, enabling it to learn and
model complex data patterns. Common activation functions include:
Training a Feedforward Neural Network
Training a Feedforward Neural Network involves adjusting the weights of the neurons to
minimize the error between the predicted output and the actual output. This process is
typically performed using backpropagation and gradient descent.
1. Forward Propagation: During forward propagation, the input data passes through the
network, and the output is calculated.
2. Loss Calculation: The loss (or error) is calculated using a loss function such as Mean
Squared Error (MSE) for regression tasks or Cross-Entropy Loss for classification tasks.
3. Backpropagation: In backpropagation, the error is propagated back through the
network to update the weights. The gradient of the loss function with respect to each
weight is calculated, and the weights are adjusted using gradient descent.
The dimension of the hyperplane depends on the number of features. For instance, if there
are two input features, the hyperplane is simply a line, and if there are three input features,
the hyperplane becomes a 2-D plane. As the number of features increases beyond three, the
complexity of visualizing the hyperplane also increases.
Consider two independent variables, x1 and x2, and one dependent variable represented as
either a blue circle or a red circle.
In this scenario, the hyperplane is a line because we are working with two features
(x1 and x2).
There are multiple lines (or hyperplanes) that can separate the data points.
The challenge is to determine the best hyperplane that maximizes the separation
margin between the red and blue circles.
From the figure above it’s very clear that there are multiple lines (our hyperplane here is a
line because we are considering only two input features x1, x2) that segregate our data
points or do a classification between red and blue circles.
Here we have one blue ball in the boundary of the red ball. So how does SVM classify the
data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue balls. The
SVM algorithm has the characteristics to ignore the outlier and finds the best hyperplane
that maximizes the margin. SVM is robust to outliers.
So in this type of data point what SVM does is, finds the maximum margin as done with
previous data sets along with that it adds a penalty each time a point crosses the margin. So
the data set, the SVM tries to minimize (1/margin+∧(∑penalty)). Hinge loss is a
the margins in these types of cases are called soft margins. When there is a soft margin to
commonly used penalty. If no violations no hinge loss.If violations hinge loss proportional
to the distance of violation.
SVM solves this by creating a new variable using a kernel. We call a point xi on the line
and we create a new variable yi as a function of distance from origin o.so if we plot this we
get something like as shown below
Support Vector Machine Terminology
Hyperplane: The hyperplane is the decision boundary used to separate data points of
different classes in a feature space. For linear classification, this is a linear equation
represented as wx+b=0.
Support Vectors: Support vectors are the closest data points to the hyperplane. These
points are critical in determining the hyperplane and the margin in Support Vector
Machine (SVM).
Margin: The margin refers to the distance between the support vector and the
hyperplane. The primary goal of the SVM algorithm is to maximize this margin, as a
wider margin typically results in better classification performance.
Kernel: The kernel is a mathematical function used in SVM to map input data into a
higher-dimensional feature space. This allows the SVM to find a hyperplane in cases
where data points are not linearly separable in the original space. Common kernel
functions include linear, polynomial, radial basis function (RBF), and sigmoid.
Hard Margin: A hard margin refers to the maximum-margin hyperplane that perfectly
separates the data points of different classes without any misclassifications.
Soft Margin: When data contains outliers or is not perfectly separable, SVM uses
the soft margin technique. This method introduces a slack variable for each data point
to allow some misclassifications while balancing between maximizing the margin and
minimizing violations.
C: The C parameter in SVM is a regularization term that balances margin
maximization and the penalty for misclassifications. A higher C value imposes a stricter
penalty for margin violations, leading to a smaller margin but fewer misclassifications.
Hinge Loss: The hinge loss is a common loss function in SVMs. It penalizes
misclassified points or margin violations and is often combined with a regularization
term in the objective function.
Dual Problem: The dual problem in SVM involves solving for the Lagrange
multipliers associated with the support vectors. This formulation allows for the use of
the kernel trick and facilitates more efficient computation.
Bagging
Bootstrap Aggregating, also known as bagging, is a machine learning ensemble meta-
algorithm designed to improve the stability and accuracy of machine learning algorithms
used in statistical classification and regression. It decreases the variance and helps to
avoid overfitting. It is usually applied to decision tree methods. Bagging is a special case of
the model averaging approach.
Boosting
Boosting is an ensemble modeling technique designed to create a strong classifier by
combining multiple weak classifiers. The process involves building models sequentially,
where each new model aims to correct the errors made by the previous ones.
Initially, a model is built using the training data.
Subsequent models are then trained to address the mistakes of their predecessors.
Boosting assigns weights to the data points in the original dataset.
Higher weights: Instances that were misclassified by the previous model receive higher
weights.
Lower weights: Instances that were correctly classified receive lower weights.
Training on weighted data: The subsequent model learns from the weighted dataset,
focusing its attention on harder-to-learn examples (those with higher weights).
This iterative process continues until:
o The entire training dataset is accurately predicted, or
o A predefined maximum number of models is reached.
Boosting Algorithms
There are several boosting algorithms. The original ones, proposed by Robert
Schapire and Yoav Freund were not adaptive and could not take full advantage of the
weak learners. Schapire and Freund then developed AdaBoost, an adaptive boosting
algorithm that won the prestigious Gödel Prize. AdaBoost was the first really successful
boosting algorithm developed for the purpose of binary classification. AdaBoost is short for
Adaptive Boosting and is a very popular boosting technique that combines multiple “weak
classifiers” into a single “strong classifier”.
Algorithm:
1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified data points.
3. Increase the weight of the wrongly classified data points and decrease the weights of
correctly classified data points. And then normalize the weights of all data points.
4. if (got required results)
Goto step 5
else
Goto step 2
5. End
1.
The simplest way of combining
predictions that A way of combining predictions that
belong to the same type. belong to the different types.
2. Aim to decrease variance, not bias. Aim to decrease bias, not variance.
3.
Models are weighted according to their
Each model receives equal weight. performance.
4.
New models are influenced
by the performance of previously built
Each model is built independently. models.
5.