R20 ML - Unit-1
R20 ML - Unit-1
Unit-1:
Unit I: Introduction- Artificial Intelligence, Machine Learning, Deep learning, Types of Machine Learning
Systems, Main Challenges of Machine Learning. Statistical Learning: Introduction, Supervised and
Unsupervised Learning, Training and Test Loss, Tradeoffs in Statistical Learning, Estimating Risk Statistics,
Sampling distribution of an estimator, Empirical Risk Minimization.
1
Introduction- Artificial Intelligence, Machine Learning, Deep learning
2
Limited Memory - These systems reference the past, and information is added over a period of time. The
referenced information is short-lived.
Theory of Mind - This covers systems that are able to understand human emotions and how they affect
decision making. They are trained to adjust their behavior accordingly.
Self-awareness - These systems are designed and created to be aware of themselves. They understand their
own internal states, predict other people’s feelings, and act appropriately.
Applications of Artificial Intelligence:
Tom M. Mitchell, Professor of Machine Learning Department, School of Computer Science, Carnegie
Mellon University. Tom M. Mitchell has defined machine learning as:
‘A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with experience
E.’
How Does Machine Learning Work?
Machine learning accesses vast amounts of data (both structured and unstructured) and learns from it to
predict the future. It learns from the data by using multiple algorithms and techniques. Below is a diagram
that shows how a machine learns from data.
3
TYPES OF MACHINE LEARNING
Machine learning can be classified into three broad categories:
1. Supervised learning – Also called predictive learning. A machine predicts the class of unknown objects
based on prior class-related information of similar objects.
2. Unsupervised learning – Also called descriptive learning. A machine finds patterns in unknown objects by
grouping similar objects together.
3. Reinforcement learning – A machine learns to act on its own to achieve the given goals.
Supervised learning:
4
Classifying texts such as classifying a set of emails as spam or non-spam
FIG. Classification
Some typical classification problems include:
Image classification
Prediction of disease
Win–loss prediction of games
Prediction of natural calamity like earthquake, flood, etc.
Recognition of handwriting
Regression: In linear regression, the objective is to predict numerical features like real estate or stock
price, temperature, marks in an examination, sales revenue, etc. The underlying predictor variable and the
target variable are continuous in nature.
In case of simple linear regression, there is only one predictor variable whereas in case of multiple linear
regression, multiple predictor variables can be included in the model.
A typical linear regression model can be represented in the form –
where ‘x’ is the predictor variable and ‘y’ is the target variable.
Typical applications of regression can be seen in
Demand forecasting in retails
Sales prediction for managers
Price prediction in real estate
Weather forecast
Skill demand forecast in job market.
5
Below is an example of a supervised learning method. The algorithm is trained using labeled data of dogs
and cats. The trained model predicts whether the new image is that of a cat or a dog.
Unsupervised learning: In unsupervised learning, there is no labeled training data to learn from and no
prediction to be made. In unsupervised learning, the objective is to take a dataset as input and try to find
natural groupings or patterns within the data elements or records. Therefore, unsupervised learning is often
termed as descriptive model and the process of unsupervised learning is referred as pattern discovery or
knowledge discovery.
6
Below is an example of an unsupervised learning method that trains a model using unlabeled data. In this
case, the data consists of different vehicles. The purpose of the model is to classify each kind of vehicle.
Reinforcement learning:
The goal of reinforcement learning is to train an agent to complete a task within an uncertain environment.
The agent receives observations and a reward from the environment and sends actions to the environment.
The reward measures how successful action is with respect to completing the task goal.
7
Examples of reinforcement learning algorithms include Q-learning and Deep Q-learning Neural Networks.
8
Machine Learning Processes
Machine Learning involves seven steps:
The major difference between deep learning vs machine learning is the way data is presented to the machine.
Machine learning algorithms usually require structured data, whereas deep learning networks work on
multiple layers of artificial neural networks.
This is what a simple neural network looks like:
9
The network has an input layer that accepts inputs from the data. The hidden layer is used to find any hidden
features from the data. The output layer then provides the expected output.
Here is an example of a neural network that uses large sets of unlabeled data of eye retinas. The network
model is trained on this data to find out whether or not a person has diabetic retinopathy.
10
In the following example, deep learning and neural networks are used to identify the number on a license
plate. This technique is used by many countries to identify rules violators and speeding vehicles.
11
Recurrent Neural Network (RNN) - RNN uses sequential information to build a model. It often works better
for models that have to memorize past data.
Generative Adversarial Network (GAN) - GAN are algorithmic architectures that use two neural networks to
create new, synthetic instances of data that pass for real data. A GAN trained on photographs can generate
new photographs that look at least superficially authentic to human observers.
Deep Belief Network (DBN) - DBN is a generative graphical model that is composed of multiple layers of
latent variables called hidden units. Each layer is interconnected, but the units are not.
12
1. Inadequate Training Data: due to noise data, incorrect data and generalizing of output data.
Data quality can be affected by some factors as follows:
Noisy Data- It is responsible for an inaccurate prediction that affects the decision as well as accuracy in
classification tasks.
Incorrect data- It is also responsible for faulty programming and results obtained in machine learning
models. Hence, incorrect data may affect the accuracy of the results also.
Generalizing of output data- Sometimes, it is also found that generalizing output data becomes complex,
which results in comparatively poor future actions.
2. Poor quality of data: if your training data is full of errors, outliers, and noise (e.g., due to poor quality
measurements), it will make it harder for the system to detect the underlying patterns, so your system
is less likely to perform well.
For example:
• If some instances are clearly outliers, it may help to simply discard them or try to fix the errors
manually.
• If some instances are missing a few features (e.g., 5% of your customers did not specify their age), you
must decide whether you want to ignore this attribute altogether, ignore these instances, fill in the missing
values (e.g., with the median age), or train one model with the feature and one model without it, and so on.
3. Non-representative training data: it is crucial that your training data be representative of the new
cases you want to generalize to. This is true whether you use instance-based learning or model-based
learning.
if the sample is too small, you will have sampling noise (i.e., non representative data as a result of
chance), but even very large samples can be non representative if the sampling method is flawed.
This is called sampling bias.
13
try to build a linear model with non-linear data. In such scenarios, the complexity of the model
destroys, and rules of the machine learning model become too easy to be applied on this data set, and
the model starts doing wrong predictions as well.
Methods to reduce Underfitting:
o Increase model complexity
o Remove noise from the data
o Trained on increased and better features
o Reduce the constraints
o Increase the number of epochs to get better results.
Statistical Learning:
Structuring and visualizing data are important aspects of data science, the main challenge lies in the
mathematical analysis of the data. When the goal is to interpret the model and quantify the uncertainty in the
data, this analysis is usually referred to as statistical learning.
There are two major goals for modeling data:
1) to accurately predict some future quantity of interest, given some observed data, and
2) to discover unusual or interesting patterns in the data
To achieve these goals, one must rely on knowledge from three important pillars of the mathematical
sciences.
Function approximation: Building a mathematical model for data usually means understanding how one
data variable depends on another data variable. The most natural way to represent the relationship between
variables is via a mathematical function or map. Thus, data scientists have to understand how best to
approximate and represent functions using the least amount of computer processing and memory.
Optimization: Given a class of mathematical models, we wish to find the best possible model in that class.
This requires some kind of efficient search or optimization procedure. The optimization step can be viewed
as a process of fitting or calibrating a function to observed data. This step usually requires knowledge of
optimization algorithms and efficient computer coding or programming.
Probability and Statistics: In general, the data used to fit the model is viewed as a realization of a random
process or numerical vector, whose probability law determines the accuracy with which we can predict
future observations. Thus, in order to quantify the uncertainty inherent in making predictions about the
future, and the sources of error in the model, data scientists need a firm grasp of probability theory and
statistical inference.
14
Supervised and Unsupervised Learning:
Feature and response: Given an input or feature vector x, one of the main goals of machine learning is to
predict an output or response variable y.
For example,
x could be a digitized signature and y a binary variable that indicates whether the signature is genuine or
false.
x represents the weight and smoking habits of an expecting mother and y the birth weight of the baby.
Prediction function: which takes as an input x and outputs a guess g(x) for y (denoted by 𝑦, for example)
Regression: the response variable y can take any real value.
when y can only lie in a finite set, say y ∈ {0, . . . , c − 1}, then predicting y is conceptually the same as
classifying the input x into one of c categories, and so prediction becomes a classification problem.
loss function: We can measure the accuracy of a prediction by with respect to a given response y by loss
function using some loss function Loss(y, 𝑦). n a regression setting the usual choice is the squared error loss
`12(y− 𝑦)2 .
In the case of classification, the zero–one (also written 0–1) loss function Loss(y, 𝑦) = 1{y , 𝑦} is often
used, which incurs a loss of 1 whenever the predicted class by is not equal to the class y.
we will encounter various other useful loss functions, such as the cross-entropy and hinge loss functions.
Error is often used as a measure of distance between a “true” object y and some approximation 𝑦, thereof. If
y is real-valued, the absolute error |y − 𝑦 | and the squared error (y− 𝑦,)2 are both well-established error
concepts, as are the norm ||y− 𝑦 || and squared norm ||y− 𝑦 || 2 for vectors. The squared error (y− 𝑦) 2 is just
one example.
Risk: assume that each pair (x, y) is the outcome of a random pair (X, Y) that has some joint probability
density f(x, y). We then assess the predictive performance via the expected loss, usually called the risk, for
g:
For example, in the classification case with zero–one loss function the risk is equal to the probability of
incorrect classification: . The prediction function g is called classifier.
Given the distribution of (X, Y) and any loss function, we can in principle find the best possible
g* := that yields the smallest risk
Learner: the function gT is a learner who learns the unknown functional relationship g ∗ : x y from the
training data T.
The learner gT to predict the output of a new input X, for which the correct output Y is not unknown.
15
Supervised Learning: One tries to learn the functional relationship between the feature vector x and response y
in the presence of a teacher who provides n examples. It is common to speak of “explaining” or predicting y
on the basis of explanatory x, where x is a vector of explanatory variables.
An example of supervised learning is email spam detection.
unsupervised learning: learning makes no distinction between response and explanatory variables, and the
objective is simply to learn the structure of the unknown distribution of the data. In other words, we need to
learn f(x). In this case the guess g(x) is an approximation of f(x) and the risk is of the form
which we call the training loss. The training loss is thus an unbiased estimator of the risk (the expected loss)
for a prediction function g, based on the training data.
To approximate the optimal prediction function g∗ (the minimizer of the risk we first select a suitable
collection of approximating functions G and then take our learner to be the function in G that minimizes the
training loss; that is
The prediction accuracy of new pairs of data is measured by the generalization risk of the learner. For a
fixed training set τ it is defined as
Figure: The generalization risk for a fixed training set is the weighted-average loss over all possible pairs (x,
y).
16
.
Figure: The expected generalization risk is the weighted-average loss over all possible pairs (x, y) and over
all training sets.
For any outcome τ of the training data, we can estimate the generalization risk without bias by taking the
sample average
17
Tradeoffs in Statistical Learning:
The relation between model complexity, computational simplicity, and estimation accuracy, it is useful to
decompose the generalization risk into several parts, so that the tradeoffs between these parts can be studied.
We will consider two such decompositions: the approximation–estimation tradeoff and the bias–variance
tradeoff.
We can decompose the generalization risk into the following three components:
18
The decomposition can now be interpreted as follows.
Thus, when using a squared-error loss, the generalization risk for a linear class can be decomposed as:
Note that in this decomposition the statistical error is the only term that depends on the training set.
The errors in a machine learning model can be broken down into 2 parts:
1. Reducible Error
2. Irreducible Error
Irreducible errors are errors that cannot be reduced even if you use any other machine learning model.
Reducible errors, on the other hand, is further broken down into square of bias and variance. Due to this
bias-variance, it causes the machine learning model to either overfit or underfit the given data.
What exactly is Bias?
Bias is the inability of a machine learning model to capture the true relationship between the data variables.
It is caused by the erroneous assumptions that are inherent to the learning algorithm. For example, in linear
regression, the relationship between the X and the Y variable is assumed to be linear, when in reality the
relationship may not be perfectly linear.
In general,
High Bias indicates more assumptions in the learning algorithm about the relationships between the
variables.
Less Bias indicates fewer assumptions in the learning algorithm.
19
Generally, nonlinear machine learning algorithms like decision trees have a high variance. It is even higher
if the branches are not pruned during training.
Let’s summarize:
If a model uses a simple machine learning algorithm like in the case of a linear model in the above
code, the model will have high bias and low variance (underfitting the data).
If a model follows a complex machine learning model, then it will have high variance and low bias(
overfitting the data).
You need to find a good balance between the bias and variance of the model we have used. This
tradeoff in complexity is what is referred to as bias and variance tradeoff. An optimal balance of bias
and variance should never overfit or underfit the model.
This tradeoff applies to all forms of supervised learning: classification, regression, and structured
output learning.
How to fix bias and variance problems?
Fixing High Bias
Adding more input features will help improve the data to fit better.
Add more polynomial features to improve the complexity of the model.
20
Decrease the regularization term to have a balance between bias and variance.
Fixing High Variance
Reduce the input features, use only features with more feature importance to reduce overfitting the
data.
Getting more training data will help in this case, because the high variance model will not be
working for an independent dataset if you have very data.
1. In-Sample Risk : Due to the phenomenon of overfitting, the training loss of the learner ,is not a
good estimate of the generalization risk of the learner..
2. Cross-Validation:
The idea is to make multiple identical copies of the data set, and to partition each copy into different training
and test sets, as illustrated in Below Figure. Here, there are four copies of the data set (consisting of response
and explanatory variables). Each copy is divided into a test set (colored blue) and training set (colored pink).
For each of these sets, we estimate the model parameters using only training data and then predict the
responses for the test set. The average loss between the predicted and observed responses is then a measure
for the predictive power of the model.
Figure: An illustration of four-fold cross-validation, representing four copies of the same data set. The data
in each copy is partitioned into a training set (pink) and a test set (blue). The darker columns represent the
response variable and the lighter ones the explanatory variables.
21
Sampling distribution of an estimator:
In statistics, it is the probability distribution of the given statistic estimated on the basis of a random sample.
It provides a generalized way to statistical inference. The estimator is the generalized mathematical
parameter to calculate sample statistics. An estimate is the result of the estimation.
The sampling distribution of estimator depends on the sample size. The effect of change of the sample size
has to be determined. An estimate has a single numerical value and hence they are called point estimates.
There are various estimators like sample me
mean,
an, sample standard deviation, proportion, variance, range etc.
Sampling distribution of the mean: It is the population mean from which the samples are drawn. For all the
sample sizes, it is likely to be normal if the population distribution is normal. Th
Thee population mean is equal
to the mean of the sampling distribution of the mean. Sampling distribution of mean has the standard
deviation, which is as follows:
Where , is the standard deviation of the sampling mean , is the population standard deviation
deviatio and n is
the sample size.
As the size of the sample increases, the spread of the sampling distribution of the mean decreases. But the
mean of the distribution remains the same and it is not affected by the sample size.
The sampling distribution of the standard deviation is the standard error of the standard deviation. It is
defined as:
Here, is the sampling distribution of the standard deviation. It is positively skewed for
small n but it approximately becomes normal for sample sizes greater than 30 30.
22
Empirical Risk Minimization:
Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family
of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we
cannot know exactly how well an algorithm will work in practice (the true "risk") because we don't know the
true distribution of data that the algorithm will work on, but we can instead measure its performance on a
known set of training data (the "empirical" risk).
In general, the risk R(h) cannot be computed because the distribution P(x,y) is unknown to the learning
algorithm (this situation is referred to as agnostic learning). However, we can compute an approximation,
called empirical risk, by averaging the loss function on the training set; more formally, computing the
expectation with respect to the empirical measure:
The empirical risk minimization principle states that the learning algorithm should choose a
hypothesis ℎ which minimizes the empirical risk:
Thus the learning algorithm defined by the ERM principle consists in solving the
above optimization problem.
23