FML Lecture Notes
FML Lecture Notes
(Autonomous)
Dundigal, Hyderabad - 500 043
Natural language processing (NLP). Most tasks in this field, including part-of-speech
tagging, named-entity recognition, context-free parsing, or dependency parsing, are cast
as learning problems. In these problems, predictions admit some structure. For example,
in part-of-speech tagging, the prediction for a sentence is a sequence of part-of-speech
tags labeling each word. In context-free parsing the prediction is a tree. These are
instances of richer learning problems known as structured prediction problems.
Many other problems such as fraud detection for credit card, telephone or insurance
companies, network intrusion, learning to play games such as chess, backgammon, or
Go, unassisted control of vehicles such as robots or cars, medical diagnosis, the design
of recommendation systems, search engines, or information extraction systems, are
tackled using machine learning techniques.
This list is by no means comprehensive. Most prediction problems found in practice can be cast
as learning problems and the practical application area of machine learning keeps expanding.
The algorithms and techniques discussed in this book can be used to derive solutions for all of
these problems, though we will not discuss in detail these applications.
1.3 Components of Learning
Basic components of learning process
The learning process, whether by a human or a machine, can be divided into four components,
namely, data storage, abstraction, generalization and evaluation. Figure 1.1 illustrates the
various components and the steps involved in the learning process.
1. Data storage
Facilities for storing and retrieving huge amounts of data are an important component of the
learning process. Humans and computers alike utilize data storage as a foundation for advanced
reasoning.
• In a human being, the data is stored in the brain and data is retrieved using
electrochemical signals.
• Computers use hard disk drives, flash memory, random access memory and similar
devices to store data and use cables and other technology to retrieve data.
2. Abstraction
The second component of the learning process is known as abstraction.
Abstraction is the process of extracting knowledge about stored data. This involves creating
general concepts about the data as a whole. The creation of knowledge involves application of
known models and creation of new models. The process of fitting a model to a dataset is
known as training. When the model has been trained, the data is transformed into an abstract
form that summarizes the original information.
3. Generalization
The third component of the learning process is known as generalization.
The term generalization describes the process of turning the knowledge about stored data into a
form that can be utilized for future action. These actions are to be carried out on tasks that are
similar, but not identical, to those what have been seen before. In generalization, the goal is to
discover those properties of the data that will be most relevant to future tasks.
4. Evaluation
Evaluation is the last component of the learning process.
It is the process of giving feedback to the user to measure the utility of the learned knowledge.
This feedback is then utilized to effect improvements in the whole learning process.
Course Code ACAC03
Course Name Foundations of Machine Learning
Class / Semester 3rd Year/ VI Semester
Section A
Name of the
CSE(Data Science)
Department
Employee ID IARE10805
Employee Name Dr.G.Sucharitha
Topic Covered Learning Problems and Scenarios
Able to understand the learning problems involved in machine
Course Outcome/s
learning.
Handout Number 2
Date
Some common machine learning scenarios are given below. These scenarios differ in the types
of training data available to the learner, the order and method by which training data is received
and the test data used to evaluate the learning algorithm.
•Supervised learning: The learner receives a set of labeled examples as training data and makes
predictions for all unseen points. This is the most common scenario associated with
classification, regression, and ranking problems. The spam detection problem discussed in the
previous section is an instance of supervised learning.
• Unsupervised learning: The learner exclusively receives unlabeled training data, and makes
predictions for all unseen points. Since in general no labeled example is available in that
setting, it can be difficult to quantitatively evaluate the performance of a learner. Clustering
and dimensionality reduction are example of unsupervised learning problems.
• Semi-supervised learning: The learner receives a training sample consisting of both labeled
and unlabeled data, and makes predictions for all unseen points. Semi-supervised learning is
common in settings where unlabeled data is easily accessible but labels are expensive to obtain.
Various types of problems arising in applications, including classification, regression, or
ranking tasks, can be framed as instances of semi-supervised learning. The hope is that the
distribution of unlabeled data accessible to the learner can help him achieve a better
performance than in the supervised setting. The analysis of the conditions under which this can
indeed be realized is the topic of much modern theoretical and applied machine learning
research.
•Transductive inference: As in the semi-supervised scenario, the learner receives a labeled
training sample along with a set of unlabeled test points. However, the objective of
transductive inference is to predict labels only for these particular test points. Transductive
inference appears to be an easier task and matches the scenario encountered in a variety of
modern applications. However, as in the semi-supervised setting, the assumptions under which
a better performance can be achieved in this setting are research questions that have not been
fully resolved.
• On-line learning: In contrast with the previous scenarios, the online scenario involves
multiple rounds where training and testing phases are intermixed. At each round, the learner
receives an unlabeled training point, makes a prediction, receives the true label, and incurs a
loss. The objective in the on-line setting is to minimize the cumulative loss over all rounds or
to minimize the regret, that is the difference of the cumulative loss incurred and that of the best
expert in hindsight. Unlike the previous settings just discussed, no distributional assumption is
made in on-line learning. In fact, instances and their labels may be chosen adversarial within
this scenario.
•Reinforcement learning: The training and testing phases are also intermixed in reinforcement
learning. To collect information, the learner actively interacts with the environment and in
some cases affects the environment, and receives an immediate reward for each action. The
object of the learner is to maximize his reward over a course of actions and iterations with the
environment. However, no long-term reward feedback is provided by the environment, and the
learner is faced with the exploration versus exploitation dilemma, since he must choose
between exploring unknown actions to gain more information versus exploiting the information
already collected.
•Active learning: The learner adaptively or interactively collects training examples, typically
by querying an oracle to request labels for new points. The goal in active learning is to achieve
a performance comparable to the standard supervised learning scenario (or passive learning
scenario), but with fewer labeled examples. Active learning is often used in applications where
labels are expensive to obtain, for example computational biology applications.
Course Code ACAC03
Course Name Foundations of Machine Learning
Class / Semester 3rd Year/ VI Semester
Section A
Name of the
CSE(Data Science)
Department
Employee ID IARE10805
Employee Name Dr.G.Sucharitha
Topic Covered Need For Machine Learning
Course Outcome/s Able to understand the necessity of machine learning.
Handout Number 3
Date
Machine Learning is a core form of Artificial Intelligence that enable machine to learn from
past data and make predictions It involves data exploration and pattern matching with minimal
human intervention. There are mainly four technologies that machine learning used to work.
1. Supervised Learning:
Supervised Learning is a machine learning method that needs supervision similar to the
student-teacher relationship. In supervised Learning, a machine is trained with well-labelled
data, which means some data is already tagged with correct outputs. So, whenever new data is
introduced into the system, supervised learning algorithms analyse this sample data and predict
correct outputs with the help of that labelled data.
o Classification: It deals when output is in the form of a category such as Yellow, blue,
right or wrong, etc.
o Regression: It deals when output variables are real values like age, height, etc.
This technology allows us to collect or produce data output from experience. It works the same
way as humans learn using some labelled data points of the training set. It helps in optimizing
the performance of models using experience and solving various complex computation
problems.
2. Unsupervised Learning:
Unlike supervised learning, unsupervised Learning does not require classified or well-labelled
data to train a machine. It aims to make groups of unsorted information based on some patterns
and differences even without any labelled training data. In unsupervised Learning, no
supervision is provided, so no sample data is given to the machines. Hence, machines are
restricted to finding hidden structures in unlabelled data by their own.
It is classified into two different categories of algorithms. These are as follows:
3.Semi-supervised learning:
Speech analysis, web content classification, protein sequence classification, and text documents
classifiers are some most popular real-world applications of semi-supervised Learning.
4. Reinforcement learning:
Reinforcement learning is defined as a feedback-based machine learning method that does not
require labeled data. In this learning method, an agent learns to behave in an environment by
performing the actions and seeing the results of actions. Agents can provide positive feedback
for each good action and negative feedback for bad actions. Since, in reinforcement learning,
there is no training data, hence agents are restricted to learn with their experience only.
Course Code ACAC03
Course Name Foundations of Machine Learning
Class / Semester 3rd Year/ VI Semester
Section A
Name of the
CSE(Data Science)
Department
Employee ID IARE10805
Employee Name Dr.G.Sucharitha
Topic Covered Standard Learning Tasks
Able to understand the learning process involved in machine
Course Outcome/s
learning
Handout Number 5
Date
Hypothesis set: A set of functions mapping features (feature vectors) to the set of labels Y. In
our example, these may be a set of functions mapping email features to Y = {spam, non-spam}.
More generally, hypotheses may be functions mapping features to a different set y’. They could
be linear functions mapping email feature vectors to real numbers interpreted as scores (y’ =
R), with higher score values more indicative of spam than lower ones.
Course Code ACAC03
Course Name Foundations of Machine Learning
Class / Semester 3rd Year/ VI Semester
Section A
Name of the
CSE(Data Science)
Department
Employee ID IARE10805
Employee Name Dr.G.Sucharitha
Topic Covered Statistical Learning Framework
Able to understand the learning process involved in machine
Course Outcome/s
learning
Handout Number 6
Date
Statistical learning theory is a framework for machine learning that draws from statistics and
functional analysis. It deals with finding a predictive function based on the data presented. The
main idea in statistical learning theory is to build a model that can draw conclusions from data
and make predictions.
Types of Data in Statistical Learning:
With statistical learning theory, there are two main types of data:
Dependent Variable — a variable (y) whose values depend on the values of other
variables (a dependent variable is sometimes also referred to as a target variable).
Independent Variables — a variable (x) whose value does not depend on the values of
other variables (independent variables are sometimes also referred to as predictor
variables, input variables, explanatory variables, or features).
In statistical learning, the independent variable(s) are the variable that will affect the
dependent variable.
A common example of an Independent Variable is Age. There is nothing that one can do to
increase or decrease age. This variable is independent.
Weight — a person’s weight is dependent on his or her age, diet, and activity levels (as
well as other factors).
The price of a home is affected by the size of the home, sq. ft is the independent variable while
price of the home is the dependent variable.
Statistical Model
A statistical model defines the relationships between a dependent and independent variable. In
the above graph, the relationships between the size of the home and the price of the home is
illustrated by the straight line. We can define this relationship by using y = mx + c where m
represents the gradient and c is the intercept. Another way that this equation can be expressed is
with roman numerals which would look something like.
𝑃𝑟𝑖𝑐𝑒 𝑜𝑓 𝐻𝑜𝑚𝑒 = 𝛽0 + 𝛽1 ∗ 𝑆𝑞. 𝐹𝑡
If we suppose that the size of the home is not the only independent variable when determining
the price and that the number of bedrooms is also an independent variable, the equation would
look like
𝑃𝑟𝑖𝑐𝑒 𝑜𝑓 𝐻𝑜𝑚𝑒 = 𝛽0 + 𝛽1 ∗ 𝑆𝑞. 𝐹𝑡 + 𝛽2 ∗ 𝑁𝑜. 𝑜𝑓 𝐵𝑒𝑑𝑟𝑜𝑜𝑚𝑠
Model Generalization
In order to build an effective model, the available data needs to be used in a way that would
make the model generalizable for unseen situations. Some problems that occur when building
models is that the model under-fits or over-fits to the data.
• Under-fitting — when a statistical model does not adequately capture the underlying structure of
the data and, therefore, does not include some parameters that would appear in a correctly
specified model.
• Over-fitting — when a statistical model contains more parameters that can be justified by the
data and includes the residual variation (“noise”) as if the variation represents underlying model
structure.
Course Code ACAC03
Course Name Foundations of Machine Learning
Class / Semester 3rd Year/ VI Semester
Section A
Name of the
CSE(Data Science)
Department
Employee ID IARE10805
Employee Name Dr.G.Sucharitha
Topic Covered Probably Approximately Correct (PAC) learning
Able to understand the learning process involved in machine
Course Outcome/s
learning
Handout Number 7&8
Date
Figure1: Target concept R and possible hypothesis R 0 . Circles represent training instances. A
blue circle is a point labelled with 1, since it falls within the rectangle R. Others are red and
labelled with 0.
In many cases, in particular when the computational representation of the concepts is not
explicitly discussed or is straightforward, we may omit the polynomial dependency on n and
size(c) in the PAC definition and focus only on the sample complexity.
Module-II
SUPERVISED LEARNING ALGORITHMS
Let us assume that, we want to learn the class, C, of a “family car.” We have a set of examples
of cars, and we have a group of people that we survey to whom we show these cars. The people
look at the cars and label them; the cars that they believe are family cars are positive examples,
and the other cars are negative examples. Class learning is finding a description that is shared
by all the positive examples and none of the negative examples. Doing this, we can make a
prediction: Given a car that we have not seen before, by checking with the description learned,
we will be able to say whether it is a family car or not. Or we can do knowledge extraction:
This study may be sponsored by a car company, and the aim may be to understand what people
expect from a family car.
Figure1: Training set for the class of a “family car.” Each data point corresponds to one
example car, and the coordinates of the point indicate the price and engine power of that car.
‘+’ denotes a positive example of the class (a family car), and ‘−’ denotes a negative example
(not a family car); it is another type of car.
Let us denote price as the first input attribute x1 (e.g., in U.S. dollars) and engine power as the
second attribute x2 (e.g., engine volume in cubic centimetres). Thus, we represent each car
using two numeric values.
𝑥1
𝑥 = ⌊𝑥 ⌋ (1)
2
Figure 2: Example of a hypothesis class. The class of family car is a rectangle in the price-
engine power space.
Course Code ACAC03
Course Name Foundations of Machine Learning
Since we can tell one class apart from the other, these classes are called ‘linearly-
separable.’
However, an infinite number of lines can be drawn to distinguish the two classes.
The exact location of this plane/hyperplane depends on the type of the linear classifier.
Non-Linear Classification
Non-Linear Classification refers to categorizing those instances that are not linearly separable.
It is possible to classify data with a straight line. It is not easy to classify data with a straight
line. Data is classified with the help of a hyperplane.
We notice that even if we draw a straight line, there would be points of the first-class
present between the data points of the second class.
Binary classification:
It is used when there are only two distinct classes and the data, we want to classify
belongs exclusively to one of those classes, e.g. to classify if a post about a given product
as positive or negative.
Multiclass classification: It is used when there are three or more classes and the data, we
want to classify belongs exclusively to one of those classes, e.g., to classify if a
semaphore on an image is red, yellow or green.
Multilabel classification:
It is used when there are two or more classes and the data, we want to classify may
belong to none of the classes or all of them at the same time, e.g., to classify which traffic
signs are contained on an image.
Multi-label classification: It allows us to classify data sets with more than one target variable.
In multi-label classification, we have several labels that are the outputs for a given prediction.
When making predictions, a given input may belong to more than one label.
For example, when predicting a given movie category, it may belong to horror, romance,
adventure, action, or all simultaneously. In this example, we have multi-labels that can be
assigned to a given movie.
Multi-class classification: In multi-class classification, an input belongs to only a single label.
For example, when predicting if a given image belongs to a cat or a dog, the output can be
either a cat or dog but not both at the same time.
In this tutorial, we will be dealing with multi-label text classification, and we will build a
model that classifies a given text input into different categories. Our text input can belong to
multiple categories or labels at the same time.
Course Code ACAC03
Course Name Foundations of Machine Learning
A decision tree is a structure that contains nodes and edges and is built from a dataset. Each
node is either used to make a decision (known as decision node) or represent an outcome
known as leaf node.
The decision nodes here are questions like ‘Is the person less than 30 years of age?’, ‘Does the
person eat junk?’, etc. and the leaves are one of the two possible outcomes viz. Fit and Unfit.
Looking at the Decision Tree we can say make the following decisions:
if a person is less than 30 years of age and doesn’t eat junk food then he is Fit, if a person is less
than 30 years of age and eats junk food then he is Unfit and so on.
The initial node is called the root node (colored in blue), the final nodes are called the leaf
nodes (colored in green) and the rest of the nodes are called intermediate or internal nodes.
The root and intermediate nodes represent the decisions while the leaf nodes represent the
outcomes.
ID3 Algorithm
ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively
(repeatedly) dichotomizes(divides) features into two or more groups at each step.
Invented by Ross Quinlan, ID3 uses a top-down greedy approach to build a decision tree. In
simple words, the top-down approach means that we start building the tree from the top and
the greedy approach means that at each iteration we select the best feature at the present
moment to create a node.
Most generally ID3 is only used for classification problems with nominal features only.
Metrics in ID3
As mentioned previously, the ID3 algorithm selects the best feature at each step while building a
Decision tree. Before you ask, the answer to the question: ‘How does ID3 select the best
feature?’ is that ID3 uses Information Gain or just Gain to find the best feature.
Information Gain calculates the reduction in the entropy and measures how well a given feature
separates or classifies the target classes. The feature with the highest Information Gain is
selected as the best one. In simple words, Entropy is the measure of disorder and the Entropy of
a dataset is the measure of disorder in the target feature of the dataset.
In the case of binary classification (where the target column has only two types of classes)
entropy is 0 if all values in the target column are homogenous(similar) and will be 1 if the target
column has equal number values for both the classes.
We denote our dataset as S, entropy is calculated as:
where, n is the total number of classes in the target column (in our case n = 2 i.e YES and NO),
pᵢ is the probability of class ‘i’ or the ratio of “number of rows with class i in the target
column” to the “total number of rows” in the dataset.
CART (Classification and Regression Tree) is a variation of the decision tree algorithm. It
can handle both classification and regression tasks.
CART Algorithm
CART is a predictive algorithm used in Machine learning and it explains how the target
variable’s values can be predicted based on other matters. It is a decision tree where each
fork is split into a predictor variable and each node has a prediction for the target variable at
the end.
In the decision tree, nodes are split into sub-nodes on the basis of a threshold value of an
attribute. The root node is taken as the training set and is split into two by considering the
best attribute and threshold value. Further, the subsets are also split using the same logic.
This continues till the last pure sub-set is found in the tree or the maximum number of leaves
possible in that growing tree.
The CART algorithm works via the following process:
The best split point of each input is obtained.
Based on the best split points of each input in Step 1, the new “best” split point is
identified.
Split the chosen input according to the “best” split point.
Continue splitting until a stopping rule is satisfied or no further desirable splitting is
available.
CART algorithm uses Gini Impurity to split the dataset into a decision tree. It does that by
searching for the best homogeneity for the sub nodes, with the help of the Gini index
criterion.
Gini index/Gini impurity
The Gini index is a metric for the classification tasks in CART. It stores the sum of squared
probabilities of each class. It computes the degree of probability of a specific variable that is
wrongly being classified when chosen randomly and a variation of the Gini coefficient. It
works on categorical variables, provides outcomes either “successful” or “failure” and hence
conducts binary splitting only.
The degree of the Gini index varies from 0 to 1,
Where 0 depicts that all the elements are allied to a certain class, or only one class
exists there.
The Gini index of value 1 signifies that all the elements are randomly distributed
across various classes.
A value of 0.5 denotes the elements are uniformly distributed into some classes.
𝐺𝑖𝑛𝑖 = 1 − ∑(𝑃𝑖 )2
𝑖=1
Limitations of CART
Overfitting.
High Variance.
low bias.
the tree structure may be unstable.
Applications of the CART algorithm
For quick Data insights.
In Blood Donors Classification.
For environmental and ecological data.
In the financial sectors.
Course Code ACAC03
Course Name Foundations of Machine Learning
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression shows
the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
Mathematically, we can represent a linear regression as:
Y= a0+a1X+ ε
Here,
The values for x and y variables are training datasets for Linear Regression model
representation.
A linear line showing the relationship between the dependent and independent variables is
called a regression line. A regression line can show two types of relationship:
When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will
have the least error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.
Cost Function
o The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the
best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps
the input variable to the output variable. This mapping function is also known
as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the
average of squared error occurred between the predicted values and actual values. It can be
written as,
Section A
Name of the Department CSE(Data Science)
Employee ID IARE10805
Date
In order to model the link between two or more characteristics and a response, multiple linear
regression fits a linear equation to the observed data. The procedures for performing multiple
linear regression are almost identical to those for performing simple linear regression. The
assessment is where the difference lies. It can help us determine which variable has the most
influence on the expected result and how several factors interact.
Section A
Name of the Department CSE(Data Science)
Employee ID IARE10805
Date
Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables. Logistic regression predicts the output of a
categorical dependent variable. Therefore, the outcome must be a categorical or discrete value.
It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0
and 1, it gives the probabilistic values which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is used
for solving the classification problems. In Logistic regression, instead of fitting a regression
line, we fit an "S" shaped logistic function, which predicts two maximum values (0 or 1). The
curve from the logistic function indicates the likelihood of something such as whether the cells
are cancerous or not, a mouse is obese or not based on its weight, etc. Logistic Regression is a
significant machine learning algorithm because it has the ability to provide probabilities and
classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data and
can easily determine the most effective variables used for the classification. The below image
is showing the logistic function:
Logistic Function (Sigmoid Function):
• The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.