Al3451 - Machine Learning - Answer Key 13 Mark
Al3451 - Machine Learning - Answer Key 13 Mark
13 MARK
11A) EXAMPLES OF ML:
According to various sources, some prominent examples of machine learning applications include: image
recognition, speech recognition, medical diagnosis, fraud detection, product recommendations, spam
filtering, online fraud detection, self-driving cars, natural language processing (NLP), stock market
prediction, and personalized advertising;
Key points about these examples:
Image Recognition:
Identifying objects or people in images using algorithms trained on large datasets, commonly seen in
facial recognition software and photo tagging on social media platforms.
Speech Recognition:
Converting spoken language into text, used in voice assistants like Siri and Google Assistant.
Medical Diagnosis:
Analyzing medical images like X-rays or MRIs to identify potential diseases, assisting doctors in
diagnosis.
Fraud Detection:
Identifying suspicious financial transactions in real-time to prevent fraudulent activity.
Product Recommendations:
Suggesting products to users based on their past purchase history or browsing behavior on online
shopping platforms
Spam Filtering:
Automatically identifying and filtering unwanted emails as spam
Self-Driving Cars:
Using sensors and machine learning algorithms to navigate vehicles autonomously
Natural Language Processing (NLP):
Analyzing and understanding human language, used in chatbots, sentiment analysis, and machine
translation
Stock Market Prediction:
Forecasting future stock prices based on historical data analysis
Personalized Advertising:
Tailoring advertisements to individual users based on their online behavior and demographics
11 B) Probably approximately correct learning
In computer science, computational learning theory (or just learning theory) is a subfield of artificial
intelligence devoted to studying the design and analysis of machine learning algorithms. In computational
learning theory, probably approximately correct learning (PAC learning) is a framework for
mathematical analysis of machine learning algorithms. It was proposed in 1984 by Leslie Valiant.
In this framework, the learner (that is, the algorithm) receives samples and must select a hypothesis from
a certain class of hypotheses. The goal is that, with high probability (the “probably” part), the selected
hypothesis will have low generalization error (the “approximately correct” part). In this section we first
give an informal definition of PAC-learnability. After introducing a few nore notions, we give a more
formal, mathematically oriented, definition of PAC-learnability. At the end, we mention one of the
applications of PAC-learnability.
PAC-learnability
To define PAC-learnability we require some specific terminology and related notations.
Let X be a set called the instance space which may be finite or infinite. For example, X
may be the set of all points in a plane.
A concept class C for X is a family of functions c : X {0; 1}. A member of C is called a
concept. A concept can also be thought of as a subset of X. If C is a subset of X, it defines a
unique function µc : X {0; 1} as follows:
A hypothesis h is also a function h : X {0; 1}. So, as in the case of concepts, a hypothesis
can also be thought of as a subset of X. H will denote a set of hypotheses.
We assume that F is an arbitrary, but fixed, probability distribution over X.
Training examples are obtained by taking random samples from X. We assume that the
samples are randomly generated from X according to the probability distribution F.
Definition (informal)
Let X be an instance space, C a concept class for X, h a hypothesis in C and F an arbitrary,
but fixed, probability distribution. The concept class C is said to be PAC-learnable if there
is an algorithm A which, for samples drawn with any probability distribution F and any
concept c Є C, will with high probability produce a hypothesis h Є C whose error is small.
Examples
Example
Let the instance space be the set X of all points in the Euclidean plane. Each point is
represented by its coordinates (x; y). So, the dimension or length of the instances is 2.
Let the concept class C be the set of all “axis-aligned rectangles” in the plane; that is, the set
of all rectangles whose sides are parallel to the coordinate axes in the plane (see Figure).
Since an axis-aligned rectangle can be defined by a set of inequalities of the following
form having four parameters
a ≤ x ≤ b, c ≤ y ≤ d
Given a set of sample points labeled positive or negative, let L be the algorithm which
outputs the hypothesis defined by the axis-aligned rectangle which gives the tightest fit to
the positive examples (that is, that rectangle with the smallest area that includes all of the
positive examples and none of the negative examples) (see Figure bleow).
Figure : Axis-aligned rectangle which gives the tightest fit to the positive examples
It can be shown that, in the notations introduced above, the concept class C is PAC-
learnable by the algorithm L using the hypothesis space H of all axis-aligned rectangles.
12 A) The Bias-Variance tradeoff in Machine Learning refers to the inherent tension between creating a model
that is too simple (high bias) and capturing noise in the data (high variance), where the ideal model strikes a
balance between both to generalize well to unseen data; essentially, as you increase the complexity of a model, its
bias tends to decrease while variance increases, and vice versa, meaning finding the optimal level of complexity is
key to achieving good performance.
Diagram Explanation:
[Image: A graph with "Model Complexity" on the x-axis and "Error" on the y-axis. Three curves are
plotted: "Bias Error" (decreasing with increasing complexity), "Variance Error" (increasing with
increasing complexity), and "Total Error" (a U-shaped curve, representing the sum of bias and variance
error). The "sweet spot" is marked at the lowest point of the Total Error curve, where bias and variance
are relatively balanced.]
Key Points:
Bias:
Represents the error due to simplifying assumptions made by a model, leading to underfitting where
the model fails to capture important patterns in the data.
Example: Using a linear regression model to fit a highly non-linear relationship between variables will
result in high bias.
Variance:
Represents how much the model's predictions fluctuate based on different training data sets, leading to
overfitting where the model learns noise in the data instead of underlying patterns.
Example: A very complex neural network with many parameters can easily overfit to training data,
leading to high variance.
How to Navigate the Tradeoff:
Model Selection:
Choose a model complexity that balances bias and variance based on the problem and data size.
Regularization Techniques:
Methods like L1 and L2 regularization can help reduce model complexity and mitigate overfitting, thus
decreasing variance.
Cross-Validation:
Used to evaluate model performance on unseen data and select the best hyperparameters to optimize
the bias-variance tradeoff.
Data Augmentation:
Generating additional training data can help reduce variance by exposing the model to a wider range of
examples.
Visualizing the Tradeoff:
Low Bias, High Variance:
Imagine a model that perfectly fits every point in the training data, including noise, resulting in
excellent performance on the training set but poor generalization on new data.
12 B) VC DIMENSION:
Shattering of a set
Let D be a dataset containing N examples for a binary classification problem with class labels 0 and 1.
Let H be a hypothesis space for the problem. Each hypothesis h in H partitions D into two disjoint
subsets as follows:
Such a partition of S is called a “dichotomy” in D. It can be shown that there are 2N possible dichotomies
in D. To each dichotomy of D there is a unique assignment of the labels “1” and “0” to the elements of
D. Conversely, if S is any subset of D then, S defines a unique hypothesis h as follows:
Thus to specify a hypothesis h, we need only specify the set {x Є D | h(x) = 1}. Figure 3.1 shows all
possible dichotomies of D if D has three elements. In the figure, we have shown only one of the two sets
in a dichotomy, namely the set {x Є D | h(x) = 1}.The circles and ellipses represent such sets.
Definition
A set of examples D is said to be shattered by a hypothesis space H if and only if for every dichotomy of
D there exists some hypothesis in H consistent with the dichotomy of D.
Example
In figure, we see that an axis-aligned rectangle can shatter four points in two dimensions. Then VC(H),
when H is the hypothesis class of axis-aligned rectangles in two dimensions, is four. In calculating the
VC dimension, it is enough that we find four points that can be shattered; it is not necessary that we be
able to shatter any four points in two dimensions.
Fig: An axis-aligned rectangle can shattered four points. Only rectangle covering two points are shown.
VC dimension may seem pessimistic. It tells us that using a rectangle as our hypothesis class, we can
learn only datasets containing four points and not more.
13 A) PERCEPTRON
Perceptron is a type of neural network that performs binary classification that maps input features to an
output decision, usually classifying data into one of two categories, such as 0 or 1.
Perceptron consists of a single layer of input nodes that are fully connected to a layer of output nodes. It
is particularly good at learning linearly separable patterns. It utilizes a variation of artificial neurons
called Threshold Logic Units (TLU), which were first introduced by McCulloch and Walter Pitts in the
1940s. This foundational model has played a crucial role in the development of more advanced neural
networks and machine learning algorithms.
Types of Perceptron
Single-Layer Perceptron is a type of perceptron is limited to learning linearly separable patterns. It is
effective for tasks where the data can be divided into distinct categories through a straight line. While
powerful in its simplicity, it struggles with more complex problems where the relationship between
inputs and outputs is non-linear.
Multi-Layer Perceptron possess enhanced processing capabilities as they consist of two or more layers,
adept at handling more complex patterns and relationships within the data.
Basic Components of Perceptron
A Perceptron is composed of key components that work together to process information and make
predictions.
Input Features: The perceptron takes multiple input features, each representing a characteristic of the input
data.
Weights: Each input feature is assigned a weight that determines its influence on the output. These weights
are adjusted during training to find the optimal values.
Summation Function: The perceptron calculates the weighted sum of its inputs, combining them with their
respective weights.
Activation Function : The weighted sum is passed through the Heaviside step function, comparing it to a
threshold to produce a binary output (0 or 1).
Output: The final output is determined by the activation function, often used for binary classification tasks.
Bias: The bias term helps the perceptron make adjustments independent of the input, improving its
flexibility in learning.
Learning Algorithm: The perceptron adjusts its weights and bias using a learning algorithm, such as
the Perceptron Learning Rule , to minimize prediction errors.
These components enable the perceptron to learn from data and make predictions. While a single
perceptron can handle simple binary classification, complex tasks require multiple perceptrons organized
into layers, forming a neural network.
How does Perceptron work?
In Machine Learning, Perceptron is considered as a single-layer neural network that consists of four main
parameters named input values (Input nodes), weights and Bias, net sum, and an activation function. The
perceptron model begins with the multiplication of all input values and their weights, then adds these
values together to create the weighted sum. Then this weighted sum is applied to the activation function
'f' to obtain the desired output. This activation function is also known as the step function and is
represented by 'f'.
Perceptron in Machine Learning
This step function or Activation function plays a vital role in ensuring that output is mapped between
required values (0,1) or (-1,1). It is important to note that the weight of input is indicative of the strength
of a node. Similarly, an input's bias value gives the ability to shift the activation function curve up or
down.
Perceptron model works in two important steps as follows:
Step-1
In the first step first, multiply all input values with corresponding weight values and then add them to
determine the weighted sum. Mathematically, we can calculate the weighted sum as follows:
Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned weighted sum, which
gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for
solving classification problems.
Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in
building the fast machine learning models that can make quick predictions.
It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and
classifying articles.
Problem: If the weather is sunny, then the Player should play or not?
No weather play
1 Rainy Yes
2 Sunny Yes
3 Overcast Yes
4 Overcast Yes
5 Sunny No
6 Rainy Yes
7 Sunny Yes
8 Overcast Yes
9 Rainy No
10 Sunny No
11 Sunny Yes
12 Rainy No
13 Overcast Yes
14 Overcast Yes
14 A) ADVANTAGE OF SVM
The main advantages of Support Vector Machines (SVMs) include their ability to handle high-dimensional
data effectively, strong resistance to overfitting, clear decision boundaries due to the optimal hyperplane,
flexibility with kernel functions to address non-linear data, and high accuracy in classification tasks,
especially when dealing with complex datasets; the optimal hyperplane, in contrast to other hyperplanes, is
the one that maximizes the margin between different data classes, meaning it creates the largest possible
separation between the classes, leading to better generalization and robustness against new data points.
Key advantages of SVMs:
High-dimensional data handling:
SVMs perform well even when the number of features is large, making them suitable for complex
datasets with many variables.
Robustness to overfitting:
By focusing on maximizing the margin between classes, SVMs are less prone to overfitting, especially in
high-dimensional spaces.
Clear decision boundaries:
The optimal hyperplane provides a well-defined separation between classes, aiding in better
interpretation of the model.
Kernel trick for non-linear data:
Using different kernel functions allows SVMs to handle non-linear data by transforming it into a higher
dimensional space.
Sparsity of support vectors:
Only a small subset of data points (support vectors) significantly influence the decision boundary,
leading to efficient computation.
OPTIMAL HYPERPLANE:
An "optimal hyperplane" in machine learning, particularly within the context of Support Vector
Machines (SVMs), is distinct from other hyperplanes because it is the one that maximizes the margin
between data points from different classes, meaning it is positioned as far away as possible from the
closest data points (called "support vectors") on either side, leading to the best possible separation
between classes and minimizing the risk of misclassification; while other hyperplanes might separate the
data, they would do so with a smaller margin, making them less robust to new data points.
Discriminative models, on the other hand, aim to model the decision boundary between different data
classes. They try to learn the boundary between different data classes by modeling the conditional
probability of the outputs given the inputs. A common example of a discriminative model is a Support
Vector Machine (SVM) which models the data as a boundary that maximally separates different classes.
These models can be used for supervised tasks like classification and Regression.
Generative models thus try to understand the underlying probability distribution of the data, while
Discriminative models try to model the decision boundary between different data classes. Generative
models are computationally more expensive and susceptible to outliers but can be used for unsupervised
tasks. Discriminative models are computationally cheaper and are more robust to outliers but are mainly
used for supervised tasks.
Example:
Imagine trying to classify emails as spam or not spam. A logistic regression model would learn the features that
best distinguish spam emails from non-spam emails, like specific keywords, and directly predict the probability
of an email being spam based on those features.
DIFFERENCE:
15A) EM ALGORITHM
The Expectation-Maximization (EM) algorithm is a statistical method used to find maximum likelihood estimates
of parameters in a model when there are unobserved "latent" variables, essentially meaning some data points are
missing or hidden; a common example is estimating the parameters of a mixture of Gaussian distributions where
you don't know which Gaussian component generated each data point in a dataset.
Key idea: The EM algorithm iteratively performs two steps:
Expectation (E) step:
Estimate the expected values of the latent variables based on the current estimates of the model
parameters.
Maximization (M) step:
Update the model parameters to maximize the likelihood function based on the expected values
calculated in the E step.
Example: Mixture of Gaussian Distributions:
Problem:
Imagine you have a dataset of points that seem to be drawn from two different Gaussian distributions
(with unknown means and variances) but you don't know which Gaussian generated each individual
data point (the latent variable).
EM Algorithm Steps:
Initialization: Randomly initialize the means and variances for the two Gaussian components.
E-step: For each data point, calculate the probability that it belongs to each Gaussian component
based on the current estimates of the means and variances.
M-step: Re-estimate the means and variances of each Gaussian component by maximizing the
likelihood function using the "soft assignments" calculated in the E-step (i.e., the probabilities of
each data point belonging to each component).
Repeat: Iterate between the E-step and M-step until convergence (parameters stop changing
significantly).
Other Examples of EM Algorithm Applications:
Customer Segmentation:
Clustering customers into groups based on their purchasing behavior where the group membership is
considered a latent variable.
Hidden Markov Models (HMMs):
Inferring the hidden states of an HMM (e.g., weather states in a sequence of temperature readings)
given the observed data.
Text Mining:
Identifying latent topics within a collection of documents by treating each document as a mixture of
topics.
The red labels indicate the class 0 points and the green labels indicate class 1 points.
Consider the white point as the query point( the point whose class label has to be predicted)
If we give the above dataset to a kNN based classifier, then the classifier would declare the query point to
belong to the class 0. But in the plot, it is clear that the point is closer to the class 1 points compared to the
class 0 points. To overcome this disadvantage, weighted kNN is used. In weighted kNN, the nearest k
points are given a weight using a function called as the kernel function. The intuition behind weighted
kNN, is to give more weight to the points which are nearby and less weight to the points which are farther
away. Any function can be used as a kernel function for the weighted knn classifier whose value decreases
as the distance increases. The simple function which is used is the inverse distance function.
Algorithm:
Let L = { ( x i , yi ) , i = 1, . . . ,n } be a training set of observations x i with given class y i and let x be a new
observation(query point), whose class label y has to be predicted.
Compute d(xi, x) for i = 1, . . . ,n , the distance between the query point and every other point in the training
Select D’ ⊆ D, the set of k nearest training data points to the query points
set.
Predict the class of the query point, using distance-weighted voting. The v represents the class labels. Use
the following formula
15 MARKS
for this example, one could consider the simplest hypothesis that “there are stars in the sky” rather than
complicating things over the observation. I know all this sounds vague… but what does this has to do with
a Machine Learning.
In most machine learning tasks, we deal with some subset of observations (samples) and our goal is to
create a generalization based on them. We also want our generalization to be valid for new unseen data. In
other words, we want to draw a general rule that works for the whole population of samples based on a
limited sample subset.
So we have some set of observations and a set of hypotheses that can be induced based on observations.
The set of observations is our data and the set of hypotheses are ML algorithms with all the possible
parameters that can be learned from this data. Each model can describe training data but provide
significantly different results on new unseen data.
As you can clearly see in the above illustration, when training different models on a fixed train data, they
tend to vary when it comes to inferring the unseen data, which can lead to varied number of predictions.
There is an infinite set of hypotheses for a finite set of samples. For example, consider observations of two
points of some single-variable function. It is possible to fit a single linear model and an infinite amount of
periodic or polynomial functions that perfectly fit the observations. Given the data, all of that functions are
valid hypotheses that perfectly align with observations, and with no additional assumptions, choosing one
over another is like making a random guess.
Now let’s infer our hypothesis from the new unseen data sample X2, and it turns out that most of the complicated
functions are inaccurate. However, the linear function appears to be quite accurate, which may seem already
familiar to you from a bias-variance tradeoff perspective.
The prioritization of some hypotheses (restriction of hypothesis space) is an inductive bias. So the model is biased
toward some group of hypotheses(preference bias). For the previous example, one can choose a linear model based
on some prior knowledge about data and thus prioritize linear generalization.
Why Is Inductive Bias Important?As one can see from the previous example, choosing the right induction bias of
the model leads to better generalization, especially in a low data setting. The less training data we have, the
stronger inductive bias should be to help the model to generalize well. But in a rich data setting, it may be
preferable to avoid any induction bias to let the model be less constrained and search through the hypothesis space
freely.
In a low data setting, right inductive bias may help to find good optimum, but in a rich data setting, it may lead to
constrains that harm generalization
16B)