0% found this document useful (0 votes)
16 views22 pages

Al3451 - Machine Learning - Answer Key 13 Mark

The document provides an overview of various machine learning applications, including image and speech recognition, medical diagnosis, and fraud detection. It discusses the concepts of probably approximately correct (PAC) learning and Vapnik-Chervonenkis (VC) dimension, which are essential in understanding the learnability of algorithms. Additionally, it covers the perceptron model for binary classification and the Naïve Bayes classifier used for text classification.

Uploaded by

rajalakshmir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views22 pages

Al3451 - Machine Learning - Answer Key 13 Mark

The document provides an overview of various machine learning applications, including image and speech recognition, medical diagnosis, and fraud detection. It discusses the concepts of probably approximately correct (PAC) learning and Vapnik-Chervonenkis (VC) dimension, which are essential in understanding the learnability of algorithms. Additionally, it covers the perceptron model for binary classification and the Naïve Bayes classifier used for text classification.

Uploaded by

rajalakshmir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

AL3451 - MACHINE LEARNING –ANSWER KEY

13 MARK
11A) EXAMPLES OF ML:
 According to various sources, some prominent examples of machine learning applications include: image
recognition, speech recognition, medical diagnosis, fraud detection, product recommendations, spam
filtering, online fraud detection, self-driving cars, natural language processing (NLP), stock market
prediction, and personalized advertising;
Key points about these examples:
Image Recognition:
 Identifying objects or people in images using algorithms trained on large datasets, commonly seen in
facial recognition software and photo tagging on social media platforms.
Speech Recognition:
 Converting spoken language into text, used in voice assistants like Siri and Google Assistant.
Medical Diagnosis:
 Analyzing medical images like X-rays or MRIs to identify potential diseases, assisting doctors in
diagnosis.
Fraud Detection:
 Identifying suspicious financial transactions in real-time to prevent fraudulent activity.
Product Recommendations:
 Suggesting products to users based on their past purchase history or browsing behavior on online
shopping platforms
Spam Filtering:
 Automatically identifying and filtering unwanted emails as spam
Self-Driving Cars:
 Using sensors and machine learning algorithms to navigate vehicles autonomously
Natural Language Processing (NLP):
 Analyzing and understanding human language, used in chatbots, sentiment analysis, and machine
translation
Stock Market Prediction:
 Forecasting future stock prices based on historical data analysis
Personalized Advertising:
 Tailoring advertisements to individual users based on their online behavior and demographics
11 B) Probably approximately correct learning
 In computer science, computational learning theory (or just learning theory) is a subfield of artificial
intelligence devoted to studying the design and analysis of machine learning algorithms. In computational
learning theory, probably approximately correct learning (PAC learning) is a framework for
mathematical analysis of machine learning algorithms. It was proposed in 1984 by Leslie Valiant.
 In this framework, the learner (that is, the algorithm) receives samples and must select a hypothesis from
a certain class of hypotheses. The goal is that, with high probability (the “probably” part), the selected
hypothesis will have low generalization error (the “approximately correct” part). In this section we first
give an informal definition of PAC-learnability. After introducing a few nore notions, we give a more
formal, mathematically oriented, definition of PAC-learnability. At the end, we mention one of the
applications of PAC-learnability.
PAC-learnability
To define PAC-learnability we require some specific terminology and related notations.
 Let X be a set called the instance space which may be finite or infinite. For example, X
may be the set of all points in a plane.
 A concept class C for X is a family of functions c : X  {0; 1}. A member of C is called a
concept. A concept can also be thought of as a subset of X. If C is a subset of X, it defines a
unique function µc : X  {0; 1} as follows:

 A hypothesis h is also a function h : X  {0; 1}. So, as in the case of concepts, a hypothesis
can also be thought of as a subset of X. H will denote a set of hypotheses.
 We assume that F is an arbitrary, but fixed, probability distribution over X.
 Training examples are obtained by taking random samples from X. We assume that the
samples are randomly generated from X according to the probability distribution F.

Now, we give below an informal definition of PAC-learnability.

Definition (informal)
Let X be an instance space, C a concept class for X, h a hypothesis in C and F an arbitrary,
but fixed, probability distribution. The concept class C is said to be PAC-learnable if there
is an algorithm A which, for samples drawn with any probability distribution F and any
concept c Є C, will with high probability produce a hypothesis h Є C whose error is small.

Examples

To illustrate the definition of PAC-learnability, let us consider some concrete examples.


Figure : An axis-aligned rectangle in the Euclidean plane

Example
 Let the instance space be the set X of all points in the Euclidean plane. Each point is
represented by its coordinates (x; y). So, the dimension or length of the instances is 2.
 Let the concept class C be the set of all “axis-aligned rectangles” in the plane; that is, the set
of all rectangles whose sides are parallel to the coordinate axes in the plane (see Figure).
 Since an axis-aligned rectangle can be defined by a set of inequalities of the following
form having four parameters

a ≤ x ≤ b, c ≤ y ≤ d

the size of a concept is 4.


 We take the set H of all hypotheses to be equal to the set C of concepts, H = C.

Given a set of sample points labeled positive or negative, let L be the algorithm which
outputs the hypothesis defined by the axis-aligned rectangle which gives the tightest fit to
the positive examples (that is, that rectangle with the smallest area that includes all of the
positive examples and none of the negative examples) (see Figure bleow).

Figure : Axis-aligned rectangle which gives the tightest fit to the positive examples
It can be shown that, in the notations introduced above, the concept class C is PAC-
learnable by the algorithm L using the hypothesis space H of all axis-aligned rectangles.

12 A) The Bias-Variance tradeoff in Machine Learning refers to the inherent tension between creating a model
that is too simple (high bias) and capturing noise in the data (high variance), where the ideal model strikes a
balance between both to generalize well to unseen data; essentially, as you increase the complexity of a model, its
bias tends to decrease while variance increases, and vice versa, meaning finding the optimal level of complexity is
key to achieving good performance.
Diagram Explanation:
[Image: A graph with "Model Complexity" on the x-axis and "Error" on the y-axis. Three curves are
plotted: "Bias Error" (decreasing with increasing complexity), "Variance Error" (increasing with
increasing complexity), and "Total Error" (a U-shaped curve, representing the sum of bias and variance
error). The "sweet spot" is marked at the lowest point of the Total Error curve, where bias and variance
are relatively balanced.]
Key Points:
 Bias:
Represents the error due to simplifying assumptions made by a model, leading to underfitting where
the model fails to capture important patterns in the data.
 Example: Using a linear regression model to fit a highly non-linear relationship between variables will
result in high bias.
 Variance:
Represents how much the model's predictions fluctuate based on different training data sets, leading to
overfitting where the model learns noise in the data instead of underlying patterns.
 Example: A very complex neural network with many parameters can easily overfit to training data,
leading to high variance.
How to Navigate the Tradeoff:
 Model Selection:
Choose a model complexity that balances bias and variance based on the problem and data size.
 Regularization Techniques:
Methods like L1 and L2 regularization can help reduce model complexity and mitigate overfitting, thus
decreasing variance.
 Cross-Validation:
Used to evaluate model performance on unseen data and select the best hyperparameters to optimize
the bias-variance tradeoff.
 Data Augmentation:
Generating additional training data can help reduce variance by exposing the model to a wider range of
examples.
Visualizing the Tradeoff:
 Low Bias, High Variance:
Imagine a model that perfectly fits every point in the training data, including noise, resulting in
excellent performance on the training set but poor generalization on new data.

 High Bias, Low Variance:


Conversely, a very simple model that only captures the most basic trends may have low variance but
fail to capture important details, leading to poor performance on both training and new data.
Important Considerations:
 Data Size:
With limited data, simpler models with lower variance are often preferred to prevent overfitting.
 Problem Complexity:
More complex problems may require more complex models, but careful consideration of the bias-
variance tradeoff is crucial.

12 B) VC DIMENSION:

Vapnik-Chervonenkis (VC) dimension


 The concepts of Vapnik-Chervonenkis dimension (VC dimension) and probably approximate correct
(PAC) learning are two important concepts in the mathematical theory of learnability and hence are
mathematically oriented. The former is a measure of the capacity (complexity, expressive power, richness,
or flexibility) of a space of functions that can be learned by a classification algorithm.
 It was originally defined by Vladimir Vapnik and Alexey Chervonenkis in 1971. The latter is a framework
for the mathematical analysis of learning algorithms. The goal is to check whether the probability for a
selected hypothesis to be approximately correct is very high. The notion of PAC learning was proposed by
Leslie Valiant in 1984.
V-C dimension
Let H be the hypothesis space for some machine learning problem. The Vapnik-Chervonenkis dimension
of H, also called the VC dimension of H, and denoted by V C(H), is a measure of the complexity (or,
capacity, expressive power, richness, or flexibility) of the space H. To define the VC dimension we
require the notion of the shattering of a set of instances.

Shattering of a set
Let D be a dataset containing N examples for a binary classification problem with class labels 0 and 1.
Let H be a hypothesis space for the problem. Each hypothesis h in H partitions D into two disjoint
subsets as follows:

Such a partition of S is called a “dichotomy” in D. It can be shown that there are 2N possible dichotomies
in D. To each dichotomy of D there is a unique assignment of the labels “1” and “0” to the elements of
D. Conversely, if S is any subset of D then, S defines a unique hypothesis h as follows:

Thus to specify a hypothesis h, we need only specify the set {x Є D | h(x) = 1}. Figure 3.1 shows all
possible dichotomies of D if D has three elements. In the figure, we have shown only one of the two sets
in a dichotomy, namely the set {x Є D | h(x) = 1}.The circles and ellipses represent such sets.

Definition
A set of examples D is said to be shattered by a hypothesis space H if and only if for every dichotomy of
D there exists some hypothesis in H consistent with the dichotomy of D.

The following example illustrates the concept of Vapnik-Chervonenkis dimension.

Example

In figure, we see that an axis-aligned rectangle can shatter four points in two dimensions. Then VC(H),
when H is the hypothesis class of axis-aligned rectangles in two dimensions, is four. In calculating the
VC dimension, it is enough that we find four points that can be shattered; it is not necessary that we be
able to shatter any four points in two dimensions.
Fig: An axis-aligned rectangle can shattered four points. Only rectangle covering two points are shown.

VC dimension may seem pessimistic. It tells us that using a rectangle as our hypothesis class, we can
learn only datasets containing four points and not more.

13 A) PERCEPTRON
 Perceptron is a type of neural network that performs binary classification that maps input features to an
output decision, usually classifying data into one of two categories, such as 0 or 1.
 Perceptron consists of a single layer of input nodes that are fully connected to a layer of output nodes. It
is particularly good at learning linearly separable patterns. It utilizes a variation of artificial neurons
called Threshold Logic Units (TLU), which were first introduced by McCulloch and Walter Pitts in the
1940s. This foundational model has played a crucial role in the development of more advanced neural
networks and machine learning algorithms.
Types of Perceptron
 Single-Layer Perceptron is a type of perceptron is limited to learning linearly separable patterns. It is
effective for tasks where the data can be divided into distinct categories through a straight line. While
powerful in its simplicity, it struggles with more complex problems where the relationship between
inputs and outputs is non-linear.
 Multi-Layer Perceptron possess enhanced processing capabilities as they consist of two or more layers,
adept at handling more complex patterns and relationships within the data.
Basic Components of Perceptron
 A Perceptron is composed of key components that work together to process information and make
predictions.
 Input Features: The perceptron takes multiple input features, each representing a characteristic of the input
data.
 Weights: Each input feature is assigned a weight that determines its influence on the output. These weights
are adjusted during training to find the optimal values.
 Summation Function: The perceptron calculates the weighted sum of its inputs, combining them with their
respective weights.
 Activation Function : The weighted sum is passed through the Heaviside step function, comparing it to a
threshold to produce a binary output (0 or 1).
 Output: The final output is determined by the activation function, often used for binary classification tasks.
 Bias: The bias term helps the perceptron make adjustments independent of the input, improving its
flexibility in learning.
 Learning Algorithm: The perceptron adjusts its weights and bias using a learning algorithm, such as
the Perceptron Learning Rule , to minimize prediction errors.
 These components enable the perceptron to learn from data and make predictions. While a single
perceptron can handle simple binary classification, complex tasks require multiple perceptrons organized
into layers, forming a neural network.
How does Perceptron work?
 In Machine Learning, Perceptron is considered as a single-layer neural network that consists of four main
parameters named input values (Input nodes), weights and Bias, net sum, and an activation function. The
perceptron model begins with the multiplication of all input values and their weights, then adds these
values together to create the weighted sum. Then this weighted sum is applied to the activation function
'f' to obtain the desired output. This activation function is also known as the step function and is
represented by 'f'.
Perceptron in Machine Learning
 This step function or Activation function plays a vital role in ensuring that output is mapped between
required values (0,1) or (-1,1). It is important to note that the weight of input is indicative of the strength
of a node. Similarly, an input's bias value gives the ability to shift the activation function curve up or
down.
 Perceptron model works in two important steps as follows:

Step-1

In the first step first, multiply all input values with corresponding weight values and then add them to
determine the weighted sum. Mathematically, we can calculate the weighted sum as follows:

∑wi*xi = x1*w1 + x2*w2 +…wn*xn

Add a special term called bias 'b' to this weighted sum to improve the model's performance.

∑wi*xi + b

Step-2

In the second step, an activation function is applied with the above-mentioned weighted sum, which
gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)

13 B) NAÏVE BAYES CLASSIFIER

 Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for
solving classification problems.

 It is mainly used in text classification that includes a high-dimensional training dataset.

 Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in
building the fast machine learning models that can make quick predictions.

 It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
 Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and
classifying articles.


 Problem: If the weather is sunny, then the Player should play or not?

No weather play

1 Rainy Yes

2 Sunny Yes

3 Overcast Yes

4 Overcast Yes

5 Sunny No

6 Rainy Yes

7 Sunny Yes

8 Overcast Yes

9 Rainy No

10 Sunny No

11 Sunny Yes

12 Rainy No

13 Overcast Yes

14 Overcast Yes

Find whether players will play if Weather is sunny ?


P(play = yes| weather =sunny ) = p(sunny / play)* p(play = yes)
------------------------------
P(sunny)
P(sunny) = 5/14
P (play = yes) = 9/14
P(sunny\play = yes) = p(A^B) = p( sunny^ yes) = 3/9
--------- --------------------
P(B) P ( play = yes)
P(play = yes | weather – sunny ) = 3/9 * 9/14 = 0.6
--------------
5/14
P(play = No | weather – sunny ) = 0.4
Compare to P ( NO) =0.4 & P(yes) = 0.6 so we take the maximum value ,the result will be yes
Hence the sunny day player can play game.

14 A) ADVANTAGE OF SVM

 The main advantages of Support Vector Machines (SVMs) include their ability to handle high-dimensional
data effectively, strong resistance to overfitting, clear decision boundaries due to the optimal hyperplane,
flexibility with kernel functions to address non-linear data, and high accuracy in classification tasks,
especially when dealing with complex datasets; the optimal hyperplane, in contrast to other hyperplanes, is
the one that maximizes the margin between different data classes, meaning it creates the largest possible
separation between the classes, leading to better generalization and robustness against new data points.
Key advantages of SVMs:
 High-dimensional data handling:
SVMs perform well even when the number of features is large, making them suitable for complex
datasets with many variables.
 Robustness to overfitting:
By focusing on maximizing the margin between classes, SVMs are less prone to overfitting, especially in
high-dimensional spaces.
 Clear decision boundaries:
The optimal hyperplane provides a well-defined separation between classes, aiding in better
interpretation of the model.
 Kernel trick for non-linear data:
Using different kernel functions allows SVMs to handle non-linear data by transforming it into a higher
dimensional space.
 Sparsity of support vectors:
Only a small subset of data points (support vectors) significantly influence the decision boundary,
leading to efficient computation.

OPTIMAL HYPERPLANE:
An "optimal hyperplane" in machine learning, particularly within the context of Support Vector
Machines (SVMs), is distinct from other hyperplanes because it is the one that maximizes the margin
between data points from different classes, meaning it is positioned as far away as possible from the
closest data points (called "support vectors") on either side, leading to the best possible separation
between classes and minimizing the risk of misclassification; while other hyperplanes might separate the
data, they would do so with a smaller margin, making them less robust to new data points.

Key points about the optimal hyperplane:


Maximized Margin:
The primary characteristic of an optimal hyperplane is that it maximizes the distance between itself and
the nearest data points from each class, known as the "margin.".
 Support Vectors:
The data points that lie closest to the optimal hyperplane and define the margin are called "support
vectors".
 SVM Algorithm:
The process of finding the optimal hyperplane is central to the Support Vector Machine (SVM)
algorithm, which aims to achieve the best possible classification by maximizing this margin.
How it differs from other hyperplanes:
 Classification Accuracy:
While other hyperplanes may still separate the data into different classes, the optimal hyperplane is
considered the best because it provides the most robust separation, minimizing the likelihood of
misclassification when encountering new data points.
 Generalizability:
Due to the maximized margin, the optimal hyperplane is less prone to overfitting to the training data,
leading to better performance on unseen data.

14B) DCM VS GM:


 A discriminative model in probability focuses on learning the conditional probability distribution,
meaning it primarily learns how to classify data based on the input features given a specific label, while a
generative model learns the joint probability distribution, allowing it to generate new data points that
resemble the training data distribution; essentially, a discriminative model learns the decision boundary
between classes, whereas a generative model tries to understand the underlying distribution of each class.
 Generative models are a class of machine learning models that aim to understand the underlying
probability distribution of the data. They try to learn the underlying structure of the data by modeling the
joint probability of the inputs and outputs. A common example of a generative model is a Gaussian
Mixture Model (GMM), which models the data as a collection of Gaussian distributions. These models
can be used for unsupervised tasks like anomaly detection, density estimation, and generative art.

 Discriminative models, on the other hand, aim to model the decision boundary between different data
classes. They try to learn the boundary between different data classes by modeling the conditional
probability of the outputs given the inputs. A common example of a discriminative model is a Support
Vector Machine (SVM) which models the data as a boundary that maximally separates different classes.
These models can be used for supervised tasks like classification and Regression.

 Generative models thus try to understand the underlying probability distribution of the data, while
Discriminative models try to model the decision boundary between different data classes. Generative
models are computationally more expensive and susceptible to outliers but can be used for unsupervised
tasks. Discriminative models are computationally cheaper and are more robust to outliers but are mainly
used for supervised tasks.

Example:

Discriminative Model (Example: Logistic Regression):

Imagine trying to classify emails as spam or not spam. A logistic regression model would learn the features that
best distinguish spam emails from non-spam emails, like specific keywords, and directly predict the probability
of an email being spam based on those features.

Generative Model (Example: Generative Adversarial Network (GAN)):


If you wanted to generate new realistic-looking images of cats, a GAN would learn the general characteristics of
cat images from a dataset and then create entirely new cat images that appear visually similar to the training
data, even if they don't exist in the original dataset.

DIFFERENCE:
15A) EM ALGORITHM

The Expectation-Maximization (EM) algorithm is a statistical method used to find maximum likelihood estimates
of parameters in a model when there are unobserved "latent" variables, essentially meaning some data points are
missing or hidden; a common example is estimating the parameters of a mixture of Gaussian distributions where
you don't know which Gaussian component generated each data point in a dataset.
Key idea: The EM algorithm iteratively performs two steps:
 Expectation (E) step:
Estimate the expected values of the latent variables based on the current estimates of the model
parameters.
 Maximization (M) step:
Update the model parameters to maximize the likelihood function based on the expected values
calculated in the E step.
Example: Mixture of Gaussian Distributions:
 Problem:
Imagine you have a dataset of points that seem to be drawn from two different Gaussian distributions
(with unknown means and variances) but you don't know which Gaussian generated each individual
data point (the latent variable).
 EM Algorithm Steps:
 Initialization: Randomly initialize the means and variances for the two Gaussian components.
 E-step: For each data point, calculate the probability that it belongs to each Gaussian component
based on the current estimates of the means and variances.
 M-step: Re-estimate the means and variances of each Gaussian component by maximizing the
likelihood function using the "soft assignments" calculated in the E-step (i.e., the probabilities of
each data point belonging to each component).
 Repeat: Iterate between the E-step and M-step until convergence (parameters stop changing
significantly).
Other Examples of EM Algorithm Applications:
 Customer Segmentation:
Clustering customers into groups based on their purchasing behavior where the group membership is
considered a latent variable.
 Hidden Markov Models (HMMs):
Inferring the hidden states of an HMM (e.g., weather states in a sequence of temperature readings)
given the observed data.
 Text Mining:
Identifying latent topics within a collection of documents by treating each document as a mixture of
topics.

15B) WEIGHTED KNN


 Weighted kNN is a modified version of k nearest neighbors. One of the many issues that affect the
performance of the kNN algorithm is the choice of the hyperparameter k. If k is too small, the algorithm
would be more sensitive to outliers. If k is too large, then the neighborhood may include too many points
from other classes.
 Another issue is the approach to combining the class labels. The simplest method is to take the majority
vote, but this can be a problem if the nearest neighbors vary widely in their distance and the closest
neighbors more reliably indicate the class of the object.
 Intuition:
Consider the following training set

The red labels indicate the class 0 points and the green labels indicate class 1 points.
Consider the white point as the query point( the point whose class label has to be predicted)
If we give the above dataset to a kNN based classifier, then the classifier would declare the query point to
belong to the class 0. But in the plot, it is clear that the point is closer to the class 1 points compared to the
class 0 points. To overcome this disadvantage, weighted kNN is used. In weighted kNN, the nearest k
points are given a weight using a function called as the kernel function. The intuition behind weighted
kNN, is to give more weight to the points which are nearby and less weight to the points which are farther
away. Any function can be used as a kernel function for the weighted knn classifier whose value decreases
as the distance increases. The simple function which is used is the inverse distance function.
Algorithm:
 Let L = { ( x i , yi ) , i = 1, . . . ,n } be a training set of observations x i with given class y i and let x be a new
observation(query point), whose class label y has to be predicted.
 Compute d(xi, x) for i = 1, . . . ,n , the distance between the query point and every other point in the training

 Select D’ ⊆ D, the set of k nearest training data points to the query points
set.

 Predict the class of the query point, using distance-weighted voting. The v represents the class labels. Use
the following formula

15 MARKS

16A) INDUCTIVE BIAS :


 In the intricate realm of machine learning, the concept of inductive bias serves as a fundamental pillar,
shaping the very essence of how models interpret and generalize from data. At its core, inductive bias
refers to the set of assumptions, constraints, or prior knowledge encoded into a learning algorithm, guiding
it to favor certain hypotheses over others. This crucial aspect plays a pivotal role in the model’s ability to
make predictions, handle uncertainty, and adapt to diverse datasets.
 To comprehend the significance of inductive bias, one must first grasp the inherent challenges that machine
learning algorithms face when presented with vast and complex datasets. In the absence of any guiding
principles, models might struggle to discern patterns, leading to overfitting or underfitting issues.
Overfitting occurs when a model learns the training data too well, capturing noise and anomalies, but
failing to generalize effectively to new, unseen data. On the contrary, underfitting transpires when a model
is too simplistic, lacking the capacity to capture the underlying patterns in the data.
 Inductive bias acts as a guiding light in the darkness of uncertainty, steering machine learning models away
from the pitfalls of overfitting and underfitting. It serves as a set of preferences or predispositions that
allow the algorithm to make assumptions about the nature of the data and, in turn, make more informed
predictions. These biases can be explicit, such as pre-defined rules or constraints, or implicit, arising from
the architectural choices and hyper parameters of the learning algorithm.
 Before going into any detail, You might need to have a basic understanding os similar concepts such as
Restrictive and Preference Bias
 Now to understand it more clearly, it is possible to make a dozen of hypotheses based on a few observation
— this is an important property of inductive reasoning: valid observation may lead to different hypotheses
and some of them can be false. For example, observing from the earth you can generally assume that all
stars are white just by looking at the sky, even though it’s not true… it is perfectly reasonable to assume it
that way.

 for this example, one could consider the simplest hypothesis that “there are stars in the sky” rather than
complicating things over the observation. I know all this sounds vague… but what does this has to do with
a Machine Learning.
 In most machine learning tasks, we deal with some subset of observations (samples) and our goal is to
create a generalization based on them. We also want our generalization to be valid for new unseen data. In
other words, we want to draw a general rule that works for the whole population of samples based on a
limited sample subset.
 So we have some set of observations and a set of hypotheses that can be induced based on observations.
The set of observations is our data and the set of hypotheses are ML algorithms with all the possible
parameters that can be learned from this data. Each model can describe training data but provide
significantly different results on new unseen data.

 As you can clearly see in the above illustration, when training different models on a fixed train data, they
tend to vary when it comes to inferring the unseen data, which can lead to varied number of predictions.
 There is an infinite set of hypotheses for a finite set of samples. For example, consider observations of two
points of some single-variable function. It is possible to fit a single linear model and an infinite amount of
periodic or polynomial functions that perfectly fit the observations. Given the data, all of that functions are
valid hypotheses that perfectly align with observations, and with no additional assumptions, choosing one
over another is like making a random guess.

Now let’s infer our hypothesis from the new unseen data sample X2, and it turns out that most of the complicated
functions are inaccurate. However, the linear function appears to be quite accurate, which may seem already
familiar to you from a bias-variance tradeoff perspective.

The prioritization of some hypotheses (restriction of hypothesis space) is an inductive bias. So the model is biased
toward some group of hypotheses(preference bias). For the previous example, one can choose a linear model based
on some prior knowledge about data and thus prioritize linear generalization.
Why Is Inductive Bias Important?As one can see from the previous example, choosing the right induction bias of
the model leads to better generalization, especially in a low data setting. The less training data we have, the
stronger inductive bias should be to help the model to generalize well. But in a rich data setting, it may be
preferable to avoid any induction bias to let the model be less constrained and search through the hypothesis space
freely.

In a low data setting, right inductive bias may help to find good optimum, but in a rich data setting, it may lead to
constrains that harm generalization

16B)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy