MLP IA1
MLP IA1
1
fi
fi
fi
fi
fi
fi
• DISADVANTAGES OF MACHINE LEARNING:
- It is very dif cult to identify and rectify the errors.
- Data Acquisition.
- Interpretation of results requires more time and space.
3
fi
NAME YEAR INVENTION
automated speech
Jay Wilson and AT&T 1992 recognition using hidden
markov model
AT&T 1992 Support vector machine
First convolutional neural
Patrick Haffner 1996 network, rebranded as
deep learning
Adaboost algorithm. This
algorithm allowed
AT&T 1997 unstructured data to be
handled through decision
trees,
natural language
understanding in
AT&T 2001
Interactive Voice
Response (IVR) systems
Group of researchers 2011 deep neural networks
4
3.DIFFERENCE BETWEEN DATA MINING AND MACHINE
LEARNING?
->
6
• Model validation and selection:
- In Data Science, many modeling techniques exists and challenge is
to nd which one is the right one to use.
- A good model has two properties:
(a) good predictive power
(b) generalizes well to data it hasn’t seen.
- To achieve good predictive power:
• error measure:
- Two common error measures in ML are:
- The classi cation error rate(classi cation)
- the mean squared error(regression)
- lower CE rate is better.
- The MSE measures how big the average error of the
prediction is.
- Squaring the average error has two consequences:
i. can’t cancel out a wrong prediction in one direction with
a faulty prediction in the other direction.
ii. bigger errors get even more weight than they
otherwise would.
- Small errors remain small or can even shrink, whereas big
errors are enlarged and will de nitely draw your attention.
• validation strategy:
- Split up: Dividing the data into a training set.
- K-folds cross validation: It divides the data set into k parts
and uses each part one time as a test data set while using
the others as a training data set
- Leave-1 out: It is same as k-folds but with k=1 means leave
one observation out and train on the rest of the data.
- To achieve Regularization:
• incur a penalty for every extra variable used to construct the
model.
• With L1 regularization a model with as few important predictors as
possible for the model’s robustness.
7
fi
fi
fi
fi
• L2 regularization aims to keep the variance between the coef cients
of the predictors as small as possible.
- Overlapping variance between predictors makes it hard to make out
the actual impact of each predictor.
- To keep it simple regularization is mainly used to stop a model from
using too many features and prevent over- tting and Validation.
- It is extremely important because it determines whether the model
works in real-life conditions.
- Test the models on unseen data and make sure this data is a true
representation of what it would encounter when applied on fresh
observations by other people.
- Once constructed a good model, can use it to predict the future.
• Applying the trained model to unseen data:
- If implemented the rst three steps successfully, now have a
performant model that generalizes to unseen data.
- The process of applying the model to new data is called model scoring,
implicitly did during validation.
- Trust the model enough to use it for real.
- Model scoring involves two steps:
- i. prepare a data set that has features exactly as de ned by
the model.
- ii. Apply the model on this new data set, and this results in a
prediction.
8
fi
fi
fi
fi
- CLASSIFICATION:
• A classi cation problem is when the output variable is a category
• the goal of classi cation is to predict the class label, which is a
choice from a prede ned list of possibilities.
• Binary class/ Multi class classi cation
• KNN Algorithm(Distance Based Algorithm- used both for Classi cation
and Regression):
- KNN is a lazy learning, non-parametric algorithm.
- It uses data with several classes to predict the classi cation of
the new sample point.
- KNN is non-parametric since it doesn't make any assumptions on
the data being studied.
- It's called a lazy learner because it doesn't perform any training
when you supply the training data.
- Instead, it just stores the data during the training time and
doesn't perform any calculations.
- Most of the computation is done during consultation time.
- It defers the processing of examples till it receives an explicit
request for information.
• When KNN should be used??
- data should be labeled
- Data should be noise free
- Data set should be small
- Even though the Data set will be huge, a proper sample of the
data should be used.
• Naïve Bayes:
• It is called Naïve because as it assumes that the occurrence of
a certain feature is independent of the occurrence of other
features.
• Naïve Bayes assumes conditional independence over the training
dataset.
9
fi
fi
fi
fi
fi
fi
• The classi er separates data into different classes according to
the Bayes’ Theorem, but assumes that the relationship between
all input features in a class is independent. Hence, the model is
called naïve.
• It depends on the principle of Bayes' Theorem.
• Naive Bayes classi ers are a family of classi ers that are quite
similar to the linear models.
• They tend to be even faster in training.
• The reason that naive Bayes models are so ef cient is that they
learn parameters by looking at each feature individually and
collect simple per-class statistics from each feature.
• Decision Tree Algorithm(Decision Based Algorithm-used both for
Classi cation and Regression):
• It is a predictive modeling technique used in Classi cation, and
Prediction tasks.
• Uses a Divide and Conquer technique to split the search space
into subsets.
• Decision Trees are widely used for Classi cation and Regression.
• Decision Trees learn a hierarchy of if-else Questions leading to
a Decision.
• Root node: No incoming edges and zero or more outgoing edges.
• Internal node: Exactly one incoming edges and two or more
outgoing edges.
• Leaf node: Exactly one incoming edge and no outgoing edges.
• ADVANTAGES AND DISADVANTAGES:
- It is simple to understand as it follows the same process
which a human follow while making any decision in real-life.
- It can be very useful for solving decision-related problems.
- It helps to think about all the possible outcomes for a
problem.
- There is less requirement of data cleaning compared to other
algorithms.
• DISADVANTAGES:
- The decision tree contains lots of layers, which makes it
complex.
10
fi
fi
fi
fi
fi
fi
fi
- It may have an over tting issue.
- For more class labels, the computational complexity of the
decision tree may increase.
- REGRESSION:
• A regression problem is when the output variable is a real value
• Goal is to predict a continuous number/ oat value in programming
terms.
• It allows one to make predictions from data by learning the relationship
between features of your data and some observed, continuous-valued
response.
• The ultimate goal of the regression algorithm is to plot a best- t line or
a curve between the data.
• Linear Regression is an algorithm that belongs to supervised machine
learning.
• It is a study of the relationship of the variables(independent variable/s
and dependent variable(target variable)
• A real number ‘y’ has to be predicted for the given the value of x.
• It tries to apply relations that will predict the outcome of an event.
• The relation is usually a straight line that best ts the different data
points as close as possible.
• It is a concept based on Line, Plane and HyperPlanes.
• ADVANTAGES:
- Easy to Implement : is computationally simple and does not require
much engineering.
- Scalability: it can be applied to cases where scaling is needed, such as
applications that handle big data.
- Interpretability: Easy to interpret and very ef cient to train.
- Applicability in real-time :can be applied to scenarios where real-time
predictions are important.
• DISADVANTAGES:
- Assumes linearity between the dependent and independent variables,
which is rarely represented in real-world data.
- Assumes a straight-line relationship between the dependent and
independent variables, which is unlikely many times.
- It is prone to noise and over tting.
11
fi
fi
fl
fi
fi
fi
- It is not the good choice for the datasets where the number of
observations is lesser than the attributes as it can lead to over tting.
- Sensitive to outliers and hence it is essential to pre-process the
dataset and remove the outliers before applying Linear Regression to
the data.
- It does not assume any relationship between the independent
variables.
- Such needs to be removed using dimensionality reduction techniques
before applying Linear Regression.
13
fi
fi
fi
8.ALL ABOUT SEMI-SUPERVISED LEARNING?
->
• Semi-supervised learning is an approach to machine learning
• Combines a small amount of labeled data with a large amount of
unlabelled data during training.
• It falls between unsupervised learning and supervised learning.
• Generating labels may not be easy or cheap, and hence due to limited
resources, may have labels for only few observations.
• There’s considerable improvement in learning accuracy.
• The acquisition of labeled data for a learning problem often requires a
skilled agent.
• The cost associated with labelling the complete process data set may be
infeasible.
• It’s a class of Machine Learning Tasks and Techniques that also make use
of unlabelled data for training. –i.e., a small amount of labeled data + large
amount of unlabelled data
• It’s found that unlabelled data, when used in conjunction with a small
amount of labeled data, can produce considerable improvement in learning
accuracy.
11.WHAT IS Q-FUNCTION?
->
• The evaluation function Q(s, a) is the maximum discounted cumulative reward
value.
• It can be achieved starting from state s and applying action a as the rst
action.
• The Q value for each state-action transition equals the r value for this
transition plus the V* value for the resulting state discounted by gamma.
• If the agent learns the Q function instead of the V* function, it will be able
to select optimal actions even when it has no knowledge of the functions r
and gamma.
• It need only consider each available action a in its current state s and
choose the action that maximizes Q(s, a).
• One can choose globally optimal action sequences by reacting repeatedly to
the local values of Q for the current state.
• The agent can choose the optimal action without ever conducting a lookahead
search to explicitly consider what state results from the action.
• Summarizes in a single number all the information needed to determine the
discounted cumulative reward that will be gained in the future if action a is
selected in state s.
STEPS:
• Set the Gamma Parameter and Environment rewards in matrix R
• Initialize matrix Q to Zero
• Select a random initial state
• Set current state=initial state
• Select one among all the possible actions for the current state
• Using this possible action, consider going to the next state
16
fi
• Get Maximum Q value for this next state based on all the possible actions
• Compute: Q(s, a)=R(s, a)+Gamma * max[Q(next state, all actions)]
• Repeat above steps until current state=goal state.
17
fi
fl
• In contrast to that are classical machine learning methods, such as SVM or
Neural Networks, that have a given set of data and draw conclusions based
on them.
• Dynamic programming can be used to solve reinforcement learning
problems when someone tells us the structure of the MDP.
• Dynamic Programming is all about remembering answers to the sub-
problems you have solved to save time later.
• WHER DO WE NEED DYNAMIC PROGRAMMING?
• If you are given a problem which can be broken down into smaller
subproblems.
• smaller sub-problems can still be broken down into smaller
subproblems and also manage to nd out that there are some
overlapping sub problem.
• The problems wherein we come across optimal substructure &
overlapping subproblems.
18
fi
UNIT 2
13.WHAT IS SVM?
->
- A classi cation that has received considerable attention is support vector
machin
- This technique has its roots in statistical learning theory (Vlamidir Vapnik,
1992).
- As a task of classi cation, it searches for optimal hyperplaneseparating
the tuples of one class from another.
- SVM works well with higher dimensional data and thus avoids
dimensionality problem.
- Although the SVM based classi cation is extremely slow, the result, is
however highly accurate.
- SVM is less prone to over fitting than other methods.
- It also facilitates compact model for classi cation.
- SVMs, are supervised learning models that analyze data and used for
classi cation and regression analysis.
- Goal: To create the best line or decision boundary that can segregate n-
dimensional space into classes and the best decision boundary is called a
hyperplane.
- can easily predict the class of a new data.
- chooses the extreme points that help in creating the hyperplane are
called as support vectors, and hence algorithm is termed as Support
Vector Machine.
- The SVM learns how important each of the training data points is to
represent the decision boundary between the two classes.
- Typically only a subset of the training points matter for de ning the
decision boundary.
- To make a prediction for a new point, the distance to each of the support
vectors is measured.
- A classi cation decision is made based on the distances to the support
vector, and the importance of the support vectors that was learned during
training.
19
fi
fi
fi
fi
fi
fi
fi
14.PROS AND CONS OF SVM?
->
• PROS:
- KSVM are powerful models and perform well on a variety of datasets.
- work well on low-dimensional and high-dimensional data
- Effective on datasets with multiple features, like nancial or medical data.
- Effective in cases where number of features is greater than the number
of data points.
- Uses a subset of training points in the decision function called support
vectors which makes it memory ef cient.
- Different kernel functions can be speci ed for the decision function.
- You can use common kernels, but it's also possible to specify custom
kernels.
• CONS:
- require careful preprocessing of the data and tuning of the parameters.
- SVM models are hard to inspect;
- it can be dif cult to understand why a particular prediction was made,
and it might be tricky to explain the model to a non-expert.
- If the number of features is a lot bigger than the number of data points,
avoiding over- tting when choosing kernel functions and regularization
term is crucial.
- SVMs don't directly provide probability estimates. Those are calculated
using an expensive ve-fold cross-validation.
- Works best on small sample sets because of its high training time.
20
fi
fi
fi
fi
fi
fi
15.HYPERPLANE AND MAXIMUM MARGIN HYPERPLANE?
->
• Hyperplane:
- There can be multiple lines/decision boundaries to segregate the classes in
n-dimensional space, but need to nd out the best decision boundary that
helps to classify the data points and this best boundary is known as the
hyperplane of SVM.
- The dimensions of the hyperplane depend on the features present in the
dataset, which means for 2 features, the hyperplane will be a straight
line, or for 3 features, the hyperplane will be a 2-dimension plane.
- always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.
- A decision boundary is a boundary which is parallel to hyperplane and
touches the closest class in one side of the hyperplane.
- The distance between the two decision boundaries of a hyperplane is
called the margin.
- The margin of hyperplane implies the error in classi er, ie. the larger the
margin, lower is the classi cation error.
- The classi er that contains hyperplane with a small margin are more
susceptible to model over tting and tend to classify with weak con dence
on unseen data.
- Thus during the training or learning phase, the approach would be to
search for the hyperplane with maximum margin, and such a hyperplane is
called maximum margin hyperplane(MMH).
- Also note the shortest distance from a hyperplane to one of its decision
boundary is equal to the shortest distance from the hyperplane to the
decision boundary at its other side.
- Alternatively, hyperplane is at the middle of its decision boundaries.
21
fi
fi
fi
fi
fi
fi
16.TYPES OF SVM?
->
• Linear SVM:
- used for linearly separable data.
- It means if a dataset can be classi ed into two classes by using a single
straight line, then such data is termed as linearly separable data, and
classi er is used called as Linear SVM classi er.
- A SVM which is used to classify data which are linearly separable is
called linear SVM.
- A classi cation technique when training data are linearly separable,
- Ia linear SVM searches for a hyperplane with the maximum margin.
- This is why a linear SVM is often termed as a maximal margin classifier
(MMC).
- Linear SVM undoubtedly better to classify data if it is trained by linearly
separable data.
- Linear SVM also can be used for non-linearly separable data, provided
that number of such instances is less.
• Non-linear SVM:
- Non-Linear SVM is used for non-linearly separated data
- It means if a dataset cannot be classi ed by using a straight line, then
such data is termed as non-linear data and classi er used is called as
Non-linear SVM classi er.
- This can be achieved in two major steps:
• Transform the original input data into a higher dimensional space.
• This is feasible because SVM’s performance is decided by number of
support vectors not by the dimension of data.
• Search for the linear decision boundaries to separate the transformed
higher dimensional data as similar to linear SVM.
- To have a nonlinear SVM, the trick is to transform non-linear data into
higher dimensional linear data and this transformation is popularly called
non linear mapping or attribute transformation.
22
fi
fi
fi
fi
fi
fi
fi
17.NON-LINEAR TO LINEAR TRANSFORMATION:
ISSUES?
->
• The non linear mapping and hence a linear decision boundary concept looks
pretty simple, but there are many potential problems to do so.
• Mapping: How to choose the non linear mapping to a higher dimensional
space? In fact, the φ-transformation works ne for small example, but fails
for realistically sized problems.
• Cost of mapping: For n-dimensional input instances there exist different
monomials comprising a feature space of dimensionality
• Dimensionality problem: It may suffer from the curse of dimensionality
problem often associated with a high dimensional data.
• More speci cally, in the calculation of W.X or Xi .X (in δ(X ), need n
multiplications and n additions (in their dot products) for each of the n-
dimensional input instances and support vectors, respectively.
• As the number of input instances as well as support vectors are enormously
large, it is therefore, computationally expensive.
• Computational cost: Solving the quadratic constrained optimization problem in
the high dimensional feature space is too computationally expensive task.
• Fortunately, mathematicians have cleverly proposes an elegant solution to the
above problems and the solution consist of the following:i. Dual formulation of
optimization problem and ii. Kernel trick
23
fi
fi
18.EXPLAIN KSVM?
->
- The function of kernel is to take data as input and transform it into the
required form, that to map from lower dimension to higher dimension in
order to separate the data points linearly.
- The kernel trick , a clever mathematical trick that allows us to learn a
classi er in a higher-dimensional space without actually computing the
new, possibly very large representation.
- It directly computes the distance of the data points for the expanded
feature representation, without ever actually computing the expansion.
- Training data which are not linearly separable, can be transformed into a
higher dimensional feature space such that in higher dimensional
transformed space a hyperplane can be decided to separate the
transformed data and hence original data.
- The data on the left in the gure is not linearly separable, mapping it to a
3D space using φ.
- The mapped data, then it is possible to have a decision boundaries and
hence hyperplane in 3D space.
24
fi
fi
19.KERNEL FUNCTIONS?
->
- Different kernel functions follow different parameters and those parameters
are called magic parameters and to be decided a priori.
- Further, which kernels to be followed also depends on the pattern of data as
well as prudent of user.
- In general, polynomial kernels result in a large dot products, Gaussian RBF
produces more support vectors than other kernels.
25
20.CHARACTERISTICS AND APPLICATIONS OF SVM?
->
• CHARACTERISTICS:
- The SVM learning problem can be formulated as a convex optimization
problem, in which ef cient algorithms are available to nd the global
minimum of the objective function.
- SVM is the best suitable to classify both linear as well as non-linear
training data ef ciently.
- SVM can be applied to categorical data by introducing a suitable similarity
measures.
- Computational complexity is in uenced by number of training data not the
dimension of data.
- In fact, learning is a bit computationally heavy and hence slow, but
classi cation of test is extremely fast and accurate.
• APPLICATIONS:
- Image classi cation: SVM is widely used in image classi cation tasks, such
as face recognition and object detection.SVM can be used to classify
images based on their features, such as texture, shape, and colour.
- Text classi cation: SVM is also used for text classi cation tasks, such as
sentiment analysis and spam detection. SVM can learn from a large
dataset of text documents and then classify new documents based on
their features.
- Bioinformatics: SVM has been successfully used in bioinformatics to
classify different types of biological data, such as gene expression data
and protein classi cation.
- Financial forecasting: SVM can be used to predict stock prices and other
nancial market trends. SVM can learn from historical data and then
predict future trends based on the learned patterns.
- Medical diagnosis: SVM can be used to diagnose medical conditions, such
as cancer and diabetes. SVM can learn from patient data and then classify
new patients based on their features.
26
fi
fi
fi
fi
fi
fi
fi
fl
fi
fi
fi