MLP RL1
MLP RL1
Chapter 1 - Introduction
● Machine learning is a set of methods that can automatically detect patterns in data and
then use those uncovered patterns to predict future data or perform other kinds of
decision making.
○ Best way to solve such problems is to use probability theory.
● Predictive of supervised learning is the goal of learning a mapping from inputs x to
outputs y given a set of input and output pairs, called the training set.
● Each training input x is a D dimensional vector, where each number in that vector is
called a feature or attribute.
● The outputs y can be a categorial variable from some finite set of classes. This is
classification.
● When the outputs y are real scalar values, this is regression.
● The goal of unsupervised learning is to find interesting patterns in the given inputs.
● Binary classification is when the number of classes to classify from is 2. If C > 2, it is
multi-class classification.
● The probability distribution over all possible labels, given the input vector x and the
label y and the training set D is p(y|x, D). This represents a vector of length C, with all of
the probabilities for each of the classes.
● When choosing between different models, the notation is p(y|x, D, M).
● To find out the predicted class of the model, you simply take the argmax of the
probability distributions.
○ The most probable class label is called the mode of the distribution and is known
as a MAP (maximum a posteriori) estimate.
● Unsupervised learning is where we have the task of density estimation. We want to
build models of the form p(x | θ ).
● Supervised learning is a conditional density estimation, while unsupervised learning is an
unconditional density estimation. SL is conditional because we are given a training set.
● Clustering data is a method in unsupervised learning.
○ FIrst we determine how many clusters to create.
○ Then, we estimate which cluster each point belongs to. The feature of which
cluster the point is a part of is a hidden or latent variable because it is not
observed in the training set, but rather is something that we created. We can pick
the cluster by using.
● Logistic regression computes the linear combination of inputs, but also passes the
output through a sigmoid function, which is necessary for the output to be interpreted as
a probability.
○ If we threshold the probability, we can induce a decision rule.
● The data is not linearly separable if there is no straight line we can draw to separate the
1s from the 0s, and thus these models will have a non zero train error.
● The lower the value for K in KNN, the more likely the model is to overfit. A lower K value
signifies a complex model, while a large K underfits and is too simple.
● Generalization error is the error of a function that is tested on data it has never been
trained on.
● A common technique to measure a model’s performance is to split the training set into
two pieces: A training set and a validation set which just acts as a test set.
● Cross validation is a technique where the training data is split into K folds, and for each
fold, we train on all the folds but the k’th fold, and test on that k’th fold. The error will
averaged across all of the folds.
● Leave One Out Validation is when you set K = N
● The no free lunch theorem states that there is no universally best algorithm. We use all
of those previous methods (validation sets, cross validation, minimization of test error) to
empirically choose the best method for our particular problem. The no free lunch
theorem basically says that the performance of two models is the same if it’s averaged
over all possible problems. This sounds terrible, but the caveat is that we specific models
are better for specific problems, and most of the time, we know the problem space we
have and we’re not necessarily averaging across all problems.
Chapter 2 - Probability
● Two interpretations of probability
○ Frequentist - Probabilities represent long run frequencies of events. Ex) If we flip
a coin a bunch of times, it will land heads about half the time.
○ Bayesian - Probability is used to quantify uncertainty about something. Ex) 80%
chance of raining tomorrow. It’s basically where the probability represents how
probable we think this event is.
● p(A) denotes the probability that event A is true.
● Discrete random variables are variables that can take some value from a finite set X.
● Conditional probability of A, given that B is true.
● X and Y are unconditionally independent if p(X,Y) = p(X)p(Y). Equivalently, you can also
say p(X | Y) = P(X).
○ standard deviation
● Suppose we toss a coin N times. Let X be the number of heads (anywhere from 0 to N).
○ If the probability of heads is θ , then X has a binomial distribution.
● Suppose we toss the coin once. Let X be 0 or 1 (depending on the coin flip result), with
the probability of heads as θ .
○ We say that X has a Bernoulli distribution.
○ A Bernoulli random variable is one that only has 2 outcomes.
● Encoding the states 1, 2, and 3 as (1,0,0) and (0,1,0) and (0,0,1) is a one-hot encoding.
● X can have a Poisson distribution with a parameter lambda, with the following probability
mass function.
● The multivariate Gaussian is the most widely used joint probability density function for
continuous variables.
● The Kullback-Leibler divergence is a measure of the dissimilarity between two probability
distributions p and q. It’s also known as the relative entropy.
● Laplace’s principle of insufficient reason argues in favor of using uniform distributions
when there are no other reasons to favor one distribution over another.