COMP4702 Notes 2019: Week 2 - Supervised Learning
COMP4702 Notes 2019: Week 2 - Supervised Learning
built.Thus,ensemblescanb
eshowntohavemoreflexi
bilityinthefunctionsthey
canrepresent.This
flexibilitycan,intheory,ena
blethemtoover‐
fitthetrainingdatamoreth
anasinglemodelwould,
butinpractice,someensem
bletechniques(especiallybag
ging)tendtoreduceproble
msrelatedto
over‐
fittingofthetrainingdata.
Empirically,ensemblestendt
oyieldbetterresultswhen
thereisasignificantdiversit
yamongthe
models.Manyensemblemeth
ods,therefore,seektoprom
otediversityamongthemod
elsthey
combine
Bagging
Bootstrapaggregating,often
abbreviatedasbagging,invol
veshavingeachmodelint
heensemble
votewithequalweight.In
ordertopromotemodelvari
ance,baggingtrainseachm
odelinthe
ensembleusingarandomly
drawnsubsetofthetraining
set.Asanexample,ther
andomforest
algorithmcombinesrandomd
ecisiontreeswithbaggingto
achieveveryhighclassifica
tionaccuracy.
Examquestions:
2013Q2.
d)Baggingisatechnique
forbuildinganensembles
upervisedlearningmodelby
trainingmultiple
baselearnersonbootstrapd
atasamplesfromagiventr
ainingset.Suggestapractic
altechniquefor
decidingthenumberofbase
classifierstouseinbaggin
g.Youcandescribeitin
wordsand/oruse
pseudocode.Ifyouneedto
makeanyassumptions,stat
ethem.
Theideawithhavingmultip
lebaselearnersistopick
learnersthatimproveaccurac
yinareaswhere
theotherlearnersaren’tacc
urate.Becausebaggingalso
createsmultiplemodelsoft
hesamelearner
usingbootstrappedexamples
itsbesttopickasmallnu
mberofunstableclassifiers,
asaveraging
stableclassifiersacrossmulti
plesampleshassmallerben
efit.
Onepossibletechniqueusing
cross‐validation
1:Createaseparatevalidation
set
2:Startroundsofbagging
3:Monitorvalidationseterror
4:Stopwhenitstartstorise
/bottomsout.
Boosting
Boostinginvolvesincrementall
ybuildinganensembleby
trainingeachnewmodelinst
anceto
emphasizethetraininginstan
cesthatpreviousmodelsmis
classified.Insomecases,bo
ostinghas
beenshowntoyieldbetter
accuracythanbagging,buti
talsotendstobemorelik
elytoover‐fitthe
trainingdata.Byfar,them
ostcommonimplementation
ofBoostingisAdaboost,alt
houghsome
neweralgorithmsarereported
toachievebetterresults
Supervised Learning: Algorithms that rely on labelled training data. Training generally relies
on minimisation of error over the training data.
Generalisation: The capability of a model to maintain low error when evaluating unseen
data (ultimate goal is to have good generalisation). Can be measured by evaluating a trained
model on an independent test set.
Logistic Regression
Model Selection
There is a fundamental trade-off between complexity, training set size and generalisation. If
the model is too simple, there may be poor predictive power on the test set. If the model is
too complex (overfit), then the model may be too specific to the training set and therefore
may have high error on the test set. Indicator of overfitting if the test error is much greater
than the training error.
Overfitting can be detected by using a third data set called the validation set. When the
validation error begins to decrease, the data is beginning to be overfit. Optima is point of
minimum validation error.
Avoiding overfitting via cross-validation, early stopping, pruning – further training may
hinder generalisation.
Occam’s Razor: Given a set of hypothesis, choose the simplest one that describes the
phenomena. In ML this means that the simplest model that defines the problem should be
used.
Week 3 – Optimisation & Statistical Learning
Maximum Likelihood Estimation (MLE): Method of determining which model parameters
best fit the data points. The likelihood represents the probability of getting the data points
from a distribution.
P ( x|C ) P ( C )
P ( C|x )=
∑ p ( x|C=Ci ) P(C=C i)
Ci
Simple case: No covariance. Linear discriminant and will predict class based on the closest
mean to the data point. Also limits the distribution to be elliptical only in the axis direction
(find a picture)
General case: full covariance matrix. Quadratic discriminant. Prediction based on the
nearest pdf (which is most likely?)
Curse of dimensionality?
The parameters of the model are: k, covariance and mean vectors for each mixture and the
mixture weights/proportions.
Probabilistic models trained through maximising the likelihood, however there is no closed
form solutions so must use an alternative method.
Expectation Maximisation (EM): Difficulty with GMM is determining which Gaussian each
data point should belong to.
Non-Parametric Density Estimation: Non-parametric models let the data speak for itself,
there is no modelling of the underlying distribution. These estimations can be used to
determine the likelihood estimates for Bayes rule.
Histogram: hyperparameter is bin width, keep a count of the number of data points within
each bin and plot the distribution.
k-nearest neighbour
Autoencoder?
Idea is to construct a neural network with hidden layers, where the input matches the
output. By doing so you create an encoder and decoder phase. At the smallest width in the
network, the data is compressed but can still represent the input well, hence the
dimensionality can be reduced by using these values. It is a more dense representation of
the input data
Question 3
a)
i)
([784+1]*50)+([50+1]*50)+([50+1]*2)+([2+1]*50)+
([50+1]*50)+([50+1]*784) = 84586
ii)
As the number of inputs of an autoencoder matches
the output the goal is to produce the input directly on the
output. Any internal structures and hidden layers less than the
input will work to reduce the dimensionality constraint of that
layer. Therefore, the output of these hidden layers can be
taken as a reduce dimensionality form if the output = input.
Week 5 – Clustering
Unsupervised learning that tries to look for grouping structure in a dataset. Related to
density estimation as trying to locate areas of high density.
k-means clustering
Performs local optimisation of the distance between points and their cluster centre. Very
similar to the EM algorithm for density estimation but using a binary value for each cluster
(rather than a gaussian probability)
1. Initialise k clusters (randomly) – have to choose k
2. Calculate closest centre for each data point
3. Recalculate the centre of the cluster with the new points
4. Repeat until convergence
Hierarchical clustering
Make a tree that indicates distance between nodes (dendrogram). Can then make a cut
through the graph to identify the number of clusters of a specific radius.
Forward search: Method of selecting features for feature extraction. Greedily add the best
features (has the minimum error). When a feature is added, check to see if the total error
has decreased – if not then terminate the algorithm at the previous step. Does not always
get the optimal choice because of the greedy decision.
The eigenvalues of the covariance matrix of the data indicates the ‘explanatory power’ of
each feature towards the total variance. The eigenvector points in the direction of the
eigenvalue. Hence selecting the eigenvectors that correspond largest k eigenvalues gives
you the data reduced to k-dimensions that explain the most variance. Highly correlated data
points produce clusters.
Scree-graph: Used to determine the number of PC to use in PCA feature extraction. Plot of
eigenvalue vs component number. Ie it is a plot of the eigenvalues sorted in descending
order. Best to select at the elbow of the graph.
Fishers LDA
Supervised technique which aims to find a direction such that projection of the data onto
the direction makes the classes well separated. It does so by maximising the distance
between the means of the data while minimising the variance within each class.
LDA vs PCA
Both reduce dimensionality by trying to find a linear combination of the features. LDA is
supervised and attempts to model differences between classes. PCA is unsupervised and
instead tries to preserve the trend in data (maintaining the variance).
Linear neuron with sigmoid activation function is equivalent to logistic regression (output
between 0 and 1).
The original neural model was called a McCulloch-Pitts neuron and used a step activation
function for some specified threshold value.
- Cannot train with backprop as the function is not differentiable
Importance of bias/dummy input: Allows for the model to incorporate offset. It is no longer
required to output 0 given an input of 0.
(single) Perceptron learning rule: Original method of training a perceptron, requires that the
data is linearly separable for convergence (XOR problem).
Learning rate: Measures the step size along the error surface during the optimisation. It is
difficult to select – too high and the weight changes do not match changes in the gradient.
Too small and the network takes too long to learn. High learning rates can lead to
overstepping of the solution and oscillation.
Why use sigmoid functions: Sigmoid are non-linear, continuous and differentiable versions
of a hard threshold function. This means they can be used to solve linearly non-separable
problems. Differentiability is also important as the network can then be learned through
gradient descent/back propagation. At very large output values, the sigmoid derivative is
very small resulting in small weight updates that can make it slow.
Weight-initialisation: Should be close to 0 and similar because otherwise the high variance
(random information) may be propagated through the network. (Standardise the inputs). If
there is a large variance then different nodes will learn at different rates. It is also beneficial
to have uncorrelated inputs as the error can be minimised independently. High initialisation
of weight can lead to slow training as the sigmoid output will have low partial derivatives.
Determining number of hidden layers: Difficult to do (similar to polynomial regression
matching) because it reflects the degree of complexity in the output. There is a trade-off
between training error and generalisation. Best to use cross-validation to see what is best.
Neural Network with a large number of hidden layers, trained again with backpropagation.
They normally use ReLu functions instead of sigmoids because they train much faster.
Because the networks are much larger than previously discussed, training time becomes a
serious consideration.
Convolutional Networks/Layers
Specific structure designed to process data that come in arrays (pictures, etc). Perform a
convolutional operator on an input subset.???
Pooling Layer
These layers are placed at the end of a convolutional network and pool the results from the
neighbouring units. This creates an invariance to shifts and distortions (ie image rotations).
Week 10 – Kernel Machines
Discriminant-based: Don’t have to estimate densities. Discriminant is defined by support
vectors.
Margin: Distance from a discriminant to the closest data points (in both directions).
Support vector: Goal of a support vector machine is to maximise the margin. The support
vectors are the data points that limit the margin and hence define the discriminant (optimal
separating hyperplane).
Soft Margin Hyperplane: When the points are not linearly separable, a soft error function
allows for there to be classification errors.
Basis function: A function that transforms data into a higher dimensional space. This can be
used in maximising the margin and creating a non-linear discriminant. This is quite
computational intense to do.
Kernel function: A similarity function – bigger more similar the two points are. It is the dot
product in the higher dimensional space
Kernel Trick: Reduces computation by not requiring the use of the basis function. As such,
the higher dimensional transform can be avoided. Only need to know the kernel. This makes
kernel methods very adaptive.
Week 11 – Combined Learners/Ensembles
Comparing ML Algorithms
Bootstrapping
Sample with replacement to get datasets. Replacement means that multiple datasets can
contain the same data. Doing this multiple times can be used to infer characteristics (eg the
mean) of the estimator.
Confusion matrices
Given labelled data and a classified, a confusion matrix gives the breakdown of predictive
accuracy of the model. The diagonal indicates correct classification, and off diagonals miss-
classification.
Ensemble Learners
Use multiple learning algorithms to obtain better performance. It is supervised learning of
building learners and voting (for classification) or weighting (for regression). Tend to
perform better when base learners are uncorrelated and diverse.
Bagging
An ensemble learning technique that uses bootstrapping to generate L training sets to train
L base-learners.
Boosting
Incrementally build an ensemble by training a model based on the previous models miss-
classifications. May have better accuracy than bagging but is far more prone to over-fitting
training data. An example of boosting is the AdaBoost algorithm.
1. Initialise all data points to have an equal probability of selection
2. Randomly select some number of data points for the sample based on their
probability
3. Train the learner on the selected subset
4. Test the learner on each data point in the large sample
5. Calculate error and the learner weight (Beta – this is based on how well it classified
the data).
6. Decrease the probability of each data point if correctly classified
7. Combine the learners, scaling by the log inverse of their error rate (higher
classification accuracy means heavier weight in the output)
Improving Ensembles
Try to get independence between the classifiers by either subset selection or metaclassifiers
(PCA).
Week 12 – Graphical Models/Bayesian Networks
Decision Trees
Supervised learning technique that provides a structure for classification based on specific
feature criteria. Benefit of transparency and interpretability (you can see exactly how the
classification decision was derived).
Impurity: A measure of how many different classes a node can allocate to (unsure if this is
right).
Random forests
??
Bayesian Networks
# marginal table = product of number of rows in each conditional (commonly the inference
from the conditions in the Bayesian network can reduce the table size for specifying the
joint distribution)
P ( X ,Y , Z )=P ( X ) × P ( Y | X ) × P(Z∨Y )
P ( Z|X )=P ( Z|Y ) × P (Y | X ) + P ( Z| Y ) × P( Y ∨ X )
2. Tail-to-head
P ( X ,Y , Z )=P ( X ) × P ( Y | X ) × P ( Z|X )
3. Head-to-Tail
P ( X ,Y , Z )=P ( X ) × P ( Y ) × P( Z∨ X ,Y )
P ( Z∨ X )=P ( Z ∨X , R ) P ( R ) + P ( Z|X , R ) P( R)
The above cases can be chained together to form more complex distributions. Bayes rule
can be used to find the reversed conditional probabilities.
Week 13 – Bayesian Inference
Main idea is that when creating a model, there is uncertainty surrounding the parameters is
used in the model. Bayesian inference works by updating the model as observations are
made. It starts with a prior distribution about the parameters, this may be determined from
guessing or through expert knowledge. When observations are made, the model is updated
to obtain the posteriors through Bayes’ Rule, giving a new adjusted distribution. The new
updated posterior, is the new model based on our original thought (prior) and the observed
data.