Machine Learning Tutorial Machine Learning Tutorial
Machine Learning Tutorial Machine Learning Tutorial
2
x
x
2
y
y
1 2
, ,...,
K
h h h
N
x
M
y
1 2 K
( ) I t V i bl ( )
1 2
, ,...,
N
x x x = x
( )
h h h = h
Input Variables:
Hidden Variables:
( )
1 2
, ,...,
K
h h h = h
( )
1 2
, ,...,
K
y y y = y
Hidden Variables:
Output Variables:
4
( )
1 2 K
y y y y
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
When are ML algorithms NOT needed? When are ML algorithms NOT needed?
When the relationships between all system variables
(input, output, and hidden) is completely
understood!
This is NOT the case for almost any real system!
5 SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
The Sub Fields of ML The Sub-Fields of ML
Supervised Learning
Reinforcement Learning Reinforcement Learning
Unsupervised Learning
6 SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Supervised Learning Supervised Learning
Given: Training examples
( ) ( ) ( ) ( ) ( ) ( )
{ }
f f f
for some unknown function (system)
( ) ( ) ( ) ( ) ( ) ( )
{ }
1 1 2 2
, , , ,..., ,
P P
x f x x f x x f x
( )
= y f x
for some unknown function (system) ( )
= y f x
( )
Find
Predict , where is not in the training set
( )
f x
( )
= y f x
x
7 SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Model model quality Model, model quality
Definition: A computer program is said to learn
from experience E
with respect to some class of tasks T
and performance measure P,
if its performance at tasks in T, as measured by P, improves
with experience E.
Learned hypothesis: model of problem/task T
Model quality: accuracy/performance measured by P
8 SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Data / Examples / Sample / Instances Data / Examples / Sample / Instances
f f / Data: experience E in the form of examples / instances
characteristic of the whole input space
representative sample representative sample
independent and identically distributed (no bias in selection / observations)
G d l Good example
1000 abstracts chosen randomly out of 20M PubMed entries (abstracts)
probably i.i.d. probably i.i.d.
representative?
if annotation is involved it is always a question of compromises
Definitely bad example Definitely bad example
all abstracts that have John Smith as an author
9
Instances have to be comparable to each other
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Data / Examples / Sample / Instances Data / Examples / Sample / Instances
f f Example: set of queries and a set of top retrieved documents
(characterized via tf, idf, tf*idf, PRank, BM25 scores) for each
try predicting relevance for reranking! try predicting relevance for reranking!
top retrieved set is dependent on underlying IR system!
issues with representativeness, but for reranking this is fine
characterization is dependent on query (exc. PRank), i.e. only certain pairs (for
the same Q) are meaningfully comparable (c.f. independent examples for the
same Q)
we have to normalize the features per query to have same mean/variance
we have to form pairs and compare e.g. the diff of feature values
Toy example:
Q = learning, rank 1: tf = 15, rank 100: tf = 2
10
Q = overfitting, rank 1: tf = 2, rank 10: tf = 0
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Features Features
The available examples (experience) has to be
described to the algorithm in a consumable format
Here: examples are represented as vectors of pre-defined features
E.g. for credit risk assesment, typical features can be: income range,
debt load employment history real estate properties criminal record debt load, employment history, real estate properties, criminal record,
city of residence, etc.
Common feature types yp
binary (criminal record, Y/N)
nominal (city of residence, X) nominal (city of residence, X)
ordinal (income range, 0-10K, 10-20K, )
numeric (debt load, $)
11
numeric (debt load, $)
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Machine Learning Tutorial Machine Learning Tutorial
CB, GS, REC , ,
Section 2 Section 2
Experimental practice
by now youve learned what machine learning is; in the supervised approach you
need (carefully selected / prepared) examples that you describe through features;
the algorithm then learns a model of the problem based on the examples (usually
some kind of optimization is performed in the background); and as a result some kind of optimization is performed in the background); and as a result,
improvement is observed in terms of some performance measure
Machine Learning Tutorial for the UKP lab Machine Learning Tutorial for the UKP lab
June 10, 2011
Model parameters Model parameters
2 kinds of parameters 2 kinds of parameters
one the user sets for the training procedure in advance hyperparameter
the degree of polynom to match in regression
number/size of hidden layer in Neural Network y
number of instances per leaf in decision tree
one that actually gets optimized through the training parameter
regression coefficients
network weights
size/depth of decision tree (in Weka, other implementations might allow to control that)
we usually do not talk about the latter, but refer to hyperparameters as parameters
Hyperparameters
the less the algorithm has, the better
Naive Bayes the best? No parameters!
usually algs with better discriminative power are not parameter-free
typically are set to optimize performance (on validation set, or through cross-validation)
manual, grid search, simulated annealing, gradient descent, etc.
13
common pitfall:
select the hyperparameters via CV, e.g. 10-fold + report cross-validation results
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Cross validation Illustration Cross-validation, Illustration
{ }
k
k
x x X ,...,
1
=
2
X
3
X
5
X
4
X
1
X
Test
The result is an average
over all iterations
Train
14 SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Cross Validation Cross-Validation
f ld CV ti f ki (h ) t ti ti n-fold CV: common practice for making (hyper)parameter estimation more
robust
round robin training/testing n times, with (n-1)/n data to train and 1/n data to evaluate the model
typical: random splits, without replacement (each instance tests exactly once)
the other way: random subsampling cross-validation
n-fold CV: common practice to report average performance deviation etc n-fold CV: common practice to report average performance, deviation, etc.
No Unbiased Estimator of the Variance of K-Fold Cross-Validation (Bengio and Grandvalet 2004)
bad practice? problem: training sets largely overlap, test errors are also dependent
tends to underestimate real variance of CV (thus e g confidence intervals are to be treated with extreme tends to underestimate real variance of CV (thus e.g. confidence intervals are to be treated with extreme
caution)
5-2 CV is a better option: do 2-fold CV and repeat 5 times, calculate average: less overlap in training sets
Folding ia nat ral nits of processing for the gi en task Folding via natural units of processing for the given task
typically, document boundaries best practice is doing it yourself!
ML package / CSV representation is not aware of e.g. document boundaries!
The PPI case
15
The PPI case
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Cross Validation Cross-Validation
Ideally the valid settings are:
take off-the-shelf algorithms, avoid parameter tuning and compare
results e g via cross validation results, e.g. via cross-validation
n.b. you probably do the folding yourself, trying to minimize biases!
do parameter tuning (n.b. selecting/tuning your features is also tuning!)
but then normally you have to have a blind set (from the beginning)
e.g. have a look at shared tasks, e.g. CoNLL practical way to learn
experimental best practice to align the predefined standards (you might even experimental best practice to align the predefined standards (you might even
benefit from comparative results, etc.)
You might want to do something different
be aware of these & the consequences
16
be aware of these & the consequences
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
The ML workflow The ML workflow
Common ML experimenting pipeline Common ML experimenting pipeline
1. define the task
instance, target variable/labels, collect and label/annotate data
di i k 1 di d/b d di i h credit risk assessment: 1 credit request, good/bad credit, ~s ran out in the
previous year
2. define and collect/calculate features, define train / validation 2. define and collect/calculate features, define train / validation
(development) ((test!)) / test (evaluation) data
3. pick a learning algorithm (e.g. decision tree), train model
train on training set
optimize/set model hyperparameters (e.g. number of instances / leaf, use
pruning, ) according to performance on validation data
cross validation: use all training data as validation data cross validation: use all training data as validation data
test model accuracy on (blind) test set
4. ready to use model to predict unseen instances with an expected
17
y p p
accuracy similar to that seen on test
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Try this in Weka Try this in Weka
=== Run information === === Run information ===
Relation: segment
Instances: 1500
Attributes: 20
Test mode: split 80 0% train remainder test Test mode: split 80.0% train, remainder test
Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2
Correctly Classified Instances 290 96.6667 %
Incorrectly Classified Instances 10 3.3333 %
Scheme: weka.classifiers.trees.J48 -C 0.25 -M 12
Correctly Classified Instances 281 93.6667 %
C f %
18
Incorrectly Classified Instances 19 6.3333 %
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Model complexity Model complexity
Fitti l i l i Fitting a polynomial regression:
0.0
1.0
t
M=0
0.0
1.0
t
M=1
M
By, for instance, least squares:
0.0 0.5 1.0
1.0
0.0
x
0.0 0.5 1.0
1.0
0.0
x
t
=
=
n
n
n
x x a
0
) (
x
1.0 M=3 1.0 M=9
x
0.0
t
0.0
t
2
1 0
min arg
= =
=
l
j
M
n
n
n j
x y
0.0 0.5 1.0
1.0
x
0.0 0.5 1.0
1.0
x
19 SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Data size and model complexity Data size and model complexity
Important concept: discriminative power of the
algorithm
linear vs nonlinear model
some theoretical aspects:
1-hidden-layer NN with unlimited hidden nodes can
perfectly model any smooth function/surface
20 SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Data size and model complexity Data size and model complexity
Overfitting: the model perfectly learns to classify training data but Overfitting: the model perfectly learns to classify training data, but
has no (bad) generalization ability
results in high test error (useless model)
typical for small sample sizes and powerful models typical for small sample sizes and powerful models
Underfitting: the model is not capable of learning the (complex)
patterns in the training set
Reasons of Underfitting and Overfitting:
lack of discriminative power
ll l i small sample size
noise in the data /labels or features/
generalization ability of algorithm
has to be chosen wrt. sample size has to be chosen wrt. sample size
Size (complexity) of learnt model
grows with data size
if the data is consistent this is OK
21
if the data is consistent, this is OK
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Predictions Confusion matrix Predictions Confusion matrix
TP: p classified as p
FP: n classified as p
TN: n classified as n
FN: p classified as n FN: p classified as n
G Good prediction:
TP+TN
Error:
FP (false alarm) + FN (miss)
22 SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Evaluation measures Evaluation measures
Accuracy Accuracy
The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage).
(TP+TN) / (TP+FN+FP+TN)
Error rate
The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage).
(FP+FN) / (TP+FN+FP+TN)
[Root]?[Mean|Absolute][Squared]?Error
The difference between the predicted and actual values
1
e.g.
=
2
) ) ( (
1
y x f
n
RMSE
Algorithms (e.g. those in Weka) typically optimize these
might be a mismatch between optimization objective and actual evaluation measure
optimize different measures research on its own (e.g. in ML for IR, a.k.a. learning to rank)
23 SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Evaluation measures Evaluation measures
Precision
TP: p classified as p
Precision
Fraction of correctly predicted positives and all
predicted positives
TP/(TP+FP)
TP: p classified as p
FP: n classified as p
TN: n classified as n
FN: p classified as n
( )
Recall
Fraction of correctly predicted positives and all actual positives
FN: p classified as n
ac o o co ec y p ed c ed pos es a d a ac ua pos es
TP/(TP+FN)
F measure F measure
weighted harmonic mean of Precision and Recall (usually equal weighted, =1)
recall precision
recall precision
F
+
+ =
2
2
) 1 (
Only makes sense for a subset of classes (usually measured for a single
class)
recall precision +
24
class)
For all classes, it equals the accuracy
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Evaluation measures Evaluation measures
Sequence P/R/F e g in Named Entity Recognition Chunking etc Sequence P/R/F, e.g. in Named Entity Recognition, Chunking, etc.
A sequence of tokens with the same label is treated as a single instance
John
_PER
studied
_O
at
_O
the
_O
Johns
_ORG
Hopkins
_ORG
University
_ORG
before joining IBM before
_O
joining
_O
IBM
_ORG
.
Why? We need complete phrases to be identified correctly
How? With external evaluation script, e.g. conlleval for NER
Example tagging:
John
_PER
studied
_O
at
_O
the
_O
Johns
_PER
Hopkins
_PER
University
_ORG
before joining IBM before
_O
joining
_O
IBM
_ORG
.
Multiple penalty:
3 Positives: John (PER) Johns Hopkins University (ORG) IBM (ORG) 3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG)
2 FPs: Johns Hopkins (PER) and University (ORG)
1 FN: Johns Hopkins University (ORG)
F(PER) 0 67 F(ORG) 0 5
25
F(PER) = 0.67, F(ORG) = 0.5
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Loss types Loss types
1 The real loss function given to us by the world Typically involves notions of money saved 1. The real loss function given to us by the world. Typically involves notions of money saved,
time saved, lives saved, hopes of tenure saved, etc. We rarely have any access to this
function.
2. The human-evaluation function. Typical examples are fluency/adequecy judgments, relevance
assessments etc We can perform these evaluations but they are slow and costly They assessments, etc. We can perform these evaluations, but they are slow and costly. They
require humans in the loop.
3. Automatic correlation-driving functions. Typical examples are Bleu, Rouge, word error rate,
mean-average-precision. These require humans at the front of the loop, but after that are
h d i k T i ll ff t h b t i t h i l ti b t th cheap and quick. Typically some effort has been put into showing correlation between these
and something higher up.
4. Automatic intuition-driven functions. Typical examples are accuracy (for anything), f-score (for
parsing, chunking and named-entity recognition), alignment error rate (for word alignment)
d l it (f l d li ) Th l i h t th f t f th l and perplexity (for language modeling). These also require humans at the front of the loop,
but differ from (3) in that they are not actually compared with higher-up tasks.
Be careful what you are optimizing! Some measures (trypically of Type 4) Be careful what you are optimizing! Some measures (trypically of Type 4)
become disfunctional when you are optimizing them!
phrase P/R/F e.g. in NER
Readability measures
26
Readability measures
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Evaluation measures Evaluation measures
Sequence P/R/F e g in Named Entity Recognition Chunking etc Sequence P/R/F, e.g. in Named Entity Recognition, Chunking, etc.
John
_PER
studied
_O
at
_O
the
_O
Johns
_ORG
Hopkins
_ORG
University
_ORG
before
_O
joining
_O
IBM
_ORG
.
Example tagging 1: p gg g
John
_PER
studied
_O
at
_O
the
_O
Johns
_PER
Hopkins
_PER
University
_ORG
before
_O
joining
_O
IBM
_ORG
.
3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG)
2 FPs: Johns Hopkins (PER) and University (ORG)
1 FN: Johns Hopkins University (ORG) 1 FN: Johns Hopkins University (ORG)
F(PER) = 0.67, F(ORG) = 0.5
Example tagging 2:
J h di d h J h H ki U i i b f j i i IBM John
_PER
studied
_O
at
_O
the
_O
Johns
_O
Hopkins
_O
University
_O
before
_O
joining
_O
IBM
_ORG
.
3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG)
0 FP
1 FN: Johns Hopkins University (ORG) p y ( )
F(PER) = 1.0, F(ORG) = 0.67
Optimizing phrase-F can encourage / prefer systems that do not mark entities!
t lik l thi i b d!!
27
most likely, this is bad!!
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
ROC curve ROC curve
OC O C ROC Receiver Operating Characteristic curve
Curve that depicts the relation between recall (sensitivity) and false
positives (1-specificity) positives (1 specificity)
Best case
a
l
l
)
Worst case
v
i
t
y
(
R
e
c
a
S
e
n
s
i
t
i
v
28
False Positives FP / (FP+TN)
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Evaluation measures Evaluation measures
A d ROC AUC Area under ROC curve, AUC
As you vary the decision threshold, you can plot the recall vs. false
positive rate p
The area under the curve measures how accurately your model
t iti f ti separates positive from negatives
perfect ranking: AUC = 1.0
random decision: AUC = 0.5
Similarly (e.g. in IR): area under P/R curve
h th t (t ) ti when there are too many (true) negatives
correctly identifying negatives is not interesting anyway
29 SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Evaluation measures (Ranking) Evaluation measures (Ranking)
P i i @K Precision @ K
number of true positives in top K predictions / ranks
MAP
The average of precisions computed at the point of each of the positives in the
ranked list (P=0 for positives not ranked at all) ( )
NDCG NDCG
For graded relevance / ranking
Highly relevant documents appearing lower in a search result list should be
penalized as the graded relevance value is reduced logarithmically proportional
to the position of the result.
30 SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Learning curve Learning curve
M h th Measures how the
accuracy
error
f th d l h ith of the model changes with
sample size
iteration number
Smaller sample
worse accuracy
more likely bias in the estimate y
(representative sample)
variance in the estimate
Typical learning curve Typical learning curve
If it looks differently:
you are plotting error vs. size/iteration
you are doing something wrong!
??
31
you are doing something wrong!
overfitting (iteration, not sample size)!
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
??
Data or Algorithm? Data or Algorithm?
Compare the accuracy of various machine learning algorithms with a Compare the accuracy of various machine learning algorithms with a
varying amount of training data (Banko & Brill, 2001):
Winnow
perceptron perceptron
nave Bayes
memory-based learner
Features:
bag of words:
words within a window of the
target word target word
collocations containing
specific words and/or part of speech
Training corpus: 1-billion words
from a variety of English texts
(news articles, literature, scientific abstracts, etc.)
32
( )
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
Take home messages (up until now) Take home messages (up until now)
Supervised learning: based on a set of labeled examples (x, f(x)) learn the p g p ( , ( ))
input-output mapping, i.e. f(x)
3 factors of successful machine learning models
much data
good features
well-suited learning algorithm
ML workflow
1. problem definition
2 feature engineering; experimental setup /train validation test / 2. feature engineering; experimental setup /train, validation, test /
3. selection of learning algorithm, (hyper)parameter tuning, training a final model
4. predict unseen examples & fill tables / draw figures for the paper - test
C f l ith Careful with
data representation (i.i.d, comparability, )
experimental setup (cross-validation, blind testing, )
data size and algorithm selection (+ overfitting underfitting )
33
data size and algorithm selection (+ overfitting, underfitting, )
evaluation measures
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |