Exercises Solutions
Exercises Solutions
CS–431: I NTRODUCTION TO
NATURAL L ANGUAGE P ROCESSING
Exercises with solutions
(version 202110–1)
Contents
1 NLP levels 2
2 Evaluation 3
3 Tokenization/Lexicons/n-grams 8
4 Out-of-Vocabulary forms 12
5 Morphology 13
6 Part-of-Speech tagging 16
8 Parsing (CYK) 23
10 Stochastic Parsing 27
11 Text Classification 30
12 Information Retrieval 37
14 Lexical Semantics 46
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
1 NLP levels
Exercise I.1
A company active in automatic recognition of hand-written documents needs to improve the quality
of their recognizer. This recognizer produces sets of sequences of correct English words, but some of
the produced sequences do not make any sense. For instance the processing of a given hand-written
input can produce a set of transcriptions like: "A was salmon outer the does", "It was a afternoon
nice sunny", and "I Thomas at mice not the spoon".
What is wrong with such sentences? NLP techniques of what level might allow the system to select
the correct one(s)? What would be the required resources?
Solution
Those sentences are not “grammatically” (syntactically) correct. It should be filtered out at the
syntactic level using a (phrase-structure) grammar.
2/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
2 Evaluation
Exercise II.1
À Give some arguments justifying why evaluation is especially important for NLP. In particular,
explain the role of evaluation when a corpus-based approach is used.
Á Many general evaluation metrics can be considered for various NLP tasks. The simplest one
is accuracy.
Give several examples of NLP tasks for which accuracy can be used as an evaluation metric.
Justify why.
In general, what property(ies) must an NLP task satisfy in order to be evaluable through accu-
racy?
à What is the formal relation between accuracy and the error rate? In which case would you
recommend to use the one or the other?
3/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
Å Another very general evaluation framework concerns this kind of NLP tasks where the goal
of the system is to propose a set of outputs among which some might turn to be correct,
while other might not (e.g. Information Retrieval (IR)). In this type of situation, the standard
evaluation metrics are the Precision and the Recall.
Give the formal definition of Precision and Recall and indicate some examples of NLP tasks
(other than IR) that can be evaluated with the Precision/Recall metrics.
0.4
0.4
Precision
Précision
0.3
0.3
0.2
0.2
0.1
0.1
0
0 0.2 0.4 0.6 0.8 1 0
Recall 0 0.2 0.4 0.6 0.8 1
Rappel
What conclusions can one derive from such curves? Provide a detailed interpretation of the
results.
Ç It is often desirable to be able to express the performance of an NLP system in the form of one
single number, which is not the case with Precision/Recall curves.
Indicate what score can be used to convert a Precision/Recall performance into a unique num-
ber. Give the formula for the corresponding evaluation metric, and indicate how it can be
weighted.
È Give well chosen examples of applications that can be evaluated with the single metric derived
from Precision/Recall and illustrate:
Solutions
À a few hints:
4/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
Á (a) PoS tagging, but also Information Retrieval (IR), Text Classification, Information Extrac-
tion. For the later, accuracy sounds like precision (but it depends on what we actually mean
by “task” (vs. subtask)) .
(b) a reference must be available, “correct” and “incorrect” must be clearly defined
à (a) err = 1 − acc. (b) does not make any sense: they are the same (opposite, actually)
Å see lecture/slides
Æ explain what the axis are; the higher the curve the better, ideally (theoretical) top right corner,
curves are decreasing by construction; left corpus is certainly much bigger than the right one
(faster decreasing, very low recall).
È Precision is prefered when very large amount of data are available and only a few well choosen
one are enough: we want to have those very early, e.g. Web search
Recall is prefered when have all the correct documents is important (implying that, if we want
to handle them, they are not that many). Typically in legal situations.
Exercise II.2
You have been hired to evaluate an email monitoring system aimed at detecting potential security
issues. The targeted goal of the application is to decide whether a given email should be further
reviewed or not.
À Give four standard measures usually considered for the evaluation of such a system? Explain
their meaning. Briefly discuss their advantages/drawbacks.
• accuracy / error rate / "overall performance": number of correct/incorrect over total num-
ber ; adv: simple ; drawback: too simple, does not take unbalancing of classes into
account
• Precision (for one class): number of correctly classified emails over number of emails
classified in that class by the system ; Ignores false negatives ; can be biaised by classi-
fying only very few highly trusted emails
5/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
• Recall / true positive rate: number of correctly classified emails over number of emails
classified in that class by experts (in the referential) ; Ignores false positives ; Can be
biaised by classifying all documents in the most important class
• Area under ROC Curve ; Plot true positive rate vs false positive rates ; not easy to com-
pute ;
• F score: Harmonic mean of precision and recall; balances P and R ; too simple: unary
score for complex situation
• false positive rate
Á For three of the measures you mentioned in the previous question, what are the corresponding
scores for a system providing the following results:
The main point here is to discuss WHAT to compute: we don’t know what neither C1 nor C2
are. So we have to compute either overall score (not very good) or scores FOR EACH class.
The confusion matrix is: system
C1 C2
reference
C1 5 3
C2 2 4
from where we get: accurary=9/14, thus overall error=5/14
P/R for C1 : P=5/7 R=5/8
P/R for C2 : P=4/7 R=4/6
(note that "overall P and R" does not make any sense and are equal to accuracy)
C1 : FPR = 2/7, FNR=3/7 (and vice-versa for C2 )
 You have been given the results of three different systems that have been evaluated on the same
panel of 157 different emails. Here are the classification errors and their standard deviations:
system 1 system 2 system 3
error 0.079 0.081 0.118
std dev 0.026 0.005 0.004
Which system would you recommend? Why?
system 2: error is first criterion, then for stastistically non signifiant differences in error (which
is the case for system 1 and 2), then min std dev is better (especially with such big difference
as here!)
1 system 3
system 2
system 1
0.8
0.6
0.4
0.2
0
0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16
6/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
à Optional (too advanced for the current version of the course): What should be the minimal
size of a test set to ensure, at a 95% confidence level, that a system has an error 0.02 lower
(absolute difference) than system 3? Justify your answer.
We could consider at least two approaches here: either binomial confidence interval or t-test.
0.118 × (1 − 0.118)
(0.02)2 = (1.96)2
T
Thus T ' 1000.
• t-test approach: let’s consider estimating their relative behaviour on each of the test
cases (i.e. each test estimation subset is of size 1). If the new system as an error of 0.098
(= 0.118 − 0.02), it can vary from system 3 between 0.02 of the test cases (both systems
almost always agree but where the new system improves the results) and 0.216 of the test
cases (the two systems never make their errors on the same test case, so they disagree
on 0.118 + 0.098 of the cases). Thus µ of the t-test is between 0.02 and 0.216. And
s = 0.004 (by assumption, same variance).
√ √
Thus t is between 5 T and 54 T which is already bigger than 1.645 for any T bigger
than 1. So this doesn’t help much.
So all we can say is that if we want to have a (lowest possible) difference of 0.02 we
should have at least 1/0.02 = 50 test cases ;-) And if we consider that we have 0.216
difference, then we have at least 5 test cases...
The reason why these numbers are so low is simply because we here make strong as-
sumptions about the test setup: that it is a paired evaluation. In such a case, having a
difference (0.02) that is 5 times bigger than the standard deviation is always statistically
significant at a 95% level.
7/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
3 Tokenization/Lexicons/n-grams
Exercise III.1
According to your knowledge of English, split the following sentence into words and punctuation:
M. O’Connel payed $ 12,000 (V.T.A. not included) with his credit card.
Which of these words won’t usually be in a standard lexicon? Justify your answer.
How would you propose to go from tokens to words? (propose concreat implementations)
Solution
words and punctuation: M. O’Connel payed $12,000 ( V.T.A. not included ) with his credit card .
Usually not in a lexicon because hard to lexicalize (too many hard-to-predict occurrences): O’Connel,
$12,000
“O’Connel” could be in some lexicon of proper names (but not so usual), or recognized by some
NER (Named-Entity Recognizer).
“$12,000” could be in some lexicon making use of regular expressions (e.g. a FSA), but this is also
not so usual unless making use of some (other) NER.
• agglutinating several (consecutive) tokens when the resulting word is in our lexicon
• doing so, it would be good to keep all possible solutions, for instance in the compact form of
a graph/lattice; for instance:
credit card
credit card
8/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
• add our own had-oc rules, e.g. M + period + whitespace + proper name/unknow token with
capital letter −→ proper noun
Exercise III.2
• How many different bigrams of characters (including whitespace) do you have in that corpus?
• Considering only lowercase alphabetical and whitespace, how many bigrams are possible?
• What are the parameters of a bigram model using the same set of characters (lowercase alpha-
betical and whitespace)?
• What is the probability of the following sequences, if the parameters are estimated using MLE
(maximum-likelihood estimation) on the above corpus (make use of a calculator or even a
short program):
– cutthechat
– cut the chat
• What is the probability of the same sequences, if the parameters are estimated using Dirichlet
prior with α having all its components equal to 0.05?
Fully justify your answer.
Solution
• there are 12 different bigrams (denoting here the whitespace with ’X’ to better see it): Xc, Xh,
Xt, at, ca, cu, eX, ha, he, tX, th, ut,
• the corpus being 19 characters long, there are 18 bigrams in total. Here are the counts Xc, 2;
Xh, 1; Xt, 1; at, 2; ca, 1; cu, 1; eX, 2; ha, 1; he, 2; tX, 2; th, 2; ut, 1
9/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
• parameters are all the 729 probabilies of the 729 bigrams (’X’ = whitespace): P(XX), P(Xa),
P(Xb), ..., P(aa), P(ab), ..., P(zz).
• Using MLE, the probability of the observed bigram are proportionnal to their number of oc-
curence: Xc: 2/18; Xh: 1/18; Xt: 1/18; at: 2/18; ca: 1/18; cu: 1/18; eX: 2/18; ha: 1/18; he:
2/18; tX: 2/18; th: 2/18; ut: 1/18
and all the other are 0.
Thus the propability of any sequence containing an unseen bigram is 0 (as a product of terms,
at least one of which is 0), which is the case for both sequences (bigram ’ch’ never seen)
• With a Dirichlet prior with parameter α = (0.05, ..., 0.05) each observed bigram as a extra
| {z }
729 times
0.05 to its count and the denominator is augmented by 729 × 0.05 = 36.45, leading thus
to: Xc: 2.05/54.45; Xh: 1.05/54.45; Xt: 1.05/54.45; at: 2.05/54.45; ca: 1.05/54.45; cu:
1.05/54.45; eX: 2.05/54.45; ha: 1.05/54.45; he: 2.05/54.45; tX: 2.05/54.45; th: 2.05/54.45;
ut: 1.05/54.45
and all the unseen bigrams have a probability of 0.05/54.45;
The probability of the two sequences then becomes (in blue the bigrams seen in the learning
corpus):
and
0.05 1.05 0.05 2.35
P(u) = ∑ P(uy) = P(ut) + 26 × = + 26 × = ' 4.32%
y 54.45 54.45 54.45 54.45
P(cutXtheXchat)
P(ut) P(tX) P(Xt) P(th) P(he) P(eX) P(Xc) P(ch) P(ha) P(at)
= P(cu) · · · · · · · · · ·
P(u) P(t) P(X) P(t) P(h) P(e) P(X) P(c) P(h) P(a)
1.05 1.05 2.05 1.05 2.05 2.05 2.05 2.05 0.05 1.05 2.05
= · · · · · · · · · ·
54.45 2.35 5.35 5.35 5.35 4.25 3.35 5.35 3.35 4.35 3.35
' 6.2 × 10−8
10/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
Notice however that the two sequences do not have the same length, so their probabilities shall
not be compared without a minimum amount of care. But in this case, since the probability of
the shorter is smaller than the probability of the longer, it’s conclusive: the longer is definitely
the better (since, for instance, any substring of length 10 (=length of the shorther) of the longer
will be more probable than the shorter).
11/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
4 Out-of-Vocabulary forms
Exercise IV.1
Consider an NLP application that needs to measure the edit distance between words using the chart-
based algorithm.
Provide the filled data structure resulting from the application of the algorithm to the pair “easy” and
“tease”. Briefly justify your answer.
Solution:
To compute the edit distance, we first have to define the set of transformations used. Let’s here
consider insertion, deletion and substitution (transposition does not make any difference in this very
example).
e a s y
0 1 2 3 4
t 1 1 2 3 4
e 2 1 2 3 4
a 3 2 1 2 3
s 4 3 2 1 2
e 5 4 3 2 2
Each cell contains to the distance between the corresponding initial prefix strings.
12/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
5 Morphology
Exercise V.1
À Briefly describe the specific objectives of the morphological module in the general perspective
of automated Natural Language Processing.
Solution: The purpose of morphology is to study of the internal structure and the variability
of the words in a language, like verbal conjugations, plurals, nominalization, ...
Á What are the different types of morphologies that can be considered? Briefly describe the main
differences between them.
Solution: inflectional morphology: no change in the grammatical category (e.g. give, given,
gave, gives )
derivational morphology: change in category (e.g. process, processing, processable, proces-
sor, processabilty)
 For what type of languages is concatenative morphology well suited? Are there other types of
approaches to morphology? For what languages?
Solution: for languages where only prefixes and suffixes are used.
More complex languages can involve infixes (e.g. Tagalog, Hebrew) or cirumfixes (e.g. Ger-
man). Pattern-based morphology should then be used.
Exercise V.2
Á Provide a precise definition of concatenative morphology and illustrate your answer with con-
crete examples in English or French.
Is this type of morphology relevant for all languages? More generally, is morphology of the
same complexity for all languages?
Solution: concatenative morphology uses roots, prefixes and suffixes only.
Exameples: in-cred-ible : in–: prefix, cred: root, –ible: suffix.
concatenative morphology not relevant for all languages: more complex languages can involve
infixes (e.g. Tagalog, Hebrew) or cirumfixes (e.g. German). Pattern-based morphology should
then be used.
13/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
the complexity of the morphology can vary a lot between languages: as easy as in Spanish, as
hard as in Turkish, or Hebrew, Tagalog...
 Give some concrete examples of NLP applications that can benefit from some form of mor-
phological processing. Justify your answer.
In the specific case of Information Retrieval (IR), explain what can be done if a full fledged
morphological analyzer is not available. What consequence do you expect this would have on
the performance of the IR system?
Solution: whenever application that needs structure of the word to either unsterstand it (e.g.
translation) or simplifiy it (e.g. Information Retrieval, Text Classification, ...)
For IR, a stemmer can be used if nothing better is available. Overall performance (e.g. preci-
sion/recall) might change but it’s difficult to predict in which direction (depends on the relative
quality of the modules used); however, processing time might be shorter.
à Provide a formal definition of a transducer. Give some good reasons to use such a tool for
morphological processing.
Solution: It’s a (Determinitic) Finite-State Automaton on character pairs (cross-product of
alphabets)
It’s the most efficient time-space implementation for matching two languages (cross-product
of languages), which is the purpose of morphology.
continues on back +
14/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
15/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
6 Part-of-Speech tagging
Exercise VI.1
computers N 0.123
process N 0.1
process V 0.2
programs N 0.11
programs V 0.15
accurately Adv 0.789
Solutions
i.e.:
16/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
Tagging obtained (not corresponding to the one expected by an average English reader ;-) ):
Exercise VI.2
We aim at tagging English texts with “Part-of-Speech” (PoS) tags. For this, we consider using the
following model (partial picture):
...some picture...
À What kind of model (of PoS tagger) is it? What assumption(s) does it rely on?
Á What are its parameters? Give examples and the appropriate name for each.
17/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
 With this lexicon, how many different PoS taggings does this sentence have? Justify your
answer.
à What (formal) parameters make the difference in the choice of these different PoS taggings
(for the above model)?
Give the explicit mathematical formulas of these parts that are different.
Ä Assume that the following tagging is produced:
my/PRP$ daughter/NN whose/WP$ first/JJ adult/JJ tooth/NN has/VBZ just/RB developed/VBN
programs/NNS
How is it possible? Give an explanation using the former formulas.
Solutions
À This is an HMM of order 1 (Well, the picture is actualy a part of a Markov chain. The "hidden"
part will be provide by the emission probabilities, i.e. the lexicon).
HMM relies on two asumptions (see course): limited lexical contionning (P(wi |...Ci ...) = P(wi |Ci ))
and limited scope for syntactic dependencies (P(Ci |C1 ...Ci−1 ) = P(Ci |Ci−k ...Ci−1 )).
Â
my PRP$
daughter NN
whose WP$
first JJ RB
adult JJ NN
tooth NN
has VBZ
just RB
developed VBN VBD
programs NNS VBZ
Examples:
18/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
On one hand:
P(X|WP$) · P(first|X) · P(Y |X) · P(adult|Y ) · P(NN|Y )
for X either “JJ” or “RB” and Y either “JJ” of “NN”, and on the other hand:
NOTICE:
2. do not forget the right hand part of each tag, e.g for "adult", not only P(NN|RB) (for instance),
but also P(NN|NN) for the transition to "tooth".
P(JJ|WP$)·P(first|JJ)·P(JJ|JJ)·P(adult|JJ)·P(NN|JJ)·P(VBN|RB)·P(developed|VBN)·P(NNS|VBN)
·P(programs|NNS)
is bigger than any other of the products for the same part, which is possible (e.g. each term bigger
than any corresponding other, or even one much bigger than all the other products, etc.)
Exercise VI.3
Á Assume that you have to quickly search for the existence of given {word , part-of-speech}
pairs within the set of all the English words associated with their part(s)-of-speech. Which
data structure(s) would you use if memory is an issue?
 What are the two main methods used for PoS tagging?
What are their main differences?
19/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
à Assume that the texts to be tagged contain unknown words, which are either capitalized words,
or spelling errors, or simply general common words not seen during the learning. Almost all
capitalized words correspond to proper nouns, and most of the spelling-errors correspond to
words already in the lexicon (only a few of the spelling errors correspond to words not seen
during the learning).
How would you handle such a situation in a concrete NLP application (that uses a PoS tagger)?
Explicit your solution(s).
Ä Assume that the texts to be tagged contain 1.5% of unknown words and that the performance
of the tagger to be used is 98% on known words.
What will be its typical overall performance in the following two situations:
Provide both a calculation (a complete formula but not necessarily the final numerical result)
and an explanation.
Solutions
À The problem addressed by a PoS tagger is to assign part-of-speech tags (i.e. grammatical
roles) to words within a given context (sentence, text).
This task is not trivial because of lexical ambiguity (words can have multiple grammatical
roles, e.g. can/N can/V) and out-of-vocabulary forms (i.e. unknown words).
Lexical ambiguity is not trivial to handle because it leads to an exponential number of possible
solution w.r.t. the sentence length.
Unknow words are not trivial because we have to decide how to cope with them, which often
involves high level linguistic features (and compromise to be made). This is the role of the
“guesser”.
Á Finite-State Transducers seems really appropriate for this task under the memory consumption
constraint since they are the optimal representation of paired-regular languages.
Another possible solution, however, could be to build the FSA of words, use it to map words
to numbers and then associate a table of list of PoS tags (maybe also represented in the form
of numbers through another FSA).
It is not clear how the overheads of each implementation will compare one to another in a real
implementation.
(The second propostion being actually one possible implementation for the corresponding
FST).
20/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
 The two main methods presented in the course for PoS tagging are Brill’s tagger and Hidden
Markov Models.
Brill is rule base whereas HMM are a probabilistic model.
HMM can handle unsupervised learning whereas Brill’s tagger requires supervision.
Brill’s tagger has an integrated guesser (through “lexical rules”) whereas HMM require an
external (or ad-hoc) treatment of OoV.
Brill’s tagger might be a bit more linguistically oriented in the sense that the applied rules
could be explained and that in principle new rules, understandable by a human, might be
introduces in the system (although not so easy and not recommended).
à The first idea is to tag capitalized words as proper nouns (what the Brill’s tagger does actually,
by the way).
Then we’d like to cope with spelling errors as much as possible. This is hard in a completely
autonomous manner because there might be several solution of a real spelling errors, but also
there might be some possible correction for unknown words which correspond to correct words
but unseen during the learning. The idea is thus to use a low threshold for the spelling error
correction and keep all possible tags for all possible solutions in case of ambiguity, leting then
the tagger to disambiguate the tag.
For the rest, the guesser corresponding to the tagger used should be used anyway.
Ä (a) is simple : 1.5% is for sure wrongly tagged. For the rest (100%-1.5%), only 98% are
correctly tagged. So the overall score is 0.985 × 0.98 ' 0.96.
(b) this is less obvious: still we have 0.985 × 0.98, but on the remaining 1.5% we cannot be
sure:
• regading capitalized words, we can expect to have 98% correct, thus: 0.015 × 0.8 × 0.98
• but for the rest, this really depends on the perfomances/ambiguities in the spelling error
correction and on the performance of the guesser.
Exercise VI.4
À Consider an HMM Part-of-Speech tagger, the tagset of which contains, among others:
DET, N, V, ADV and ADJ,
and some of the parameters of which are:
21/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
22/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
8 Parsing (CYK)
Exercise VIII.1
S -> NP VP VP -> V
NP -> Det N VP -> VP PP
NP -> N VP -> VBP VBG PP
NP -> NP PP PP -> P NP
2012 N from P
Switzerland N in P
USA N increasing VBG
are VBP the Det
exports N to P
exports V
Using the CYK algorithm, parse the following sentence with the above lexicon/grammar:
Provide both the complete, fully filled, data structure used by the algorithm, as well as the result of
the parsing in the form of a/the parse tree(s).
Solution
Tranform to CNF:
Chart:
(next page)
23/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
S
S
S
NP S
NP VP
NP PP S
NP VP NP VP
PP NP
NP NP PP X PP
Det N V NP VP P Det N NP P N NP VBP VBG P N NP
the exports from the USA to Switzerland are increasing in 2006
Notice: the blue NP has two interpretations. This leads to two full parse-trees:
S
NP
PP
NP VP
PP PP
NP NP NP X NP
Det N P Det N P N VBP VBG P N
the exports from the USA to Switzerland are increasing in 2006
S
NP
NP VP
PP PP PP
NP NP NP X NP
Det N P Det N P N VBP VBG P N
the exports from the USA to Switzerland are increasing in 2006
Exercise VIII.2
À Give the result of the CYK algorithm applied to the following sentence:
the cat is looking at the mouse
using the following grammar:
24/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
Á Draw all the parse trees that could be obtained from the previous question.
 The above grammar over-generates. One reason is that some adjectives, e.g. former, can only
occur before a noun. For instance
the cat is former
is incorrect in English (but accepted by the above grammar).
Another reason for over-generation is that PPs do not combine with adjectives occurring before
a noun. For instance:
the looking at the mouse cat is black
is incorrect in English (but accepted by the above grammar).
Explain how the above grammar might be modified to prevent these two types of over-generation.
à This grammar also accepts the following examples, which are (either syntactically or seman-
tically) incorrect in English:
the cat is old at the mouse
the cat is nice under the mouse
the cat is nice at the mouse at the mouse
In the first example, attaching “at the mouse” to “old” is incorrect in English because some
adjectives (e.g. “old”) may not have a PP; the second example is incorrect because “nice” can
only take PPs where the preposition is limited to a certain subset (e.g. “at”, but not “under”);
and the third example is incorrect because adjectives may not combine with more than one PP.
Propose modifications to the grammar in order to prevent these types of over-generation.
Solutions
S
VP
S Adj
À
S PP
NP VP NP
Det N VBe V VP Ving Adj Prep Det N
the cat is looking at the mouse
25/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
Á two derivations:
S
NP VP
Det N VBe Adj
the cat is Adj PP
Ving Prep NP
looking at Det N
the mouse
S
NP VP
Det N VP PP
the cat VBe Adj Prep NP
is Ving at Det N
looking the mouse
(and, of course, add the right PoS tag into the lexicon, e.g. former:Adj-).
Here we keep the PoS tag Adj for "Adj- or Adj+".
where Adj+PP is the kind of adjective than could be complemented with a PP.
Furthermore, what should be avoided is the accumulation of PPs on the same non-terminal,
i.e. we should NOT have any X -> X PP with the same X on both left and right.
The main idea here is to go for a feature grammar and lexicalize some of the dependencies.
26/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
10 Stochastic Parsing
Exercise X.1
Below is a part of the lexicon and grammar for parsing English queries. Note that there is no error
with the probabilities, the list of rules shown here is simply incomplete.
POS Prob
S -> NP VP (0.76)
who Pron 0.30
NP -> Det N (0.34)
started V 0.26
NP -> Det N PP (0.23)
an Det 0.21
NP -> Pron (0.20)
the Det 0.49
VP -> V NP (0.45)
argument N 0.07
VP -> V NP PP (0.13)
partners N 0.04
PP -> P NP (0.84)
with P 0.35
Á Using the CYK algorithm, and the above grammar and lexicon, analyze the sentence:
who started an argument with the partners
Show both the CYK data structure, with the values filled in, and all the possible parse tree(s).
Solutions
À recognition and analysis: see lecture slides.
chart:
S
VP
NP, X2
S X1
VP PP
NP NP
Pron, NP V Det N P Det N
who started an argument with the partners
27/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
S
NP VP
Pron V NP
who started Det N PP
an argument P NP
with Det N
the partners
 second is best. Compute only the part that differ, not the whole product:
0.13 < 0.23 and 0.34 < 0.45: not any computation to do whatsoever!
Exercise X.2
Á Parse the sentence “the cat ran home from the garden” using the CYK algorithm. Provide
both the completely filled-in data-structure generated by the algorithm, as well as the resulting
parse tree(s).
 What is the most probable parse for the former sentence? Justify your answer.
28/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
Solutions
À PCFG rules. Grammar and lexicon.
X -> V NP (1.0)
VP -> X PP (0.1)
S
S
VP,X
S NP
S PP
NP VP, X NP
Det N, NP V N, NP P Det N, NP
the cat ran home from the garden
S
NP VP
Det N V NP
the cat ran NP PP
N P NP
home from Det N
the garden
 second is best. Compute only the part that differ (in blue), not the whole product:
P(t1 ) = K1 · 0.1 · K2
P(t2 ) = K1 · 0.4 · 0.5 · K2
29/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
11 Text Classification
Exercise XI.1
In an automated email router of a company, we want to make the distinction between three kind of
emails: technical (about computers), financial, and the rest (“irrelevant”). For this we plan to use a
Naive Bayes approach.
À What is the main assumption made by Naive Bayes classifiers? Why is it “Naive”?
The Dow industrials tumbled 120.54 to 10924.74, hurt by GM’s sales forecast
and two economic reports. Oil rose to $71.92.
from www.wsj.com/
Intel will sell its XScale PXAxxx applications processor and 3G baseband pro-
cessor businesses to Marvell for $600 million, plus existing liabilities. The deal
could make Marvell the top supplier of 3G and later smartphone processors, and
enable Intel to focus on its core x86 and wireless LAN chipset businesses, the
companies say.
from www.linuxdevices.com/
Á What pre-processing steps (before actually using the Naive Bayes Classifier) do you consider
applying to the input text?
 For the first text, give an example of the corresponding output of the pre-processor.
continues on back +
30/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
Suppose we have collected the following statistics2 about the word frequencies within the corre-
sponding classes, where “0.00. . .” stands for some very small value:
à In a typical NLP architecture, where/how would you store this information? Explicit your
answer, e.g. provide an illustrative example.
Ä For each of the above three texts, in what category will it be classified, knowing that on average
50% of the emails happen to be technical, 40% to be financial and 10% to be of no interest.
You can assume that all the missing information is irrelevant (i.e. do not impact the results).
Provide a full explanation of all the steps and computations that lead to your results.
We now want to specifically focus on the processing of compounds such as “network capacity” in
the second text.
Å How are the compounds handled by a Naive Bayes classifier if no specific pre-processing of
compounds is used?
Æ What changes if the compounds are handled by the NL pre-processor?
Discuss this situation (NL pre-processing handling compounds) with respect to the Naive
Bayes main assumption.
Ç Outline how you would build a pre-processor for compound words.
Solutions
Q2.1 The main assumption is that features/attributes contributing to the likelihood are independant,
conditionnaly to classes:
P( f1 ... fn |C) = ∏ P( fi |C)
i
2 Note that this is only partial information, statistics about other words not presented here have also been collected.
31/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
This is in practice definitely a strong assumption. This is the reason why is is called “Naive”
Q2.2 In text classification, preprocessing is really crutial in order to allow a "good" representation
mainly through proper lexical variability reduction.
Usual NLP steps for reducing lexical variability include: tokenization (removal of punctuation), PoS
tagging, lemmatization and suppression of grammatical ("meaningless") words (stopword, some PoS
tags, low frequencies).
We could also have a more evolved tokenizer including Entity Recognition (e.g. based on regular
patterns) or even Name Entity Recognition for Proper Nouns.
Q2.3 Lemmatized and with number entity recognition, this could lead to:
Dow industrial tumble <number> <number> hurt GM sale
forecast economic report oil rise \$<number>
If a multi-set representation is even included in the preprocessing (this was not expected as an an-
swer), the output could even be:
(\$<number>,1) (<number>,2) (Dow,1) (GM,1) (economic,1) (forecast,1)
(hurt,1) (industrial,1) (oil,1) (report,1) (rise,1) (sale,1) (tumble,1)
Q2.4 This is a more difficult question than it seems because it actually depends on the representation
choosen for the lexicon. If this representation allows to have several numeric fields associated to
lexical entries, then definitly it should be stored there.
Otherwise some external (I mean out of the lexicon) array would be build, the role of the lexicon
then being to provide a mapping between lexical entries and indexes in these arrays.
The choice of the implementation also highly depends on the size of the vocabulary to be stored (and
also on the timing specifications for this tasks: realtime, off-line, ...)
Example for the case where a associative memory (whatever it’s implementation) is available:
It should be noticed that these probability arrays are very likely to be very sparse. Thus sparse matrix
representations of these would be worth using here.
Q2.5 What make the discrimination between the class are the P(word|textclass) and the priors P(C).
Ideed, the Naive Bayes classifier uses (see lectures):
Argmax P(C|w1 ...wn ) = Argmax P(C) ∏ P(wi |C)
i
As stated out in the question, assuming that all the rest is irrelevant, the first text will have
32/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
the maximal product of which is clearly for the second class: “financial”.
the maximal product of which is clearly for the first class: “technical.
33/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
showing that the ∏i P(wi |C) part is the same for the first two classes (and much smaller for “irrele-
vant”)
Thus the prior P(C) will make the decision and this last text is classified as “technical”.
Q2.6 compounds are simply ignored as such by the Naive Bayes and are, due to the “Naive” inde-
pendance assumption, handled as separated tokens.
Q2.7 If the preprocessor is able to recognized coumponds as such they will thus be included as
such in the set of features and would thus be handled as such. This is actually a way (preproces-
sor) to increase the independance between “features” of the Naive Bayes, these features no loger
corresponding to single tokens only.
Q2.8 coumpond preprocessor is a wide topic in itself (lexical acquisition), but as in many NLP
domains two main ways could be considered, which whould definitly be exploited in complement
one of the other: the statistical way and the linguistic/human knowledge way.
The most naive linguistic approach could be to add by hand coumponds to the lexicon.
For the statistical, simply extract all correlated pairs or biger tuples of word, using e.g. mutual
information, chi-square or whatever measure of correlation. This could be enhanced using human
knowledge by selecting which PoS tags could enter this correlation game (e.g. lookig for NN NN,
NN of NN etc..) but also by filtering out manually automatically extracted lists of candidates.
34/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
Exercise XI.2
You are responsible for a project aiming at providing on-line recommendations to the customers of
a on-line book selling company.
The general idea behind this recommendation system is to cluster books according to both customers
and content similarities, so as to propose books similar to the books already bought by a given
customer. The core of the recommendation system is a clustering algorithm aiming at regrouping
books likely to be appreciate by the same person. This clustering should not only be achieved
based on the purchase history of customers, but should also be refined by the content of the books
themselves. It’s that latter aspect we want to address in this exam question.
À Briefly explain how books could be clustered according to similar content. Give the main steps
and ideas.
“similar content”: meaning .vs. surface content or even structural content
main steps:
• preprocessing: keep semantically meaningful elements, remove less semantically impor-
tant lexical variability
Usual NLP steps for reducing lexical variability include: tokenization (removal of punc-
tuation), PoS tagging, lemmatization and suppression of grammatical ("meaningless")
words (stopwords, some well chosen PoS tags).
If lemmatization is not possible, stemming could be consisered either.
We could also have a more evolved tokenizer including Name Entity Recognition (e.g.
based on regular patterns).
• counting: frequencies, IDF, ...
• indexing / Bag of words representation: from word sequences to vectors
• compute (dis)similarities btw representations
• (choose and) use classification method.
Á The chosen clustering algorithm is the dendrogram. What other algorithms could you propose
for the same task? Briefly review advantages and disadvantages of each of them (including
dendrograms). Which one would you recommend for the targeted task?
We are in the unsupervised case. A possible baseline altenative are the K-means.
drawbacks: what K should be use for K-mean? converges only to a local min, what linkage to
use for dendrograms
advantages: planar representation for dendrograms (could be complemented with minimal
spanning tree), K-means are incremental: can choose to stop if too long (monitor intra-class
variance, however)
Maybe the best to do should be to try both (and even more) and evaluated them, if possible, in
real context...
 Consider the following six "documents" (toy example):
d1 "Because cows are not sorted as they return from the fields to their home pen, cow flows
are improved."
35/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
...
d6 "What pen for what cow? A red pen for a red cow, a black pen for a black cow, a brown
pen for a brown cow, ... Understand?"
and suppose (toy example) that they are indexed only by the two words: pen and cow.
(d6)
(d1) at (1,2)
(d2) at (2,0)
(d3) at (1,1)
(d4) at (2,2) (d1) (d4)
(d5) at (4,1)
(d6) at (4,4) (d3) (d5)
(d2)
pen
(b) Give the definition of the cosine similarity. What vector’s feature(s) is it sensible to?
see lectures. Only sensible to vector angle/direction, not length, i.e. to word relative
proportions, not to absolute counts.
(c) What is the result of the dendrogram clustering algorithm on those six documents, using
the cosine similarity and single linkage?
Explain all the steps.
√ √ √
Hint: 5/ 34 < 3/ 10 < 4/ 17.
Solution : Notice that there is absolutely no need to compute every pair of similarities!!!
Only 3 of them are useful, cf drawing (a); and even the drawing alone might be sufficient
(no computation at all)!
+------------------+
| |
| +-------+
| | |
| +------+ |
+----+ | | |
| | +----+ | |
| | | | | |
(d2) (d5) (d3) (d4) (d6) (d1)
36/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
12 Information Retrieval
Exercise XII.1
À Describe the main principles of the standard vector space model for semantics.
Á Consider the following document:
Propose a possible indexing set for this document. Justify your answer.
 What is the similarity between the above document D and
Solutions
À The standard approach to vector semantics can be decomposed into two mains steps:
• the indexing (or desequalization) phase: during this phase, the documents for which a
vectorial semantic representation needs to be produced, are processed with linguistic
tools in order to identify the indexing features (words, stems, lemmas, ...) they will be
associated with.
This phase results in the association with each of the documents of a set of indexing
features. Notice that, for the rest of the processing, on the sets of indexing features will
be considered. The rest of the documents will be ignored. Notice also that the sets of
indexing features are sets!... and that therefore any notion of word order is lost after the
indexing phase.
For example, if we consider the toy document collection consisting of the two following
documents:
37/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
V1 ·V2
sim(D1 , D2 ) = cos(V1 ,V2 ) = ,
kV 1kkV 2k
√
where X · Y denotes the dot-product between vector X and vector Y , and kXk = X ·X
represents the norm (i.e. the length) of vector X.
38/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
Notice that this simple similarity might be further sophisticated in order to take into
account varying importance for the various dimensions of the vector space.
A possible approach is to use a weighted dot-product of the form:
for V 1 = (v11 , v12 , ..., v1n )
and V 2 = (v21 , v22 , ..., v2n )
V1 ·V 2 = ∑ni=1 ai v1i v2i , where the ai are some (usually positive) coefficients.
A standard approach for the weighting of the vector space dimensions is to use the "in-
verse document frequency" (i.e. fact any function f() decreasing with the document fre-
quency of an indexing feature, i.e. the inverse of the number of documents containing
the given indexing feature).
For example, if we take: ai = idf(i)2 = log(1/DF(i))2, where DF(i) is the document fre-
quency of the indexing feature associated with the i-th dimension of the vector space, we
get:
sim(D1,D2) = cos(V1’, V2’), where Vi’ = (tf(i,k).idf(k)), where tf(i,k) is the measure of
importance of the k-th indexing feature for the i-th document and idf(k) is a measure of
importance of the k-th dimension of the vector space.
This approach corresponds to the standard "tf.idf" weighting scheme.
Á The simplest solution to produce the indexing set associated with a document is to use a
stemmer associated with stop lists allowing to ignore specific non content bearing terms. In
this case, the indexing set associated with D might be:
A more sophisticated approach would consist in using a lemmatizer in which case, the indexing
set might be:
Then several similarity measures could be considered, e.g. Dice, Jacard, cosine.
√ √
For cosine: the dot product is 2 (export and increas) and the norms are 5 and 4, thus:
1
cos(D, D0 ) = √
5
39/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
à One of the important limitations of the standard vector space approach is that the use of the
cosine similarity imposes that the dimensions of the vector space are orthogonal and that
therefore the indexing features associated with the dimensions have, by construction, a null
similarity.
This is in fact a problem as it is extremely difficult to guarantee that the indexing features
associated with the dimension are indeed semantically fully uncorrelated. For example, it
is sufficient that one (or more) document(s) contain(s) the two words "car" and "vehicle" to
imply that sim("car", "vehicle") = 0 which should be interpreted as the (not very convincing)
fact that "car" and "vehicle" has nothing in common.
A possible (partial) solution for this problem is to use more sophisticated representation tech-
niques such the Distributional Semantics (DS).
In DS, the semantic content of an indexing feature does not only rely on its occurrences in
the document collection, but also on the co-occurrences of this indexing feature with other
indexing features appearing the the same documents. In fact, the vectorial representations use
in DS are a mixture of the standard occurrence vectors (as they are used in the traditional vector
space model) with the co-occurrence vectors characterizing the indexing features appearing
in the documents. Thus, even is the similarity between the occurrence vectors of "car" and
"vehicle" have, by definition, a zero similarity, in DS, the vectors representing the documents
are of the form:
V(D) = a*OccV(D) + (1-a)*CoocV(D)
and therefore, is "car" and "vehicle" share some co-occurrences (i.e. appear in documents
together with some identical words), their similarity will not be zero anymore.
Ä Any NLP application that requires the assessment of the semantic proximity between textual
entities (text, segments, words, ...) might benefit from the semantic vectorial representation.
Information retrieval is of course one of the prototypical applications illustrating the potential-
ity of the VS techniques. However, many other applications can be considered:
• automated summarization: the document to summarize is split into passages; each of the
passages is represented in a vector space and the passage(s) that is the "most central" in
the set of vector thus produced are taken as good candidates for the summary to generate;
• semantic desambiguisation: when polysemic words (such as "pen" which can be a place
to put cow or a writing instrument) are a problem –for example in machine translation–
vectorial representations can be generated for the different possible meanings of a word
(for example from machine readable disctionalires) and used to desambiguate the occur-
rences of an ambiguous word in documents;
• automated routing of messages to users: each user is represented by the vector represent-
ing the semantic content of the messages s/he has received so far, and any new incoming
message is routed only to those users the representative vector of which is enough similar
to the vector representing the content of the incoming message;
• text categorization or clustering
• ...
Å No. The indexing sets associated with D and D’ would be exactly the same and would therefore
not allow to discriminate between these two documents (which nevertheless do not mean the
same!...).
40/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
Æ If a parser would be available, grammatical roles might be automatically associated with the
reduced indexing features. For example, specific grammatical roles could be associated with
prepositional nominal phrases such as "to the USA" or "from Switzerland" which could then
be represented as "to_USA" and "from_Switzerland".
In this case, the indexing sets associated with D and D’ would be:
and
I(D) = {2006, export, from_USA, increase, to_Switzerland}
and would allow D and D’ to be discriminated.
Exercise XII.2
41/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
Solutions
À It’s a possible measure used for document semantic content similarity. It operated on a vector
representation of the document “meaning” and is computed as
d ·q d ·q
cos(d, q) = =p
||d|| ||q|| (d · d)(q · q)
Á first a part-of-speech tagger might be applied in order to both filter out some stop words (and
make the distinction between can/V, to be removed, and can/N, to be kept), and prepare for
lemmatization, i.e. normalization of the surface forms of the “full words”.
On this example, typically (maybe “raining” or “rain/V”):
Then a vectorial representation is build, typically a words frequency (tf) vector. In this example
(corresponding to words dog, cat, eat, rain, home):
D1: (2, 2, 1, 0, 0)
D2: (1, 1, 1, 1, 1)
√
√5 5
Then the above cosine formula is used, leading to 45
= 3 .
 Well... in principe NO... unless the system is urged to answer something anyway.
The numerator of the cosine will be 0 for every document (this query is orthogonal to all
documents). However, depending on how the norm of the query is computed, this should also
lead to a 0. Thus the cosine is undefined (0 over 0).
And it’s up to the system engineering details to decide what to do in such a case.
à (a) yes. it is indeed easy to compute intersection and union from the tf vector, for instance
simply degrade it into a binary vector.
(b) This appeared to be a difficult question. Several missed the “boolean representation” con-
straint and none was able to properly find the example.
First notice that on a boolean representation, the dot product is the same as the (cardinal of
the) intersection. Thus cosine and Jaccard have the same numerator.
Futhermore, the L2 norm correspond to the document length in the boolean case (binary vec-
tor).
|d∩q| |d∩q|
Thus in the boolean case cosine reduces to |d| |q| (where Jaccard is |d∪q| ).
Notice also that |d ∪ q| = |d| + |q| − |d ∩ q|.
The fact that the two Jaccard are the same but the cosine are different implies that |d1 ∩ d2 | and
|d1 ∩ d3 | have to differ (easy proof).
For instance, let’s take the simplest case: |d1 ∩ d2 | = 1 and |d1 ∩ d3 | = 2.
This implies that |d1 ∪ d3 | = 2 |d1 ∩ d2 |. Since they cannot be 1 and 2, neither 2 and 4 (easy to
see that the cosine are then equal), let us try with 3 and 6, for which several examples can be
found, for instance
42/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
D1: 1 1 1 0 0 0
D2: 1 0 0 0 0 0
D3: 1 0 1 1 1 1
√ √
In this case the two Jaccard are both equal to 1/3 and the two cosines are 1/ 3 and 2/ 15.
Exercise XII.3
À Official NLP evaluations (especially for task such as Information Retrieval or Information
Extraction) are often carried out in the form of “evaluation campaigns”.
Precisely describe the various steps of such an evaluation campaign.
For each of the steps, clearly indicate the main goals.
Solution: The first step is to define the control task. Then you have to gather a significant
amount of data. Third you have to anotate some reference test data (the “golden truth”),
typicaly by some human experts. Then you can run your system on a typical test set (different
from both learning and tuning (a.k.a validation) set). You thus produce some quantitative
scores describing the results, which you can publish, analyse (confidence) and discuss.
Á In an IR evaluation campaign, the following “referential” (“golden truth”) has been produced
by a set of human judges:
where the list of document references dj associated with a query reference qi defines the set
of documents considered to be relevant for the query by the human judges.
Is such a referential easy to produce?
Indicate the various problems that might arise when one tries to produce it.
(a) task ambiguity (meaning of the text, of the question, is the solution unique?)
(b) subjectivity (see inter-anotator agreement)
(c) size matters (too small =⇒ too biaised)
(d) exhaustivity
(a) and (b) , resp. (c) and (d) are related ; the later being a consequence of the former.
 Consider two Information Retrieval systems S1 and S2 that produced the following outputs for
the 4 reference queries q1, q2, q3, q4:
43/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
S1: | referential:
q1: d01 d02 d03 d04 dXX dXX dXX dXX | q1: d01 d02 d03 d04
q2: d06 dXX dXX dXX dXX | q2: d05 d06
q3: dXX d07 d09 d11 dXX dXX dXX dXX dXX | q3: d07 d08 d09 d10 d11
q4: d12 dXX dXX d14 d15 dXX dXX dXX dXX | q4: d12 d13 d14 d15
S2:: | referential:
q1: dXX dXX dXX dXX d04 | q1: d01 d02 d03 d04
q2: dXX dXX d05 d06 | q2: d05 d06
q3: dXX dXX d07 d08 d09 | q3: d07 d08 d09 d10 d11
q4: dXX d13 dXX d15 | q4: d12 d13 d14 d15
where dXX refer to document references that do not appear in the referential. To make the
answer easier, we copied the referential on the right.
For each of the two systems, compute the mean Precision and Recall measures (provide the
results as fractions). Explain all the steps of your computation.
S1:
q1: P=4/8 R=4/4 q2: P=1/5 R=1/2
q3: P=3/9 R=3/5 q4: P=3/9 R=3/4
S2:
q1: P=1/5 R=1/4 q2: P=2/4 R=2/2
q3: P=3/5 R=3/5 q4: P=2/4 R=2/4
Ä How is it possible to compute the average Precision/Recall curves? Explain in detail the
various steps of the computation.
As it would be too tedious to compute the average Precision/Recall curves by hand, plot, on a
Precision/Recall graph, the Precision and Recall values obtained in subquestion  for each of
the two systems and for each of the 4 queries.
Based on the resulting curves, what is your relative evaluation of the two systems?
(1) Use average precision : for each relevant document compute the average precision for all
relevant docs below rank Rk (see formula in course).
Then we can have difference precisions for different recall and plot these values.
(3) Ideal is in the top right corner.
S1 has better recall, S2 better precision. In general S2 performs slightly better.
44/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
Å The Precision/Recall based evaluation of the IR systems S1 and S2 above does not explicitly
take into account the order in which the documents have been retrieved by the systems. For
this purpose, another metric can be used: the Precision at k (P@k), which corresponds to the
fraction of truly relevant documents among the top k documents retrieved by a system.
Compute the average P@k values for k between 1 and 5 for the IR systems S1 and S2 above.
What additional insight do these values provide in addition to the Precision/Recall curves?
Based on these results, what is your relative evaluation of the two systems? How does it
compare to Â?
k 1 2 3 4 5
S1 q1 1 1 1 1 4/5
q2 1 1/2 1/3 1/4 1/5
q3 0 1/2 2/3 3/4 3/5
q4 1 1/2 1/3 1/2 3/5
P 3/4 5/8 1/12 10/16 11/20
S2 q1 0 0 0 0 1/5
q2 0 0 1/3 1/2 2/5
q3 0 0 1/3 1/2 3/5
q4 0 1/2 1/3 1/2 2/5
P 0 1/8 3/12 6/16 8/20
(2) Since higherly-ranked documents should have more relevance, we can see if a system can
produce relevant results quicly in the first retrieved documents.
(3) S1 is better than S2 since it has a lot of relevant docs in the top results. This give a
completely different view wrt former evaluation. S1 is better for web-like, S2 maybe for
law-like (see Q8).
Æ It is often desirable to be able to express the performance of an NLP system in the form of a
single number, which is not the case when the Precision/Recall framework is used.
Indicate what scores can be used to convert Precision/Recall measures into a unique number.
For each score, give the corresponding formula.
F score :
(b2 + 1) · P · R
b2 · P + R
When b2 > 1 emphasizes P otherwise emphasies R.
Accuracy: ratio of correct results provided by the system (wrt total number of results from the
system)
Error = 1-Accuracy
Ç Give well chosen examples of applications that illustrate:
• a situation where more importance should be given to Precision;
• a situation where more importance should be given to Recall.
More importance should be given to precision in Web-like search applications because a few relevant
document out of huge amount of possibly pertinent document is enough (large enough).
Recall shall be prefered in legal or medical-like search where exhaustivity (of correct documents) is
important.
45/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
14 Lexical Semantics
Exercise XIV.1
The objective of this question is to illustrate the use of a lexical semantics resource to compute
lexical cohesion.
Consider the following toy ontology providing a semantic structuring for a (small) set of nouns:
all
man woman child cat dog mouse freedom happiness table pen mouse
À Give some examples of NLP tasks for which lexical cohesion might be useful. Explain why.
IR: find document based on lexical cohesion wrt query words.
automatic summarization: check coherence of extracted sentences
semantic disamguation of possibles choices in spelling error correction (e.g. bag or bug for
“bxg”)
WSD
Machine translation (semantic filter)
Á What is the semantic relation that has been used to build the ontology?
Cite another semantic relations that could also be useful for building lexical semantics re-
sources.
For this semantic relation, give a short definition and a concrete example.
“is a” relation: hyperonymy
other: meronymy (“part of”)
 The word "mouse" appears at two different places in the toy ontology. What does this mean?
What specific problems does it raise when the ontology is used?
How could such problems be solved? (just provide a sketch of explanation.)
semantic ambiguity (two different meanings, polysemy, homonymy(homography)).
WSD through the informations from the context (e.g. coehsion).
à Consider the following short text:
Cats are fighting dogs. There are plenty of pens on the table.
What pre-processing should be performed on this text to make it suitable for the use of the
available ontology?
identify surface forms (tokenization), maybe stopwords filtering, PoS tagging for nouns, lemma-
tization (or stemming).
46/47
J.-C. Chappelier
I NTRODUCTION TO NLP (CS–431) Exercises with solutions
& M. Rajman
Ä We want to use lexical cohesion to decide whether the provided text consists of one single
topical segment corresponding to both sentences, or of two distinct topical segments, each
corresponding to one of the sentences.
The lexical cohesion of any set of words (in canonical form) is be defined as the average lexical
distance between all pairs of words present in the set. The lexical distance between any two
words is be defined as the length of a shortest path between the two words in the available
ontology.
For example, "freedom" and "happiness" are at distance 2 (length, i.e. number of links, of the
path: happiness −→ abstract entities −→ freedom), while "freedom" and "dog" are at distance
6 (length of the path: freedom −→ abstract entities −→ non animate entities −→ all −→
animate entities −→ animals −→ dog)
Compute the lexical distance between all the pairs of words present in the above text and in
the provided ontology (there are 6 such pairs).
Å Compute the lexical cohesion of each of the two sentences, and then the lexical cohesion of
the whole text.
Based on the obtained values, what decision should be taken as far as the segmentation of the
text into topical segments is concerned?
D(S1) = 2 D(S2) = 2
D(S1,S2) = 1/6 ( 2+6+6+6+6+2) = 14/3
We here have a distance that we want to minimize, as the lower the distance, the more lexically
coherent is the text (closer words in the ontology). We thus here decide to segment the text
into two topical segments (one sentence each).
47/47