0% found this document useful (0 votes)

151 views81 pages

Statistical Machine Translation: The Basic, The Novel, and The Speculative

This document summarizes the key components and recent developments in statistical machine translation presented in a tutorial by Philipp Koehn on April 4, 2006. It discusses the basic components including using parallel text to train models, statistical modeling and decoding. It also outlines novel developments such as automatic evaluation methods, phrase-based models and discriminative training. Finally, it speculatively discusses future directions including syntax-based models and factored translation models.

Uploaded by

YeeLin Tang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

151 views81 pages

Statistical Machine Translation: The Basic, The Novel, and The Speculative

Uploaded by

YeeLin Tang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 81

Statistical Machine Translation:

the basic, the novel, and the speculative

Philipp Koehn, University of Edinburgh

4 April 2006

Philipp Koehn SMT Tutorial 4 April 2006

The Basic
• Translating with data
– how can computers learn from translated text?
– what translated material is out there?
– is it enough? how much is needed?
• Statistical modeling
– framing translation as a generative statistical process
• EM Training
– how do we automatically discover hidden data?
• Decoding
– algorithm for translation

Philipp Koehn SMT Tutorial 4 April 2006

The Novel
• Automatic evaluation methods
– can computers decide what are good translations?
• Phrase-based models
– what are atomic units of translation?
– the best method in statistical machine translation
• Discriminative training
– what are the methods that directly optimize translation performance?

Philipp Koehn SMT Tutorial 4 April 2006

The Speculative
• Syntax-based transfer models
– how can we build models that take advantage of syntax?
– how can we ensure that the output is grammatical?
• Factored translation models
– how can we integrate different levels of abstraction?

Philipp Koehn SMT Tutorial 4 April 2006

The Rosetta Stone

• Egyptian language was a mystery for centuries

• 1799 a stone with Egyptian text and its translation into Greek was found
⇒ Humans could learn how to translated Egyptian

Philipp Koehn SMT Tutorial 4 April 2006

Parallel Data
• Lots of translated text available: 100s of million words of translated text for
some language pairs
– a book has a few 100,000s words
– an educated person may read 10,000 words a day
→ 3.5 million words a year
→ 300 million a lifetime
→ soon computers will be able to see more translated text than humans read
in a lifetime
⇒ Machine can learn how to translated foreign languages

Philipp Koehn SMT Tutorial 4 April 2006

Statistical Machine Translation

• Components: Translation model, language model, decoder
foreign/English English
parallel text text

statistical analysis statistical analysis

Translation Language
Model Model

Decoding Algorithm

Philipp Koehn SMT Tutorial 4 April 2006

Word-Based Models
Mary did not slap the green witch
n(3|slap)
Mary not slap slap slap the green witch
p-null
Mary not slap slap slap NULL the green witch
t(la|the)
Maria no daba una botefada a la verde bruja
d(4|4)
Maria no daba una bofetada a la bruja verde

[from Knight, 1997]

• Translation process is decomposed into smaller steps,
each is tied to words
• Original models for statistical machine translation [Brown et al., 1993]

Philipp Koehn SMT Tutorial 4 April 2006

Phrase-Based Models
Morgen fliege ich nach Kanada zur Konferenz

Tomorrow I will fly to the conference in Canada

[from Koehn et al., 2003, NAACL]

• Foreign input is segmented in phrases
– any sequence of words, not necessarily linguistically motivated
• Each phrase is translated into English
• Phrases are reordered

Philipp Koehn SMT Tutorial 4 April 2006

Syntax-Based Models
VB VB

PRP VB1 VB2

reorder PRP VB2 VB1

he adores VB TO he TO VB adores

listening TO MN MN TO listening

to music music to

VB VB
insert
PRP VB2 VB1 PRP VB2 VB1

he ha TO VB ga adores desu kare ha TO VB ga daisuki desu

MN TO listening no MN TO kiku no
translate
music to ongaku wo

take leaves
Kare ha ongaku wo kiku no ga daisuki desu
[from Yamada and Knight, 2001]

Philipp Koehn SMT Tutorial 4 April 2006

Language Models
• Language models indicate, whether a sentence is good English
– p(Tomorrow I will fly to the conference) = high
– p(Tomorrow fly me at a summit) = low
→ ensures fluent output by guiding word choice and word order
• Standard: trigram language models
p(Tomorrow|START) ×
p(I|START,Tomorrow) ×
p(will|Tomorrow,I) ×
...
p(Canada|conference,in) ×
p(END|in,Canada) ×
• Often estimated using additional monolingual data (billions of words)

Philipp Koehn SMT Tutorial 4 April 2006

Automatic Evaluation
• Why automatic evaluation metrics?
– Manual evaluation is too slow
– Evaluation on large test sets reveals minor improvements
– Automatic tuning to improve machine translation performance
• History
– Word Error Rate
– BLEU since 2002
• BLEU in short: Overlap with reference translations

Philipp Koehn SMT Tutorial 4 April 2006

Automatic Evaluation
• Reference Translation
– the gunman was shot to death by the police .
• System Translations
– the gunman was police kill .
– wounded police jaya of
– the gunman was shot dead by the police .
– the gunman arrested by police kill .
– the gunmen were killed .
– the gunman was shot to death by the police .
– gunmen were killed by police ?SUB>0 ?SUB>0
– al by the police .
– the ringer is killed by the police .
– police killed the gunman .
• Matches
– green = 4 gram match (good!)
– red = word not matched (bad!)

Philipp Koehn SMT Tutorial 4 April 2006

Automatic Evaluation

[from George Doddington, NIST]

• BLEU correlates with human judgement
– multiple reference translations may be used

Philipp Koehn SMT Tutorial 4 April 2006

Correlation? [Callison-Burch et al., 2006]

4 4
Adequacy Fluency
Correlation Correlation

3.5 3.5
Human Score

Human Score
3 3

2.5 2.5

2 2
0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52 0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52
Bleu Score Bleu Score
[from Callison-Burch et al., 2006, EACL]
• DARPA/NIST MT Eval 2005
– Mostly statistical systems (all but one in graphs)
– One submission manual post-edit of statistical system’s output
→ Good adequacy/fluency scores not reflected by BLEU

Philipp Koehn SMT Tutorial 4 April 2006

Correlation? [Callison-Burch et al., 2006]

4.5
Adequacy
Fluency
4
SMT System 1
Rule-based System
(Systran)
Human Score

3.5

SMT System 2
2.5

2
0.18 0.2 0.22 0.24 0.26 0.28 0.3
Bleu Score

[from Callison-Burch et al., 2006, EACL]

• Comparison of
– good statistical system: high BLEU, high adequacy/fluency
– bad statistical sys. (trained on less data): low BLEU, low adequacy/fluency
– Systran: lowest BLEU score, but high adequacy/fluency

Philipp Koehn SMT Tutorial 4 April 2006

Automatic Evaluation: Outlook

• Research questions
– why does BLEU fail Systran and manual post-edits?
– how can this overcome with novel evaluation metrics?
• Future of automatic methods
– automatic metrics too useful to be abandoned
– evidence still supports that during system development, a better BLEU
indicates a better system
– final assessment has to be human judgement

Philipp Koehn SMT Tutorial 4 April 2006

Competitions
• Progress driven by MT Competitions
– NIST/DARPA: Yearly campaigns for Arabic-English, Chinese-English,
newstexts, since 2001
– IWSLT: Yearly competitions for Asian languages and Arabic into English,
speech travel domain, since 2003
– WPT/WMT: Yearly competitions for European languages, European
Parliament proceedings, since 2005
• Increasing number of statistical MT groups participate
• Competitions won by statistical systems

Philipp Koehn SMT Tutorial 4 April 2006

Competitions: Good or Bad?

• Pro:
– public forum for demonstrating the state of the art
– open data sets and evaluation metrics allow for comparison of methods
– credibility for a new approach by doing well
– sharing of ideas and implementation details
• Con:
– winning competition is mostly due to better engineering
– having more data and faster machines plays a role
– limit research to few directions (re-engineering of other’s methods)

Philipp Koehn SMT Tutorial 4 April 2006

Euromatrix
• Proceedings of the European Parliament
– translated into 11 official languages
– entry of new members in May 2004: more to come...
• Europarl corpus
– collected 20-30 million words per language
→ 110 language pairs
• 110 Translation systems
– 3 weeks on 16-node cluster computer
→ 110 translation systems
• Basis of a new European Commission funded project

Philipp Koehn SMT Tutorial 4 April 2006

Quality of Translation Systems

• Scores for all 110 systems
da de el en es fr fi it nl pt sv
da - 18.4 21.1 28.5 26.4 28.7 14.2 22.2 21.4 24.3 28.3
de 22.3 - 20.7 25.3 25.4 27.7 11.8 21.3 23.4 23.2 20.5
el 22.7 17.4 - 27.2 31.2 32.1 11.4 26.8 20.0 27.6 21.2
en 25.2 17.6 23.2 - 30.1 31.1 13.0 25.3 21.0 27.1 24.8
es 24.1 18.2 28.3 30.5 - 40.2 12.5 32.3 21.4 35.9 23.9
fr 23.7 18.5 26.1 30.0 38.4 - 12.6 32.4 21.1 35.3 22.6
fi 20.0 14.5 18.2 21.8 21.1 22.4 - 18.3 17.0 19.1 18.8
it 21.4 16.9 24.8 27.8 34.0 36.0 11.0 - 20.0 31.2 20.2
nl 20.5 18.3 17.4 23.0 22.9 24.6 10.3 20.0 - 20.7 19.0
pt 23.2 18.2 26.4 30.1 37.9 39.0 11.9 32.0 20.2 - 21.9
sv 30.3 18.9 22.8 30.2 28.6 29.7 15.3 23.9 21.9 25.9 -
[from Koehn, 2005: Europarl]

Philipp Koehn SMT Tutorial 4 April 2006

Clustering Languages
fi
el

de nl
sv da en
pt es fr it
[from Koehn, 2005, MT Summit]

• Clustering languages based on how easy they translate into each other
⇒ Approximation of language families

Philipp Koehn SMT Tutorial 4 April 2006

Translation examples
• Spanish-English
(1) the current situation , unsustainable above all for many self-employed drivers and in the
area of agriculture , we must improve without doubt .
(2) in itself , it is good to reach an agreement on procedures , but we have to ensure that this
system is not likely to be used as a weapon policy .
• Finnish-English
(1) the current situation , which is unacceptable , in particular , for many carriers and
responsible for agriculture , is in any case , to be improved .
(2) agreement on procedures in itself is a good thing , but there is a need to ensure that the
system cannot be used as a political lyömäaseena .
• English reference
(1) the current situation , which is intolerable , particularly for many independent haulage firms
and for agriculture , does in any case need to be improved .
(2) an agreement on procedures in itself is a good thing , but we must make sure that the
system cannot be used as a political weapon .

Philipp Koehn SMT Tutorial 4 April 2006

Translate into vs. out of a Language

• Some languages are easier to translate into that out of

Language From Into Diff

da 23.4 23.3 0.0
de 22.2 17.7 -4.5
el 23.8 22.9 -0.9
en 23.8 27.4 +3.6
es 26.7 29.6 +2.9
fr 26.1 31.1 +5.1
fi 19.1 12.4 -6.7
it 24.3 25.4 +1.1
nl 19.7 20.7 +1.1
pt 26.1 27.0 +0.9
sv 24.8 22.1 -2.6
[from Koehn, 2005: Europarl]

• Morphologically rich languages harder to generate (German, Finnish)

Philipp Koehn SMT Tutorial 4 April 2006

Backtranslations
• Checking translation quality by back-transliteration
• The spirit is willing, but the flesh is weak
• English → Russian → English
• The vodka is good but the meat is rotten

Philipp Koehn SMT Tutorial 4 April 2006

Backtranslations II
• Does not correlate with unidirectional performance

Language From Into Back

da 28.5 25.2 56.6
de 25.3 17.6 48.8
el 27.2 23.2 56.5
es 30.5 30.1 52.6
fi 21.8 13.0 44.4
it 27.8 25.3 49.9
nl 23.0 21.0 46.0
pt 30.1 27.1 53.6
sv 30.2 24.8 54.4
[from Koehn, 2005: Europarl]

Philipp Koehn SMT Tutorial 4 April 2006

Available Data
• Available parallel text
– Europarl: 30 million words in 11 languages http://www.statmt.org/europarl/
– Acquis Communitaire: 8-50 million words in 20 EU languages
– Canadian Hansards: 20 million words from Ulrich Germann, ISI
– Chinese/Arabic to English: over 100 million words from LDC
– lots more French/English, Spanish/French/English from LDC
• Available monolingual text (for language modeling)
– 2.8 billion words of English from LDC
– 100s of billions, trillions on the web

Philipp Koehn SMT Tutorial 4 April 2006

More Data, Better Translations

0.30
Swedish

French
0.25

German

0.20

Finnish

0.15

10k 20k 40k 80k 160k 320k

[from Koehn, 2003: Europarl]

• Log-scale improvements on BLEU:
Doubling the training data gives constant improvement (+1 %BLEU)

Philipp Koehn SMT Tutorial 4 April 2006

More LM Data, Better Translations

BLEU
51.2 51.7 51.9 52.3 53.1
50 48.5 49.1 49.8 50.0 50.5
45
40
35
30
25
20
15
10
5
0
75M 150M 300M 600M 1.2B 2.5B 5B 10B 18B +web
[from Och, 2005: MT Eval presentation]
• Also log-scale improvements on BLEU:
doubling the training data gives constant improvement (+0.5 %BLEU)
(last addition is 218 billion words out-of-domain web data)

Philipp Koehn SMT Tutorial 4 April 2006

• Decoding
• Statistical Modeling
• EM Algorithm
• Word Alignment
• Phrase-Based Translation
• Discriminative Training
• Syntax-Based Statistical MT