Eamt11 Complete
Eamt11 Complete
Radu ION
Research Institute for AI,
Romanian Academy
Calea 13 Septembrie nr. 13
Bucharest, 050711, Romania
radu@racai.ro
Alexandru CEAUU
Dublin City University
Glasnevin, Dublin 9, Ireland
aceausu@computing.dcu.ie
Abstract
Introduction
Elena IRIMIA
Research Institute for AI,
Romanian Academy
Calea 13 Septembrie nr. 13
Bucharest, 050711, Romania
elena@racai.ro
ble corpus differs essentially from a parallel corpus by the fact that textual units do not follow a
translation order that otherwise greatly reduces
the word alignment search space in a parallel
corpus. Given this limitation of a comparable
corpus in general and the sizes of the comparable
corpora that we will have to deal with in particular, we have devised one variant of an Expectation Maximization (EM) algorithm (Dempster et
al., 1977) that generates a 1:1 (p = 1) document
assignment from a parallel and/or comparable
corpus using only pre-existing translation lexicons. Its generality would permit it to perform
the same task on other textual units such as paragraphs or sentences.
In what follows, we will briefly review the literature discussing document/paragraph alignment and then we will present the derivation of
the EM algorithm that generates 1:1 document
alignments. We will end the article with a thorough evaluation of the performances of this algorithm and the conclusions that arise from these
evaluations.
Related Work
( ) ([
])
EMACC
We propose an EM algorithm for aligning different types of textual units: documents, paragraphs,
and sentences which we will name EMACC (an
acronym for Expectation Maximization Alignment for Comparable Corpora). We draw our
inspiration from the famous IBM models (specifically from the IBM-1 model) for word alignment (Brown et al., 1993) where the translation
probability (eq. (5)) is modeled through an EM
algorithm where the hidden variable a models
the assignment (1:1 word alignments) from the
French sequence of words ( indexes) to the English one.
By analogy, we imagined that between two
sets of documents (from now on, we will refer to
documents as our textual units but what we present here is equally applicable but with different performance penalties to paragraphs and/or
sentences) lets call them and , there is an
assignment (a sequence of 1:1 document correspondences 1 ), the distribution of which can be
modeled by a hidden variable taking values in
the set {true, false}. This assignment will be
largely determined by the existence of word
translations between a pair of documents, translations that can differentiate between one another
in their ability to indicate a correct document
alignment versus an incorrect one. In other
words, we hypothesize that there are certain pairs
of translation equivalents that are better indicators of a correct document correspondence than
other translation equivalents pairs.
We take the general formulation and derivation of the EM optimization problem from (Borman, 2009). The general goal is to
mize (
), that is to find the parameters for
which (
) is maximum. In a sequence of
derivations that we are not going to repeat here,
the EM equation 1 is given by:
(
iterate over a set of parameters, compute the expectation expression for each of these parameters
and choose the parameters for which the expression has the largest value. But as we will see, in
practice, the set of all possible parameters has a
dimension that is exponential in terms of the
number of parameters. This renders the problem
intractable and one should back off to heuristic
searches in order to find a near-optimal solution.
Having the equation of the EM algorithm, the
next task is to tailor it to the problem at hand:
document alignment. But before doing so, lets
introduce a few notations that we will operate
with:
is a pair of documents,
and
;
such that
is a lexical item that
belongs to and
is a lexical item that
belongs to ;
)
where (
. At step n+1, we try to
obtain a new set of parameters
that is going
to maximize (the maximization step) the sum
over z (the expectation step) that in its turn depends on the best set of parameters
obtained
at step n. Thus, in principle, the algorithm should
tion of the integer n and . / is the binomial coefficient. It is clear now that with
this kind of dimension of the set of all
possible assignments, we cannot simply
iterate over it in order to choose the assignment that maximizes the expectation;
*
+ is the hidden variable
that signals if a pair of documents
)
(
(
) ( )
)
) ( )
(
)
(
)
( ) (
) ( ) ( )
( ) ( ) ( )
(
(
) ( (
) (
) (
))
)
)
( ) ( )
)
( ) (
( )
)-
)-
where (
) is the simplified lexical document alignment probability which is initially
equal to ( ) from the set . This probability
is to be read as the contribution
makes to
the correctness of the
alignment. We want
that the alignment contribution of one translation
equivalents pair
to distribute over the set of
all possible document pairs thus enforcing equation 4
[
)]
)
(
(
|
|
)
)
)]
where
is the number of words in document
( ) is the frequency of word
and
in
document . If every word in the source document has at least one translation (of a given
threshold probability score) in the target document, then this measure is 1. We normalize the
table initialized using this measure with equation
6.
EMACC finds only 1:1 textual units alignments in its present form but a document pair
can be easily extended to a document bead
following the example from (Chen, 1993). The
main difference between the algorithm described
by Chen and ours is that the search procedure
reported there is invalid for comparable corpora
in which no pruning is available due to the nature
of the corpus. A second very important difference is that Chen only relies on lexical alignment
information, on the parallel nature of the corpus
and on sentence lengths correlations while we
add the probability of the whole assignment
which, when initially set to the D2 distribution,
produces a significant boost of the precision of
the alignment. Finally, the maximization equations are quite different but, in principle, trying
to model the same thing: the indication of a good
alignment through the use of translation equivalents.
The test data for document alignment was compiled from the corpora that was previously collected in the ACCURAT project 2 and that is
known to the project members as the Initial
Comparable Corpora or ICC for short. It is important to know the fact that ICC contains all
types of comparable corpora from parallel to
weakly comparable documents but we classified
document pairs in three classes: parallel (class
name: p), strongly comparable (cs) and weakly
comparable (cw). We have considered the following pairs of languages: English-Romanian
(en-ro), English-Latvian (en-lv), EnglishLithuanian (en-lt), English-Estonian (en-et), English-Slovene (en-sl) and English-Greek (en-el).
For each pair of languages, ICC also contains a
Gold Standard list of document alignments that
were compiled by hand for testing purposes.
2
http://www.accurat-project.eu/
http://langtech.jrc.it/DGT-TM.html
cause the accuracy figures are always lower (1020%) than those of EMACC with D2.
p
P/R
enro
1/
0.66666
ensl
0.98742/
0.29511
enel
1
/1
enlt
1/
0.29971
enlv
1/
1
enet
1/
0.69780
Prms.
*
*
<
0.001
*
0.3
<
*
*
0.4
0.001
0.3
*
*
<
*
*
0.3
P/R
1/
1
0.89097/
0.89097
1/
1
0.93371/
0.93371
1/
1
0.96153/
0.96153
Prms.
0.4
0.001
1
0.001
0.001
1
<
*
1
0.001
0.8
1
0.4
<
1
0.001
0.4
1
#
21
532
87
347
184
P/R
Prms.
P/R
Prms.
1/
0.69047
>
<
0.85714/
0.85714
0.4
1
42
0.97777/
0.29139
0.001
0.3
0.81456/
0.81456
0.4
0.1
302
0.94124/
0.28148
0.001
0.3
0.71851/
0.71851
0.001
1
407
0.95364/
0.28514
0.001
0.3
0.72673/
0.72673
0.001
1
507
0.91463/
0.27322
0.001
0.3
0.80692/
0.80692
0.001
1
560
0.87030/
0.26100
0.4
0.3
0.57727/
0.57727
0.4
1
987
182
cs
enro
ensl
enel
enlt
enlv
enet
P/R
enro
1/
0.29411
ensl
0.73958/
0.22164
enel
0.15238/
0.04545
P/R
Prms.
P/R
Prms.
1/
0.66666
*
<
1/
1
0.8
1
21
0.98382/
0.68738
0.001
0.7
0.93785/
0.93785
0.001
1
532
enlt
0.55670/
0.16615
1/
0.69411
*
<
1/
1
0.001
1
87
0.95192/
0.28530
0.4
0.3
0.90778/
0.90778
0.001
1
enlv
0.23529/
0.07045
347
1/
0.29891
<
0.3
0.97826/
0.97826
<
1
184
enet
0.59027/
0.17634
1/
0.69780
<
<
0.97802/
0.97802
0.001
1
182
Prms.
0.4
0.001
0.3
0.4
0.4
0.3
0.001
0.8
0.3
0.4
0.8
0.3
0.4
>
0.3
0.4
0.8
0.3
P/R
0.66176/
0.66176
0.42767/
0.42767
0.07670/
0.07670
0.28307/
0.28307
0.10176/
0.10176
0.27800/
0.27800
Prms.
0.4
0.001
1
0.4
0.4
1
0.001
0.8
1
0.4
0.8
1
0.4
0.4
1
0.4
0.8
1
#
68
961
352
325
511
483
P/R
enro
1/
0.69047
ensl
0.96666/
0.28807
enel
0.97540/
0.29238
enlt
0.97368/
0.29191
enlv
0.95757/
0.28675
enet
0.88135/
0.26442
Prms.
>
*
<
0.4
0.4
0.3
0.001
0.8
0.3
0.4
0.8
0.3
0.4
>
0.3
0.4
0.8
0.3
P/R
0.85714/
0.85714
0.83112/
0.83112
0.80098/
0.80098
0.72978/
0.72978
0.79854/
0.79854
0.55182/
0.55182
Prms.
0.4
>
1
0.4
0.4
1
0.001
0.4
1
0.4
0.4
1
0.001
0.8
1
0.4
0.4
1
#
42
302
407
507
560
987
cw
enro
ensl
enel
enlt
enlv
enet
P/R
Prms.
P/R
Prms.
0.85/
0.25
0.4
0.3
0.61764/
0.61764
0.4
1
68
0.65505/
0.19624
0.4
0.3
0.39874/
0.39874
0.4
1
961
0.11428/
0.03428
0.4
0.3
0.06285/
0.06285
0.4
1
352
0.60416/
0.18012
0.4
0.3
0.24844/
0.24844
0.4
1
325
0.13071/
0.03921
0.4
0.3
0.09803/
0.09803
0.4
1
511
0.48611/
0.14522
0.001
0.3
0.25678/
0.25678
0.4
1
483
References
BORMAN, S. 2009. The Expectation Maximization
Algorithm. A short tutorial. Online at
http://www.isi.edu/naturallanguage/teaching/cs562/2009/readings/B06.pdf
BROWN, P. F., LAI, J. C., and MERCER, R. L.
1991. Aligning sentences in parallel corpora. In
Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, pp. 169
176, June 8-21, 1991, University of California,
Berkeley, California, USA.
BROWN, P. F., PIETRA, S. A. D., PIETRA, V. J. D.,
and MERCER, R. L. 1993. The Mathematics of
Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19(2): 263
311.
CEAUU, A. 2009. Statistical Machine Translation
for Romanian. PhD Thesis, Romanian Academy
(in Romanian).
CHEN, S. F. 1993. Aligning Sentences in Bilingual
Corpora Using Lexical Information. In Proceedings of the 31st Annual Meeting on Association for
Computational Linguistics, pp. 916, Columbus,
Ohio, USA.
DEMPSTER, A. P., LAIRD, N. M., and RUBIN, D. B.
1977. Maximum likelihood from incomplete data
via the EM algorithm. Journal of the Royal Statistical Society, 39(B):138.
GAO, Q., and VOGEL, S. 2008. Parallel implementations of word alignment tool. ACL-08 HLT:
Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pp. 4957,
June 20, 2008, The Ohio State University, Columbus, Ohio, USA.
MUNTEANU, D. ., and MARCU, D. 2002. Processing comparable corpora with bilingual suffix
trees. In Proceedings of the 2002 Conference on
Empirical Methods in Natural Language Processing (EMNLP 2002), pp. 289295, July 67, 2002, University of Pennsylvania, Philadelphia,
USA.
MUNTEANU, D. ., and MARCU, D. 2005. Improving machine translation performance by exploiting
non-parallel corpora. Computational Linguistics,
31(4):477504.