0% found this document useful (0 votes)

23 views8 pages

Eamt11 Complete

Eamt

Uploaded by

Maria Mitrofan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views8 pages

Eamt11 Complete

Eamt

Uploaded by

Maria Mitrofan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Textual Unit Alignment via Expectation Maximization

Radu ION
Research Institute for AI,
Romanian Academy
Calea 13 Septembrie nr. 13
Bucharest, 050711, Romania
radu@racai.ro

Alexandru CEAUU
Dublin City University
Glasnevin, Dublin 9, Ireland
aceausu@computing.dcu.ie

Abstract

The paper presents an Expectation Maximization (EM) algorithm for automatic

generation of parallel and quasi-parallel
data from any degree of comparable corpora ranging from parallel to weakly
comparable. Specifically, we address the
problem of extracting related textual
units (documents, paragraphs or sentences) relying on the hypothesis that, in a
given corpus, certain pairs of translation
equivalents are better indicators of a correct textual unit correspondence than other pairs of translation equivalents. We
evaluate our method on mixed types of
bilingual comparable corpora in six language pairs, obtaining state of the art accuracy figures.

Introduction

Statistical Machine Translation (SMT) is in a

constant need of good quality training data both
for translation models and for the language models. Regarding the latter, monolingual corpora is
evidently easier to collect than parallel corpora
and the truth of this statement is even more obvious when it comes to pairs of languages other
than those both widely spoken and computationally well-treated around the world such as English, Spanish, French or German.
Comparable corpora came as a possible solution to the problem of scarcity of parallel corpora
with the promise that it may serve as a seed for
parallel data extraction. A general definition of
comparability that we find operational is given
by Munteanu and Marcu (2005). They say that a
(bilingual) comparable corpus is a set of paired
2011 European Association for Machine Translation.

Elena IRIMIA
Research Institute for AI,
Romanian Academy
Calea 13 Septembrie nr. 13
Bucharest, 050711, Romania
elena@racai.ro

documents that, while not parallel in the strict

sense, are related and convey overlapping information.
Current practices of automatically collecting
domain-dependent bilingual comparable corpora
from the Web usually begin with collecting a list
of t terms as seed data in both the source and the
target languages. Each term (in each language) is
then queried on the most popular search engine
and the first N document hits are retained. The
final corpus will contain t N documents in each
language and in subsequent usage the document
boundaries are often disregarded.
At this point, it is important to stress out the
importance of the pairing of documents in a
comparable corpus. Suppose that we want to
word-align a bilingual comparable corpus consisting of M documents per language, each with k
words, using the IBM-1 word alignment algorithm (Brown et al., 1993). This algorithm
searches for each source word, the target words
that have a maximum translation probability with
the source word. Aligning all the words in our
corpus with no regard to document boundaries,
would yield a time complexity of
operations. The alternative would be in finding a 1:p
(with p a small positive integer, usually 1, 2 or 3)
document assignment (a set of aligned document
pairs) that would enforce the no search outside
the document boundary condition when doing
word alignment with the advantage of reducing
the time complexity to
operations. When
M is large, the reduction may actually be vital to
getting a result in a reasonable amount of time.
The downside of this simplification is the loss of
information: two documents may not be correctly
aligned thus depriving the word-alignment algorithm of the part of the search space that would
have contained the right alignments.
Word alignment forms the basis of the phrase
alignment procedure which, in turn, is the basis
of any statistical translation model. A compara-

ble corpus differs essentially from a parallel corpus by the fact that textual units do not follow a
translation order that otherwise greatly reduces
the word alignment search space in a parallel
corpus. Given this limitation of a comparable
corpus in general and the sizes of the comparable
corpora that we will have to deal with in particular, we have devised one variant of an Expectation Maximization (EM) algorithm (Dempster et
al., 1977) that generates a 1:1 (p = 1) document
assignment from a parallel and/or comparable
corpus using only pre-existing translation lexicons. Its generality would permit it to perform
the same task on other textual units such as paragraphs or sentences.
In what follows, we will briefly review the literature discussing document/paragraph alignment and then we will present the derivation of
the EM algorithm that generates 1:1 document
alignments. We will end the article with a thorough evaluation of the performances of this algorithm and the conclusions that arise from these
evaluations.

Related Work

Document alignment and other types of textual

unit alignment have been attempted in various
situations involving extracting parallel data from
comparable corpora. The first case study is offered by Munteanu and Marcu (2002). They
align sentences in an English-French comparable
corpus of 1.3M of words per language by comparing suffix trees of the sentences. Each sentence from each part of the corpus is encoded as
a suffix tree which is a tree that stores each possible suffix of a string from the last character to
the full string. The algorithm for sentence alignment proceeds as follows: a) generalized suffix
trees are constructed for sentences in the source
language and for those in the target language:
one tree per language which is the concatenation
of all suffix trees of all sentences in that language; b) the source tree is checked against the
target tree to determine branches that match.
Since the vocabulary is not the same (the branches contain words from different languages), an
initial bilingual lexicon is used to determine the
match. Using this method, Munteanu and Marcu
are able to detect correct sentence alignments
with a precision of 95% (out of 100 humanjudged and randomly selected sentences from the
generated output). The running time of their algorithm is approximately 100 hours for 50000
sentences in each of the languages.

A popular method of aligning sentences in a

comparable corpus is by classifying pairs of sentences as parallel or not parallel. Munteanu and
Marcu (2005) use a Maximum Entropy classifier
for the job trained with the following features:
sentence lengths and their differences and ratios,
percentage of the words in a source sentence that
have translations in a target sentence (translations are taken from pre-existing translation lexicons), the top three largest fertilities, length of
the longest sequence of words that have translations, etc. The training data consisted of a small
parallel corpus of 5000 sentences per language.
Since the number of negative instances (50002
5000) is far more large than the number of positive ones (5000), the negative training instances
were selected randomly out of instances that
passed a certain word overlap filter (see the paper for details). The classifier precision is around
97% with a recall of 40% at the Chinese-English
task and around 95% with a recall of 41% for the
Arabic-English task.
The last case study of sentence alignment that
we will present here is that of Chen (1993). He
employs an EM algorithm that will find a sentence alignment in the parallel corpus which
maximizes the translation probability for each
sentence bead in the alignment. The translation
probability to be maximized by the EM procedure considering each possible alignment
is
given by
(

( ) ([

])

The following notations were used:

is the
English corpus (a sequence of English
sentences), is the French corpus, [
] is a
sentence bead (a pairing of m sentences in
English with n sentences in French),
([
]
[
]) is the sentence alignment
(a sequence of sentence beads) and p(L) is the
probability that an alignment contains L beads.
The EM algorithm developed by Chen is similar
in principle with the one were about to describe
but there are several key differences that will be
pointed out. Its accuracy is around 96% and was
computed indirectly by checking disagreement
with the Brown sentence aligner (Brown et al.,
1991) on randomly selected 500 disagreement
cases.

EMACC

We propose an EM algorithm for aligning different types of textual units: documents, paragraphs,
and sentences which we will name EMACC (an
acronym for Expectation Maximization Alignment for Comparable Corpora). We draw our
inspiration from the famous IBM models (specifically from the IBM-1 model) for word alignment (Brown et al., 1993) where the translation
probability (eq. (5)) is modeled through an EM
algorithm where the hidden variable a models
the assignment (1:1 word alignments) from the
French sequence of words ( indexes) to the English one.
By analogy, we imagined that between two
sets of documents (from now on, we will refer to
documents as our textual units but what we present here is equally applicable but with different performance penalties to paragraphs and/or
sentences) lets call them and , there is an
assignment (a sequence of 1:1 document correspondences 1 ), the distribution of which can be
modeled by a hidden variable taking values in
the set {true, false}. This assignment will be
largely determined by the existence of word
translations between a pair of documents, translations that can differentiate between one another
in their ability to indicate a correct document
alignment versus an incorrect one. In other
words, we hypothesize that there are certain pairs
of translation equivalents that are better indicators of a correct document correspondence than
other translation equivalents pairs.
We take the general formulation and derivation of the EM optimization problem from (Borman, 2009). The general goal is to
mize (
), that is to find the parameters for
which (
) is maximum. In a sequence of
derivations that we are not going to repeat here,
the EM equation 1 is given by:
(

iterate over a set of parameters, compute the expectation expression for each of these parameters
and choose the parameters for which the expression has the largest value. But as we will see, in
practice, the set of all possible parameters has a
dimension that is exponential in terms of the
number of parameters. This renders the problem
intractable and one should back off to heuristic
searches in order to find a near-optimal solution.
Having the equation of the EM algorithm, the
next task is to tailor it to the problem at hand:
document alignment. But before doing so, lets
introduce a few notations that we will operate
with:

is the set of source documents,

is
the cardinal of this set;

is the set of target documents with

its cardinal;

is a pair of documents,
and
;

is a pair of translation equivalents

such that
is a lexical item that
belongs to and
is a lexical item that
belongs to ;

is the set of all existing translation

. is the transequivalents pairs
lation probability score (as the one given
for instance by GIZA++ (Gao and Vogel,
2008)). We assume that GIZA++ translation lexicons already exist for the pair of
languages of interest.
In order to tie equation 1 to our problem, we
define its variables as follows:

is the sequence of 1:1 document

alignments
of
the
form
,
{
an assignment
} . We call
which is basically a sequence of 1:1 document alignments. If there are
1:1
document alignments in
and if
, then the set of all possible assignments has the cardinal equal to
(

)
where (
. At step n+1, we try to
obtain a new set of parameters
that is going
to maximize (the maximization step) the sum
over z (the expectation step) that in its turn depends on the best set of parameters
obtained
at step n. Thus, in principle, the algorithm should

Or alignments or pairs. These terms will be used

with the same meaning throughout the presentation.

) where n! is the factorial func-

tion of the integer n and . / is the binomial coefficient. It is clear now that with
this kind of dimension of the set of all
possible assignments, we cannot simply
iterate over it in order to choose the assignment that maximizes the expectation;
*
+ is the hidden variable
that signals if a pair of documents

represents a correct alignment (true) or

not (false);

is the sequence of translation equivalents pairs

from T in the order they
appear in each document pair from .
Having defined the variables in equation 1 this
way, our job is then to maximize (
) meaning that we want to maximize the translation
equivalents probability over a given assignment.
In doing so, through the use of the hidden variable z, we are also able to find the 1:1 document
alignments that attest for this maximization.
We proceed by reducing equation 1 to a form
that is readily amenable to software coding. That
is, we aim at obtaining some distinct probability
tables that are going to be (re-)estimated by the
EM procedure. Throughout the presentation, we
will make some (overt and emphasized) independence assumptions that may or may not be
correct but that are necessary in order to obtain
the desired simplification. We acknowledge the
fact that other derivations based on different assumptions are also possible.
We begin by expanding the expectation expression from equation 1
(

)
(
(

) ( )
)

) ( )
(
)
(
)
( ) (
) ( ) ( )

( ) ( ) ( )

A final assumption (A4) that we make is that

(
)
(
) ( )or otherwise said,
and are conditionally independent given because only makes sense in the presence of a
document pair:
(

(
(

) ( (

) (

))

We thus end up with two probability tables:

(
) which we call the lexical (document)
alignment probability and ( ) which is the
estimated assignment probability. But because of
)
the fact that (
and being only interested in the
value, we end up with
the following simplified EM equation:

and making our first assumptions:

(A1) (
(A2) (
(A3) (

)
)

( ) ( )
)
( ) (
( )

The third assumption (A3) states that does not

depend on which only makes sense in the context of a document pair. The first assumption
(A1) mandates that and
are independent,
which is justifiable if we think that only depends on current and not on the previously estimated one. The second assumption (A2) extends the first one by also imposing the same
independence condition but conditioned on .
With these expressions at hand we proceed with
the simplifications:

) is going to be a conThe probability (

stant because it is computed based on the fixed
assignment that was found in the previous step
and thus, the previous equation is equivalent with
maximizing the new EM equation 2
,

Equation 2 suggests a method of updating the

assignment probability (
) with the lexi) in an effort to
cal alignment probability (
provide the alignment clues that will guide the
assignment probability towards the correct assignment. All it remains to do now is to define
the two probabilities according to our setup: document pairs and translation equivalents pairs.

The lexical document alignment probability

(
) is defined as follows (equation 3):
(

where (
) is the simplified lexical document alignment probability which is initially
equal to ( ) from the set . This probability
is to be read as the contribution
makes to
the correctness of the
alignment. We want
that the alignment contribution of one translation
equivalents pair
to distribute over the set of
all possible document pairs thus enforcing equation 4

for which we enforce the condition (equation 6):

(

Using equations 2, 3 and 5 we deduce the final

EM equation 7

[
)]

)
(
(

|
|

)
)

and normalize the two probability tables with

equations 6 and 4. The first update is to be interpreted as the contribution the lexical document
alignment probability makes to the alignment
probability. The second update equation aims at
boosting the probability of a translation equivalent if and only if it is found in a pair of documents belonging to the best assignment so far. In
this way, we hope that this translation equivalent
will make a better contribution to the discovery
of a correct document alignment that has not yet
been discovered at step n + 1.
Before we start the EM iterations, we need to
) and
initialize the probability tables (
(
| ). For the second table we used the
GIZA++ scores that we have for the
translation equivalents pair and normalized the table
with equation 4. For the first probability table we
have (and tried) two choices:
(D1) a uniform distribution:
;

)]

and construct the best 1:1 assignment

by
choosing those pairs
for which we have
counts with the maximum values. Before the EM
cycle is resumed, we perform the following updates (equations 7a and 7b):

The summation over in equation 3 is actually

over all translation equivalents pairs that are to
be found only in the current
document pair
and the presence of the product
ensures
that we still have a probability value.
The assignment probability (
) is also defined in the following way (equation 5):

Remember the fact that we cannot simply iterate

over all possible assignments which leads us to
a greedy algorithm: simply iterate over all possible 1:1 document pairs and for each document
pair
{
} compute the
alignment count (its not a probability so we call
it a count)

(D2) a lexical document alignment

measure (
) (values between 0 and
1) that is computed directly from a pair
of documents
using the
translation equivalents pairs from the dictionary
(equation 8):

where
is the number of words in document
( ) is the frequency of word
and
in
document . If every word in the source document has at least one translation (of a given
threshold probability score) in the target document, then this measure is 1. We normalize the
table initialized using this measure with equation
6.
EMACC finds only 1:1 textual units alignments in its present form but a document pair
can be easily extended to a document bead
following the example from (Chen, 1993). The
main difference between the algorithm described
by Chen and ours is that the search procedure
reported there is invalid for comparable corpora
in which no pruning is available due to the nature
of the corpus. A second very important difference is that Chen only relies on lexical alignment
information, on the parallel nature of the corpus
and on sentence lengths correlations while we
add the probability of the whole assignment
which, when initially set to the D2 distribution,
produces a significant boost of the precision of
the alignment. Finally, the maximization equations are quite different but, in principle, trying
to model the same thing: the indication of a good
alignment through the use of translation equivalents.

Experiments and Evaluations

The test data for document alignment was compiled from the corpora that was previously collected in the ACCURAT project 2 and that is
known to the project members as the Initial
Comparable Corpora or ICC for short. It is important to know the fact that ICC contains all
types of comparable corpora from parallel to
weakly comparable documents but we classified
document pairs in three classes: parallel (class
name: p), strongly comparable (cs) and weakly
comparable (cw). We have considered the following pairs of languages: English-Romanian
(en-ro), English-Latvian (en-lv), EnglishLithuanian (en-lt), English-Estonian (en-et), English-Slovene (en-sl) and English-Greek (en-el).
For each pair of languages, ICC also contains a
Gold Standard list of document alignments that
were compiled by hand for testing purposes.
2

http://www.accurat-project.eu/

We trained GIZA++ translation lexicons for

every language pair using the DGT-TM3 corpus.
The input texts were converted from their
Unicode encoding to UTF-8 and were tokenized
using a tokenizer web service described by
Ceauu (2009). Then, we applied a parallel version of GIZA++ (Gao and Vogel, 2008) that
gave us the translation dictionaries of content
words only (nouns, verbs, adjective and adverbs)
at wordform level. For Romanian, Lithuanian,
Latvian, Greek and English, we had lists of inflectional suffixes which we used to stem entries
in respective dictionaries and processed documents. Slovene remained the only language
which involved wordform level processing.
The accuracy of EMACC is influenced by
three parameters whose values have been experimentally set:
the threshold over which we use translation equivalents from the dictionary for
textual unit alignment; values for this
threshold (lets call it t-T) are from the
+;
ordered set *
the threshold over which we decide to
update the probabilities of translation
equivalents with equation 7b; values for
this threshold (named t-E) are from the
+;
same ordered set *
the top t-O% alignments from the best
assignment found by EMACC. This parameter will introduce precision and recall with the perfect value for recall
equal to t-O%. Values for this parameter
+.
are from the set *
We have run EMACC (10 EM steps) on every
possible combination of these parameters for the
pairs of languages in question on both initial distributions D1 and D2. For comparison, we also
performed a baseline document alignment using
the greedy algorithm of EMACC with the equation 8 supplying the document alignment
strength measure.
The following 6 tables report a synthesis of
the results we have obtained which, because of
the lack of space, we cannot give in full. Every
two tables are paired (and should be compared to
one another): the first one reports the results obtained with EMACC with D2 initial distribution
and second one gives the results with the D2
baseline alignment algorithm. We omit the results of EMACC with D1 initial distribution be-

http://langtech.jrc.it/DGT-TM.html

cause the accuracy figures are always lower (1020%) than those of EMACC with D2.
p

P/R

enro

1/
0.66666

ensl

0.98742/
0.29511

enel

1
/1

enlt

1/
0.29971

enlv

1/
1

enet

1/
0.69780

Prms.
*
*
<
0.001
*
0.3
<
*
*
0.4
0.001
0.3
*
*
<
*
*
0.3

P/R
1/
1
0.89097/
0.89097
1/
1
0.93371/
0.93371
1/
1
0.96153/
0.96153

Prms.
0.4
0.001
1
0.001
0.001
1
<
*
1
0.001
0.8
1
0.4
<
1
0.001
0.4
1

#
21

532

347

184

P/R

Prms.

P/R

Prms.

1/
0.69047

>
<

0.85714/
0.85714

0.4
1

0.97777/
0.29139

0.001
0.3

0.81456/
0.81456

0.4
0.1

302

0.94124/
0.28148

0.001
0.3

0.71851/
0.71851

0.001
1

407

0.95364/
0.28514

0.001
0.3

0.72673/
0.72673

0.001
1

507

0.91463/
0.27322

0.001
0.3

0.80692/
0.80692

0.001
1

560

0.87030/
0.26100

0.4
0.3

0.57727/
0.57727

0.4
1

987

Table 4: D2 baseline algorithm on strongly comparable corpora

182

Table 1: EMACC with D2 initial distribution on parallel corpora

p
enro
ensl
enel
enlt
enlv
enet

cs
enro
ensl
enel
enlt
enlv
enet

P/R

enro

1/
0.29411

ensl

0.73958/
0.22164

enel

0.15238/
0.04545

P/R

Prms.

P/R

Prms.

1/
0.66666

*
<

1/
1

0.8
1

0.98382/
0.68738

0.001
0.7

0.93785/
0.93785

0.001
1

532

enlt

0.55670/
0.16615

1/
0.69411

*
<

1/
1

0.001
1

0.95192/
0.28530

0.4
0.3

0.90778/
0.90778

0.001
1

enlv

0.23529/
0.07045

347

1/
0.29891

<
0.3

0.97826/
0.97826

<
1

184

enet

0.59027/
0.17634

1/
0.69780

<
<

0.97802/
0.97802

0.001
1

182

Prms.
0.4
0.001
0.3
0.4
0.4
0.3
0.001
0.8
0.3
0.4
0.8
0.3
0.4
>
0.3
0.4
0.8
0.3

P/R
0.66176/
0.66176
0.42767/
0.42767
0.07670/
0.07670
0.28307/
0.28307
0.10176/
0.10176
0.27800/
0.27800

Prms.
0.4
0.001
1
0.4
0.4
1
0.001
0.8
1
0.4
0.8
1
0.4
0.4
1
0.4
0.8
1

#
68

961

352

325

511

483

Table 5: EMACC with D2 initial distribution on

weakly comparable corpora

Table 2: D2 baseline algorithm on parallel corpora

P/R

enro

1/
0.69047

ensl

0.96666/
0.28807

enel

0.97540/
0.29238

enlt

0.97368/
0.29191

enlv

0.95757/
0.28675

enet

0.88135/
0.26442

Prms.
>
*
<
0.4
0.4
0.3
0.001
0.8
0.3
0.4
0.8
0.3
0.4
>
0.3
0.4
0.8
0.3

P/R
0.85714/
0.85714
0.83112/
0.83112
0.80098/
0.80098
0.72978/
0.72978
0.79854/
0.79854
0.55182/
0.55182

Prms.
0.4
>
1
0.4
0.4
1
0.001
0.4
1
0.4
0.4
1
0.001
0.8
1
0.4
0.4
1

#
42

302

407

507

560

987

Table 3: EMACC with D2 initial distribution on

strongly comparable corpora

cw
enro
ensl
enel
enlt
enlv
enet

P/R

Prms.

P/R

Prms.

0.85/
0.25

0.4
0.3

0.61764/
0.61764

0.4
1

0.65505/
0.19624

0.4
0.3

0.39874/
0.39874

0.4
1

961

0.11428/
0.03428

0.4
0.3

0.06285/
0.06285

0.4
1

352

0.60416/
0.18012

0.4
0.3

0.24844/
0.24844

0.4
1

325

0.13071/
0.03921

0.4
0.3

0.09803/
0.09803

0.4
1

511

0.48611/
0.14522

0.001
0.3

0.25678/
0.25678

0.4
1

483

Table 6: D2 baseline algorithm on weakly comparable corpora

In every table above, the P/R column gives the

maximum precision the associated (maximum)
recall EMACC was able to obtain for the corresponding pair of languages using the parameters
(Prms.) from the next column. The P/R column
gives the maximum recall with the associated
(maximum) precision that we obtained for that

pair of languages. The Prms. columns differ in

Tables 1, 3 and 5 from Tables 2, 4, and 6: in Tables 1, 3 and 5 the parameters are given in the
order t-T, t-E and t-O (from the top of the cell to
the bottom) and in Tables 2, 4 and 6 parameters
are given in the order t-T, t-O (the t-E threshold
is absent because it is EMACC specific). For the
sake of compactness of representation we used
some thresholds interval placeholders which are:
< for the first two values of a threshold, >
for the last two values and * for all values of a
threshold. For instance, in Table 3, we have obtained a precision of 1 and a recall of 0.69047
aligning 42 en-ro documents (# column gives the
number of documents per language) with any of
+ for t-T, any of the values
the values of *
*
+ for t-E and any of the values
*
+ for the t-O threshold.
To ease comparison between EMACC and the
D2 baseline for each type of corpora (p, cs or cw),
we grayed maximal values between the two: either the precision in the P/R column or the recall
in the P/R column. In the case of parallel corpora
(see Tables 1 and 2) we observe that EMACC is
not able to improve the baseline because of the
simple fact that the baseline is very high (the
smallest precision is 0.98). Moving on to the
strongly comparable corpora (Tables 3 and 4),
we see that the benefits of re-estimating the
probabilities of the translation equivalents (based
on which we judge document alignments) begin
to emerge with precisions for en-el, en-lt, en-lv
and en-et being better than those obtained with
the D2 baseline. Finally, in the case of weakly
comparable corpora, the benefits of EM reestimation are clear with EMACC constantly
delivering better results than D2 baseline (with
the single exception of en-lt precision).

Comments and Conclusions

The whole point in developing textual unit

alignment algorithms for comparable corpora is
to be able to provide good quality quasi-aligned
data to programs that are specialized in extracting parallel data from these alignments. In the
context of this paper, the most important result to
note is that translation probability re-estimation
is a good tool in discovering new correct textual
unit alignments in the case of weakly related
documents. We also tested EMACC at the
alignment of 200 parallel paragraphs (small texts
of no more than 50 words) for all pairs of languages that we have considered here. We can
briefly report that the results are very similar to

the parallel document alignments from Tables 1

and 2 which is a promising result because one
would think that a significant reduction in textual
unit size would negatively impact the alignment
accuracy.
The only drawback of this algorithm is its high
computing time. We use a cluster with a total of
32 CPU cores (4 nodes) with 6-8 GB of RAM
per node and, with this configuration, the total
running time is between 12h and 48h per language pair depending on the setting of the various parameters.

References
BORMAN, S. 2009. The Expectation Maximization
Algorithm. A short tutorial. Online at
http://www.isi.edu/naturallanguage/teaching/cs562/2009/readings/B06.pdf
BROWN, P. F., LAI, J. C., and MERCER, R. L.
1991. Aligning sentences in parallel corpora. In
Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, pp. 169
176, June 8-21, 1991, University of California,
Berkeley, California, USA.
BROWN, P. F., PIETRA, S. A. D., PIETRA, V. J. D.,
and MERCER, R. L. 1993. The Mathematics of
Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19(2): 263
311.
CEAUU, A. 2009. Statistical Machine Translation
for Romanian. PhD Thesis, Romanian Academy
(in Romanian).
CHEN, S. F. 1993. Aligning Sentences in Bilingual
Corpora Using Lexical Information. In Proceedings of the 31st Annual Meeting on Association for
Computational Linguistics, pp. 916, Columbus,
Ohio, USA.
DEMPSTER, A. P., LAIRD, N. M., and RUBIN, D. B.
1977. Maximum likelihood from incomplete data
via the EM algorithm. Journal of the Royal Statistical Society, 39(B):138.
GAO, Q., and VOGEL, S. 2008. Parallel implementations of word alignment tool. ACL-08 HLT:
Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pp. 4957,
June 20, 2008, The Ohio State University, Columbus, Ohio, USA.
MUNTEANU, D. ., and MARCU, D. 2002. Processing comparable corpora with bilingual suffix
trees. In Proceedings of the 2002 Conference on
Empirical Methods in Natural Language Processing (EMNLP 2002), pp. 289295, July 67, 2002, University of Pennsylvania, Philadelphia,
USA.
MUNTEANU, D. ., and MARCU, D. 2005. Improving machine translation performance by exploiting
non-parallel corpora. Computational Linguistics,
31(4):477504.

Hold On To Your Kids - Gabor Mate
85% (13)
Hold On To Your Kids - Gabor Mate
4 pages
Berlitz Schools of Languages - The Berlitz Self-Teacher - French-Perigee Books (1987 (1949) )
100% (6)
Berlitz Schools of Languages - The Berlitz Self-Teacher - French-Perigee Books (1987 (1949) )
308 pages
Inganekwane
No ratings yet
Inganekwane
3 pages
DR Kousmine Viata Sanatoasa
No ratings yet
DR Kousmine Viata Sanatoasa
684 pages
IELTS Writing Task 2
100% (1)
IELTS Writing Task 2
14 pages
Interactive English To Urdu Machine Translation Using Example-Based Approach
100% (2)
Interactive English To Urdu Machine Translation Using Example-Based Approach
8 pages
4 Natural Language Processing-Text Normalization
No ratings yet
4 Natural Language Processing-Text Normalization
10 pages
Verb To Be Exercises
100% (1)
Verb To Be Exercises
4 pages
Personal Pronouns
No ratings yet
Personal Pronouns
6 pages
MYPNA TE G07 EM Web
No ratings yet
MYPNA TE G07 EM Web
81 pages
Worksheet On Conditional Sentences B - Type1 and 2
No ratings yet
Worksheet On Conditional Sentences B - Type1 and 2
2 pages
5054 Reported Speech Step by Step Step 1 Grammar Part 1
100% (2)
5054 Reported Speech Step by Step Step 1 Grammar Part 1
2 pages
Spanish - Portuguese SMT
100% (2)
Spanish - Portuguese SMT
10 pages
Lect 07 - MT and Seq2seq
No ratings yet
Lect 07 - MT and Seq2seq
86 pages
View
No ratings yet
View
99 pages
05 Lecture08 NMT
No ratings yet
05 Lecture08 NMT
79 pages
An Overview of Statistical Machine Translation: Charles Schafer
No ratings yet
An Overview of Statistical Machine Translation: Charles Schafer
104 pages
Adjusted BOW Elementary - ENGLISH
No ratings yet
Adjusted BOW Elementary - ENGLISH
7 pages
Turkish-English Machine Translation System
No ratings yet
Turkish-English Machine Translation System
47 pages
Unit 5
No ratings yet
Unit 5
42 pages
ESOL 100 Intro. To ESOL 100 Grammar Component
No ratings yet
ESOL 100 Intro. To ESOL 100 Grammar Component
36 pages
A Systematic Comparison of Various Statistical Alignment Models
No ratings yet
A Systematic Comparison of Various Statistical Alignment Models
33 pages
T4 - Language Varieties and Registers
No ratings yet
T4 - Language Varieties and Registers
29 pages
What's in A Translation Rule?: Michel Galley Mark Hopkins Kevin Knight and Daniel Marcu
No ratings yet
What's in A Translation Rule?: Michel Galley Mark Hopkins Kevin Knight and Daniel Marcu
8 pages
Natural Language Processing, Problem Set 3: Training Data
No ratings yet
Natural Language Processing, Problem Set 3: Training Data
6 pages
Translingual Information Retrieval Learning From Bil - 1998 - Artificial Intell
No ratings yet
Translingual Information Retrieval Learning From Bil - 1998 - Artificial Intell
23 pages
Algorithms PDF
No ratings yet
Algorithms PDF
6 pages
8.5 Multilingual Speech Processing
No ratings yet
8.5 Multilingual Speech Processing
24 pages
Unsupervised Multilingual Learning For POS Tagging
100% (1)
Unsupervised Multilingual Learning For POS Tagging
10 pages
Diamond Theorem
No ratings yet
Diamond Theorem
5 pages
IELTS Lesson Plan For Securing Band Scor
No ratings yet
IELTS Lesson Plan For Securing Band Scor
2 pages
Concurs Creatie Literara
No ratings yet
Concurs Creatie Literara
25 pages
Diana FB
No ratings yet
Diana FB
26 pages
Ulllted States Patent (10) Patent N0.: US 8,275,604 B2
No ratings yet
Ulllted States Patent (10) Patent N0.: US 8,275,604 B2
21 pages
A Study of Business English Translation Skills Based On Parallel Corpus
No ratings yet
A Study of Business English Translation Skills Based On Parallel Corpus
16 pages
Notes From - Word Power Made Easy - The Complete Handbook For Building A Superior Vocabulary
No ratings yet
Notes From - Word Power Made Easy - The Complete Handbook For Building A Superior Vocabulary
15 pages
Tree Based Statistical Machine Translati
No ratings yet
Tree Based Statistical Machine Translati
15 pages
Measuring Bilingual Corpus Comparability
No ratings yet
Measuring Bilingual Corpus Comparability
27 pages
Jerome Bruner's Theory of Education: From Early Bruner To Later Bruner
No ratings yet
Jerome Bruner's Theory of Education: From Early Bruner To Later Bruner
20 pages
Learning Bilingual Word Mappings
No ratings yet
Learning Bilingual Word Mappings
10 pages
Sentence Alignment Using MR and GA
No ratings yet
Sentence Alignment Using MR and GA
7 pages
Demos 035
No ratings yet
Demos 035
12 pages
Karikatur Tulisan Siswa Peirce
No ratings yet
Karikatur Tulisan Siswa Peirce
9 pages
Cross-Lingual Language Model Pretraining PDF
No ratings yet
Cross-Lingual Language Model Pretraining PDF
10 pages
Use of Texts Stories Songs and Rhymes Sebogon
No ratings yet
Use of Texts Stories Songs and Rhymes Sebogon
14 pages
2005.mtsummit Papers.11
No ratings yet
2005.mtsummit Papers.11
8 pages
Parallel Texts Alignment Strategies
No ratings yet
Parallel Texts Alignment Strategies
7 pages
Seminar Sample Report
No ratings yet
Seminar Sample Report
20 pages
Volk - Graen - Callegaro - 2014-Innovations Parallel Corpus Tools
No ratings yet
Volk - Graen - Callegaro - 2014-Innovations Parallel Corpus Tools
7 pages
Learning Translation Rules From Bilingual English - Filipino Corpus
No ratings yet
Learning Translation Rules From Bilingual English - Filipino Corpus
10 pages
English Grammar: by Aylate Gudeno 2014/2021
No ratings yet
English Grammar: by Aylate Gudeno 2014/2021
10 pages
The Cued Hifst System For The Wmt10 Translation Shared Task
No ratings yet
The Cued Hifst System For The Wmt10 Translation Shared Task
6 pages
Addressing Word-Order Divergence in Multilingual Neural Machine Translation For Extremely Low Resource Languages
No ratings yet
Addressing Word-Order Divergence in Multilingual Neural Machine Translation For Extremely Low Resource Languages
6 pages
Math Act. 4
No ratings yet
Math Act. 4
5 pages
Translation Table Compression Under End-Tagged Dense Code
No ratings yet
Translation Table Compression Under End-Tagged Dense Code
6 pages
h3 P
No ratings yet
h3 P
6 pages
CONNEAU and Lample - 2019 - Cross-Lingual Language Model Pretraining
No ratings yet
CONNEAU and Lample - 2019 - Cross-Lingual Language Model Pretraining
11 pages
Tips To Help You Avoid Colloquial Informal Writing
No ratings yet
Tips To Help You Avoid Colloquial Informal Writing
6 pages
Statistical Approach To Machine Translation
No ratings yet
Statistical Approach To Machine Translation
7 pages
Strings (ALL PROGRAMS)
No ratings yet
Strings (ALL PROGRAMS)
4 pages
Improving Word Alignment in An English - Malay Parallel Corpus For Machine Translation
No ratings yet
Improving Word Alignment in An English - Malay Parallel Corpus For Machine Translation
4 pages
Exploring Properties of Intralingual and Interlingual Association Measures Visually
No ratings yet
Exploring Properties of Intralingual and Interlingual Association Measures Visually
4 pages
2022 Findings-Aacl 20
No ratings yet
2022 Findings-Aacl 20
6 pages
Translating Similar Languages: Role of Mutual Intelligibility in Multilingual Transformers
No ratings yet
Translating Similar Languages: Role of Mutual Intelligibility in Multilingual Transformers
7 pages
The CMU-EBMT Machine Translation System: Ralf D. Brown
No ratings yet
The CMU-EBMT Machine Translation System: Ralf D. Brown
17 pages
Case Study
No ratings yet
Case Study
4 pages
Tiered Tagging and Combined Language Models Classiers
No ratings yet
Tiered Tagging and Combined Language Models Classiers
6 pages
Pivot-Based Hybrid Machine Translation To Support Multilingual Communication For Closely Related Languages
No ratings yet
Pivot-Based Hybrid Machine Translation To Support Multilingual Communication For Closely Related Languages
6 pages
NLP 5
No ratings yet
NLP 5
3 pages
Findings of The WMT 2016 Bilingual Document Alignment Shared Task
No ratings yet
Findings of The WMT 2016 Bilingual Document Alignment Shared Task
10 pages
Building English-Punjabi Parallel Corpus For Machi
No ratings yet
Building English-Punjabi Parallel Corpus For Machi
5 pages
3D Hexagonal Network: Modeling, Topological Properties, Addressing Scheme, and Optimal Routing Algorithm
No ratings yet
3D Hexagonal Network: Modeling, Topological Properties, Addressing Scheme, and Optimal Routing Algorithm
10 pages
Ccmatrix: Mining Billions of High-Quality Parallel Sentences On The Web
No ratings yet
Ccmatrix: Mining Billions of High-Quality Parallel Sentences On The Web
13 pages
Dependencies vs. Constituents For Tree-Based Alignment: Daniel Gildea
No ratings yet
Dependencies vs. Constituents For Tree-Based Alignment: Daniel Gildea
8 pages
Sanchez Martinez11a
No ratings yet
Sanchez Martinez11a
12 pages
A Survey On Biomedical Named Entity Extraction: Research Article
No ratings yet
A Survey On Biomedical Named Entity Extraction: Research Article
4 pages
Weak - and - Strong - Forms - of - Function - Words
No ratings yet
Weak - and - Strong - Forms - of - Function - Words
2 pages
Automated Machine Translation For Regional Languages: Problem Statement
No ratings yet
Automated Machine Translation For Regional Languages: Problem Statement
2 pages
Improving The Performance of English-Tamil Statistical Machine Translation System Using Source-Side Pre-Processing
No ratings yet
Improving The Performance of English-Tamil Statistical Machine Translation System Using Source-Side Pre-Processing
11 pages
Eng122 Course Outline
No ratings yet
Eng122 Course Outline
8 pages
Modal Auxiliary Verbs: Can! ! Could! ! May! ! Might! Will! ! Shall! ! Should! Must! Ought To! ! Need To! ! Had Better!
No ratings yet
Modal Auxiliary Verbs: Can! ! Could! ! May! ! Might! Will! ! Shall! ! Should! Must! Ought To! ! Need To! ! Had Better!
6 pages
DLL-ENG8-2NDQ-1st-week Edited
No ratings yet
DLL-ENG8-2NDQ-1st-week Edited
9 pages
AMTA 2012 Goutte
No ratings yet
AMTA 2012 Goutte
8 pages
Machine Translation: A Presentation By: Julie Conlonova, Rob Chase, and Eric Pomerleau
No ratings yet
Machine Translation: A Presentation By: Julie Conlonova, Rob Chase, and Eric Pomerleau
31 pages
Contribution To AMTA2012
No ratings yet
Contribution To AMTA2012
8 pages
Acoustic Properties of Waray Vowels Towards A New Waray Orthography
No ratings yet
Acoustic Properties of Waray Vowels Towards A New Waray Orthography
16 pages
Sec 2 Ih SBQ Notes 1
No ratings yet
Sec 2 Ih SBQ Notes 1
2 pages
Relative Clauses THEORY 2BACH
No ratings yet
Relative Clauses THEORY 2BACH
4 pages
Math 30perm Restrictions Short
No ratings yet
Math 30perm Restrictions Short
4 pages
Comparative Study of Machine Translation Techniques
No ratings yet
Comparative Study of Machine Translation Techniques
16 pages
A DOM Tree Alignment Model For Mining Parallel Data From The Web
No ratings yet
A DOM Tree Alignment Model For Mining Parallel Data From The Web
8 pages
Improvements in Phrase-Based Statistical Machine Translation
No ratings yet
Improvements in Phrase-Based Statistical Machine Translation
8 pages
Statistical Phrase-Based Translation: Philipp Koehn, Franz Josef Och, Daniel Marcu
No ratings yet
Statistical Phrase-Based Translation: Philipp Koehn, Franz Josef Och, Daniel Marcu
7 pages
Toward A Name Entity Aligned Bilingual Corpus
No ratings yet
Toward A Name Entity Aligned Bilingual Corpus
7 pages
Exploiting Non-Parallel Corpora For Statistical Machine Translation
No ratings yet
Exploiting Non-Parallel Corpora For Statistical Machine Translation
6 pages
Using Synonyms For Arabic-to-English Example-Based Translation
No ratings yet
Using Synonyms For Arabic-to-English Example-Based Translation
10 pages
Grade 6 Ela Unpacked Documents
No ratings yet
Grade 6 Ela Unpacked Documents
35 pages
Gene Expression Programming: Fundamentals and Applications
From Everand
Gene Expression Programming: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Eamt11 Complete

Uploaded by

Eamt11 Complete

Uploaded by

Textual Unit Alignment via Expectation Maximization

The paper presents an Expectation Maximization (EM) algorithm for automatic

Statistical Machine Translation (SMT) is in a

documents that, while not parallel in the strict

Document alignment and other types of textual

A popular method of aligning sentences in a

The following notations were used:

is the set of source documents,

is the set of target documents with

is a pair of translation equivalents

is the set of all existing translation

is the sequence of 1:1 document

Or alignments or pairs. These terms will be used

) where n! is the factorial func-

represents a correct alignment (true) or

is the sequence of translation equivalents pairs

A final assumption (A4) that we make is that

We thus end up with two probability tables:

and making our first assumptions:

The third assumption (A3) states that does not

) is going to be a conThe probability (

Equation 2 suggests a method of updating the

The lexical document alignment probability

for which we enforce the condition (equation 6):

Using equations 2, 3 and 5 we deduce the final

and normalize the two probability tables with

and construct the best 1:1 assignment

The summation over in equation 3 is actually

Remember the fact that we cannot simply iterate

(D2) a lexical document alignment

Experiments and Evaluations

We trained GIZA++ translation lexicons for

Table 4: D2 baseline algorithm on strongly comparable corpora

Table 1: EMACC with D2 initial distribution on parallel corpora

Table 5: EMACC with D2 initial distribution on

Table 2: D2 baseline algorithm on parallel corpora

Table 3: EMACC with D2 initial distribution on

Table 6: D2 baseline algorithm on weakly comparable corpora

In every table above, the P/R column gives the

pair of languages. The Prms. columns differ in

Comments and Conclusions

The whole point in developing textual unit

the parallel document alignments from Tables 1

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.