0% found this document useful (0 votes)
23 views8 pages

Eamt11 Complete

Eamt

Uploaded by

Maria Mitrofan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views8 pages

Eamt11 Complete

Eamt

Uploaded by

Maria Mitrofan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Textual Unit Alignment via Expectation Maximization

Radu ION
Research Institute for AI,
Romanian Academy
Calea 13 Septembrie nr. 13
Bucharest, 050711, Romania
radu@racai.ro

Alexandru CEAUU
Dublin City University
Glasnevin, Dublin 9, Ireland
aceausu@computing.dcu.ie

Abstract

The paper presents an Expectation Maximization (EM) algorithm for automatic


generation of parallel and quasi-parallel
data from any degree of comparable corpora ranging from parallel to weakly
comparable. Specifically, we address the
problem of extracting related textual
units (documents, paragraphs or sentences) relying on the hypothesis that, in a
given corpus, certain pairs of translation
equivalents are better indicators of a correct textual unit correspondence than other pairs of translation equivalents. We
evaluate our method on mixed types of
bilingual comparable corpora in six language pairs, obtaining state of the art accuracy figures.

Introduction

Statistical Machine Translation (SMT) is in a


constant need of good quality training data both
for translation models and for the language models. Regarding the latter, monolingual corpora is
evidently easier to collect than parallel corpora
and the truth of this statement is even more obvious when it comes to pairs of languages other
than those both widely spoken and computationally well-treated around the world such as English, Spanish, French or German.
Comparable corpora came as a possible solution to the problem of scarcity of parallel corpora
with the promise that it may serve as a seed for
parallel data extraction. A general definition of
comparability that we find operational is given
by Munteanu and Marcu (2005). They say that a
(bilingual) comparable corpus is a set of paired
2011 European Association for Machine Translation.

Elena IRIMIA
Research Institute for AI,
Romanian Academy
Calea 13 Septembrie nr. 13
Bucharest, 050711, Romania
elena@racai.ro

documents that, while not parallel in the strict


sense, are related and convey overlapping information.
Current practices of automatically collecting
domain-dependent bilingual comparable corpora
from the Web usually begin with collecting a list
of t terms as seed data in both the source and the
target languages. Each term (in each language) is
then queried on the most popular search engine
and the first N document hits are retained. The
final corpus will contain t N documents in each
language and in subsequent usage the document
boundaries are often disregarded.
At this point, it is important to stress out the
importance of the pairing of documents in a
comparable corpus. Suppose that we want to
word-align a bilingual comparable corpus consisting of M documents per language, each with k
words, using the IBM-1 word alignment algorithm (Brown et al., 1993). This algorithm
searches for each source word, the target words
that have a maximum translation probability with
the source word. Aligning all the words in our
corpus with no regard to document boundaries,
would yield a time complexity of
operations. The alternative would be in finding a 1:p
(with p a small positive integer, usually 1, 2 or 3)
document assignment (a set of aligned document
pairs) that would enforce the no search outside
the document boundary condition when doing
word alignment with the advantage of reducing
the time complexity to
operations. When
M is large, the reduction may actually be vital to
getting a result in a reasonable amount of time.
The downside of this simplification is the loss of
information: two documents may not be correctly
aligned thus depriving the word-alignment algorithm of the part of the search space that would
have contained the right alignments.
Word alignment forms the basis of the phrase
alignment procedure which, in turn, is the basis
of any statistical translation model. A compara-

ble corpus differs essentially from a parallel corpus by the fact that textual units do not follow a
translation order that otherwise greatly reduces
the word alignment search space in a parallel
corpus. Given this limitation of a comparable
corpus in general and the sizes of the comparable
corpora that we will have to deal with in particular, we have devised one variant of an Expectation Maximization (EM) algorithm (Dempster et
al., 1977) that generates a 1:1 (p = 1) document
assignment from a parallel and/or comparable
corpus using only pre-existing translation lexicons. Its generality would permit it to perform
the same task on other textual units such as paragraphs or sentences.
In what follows, we will briefly review the literature discussing document/paragraph alignment and then we will present the derivation of
the EM algorithm that generates 1:1 document
alignments. We will end the article with a thorough evaluation of the performances of this algorithm and the conclusions that arise from these
evaluations.

Related Work

Document alignment and other types of textual


unit alignment have been attempted in various
situations involving extracting parallel data from
comparable corpora. The first case study is offered by Munteanu and Marcu (2002). They
align sentences in an English-French comparable
corpus of 1.3M of words per language by comparing suffix trees of the sentences. Each sentence from each part of the corpus is encoded as
a suffix tree which is a tree that stores each possible suffix of a string from the last character to
the full string. The algorithm for sentence alignment proceeds as follows: a) generalized suffix
trees are constructed for sentences in the source
language and for those in the target language:
one tree per language which is the concatenation
of all suffix trees of all sentences in that language; b) the source tree is checked against the
target tree to determine branches that match.
Since the vocabulary is not the same (the branches contain words from different languages), an
initial bilingual lexicon is used to determine the
match. Using this method, Munteanu and Marcu
are able to detect correct sentence alignments
with a precision of 95% (out of 100 humanjudged and randomly selected sentences from the
generated output). The running time of their algorithm is approximately 100 hours for 50000
sentences in each of the languages.

A popular method of aligning sentences in a


comparable corpus is by classifying pairs of sentences as parallel or not parallel. Munteanu and
Marcu (2005) use a Maximum Entropy classifier
for the job trained with the following features:
sentence lengths and their differences and ratios,
percentage of the words in a source sentence that
have translations in a target sentence (translations are taken from pre-existing translation lexicons), the top three largest fertilities, length of
the longest sequence of words that have translations, etc. The training data consisted of a small
parallel corpus of 5000 sentences per language.
Since the number of negative instances (50002
5000) is far more large than the number of positive ones (5000), the negative training instances
were selected randomly out of instances that
passed a certain word overlap filter (see the paper for details). The classifier precision is around
97% with a recall of 40% at the Chinese-English
task and around 95% with a recall of 41% for the
Arabic-English task.
The last case study of sentence alignment that
we will present here is that of Chen (1993). He
employs an EM algorithm that will find a sentence alignment in the parallel corpus which
maximizes the translation probability for each
sentence bead in the alignment. The translation
probability to be maximized by the EM procedure considering each possible alignment
is
given by
(

( ) ([

])

The following notations were used:


is the
English corpus (a sequence of English
sentences), is the French corpus, [
] is a
sentence bead (a pairing of m sentences in
English with n sentences in French),
([
]
[
]) is the sentence alignment
(a sequence of sentence beads) and p(L) is the
probability that an alignment contains L beads.
The EM algorithm developed by Chen is similar
in principle with the one were about to describe
but there are several key differences that will be
pointed out. Its accuracy is around 96% and was
computed indirectly by checking disagreement
with the Brown sentence aligner (Brown et al.,
1991) on randomly selected 500 disagreement
cases.

EMACC

We propose an EM algorithm for aligning different types of textual units: documents, paragraphs,
and sentences which we will name EMACC (an
acronym for Expectation Maximization Alignment for Comparable Corpora). We draw our
inspiration from the famous IBM models (specifically from the IBM-1 model) for word alignment (Brown et al., 1993) where the translation
probability (eq. (5)) is modeled through an EM
algorithm where the hidden variable a models
the assignment (1:1 word alignments) from the
French sequence of words ( indexes) to the English one.
By analogy, we imagined that between two
sets of documents (from now on, we will refer to
documents as our textual units but what we present here is equally applicable but with different performance penalties to paragraphs and/or
sentences) lets call them and , there is an
assignment (a sequence of 1:1 document correspondences 1 ), the distribution of which can be
modeled by a hidden variable taking values in
the set {true, false}. This assignment will be
largely determined by the existence of word
translations between a pair of documents, translations that can differentiate between one another
in their ability to indicate a correct document
alignment versus an incorrect one. In other
words, we hypothesize that there are certain pairs
of translation equivalents that are better indicators of a correct document correspondence than
other translation equivalents pairs.
We take the general formulation and derivation of the EM optimization problem from (Borman, 2009). The general goal is to
mize (
), that is to find the parameters for
which (
) is maximum. In a sequence of
derivations that we are not going to repeat here,
the EM equation 1 is given by:
(

iterate over a set of parameters, compute the expectation expression for each of these parameters
and choose the parameters for which the expression has the largest value. But as we will see, in
practice, the set of all possible parameters has a
dimension that is exponential in terms of the
number of parameters. This renders the problem
intractable and one should back off to heuristic
searches in order to find a near-optimal solution.
Having the equation of the EM algorithm, the
next task is to tailor it to the problem at hand:
document alignment. But before doing so, lets
introduce a few notations that we will operate
with:

is the set of source documents,


is
the cardinal of this set;

is the set of target documents with


its cardinal;

is a pair of documents,
and
;

is a pair of translation equivalents

such that
is a lexical item that
belongs to and
is a lexical item that
belongs to ;

is the set of all existing translation


. is the transequivalents pairs
lation probability score (as the one given
for instance by GIZA++ (Gao and Vogel,
2008)). We assume that GIZA++ translation lexicons already exist for the pair of
languages of interest.
In order to tie equation 1 to our problem, we
define its variables as follows:

is the sequence of 1:1 document


alignments
of
the
form
,
{
an assignment
} . We call
which is basically a sequence of 1:1 document alignments. If there are
1:1
document alignments in
and if
, then the set of all possible assignments has the cardinal equal to
(

)
where (
. At step n+1, we try to
obtain a new set of parameters
that is going
to maximize (the maximization step) the sum
over z (the expectation step) that in its turn depends on the best set of parameters
obtained
at step n. Thus, in principle, the algorithm should

Or alignments or pairs. These terms will be used


with the same meaning throughout the presentation.

) where n! is the factorial func-

tion of the integer n and . / is the binomial coefficient. It is clear now that with
this kind of dimension of the set of all
possible assignments, we cannot simply
iterate over it in order to choose the assignment that maximizes the expectation;
*
+ is the hidden variable
that signals if a pair of documents

represents a correct alignment (true) or


not (false);

is the sequence of translation equivalents pairs


from T in the order they
appear in each document pair from .
Having defined the variables in equation 1 this
way, our job is then to maximize (
) meaning that we want to maximize the translation
equivalents probability over a given assignment.
In doing so, through the use of the hidden variable z, we are also able to find the 1:1 document
alignments that attest for this maximization.
We proceed by reducing equation 1 to a form
that is readily amenable to software coding. That
is, we aim at obtaining some distinct probability
tables that are going to be (re-)estimated by the
EM procedure. Throughout the presentation, we
will make some (overt and emphasized) independence assumptions that may or may not be
correct but that are necessary in order to obtain
the desired simplification. We acknowledge the
fact that other derivations based on different assumptions are also possible.
We begin by expanding the expectation expression from equation 1
(

)
(
(

) ( )
)

) ( )
(
)
(
)
( ) (
) ( ) ( )

( ) ( ) ( )

A final assumption (A4) that we make is that


(
)
(
) ( )or otherwise said,
and are conditionally independent given because only makes sense in the presence of a
document pair:
(

(
(

) ( (

) (

) (

))

We thus end up with two probability tables:


(
) which we call the lexical (document)
alignment probability and ( ) which is the
estimated assignment probability. But because of
)
the fact that (
and being only interested in the
value, we end up with
the following simplified EM equation:

and making our first assumptions:


(A1) (
(A2) (
(A3) (

)
)

( ) ( )
)
( ) (
( )

The third assumption (A3) states that does not


depend on which only makes sense in the context of a document pair. The first assumption
(A1) mandates that and
are independent,
which is justifiable if we think that only depends on current and not on the previously estimated one. The second assumption (A2) extends the first one by also imposing the same
independence condition but conditioned on .
With these expressions at hand we proceed with
the simplifications:

)-

) is going to be a conThe probability (


stant because it is computed based on the fixed
assignment that was found in the previous step
and thus, the previous equation is equivalent with
maximizing the new EM equation 2
,

)-

Equation 2 suggests a method of updating the


assignment probability (
) with the lexi) in an effort to
cal alignment probability (
provide the alignment clues that will guide the
assignment probability towards the correct assignment. All it remains to do now is to define
the two probabilities according to our setup: document pairs and translation equivalents pairs.

The lexical document alignment probability


(
) is defined as follows (equation 3):
(

where (
) is the simplified lexical document alignment probability which is initially
equal to ( ) from the set . This probability
is to be read as the contribution
makes to
the correctness of the
alignment. We want
that the alignment contribution of one translation
equivalents pair
to distribute over the set of
all possible document pairs thus enforcing equation 4

for which we enforce the condition (equation 6):


(

Using equations 2, 3 and 5 we deduce the final


EM equation 7

[
)]

)
(
(

|
|

)
)

and normalize the two probability tables with


equations 6 and 4. The first update is to be interpreted as the contribution the lexical document
alignment probability makes to the alignment
probability. The second update equation aims at
boosting the probability of a translation equivalent if and only if it is found in a pair of documents belonging to the best assignment so far. In
this way, we hope that this translation equivalent
will make a better contribution to the discovery
of a correct document alignment that has not yet
been discovered at step n + 1.
Before we start the EM iterations, we need to
) and
initialize the probability tables (
(
| ). For the second table we used the
GIZA++ scores that we have for the
translation equivalents pair and normalized the table
with equation 4. For the first probability table we
have (and tried) two choices:
(D1) a uniform distribution:
;

)]

and construct the best 1:1 assignment


by
choosing those pairs
for which we have
counts with the maximum values. Before the EM
cycle is resumed, we perform the following updates (equations 7a and 7b):

The summation over in equation 3 is actually


over all translation equivalents pairs that are to
be found only in the current
document pair
and the presence of the product
ensures
that we still have a probability value.
The assignment probability (
) is also defined in the following way (equation 5):

Remember the fact that we cannot simply iterate


over all possible assignments which leads us to
a greedy algorithm: simply iterate over all possible 1:1 document pairs and for each document
pair
{
} compute the
alignment count (its not a probability so we call
it a count)

(D2) a lexical document alignment


measure (
) (values between 0 and
1) that is computed directly from a pair
of documents
using the
translation equivalents pairs from the dictionary
(equation 8):

where
is the number of words in document
( ) is the frequency of word
and
in
document . If every word in the source document has at least one translation (of a given
threshold probability score) in the target document, then this measure is 1. We normalize the
table initialized using this measure with equation
6.
EMACC finds only 1:1 textual units alignments in its present form but a document pair
can be easily extended to a document bead
following the example from (Chen, 1993). The
main difference between the algorithm described
by Chen and ours is that the search procedure
reported there is invalid for comparable corpora
in which no pruning is available due to the nature
of the corpus. A second very important difference is that Chen only relies on lexical alignment
information, on the parallel nature of the corpus
and on sentence lengths correlations while we
add the probability of the whole assignment
which, when initially set to the D2 distribution,
produces a significant boost of the precision of
the alignment. Finally, the maximization equations are quite different but, in principle, trying
to model the same thing: the indication of a good
alignment through the use of translation equivalents.

Experiments and Evaluations

The test data for document alignment was compiled from the corpora that was previously collected in the ACCURAT project 2 and that is
known to the project members as the Initial
Comparable Corpora or ICC for short. It is important to know the fact that ICC contains all
types of comparable corpora from parallel to
weakly comparable documents but we classified
document pairs in three classes: parallel (class
name: p), strongly comparable (cs) and weakly
comparable (cw). We have considered the following pairs of languages: English-Romanian
(en-ro), English-Latvian (en-lv), EnglishLithuanian (en-lt), English-Estonian (en-et), English-Slovene (en-sl) and English-Greek (en-el).
For each pair of languages, ICC also contains a
Gold Standard list of document alignments that
were compiled by hand for testing purposes.
2

http://www.accurat-project.eu/

We trained GIZA++ translation lexicons for


every language pair using the DGT-TM3 corpus.
The input texts were converted from their
Unicode encoding to UTF-8 and were tokenized
using a tokenizer web service described by
Ceauu (2009). Then, we applied a parallel version of GIZA++ (Gao and Vogel, 2008) that
gave us the translation dictionaries of content
words only (nouns, verbs, adjective and adverbs)
at wordform level. For Romanian, Lithuanian,
Latvian, Greek and English, we had lists of inflectional suffixes which we used to stem entries
in respective dictionaries and processed documents. Slovene remained the only language
which involved wordform level processing.
The accuracy of EMACC is influenced by
three parameters whose values have been experimentally set:
the threshold over which we use translation equivalents from the dictionary for
textual unit alignment; values for this
threshold (lets call it t-T) are from the
+;
ordered set *
the threshold over which we decide to
update the probabilities of translation
equivalents with equation 7b; values for
this threshold (named t-E) are from the
+;
same ordered set *
the top t-O% alignments from the best
assignment found by EMACC. This parameter will introduce precision and recall with the perfect value for recall
equal to t-O%. Values for this parameter
+.
are from the set *
We have run EMACC (10 EM steps) on every
possible combination of these parameters for the
pairs of languages in question on both initial distributions D1 and D2. For comparison, we also
performed a baseline document alignment using
the greedy algorithm of EMACC with the equation 8 supplying the document alignment
strength measure.
The following 6 tables report a synthesis of
the results we have obtained which, because of
the lack of space, we cannot give in full. Every
two tables are paired (and should be compared to
one another): the first one reports the results obtained with EMACC with D2 initial distribution
and second one gives the results with the D2
baseline alignment algorithm. We omit the results of EMACC with D1 initial distribution be-

http://langtech.jrc.it/DGT-TM.html

cause the accuracy figures are always lower (1020%) than those of EMACC with D2.
p

P/R

enro

1/
0.66666

ensl

0.98742/
0.29511

enel

1
/1

enlt

1/
0.29971

enlv

1/
1

enet

1/
0.69780

Prms.
*
*
<
0.001
*
0.3
<
*
*
0.4
0.001
0.3
*
*
<
*
*
0.3

P/R
1/
1
0.89097/
0.89097
1/
1
0.93371/
0.93371
1/
1
0.96153/
0.96153

Prms.
0.4
0.001
1
0.001
0.001
1
<
*
1
0.001
0.8
1
0.4
<
1
0.001
0.4
1

#
21

532

87

347

184

P/R

Prms.

P/R

Prms.

1/
0.69047

>
<

0.85714/
0.85714

0.4
1

42

0.97777/
0.29139

0.001
0.3

0.81456/
0.81456

0.4
0.1

302

0.94124/
0.28148

0.001
0.3

0.71851/
0.71851

0.001
1

407

0.95364/
0.28514

0.001
0.3

0.72673/
0.72673

0.001
1

507

0.91463/
0.27322

0.001
0.3

0.80692/
0.80692

0.001
1

560

0.87030/
0.26100

0.4
0.3

0.57727/
0.57727

0.4
1

987

Table 4: D2 baseline algorithm on strongly comparable corpora


cw

182

Table 1: EMACC with D2 initial distribution on parallel corpora


p
enro
ensl
enel
enlt
enlv
enet

cs
enro
ensl
enel
enlt
enlv
enet

P/R

enro

1/
0.29411

ensl

0.73958/
0.22164

enel

0.15238/
0.04545

P/R

Prms.

P/R

Prms.

1/
0.66666

*
<

1/
1

0.8
1

21

0.98382/
0.68738

0.001
0.7

0.93785/
0.93785

0.001
1

532

enlt

0.55670/
0.16615

1/
0.69411

*
<

1/
1

0.001
1

87

0.95192/
0.28530

0.4
0.3

0.90778/
0.90778

0.001
1

enlv

0.23529/
0.07045

347

1/
0.29891

<
0.3

0.97826/
0.97826

<
1

184

enet

0.59027/
0.17634

1/
0.69780

<
<

0.97802/
0.97802

0.001
1

182

Prms.
0.4
0.001
0.3
0.4
0.4
0.3
0.001
0.8
0.3
0.4
0.8
0.3
0.4
>
0.3
0.4
0.8
0.3

P/R
0.66176/
0.66176
0.42767/
0.42767
0.07670/
0.07670
0.28307/
0.28307
0.10176/
0.10176
0.27800/
0.27800

Prms.
0.4
0.001
1
0.4
0.4
1
0.001
0.8
1
0.4
0.8
1
0.4
0.4
1
0.4
0.8
1

#
68

961

352

325

511

483

Table 5: EMACC with D2 initial distribution on


weakly comparable corpora

Table 2: D2 baseline algorithm on parallel corpora


cs

P/R

enro

1/
0.69047

ensl

0.96666/
0.28807

enel

0.97540/
0.29238

enlt

0.97368/
0.29191

enlv

0.95757/
0.28675

enet

0.88135/
0.26442

Prms.
>
*
<
0.4
0.4
0.3
0.001
0.8
0.3
0.4
0.8
0.3
0.4
>
0.3
0.4
0.8
0.3

P/R
0.85714/
0.85714
0.83112/
0.83112
0.80098/
0.80098
0.72978/
0.72978
0.79854/
0.79854
0.55182/
0.55182

Prms.
0.4
>
1
0.4
0.4
1
0.001
0.4
1
0.4
0.4
1
0.001
0.8
1
0.4
0.4
1

#
42

302

407

507

560

987

Table 3: EMACC with D2 initial distribution on


strongly comparable corpora

cw
enro
ensl
enel
enlt
enlv
enet

P/R

Prms.

P/R

Prms.

0.85/
0.25

0.4
0.3

0.61764/
0.61764

0.4
1

68

0.65505/
0.19624

0.4
0.3

0.39874/
0.39874

0.4
1

961

0.11428/
0.03428

0.4
0.3

0.06285/
0.06285

0.4
1

352

0.60416/
0.18012

0.4
0.3

0.24844/
0.24844

0.4
1

325

0.13071/
0.03921

0.4
0.3

0.09803/
0.09803

0.4
1

511

0.48611/
0.14522

0.001
0.3

0.25678/
0.25678

0.4
1

483

Table 6: D2 baseline algorithm on weakly comparable corpora

In every table above, the P/R column gives the


maximum precision the associated (maximum)
recall EMACC was able to obtain for the corresponding pair of languages using the parameters
(Prms.) from the next column. The P/R column
gives the maximum recall with the associated
(maximum) precision that we obtained for that

pair of languages. The Prms. columns differ in


Tables 1, 3 and 5 from Tables 2, 4, and 6: in Tables 1, 3 and 5 the parameters are given in the
order t-T, t-E and t-O (from the top of the cell to
the bottom) and in Tables 2, 4 and 6 parameters
are given in the order t-T, t-O (the t-E threshold
is absent because it is EMACC specific). For the
sake of compactness of representation we used
some thresholds interval placeholders which are:
< for the first two values of a threshold, >
for the last two values and * for all values of a
threshold. For instance, in Table 3, we have obtained a precision of 1 and a recall of 0.69047
aligning 42 en-ro documents (# column gives the
number of documents per language) with any of
+ for t-T, any of the values
the values of *
*
+ for t-E and any of the values
*
+ for the t-O threshold.
To ease comparison between EMACC and the
D2 baseline for each type of corpora (p, cs or cw),
we grayed maximal values between the two: either the precision in the P/R column or the recall
in the P/R column. In the case of parallel corpora
(see Tables 1 and 2) we observe that EMACC is
not able to improve the baseline because of the
simple fact that the baseline is very high (the
smallest precision is 0.98). Moving on to the
strongly comparable corpora (Tables 3 and 4),
we see that the benefits of re-estimating the
probabilities of the translation equivalents (based
on which we judge document alignments) begin
to emerge with precisions for en-el, en-lt, en-lv
and en-et being better than those obtained with
the D2 baseline. Finally, in the case of weakly
comparable corpora, the benefits of EM reestimation are clear with EMACC constantly
delivering better results than D2 baseline (with
the single exception of en-lt precision).

Comments and Conclusions

The whole point in developing textual unit


alignment algorithms for comparable corpora is
to be able to provide good quality quasi-aligned
data to programs that are specialized in extracting parallel data from these alignments. In the
context of this paper, the most important result to
note is that translation probability re-estimation
is a good tool in discovering new correct textual
unit alignments in the case of weakly related
documents. We also tested EMACC at the
alignment of 200 parallel paragraphs (small texts
of no more than 50 words) for all pairs of languages that we have considered here. We can
briefly report that the results are very similar to

the parallel document alignments from Tables 1


and 2 which is a promising result because one
would think that a significant reduction in textual
unit size would negatively impact the alignment
accuracy.
The only drawback of this algorithm is its high
computing time. We use a cluster with a total of
32 CPU cores (4 nodes) with 6-8 GB of RAM
per node and, with this configuration, the total
running time is between 12h and 48h per language pair depending on the setting of the various parameters.

References
BORMAN, S. 2009. The Expectation Maximization
Algorithm. A short tutorial. Online at
http://www.isi.edu/naturallanguage/teaching/cs562/2009/readings/B06.pdf
BROWN, P. F., LAI, J. C., and MERCER, R. L.
1991. Aligning sentences in parallel corpora. In
Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, pp. 169
176, June 8-21, 1991, University of California,
Berkeley, California, USA.
BROWN, P. F., PIETRA, S. A. D., PIETRA, V. J. D.,
and MERCER, R. L. 1993. The Mathematics of
Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19(2): 263
311.
CEAUU, A. 2009. Statistical Machine Translation
for Romanian. PhD Thesis, Romanian Academy
(in Romanian).
CHEN, S. F. 1993. Aligning Sentences in Bilingual
Corpora Using Lexical Information. In Proceedings of the 31st Annual Meeting on Association for
Computational Linguistics, pp. 916, Columbus,
Ohio, USA.
DEMPSTER, A. P., LAIRD, N. M., and RUBIN, D. B.
1977. Maximum likelihood from incomplete data
via the EM algorithm. Journal of the Royal Statistical Society, 39(B):138.
GAO, Q., and VOGEL, S. 2008. Parallel implementations of word alignment tool. ACL-08 HLT:
Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pp. 4957,
June 20, 2008, The Ohio State University, Columbus, Ohio, USA.
MUNTEANU, D. ., and MARCU, D. 2002. Processing comparable corpora with bilingual suffix
trees. In Proceedings of the 2002 Conference on
Empirical Methods in Natural Language Processing (EMNLP 2002), pp. 289295, July 67, 2002, University of Pennsylvania, Philadelphia,
USA.
MUNTEANU, D. ., and MARCU, D. 2005. Improving machine translation performance by exploiting
non-parallel corpora. Computational Linguistics,
31(4):477504.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy