0% found this document useful (0 votes)
18 views4 pages

Automatic Quality Evaluation of MT Output

The document discusses research on evaluating the quality of machine translations between Croatian and English in the sociological, philosophical, and spiritual domain. Several automatic evaluation metrics were used to assess translations of text from a scientific publication that was digitized. The results are analyzed and further analysis is proposed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views4 pages

Automatic Quality Evaluation of MT Output

The document discusses research on evaluating the quality of machine translations between Croatian and English in the sociological, philosophical, and spiritual domain. Several automatic evaluation metrics were used to assess translations of text from a scientific publication that was digitized. The results are analyzed and further analysis is proposed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Automatic Quality Evaluation of

Machine-Translated Output in
Sociological-Philosophical-Spiritual Domain
Sanja Seljan; Ivan Dunđer
Department of Information and Communication Sciences
Faculty of Humanities and Social Sciences, University of Zagreb
Zagreb, Croatia
sanja.seljan@ffzg.hr; ivandunder@gmail.com

Abstract — Automatic quality evaluation of machine translation languages is a topic of interest of numerous researchers and
systems has become an important issue in the field of natural organisations, since the results could be useful to professional
language processing, due to raised interest and needs of industry translators, translation industry, researchers and to everyday
and everyday users. Development of online machine translation users. The main advantages of automatic evaluation metrics are
systems is also important for less-resourced languages, as they speed, cost and objectiveness. They perform always in the
enable basic information transfer and communication. Although same way, and besides being tuneable, they can provide
the quality of free online automatic translation systems is not meaningful, consistent, correct and reliable information on the
perfect, it is important to assure acceptable quality. As human level of machine translation quality [5].
evaluation is time-consuming, expensive and subjective,
automatic quality evaluation metrics try to approach and Human evaluation, on the other hand, is considered to be
approximate human evaluation as much as possible. In this the “gold standard”, but it is subjective, tedious and more
paper, several automatic quality metrics will be utilised, in order expensive.
to assess the quality of specific machine translated text. Namely,
the research is performed on sociological-philosophical-spiritual II. RELATED WORK
domain, resulting from the digitisation process of a scientific
Evaluation of machine translation was mainly performed in
publication written in Croatian and English. The quality
evaluation results are discussed and further analysis is proposed.
the legislation, technical or general domain, due to available
bilingual corpora [6]. Domains such as sociology, philosophy
Keywords – automatic quality evaluation; machine translation; or religion are rarely investigated, as acquiring necessary
BLEU; NIST; METEOR; GTM; English-Croatian; Croatian- corpora and a specific reference translation represents a notable
English; sociological-philosophical-spiritual domain. problem.

I. INTRODUCTION However, one research paper describes translation in cross-


language information access performed by machine translation,
Machine translation evaluation has become a hot topic of supplemented by domain-specific phrase dictionaries, which
interest to numerous researchers and projects, usually when were automatically mined from the online Wikipedia in the
comparing several machine translation systems with the same domain of cultural heritage [7]. Queries were translated from
test set or when evaluating one system through different and into English, Spanish and Italian and then evaluated using
phases, such as automatic evaluation, correlation of automatic human annotations.
evaluation scores with human evaluation etc. [1].
In a research, the idea was to evaluate machine translations
Lately, extensive evaluation of machine translation quality produced by Google Translate for the English-Swedish
was conducted focusing on online machine translation systems, language pair and for the fictional and non-fictional texts,
commercial or integrated systems applying statistical machine including samples of law documents, commercial company
translation, sometimes combined with other machine reports, social science texts (religion, welfare, astronomy) and
translation approaches [2]. medicine [8]. Evaluation is carried out with the BLEU metric,
Statistical machine translation systems rely on huge showing that law texts gained double of average BLEU scores.
amounts of parallel data, which is sometimes inconvenient for Evaluation of machine translation shows better scores for
less-resourced languages or between different types of about 20% when two reference sets are used, and up to 29% for
languages, and which requires more detailed quality error three reference sets, with regard to differentiating short and
analysis and evaluation, in order to improve the performance of long sentences [9]. Besides sentence length, other problems in
a machine translation system [3]. the machine translation process were investigated, such as
Some authors point out the context of machine translation specific terminology, anaphora and ambiguity [8].
evaluation relating quality, purpose and context, trying to Another research describes the importance of the
establish a coherent evaluation approach [4]. Automatic translation domain, which influences the quality of machine
evaluation for less-resourced, but morphologically rich
translation output. Therefore, domain knowledge and specific translation memory is ideal for further research in statistical
terminology translations have been added [10]. The research is terminology and collocation extraction, evaluation and
conducted with the SYSTRAN translation system, which uses analysis.
the transfer translation approach for Chinese-English, English-
French, French-English and Russian-English language pairs. The following table shows the data set statistics (Table I).
The data set consisted of 369 segments (sentences) and 107264
Research on machine translation of religious texts by characters in total. The longest segment was 80 words long in
Google Translate for English-Urdu and Arabic-Urdu has also Croatian, and 88 words in English, whereas 3677 distinct
been conducted [11]. Evaluation of cross-language information words appeared in Croatian, and 2782 in English. On average,
retrieval using machine translation in the domain of sociology English abstracts were composed of 10.6% more characters,
is presented in [12] for English, French, German and Italian. and 19% more words. Specificity of terminology is also
reflected in the large number of hapax legomena, which also
III. RESEARCH indicates a variety of different topics in the digitised book of
The following subsections describe the digitisation process abstracts.
and data set, and discuss the research methods, quality
evaluation metrics and used tools. TABLE I. DATA SET STATISTICS

A. Digitisation Language
Data set
Croatian English
For the purpose of this research, a book of abstracts from a
scientific conference containing mutual translations in Croatian No. of characters 50631 56633
and English was digitised with a scanner. Digitisation No. of words 7340 9062
represents the systematic recording, storing and processing of
content using digital cameras, scanners and computers [13]. It No. of segments 369 369
is the process of creating a digital representation of an object, No. of abstracts 41 41
image, document or a signal, and allows them to be stored,
displayed, disseminated and manipulated on a computer. In Max. of words per segment 80 88
order to digitise the mutual bitexts, a HP Scanjet G3110 flatbed Min. words per segment 1 1
scanner was used, set to 300 dpi and grayscale scanning.
Scanned abstracts were in A5 format and text was written in Distinct words 3677 2782
Times New Roman font, size 10, standard black font colour on Words that appear only
2837 (77.16%) 1892 (68.01%)
white paper. once (hapax legomena)
Words that appear twice
424 (11.53%) 389 (13.98%)
Afterwards, optical character recognition (OCR) was (dis legomena)
carried out for extracting, editing, searching and repurposing Words that appear three
165 (4.49%) 159 (5.72%)
times (tris legomena)
data from the scanned book of abstracts. In this research Abby
Words that appear more
Fine Reader 8.0.0.677 was used as the OCR software, which than three times
251 (6.83%) 342 (12.29%)
identifies text by analysing the structure of the object that Arithmetical mean of
1234.90 1381.29
needs to be digitised, by dividing it into structural elements and characters per abstract
by distinguishing characters through comparison with a set of Arithmetical mean of
137.21 153.48
pattern images stored in a database and built-in dictionaries. characters per segment
Arithmetical mean of words
During optical character recognition, errors are inevitable, and 179.02 221.02
per abstract
the induced noise is a serious challenge to subsequent Arithmetical mean of words
processes that attempt to make use of such data [14]. 19.89 24.56
per segment
B. Data set No. of OCR errors in total 66 67

The book of abstracts, which consisted of 41 abstracts, was Arithmetical mean of OCR
1.61 1.63
errors per abstract
digitised and afterwards processed with OCR, then manually
corrected and later on used as the reference set for machine
translation, i.e. gold standard. The book contained very specific In total, 133 OCR errors occurred in the digitisation
abstracts of full scientific papers in the fields of sociology, process. Typical errors during optical character recognition
psychology, theology and philosophy with emphasis on several were misrecognitions of characters, missing whitespace
topics, such as human dignity, religion, dialogue, freedom, characters or apostrophes, various forms of substitution errors,
peace, responsibility, family and community, philosophical and as well as space deletion and insertion errors. The most
sociological reflections. frequent OCR errors in Croatian were substitution errors (e.g.
(l) → (i)) and space deletion, where two words were
The texts were compiled into a parallel bilingual sentence-
erroneously unified (e.g. (U postkomunističkim) →
aligned Croatian-English bitext consisting of mutual
(Upostkomunističkim)). The most frequent OCR errors in
translations relating to the sociological-philosophical-spiritual
English were substitution errors (e.g. («) → ((()) and missing
domain. The process of preparing the data set included the
apostrophes ('). OCR errors have an impact on later-stage
digitisation of printed material, applying OCR techniques and
processing and data usability, therefore, all scanned texts were
its evaluation. All segments were sentence-aligned and a
manually post-edited afterwards.
translation memory was created. The format of such a
C. Tools and methods corpus level. It cannot implement language knowledge from
All machine translations for both directions (Croatian- several references into the score, but gives scores for each
English and English-Croatian) were generated by the freely reference translation.
available online machine translation service, Google Translate GTM metric computes the correct number of unigrams and
(https://translate.google.com/). Automatic machine translation favours longer matches, based on precision (the number of
quality evaluation is performed for both directions by the correct words, divided by the generated machine translation
following metrics: BLUE (BiLingual Evaluation Understudy), system output-length) and recall (the number of correct words,
NIST (National Institute of Standards and Technology), divided by the reference-length) and calculates the F-measure
METEOR (Metric for Evaluation of Translation with Explicit [22]. This metric computes unigrams, i.e. the correct number of
ORdering) and GTM (General Text Matcher). unigram matches referring to non-repeated words in the output
The basic idea behind the mentioned metrics is to calculate and in the reference translation. This metric favours n-grams in
the matching of automatic and reference translations. The the correct order and assigns them higher weights [23].
metrics are based on overlapping of the same surface forms, Apart from the mentioned disadvantages of automatic
which is not suitable for languages with rich morphology and evaluation metrics, there are other numerous defaults, such as,
relatively free word order. Some of the metrics are based on aspect of evaluation, difficulties with the interpretation and
fixed word order (METEOR, GTM), which is also not suitable meaning of scores, ignoring the importance of words, not
for languages with relatively free word order, such as Croatian. addressing grammatical coherence etc. [23].
BLEU is more order-independent, whereas METEOR
introduces linguistic knowledge for n-grams having the same IV. RESULTS AND DISCUSSION
lemma, or for synonym matches. The following table shows the results of automatic machine
GTM and METEOR are based on precision and recall, translation quality evaluation metrics (Table II). BLEU scores
while BLEU and NIST are based on precision and to range from 0 (no overlapping with reference translation) to 1
compensate recall, BLEU introduced brevity penalty [15]. (perfect overlapping with reference translation), whereas scores
Metrics are mainly focused on evaluation of adequacy, as it over 0.3 generally reflect understandable translations, and
gives information to what extent the meaning in the translation scores over 0.5 reflect good and fluent translations [24].
is preserved, and penalise translations with missing words, METEOR scores are usually higher than BLEU scores and
affecting recall. reflect understandable translation when higher than 0.5, and
good and fluent translation when scored higher than 0.7 [24].
The BLEU metric, proposed by IBM, represents a standard NIST scores be 0 or higher, and have no fixed maximum,
for machine translation evaluation [16]. It matches machine whereas GTM scores can range from 0 to 1. Different metric
translation n-grams with n-grams of its reference translation, scores provide an overall overview of the machine translation
and counts the number of matches on the sentence level, quality with regard to various aspects of evaluation and can be
typically for 1-4 n-grams. For each n-gram it assigns the same correlated.
weights, which is one of the main defaults of this metric. This
metric is based on the same surface forms, accepting only TABLE II. RESULTS OF AUTOMATIC QUALITY EVALUATION METRICS
complete matches, and does not take into account words having
Automatic machine translation
the same lemma. BLEU also assigns a brevity penalty score, Machine quality evaluation metrics
which is given to automatic translations shorter than the translation (higher is better)
reference translation. It allows evaluation of multiple reference direction
BLEU NIST METEOR GTM
translations as well.
English-Croatian 0.1656 4.6527 0.1976 0.3348
The NIST metric is based on BLEU with some
Croatian-English 0.2383 5.8686 0.2439 0.5044
modifications [17]. While BLEU is based on n-gram precision
assigning an equal weight to each word, NIST calculates
information weight for each word, i.e. higher scores are given Overall, results of automatic quality assessment show better
to more rare n-grams which are considered as more informative scores for Croatian-English direction for 20-30%. This is
n-grams. It differs also from BLEU in brevity penalty mainly due to fact that Croatian language is highly flective
calculation, where small differences in translation length do not with rich morphology. Furthermore, metrics which rely on
impact the overall score. Stemming is significantly beneficial word matching penalise the word types that appear in form of
to BLEU and NIST [18]. different tokens. As the data set belongs to the sociological-
METEOR metric modifies BLEU in the way that it gives philosophical-spiritual domain, it contains specific terminology
more emphasis to recall than to precision [19]. This metric for which Google Translate does not provide a correct
incorporates linguistic knowledge, taking into account the same translation. Namely, such terminology is unlikely to appear in
lemma and synonym matches, which is suitable for languages the correct context in the language and translation models of
with rich morphology [20]. This metric, like GTM, favours the analysed machine translation system. The fact that the data
longer matches in the same order. It uses fragmentation set contains 77.16% of hapax legomena in Croatian and
penalty, which reduces F-measure if there are no bigrams or 68.01% in English also points to infrequently used
longer matches [21]. This metric is calculated at the sentence or terminology. BLEU metric, which ignores the word relevance,
segment level, while BLEU metric is usually computed at the penalises a machine translation that is shorter than the
reference translation and counts the words having the same
surface form. In this research BLEU score is relatively low for [7] G. J. F. Jones, F. Fantino, E. Newman, and Y. Zhang, “Domain-specific
English-Croatian (0.17) when compared to the Croatian- query translation for multilingual information access using machine
translation augmented with dictionaries mined from Wikipedia,”
English direction (0.24). Generally, translating from ceedings of th Second International Workshop on “Cross Lingual
morphologically rich languages to less rich languages results in Information Access”, 2008, pp. 34-41.
better BLEU scores. NIST metric which is sensitive to more [8] J. Salimi, “Machine Translation Of Fictional And Non-fictional Texts,”
informative n-grams which occur less frequently, gives the Stockholm University Library, 2014, p. 16, available at:
following results: 4.65 for English-Croatian and 5.87 for http://www.diva-portal.org/smash/get/diva2:737887/FULLTEXT01.pdf
Croatian-English. The metric METEOR shows scores close to [9] S. Seljan, T. Vičić and M. Brkić, “BLEU evaluation of machine-
BLEU metric for English-Croatian (0.20) and for Croatian- translated English-Croatian legislation,” Proceedings of the Eighth
International Conference on Language Resources and Evaluation
English (0.24). Although METEOR counts matches at the stem (LREC'12), 2012, pp. 2143-2148.
level, in this research the raw data set was used, which was not [10] E. D. Lange and J. Yang, “Automatic domain recognition for machine
lowercased and tokenised. The results of GTM metric, which translation”, Proceedings of the MT Summit VII, 1999, pp. 641-645.
computes F-measure, are as follows: 0.33 for English-Croatian [11] T. T. Soomro, G. Ahmad and M. Usman, “Google Translation service
and 0.50 for Croatian-English. The GTM score for English- issues: Religious text perspective,” Journal of Global Research in
Croatian is lower than for Croatian-English due to Computer Science, vol. 4, no. 8, 2013, pp. 40-43.
morphological variants of the same lemma, which causes lower [12] M. Braschler, D. Harman, M. Hess, M. Kluck, C. Peters and P.
scores due to non-matching of words belonging to the same Schäuble, “The evaluation of systems for cross-language information
lemma but with different morphological suffixes. retrieval,” Proceedings of the Second International Conference on
Language Ressources and Evaluation (LREC-2000), 2000, p. 6.
V. CONCLUSIONS [13] D. Lopresti, “Optical character recognition errors and their effects on
natural language processing,” International Journal on Document
In this research, a book of scientific abstracts was digitised Analysis and Recognition, vol. 12, no. 3, 2009, pp. 141-151.
with a scanner, subsequently processed with OCR, later on [14] J. Smolčić and A. Valešić, “Legal contexts of digitization and
post-edited, and eventually used as a gold standard for machine preservation of written heritage,” Proceedings of the INFuture2009 –
translation. Automatic machine translations were generated by Digital Resources and Knowledge Sharing Conference, 2009, pp. 87-94.
Google Translate and afterwards evaluated by means of several [15] C. Callison-Burch, M. Osborne and P. Koehn, “Re-evaluating the role of
BLEU in machine translation research,” Proceedings of the 11th
metrics and for both directions (Croatian-English and English- Conference of the European Chapter of the Association for
Croatian). The results for translation into Croatian are less Computational Linguistics, 2006, pp. 249-256.
scored due to specific terminology that is not widely used on [16] K. Papineni, S. Roukos, T. Ward and W. J. Zhu, “BLEU: a method for
the internet and therefore not available in the correct context in automatic evaluation of machine translation,” Proceedings of the 40th
the machine translation models, morphological richness of the Annual Meeting on Association for Computational Linguistics, 2002,
Croatian language, long sentences, relatively free word order pp. 311-318.
and grammatical case agreement. Namely, this causes [17] G. Doddington, “Automatic evaluation of machine translation quality
decreased scores since several types of the same lemma, which using n-gram co-occurrence statistics,” Proceedings of the Second
Conference on Human Language Technology, 2002, pp. 128-132.
are not identical with the reference morphological variant,
[18] A. Lavie, K. Sagae and S. Jayaraman, “The significance of Recall in
count as mismatches. Overall, results of the automatic machine Automatic Metrics for MT Evaluation,” in Machine Translation: From
translation quality evaluation for BLEU, NIST, METEOR and Real Users to Research, R. E. Frederking and K. B. Taylor, Eds. Berlin,
GTM are better for the Croatian-English direction (20-30% Heidelberg: Springer, 2004, pp. 134-143.
better). Further research on automatic quality evaluation would [19] S. Banerjee and A. Lavie, “METEOR: An automatic metric for MT
include more extensive evaluation applying other metrics, text evaluation with improved correlation with human judgments,”
lemmatisation, lowercasing, tokenisation and enlargement of Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation
Measures for MT and/or Summarization at the 43rd Annual Meeting of
the data set, using multiple reference sentences. the Association of Computational Linguistics, 2005, pp. 65-72.
REFERENCES [20] M. Denkowski and A. Lavie, “Meteor 1.3: Automatic metric for reliable
optimization and evaluation of machine translation systems,”
[1] D. R. Amancio, M. G. V. Nunes, O. N. Oliveira Jr., T. A. S. Pardo, L. Proceedings of the Sixth Workshop on Statistical Machine Translation,
Antiqueira and L. da F. Costa, “Using metrics from complex networks to (ACL), 2011, pp. 85-91.
evaluate MT,” Physica A: Statistical Mechanics and its Applications, [21] A. Agarwal and A. Lavie, “METEOR, M-BLEU and M-TER:
vol. 390, no. 1, 2011, pp. 131-142. Evaluation metrics for high-correlation with human rankings of machine
[2] S. Hampshire and C. Porta Salvia, “Translation and the internet: translation output,” Proceedings of the ACL 2008 Workshop on
Evaluating the quality of free online machine translators,” Quaderns: Statistical Machine Translation, 2008, pp. 115-118.
revista de traducció, no. 17, 2010, pp. 197-209. [22] Automated Community Content Editing PorTal (ACCEPT), “Analysis
[3] S. Stymne, “Pre- and postprocessing for statistical machine translation of existing metrics and proposal for a task-oriented metric,” European
into germanic languages,” Proceedings of the ACL-HLT 2011 Student Community's FP7 project deliverable, 2012, available at:
Session, 2011, pp. 12-17. http://cordis.europa.eu/docs/projects/cnect/9/288769/080/deliverables/00
[4] E. Hovy, M. King and A. Popescu-Belis, “Principles of context-based 1-D91Analysisofexistingmetricsandproposalofataskorientedmetric.pdf
machine translation evaluation,” Machine Translation, vol. 17, 2002, pp. [23] J. P. Turian, L. Shen and I. D. Melamed, “Evaluation of machine
43-75. translation and its evaluation”, Proceedings of the 9th Machine
[5] P. Koehn, “What is a better translation? Reflections on six years of Translation Summit, 2003, pp. 386-393.
running evaluation campaigns,” Tralogy 2011, 2011, p. 9, available at: [24] A. Lavie, “Evaluating the Output of Machine Translation Systems,”
http://homepages.inf.ed.ac.uk/pkoehn/publications/tralogy11.pdf AMTA Tutorial, 2010, p. 86, available at:
[6] C. Kit and T. M. Wong, “Comparative evaluation of online machine http://amta2010.amtaweb.org/AMTA/papers/6-04-
translation systems with legal texts,” Law Library Journal, vol. 100, no. LavieMTEvaluation.pdf
2, 2008, pp. 299-321.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy