4 162555252204109979 PDF
4 162555252204109979 PDF
A practical appraisal
Edited by
John Newton
A catalogue record for this book is available from the British Library.
Figures ix
Contributors x
Preface xvi
Abbreviations and acronyms xviii
1 Introduction and overview
John Newton 1
2 The story so far: an evaluation of machine
translation in the world today
Jeanette Pugh 14
3 Made to measure solutions
Annette Grimaila in collaboration with John Chandioux 33
4 The Perkins experience
John Newton 46
5 Machine translation in a high-volume translation
environment
Muriel Vasconcellos and Dale A.Bostad 58
6 Esperanto as an intermediate language for
machine translation
Klaus Schubert 78
7 Limitations of computers as translation tools
Alex Gross 96
8 Computerized term banks and translation
Patricia Thomas 131
9 The translator workstation
Alan Melby 147
vii
viii Contents
Bibliography 208
Glossary of terms 222
Index 228
Figures
ix
Contributors
x
Contributors xi
xvi
Preface xvii
xviii
Abbreviations and acronyms xix
1
2 John Newton
which are wrong. A translator who has been translating all day can
easily fail to spot a deviation from what appears to be a uniform
sequence of messages as the working day draws to a close. If
‘never’ occurs just once in a list of safety instructions beginning
‘always’, it is not difficult to imagine how an error could creep in.
In large organisations where translations are routinely proofread by
other translators or editors, errors of this kind would normally be
discovered and corrected but many translators have to work
without this safety net.
As stated above, MT systems cannot deviate from the translation
equivalents that figure in the dictionary or dictionaries specified for
any particular task. However, if the human-supplied dictionary input
is wrong, the system will consistently reproduce errors in raw
translation until the entry is corrected. Fortunately, most MT output
is post-edited, and a post-editor, like the proofreader of human
translation, can be expected to notice problems of this kind and take
corrective measures in the text and in the dictionary. Moreover, MT
does not carry the risk of lines, paragraphs or pages being
unintentionally ‘skipped’ in translation.
substantial and growing body of technical and scientific text that can
be handled efficiently and profitably using computers if the requisite
operational conditions exist, as is exemplified by the applications
described in Chapters 3, 4 and 5 of this volume. It would,
nonetheless, be unrealistic to expect any system to perform well with
source texts drawn randomly from a wide range of subject areas and
text typologies.
I have stated above that an MT system’s raw output—‘raw
translation’—is not usually regarded as a finished product. This
reflects the fact that most users practise post-editing to some degree.
The techniques applied range from rapid post-editing, for
information-only purposes, to comprehensive ‘polishing’ aimed at
making the finished product indistinguishable from human
translation. Most MT systems have a synchronized split-screen
editing mode which enables the post-editor to view and scroll the
raw translation in tandem with the corresponding section of source
text. Quality control rests with the post-editor and must be assumed
to be as rigorous as that applied to human translation produced for
similar purposes.
Syntax errors in raw translation are only a problem if they are so
numerous that post-editing requires as much effort as normal
human translation (see Chapter 3: Annette Grimaila and John
Chandioux). It is always advisable to conduct tests to ascertain
whether MT is suited to a particular environment before incurring
the expense of purchasing a system, but if such tests are to be fair
to the prospective purchaser and to the system, they should involve
a substantial volume of truly typical text and the requisite level of
dictionary updating. Anything less than this is unlikely to yield an
accurate picture.
system can only be fully exploited if users are trained in all aspects
of its operation and feel generally comfortable with it; thus, any
attempt to take shortcuts in the area of training would be likely to
have a very deleterious effect on the results obtained and on the
translators’ attitude to MT.
In Chapter 7, Alex Gross makes the point that MT demands
special skills of a very high order on the part of the operator.
Introducing it into a department which may lack the requisite
aptitude or commitment is therefore unlikely to produce optimum
results. While some translators take to MT like the proverbial duck
to water, others find the prospect of adapting to it extremely
daunting. My experience of training groups of translators to use MT
systems suggests that a good translator does not necessarily make a
good MT operator; although I did find that most translators—
including some who were nearing retirement—were able and willing
to make the necessary adjustment.
There should be no conflict of interest between human translators
and MT. The latter has proved beneficial to some translators
through opening up a new and challenging career, and the publicity
MT receives does, at least, bring discussion of translation into the
broader public domain. Furthermore, the presence of an MT system,
in a clearly subordinate role, serves to highlight the skills of the
human translator/post-editor. As one MT provider accurately
described it, ‘the system takes care of the donkey work, while the
human translator concentrates on fine-tuning the dictionaries and
polishing the raw output’. We should also recognize that some of the
tasks assigned to MT, e.g. information-only translation (as discussed
by Muriel Vasconcellos and Dale Bostad in Chapter 5), and
translation of time-sensitive restricted domain texts (as described in
Chapter 3), would probably not be performed at all if MT were not
available.
ATTITUDES TO MT
There are a number of (largely apocryphal) stories in circulation
about bizarre computer renderings of well-known phrases and
idioms; the most notorious of these is probably ‘out of sight, out of
mind’, allegedly rendered as ‘invisible idiot’. It calls to mind an
occasion when I was invited to demonstrate a French-English MT
system on television. It was clear from the outset that the presenter
wanted to maximize the entertainment value of the situation. After
10 John Newton
CONCLUSIONS
Along with other writers, translators were quick to realize the
benefits of computers. Word processing is now used almost
universally among translators and translation tools and
computerized dictionaries are steadily gaining ground.
Commercially-available MT systems are often designed to
meet wide-ranging needs. The fact that potential users tend to
have very specific needs renders a large proportion of an off-the-
shelf system’s capability superfluous in most applications. It
would therefore seem desirable to extend the principle of subject
specialization, as practised by human translators, to MT. Bespoke
systems (as described in Chapter 3) undoubtedly maximize the
potential for success. At some future time, it may even be possible
to assemble systems from ‘bolt-on’ modules, selected or adapted
to cope with specified features of a particular text type; for
instance, in an English-French context it would be useful to be
able to choose how the negative imperative should be handled. If
this idea does become a reality one day, what is left out (for any
12 John Newton
INTRODUCTION
This chapter presents an assessment of the position of machine
translation (MT) in the world today. In this assessment I address
questions such as: how was the current situation of MT arrived at?
What does MT mean today? How is MT perceived in different parts
of the world and by different sectors of society? What contribution
does MT have to make now and in the future?
It is assumed that the reader is familiar with the general
background to MT, and only a very brief historical survey of the
field is given. The emphasis throughout the chapter is on the
contemporary position and status of MT; hence, the origins and
development of the field to date are considered only to the extent
that they may shed light on modern attitudes and perspectives.
To circumscribe the domain under examination, a short overview
of the kinds of activities that make up ‘machine translation’ today
and how these relate to other associated disciplines and domains is
given. The purpose of this overview is to situate MT in a scientific
sense, and to give the reader a feel for the divergences in breadth and
depth of MT-related activities in different parts of the world
(Western Europe, Eastern Europe, the United States, Japan, etc.),
and among different sectors (academic research, the public sector,
commercial research and development, international initiatives, etc.).
In the main body of the chapter, I present an analysis of the
present position of machine translation and explore the reasons for
the divergences identified earlier. Obviously, a thorough, in-depth
analysis of all the elements which emerge would be beyond the scope
of this chapter. What I will aim at, then, is rather a broad review of
possible factors underlying such differences as those in national and
14
Machine translation today 15
HISTORICAL PERSPECTIVE
The origins and development of machine translation have been well
documented and I wil not give yet another detailed historical
account here (the interested reader is invited to consult Hutchins
1986 which is an excellent source). However, while the historical
facts are amply recorded, their significance today and, in particular,
the extent to which they have shaped modern attitudes to and within
MT, have been paid relatively little attention. In this part of the
chapter, I review the history of MT from a contemporary
perspective.
While the concept of mechanical translation can be traced back as
far as the seventeenth century, the enabling technology for the
present-day concept of ‘machine translation’ appeared less than fifty
years ago. The electronic digital computer made its first impact in
the Second World War. In the immediate post-War period, there was
a natural move to explore its potential capabilities, and the
translation of languages was soon identified as one of the obvious
candidates for exploitation.
These early explorations reached a watershed in 1949 with the
famous Weaver memorandum, which ‘in effect…launched machine
translation…in the United States and subsequently elsewhere’
(Hutchins 1986:28). The significance of the Weaver memorandum
is undeniable. Interestingly, it focused on the general strategies and
long-term objectives of machine translation rather than on the
16 Jeanette Pugh
MT IN EUROPE
In Europe, the impact of the ALPAC report was initially dramatic,
but it took little more than a decade for its effects to disappear. As
we have seen, the late 1970s witnessed a veritable explosion of MT
activity in Europe, the most notable initiative being the launch by
the CEC of the EU ROTRA programme which has received
sustained high-level funding from both the European Commission
and the national authorities of all EC member states. It has
involved all twelve EC countries, working on all nine official EC
languages, with some twenty individual research sites and, at its
peak, about two hundred participants (Raw et al. 1989; Steiner
(ed.) 1991). At the time of writing, the EUROTRA programme has
taken a new and exciting turn, with the move to a two-year
transition programme in which both the scope of its activities and
the range of participants will be much more diversified. The
primary aim of this two-year programme is to prepare the
transition from EU ROTRA’s pre-operational prototype to an
industrialization of the system. This will involve the active
participation of European industry which, it is hoped, will invest
not only manpower but also financial resources. The EUROTRA
stage has thus become much wider and the range and roles of its
actors more varied (EEC 1990).
Although EUROTRA has dominated the European MT scene by
virtue of its sheer size, it is by no means the sole MT effort in
Europe. There are also a number of national MT programmes which
testify to significant public-sector commitment, as well as substantial
involvement by the private sector.
France, Germany and the United Kingdom have the longest
traditions in MT in Western Europe. The famous French centre,
20 Jeanette Pugh
MT IN JAPAN
Although the future prospects for MT in the United States look
better now than at any other time in the post-ALPAC period, the
current situation still compares poorly with that in Japan. There,
MT enjoys a privileged status and is a highly valued, high-priority
activity in which an enormous investment of public and private
financial and human resources has been made. Every major
Japanese computer or electronics firm has invested considerable
effort in MT R&D, and many claim to have developed operational
systems.
Moreover, this investment is not confined to the private sector. A
major, long-term initiative in MT R&D was launched by the
Ministry of International Trade and Industry (M ITI) which
Machine translation today 27
MT IN THE FUTURE
In this chapter, I have sought to give the reader a ‘snapshot’ view
of the current status of MT in different parts of the world, focusing
on the contrast in attitudes between public and private sectors, and
on the varying policies of national governments. One fact which
clearly emerges is that MT is today a recognized international
scientific field with a worldwide community of researchers. Yet,
while its international status is now firmly established, the standing
which it enjoys at national level is by no means uniform across the
globe. Japan’s perception of MT as a key element in the future
development of an information-based society and its consequent
long-term commitment to MT activities lies at the extreme end of
the spectrum of national attitudes. In Europe, there is a solid
tradition of MT, and the EUROTRA programme has done much
to improve collaboration and to consolidate and expand expertise.
It is to be hoped—and expected—that future European research
programmes will ensure that ample place is given to MT and
related issues. As for the United States, the signs are very
encouraging and we can expect that, following the enlightened
example of Canada, the USA will come to play a leading role in the
future of MT.
There are also signs that involvement in MT will continue to
spread to include countries which have little experience in the field
so far. Recent developments in South Korea, for instance, indicate
that the private sector there is set to follow the example of its
Japanese counterpart with large-scale investment in MT R&D. It is
thus likely that a strong MT community will emerge in the Far East
which will set a challenge for the United States and Europe in an
Machine translation today 29
BIBLIOGRAPHY
Abbou, A. (ed.) (1989) Traduction Assistée par Ordinateur: Perspectives
technologiques, industrielles et économiques envisageables à l’horizon 1990: l’offre, la
demande, les marchés et les évolutions en cours, Actes du Séminaire international
(Paris, March 1988), Paris: DAICADIF.
ALPAC (1966) Language and Machines: Computers in Translation and Linguistics
(Report by the Automatic Language Processing Advisory Committee,
Division of Behavioral Sciences, National Research Council),
Washington, DC: National Academy of Sciences.
Bennett, W.S. and Slocum, J. (1985) ‘The LRC machine translation system’,
in Computational Linguistics 11:111–21; reprinted in J.Slocum (ed.) Machine
Translation Systems, Cambridge: Cambridge University Press, 1988, 111–
40.
Boitet, Ch. (1990) Towards personal MT: general design, dialogue
structure, potential role of speech’, in H.Karlgren (ed.) COLING-90:
Papers presented to the 13th International Conference on Computational Linguistics,
Helsinki: Yliopistopaino, vol. 3, 30–5.
Carbonell, J.G. (1990) ‘Machine translation technology: status and
recommendations’, in T.Valentine (ed.) (1990) ‘Status of machine
translation (MT) technology: Hearing before the Subcommittee on
Science, Research and Technology of the Committee on Science, Space
and Technology, US House of Representatives, 101st Congress, Second
Session, September 11, 1990’, [no. 153], Chairman: Rep. T.Valentine,
Washington: US Government Printing Office, 119–31.
Chandioux, J. (1989) ‘M ETEO: 100 million words later’, in ATA
conference proceedings, 449–53.
EEC (1990) ‘Council decision of 26 November 1990 adopting a specific
programme concerning the preparation of the development of an
operational E U ROTRA system’, Official Journal of the European
Communities, no. L 358/84, EEC/664/90, 21.12.90.
Farwell, D. and Wilks, Y. (1990) ‘Ultra: a multilingual machine translator’,
Research Report MCCS-90–202, Computing Research Laboratory, New
Mexico State University, Las Cruces, New Mexico.
Hammond, D.L. (ed.) (1989) Coming of Age and the proceedings of the Thirtieth
Annual Conference of the American Translators Association, October 11–15 1989,
Washington D.C., Medford, New Jersey: Learned Information.
Hutchins, W.J. (1986) Machine Translation: Past, Present, Future, Chichester:
Ellis Horwood.
——(1988) ‘Future perspectives in translation technologies’, in M.
Vasconcellos (ed.) Technology as Translation Strategy (American Translators
Association, Scholarly Monograph Series, vol. II, Binghamton, New
York: State University of New York Press, 223–40.
Iida, H. (1989) ‘Advanced dialogue translation techniques: plan-based,
memory-based and parallel approaches’, in ATR Symposium on Basic
Research for Telephone Interpretation, Kyoto, Japan, Proceedings 8/7–8/8.
Isabelle, P. (1989) ‘Bilan et perspectives de la traduction assistée par
ordinateur au Canada’, in A.Abbou (ed.) Traduction Assistée par Ordinateur:
Perspectives technologiques, industrielles et économiques envisageables à l’horizon
Machine translation today 31
1990: l’off re, la demande, les marchés et les évolutions en cours, Actes du
séminaire international (Paris, March 1988), Paris: DAICADIF, 153–8.
JEIDA (1989) ‘A Japanese view of machine translation in light of the
considerations and recommendations reported by ALPAC, U SA’,
Tokyo: JEIDA.
King, M. (1989) ‘Perspectives, recherches, besoins, marchés et projets en
Suisse’, in A.Abbou (ed.) Traduction Assistée par Ordinateur: Perspectives
technologiques, industrielles et économiques envisageables à l’horizon 1990: l’offre, la
demande, les marchés et les évolutions en cours, Actes du séminaire international
(Paris, March 1988), Paris: DAICADIF, 177–9.
Landsbergen, J. (1987) ‘Isomorphic grammars and their use in the
ROSETTA translation system’, in M.King (ed.) Machine Translation Today:
the state of the art, Edinburgh: Edinburgh University Press, 351–72.
Maas, D. (1987) ‘The Saarbrücken automatic translation system (SUSY)’, in
Overcoming the language barrier, Commission of the European
Communities, Munich: Verlag Dokumentation, vol. 1:585–92.
Nagao, M., Tsujii, J. and Nakamura, J. (1985) ‘The Japanese government
project for machine translation’, Computational Linguistics 11:91–110.
Nirenburg, S. (1989) ‘Knowledge-based machine translation’, Machine
Translation, 4:5–24.
Pappegaaij, B.C., Sadler, V. and Witkam, A.P.M. (eds) (1986) Word Expert
Semantics: an Interlingual Knowledge-based Approach, (Distributed Language
Translation 1), Dordrecht: Foris.
Pogson, G. (1989) ‘The LT/Electric Word multilingual wordworker’s
resource guide’, LT/Electric Word 13, Amsterdam: Language Technology
BV.
Raw, A., van Eynde, F., ten Hacken, P., Hoekstra, H. and Vandecapelle, B.
(1989) ‘An introduction to the Eurotra machine translation system’,
Working Papers in Natural Language Processing 1, TAAL Technologie,
Utrecht and Katholieke Universiteit Leuven.
Slocum, J. (1984) ‘Machine translation: its history, current status, and future
prospects’, in 10th International Conference on Computational Linguistics,
Proceedings of COLING-84, Stanford, California, 546–61.
Steer, M.G. and Stentiford, F.W.M. (1989) ‘Speech language translation’, in
J.Peckham (ed.) Recent Developments and Applications of Natural Language
Processing, London: Kogan Page, 129–40.
Steiner, E. (ed.) (1991) ‘Special issue on Eurotra’, Machine Translation 6.2–3.
Thurmair, G. (1990) ‘Complex lexical transfer in METAL’, in Third
International Conference on Theoretical and Methodological Issues in Machine
Translation of Natural Languages, Austin, Texas, 91–107.
Trabulsi, S. (1989) ‘Le système SYSTRAN’, in A.Abbou (ed.) Traduction
Assistée par Ordinateur: Perspectives technologiques, industrielles et économiques
envisageables à l’horizon 1990: l’offre, la demande, les marchés et les évolutions en
cours, Actes du séminaire international (Paris, March 1988), Paris:
DAICADIF, 15–27.
Tsujii, J. (1989) ‘Machine translation with Japan’s neighboring countries’, in
M.Nagao (ed.) Machine Translation Summit, Tokyo: Ohmsha, 50–3.
Uchida, H. and Kakizaki, T. (1989) ‘Electronic dictionary project’, in M.
Nagao (ed.) Machine Translation Summit, Tokyo: Ohmsha, 83–7.
32 Jeanette Pugh
Let us separate the machine from the translation and remember that
it is the machine that serves the translation and not the other way
round.
In all real-world applications of MT, the translator is not replaced.
In fact, he or she is the one person who must be consulted,
considered and helped by the application. If the machine output is of
such low quality or if its manipulation is so complex that the
translator wastes more time revising the results than he or she would
spend translating a source text, then the usefulness of the system is
seriously in doubt.
There is one world-renowned MT system which has been in
continuous use since the early 1980s: METEO and the Canadian
government’s use of it to translate public weather forecasts from
English to French (and from French to English since early 1989)
have been well publicized but its creator’s views on the subject have
seldom been sought out or clearly understood.
33
34 Annette Grimaila with John Chandioux
METEO. What is less clear from the literature is that the prototype
developed in 1975–6 was never put into operation by the TAUM
group which thereafter concentrated its research on another project:
TAUM-AVIATION.
The Canadian government had practically shelved the
M ETEO project when Chandioux, who had by then left the
university, came to them with a proposal to resume development
of the prototype. What a shock when the first results showed that
only 40 percent of the sentences in the weather bulletins were
adequately translated! In fact, the prototype had been developed
on a non-representative corpus of texts which was too small and
did not include all of the weather regions for which the system
was destined. The Canadian weather is so diverse that it took a
full year of analysis, development and adjustments to ensure an
80 per cent success rate.
METEO-1 finally reached this goal in 1978. It ran on a Cyber-
7600 mainframe, required 1.6 megabytes of random access memory
and translated some 7,000 words per day.
Concurrently with his work on M ETEO, Chandioux had
undertaken the development of a programming language
specifically designed for linguistics. He wasn’t satisfied with what
could be done with traditional programming tools like PROLOG
or LISP, and even Q-SYSTEMS, with which METEO had been
developed, had had to be substantially rewritten in order to
eliminate a certain number of inconsistencies and unpredictable
bugs.
Chandioux had always been considered a visionary by the
computer experts at the University of Montreal. When he explained
what he wanted done, no one thought it was possible.
In 1976, Alain Colmerauer, a computer specialist who had
b een the first to implement P ROLO G as a programming
language, returned to Montreal to visit the TAUM group which
he had directed for some years and where he had written Q-
SYSTE MS. Discussions with Colmerauer, who considered the
project feasible, convinced Chandioux that the future of
computing and of his new language lay in the smaller machines
that would come to be known as microcomputers. In 1977,
Chandioux met with Professor Vauquois who, also agreeing with
Chandioux’s vision, took the time to detail what he would have
done differently with GETA’s machine translation projects were
he to start all over again.
Made to measure solutions 35
translated and requires them to assume their role of revisers with the
computer looking after the routines of receiving, translating and
transmitting. The remaining 20 percent of their workload takes up
half of their working hours even though terminals and other
microcomputers allow them quicker access to the bulletins and the
other tools which simplify their job.
Clearing up a misconception
It has often been said that the METEO system works because
weather bulletins use a simplified syntax, because the vocabulary is
limited or because the texts are so repetitive that there is no
challenge.
In reality, the syntax is not at all controlled at the input level: the
meteorologists are only required to respect certain semantic controls,
such as ‘strong winds’ being limited to velocities between ‘x’ and ‘y’
kilometers per hour. The recommended style resembles telegraphic
texts, with little if any punctuation or other syntactical points of
reference to help the translation process. The vocabulary, excluding
place names which are also translated, totals some 2,000 words.
Repetition? Yes, the texts can be repetitive but not predictably so.
The eight meteorological centres across Canada each reflect regional
differences and diverging styles as well as meteorological realities as
diverse as the country is vast.
The following is an extract of a weather bulletin submitted to the
METEO system on 31 January 1991:
METRO TORONTO.
TODAY…MAINLY CLOUDY AND COLD WITH
OCCASIONAL FLURRIES. BRISK WESTERLY WINDS
TO 50 KM/H. HIGH NEAR MINUS 7.
TONIGHT…VARIABLE CLOUDINESS. ISOLATED
FLURRIES. DIMINISHING WINDS. LOW NEAR MINUS
15.
FRIDAY…VARIABLE CLOUDINESS. HIGH NEAR
MINUS 6.
PROBABILITY OF PRECIPITATION IN PERCENT 60
TODAY. 30 TONIGHT. 20 FRIDAY.
WATERLOO-WELLINGTON-DUFFERIN
BARRIE-HURONIA
GREY-BRUCE
38 Annette Grimaila with John Chandioux
HURON-PERTH
…SNOWSQUALL WARNING IN EFFECT…
TODAY…SNOW AND LOCAL SNOWSQUALLS. BRISK
WESTERLY WINDS TO 50 KM/H CAUSING REDUCED
VISIBILITIES IN BLOWING AND DRIFTING SNOW.
ACCUMULATIONS OF 15 TO 25 CM EXCEPT LOCALLY
UP TO 35 CM IN HEAVIER SQUALLS. HIGH NEAR
MINUS 9.
TONIGHT…FLURRIES AND LOCAL SNOWSQUALLS
CONTINUING. NORTHWEST WINDS TO 40 KM/H.
LOW NEAR MINUS 16.
FRIDAY…FLURRIES AND SQUALLS TAPERING TO
SCATTERED FLURRIES. HIGH NEAR MINUS 7.
PROBABILITY OF PRECIPITATION IN PERCENT 90
TODAY. 90 TONIGHT. 70 FRIDAY.
METEO’s machine output, without any human revision, reads as
follows:
LE GRAND TORONTO.
AUJOURD HUI…GENERALEMENT NUAGEUX ET
FROID AVEC QUELQUES AVERSES DE NEIGE. VENTS
VIFS D’OUEST A 50 KM/H. MAXIMUM D ENVIRON
MOINS 7.
CETTE NUIT…CIEL VARIABLE. AVERSES DE NEIGE
EPARSES. AFFAIBLISSEMENT DES VENTS. MINIMUM
D ENVIRON MOINS 15.
VENDREDI…CIEL VARIABLE. MAXIMUM D
ENVIRON MOINS 6.
PROBABILITE DE PRECIPITATIONS EN
POURCENTAGE 60 AUJOURD HUI. 30 CETTE NUIT. 20
VENDREDI.
WATERLOO-WELLINGTON-DUFFERIN
BARRIE-HURONIE
GREY-BRUCE
HURON-PERTH
…AVERTISSEMENT DE BOURRASQUES DE NEIGE
EN VIGUEUR…
AUJOURD HUI…NEIGE ET BOURRASQUES DE
NEIGE PAR ENDROITS. VENTS VIFS D OUEST A 50
KM/H OCCASIONNANT UNE VISIBILITE REDUITE
DANS LA POUDRERIE HAUTE ET BASSE.
Made to measure solutions 39
DEFINITIONS
Earnings: gross monthly earnings, excluding bonus, commissions and
overtime. Earnings shall be determined where necessary on the basis
of 40 hours per week, 4.33 weeks per month and 12 months per year.
Renewal date: January 1st.
GENERAL PROVISIONS
If an employee suffers a specified loss as a direct result of a covered
accident, within 365 days after the date of such accident, the benefit
will be paid provided CONFED receives proof of claim.
The amount payable is based upon the amount specified in the
BENEFIT PLAN SUMMARY which is in effect at the time of loss and
calculated using the percentage for the loss set out in the following table:
ADMINISTRATIVE PROVISIONS
POLICY RENEWAL
At the end of each policy year, CONFED will renew this policy for
a further term of one policy year provided:
DÉFINITIONS
Salaire: rémunération mensuelle brute, à l’exclusion des grati-
fications, commissions et heures supplémentaires. Le salaire est
calculé à raison de 40 heures par semaine, 4,33 semaines par mois et
de 12 mois par année.
*Renouvellement: le ler janvier, à minuit une minute.
42 Annette Grimaila with John Chandioux
CONDITIONS GÉNÉRALES
En cas de sinistre directement attribuable à un accident garanti,
survenu dans les 365 jours qui suivent l’accident, CONFED verse la
prestation après avoir reçu les pièces justificatives. La prestation
correspond à un pourcentage du capital stipulé *aux
CON DITION S PARTICU LIÈRE S au moment du sinistre,
conformément au tableau ci-dessous:
Here again the success of the development was assured by the assistance
of the department’s assistant manager, Paul Dupont, and the translators
who were directly involved in testing from the beginning and in
completing the grammars since October 1990. In December 1990, the
translator who is now in charge of the system, David Harris, confirmed
that he was able to produce a finished document from a General TAO
output in 30 minutes as compared to several hours before the system
was operational. Additional documents are being analyzed for possible
inclusion in the system and a French-English counterpart using the same
user interface is being considered.
Translated:
QA-0JQAA-WZ UWS USER/ADMIN V4.1 M/J
ENSEMBLE DOC
QA-GEJAA-H5 20/20 RISC SUPP ENREG DOC
QL-B16A9-JT WPS-PLUS VMS LIC E ENS CW:6000
QL-GHVA9-JN DECmsgQ VMS OPTION EXEC CW:2400
QT-0JXAG-9M WIN/TCP F/M DECSUPPORT 16MT9
QT-YNCAM-8M DPRT PRT SVCS F/M SER BASE 16MT9
Abbreviated:
QA-0JQAA-WZ UWS USER/ADMIN V4.1 M/J ENS DOC
QA-GEJAA-H5 20/20 RISC SUPP DOC
QL-B16A9-JT WPS-PLUS VMS LIC E ENS CW:6000
QL-GHVA9-JN DECmsgQ VMS OPT EXEC CW:2400
QT-0JXAG-9M WIN/TCP F/M DECSUPP 16MT9
QT-YNCAM-8M DPRT PRT SVCS F/M SER BASE 16MT9
present state of the art. We are too far from fully understanding the
nature of human language and its myriad linguistic representations.
A workable general translation system will eventually grow from
numerous specific translation projects, such as the ones we have
developed at John Chandioux Consultants. Through our research
and development activities we can slowly identify and solve
linguistic ambiguities and ensure a quality output that substantially
helps the translator to do what he or she is paid to do: translate
realities from one reference system to another.
Like any good tool, GramR has been found useful in related fields
for which it was not originally designed. Even though GramR was
originally developed to deal with English–French translation
problems, it is not specific to this language duo and can be used for
most European languages. It can even be used for other linguistic
applications.
John Chandioux Consultants Inc. has developed and launched
several general-use software products, all of which deal with the
French language: a spelling and grammar checker, a verb conjugator
and, most recently, a program that converts standard French spelling
to the ‘reformed’ spelling proposed by the Conseil Supérieur de la
Langue Frangaise in Paris on 6 December 1990. These are known as
the GramR family of products: ORTOGRAF+, the name used in
Canada for the spelling and grammar checker, CONJUGUONS!
and L’ORTHOGRAPHE MODERNE.
The development of a similar class of products for the English
language is under way and should prove interesting to people
‘bilingually’ involved.
NOTE
GramR, METEO and ORTOGRAF+ are registered trade marks
and are the property of John Chandioux.
Chapter 4
BACKGROUND
One of the most successful machine translation applications known
to me is at Perkins Engines, Peterborough, England. It is an example
of what can be achieved when a system is introduced in a thoroughly
planned and methodical way into a restricted domain environment
to process controlled-language source texts.
Perkins has been a leading manufacturer of diesel engines since
1932 and is well established in worldwide export markets. Frequent
additions to the product range and modifications to existing
products have created a need for rapid production of high-quality
user documentation in five languages—English, French, German,
Spanish and Italian.
Until 1985, all translation had been done manually: some overseas
and some in the UK. The Technical Publications Manager, Peter Pym,
was keen to ensure that the four translated versions did not differ too
greatly in style or content from the English source texts or from each
other; close scrutiny of existing translations had revealed minor
semantic and stylistic divergences, as well as omissions and introduced
elements (i.e. elements not derived from the source text). As these
traits were particularly evident in translations produced or edited
overseas, greater control from Peterborough was clearly desirable but
the translations produced had to be acceptable to the overseas
subsidiaries, given that they and their customers were the consumers.
When Peter Pym decided to explore the possibility of using MT, he
already had a firm foundation on which to build: his department was
using a form of controlled English known as Perkins Approved Clear
English (PACE). PACE was initially based on the International
Language for Service and Maintenance (ILSAM), which in turn was
46
The Perkins experience 47
Pre-PACE
The heavy duty oil bath cleaners are usually fitted with a
centrifugal pre-cleaner mounted on top of the main cleaner.
Using PACE
Heavy-duty air cleaners of the oil bath type are usually fitted with
a central pre-cleaner, which is mounted on top of the main
cleaner.
Pre-PACE
There are a few engines fitted with spur gears instead of helical
gears shown in this section.
Using PACE
Certain engines are fitted with spur gears, instead of helical gears
which are shown in this section.
(Pym 1990:86–7)
Multiple-word entry
The next step was to use the PACE English/French MicroCat
dictionary to produce raw translations of texts that had been
translated previously and then compare the machine’s output with
the human versions. This helped to identify recurrent structures
requiring multiple-word-entry solutions in the MicroCat dictionary;
for example, big end bearing, always ensure that.
50 John Newton
Homographic entries
For the purposes of the MicroCat system, a homograph is defined as
a word that can function as more than one part of speech, in its base
form, in an inflected form or in some combination of both. Because
English verbs invariably have inflected forms that function as non-
verbal parts of speech, e.g. adjectives ending in -ing and -ed, all verbs
are treated as homographs. Provision is made for various
homographic combinations, including: noun/adjective, noun/
adjective/adverb, adjective/adverb, verb/adverb, verb/noun/ adverb,
verb/noun/adjective and verb/noun/adjective/adverb, as well as verb
homographs which allow for the existence of a gerund or other -ing
noun, e.g. swim (swimming), build (building). Although the system
provides for up to nine parts of speech for each homographic entry,
it is unusual for more than five to be entered. The French
translations entered for the source homograph light in a general
context would probably be: lumière (noun), éclairage (-ing noun),
allumer (verb), léger (adjective), and qui allumer (the so-called -ing
adjective); the latter form is entered to cover cases where the -ing form
is used adjectivally, e.g. the boy lighting the candles, the man walking his
dog; however, walking stick would need a separate entry to avoid its
being translated as la canne qui marche.
The system’s ability to differentiate between the various parts of a
homograph, as exemplified by its rendering of the nonsense sentence
the light lady lighting the lighting is lighting light lights as la dame légère qui
allume l’éclairage allume des lumières légères, is impressive.
Dictionary sequencing
In a situation where the prescribed vocabulary and terminology
were not contained within a single fully customized dictionary, a
hierarchical dictionary look-up sequence would be specified to
guide the system in its choice of translation equivalents. Up to
nine dictionaries can be sequenced by the user for any vocabulary
search or translation task, and up to twenty-six different
sequences can be resident at any time. In this mode, the system
searches for source-text words and idioms in each dictionary in
the specified sequence until it finds or finally fails to find the item
in question. In an automotive context, a look-up sequence might
contain a product-specific dictionary at the highest level, followed
by a general in-house dictionary, a general automotive dictionary
and the system’s core dictionary. In this sequence, terms coined
to describe features specific to the product would be found in the
first dictionary specified, terms relating to the company’s
products in general would be located in the second, general
automotive terms such as steering wheel and small end bush would be
found in the third and the fourth would provide general words
such as number and blue. Jumbling the sequence can produce
amusing but predictable results; e.g. sequencing the core
dictionary before the general automotive dictionary would result
in small end bush being rendered as le petit buisson de fin but this
would b e attributable to human error (or mischief ). The
MicroCat PACE dictionary’s total, yet exclusive, coverage of the
The Perkins experience 53
Implementation
CONCLUSIONS
Not least of the benefits of Perkins’ bold decision to automate the
translation process was the opportunity it afforded for looking even
more closely at what was being written and how it was being
written. Introducing MT afforded Peter Pym a level of control that
56 John Newton
REFERENCE
Pym, P.J. (1990) ‘Pre-editing and the use of simplified writing for MT: an
engineer’s experience of operating an MT system’, in P.Mayorcas (ed.)
Translating and the Computer 10: The Translation Environment 10 Years On,
London: Aslib, 80–96.
Chapter 5
Machine translation in a
high-volume translation
environment
Muriel Vasconcellos and Dale A.Bostad
58
A high-volume translation environment 59
The experiment
Current situation
The use of MT has not been stabilized in PAHO. The new
technology continues to do the lion’s share of the work. The decision
to use MT, which rests entirely with the terminology and translation
service, is based on the following characteristics of the input text:
68 Muriel Vasconcellos and Dale A.Bostad
Future applications
Thanks to ongoing dictionary work and system improvement,
ENGSPAN now produces raw output of sufficiently reliable quality
that consideration is being given to the translation of data bases and
other applications in which users can access MT directly. Of
particular interest are data bases that are available on compact disk
read only memory (CD-ROM). Several proposals have been made
and some of these may materialize into active projects.
The Consultative Group on International Agricultural Research
(CGIAR) has been collaborating with the PAHO MT project since
1986 and provided support for the installation of ENGSPAN at the
International Rice Research Institute in the Philippines and the
International Center for Tropical Agriculture (CIAT) in Colombia.
CGIAR is helping to form a donor consortium that will provide
PAHO with funds to adapt ENGSPAN to a microcomputer and
develop parallel systems from English into French and Portuguese,
as well as establish an MT center within the CGIAR network that
72 Muriel Vasconcellos and Dale A.Bostad
NOTES
1 Regional Office of the World Health Organization for the Americas.
WHO is a specialized agency in the United Nations system.
2 Developed with partial assistance from the U S Agency for
International Development (AID) under Grant DEP-5443-G-SS-3048–
00. ENGSPAN is installed at AID and runs there on an IBM 3081
(OS/ VMS).
3 SPANAM and ENGSPAN are written in PL/1 and run on PAHO’s
IBM mainframe computer, currently an IBM 4381 (DOS/VSE/SP),
which is used for many other purposes.
4 The Organization’s working languages are Spanish and English. The
English-Spanish and Spanish-English combinations account for 90 per
cent of the translation workload. Portuguese and French, which
together make up the other 10 per cent, are also official languages of
the Organization but are handled by a separate service.
5 For a detailed review of the data from the experiment, see Vasconcellos
1989b.
6 The equipment on hand was already old at the time of the experiment.
Current OCR technology would undoubtedly do much better.
BIBLIOGRAPHY
Grimes, J.E. (1975) The Thread of Discourse, The Hague: Mouton. Halliday,
M.A.K. (1967) ‘Notes on transitivity and theme in English’, part 2,
Journal of Linguistics 3:199–244.
Hartmann, R.R.K. and Stork, F.C. (1976) Dictionary of Language and
Linguistics, New York: Wiley.
A high-volume translation environment 77
Esperanto as an intermediate
language for machine translation
Klaus Schubert
PRESTIGE
78
Esperanto as an intermediate language 79
1 ‘the idea’;
2 ‘Esperanto on equal terms’; and
3 ‘Esperanto for its specificity’.
The first stage, ‘the idea’, had its origin in the very early years of
machine translation in the late 1940s and early 1950s. After the first
wishful attempts, it was soon understood that natural language is
more intricate than the decoding tasks the first computers had
performed well for military and intelligence applications. When
natural languages turned out to be too difficult, it was suggested that
something more consistent be tried, such as, for instance, Esperanto.
Yehoshua Bar-Hillel put forward this suggestion in his famous state-
of-the-art report of 1951 (Bar-Hillel 1951; Hutchins 1986:33–4).
This first stage of ideas about Esperanto in computational
language processing was preceded by a period when various scholars
used Esperanto as a kind of universal syntactic representation. Most
interesting in this respect are Lucien Tesnière’s dependency syntax
(Tesnière 1959/1982:64) and Petr Smirnov-Trojanskij’s method of
machine translation developed prior to the computer age (Denisov
1965:80ff.). Both systems were developed in the 1930s.
The second stage, ‘Esperanto on equal terms’, begins when
Esperanto is actually used in computational linguistics. First, a series
of studies appear which merely investigate the feasibility of the idea
(Sjögren 1970; Kelly 1978; Dietze 1986), then smaller programs are
written of which only a minority may have been published (Ben-Avi
1977), and, finally, larger implementations are realized. Such
implementations are carried out, for example, within the SUSY
machine-translation project in Saarbrücken (Maas 1982, 1985,
80 Klaus Schubert
Productive word-formation
One of the central and most efficient aspects of the infiniteness of
language is the possibility of combining existing linguistic signs to
express new meanings. In all languages, the formation of syntagmata
and sentences offers an opportunity for the expression of new
meanings, but many languages also have a very productive
combinatorial capacity within words. Productive word formation is a
major instrument of infinite expressiveness. In English, this
possibility is marginal, in Esperanto, as in all agglutinative
languages, it is an essential feature.
Esperantology has brought about a highly elaborated theory of
word formation (Schubert 1989b). On the basis of a grammatical
foundation which was designed as an open, extensible system and
under the influence of the pragmatic factors to be discussed below,
Esperanto has, in its hundred-year history, developed a highly
compositional system of word-formation. As a consequence, the
language provides a powerful mechanism for forming new words
from existing roots, while at the same time, the derivation of the
meaning of these words can be automated to a high degree. (This
derivation mechanism cannot be fully exhaustive for theoretical
90 Klaus Schubert
reasons. A language in which this were possible would have lost its
infiniteness in a crucial field.)
The ability to isolate the component elements of complex words
using automated processes is extremely important for a good and
effective functioning of word-based knowledge banks such as those
used in DLT, where the basic knowledge source is made up of
structured corpora. A speciality of Esperanto is, in this respect, its
unusually high degree of compositionality. This means that the
meaning of an extremely high percentage of complex words can be
inferred in a straightforward manner from the individual meanings
of their component elements. This is the case with Esperanto, as its
design objective was ease of learning and inferability is an obvious
advantage in language learning. As it turns out, the same
characteristic pays off in machine translation as well. The fact that
Esperanto not only started with a perspicuous word-formation
system but maintained this system intact throughout a hundred
years of uncontrolled use can be attributed to the pragmatic factors
addressed below.
Two prejudices
Finally, a few words about two frequently heard prejudices
concerning Esperanto that are closely related to the subject of this
discussion. First a practical, then a theoretical one.
Prejudice 1
‘I understand that some simple corespondence among pen pals and
stamp collectors is possible in Esperanto. But, of course, you can’t
say everything in that language, can you?’ Of course you can. Like
any other language, Esperanto can express everything which has
already been expressed in it. This sounds contradictory but it holds
for all languages, English as well as Yoruba: if you have to prove that
something can be said in a language, you will only succeed if this has
been done (and recorded) before. So the problem resides not only in
existing language use but in the language’s capacity to develop.
‘Western’ languages like English, German or Russian have very rich
vocabularies, but terminologists still work hard to extend them, as
unfortunately, they experience that you cannot really say everything
in English, German or Russian. The same holds true for Esperanto.
The language has the capacity to develop, and it has developed
whenever it was necessary for it to do so. Unlike ethnic languages,
however, Esperanto develops in the sociolinguistically unique setting
of international communication.
Prejudice 2
‘You argue that artificial symbol systems are insufficient. But
Esperanto is an artificial language par excellence, isn’t it?’ It is not.
What is often overlooked in this sort of discussion is the fact that
Esperanto has already entered its second century. When the first
textbook of Esperanto was published in 1887, Esperanto was
artificial, and it was not yet a language. But since then, Esperanto
has, in a slow and unnoticed development, become a language
spoken by people. According to Detlev Blanke (Blanke 1985:105ff
and Tabelle 2), Esperanto is the only project of a planned language
which has totally accomplished the transition from an artificial
Esperanto as an intermediate language 93
CONCLUSION
The experience of the DLT machine translation project so far has
shown that Esperanto fulfils a specific requirement in language
technology: it can be used to good advantage as an intermediate
language in machine translation, when fully automatic high-quality
translation from the intermediate language into the target
language(s) is aimed at.
BIBLIOGRAPHY
Bar-Hillel, Y. (1951) ‘The state of machine translation in 1951’, American
Documentation 2:229–37.
Ben-Avi, S. (1977) ‘An investigation into the use of Esperanto as an
intermediate language in a machine translation project’, PhD Thesis,
Manchester.
Blanke, D. (1985) Internationale Plansprachen, Berlin: Akademic-Verlag.
Briem, S. (1990) ‘Maskinoversaettelse fra esperanto til islandsk’, in J.Pind
and E.Rögnvaldsson (eds) Papers from the Seventh Scandinavian Conference of
Computational Linguistics (Reykjavík 198 9), Reykjavik: Institute of
Lexicography/Institute of Linguistics, 138–45.
Denisov, P.N. (1965) Principy modelirovanija jazyka (na materiale vspomogatelaych
jazykov dlja avtomaticeskogo poiska i perevoda), Moscow: Moskovskij
Universitet.
Dietze, J. (1986) ‘Projekt der rechnergestützten Erarbeitung eines
Wörterbuchs von Esperantowurzeln’, Wissenschaftliche Zeitschrift der
Universität Halle, 35 [G, 5]: 90–1.
Dulicenko, A.D. (1989) ‘Ethnic language and planned language, and its
place in language science’, in K.Schubert (ed.) Interlinguistics. Aspects of
the science of planned languages, Berlin/New York: Mouton de Gruyter,
47–61.
Forster, P.G. (1987) ‘Some social sources of resistance to Esperanto’, in José-
Luis Melena Jiménez et al. (eds) Serta gratvlatoria in honorem Juan Régulo,
vol. 2: Esperantismo, La Laguna: Universidad, 203–11.
Gordos, G. (1985) ‘Parol-sintezo kun limigita vortaro’, in I.Koutny (ed.)
Perkomputila tekstoprilaboro, Budapest: Scienca Eldona Centro, 11–29.
Hjelmslev, L. (1963) Sproget, 2nd edn, Copenhagen: Berlingske forlag.
Hutchins, W.J. (1986) Machine Translation: Past, Present, Future, Chichester:
Ellis Horwood.
Janot-Giorgetti, M.T. (1985) ‘Parol-rekono kun limigita vortaro, gia apliko
en la lernado de parolataj lingvoj’, in I.Koutny (ed.) Perkomputila
tekstoprilaboro, Budapest: Scienca Eldona Centro, 57–68.
94 Klaus Schubert
Limitations of computers as
translation tools
Alex Gross*
96
Limitations of computers 97
PRACTICAL LIMITATIONS
There are six important variables in any decision to use a computer
for translation: speed, subject matter, desired level of accuracy,
consistency of translation, volume and expense. These six
determinants can in some cases be merged harmoniously together in
a single task but they will at least as frequently tend to clash. Let us
take a brief look at each:
Speed
This is an area where the computer simply excels—one mainframe
system boasts 700 pages of raw output per night (while translators
100 Alex Gross
are sleeping), and other systems are equally prodigious. How raw
the output actually is—and how much post-editing will be required,
another factor of speed—will depend on how well the computer has
been primed to deal with the technical vocabulary of the text being
translated. Which brings us to our second category:
Subject matter
Here, too, the computer has an enormous advantage, provided a
great deal of work has already gone into codifying the vocabulary of
the technical field and entering it into the computer’s dictionary.
Thus, translations of aeronautical material from Russian to English
can be not only speedy but can perhaps even graze the ‘98 percent
accurate’ target, because intensive work over several decades has
gone into building up this vocabulary. If you are translating from a
field the computer vocabulary of which has not yet been developed,
you may have to devote some time to bringing its dictionaries up to
a more advanced level. Closely related to this factor is the third
category:
Consistency of vocabulary
Here the computer rules supreme, always assuming that correct
prerequisite dictionary building has been done. Before computer
translation was readily available, large commercial jobs with a
deadline would inevitably be farmed out in pieces to numerous
translators with perhaps something resembling a technical glossary
distributed among them. Sometimes the task of ‘standardizing’ the
final version could be placed in the hands of a single person of
dubious technical attainments. Even without the added problem of a
Limitations of computers 101
Volume
From the foregoing, it should be obvious that some translation
tasks are best left to human beings. Any work of high or even
medium literary value is likely to fall into this category. But
volume, along with subject matter and accuracy, can also play a
role. Many years ago a friend of mine considered moving to
Australia, where he heard that sheep farming was quite profitable
on either a very small or a very large scale. Then he learned that
a very small scale meant from 10,000 to 20,000 head of sheep, a
very large one meant over 100,000. Anything else was a poor
prospect, and so he ended up staying at home. The numbers are
different for translation, of course, and vary from task to task and
system to system but the principle is related. In general, there will
be—all other factors being almost equal—a point at which the
physical size of a translation will play a role in reaching a
decision. Would-be users should carefully consider how all the
factors I have touched upon may affect their own needs and
intentions. Thus, the size and scope of a job can also determine
whether or not you may be better off using a computer alone,
some computer—human combination or having human
translators handle it for you from the start. One author proposes
8,000 pages per year in a single technical specialty with a fairly
standardized vocabulary as minimum requirements for
translating text on a mainframe system. 5
Expense
Given the computer’s enormous speed and its virtually foolproof
vocabulary safeguards, one would expect it to be a clear winner in
this area. But for all the reasons I have already mentioned, this is
by no means true in all cases. The last word is far from having been
written here, and one of the oldest French companies in this field
102 Alex Gross
DEEPER LIMITATIONS
This section explains how changing standards in the study of
linguistics may be related to the limitations in machine translation
we see today and perhaps prefigure certain lines of development in
this field. Those interested only in the practical side should turn
immediately to p. 115.
Some practical limitations of MT and even of CAT should
already be clear enough. Less evident are the limitations in some of
the linguistic theories which have sired much of the work in this
field. On the whole, Westerners are not accustomed to believing that
problems may be insoluble and, after four decades of labor, readers
might suppose that more progress had been made in this field than
appears to be the case. To provide several examples at once, I can
remember standing for some time by the display booth of a
prominent European computer translation firm during a science
conference at MIT and listening to the comments of passers-by. I
found it dismaying to overhear the same attitudes voiced over and
over again by quite sane and reasonable representatives from
government, business and education. Most of what I heard could be
summed up as:
These ideas are clearly not ones Bloomfield could have approved
of. They are not relativistic or cautious but universalist and all-
embracing; they do not emphasize the study of individual
languages and cultures but leap ahead into stunning
generalizations. As such, he would have considered them examples
of ‘secondary responses’ to language. In many ways they reflect the
USA of the late 1950s, a nation proud of its own new-found
dominance and convinced that its values must be more substantial
than those of ‘lesser’ peoples. Such ideas also coincide nicely with
a seemingly perennial need academia feels for theories offering a
seemingly scientific approach, suggestive diagrams, learned jargon
and a grandiose vision.
We all know that science progresses by odd fits and starts and
that the supreme doctrines of one period may become the
abandoned follies of a later one. But the turnabout we have
described is surely among the most extreme on record. It should
also be stressed that the outlook of Bloomfield, Whorf and Sapir
has never truly been disproved or rejected and still has followers
today.11 Moreover, there is little proof that these newer ideas, while
they may have been useful in describing the way children learn to
speak, have ever helped a single teacher to teach languages better
Limitations of computers 109
might argue that the Italian steer itself is different; technically and
anatomically, it might just qualify as a different subspecies.
This notion of ‘cutting the animal differently’ or of ‘slicing reality
differently’ can turn out to be a factor in many translation problems.
It is altogether possible for whole sets of distinctions, indeed whole
ranges of psychological or even tangible realities, to vanish when
going from one language to another. Those which do not vanish
may still be mangled beyond recognition. It is this factor which poses
one of the greatest challenges even for experienced translators. It
may also place an insurmountable stumbling block in the path of
computer-translation projects, which are based on the assumption
that simple conversions of obvious meanings between languages are
readily possible.
Another cross-cultural example concerns a well-known wager AI
pioneer Marvin Minsky has made with his MIT students. Minsky
has challenged them to create a progam or device that can
unfailingly tell the difference, as humans supposedly can, between a
cat and a dog. Minsky has made many intriguing remarks on the
relation between language and reality16 but he shows in this instance
that he has unwittingly been manipulated by languageimposed
categories. The difference between a cat and a dog is by no means
obvious, and even ‘scientific’ Linnaean taxonomy may not provide
the last word. The Tzeltal Indians of Mexico’s Chiapas State in fact
classify some of our ‘cats’ in the ‘dog’ category, rabbits and squirrels
as ‘monkey,’ and a more doglike tapir as a ‘cat,’ thus proving in this
case that whole systems of animals can be sliced differently.
Qualified linguistic anthropologists have concluded that the Tzeltal
system of naming animals—making allowance for the fact that they
know only the creatures of their region—is ultimately just as useful
and informative as Linnaean latinisms and even includes
information that the latter may omit.17 Comparable examples from
the other cultures are on record.18
An especially dramatic cross-cultural example suggests that at
least part of the raging battle as to whether acupuncture and the
several other branches of Chinese medicine can qualify as
‘scientific’ springs from the linguistic shortcomings of Western
observers. The relationships concerning illness the Chinese
observe and measure are not the ones we observe, their
measurements and distinctions are not the same as ours, their
interpretation of such distinctions is quite different from ours, the
diagnosis suggested by these procedures is not the same and the
Limitations of computers 113
assigned to each word. We’ll call them set A and set B. If each
numbered word within set A meant exactly the same thing as each
word with the same number in set B, translation would be no
problem at all, and no professional translators would be needed.
Absolutely anyone able to read would be able to translate any text
between these two languages by looking up the numbers for the
words in the first language and then substituting the words with the
same numbers in the second language. It would not even be
necessary to know either language. And computer translation in
such a case would be incredibly easy, a mere exercise in ‘search and
replace,’ immediately putting all the people searching through books
of words and numbers out of business.
But the sad reality of the matter—and the real truth behind
machine translation efforts—is that word # 152 in language A does
not mean exactly what word #152 in language B means. In fact, you
may have to choose between words 152, 157, 478 and 1,027 to obtain
a valid translation. It may further turn out that word 152 in language
B can be translated back into language A not only as 152 but also
149, 462 and 876. In fact, word #152 in language B may turn out to
have no relation to word #152 in language A at all. This is because
forty-seven words with lower numbers in language B had meanings
that spilled over into further numbered listings. It could still be
argued that all these difficulties could be sorted out by complex trees
of search and ‘goto’ commands. But such altogether typical examples
are only the beginning of the problems faced by computational
linguists, as words are rarely used singly or in a vacuum but are
strung together in thick, clammy strings of beads according to
different rules for different languages. Each bead one uses influences
the number, shape and size of subsequent beads, so that each new
word in a language A sentence compounds the problems of
translation into language B by an extremely non-trivial factor, with a
possible final total exceeding by several orders of magnitude the
problems confronted by those who program computers for the game
of chess.
There are, of course, some real technical experts, the linguistic
equivalents of chess grandmasters, who can easily determine most of
the time what the words mean in language A and how to render
them most correctly in language B. These experts are called
translators, though thus far no one has attributed to them the power
or standing of chess masters. Another large irony: so far, the only
people who have proved capable of manipulating the extremely
Limitations of computers 115
greatly enhanced general clarity about what languages are and how
they work.
In all this the translator is rarely perceived as a real person with
specific professional problems, as a writer who happens to
specialize in foreign languages. When MT systems are introduced,
the impetus is most often to retrain and/or totally reorganize the
work habits of translators or replace them with younger staff whose
work habits have not yet been formed, a practice likely to have
mixed results in terms of staff morale and competence. Another
problem, in common with word processing, is that no two MT
systems are entirely alike, and a translator trained on one system
cannot fully apply experience gained on it to another. Furthermore,
very little effort is made to persuade translators to become a factor
in their own self-improvement. Of any three translators trained on
a given system, only one at best will work to use the system to its
fullest extent and maximize what it has to offer. Doing so requires
a high degree of self-motivation and a willingness to improvise
glossary entries and macros that can speed up work. Employees
clever enough to do such things are also likely to be upwardly
mobile, which may mean soon starting the training process all over
again, possibly with someone less able. Such training also forces
translators to recognize that they are virtually wedded to creating a
system that will improve and grow over time. This is a great deal to
ask in either the U SA’s fast-food job market or Europe’s
increasingly mobile work environment. Some may feel it is a bit
like singling out translators and asking them to willingly declare
their life-long serfdom to a machine.
The real truth may be far more sobering. As Bloomfield and his
contemporaries foresaw, language may be no puny afterthought of
culture, no mere envelope of experience but a major functioning
part of knowledge, culture and reality, their processes so
interpenetrating and mutually generating as to be inseparable. In a
sense humans may live in not one but two jungles, the first being
the tangible and allegedly real one with all its trials and travails.
But the second jungle is language itself, perhaps just as difficult to
deal with in its way as the first.
At this point I would like to make it abundantly clear that I am no
enemy either of computers or of computer translation. I spend
endless hours at the keyboard, am addicted to downloading all
manner of strange software from bulletin boards and have even
ventured into producing some software of my own. As I also love
Limitations of computers 123
NOTES
* I wish to express my gratitude to the following individuals, who read
this piece in an earlier version and assisted me with their comments and
criticisms: John Báez, Professor of Mathematics, Wellesley College;
Alan Brody, computer consultant and journalist; Sandra Celt,
translator and editor; André Chassigneux, translator and Maître de
Conférences at the Sorbonne’s École Supérieure d’Interprètes et de
Traducteurs (L’ESIT); Harald Hille, English terminologist, United
Nations; Joseph Murphy, Director, Bergen Language Institute; Lisa
Raphals, computer consultant and linguist; Laurie Treuhaft, English
Translation Department, United Nations; Vieri Tucci, computer
consultant and translator; Peter Wheeler, Director, Antler Translation
Services; and Apollo Wu, reviser, Chinese Department, United
Nations.
124 Alex Gross
10 Although presented here in summarized form, these ideas all form part
of the well-known Chomskian process and can be found elaborated in
various stages of complexity in many works by Chomsky and his
followers (Chomsky 1957, 1965 and 1975).
11 The bloodied battlefields of past scholarly warfare waged over these
issues are easily enough uncovered. In 1968 Charles Hockett, a noted
follower of Bloomfield, launched a full-scale attack on Chomsky
(Hockett 1968). Those who wish to follow this line of debate further
can use his bibliography as a starting point. Hostilities even spilled over
into a New Yorker piece and a book of the same name (Mehta 1971).
Other starting points are the works of Chomsky’s teacher (Harris 1951)
or a unique point of view related to computer translation (Lehmann
1987). Throughout this debate, there have been those who questioned
why these transformational linguists, who claim so much knowledge of
language, should write such dense and unclear English. When
questioned on this, Mehta relates Chomsky’s reply as follows:
12 See, for example, Fodor (Fodor and Katz 1964) or Chisholm (Chisholm
1981).
13 See Note 3 for reference to Esperanto. The South American Indian
language Aymara has been proposed and partially implemented as a
basis for multilingual machine translation by the Bolivian
mathematician Iván Guzmán de Rojas, who claims that its special
syntactic and logical structures make it an ideal vehicle for such a
purpose. On a surface analysis, such a notion sounds remarkably close
to Bloomfieldian secondary responses about the ideal characteristics of
Limitations of computers 127
BIBLIOGRAPHY
Bloomfield, L. (1933) Language, New York: Holt, Rinehart & Winston
(reprinted in great part in 1984, University of Chicago).
——(1944) ‘Secondary and tertiary responses to language’, in Language
20:45–55 and in C.F.Hockett (ed.) (1987) A Leonard Bloomfield Anthology,
Chicago: University of Chicago Press.
Booth, A.D. (ed.) (1967) Machine Translation, Amsterdam: North Holland.
Brower, R.A. (ed.) (1959) On Translation, Cambridge, Mass.: Harvard
University Press.
Carbonell, J.G. and Tomita, M. (1987) ‘Knowledge-based machine
translation, the CMU approach’, in S.Nirenburg (ed.) Machine Translation:
Theoretical and Methodological Issues, Cambridge: Cambridge University
Press.
Celt, S. and Gross, A. (1987) ‘The challenge of translating Chinese
medicine’, Language Monthly 43:19–21.
Chisholm, W.S. Jr. (1981) Elements of English Linguistics, London: Longman.
Chomsky, N. (1957) Syntactic Structures, The Hague: Mouton.
——(1965) Aspects of the Theory of Syntax, Cambridge, Mass.: MIT Press.
——(1975) The Logical Structure of Linguistic Theory, Chicago: University of
Chicago Press.
Coughlin, J. (1988) ‘Artificial intelligence and machine translation: present
developments and future prospects’, Babel 34:1, 1–9.
Datta, J. (1988) ‘MT in large organizations: revolution in the workplace’, in
M.Vasconcellos (ed.) Technology as Translation Strategy (American
Translators Association Scholarly Monog raph Series, vol. I I),
Binghamton, New York: State University of New York Press.
Drexler, E.K. (1986) Engines of Creation, New York: Anchor Press.
Limitations of computers 129
Fodor, J.A. and Katz, J.J. (1964) The Structure of Language, New York:
Prentice-Hall.
Godel, K. (1931) ‘Uber formal unentscheidbare Sätze der Principia
Mathematica und verwandte Systeme I’, Monatshefte für Mathematik und
Physik 38:173–98.
Greenberg, J. (1963) Universals of Language, Cambridge, Mass.: MIT Press.
Grosjean, F. (1982) Life with Two Languages: an Introduction to Bilingualism,
Cambridge, Mass.: Harvard University Press.
Guzmán de Rojas, I. (1985) ‘Logical and linguistic problems of social
communication with the Aymara people’, Ottawa: The International
Development Research Center.
Harel, D. (1987) Algorithmics: The Spirit of Computing, Addison-Wesley.
Harris, Z. (1951) Structural Linguistics, Chicago: University of Chicago Press.
Hjelmslev, L. (1961) Prolegomena to a Theory of Language, Madison: University
of Wisconsin Press.
Hockett, C.F. (1968) The State of the Art, The Hague: Mouton.
——(ed.) (1987) A Leonard Bloomfield Anthology, Chicago: University of
Chicago Press.
Hodges, A. (1983) Alan Turing: The Enigma, New York: Simon & Schuster.
Hunn, E.S. (1977) Tzeltal Folk Zoology: The Classification of Discontinuities in
Nature, New York: Academic Press.
Hutchins, W.J. (1986) Machine Translation: Past, Present, Future, Chichester:
Ellis Horwood.
Jakobson, R. (1959) ‘On linguistic aspects of translation’, in R.A. Brower
(ed.) On Translation, Cambridge, Mass.: Harvard University Press.
Kay, M. (1982) ‘Machine translation’, American Journal of Computational
Linguistics, April-June, 74–8.
Kingscott, G. (1990) ‘SITE buys B’Vital: relaunch of French national
MTproject’, Language International, April.
Klein, F. (1988) ‘Factors in the evaluation of MT: a pragmatic approach’, in
M.Vasconcellos (ed.) Technology as Translation Strategy (American
Translators Association Scholarly Monog raph Series, vol. I I),
Bingghamton, New York: State of New York University Press.
Lehmann, W.P. (1987) ‘The context of machine translation’, Computers and
Translation 2.
Malmberg, B. (1967) ‘Los nuevos caminos de la lingüística’, Siglo Veintiuno,
Mexico: 154–74.
Mehta, V. (1971) John is Easy to Please, New York: Ferrar, Straus & Giroux.
Minsky, M. (1986) The Society of Mind, New York: Simon & Schuster.
Nagel, E. and Newman, J.R. (1989) Godel’s Proof, New York: New York
University Press.
Newman, P.E. (1988) ‘Information-only machine translation: a feasibility
study’, in M.Vasconcellos (ed.) Technology as Translation Strategy (American
Translators Association Scholarly Monog raph Series, vol. I I),
Binghamton, New York: State University of New York Press.
Nirenburg, S. (ed.) (1987) Machine Translation: Theoretical and Methodological
Issues, Cambridge: Cambridge University Press.
Paulos, J.A. (1989) Innumeracy, Mathematical Illiteracy and its Consequences, New
York: Hill & Wang.
130 Alex Gross
TERM BANKS
Generally, little is known about term banks as, apart from one or
two, they have not received the same press as machine translation
(MT). Why is this? There seem to be three main reasons: first, it is
only now becoming possible to buy a term bank ‘off the shelf as one
might a personal computer (PC) version of an MT system. Second,
131
132 Patricia Thomas
Current developments
Since in the context of translation term banks are often linked into
machine translation systems, the current situation of both types of
help for translators will be reviewed, together with related aids.
Renewed interest in MT and particularly machine-assisted
translation (MAT) in recent years is due to a greater insistence on
the use of the mother tongue because of export marketing, with a
trend towards a lesser use of English. Industry’s need to penetrate
foreign markets has created an increasing and largely unsatisfied
demand for multilingual documentation. In addition to the
established term banks, and as an intermediate stage between these
and large MT systems, there is a flourishing growth in interactive
140 Patricia Thomas
NOTES
1 Contact British Telecom UK Sales Operation, 8th Floor, Tenter House,
45 Moorfields, London EC2Y 9TH, tel: 0800 282444, telex: 8952558
NSSAL G, fax: 071 250 8343.
2 Tel: 010 352 488 041.
3 One area in which cooperation is being achieved is work on subject
classification which began in the Scandinavian countries with the
development of N ORDTE RM. Representatives include
organizations in Denmark, Finland, Iceland, Norway, Sweden,
Germany and the Netherlands. The classification system is
hierarchical, consisting of a letter plus four digits, and it is hoped that
it will be implemented within the next few years. Its importance lies
in its effectiveness in providing small, tightly defined domains which
Computerized term banks 145
BIBLIOGRAPHY
Ahmad, K., Fulford, H., Holmes-Higgin, P., Rogers, M. and Thomas, P.
(1990) ‘The translator’s workbench project’, in C.Picken (ed.)
Translating and the Computer 11: Preparing for the Next Decade, London:
Aslib, 9–19.
Danzin, A., Allén, S., Coltof, H., Recoque, A., Steusloff, H. and O’Leary,
M. (1990) ‘Eurotra Programme Assessment Report’, Commission of the
European Communities, March 1990.
Fulford, H., Höge, M. and Ahmad, K. (1990) ‘User requirements study’,
European Commission Esprit I I Project no. 2315, Translator’s
Workbench Project, Final Report on Workpackage 3.3.
Hutchins, W.J. (1986) Machine Translation: Past, Present, Future, Chichester:
Ellis Horwood.
Hutchins, W.J. and Somers, H.L. (1992) An Introduction to Machine Translation,
London: Academic Press.
Iljon, A. (1977) ‘Scientific and technical databases in a multilingual society’,
in Proceedings of Third European Congress on Information Systems, Munich:
Commission of the European Communities.
ISO 2788 (1974) ‘Documentation—guidelines for the establishment and
development of multilingual thesauri’.
ISO 6156 (1987) ‘Magnetic tape exchange format for terminological/
lexicographical records (MATER)’.
Knowles, F. (1979) ‘Error analysis of Systran output: a suggested criterion
for “internal” evaluation of translation quality and a possible corrective
design’, in B.M.Snell (ed.) (1979) Translation and the Computer, Amsterdam:
North-Holland, 109–33.
McNaught, J. (1988) ‘A survey of termbanks worldwide’, in C.Picken
(ed.) Translating and the Computer 9: Potential and Practice, London: Aslib,
112–29.
146 Patricia Thomas
PREVIEW
The functions of a translator workstation can be divided into three
levels (Melby 1982) as follows:
147
148 Alan Melby
LEVEL-ONE FUNCTIONS
and-replace function will not find ‘broke’ because it does not know
enough about morphology to identify past, plural and other forms
of words. Again, translators would not be the only clients to use
this feature, so it is likely that normal market pressures will
eventually result in general-purpose word processors with
morphology-based search-and-replace functions becoming
available.
An interesting enhancement to word processing proposed by
Professor Gregory Shreve (personal communication) of Kent State
University is a database of prototypical texts. When translating a
patent, for example, a translator would find it helpful to have ready
access to a typical patent in the target language and country as well
as a description of the elements of such a document. We see the
beginnings of functions to support this in the style sheets available in
some word processors.
Also, SGML, a very flexible mark-up language, promises to make
document structure descriptions available in a universal form. Once
SGML document descriptions are widely adopted, it will make it
easier to transfer texts among different word processing software
packages.
SG M L, which stands for Standard Generalized Mark-up
Language, became an international standard (ISO 8879) in 1986.
The text encoding initiative (TEI) is an international effort
including representatives from the Association for Computing in
the Humanities and the Association for Computational
Linguistics. The TEI group is producing an application of SGM L
which will create a standardized system of embedded format
codes. Eventually, through end-user pressure, word processing
software packages will include utilities to import to and export
from TEI-SGML format. This will solve one of the problems
with word processing for translators today: some clients want the
final version of their translation returned to them containing the
internal format codes for one word processor while others prefer
that it contain those of another. Translators should not have to
switch back and forth between different word processors to meet
these requirements. However, lacking any other option, they
sometimes turn to utilities, either supplied with the word
processing software or by a third party, that switch between all
kinds of word processing formats. However, these utilities need to
be updated constantly for new releases of word processing
software and thus tend to be incomplete. The final solution to
158 Alan Melby
Telecommunications
Telecommunications is a basic need for most translators today. The
apparent exception is the in-house translator who is handed a
source text from someone else in the organization and who hands
the translation to someone else. But even in this situation, it is
likely that some text transfer using telecommunications takes place
either as the document travels from the technical writers to the
organization or from the translation office to an office in another
country.
It seems that mail is just not fast enough for many text deliveries
today. One reason for this is the crucial need for business to reduce
the time lag between launching a new product in one language
market and making it available in another. Corporations will
probably find the problem of reducing this interval even more
critical in the 1990s, particularly after 1992 in Europe.
The most common form of telecommunications used in the late
1980s was text transfer by means of a fax machine. As useful as fax
has become, it is only a partial solution for the translator. Text
transfer of target-language texts via fax presents two important
problems: first, a fax cannot be edited using a word processor;
second, a faxed copy of a text is not camera-ready. Perhaps the
optimum solution to this problem would be the widespread use of
file transfer by modem with automatic error detection and
correction. Fortunately, there are several such error-calculating
transmission protocols available, such as X-modem and Kermit.
Another important translation-related function of
telecommunications is the obtaining of access to remote databases,
such as bibliographic databases, for doing research. This research
may relate either to single terms or to general research about the
domains of the source text and would be carried out by locating
related or explanatory documents.
Terminology management
Terminology management is extraordinarily important to
translation of special-language texts containing many technical
terms. Terminology is standardized at multiple levels. Some terms
The translator workstation 159
LEVEL-TWO FUNCTIONS
Level one does not assume that the source text is available in
machine-readable form. The target text can be created on a word
processor, various terminology files can be consulted and the
translation can be sent to the requestor by electronic file transfer, all
without the source text being available to the translator in a
compatible electronic form. Although most translation produced in
the 1980s was produced from hard copy only, this situation should
change during the 1990s.
When the source text is available in compatible electronic form,
three new functions can be added to the translator workstation: text
analysis, automatic term look-up in electronic terminology files and
synchronized retrieval of source and target texts.
The distinction between level-one functions and level-two
functions is simply that level one is restricted to what can be done
when the source text is only available in hard-copy form. Level two
comprises all functions included in level one plus additional
functions for processing the source text in various ways.
Text analysis
One basic text analysis tool is a dynamic concordance system which
indexes all the words in the document and which allows the user to
request all occurrences of a word or combination of words within the
document. This type of analysis may assist in the translation of a
long document because it allows the translator to quickly see how
troublesome terms are used in various contexts throughout the
document.
Several text-analysis software packages are available commercially
for PCs. There are two basic types: one which searches a text ‘on-
162 Alan Melby
Automatic lookup
Once terminology files are available and routines are in place to find
the basic forms of inflected words (i.e. to perform morphological
analysis), it is a straightforward matter to identify the words in a
piece of source text and automatically look them up in a terminology
file. Terms found in the file would be displayed on the screen along
The translator workstation 163
Synchronized retrieval
Another level-two spin-off of morphological analysis could be the
automatic creation of synchronized (i.e. parallel) bilingual text files.
When a document had been translated and revised, the final version,
as well as its source text, would be stored in such a way that each
unit of source text was linked to a corresponding unit of target text.
The units would generally be sentences, except in cases where one
sentence in the source text becomes two in the target text or vice
versa. The benefits of synchronized bilingual text retrieval are
manifold with appropriate software. A translator beginning a
revision of a document could automatically incorporate unmodified
units taken from a previous translation into the revision with a
minimum of effort. Such a system has been developed and marketed
by Alpnet but it requires that the text be marked and segmented by
the software and then translated segment by segment. The next
generation of such software would automatically synchronize source
and target units ‘after the fact’, along the lines of several
experimental software research projects.
Another benefit of synchronized bilingual retrieval would be the
creation of large bilingual databases of previously translated texts.
Thus, an individual requestor, service bureau or, most interestingly,
a large corporation or agency, could provide to its translators a
database (perhaps on CD-ROM) showing how various terms had
been translated in the organization’s documents over the past several
years. Not only would such information be invaluable for helping
the translator or terminoiogist choose the appropriate equivalent
terms for the purposes of the current text, it would also have the
added benefit of instantly allowing them to determine whether or
not a particular use of a term was new.
BIBLIOGRAPHY
Hutchins, W.J. (1986) Machine Translation: Past, Present, Future, Chichester,
Ellis Horwood.
Kay, M. (1980) ‘The proper place of men and machines’, in Language
Translation, Research Report, Xerox Palo Alto Research Center, Palo Alto,
California.
Lakoff, G. (1987) Women, Fire, and Dangerous Things: What Categories Reveal
about the Mind, Chicago: University of Chicago Press.
Melby, A. (1982) ‘Multi-level translation aids in a distributed system’, in
J.Horecky (ed.) Proceedings of COLlNG-82, Prague, July 1982, Amsterdam:
North-Holland.
——(forthcoming) ‘Causes and effects of partial asymmetry between semantic
networks tied to natural languages’ (a lecture series given at the Collège
de France, Paris, February, 1990), to appear in Les Cahiers de Lexicologie,
Paris.
Shreve, G.M. and Vinciquerra, K.J. (1990) ‘Hypertext knowledge-bases for
computer-assisted translation: organization and structure’, in A.L.
Wilson (ed.) Proceedings of the 31st Annual Conference of the American
Translators Association, New Orleans, October 1990: Learned Information,
Inc.
Snell-Hornby, M. (198 8) Translation Studies: An Integrated Approach,
Amsterdam: John Benjamins.
Chapter 10
INTRODUCTION
To this day the SYSTRAN system (Toma 1976) remains the existence
proof of machine translation (MT). When people argue (as they
sometimes still do) that usable MT (of a quality that benefits a large
class of consumers) does not exist, one can simply point to the
existence of SYSTRAN and, in particular, to its twenty-year history at
the Foreign Technology Division (FTD) in Dayton, Ohio, where it
translates large numbers of Russian scientific and engineering theses
every month.
At the present time, there is a resurgence of MT research based
only on text statistics at IBM, New York, a revived technique
(Brown et al. 1990) that has attracted both interest and funding.
Their claim that MT can and should be done on the basis of
correlations of English and foreign words, established between very
large bilingual corpora, assumes that ‘symbolic,’ non-quantitative,
MT cannot do the job. The quick answer to them is again
SYSTRAN, which I B M’s ‘proportion of sentences correct’
percentage (40 per cent versus SYSTRAN’S 60–70 percent success
rate) lags far behind, with no evidence beyond hope, energy and
application of ever closing the gap. IBM’s argument simply ignores
what constitutes a refutation of their claims, namely, SYSTRAN.
A more detached observer might say of this clash of opinion that,
while SYSTRAN has twenty years of work to its name, the IBM
results would still be important even if all they could do was reach
the same levels of accuracy as SYSTRAN, simply because the IBM
procedures would be wholly automatic, requiring no linguistics,
translations, text marking, rules, dictionaries or even foreign-
language speakers.
166
SYSTRAN 167
vii A comparator program took the first and second runs of each text
and listed only those sentences that were changed between runs,
viii The two outputs of the comparator program (one object, one
control) were each divided into three parts,
ix Evaluators were chosen at three sites as follows:
A at Essex University, two bilinguals familiar with Rus-sian-
English translation but not with MT, plus one monolingual
with qualifications in political science, were named A1, A2,
A3, respectively;
B at FTD, three bilinguals familiar with MT were named B1,
B2, B3, respectively;
C at LATS EC, the inverse of (A) was done: one non-MT
bilingual plus two monolinguals familiar with the subject
matter were chosen, called C3, C1, C2, respectively.
Each evaluator with a given digit in their name code received
that same one-third of the change object and control text.
x Each evaluator received the same instructions and questionnaire
(see below). The sentences came to them in the form of a
Russian sentence, a first-run English sentence or a second-run
English sentence. These last two sentences were randomly
ordered, so as to avoid any assumption that the second was
‘better’. For each of three questions the evaluator was asked to
choose one of the four answers A, B, C, or D. Their choice was
indicated by circling one of the letters on a computer form
containing the number of the sentence and the letters A to D.
Answer sheets were mailed directly back to LATSEC.
xi A totalizator program compiled the results from each evaluator
for each set of texts, plus monolingual and bilingual totals and
these in turn were subjected to statistical analysis.
xii The evaluators were asked to give their reactions to the test and
questionnaire, and some were asked to review sample sentences,
answering with different choice orders, and to count the Russian
words that survived translation (for the significance of this, see
questionnaire below).
Test selection
A keypuncher at LATSEC prepared a card for each document in the
one-and-a-half million word Russian corpus. On each card was
punched the number of one of the documents and a random number
(these being taken from the standard library copy of a random
number table, starting on the first page with the first number, and
continuing in order from that page). The pack of cards prepared for
all the documents in the corpus went to Teledyne Ryan who ran it
through a standard program that sorts random numbers in
ascending order. LATSEC then keypunched the Russian documents
by taking their numbers from the ordered list provided by Teledyne
Ryan (taking the document numbers in turn which corresponded
one-to-one in that they were on the same card) to the random
numbers, now in numerical sequence.
For security, the original card pack was then sent to FTD so that
the whole procedure could be verified later with any standard
sorting program. We believe this procedure gave a random selection
of 150,000 Russian words by LATSEC keypunching down the list
until that total was reached (the first 75,000 becoming the object text
so that translation and updating could start immediately, and the
second 75,000 becoming the control text). While this method was
perhaps overly detailed, it yielded a significant sample of the corpus
by any normal statistical criteria, one which compared very well in
terms of sample size with experiments referred to in other surveys.
The fluctuation range was much the same over evaluators as before,
though its net impact is less important. The table below gives the
balance, and, of course, displays the range of fluctuation:
Question 2: intelligibility
The results may be summarized as follows:
Note: Some sets of figures do not add up to 100 because of rounding errors.
All percentages accurate to ± 1%.
SYSTRAN 179
These results generally confirm that for both object (O) and control (C)
texts, there was a significant improvement in the second translation. For
the O text, this should be assessed as 44 percent to 53 percent
improvement. (All C code percentages have a margin of error of no
more than 2 percent.) Little attention should be paid to the monolingual
value (especially where it differs in the C text), as the three monolinguals
disagreed with each other sharply. The overall result for the bilinguals
can be seen by pooling the FTD and non-FTD teams as follows:
Note: Some sets of figures do not add up to 100 because of founding errors.
All percentages accurate to ± 2%.
180 Yorick Wilks
Note: All these figures were subject to very small error margins (1.5–2.5 per
cent), and are statistically significant. The slots followed by an asterisk are
those in which the FTD/non-FTD differences are insignificant.
One might argue that the difference in question 3’s results was
dependent on ordering: those who answered it last (non-FTD and
B1 at FTD) were less likely to find improvement (only 18 percent of
the cases), while B2 and 3 (at FTD), who answered it fiirst, found a
28 percent improvement. But if we look further, we find that there
was actually more disagreement within this order group (B2 = 19
percent, B3 = 33 percent) than between the two groups!
Furthermore, A3 (non-FTD) found as high an improvement score
(32 per cent) as did B3 (FTD), while C3 (non-FTD) found an even
higher one (38 percent).
To investigate the effect of choice order, the Essex (A) group re-
did a large sample of their data sheets in the choice order B C D A.
The average difference they found was around 4 per cent: most
likely an insignificant figure.
Evaluator variance
As already noted, a striking feature of the results is the high level
of evaluator variance. The standard deviation of the twenty-four
individual judgments made by nine evaluators on three questions
(monos did not answer the first question) is very high: 17.2
percent (19.4 percent for the control text) for the proportion
deemed judgeable (the sum of B+C percent). While this is
unfortunate, it is compensated for by the much lower standard
deviation for those judgments that were made, around 4.8–4.9
percent. In other words, we should attach little importance to the
figures when a sentence translation pair (i.e. first run, second run)
could not be judged as different, but considerable reliability to the
70–80 percent figure for improvement when a decision could be
made. Thus while the average unjudgeable (A+D) proportion was
50 percent for O text and 60 percent for C text, the range within
182 Yorick Wilks
which the true figure lies is much greater, for the margin of error
was +7 percent. But for the actual judgments, we can confidently
state that the error range was less than 2 percent. This is
reassuring because had the reverse result occurred (i.e. had the
evaluations of improvement varied greatly in a subjective way),
we would have had cause to doubt our entire methodology of
evaluation.
Further confirmation of our method came from examining
correlations between evaluations of O and C texts. Reassuringly,
the correlation coefficient over the 25 judgements made on each
text is not significantly different from zero. On the other hand,
the tendency to deem sentences unjudgeable was shown to arise
from evaluators’ individual differences in outlook; the correlation
between each evaluator’s unjudgeability evaluations for each
question between O and C text was amazingly high (0.913). As
this clearly did not represent any actual link between such
evaluation items (the texts having been drawn at random), the
average level and the variance of these decisions should not
concern us.
Monolingual expertise
It is generally accepted that the value of monolingual evaluation in
scientific subjets depends on monolingual subject expertise. While
our monolingual evaluators all had some expertise in the field of
political science, this simply did not transfer from Russian to English
in the way that a universally understood area like physics would. To
some degree, this explains the high variance among the
monolinguals, and their consequently diminished role compared to
the BR study of scientific texts.
APPENDIX
When looking at both the Russian and the English, you should be
careful not to assume that the sets of sentences form coherent,
continuous texts; rather, you should treat each triplet individually.
Also, do not assume that the second English sentence is better than
the first: the order of the sentence pairs is entirely random.
The difference between ‘understandable’ and ‘natural’ can be
illustrated as follows: the sentences ‘To John gave I the apple’ and
‘I want you go now’ are both understandable but not natural
English.
Questionnaire
Each line on the form corresponds to one sentence triplet (one
Russian and two English) by number. Each numbered section below
corresponds to a column on the answer form. You should circle one
and only one letter (A, B, C or D) for each question and then do that
for each sentence triplet in the data.
Circle
A if you do not speak Russian, OR if you speak Russian and
consider both English sentences to be such bad translations that
no choice can be made between them, OR if you speak Russian
and can see that BOTH English sentences contain Russian
words
B if you prefer the first sentence as an accurate translation
C if you prefer the second sentence
D if you have no preference
2 (enter in column 2 for each triplet) Now look at only the English
sentences in the triplet, and ask yourself if you can comprehend
them as such, accounting for your knowledge of the subject
matter.
Circle
A if you speak Rusian, but consider both texts to be such bad
translations that you decline to form a judgment, OR if both
SYSTRAN 187
Circle
A if you speak Russian and consider both sentences such bad
translations of the Russian that you decline to make this
judgment, OR if both English sentences contain Russian words
(once again, you can select this if you do not speak Russian but
should NOT do so if only one of the English sentences contains
Russian words)
B if you prefer the first sentence for the naturalness of its English
C if you prefer the second sentence
D if you have no preference
BIBLIOGRAPHY
ALPAC (1966) Language and Machines: Computers in Translation and Linguistics
(Report by the Automatic Language Processing Advisory Committee,
Division of Behavioral Sciences, National Research Council),
Washington, DC: National Academy of Sciences.
Battelle Columbus Laboratories (1977) ‘The evaluation and systems
analysis of the SYSTRAN machine translation system’, RADC-TR-76–
399 Technical Report.
Brown, P.P., Cocke, J., Della Pietra, S.A., Della Pietra, V.J., Jelinek, F.,
Lafferty, J.D., Mercer, R.L. and Roossin, P.S. (1990) ‘A statistical
approach to machine translation’, Computational Linguistics 16:79–85.
Carroll, J.B. (1966) ‘An experiment in evaluating the quality of translations’,
Mechanical Translation and Computational Linguistics 9 (3 & 4): 55–66.
Jacobs, P., Krupka, G. and Rau, L. (1991) ‘Lexico-semantic pattern
matching as a companion to parsing in text understanding’, in Proceedings
of the DARPA Speech and Natural Language Workshop, Monterey, California.
Johnson, R., King, M. and des Tombe, L. (1985) ‘E U ROTRA: a
multilingual system under development’, in Computational Linguistics 11(2–
3): 155–69.
Toma, P. (1976) ‘An operational machine translation system’, in R.W.
Brislin (ed.) Translation: Applications and Research, New York: Gardner, 247–
59.
188 Yorick Wilks
INTRODUCTION
189
190 Harold L.Somers
For a while now it has been the conventional wisdom that the
next advance in MT design—the ‘third generation’—would involve
the incorporation of techniques from AI. In his instant classic,
Hutchins is typical in this respect:
He goes on to say:
Incorporating AI techniques
Returning to the question of AI-oriented ‘third-generation’ MT
systems, it is probably fair to say that the most notable example of
this approach is at the Center for Machine Translation at Carnegie
Mellon University (CMU), where a significantly sized research team
was expressly set up to pursue the question of ‘knowledgebased MT’
(KB MT) (Carbonell and Tomita 1987). Similar work on the
Japanese ATR project has recently been reported (Kudo 1990).
What then are the ‘AI techniques’ which the CMU team has
incorporated into its MT system, and how do we judge them?
In the Nirenburg and Carbonell description of KB MT, the
emphasis seems to be on the need to integrate discourse pragmatics
in order to get pronouns and anaphora right (Nirenburg and
Carbonell 1987). This requires texts to be mapped onto a
corresponding knowledge representation in the form of a framebased
conceptual interlingua. More recent descriptions of the project
(Nirenburg 1989; Nirenburg and Levin 1989) stress the use of
domain knowledge. In Kudo’s case, local ‘cohesive’ knowledge (in a
dialogue) is stressed. These are well-respected techniques in the
Current research 193
Sublanguage
Obviously, the most successful MT story of all is that of the
METEO system, which translates daily more than 30,000 words of
weather bulletins from English into French at a cost of less than 0.5¢
(Canadian) per word, with an accuracy rate of 95 per cent
(Chandioux 1987/9:169). The system performs a translation task too
boring for any human doing it to last for more than a few months,
yet sufficiently constrained to allow an MT system to be devised
which only makes mistakes when the input is ill formed. Some
research groups have looked for similarly constrained domains.
Alternatively, the idea of imposing constraints on authors has a long
history of association with MT. At the 1978 Aslib conference,
Elliston showed how at Rank Xerox acceptable output could be got
out of SYSTRAN by forcing technical writers to write in a style that
would not catch the system out (Elliston 1979). It is interesting to see
much the same experience reported again ten years later, at the same
Current research 195
forum, but this time using Weidner’s MicroCat (Pym 1990). This
rather haphazard activity has fortunately been ‘legitimized’ by its
association with research in the field of language for special purposes
(LSP), and the word ‘sublanguage’ is starting to be widely used in
MT circles (e.g. Kosaka et al. 1988; Luckhardt 1991). In fact, I see
this as a positive move, as long as ‘sublanguage’ is not just used as a
convenient term to camouflage the same old MT design but with
simplified grammar and a reduced lexicon.
Studies of sublanguage (e.g. Kittredge and Lehrberger 1982)
remind us that the topic is much more complex than that: should a
sublanguage be defined prescriptively (or even proscriptively) as in
the Elliston and Pym examples, or descriptively, on the basis of some
corpus judged to be a homogeneous example of the sublanguage in
question? And note that even the term ‘sublanguage’ itself can be
misleading: in most of the literature on the subject, the term is taken
to mean ‘special language of a particular domain’ as in ‘the
sublanguage (of) meteorology’. Yet a more intuitive interpretation of
the term, especially from the point of view of MT system designers,
would be something like ‘the grammar, lexicon, etc. of a particular
text-type in a particular domain’, as in ‘the sublanguage of
meteorological reports as given on the radio’, which might share
some of the lexis of, say, ‘the sublanguage of scientific papers on
meteorology’, although clearly not (all) the grammar. By the same
token, scientific papers on various subjects might share a common
grammar, while differing in lexicon. Furthermore, there is the
question of whether the notion of a ‘core’ grammar or lexicon is
useful or even practical. Some of these questions are being addressed
as part of one of the MT projects started in 1990 at UMI ST
(Manchester), concerning the design of an architecture for a system
which interacts with various types of experts to ‘generate’ a
sublanguage MT system: I will begin my final section with a brief
description of this research.
Sublanguage plus
One system recently proposed at UMIST is a sublanguage MT
system for the Matsushita company (Ananiadou et al. 1990). The
design is for a system with which individual sublanguage MT
systems can be created, on the basis of a bilingual corpus of ‘typical’
texts. The system therefore has two components: a core MT engine,
which is to a certain extent not unlike a typical second-generation
MT system, with explicitly separate linguistic and computational
components; and a set of expert systems which interact with humans
in order to extract from the corpus of texts the grammar and lexicon
that the linguistic part of the MT system will use. The expertise of
the expert systems and the human users is divided between domain
expertise and linguistic expertise, corresponding to the separate
domain knowledge and linguistic knowledge (i.e. of grammars,
lexicons and contrastive knowledge). Using various statistical
methods (see below), the linguistic expert system will attempt to
infer the grammar and lexicon of the sublanguage, on the
assumption that the corpus is fully representative (and approaches
closure). From our observation of other statistics-based approaches
to MT, we concluded that the statistical methods needed to be
‘primed’ with linguistic knowledge, for example, concerning the
nature of linguistic categories, morphological processes and so on.
We have investigated the extent to which this can be done without
going as far as to posit a core grammar, as we are uneasy about the
idea that a sublanguage be defined in terms of deviation from some
standard. The system will make hypotheses about the grammar and
lexicon, to be confirmed by a human user, who must clearly be a
linguist rather than, say, the end user. In the same way, the
contrastive linguistic knowledge is extracted from the corpus, to be
confirmed by interaction with a (probably different) human. Again,
some ‘priming’ is almost always necessary.
Dialogue MT
A recent research direction to emerge is an MT system aimed at a
user who is the original author of a text to be composed in a foreign
language. Two such systems are Ntran, mentioned above, and
Huang’s system (Huang 1990), both for use by a monolingual writer
198 Harold L.Somers
Corpus-based MT
The approaches to be described in this final section have in common
the idea that a pre-existing large corpus of already translated text
could be used in some way to construct an MT system. They can be
divided into three types, called ‘memory-based’, ‘example-based’ and
‘statistics-based’ translation.
Memory-based translation
The most ‘linguistic’ of the corpus-based approaches is ‘memory-
based translation’ (Sato and Nagao 1990): here, example translations
are used as the basis of new translations. The idea—first suggested by
Nagao (1984)—is that translation is achieved by imitating the
translation of a similar example in a database. The task becomes one
of matching new input to the appropriate stored translation. In this
connection, a secondary problem is the question of the most
appropriate means of storing the examples. As they believe that
combining fragments of sentences is an essential feature, it is natural
for Sato and Nagao to think in terms of storing linguistic objects—
notably partial syntactic trees (in their case, dependency trees). An
element of statistical manipulation is introduced by the need for a
scoring mechanism to choose between competing candidates.
Advantages of this system are ease of modification—notably by
changing or adding to the examples—and the high quality of
translation seen here again, as above, as a result of translations being
established a priori rather than compositionally (although there is an
element of compositionality in Sato and Nagao’s approach). The
200 Harold L.Somers
Example-based translation
A similar approach which overcomes this major demerit has been
developed quite independently by two groups of researchers at ATR
in Japan (Sumita et al. 1990), and at UMIST in Manchester (Carroll
1990). In both cases, the central point of interest is the development
of ‘distance’ or ‘similarity’ measures for sentences or parts of
sentences, which permit the input sentence to be translated to be
matched rapidly against a large corpus of existing translations. In
Carroll’s case, the measure can be ‘programmed’ to take account of
grammatical function words and punctuation, which has the effect of
making the algorithm apparently sensitive to syntactic structure
without actually parsing the input as such. While Sumita et al.’s
intention is to provide a single correct translation by this approach,
Carroll’s measure is used in an interactive environment as a
translator’s aid, selecting a set of apparently similar sentences from
the corpus, to guide the translator in the choice of the appropriate
translation. For this reason, spurious or inappropriate selections of
examples can be tolerated as long as the correct selections are also
made at the same time.
Statistics-based approaches
Other corpus-based approaches have been more overtly statistical or
mathematical. The most notable of these is the work at IBM (Brown
et al. 1988a,b, 1990). These researchers, encouraged by the success
of statistics-based approaches to speech recognition and parsing,
decided to apply similar methods to translation. Taking a huge
corpus of bilingual text available in machine-readable form (3 million
sentences selected from the Canadian Hansard), the probability that
any one word in a sentence in one language corresponds to zero, one
or two words in the translation is calculated. The glossary of word
equivalences so established consists of lists of translation possibilities
for every word, each with a corresponding probability. For example,
the translates as le with a probability of 0.610, as la with probability
0.178, and so on. These probabilities can be combined in various
ways, and the highest-scoring combination will determine the words
which will make up the target text. An algorithm to get the target
Current research 201
words in the right order is now needed. This can be calculated using
rather well-known statistical methods for measuring the probabilities
of word-pairs-triples, etc.
The results of this experiment are certainly interesting.
Translations which were either the same as or preserved the meaning
of the official translations were achieved in about 48 per cent of the
cases. Although at first glance this level of success would not seem to
make this method viable as it stands, it is to be noted that not many
commercial MT systems achieve a significantly better quality. More
interesting is to consider the near-miss cases in the IBM experiment:
incorrect translations were often the result of the fact that the system
contains no linguistic ‘knowledge’ at all. Brown et al. admit that
serious problems arise when the translation of one word depends on
the translation of others, and suggest (Brown et al. 1988a:11–12) that
some simple morphological and/or syntactic analysis, also based on
probabilistic methods, would greatly improve the quality of the
translation.
As part of sublanguage MT research at UMIST, Arad (Arad
1991) investigated the possibility of using statistical methods to
derive morphological and syntactic grammars and mono- and
bilingual lexica from bilingual corpora. A similar goal is reported by
Kay and Röscheisen (1988). That the text type is restricted permits
Arad to work with lower thresholds of statistical significance, and
hence smaller corpora. In the future she intends to prime the system
with a certain amount of a priori linguistic knowledge of a very basic
kind, for example, what sort of morphological processes are likely?
(e.g. for English mainly non-agglutinative suffixes, stems should
contain a vowel and be longer than their affixes), typical
characteristics of phrase structure (notions of open- and closed-class
words, headedness and so on). Arad has resisted the idea of priming
the system with a core grammar but recognizes that this may prove
to be a necessary step.
CONCLUSIONS
In this chapter I have given a view—at times personal—of current
research in MT. Of course, there are probably numerous research
projects that I have omitted to mention, generally because I have not
been able to get information about them, or simply because they
have not come to my notice or have emerged while this book was in
press. I am conscious that some readers of this chapter will be
202 Harold L.Somers
BIBLIOGRAPHY
Abeillé, A., Schabes, Y. and Joshi, A.K. (1990) ‘Using lexicalized tags for
machine translation’, in H.Karlgren (ed.) COLING-90: Papers Presented to
the 13th International Conference on Computational Linguistics, Helsinki:
Yliopistopaino, vol. 3:1–6.
Alam, Y.S. (1986) ‘A lexical-functional approach to Japanese for the purpose
of machine translation’, Computers and Translation 1:199–214.
ALPAC (1966) Language and Machines: Computers in Translation and Linguistics
(Report by the Automatic Language Processing Advisory Committee,
Division of Behavioral Sciences, National Research Council),
Washington, DC: National Academy of Sciences.
Amores Carredano, J.G. (1990) ‘An LFG machine translation system in
DCG-PROLOG’, unpublished MSc dissertation, UMIST, Manchester.
Ananiadou, S., Carroll, J.J. and Phillips, J.D. (1990) ‘Methodologies for
development of sublanguage MT systems’, CCL Report 90/10, Centre
for Computational Linguistics, UMIST, Manchester.
Arad, I. (1991) ‘A quasi-statistical approach to automatic generation of
linguistic knowledge, PhD thesis, UMIST, Machnester.
Bateman, J.A. (1990) ‘Finding translation equivalents: an application of
grammatical metaphor’, in H.Karlgren (ed.) COLING-90: Papers Presented
to the 13th International Conference on Computational Linguistics, Helsinki:
Yliopistopaino, vol. 3:13–18.
Beaven, J.L. and Whitelock, P. (1988) ‘Machine translation using
isomorphic UCGs’, in D.Vargha (ed.) COLING Budapest: Proceedings of the
12th International Conference on Computational Linguistics, Budapest: John von
Current research 203
208
Bibliography 209
222
Glossary 223
Text encoding initiative The goal of the TEI is to develop and disseminate
a clearly defined format for the interchange of machine-readable texts
among researchers so as to allow easier and more efficient sharing of
resources for textual computing and natural-language processing. The
interchange format is intended to specify how texts should be encoded or
marked so that they can be shared by different research projects for
different purposes. The use of specific delimiter characters and specific
tabs to express specific information are all prescribed by this interchange
format, based on the international standard I SO 8879, Standard
Generalized Markup Language (SGML).
Translation 1. A rendering of a text from one language into another. 2. The
product of such a rendering.
Translation tools Software packages which are designed to assist translators
in their work but which do not perform translation. The main
constituents of a translation tools package are a text analyser (parser) and
a user-updatable dictionary. The text analyser is used to identify lexical
items for dictionary entry. Such products usually allow translation
equivalents to be pasted directly into a text file via a window during
dictionary look-up.
Translator’s workbench see Translator workstation.
Translator workstation A custom-built, economically designed unit having
a computer as its main platform, incorporating translation tools and
supporting a potentially wide range of software utilities and peripheral
devices. Translator workstations offer various levels of computer
assistance up to and including MT. They can also incorporate fax,
electronic mail and desk-top publishing facilities, and offer remote access
to terminology databases.
Vocabulary 1. A listing, selective or exhaustive, of the lexical items of a
language or used by a group or individual or within a specialized field of
knowledge, with definitions or translation equivalents. 2. The aggregate
of lexical items used or understood by a specified group, class,
profession, etc.
Vocabulary search A function of MT and translation tools packages in
which the system compares the words in a text with those that figure in
a specified dictionary or sequence of dictionaries and copies unfound
words into a file which can later be used as a basis for dictionary
updating.
Window A feature of certain software packages that allows the user to
access utilities from within a text file.
Wordage The number of words contained in a text.
Word count, Wordcount see Wordage.
Index
228
Index 229