Development of A Punjabi To English Tran
Development of A Punjabi To English Tran
521-526
ABSTRACT
Machine transliteration has gained prime importance as a supporting tool for Machine translation and cross language
information retrieval especially when proper names and technical terms are involved. The performance of machine
translation and cross-language information retrieval depends extremely on accurate transliteration of named entities.
Hence, the transliteration model must aim to preserve the phonetic structure of words as closely as possible. This
paper addresses the problem of transliterating Punjabi to English language using a rule based approach .The proposed
transliteration scheme uses grapheme based method to model the transliteration problem. This technique has
demonstrated transliteration from Punjabi to English for conman names and achieved accuracy of 93.22%.
into two phases. The first phase performs pre-processing and its corresponding phoneme have been aligned
and rule-based transliteration tasks and the second phase phonetically. Second, English words have been
performs the task of post-processing. The overall transliterated into Korean words through several steps.
accuracy of system has been reported to be 91.37%. Malik Using an English pronunciation dictionary (P-DIC),
[4] has developed Punjabi Machine Transliteration assigned pronunciation to a given English word. If it has
(PMT) system which is rule-based. PMT has been used been not found in P-DIC, system investigates that it has
for the Shahmukhi to Gurmukhi Transliteration System. a complex word form. For detecting a complex word
PMT has preserved the phonetics of transliterated word form, they have divided a given English word into two
and the meaning of transliterated word. The primary words (word+word) using entries of P-DIC. If both of
limitation of this system is that this system works only them are in P-DIC, system can assign pronunciation to
on input data which has been manually edited for the given word otherwise system should estimate
missing vowels or diacritical marks (the basic ambiguity pronunciation. Then, system checks whether the English
of written Arabic script) which practically has limited word is from Greek origin or not. Because a way of E-K
use. The accuracy of system has been reported to 98.95%. transliteration for the English words of Greek origin is
Verma[5] has developed Gurmukhi to Roman different from that for pure English words, it is important
Transliteration System and named it GTrans. He has to detect that. Pronunciation for English words, which
surveyed existing Roman-Indic script transliteration were not registered in a P-DIC, has been estimated in
techniques and finally a transliteration scheme based on the next step. Finally, Korean transliterated words has
ISO: 15919 transliteration and ALA-LC has been been generated using conversion rules. Evaluation has
developed. It is a rule based system. He has also done been performed through Word Accuracy (WA) and
reverse transliteration from Gurumukhi to Roman. The Character Accuracy (CA). This system has reported
overall accuracy of system has been reported to be accuracy of 90.82% for WA and 56% for CA. Yaser,
98.43%. Hong, Kim, Lee and Chang [6] have developed Knight [9] has developed Arabic To English
English-Korean Name Transliteration system, using the Transliteration system based on the sound & spelling
Hybrid Approach. In the transliteration process, first, a mapping using finite state machine. They have combined
phrase-base SMT model with some factored translation the phonetic based model & spelling based model into
features has been used. Second, they have expanded the the single transliteration model. For testing they have
base system by applying web-based n-best re-ranking used the development data set & blind data set. The
of the results. Third, they have applied a pronouncing overall accuracy with development data set has been
dictionary-based method to the base system which reported to be 53.66% & with blind data set it showed
utilizes the pronunciation symbols which is motivated 61% accuracy. The reason of high accuracy with blind
by linguistic knowledge. Finally, phonics based method data set was that blind set is mostly of highly frequent,
is applied which has been originally designed for prominent politicians where as development set also
teaching speakers of English to read and write that contain names of writers and less common political
language. The experimental results of using three n-best figure.
re-ranking techniques have showed that the web-based
re-ranking is proved to be a useful method .Their 3. PUNJABI & ENGLISH LANGUAGE
standard run and best standard run has accuracy of In this section we will discuss about Punjabi and English
45.1% & 78.5%. Ali and Ijaz [7] have developed English Language.
to Urdu Transliteration System based on the mapping
rules. The whole process has three steps. In the first step, 3.1 Punjabi Language
the mapping rules that have been used to generate Urdu
Punjabi Language is written in Gurmukhi Script. The
text from English transcription. English text is converted
Gurmukhi script was derived from the Sharada script
to Urdu using both English pronunciation and mapping
and standardized by Guru Angad Dev in the 16th
rules. In Second step, Urdu syllabification has been
century. It was designed to write the Punjabi language.
applied on English transcription. Consonant and Vowels
The meaning of Gurmukhi is from the mouth of the
have been combined to make syllable and breaking up a
Guru . The Gurmukhi (or Punjabi) alphabet contains
word into syllables is known as syllabification. To
thirty-five distinct letters. These are:
improve system s accuracy, they have applied the
Urduization Rules in third step. Overall system s
accuracy is 96%. Hoon Oh, and Choi [9] have developed
English-Korean Transliteration system using the hybrid
approach, because it has used both phonetic information
such as phoneme and its context and orthography. This The first three letters are unique because they form
method has been composed of two phases i.e. alignment the basis for vowels and are not consonants. Apart from
and transliteration. First, an English pronunciation unit Era, these characters are never used on their own.
Development of a Punjabi to English Transliteration System 523
Consonants are:
4. ARCHITECTURE
Our basic rule based transliteration system works by
employing a set of character mapping or character
sequence map ping rules betw een the languages
involved. Punjabi words are written in Gurumukhi script
while English words are written in Roman script. Each
Gurumukhi consonant symbol that is not followed by a
vowel represents that consonant plus an inherent schwa
vowel sound . For example, is represented as
In addition to these, there are six consonants created . Note that the schwa vowel does
by placing a dot (bindi) at the foot (pair) of the consonant: not get pronounced in certain contexts as in this case
after schwa sound symbol has not pronounced. A
snippet of the direct mapping of vowels and consonant
is shown in Table 1, 2, 3 and 4. The accuracy of the system
using direct mapping was very low. To improve that
In addition to this, there are nine dependent vowel accuracy we have developed different rules. In our
signs used to create ten independent vowels with three system, rules also include constraints which specify the
bearer characters: Ura [ ], Aira [ ] and Iri [ I ].
e
context in which they are applicable like Start of a Word
(S), Ending of a Word (E), After Vowel (AV), After
3.2 English Language Consonant (AC) etc. Combination of different mapping
English Language is written in Roman script. English is options for each character in inputting Punjabi words
a West Germanic language that arose in the Anglo-Saxon results in different transliteration candidates. For
kingdoms of England. It is one of six official languages example, consider the Punjabi word
of the United Nations. India is one of the countries where have 2, 2, 1, 1, 1, and 2 possible mappings respectively.
English is spoken as a second language. There are 26 Hence a total of 2*2*1*1*1*2=8 transliteration candidates
letters in English. Out of which 21 are consonants and 5 should be considered. (Examples: waishali, wayshali,
are Vowels. Vowels are: vaishali, vayshali waishalee, wayshalee etc.).
Table 1
Independent Vowels Mapping
Table 2
Dependent Vowels Mapping
524 International Journal of Computer Science and Communication (IJCSC)
Table 3
Consonant Mapping
Table 4
Mapping of Special Symbols
Figure 1
6. CONCLUSION
5.4 Error Analysis
In this paper w e have ad d ressed the pro blem o f
The overall performance accuracy test of the system is transliterating Punjabi to English language using rule
quite good. But the Test case 2 is less accurate than the based approach. Punjabi to English transliteration
first one because, the un-standardized language causes system is very beneficial for removing the language and
more ambiguities. There are several reasons for the errors scriptural barrier. The system is giving promising results
in the output. and this can be further used by the researchers working
• Multiple Transliterations: Sometimes when a on Punjabi and English Natural Language Processing
name is pronounced in Punjabi it correspond to tasks. As we know that in Punjab area most of the
many English words, so their system fails to guess government departments use Punjabi language to store
w hich o ne is the best fo r that p articular their data, so this transliteration system will help them
transliteration. a lot to transliterate Punjabi to English on a click of a
button.
• Wrong Input of Words: Some time user does not
enter correct data to the system due to which output 7. ACKNOWLEDGMENTS
is also not correct. For example as
I would like to express my deep and sincere gratitude to
here halant is used as such but we know it is used
my Guide Dr. Vishal Goyal, Assistant Professor, Dept
to write half letter.
of Computer Science, Punjabi University, Patiala, for the
• Character Gap: The number of characters in, both continuous support of my w ork, for his patience,
English and Punjabi, character sets varies in both motivation, enthusiasm, and immense knowledge. His
the language that makes the transliteration process understanding, encouraging and personal guidance
difficult. The numbers of vowels are 5 and 20 and have provided a good basis for the present work. I would
numbers of consonants are 21 and 41, in both like to thank my parents, for supporting me throughout
English and Punjabi, respectively as explained my life. Above all, I thank ‘GOD’ for making this mortal
earlier. So, there is character gap in both the venture possible.
languages that leads to problems in transliteration
process. For Example, for character in Punjabi REFERENCES
there is no corresponding character in English.
[1] Vijaya, V.P., Shivapratap and K.P. CEN(2009), “ English
• One-to-Multi mapping Problem: In this problem, to Tamil Transliteratio n using WEKA system” ,
single character in one script transform to multiple International Journal of Recent Trends in Engineering, May
characters in another script. The Multi-mapping 2009, 1, No. 1, pages: 498-500.
526 International Journal of Computer Science and Communication (IJCSC)
[2] Chinnakotla, Damani, Satoskar(2009), “ Transliteration pages 108–111, Suntec, Singapore, 7 August 2009 ACL
for Resource Scarce Language” , ACM Transactions on and AFNLP.
Asian Language Information Processing, V, No. N. [7] Ali and Ijaz(2009), “ English to Urdu Transliteration
[3] Lehal and Singh (2008), “ Shahmukhi to Gurmukhi System” , Proceedings of the Conference on Language &
Transliteration System: A Corpus based Approach” , Technology 2009, pages: 15-23.
Proceeding of Advanced Centre for Technical Development
[8] Hoon Oh, and Key-Sun Choi (2002), “ An English-Korean
of Punjabi Language, Literature & Culture, Punjabi
Transliteration Model” , using Pronunciation and
University, Patiala 147 002, Punjab, India, pages:151-162.
Contextual Rules, In Proc. of the 19th International
[4] Malik (2006), “ Punjabi Machine Transliteration System” ,
Conference on Computational Linguistics (COLING 2002),
In Proceedings of the 21 st International Conference on
pages: 393–399.
Computational Linguistics and 44 th Annual Meeting of the
ACL (2006), pages:1137-1144. [9] Yaser, Knight (2002), “ Machine Transliteration of Names
[5] Verma (2006), “ A Roman-Gurmukhi Transliteration in Arabic Text” , Machine Transliteration of Names in
System” , Proceeding of the Department of Computer Science, Arabic text, In Proceedings of the ACL Workshop on
Punjabi University, Patiala, 2006. Computational Approaches to Semitic Languages ,
Philadelphia, PA, pages: 1-13.
[6] Hong, Kim, Lee and Chang(2009), “ A Hybrid Approach
to English-Korean Name Transliteration” , Procedings of [10] ” Transliteration” , Internet Source:- http:/ / en.wikipedia.
the 2009 Named Entities Workshop, ACL-IJCNLP 2009, org/ wiki/ transliteration acessed on jan,2011.