0% found this document useful (0 votes)
35 views6 pages

Development of A Punjabi To English Tran

This paper presents a rule-based transliteration system for converting Punjabi names to English, achieving an accuracy of 93.22%. The system employs a grapheme-based approach and utilizes character mapping rules to maintain phonetic integrity. It highlights the challenges of transliteration, particularly with ambiguous names and user input errors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views6 pages

Development of A Punjabi To English Tran

This paper presents a rule-based transliteration system for converting Punjabi names to English, achieving an accuracy of 93.22%. The system employs a grapheme-based approach and utilizes character mapping rules to maintain phonetic integrity. It highlights the challenges of transliteration, particularly with ambiguous names and user input errors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

International Journal of Computer Science and Communication Vol. 2, No. 2, July-December 2011, pp.

521-526

DEVELOPMENT OF A PUNJABI TO ENGLISH TRANSLITERATION SYSTEM

Kamal Deep1 and Vishal Goyal2


1
Department of Computer Science, Punjabi University, Patiala, India
E-mail: kamal.1cse@gmail.com
2
Assistant Professor, Department of Computer Science, Punjabi University, Patiala, India
E-mail: vishal.pup@gmail.com

ABSTRACT
Machine transliteration has gained prime importance as a supporting tool for Machine translation and cross language
information retrieval especially when proper names and technical terms are involved. The performance of machine
translation and cross-language information retrieval depends extremely on accurate transliteration of named entities.
Hence, the transliteration model must aim to preserve the phonetic structure of words as closely as possible. This
paper addresses the problem of transliterating Punjabi to English language using a rule based approach .The proposed
transliteration scheme uses grapheme based method to model the transliteration problem. This technique has
demonstrated transliteration from Punjabi to English for conman names and achieved accuracy of 93.22%.

1. INTRODUCTION grapheme transformation. In hybrid approaches (ΨH), it


Transliteration is the process of replacing words in simply combines the grapheme-based transliteration
source language with their approximate phonetic or probability (Pr (Ψ G )) and the phoneme-based
spelling equivalents in target language. Commonly, transliteration probability (Pr (Ψ P )) using linear
transliteration is used to translate named entities across interpolation.
languages. Automatic transliteration is helpful for many Vijaya, VP, Shivapratap and KP CEN [1] has
applications, such as Machine Translation (MT), Cross developed English to Tamil Transliteration system and
Language Information Retrieval (CLIR) and Information named it WEKA. It is a Rule based system and is used
Extraction (IE), etc. Transliterating a word from the the j48 decision tree classifier of WEKA for classification
language of its origin to a foreign language is called purposes. The transliteration process consisted of four
Forward Transliteration, while transliterating a loan phases: Preprocessing phase, feature extraction, training
word written in a foreign language back to the language and transliteration phase .The accuracy of this system
of its origin is called Backward Transliteration. This has been tested with 1000 English names that were out
paper addresses the problem of forward transliterating of corpus. The transliteration model produced an exact
of person names from Punjabi to English. transliteration in Tamil from English words with an
accuracy of 84.82%.Chinnakotla, Damani, Satoskar[2] has
The remainder of this paper is organized as follows. developed Transliteration systems for Resource Scarce
In section 2, we have described the related work. Section Languages. They have developed rule based systems for
3 introduces about English and Punjabi Language. We Hindi to English, English to Hindi, and Persian to English
describe Transliteration System Architecture in sections transliteration tasks. They used CSM (Character
4. Experimental Results and Error Analysis are discussed Sequence Modeling) on the source side for word origin
in section 5. Finally, we have concluded it in section 6. identiûcation,a manually generated non-probabilistic
character mapping rule base for generating
2. RELATED WORK transliteration candidates, and then again used the CSM
Several approaches have been proposed for name on the target side for ranking the generated candidates.
transliteration. In Grapheme based approaches (ΨG), The overall efficiency by using CRF (Conditional Random
transliteration is viewed as a process of mapping a Field) approach of English to Hindi is 67.0%, Hindi to
grapheme sequence from a source language to a target English is 70.7% and Persian to English is 48.0%. Lehal
language ignoring the phoneme-level processes. In and Singh [3] have developed Shahmukhi to Gurmukhi
contrast, in phoneme-based approaches (Ψ P ), the Transliteration System based on Corpus approach. In this
transliteration key is pronunciation or the source system, first of all script mappings has been done in
phoneme rather than Spelling or the source grapheme. which mapping of Simple Consonants, Aspirated
This approach is basically source grapheme-to-source Consonants (AC), Vowels, other Diacritical Marks or
phoneme transformation and source phoneme-to-target Symbols are done. This system has been virtually divided
522 International Journal of Computer Science and Communication (IJCSC)

into two phases. The first phase performs pre-processing and its corresponding phoneme have been aligned
and rule-based transliteration tasks and the second phase phonetically. Second, English words have been
performs the task of post-processing. The overall transliterated into Korean words through several steps.
accuracy of system has been reported to be 91.37%. Malik Using an English pronunciation dictionary (P-DIC),
[4] has developed Punjabi Machine Transliteration assigned pronunciation to a given English word. If it has
(PMT) system which is rule-based. PMT has been used been not found in P-DIC, system investigates that it has
for the Shahmukhi to Gurmukhi Transliteration System. a complex word form. For detecting a complex word
PMT has preserved the phonetics of transliterated word form, they have divided a given English word into two
and the meaning of transliterated word. The primary words (word+word) using entries of P-DIC. If both of
limitation of this system is that this system works only them are in P-DIC, system can assign pronunciation to
on input data which has been manually edited for the given word otherwise system should estimate
missing vowels or diacritical marks (the basic ambiguity pronunciation. Then, system checks whether the English
of written Arabic script) which practically has limited word is from Greek origin or not. Because a way of E-K
use. The accuracy of system has been reported to 98.95%. transliteration for the English words of Greek origin is
Verma[5] has developed Gurmukhi to Roman different from that for pure English words, it is important
Transliteration System and named it GTrans. He has to detect that. Pronunciation for English words, which
surveyed existing Roman-Indic script transliteration were not registered in a P-DIC, has been estimated in
techniques and finally a transliteration scheme based on the next step. Finally, Korean transliterated words has
ISO: 15919 transliteration and ALA-LC has been been generated using conversion rules. Evaluation has
developed. It is a rule based system. He has also done been performed through Word Accuracy (WA) and
reverse transliteration from Gurumukhi to Roman. The Character Accuracy (CA). This system has reported
overall accuracy of system has been reported to be accuracy of 90.82% for WA and 56% for CA. Yaser,
98.43%. Hong, Kim, Lee and Chang [6] have developed Knight [9] has developed Arabic To English
English-Korean Name Transliteration system, using the Transliteration system based on the sound & spelling
Hybrid Approach. In the transliteration process, first, a mapping using finite state machine. They have combined
phrase-base SMT model with some factored translation the phonetic based model & spelling based model into
features has been used. Second, they have expanded the the single transliteration model. For testing they have
base system by applying web-based n-best re-ranking used the development data set & blind data set. The
of the results. Third, they have applied a pronouncing overall accuracy with development data set has been
dictionary-based method to the base system which reported to be 53.66% & with blind data set it showed
utilizes the pronunciation symbols which is motivated 61% accuracy. The reason of high accuracy with blind
by linguistic knowledge. Finally, phonics based method data set was that blind set is mostly of highly frequent,
is applied which has been originally designed for prominent politicians where as development set also
teaching speakers of English to read and write that contain names of writers and less common political
language. The experimental results of using three n-best figure.
re-ranking techniques have showed that the web-based
re-ranking is proved to be a useful method .Their 3. PUNJABI & ENGLISH LANGUAGE
standard run and best standard run has accuracy of In this section we will discuss about Punjabi and English
45.1% & 78.5%. Ali and Ijaz [7] have developed English Language.
to Urdu Transliteration System based on the mapping
rules. The whole process has three steps. In the first step, 3.1 Punjabi Language
the mapping rules that have been used to generate Urdu
Punjabi Language is written in Gurmukhi Script. The
text from English transcription. English text is converted
Gurmukhi script was derived from the Sharada script
to Urdu using both English pronunciation and mapping
and standardized by Guru Angad Dev in the 16th
rules. In Second step, Urdu syllabification has been
century. It was designed to write the Punjabi language.
applied on English transcription. Consonant and Vowels
The meaning of Gurmukhi is from the mouth of the
have been combined to make syllable and breaking up a
Guru . The Gurmukhi (or Punjabi) alphabet contains
word into syllables is known as syllabification. To
thirty-five distinct letters. These are:
improve system s accuracy, they have applied the
Urduization Rules in third step. Overall system s
accuracy is 96%. Hoon Oh, and Choi [9] have developed
English-Korean Transliteration system using the hybrid
approach, because it has used both phonetic information
such as phoneme and its context and orthography. This The first three letters are unique because they form
method has been composed of two phases i.e. alignment the basis for vowels and are not consonants. Apart from
and transliteration. First, an English pronunciation unit Era, these characters are never used on their own.
Development of a Punjabi to English Transliteration System 523

Consonants are:

4. ARCHITECTURE
Our basic rule based transliteration system works by
employing a set of character mapping or character
sequence map ping rules betw een the languages
involved. Punjabi words are written in Gurumukhi script
while English words are written in Roman script. Each
Gurumukhi consonant symbol that is not followed by a
vowel represents that consonant plus an inherent schwa
vowel sound . For example, is represented as
In addition to these, there are six consonants created . Note that the schwa vowel does
by placing a dot (bindi) at the foot (pair) of the consonant: not get pronounced in certain contexts as in this case
after schwa sound symbol has not pronounced. A
snippet of the direct mapping of vowels and consonant
is shown in Table 1, 2, 3 and 4. The accuracy of the system
using direct mapping was very low. To improve that
In addition to this, there are nine dependent vowel accuracy we have developed different rules. In our
signs used to create ten independent vowels with three system, rules also include constraints which specify the
bearer characters: Ura [ ], Aira [ ] and Iri [ I ].
e
context in which they are applicable like Start of a Word
(S), Ending of a Word (E), After Vowel (AV), After
3.2 English Language Consonant (AC) etc. Combination of different mapping
English Language is written in Roman script. English is options for each character in inputting Punjabi words
a West Germanic language that arose in the Anglo-Saxon results in different transliteration candidates. For
kingdoms of England. It is one of six official languages example, consider the Punjabi word
of the United Nations. India is one of the countries where have 2, 2, 1, 1, 1, and 2 possible mappings respectively.
English is spoken as a second language. There are 26 Hence a total of 2*2*1*1*1*2=8 transliteration candidates
letters in English. Out of which 21 are consonants and 5 should be considered. (Examples: waishali, wayshali,
are Vowels. Vowels are: vaishali, vayshali waishalee, wayshalee etc.).
Table 1
Independent Vowels Mapping

Table 2
Dependent Vowels Mapping
524 International Journal of Computer Science and Communication (IJCSC)

Table 3
Consonant Mapping

Table 4
Mapping of Special Symbols

5. EXPERIMENTAL RESULT AND ERROR ANAYLSIS 5.2 Evaluation Data


In this section, we will discuss the accuracy of our We have divided the data set into two parts. One is
system. Training data set and second is Test Data set. Training
data set consisted of 1013 person’s names, using these
5.1 Evaluation Metrics names we have made the rules for the transliteration
The main aim of system is Effective transliteration of from Punjabi to English. And in Test Data set we have
names from Punjabi to English language. Thus for our used the original data where it will be implemented. Our
System evaluation Accuracy Test and error analysis have system is accurate for the Punjabi words but not for the
been ev aluated . To measure the quality o f the foreign words. For evaluating the system we took names
transliteration results, Word Accuracy is calculated by from the different domains like Person names, City
using the following equation: names, State names, River names etc. We have made two
test cases. Test case 1 contains person names. Test case 2
Accuracy = (C/ N) * 100 contains City names, State names, River names.
Where Test Case 1 Person names 1923 names
C = ind icates the to tal number o f co rrected (351 duplicates)
transliterated words and Test Case 2 City names, 128 names
N = indicates the total number of test words. State names,
River names
Development of a Punjabi to English Transliteration System 525

5.3 Result Problem is associated with following characters as


The results of two test case discussed are given below. shown in Table III. For example, character in
The overall accuracy of our system is 93.22%. Punjabi language can be transliterated into two
characters in English ‘v’ or ‘w’. Some algorithm is
Test Case Accuracy required to select the appropriate character at
Test case 1 (Person names) 95.00% different situations.
Test case 2 (City names, State names, 91.40% Some names that are not correct transliterated by
River names) our system i.e.
.
This following figure gives us a graphical view of
the accuracy of Test case 1 and Test case 2. The follow ing names are transliterated by our
system.

Figure 1

6. CONCLUSION
5.4 Error Analysis
In this paper w e have ad d ressed the pro blem o f
The overall performance accuracy test of the system is transliterating Punjabi to English language using rule
quite good. But the Test case 2 is less accurate than the based approach. Punjabi to English transliteration
first one because, the un-standardized language causes system is very beneficial for removing the language and
more ambiguities. There are several reasons for the errors scriptural barrier. The system is giving promising results
in the output. and this can be further used by the researchers working
• Multiple Transliterations: Sometimes when a on Punjabi and English Natural Language Processing
name is pronounced in Punjabi it correspond to tasks. As we know that in Punjab area most of the
many English words, so their system fails to guess government departments use Punjabi language to store
w hich o ne is the best fo r that p articular their data, so this transliteration system will help them
transliteration. a lot to transliterate Punjabi to English on a click of a
button.
• Wrong Input of Words: Some time user does not
enter correct data to the system due to which output 7. ACKNOWLEDGMENTS
is also not correct. For example as
I would like to express my deep and sincere gratitude to
here halant is used as such but we know it is used
my Guide Dr. Vishal Goyal, Assistant Professor, Dept
to write half letter.
of Computer Science, Punjabi University, Patiala, for the
• Character Gap: The number of characters in, both continuous support of my w ork, for his patience,
English and Punjabi, character sets varies in both motivation, enthusiasm, and immense knowledge. His
the language that makes the transliteration process understanding, encouraging and personal guidance
difficult. The numbers of vowels are 5 and 20 and have provided a good basis for the present work. I would
numbers of consonants are 21 and 41, in both like to thank my parents, for supporting me throughout
English and Punjabi, respectively as explained my life. Above all, I thank ‘GOD’ for making this mortal
earlier. So, there is character gap in both the venture possible.
languages that leads to problems in transliteration
process. For Example, for character in Punjabi REFERENCES
there is no corresponding character in English.
[1] Vijaya, V.P., Shivapratap and K.P. CEN(2009), “ English
• One-to-Multi mapping Problem: In this problem, to Tamil Transliteratio n using WEKA system” ,
single character in one script transform to multiple International Journal of Recent Trends in Engineering, May
characters in another script. The Multi-mapping 2009, 1, No. 1, pages: 498-500.
526 International Journal of Computer Science and Communication (IJCSC)

[2] Chinnakotla, Damani, Satoskar(2009), “ Transliteration pages 108–111, Suntec, Singapore, 7 August 2009 ACL
for Resource Scarce Language” , ACM Transactions on and AFNLP.
Asian Language Information Processing, V, No. N. [7] Ali and Ijaz(2009), “ English to Urdu Transliteration
[3] Lehal and Singh (2008), “ Shahmukhi to Gurmukhi System” , Proceedings of the Conference on Language &
Transliteration System: A Corpus based Approach” , Technology 2009, pages: 15-23.
Proceeding of Advanced Centre for Technical Development
[8] Hoon Oh, and Key-Sun Choi (2002), “ An English-Korean
of Punjabi Language, Literature & Culture, Punjabi
Transliteration Model” , using Pronunciation and
University, Patiala 147 002, Punjab, India, pages:151-162.
Contextual Rules, In Proc. of the 19th International
[4] Malik (2006), “ Punjabi Machine Transliteration System” ,
Conference on Computational Linguistics (COLING 2002),
In Proceedings of the 21 st International Conference on
pages: 393–399.
Computational Linguistics and 44 th Annual Meeting of the
ACL (2006), pages:1137-1144. [9] Yaser, Knight (2002), “ Machine Transliteration of Names
[5] Verma (2006), “ A Roman-Gurmukhi Transliteration in Arabic Text” , Machine Transliteration of Names in
System” , Proceeding of the Department of Computer Science, Arabic text, In Proceedings of the ACL Workshop on
Punjabi University, Patiala, 2006. Computational Approaches to Semitic Languages ,
Philadelphia, PA, pages: 1-13.
[6] Hong, Kim, Lee and Chang(2009), “ A Hybrid Approach
to English-Korean Name Transliteration” , Procedings of [10] ” Transliteration” , Internet Source:- http:/ / en.wikipedia.
the 2009 Named Entities Workshop, ACL-IJCNLP 2009, org/ wiki/ transliteration acessed on jan,2011.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy