0% found this document useful (0 votes)
81 views5 pages

Entity Recognition in Assamese Text: Abstract - Entity Recognition Detects All The Entities Present

This document discusses entity recognition in Assamese text. It introduces entity recognition using conditional random fields for the Assamese language. It describes some challenges of entity recognition in Assamese, including its free word order, lack of resources, ambiguity of names, agglutinative nature, spelling variations, lack of capitalization, nested entities, and unique features of Assamese grammar. The proposed system combines preprocessing of Assamese text with entity recognition using conditional random fields and natural language toolkits.

Uploaded by

suy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views5 pages

Entity Recognition in Assamese Text: Abstract - Entity Recognition Detects All The Entities Present

This document discusses entity recognition in Assamese text. It introduces entity recognition using conditional random fields for the Assamese language. It describes some challenges of entity recognition in Assamese, including its free word order, lack of resources, ambiguity of names, agglutinative nature, spelling variations, lack of capitalization, nested entities, and unique features of Assamese grammar. The proposed system combines preprocessing of Assamese text with entity recognition using conditional random fields and natural language toolkits.

Uploaded by

suy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

ENTITY RECOGNITION IN ASSAMESE TEXT

Nandana Mahanta, Sourish Dhar, Sudipta Roy


Department of CSE,
Assam University, Silchar,
Assam, India
Email: {1nandana.mahanta, dharsourish, sudipta.it}@gmai1.com

Abstract— Entity Recognition detects all the entities present Assamese is an Indic or Indo-Aryan language (branch of
in a document to improve the performance of some high level Indo European language family) spoken mainly in the state of
Natural Language Processing (NLP) tasks like Question Assam, where it is an official language. Assamese is spoken
Answering, Auto Summarization, Machine Translation, by over 30 million people in North East India. Assamese is a
Information Extraction. The task is subdivided into two parts:
national language of India but with a limited computational
Parts of Speech Tagging (POS) and Entity Recognition. Each
sentence is annotated with part-of-speech tags and then the linguistic work [3, 4].
proper nouns are again classified with our own entity tag set. We have used Conditional Random Field (CRF), a
This paper introduces Entity Recognition in Assamese Text using machine learning approach for our Entity Recognition task.
Conditional Random Fields (CRF). Results are measured with F- Although a lot of work for IE and its subtasks has been done
measure metric for each different entity class. in English and other foreign languages like Spanish, German
and Chinese with high accuracy but for Indian languages not
Keywords—POS tagging; Entity Recognition; CRF; Assamese much work have been done yet. Ours is the first work on
Language Entity Recognition for Assamese.

I. INTRODUCTION II. ISSUES REGARDING ENTITY RECOGNITION IN


Natural Language Processing (NLP)is a field of computer ASSAMESE
science, artificial intelligence, and computational linguistics Assamese is written using the Assamese script, similar to
ZKLFK GHDOV ZLWK DQDO\]LQJ XQGHUVWDQGLQJ DQG that of Bengali except the symbols for ৰ/ra/ and ৱ/wa/, and
JHQHUDWLQJ WKH KXPDQ QDWXUDO  ODQJXDJHV LQ RUGHU highly resembles the Devanagiri script of Hindi, Sanskrit and
WRLQWHUIDFHLWZLWKFRPSXWHUVLQERWKZULWWHQDQG other related Indic languages [11]. There are various issues
VSRNHQ FRQWH[WV LQVWHDG RI FRPSXWHU ODQJXDJHV related to Assamese entity Recognition. Many of these issues
Parts Of Speech tagging (POS tagging) and Named Entity are general to other NLP tasks and not specific to Assamese.
Recognition (NER) both are individual tasks in the field of A. Sentence structure
1/3 >@ Assamese is a relatively free word order language. The
Entity Recognition is a subtask of Information Extraction basic structure of an Assamese sentence is Subject + Object +
(IE) [1, 2] which is associated with the problem of text Verb (SOV). But SVO, OSV, OVS, VOS and VSO can also
simplification in order to form an organized view of the result the same meaning. Thus sentence formation in
information present in free text. The aim of Entity Recognition
Assamese is diverse in nature [5, 7].
is to create a more easily machine-readable text to process the
sentence. B. Scarcity of resources
Let’s look into an Assamese sentence to understand the Although Assamese is a language spoken by about 15
difference between Entity Recognition and NER. million people in the Indian state of Assam as a first language,
the development of electronic resources for the language has
ডাঃ ৰােম এ.িব.আই.ৰ ১০০টা য়াৰ িকিনিছল । been lagging behind compared to many Indian languages [5].
In the above sentence NER will identify only ৰােম[PER] Entity Recognition requires a large data set. But very few
corpora of Assamese are publicly available and most of them
এ.িব.আই.[ORG]. So if we have a question like “What did Ram are driven by specific agenda. Not much NLP work has been
buy?”, then we can’t answer it only with the help of NER. done for Assamese and other north eastern Indic languages.
On the other hand with the help of POS tagging C. Ambiguity
(supervised) [9] Entity Recognition will tag the above
sentence as There exist an ambiguity in the names of peoples since
names of people are usually dictionary words, unlike Western
ডাঃ[P_NOM] ৰােম[PER] এ.িব.আই.[ORG]ৰ[SFX] ১০০[NUM] names. For example AKAAX, JON means sky and moon
টা[SFX] য়াৰ[CN] িকিনিছল[VB] ।[PUN] respectively in Assamese, but also can indicate person names
thus creating ambiguities between common noun and proper
Thus Entity Recognition will improve the performance of
noun [6].
high level NLP tasks.
D. Agglutinative nature Plural numbers can be formed by adding suffixes to the
Assamese language suffers from agglutination and singular forms of nouns [ৰামহঁ ত, /ramhot/ ram (ram) -hɔt(-PL)
complex words are created by adding additional features to ‘Ram and others’]. This feature has not been seen in Hindi or
change the meaning of the word. For example, অসম (Assam) is Bengali languages.
the name of a place which is a location named entity but অসমীয়া
(Assamiya) is produced by adding suffix ইয়া (IYA) to অসম Pronouns are also made plural with suffixation [িসহঁ ত,
/xihɔt/ xi (he) -hɔt(- PL) ‘They’]. They can also be expressed
(Assam) which signifies people residing in Assam which is
by adding qualifying words, in which case no suffix is added
not a location named entity [8].
[ব ত মানুহ /bɔhut manuh/ bɔhut (many) manuh (man) ‘Many
E. Spelling Variation men’].
Changes in the spelling of proper names are another
problem in Assamese Named Entity Recognition. For C. Inflection of Adjectives
example, in চা (Shree Shreesanth) there is a confusion
whether (Shree) in চা (Shreesanth) is a Pre-nominal word Assamese adjective is basically not inflected, but
or person named entity [6]. sometimes when adjectives are inflected, then they take the
noun form [5, 7]. For example
F. Lack of captalization
In English capitalization plays an important role in ধুনীয়াজনী হ আিহছা ।
identifying the named entities. But there is no concept of Here ধুনীয়া (dhunia: beautiful) is an adjective, but after adding
capitalization in Assamese language thus making it difficult to feminine definitive জনী (zɒni) the whole constituent becomes
identify the proper noun [6].
a noun word.
G. Nested Entities
III. PROPOSED WORK
When two or more proper nouns are present then it
becomes difficult to assign the proper named entity class. For Our proposed system is combination of preprocessing of
example, in গৗহা িব িবদ ালয় (Gauhati bishabidyaly) গৗহা Assamese text and Entity Recognition.
The system is developed in Python language and Natural
(Gauhati) is a location named entity and িব িবদ ালয়
Language Toolkit (NLTK).
(bishabidyalay) refers to organization thus creating problem in
assigning the proper class.
Raw
Some issues which we have seen in Assamese language (Unstructured)
are not present in Hindi, Bengali or other Indic languages. Assamese Text
They are:

A. Negation of verbs
Preprocessing Tokenization
The procedure of negation of verbs in Assamese language Of
is a unique feature which clearly distinguishes it from the rest Raw text
of the Indo-Aryan and other Dravidian languages. In POS Tagging
Assamese ন/n/ is pre-fixed to the verb followed by a vowel
which is the exact copy of the vowel of the first syllable of the
Entity Recognition
verb, as in নালােগ/nalage/ meaning ‘do not want’ (1st, 2nd, 3rd using CRF
person). The various negative markers in Assamese are ন/n/,
না/no/, না/na/, ন/ne/ and িন/ni/ etc. [5].
Tagged
B. Use of plural suffixation (Structured)
Assamese Text
The use of the plural suffixes is another feature of
Assamese. In Assamese nouns, pronouns are generally Figure 1: System Model
inflected for number, gender and case. For instance, all the
bound forms such as হতঁ /hɔt/, বাৰ/bur/, িবলাক/bilak/, Preprocessing: Preprocessing task consists of loading text in
মখা/mokha/, জাক /zak/, সকল/xɔkɔl/ etc denote plurality and the system, Tokenization and POS tagging.
Loading text means loading the Assamese text file from
are suffixed to a noun or a pronoun [5,7].
the system. After that the text file is tokenized i.e. each word,
symbol and punctuation mark are separated. Then the system
will tag each token of the tokenized file with the most We have used the following features while training the
appropriate POS tag. model.
To train the model for POS tagging we have used the
A. POS tag
partially annotated corpora from Research Centre for Indian
Language Technology Solution (RCILTS). The remaining POS tag is the most important feature in our system. It
unannotated corpora is again manually tagged (30,000 words). gives the parts-of-speech information about a token which is
The training dataset consists of 90k annotated words. We have very much helpful in our Entity Recognition system. With
used Stanford POS Tagger and the POS tag set described by help of parts-of-speech information the relation between 2
Pallav Kumar Dutta [5]. The considered tag set along with its entities can be found out.
meaning is described in Table-1. B. Word Prefix and Suffix
TABLE 1: POS TAG SET The starting and ending characters of a token play
important role in tagging. Suffixation of nouns is very
Tag Description
extensive in Assamese. There are more than 100 suffixes for
NN Noun
NNPC Compound Proper Noun the Assamese noun [7].
NNP Proper Noun
C. N-gram
NLOC Noun Location
NVB Noun in Kriya (verb) Mula N-grams have been widely investigated for a number of
PRP Pronoun text processing and retrieval applications. An n-gram is a
CC Conjunction contiguous sequence of n-items from a given sequence of text
INTF Intensifier
JJ Adjective
or speech. The items can be phonemes, syllables, letters,
JVB Adjective in Kriya (verb) Mula words or base pairs according to the application. The n-grams
NEG Negative typically are collected from a text or speech corpus [12]. We
PSP Post-position have considered letters of a word as n-gram (n=6) feature for
PUNC Punctuation the system.
QF Quantifier
QFNUM Number Quantifier D. Context Words
QW Question Word
RB Adverb
We have considered the previous and next tag of current
RBVB Adverb in Kriya (verb) Mula word. In Assamese although sentence structure is not similar
RP Particle like English and other European language still we can get
SYM Symbol some information about the type of a current token by
UH Interjection Word observing its surrounding tokens.
VAUX Auxiliary Verb
VAUXN Negative Verb Auxiliary
Using our own entity classes and with the help of POS
VFM Verb Finite Main tagging each entity will be tagged with some categories based
VFMN Negative Verb Finite Main on the appropriate meaning in the text. Our considered tag set
VJJ Verb Non-Finite Adjectival for Entity recognition is given below-
VJJN Negative Verb Non-Finite Adjectival
VNN Verb Non-Finite Nominal TABLE 2: ENTITY TAG SET
VNNN Negative Verb Non-Finite Nominal
VRB Verb Non-Finite Adverbial Tag Description Example
VRBN Negative Verb Non-Finite Adverbial PER Single word person name ৰাজীৱ/PER
VNF Non-Finite Verb LOC Single word location name কিলকতা/LOC
VNFN Negative Non-Finite Verb ORG Single word organization name কংে ছ/ORG
Source: POS tag set of Pallav Kumar Dutta [5] B-PER Beginning, Internal or End of a মাহনদাস/B-PER
I-PER multiword person name কৰমচা /I-PER
Entity Recognition: RCILTS corpora is not rich with proper E-PER
গা ী/E-PER
nouns since it is developed for POS tagging. To bridge this B-LOC Beginning, Internal or End of a মহা া/B-LOC গা ী/I-
gap we have manually collected around 30 articles (around I-LOC multiword location name LOC পথ/E-LOC
3000 sentences) of Assamese text from online Assamese E-LOC
B-ORG Beginning, Internal or End of a মহাকাশ /I-ORG
Wikipedia1 for our Entity Recognition task. These articles are I-ORG multiword organization name গেৱষণা /I-ORG
again manually tagged to train the system and 20% of this E-ORG
সং া/E-ORG
annotated data is kept aside for testing.
NUM Number ১০০
DATE Date ১২ বহাগ/DATE
With the remaining 80% data we trained our Entity
ABV Abbreviation ড°/ABV
Recognition system using CRF. CRF combines the advantages
of discriminative classification and graphical modeling and
results more accurate conditional model which has much IV. RESULT ANALYSIS
simpler structure than a joint model [10].
After finishing classification task, we tested our system
with the 20% data (600 sentences) which we kept aside earlier.
So far as we know Entity Recognition task has not been done
till date in case of Assamese language. So, we could not PERSON 112 0.9035
compare our result with any existing system.
LOCATION 61 0.6244
Result Evaluation Parameters:
ORGANIZATION 20 .8669
Precision = True Positive/ (True Negative + False Positive)
Recall = True Positive/ (True Positive + False Negative)
F-Score = 2* Precision* Recall/ (Precision + Recall) NUMBER 79 0.6218

For each entity class the F-score is listed in Table 4. ABBRVIATION 43 0.8656

DATE 67 0.6577
TABLE 3: CLASSIFIER RESULT ON TEST DATA SET

ENTITY PRECISION RECALL

V. CONCLUSION AND FUTURE WORK


PERSON 0.9577 0.8551
In this paper we briefly discussed about our proposed
Entity Recognition system for Assamese text and different
LOCATION 0.8821 0.4074
components to develop this system. We think with better
resources and varied dataset in Assamese language this result
ORGANIZATION 0.9218 0.8182 can be optimized.
We got maximum of 0.9577 and minimum of 0.7561
NUMBER 0.8219 0.5000 precision values for PERSON and DATE entity class
respectively and maximum 0.8551 and minimum 0.5820 recall
values for these two entity classes. Since ours is the first
ABBREVIATION 0.9122 0.8235
system for Entity Recognition in Assamese language we are
not able to compare our results. We hope this concept of
DATE 0.7561 0.5820 Entity recognition will be make a huge difference in high level
NLP task for Assamese and other Indian languages.
REFERENCES
[1] Tang, J., Hong, M., Zhang, D., Liang, B. and Li, J., 2007, Information
Extraction: Methodologies and Applications, in Prado, H. A. D. and
Ferneda, E., eds., Emerging Technologies of Text Minig: Techniques and
Applications, IGI Global, New York, p. 1-33.
[2] Gupta, V. and Lehaal, G. S., 2009, A survey of Text Minig: Techniques
and Applications, Journal of Emerging Technologies in Web
Intelligence, vol. 1, No. 1, p. 60-76.
[3] Rahman, M., Das, S. and Sharma, U., 2009, Parsing of part-of-speech
tagged Assamese Texts, ,-&6, ,QWHUQDWLRQDO -RXUQDO RI
&RPSXWHU6FLHQFH,VVXHV9RO1RS
[4] Assamese Website, http://www.iitg.ernet.in/rcilts/pdf/assamese.pdf,
(August 4, 2016).
[5] Dutta, P. K., An Online Semi Automated Part of Speech Tagging
Technique Applied to Assamese, PhD Thesis, Dept. of CSE, Indian
Institute of Technology Guwahati, Guwahati – 781039, Assam, India,
December 2013.
[6] Sharma, P., Sharma, U. and kalita, J., Named Entity Recognition: A
Survey for the Indian Languages, 2010, National Seminar on Lexical
Resources and Computational Techniques on Indian Langusges,
Pondicherry.
FIGURE 2: PRECISION VS RECALL GRAPH ON ENTITY RESULT
[7] Saharia, N., Computational Morphology and Syntax for a Resource-Poor
Inflectional Language, PhD Thesis, Dept. of CSE, School of
TABLE 4: ENTITY EXTRACTED AND F-SCORE VALUE Engineering, Tezpur University, Tezpur, Assam, India – 784028,
January 2014.
ENTITY NUMBER OF F-SCORE [8] Talukdar, G., Borah, P. P. and boruah, A., 2014, Supervised Named
EXTRACTION Entity Recognition in Assamese language, IC3I International
Conference on Contemporary Computing and Informatics, Mysore.
[9] Guilder, L. V., Automated Part of Speech Tagging: A Brief Overview,
Handout for LlNG361, Fall 1995, Georgetown University.
[10] Sutton, C. and McCallum, A., 2007, An Introduction to Conditional A Hybrid Approach, International Journal of Computer Applications,
Random Fields for Relational Learning, in Getoor, L., ed., Introdcution vol. 84, No. 9, p. 31-35
to Statistical Relational Learning, MIT Press, p. 93-123
[11] Assamese design Guide,
http://www.iitg.ernet.in/rcilts/phaseI/newassamesedesign.pdf, (August 4,
2016).
[12] Dey, A. and Purkayastha, B. S., 2013, Named Entity Recognition using
Gazetteer Method and N-gram Technique for an Inflectional Language:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy