Entity Recognition in Assamese Text: Abstract - Entity Recognition Detects All The Entities Present
Entity Recognition in Assamese Text: Abstract - Entity Recognition Detects All The Entities Present
Abstract— Entity Recognition detects all the entities present Assamese is an Indic or Indo-Aryan language (branch of
in a document to improve the performance of some high level Indo European language family) spoken mainly in the state of
Natural Language Processing (NLP) tasks like Question Assam, where it is an official language. Assamese is spoken
Answering, Auto Summarization, Machine Translation, by over 30 million people in North East India. Assamese is a
Information Extraction. The task is subdivided into two parts:
national language of India but with a limited computational
Parts of Speech Tagging (POS) and Entity Recognition. Each
sentence is annotated with part-of-speech tags and then the linguistic work [3, 4].
proper nouns are again classified with our own entity tag set. We have used Conditional Random Field (CRF), a
This paper introduces Entity Recognition in Assamese Text using machine learning approach for our Entity Recognition task.
Conditional Random Fields (CRF). Results are measured with F- Although a lot of work for IE and its subtasks has been done
measure metric for each different entity class. in English and other foreign languages like Spanish, German
and Chinese with high accuracy but for Indian languages not
Keywords—POS tagging; Entity Recognition; CRF; Assamese much work have been done yet. Ours is the first work on
Language Entity Recognition for Assamese.
A. Negation of verbs
Preprocessing Tokenization
The procedure of negation of verbs in Assamese language Of
is a unique feature which clearly distinguishes it from the rest Raw text
of the Indo-Aryan and other Dravidian languages. In POS Tagging
Assamese ন/n/ is pre-fixed to the verb followed by a vowel
which is the exact copy of the vowel of the first syllable of the
Entity Recognition
verb, as in নালােগ/nalage/ meaning ‘do not want’ (1st, 2nd, 3rd using CRF
person). The various negative markers in Assamese are ন/n/,
না/no/, না/na/, ন/ne/ and িন/ni/ etc. [5].
Tagged
B. Use of plural suffixation (Structured)
Assamese Text
The use of the plural suffixes is another feature of
Assamese. In Assamese nouns, pronouns are generally Figure 1: System Model
inflected for number, gender and case. For instance, all the
bound forms such as হতঁ /hɔt/, বাৰ/bur/, িবলাক/bilak/, Preprocessing: Preprocessing task consists of loading text in
মখা/mokha/, জাক /zak/, সকল/xɔkɔl/ etc denote plurality and the system, Tokenization and POS tagging.
Loading text means loading the Assamese text file from
are suffixed to a noun or a pronoun [5,7].
the system. After that the text file is tokenized i.e. each word,
symbol and punctuation mark are separated. Then the system
will tag each token of the tokenized file with the most We have used the following features while training the
appropriate POS tag. model.
To train the model for POS tagging we have used the
A. POS tag
partially annotated corpora from Research Centre for Indian
Language Technology Solution (RCILTS). The remaining POS tag is the most important feature in our system. It
unannotated corpora is again manually tagged (30,000 words). gives the parts-of-speech information about a token which is
The training dataset consists of 90k annotated words. We have very much helpful in our Entity Recognition system. With
used Stanford POS Tagger and the POS tag set described by help of parts-of-speech information the relation between 2
Pallav Kumar Dutta [5]. The considered tag set along with its entities can be found out.
meaning is described in Table-1. B. Word Prefix and Suffix
TABLE 1: POS TAG SET The starting and ending characters of a token play
important role in tagging. Suffixation of nouns is very
Tag Description
extensive in Assamese. There are more than 100 suffixes for
NN Noun
NNPC Compound Proper Noun the Assamese noun [7].
NNP Proper Noun
C. N-gram
NLOC Noun Location
NVB Noun in Kriya (verb) Mula N-grams have been widely investigated for a number of
PRP Pronoun text processing and retrieval applications. An n-gram is a
CC Conjunction contiguous sequence of n-items from a given sequence of text
INTF Intensifier
JJ Adjective
or speech. The items can be phonemes, syllables, letters,
JVB Adjective in Kriya (verb) Mula words or base pairs according to the application. The n-grams
NEG Negative typically are collected from a text or speech corpus [12]. We
PSP Post-position have considered letters of a word as n-gram (n=6) feature for
PUNC Punctuation the system.
QF Quantifier
QFNUM Number Quantifier D. Context Words
QW Question Word
RB Adverb
We have considered the previous and next tag of current
RBVB Adverb in Kriya (verb) Mula word. In Assamese although sentence structure is not similar
RP Particle like English and other European language still we can get
SYM Symbol some information about the type of a current token by
UH Interjection Word observing its surrounding tokens.
VAUX Auxiliary Verb
VAUXN Negative Verb Auxiliary
Using our own entity classes and with the help of POS
VFM Verb Finite Main tagging each entity will be tagged with some categories based
VFMN Negative Verb Finite Main on the appropriate meaning in the text. Our considered tag set
VJJ Verb Non-Finite Adjectival for Entity recognition is given below-
VJJN Negative Verb Non-Finite Adjectival
VNN Verb Non-Finite Nominal TABLE 2: ENTITY TAG SET
VNNN Negative Verb Non-Finite Nominal
VRB Verb Non-Finite Adverbial Tag Description Example
VRBN Negative Verb Non-Finite Adverbial PER Single word person name ৰাজীৱ/PER
VNF Non-Finite Verb LOC Single word location name কিলকতা/LOC
VNFN Negative Non-Finite Verb ORG Single word organization name কংে ছ/ORG
Source: POS tag set of Pallav Kumar Dutta [5] B-PER Beginning, Internal or End of a মাহনদাস/B-PER
I-PER multiword person name কৰমচা /I-PER
Entity Recognition: RCILTS corpora is not rich with proper E-PER
গা ী/E-PER
nouns since it is developed for POS tagging. To bridge this B-LOC Beginning, Internal or End of a মহা া/B-LOC গা ী/I-
gap we have manually collected around 30 articles (around I-LOC multiword location name LOC পথ/E-LOC
3000 sentences) of Assamese text from online Assamese E-LOC
B-ORG Beginning, Internal or End of a মহাকাশ /I-ORG
Wikipedia1 for our Entity Recognition task. These articles are I-ORG multiword organization name গেৱষণা /I-ORG
again manually tagged to train the system and 20% of this E-ORG
সং া/E-ORG
annotated data is kept aside for testing.
NUM Number ১০০
DATE Date ১২ বহাগ/DATE
With the remaining 80% data we trained our Entity
ABV Abbreviation ড°/ABV
Recognition system using CRF. CRF combines the advantages
of discriminative classification and graphical modeling and
results more accurate conditional model which has much IV. RESULT ANALYSIS
simpler structure than a joint model [10].
After finishing classification task, we tested our system
with the 20% data (600 sentences) which we kept aside earlier.
So far as we know Entity Recognition task has not been done
till date in case of Assamese language. So, we could not PERSON 112 0.9035
compare our result with any existing system.
LOCATION 61 0.6244
Result Evaluation Parameters:
ORGANIZATION 20 .8669
Precision = True Positive/ (True Negative + False Positive)
Recall = True Positive/ (True Positive + False Negative)
F-Score = 2* Precision* Recall/ (Precision + Recall) NUMBER 79 0.6218
For each entity class the F-score is listed in Table 4. ABBRVIATION 43 0.8656
DATE 67 0.6577
TABLE 3: CLASSIFIER RESULT ON TEST DATA SET