0% found this document useful (0 votes)
39 views7 pages

2020 Lrec-1 258

Uploaded by

yuti6211
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views7 pages

2020 Lrec-1 258

Uploaded by

yuti6211
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2103–2109

Marseille, 11–16 May 2020


c European Language Resources Association (ELRA), licensed under CC-BY-NC

Event Extraction from Unstructured Amharic Text


Ephrem Tadesse, Rosa Tsegaye Aga, Kuulaa Qaqqabaa
Jimma University, Armauer Hansen Research Institute, Addis Ababa Science and Technology University
Jimma, Addis Ababa, Akaki Kality Sub-City
ephe11ta@gmail.com, rosatsegaye@gmail.com, kuulaa@gmail.com

Abstract
In information extraction, event extraction is one of the types that extract the specific knowledge of certain incidents from texts. Event
extraction has been done on different languages text but not on one of the Semitic language, Amharic. In this study, we present a system
that extracts an event from unstructured Amharic text. The system has designed by the integration of supervised machine learning and
rule-based approaches. We call this system a hybrid system. The system uses the supervised machine learning to detect events from the
text and the handcrafted and the rule-based rules to extract the event from the text. For the event extraction, event arguments have been
used. Event arguments identify event triggering words or phrases that clearly express the occurrence of the event. The event argument
attributes can be verbs, nouns, sometimes adjectives (such as ˜rg/wedding) and time as well. The hybrid system has compared with the
standalone rule-based method that is well known for event extraction. The study has shown that the hybrid system has outperformed the
standalone rule-based method.

Keywords: Event extraction, under-resourced language, Machine learning algorithms, Nominal events.

1. Introduction guage Processing (NLP) tasks we are interested to tackle


this problem. In this study we present a comprehensive
Amharic is a Semitic language, related to Hebrew, Arabic,
technique for extracting events from Amharic unstructured
and Syriac. It has been the second most spoken Semitic lan-
text.
guage by around 27 million speakers (Mulugeta and Gasser,
2012) primarily in Ethiopia next to Arabic language. It is The rest of the paper is organized as follows. Section 2. dis-
currently the official language of government in Ethiopia, cusses the related works of this study. Section 3. explains
and has been since the 13th century. In addition, it is the the methodology of the study. It motivates and elaborates
medium of instruction in primary and secondary schools as the event extraction models and algorithms that have used
well as the source language for a large body of historical in the study. Section 5. presents the experimental results of
texts. As a result, most documents in the country have been the study, and discussion and comparison of the different
produced in Amharic and there has been an enormous pro- result of the models that have proposed in the study. The
duction of electronic and online accessible Amharic docu- study has concluded in Section 6. by conclusion and future
ments. work.
The predominant problem of underrepresented languages
is the lack of resources (Sohail and Elahi, 2018). Most re- 2. Related Work
cently on the web fewer online Amharic textual resources Recently event extraction has gained popularity due to its
are available for people in their everyday lives. However, wide applicability for various NLP applications. Most
researchers and other interested group of people in linguis- event extraction systems support English and European lan-
tic and computing disciplines face difficulties because of guage texts from different domains using a variety of tech-
Amharic presents sophisticated language-specific issues. niques. Now a days, Semitic languages are typically a topic
The existing information extraction systems that have de- of interest for researchers. Event extraction for Amharic
veloped for Hebrew, Arabic, or other languages have not has not been done yet; therefore this study is the first in
represented the linguistic structure and morphological rich- this particular information extraction (IE) application. Due
ness of the languages. But events in Amharic text are pre- to the variation of the language structure the existing tech-
dominantly expressed through verbs and nouns. Therefore, niques and tools applied to other languages can’t be directly
these systems can not be used directly for Amharic texts. used for this particular task.
For example, consider the following sentence "˜~ There are some progressive work that have been done so
Œs¼’m 1965 ‚tÓÍÑ ¤wµ’t ¶g› ¶¤r." /“Ethiopia far on Amharic NLP tasks with promising results includ-
was in turmoil in Monday, September , 1965”. In this sen- ing part of speech tagging, morphological analyzer, named
tence "¤wµ’t" and "¶g›" refers to an event, whereas the entity recognition, base phrase chunking and text classifica-
phrase "˜~ Œs¼’m 1965" is a time argument which indi- tion as in (Adafre, 2005; Ibrahim and Assabie, 2014; Sik-
cates when the event happened. The word "‚tÓÍÑ" refers dar and Gambäck, 2018; lasker et al., 2007). Various tech-
the named entity or participant of the event. niques have been widely employed for each task to enhance
Because of this prominent significance of extracting events the accuracy and handling linguistic exceptions. However,
from unstructured Amharic text for high level Natural Lan- there have not been ready-made pre-components and well

2103
organized datasets. Besides these limitations there has not portant event arguments that are event agent, event loca-
been any undergoing research on event extraction from un- tion, event trigger, event target, and event product and
structured Amharic text due to difficulties in syntactic and event time. The tools and dataset that have used in their
semantic status of class of functional verbs. The other chal- study have utilized twitter streaming API and preprocessed
lenges are identifying event arguments. In our case tem- through AraNLP Java-based package. Moreover, after the
poral event arguments have considered. However, it has visualization services event extraction like calendar, time-
a challenge in Amharic texts. Amharic texts have repre- line supplied through the help of ontological knowledge
sented in various forms such as; sequence of words, Arabic bases. In their study the experimental results show that the
and Geez’e script numerals. As such it needs extra normal- approach has an accuracy of 75.9 for T1: event trigger ex-
ization and syntactic analyzing scheme to tackle temporal traction, 87.5 for T2: Event time extraction and 97.7 for T3:
argument. event type identification. Their study claims that applying
Semitic languages like Arabic, Hebrew and Amharic have this kind of domain dependent approach to extract events
much more complex morphology than English. The mor- from tweets scores significant results.
phological variation limits the research progress on Natural In general there has been a lot of work in event extrac-
language processing, in general. However, there are stud- tion such as (Arnulphy et al., 2015; Tourille et al., 2017) in
ies relative to other Semitic languages. For example, (Al- European languages, predominantly in English; But, much
Smadi and Qawasmeh, 2016) have done their study on au- less research in other languages. An approach or technique
tomatic event extraction for Arabic language using knowl- that has used in one language to extract events might be
edge driven approach which concentrates on tagging the used in languages as well if they have a similar grammar
event trigger instances and related entities. One of their and character set. However, If languages have very differ-
main contribution is to link event with the entity mention. ent grammar, or a very different written representation, it
However, in our case we mainly concentrate on extracting will be difficult to use related approaches or techniques to
events and its arguments with the advantage of hand crafted extract events.
rules and machine learning classifiers. There has been research in part-of-speech tagging on
Hindi is another under-resourced an indo European lan- Amharic text (Adafre, 2005) and on Amharic morphol-
guage that has more common words with Arabic. In ogy (Mulugeta and Gasser, 2012) which are helpful for
(Ramrakhiyani and Majumder, 2015) solely has focused on event detection, but not directly related to the actual event
Temporal Expression Recognition in Hindi using interac- extraction task from Amharic text. For this particular task
tive handcrafted rules. They aim to carry out two basic the state of art Event detection system typically uses a ro-
goals that are identification of the temporal expressions in bust machine-learning techniques. Examples of such sys-
plain text and classifying the identified temporal expres- tems are (Arnulphy et al., 2015). Because of the lack of
sion. However, extracting events along with the corre- sufficient labeled training data for Amharic, we bootstrap
sponding arguments gains more advantage for the ease of an event extractor using a rule-based algorithm.
chronological ordering of events in their occurrences. In
addition it can be extended for event argument relationship 3. Methodology
extraction tasks. According to (Frederik Hogenboom and Kaymak, 2016)
(Smadi and Qawasmeh, 2018) has proposed a supervised event extraction techniques have been evaluated based on
machine learning approach for extracting events from Ara- the works on a set of qualitative dimensions that are
bic tweets. The study mainly focuses on four main tasks: the amount of required data, knowledge, expertise, inter-
Event Trigger Extraction, Event Time Expression Extrac- pretability of the results and the required development and
tion, Event Type Identification, and Temporal Resolution execution times. In this study, supervised machine learn-
for ontology population. Significant scores have resulted ing techniques, handcrafted rules and hybrid techniques
for each task covered under this paper includes; T1: event have employed to detect and extract events and its argu-
trigger extraction F-1= 92.6, and T2: event time expres- ments from unstructured text. Our focus of interest has
sion extraction F-1= 92.8 in T3: event type identification been extracting events and event arguments from unstruc-
Accuracy= 80.1. They have claimed that the third task is tured Amharic text. Event arguments include identification
relatively better than the previous works done using simi- of event trigger words; where in Amharic unstructured text
lar techniques like document-term matrix or bag-of-words. nominal events become ambiguous. Such events can be ar-
(Arnulphy et al., 2015) has also proposed supervised ma- guments of other events, and they often have been hard to
chine learning approach but to detect French and English be identified.
Time Markup Language (ML) Events. The study has sug-
gested the approach to be used by combining different su- 3.1. Dataset preparation
pervised machine learning algorithms such as conditional Unlike other languages, Amharic language does not have
random field, decision tree and k-nearest neighbor includ- any standardized annotated publically available corpora
ing language models. like Treebank1 and PropBank2 for English. The news do-
(Al-Smadi and Qawasmeh, 2016) has proposed knowledge- main is more preferable data source. Because its publicly
based approach for event extraction from Arabic Tweets. available and contains rich source of information that helps
There are three subtasks covered under their study such as
1
event trigger extraction, event time extraction, and event https://catalog.ldc.upenn.edu/LDC99T42
2
type identification. The event expression includes im- http://www.nltk.org/howto/propbank.html

2104
for any NLP applications such as entity extraction, event their derivation. The lemma of a word is very crucial fea-
and temporal information extraction and co-reference res- ture for the classifier. We have applied hornmorpho 8 that
olution. In this study, we build our own dataset by scrap- is a system to process the morphology of Amharic. The
ing top local websites. These are Zehabesha3 , Satenaw4 , system works for the other Ethiopian local languages such
Ezega5 , and one international website BBC Amharic6 that as Amharic, Affan Oromo, and Tigrinya languages. How-
contains relevant Amharic unstructured text. A Python ever, the system misses some unique and compound words.
Beautiful Soup library 7 has been used for for scraping Thus, we have developed our own unique exceptional dic-
the sites. The scraped texts are from all domains such tionary (Gazetteer) to handle exceptional keywords. Find-
as economy, politics, technology and sport. Simple reg- ing a pattern to get only the lemma of the hornmorpho
ular expressions have been used to retrieve only relevant result has also other difficulties; because sometimes the
text contents. A total of 659,848 words have extracted. co responding word doesn’t contain full information. In
Along with our own dataset, we have used Amharic corpora that case the Hornmorpho skips subject, object, grammar,
that have been prepared by the Ethiopian Languages Re- or word classes of a specific words. For Amharic Lan-
search Center of Addis Ababa University in a project called guage, Hornmorpho has evaluated using 200 randomly se-
the annotation of Amharic news documents (Demeke and lected verbs and nouns/adjectives in (Gasser, 2011) study.
Getachew, 2006). The project has been tagging manually The output has compared with manually identified Amharic
each Amharic word in its context with the most appropri- verbs and nouns. 99%; Amharic nouns: 95.5%. Although,
ate parts-of speech tag. The corpus contains 210,000 words we prefer to use this tool in our study, because of the lack of
that has collected from 1065 Amharic news (documents of other ready-made NLP components for Amharic language.
Walta Information Center (Demeke and Getachew, 2006)). The Jython library9 has been used to integrate the python
Walta Information Center is a private news and information based morphological analyzer for Amharic to get morpho-
service provider located in Addis Ababa, Ethiopia. logical features of words.
Besides analyzing the verb morphology, annotating the ex-
3.2. Data preprocessing act word class of the instance is also the required prepro-
In this step, data has converted to the appropriate format cessing task in this study. To do so, we have been us-
required for the respective information extraction process. ing the publically available language independent part-of-
In this study the scraped texts have many junks such as speech tagger, which is TreeTagger10 . TreeTagger is a tool
markup tags and other special characters. The first step for annotating text with part-of-speech and lemma informa-
in our study is raw text preprocessing. This step con- tion. It has been successfully used to tag German, English,
tains cleaning unwanted junks, sentence splitting, tokeniz- French, Italian, Danish, Swedish, Norwegian, Dutch, Span-
ing, word stemming, character normalization, stop word re- ish, Bulgarian, Coptic and Spanish texts. It is adaptable to
moval and Part Of Speech tagging (POS). Unlike other lan- other languages as well if a lexicon and a manually tagged
guages, Amharic is a morphologically rich language that training corpus are available (Schmid, 1994). It consists of
posses complicated syntactic features. This makes cum- two programs: the training program that creates a param-
bersome the preprocessing task to analyze the morpholog- eter file from a full-form lexicon and the lexicon genera-
ical features of representative tokens. The sentence splitter tor along with a hand tagged corpus. The tagger program
splits using Amharic sentence demarcations ( ~ ; ? !). reads the parameter file and annotates the text with part of
Amharic language has different characters with the same speech and lemma information. To prepare a parameter file
meaning and pronunciation. Those different characters for TreeTagger we used a total of 217 000 Amharic manu-
should be treated equally because there is no change in ally tagged corpora with 9 distinguished word classes and
meaning regardless of the linguistic view of orientation corresponding lemmas. We have conducted evaluation of
among the characters. For example:- (€,K,ƒ), (˜, P), TreeTager using 92,456 randomly untagged tokens. The
(a, A, €) and (Ð, Ø), each group has the same mean- output of TreeTager results 99.9% accuracy compared with
ing (Gasser, 2011). As a result, we develop a character manually tagged Amharic words.
normalizer that enables to normalize those characters to an The other crucial step in our preprocessing module is nor-
ordinary conceivable form This task helps the performance malizing Amharic temporal arguments. There are various
of our system. The other preprocessing task is stop word representations of date time expressions in Amharic such
removal. Like other language, Amharic has its own list of as Arabic, Geez and using alphanumeric characters.
stop words such as conjunctions, articles and prepositions. For example, the following sentences show the different
In our case we have adopted stop word lists that has used in date time representation:
(Tsedalu, 2010) study. In addition, we have built our own
(€¨l ¤1995 A.m °Â†Ô ~ ) using Arabic characters
stop word lists, as well with the help of linguistic experts.
(€¨l ¤] Ȱ} Œµ Ȱ¹ €mst A.m °Â†Ô ~ )
Then a total of 235 stop word have identified.
using alphanumeric characters
The other important preprocessing task is analyzing
(€¨l ¤ |:C9CB5| A.m °Â†Ô ~ ) using geez
Amharic verb morphology to identify lemma of words and
characters
3
http://www.zehabesha.com/amharic/
4
https://www.satenaw.com/amharic/
5 8
https://www.ezega.com/News/am/ https://www.cs.indiana.edu/ gasser/HLTD11/
6 9
https://www.bbc.com/amharic https://www.jython.org/
7 10
https://www.crummy.com/software/BeautifulSoup/bs4/doc/ https://reckart.github.io/tt4j/

2105
The probabilistic model of naive Bayes classifiers is based
on Bayes’ theorem. This algorithm works on the assump-
tion that the features in a dataset are independent from each
other.
Figure 1: Geez numerals LIBSVM is a library for Support Vector Machines (SVM).
It has gained wide popularity in machine learning and many
other areas. SVM finds an optimal solution and maximizes
The above sentences refer logically similar meaning with the distance between the hyperplane and the difficult points
various syntactic representation. In order to handle tem- close to decision boundary. As (Chang and Lin, 2011)
poral arguments of the event, a normalization and conver- stated, if there are no points near the decision surface, then
sion scheme to convert temporal representations into one there are no very uncertain classification decisions.
form. The conversion of Ge’ez numerals to uniform Arabic The other classifier algorithm that has used in this study is
number system is not straight forward as other normaliza- decision tree. Decision tree is a Tree-based classifier for
tion tasks because of the irregularities of Unicode values for instances that have represented as feature-vectors. There is
Ge’ez numerals. Some of the Ge’ez numerals are presented one branch for each value of the feature, and leaves specify
in Figure 1. the category. It represents arbitrary classification function
over discrete feature vectors. For the decision tree, J48 al-
3.3. Event detection using supervised machine
gorithm have used. J48 is an algorithm used to generate
leaning
a decision tree. The decision trees generated by J48 can
In this study, supervised machine learning approach has be used for classification, and for this reason, J48 is often
been employed. Supervised machine learning classifiers referred to as a statistical classifier.
predict new events based on the given labeled training sets. The above algorithms have been used to train the models
It uses event properties and characteristics from training using the labeled dataset as an input. Then the models have
data and generalize the unseen situations to predict events. detected the instances even on a test set as on-event and off-
In this study, the supervised learning approach has been event classes. The POS tag feature has showed good per-
used to detect the events from a text. formance as the best syntactic feature to detect the events
In this study, the datasets are unstructured text and doc- based on the feature selection recommendation.
uments. Therefore, the unstructured text sequences have
converted into a structured feature space using mathemati- 3.4. Event Extraction using Rule Based
cal modeling. For classification, feature extraction can be Approach
seen as a search among all possible transformations of the
Rule based learning is one of the information extraction
feature set for the best one. This preserves class maintain-
method that utilizes the extraction pattern to retrieve in-
ability as much as possible in the space with the lowest
formation from a text document. In this study, a stan-
possible dimension. In this study, the features contain in-
dalone rule-based approach has proposed to enhance the
formation of the text that have used to provide necessary
accuracy of event extraction system. Unlike other lan-
information associated to a given events. These features in-
guages, Amharic has a subject-object-verb agreement and
crease the confidence level of predicting a token as an event.
other morphological features that makes cumbersome the
Thus, the feature extractor component that has used in this
rules construction. As (Yunita Sari and Zamin, 2010) has
study is responsible for extracting candidate attributes for
mentioned, construction of extraction pattern is based on
the classifier. The features that have used in this study are
syntactic or semantic constraint and delimiter or combi-
the following:-
nation of both syntactic and semantic constraint. Events
• Words of the instance dominantly exist as nominal and verbs (Ramesh and Ku-
mar, 2016). The nominal events are ambiguous, in which
• POS of the corresponding word they can appear in deverbal or non-deverbal nouns form.
Thus, we need morphological features of the instances to
• Lemma of the corresponding word
disambiguate nominal events. To do so, morphological ana-
• List of lexicons for exceptional events lyzer has employed to get the morphological features of the
event that have mentioned in the instances. For example:- (
A binary classifier has been used to detect events from ΂tÓÍÑ hz©m ¼Êh ¤ö‰ àݑ ŒÈ¹Ýt €yàlÛm ~
Amharic text. The classifier detects events from the text ) In this sentence, the underline word (àݑ) is de-
and classify the text as on-event and off-event. The on- rived from the verb (fÕm). It seems an adjective, but,
event class refers the instance that contains event trigger it’s a deverbal entity we call it a nominal event. The rules
keywords; Whereas the off-event class refers the instance have been developed based on syntactic features of words
that does not infer the event trigger keywords. From the ma- with the help of a carefully constructed list of gazetteers.
chine learning algorithms, Naive Bayes, decision tree and The POS tag and lemma of the word have been used as
SVM algorithms have proposed based on their widely use an abasement for the handcrafted linguistic rules. Differ-
in text classification tasks (Pranckevicius and Marcinkevi- ent components have been used to get syntactic features of
cius, 2017). words using Tree Tagger and Hornmorpho. The pattern ex-
Naive Bayes classifier is linear classifier that is known for tractor has been developed based on the syntactic features.
being simple and very efficient for text classification tasks. Simple rules have been applied to extract detected events.

2106
For example:- (€¤¤) N (t‰nt) ADV (ÎÚËw) VN expertise is generally high due to the combination of mul-
(¤–) N (‘°) VP ( ~ ) . In this example, the snippets tiple techniques compared to pure knowledge driven tech-
of handcrafted rules have tackled based on the POS tagger niques.Moreover, the interpretability of a system benefits
results. The formal structures are not always regular to de- to some extent from the use of semantics as in knowledge-
velop stable rules. In contrast, the morphological analyzer based techniques(Baradaran and Mineai-Bidgoli, 2015).
is very helpful, because of the existence of deverbal events The other technique that has employed in this study is com-
that have been act as ambiguous. bining both supervised machine learning and rule-based
Some of the rules that have applied in this study for the techniques to extract events from Amharic unstructured
hybrid system includes the following:- text. The machine learning approach mainly focuses on
coverage (Recall) apart from sensitivity (precision) while,
1. Automatically label preprocessed texts with their cor- the handcrafted rules approach is on achieving the highest
responding word classes or parts-of-speeches. potential of precision value based on the incorporated rules.
In our case, the machine learning classifiers ignores nomi-
2. Get the morphological features of words including nal events in comparison with the verbal events. Therefore,
word, subject, root, lemma , object, grammar and we incorporate some rules to tackle the missed events from
preposition the machine learning classifiers result. Deverbal nouns
3. Usually events are expressed using verbs and nouns. exhibit both nominal and verbal syntactic representations
Check the neighboring words using bigram language They serve as concrete nouns, but also participate in verbal
models. Because, not all nouns have been events and constructions where they require arguments and accept the
sometimes nouns come at the beginning, then they are aspectual modification. Nominal events sometimes appear
the subjects or participant of the event not exactly the as deverbal and non-deverbal, in which deverbal entities
event. have been derived from verbs in-contrary the non-deverbal
entities have not derived from verbs. e.g. (àՑ) is a de-
4. Identifying the nominal events; To do so, the morpho- verbal entities that is an event derived from verb (fÕm).
logical analyzer has main role on indicating the cita- An event is a situation that lasts for a moment. By this def-
tion of the respective nouns; i.e words that have ex- inition, nominal can be an event e.g (˜rg)/ wedding is a
actly nominal can be deverbal or non deverbal nouns. non-deverbal nominal event. Another example (΀¤¤ ˜rg
But, deverbal nouns has a citation of verbs. ƒmŠ 16, 2010 A.m nw.) . Simply knowing the mor-
phological variation of words and having a common non-
5. Words that has categorized as verbs and verb group deverbal nominal list from the gazetteers (list of exceptional
word classes as part-of-speech and it’s infinitive forms non deverbal events) help to get rid of event ambiguity. We
have selected as primary candidates. also get those deverbal events from the morphological ana-
lyzer and non-deverbal events from the gazetteers. Apply-
6. Check non deverbal nouns (usually acts as events) ing such disambiguation scheme improves accuracy of our
from carefully built gazetteers (list of non deverbal system in proportion to the standalone rule based approach.
noun lexical). Because of our limited dictionary a
ternary search tree algorithm has been applied to en- 4. Model Evaluation
hance the efficiency.
Among the standard information extraction evaluation met-
7. Identifying words that contain temporal keywords. rics precision, recall and F-measure have been used to eval-
The temporal indicator keywords have carefully built uate the performance of models. In this study a 10-fold
the list of commonly used temporal expressions in cross-validation technique has used to split the dataset. In
Amharic. In addition, regular expressions have been this case by shuffling the dataset randomly, 80% of the data
constructed to tackle regular date-time expressions. has used for training and 20% has used for test.
Bi-gram language models have been applied to find
temporal arguments. 5. Experimental Results
For example:- ΀¤¤ ˜rg ¶Ú ¶w ~ / "Abebe’s In this study, a total of five experiments have been con-
wedding is tomorrow." ducted. Three experiments are on supervised learning al-
From this sentence, the word ˜rg is a deverbal gorithms (Naive Bayes, Decision Tree and SVM) to de-
nouns that has been extracted as an event and it’s ac- tect events from the unstructured Amharic text. One ex-
tually an event, where the word ¶Ú is an event ar- periment is on the rule-based approach to extract the event
gument extracted as temporal event argument of the from the unstructured Amharic text. The other experiment
major event ˜rg is combining the supervised learning and the rule-based ap-
proaches.
The first three experiments are training a model using the
3.5. Event Extraction using hybrid Approach three selected supervised learning algorithms. All features
Unlike knowledge driven systems; in hybrid event extrac- have used to see the effect of each attribute on the event
tion systems the amount of required data increases, due detection. Each algorithm has been experimented on the
to the usage of supervised machine learning techniques, full features.
yet typically remains less than the case with purely data- As the result shows in Table 1, among the three algo-
driven techniques. Where complexity and hence required rithms, the Naïve Bayes (NB) classifier has outperformed

2107
the other classifiers to detect events. It has showed F-score technique. As the table shows, the combination of both rule
of 0.915% on the weighted average, 0.831 on the On-Event based and supervised machine learning classifiers bring sig-
class, and 0.944 on the Off-Event class. This experimen- nificant result to extract events from unstructured Amharic
tal result confirms the advantage of Naïve Bayes classifier text.
for event detection task. We get encouraging result using
a machine learning classifier for event detection task. The
Table 2: Over all event extraction evaluation of experimen-
problem resides on deverbal entities ambiguousness.
tal result comparison

Table 1: Experimental results for machine learning Algo-


Standard measures
rithms to detect events Techniques
Precision Recall F-measure
Rule based Approach 0.976 0.952 0.959
Measures Hybrid Approach 0.979 0.962 0.971
Algorithms Classes
Precision Recall F-measure
0.866 0.798 0.831 On-Event
NB 0.932 0.957 0.944 Off-Event
0.915 0.916 0.915 Weighted Ave.
0.895 0.395 0.548 On-Event 6. Conclusion and future work
LIBSVM 0.825 0.984 0.897 Off-Event
0.843 0.833 0.808 Weighted Ave. In this study we have presented a system that extract events
0.891 0.698 0.783 On-Event from unstructured Amharic text. The system has built by
J48 0.903 0.971 0.935 Off-Event combining supervised learning and the standalone rule-
0.9 0.9 0.896 Weighted Ave. based techniques. The supervised machine learning have
used to detect events and the standalone rule-based tech-
nique to extract the even from the unstructured Amharic
From the machine learning event detection system, it has text. For the supervised machine learning, the three algo-
observed that due to linguistic features verb triggered rithms (naïve bayes, support vector machine and decision
events have equal weight by the classifier with the non- tree) have proposed. Then Naïve bayes has outperformed
event class. This has been the reason that motivates the to detect events from the unstructured Amharic texts.
study to come up with developing hand crafted rules to get The standalone rule based approach has evaluated indepen-
rid of the ambiguities. In this particular technique, in or- dently to extract events from unstructured Amharic text.
der to make a clear comparison with the hybrid based event However, the proposed hybrid system has outperformed us-
extraction system, similar dataset have been used. ing the Naïve bayes algorithm to detect the event.
The other experiment is on the rule-based approach. As the In the future we need to address other relevant event ex-
result Table 2 shows, the F-score of this approach model is traction tasks such as building larger events and temporally
0.959. This shows that it has outperformed the supervised annotated corpus, employing powerful deep learning tech-
machine learning three models. niques to extract relation between event and time, extract-
The last experiment that has been conducted in this study ing relation between events and document creation time.
is on the hybrid event extraction technique. The perfor-
mance of this method relay on the power of having the ad-
7. Bibliographical References
vantage of the rule based and supervised machine learning
methods in conjunction. The machine learning classifiers Adafre, S. F. (2005). Part of speech tagging for amharic
have labeled the instances as on-event and off-event binary using conditional random fields. In Proceedings of the
classes by assigning different weights. An instance that has ACL workshop on computational approaches to semitic
assigned high probability value by the classifier is catego- languages.
rized under on-event class; which is actually an event. On Al-Smadi, M. and Qawasmeh, O. (2016). Knowledge-
the other hand an instance which has assigned low probabil- based approach for event extraction from arabic tweets.
ity value than the on-event class instance has been mostly International Journal of Advanced Computer Science
non-event or categorized as off-event class. Thus positive and Applications, 7(6).
predicated values accepted as it’s i.e. instances categorized Arnulphy, B., Claveau, V., Tannier, X., and Vilnat, A.
as on-event with highest weighted value. Because, it is pre- (2015). Supervised Machine Learning Techniques to
dicted exactly as an event, while instances getting equal Detect TimeML Events in French and English. In Chris
weight by the classifier in both class are going to be the tar- Beimann, et al., editors, 20thInternational Conference
get instances for the heuristics. Equal weighted instances on Applications of Natural Language to Information Sys-
are considered as ambiguous. Using the help of syntactic tems, NLDB 2015, volume 9103 of Proceedings of the
features, ambiguous instances have been handled. As a re- NLDB conference, Passau, Germany, June. Springer.
sult, the number of on-event instances correctly extracted Baradaran, R. and Mineai-Bidgoli, B. (2015). Event ex-
increases when heuristics has applied. In order to get the traction from classical arabic texts. International Arab
false negative and the false positive values we have used a Journal of Information Technology", 12(5).
manual scanning of the result to be accurate. Chang, C.-C. and Lin, C.-J. (2011). Libsvm: A library for
Table 2 shows the hybrid technique experimental result as support vector machines. ACM Trans. Intell. Syst. Tech-
well and compare with the experimental result of rule based nol., 2(3):27:1–27:27, May.

2108
Demeke, G. and Getachew, M. (2006). Manual annotation pattern extractor and named entity recognition: A hybrid
of amharic news items with part-of-speech tags and its approach. IEEE.
challenges. 01.
Frederik Hogenboom, F. F. and Kaymak, U. (2016). A sur-
vey of event extraction methods from text for decision
support systems. Elsevier.
Gasser, M. (2011). Hornmorpho: a system for morpho-
logical processing of amharic, oromo, and tigrinya. In
Conference on Human Language Technology for Devel-
opment.
Ibrahim, A. and Assabie, Y. (2014). Amharic sentence
parsing using base phrase chunking. In COLING 2014.
lasker, L., Argaw, A. A., and Gamback, B. (2007). Apply-
ing machine learning to amharic text classification. In
Proceedings of the 5th World Congress of African Lin-
guistics.
Mulugeta, W. and Gasser, M. (2012). Learning morpho-
logical rules for amharic verbs using inductive logic pro-
gramming.
Pranckevicius, T. and Marcinkevicius, V. (2017). Compar-
ison of naïve bayes, random forest, decision tree, sup-
port vector machines, and logistic regression classifiers
for text reviews classification. In Baltic J. Modern Com-
puting.
Ramesh, D. and Kumar, S. S. (2016). Event extraction
from natural language text. International Journal of En-
gineering Sciences and Research Technology (IJESRT),
5(7).
Ramrakhiyani, N. and Majumder. (2015). Approaches to
temporal expression recognition in hindi. ACM Transac-
tions on Asian and Low-Resource Language Information
Processing (TALLIP), 14(1).
Schmid, H. (1994). Probabilistic part-of-speech tagging
using decision trees. In International Conference on
New Methods in Language Proceeding.
Sikdar, U. and Gambäck, B., (2018). Named Entity Recog-
nition for Amharic Using Stack-Based Deep Learning:
18th International Conference, CICLing 2017, Budapest,
Hungary, April 17–23, 2017, Revised Selected Papers,
Part I, pages 276–287. 01.
Smadi, M. and Qawasmeh, O. (2018). A supervised ma-
chine learning approach for events extraction out of ara-
bic tweets. In Fifth International Conference on Social
Networks Analysis, Management and Security, SNAMS
2018, Valencia, Spain, October 15-18, 2018, pages 114–
119.
Sohail, O. and Elahi, I. (2018). Text classification in an
under-resourced language via lexical normalization and
feature pooling. In Twenty-Second Pacific Asia Confer-
ence on Information Systems.
Tourille, J., Ferret, O., Tannier, X., and Neveol, A. (2017).
Temporal information extraction from clinical text. In
Proceedings of the 15th Conference of the European
Chapter of the Association for Computational Linguis-
tics, volume 2 of EACL, page 739–745.
Tsedalu, G. (2010). Information extraction model from
amharic news texts. Master’s thesis, Addis Ababa Uni-
versity.
Yunita Sari, M. F. H. and Zamin, N. (2010). Rule based

2109

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy