2020 Lrec-1 258
2020 Lrec-1 258
Abstract
In information extraction, event extraction is one of the types that extract the specific knowledge of certain incidents from texts. Event
extraction has been done on different languages text but not on one of the Semitic language, Amharic. In this study, we present a system
that extracts an event from unstructured Amharic text. The system has designed by the integration of supervised machine learning and
rule-based approaches. We call this system a hybrid system. The system uses the supervised machine learning to detect events from the
text and the handcrafted and the rule-based rules to extract the event from the text. For the event extraction, event arguments have been
used. Event arguments identify event triggering words or phrases that clearly express the occurrence of the event. The event argument
attributes can be verbs, nouns, sometimes adjectives (such as ˜rg/wedding) and time as well. The hybrid system has compared with the
standalone rule-based method that is well known for event extraction. The study has shown that the hybrid system has outperformed the
standalone rule-based method.
Keywords: Event extraction, under-resourced language, Machine learning algorithms, Nominal events.
2103
organized datasets. Besides these limitations there has not portant event arguments that are event agent, event loca-
been any undergoing research on event extraction from un- tion, event trigger, event target, and event product and
structured Amharic text due to difficulties in syntactic and event time. The tools and dataset that have used in their
semantic status of class of functional verbs. The other chal- study have utilized twitter streaming API and preprocessed
lenges are identifying event arguments. In our case tem- through AraNLP Java-based package. Moreover, after the
poral event arguments have considered. However, it has visualization services event extraction like calendar, time-
a challenge in Amharic texts. Amharic texts have repre- line supplied through the help of ontological knowledge
sented in various forms such as; sequence of words, Arabic bases. In their study the experimental results show that the
and Geez’e script numerals. As such it needs extra normal- approach has an accuracy of 75.9 for T1: event trigger ex-
ization and syntactic analyzing scheme to tackle temporal traction, 87.5 for T2: Event time extraction and 97.7 for T3:
argument. event type identification. Their study claims that applying
Semitic languages like Arabic, Hebrew and Amharic have this kind of domain dependent approach to extract events
much more complex morphology than English. The mor- from tweets scores significant results.
phological variation limits the research progress on Natural In general there has been a lot of work in event extrac-
language processing, in general. However, there are stud- tion such as (Arnulphy et al., 2015; Tourille et al., 2017) in
ies relative to other Semitic languages. For example, (Al- European languages, predominantly in English; But, much
Smadi and Qawasmeh, 2016) have done their study on au- less research in other languages. An approach or technique
tomatic event extraction for Arabic language using knowl- that has used in one language to extract events might be
edge driven approach which concentrates on tagging the used in languages as well if they have a similar grammar
event trigger instances and related entities. One of their and character set. However, If languages have very differ-
main contribution is to link event with the entity mention. ent grammar, or a very different written representation, it
However, in our case we mainly concentrate on extracting will be difficult to use related approaches or techniques to
events and its arguments with the advantage of hand crafted extract events.
rules and machine learning classifiers. There has been research in part-of-speech tagging on
Hindi is another under-resourced an indo European lan- Amharic text (Adafre, 2005) and on Amharic morphol-
guage that has more common words with Arabic. In ogy (Mulugeta and Gasser, 2012) which are helpful for
(Ramrakhiyani and Majumder, 2015) solely has focused on event detection, but not directly related to the actual event
Temporal Expression Recognition in Hindi using interac- extraction task from Amharic text. For this particular task
tive handcrafted rules. They aim to carry out two basic the state of art Event detection system typically uses a ro-
goals that are identification of the temporal expressions in bust machine-learning techniques. Examples of such sys-
plain text and classifying the identified temporal expres- tems are (Arnulphy et al., 2015). Because of the lack of
sion. However, extracting events along with the corre- sufficient labeled training data for Amharic, we bootstrap
sponding arguments gains more advantage for the ease of an event extractor using a rule-based algorithm.
chronological ordering of events in their occurrences. In
addition it can be extended for event argument relationship 3. Methodology
extraction tasks. According to (Frederik Hogenboom and Kaymak, 2016)
(Smadi and Qawasmeh, 2018) has proposed a supervised event extraction techniques have been evaluated based on
machine learning approach for extracting events from Ara- the works on a set of qualitative dimensions that are
bic tweets. The study mainly focuses on four main tasks: the amount of required data, knowledge, expertise, inter-
Event Trigger Extraction, Event Time Expression Extrac- pretability of the results and the required development and
tion, Event Type Identification, and Temporal Resolution execution times. In this study, supervised machine learn-
for ontology population. Significant scores have resulted ing techniques, handcrafted rules and hybrid techniques
for each task covered under this paper includes; T1: event have employed to detect and extract events and its argu-
trigger extraction F-1= 92.6, and T2: event time expres- ments from unstructured text. Our focus of interest has
sion extraction F-1= 92.8 in T3: event type identification been extracting events and event arguments from unstruc-
Accuracy= 80.1. They have claimed that the third task is tured Amharic text. Event arguments include identification
relatively better than the previous works done using simi- of event trigger words; where in Amharic unstructured text
lar techniques like document-term matrix or bag-of-words. nominal events become ambiguous. Such events can be ar-
(Arnulphy et al., 2015) has also proposed supervised ma- guments of other events, and they often have been hard to
chine learning approach but to detect French and English be identified.
Time Markup Language (ML) Events. The study has sug-
gested the approach to be used by combining different su- 3.1. Dataset preparation
pervised machine learning algorithms such as conditional Unlike other languages, Amharic language does not have
random field, decision tree and k-nearest neighbor includ- any standardized annotated publically available corpora
ing language models. like Treebank1 and PropBank2 for English. The news do-
(Al-Smadi and Qawasmeh, 2016) has proposed knowledge- main is more preferable data source. Because its publicly
based approach for event extraction from Arabic Tweets. available and contains rich source of information that helps
There are three subtasks covered under their study such as
1
event trigger extraction, event time extraction, and event https://catalog.ldc.upenn.edu/LDC99T42
2
type identification. The event expression includes im- http://www.nltk.org/howto/propbank.html
2104
for any NLP applications such as entity extraction, event their derivation. The lemma of a word is very crucial fea-
and temporal information extraction and co-reference res- ture for the classifier. We have applied hornmorpho 8 that
olution. In this study, we build our own dataset by scrap- is a system to process the morphology of Amharic. The
ing top local websites. These are Zehabesha3 , Satenaw4 , system works for the other Ethiopian local languages such
Ezega5 , and one international website BBC Amharic6 that as Amharic, Affan Oromo, and Tigrinya languages. How-
contains relevant Amharic unstructured text. A Python ever, the system misses some unique and compound words.
Beautiful Soup library 7 has been used for for scraping Thus, we have developed our own unique exceptional dic-
the sites. The scraped texts are from all domains such tionary (Gazetteer) to handle exceptional keywords. Find-
as economy, politics, technology and sport. Simple reg- ing a pattern to get only the lemma of the hornmorpho
ular expressions have been used to retrieve only relevant result has also other difficulties; because sometimes the
text contents. A total of 659,848 words have extracted. co responding word doesn’t contain full information. In
Along with our own dataset, we have used Amharic corpora that case the Hornmorpho skips subject, object, grammar,
that have been prepared by the Ethiopian Languages Re- or word classes of a specific words. For Amharic Lan-
search Center of Addis Ababa University in a project called guage, Hornmorpho has evaluated using 200 randomly se-
the annotation of Amharic news documents (Demeke and lected verbs and nouns/adjectives in (Gasser, 2011) study.
Getachew, 2006). The project has been tagging manually The output has compared with manually identified Amharic
each Amharic word in its context with the most appropri- verbs and nouns. 99%; Amharic nouns: 95.5%. Although,
ate parts-of speech tag. The corpus contains 210,000 words we prefer to use this tool in our study, because of the lack of
that has collected from 1065 Amharic news (documents of other ready-made NLP components for Amharic language.
Walta Information Center (Demeke and Getachew, 2006)). The Jython library9 has been used to integrate the python
Walta Information Center is a private news and information based morphological analyzer for Amharic to get morpho-
service provider located in Addis Ababa, Ethiopia. logical features of words.
Besides analyzing the verb morphology, annotating the ex-
3.2. Data preprocessing act word class of the instance is also the required prepro-
In this step, data has converted to the appropriate format cessing task in this study. To do so, we have been us-
required for the respective information extraction process. ing the publically available language independent part-of-
In this study the scraped texts have many junks such as speech tagger, which is TreeTagger10 . TreeTagger is a tool
markup tags and other special characters. The first step for annotating text with part-of-speech and lemma informa-
in our study is raw text preprocessing. This step con- tion. It has been successfully used to tag German, English,
tains cleaning unwanted junks, sentence splitting, tokeniz- French, Italian, Danish, Swedish, Norwegian, Dutch, Span-
ing, word stemming, character normalization, stop word re- ish, Bulgarian, Coptic and Spanish texts. It is adaptable to
moval and Part Of Speech tagging (POS). Unlike other lan- other languages as well if a lexicon and a manually tagged
guages, Amharic is a morphologically rich language that training corpus are available (Schmid, 1994). It consists of
posses complicated syntactic features. This makes cum- two programs: the training program that creates a param-
bersome the preprocessing task to analyze the morpholog- eter file from a full-form lexicon and the lexicon genera-
ical features of representative tokens. The sentence splitter tor along with a hand tagged corpus. The tagger program
splits using Amharic sentence demarcations ( ~ ; ? !). reads the parameter file and annotates the text with part of
Amharic language has different characters with the same speech and lemma information. To prepare a parameter file
meaning and pronunciation. Those different characters for TreeTagger we used a total of 217 000 Amharic manu-
should be treated equally because there is no change in ally tagged corpora with 9 distinguished word classes and
meaning regardless of the linguistic view of orientation corresponding lemmas. We have conducted evaluation of
among the characters. For example:- (€,K,ƒ), (˜, P), TreeTager using 92,456 randomly untagged tokens. The
(a, A, €) and (Ð, Ø), each group has the same mean- output of TreeTager results 99.9% accuracy compared with
ing (Gasser, 2011). As a result, we develop a character manually tagged Amharic words.
normalizer that enables to normalize those characters to an The other crucial step in our preprocessing module is nor-
ordinary conceivable form This task helps the performance malizing Amharic temporal arguments. There are various
of our system. The other preprocessing task is stop word representations of date time expressions in Amharic such
removal. Like other language, Amharic has its own list of as Arabic, Geez and using alphanumeric characters.
stop words such as conjunctions, articles and prepositions. For example, the following sentences show the different
In our case we have adopted stop word lists that has used in date time representation:
(Tsedalu, 2010) study. In addition, we have built our own
(€¨l ¤1995 A.m °Â†Ô ~ ) using Arabic characters
stop word lists, as well with the help of linguistic experts.
(€¨l ¤] Ȱ} Œµ Ȱ¹ €mst A.m °Â†Ô ~ )
Then a total of 235 stop word have identified.
using alphanumeric characters
The other important preprocessing task is analyzing
(€¨l ¤ |:C9CB5| A.m °Â†Ô ~ ) using geez
Amharic verb morphology to identify lemma of words and
characters
3
http://www.zehabesha.com/amharic/
4
https://www.satenaw.com/amharic/
5 8
https://www.ezega.com/News/am/ https://www.cs.indiana.edu/ gasser/HLTD11/
6 9
https://www.bbc.com/amharic https://www.jython.org/
7 10
https://www.crummy.com/software/BeautifulSoup/bs4/doc/ https://reckart.github.io/tt4j/
2105
The probabilistic model of naive Bayes classifiers is based
on Bayes’ theorem. This algorithm works on the assump-
tion that the features in a dataset are independent from each
other.
Figure 1: Geez numerals LIBSVM is a library for Support Vector Machines (SVM).
It has gained wide popularity in machine learning and many
other areas. SVM finds an optimal solution and maximizes
The above sentences refer logically similar meaning with the distance between the hyperplane and the difficult points
various syntactic representation. In order to handle tem- close to decision boundary. As (Chang and Lin, 2011)
poral arguments of the event, a normalization and conver- stated, if there are no points near the decision surface, then
sion scheme to convert temporal representations into one there are no very uncertain classification decisions.
form. The conversion of Ge’ez numerals to uniform Arabic The other classifier algorithm that has used in this study is
number system is not straight forward as other normaliza- decision tree. Decision tree is a Tree-based classifier for
tion tasks because of the irregularities of Unicode values for instances that have represented as feature-vectors. There is
Ge’ez numerals. Some of the Ge’ez numerals are presented one branch for each value of the feature, and leaves specify
in Figure 1. the category. It represents arbitrary classification function
over discrete feature vectors. For the decision tree, J48 al-
3.3. Event detection using supervised machine
gorithm have used. J48 is an algorithm used to generate
leaning
a decision tree. The decision trees generated by J48 can
In this study, supervised machine learning approach has be used for classification, and for this reason, J48 is often
been employed. Supervised machine learning classifiers referred to as a statistical classifier.
predict new events based on the given labeled training sets. The above algorithms have been used to train the models
It uses event properties and characteristics from training using the labeled dataset as an input. Then the models have
data and generalize the unseen situations to predict events. detected the instances even on a test set as on-event and off-
In this study, the supervised learning approach has been event classes. The POS tag feature has showed good per-
used to detect the events from a text. formance as the best syntactic feature to detect the events
In this study, the datasets are unstructured text and doc- based on the feature selection recommendation.
uments. Therefore, the unstructured text sequences have
converted into a structured feature space using mathemati- 3.4. Event Extraction using Rule Based
cal modeling. For classification, feature extraction can be Approach
seen as a search among all possible transformations of the
Rule based learning is one of the information extraction
feature set for the best one. This preserves class maintain-
method that utilizes the extraction pattern to retrieve in-
ability as much as possible in the space with the lowest
formation from a text document. In this study, a stan-
possible dimension. In this study, the features contain in-
dalone rule-based approach has proposed to enhance the
formation of the text that have used to provide necessary
accuracy of event extraction system. Unlike other lan-
information associated to a given events. These features in-
guages, Amharic has a subject-object-verb agreement and
crease the confidence level of predicting a token as an event.
other morphological features that makes cumbersome the
Thus, the feature extractor component that has used in this
rules construction. As (Yunita Sari and Zamin, 2010) has
study is responsible for extracting candidate attributes for
mentioned, construction of extraction pattern is based on
the classifier. The features that have used in this study are
syntactic or semantic constraint and delimiter or combi-
the following:-
nation of both syntactic and semantic constraint. Events
• Words of the instance dominantly exist as nominal and verbs (Ramesh and Ku-
mar, 2016). The nominal events are ambiguous, in which
• POS of the corresponding word they can appear in deverbal or non-deverbal nouns form.
Thus, we need morphological features of the instances to
• Lemma of the corresponding word
disambiguate nominal events. To do so, morphological ana-
• List of lexicons for exceptional events lyzer has employed to get the morphological features of the
event that have mentioned in the instances. For example:- (
A binary classifier has been used to detect events from ΂tÓÍÑ hz©m ¼Êh ¤ö‰ àÝ ŒÈ¹Ýt €yàlÛm ~
Amharic text. The classifier detects events from the text ) In this sentence, the underline word (àÝ) is de-
and classify the text as on-event and off-event. The on- rived from the verb (fÕm). It seems an adjective, but,
event class refers the instance that contains event trigger it’s a deverbal entity we call it a nominal event. The rules
keywords; Whereas the off-event class refers the instance have been developed based on syntactic features of words
that does not infer the event trigger keywords. From the ma- with the help of a carefully constructed list of gazetteers.
chine learning algorithms, Naive Bayes, decision tree and The POS tag and lemma of the word have been used as
SVM algorithms have proposed based on their widely use an abasement for the handcrafted linguistic rules. Differ-
in text classification tasks (Pranckevicius and Marcinkevi- ent components have been used to get syntactic features of
cius, 2017). words using Tree Tagger and Hornmorpho. The pattern ex-
Naive Bayes classifier is linear classifier that is known for tractor has been developed based on the syntactic features.
being simple and very efficient for text classification tasks. Simple rules have been applied to extract detected events.
2106
For example:- (€¤¤) N (t‰nt) ADV (ÎÚËw) VN expertise is generally high due to the combination of mul-
(¤–) N (°) VP ( ~ ) . In this example, the snippets tiple techniques compared to pure knowledge driven tech-
of handcrafted rules have tackled based on the POS tagger niques.Moreover, the interpretability of a system benefits
results. The formal structures are not always regular to de- to some extent from the use of semantics as in knowledge-
velop stable rules. In contrast, the morphological analyzer based techniques(Baradaran and Mineai-Bidgoli, 2015).
is very helpful, because of the existence of deverbal events The other technique that has employed in this study is com-
that have been act as ambiguous. bining both supervised machine learning and rule-based
Some of the rules that have applied in this study for the techniques to extract events from Amharic unstructured
hybrid system includes the following:- text. The machine learning approach mainly focuses on
coverage (Recall) apart from sensitivity (precision) while,
1. Automatically label preprocessed texts with their cor- the handcrafted rules approach is on achieving the highest
responding word classes or parts-of-speeches. potential of precision value based on the incorporated rules.
In our case, the machine learning classifiers ignores nomi-
2. Get the morphological features of words including nal events in comparison with the verbal events. Therefore,
word, subject, root, lemma , object, grammar and we incorporate some rules to tackle the missed events from
preposition the machine learning classifiers result. Deverbal nouns
3. Usually events are expressed using verbs and nouns. exhibit both nominal and verbal syntactic representations
Check the neighboring words using bigram language They serve as concrete nouns, but also participate in verbal
models. Because, not all nouns have been events and constructions where they require arguments and accept the
sometimes nouns come at the beginning, then they are aspectual modification. Nominal events sometimes appear
the subjects or participant of the event not exactly the as deverbal and non-deverbal, in which deverbal entities
event. have been derived from verbs in-contrary the non-deverbal
entities have not derived from verbs. e.g. (àÕ) is a de-
4. Identifying the nominal events; To do so, the morpho- verbal entities that is an event derived from verb (fÕm).
logical analyzer has main role on indicating the cita- An event is a situation that lasts for a moment. By this def-
tion of the respective nouns; i.e words that have ex- inition, nominal can be an event e.g (˜rg)/ wedding is a
actly nominal can be deverbal or non deverbal nouns. non-deverbal nominal event. Another example (΀¤¤ ˜rg
But, deverbal nouns has a citation of verbs. ƒmŠ 16, 2010 A.m nw.) . Simply knowing the mor-
phological variation of words and having a common non-
5. Words that has categorized as verbs and verb group deverbal nominal list from the gazetteers (list of exceptional
word classes as part-of-speech and it’s infinitive forms non deverbal events) help to get rid of event ambiguity. We
have selected as primary candidates. also get those deverbal events from the morphological ana-
lyzer and non-deverbal events from the gazetteers. Apply-
6. Check non deverbal nouns (usually acts as events) ing such disambiguation scheme improves accuracy of our
from carefully built gazetteers (list of non deverbal system in proportion to the standalone rule based approach.
noun lexical). Because of our limited dictionary a
ternary search tree algorithm has been applied to en- 4. Model Evaluation
hance the efficiency.
Among the standard information extraction evaluation met-
7. Identifying words that contain temporal keywords. rics precision, recall and F-measure have been used to eval-
The temporal indicator keywords have carefully built uate the performance of models. In this study a 10-fold
the list of commonly used temporal expressions in cross-validation technique has used to split the dataset. In
Amharic. In addition, regular expressions have been this case by shuffling the dataset randomly, 80% of the data
constructed to tackle regular date-time expressions. has used for training and 20% has used for test.
Bi-gram language models have been applied to find
temporal arguments. 5. Experimental Results
For example:- ΀¤¤ ˜rg ¶Ú ¶w ~ / "Abebe’s In this study, a total of five experiments have been con-
wedding is tomorrow." ducted. Three experiments are on supervised learning al-
From this sentence, the word ˜rg is a deverbal gorithms (Naive Bayes, Decision Tree and SVM) to de-
nouns that has been extracted as an event and it’s ac- tect events from the unstructured Amharic text. One ex-
tually an event, where the word ¶Ú is an event ar- periment is on the rule-based approach to extract the event
gument extracted as temporal event argument of the from the unstructured Amharic text. The other experiment
major event ˜rg is combining the supervised learning and the rule-based ap-
proaches.
The first three experiments are training a model using the
3.5. Event Extraction using hybrid Approach three selected supervised learning algorithms. All features
Unlike knowledge driven systems; in hybrid event extrac- have used to see the effect of each attribute on the event
tion systems the amount of required data increases, due detection. Each algorithm has been experimented on the
to the usage of supervised machine learning techniques, full features.
yet typically remains less than the case with purely data- As the result shows in Table 1, among the three algo-
driven techniques. Where complexity and hence required rithms, the Naïve Bayes (NB) classifier has outperformed
2107
the other classifiers to detect events. It has showed F-score technique. As the table shows, the combination of both rule
of 0.915% on the weighted average, 0.831 on the On-Event based and supervised machine learning classifiers bring sig-
class, and 0.944 on the Off-Event class. This experimen- nificant result to extract events from unstructured Amharic
tal result confirms the advantage of Naïve Bayes classifier text.
for event detection task. We get encouraging result using
a machine learning classifier for event detection task. The
Table 2: Over all event extraction evaluation of experimen-
problem resides on deverbal entities ambiguousness.
tal result comparison
2108
Demeke, G. and Getachew, M. (2006). Manual annotation pattern extractor and named entity recognition: A hybrid
of amharic news items with part-of-speech tags and its approach. IEEE.
challenges. 01.
Frederik Hogenboom, F. F. and Kaymak, U. (2016). A sur-
vey of event extraction methods from text for decision
support systems. Elsevier.
Gasser, M. (2011). Hornmorpho: a system for morpho-
logical processing of amharic, oromo, and tigrinya. In
Conference on Human Language Technology for Devel-
opment.
Ibrahim, A. and Assabie, Y. (2014). Amharic sentence
parsing using base phrase chunking. In COLING 2014.
lasker, L., Argaw, A. A., and Gamback, B. (2007). Apply-
ing machine learning to amharic text classification. In
Proceedings of the 5th World Congress of African Lin-
guistics.
Mulugeta, W. and Gasser, M. (2012). Learning morpho-
logical rules for amharic verbs using inductive logic pro-
gramming.
Pranckevicius, T. and Marcinkevicius, V. (2017). Compar-
ison of naïve bayes, random forest, decision tree, sup-
port vector machines, and logistic regression classifiers
for text reviews classification. In Baltic J. Modern Com-
puting.
Ramesh, D. and Kumar, S. S. (2016). Event extraction
from natural language text. International Journal of En-
gineering Sciences and Research Technology (IJESRT),
5(7).
Ramrakhiyani, N. and Majumder. (2015). Approaches to
temporal expression recognition in hindi. ACM Transac-
tions on Asian and Low-Resource Language Information
Processing (TALLIP), 14(1).
Schmid, H. (1994). Probabilistic part-of-speech tagging
using decision trees. In International Conference on
New Methods in Language Proceeding.
Sikdar, U. and Gambäck, B., (2018). Named Entity Recog-
nition for Amharic Using Stack-Based Deep Learning:
18th International Conference, CICLing 2017, Budapest,
Hungary, April 17–23, 2017, Revised Selected Papers,
Part I, pages 276–287. 01.
Smadi, M. and Qawasmeh, O. (2018). A supervised ma-
chine learning approach for events extraction out of ara-
bic tweets. In Fifth International Conference on Social
Networks Analysis, Management and Security, SNAMS
2018, Valencia, Spain, October 15-18, 2018, pages 114–
119.
Sohail, O. and Elahi, I. (2018). Text classification in an
under-resourced language via lexical normalization and
feature pooling. In Twenty-Second Pacific Asia Confer-
ence on Information Systems.
Tourille, J., Ferret, O., Tannier, X., and Neveol, A. (2017).
Temporal information extraction from clinical text. In
Proceedings of the 15th Conference of the European
Chapter of the Association for Computational Linguis-
tics, volume 2 of EACL, page 739–745.
Tsedalu, G. (2010). Information extraction model from
amharic news texts. Master’s thesis, Addis Ababa Uni-
versity.
Yunita Sari, M. F. H. and Zamin, N. (2010). Rule based
2109