IR Documentation
IR Documentation
By
i
Amharic language Text Based Document Categorization and File Relation
1. Introduction
Amharic is the working language of the Federal Government of Ethiopia. It is the second largest
Semitic language next to Arabic. It is the language where majority of the Ethiopian literatures and
documents are written. It has an inflectional morphological structures, where morphemes are fused
together and require complex morphological analyzer to separate morphemes. Information
retrieved from the database becomes difficult task for organization and Categorizing Amharic
document. Document categorization and file relation aims to categorize the given documents in to
relevant category of document and check its relation to a file from the corpus. Document
categorization can be manipulated based on keywords or concepts to categorize a given document
into a specific category. Keyword-based text categorization only uses keywords which are
extracted from the text to identify the category of a given document. The Categories of document
signify organization of items into groups according to their similarities or shared characteristics.
In these project we train a data based on six categories these are Sport, Health, Politics, Technology
,Transport , Business and so on. Not only categorizing the documents the project identify the
percentage of a document associated with the query Term. Categorizing and file relation Amharic
document requires a domain expert deciding on appropriate topic class or classes. The abstract of
a document will be freely accessible on the web. It is therefore possible to create an extensive
library of such abstracts. Mainly it works by the comparison of keywords, which means a
document that is going to be categorized should contain a specific keyword that matches the
represented document to be categorized into the predefined category.
1.2. Objectives of the project
1.2.1. General Objectives
The project purposely designed to categorize the Amharic documents based texts and relate them
to the contents of documents in to relevant category of users need and identify the percentage of
query terms associated with the document and list based on their Term frequency rank.
-3-
Amharic language Text Based Document Categorization and File Relation
Ø Normalization: - Amharic has the words same pronunciation but different symbols,
therefore the project handle these kind of ambiguity. For example the word ሀይሌ፡ሃይሌ፡ ሓይሌ
are replaced to ሀይሌ.
Ø Identify the root words or stemming: removing affixes(suffix and prefix) to get the root
word.
Ø Categorize the document in to relevant category based on the contents of documents and
idenify the percentage of Association between the query term and the trained documents
on the corpus.
1.7. Conceptual Framework for the project
-4-
Amharic language Text Based Document Categorization and File Relation
ü Tokenizing, Removing stop words, Steaming and Displaying Root Words
After browsing the Documents The system can perform basic text processing operations on the
Amharic Documents. Which means the system can remove stope words, affix and suffixes and
also steam the contents of documents in order to display root words (ስረወ፡ቃል) for example the
word on the document which is called ”እየተንቀሳቀሱ“ is steamed or normalized in to its root word
“ተንቀሳቀሱ”
-5-
Amharic language Text Based Document Categorization and File Relation
ü Catagorizing Documents
Finaly the system catagorized the amharic Documents in to relevant documents catagory based on
the contents of Documentsand shows the association of documents with the query term by
calculating the percentage of similarity. When we click the button ስረወ ቃሉን አሳይ the system display
the catagory of the given documents. For example the above browsed document is catagorized
under Business( ንግድ ና ገበያ).
-6-
Amharic language Text Based Document Categorization and File Relation
1.9. Conclusion
Amharic Text Based document categorization and file association is implemented and done based
on basic information retrieval text processing operations includes document, tokenization, stop
word removal, normalization , stemming and indexing the root word and try to compare the
frequency of the word valuable on the database corpus collection and associate the file rank and
display to the user based on the query entered by the user. The document who contain high number
of terms as compared to others based on the user input submitted by the user query is selected as
the first match or relatives of the query and arrange the documents based on their term frequency
descending order.
-7-
Amharic language Text Based Document Categorization and File Relation