0% found this document useful (0 votes)
91 views9 pages

IR Documentation

This document is a thesis submitted to the University of Gondar regarding an Amharic language text-based document categorization and file relation project. It introduces the objectives, methodology, and significance of the project, which aims to categorize Amharic documents into relevant categories based on their texts and identify the percentage of query terms associated with each document. The project uses techniques like tokenization, stemming, lemmatization and stopword removal to process Amharic texts and categorize documents based on their contents.

Uploaded by

dagne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views9 pages

IR Documentation

This document is a thesis submitted to the University of Gondar regarding an Amharic language text-based document categorization and file relation project. It introduces the objectives, methodology, and significance of the project, which aims to categorize Amharic documents into relevant categories based on their texts and identify the percentage of query terms associated with each document. The project uses techniques like tokenization, stemming, lemmatization and stopword removal to process Amharic texts and categorize documents based on their contents.

Uploaded by

dagne
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

UNIVERSITY OF GONDAR

COLLEGE OF COMPUTING AND INFORMATICS

DEPARTMENT OF INFORMATION TECHNOLOGY

Amharic Language Text Based Document Categorization and File Relation

By

1. Bekalu Tadele GUR/05201/11

2. Wasyihun Sema. GUR/05194/11

3. Nigus Wereta GUR/05192/11

Advisor:- Minale Ashagrie (PhD)

Date :Saturday, June 22, 2019


Table of Contents
1. Introduction ................................................................................................................................. - 1 -
1.2. Objectives of the project ....................................................................................................... - 1 -
1.2.1. General Objectives ........................................................................................................ - 1 -
1.2.2. Specific objectives of the project .................................................................................. - 1 -
1.3. Statement of the problems .................................................................................................... - 2 -
1.4. Tools and Methodology used in the project .......................................................................... - 2 -
1.5. Significance of the project ..................................................................................................... - 3 -
1.6. Experimental procedures ......................................................................................................... - 3 -
1.7. Conceptual Framework for the project ................................................................................ - 4 -
1.8. Implementation .....................................................................................................................- 4 -
1.9. Conclusion .............................................................................................................................- 7 -
1.10 Future Work...........................................................................................................................- 7 -

i
Amharic language Text Based Document Categorization and File Relation
1. Introduction
Amharic is the working language of the Federal Government of Ethiopia. It is the second largest
Semitic language next to Arabic. It is the language where majority of the Ethiopian literatures and
documents are written. It has an inflectional morphological structures, where morphemes are fused
together and require complex morphological analyzer to separate morphemes. Information
retrieved from the database becomes difficult task for organization and Categorizing Amharic
document. Document categorization and file relation aims to categorize the given documents in to
relevant category of document and check its relation to a file from the corpus. Document
categorization can be manipulated based on keywords or concepts to categorize a given document
into a specific category. Keyword-based text categorization only uses keywords which are
extracted from the text to identify the category of a given document. The Categories of document
signify organization of items into groups according to their similarities or shared characteristics.
In these project we train a data based on six categories these are Sport, Health, Politics, Technology
,Transport , Business and so on. Not only categorizing the documents the project identify the
percentage of a document associated with the query Term. Categorizing and file relation Amharic
document requires a domain expert deciding on appropriate topic class or classes. The abstract of
a document will be freely accessible on the web. It is therefore possible to create an extensive
library of such abstracts. Mainly it works by the comparison of keywords, which means a
document that is going to be categorized should contain a specific keyword that matches the
represented document to be categorized into the predefined category.
1.2. Objectives of the project
1.2.1. General Objectives
The project purposely designed to categorize the Amharic documents based texts and relate them
to the contents of documents in to relevant category of users need and identify the percentage of
query terms associated with the document and list based on their Term frequency rank.

1.2.2. Specific objectives of the project


To achieve the general objectives of the project the following specific objectives should meet.
ü To understand the basic operation of Text processing
ü To understand the concept of document categorization in to relevant category
ü To design Text category and file relation system
ü To implement document categorization and File relation prototype
-1-
Amharic language Text Based Document Categorization and File Relation
ü To test the prototype.
1.3. Statement of the problems
Amharic Text based Document Categorization and file association(relation) is a problem in
information retrieval. We assign a document to one or more classes or categories. Problems solved
using both the categories are different but still, they overlap and hence there is interdisciplinary
research on document classification. Classification documents can help an organization to meet
legal and regulatory requirements for retrieving specific information in a set timeframe, and this
is often the motivation behind implementing data classification or categorization. However, data
strategies differ greatly from one organization to the next, as each generates different types and
volumes of data. Now a days there are A lot of Unstructured data that controls the market share to
But unstructured data are not complete , effective and accurate during searching an information,
to overcome these information retrieval document organizing and categorizing is best solution as
we think. In the case of Amharic language there are a lot of documents available in each office and
organization but these documents are mainly not organized in Category and in electronic way these
has negative effect on researchers and other professional who needs stastical data analysists so for
these we need to develop these project.
.
1.4. Tools and Methodology used in the project
Methodology is the systematic, theoretical analysis of the methods applied to a field of study.
There for to do the project the team members uses the following tools and methodology.
ü Literature review: - review journal papers and books about document categorization,
English language structure and characteristics, approaches that are used so far in document
categorization Identify the techniques on English document categorization.
ü Data preparation: The project team prepares data set from Fana broadcasting corporation
news Site ,Soccer Ethiopia ,Minster of Transport ,minister of Health, Official websites
pages and the data is predefined and trained
ü Development Tool. To do this project we use python 3.7 programming language with
PyCharm IDE(Integrated Development environment ) because The Python programming
language is freely available and makes solving a computer problem almost as easy as
writing out your thoughts about the solution.
ü Microsoft office packages: to Prepare documentation and presentation for the project
-2-
Amharic language Text Based Document Categorization and File Relation
1.5. Significance of the project
The major contribution of the project to design Amharic text categorization and File Relation of
query terms and documents using information retrival. The project helps to identify the type of
documents based on the content of a document matching with the query terms. The user enters the
text query the the system can process and text by applying basic text operation process the
categorize the tex query in to relevant document category and the project helps for easly accessing
of document types
1.6. Experimental procedures
The experiment is don through procedure in order to make the categorization of documents easy
and provide fast searching process. The experiment is don starting from installing a software.
ü Install python 3.7
ü Install different python package that are used for developing the project one of the
package that we install to develop our project is Tkinter used to build graphical user
interface.
ü Planning an experiment: preparing documents and plan experiment to apply basic
document processing operations on Amharic documents.
ü Conducting the experiment: On this step we write codes how part of basic operation
work on the Documents in Amharic languages. The basic Text processing operations
implemented on the experiments
Ø Tokenizing the given contents of documents
Tokenization is the process of parsing text data into smaller units (tokens) such as words and
phrases.
Ø Steaming and Lemmatization the contents of the documents
Different tokens might carry out similar information (e.g. tokenization and tokenizing). And you
can avoid calculating similar information repeatedly by reducing all tokens to its base form using
various stemming and lemmatization dictionaries.
Ø Removing Stope words
Some tokens are less important than others. For instance, common words such as “ዎች” ፣”የ”፣”በ“፣
”ስለዚህ“ might not be very helpful for revealing the essential characteristics of a text. So usually it
is a good idea to eliminate stop words and any suffix and Prefix words

-3-
Amharic language Text Based Document Categorization and File Relation
Ø Normalization: - Amharic has the words same pronunciation but different symbols,
therefore the project handle these kind of ambiguity. For example the word ሀይሌ፡ሃይሌ፡ ሓይሌ
are replaced to ሀይሌ.
Ø Identify the root words or stemming: removing affixes(suffix and prefix) to get the root
word.
Ø Categorize the document in to relevant category based on the contents of documents and
idenify the percentage of Association between the query term and the trained documents
on the corpus.
1.7. Conceptual Framework for the project

Where S1= Document source 1 DT1= Document type 1


S2= document source 2 DT2= Document Type 2
S3= document source 3 DT3=Document type 3
1.8. Implementation
The system brows the testing data which is prepared by the group members and the system
tokenize, steam and normalize the content of the documents. Once the system brows the document
it tokenize, remove stope words and display the root words (ስረወ -ቃል)aand associate with
predefined trained data on the corpus.
Graphical user interface Over View

-4-
Amharic language Text Based Document Categorization and File Relation
ü Tokenizing, Removing stop words, Steaming and Displaying Root Words
After browsing the Documents The system can perform basic text processing operations on the
Amharic Documents. Which means the system can remove stope words, affix and suffixes and
also steam the contents of documents in order to display root words (ስረወ፡ቃል) for example the
word on the document which is called ”እየተንቀሳቀሱ“ is steamed or normalized in to its root word
“ተንቀሳቀሱ”

-5-
Amharic language Text Based Document Categorization and File Relation
ü Catagorizing Documents
Finaly the system catagorized the amharic Documents in to relevant documents catagory based on
the contents of Documentsand shows the association of documents with the query term by
calculating the percentage of similarity. When we click the button ስረወ ቃሉን አሳይ the system display
the catagory of the given documents. For example the above browsed document is catagorized
under Business( ንግድ ና ገበያ).

-6-
Amharic language Text Based Document Categorization and File Relation
1.9. Conclusion

Amharic Text Based document categorization and file association is implemented and done based
on basic information retrieval text processing operations includes document, tokenization, stop
word removal, normalization , stemming and indexing the root word and try to compare the
frequency of the word valuable on the database corpus collection and associate the file rank and
display to the user based on the query entered by the user. The document who contain high number
of terms as compared to others based on the user input submitted by the user query is selected as
the first match or relatives of the query and arrange the documents based on their term frequency
descending order.

1.10 Future Work


Now we implement the project based on manual training of the document on the corpus collection and for
future we recommend to solve these manual training of data or document based on Unsupervised learning
ways. Secondly our project Corpus collection is done only.txt file so we recommend to extend the corpus
collection in to different document collection and file formats it may be image containing files files, word
files , Database files and son.

-7-
Amharic language Text Based Document Categorization and File Relation

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy