0% found this document useful (0 votes)

764 views12 pages

Bengali Information Retrieval System (Birs)

Information Retrieval System is an effective process that helps a user to trace relevant information by Natural Language Processing (NLP). In this research paper, we have presented present an algorithmic Information Retrieval System(BIRS) based on information and the system is significant mathematically and statistically. This paper is demonstrated by two algorithms for finding out the lemmatization of Bengali words.

Uploaded by

Darren

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

764 views12 pages

Bengali Information Retrieval System (Birs)

Uploaded by

Darren

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

International Journal on Natural Language Computing (IJNLC) Vol.8, No.

5, October 2019

BENGALI INFORMATION RETRIEVAL

SYSTEM (BIRS)
Md. Kowsher1, Imran Hossen2 and SkShohorab Ahmed2
1
Department of applied mathematics, Noakhali Science and Technology University,
Noakhali-3814, Bangladesh
ga.kowsher@gmail.com
2
Department of Information and Communication Engineering, University of Rajshai,
Rajshai-6205, Bangladesh
imranhsobuj97@gmail.com
shohorab.ahmed.it@gmail.com

ABSTRACT
Information Retrieval System is an effective process that helps a user to trace relevant information by
Natural Language Processing (NLP). In this research paper, we have presented present an algorithmic
Information Retrieval System(BIRS) based on information and the system is significant mathematically
and statistically. This paper is demonstrated by two algorithms for finding out the lemmatization of
Bengali words such as Trie and Dictionary Based Search by Removing Affix (DBSRA) as well as
compared with Edit Distance for the exact lemmatization. We have presented the Bengali Anaphora
resolution system using the Hobbs’ algorithm to get the correct expression of information. As the actions
of questions answering algorithms, the TF-IDF and Cosine Similarity are developed to find out the
accurate answer from the documents. In this study, we have introduced a Bengali Language Toolkit
(BLTK) and Bengali Language Expression (BRE) that make the easiest implication of our task. We have
also developed Bengali root word’s corpus, synonym word’s corpus, stop word’s corpus and gathered
672 articles from the popular Bengali newspapers ‘The Daily Prothom Alo’ which is our inserted
information. For testing this system, we have created 19335 questions from the introduced information
and got 97.22% accurate answer.

KEYWORDS
Bangla language Processing, Information retrieval, Corpus, Mathematics, and Statistics.

1. INTRODUCTION
Information Retrieval (IR) simply refers to retrieve information from a collection of sources
based on relevant query. It is a science for searching information in a document, searching
documents themselves and searching of text, images and sounds . Searching is mainly either
based on full-text or content-based. By default, IR means text retrieval because of historical
reasons [1]. Recommender System that is intimately related to the Information Retrieval System
work without a query.

Automated Information Retrieval are most likely used to reduce information overload. It is
basically a software environment that gives access to documents like books, journals, magazines
and so on. Documents can be stored and managed using this system. And today, web search
engines are the most visible IR application.

DOI: 10.5121/ijnlc.2019.8501 1
International Journal on Natural Language Computing (IJNLC) Vol.8, No.5, October 2019

Everyday, a huge number of information are produced by newspapers, social networking sites
and different kinds of websites. Due to these large collections of digital documents in the web
or local machine, finding the desired information is a tedious process. Finding relevant
information based on query, has some challenges such as word mismatch that is a sentence can
be made in different ways, their meaning is same but structure is different and a question can be
formulated in different ways utilizing synonymuos words. So it is very challenging and difficult
task to retrieve the desired information.

The Boolean Retrieval Model was used in the first search engines and is still in using today [2].
Documents will be retrieved if they correspond exactly to the query terms but this does not
generate the ranking since this model assumes all the documents in the retrieved set to be
relevant. However, it is due to the lack of the appropriate ranking algorithm that is the major
drawback of this approach. Therefore, the Vector Space Model [3] was suggested. The word
weighting and similarity can be defined in this space between papers. Terms are usually
weighted by frequency in the document (TF or Term Frequency) or normalized with respect to
the number of documents where they appear (IDF, or Inverse Document Frequency).

Unlike the previously mentioned retrieval models, probabilistic models provide some levels of
theoretical assurances that the models will perform well as long as the model assumptions are
consistent with the data observed. The Principle of Probability Ranking or PRP[4] states that
ranking documents by the probability of relevance will maximize precision at any given rank —
assuming a document's relevance to a query is independent of other documents. One of the first
techniques was suggested in[5] to calculate the probability of relevance using document terms.
Other methods assess text statistics in papers and queries and then use them to construct the
term weighting function [6]. BM25 ranking algorithm works well in different tasks [7]. More
advanced methods such as the Relevance-Based Language Models (or Relevance Models for
short, RM) are the best-performing text retrieval ranking techniques [8 ].

So we are mainly interested to deal the problems using TF- IDF and cosine similarity. TF was
used to calculate the importance of every word in a sentence. IDF calculated the actual
importance of the words in a document. Then we used cosine similarity to figure out the
relationship between questions and sentences. Our main objective is to retrieve relevant
information within a short time with great accuracy.

2. RELATED WORK
In foreign language for instance English language, there are many distinct fields such as Web
information retrieval [9], the retrieval of the picture and video [10] or the content-based
recommendation [11], text retrieval methods that have been commonly studied.

However, in Bangla language, the study of information retrieval isn’t satisfactory. A joint
correlation technique [12] is used to extract and recognize Bangla numbers from the paper
picture. Bangla OCR by the Center for Research Bangla Language Processing (CRBLP) which
converts written text or pictures into editable Unicode text [13], has weaknesses as it is for a
restricted scope. However, greater efficiency has been accomplished by the OCR scheme for
Bangla[14] and the web handwritten OCR for Bangla [15].

Much research is performed previously on summarization, such as extracting Bangla sentences

for document summarization [16]. In addition, knowledge extraction from Bangla Blogs and
News [17], Bangla text extraction from real images [18], Bangla sentiment analysis from micro-
blogs, media channel and customer feedback portal [19] is widely studied.

2
International Journal on Natural Language Computing (IJNLC) Vol.8, No.5, October 2019

Unlike these works, we present an information recovery system on Bangla language by the help
of mathematics and statistics.

3. PROPOSED WORK
In this paper, we introduced a Bengali Information Retrieval Systems (BIRS) based on Bengali
Natural Language Processing (BNLP). The procedure we adopted is isolated in three parts such
as collecting informative documents , pre-processing data and finding relationships between
informative documents and questions via the boost of TF-IDF model and cosine similarity.
Firstly, we collectd five types of corpus such as Bengali root words, stop words, our informative
documents, questions, synonym words . Secondly, we pre-processed our desired data. Finally,
Cosine similarity was used to obtain the relationship between the questions and the answers.
However, cosine similarity deals with vectors. In this case, the documents and questions were
converted to vectors using the TF-IDF model. For better explanation, we take a bunch of
information as paragraph along with two questions so that a workflow of information retrieval
can be explained.

Fig.1: Proposed Work.

3.1. Category: Information

<p>বঙ্গবন্ধু শেখ মুজিবুর রহমান ১৯২০ সালের ১৭ মার্চ টুঙ্গঙ্গপড়া গ্রালম িন্ম গ্রহণ কলরন। তার
রািননঙ্গতক িীবন শুরু হল়েঙ্গিে ১৯৩৯ সালে ঙ্গমেনাঙ্গর স্কুলে পড়ার সম়ে শেলকই। ভারত
পাঙ্গকস্থান ঙ্গবভক্ত হ়োর পর পূব চপাঙ্গকস্থালনর উপর পজিম পাঙ্গকস্থালনর অনযা়ে অঙ্গবর্ার বাড়লত
োলক। এিনয ঙ্গতঙ্গন ১৯৬৬ সালের ৫ শেবররু়োঙ্গর োলহালর ঙ্গবলরাধী দেসমূলহর একটট িাতী়ে
সলেেলন ঐঙ্গতহাঙ্গসক ি়ে দো দাবী শপে কলরন যা ঙ্গিে কাযতচ পূব পাঙ্গ চ কস্তালনর স্বা়েত্তোসলনর

পঙ্গরপূণ চ রূপলরখা। অবলেলে ঙ্গতঙ্গন ১৯৭১ সালের ২৬ মার্চ বাাংোলদলের স্বাধীনতার শ ােণা
শদন।</p>
3
International Journal on Natural Language Computing (IJNLC) Vol.8, No.5, October 2019

Category: Questions

Question-1: বঙ্গবন্ধু শেখ মুজিবুর রহমান শকাো়ে িন্ম গ্রহণ কলরন?

Question-2: কত তাঙ্গরলখ ঙ্গতঙ্গন ি়ে দো দাবী শপে কলরঙ্গিে?

3.2. Corpus

For the implication of Bengali Informative Retrieval System (BIRS), we mainly described five
types of corpus. In the first corpus, there were 28,324 Bengali root words. The purpose of this
corpus was to lemmatize Bengali words. The second one that contained 382 Bengali stop words
was used to remove unnecessary information from documents e.g. stop words from the
informative documents and questions. We compiled 672 articles as informative documents
from variety field of interest such as politics, entertainment, sports, education, science, and
technology from the popular Bengali newspapers named ‘The Daily ProthomAlo’ that was our
third corpus as informative documents. In this work, as our fourth corpus, we created 19334
questions from our informative documents. Furthermore, for the sake of synonymous words
processing, we included 18454 more Bengali similar words and collectively that was our fifth
copus. However, there were also some other corpus e.g. verb processing, unknown word
processing and removing punctuations and so on.

3.3. Pre-Processing

We need to pre-process or normalize the informative documents and the questions through
cleaning, verb processing, removing stop words, tokenization, lemmatization, and synonyms
words processing . Basically, the unwanted or special characters and stop-words don’t affect on
any linguistic operations. Besides, the stopwords of Bengali language always remains as root
.So pre-processing these are also a good way for reducing execution time.

3.3.1. Anaphora

In linguistics, anaphora is the use of an expression whose interpretation depends upon another
expression in context. In narrower sense, it refers to replace the previously described words
(noun) with other words (pronoun) for further use in context. It requires a successful
identification and resolution of Natural language processing(NLP). And hence presents
challenges in computational linguistics.

In the proposed Bengali Informative Retrieval system (BIRS), we mentioned a review of work
done in the field of anaphora resolution which has an influence on pronouns, mainly personal
pronoun. The Hobbs’ algorithm of Anaphora resolution was used in our proposed work. The
algorithm has been adapted for Bengali language taking into account the roles of subject, object
and its impact on anaphora resolution for reflexive and possessive pronouns. Here is the two
sentences from our Bengali informative documents:

বঙ্গবন্ধু শেখ মুজিবুর রহমান ১৯২০ সালের ১৭ মার্চ টুঙ্গঙ্গপড়া গ্রালম িন্ম গ্রহণ কলরন। তার
রািননঙ্গতক িীবন শুরু হল়েঙ্গিে ১৯৩৯ সালে ঙ্গমেনাঙ্গর স্কুলে পড়ার সম়ে শেলকই।
Here, ‘তাাঁর’ (his) is the pronoun of ‘বঙ্গবন্ধু শেখ মুজিবুর রহমান’ (noun).

4
International Journal on Natural Language Computing (IJNLC) Vol.8, No.5, October 2019

3.3.2. Tokenization

Tokenization is an NLP action. It separates a stream of text into words, sentences, symbols or
phrases or other elements which are meaningful. We used the sentence tokenization by using
the BLTK tool and we mapped every tokenized sentence to its root sentence.

3.3.3. Cleaning

Cleaning word refers to remove an unwanted character which doesn’t sentiment to an

informative documents. For example, colon, semicolon, comma, question mark, exclamation
sign, and other punctuations don’t provide meaningful connation. We used the Bangla Regular
Expression (BRE) tool to shift the unwanted characters from the informative documents and
the questions. The Bengali punctuation corpus was used to deal with BRE and carry away
punctuation as unwanted data.

3.3.4. Stop Words Removing

Stop words refer the words that does not affect to the overall meaning of the sentences in the
documents. For instance, in Bengali language stop words such as এবাং (and), শকাো়ে
(where),অেবা (or), শত (to), সালে (with) are meaningless making the overall meaning of the
sentences. As our proposed system i.e. BIRS is an algorithmic approach based on data, the stop
words need to be dismissed. In our documents , every sentence was checked whether there was
stopwords or not. If there was anstopword, the word was deleted. For the simplification of this
action, we used Bengali Language Toolkit (BLTK).

3.3.5. Lemmatization for Bangla Language

Lemmatization is a significant process to transfer a word to its root word. It is one kind of
Natural Language Processing technique and effective for use in various purposes such as text
mining, answering questions or Chatbot etc.In Bengali natural language processing, there are
few verbs that cannot be lemmatized by any system because of the limitation of lemmatization
algorithms. For example, শেলে (went) and ঙ্গেল়ে ( going) are generated from the root word
যাও়ো ( go). There is no relation of character between শেলে( went) and যাও়ো ( go). So
processing these words with algorithms are not good choice. That is why, these types of verbs
are mapped to their root verbs for easily accessing.

In our proposed BIRS , we used two novel techniques for lemmatization one is based on data
structure and the other is the hash table of computer programming. The novel techniques are
“DBSRA” and “Trie”. Dictionary Based Search by Removing Affix (DBSRA) which is really
an easy concept and more suitable for the lemmatization of Bengali words with the lowest time
and space complexity. In this method, at first it removes the ith character from any data (which
we want to shift into root word) where i =0, 1, 2,.... (Length of word - 1) and delete last jth
character where j = length of word n, n-1, n-2, ...., 1 .After that, this method will look up in our
corpus whether the intended words are in the root words corpus or not. “Trie” is a tree-based
data structure which is exoteric for storing data. Like DBSRA, Trie also requires lowest space
and time complexity. A Trie depot the inserted information’s and question’s words with the
similar prefix under the identical sequence of edges in the tree eliminating that essential for
storing the identical prefix each time for each word.

In BIRS, the Trie is taken into account as lemmatization procedure which accommodates for
retrieving all possible lemmas where every node contains a single character and every two
nodes are connected with a single edge. In our system, we used ‘Trie’ as an additional process

5
International Journal on Natural Language Computing (IJNLC) Vol.8, No.5, October 2019

for the best lemma as sometimes, the Trie approach does not work properly if the input word
contains a prefix. So eliminating the prefix is mandatory for proper lemmatization.

Fig.2. Bangla Lemmatization Process

For the best accuracy, the whole informative documents and questions words are processed with
both lemmatization techniques. accurate lemma can be found out with the help of ‘Edit
distance’. Edit distance is egregious to count the similarity or dissimilarity between two words
using dynamic programming.

In the lemmatization process, we applied “Edit Distance” for the root word comparison of
‘DBSRA’ and ‘Trie’. In the applications of Natural Language Processing, Edit Distance is
massively applied for the spelling correction of a word.

Sometimes, the lemmatization algorithms are not a good choice for the unknown words. Here,
unknown words refer to the name of a place, a person or name of anything. ‘Edit Distance’
assists to determine which word is known or unknown. Edit Distance can count the probability
with the edit to its word (not lemma). If the probability P(edit|word) is greater than 50%
(P(edit|word) > 50%) then it is counted as unknown words.

6
International Journal on Natural Language Computing (IJNLC) Vol.8, No.5, October 2019

In order to process the unknown words, we built a corpus, the suffix of Bengali Language that
included শত (te), শি(che), শ়ের(yer), etc. The longest common suffix was removed from the last
character of an unknown word obtained the lemma or root of an unknown word.

3.3.6. Synonyms Words Processing

Synonyms words refers same meaning with different words. There is a possibility when users
make questions which are not available in information data but the meaning is same. In thi.s
case, BIRS may be failed to answer the questions correctly. And hence the overall performance
will be degraded if this type of event happens for any user. So we had given much importance
in case of synonyms words processing in the BIQAS and also Natural Language
Understanding. Therefore, to handle this unwanted situation, a synonyms word corpus were
constructed containing total 13,189 words. Every word was mapped to a common identical
word in the informative documents and th questions. In our desired system, if a word was not
synonymous word with respect to the synonym corpus word, we didn’t count it as a similar
word.

After preprocessing the sentences and questions look the following:

Sentence-1: <s>বঙ্গবন্ধু শেখ মুজিবুর রহমান ১৯২০ সাে ১৭ মার্চ টুঙ্গঙ্গপড়া গ্রাম িন্ম গ্রহণ
করা</s>
Sentence-2: <s>বঙ্গবন্ধু শেখ মুজিবুর রহমান রািনীঙ্গত িীবন শুরু হ়ে ১৯৩৯ সাে ঙ্গমেনাঙ্গর
স্কুে পড়া সম়ে োকা</s>
Sentence-3:<s> ভারত পাঙ্গকস্থান ঙ্গবভক্ত হ়ে পূব চ পাঙ্গকস্থান পজিম পাঙ্গকস্থান অনযা়ে অঙ্গবর্ার
বাড়া োকা</s>
Sentecnce-4: <s>বঙ্গবন্ধু শেখ মুজিবুর রহমান ১৯৬৬ সাে ৫ শেবররু়োঙ্গর োলহার ঙ্গবলরাধ দে
িাতী সলেেন ঐঙ্গতহাঙ্গসক ি়ে দো দাবী শপে করা োকা কায চপূব চপাঙ্গকস্তান স্বা়েত্তোসন পূণ চ
রূপলরখা</s>
Sentence-5: <s>অবলেলে বঙ্গবন্ধু শেখ মুজিবুর রহমান ১৯৭১ সাে ২৬ মার্চ বাাংোলদে স্বাধীন
শ ােণা শদও়ো</s>

Question-1: বঙ্গবন্ধু শেখ মুজিবুর রহমান িন্ম গ্রহণ করা

Question-2: তাঙ্গরখ বঙ্গবন্ধু শেখ মুজিবুর রহমান ি়ে দো দাবী শপে করা

4. ALGORITHMS
4.1. TF-IDF

TF-IDF is an abbreviation for the Term Frequency-Inverse Document Frequency. Finding the
importance of a word to a sentence is a numerical technique and is mathematically-statistically
important.

In our proposed system of the TF-IDF model, there are several steps to figure out the TF-IDF of
an inserted sentence. Actually, TF calculates the frequency of a term in a document. It refers
how many times a term occurs in a document.

Firstly, the term rule has been ensured to measure the value of TF in every pre-processed
sentence. For the standard data, normalization has been introduced due to the variation of the
length of the document. In small document, there might be less word terms than the large

7
International Journal on Natural Language Computing (IJNLC) Vol.8, No.5, October 2019

documents. Therefore there would be wrong in the retrieved information based on the query.
And hence we introduced the normalization technique as a term (word÷length) of the sentence.
Secondly, to figure out a relevant sentence by searching questions, IDF is pretty useful in this
case. In TF all the words are treated as equal importance. But IDF determines the actual
importance of a word in the document.

Finally, we counted the desired TF-IDF by multiplication with the term TF and IDF from the
inserted questions and sentences. In order to reduce the time and the space complexity, we
calculated the TF-IDF only of those words which are related to input questions inquired by the
users.

Here, the TF-IDF from the previous example is shown in Table 1.

Table 1.TF-IDF of Sentences and Questions

Terms Sent-1 Sent-2 Sent-3 Sent-4 Sent-5 Ques-1 Ques-2

টুঙ্গঙ্গপড়া 0.053767 0 0 0 0 0 0
রূপলরখা 0 0 0 0.026883 0 0 0
করা 0.007455 0 0 0.015305 0 0.056849 0.036176
রািনীঙ্গত 0 0.046598 0 0 0 0 0
স্বা়েত্তোসন 0 0 0 0.026883 0 0 0
স্বাধীন 0 0 0 0 0.053767 0 0
১৯৩৯ 0 0.046598 0 0 0 0 0
ঐঙ্গতহাঙ্গসক 0 0 0 0.026883 0 0 0
সাে 0.030611 0.006461 0 0.003727 0.007455 0 0
৫ 0 0 0 0.026883 0 0 0
োকা 0 0.01479 0.018487 0.008533 0 0 0
অবলেলে 0 0 0 0 0.053767 0 0
পূব চ 0 0 0.033162 0.015305 0 0 0
ঙ্গমেনাঙ্গর 0 0.046598 0 0 0 0 0
পজিম 0 0 0.058248 0 0 0 0
শ ােণা 0 0 0 0 0.053767 0 0
১৯২০ 0.007455 0 0 0 0 0 0
বাাংোলদে 0 0 0 0 0.053767 0 0
বঙ্গবন্ধু 0 0.006461 0 0.003727 0.007455 0.013844 0.00881
শেবররু়োঙ্গর 0 0 0 0.026883 0 0 0
মার্চ 0.053767 0 0 0 0.030611 0 0
হ়ে 0 0 0.058248 0 0 0 0
শেখ 0.030611 0.006461 0 0.003727 0.007455 0.013844 0.00881
সলেেন 0 0 0 0.026883 0 0 0
পাঙ্গকস্থান 0 0 0.174743 0 0 0 0
শুরু 0 0.046598 0 0 0 0 0
দো 0 0 0 0.026883 0 0 0.063543
১৭ 0.007455 0 0 0 0 0 0
বাড় 0 0 0.058248 0 0 0 0
দাবী 0 0 0 0.026883 0 0 0.063543
পূণ চ 0 0 0 0.026883 0 0 0
োলহার 0 0 0 0.026883 0 0 0
িন্ম 0.053767 0 0 0 0 0.099853 0

8
International Journal on Natural Language Computing (IJNLC) Vol.8, No.5, October 2019

পড়া 0 0.046598 0 0 0 0 0
ি়ে 0 0 0 0.026883 0 0 0.063543
মুজিবুর 0.007455 0.006461 0 0.003727 0.007455 0.013844 0.00881
ঙ্গবভক্ত 0 0 0.058248 0 0 0 0
অনযা়ে 0 0 0.058248 0 0 0 0
পাঙ্গকস্তান 0 0 0 0.026883 0 0 0
িাতী 0 0 0 0.026883 0 0 0
সম়ে 0 0.046598 0 0 0 0 0
গ্রাম 0.007455 0 0 0 0 0 0
শদও়ো 0 0 0 0 0.053767 0 0
দে 0 0 0 0.026883 0 0 0
অঙ্গবর্ার 0 0 0.058248 0 0 0 0
গ্রহণ 0.053767 0 0 0 0 0.099853 0
স্কুে 0 0.046598 0 0 0 0 0
ঙ্গবলরাধ 0 0 0 0.026883 0 0 0
১৯৬৬ 0 0 0 0.026883 0 0 0
িীবন 0 0.046598 0 0 0 0 0
ভারত 0 0 0.058248 0 0 0 0
শপে 0 0 0 0.026883 0 0 0.063543
হ়ে 0 0.046598 0 0 0 0 0
কায চ 0 0 0 0.026883 0 0 0
১৯৭১ 0 0 0 0 0.053767 0 0
২৬ 0 0 0 0 0.053767 0 0
রহমান 0.053767 0.006461 0 0.003727 0.007455 0.013844 0.00881

4.2. Cosine Similarity

Cosine similarity is applied to obtain the relationship between questions. The cosine similarity
is a measurement between two vectors that counts the cosine angle between them. The angle is
the judgment of orientation but not magnitude that can be identified by comparing between
vectors on a normalized space.

The cosine of two non-zero vectors can be derived as:

With the help of the TF-IDF table, we calculated the cosine similarly which is shown in the
table 2.

9
International Journal on Natural Language Computing (IJNLC) Vol.8, No.5, October 2019

Table 2. Cosine Similarly

Relations Sentence-1 Sentence-2 Sentence-3 Sentence-4 Sentence-5

Question-1 0.597 0.016 0.0 0.060 0.018

Question-2 0.074 0.012 0.0 0.483 0.013

The Cosine Similarity (Q1, S1) is greater than the Cosine Similarity (Q1, S2) (0.597>0.016).
So it states, the percentage of question1's answer is 59.7% related to sentence-1 and 1.6%
related to sentence-2. In the same way, the cosine similarity can be ordered from the table such
as (Q1, S1) > (Q1, S4) > (Q1, S5) > ( Q1, S2) > (Q1, S3). Similarly, the Cosine Similarity (Q2,
S1) is less than the Cosine Similarity (Q2, S4) i.e. (0.074 < 0.483). So, the percentage of
question2's answer is 7.4% related to sentence1 and 48.3% related to sentence-4. So the cosine
similarity can be ordered from the table like (Q2, S4) > (Q2, S1) > (Q2, S5) > (Q2, S2) >( Q2,
S3). Threfore, the answer of question-1 stays in sentence-1 and the answer of question-2
remains in sentence-4. In this way, we can find out the relevant answer from the corpus.

5. Experiments
5.1. Experimental Tools

The whole work was performed in Anaconda distribution and Python 3.6 tools. For the sake of
clearing data, removing stop words, we have constructed a tool for Bengali language processing
mentioned as Bengali language Toolkit (BLTK) and we have applied NLTK system in many
pre-processing tasks.

5.2. Final Result

To test our proposed Bengali Information Retrieval System, we collected 672 articles from the
popular Bengali newspapers ‘The Daily ProthomAlo’ as our input documents. We created
19334 questions for testing data and obtained 97.22 % accuracy of our BIRS. From 19334
questions, the number of correct answers was 18797 and incorrect was 537. The performance to
retrieve information is pretty good and flexible with the lowest time complexity.

6. CONCLUSION AND FUTURE WORK

In this paper, we have presented a Bengali Information Retrieval System. To establish our
proposed system we have used various types of algorithms and methods such as lemmatization,
anaphora resolution procedure, TF-IDF and Cosine Similarity. The whole actions have been
processed with Bengali Language as part of BNLP. We have tested our proposed BIRS, noted
the accuracy, compared the correct and incorrect results.

In future, we have a plan to improve the system for educational purposes, industry, business and
personal tasks. And we want to use deep learning algorithms such as neural network for the
development of the system.

10
International Journal on Natural Language Computing (IJNLC) Vol.8, No.5, October 2019

REFERENCES
[1] Singhal, A. (2001). “Modern information retrieval: A brief overview.”, IEEE Data
EngineeringBulletin 24(4), 35–43.

[2] Croft, W.B., Metzler, D. &Strohman, T. (2009). “ Search engines-information retrieval in

practice.”,Pearson education. http://www.search-engines-book.com/.

[3] Salton, G., Wong, A., & Yang, C. S. (1975). “A vector space model for automatic
indexing.” Communications of the ACM 18(11), 613–620.
http://dx.doi.org/10.1145/361219.361220.

[4] Robertson &S.E. (1997) “Readings in information retrieval”,The probability ranking principle in IR
(pp. 281–286). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. http://dl.
acm.org/citation.cfm?id=275537.275701.

[5] Robertson, S. E., & Jones, K. S. (1988) “Relevance weighting of search terms” (pp. 143–160).
London, UK: Taylor Graham Publishing.

[6] Amati, G., & Van Rijsbergen, C. J. (2002). “Probabilistic models of information retrieval based on
measuring the divergence from randomness.” ACM Transactions on Information Systems 20(4),
357–389.

[7] Robertson, S. (2010). “The probabilistic relevance framework: BM25 and Beyond.” Foundations
and Trends in Information Retrieval 3(4), 333–389.

[8] Lavrenko, V., & Croft, W. B. (2001) “Relevance-based language models.” In W. B. Croft, D. J.
Harper, D.H.Kraft, &J.Zobel (eds.) SIGIR2001:Proceedings of the 24th annual international ACM
SIGIR conference on research and development in information retrieval, New Orleans, Louisiana,
USA(pp.120–127). ACM.https://doi.org/10.1145/383952. 383972.

[9]. Agichtein, E., Brill, E., &Dumais, S. (2006) “Improving web search ranking by incorporating user
behavior information.” , Proceedings of the 29th annual international ACM SIGIR conference on
Research and development in information retrieval, SIGIR 2006 (pp. 19–26). New York, NY, USA:
ACM. https://doi.org/10.1145/1148170.1148177.

[10] Sivic, J., &Zisserman, A. (2003) “Videogoogle: A text retrieval approach to object matching in
videos.” ,Proceedings of the ninth IEEE international conference on computer vision, ICCV 2003
(Vol. 2, pp. 1470–1477). Washington, DC, USA: IEEE Computer Society. http://dl.acm.
org/citation.cfm?id=946247.946751.

[11] Xu, S., Bao, S., Fei, B., Su, Z., & Yu, Y. (2008). “Exploring folksonomy for personalized search.” ,
Proceedings of the 31st annual international ACM SIGIR conference on Research and development
in information retrieval, SIGIR 2008 (pp. 155–162). New York, NY, USA: ACM

[12] M. K. I.Molla, & K. M.Talukder, ( 2007) “Bangla number extraction and recognition from the
document image” , International Conference. on Computer and Information Technology, ICCIT
2007, pp. 512-517.

[13] M. S. Islam, (2009) “Research on Bangla Language Processing in Bangladesh: Progress and
Challenges”, International Conference on Language & Development pp. 23-25.

[14] M.A. Hasnat, S.M. Habib, & M. Khan (2008) “A high-performance domain specific OCR for
Bangla script”, Novel Algorithms and Techniques In Telecommunications, Automation and
Industrial Electronics pp. 174-178, Springer, Dordrecht

11
International Journal on Natural Language Computing (IJNLC) Vol.8, No.5, October 2019

[15] G. Fink, S. Vajda, U. Bhattacharya, S. K. Parui& B. B. Chaudhuri, (2010). “ Online Bangla word
recognition using sub-stroke level features and hidden Markov models” International Conference.
on Frontiers in Handwriting Recognition, ICFHR 2010, pp. 393-398.

[16] K .Sarkar, (2012) “Bengali text summarization by sentence extraction”, arXiv preprint
arXiv:1201.224.

[17] A. Das & S. Bandyopadhyay, (2010).“Phrase-level Polarity Identification for Bengali” International
Journal of Computational Linguistics and Applications, IJCLA, 1(1-2), pp. 169-182.

[18] U. Bhattacharya, S. K. Parui, & S. Mondal, (2009) “Devanagari and Bangla Text Extraction from
Natural Scene Images”, International Conference on Document Analysis and Recognition, pp. 171-
175.

[19] A. Hassan, M.R. Amin, N. Mohammed, & A.K.A. Azad, (2016). “Sentiment Analysis on Bangla
and Romanized Bangla Text (BRBT) using Deep Recurrent models”, arXiv preprint
arXiv:1610.00369

Thesis Summary
No ratings yet
Thesis Summary
117 pages
Question Answering
No ratings yet
Question Answering
68 pages
NLP Week10 IR Enc Dec
No ratings yet
NLP Week10 IR Enc Dec
68 pages
Mod 4
No ratings yet
Mod 4
35 pages
Unit Iii - Information Retrieval Design Features of Information Retrieval Systems
No ratings yet
Unit Iii - Information Retrieval Design Features of Information Retrieval Systems
57 pages
AI Unit V
No ratings yet
AI Unit V
64 pages
NLP Unit-Ii (Part-I)
No ratings yet
NLP Unit-Ii (Part-I)
19 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
NLP 4
No ratings yet
NLP 4
33 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
Applications of NLP
No ratings yet
Applications of NLP
85 pages
Information Retrieval Dissertation
100% (2)
Information Retrieval Dissertation
5 pages
Ch2 - IR and LT
No ratings yet
Ch2 - IR and LT
45 pages
UNIT 6 Applications of NLP
No ratings yet
UNIT 6 Applications of NLP
60 pages
Bulu
No ratings yet
Bulu
47 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
Text Similarity Algorithms
No ratings yet
Text Similarity Algorithms
28 pages
Unit 5 - Query Operations and Languages
No ratings yet
Unit 5 - Query Operations and Languages
11 pages
Question Answering, Information Retrieval, and Retrieval Augmented Generation
No ratings yet
Question Answering, Information Retrieval, and Retrieval Augmented Generation
22 pages
Information Retreival Methods
No ratings yet
Information Retreival Methods
19 pages
Designing and Building An Automatic Information Re
No ratings yet
Designing and Building An Automatic Information Re
7 pages
A Bangla Text Search Engine Using Pointwise Approach of Learn To Rank (LTR) Algorithm
No ratings yet
A Bangla Text Search Engine Using Pointwise Approach of Learn To Rank (LTR) Algorithm
6 pages
SQL Server Notes
No ratings yet
SQL Server Notes
396 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
Chapter Five (ISR)
No ratings yet
Chapter Five (ISR)
17 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
PPT08-Natural Language Processing
100% (1)
PPT08-Natural Language Processing
44 pages
SQL PL SQL Interview Questions
100% (1)
SQL PL SQL Interview Questions
74 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
1 Information Retrieval System
No ratings yet
1 Information Retrieval System
10 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
Ccs369-Unit 3
No ratings yet
Ccs369-Unit 3
28 pages
Artificial Intelligence in Information Retrieval
No ratings yet
Artificial Intelligence in Information Retrieval
5 pages
BigData Unit4
No ratings yet
BigData Unit4
70 pages
Ii-Ii-Dbms-Courses File
No ratings yet
Ii-Ii-Dbms-Courses File
207 pages
IR Chapter 4
No ratings yet
IR Chapter 4
15 pages
BDDcpaq
No ratings yet
BDDcpaq
664 pages
Informatin and Storage Retrieval Group - 5 Sec - 2 Assiment
No ratings yet
Informatin and Storage Retrieval Group - 5 Sec - 2 Assiment
14 pages
CCS369 - TSS-Unit 3
No ratings yet
CCS369 - TSS-Unit 3
55 pages
Tamrakar 2015
No ratings yet
Tamrakar 2015
6 pages
DAMA DMBOK2 Framework V2 20140317 FINAL PDF
88% (8)
DAMA DMBOK2 Framework V2 20140317 FINAL PDF
27 pages
Performance Enhancement and Customization of Information Storage and Retrieval System
No ratings yet
Performance Enhancement and Customization of Information Storage and Retrieval System
32 pages
Language Independent Document
No ratings yet
Language Independent Document
10 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
47 pages
Unit Vapplications Notes
No ratings yet
Unit Vapplications Notes
13 pages
Database Management System (Mca) (1sem) Continue....... (Repaired) PDF
33% (3)
Database Management System (Mca) (1sem) Continue....... (Repaired) PDF
108 pages
Information Retrieval System-Chapter-1
No ratings yet
Information Retrieval System-Chapter-1
23 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
4 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
Modern Information Retrieval: A Brief Overview
No ratings yet
Modern Information Retrieval: A Brief Overview
9 pages
SQL Basic
No ratings yet
SQL Basic
33 pages
Irt-23 Unit 2
No ratings yet
Irt-23 Unit 2
10 pages
Chap 4 Text IR PDF
No ratings yet
Chap 4 Text IR PDF
19 pages
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
No ratings yet
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
3 pages
A Language Independent Approach To Develop URDUIR System
No ratings yet
A Language Independent Approach To Develop URDUIR System
10 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Information Retrieval System
No ratings yet
Information Retrieval System
4 pages
Modern Information Retrieval Amit Singhal
No ratings yet
Modern Information Retrieval Amit Singhal
9 pages
DatabaseSystem Final
No ratings yet
DatabaseSystem Final
20 pages
Query Questions
No ratings yet
Query Questions
2 pages
Dspace-Cris-2022 02 00
No ratings yet
Dspace-Cris-2022 02 00
227 pages
International Journal on Natural Language Computing (IJNLC)
No ratings yet
International Journal on Natural Language Computing (IJNLC)
2 pages
QA Review: IR-based Question Answering
No ratings yet
QA Review: IR-based Question Answering
11 pages
cs419-519 Slides Part 2
No ratings yet
cs419-519 Slides Part 2
6 pages
A ROBUST JOINT-TRAINING GRAPHNEURALNETWORKS MODEL FOR EVENT DETECTIONWITHSYMMETRY AND A NOISYLABELS
No ratings yet
A ROBUST JOINT-TRAINING GRAPHNEURALNETWORKS MODEL FOR EVENT DETECTIONWITHSYMMETRY AND A NOISYLABELS
12 pages
Tutorial 4 Relational Data Model
No ratings yet
Tutorial 4 Relational Data Model
14 pages
Name Shivam Prasad Reg No. 15BCE1196
No ratings yet
Name Shivam Prasad Reg No. 15BCE1196
8 pages
International Journal On Natural Language Computing (IJNLC)
No ratings yet
International Journal On Natural Language Computing (IJNLC)
2 pages
SQL Introduction
No ratings yet
SQL Introduction
17 pages
Acm Iconiaac 2014
No ratings yet
Acm Iconiaac 2014
8 pages
Information Retrieval Is A Complex Process Because There Is No Infallible Way To Provide A Direct Connection Between A User
No ratings yet
Information Retrieval Is A Complex Process Because There Is No Infallible Way To Provide A Direct Connection Between A User
4 pages
Checklist For DWH Testing
No ratings yet
Checklist For DWH Testing
3 pages
Sridhar Rapala
No ratings yet
Sridhar Rapala
3 pages
Interlingual Syntactic Parsing: An Optimized Head-Driven Parsing For English To Indian Language Machine Translation
No ratings yet
Interlingual Syntactic Parsing: An Optimized Head-Driven Parsing For English To Indian Language Machine Translation
11 pages
Call For Papers - 6th International Conference On Natural Language Processing and Computational Linguistics (NLPCL 2025)
No ratings yet
Call For Papers - 6th International Conference On Natural Language Processing and Computational Linguistics (NLPCL 2025)
2 pages
Making Medical Experts Fit4ner: Transforming Domain Knowledge Through Machine Learning-Based Named Entity Recognition
No ratings yet
Making Medical Experts Fit4ner: Transforming Domain Knowledge Through Machine Learning-Based Named Entity Recognition
20 pages
Call For Papers - 6th International Conference On NLP Trends & Technologies (NLPTT 2025)
No ratings yet
Call For Papers - 6th International Conference On NLP Trends & Technologies (NLPTT 2025)
2 pages
Call For Papers - 6th International Conference On Natural Language Computing Advances (NLCA 2025)
No ratings yet
Call For Papers - 6th International Conference On Natural Language Computing Advances (NLCA 2025)
2 pages
9.OSINT Documentation
No ratings yet
9.OSINT Documentation
86 pages
Analysis of Algorithms CS 477/677: Instructor: Monica Nicolescu
No ratings yet
Analysis of Algorithms CS 477/677: Instructor: Monica Nicolescu
39 pages
Unit Ii
No ratings yet
Unit Ii
45 pages
International Journal On Natural Language Computing (IJNLC)
No ratings yet
International Journal On Natural Language Computing (IJNLC)
2 pages
Orchestrating Multi-Agent Systems For Multi-Source Information Retrieval and Question Answering With Large Language Models
No ratings yet
Orchestrating Multi-Agent Systems For Multi-Source Information Retrieval and Question Answering With Large Language Models
20 pages
Big Data Analytics in Cloud Computing: An Overview
No ratings yet
Big Data Analytics in Cloud Computing: An Overview
11 pages
Fundamentals of Database Systems: Lesson 2: Data Models
No ratings yet
Fundamentals of Database Systems: Lesson 2: Data Models
39 pages
Practical No 1 Aim Horizontal Fragmentat
No ratings yet
Practical No 1 Aim Horizontal Fragmentat
42 pages
W5a. Peer-Graded Final Assignement - Analyze The Data
No ratings yet
W5a. Peer-Graded Final Assignement - Analyze The Data
3 pages
Jena API - Introduction
No ratings yet
Jena API - Introduction
14 pages
Human Intention Space - Natural Language Phrase Driven Approach To Place Social Computing Interaction in A Designed Space
No ratings yet
Human Intention Space - Natural Language Phrase Driven Approach To Place Social Computing Interaction in A Designed Space
19 pages
International Journal On Natural Language Computing (IJNLC)
No ratings yet
International Journal On Natural Language Computing (IJNLC)
2 pages
CS Holiday Assignment
No ratings yet
CS Holiday Assignment
5 pages
Inventory Classification With Ai: Evaluating How Large Language Models Enhance Categorization Using Unpsc Codes
No ratings yet
Inventory Classification With Ai: Evaluating How Large Language Models Enhance Categorization Using Unpsc Codes
12 pages
Call For Papers - International Journal On Natural Language Computing (IJNLC)
No ratings yet
Call For Papers - International Journal On Natural Language Computing (IJNLC)
2 pages
Call For Papers - International Journal On Natural Language Computing (IJNLC)
No ratings yet
Call For Papers - International Journal On Natural Language Computing (IJNLC)
2 pages
Call For Papers - International Journal On Natural Language Computing (IJNLC)
No ratings yet
Call For Papers - International Journal On Natural Language Computing (IJNLC)
2 pages
End-to-End Bangla AI For Solving Math Olympiad Problem Benchmark:Leveraging Large Language Model Using Integrated Approach
No ratings yet
End-to-End Bangla AI For Solving Math Olympiad Problem Benchmark:Leveraging Large Language Model Using Integrated Approach
11 pages
Ijnlc CFP 09
No ratings yet
Ijnlc CFP 09
2 pages
Call For Papers - 6th International Conference On Natural Language Processing and Applications (NLPA 2025)
No ratings yet
Call For Papers - 6th International Conference On Natural Language Processing and Applications (NLPA 2025)
2 pages
Call For Papers - International Journal On Natural Language Computing (IJNLC)
No ratings yet
Call For Papers - International Journal On Natural Language Computing (IJNLC)
2 pages
Call For Papers - 6th International Conference On NLP & Big Data (NLPD 2025)
No ratings yet
Call For Papers - 6th International Conference On NLP & Big Data (NLPD 2025)
2 pages
Design and Development of Morphological Analyzer For Tigrigna Verbs Using Hybrid Approach
No ratings yet
Design and Development of Morphological Analyzer For Tigrigna Verbs Using Hybrid Approach
12 pages
#0 e Scan Protected
No ratings yet
#0 e Scan Protected
1 page
Call For Papers - 6th International Conference On NLP & Big Data (NLPD 2025)
No ratings yet
Call For Papers - 6th International Conference On NLP & Big Data (NLPD 2025)
2 pages
Call For Papers - 4th International Conference On Machine Learning, NLP and Data Mining (MLDA 2025)
No ratings yet
Call For Papers - 4th International Conference On Machine Learning, NLP and Data Mining (MLDA 2025)
2 pages
High Accuracy Location Information Extraction From Social Network Texts Using Natural Language Processing
No ratings yet
High Accuracy Location Information Extraction From Social Network Texts Using Natural Language Processing
12 pages
Call For Papers - International Journal On Natural Language Computing (IJNLC)
No ratings yet
Call For Papers - International Journal On Natural Language Computing (IJNLC)
2 pages
CFP - International Journal On Natural Language Computing (IJNLC)
No ratings yet
CFP - International Journal On Natural Language Computing (IJNLC)
2 pages
Mathematical Formulas For Generating Syllables Used in Arabic Speech Synthesis
No ratings yet
Mathematical Formulas For Generating Syllables Used in Arabic Speech Synthesis
10 pages
Call For Papers - International Journal On Natural Language Computing (IJNLC) - H - Index - 25
No ratings yet
Call For Papers - International Journal On Natural Language Computing (IJNLC) - H - Index - 25
2 pages
Shared Dimensions
No ratings yet
Shared Dimensions
21 pages
Call For Papers - International Journal On Natural Language Computing (IJNLC)
No ratings yet
Call For Papers - International Journal On Natural Language Computing (IJNLC)
2 pages
Index - Technical Documentation For IFS Cloud
No ratings yet
Index - Technical Documentation For IFS Cloud
4 pages
Committing and Rolling Back A Transaction Using A Stored Procedure
No ratings yet
Committing and Rolling Back A Transaction Using A Stored Procedure
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Bengali Information Retrieval System (Birs)

Uploaded by

Bengali Information Retrieval System (Birs)

Uploaded by

International Journal on Natural Language Computing (IJNLC) Vol.8, No.

BENGALI INFORMATION RETRIEVAL

Much research is performed previously on summarization, such as extracting Bangla sentences

Fig.1: Proposed Work.

3.1. Category: Information

Question-1: বঙ্গবন্ধু শেখ মুজিবুর রহমান শকাো়ে িন্ম গ্রহণ কলরন?

Cleaning word refers to remove an unwanted character which doesn’t sentiment to an

3.3.4. Stop Words Removing

3.3.5. Lemmatization for Bangla Language

Fig.2. Bangla Lemmatization Process

3.3.6. Synonyms Words Processing

After preprocessing the sentences and questions look the following:

Question-1: বঙ্গবন্ধু শেখ মুজিবুর রহমান িন্ম গ্রহণ করা

Here, the TF-IDF from the previous example is shown in Table 1.

Table 1.TF-IDF of Sentences and Questions

Terms Sent-1 Sent-2 Sent-3 Sent-4 Sent-5 Ques-1 Ques-2

4.2. Cosine Similarity

The cosine of two non-zero vectors can be derived as:

Table 2. Cosine Similarly

Relations Sentence-1 Sentence-2 Sentence-3 Sentence-4 Sentence-5

Question-1 0.597 0.016 0.0 0.060 0.018

Question-2 0.074 0.012 0.0 0.483 0.013

5.2. Final Result

6. CONCLUSION AND FUTURE WORK

[2] Croft, W.B., Metzler, D. &Strohman, T. (2009). “ Search engines-information retrieval in

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.