NLP 4
NLP 4
By
Dr. Pankaj Dadure
Assistant Professor
SoCS, UPES Dehradun
Basic assumptions of Information Retrieval
• Collection: A set of documents
• Assume it is a static collection for the moment
Based on the above computation document1 and document4 are relevant to the
given query
Vector Space Model
• A vector space model is an algebraic model, involving two steps:
• in first step we represent the text documents into vector of words
• in second step we transform to numerical format
• So that we can apply any text mining techniques such as information retrieval,
information extraction, information filtering etc.
Example
Consider below statements and a query term. The statements are referred as documents hereafter
Document 1: Cat runs behind rat
Document 2: Dog runs behind cat
Query: rat
Result: the relevant document to Query = greater of (similarity score between (Document1, Query),
similarity score between (Document2, Query)
Document vectors representation
Preprocessing: Breaking each document into words, applying preprocessing steps such as
removing stopwords, punctuations, special characters etc.
Below is a sample representation of the document vectors.
the relevant document to Query = greater of (similarity score between (Document1, Query), similarity score
between (Document2, Query)
the relevant document to Query = greater of (similarity score between (0.30103, 0.30103), similarity score between
(0.30103, 0))
Graphical overview
of the Vector space
IR Model
Question Answering (QA)
• Question answering (QA) is a field of natural language processing (NLP) and artificial
intelligence (AI) that aims to develop systems that can understand, and answer questions
posed in natural language.
• The point of a QA system is to understand the question and give an answer that is correct
and helpful.
• QA systems can be based on various techniques, including information retrieval,
knowledge-based, generative, and rule-based approaches. Each method has its strengths
and weaknesses, and the choice of method depends on the project’s specific needs.
A classification of question answering systems
• Open domain Question Answering system
Open domain Question Answering systems are not restricted to any specific domain
and provide a short answer to a question, addressed in natural language.
• Closed domain Question Answering system
In Closed domain QA system, there is restriction of domain which is based on web and
questions are related to a specific domain. Closed domain Question Answering system
consists of limited repository of domain specific questions and can answer a limited
number of questions. Hence in closed domain QA systems, the quality of answers is
high. Closed domain QA systems answer domain specific questions and answers are
searched within domain specific document collections
Classification based on types of questions
• If the data is in the form of emoji, then you need to detect whether it is good or bad.