0% found this document useful (0 votes)
4 views33 pages

NLP 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views33 pages

NLP 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Information Retrieval

By
Dr. Pankaj Dadure
Assistant Professor
SoCS, UPES Dehradun
Basic assumptions of Information Retrieval
• Collection: A set of documents
• Assume it is a static collection for the moment

• Goal: Retrieve documents with information that is relevant to the


user’s information need and helps the user complete a task
Information Retrieval
• Information retrieval (IR) may be defined as a software program that deals with
the organization, storage, retrieval and evaluation of information from document
repositories.
• The system assists users in finding the information they require but it does not
explicitly return the answers of the questions. It informs the existence and
location of documents that might consist of the required information.
• The documents that satisfy user’s requirement are called relevant documents. A
perfect IR system will retrieve only relevant documents.
The classic
search model
Design features of Information retrieval
systems
• Inverted Index: The primary data structure of most of the IR systems is in the
form of inverted index. We can define an inverted index as a data structure that
list, for every word, all documents that contain it and frequency of the
occurrences in document.
• Stop Word Elimination: Stop words are those high frequency words that are
deemed unlikely to be useful for searching.
• Stemming: Stemming, the simplified form of morphological analysis, is the
heuristic process of extracting the base form of words by chopping off the ends of
words.
IR vs DBMS
The Boolean IR Model
• It is a simple retrieval model based on set theory and boolean algebra. Queries
are designed as boolean expressions which have precise semantics. The retrieval
strategy is based on binary decision criterion. The boolean model considers that
index terms are present or absent in a document.
• In exact match a query specifies precise criteria. Each document either matches
or fails to match the query. The results retrieved in exact match is a set of
document (without ranking).
• Partial matches and ranking are not supported.
Basic Assumption of Boolean Model
1. An index term is either present(1) or absent(0) in the document
2. All index terms provide equal evidence with respect to information needs.
3. Queries are Boolean combinations of index terms.
• X AND Y: represents doc that contains both X and Y
• X OR Y: represents doc that contains either X or Y
• NOT X: represents the doc that do not contain X
Document-term matrix
• A document-term matrix is a mathematical matrix that describes the frequency of terms
that occur in a collection of documents.
Example:
Input collection

• document 1 = ‘ term1 term3 ‘


• document 2 = ‘ term 2 term4 term6 ‘
• document 3 = ‘ term1 term2 term3 term4 term5 ‘
• document 4 = ‘ term1 term3 term6 ‘
• document 5 = ‘ term3 term4 ‘
• Query problem: Find the document consisting of term1 and term3 and not
term2 ( term1 ∧ term3 ∧ ¬ term2)
Input collection in
Boolean model
• document 1 = ‘ term1 term3 ‘
• document 2 = ‘ term 2 term4 term6 ‘
• document 3 = ‘ term1 term2 term3
term4 term5 ‘
• document 4 = ‘ term1 term3 term6 ‘
• document 5 = ‘ term3 term4 ‘
Query in
Boolean Model
( term1 ∧ term3 ∧ ¬ term2)
Retrieved Results for the given Query
• document 1 : 1 ∧ 1∧ 1 = 1
• document 2 : 0 ∧ 0 ∧ 0 = 0
• document 3 : 1 ∧ 1 ∧ 0 = 0
• document 4 : 1 ∧ 1 ∧ 1 = 1
• document 5 : 0 ∧ 1 ∧ 1 = 0

Based on the above computation document1 and document4 are relevant to the
given query
Vector Space Model
• A vector space model is an algebraic model, involving two steps:
• in first step we represent the text documents into vector of words
• in second step we transform to numerical format

• So that we can apply any text mining techniques such as information retrieval,
information extraction, information filtering etc.
Example
Consider below statements and a query term. The statements are referred as documents hereafter
Document 1: Cat runs behind rat
Document 2: Dog runs behind cat
Query: rat

Result: the relevant document to Query = greater of (similarity score between (Document1, Query),
similarity score between (Document2, Query)
Document vectors representation
Preprocessing: Breaking each document into words, applying preprocessing steps such as
removing stopwords, punctuations, special characters etc.
Below is a sample representation of the document vectors.

• Document 1: (cat, run, behind, rat)


Document 2: (Dog, run, behind, cat)
Query: (rat)
Term Document Matrix
• A term document matrix is a way of representing documents vectors in a matrix format in which
each row represents term vectors across all the documents and columns represent document
vectors across all the terms.
• This can be achieved using a method known as term frequency – inverse document frequency (tf-
idf) which gives higher weights to the terms which occurs more in a document but rarely occurs in
all other documents, lower weights to the terms which commonly occurs within and across all the
documents.
Tf-idf = tf * idf

tf = term frequency is the number of times a term occurs in a document


idf = inverse of the document frequency, where idf = log(N/df), df is the document frequency
containing a particular term
Term Document Matrix
Total number of documents Term document matrix Inverse document frequency
Term Document Matrix
tf-idf calculation

the relevant document to Query = greater of (similarity score between (Document1, Query), similarity score
between (Document2, Query)

the relevant document to Query = greater of (similarity score between (0.30103, 0.30103), similarity score between
(0.30103, 0))
Graphical overview
of the Vector space
IR Model
Question Answering (QA)
• Question answering (QA) is a field of natural language processing (NLP) and artificial
intelligence (AI) that aims to develop systems that can understand, and answer questions
posed in natural language.
• The point of a QA system is to understand the question and give an answer that is correct
and helpful.
• QA systems can be based on various techniques, including information retrieval,
knowledge-based, generative, and rule-based approaches. Each method has its strengths
and weaknesses, and the choice of method depends on the project’s specific needs.
A classification of question answering systems
• Open domain Question Answering system
Open domain Question Answering systems are not restricted to any specific domain
and provide a short answer to a question, addressed in natural language.
• Closed domain Question Answering system
In Closed domain QA system, there is restriction of domain which is based on web and
questions are related to a specific domain. Closed domain Question Answering system
consists of limited repository of domain specific questions and can answer a limited
number of questions. Hence in closed domain QA systems, the quality of answers is
high. Closed domain QA systems answer domain specific questions and answers are
searched within domain specific document collections
Classification based on types of questions

The different categories of question types are


• Factoid type questions
The factoid type questions commonly begin with wh-word. These questions are
simple to answer, and fact based that need answers in a single sentence or short
phrase. For instance, the factoid type question "What is the capital of India?"
• List type questions
The list type questions need a list of facts or entities as answers e.g., List names of
movies in 2017.
• Confirmation questions
Confirmation questions need answers in the form of yes or no. For instance, the
confirmation type question "Is Rahul bad or good?", asks for the answer yes or no.
Classification based on types of questions
• Causal Questions
The answers of causal questions are not named entities as factoid type questions. Causal
questions need answers descriptions about an entity. Causal questions are asked by users
those who desire answers as reasons, explanations, elaborations etc related to particular
objects or events
• Hypothetical Questions
Hypothetical questions request for information associated to any hypothetical event and
no specific answers of these questions. Hypothetical questions usually start with 'what
would happen if'. The reliability and accuracy of these questions are low and depends upon
users and context.
Classification based on types of questions
• Complex questions
Complex questions are more difficult to answer and whose answers generally consists of
list of "nuggets". Complex question such as "Two trains running in opposite directions
cross a man standing on the platform in 27 seconds and 17 seconds respectively and they
cross each other in 23 seconds. The ratio of their speeds is:" often require inferring
and synthesizing information from multiple documents to get multiple nuggets as answers.
Architecture of
QA
Architecture of QA
A natural language question-answering system is a computer program that automatically answers questions
using NLP. The basic process of a natural language QA system includes the following steps:
1. Text pre-processing: The question is pre-processed to remove irrelevant information and standardize the text’s
format. This step includes tokenization, lemmatization, and stop-word removal, among others.
2. Question processing: The pre-processed question is analyzed to extract the relevant entities and concepts and
to identify the type of question being asked. This step can be done using natural language processing (NLP)
techniques such as named entity recognition, dependency parsing, and part-of-speech tagging.
3. Information retrieval: The question is used to search a database or corpus of text to retrieve the most relevant
information. This can be done using information retrieval techniques such as keyword search or semantic
search.
4. Answer processing: The retrieved information is analyzed to extract the specific answer to the question. This
can be done using various techniques, such as machine learning algorithms, rule-based systems, or a
combination.
5. Ranking: The extracted answers are ranked based on relevance and confidence score.
Application of QA
• Fact-checking if a fact is verified, by posing a question like: is fact X true or false
• Customer service
• Technical support
• Market research
• Generating reports or conducting research.
Sentiment Analysis
• Sentiment analysis is a popular task in natural language processing. The goal of sentiment analysis
is to classify the text based on the mood or mentality expressed in the text, which can be positive
negative, or neutral.
• The goal which Sentiment analysis tries to gain is to be analyzed people’s opinions in a way that
can help businesses expand.
• It focuses not only on polarity (positive, negative & neutral) but also on emotions (happy, sad,
angry, etc.).
Types of Sentiment Analysis
1.Fine-grained sentiment analysis: This depends on the polarity base. This category can be
designed as very positive, positive, neutral, negative, or very negative. The rating is done on a
scale of 1 to 5. If the rating is 5 then it is very positive, 2 then negative, and 3 then neutral.
2.Emotion detection: The sentiments happy, sad, angry, upset, jolly, pleasant, and so on come
under emotion detection. It is also known as a lexicon method of sentiment analysis.
3.Aspect-based sentiment analysis: It focuses on a particular aspect for instance if a person
wants to check the feature of the cell phone then it checks the aspect such as the battery,
screen, and camera quality then aspect based is used.
4.Multilingual sentiment analysis: Multilingual consists of different languages where the
classification needs to be done as positive, negative, and neutral. This is highly challenging
and comparatively difficult.
How does Sentiment Analysis work
1. Rule-based approach: Over here, the lexicon method, tokenization, and parsing come in the rule-based. The
approach is that counts the number of positive and negative words in the given dataset. If the number of
positive words is greater than the number of negative words then the sentiment is positive else vice-versa.
2. Machine Learning Approach: This approach works on the machine learning technique. Firstly, the datasets are
trained and predictive analysis is done. The next process is the extraction of words from the text is done. This
text extraction can be done using different techniques such as Naive Bayes, Support Vector machines, hidden
Markov model, and conditional random fields like this machine learning techniques are used.
3. Neural network Approach: In the last few years neural networks have evolved at a very rate. It involves using
artificial neural networks, which are inspired by the structure of the human brain, to classify text into positive,
negative, or neutral sentiments. it has Recurrent neural networks, Long short-term memory, Gated recurrent
unit, etc to process sequential data like text.
4. Hybrid Approach: It is the combination of two or more approaches i.e. rule-based and Machine
Learning approaches. The surplus is that the accuracy is high compared to the other two approaches.
Applications
1.Social Media: If for instance the comments on social media side as Instagram, over here all
the reviews are analyzed and categorized as positive, negative, and neutral.
2.Customer Service: In the play store, all the comments in the form of 1 to 5 are done with the
help of sentiment analysis approaches.
3.Marketing Sector: In the marketing area where a particular product needs to be reviewed as
good or bad.
4.Reviewer side: All the reviewers will have a look at the comments and will check and give the
overall review of the product.
Challenges of Sentiment Analysis
• If the data is in the form of a tone, then it becomes really difficult to detect whether the
comment is pessimist or optimistic.

• If the data is in the form of emoji, then you need to detect whether it is good or bad.

• Even the ironic, sarcastic, comparing comments detection is really hard.

• Comparing a neutral statement is a big task.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy