0% found this document useful (0 votes)

4 views33 pages

NLP 4

Uploaded by

sakshisehrawat1311

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views33 pages

NLP 4

Uploaded by

sakshisehrawat1311

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

Information Retrieval

By
Dr. Pankaj Dadure
Assistant Professor
SoCS, UPES Dehradun
Basic assumptions of Information Retrieval
• Collection: A set of documents
• Assume it is a static collection for the moment

• Goal: Retrieve documents with information that is relevant to the

user’s information need and helps the user complete a task
Information Retrieval
• Information retrieval (IR) may be defined as a software program that deals with
the organization, storage, retrieval and evaluation of information from document
repositories.
• The system assists users in finding the information they require but it does not
explicitly return the answers of the questions. It informs the existence and
location of documents that might consist of the required information.
• The documents that satisfy user’s requirement are called relevant documents. A
perfect IR system will retrieve only relevant documents.
The classic
search model
Design features of Information retrieval
systems
• Inverted Index: The primary data structure of most of the IR systems is in the
form of inverted index. We can define an inverted index as a data structure that
list, for every word, all documents that contain it and frequency of the
occurrences in document.
• Stop Word Elimination: Stop words are those high frequency words that are
deemed unlikely to be useful for searching.
• Stemming: Stemming, the simplified form of morphological analysis, is the
heuristic process of extracting the base form of words by chopping off the ends of
words.
IR vs DBMS
The Boolean IR Model
• It is a simple retrieval model based on set theory and boolean algebra. Queries
are designed as boolean expressions which have precise semantics. The retrieval
strategy is based on binary decision criterion. The boolean model considers that
index terms are present or absent in a document.
• In exact match a query specifies precise criteria. Each document either matches
or fails to match the query. The results retrieved in exact match is a set of
document (without ranking).
• Partial matches and ranking are not supported.
Basic Assumption of Boolean Model
1. An index term is either present(1) or absent(0) in the document
2. All index terms provide equal evidence with respect to information needs.
3. Queries are Boolean combinations of index terms.
• X AND Y: represents doc that contains both X and Y
• X OR Y: represents doc that contains either X or Y
• NOT X: represents the doc that do not contain X
Document-term matrix
• A document-term matrix is a mathematical matrix that describes the frequency of terms
that occur in a collection of documents.
Example:
Input collection

• document 1 = ‘ term1 term3 ‘

• document 2 = ‘ term 2 term4 term6 ‘
• document 3 = ‘ term1 term2 term3 term4 term5 ‘
• document 4 = ‘ term1 term3 term6 ‘
• document 5 = ‘ term3 term4 ‘
• Query problem: Find the document consisting of term1 and term3 and not
term2 ( term1 ∧ term3 ∧ ¬ term2)
Input collection in
Boolean model
• document 1 = ‘ term1 term3 ‘
• document 2 = ‘ term 2 term4 term6 ‘
• document 3 = ‘ term1 term2 term3
term4 term5 ‘
• document 4 = ‘ term1 term3 term6 ‘
• document 5 = ‘ term3 term4 ‘
Query in
Boolean Model
( term1 ∧ term3 ∧ ¬ term2)
Retrieved Results for the given Query
• document 1 : 1 ∧ 1∧ 1 = 1
• document 2 : 0 ∧ 0 ∧ 0 = 0
• document 3 : 1 ∧ 1 ∧ 0 = 0
• document 4 : 1 ∧ 1 ∧ 1 = 1
• document 5 : 0 ∧ 1 ∧ 1 = 0

Based on the above computation document1 and document4 are relevant to the
given query
Vector Space Model
• A vector space model is an algebraic model, involving two steps:
• in first step we represent the text documents into vector of words
• in second step we transform to numerical format

• So that we can apply any text mining techniques such as information retrieval,
information extraction, information filtering etc.
Example
Consider below statements and a query term. The statements are referred as documents hereafter
Document 1: Cat runs behind rat
Document 2: Dog runs behind cat
Query: rat

Result: the relevant document to Query = greater of (similarity score between (Document1, Query),
similarity score between (Document2, Query)
Document vectors representation
Preprocessing: Breaking each document into words, applying preprocessing steps such as
removing stopwords, punctuations, special characters etc.
Below is a sample representation of the document vectors.

• Document 1: (cat, run, behind, rat)

Document 2: (Dog, run, behind, cat)
Query: (rat)
Term Document Matrix
• A term document matrix is a way of representing documents vectors in a matrix format in which
each row represents term vectors across all the documents and columns represent document
vectors across all the terms.
• This can be achieved using a method known as term frequency – inverse document frequency (tf-
idf) which gives higher weights to the terms which occurs more in a document but rarely occurs in
all other documents, lower weights to the terms which commonly occurs within and across all the
documents.
Tf-idf = tf * idf

tf = term frequency is the number of times a term occurs in a document

idf = inverse of the document frequency, where idf = log(N/df), df is the document frequency
containing a particular term
Term Document Matrix
Total number of documents Term document matrix Inverse document frequency
Term Document Matrix
tf-idf calculation

the relevant document to Query = greater of (similarity score between (Document1, Query), similarity score
between (Document2, Query)

the relevant document to Query = greater of (similarity score between (0.30103, 0.30103), similarity score between
(0.30103, 0))
Graphical overview
of the Vector space
IR Model
Question Answering (QA)
• Question answering (QA) is a field of natural language processing (NLP) and artificial
intelligence (AI) that aims to develop systems that can understand, and answer questions
posed in natural language.
• The point of a QA system is to understand the question and give an answer that is correct
and helpful.
• QA systems can be based on various techniques, including information retrieval,
knowledge-based, generative, and rule-based approaches. Each method has its strengths
and weaknesses, and the choice of method depends on the project’s specific needs.
A classification of question answering systems
• Open domain Question Answering system
Open domain Question Answering systems are not restricted to any specific domain
and provide a short answer to a question, addressed in natural language.
• Closed domain Question Answering system
In Closed domain QA system, there is restriction of domain which is based on web and
questions are related to a specific domain. Closed domain Question Answering system
consists of limited repository of domain specific questions and can answer a limited
number of questions. Hence in closed domain QA systems, the quality of answers is
high. Closed domain QA systems answer domain specific questions and answers are
searched within domain specific document collections
Classification based on types of questions

The different categories of question types are

• Factoid type questions
The factoid type questions commonly begin with wh-word. These questions are
simple to answer, and fact based that need answers in a single sentence or short
phrase. For instance, the factoid type question "What is the capital of India?"
• List type questions
The list type questions need a list of facts or entities as answers e.g., List names of
movies in 2017.
• Confirmation questions
Confirmation questions need answers in the form of yes or no. For instance, the
confirmation type question "Is Rahul bad or good?", asks for the answer yes or no.
Classification based on types of questions
• Causal Questions
The answers of causal questions are not named entities as factoid type questions. Causal
questions need answers descriptions about an entity. Causal questions are asked by users
those who desire answers as reasons, explanations, elaborations etc related to particular
objects or events
• Hypothetical Questions
Hypothetical questions request for information associated to any hypothetical event and
no specific answers of these questions. Hypothetical questions usually start with 'what
would happen if'. The reliability and accuracy of these questions are low and depends upon
users and context.
Classification based on types of questions
• Complex questions
Complex questions are more difficult to answer and whose answers generally consists of
list of "nuggets". Complex question such as "Two trains running in opposite directions
cross a man standing on the platform in 27 seconds and 17 seconds respectively and they
cross each other in 23 seconds. The ratio of their speeds is:" often require inferring
and synthesizing information from multiple documents to get multiple nuggets as answers.
Architecture of
QA
Architecture of QA
A natural language question-answering system is a computer program that automatically answers questions
using NLP. The basic process of a natural language QA system includes the following steps:
1. Text pre-processing: The question is pre-processed to remove irrelevant information and standardize the text’s
format. This step includes tokenization, lemmatization, and stop-word removal, among others.
2. Question processing: The pre-processed question is analyzed to extract the relevant entities and concepts and
to identify the type of question being asked. This step can be done using natural language processing (NLP)
techniques such as named entity recognition, dependency parsing, and part-of-speech tagging.
3. Information retrieval: The question is used to search a database or corpus of text to retrieve the most relevant
information. This can be done using information retrieval techniques such as keyword search or semantic
search.
4. Answer processing: The retrieved information is analyzed to extract the specific answer to the question. This
can be done using various techniques, such as machine learning algorithms, rule-based systems, or a
combination.
5. Ranking: The extracted answers are ranked based on relevance and confidence score.
Application of QA
• Fact-checking if a fact is verified, by posing a question like: is fact X true or false
• Customer service
• Technical support
• Market research
• Generating reports or conducting research.
Sentiment Analysis
• Sentiment analysis is a popular task in natural language processing. The goal of sentiment analysis
is to classify the text based on the mood or mentality expressed in the text, which can be positive
negative, or neutral.
• The goal which Sentiment analysis tries to gain is to be analyzed people’s opinions in a way that
can help businesses expand.
• It focuses not only on polarity (positive, negative & neutral) but also on emotions (happy, sad,
angry, etc.).
Types of Sentiment Analysis
1.Fine-grained sentiment analysis: This depends on the polarity base. This category can be
designed as very positive, positive, neutral, negative, or very negative. The rating is done on a
scale of 1 to 5. If the rating is 5 then it is very positive, 2 then negative, and 3 then neutral.
2.Emotion detection: The sentiments happy, sad, angry, upset, jolly, pleasant, and so on come
under emotion detection. It is also known as a lexicon method of sentiment analysis.
3.Aspect-based sentiment analysis: It focuses on a particular aspect for instance if a person
wants to check the feature of the cell phone then it checks the aspect such as the battery,
screen, and camera quality then aspect based is used.
4.Multilingual sentiment analysis: Multilingual consists of different languages where the
classification needs to be done as positive, negative, and neutral. This is highly challenging
and comparatively difficult.
How does Sentiment Analysis work
1. Rule-based approach: Over here, the lexicon method, tokenization, and parsing come in the rule-based. The
approach is that counts the number of positive and negative words in the given dataset. If the number of
positive words is greater than the number of negative words then the sentiment is positive else vice-versa.
2. Machine Learning Approach: This approach works on the machine learning technique. Firstly, the datasets are
trained and predictive analysis is done. The next process is the extraction of words from the text is done. This
text extraction can be done using different techniques such as Naive Bayes, Support Vector machines, hidden
Markov model, and conditional random fields like this machine learning techniques are used.
3. Neural network Approach: In the last few years neural networks have evolved at a very rate. It involves using
artificial neural networks, which are inspired by the structure of the human brain, to classify text into positive,
negative, or neutral sentiments. it has Recurrent neural networks, Long short-term memory, Gated recurrent
unit, etc to process sequential data like text.
4. Hybrid Approach: It is the combination of two or more approaches i.e. rule-based and Machine
Learning approaches. The surplus is that the accuracy is high compared to the other two approaches.
Applications
1.Social Media: If for instance the comments on social media side as Instagram, over here all
the reviews are analyzed and categorized as positive, negative, and neutral.
2.Customer Service: In the play store, all the comments in the form of 1 to 5 are done with the
help of sentiment analysis approaches.
3.Marketing Sector: In the marketing area where a particular product needs to be reviewed as
good or bad.
4.Reviewer side: All the reviewers will have a look at the comments and will check and give the
overall review of the product.
Challenges of Sentiment Analysis
• If the data is in the form of a tone, then it becomes really difficult to detect whether the
comment is pessimist or optimistic.

• If the data is in the form of emoji, then you need to detect whether it is good or bad.

• Even the ironic, sarcastic, comparing comments detection is really hard.

• Comparing a neutral statement is a big task.

New York US
No ratings yet
New York US
36 pages
Grade 07 Second Language Tamil 2nd Term Test Paper 2019 North Western Province
70% (10)
Grade 07 Second Language Tamil 2nd Term Test Paper 2019 North Western Province
4 pages
Cloud Computing, Seminar Report
78% (9)
Cloud Computing, Seminar Report
17 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Unit 2
No ratings yet
Unit 2
58 pages
Chapter 4
No ratings yet
Chapter 4
8 pages
NLP - Module 5
No ratings yet
NLP - Module 5
58 pages
Information Retrieval Models
No ratings yet
Information Retrieval Models
4 pages
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
No ratings yet
IR Models: - Why IR Models? - Boolean IR Model - Vector Space IR Model - Probabilistic IR Model
46 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
34 pages
Emutye
No ratings yet
Emutye
20 pages
Introduction of IR Models
No ratings yet
Introduction of IR Models
67 pages
Chapter Five IR Models
No ratings yet
Chapter Five IR Models
28 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
5 IRModels IR
No ratings yet
5 IRModels IR
25 pages
IR Models
No ratings yet
IR Models
261 pages
ISE Information Retrieval Mod-V
No ratings yet
ISE Information Retrieval Mod-V
48 pages
IR Unit 2
No ratings yet
IR Unit 2
54 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Irt-23 Unit 2
No ratings yet
Irt-23 Unit 2
10 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
21 pages
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
No ratings yet
Cs8080 Ir Unit2 I Modeling and Retrieval Evaluation
42 pages
CCS369 - TSS-Unit 3
No ratings yet
CCS369 - TSS-Unit 3
55 pages
Boolean Retrieval Model
No ratings yet
Boolean Retrieval Model
5 pages
Module 1-1
No ratings yet
Module 1-1
12 pages
Unit Iii - Information Retrieval Design Features of Information Retrieval Systems
No ratings yet
Unit Iii - Information Retrieval Design Features of Information Retrieval Systems
57 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
No ratings yet
ISE Information Retrieval Mod-V (Uploaded by Snaptricks - In)
48 pages
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
Module 7
No ratings yet
Module 7
53 pages
NLP Unit-Ii (Part-I)
No ratings yet
NLP Unit-Ii (Part-I)
19 pages
Question Answering
No ratings yet
Question Answering
68 pages
Ccs369-Unit 3
No ratings yet
Ccs369-Unit 3
28 pages
cs419-519 Slides Part 2
No ratings yet
cs419-519 Slides Part 2
6 pages
Module 3 Indexing Part A
No ratings yet
Module 3 Indexing Part A
46 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
5 B IRModels
No ratings yet
5 B IRModels
51 pages
4 IRModels
No ratings yet
4 IRModels
46 pages
Information Retreival Methods
No ratings yet
Information Retreival Methods
19 pages
Lang Models: 04 December 2024 23:03
No ratings yet
Lang Models: 04 December 2024 23:03
4 pages
CCS369 Two Marks
No ratings yet
CCS369 Two Marks
9 pages
IR Chap4
100% (1)
IR Chap4
32 pages
IR Chap4
100% (1)
IR Chap4
32 pages
Information Retrieval Detailed Lecture Nov 2023
No ratings yet
Information Retrieval Detailed Lecture Nov 2023
39 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Chapter 4 IR Models
No ratings yet
Chapter 4 IR Models
43 pages
Retrieval Models and Rank Retrieval
No ratings yet
Retrieval Models and Rank Retrieval
16 pages
Unit V Notes Adbt Adbt
No ratings yet
Unit V Notes Adbt Adbt
7 pages
IR Unit II
No ratings yet
IR Unit II
4 pages
4 IRModels
No ratings yet
4 IRModels
32 pages
Tsa Ut Iii
No ratings yet
Tsa Ut Iii
28 pages
IR Models: Chapter Five
100% (1)
IR Models: Chapter Five
26 pages
Information Retrieval - 1
No ratings yet
Information Retrieval - 1
47 pages
Unit-5 Adt
No ratings yet
Unit-5 Adt
11 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
4-IR Models
No ratings yet
4-IR Models
33 pages
NLP Week10 IR Enc Dec
No ratings yet
NLP Week10 IR Enc Dec
68 pages
NLP See
No ratings yet
NLP See
27 pages
NLP See
No ratings yet
NLP See
9 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Kumpletong Sahog NG Adobo
No ratings yet
Kumpletong Sahog NG Adobo
4 pages
Unit 5 Control Statement
No ratings yet
Unit 5 Control Statement
19 pages
1 s2.0 S1386505623003556 Main
No ratings yet
1 s2.0 S1386505623003556 Main
11 pages
TCPB Workflow English
No ratings yet
TCPB Workflow English
168 pages
1 Introduction To Data Analytics
No ratings yet
1 Introduction To Data Analytics
14 pages
Unwind Â Neal Shusterman
100% (1)
Unwind Â Neal Shusterman
258 pages
Armis Use Cases For OT Environments
No ratings yet
Armis Use Cases For OT Environments
11 pages
9305 - Datasheet UPS
No ratings yet
9305 - Datasheet UPS
2 pages
Final Year Project Management and Tracking System: Hina Mahmood Qureshi (BCS163008) Muhammad Haroon Qureshi (BCS163009)
No ratings yet
Final Year Project Management and Tracking System: Hina Mahmood Qureshi (BCS163008) Muhammad Haroon Qureshi (BCS163009)
153 pages
Case Study MIS of Deloitte
No ratings yet
Case Study MIS of Deloitte
7 pages
UGCErrors Log
No ratings yet
UGCErrors Log
4 pages
NoSQL Databases Critical Analysis and Comparison
No ratings yet
NoSQL Databases Critical Analysis and Comparison
7 pages
AmpliTube 3 User Manual
No ratings yet
AmpliTube 3 User Manual
300 pages
Project Proposal
No ratings yet
Project Proposal
9 pages
OOPS PROJECT REPORT Sheraz 21
No ratings yet
OOPS PROJECT REPORT Sheraz 21
14 pages
7951 Teka ST Makati - Google Search
No ratings yet
7951 Teka ST Makati - Google Search
1 page
02 Performing Calculation On Data
No ratings yet
02 Performing Calculation On Data
5 pages
Language Models Application Development
No ratings yet
Language Models Application Development
5 pages
MYPRO Touchplus
No ratings yet
MYPRO Touchplus
4 pages
AWS CloudTrail CheatSheet
No ratings yet
AWS CloudTrail CheatSheet
1 page
Blockchain Management and Machine Learning Adaptation For IoT
No ratings yet
Blockchain Management and Machine Learning Adaptation For IoT
27 pages
Errlog Sai2 20191103 164026
No ratings yet
Errlog Sai2 20191103 164026
3 pages
Ch01 Introduction To Web Engineering
No ratings yet
Ch01 Introduction To Web Engineering
29 pages
CV Vini IN
No ratings yet
CV Vini IN
1 page
ETL Developer Training
No ratings yet
ETL Developer Training
7 pages
USB-SERIAL ADAPTER User's Manual - English
No ratings yet
USB-SERIAL ADAPTER User's Manual - English
21 pages
16344
No ratings yet
16344
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

NLP 4

Uploaded by

NLP 4

Uploaded by

Information Retrieval

• Goal: Retrieve documents with information that is relevant to the

• document 1 = ‘ term1 term3 ‘

• Document 1: (cat, run, behind, rat)

tf = term frequency is the number of times a term occurs in a document

The different categories of question types are

• Even the ironic, sarcastic, comparing comments detection is really hard.

• Comparing a neutral statement is a big task.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.