0% found this document useful (0 votes)

5 views17 pages

Pertemuan 5 - Information Retrieval

The document outlines the NLP pipeline and its components, including text preprocessing, feature extraction, modeling, and deployment. It discusses Information Retrieval (IR) as the process of finding relevant unstructured text documents and emphasizes the importance of precision and recall in evaluating retrieval effectiveness. Additionally, it highlights challenges in handling large document collections and introduces the concept of inverted indexing for efficient data representation.

Uploaded by

tian Indra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views17 pages

Pertemuan 5 - Information Retrieval

Uploaded by

tian Indra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 17

Advanced NLP

NLP
(Information
Retrieval)

Dr. Sajarwo Anggai, S.ST., M.T.

NIDN : 0421108703
NLP Pipeline

Text Feature Post-

Modeling Deployment
Preprocessing Extraction Processing

• Segmentation/ • Bag of Words • Machine Learning • Prediction • Production

Tokenization • Deep Learning • Evaluation • Monitoring and
(BoW)
• Normalization/ • TF-IDF • Fine-Tuning Updating
Lowercasing • Word Embeddings
• Stopword Removal
• Punctuation
Removal
• Stemming/
Lemmatization
• Depedency parsing
• Part-of-Speech
Tagging
Information Retrieval
Information Retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from
within large collections (usually stored on computers).

These days we frequently think first of web search, but there are many other
cases:
E-mail search
Searching your laptop
Corporate knowledge bases
Legal information retrieval
Unstructured (text) vs. structured (database)

250 250
200 200
150 Unstructured 150 Unstructured
Structured
Structured
100 100
50 50
0
0
Data Market Cap
Data Market Cap
volume
volume
mid-nineties today
Basic assumptions of Information Retrieval

Collection: A set of documents

Assume it is a static collection for the moment

Goal: Retrieve documents with information that is relevant to

the user’s information need and helps the user complete a
task
How good are the retrieved docs?

 Precision : Fraction of retrieved docs that are

relevant to the user’s information need
 Recall : Fraction of relevant docs in collection that
are retrieved
Unstructured data in 1620

Which plays of Shakespeare contain the words Brutus AND

Caesar but NOT Calpurnia?
One could grep all of Shakespeare’s plays for Brutus and
Caesar, then strip out lines containing Calpurnia?
Why is that not the answer?
Slow (for large corpora)
NOT Calpurnia is non-trivial
Other operations (e.g., find the word Romans near
countrymen) not feasible
Ranked retrieval (best documents to return)
Later lectures
Term-document incidence matrices
Which plays of Shakespeare contain the words Brutus AND Caesar but NOT
Calpurnia?
One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip
out lines containing Calpurnia?
Why is that not the answer?
Slow (for large corpora)
NOT Calpurnia is non-trivial
Other operations (e.g., find the word Romans near countrymen) not
feasible
Ranked retrieval (best documents to return)
Later lectures
Term-document incidence matrices

Brutus AND Caesar BUT 1 if play contains

NOT Calpurnia word, 0 otherwise
Incidence vectors
So we have a 0/1 vector for each term.
To answer query: take the vectors for
Brutus, Caesar and Calpurnia
(complemented)  bitwise AND.
110100 AND
110111 AND
101111 =
100100
Bigger collections

• Consider N = 1 million documents, each with about 1000

words.
• Avg 6 bytes/word including spaces/punctuation
6GB of data in the documents.
• Say there are M = 500K distinct terms among these.
Can’t build the matrix
• 500K x 1M matrix has half-a-trillion 0’s and 1’s.

• But it has no more than one billion 1’s. Why?

• matrix is extremely sparse.

• What’s a better representation?

• We only record the 1 positions.
Can’t build the matrix
• For each term t, we must store a list of all documents that contain t.
- Identify each doc by a docID, a document serial number
• Can we used fixed-size arrays for this?

Brutus 1 2 4 11 31 45 173 174

Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

What happens if the word Caesar is

added to document 14?
Inverted index construction
Ref
• https://nlp.stanford.edu/IR-book/information-retrieval-book.html
• https://web.stanford.edu/class/cs276/
• https://courses.cs.washington.edu/courses/cse454/09sp/slides.ht
ml
Tugas

• Cari 10 jurnal terkait dengan Information Retrieval!

• Buatlah program Inverted Index untuk menampung
corpus/dataset!
Universitas Pamulang
Prodi Teknik Informatika S-2

Sajarwo Anggai
Dosen – Universitas Pamulang
NIDN : 0421108703
Terima Kasih Email : dosen02832@unpam.ac.id

Lecture1 Intro
No ratings yet
Lecture1 Intro
57 pages
On Information Retrival
No ratings yet
On Information Retrival
23 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
38 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
30 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
42 pages
Lecture1 Intro Handout 1 Per
No ratings yet
Lecture1 Intro Handout 1 Per
57 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
57 pages
Lecture1 Intro
No ratings yet
Lecture1 Intro
60 pages
Lecture1-Intro - Realted To Ch1
No ratings yet
Lecture1-Intro - Realted To Ch1
60 pages
Lecture01 Intro
No ratings yet
Lecture01 Intro
45 pages
Lecture02 - IR
No ratings yet
Lecture02 - IR
36 pages
Lecture 2 - Boolean Retrieval
No ratings yet
Lecture 2 - Boolean Retrieval
49 pages
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
No ratings yet
CSCI 7000 Modern Information Retrieval: Lecture 1: Introduction
16 pages
Information Retrival Systems
No ratings yet
Information Retrival Systems
50 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
67 pages
IR Lecture 1b
No ratings yet
IR Lecture 1b
54 pages
Lecture2 Intro Boolean 6per
No ratings yet
Lecture2 Intro Boolean 6per
9 pages
C1 Intro
No ratings yet
C1 Intro
10 pages
IR-Lec1 - Ch1-2023
No ratings yet
IR-Lec1 - Ch1-2023
41 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
69 pages
L3L4 IRSW Boolean Retrieval
No ratings yet
L3L4 IRSW Boolean Retrieval
54 pages
Unit 1
No ratings yet
Unit 1
181 pages
Introduction To Information Retrieval
100% (2)
Introduction To Information Retrieval
60 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
2-Boolean IR and Indexing
No ratings yet
2-Boolean IR and Indexing
46 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
31 pages
Unit 1 Intro To IR
No ratings yet
Unit 1 Intro To IR
32 pages
IR Unit 2 Final
No ratings yet
IR Unit 2 Final
43 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
51 pages
Lecture1 Intro Boolean
No ratings yet
Lecture1 Intro Boolean
42 pages
Ir 1
No ratings yet
Ir 1
59 pages
Lec 1 IR
No ratings yet
Lec 1 IR
42 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
46 pages
KEN2570-5-Search and IR
No ratings yet
KEN2570-5-Search and IR
18 pages
Information Retrieval (CS6370) : Maunendra Sankar Desarkar
No ratings yet
Information Retrieval (CS6370) : Maunendra Sankar Desarkar
44 pages
02 Boolean Retrieval
No ratings yet
02 Boolean Retrieval
52 pages
Web Search and Mining: Lecture 2: Boolean Retrieval
No ratings yet
Web Search and Mining: Lecture 2: Boolean Retrieval
45 pages
Information Retrieval
No ratings yet
Information Retrieval
57 pages
Intro To IRE
No ratings yet
Intro To IRE
48 pages
01 - Introduction To Information Retrieval
No ratings yet
01 - Introduction To Information Retrieval
15 pages
Week 6
No ratings yet
Week 6
98 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Information Retrieval: Indexing
No ratings yet
Information Retrieval: Indexing
32 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Module 4-Boolean Retrieval Models
No ratings yet
Module 4-Boolean Retrieval Models
52 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Lecture 1: Introduction and Overview: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 1: Introduction and Overview: Information Retrieval Computer Science Tripos Part II
38 pages
Lect 2 Boolean Retrieval
No ratings yet
Lect 2 Boolean Retrieval
24 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
33 pages
Boolean Retrieval PPT Updated
No ratings yet
Boolean Retrieval PPT Updated
30 pages
Information Retrieval
No ratings yet
Information Retrieval
44 pages
Unit I
No ratings yet
Unit I
83 pages
Lecture2 Ranking1
No ratings yet
Lecture2 Ranking1
126 pages
Unit 2 Irt
No ratings yet
Unit 2 Irt
33 pages
Inverted Index Construction: Adapted From Lectures by
No ratings yet
Inverted Index Construction: Adapted From Lectures by
78 pages
Lecture 2-Boolean Retrieval
No ratings yet
Lecture 2-Boolean Retrieval
29 pages
Lec 1
No ratings yet
Lec 1
21 pages
Boolean Model 2021spring
No ratings yet
Boolean Model 2021spring
43 pages
SQL Programming & Database Management For Noobee
From Everand
SQL Programming & Database Management For Noobee
Kishor Sarkar X
No ratings yet
Pertemuan 1 - Introduction To NLP
No ratings yet
Pertemuan 1 - Introduction To NLP
29 pages
Pertemuan 3 - Preprocessing
No ratings yet
Pertemuan 3 - Preprocessing
25 pages
Pertemuan 2 - Python For NLP
No ratings yet
Pertemuan 2 - Python For NLP
27 pages
Pertemuan 4 - Fature Extraction
No ratings yet
Pertemuan 4 - Fature Extraction
18 pages
EXCEL-Convert Number of Month To Name of Month
No ratings yet
EXCEL-Convert Number of Month To Name of Month
7 pages
Tectura Cloud Capability - 2017
No ratings yet
Tectura Cloud Capability - 2017
26 pages
NTTF Placement Brochure 2021
No ratings yet
NTTF Placement Brochure 2021
72 pages
Surya Prakash - 202231039 - E - Individual Assignment 2023
No ratings yet
Surya Prakash - 202231039 - E - Individual Assignment 2023
23 pages
Pinhole Cameras and Eyes
No ratings yet
Pinhole Cameras and Eyes
5 pages
DS Theory HW 3
No ratings yet
DS Theory HW 3
6 pages
OOSD All Units Notes by MultiAtoms
No ratings yet
OOSD All Units Notes by MultiAtoms
93 pages
Do Not Dare To Copy It
No ratings yet
Do Not Dare To Copy It
37 pages
Jadual
No ratings yet
Jadual
4 pages
HPE - A00007129en - Us - R13xx-HPE FlexNetwork 5510 HI Layer 2 - LAN Switching Configuration Guide
No ratings yet
HPE - A00007129en - Us - R13xx-HPE FlexNetwork 5510 HI Layer 2 - LAN Switching Configuration Guide
329 pages
401 Presentation: Group - II
No ratings yet
401 Presentation: Group - II
33 pages
Az-700 Dumps
No ratings yet
Az-700 Dumps
7 pages
Human Resource Managemnt
No ratings yet
Human Resource Managemnt
5 pages
Vigiflow: Introduction and Basic Features
No ratings yet
Vigiflow: Introduction and Basic Features
26 pages
49 00 00 Fi
No ratings yet
49 00 00 Fi
8 pages
Oil and Gas Indonesia
No ratings yet
Oil and Gas Indonesia
87 pages
Bredel Pumps
No ratings yet
Bredel Pumps
80 pages
Tutorials: Tutorial 1 Getting Started
No ratings yet
Tutorials: Tutorial 1 Getting Started
11 pages
Full 05 Nguyen-Hoang-Lam 231158500218ccf11d2bdA8Ae
No ratings yet
Full 05 Nguyen-Hoang-Lam 231158500218ccf11d2bdA8Ae
16 pages
Guidelines For Final Year BE Project Report Submission
No ratings yet
Guidelines For Final Year BE Project Report Submission
4 pages
Optimizing Flink For High-Throughput Machine Learning: Streaming Feature Engineering in Banking
No ratings yet
Optimizing Flink For High-Throughput Machine Learning: Streaming Feature Engineering in Banking
10 pages
Flange Dim EN1092-1
No ratings yet
Flange Dim EN1092-1
18 pages
Image My World: Using The Dash Cam About The Manual
No ratings yet
Image My World: Using The Dash Cam About The Manual
1 page
R431003039 RexrothAluminumQuickExhaustValves (Imperial)
No ratings yet
R431003039 RexrothAluminumQuickExhaustValves (Imperial)
2 pages
Cold Storage Design Thesis
100% (2)
Cold Storage Design Thesis
6 pages
FALLSEM2019-20 EEE2004 ETH VL2019201000960 MODEL QUESTION PAPER Model Question Paper
No ratings yet
FALLSEM2019-20 EEE2004 ETH VL2019201000960 MODEL QUESTION PAPER Model Question Paper
2 pages
Ptu Library
No ratings yet
Ptu Library
25 pages
Enterprise Resource Planning
No ratings yet
Enterprise Resource Planning
6 pages
1st MIL
No ratings yet
1st MIL
4 pages
Nara Cognitive Technologies Whitepaper
No ratings yet
Nara Cognitive Technologies Whitepaper
29 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Pertemuan 5 - Information Retrieval

Uploaded by

Pertemuan 5 - Information Retrieval

Uploaded by

Advanced NLP

Dr. Sajarwo Anggai, S.ST., M.T.

Text Feature Post-

• Segmentation/ • Bag of Words • Machine Learning • Prediction • Production

Collection: A set of documents

Goal: Retrieve documents with information that is relevant to

 Precision : Fraction of retrieved docs that are

Which plays of Shakespeare contain the words Brutus AND

Brutus AND Caesar BUT 1 if play contains

• Consider N = 1 million documents, each with about 1000

• But it has no more than one billion 1’s. Why?

• What’s a better representation?

Brutus 1 2 4 11 31 45 173 174

What happens if the word Caesar is

• Cari 10 jurnal terkait dengan Information Retrieval!

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.