0% found this document useful (0 votes)
5 views17 pages

Pertemuan 5 - Information Retrieval

The document outlines the NLP pipeline and its components, including text preprocessing, feature extraction, modeling, and deployment. It discusses Information Retrieval (IR) as the process of finding relevant unstructured text documents and emphasizes the importance of precision and recall in evaluating retrieval effectiveness. Additionally, it highlights challenges in handling large document collections and introduces the concept of inverted indexing for efficient data representation.

Uploaded by

tian Indra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views17 pages

Pertemuan 5 - Information Retrieval

The document outlines the NLP pipeline and its components, including text preprocessing, feature extraction, modeling, and deployment. It discusses Information Retrieval (IR) as the process of finding relevant unstructured text documents and emphasizes the importance of precision and recall in evaluating retrieval effectiveness. Additionally, it highlights challenges in handling large document collections and introduces the concept of inverted indexing for efficient data representation.

Uploaded by

tian Indra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Advanced NLP

NLP
(Information
Retrieval)

Dr. Sajarwo Anggai, S.ST., M.T.


NIDN : 0421108703
NLP Pipeline

Text Feature Post-


Modeling Deployment
Preprocessing Extraction Processing

• Segmentation/ • Bag of Words • Machine Learning • Prediction • Production


Tokenization • Deep Learning • Evaluation • Monitoring and
(BoW)
• Normalization/ • TF-IDF • Fine-Tuning Updating
Lowercasing • Word Embeddings
• Stopword Removal
• Punctuation
Removal
• Stemming/
Lemmatization
• Depedency parsing
• Part-of-Speech
Tagging
Information Retrieval
Information Retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need from
within large collections (usually stored on computers).

These days we frequently think first of web search, but there are many other
cases:
E-mail search
Searching your laptop
Corporate knowledge bases
Legal information retrieval
Unstructured (text) vs. structured (database)

250 250
200 200
150 Unstructured 150 Unstructured
Structured
Structured
100 100
50 50
0
0
Data Market Cap
Data Market Cap
volume
volume
mid-nineties today
Basic assumptions of Information Retrieval

Collection: A set of documents


Assume it is a static collection for the moment

Goal: Retrieve documents with information that is relevant to


the user’s information need and helps the user complete a
task
How good are the retrieved docs?

 Precision : Fraction of retrieved docs that are


relevant to the user’s information need
 Recall : Fraction of relevant docs in collection that
are retrieved
Unstructured data in 1620

Which plays of Shakespeare contain the words Brutus AND


Caesar but NOT Calpurnia?
One could grep all of Shakespeare’s plays for Brutus and
Caesar, then strip out lines containing Calpurnia?
Why is that not the answer?
Slow (for large corpora)
NOT Calpurnia is non-trivial
Other operations (e.g., find the word Romans near
countrymen) not feasible
Ranked retrieval (best documents to return)
Later lectures
Term-document incidence matrices
Which plays of Shakespeare contain the words Brutus AND Caesar but NOT
Calpurnia?
One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip
out lines containing Calpurnia?
Why is that not the answer?
Slow (for large corpora)
NOT Calpurnia is non-trivial
Other operations (e.g., find the word Romans near countrymen) not
feasible
Ranked retrieval (best documents to return)
Later lectures
Term-document incidence matrices

Brutus AND Caesar BUT 1 if play contains


NOT Calpurnia word, 0 otherwise
Incidence vectors
So we have a 0/1 vector for each term.
To answer query: take the vectors for
Brutus, Caesar and Calpurnia
(complemented)  bitwise AND.
110100 AND
110111 AND
101111 =
100100
Bigger collections

• Consider N = 1 million documents, each with about 1000


words.
• Avg 6 bytes/word including spaces/punctuation
6GB of data in the documents.
• Say there are M = 500K distinct terms among these.
Can’t build the matrix
• 500K x 1M matrix has half-a-trillion 0’s and 1’s.

• But it has no more than one billion 1’s. Why?


• matrix is extremely sparse.

• What’s a better representation?


• We only record the 1 positions.
Can’t build the matrix
• For each term t, we must store a list of all documents that contain t.
- Identify each doc by a docID, a document serial number
• Can we used fixed-size arrays for this?

Brutus 1 2 4 11 31 45 173 174


Caesar 1 2 4 5 6 16 57 132
Calpurnia 2 31 54 101

What happens if the word Caesar is


added to document 14?
Inverted index construction
Ref
• https://nlp.stanford.edu/IR-book/information-retrieval-book.html
• https://web.stanford.edu/class/cs276/
• https://courses.cs.washington.edu/courses/cse454/09sp/slides.ht
ml
Tugas

• Cari 10 jurnal terkait dengan Information Retrieval!


• Buatlah program Inverted Index untuk menampung
corpus/dataset!
Universitas Pamulang
Prodi Teknik Informatika S-2

Sajarwo Anggai
Dosen – Universitas Pamulang
NIDN : 0421108703
Terima Kasih Email : dosen02832@unpam.ac.id

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy