Skip to content

Commit 5146f77

Browse files
antmarakisnorvig
authored andcommitted
Text Notebook: Information Retrieval (aimacode#576)
* spacing in text.py * information retrieval notebook section
1 parent b785561 commit 5146f77

File tree

2 files changed

+159
-0
lines changed

2 files changed

+159
-0
lines changed

text.ipynb

Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@
2929
"\n",
3030
"* Text Models\n",
3131
"* Viterbi Text Segmentation\n",
32+
"* Information Retrieval\n",
3233
"* Decoders\n",
3334
" * Introduction\n",
3435
" * Shift Decoder\n",
@@ -404,6 +405,161 @@
404405
"The algorithm correctly retrieved the words from the string. It also gave us the probability of this sequence, which is small, but still the most probable segmentation of the string."
405406
]
406407
},
408+
{
409+
"cell_type": "markdown",
410+
"metadata": {},
411+
"source": [
412+
"## INFORMATION RETRIEVAL\n",
413+
"\n",
414+
"### Overview\n",
415+
"\n",
416+
"With **Information Retrieval (IR)** we find documents that are relevant to a user's needs for information. A popular example is a web search engine, which finds and presents to a user pages relevant to a query. An IR system is comprised of the following:\n",
417+
"\n",
418+
"* A body (called corpus) of documents: A collection of documents, where the IR will work on.\n",
419+
"\n",
420+
"* A query language: A query represents what the user wants.\n",
421+
"\n",
422+
"* Results: The documents the system grades as relevant to a user's query and needs.\n",
423+
"\n",
424+
"* Presententation of the results: How the results are presented to the user.\n",
425+
"\n",
426+
"How does an IR system determine which documents are relevant though? We can sign a document as relevant if all the words in the query appear in it, and sign it as irrelevant otherwise. We can even extend the query language to support boolean operations (for example, \"paint AND brush\") and then sign as relevant the outcome of the query for the document. This technique though does not give a level of relevancy. All the documents are either relevant or irrelevant, but in reality some documents are more relevant than others.\n",
427+
"\n",
428+
"So, instead of a boolean relevancy system, we use a *scoring function*. There are many scoring functions around for many different situations. One of the most used takes into account the frequency of the words appearing in a document, the frequency of a word appearing across documents (for example, the word \"a\" appears a lot, so it is not very important) and the length of a document (since large documents will have higher occurences for the query terms, but a short document with a lot of occurences seems very relevant). We combine these properties in a formula and we get a numeric score for each document, so we can then quantify relevancy and pick the best documents.\n",
429+
"\n",
430+
"These scoring functions are not perfect though and there is room for improvement. For instance, for the above scoring function we assume each word is independent. That is not the case though, since words can share meaning. For example, the words \"painter\" and \"painters\" are closely related. If in a query we have the word \"painter\" and in a document the word \"painters\" appears a lot, this might be an indication that the document is relevant but we are missing out since we are only looking for \"painter\". There are a lot of ways to combat this. One of them is to reduce the query and document words into their stems. For example, both \"painter\" and \"painters\" have \"paint\" as their stem form. This can improve slightly the performance of algorithms.\n",
431+
"\n",
432+
"To determine how good an IR system is, we give the system a set of queries (for which we know the relevant pages beforehand) and record the results. The two measures for performance are *precision* and *recall*. Precision measures the proportion of result documents that actually are relevant. Recall measures the proportion of relevant documents (which, as mentioned before, we know in advance) appearing in the result documents."
433+
]
434+
},
435+
{
436+
"cell_type": "markdown",
437+
"metadata": {},
438+
"source": [
439+
"### Implementation\n",
440+
"\n",
441+
"You can read the source code by running the command below:"
442+
]
443+
},
444+
{
445+
"cell_type": "code",
446+
"execution_count": 2,
447+
"metadata": {
448+
"collapsed": true
449+
},
450+
"outputs": [],
451+
"source": [
452+
"%psource IRSystem"
453+
]
454+
},
455+
{
456+
"cell_type": "markdown",
457+
"metadata": {},
458+
"source": [
459+
"The `stopwords` argument signifies words in the queries that should not be accounted for in documents. Usually they are very common words that do not add any significant information for a document's relevancy.\n",
460+
"\n",
461+
"A quick guide for the functions in the `IRSystem` class:\n",
462+
"\n",
463+
"* `index_document`: Add document to the collection of documents (named `documents`), which is a list of tuples. Also, count how many times each word in the query appears in each document.\n",
464+
"\n",
465+
"* `index_collection`: Index a collection of documents given by `filenames`.\n",
466+
"\n",
467+
"* `query`: Returns a list of `n` pairs of `(score, docid)` sorted on the score of each document. Also takes care of the special query \"learn: X\", where instead of the normal functionality we present the output of the terminal command \"X\".\n",
468+
"\n",
469+
"* `score`: Scores a given document for the given word using `log(1+k)/log(1+n)`, where `k` is the number of query words in a document and `k` is the total number of words in the document. Other scoring functions can be used and you can overwrite this function to better suit your needs.\n",
470+
"\n",
471+
"* `total_score`: Calculate the sum of all the query words in given document.\n",
472+
"\n",
473+
"* `present`/`present_results`: Presents the results as a list.\n",
474+
"\n",
475+
"We also have the class `Document` that holds metadata of documents, like their title, url and number of words. An additional class, `UnixConsultant`, can be used to initialize an IR System for Unix command manuals. This is the example we will use to showcase the implementation."
476+
]
477+
},
478+
{
479+
"cell_type": "markdown",
480+
"metadata": {},
481+
"source": [
482+
"### Example\n",
483+
"\n",
484+
"First let's take a look at the source code of `UnixConsultant`."
485+
]
486+
},
487+
{
488+
"cell_type": "code",
489+
"execution_count": 3,
490+
"metadata": {
491+
"collapsed": true
492+
},
493+
"outputs": [],
494+
"source": [
495+
"%psource UnixConsultant"
496+
]
497+
},
498+
{
499+
"cell_type": "markdown",
500+
"metadata": {},
501+
"source": [
502+
"The class creates an IR System with the stopwords \"how do i the a of\". We could add more words to exclude, but the queries we will test will generally be in that format, so it is convenient. After the initialization of the system, we get the manual files and start indexing them.\n",
503+
"\n",
504+
"Let's build our Unix consultant and run a query:"
505+
]
506+
},
507+
{
508+
"cell_type": "code",
509+
"execution_count": 9,
510+
"metadata": {},
511+
"outputs": [
512+
{
513+
"name": "stdout",
514+
"output_type": "stream",
515+
"text": [
516+
"0.7682667868462166 aima-data/MAN/rm.txt\n"
517+
]
518+
}
519+
],
520+
"source": [
521+
"uc = UnixConsultant()\n",
522+
"\n",
523+
"q = uc.query(\"how do I remove a file\")\n",
524+
"\n",
525+
"top_score, top_doc = q[0][0], q[0][1]\n",
526+
"print(top_score, uc.documents[top_doc].url)"
527+
]
528+
},
529+
{
530+
"cell_type": "markdown",
531+
"metadata": {},
532+
"source": [
533+
"We asked how to remove a file and the top result was the `rm` (the Unix command for remove) manual. This is exactly what we wanted! Let's try another query:"
534+
]
535+
},
536+
{
537+
"cell_type": "code",
538+
"execution_count": 10,
539+
"metadata": {},
540+
"outputs": [
541+
{
542+
"name": "stdout",
543+
"output_type": "stream",
544+
"text": [
545+
"0.7546722691607105 aima-data/MAN/diff.txt\n"
546+
]
547+
}
548+
],
549+
"source": [
550+
"q = uc.query(\"how do I delete a file\")\n",
551+
"\n",
552+
"top_score, top_doc = q[0][0], q[0][1]\n",
553+
"print(top_score, uc.documents[top_doc].url)"
554+
]
555+
},
556+
{
557+
"cell_type": "markdown",
558+
"metadata": {},
559+
"source": [
560+
"Even though we are basically asking for the same thing, we got a different top result. The `diff` command shows the differences between two files. So the system failed us and presented us an irrelevant document. Why is that? Unfortunately our IR system considers each word independent. \"Remove\" and \"delete\" have similar meanings, but since they are different words our system will not make the connection. So, the `diff` manual which mentions a lot the word `delete` gets the nod ahead of other manuals, while the `rm` one isn't in the result set since it doesn't use the word at all."
561+
]
562+
},
407563
{
408564
"cell_type": "markdown",
409565
"metadata": {},

text.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -168,6 +168,7 @@ def query(self, query_text, n=10):
168168
doctext = os.popen(query_text[len("learn:"):], 'r').read()
169169
self.index_document(doctext, query_text)
170170
return []
171+
171172
qwords = [w for w in words(query_text) if w not in self.stopwords]
172173
shortest = argmin(qwords, key=lambda w: len(self.index[w]))
173174
docids = self.index[shortest]
@@ -202,11 +203,13 @@ class UnixConsultant(IRSystem):
202203

203204
def __init__(self):
204205
IRSystem.__init__(self, stopwords="how do i the a of")
206+
205207
import os
206208
aima_root = os.path.dirname(__file__)
207209
mandir = os.path.join(aima_root, 'aima-data/MAN/')
208210
man_files = [mandir + f for f in os.listdir(mandir)
209211
if f.endswith('.txt')]
212+
210213
self.index_collection(man_files)
211214

212215

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy