Text Notebook: Information Retrieval (aimacode#576)

antmarakis · norvig · commit 5146f77e18de · 2017-07-09T00:54:37.000-07:00
* spacing in text.py

* information retrieval notebook section
diff --git a/text.ipynb b/text.ipynb
@@ -29,6 +29,7 @@
     "\n",
     "* Text Models\n",
     "* Viterbi Text Segmentation\n",
+    "* Information Retrieval\n",
     "* Decoders\n",
     "    * Introduction\n",
     "    * Shift Decoder\n",
@@ -404,6 +405,161 @@
     "The algorithm correctly retrieved the words from the string. It also gave us the probability of this sequence, which is small, but still the most probable segmentation of the string."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## INFORMATION RETRIEVAL\n",
+    "\n",
+    "### Overview\n",
+    "\n",
+    "With **Information Retrieval (IR)** we find documents that are relevant to a user's needs for information. A popular example is a web search engine, which finds and presents to a user pages relevant to a query. An IR system is comprised of the following:\n",
+    "\n",
+    "* A body (called corpus) of documents: A collection of documents, where the IR will work on.\n",
+    "\n",
+    "* A query language: A query represents what the user wants.\n",
+    "\n",
+    "* Results: The documents the system grades as relevant to a user's query and needs.\n",
+    "\n",
+    "* Presententation of the results: How the results are presented to the user.\n",
+    "\n",
+    "How does an IR system determine which documents are relevant though? We can sign a document as relevant if all the words in the query appear in it, and sign it as irrelevant otherwise. We can even extend the query language to support boolean operations (for example, \"paint AND brush\") and then sign as relevant the outcome of the query for the document. This technique though does not give a level of relevancy. All the documents are either relevant or irrelevant, but in reality some documents are more relevant than others.\n",
+    "\n",
+    "So, instead of a boolean relevancy system, we use a *scoring function*. There are many scoring functions around for many different situations. One of the most used takes into account the frequency of the words appearing in a document, the frequency of a word appearing across documents (for example, the word \"a\" appears a lot, so it is not very important) and the length of a document (since large documents will have higher occurences for the query terms, but a short document with a lot of occurences seems very relevant). We combine these properties in a formula and we get a numeric score for each document, so we can then quantify relevancy and pick the best documents.\n",
+    "\n",
+    "These scoring functions are not perfect though and there is room for improvement. For instance, for the above scoring function we assume each word is independent. That is not the case though, since words can share meaning. For example, the words \"painter\" and \"painters\" are closely related. If in a query we have the word \"painter\" and in a document the word \"painters\" appears a lot, this might be an indication that the document is relevant but we are missing out since we are only looking for \"painter\". There are a lot of ways to combat this. One of them is to reduce the query and document words into their stems. For example, both \"painter\" and \"painters\" have \"paint\" as their stem form. This can improve slightly the performance of algorithms.\n",
+    "\n",
+    "To determine how good an IR system is, we give the system a set of queries (for which we know the relevant pages beforehand) and record the results. The two measures for performance are *precision* and *recall*. Precision measures the proportion of result documents that actually are relevant. Recall measures the proportion of relevant documents (which, as mentioned before, we know in advance) appearing in the result documents."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Implementation\n",
+    "\n",
+    "You can read the source code by running the command below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "%psource IRSystem"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `stopwords` argument signifies words in the queries that should not be accounted for in documents. Usually they are very common words that do not add any significant information for a document's relevancy.\n",
+    "\n",
+    "A quick guide for the functions in the `IRSystem` class:\n",
+    "\n",
+    "* `index_document`: Add document to the collection of documents (named `documents`), which is a list of tuples. Also, count how many times each word in the query appears in each document.\n",
+    "\n",
+    "* `index_collection`: Index a collection of documents given by `filenames`.\n",
+    "\n",
+    "* `query`: Returns a list of `n` pairs of `(score, docid)` sorted on the score of each document. Also takes care of the special query \"learn: X\", where instead of the normal functionality we present the output of the terminal command \"X\".\n",
+    "\n",
+    "* `score`: Scores a given document for the given word using `log(1+k)/log(1+n)`, where `k` is the number of query words in a document and `k` is the total number of words in the document. Other scoring functions can be used and you can overwrite this function to better suit your needs.\n",
+    "\n",
+    "* `total_score`: Calculate the sum of all the query words in given document.\n",
+    "\n",
+    "* `present`/`present_results`: Presents the results as a list.\n",
+    "\n",
+    "We also have the class `Document` that holds metadata of documents, like their title, url and number of words. An additional class, `UnixConsultant`, can be used to initialize an IR System for Unix command manuals. This is the example we will use to showcase the implementation."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Example\n",
+    "\n",
+    "First let's take a look at the source code of `UnixConsultant`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "%psource UnixConsultant"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The class creates an IR System with the stopwords \"how do i the a of\". We could add more words to exclude, but the queries we will test will generally be in that format, so it is convenient. After the initialization of the system, we get the manual files and start indexing them.\n",
+    "\n",
+    "Let's build our Unix consultant and run a query:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0.7682667868462166 aima-data/MAN/rm.txt\n"
+     ]
+    }
+   ],
+   "source": [
+    "uc = UnixConsultant()\n",
+    "\n",
+    "q = uc.query(\"how do I remove a file\")\n",
+    "\n",
+    "top_score, top_doc = q[0][0], q[0][1]\n",
+    "print(top_score, uc.documents[top_doc].url)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We asked how to remove a file and the top result was the `rm` (the Unix command for remove) manual. This is exactly what we wanted! Let's try another query:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "0.7546722691607105 aima-data/MAN/diff.txt\n"
+     ]
+    }
+   ],
+   "source": [
+    "q = uc.query(\"how do I delete a file\")\n",
+    "\n",
+    "top_score, top_doc = q[0][0], q[0][1]\n",
+    "print(top_score, uc.documents[top_doc].url)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Even though we are basically asking for the same thing, we got a different top result. The `diff` command shows the differences between two files. So the system failed us and presented us an irrelevant document. Why is that? Unfortunately our IR system considers each word independent. \"Remove\" and \"delete\" have similar meanings, but since they are different words our system will not make the connection. So, the `diff` manual which mentions a lot the word `delete` gets the nod ahead of other manuals, while the `rm` one isn't in the result set since it doesn't use the word at all."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
diff --git a/text.py b/text.py
@@ -168,6 +168,7 @@ def query(self, query_text, n=10):
             doctext = os.popen(query_text[len("learn:"):], 'r').read()
             self.index_document(doctext, query_text)
             return []
+
         qwords = [w for w in words(query_text) if w not in self.stopwords]
         shortest = argmin(qwords, key=lambda w: len(self.index[w]))
         docids = self.index[shortest]
@@ -202,11 +203,13 @@ class UnixConsultant(IRSystem):
 
     def __init__(self):
         IRSystem.__init__(self, stopwords="how do i the a of")
+        
         import os
         aima_root = os.path.dirname(__file__)
         mandir = os.path.join(aima_root, 'aima-data/MAN/')
         man_files = [mandir + f for f in os.listdir(mandir)
                      if f.endswith('.txt')]
+
         self.index_collection(man_files)