NLP: Notebook + Minor Changes (aimacode#579)

antmarakis · norvig · commit be0a10e9e8c0 · 2017-07-09T00:51:52.000-07:00
* Update nlp.py

* Update test_nlp.py

* Update nlp.ipynb
diff --git a/nlp.ipynb b/nlp.ipynb
@@ -1,24 +1,216 @@
 {
  "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# NATURAL LANGUAGE PROCESSING\n",
+    "\n",
+    "This notebook covers chapters 22 and 23 from the book *Artificial Intelligence: A Modern Approach*, 3rd Edition. The implementations of the algorithms can be found in [nlp.py](https://github.com/aimacode/aima-python/blob/master/nlp.py).\n",
+    "\n",
+    "Run the below cell to import the code from the module and get started!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import nlp\n",
+    "from nlp import Page, HITS"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true
+   },
+   "source": [
+    "## CONTENTS\n",
+    "\n",
+    "* Overview\n",
+    "* HITS"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## OVERVIEW\n",
+    "\n",
+    "`TODO...`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## HITS\n",
+    "\n",
+    "### Overview\n",
+    "\n",
+    "**Hyperlink-Induced Topic Search** (or HITS for short) is an algorithm for information retrieval and page ranking. You can read more on information retrieval in the [text](https://github.com/aimacode/aima-python/blob/master/text.ipynb) notebook. Essentially, given a collection of documents and a user's query, such systems return to the user the documents most relevant to what the user needs. The HITS algorithm differs from a lot of other similar ranking algorithms (like Google's *Pagerank*) as the page ratings in this algorithm are dependent on the given query. This means that for each new query the result pages must be computed anew. This cost might be prohibitive for many modern search engines, so a lot steer away from this approach.\n",
+    "\n",
+    "HITS first finds a list of relevant pages to the query and then adds pages that link to or are linked from these pages. Once the set is built, we define two values for each page. **Authority** on the query, the degree of pages from the relevant set linking to it and **hub** of the query, the degree that it points to authoritative pages in the set. Since we do not want to simply count the number of links from a page to other pages, but we also want to take into account the quality of the linked pages, we update the hub and authority values of a page in the following manner, until convergence:\n",
+    "\n",
+    "* Hub score = The sum of the authority scores of the pages it links to.\n",
+    "\n",
+    "* Authority score = The sum of hub scores of the pages it is linked from.\n",
+    "\n",
+    "So the higher quality the pages a page is linked to and from, the higher its scores.\n",
+    "\n",
+    "We then normalize the scores by dividing each score by the sum of the squares of the respective scores of all pages. When the values converge, we return the top-valued pages. Note that because we normalize the values, the algorithm is guaranteed to converge."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true
+   },
+   "source": [
+    "### Implementation\n",
+    "\n",
+    "The source code for the algorithm is given below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "%psource HITS"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "First we compile the collection of pages as mentioned above. Then, we initialize the authority and hub scores for each page and finally we update and normalize the values until convergence.\n",
+    "\n",
+    "A quick overview of the helper functions functions we use:\n",
+    "\n",
+    "* `relevant_pages`: Returns relevant pages from `pagesIndex` given a query.\n",
+    "\n",
+    "* `expand_pages`: Adds to the collection pages linked to and from the given `pages`.\n",
+    "\n",
+    "* `normalize`: Normalizes authority and hub scores.\n",
+    "\n",
+    "* `ConvergenceDetector`: A class that checks for convergence, by keeping a history of the pages' scores and checking if they change or not.\n",
+    "\n",
+    "* `Page`: The template for pages. Stores the address, authority/hub scores and in-links/out-links."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true
+   },
+   "source": [
+    "### Example\n",
+    "\n",
+    "Before we begin we need to define a list of sample pages to work on. The pages are `pA`, `pB` and so on and their text is given by `testHTML` and `testHTML2`. The `Page` class takes as arguments the in-links and out-links as lists. For page \"A\", the in-links are \"B\", \"C\" and \"E\" while the sole out-link is \"D\".\n",
+    "\n",
+    "We also need to set the `nlp` global variables `pageDict`, `pagesIndex` and `pagesContent`."
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 3,
    "metadata": {
-    "collapsed": false
+    "collapsed": true
    },
    "outputs": [],
    "source": [
-    "import nlp"
+    "testHTML = \"\"\"Like most other male mammals, a man inherits an\n",
+    "            X from his mom and a Y from his dad.\"\"\"\n",
+    "testHTML2 = \"a mom and a dad\"\n",
+    "\n",
+    "pA = Page(\"A\", [\"B\", \"C\", \"E\"], [\"D\"])\n",
+    "pB = Page(\"B\", [\"E\"], [\"A\", \"C\", \"D\"])\n",
+    "pC = Page(\"C\", [\"B\", \"E\"], [\"A\", \"D\"])\n",
+    "pD = Page(\"D\", [\"A\", \"B\", \"C\", \"E\"], [])\n",
+    "pE = Page(\"E\", [], [\"A\", \"B\", \"C\", \"D\", \"F\"])\n",
+    "pF = Page(\"F\", [\"E\"], [])\n",
+    "\n",
+    "nlp.pageDict = {pA.address: pA, pB.address: pB, pC.address: pC,\n",
+    "                pD.address: pD, pE.address: pE, pF.address: pF}\n",
+    "\n",
+    "nlp.pagesIndex = nlp.pageDict\n",
+    "\n",
+    "nlp.pagesContent ={pA.address: testHTML, pB.address: testHTML2,\n",
+    "                   pC.address: testHTML, pD.address: testHTML2,\n",
+    "                   pE.address: testHTML, pF.address: testHTML2}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can now run the HITS algorithm. Our query will be 'mammals' (note that while the content of the HTML doesn't matter, it should include the query words or else no page will be picked at the first step)."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 4,
    "metadata": {
     "collapsed": true
    },
    "outputs": [],
-   "source": []
+   "source": [
+    "HITS('mammals')\n",
+    "page_list = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"]\n",
+    "auth_list = [pA.authority, pB.authority, pC.authority, pD.authority, pE.authority, pF.authority]\n",
+    "hub_list = [pA.hub, pB.hub, pC.hub, pD.hub, pE.hub, pF.hub]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's see how the pages were scored:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "A: total=0.7696163397038682, auth=0.5583254178509696, hub=0.2112909218528986\n",
+      "B: total=0.7795962360479534, auth=0.23657856688600404, hub=0.5430176691619494\n",
+      "C: total=0.8204496913590655, auth=0.4211098490570872, hub=0.3993398423019784\n",
+      "D: total=0.6316647735856309, auth=0.6316647735856309, hub=0.0\n",
+      "E: total=0.7078245882072104, auth=0.0, hub=0.7078245882072104\n",
+      "F: total=0.23657856688600404, auth=0.23657856688600404, hub=0.0\n"
+     ]
+    }
+   ],
+   "source": [
+    "for i in range(6):\n",
+    "    p = page_list[i]\n",
+    "    a = auth_list[i]\n",
+    "    h = hub_list[i]\n",
+    "    \n",
+    "    print(\"{}: total={}, auth={}, hub={}\".format(p, a + h, a, h))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true
+   },
+   "source": [
+    "The top score is 0.82 by \"C\". This is the most relevant page according to the algorithm. You can see that the pages it links to, \"A\" and \"D\", have the two highest authority scores (therefore \"C\" has a high hub score) and the pages it is linked from, \"B\" and \"E\", have the highest hub scores (so \"C\" has a high authority score). By combining these two facts, we get that \"C\" is the most relevant page. It is worth noting that it does not matter if the given page contains the query words, just that it links and is linked from high-quality pages."
+   ]
   }
  ],
  "metadata": {
@@ -37,9 +229,9 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.5.1"
+   "version": "3.5.2+"
   }
  },
  "nbformat": 4,
- "nbformat_minor": 0
+ "nbformat_minor": 1
 }
diff --git a/nlp.py b/nlp.py
@@ -285,7 +285,7 @@ def onlyWikipediaURLS(urls):
 # HITS Helper Functions
 
 def expand_pages(pages):
-    """From Textbook: adds in every page that links to or is linked from one of
+    """Adds in every page that links to or is linked from one of
     the relevant pages."""
     expanded = {}
     for addr, page in pages.items():
@@ -301,7 +301,7 @@ def expand_pages(pages):
 
 
 def relevant_pages(query):
-    """Relevant pages are pages that contain all of the query words. They are obtained by 
+    """Relevant pages are pages that contain all of the query words. They are obtained by
     intersecting the hit lists of the query words."""
     hit_intersection = {addr for addr in pagesIndex}
     query_words = query.split()
@@ -314,8 +314,8 @@ def relevant_pages(query):
     return {addr: pagesIndex[addr] for addr in hit_intersection}
 
 def normalize(pages):
-    """From the pseudocode: Normalize divides each page's score by the sum of
-    the squares of all pages' scores (separately for both the authority and hubs scores).
+    """Normalize divides each page's score by the sum of the squares of all
+    pages' scores (separately for both the authority and hub scores).
     """
     summed_hub = sum(page.hub**2 for _, page in pages.items())
     summed_auth = sum(page.authority**2 for _, page in pages.items())
@@ -371,7 +371,7 @@ def getOutlinks(page):
 # HITS Algorithm
 
 class Page(object):
-    def __init__(self, address, hub=0, authority=0, inlinks=None, outlinks=None):
+    def __init__(self, address, inlinks=None, outlinks=None, hub=0, authority=0):
         self.address = address
         self.hub = hub
         self.authority = authority
@@ -390,7 +390,7 @@ def HITS(query):
     for p in pages.values():
         p.authority = 1
         p.hub = 1
-    while True:  # repeat until... convergence
+    while not convergence():
         authority = {p: pages[p].authority for p in pages}
         hub = {p: pages[p].hub for p in pages}
         for p in pages:
@@ -399,6 +399,4 @@ def HITS(query):
             # p.hub ← ∑i Outlinki(p).Authority
             pages[p].hub = sum(authority[x] for x in getOutlinks(pages[p]))
         normalize(pages)
-        if convergence():
-            break
     return pages
diff --git a/tests/test_nlp.py b/tests/test_nlp.py
@@ -45,12 +45,12 @@ def test_lexicon():
             </html>
             """
 
-pA = Page("A", 1, 6, ["B", "C", "E"], ["D"])
-pB = Page("B", 2, 5, ["E"], ["A", "C", "D"])
-pC = Page("C", 3, 4, ["B", "E"], ["A", "D"])
-pD = Page("D", 4, 3, ["A", "B", "C", "E"], [])
-pE = Page("E", 5, 2, [], ["A", "B", "C", "D", "F"])
-pF = Page("F", 6, 1, ["E"], [])
+pA = Page("A", ["B", "C", "E"], ["D"], 1, 6)
+pB = Page("B", ["E"], ["A", "C", "D"], 2, 5)
+pC = Page("C", ["B", "E"], ["A", "D"], 3, 4)
+pD = Page("D", ["A", "B", "C", "E"], [], 4, 3)
+pE = Page("E", [], ["A", "B", "C", "D", "F"], 5, 2)
+pF = Page("F", ["E"], [], 6, 1)
 pageDict = {pA.address: pA, pB.address: pB, pC.address: pC,
             pD.address: pD, pE.address: pE, pF.address: pF}
 nlp.pagesIndex = pageDict