Skip to content

Commit be0a10e

Browse files
antmarakisnorvig
authored andcommitted
NLP: Notebook + Minor Changes (aimacode#579)
* Update nlp.py * Update test_nlp.py * Update nlp.ipynb
1 parent 39db351 commit be0a10e

File tree

3 files changed

+211
-21
lines changed

3 files changed

+211
-21
lines changed

nlp.ipynb

Lines changed: 199 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,216 @@
11
{
22
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# NATURAL LANGUAGE PROCESSING\n",
8+
"\n",
9+
"This notebook covers chapters 22 and 23 from the book *Artificial Intelligence: A Modern Approach*, 3rd Edition. The implementations of the algorithms can be found in [nlp.py](https://github.com/aimacode/aima-python/blob/master/nlp.py).\n",
10+
"\n",
11+
"Run the below cell to import the code from the module and get started!"
12+
]
13+
},
14+
{
15+
"cell_type": "code",
16+
"execution_count": 1,
17+
"metadata": {
18+
"collapsed": true
19+
},
20+
"outputs": [],
21+
"source": [
22+
"import nlp\n",
23+
"from nlp import Page, HITS"
24+
]
25+
},
26+
{
27+
"cell_type": "markdown",
28+
"metadata": {
29+
"collapsed": true
30+
},
31+
"source": [
32+
"## CONTENTS\n",
33+
"\n",
34+
"* Overview\n",
35+
"* HITS"
36+
]
37+
},
38+
{
39+
"cell_type": "markdown",
40+
"metadata": {},
41+
"source": [
42+
"## OVERVIEW\n",
43+
"\n",
44+
"`TODO...`"
45+
]
46+
},
47+
{
48+
"cell_type": "markdown",
49+
"metadata": {},
50+
"source": [
51+
"## HITS\n",
52+
"\n",
53+
"### Overview\n",
54+
"\n",
55+
"**Hyperlink-Induced Topic Search** (or HITS for short) is an algorithm for information retrieval and page ranking. You can read more on information retrieval in the [text](https://github.com/aimacode/aima-python/blob/master/text.ipynb) notebook. Essentially, given a collection of documents and a user's query, such systems return to the user the documents most relevant to what the user needs. The HITS algorithm differs from a lot of other similar ranking algorithms (like Google's *Pagerank*) as the page ratings in this algorithm are dependent on the given query. This means that for each new query the result pages must be computed anew. This cost might be prohibitive for many modern search engines, so a lot steer away from this approach.\n",
56+
"\n",
57+
"HITS first finds a list of relevant pages to the query and then adds pages that link to or are linked from these pages. Once the set is built, we define two values for each page. **Authority** on the query, the degree of pages from the relevant set linking to it and **hub** of the query, the degree that it points to authoritative pages in the set. Since we do not want to simply count the number of links from a page to other pages, but we also want to take into account the quality of the linked pages, we update the hub and authority values of a page in the following manner, until convergence:\n",
58+
"\n",
59+
"* Hub score = The sum of the authority scores of the pages it links to.\n",
60+
"\n",
61+
"* Authority score = The sum of hub scores of the pages it is linked from.\n",
62+
"\n",
63+
"So the higher quality the pages a page is linked to and from, the higher its scores.\n",
64+
"\n",
65+
"We then normalize the scores by dividing each score by the sum of the squares of the respective scores of all pages. When the values converge, we return the top-valued pages. Note that because we normalize the values, the algorithm is guaranteed to converge."
66+
]
67+
},
68+
{
69+
"cell_type": "markdown",
70+
"metadata": {
71+
"collapsed": true
72+
},
73+
"source": [
74+
"### Implementation\n",
75+
"\n",
76+
"The source code for the algorithm is given below:"
77+
]
78+
},
79+
{
80+
"cell_type": "code",
81+
"execution_count": 2,
82+
"metadata": {
83+
"collapsed": true
84+
},
85+
"outputs": [],
86+
"source": [
87+
"%psource HITS"
88+
]
89+
},
90+
{
91+
"cell_type": "markdown",
92+
"metadata": {},
93+
"source": [
94+
"First we compile the collection of pages as mentioned above. Then, we initialize the authority and hub scores for each page and finally we update and normalize the values until convergence.\n",
95+
"\n",
96+
"A quick overview of the helper functions functions we use:\n",
97+
"\n",
98+
"* `relevant_pages`: Returns relevant pages from `pagesIndex` given a query.\n",
99+
"\n",
100+
"* `expand_pages`: Adds to the collection pages linked to and from the given `pages`.\n",
101+
"\n",
102+
"* `normalize`: Normalizes authority and hub scores.\n",
103+
"\n",
104+
"* `ConvergenceDetector`: A class that checks for convergence, by keeping a history of the pages' scores and checking if they change or not.\n",
105+
"\n",
106+
"* `Page`: The template for pages. Stores the address, authority/hub scores and in-links/out-links."
107+
]
108+
},
109+
{
110+
"cell_type": "markdown",
111+
"metadata": {
112+
"collapsed": true
113+
},
114+
"source": [
115+
"### Example\n",
116+
"\n",
117+
"Before we begin we need to define a list of sample pages to work on. The pages are `pA`, `pB` and so on and their text is given by `testHTML` and `testHTML2`. The `Page` class takes as arguments the in-links and out-links as lists. For page \"A\", the in-links are \"B\", \"C\" and \"E\" while the sole out-link is \"D\".\n",
118+
"\n",
119+
"We also need to set the `nlp` global variables `pageDict`, `pagesIndex` and `pagesContent`."
120+
]
121+
},
3122
{
4123
"cell_type": "code",
5-
"execution_count": null,
124+
"execution_count": 3,
6125
"metadata": {
7-
"collapsed": false
126+
"collapsed": true
8127
},
9128
"outputs": [],
10129
"source": [
11-
"import nlp"
130+
"testHTML = \"\"\"Like most other male mammals, a man inherits an\n",
131+
" X from his mom and a Y from his dad.\"\"\"\n",
132+
"testHTML2 = \"a mom and a dad\"\n",
133+
"\n",
134+
"pA = Page(\"A\", [\"B\", \"C\", \"E\"], [\"D\"])\n",
135+
"pB = Page(\"B\", [\"E\"], [\"A\", \"C\", \"D\"])\n",
136+
"pC = Page(\"C\", [\"B\", \"E\"], [\"A\", \"D\"])\n",
137+
"pD = Page(\"D\", [\"A\", \"B\", \"C\", \"E\"], [])\n",
138+
"pE = Page(\"E\", [], [\"A\", \"B\", \"C\", \"D\", \"F\"])\n",
139+
"pF = Page(\"F\", [\"E\"], [])\n",
140+
"\n",
141+
"nlp.pageDict = {pA.address: pA, pB.address: pB, pC.address: pC,\n",
142+
" pD.address: pD, pE.address: pE, pF.address: pF}\n",
143+
"\n",
144+
"nlp.pagesIndex = nlp.pageDict\n",
145+
"\n",
146+
"nlp.pagesContent ={pA.address: testHTML, pB.address: testHTML2,\n",
147+
" pC.address: testHTML, pD.address: testHTML2,\n",
148+
" pE.address: testHTML, pF.address: testHTML2}"
149+
]
150+
},
151+
{
152+
"cell_type": "markdown",
153+
"metadata": {},
154+
"source": [
155+
"We can now run the HITS algorithm. Our query will be 'mammals' (note that while the content of the HTML doesn't matter, it should include the query words or else no page will be picked at the first step)."
12156
]
13157
},
14158
{
15159
"cell_type": "code",
16-
"execution_count": null,
160+
"execution_count": 4,
17161
"metadata": {
18162
"collapsed": true
19163
},
20164
"outputs": [],
21-
"source": []
165+
"source": [
166+
"HITS('mammals')\n",
167+
"page_list = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"]\n",
168+
"auth_list = [pA.authority, pB.authority, pC.authority, pD.authority, pE.authority, pF.authority]\n",
169+
"hub_list = [pA.hub, pB.hub, pC.hub, pD.hub, pE.hub, pF.hub]"
170+
]
171+
},
172+
{
173+
"cell_type": "markdown",
174+
"metadata": {},
175+
"source": [
176+
"Let's see how the pages were scored:"
177+
]
178+
},
179+
{
180+
"cell_type": "code",
181+
"execution_count": 5,
182+
"metadata": {},
183+
"outputs": [
184+
{
185+
"name": "stdout",
186+
"output_type": "stream",
187+
"text": [
188+
"A: total=0.7696163397038682, auth=0.5583254178509696, hub=0.2112909218528986\n",
189+
"B: total=0.7795962360479534, auth=0.23657856688600404, hub=0.5430176691619494\n",
190+
"C: total=0.8204496913590655, auth=0.4211098490570872, hub=0.3993398423019784\n",
191+
"D: total=0.6316647735856309, auth=0.6316647735856309, hub=0.0\n",
192+
"E: total=0.7078245882072104, auth=0.0, hub=0.7078245882072104\n",
193+
"F: total=0.23657856688600404, auth=0.23657856688600404, hub=0.0\n"
194+
]
195+
}
196+
],
197+
"source": [
198+
"for i in range(6):\n",
199+
" p = page_list[i]\n",
200+
" a = auth_list[i]\n",
201+
" h = hub_list[i]\n",
202+
" \n",
203+
" print(\"{}: total={}, auth={}, hub={}\".format(p, a + h, a, h))"
204+
]
205+
},
206+
{
207+
"cell_type": "markdown",
208+
"metadata": {
209+
"collapsed": true
210+
},
211+
"source": [
212+
"The top score is 0.82 by \"C\". This is the most relevant page according to the algorithm. You can see that the pages it links to, \"A\" and \"D\", have the two highest authority scores (therefore \"C\" has a high hub score) and the pages it is linked from, \"B\" and \"E\", have the highest hub scores (so \"C\" has a high authority score). By combining these two facts, we get that \"C\" is the most relevant page. It is worth noting that it does not matter if the given page contains the query words, just that it links and is linked from high-quality pages."
213+
]
22214
}
23215
],
24216
"metadata": {
@@ -37,9 +229,9 @@
37229
"name": "python",
38230
"nbconvert_exporter": "python",
39231
"pygments_lexer": "ipython3",
40-
"version": "3.5.1"
232+
"version": "3.5.2+"
41233
}
42234
},
43235
"nbformat": 4,
44-
"nbformat_minor": 0
236+
"nbformat_minor": 1
45237
}

nlp.py

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -285,7 +285,7 @@ def onlyWikipediaURLS(urls):
285285
# HITS Helper Functions
286286

287287
def expand_pages(pages):
288-
"""From Textbook: adds in every page that links to or is linked from one of
288+
"""Adds in every page that links to or is linked from one of
289289
the relevant pages."""
290290
expanded = {}
291291
for addr, page in pages.items():
@@ -301,7 +301,7 @@ def expand_pages(pages):
301301

302302

303303
def relevant_pages(query):
304-
"""Relevant pages are pages that contain all of the query words. They are obtained by
304+
"""Relevant pages are pages that contain all of the query words. They are obtained by
305305
intersecting the hit lists of the query words."""
306306
hit_intersection = {addr for addr in pagesIndex}
307307
query_words = query.split()
@@ -314,8 +314,8 @@ def relevant_pages(query):
314314
return {addr: pagesIndex[addr] for addr in hit_intersection}
315315

316316
def normalize(pages):
317-
"""From the pseudocode: Normalize divides each page's score by the sum of
318-
the squares of all pages' scores (separately for both the authority and hubs scores).
317+
"""Normalize divides each page's score by the sum of the squares of all
318+
pages' scores (separately for both the authority and hub scores).
319319
"""
320320
summed_hub = sum(page.hub**2 for _, page in pages.items())
321321
summed_auth = sum(page.authority**2 for _, page in pages.items())
@@ -371,7 +371,7 @@ def getOutlinks(page):
371371
# HITS Algorithm
372372

373373
class Page(object):
374-
def __init__(self, address, hub=0, authority=0, inlinks=None, outlinks=None):
374+
def __init__(self, address, inlinks=None, outlinks=None, hub=0, authority=0):
375375
self.address = address
376376
self.hub = hub
377377
self.authority = authority
@@ -390,7 +390,7 @@ def HITS(query):
390390
for p in pages.values():
391391
p.authority = 1
392392
p.hub = 1
393-
while True: # repeat until... convergence
393+
while not convergence():
394394
authority = {p: pages[p].authority for p in pages}
395395
hub = {p: pages[p].hub for p in pages}
396396
for p in pages:
@@ -399,6 +399,4 @@ def HITS(query):
399399
# p.hub ← ∑i Outlinki(p).Authority
400400
pages[p].hub = sum(authority[x] for x in getOutlinks(pages[p]))
401401
normalize(pages)
402-
if convergence():
403-
break
404402
return pages

tests/test_nlp.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -45,12 +45,12 @@ def test_lexicon():
4545
</html>
4646
"""
4747

48-
pA = Page("A", 1, 6, ["B", "C", "E"], ["D"])
49-
pB = Page("B", 2, 5, ["E"], ["A", "C", "D"])
50-
pC = Page("C", 3, 4, ["B", "E"], ["A", "D"])
51-
pD = Page("D", 4, 3, ["A", "B", "C", "E"], [])
52-
pE = Page("E", 5, 2, [], ["A", "B", "C", "D", "F"])
53-
pF = Page("F", 6, 1, ["E"], [])
48+
pA = Page("A", ["B", "C", "E"], ["D"], 1, 6)
49+
pB = Page("B", ["E"], ["A", "C", "D"], 2, 5)
50+
pC = Page("C", ["B", "E"], ["A", "D"], 3, 4)
51+
pD = Page("D", ["A", "B", "C", "E"], [], 4, 3)
52+
pE = Page("E", [], ["A", "B", "C", "D", "F"], 5, 2)
53+
pF = Page("F", ["E"], [], 6, 1)
5454
pageDict = {pA.address: pA, pB.address: pB, pC.address: pC,
5555
pD.address: pD, pE.address: pE, pF.address: pF}
5656
nlp.pagesIndex = pageDict

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy