Skip to content

Commit 2f03807

Browse files
antmarakisnorvig
authored andcommitted
NLP: Chart Parsing (aimacode#612)
* Update nlp.py * add chart parsing test * add chart parsing section
1 parent d84c3bf commit 2f03807

File tree

3 files changed

+258
-8
lines changed

3 files changed

+258
-8
lines changed

nlp.ipynb

Lines changed: 250 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@
2222
"import nlp\n",
2323
"from nlp import Page, HITS\n",
2424
"from nlp import Lexicon, Rules, Grammar, ProbLexicon, ProbRules, ProbGrammar\n",
25-
"from nlp import CYK_parse"
25+
"from nlp import CYK_parse, Chart"
2626
]
2727
},
2828
{
@@ -36,7 +36,9 @@
3636
"* Overview\n",
3737
"* Languages\n",
3838
"* HITS\n",
39-
"* Question Answering"
39+
"* Question Answering\n",
40+
"* CYK Parse\n",
41+
"* Chart Parsing"
4042
]
4143
},
4244
{
@@ -45,7 +47,11 @@
4547
"source": [
4648
"## OVERVIEW\n",
4749
"\n",
48-
"`TODO...`"
50+
"**Natural Language Processing (NLP)** is a field of AI concerned with understanding, analyzing and using natural languages. This field is considered a difficult yet intriguing field of study, since it is connected to how humans and their languages work.\n",
51+
"\n",
52+
"Applications of the field include translation, speech recognition, topic segmentation, information extraction and retrieval, and a lot more.\n",
53+
"\n",
54+
"Below we take a look at some algorithms in the field. Before we get right into it though, we will take a look at a very useful form of language, **context-free** languages. Even though they are a bit restrictive, they have been used a lot in research in natural language processing."
4955
]
5056
},
5157
{
@@ -908,6 +914,247 @@
908914
"\n",
909915
"Notice how the probability for the whole string (given by the key `('S', 0, 4)`) is 0.015. This means the most probable parsing of the sentence has a probability of 0.015."
910916
]
917+
},
918+
{
919+
"cell_type": "markdown",
920+
"metadata": {},
921+
"source": [
922+
"## CHART PARSING\n",
923+
"\n",
924+
"### Overview\n",
925+
"\n",
926+
"Let's now take a look at a more general chart parsing algorithm. Given a non-probabilistic grammar and a sentence, this algorithm builds a parse tree in a top-down manner, with the words of the sentence as the leaves. It works with a dynamic programming approach, building a chart to store parses for substrings so that it doesn't have to analyze them again (just like the CYK algorithm). Each non-terminal, starting from S, gets replaced by its right-hand side rules in the chart, until we end up with the correct parses.\n",
927+
"\n",
928+
"### Implementation\n",
929+
"\n",
930+
"A parse is in the form `[start, end, non-terminal, sub-tree, expected-transformation]`, where `sub-tree` is a tree with the corresponding `non-terminal` as its root and `expected-transformation` is a right-hand side rule of the `non-terminal`.\n",
931+
"\n",
932+
"The chart parsing is implemented in a class, `Chart`. It is initialized with a grammar and can return the list of all the parses of a sentence with the `parses` function.\n",
933+
"\n",
934+
"The chart is a list of lists. The lists correspond to the lengths of substrings (including the empty string), from start to finish. When we say 'a point in the chart', we refer to a list of a certain length.\n",
935+
"\n",
936+
"A quick rundown of the class functions:"
937+
]
938+
},
939+
{
940+
"cell_type": "markdown",
941+
"metadata": {
942+
"collapsed": true
943+
},
944+
"source": [
945+
"* `parses`: Returns a list of parses for a given sentence. If the sentence can't be parsed, it will return an empty list. Initializes the process by calling `parse` from the starting symbol.\n",
946+
"\n",
947+
"\n",
948+
"* `parse`: Parses the list of words and builds the chart.\n",
949+
"\n",
950+
"\n",
951+
"* `add_edge`: Adds another edge to the chart at a given point. Also, examines whether the edge extends or predicts another edge. If the edge itself is not expecting a transformation, it will extend other edges and it will predict edges otherwise.\n",
952+
"\n",
953+
"\n",
954+
"* `scanner`: Given a word and a point in the chart, it extends edges that were expecting a transformation that can result in the given word. For example, if the word 'the' is an 'Article' and we are examining two edges at a chart's point, with one expecting an 'Article' and the other a 'Verb', the first one will be extended while the second one will not.\n",
955+
"\n",
956+
"\n",
957+
"* `predictor`: If an edge can't extend other edges (because it is expecting a transformation itself), we will add to the chart rules/transformations that can help extend the edge. The new edges come from the right-hand side of the expected transformation's rules. For example, if an edge is expecting the transformation 'Adjective Noun', we will add to the chart an edge for each right-hand side rule of the non-terminal 'Adjective'.\n",
958+
"\n",
959+
"\n",
960+
"* `extender`: Extends edges given an edge (called `E`). If `E`'s non-terminal is the same as the expected transformation of another edge (let's call it `A`), add to the chart a new edge with the non-terminal of `A` and the transformations of `A` minus the non-terminal that matched with `E`'s non-terminal. For example, if an edge `E` has 'Article' as its non-terminal and is expecting no transformation, we need to see what edges it can extend. Let's examine the edge `N`. This expects a transformation of 'Noun Verb'. 'Noun' does not match with 'Article', so we move on. Another edge, `A`, expects a transformation of 'Article Noun' and has a non-terminal of 'NP'. We have a match! A new edge will be added with 'NP' as its non-terminal (the non-terminal of `A`) and 'Noun' as the expected transformation (the rest of the expected transformation of `A`)."
961+
]
962+
},
963+
{
964+
"cell_type": "markdown",
965+
"metadata": {},
966+
"source": [
967+
"### Example\n",
968+
"\n",
969+
"We will use the grammar `E0` to parse the sentence \"the stench is in 2 2\".\n",
970+
"\n",
971+
"First we need to build a `Chart` object:"
972+
]
973+
},
974+
{
975+
"cell_type": "code",
976+
"execution_count": 2,
977+
"metadata": {
978+
"collapsed": true
979+
},
980+
"outputs": [],
981+
"source": [
982+
"chart = Chart(nlp.E0)"
983+
]
984+
},
985+
{
986+
"cell_type": "markdown",
987+
"metadata": {},
988+
"source": [
989+
"And then we simply call the `parses` function:"
990+
]
991+
},
992+
{
993+
"cell_type": "code",
994+
"execution_count": 3,
995+
"metadata": {},
996+
"outputs": [
997+
{
998+
"name": "stdout",
999+
"output_type": "stream",
1000+
"text": [
1001+
"[[0, 6, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], []]]\n"
1002+
]
1003+
}
1004+
],
1005+
"source": [
1006+
"print(chart.parses('the stench is in 2 2'))"
1007+
]
1008+
},
1009+
{
1010+
"cell_type": "markdown",
1011+
"metadata": {},
1012+
"source": [
1013+
"You can see which edges get added by setting the optional initialization argument `trace` to true."
1014+
]
1015+
},
1016+
{
1017+
"cell_type": "code",
1018+
"execution_count": 4,
1019+
"metadata": {
1020+
"collapsed": true
1021+
},
1022+
"outputs": [
1023+
{
1024+
"name": "stdout",
1025+
"output_type": "stream",
1026+
"text": [
1027+
"Chart: added [0, 0, 'S_', [], ['S']]\n",
1028+
"Chart: added [0, 0, 'S', [], ['NP', 'VP']]\n",
1029+
"Chart: added [0, 0, 'NP', [], ['Pronoun']]\n",
1030+
"Chart: added [0, 0, 'NP', [], ['Name']]\n",
1031+
"Chart: added [0, 0, 'NP', [], ['Noun']]\n",
1032+
"Chart: added [0, 0, 'NP', [], ['Article', 'Noun']]\n",
1033+
"Chart: added [0, 0, 'NP', [], ['Digit', 'Digit']]\n",
1034+
"Chart: added [0, 0, 'NP', [], ['NP', 'PP']]\n",
1035+
"Chart: added [0, 0, 'NP', [], ['NP', 'RelClause']]\n",
1036+
"Chart: added [0, 0, 'S', [], ['S', 'Conjunction', 'S']]\n",
1037+
"Chart: added [0, 1, 'NP', [('Article', 'the')], ['Noun']]\n",
1038+
"Chart: added [0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []]\n",
1039+
"Chart: added [0, 2, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []]], ['VP']]\n",
1040+
"Chart: added [2, 2, 'VP', [], ['Verb']]\n",
1041+
"Chart: added [2, 2, 'VP', [], ['VP', 'NP']]\n",
1042+
"Chart: added [2, 2, 'VP', [], ['VP', 'Adjective']]\n",
1043+
"Chart: added [2, 2, 'VP', [], ['VP', 'PP']]\n",
1044+
"Chart: added [2, 2, 'VP', [], ['VP', 'Adverb']]\n",
1045+
"Chart: added [0, 2, 'NP', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []]], ['PP']]\n",
1046+
"Chart: added [2, 2, 'PP', [], ['Preposition', 'NP']]\n",
1047+
"Chart: added [0, 2, 'NP', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []]], ['RelClause']]\n",
1048+
"Chart: added [2, 2, 'RelClause', [], ['That', 'VP']]\n",
1049+
"Chart: added [2, 3, 'VP', [('Verb', 'is')], []]\n",
1050+
"Chart: added [0, 3, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 3, 'VP', [('Verb', 'is')], []]], []]\n",
1051+
"Chart: added [0, 3, 'S_', [[0, 3, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 3, 'VP', [('Verb', 'is')], []]], []]], []]\n",
1052+
"Chart: added [0, 3, 'S', [[0, 3, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 3, 'VP', [('Verb', 'is')], []]], []]], ['Conjunction', 'S']]\n",
1053+
"Chart: added [2, 3, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []]], ['NP']]\n",
1054+
"Chart: added [3, 3, 'NP', [], ['Pronoun']]\n",
1055+
"Chart: added [3, 3, 'NP', [], ['Name']]\n",
1056+
"Chart: added [3, 3, 'NP', [], ['Noun']]\n",
1057+
"Chart: added [3, 3, 'NP', [], ['Article', 'Noun']]\n",
1058+
"Chart: added [3, 3, 'NP', [], ['Digit', 'Digit']]\n",
1059+
"Chart: added [3, 3, 'NP', [], ['NP', 'PP']]\n",
1060+
"Chart: added [3, 3, 'NP', [], ['NP', 'RelClause']]\n",
1061+
"Chart: added [2, 3, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []]], ['Adjective']]\n",
1062+
"Chart: added [2, 3, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []]], ['PP']]\n",
1063+
"Chart: added [3, 3, 'PP', [], ['Preposition', 'NP']]\n",
1064+
"Chart: added [2, 3, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []]], ['Adverb']]\n",
1065+
"Chart: added [3, 4, 'PP', [('Preposition', 'in')], ['NP']]\n",
1066+
"Chart: added [4, 4, 'NP', [], ['Pronoun']]\n",
1067+
"Chart: added [4, 4, 'NP', [], ['Name']]\n",
1068+
"Chart: added [4, 4, 'NP', [], ['Noun']]\n",
1069+
"Chart: added [4, 4, 'NP', [], ['Article', 'Noun']]\n",
1070+
"Chart: added [4, 4, 'NP', [], ['Digit', 'Digit']]\n",
1071+
"Chart: added [4, 4, 'NP', [], ['NP', 'PP']]\n",
1072+
"Chart: added [4, 4, 'NP', [], ['NP', 'RelClause']]\n",
1073+
"Chart: added [4, 5, 'NP', [('Digit', '2')], ['Digit']]\n",
1074+
"Chart: added [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]\n",
1075+
"Chart: added [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]\n",
1076+
"Chart: added [2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]\n",
1077+
"Chart: added [0, 6, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], []]\n",
1078+
"Chart: added [0, 6, 'S_', [[0, 6, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], []]], []]\n",
1079+
"Chart: added [0, 6, 'S', [[0, 6, 'S', [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []], [2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], []]], ['Conjunction', 'S']]\n",
1080+
"Chart: added [2, 6, 'VP', [[2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], ['NP']]\n",
1081+
"Chart: added [6, 6, 'NP', [], ['Pronoun']]\n",
1082+
"Chart: added [6, 6, 'NP', [], ['Name']]\n",
1083+
"Chart: added [6, 6, 'NP', [], ['Noun']]\n",
1084+
"Chart: added [6, 6, 'NP', [], ['Article', 'Noun']]\n",
1085+
"Chart: added [6, 6, 'NP', [], ['Digit', 'Digit']]\n",
1086+
"Chart: added [6, 6, 'NP', [], ['NP', 'PP']]\n",
1087+
"Chart: added [6, 6, 'NP', [], ['NP', 'RelClause']]\n",
1088+
"Chart: added [2, 6, 'VP', [[2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], ['Adjective']]\n",
1089+
"Chart: added [2, 6, 'VP', [[2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], ['PP']]\n",
1090+
"Chart: added [6, 6, 'PP', [], ['Preposition', 'NP']]\n",
1091+
"Chart: added [2, 6, 'VP', [[2, 6, 'VP', [[2, 3, 'VP', [('Verb', 'is')], []], [3, 6, 'PP', [('Preposition', 'in'), [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], []]], []]], ['Adverb']]\n",
1092+
"Chart: added [4, 6, 'NP', [[4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], ['PP']]\n",
1093+
"Chart: added [4, 6, 'NP', [[4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]], ['RelClause']]\n",
1094+
"Chart: added [6, 6, 'RelClause', [], ['That', 'VP']]\n"
1095+
]
1096+
},
1097+
{
1098+
"data": {
1099+
"text/plain": [
1100+
"[[0,\n",
1101+
" 6,\n",
1102+
" 'S',\n",
1103+
" [[0, 2, 'NP', [('Article', 'the'), ('Noun', 'stench')], []],\n",
1104+
" [2,\n",
1105+
" 6,\n",
1106+
" 'VP',\n",
1107+
" [[2, 3, 'VP', [('Verb', 'is')], []],\n",
1108+
" [3,\n",
1109+
" 6,\n",
1110+
" 'PP',\n",
1111+
" [('Preposition', 'in'),\n",
1112+
" [4, 6, 'NP', [('Digit', '2'), ('Digit', '2')], []]],\n",
1113+
" []]],\n",
1114+
" []]],\n",
1115+
" []]]"
1116+
]
1117+
},
1118+
"execution_count": 4,
1119+
"metadata": {},
1120+
"output_type": "execute_result"
1121+
}
1122+
],
1123+
"source": [
1124+
"chart_trace = Chart(nlp.E0, trace=True)\n",
1125+
"chart_trace.parses('the stench is in 2 2')"
1126+
]
1127+
},
1128+
{
1129+
"cell_type": "markdown",
1130+
"metadata": {},
1131+
"source": [
1132+
"Let's try and parse a sentence that is not recognized by the grammar:"
1133+
]
1134+
},
1135+
{
1136+
"cell_type": "code",
1137+
"execution_count": 5,
1138+
"metadata": {},
1139+
"outputs": [
1140+
{
1141+
"name": "stdout",
1142+
"output_type": "stream",
1143+
"text": [
1144+
"[]\n"
1145+
]
1146+
}
1147+
],
1148+
"source": [
1149+
"print(chart.parses('the stench 2 2'))"
1150+
]
1151+
},
1152+
{
1153+
"cell_type": "markdown",
1154+
"metadata": {},
1155+
"source": [
1156+
"An empty list was returned."
1157+
]
9111158
}
9121159
],
9131160
"metadata": {

nlp.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,5 @@
11
"""Natural Language Processing; Chart Parsing and PageRanking (Chapter 22-23)"""
22

3-
# (Written for the second edition of AIMA; expect some discrepanciecs
4-
# from the third edition until this gets reviewed.)
5-
63
from collections import defaultdict
74
from utils import weighted_choice
85
import urllib.request
@@ -274,7 +271,7 @@ def __repr__(self):
274271

275272
class Chart:
276273

277-
"""Class for parsing sentences using a chart data structure. [Figure 22.7]
274+
"""Class for parsing sentences using a chart data structure.
278275
>>> chart = Chart(E0);
279276
>>> len(chart.parses('the stench is in 2 2'))
280277
1

tests/test_nlp.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
from nlp import expand_pages, relevant_pages, normalize, ConvergenceDetector, getInlinks
66
from nlp import getOutlinks, Page, determineInlinks, HITS
77
from nlp import Rules, Lexicon, Grammar, ProbRules, ProbLexicon, ProbGrammar
8-
from nlp import CYK_parse
8+
from nlp import Chart, CYK_parse
99
# Clumsy imports because we want to access certain nlp.py globals explicitly, because
1010
# they are accessed by functions within nlp.py
1111

@@ -101,6 +101,12 @@ def test_prob_generation():
101101
assert len(sentence) == 2
102102

103103

104+
def test_chart_parsing():
105+
chart = Chart(nlp.E0)
106+
parses = chart.parses('the stench is in 2 2')
107+
assert len(parses) == 1
108+
109+
104110
def test_CYK_parse():
105111
grammar = nlp.E_Prob_Chomsky
106112
words = ['the', 'robot', 'is', 'good']

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy