Skip to content

Commit 52eb90e

Browse files
Chipe1norvig
authored andcommitted
Added ShiftDecoder to notebook (#463)
* Added ShiftDecoder to notebook * replaced code with psource * fix spelling mistakes
1 parent 034d279 commit 52eb90e

File tree

1 file changed

+184
-36
lines changed

1 file changed

+184
-36
lines changed

text.ipynb

Lines changed: 184 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,20 @@
1313
"This notebook serves as supporting material for topics covered in **Chapter 22 - Natural Language Processing** from the book *Artificial Intelligence: A Modern Approach*. This notebook uses implementations from [text.py](https://github.com/aimacode/aima-python/blob/master/text.py)."
1414
]
1515
},
16+
{
17+
"cell_type": "code",
18+
"execution_count": 1,
19+
"metadata": {
20+
"collapsed": true,
21+
"deletable": true,
22+
"editable": true
23+
},
24+
"outputs": [],
25+
"source": [
26+
"from text import *\n",
27+
"from utils import DataFile"
28+
]
29+
},
1630
{
1731
"cell_type": "markdown",
1832
"metadata": {
@@ -26,7 +40,11 @@
2640
"* Viterbi Text Segmentation\n",
2741
" * Overview\n",
2842
" * Implementation\n",
29-
" * Example"
43+
" * Example\n",
44+
"* Decoders\n",
45+
" * Introduction\n",
46+
" * Shift Decoder\n",
47+
" * Permutation Decoder"
3048
]
3149
},
3250
{
@@ -49,7 +67,7 @@
4967
},
5068
{
5169
"cell_type": "code",
52-
"execution_count": 4,
70+
"execution_count": 2,
5371
"metadata": {
5472
"collapsed": false,
5573
"deletable": true,
@@ -66,9 +84,6 @@
6684
}
6785
],
6886
"source": [
69-
"from text import UnigramTextModel, NgramTextModel, words\n",
70-
"from utils import DataFile\n",
71-
"\n",
7287
"flatland = DataFile(\"EN-text/flatland.txt\").read()\n",
7388
"wordseq = words(flatland)\n",
7489
"\n",
@@ -117,38 +132,15 @@
117132
},
118133
{
119134
"cell_type": "code",
120-
"execution_count": 1,
135+
"execution_count": 3,
121136
"metadata": {
122-
"collapsed": true,
137+
"collapsed": false,
123138
"deletable": true,
124139
"editable": true
125140
},
126141
"outputs": [],
127142
"source": [
128-
"def viterbi_segment(text, P):\n",
129-
" \"\"\"Find the best segmentation of the string of characters, given the\n",
130-
" UnigramTextModel P.\"\"\"\n",
131-
" # best[i] = best probability for text[0:i]\n",
132-
" # words[i] = best word ending at position i\n",
133-
" n = len(text)\n",
134-
" words = [''] + list(text)\n",
135-
" best = [1.0] + [0.0] * n\n",
136-
" # Fill in the vectors best words via dynamic programming\n",
137-
" for i in range(n+1):\n",
138-
" for j in range(0, i):\n",
139-
" w = text[j:i]\n",
140-
" newbest = P[w] * best[i - len(w)]\n",
141-
" if newbest >= best[i]:\n",
142-
" best[i] = newbest\n",
143-
" words[i] = w\n",
144-
" # Now recover the sequence of best words\n",
145-
" sequence = []\n",
146-
" i = len(words) - 1\n",
147-
" while i > 0:\n",
148-
" sequence[0:0] = [words[i]]\n",
149-
" i = i - len(words[i])\n",
150-
" # Return sequence of best words and overall probability\n",
151-
" return sequence, best[-1]"
143+
"%psource viterbi_segment"
152144
]
153145
},
154146
{
@@ -177,7 +169,7 @@
177169
},
178170
{
179171
"cell_type": "code",
180-
"execution_count": 6,
172+
"execution_count": 4,
181173
"metadata": {
182174
"collapsed": false,
183175
"deletable": true,
@@ -194,9 +186,6 @@
194186
}
195187
],
196188
"source": [
197-
"from text import UnigramTextModel, words, viterbi_segment\n",
198-
"from utils import DataFile\n",
199-
"\n",
200189
"flatland = DataFile(\"EN-text/flatland.txt\").read()\n",
201190
"wordseq = words(flatland)\n",
202191
"P = UnigramTextModel(wordseq)\n",
@@ -216,6 +205,165 @@
216205
"source": [
217206
"The algorithm correctly retrieved the words from the string. It also gave us the probability of this sequence, which is small, but still the most probable segmentation of the string."
218207
]
208+
},
209+
{
210+
"cell_type": "markdown",
211+
"metadata": {
212+
"deletable": true,
213+
"editable": true
214+
},
215+
"source": [
216+
"## Decoders\n",
217+
"\n",
218+
"### Introduction\n",
219+
"\n",
220+
"In this section we will try to decode ciphertext using probabilistic text models. A ciphertext is obtained by performing encryption on a text message. This encryption lets us communicate safely, as anyone who has access to the ciphertext but doesn't know how to decode it cannot read the message. We will restrict our study to <b>Monoalphabetic Substitution Ciphers</b>. These are primitive forms of cipher where each letter in the message text (also known as plaintext) is replaced by another another letter of the alphabet.\n",
221+
"\n",
222+
"### Shift Decoder\n",
223+
"\n",
224+
"#### The Caesar cipher\n",
225+
"\n",
226+
"The Caesar cipher, also known as shift cipher is a form of monoalphabetic substitution ciphers where each letter is <i>shifted</i> by a fixed value. A shift by <b>`n`</b> in this context means that each letter in the plaintext is replaced with a letter corresponding to `n` letters down in the alphabet. For example the plaintext `\"ABCDWXYZ\"` shifted by `3` yields `\"DEFGZABC\"`. Note how `X` became `A`. This is because the alphabet is cyclic, i.e. the letter after the last letter in the alphabet, `Z`, is the first letter of the alphabet - `A`."
227+
]
228+
},
229+
{
230+
"cell_type": "code",
231+
"execution_count": 5,
232+
"metadata": {
233+
"collapsed": false,
234+
"deletable": true,
235+
"editable": true
236+
},
237+
"outputs": [
238+
{
239+
"name": "stdout",
240+
"output_type": "stream",
241+
"text": [
242+
"DEFGZABC\n"
243+
]
244+
}
245+
],
246+
"source": [
247+
"plaintext = \"ABCDWXYZ\"\n",
248+
"ciphertext = shift_encode(plaintext, 3)\n",
249+
"print(ciphertext)"
250+
]
251+
},
252+
{
253+
"cell_type": "markdown",
254+
"metadata": {
255+
"deletable": true,
256+
"editable": true
257+
},
258+
"source": [
259+
"#### Decoding a Caesar cipher\n",
260+
"\n",
261+
"To decode a Caesar cipher we exploit the fact that not all letters in the alphabet are used equally. Some letters are used more than others and some pairs of letters are more probable to occur together. We call a pair of consecutive letters a <b>bigram</b>."
262+
]
263+
},
264+
{
265+
"cell_type": "code",
266+
"execution_count": 6,
267+
"metadata": {
268+
"collapsed": false,
269+
"deletable": true,
270+
"editable": true
271+
},
272+
"outputs": [
273+
{
274+
"name": "stdout",
275+
"output_type": "stream",
276+
"text": [
277+
"['th', 'hi', 'is', 's ', ' i', 'is', 's ', ' a', 'a ', ' s', 'se', 'en', 'nt', 'te', 'en', 'nc', 'ce']\n"
278+
]
279+
}
280+
],
281+
"source": [
282+
"print(bigrams('this is a sentence'))"
283+
]
284+
},
285+
{
286+
"cell_type": "markdown",
287+
"metadata": {
288+
"deletable": true,
289+
"editable": true
290+
},
291+
"source": [
292+
"We use `CountingProbDist` to get the probability distribution of bigrams. In the latin alphabet consists of only only `26` letters. This limits the total number of possible substitutions to `26`. We reverse the shift encoding for a given `n` and check how probable it is using the bigram distribution. We try all `26` values of `n`, i.e. from `n = 0` to `n = 26` and use the value of `n` which gives the most probable plaintext."
293+
]
294+
},
295+
{
296+
"cell_type": "code",
297+
"execution_count": 7,
298+
"metadata": {
299+
"collapsed": false,
300+
"deletable": true,
301+
"editable": true
302+
},
303+
"outputs": [],
304+
"source": [
305+
"%psource ShiftDecoder"
306+
]
307+
},
308+
{
309+
"cell_type": "markdown",
310+
"metadata": {
311+
"deletable": true,
312+
"editable": true
313+
},
314+
"source": [
315+
"#### Example\n",
316+
"\n",
317+
"Let us encode a secret message using Caeasar cipher and then try decoding it using `ShiftDecoder`. We will again use `flatland.txt` to build the text model"
318+
]
319+
},
320+
{
321+
"cell_type": "code",
322+
"execution_count": 8,
323+
"metadata": {
324+
"collapsed": false,
325+
"deletable": true,
326+
"editable": true
327+
},
328+
"outputs": [
329+
{
330+
"name": "stdout",
331+
"output_type": "stream",
332+
"text": [
333+
"The code is \"Guvf vf n frperg zrffntr\"\n"
334+
]
335+
}
336+
],
337+
"source": [
338+
"plaintext = \"This is a secret message\"\n",
339+
"ciphertext = shift_encode(plaintext, 13)\n",
340+
"print('The code is', '\"' + ciphertext + '\"')"
341+
]
342+
},
343+
{
344+
"cell_type": "code",
345+
"execution_count": 9,
346+
"metadata": {
347+
"collapsed": false,
348+
"deletable": true,
349+
"editable": true
350+
},
351+
"outputs": [
352+
{
353+
"name": "stdout",
354+
"output_type": "stream",
355+
"text": [
356+
"The decoded message is \"This is a secret message\"\n"
357+
]
358+
}
359+
],
360+
"source": [
361+
"flatland = DataFile(\"EN-text/flatland.txt\").read()\n",
362+
"decoder = ShiftDecoder(flatland)\n",
363+
"\n",
364+
"decoded_message = decoder.decode(ciphertext)\n",
365+
"print('The decoded message is', '\"' + decoded_message + '\"')"
366+
]
219367
}
220368
],
221369
"metadata": {
@@ -234,7 +382,7 @@
234382
"name": "python",
235383
"nbconvert_exporter": "python",
236384
"pygments_lexer": "ipython3",
237-
"version": "3.5.2"
385+
"version": "3.6.0"
238386
}
239387
},
240388
"nbformat": 4,

0 commit comments

Comments
 (0)
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy