0% found this document useful (0 votes)

35 views33 pages

UBC Summer School in NLP - VSP 2019 Lecture 10

The document summarizes a linguistics class that discusses natural language processing techniques. It covers building an n-gram model to generate sentences, including implementing sentence generation options as functions. It also discusses using NLTK for tasks like tokenizing, stemming, lemmatization, and working with parts of speech.

Uploaded by

万颖佳

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views33 pages

UBC Summer School in NLP - VSP 2019 Lecture 10

Uploaded by

万颖佳

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

VANCOUVER SUMMER PROGRAM

Package G (Linguistics): Computation for Natural Language Processing

Class 10
Instructor: Michael Fry
THE PLAN TODAY
• Assignment 2 tips
• Review:
• Functions
• Go over bigram model
• Convert sentence generation options into functions
• Build end-to-end model of generated sentences to speech
• NLTK methods
• Tokenizing, Stemming, Lemmatization
• Synonyms and antonyms
• Introduce Parts of Speech
N-GRAM MODEL
• Once you have your bigram dictionary, you can build a sentence using it
• There are lots of ways to implement this, one option:
• Provide a first word and the number of words you want in your sentence
• Access the bigrams from your first word as a key
• Select the next word with the highest occurrence
• Repeat until you reach the number of words in your sentence
N-GRAM MODEL
• Another possible way to generate sentences, use the random module
• import random
• Provide the first word and number of desired words
• Get the possible next words from your bigram list
• Use random.randint(lower_bound, upper_bound) to select a random word
• Repeat with your new word until you reach the desired number of words
N-GRAM MODEL
• A final option is to use probabilities (this is probably the best when thinking about
how the brain might actually work)
• Provide a word and the number of desired words in the sentence
• Access the bigram list
• Find the probability of each following-word option
• Use random to generate a number between 0-1, select where that number is based on
the probabilities of the items in the dictionary
N-GRAM MODEL
• In other words, divide up the space between 0 and 1, and assign each gram to a
range. Pick a random number using random.random() and see which word's range it
falls in.
• You need to know the probabilities of any word, meaning it’s frequency over the sum
of all possibilities

"like" "them" "green"

1/4 1/4 2/4
0.25 0.25 0.5
0.0 0.25 0.5 1.0
N-GRAM MODEL: FUNCTIONS
FOR SENTENCE GENERATION
• Now, write a function that takes three arguments:
• A starting word
• The number of words in the sentence
• The method to generate the sentence
• most_probable, random, probabilistic
• And returns the generated sentence
• It should start and end like this:
def generate_sentence(starting_word, num_of_words, method_to_use):
…
return generated_sentence
• Note: A function must be defined before it is called
PUTTING EVERYTHING
TOGETHER
• Open up your text_to_speech.py file
• Copy and paste your code from text_to_speech.py into your n-gram script
• Use your text_to_speech.py code to create IPA transcriptions of the generated
sentences your program creates
• Load your text into: https://itinerarium.github.io/phoneme-synthesis/
• We can also change the input text we trained on
• i.e. I could change austen-emma.txt to Melville-moby_dick.txt
THE PLAN TODAY
• Assignment 2 tips
• Review:
• Functions
• Go over bigram model
• Convert sentence generation options into functions
• Build end-to-end model of generated sentences to speech
• NLTK methods
• Tokenizing, Stemming, Lemmatization
• Synonyms and antonyms
• Introduce Parts of Speech
THE PROBLEM SPACE
• With an n-gram model, every data point is valuable, but language makes it hard to
get the most out of a dataset:
• We’ve already talked a bit about punctuation
• Another part we can address relates to morphology and part-of-speech
• Consider this: ‘I sing every morning and she sings every night’
• It would nice to be able to get some correspondence between sing and sings
NLTK
• NLTK has functions to help here, particularly as we build to lemmatization
• Tokenizing
• Stemming
• Lemmatization
• A few that pull from Semantics:
• Finding synonyms
• Finding antonyms
NLTK - TOKENIZING
• Before we can do anything with language data, we need to access some structure in it
• Tokenizing means breaking a string into smaller parts
• Sentence tokenization finds individual sentences
• Word tokenization finds individual words.
• Tokenization is an important part of NLP work, since we normally have to find
individual words in order to make sense of a document
• NLTK has a number of specialized tokenizers.
NLTK TOKENIZING – SENTENCE
TOKENS
• If we want to isolate sentences, why is the following code not ideal:

text = 'This is Mr. Green. He lives at 1212 East Main St.

and works at a grocery store. The store is only 500m from
his house, which is very convenient. (Don't tell anyone, but
he doesn't like his job!)'

sentences = text.split('.')
NLTK TOKENIZING – SENTENCE
TOKENS
• NLTK has a "smarter" tokenizer that knows how to split properly.

import nltk

text = 'This is Mr. Green. He lives at 1212 East Main St.

and works at a grocery store. The store is only 500m from
his house, which is very convenient. (Don't tell anyone, but
he doesn't like his job!)'

sentences = nltk.tokenize.sent_tokenize(text)
NLTK TOKENIZING – WORD
TOKENS
• Now, if we wanted words, why isn’t this code great:

text = 'This is Mr. Green. He lives at 1212 East Main St.

and works at a grocery store. The store is only 500m from
his house, which is very convenient. (Don't tell anyone, but
he doesn't like his job!)'

words = text.split(' ')

• It doesn’t remove punctuation, we get strings like 'Green.' and 'house,' and
'job)!'
NLTK TOKENIZING – WORD
TOKENS
• NLTK has a smarter tokenizer, that will split apart the punctuation and make tokens
out of it (it does not remove punctuation!)

import nltk

text = 'This is Mr. Green. He lives at 1212 East Main St.

and works at a grocery store. The store is only 500m from
his house, which is very convenient. (Don't tell anyone, but
he doesn't like his job!)'

words = nltk.tokenize.word_tokenize(text)
NLTK TOKENIZING –
PUNCTUATION
• Even though NLTK is much better than our basic .split() functionality, it still runs into
trouble. Try this:
import nltk
text = 'The police yelled "Stop!" but the thief kept running.
"You'll never catch me!" he said."

print(nltk.tokenize.sent_tokenize(text))

• Do we get our desired output?

NLTK TOKENIZING –
PUNCTUATION
• Things get more complicated when we want to branch away from English:
• This is a question in Greek:
• Αλλά ποια είναι η πηγή της γνώσης;
• This is a question in Spanish:
• ¿para qué sirve?
• This is a quote in German:
• „Guten Tag!“
• This is a quote in French:
• « Bonjour! »
NLTK TOKENIZING – CROSS-
LINGUISTIC SUPPORT
• Fortunately, NLTK is able to tokenize in 17 different languages:
• Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Italian,
Norwegian, Polish, Portuguese, Slovene, Spanish, Swedish, Turkish
• (Notice the European focus)
• To get a language-specific tokenizer, we need to specify the language
import nltk
nltk.tokenize.word_tokenize(data, language=‘French’)
• Let’s try it!
• Go to: https://ici.radio-canada.ca/info or: https://cnnespanol.cnn.com/
• Copy and paste some French or Spanish text
• See if you can get a tokenizer to identify words
NLTK STEMMING
• Stemming goes one step beyond simple tokenizing
• Stemming means removing all the affixes on a word, to leave just a root morpheme
• Root morphemes can provide a lot of information on their own
• walk
• walker
• walking
• walked
• walks
• Stemming these words reduces them all to "walk".
• This can be useful for tasks such as searching, or topic identification.
NLTK STEMMING
• Suppose someone is searching the Internet:
• "buying cheapest iPads in Vancouver“
• It would not be useful to only find webpages that have exactly these words
• It would also be good to match "buy", "cheap" and "iPad“
• This can be accomplished by stemming.
NLTK STEMMING
• A nice thing about stemming is that text is still comprehensible
• Compare:
• New evidence suggests the presence of a lake beneath Mars’s south pole, according to
new research published Wednesday. Scientists say the lake stretches 20 kilometers
across. The findings, if confirmed, would mark the detection of the largest body of liquid
water on Mars.
• With:
• New evidence suggest_ the presence of a lake beneath Mars_ south pole, accord_ to
new research publish_ Wednesday. Scientist_ say the lake stretch_ 20 kilometer_
across. The finding, if confirm_, would mark the detect_ of the larg_ body of liquid water
on Mars.
NLTK STEMMING
• The simplest technique is a look-up table, which is just a list of roots and affixes
• runs, running, ran, runner -> run
• eats, eating, eating, eatery, eater, ate -> ate
• Advantages: Simple, fast, handles irregular roots. Disadvantage: Can't handle new roots/affixes.
• More sophisticated stemming is rule-based.
• If a word ends with -ing, remove the -ing
• If a word ends with -ly, remove the –ly
• If a word starts with pre- remove the pre-
• Advantages: Flexible, can handle new words
• Disadvantages: Might create a non-existing root (arguing->argu) Might remove a non-affix
(James->Jame)
NLTK STEMMING
• Since each language has different morphology rules, no stemmer can work for every
single language
• A popular stemmer is the Porter Stemmer.
• You can read the technical details here: https://tartarus.org/martin/PorterStemmer/def.txt
• This stemmer is included in NLTK
import nltk
stemmer = nltk.stem.PorterStemmer()
stemmer.stem('cats’)
stemmer.stem('thinking’)
stemmer.stem('governmental’)
stemmer.stem('preheat')
NLTK LEMMATIZATION
• Just like stemming, this is a process of reducing words to their root forms. Unlike
stemming, it can make use of context and doesn’t suffer from stemming issues
• What happens if we stem ‘caring’?
• Lemmatizing also tries to get to the most meaningful lemma in context, so it doesn’t
always reduce to the base root
NLTK LEMMATIZATION
• The most commonly used lemmatizer in nltk is the WordNetLemmatizer:
import nltk
lemmatizer = nltk.stem.WordNetLemmatizer()
lemmatizer.lemmatize('cats’)
lemmatizer.lemmatize('governmental’)
lemmatizer.lemmatize('tallest')
NLTK LEMMATIZATION
• You can get better results by specifying the part of speech
lemmatizer.lemmatize('thinking', pos='v’)
lemmatizer.lemmatize('governmental', pos='a’)
lemmatizer.lemmatize('tallest', pos='a')
NLTK LEMMATIZATION
• Lemmatization is generally more useful than stemming or simple word tokens
• But it requires more knowledge
• To get the ideal lemmatization, you need a part of speech tagger as well (so that
lemma’s are reduced appropriately)
• We’ll come back to parts of speech in a bit, but first we’ll talk about synonyms and
antonyms
THE PLAN TODAY
• Assignment 2 tips
• Review:
• Functions
• Go over bigram model
• Convert sentence generation options into functions
• Build end-to-end model of generated sentences to speech
• NLTK methods
• Tokenizing, Stemming, Lemmatization
• Synonyms and antonyms
• Introduce Parts of Speech
NLTK WORDNET
• WordNet® is a large lexical database of English. Nouns, verbs, adjectives and
adverbs are grouped into sets of synonyms (synsets), each expressing a distinct
concept. Synsets are interlinked by means of conceptual-semantic and lexical
relations.
• https://wordnet.princeton.edu/
• NLTK let’s us access WordNet’s powerful structure and associations easily
import nltk
wordnet = nltk.corpus.wordnet
NLTK WORDNET
• With WordNet, you can look up something called a synonym sets (Synset). A
Synset has a variety of methods/attributes such as .definition()
wordnet = nltk.corpus.wordnet
data = wordnet.synsets('kind’)
for d in data:
print(d.definition())
NLTK WORDNET
• You can also access synonyms and antonyms using NLTK very easily
import nltk
wordnet = nltk.corpus.wordnet
synonyms = list()
antonyms = list()

for syn in wordnet.synsets("hot"):

for l in syn.lemmas():
synonyms.append(l.name())
if l.antonyms():
antonyms.append(l.antonyms()[0].name())
print(list(set(synonyms)))
print(list(set(antonyms)))
NLTK WORDNET: PRACTICAL
APPLICATION
• Let’s have some fun, read in cat_in_the_hat.txt, replace all the words with synonyms, write
out the result to file
• Psuedo-Pseudocode:
load wordnet
read text file
initialize empty output list
for line in file
initialize empty output line
for word in line
look up synonym
add synonym to line
add line to output list
print to file

Burny 2.8 Manual
100% (3)
Burny 2.8 Manual
474 pages
Dbms Practical File
No ratings yet
Dbms Practical File
29 pages
Lab 2
No ratings yet
Lab 2
49 pages
NLP Manual (1-12) 1
No ratings yet
NLP Manual (1-12) 1
56 pages
NLTK
No ratings yet
NLTK
3 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
55 pages
NLP Programs
No ratings yet
NLP Programs
5 pages
NLP-Lab Manual - Ashwini - Kachare
No ratings yet
NLP-Lab Manual - Ashwini - Kachare
41 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
NLP Intro
No ratings yet
NLP Intro
15 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
NLP Programming
No ratings yet
NLP Programming
39 pages
NLTK
No ratings yet
NLTK
4 pages
AP For NLP-LO1
No ratings yet
AP For NLP-LO1
61 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
Wsma Final Manual
No ratings yet
Wsma Final Manual
58 pages
AP For NLP-Word 2 Vec
No ratings yet
AP For NLP-Word 2 Vec
33 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
NLP Notes and Related Questions
No ratings yet
NLP Notes and Related Questions
7 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Dsbdal A7
No ratings yet
Dsbdal A7
65 pages
Natural Language Processing With Python's NLTK Package - Real Python
No ratings yet
Natural Language Processing With Python's NLTK Package - Real Python
27 pages
Natural Language Processing: Practical 1
No ratings yet
Natural Language Processing: Practical 1
64 pages
NLP - 1 - 250119 - 222702
No ratings yet
NLP - 1 - 250119 - 222702
71 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
Text Preprocessing For NLP
No ratings yet
Text Preprocessing For NLP
15 pages
For Assignment-10 (Machine Learning With Python - NLP-2)
No ratings yet
For Assignment-10 (Machine Learning With Python - NLP-2)
37 pages
Language Engineering - Section
No ratings yet
Language Engineering - Section
24 pages
Final LP-VI NLP Manual 2023-24
No ratings yet
Final LP-VI NLP Manual 2023-24
29 pages
Lab Prgms Weel1-Output
No ratings yet
Lab Prgms Weel1-Output
4 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
3.Nlp Lab Manual
No ratings yet
3.Nlp Lab Manual
18 pages
Intro To NLP: Natural Language Toolkit
No ratings yet
Intro To NLP: Natural Language Toolkit
11 pages
NLTK Tutorial
No ratings yet
NLTK Tutorial
33 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
33 pages
UNIT-V-NLP Using NLTK
No ratings yet
UNIT-V-NLP Using NLTK
19 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
NLP Lab Work
No ratings yet
NLP Lab Work
34 pages
Module 5
No ratings yet
Module 5
69 pages
01 NLP - Merged Vinay
No ratings yet
01 NLP - Merged Vinay
27 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
17 pages
NLP Unit-2
No ratings yet
NLP Unit-2
12 pages
NLP m2
No ratings yet
NLP m2
71 pages
Text Processing
No ratings yet
Text Processing
16 pages
Natural Language Processing-Section
No ratings yet
Natural Language Processing-Section
25 pages
7 Exp
No ratings yet
7 Exp
6 pages
Sree017 NLP
No ratings yet
Sree017 NLP
3 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
25 pages
UBC Summer School in NLP - VSP 2019 Lecture 9
No ratings yet
UBC Summer School in NLP - VSP 2019 Lecture 9
17 pages
NLTK Cheatsheet
No ratings yet
NLTK Cheatsheet
27 pages
W11 Natural Language Processing Lecture
No ratings yet
W11 Natural Language Processing Lecture
9 pages
20BCP123 - NLP Lab Manual
No ratings yet
20BCP123 - NLP Lab Manual
45 pages
Python NLP Assignment
No ratings yet
Python NLP Assignment
9 pages
NLP Lab Manual (R20)
50% (2)
NLP Lab Manual (R20)
24 pages
CCS369-Text and Speech Analysis Lab (1-9)
No ratings yet
CCS369-Text and Speech Analysis Lab (1-9)
37 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages
Lecture 6 - CS50's Introduction To Artificial Intelligence With Python
No ratings yet
Lecture 6 - CS50's Introduction To Artificial Intelligence With Python
12 pages
Teacher Ila'S English Lesson FORM 4 2020: Activities
No ratings yet
Teacher Ila'S English Lesson FORM 4 2020: Activities
2 pages
CD - Love Changes - Kashif Audrey Wheeler Bashiri Johnson B
No ratings yet
CD - Love Changes - Kashif Audrey Wheeler Bashiri Johnson B
6 pages
National & Kapodistrian University of Athens Lesson Planning and Materials Development
No ratings yet
National & Kapodistrian University of Athens Lesson Planning and Materials Development
17 pages
OT24 Jericho Usa
No ratings yet
OT24 Jericho Usa
15 pages
Practice Module 2 Introduction To Programming: NIM/Name: 4312111010/abdan Fauzan Nurtsani
No ratings yet
Practice Module 2 Introduction To Programming: NIM/Name: 4312111010/abdan Fauzan Nurtsani
6 pages
FOF Preview
No ratings yet
FOF Preview
7 pages
13 Custom Auth Server
No ratings yet
13 Custom Auth Server
9 pages
B.E (2019 Pattern)
No ratings yet
B.E (2019 Pattern)
2 pages
ELE2120 Digital Circuits and Systems: Tutorial Note 9
No ratings yet
ELE2120 Digital Circuits and Systems: Tutorial Note 9
25 pages
CourseMAS 2019 Ciurea
No ratings yet
CourseMAS 2019 Ciurea
92 pages
List of Autorised Recovery Agencies
No ratings yet
List of Autorised Recovery Agencies
74 pages
Unit 3 - Vocabulary
No ratings yet
Unit 3 - Vocabulary
33 pages
Office 2003 Prof Step by Step Installation Guide Prof
No ratings yet
Office 2003 Prof Step by Step Installation Guide Prof
32 pages
The Von Neumann Architecture
No ratings yet
The Von Neumann Architecture
27 pages
List of Deepfake Tools
No ratings yet
List of Deepfake Tools
5 pages
NS LogMessages
No ratings yet
NS LogMessages
54 pages
Ref 541 543 and 545 Feeder Terminal
No ratings yet
Ref 541 543 and 545 Feeder Terminal
44 pages
HTML CSS JS Notes
No ratings yet
HTML CSS JS Notes
4 pages
Mecon Answer Key PDF
No ratings yet
Mecon Answer Key PDF
34 pages
19cse214: Theory of Computation: Case Study Report
No ratings yet
19cse214: Theory of Computation: Case Study Report
5 pages
Final Time Table For Mock 2025
No ratings yet
Final Time Table For Mock 2025
2 pages
Team 3 Kubernetes MinIO WS2021
No ratings yet
Team 3 Kubernetes MinIO WS2021
34 pages
w3w4 - Worksheet
No ratings yet
w3w4 - Worksheet
2 pages
Islamic Studies Edexcel 1
No ratings yet
Islamic Studies Edexcel 1
14 pages
Dropbox
No ratings yet
Dropbox
4 pages
Comprehensive Guide On Cupp - A Wordlist Generating Tool
No ratings yet
Comprehensive Guide On Cupp - A Wordlist Generating Tool
12 pages
An Analysis of The Famous Poem When You Are Old
No ratings yet
An Analysis of The Famous Poem When You Are Old
6 pages
Siddhantamuktavali: Sevyaswaroop
No ratings yet
Siddhantamuktavali: Sevyaswaroop
24 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

UBC Summer School in NLP - VSP 2019 Lecture 10

Uploaded by

UBC Summer School in NLP - VSP 2019 Lecture 10

Uploaded by

VANCOUVER SUMMER PROGRAM

Package G (Linguistics): Computation for Natural Language Processing

"like" "them" "green"

text = 'This is Mr. Green. He lives at 1212 East Main St.

text = 'This is Mr. Green. He lives at 1212 East Main St.

text = 'This is Mr. Green. He lives at 1212 East Main St.

words = text.split(' ')

text = 'This is Mr. Green. He lives at 1212 East Main St.

• Do we get our desired output?

for syn in wordnet.synsets("hot"):

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.