0% found this document useful (0 votes)
35 views33 pages

UBC Summer School in NLP - VSP 2019 Lecture 10

The document summarizes a linguistics class that discusses natural language processing techniques. It covers building an n-gram model to generate sentences, including implementing sentence generation options as functions. It also discusses using NLTK for tasks like tokenizing, stemming, lemmatization, and working with parts of speech.

Uploaded by

万颖佳
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views33 pages

UBC Summer School in NLP - VSP 2019 Lecture 10

The document summarizes a linguistics class that discusses natural language processing techniques. It covers building an n-gram model to generate sentences, including implementing sentence generation options as functions. It also discusses using NLTK for tasks like tokenizing, stemming, lemmatization, and working with parts of speech.

Uploaded by

万颖佳
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

VANCOUVER SUMMER PROGRAM

Package G (Linguistics): Computation for Natural Language Processing


Class 10
Instructor: Michael Fry
THE PLAN TODAY
• Assignment 2 tips
• Review:
• Functions
• Go over bigram model
• Convert sentence generation options into functions
• Build end-to-end model of generated sentences to speech
• NLTK methods
• Tokenizing, Stemming, Lemmatization
• Synonyms and antonyms
• Introduce Parts of Speech
N-GRAM MODEL
• Once you have your bigram dictionary, you can build a sentence using it
• There are lots of ways to implement this, one option:
• Provide a first word and the number of words you want in your sentence
• Access the bigrams from your first word as a key
• Select the next word with the highest occurrence
• Repeat until you reach the number of words in your sentence
N-GRAM MODEL
• Another possible way to generate sentences, use the random module
• import random
• Provide the first word and number of desired words
• Get the possible next words from your bigram list
• Use random.randint(lower_bound, upper_bound) to select a random word
• Repeat with your new word until you reach the desired number of words
N-GRAM MODEL
• A final option is to use probabilities (this is probably the best when thinking about
how the brain might actually work)
• Provide a word and the number of desired words in the sentence
• Access the bigram list
• Find the probability of each following-word option
• Use random to generate a number between 0-1, select where that number is based on
the probabilities of the items in the dictionary
N-GRAM MODEL
• In other words, divide up the space between 0 and 1, and assign each gram to a
range. Pick a random number using random.random() and see which word's range it
falls in.
• You need to know the probabilities of any word, meaning it’s frequency over the sum
of all possibilities

"like" "them" "green"


1/4 1/4 2/4
0.25 0.25 0.5
0.0 0.25 0.5 1.0
N-GRAM MODEL: FUNCTIONS
FOR SENTENCE GENERATION
• Now, write a function that takes three arguments:
• A starting word
• The number of words in the sentence
• The method to generate the sentence
• most_probable, random, probabilistic
• And returns the generated sentence
• It should start and end like this:
def generate_sentence(starting_word, num_of_words, method_to_use):

return generated_sentence
• Note: A function must be defined before it is called
PUTTING EVERYTHING
TOGETHER
• Open up your text_to_speech.py file
• Copy and paste your code from text_to_speech.py into your n-gram script
• Use your text_to_speech.py code to create IPA transcriptions of the generated
sentences your program creates
• Load your text into: https://itinerarium.github.io/phoneme-synthesis/
• We can also change the input text we trained on
• i.e. I could change austen-emma.txt to Melville-moby_dick.txt
THE PLAN TODAY
• Assignment 2 tips
• Review:
• Functions
• Go over bigram model
• Convert sentence generation options into functions
• Build end-to-end model of generated sentences to speech
• NLTK methods
• Tokenizing, Stemming, Lemmatization
• Synonyms and antonyms
• Introduce Parts of Speech
THE PROBLEM SPACE
• With an n-gram model, every data point is valuable, but language makes it hard to
get the most out of a dataset:
• We’ve already talked a bit about punctuation
• Another part we can address relates to morphology and part-of-speech
• Consider this: ‘I sing every morning and she sings every night’
• It would nice to be able to get some correspondence between sing and sings
NLTK
• NLTK has functions to help here, particularly as we build to lemmatization
• Tokenizing
• Stemming
• Lemmatization
• A few that pull from Semantics:
• Finding synonyms
• Finding antonyms
NLTK - TOKENIZING
• Before we can do anything with language data, we need to access some structure in it
• Tokenizing means breaking a string into smaller parts
• Sentence tokenization finds individual sentences
• Word tokenization finds individual words.
• Tokenization is an important part of NLP work, since we normally have to find
individual words in order to make sense of a document
• NLTK has a number of specialized tokenizers.
NLTK TOKENIZING – SENTENCE
TOKENS
• If we want to isolate sentences, why is the following code not ideal:

text = 'This is Mr. Green. He lives at 1212 East Main St.


and works at a grocery store. The store is only 500m from
his house, which is very convenient. (Don't tell anyone, but
he doesn't like his job!)'

sentences = text.split('.')
NLTK TOKENIZING – SENTENCE
TOKENS
• NLTK has a "smarter" tokenizer that knows how to split properly.

import nltk

text = 'This is Mr. Green. He lives at 1212 East Main St.


and works at a grocery store. The store is only 500m from
his house, which is very convenient. (Don't tell anyone, but
he doesn't like his job!)'

sentences = nltk.tokenize.sent_tokenize(text)
NLTK TOKENIZING – WORD
TOKENS
• Now, if we wanted words, why isn’t this code great:

text = 'This is Mr. Green. He lives at 1212 East Main St.


and works at a grocery store. The store is only 500m from
his house, which is very convenient. (Don't tell anyone, but
he doesn't like his job!)'

words = text.split(' ')

• It doesn’t remove punctuation, we get strings like 'Green.' and 'house,' and
'job)!'
NLTK TOKENIZING – WORD
TOKENS
• NLTK has a smarter tokenizer, that will split apart the punctuation and make tokens
out of it (it does not remove punctuation!)

import nltk

text = 'This is Mr. Green. He lives at 1212 East Main St.


and works at a grocery store. The store is only 500m from
his house, which is very convenient. (Don't tell anyone, but
he doesn't like his job!)'

words = nltk.tokenize.word_tokenize(text)
NLTK TOKENIZING –
PUNCTUATION
• Even though NLTK is much better than our basic .split() functionality, it still runs into
trouble. Try this:
import nltk
text = 'The police yelled "Stop!" but the thief kept running.
"You'll never catch me!" he said."

print(nltk.tokenize.sent_tokenize(text))

• Do we get our desired output?


NLTK TOKENIZING –
PUNCTUATION
• Things get more complicated when we want to branch away from English:
• This is a question in Greek:
• Αλλά ποια είναι η πηγή της γνώσης;
• This is a question in Spanish:
• ¿para qué sirve?
• This is a quote in German:
• „Guten Tag!“
• This is a quote in French:
• « Bonjour! »
NLTK TOKENIZING – CROSS-
LINGUISTIC SUPPORT
• Fortunately, NLTK is able to tokenize in 17 different languages:
• Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Italian,
Norwegian, Polish, Portuguese, Slovene, Spanish, Swedish, Turkish
• (Notice the European focus)
• To get a language-specific tokenizer, we need to specify the language
import nltk
nltk.tokenize.word_tokenize(data, language=‘French’)
• Let’s try it!
• Go to: https://ici.radio-canada.ca/info or: https://cnnespanol.cnn.com/
• Copy and paste some French or Spanish text
• See if you can get a tokenizer to identify words
NLTK STEMMING
• Stemming goes one step beyond simple tokenizing
• Stemming means removing all the affixes on a word, to leave just a root morpheme
• Root morphemes can provide a lot of information on their own
• walk
• walker
• walking
• walked
• walks
• Stemming these words reduces them all to "walk".
• This can be useful for tasks such as searching, or topic identification.
NLTK STEMMING
• Suppose someone is searching the Internet:
• "buying cheapest iPads in Vancouver“
• It would not be useful to only find webpages that have exactly these words
• It would also be good to match "buy", "cheap" and "iPad“
• This can be accomplished by stemming.
NLTK STEMMING
• A nice thing about stemming is that text is still comprehensible
• Compare:
• New evidence suggests the presence of a lake beneath Mars’s south pole, according to
new research published Wednesday. Scientists say the lake stretches 20 kilometers
across. The findings, if confirmed, would mark the detection of the largest body of liquid
water on Mars.
• With:
• New evidence suggest_ the presence of a lake beneath Mars_ south pole, accord_ to
new research publish_ Wednesday. Scientist_ say the lake stretch_ 20 kilometer_
across. The finding, if confirm_, would mark the detect_ of the larg_ body of liquid water
on Mars.
NLTK STEMMING
• The simplest technique is a look-up table, which is just a list of roots and affixes
• runs, running, ran, runner -> run
• eats, eating, eating, eatery, eater, ate -> ate
• Advantages: Simple, fast, handles irregular roots. Disadvantage: Can't handle new roots/affixes.
• More sophisticated stemming is rule-based.
• If a word ends with -ing, remove the -ing
• If a word ends with -ly, remove the –ly
• If a word starts with pre- remove the pre-
• Advantages: Flexible, can handle new words
• Disadvantages: Might create a non-existing root (arguing->argu) Might remove a non-affix
(James->Jame)
NLTK STEMMING
• Since each language has different morphology rules, no stemmer can work for every
single language
• A popular stemmer is the Porter Stemmer.
• You can read the technical details here: https://tartarus.org/martin/PorterStemmer/def.txt
• This stemmer is included in NLTK
import nltk
stemmer = nltk.stem.PorterStemmer()
stemmer.stem('cats’)
stemmer.stem('thinking’)
stemmer.stem('governmental’)
stemmer.stem('preheat')
NLTK LEMMATIZATION
• Just like stemming, this is a process of reducing words to their root forms. Unlike
stemming, it can make use of context and doesn’t suffer from stemming issues
• What happens if we stem ‘caring’?
• Lemmatizing also tries to get to the most meaningful lemma in context, so it doesn’t
always reduce to the base root
NLTK LEMMATIZATION
• The most commonly used lemmatizer in nltk is the WordNetLemmatizer:
import nltk
lemmatizer = nltk.stem.WordNetLemmatizer()
lemmatizer.lemmatize('cats’)
lemmatizer.lemmatize('governmental’)
lemmatizer.lemmatize('tallest')
NLTK LEMMATIZATION
• You can get better results by specifying the part of speech
lemmatizer.lemmatize('thinking', pos='v’)
lemmatizer.lemmatize('governmental', pos='a’)
lemmatizer.lemmatize('tallest', pos='a')
NLTK LEMMATIZATION
• Lemmatization is generally more useful than stemming or simple word tokens
• But it requires more knowledge
• To get the ideal lemmatization, you need a part of speech tagger as well (so that
lemma’s are reduced appropriately)
• We’ll come back to parts of speech in a bit, but first we’ll talk about synonyms and
antonyms
THE PLAN TODAY
• Assignment 2 tips
• Review:
• Functions
• Go over bigram model
• Convert sentence generation options into functions
• Build end-to-end model of generated sentences to speech
• NLTK methods
• Tokenizing, Stemming, Lemmatization
• Synonyms and antonyms
• Introduce Parts of Speech
NLTK WORDNET
• WordNet® is a large lexical database of English. Nouns, verbs, adjectives and
adverbs are grouped into sets of synonyms (synsets), each expressing a distinct
concept. Synsets are interlinked by means of conceptual-semantic and lexical
relations.
• https://wordnet.princeton.edu/
• NLTK let’s us access WordNet’s powerful structure and associations easily
import nltk
wordnet = nltk.corpus.wordnet
NLTK WORDNET
• With WordNet, you can look up something called a synonym sets (Synset). A
Synset has a variety of methods/attributes such as .definition()
wordnet = nltk.corpus.wordnet
data = wordnet.synsets('kind’)
for d in data:
print(d.definition())
NLTK WORDNET
• You can also access synonyms and antonyms using NLTK very easily
import nltk
wordnet = nltk.corpus.wordnet
synonyms = list()
antonyms = list()

for syn in wordnet.synsets("hot"):


for l in syn.lemmas():
synonyms.append(l.name())
if l.antonyms():
antonyms.append(l.antonyms()[0].name())
print(list(set(synonyms)))
print(list(set(antonyms)))
NLTK WORDNET: PRACTICAL
APPLICATION
• Let’s have some fun, read in cat_in_the_hat.txt, replace all the words with synonyms, write
out the result to file
• Psuedo-Pseudocode:
load wordnet
read text file
initialize empty output list
for line in file
initialize empty output line
for word in line
look up synonym
add synonym to line
add line to output list
print to file

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy