UBC Summer School in NLP - VSP 2019 Lecture 10
UBC Summer School in NLP - VSP 2019 Lecture 10
sentences = text.split('.')
NLTK TOKENIZING – SENTENCE
TOKENS
• NLTK has a "smarter" tokenizer that knows how to split properly.
import nltk
sentences = nltk.tokenize.sent_tokenize(text)
NLTK TOKENIZING – WORD
TOKENS
• Now, if we wanted words, why isn’t this code great:
• It doesn’t remove punctuation, we get strings like 'Green.' and 'house,' and
'job)!'
NLTK TOKENIZING – WORD
TOKENS
• NLTK has a smarter tokenizer, that will split apart the punctuation and make tokens
out of it (it does not remove punctuation!)
import nltk
words = nltk.tokenize.word_tokenize(text)
NLTK TOKENIZING –
PUNCTUATION
• Even though NLTK is much better than our basic .split() functionality, it still runs into
trouble. Try this:
import nltk
text = 'The police yelled "Stop!" but the thief kept running.
"You'll never catch me!" he said."
print(nltk.tokenize.sent_tokenize(text))