Python - Word Tokenization



Word tokenization is the process of splitting a large sample of text into words. This is a requirement in natural language processing tasks where each word needs to be captured and subjected to further analysis like classifying and counting them for a particular sentiment etc. The Natural Language Tool kit(NLTK) is a library used to achieve this. Install NLTK before proceeding with the python program for word tokenization.

conda install -c anaconda nltk

Next we use the word_tokenize method to split the paragraph into individual words.

import nltk

word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
print (nltk_tokens)

When we execute the above code, it produces the following result.

['It', 'originated', 'from', 'the', 'idea', 'that', 'there', 'are', 'readers', 
'who', 'prefer', 'learning', 'new', 'skills', 'from', 'the',
'comforts', 'of', 'their', 'drawing', 'rooms']

Tokenizing Sentences

We can also tokenize the sentences in a paragraph like we tokenized the words. We use the method sent_tokenize to achieve this. Below is an example.

import nltk
sentence_data = "Sun rises in the east. Sun sets in the west."
nltk_tokens = nltk.sent_tokenize(sentence_data)
print (nltk_tokens)

When we execute the above code, it produces the following result.

['Sun rises in the east.', 'Sun sets in the west.']
Advertisements
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy