TSP Unit1 Own
TSP Unit1 Own
Natural Language Processing (NLP) is a field in Artificial Intelligence (AI) that helps
computers understand and work with human language. It allows machines to process and
generate meaningful sentences just like humans do.
Applications of NLP
To train NLP models, computers study large amounts of text written by humans. This helps
the models learn how words and sentences work together.
This phase focuses on breaking down words into their smallest meaningful parts.
💡 Example:
Helps search engines understand words and variations (e.g., "run," "running," "ran").
Useful in translation, transcription, and search engine optimization (SEO).
This step ensures that sentences follow the correct grammar rules.
💡 Example:
This phase focuses on understanding what words and sentences actually mean.
💡 Example:
This phase ensures that sentences connect well with previous and future sentences.
💡 Example:
This final phase helps AI understand the true intent behind words.
It extracts deep meaning from text, helping AI understand human emotions and
conversations.
It allows chatbots to give more human-like responses.
💡 Example:
Imagine you have a bag, and you put all the words from a document into it.
You don’t care about the order of the words, just how many times each word appears.
This helps in finding patterns in text and making comparisons between documents.
1. Build a Vocabulary → Make a list of all unique words from the text.
2. Count Word Occurrences → Count how often each word appears in the text.
3. Convert to a Vector → Create a list of numbers that represent the word counts.
Go through all the words in both sentences and create a list of unique words.
Vocabulary List:
[Welcome, to, Great, Learning, ,, Now, start, learning, is, a, good,
practice]
⚠ Problems:
Now, create a vector (list of numbers) for each sentence. The numbers represent the presence
(1) or absence (0) of each word.
Welcome 1 0
To 1 0
Great 1 0
Learning 1 1
, 1 0
Now 1 0
start 1 0
learning 1 0
Is 0 1
A 0 1
good 0 1
practice 0 1
💡 Sentence 1 Vector: [1,1,1,1,1,1,1,1,0,0,0,0]
💡 Sentence 2 Vector: [0,0,0,1,0,0,0,1,1,1,1,1]
To improve the BoW model, we clean the text before converting it into a vector.
Convert all text to lowercase so that "Learning" and "learning" are treated as the same word.
welcome 1 0
great 1 0
learning 2 1
now 1 0
start 1 0
good 0 1
practice 0 1
✅ Advantages of Preprocessing:
1. Ignores word order → "Apple is red" and "Red is apple" are treated the same.
2. Sparse representation → Large vocabulary creates big data, making processing slow.
3. Loses meaning → It doesn’t understand context (e.g., "old bike" vs. "used bike").
To improve BoW, advanced models like TF-IDF, Word2Vec, and BERT are used.
3)Bag-of-N-Grams
James
James is
is the
the best
best person
person ever.
ever.
In this model, a document is represented by the set of n-grams (word pairs, in this case) it
contains. Instead of just counting words like in Bag-of-Words, this method keeps some word
order information. This makes it better for understanding meaning while still being simple to
use.
TF=10/100=0.1
DF=50
o If a word appears in many documents, it is less unique and not very special.
3. Inverse Document Frequency (IDF) – Measures how rare a word is.
o If a word appears everywhere, it is not useful for identifying unique topics.
o If a word appears in only a few documents, it is likely important.
1. Contraction Mapping – Expanding shortened words (e.g., "I’ll" → "I will", "don’t"
→ "do not") to maintain consistency.
2. Tokenization – Splitting text into smaller parts called "tokens" (words, characters, or
subwords).
3. Noise Cleaning – Removing unnecessary symbols, punctuation, and special
characters.
4. Spell-checking – Fixing spelling mistakes for cleaner text.
5. Stopwords Removal – Removing common words like "the," "is," and "and" that do
not add much meaning.
6. Stemming/Lemmatization – Converting words to their root form (e.g., "running" →
"run").
Tokenization
1. Word Tokenization – Splitting text into words (e.g., "Never give up" → ["Never",
"give", "up"]).
o Problem: It struggles with Out Of Vocabulary (OOV) words (new words not
in training data).
o Solution: Replace rare words with an "UNK" (unknown) token.
2. Character Tokenization – Breaking text into individual characters (e.g., "smarter"
→ ['s', 'm', 'a', 'r', 't', 'e', 'r']).
o Advantage: Handles OOV words better.
o Problem: The sequence becomes too long, making it harder for models to
learn word meanings.
3. Subword Tokenization – Breaking words into meaningful sub-parts (e.g., "smartest"
→ ["smart", "est"]).
o Advantage: A balance between word and character tokenization.
o Used in: Advanced models like Transformers.
BPE is a widely used tokenization approach that efficiently handles OOV words and reduces
input length.
Steps in BPE
Sure! Let's go step by step with an example to understand Byte Pair Encoding (BPE) in a
simple way.
Let's say we have a small text corpus with just three words:
Corpus:
low
lower
lowest
We first break each word into individual characters and add a special end-of-word symbol
(</w>).
low -> l o w </w>
lower -> l o w e r </w>
lowest -> l o w e s t </w>
We count how often each pair of characters appears together in the corpus:
lo 3
ow 3
w 1
we 2
er 1
es 1
st 1
r 1
The most common pair is "o w" (appears 3 times). We merge it into a new token "ow".
New vocabulary:
l ow 3
ow 1
ow e 2
er 1
es 1
st 1
r 1
The most common pair now is "l ow" (appears 3 times). We merge it:
New vocabulary:
{low, e, r, s, t, </w>}
Final vocabulary:
Now, instead of splitting words into individual characters, the model tokenizes them
efficiently as:
"low" → low
"lower" → lower
"lowest" → lowest
This reduces vocabulary size and handles Out Of Vocabulary (OOV) words efficiently.
Key Takeaways