0% found this document useful (0 votes)
2 views13 pages

TSP Unit1 Own

Natural Language Processing (NLP) is a field of AI that enables computers to understand and generate human language, encompassing Natural Language Understanding (NLU) and Natural Language Generation (NLG). Key components include data preprocessing and algorithm development, with applications in machine translation, chatbots, and sentiment analysis. Techniques like Bag of Words, TF-IDF, and text preprocessing methods are essential for analyzing and processing text data effectively.

Uploaded by

SUJITHA M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views13 pages

TSP Unit1 Own

Natural Language Processing (NLP) is a field of AI that enables computers to understand and generate human language, encompassing Natural Language Understanding (NLU) and Natural Language Generation (NLG). Key components include data preprocessing and algorithm development, with applications in machine translation, chatbots, and sentiment analysis. Techniques like Bag of Words, TF-IDF, and text preprocessing methods are essential for analyzing and processing text data effectively.

Uploaded by

SUJITHA M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1) Foundations of Natural Language Processing (NLP)

Natural Language Processing (NLP) is a field in Artificial Intelligence (AI) that helps
computers understand and work with human language. It allows machines to process and
generate meaningful sentences just like humans do.

NLP consists of two major areas:

1. Natural Language Understanding (NLU): Helps machines understand human language.


2. Natural Language Generation (NLG): Helps machines generate human-like responses.

NLP combines Linguistics, Computer Science, and AI to improve communication between


humans and machines. It is widely used today in chatbots, search engines, and language
translation tools.

Two Main Parts of NLP

1. Data Preprocessing – Cleaning and preparing text for analysis.


2. Algorithm Development – Creating models that understand and generate text.

Applications of NLP

NLP is used in many areas, including:

 Machine Translation (e.g., Google Translate)


 Information Retrieval (e.g., Google Search)
 Question Answering (e.g., Chatbots like Siri, Alexa)
 Dialogue Systems (e.g., Virtual Assistants)
 Summarization (e.g., AI-generated article summaries)
 Sentiment Analysis (e.g., Analyzing customer feedback)
How NLP Works?

To train NLP models, computers study large amounts of text written by humans. This helps
the models learn how words and sentences work together.

NLP follows five main phases to understand language:

1. Lexical Analysis (Word Structure Analysis)


2. Syntax Analysis (Grammar & Sentence Structure Analysis)
3. Semantic Analysis (Meaning Analysis)
4. Discourse Integration (Context Understanding)
5. Pragmatic Analysis (Real-World Meaning Analysis)

Phase I: Lexical or Morphological Analysis (Word Structure Analysis)

This phase focuses on breaking down words into their smallest meaningful parts.

 Lexicon: A collection of words in a language.


 Morpheme: The smallest unit of meaning in a word.
o Free morpheme (e.g., "walk") can stand alone.
o Bound morpheme (e.g., "-ing," "-ed") needs to be attached to a free
morpheme to form a meaningful word.

💡 Example:

 "Walking" → "walk" (free morpheme) + "-ing" (bound morpheme).

How is this useful?

 Helps search engines understand words and variations (e.g., "run," "running," "ran").
 Useful in translation, transcription, and search engine optimization (SEO).

Phase II: Syntax Analysis (Grammar Checking)

This step ensures that sentences follow the correct grammar rules.

 It checks word arrangement and the relationship between words.


 A syntax tree is created to visualize sentence structure.

💡 Example:

 Correct: "The cat sat on the mat." ✅


 Incorrect: "Mat the sat cat on." ❌(Wrong word order)

Why is this important?


 Helps in grammar checking for AI-based writing assistants.
 Makes sure AI-generated content is readable and meaningful.
 Used in SEO to ensure content is properly structured.

Phase III: Semantic Analysis (Understanding Meaning)

This phase focuses on understanding what words and sentences actually mean.

 It ensures words and phrases make sense together.


 It identifies synonyms (words with similar meanings), antonyms (opposites), and
homonyms (words that sound the same but have different meanings).

💡 Example:

 "I bank my money in the bank."


o "Bank" (financial institution) vs. "Bank" (side of a river).
o AI must understand the correct meaning from the context.

How is this useful?

 Helps chatbots and virtual assistants understand conversations.


 Used in sentiment analysis (e.g., detecting positive or negative reviews).
 Helps classify topics for better internal linking in websites (SEO).

Phase IV: Discourse Integration (Context Understanding)

This phase ensures that sentences connect well with previous and future sentences.

 Helps AI understand the context in a conversation or document.


 Identifies relationships between different topics and ideas.

💡 Example:

 Without discourse integration:


o "I love this restaurant. The food is amazing. The service is horrible."
 AI may wrongly assume it's a positive review.
 With discourse integration:
o AI recognizes mixed sentiments because it understands the context.

Why is this important?

 Helps in detecting fake reviews or hate speech.


 Makes AI better at answering follow-up questions in chatbots.
 Useful in text summarization and SEO localization.
Phase V: Pragmatic Analysis (Real-World Meaning)

This final phase helps AI understand the true intent behind words.

 It extracts deep meaning from text, helping AI understand human emotions and
conversations.
 It allows chatbots to give more human-like responses.

💡 Example:

 "Can you tell me the time?"


o AI should not just reply "Yes" (literal meaning).
o It should actually provide the time (pragmatic meaning).

How is this used?

 In virtual assistants (e.g., Siri, Alexa).


 In question-answering systems (e.g., Google search).
 In automated FAQ generation for websites.

2) Bag of Words (BoW)


The Bag of Words (BoW) model is a technique used in Natural Language Processing
(NLP) to analyze text. It helps computers understand and work with text by turning words
into numbers, which is easier for machines to process.

What is Bag of Words?

 Imagine you have a bag, and you put all the words from a document into it.
 You don’t care about the order of the words, just how many times each word appears.
 This helps in finding patterns in text and making comparisons between documents.

How Does It Work?

1. Build a Vocabulary → Make a list of all unique words from the text.
2. Count Word Occurrences → Count how often each word appears in the text.
3. Convert to a Vector → Create a list of numbers that represent the word counts.

Example 1: Without Preprocessing (Raw Text)

Step 1: Take Sample Sentences

Let’s consider two sentences:

1. Sentence 1: "Welcome to Great Learning, Now start learning"


2. Sentence 2: "Learning is a good practice"

Step 2: Create a Vocabulary List

Go through all the words in both sentences and create a list of unique words.

Vocabulary List:
[Welcome, to, Great, Learning, ,, Now, start, learning, is, a, good,
practice]

⚠ Problems:

 "Learning" and "learning" are treated as different words (case-sensitive issue).


 Punctuation marks like , are included, which are unnecessary.

Step 3: Convert Sentences into Vectors

Now, create a vector (list of numbers) for each sentence. The numbers represent the presence
(1) or absence (0) of each word.

Word Sentence 1 Sentence 2

Welcome 1 0

To 1 0

Great 1 0

Learning 1 1

, 1 0

Now 1 0

start 1 0

learning 1 0

Is 0 1

A 0 1

good 0 1

practice 0 1
💡 Sentence 1 Vector: [1,1,1,1,1,1,1,1,0,0,0,0]
💡 Sentence 2 Vector: [0,0,0,1,0,0,0,1,1,1,1,1]

⚠ Problems with this approach:

 Repetition of words ("Learning" and "learning") due to case sensitivity.


 Unnecessary punctuation (,).

Example 2: With Preprocessing (Cleaned Text)

To improve the BoW model, we clean the text before converting it into a vector.

Step 1: Convert Sentences to Lowercase

Convert all text to lowercase so that "Learning" and "learning" are treated as the same word.

 Sentence 1 → "welcome to great learning, now start learning"


 Sentence 2 → "learning is a good practice"

Step 2: Remove Unnecessary Words and Symbols

1. Remove punctuation (,)


2. Remove stopwords (common words that do not add much meaning, like “is,” “a,”
“to”)

 Cleaned Sentence 1: "welcome great learning now start learning"


 Cleaned Sentence 2: "learning good practice"

Step 3: Create a New Vocabulary List

Now, we only keep important words.

New Vocabulary List:


[welcome, great, learning, now, start, good, practice]

Step 4: Convert Sentences into Vectors

Now, assign numbers based on word frequency.


Word Sentence 1 Sentence 2

welcome 1 0

great 1 0

learning 2 1

now 1 0

start 1 0

good 0 1

practice 0 1

💡 Sentence 1 Vector: [1,1,2,1,1,0,0]


💡 Sentence 2 Vector: [0,0,1,0,0,1,1]

✅ Advantages of Preprocessing:

 Removes duplicates (e.g., "learning" counted only once).


 Removes punctuation.
 Removes stopwords to reduce unnecessary words.

Limitations of the Bag-of-Words Model

1. Ignores word order → "Apple is red" and "Red is apple" are treated the same.
2. Sparse representation → Large vocabulary creates big data, making processing slow.
3. Loses meaning → It doesn’t understand context (e.g., "old bike" vs. "used bike").

Why Use BoW?

✅ Simple and easy to implement


✅ Works well for tasks like spam detection, sentiment analysis
❌ Does not capture word meaning and order

To improve BoW, advanced models like TF-IDF, Word2Vec, and BERT are used.

3)Bag-of-N-Grams

A Bag-of-N-Grams model is a way to represent a document by breaking it into small groups


of words (called n-grams), similar to the Bag-of-Words model. Instead of looking at single
words separately, this model captures short sequences of words to keep some context.

For example, if we take the sentence:


"James is the best person ever."
And we break it into bigrams (n=2), we get:

 James
 James is
 is the
 the best
 best person
 person ever.
 ever.

In this model, a document is represented by the set of n-grams (word pairs, in this case) it
contains. Instead of just counting words like in Bag-of-Words, this method keeps some word
order information. This makes it better for understanding meaning while still being simple to
use.

TF-IDF (Term Frequency - Inverse Document Frequency)

TF-IDF is a method used to find important words in a document compared to a collection of


documents (called a corpus). It helps in identifying key terms by considering both how often
a word appears in a document and how rare it is across all documents.

Key Terms in TF-IDF

1. Term Frequency (TF) – How often a word appears in a document.


o If a word appears many times in a document, it is likely important.
o Example: If "computer" appears 10 times in a document with 100 words,
then:

TF=10/100=0.1

o The more a word appears, the more relevant it is to that document.


2. Document Frequency (DF) – Counts how many documents contain a word.
o Example: If there are 1,000 documents and 50 documents contain the word
"computer", then:

DF=50

o If a word appears in many documents, it is less unique and not very special.
3. Inverse Document Frequency (IDF) – Measures how rare a word is.
o If a word appears everywhere, it is not useful for identifying unique topics.
o If a word appears in only a few documents, it is likely important.

How Does TF-IDF Work?

To calculate the importance of a word in a document, we multiply TF × IDF.


For example, if:

 TF = 0.1 (word appears frequently in a document)


 IDF = 2 (word is somewhat rare in the whole corpus)

Then, TF-IDF Score = 0.1 × 2 = 0.2

 A higher score means the word is important in that document.


 A lower score means the word is common and not very significant.

Why is TF-IDF Useful?

✔Search engines – Helps find the most relevant results.


✔Keyword extraction – Identifies important words in a document.
✔Spam detection – Finds unusual or suspicious words in emails.

4) Text Preprocessing or Wrangling


Text preprocessing is the process of cleaning text data before feeding it into a model. Since
machines understand numbers, not words, we must transform text into numbers in a
meaningful way. Text data often contains noise, such as punctuation, special characters, and
different word forms.

Techniques for Text Preprocessing

1. Contraction Mapping – Expanding shortened words (e.g., "I’ll" → "I will", "don’t"
→ "do not") to maintain consistency.
2. Tokenization – Splitting text into smaller parts called "tokens" (words, characters, or
subwords).
3. Noise Cleaning – Removing unnecessary symbols, punctuation, and special
characters.
4. Spell-checking – Fixing spelling mistakes for cleaner text.
5. Stopwords Removal – Removing common words like "the," "is," and "and" that do
not add much meaning.
6. Stemming/Lemmatization – Converting words to their root form (e.g., "running" →
"run").

Tokenization

Tokenization is a fundamental step in Natural Language Processing (NLP), where text is


broken into small pieces called tokens. These tokens can be:

1. Word Tokenization – Splitting text into words (e.g., "Never give up" → ["Never",
"give", "up"]).
o Problem: It struggles with Out Of Vocabulary (OOV) words (new words not
in training data).
o Solution: Replace rare words with an "UNK" (unknown) token.
2. Character Tokenization – Breaking text into individual characters (e.g., "smarter"
→ ['s', 'm', 'a', 'r', 't', 'e', 'r']).
o Advantage: Handles OOV words better.
o Problem: The sequence becomes too long, making it harder for models to
learn word meanings.
3. Subword Tokenization – Breaking words into meaningful sub-parts (e.g., "smartest"
→ ["smart", "est"]).
o Advantage: A balance between word and character tokenization.
o Used in: Advanced models like Transformers.

Byte Pair Encoding (BPE) – A Popular Tokenization Method

BPE is a widely used tokenization approach that efficiently handles OOV words and reduces
input length.

Steps in BPE

1. Split words into characters.


2. Create a vocabulary with unique characters.
3. Identify frequently occurring character pairs.
4. Merge the most common pairs.
5. Repeat until the vocabulary is optimized.

Sure! Let's go step by step with an example to understand Byte Pair Encoding (BPE) in a
simple way.

Example: Byte Pair Encoding (BPE)

Let's say we have a small text corpus with just three words:

Corpus:

 low
 lower
 lowest

Step 1: Split words into characters

We first break each word into individual characters and add a special end-of-word symbol
(</w>).
low -> l o w </w>
lower -> l o w e r </w>
lowest -> l o w e s t </w>

Step 2: Create an initial vocabulary

The vocabulary consists of all unique characters in the corpus:

Vocabulary: {l, o, w, e, r, s, t, </w>}

Step 3: Count character pair frequencies

We count how often each pair of characters appears together in the corpus:

Character Pair Frequency

lo 3

ow 3

w 1

we 2

er 1

es 1

st 1

r 1

Step 4: Merge the most frequent pair

The most common pair is "o w" (appears 3 times). We merge it into a new token "ow".

New corpus after merging "o w":

low -> l ow </w>


lower -> l ow e r </w>
lowest -> l ow e s t </w>

New vocabulary:

{l, ow, e, r, s, t, </w>}

Step 5: Repeat the process


Now, we recalculate the frequencies:

Character Pair Frequency

l ow 3

ow 1

ow e 2

er 1

es 1

st 1

r 1

The most common pair now is "l ow" (appears 3 times). We merge it:

low -> low </w>


lower -> low e r </w>
lowest -> low e s t </w>

New vocabulary:

{low, e, r, s, t, </w>}

Step 6: Continue merging until vocabulary is optimized

Next, we merge "low e" (appears 2 times):

low -> low </w>


lower -> lowe r </w>
lowest -> lowe s t </w>

Now merge "lowe r" (appears once):

low -> low </w>


lower -> lower </w>
lowest -> lowe s t </w>

And merge "lowe s" → "lowes" → "lowest":

low -> low </w>


lower -> lower </w>
lowest -> lowest </w>

Final vocabulary:

{low, lower, lowest, </w>}


Final Tokenization

Now, instead of splitting words into individual characters, the model tokenizes them
efficiently as:

 "low" → low
 "lower" → lower
 "lowest" → lowest

This reduces vocabulary size and handles Out Of Vocabulary (OOV) words efficiently.

Key Takeaways

✅ Reduces vocabulary size (instead of storing each character, we store meaningful


subwords).
✅ Handles OOV words by breaking them into known subwords instead of replacing them
with an "UNK" token.
✅ Used in Transformers like BERT, GPT, and other NLP models.

Would you like me to explain a real-world example using Python? 💡

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy