Lecture 02 - NLU concepts
Lecture 02 - NLU concepts
Lecture 2
BITS Pilani
Pilani Campus
Session Content
Importance: These steps help in converting raw text into a more manageable and
analyzable format, facilitating better model performance and more accurate results.
• Tools
• NLTK: A leading platform for building Python programs to work with human language data.
• SpaCy: An open-source software library for advanced NLP in Python.
• Tokenizer APIs: Available in various programming languages and platforms.
• Importance: Tokenization is the first step in the text pre-processing pipeline and is crucial
for the performance of subsequent steps.
Model Performance: Machine learning models require numerical or categorical input. Without
tokenization, converting text into a suitable format is impossible, leading to poor model performance.
Difficulty in Feature Extraction: Tokenization allows for extracting features such as word
frequencies, n-grams, and more. Skipping this step hinders effective feature extraction.
Ineffective Text Cleaning: Tokenization is often the first step in text cleaning. Without it, removing
stop words, punctuation, and performing stemming/lemmatization becomes challenging.
Error Propagation: Errors in initial steps propagate through the pipeline, leading to inaccuracies in
tasks like sentiment analysis, NER, and POS tagging.
• Purpose:
• Noise Reduction: Removing stop words helps in reducing the noise in the text data.
• Efficiency: It reduces the size of the text data, making processing faster and more efficient.
• Tools:
• NLTK: Provides a predefined list of stop words and functions for their removal.
• SpaCy: Offers built-in support for stop word removal in various languages.
• Example:
• Input: "The quick brown fox jumps over the lazy dog."
• Output: "quick brown fox jumps lazy dog"
• Importance: Removing stop words helps in focusing on the words that are more likely to be
significant in the analysis, thereby improving the performance of NLP models.
• Lemmatization: Converts words to their base form (lemma) using morphological analysis. It
always returns a valid word.
• Examples:
• Stemming: "running" -> "run", "jumps" -> "jump"
• Lemmatization: "better" -> "good", "running" -> "run"
• Tools:
• NLTK:
• PorterStemmer for stemming
• WordNetLemmatizer for lemmatization
• SpaCy: Offers built-in lemmatization capabilities.
• Use Case: Choose stemming for quick and dirty text processing; use lemmatization for
tasks requiring higher accuracy.
• Importance: Both techniques help in normalizing words to their base forms, which reduces
the dimensionality of the text data and improves the performance of NLP models.
• Purpose:
• Consistency: Ensures that words are treated equally regardless of their case.
• Reduction of Redundancy: Helps in reducing redundancy by treating "Apple" and "apple" as the same word.
• Example:
• Input: "Natural Language Processing"
• Output: "natural language processing"
• Tools:
• Python String Methods: .lower() and .upper()
• NLTK: Provides functions for case normalization.
• SpaCy: Built-in support for case normalization.
• Importance:
• Improves Text Processing: Case normalization simplifies text processing by reducing the number of unique tokens.
• Enhances Model Performance: Models become more efficient as they deal with fewer variations of the same word.
• Note: Case normalization is particularly useful when the case of the text does not carry significant meaning for the
analysis.
• Tools:
• Regular Expressions (Regex): Powerful for pattern matching and substitution.
• NLTK: Provides functions for various text cleaning tasks.
• SpaCy: Built-in functions for text cleaning.
• Example:
• Input: "Hello, world! Visit us at https://example.com #NLP"
• Output: "Hello world Visit us at example com NLP"
• Importance:
• Enhances Data Quality: Cleaned text is more consistent and easier to analyze.
• Improves Model Accuracy: Cleaner data leads to better-performing models by reducing noise and irrelevant information.
Importance:
Choosing the right technique depends on the specific requirements of the task, available
data, and resources. Machine learning and deep learning approaches are preferred for
their accuracy and scalability.
Example The quick brown fox jumps over the lazy dog."
• POS Tags:
The (DT) quick (JJ) brown (JJ) fox (NN) jumps (VBZ) over (IN) the (DT) lazy (JJ) dog (NN)
Tools:
• NLTK: Provides a comprehensive POS tagging module.
• SpaCy: Offers efficient and accurate POS tagging capabilities.
• Stanford POS Tagger: A robust tool developed by Stanford University.
Importance:
POS tagging is fundamental for many NLP tasks such as parsing, text-to-speech conversion, and information
extraction. It enables a deeper understanding of the syntactic and semantic properties of text.
Example:
Input: "I love the new design of your website!"
Output: Positive
Importance:
Sentiment analysis provides valuable insights into the emotions and opinions
expressed in text, enabling better decision-making and strategy formulation.
Key Techniques:
Rule-based Systems : Use predefined rules and templates to generate text.
Example: Fill-in-the-blank templates for automated report generation
Markov Chains : Use probabilistic models based on the likelihood of word sequences.
Example: Generating text by predicting the next word based on the previous one
Recurrent Neural Networks (RNNs): Use neural networks with loops to maintain context over
sequences.
Example: Generating poetry or short stories
Long Short-Term Memory Networks (LSTMs) : A type of RNN designed to better handle long-
term dependencies.
Example: Generating more coherent paragraphs and articles.
Transformer Models: Use self-attention mechanisms to capture long-range dependencies in text.
Example: GPT-3 generating articles, stories, and dialogue.
NLG
USER
Intents ACTION
NLU & Slots
DIALOG MANAGER
MANAGER
STATE DATABASE
(Personal or Public
MACHINE information)
Example Intents:
Booking a flight
Slot Filling: Extracting specific pieces of information (slots) from the user's input that are necessary to complete the intent.
Purpose: Provides detailed information required to fulfill the user's request.
Example
Slots for Flight Booking:
Destination
Departure
DateReturn
Date
Number of Passengers
Techniques:
Rule-based Methods: Use predefined patterns and templates to recognize intents and extract slots.
Machine Learning Approaches: Train classifiers on labeled datasets to predict intents and extract slots.
Deep Learning Approaches: Use neural networks, particularly sequence-to-sequence models, to handle more complex and varied
inputs.
Example:
User Input: "I want to book a flight to New York on June 5th."
Recognized Intent: Book Flight
Extracted Slots:
Destination: New York
Departure Date: June 5th