Learn 4
Learn 4
An encoder-decoder is a neural network architecture commonly used in sequence-to-sequence (Seq2Seq) models, particularly in tasks involving natural
language processing (NLP) and machine translation.
The seq2Seq model is a kind of machine learning model that takes sequential
data as input and generates also sequential data as output.
Encoder Block
The main purpose of the encoder block is to process the input sequence and
capture information in a fixed-size context vector.
• The input sequence is put into the encoder.
• The encoder processes each element of the input sequence using
neural networks (or transformer architecture).
• Throughout this process, the encoder keeps an internal state, and the
ultimate hidden state functions as the context vector that encapsulates
a compressed representation of the entire input sequence. This
context vector captures the semantic meaning and important
information of the input sequence.
The final hidden state of the encoder is then passed as the context vector to the decoder.
Decoder Block
The decoder block is similar to encoder block. The decoder processes the context vector from encoder to generate output sequence incrementally.
• In the training phase, the decoder receives both the context vector and the desired target output sequence (ground truth).
• During inference, the decoder relies on its own previously generated outputs as inputs for subsequent steps.
The decoder uses the context vector to comprehend the input sequence and create the corresponding output sequence. It engages in autoregressive
generation, producing individual elements sequentially. At each time step, the decoder uses the current hidden state, the context vector, and the previous
output token to generate a probability distribution over the possible next tokens. The token with the highest probability is then chosen as the output, and
the process continues until the end of the output sequence is reached.
Advantages:
1. Handling Sequential Data: Well-suited for tasks involving natural language, speech, and time series data.
2. Context Understanding: The encoder-decoder architecture captures the input sequence's context for generating accurate outputs.
3. Attention Mechanism: Improves performance for long input sequences by allowing the model to focus on specific parts of the input during output
generation.
Disadvantages:
1. High Computational Cost: Training Seq2Seq models demands significant computational resources and can be challenging to optimize.
2. Limited Interpretability: Understanding the reasoning behind the model's decisions is difficult.
3. Overfitting: Without proper regularization, these models can overfit training data, leading to poor generalization.
4. Rare Word Handling: Struggle with rare or unseen words in the training data.
5. Long Input Sequences: Context vectors may fail to capture all the information in very long input sequences, impacting performance.
Applications of Seq2Seq model
1. Text Summarization: The seq2seq model effectively understands the input text which makes it suitable for news and document summarization.
2. Speech Recognition: Seq2Seq model, especially those with attention mechanisms, excel in processing audio waveform for ASR. They are able to
capture spoken language patterns effectively.
3. Image Captioning: The seq2seq model integrate image features from CNNs with textual generation capabilities for image captioning. They are
capable to describe images in a human readable format.
TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the relevance of a word to a document in a collection (or
corpus) of documents. It balances two factors: how frequently a term appears in a specific document (Term Frequency, TF) and how rare the term is across
the entire corpus (Inverse Document Frequency, IDF). This ensures that commonly used words (e.g., "the," "and") are given less weight, while more unique
words are emphasized.
TF-IDF assigns high scores to words that are frequent in a document but rare in the overall corpus. Common words have low scores, making TF-IDF effective
for distinguishing important terms.
4. Text Classification:
TF-IDF is also used in machine learning models for text classification, where words that are unique to a particular class or category help the model
identify the correct label for a document.
Rule Based Approach
There are three types of NLP approaches:
The rule-based approach is one of the foundational methods in Natural Language Processing (NLP). It relies on predefined linguistic rules and patterns to
analyze, process, and understand text data. Unlike machine learning-based methods, which learn patterns from data, rule-based approaches explicitly
encode domain-specific knowledge through logical rules, grammar structures, or pattern matching.
The rule-based approach involves defining a set of rules to capture patterns in text and applying them to perform specific NLP tasks such as text
classification, entity recognition, or information extraction. These rules are often manually created by experts based on domain knowledge and linguistic
principles.
Machine learning approaches use data to automatically learn patterns and adapt to new examples, while rule-based systems require manual definition of
patterns. Rule-based methods are more interpretable but less flexible.
While modern methods like machine learning and neural networks dominate NLP, rule-based systems remain relevant in resource-constrained settings or
for niche applications where explicit knowledge and control are critical.
Probabilistic Model in Machine Learning
Probabilistic models are a class of machine learning algorithms that use probability theory to model the uncertainty and variability in data. These models
rely on the principles of statistics and probability to make predictions, infer patterns, and reason under uncertainty. Instead of making deterministic
decisions, probabilistic models provide a measure of confidence or likelihood for their predictions.
Probabilistic models represent data as distributions rather than fixed values. They assume that the observed data is generated by some underlying
probability distribution. By learning these distributions, the models can predict outcomes, assess uncertainty, and even handle missing or noisy data.
Probabilistic models are used in a variety of machine learning tasks such as classification, regression, clustering, and dimensionality reduction. Some
popular probabilistic models include:
4. Bayesian Networks:
Represent probabilistic dependencies among variables using a directed acyclic graph. Useful for reasoning and decision-making under uncertainty.
5. Markov Models:
Use the Markov property, assuming the future state depends only on the current state. Often used in time series analysis.
Applications
1. Natural Language Processing (NLP): Text classification with Naive Bayes
2. Computer Vision: Image segmentation using probabilistic graphical models.
3. Finance: Risk assessment and decision-making under uncertainty.
4. Healthcare: Predictive modelling for patient diagnosis and treatment outcomes.
Probabilistic models offer a principled way to deal with uncertainty and variability in data. While they can be computationally intensive, their ability to
provide interpretable results and manage uncertainty makes them invaluable in many applications, from recommendation systems to scientific research.
Part-of-Speech (POS) Tagging in NLP
Part-of-Speech (POS) tagging is a fundamental task in Natural Language Processing (NLP) that involves
identifying and labelling the grammatical category of each word in a sentence. These categories, known as
parts of speech, include nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, and
others. The process is essential for understanding the structure and meaning of text, making it a critical step
in tasks like text analysis, machine translation, and information extraction.
# Sample sentence
sentence = “The quick brown fox jumps over the lazy dog.”
# Sample text
text = "Barack Obama was born in Hawaii and served as President of the United States."
Key Concepts of Bag of Words (BoW): o Represent each text as a numerical vector based on the
1. Tokenization: vocabulary.
o Split the text into individual words or tokens. o Each position in the vector corresponds to a word in the
o Example: "I love NLP" → ["I", "love", "NLP"] vocabulary, and the value is the word’s frequency in the
2. Vocabulary Creation: text.
o Create a list of unique words from all documents in the o Example:
dataset. ▪ For Text 1: [1, 1, 1, 0, 0]
o Example for texts: ▪ For Text 2: [0, 0, 1, 1, 1]
▪ Text 1: "I love NLP" 4. Fixed-Length Representation:
▪ Text 2: "NLP loves Python" o Regardless of the text length, each document is converted
▪ Vocabulary: ["I", "love", "NLP", "loves", "Python"] into a vector of fixed size (equal to the vocabulary size).
3. Vectorization:
Disadvantages:
• Ignores Context: Word order and semantics are lost, so it cannot differentiate between "not good" and "good."
• High Dimensionality: Large vocabularies result in sparse and high-dimensional vectors.
• Sensitivity to Frequent Words: Common words (e.g., "the," "is") dominate, though techniques like TF-IDF can mitigate this.
Code:
from sklearn.feature_extraction.text import CountVectorizer
# Sample sentences
texts = ["I love NLP", "NLP loves Python"]
Output:
Vocabulary: ['i', 'love', 'loves', 'nlp', 'python']
Bag of Words Matrix:
[[1 1 0 1 0]
[0 0 1 1 1]]
Word Embeddings in NLP
Word embeddings are a technique in Natural Language Processing (NLP) to represent words in a continuous vector space. Unlike traditional methods like
Bag of Words or TF-IDF, which treat words as discrete symbols, word embeddings assign each word a dense vector that captures its meaning and semantic
relationships with other words.
Word embeddings are dense vector representations of words in a continuous space, where semantically similar words are positioned closer together.
1. Lexical Analysis
The purpose of lexical analysis is that it aims to read the input code and break it down into meaningful elements called tokens. Those tokens are turned into
building blocks for other phases of compilation.
The output of the lexical analysis phase is a stream of tokens, that can be more easily processed by the syntax analyzer, which is responsible for checking
the program for correct syntax and structure.
The actual text or character sequences that form the tokens are called lexemes.
In programming languages, lexical analysis helps in breaking down the source code so that the compiler or interpreter can understand and process it.
In NLP, lexical analysis is the first step in processing natural language text for tasks such as part-of-speech tagging, named entity recognition, and text
parsing.
Example: In the sentence "The cat sat on the mat," the tokens would be "The", "cat", "sat", "on", "the", "mat".
2. Syntactic Analysis
Syntactic Analysis, also known as parsing, is the process of analyzing the structure of a sentence or expression to determine its grammatical structure
according to a formal grammar. It is a crucial step in both Natural Language Processing (NLP) and compilers for programming languages, where the goal is
to understand how the tokens (from lexical analysis) fit together according to the rules of syntax.
Syntax analysis (parsing) is the second phase of the compilation process, following lexical analysis. It takes the tokens generated by the lexical analyzer and
attempts to build a Parse Tree or Abstract Syntax Tree (AST), representing the program’s structure. During this phase, the syntax analyzer checks whether
the input string adheres to the grammatical rules of the language using context-free grammar. If the syntax is correct, the analyzer moves forward;
otherwise, it reports an error.
The main goal of syntax analysis is to create a parse tree or abstract syntax tree (AST) of the source code, which is a hierarchical representation of the source
code that reflects the grammatical structure of the program.
Example: The cat chases the mouse in the garden
• Sentence(S)
• Noun Phrase (NP)
• Determiner (Det)
• Verb Phrase (VP)
• Prepositional Phrase (PP)
• Verb(V)
• Noun(N)
In short, Syntactic Analysis is used to check grammar, word arrangements, and shows the relationship among the words.
3. Semantic Analysis
Semantic Analysis is the process of extracting and understanding the meaning from a sentence or text. While syntactic analysis focuses on the structure of
sentences, semantic analysis focuses on interpreting the meaning behind the words and their relationships. It seeks to resolve ambiguities and determine
the precise meaning of phrases and sentences in context.
Semantic analysis in NLP can be divided into two main parts, Lexical Semantic Analysis and Compositional Semantic Analysis, each focusing on
understanding meaning at different levels.
Key elements of semantic analysis include various types of word relationships, which help in deciphering the underlying meaning of text. Here’s an
explanation of each element:
1. Hyponymy
• Definition: Refers to terms that are specific instances of a general category (hypernym).
• Analogy: Class-object relationship.
• Example: Color is the hypernym, while grey, blue, and red are its hyponyms.
2. Homonymy
• Definition: Words with the same spelling but completely different meanings.
• Example: Rose can mean "a flower" or "the past tense of rise."
3. Synonymy
• Definition: Words with similar or identical meanings.
• Example: (Job, Occupation), (Large, Big), (Stop, Halt).
4. Antonymy
• Definition: Pairs of words with opposite meanings.
• Example: (Day, Night), (Hot, Cold), (Large, Small).
5. Polysemy
• Definition: Words with the same spelling that have multiple closely related meanings.
• Difference from Homonymy: In polysemy, meanings are related; in homonymy, they are not.
• Example: Man can mean:
o "Human species"
o "Male human"
o "Adult male human"
In short, Semantic analysis is concerned with the meaning representation. It mainly focuses on the literal meaning of words, phrases, and sentences.
In short, Discourse integration is the process of connecting and interpreting different parts of a text to maintain coherence and meaning across sentences
and paragraphs.
5. Pragmatic analysis
Pragmatic analysis deals with understanding the intended meaning behind words or sentences based on the situation, social norms, or cultural context.
This step goes beyond literal meanings and interprets implied meanings.
Pragmatic analysis refers to understanding the intended meaning of a sentence in context, beyond its literal meaning. It involves interpreting how language
is used in real-world situations, considering factors like the speaker's intentions, the relationship between the speaker and listener, and the context of the
conversation.
Example, in the sentence "Can you pass the salt?" the literal meaning is a question about someone's ability to pass the salt, but pragmatically, it's a request
for the salt.
In short, it checks the real-world knowledge or context to derive the real meaning of the sentence.
Conclusion:
1. Lexical Analysis: Breaking down the text into tokens (words, punctuation, etc.) and identifying the basic components of the language.
2. Syntactic Analysis: Analyzing the grammatical structure of sentences, identifying parts of speech, and constructing a syntax tree to understand the
relationships between words.
3. Semantic Analysis: Extracting meaning from sentences by interpreting the meanings of words and how they combine to form the overall meaning of the
text.
4. Discourse Integration: Linking together different sentences or phrases within a text, understanding how previous and subsequent sentences contribute
to the overall meaning of the discourse.
5. Pragmatic Analysis: Understanding the context and intent behind the text, including resolving ambiguities and interpreting indirect meanings (e.g.,
recognizing sarcasm or requests).
Dataset Preparation for NLP Applications
1. Data Collection
Data can be collected from various sources, including text files, web scraping with tools like BeautifulSoup, APIs from platforms like Twitter, and public
datasets from Kaggle or UCI. For instance, customer reviews are often gathered for sentiment analysis.
2. Text Cleaning and Preprocessing
• Tokenization splits text into words or sentences. For example, "I love NLP!" becomes ["I", "love", "NLP", "!"].
• Lowercasing converts text to uniform lowercase, e.g., "I Love NLP" to "i love nlp."
• Removing punctuation eliminates unnecessary marks; "Hello, World!" turns into "Hello World."
• Removing stopwords filters out common words, like turning "This is a sample sentence" into "sample sentence."
• Stemming or lemmatization reduces words to their root forms, so "running," "runs," and "ran" all become "run."
• Handling numbers requires decisions on removal or conversion, as in changing "In 2023, there were 5000 participants" to "In year there were
participants."
• Handling misspellings corrects errors with libraries; "Ths is a smple text" becomes "This is a simple text."
• Removing HTML tags cleans web-scraped data, changing <p>This is a paragraph.</p> to "This is a paragraph."
3. Text Normalization
Text normalization includes expanding contractions, such as "I'm" to "I am," and handling special characters or emojis.
4. Feature Engineering
Feature engineering techniques include Bag of Words (BoW) for word counts, TF-IDF for word importance, and word embeddings like Word2Vec. n-
Grams capture sequences of words.
5. Splitting the Dataset
The dataset is typically split into training and testing sets, using an 80-20 ratio, and validated with techniques like KFold.
6. Handling Imbalanced Datasets
To manage imbalanced datasets, oversampling duplicates minority class examples, while undersampling reduces majority class examples. Class
weights can prioritize the minority class.
7. Preparing for Specific NLP Tasks
Preparation for tasks includes labeling text for sentiment analysis and annotating entities for Named Entity Recognition (NER).
8. Data Augmentation
Data augmentation techniques include synonym replacement, random insertion of words, and back translation for paraphrasing.
9. Final Checks
Final checks ensure data consistency, verifying no missing labels or formatting issues, and reviewing data quality for errors.
10. Documentation and Metadata
Documenting preprocessing steps and tools used, along with metadata like data source and collection date, adds valuable context.
Understanding Data Attribute Types
Manner of Articulation refers to how airflow is manipulated during the production of speech sounds. It describes the way in which the vocal tract modifies
the airflow to create different types of sounds.
1. Plosive (Stop): Complete closure of the vocal tract followed by a sudden release of air.
Examples: /p/, /b/, /t/, /d/, /k/, /g/.
2. Nasal: Air flows through the nose due to a lowered velum.
Examples: /m/, /n/, /ŋ/.
3. Fricative: Partial closure of the vocal tract, creating a turbulent airflow.
Examples: /f/, /v/, /s/, /z/, /ʃ/, /ʒ/, /θ/, /ð/, /h/.
4. Affricate: A combination of a plosive and a fricative, starting with a complete closure followed by a turbulent release.
Examples: /tʃ/ ("chess"), /dʒ/ ("judge").
5. Approximant: The vocal tract is narrowed, but not enough to create turbulence.
Examples: /j/, /w/, /ɹ/ ("red").
6. Lateral Approximant: The airstream flows along the sides of the tongue.
Example: /l/.
7. Tap or Flap: A single, quick touch of the tongue against the roof of the mouth.
Example: /ɾ/ ("t" in "butter" in American English).
8. Trill: Rapid, repeated contact of the articulator with the place of articulation.
Example: /r/ (rolled "r").
Manner
Plosive Nasal Fricative Affricate Approximant Lateral Approximant
Place
Bilabial p, b m w
Labiodental f, v
Dental θ, ð
Alveolar t, d n s, z ɹ l
Post-alveolar ʃ, ʒ tʃ, dʒ
Palatal j
Velar k, g ŋ
Glottal ? h
Speech Processing in NLP
Speech Modes
• Isolated Speech Recognition: Recognizes words spoken with
clear pauses.
1. Speaker Recognition o Features: Commonly used, identifies words ~0.96
Speaker Verification: Confirms a speaker's identity by comparing their seconds in length.
voice to a claimed identity. o Use: Ideal for distinct voice commands (e.g., home
• Process: automation).
1. Speaker provides a voice sample. • Connected Speech Recognition: Bridges isolated and continuous
2. Match with a pre-stored voiceprint. speech, allowing natural multi-word input with short pauses.
• Types: o Features: Recognizes phrases up to ~1.92 seconds;
o Text-dependent: Specific phrase required. limited vocabulary (~20 words).
o Text-independent: Any speech allowed. o Use: Suitable for short commands in voice-controlled
• Applications: Biometric authentication, secure access, mobile applications.
devices. • Continuous Speech Recognition: Processes natural,
• Challenges: Background noise, microphone quality, voice conversational speech.
changes. o Challenges: Difficult to segment words as they merge
together.
Speaker Identification: Identifies a speaker from a group based on o Use: Virtual assistants, transcription, communication
unique voice characteristics. tools.
• Process:
1. Extract features (pitch, tone, accent). Speaking Style
2. Compare against a database of voiceprints. Dictation: Formal, structured speech used in controlled settings like
• Applications: Security systems, voice assistants, call centers. note-taking.
• Challenges: Variability due to mood, health, background noise. • Features: Predefined vocabulary, clear and organized speech,
distinct pronunciation, proper pauses for context and
Speaker Diarisation: Determines who spoke when in an audio recording. punctuation.
• Process: Spontaneous Speech: Unplanned, natural conversation with varied
1. Segment audio by silence and speech. expressions.
2. Extract features from segments. • Features: Free-flowing, variable speech patterns, influenced by
3. Cluster segments by voice characteristics. background noise and interruptions, requires context
• Applications: Meeting transcription, multi-speaker understanding for accurate recognition.
conversations.
• Challenges: Speech variability, overlapping dialogue, 3. Language Identification
background noise
It is the process of determining a text or audio's language. It's used in tasks
like translation, document classification, and speech recognition.
2. Speech Recognition
Techniques include character-based, word-based, and machine learning
Speaker Mode
methods. Challenges include dialects, code-switching, and limited data.
• Speaker Dependent: Trained on a specific user's voice.
Future research aims to improve robustness, address low-resource
languages, and combine multiple modalities.
Word Boundary Detection
Word boundary detection involves identifying where one word ends and another begins, essential for speech recognition and natural language processing.
Two types of speech are considered in word boundary detection:
1. Constrained Speech:
In constrained speech, words have well-defined boundaries with clear pauses between them. The speech is structured, making it easier to detect
where one word ends and another begins.
2. Unconstrained Speech:
Unconstrained speech lacks clear boundaries, grammar, or pauses. This natural, free-flowing speech makes word boundary detection more
challenging. Without a robust detection algorithm, it often leads to false alarms and missed word boundaries due to the continuous nature of
speech.
Speech Recognition
1. Overview of Speech Recognition: o Uses statistical models to predict the probability of a word
o The process of converting spoken words into text by using sequence. It helps to handle ambiguities in speech
computational techniques. recognition by providing context.
o Involves three major stages: feature extraction, pattern o Common models: n-gram models (bigram, trigram), neural
recognition, and language modeling. network-based models.
2. Feature Extraction: 5. Decoding Process:
o Converts raw speech signals (audio waveform) into a o Given a sequence of feature vectors, the goal is to find the
sequence of feature vectors that represent the acoustic most likely sequence of words or phonemes. This is done
characteristics of the speech. using dynamic programming techniques like the Viterbi
o Common techniques: MFCC (Mel-Frequency Cepstral algorithm.
Coefficients), PLP (Perceptual Linear Prediction). Key Points to Remember:
3. Pattern Recognition: • HMMs model the sequential nature of speech, where each state
o This is where HMMs play a key role. The feature vectors corresponds to a phoneme or other linguistic unit.
extracted from the audio are matched to phonemes or • Feature extraction (e.g., MFCC) is a critical step in transforming raw
words using HMMs. audio into a format usable by HMMs.
o Acoustic Models: HMMs are used to model the sequence of • The Viterbi algorithm is central to decoding the best sequence of
phonemes or sub-word units that generate speech sounds. phonemes or words from a sequence of acoustic features.
4. Language Modeling: • Language models help disambiguate the output by providing
contextual probabilities for word sequences
Language Structure and Analyzer in NLP
A Language Structure and Analyzer is a critical component within Natural Language Processing (NLP). Its goal is to break down and understand the
grammatical structure of language. This involves identifying elements like parts of speech, syntactic patterns, and sometimes even the semantic meanings
within a given text.
Overview of Language Structure
1. Phonetics and Phonology:
o Phonetics: Focuses on the physical properties of speech sounds. It studies:
▪ How sounds are produced (articulatory phonetics).
▪ How sounds travel (acoustic phonetics).
▪ How sounds are perceived (auditory phonetics).
o Phonology: Deals with how sounds function in particular languages. It studies:
▪ How sounds are mentally organized.
▪ The interaction of sounds within the language.
2. Morphology:
o Morphology: The study of the structure of words. It looks at how words are formed by smaller units called morphemes, which are the
smallest meaningful units of language.
▪ Example: The word "unhappiness" consists of three morphemes: "un-", "happy", and "-ness".
3. Syntax:
o Syntax: The study of how words combine to form sentences. It focuses on:
▪ The rules governing sentence structure.
▪ How words are arranged to form meaningful and grammatically correct sentences.
4. Semantics:
o Semantics: The study of meaning in language. It examines:
▪ Lexical Semantics: The meanings of individual words and how they relate to one another.
▪ Compositional Semantics: How the meanings of individual words combine to form the meaning of larger expressions, like phrases
or sentences.
5. Pragmatics:
o Pragmatics: The study of how context affects the interpretation of language. It explores:
▪ How speakers use language in different social contexts.
▪ How listeners interpret what speakers say based on the context.
Computational Grammar
Computational Grammar is the set of formal grammatical rules that allow computers to process and understand human languages. It plays a key role in
various Natural Language Processing (NLP) tasks such as machine translation, speech recognition, and text generation. Here’s a simple breakdown of the key
components and why they are important.
Key Components of Computational Grammar
1. Formal Grammar:
o Formal Grammar refers to a system of rules that defines how sentences and phrases are structured in a language.
o There are different types of formal grammars:
▪ Context-Free Grammar (CFG): This is a type of grammar where the rules apply independently of the context. It helps in breaking
down sentences into smaller parts, like dividing them into subject, verb, and object.
▪ Dependency Grammar: This focuses on the relationships between words. For example, in the sentence "The cat eats fish," "eats" is
the main word, and "cat" depends on "eats" as the subject, and "fish" as the object.
2. Part-of-Speech (POS) Tagging:
o POS Tagging is the process of labeling each word in a sentence with its part of speech, such as noun, verb, or adjective.
o Computers use models like Hidden Markov Models (HMMs) and neural networks to tag words based on their roles in a sentence.
3. Parsing:
o Parsing refers to analyzing the sentence structure based on grammar rules.
o Parsers are algorithms that break down sentences to identify their structure. There are different kinds of parsers, like shift-reduce parsers
and CYK parsers.
4. Syntactic Tree Representation:
o Parse Trees represent the syntactic structure of a sentence. Each word in the sentence is connected to show how the sentence is constructed
according to grammar rules.
o Dependency Trees show how individual words are related to one another. For example, in "The dog chased the cat," the word "chased" has
two dependencies: the subject "dog" and the object "cat."
Why Computational Grammar is Important
1. Accuracy in Language Processing: It helps computers understand and generate language correctly.
2. Disambiguation: It resolves ambiguities in language, like when words have multiple meanings.
3. Contextual Understanding: It allows the computer to understand the context in which words and sentences are used.
4. Language Translation: It’s crucial for translating text from one language to another while keeping the meaning intact.
5. Speech Recognition: It improves the accuracy of converting spoken language into text.
Applications of Computational Grammar
1. Machine Translation: Converting text from one language to another, like from English to Spanish.
2. Speech Recognition: Turning spoken words into written text.
3. Text-to-Speech: Generating speech from written text, useful in virtual assistants.
4. Sentiment Analysis: Determining if a text has positive, negative, or neutral emotions.
5. Information Retrieval: Finding specific information from large amounts of text, like search engines.
Example: Parsing a Sentence
Let’s take the sentence:
“The quick brown fox jumps over the lazy dog.”
1. Part-of-Speech (POS) Tagging:
o The/DT (determiner), quick/JJ (adjective), brown/JJ (adjective), fox/NN (noun), jumps/VBZ (verb), over/IN (preposition), the/DT
(determiner), lazy/JJ (adjective), dog/NN (noun).
2. Syntactic Structure (Parse Tree):
o The sentence is broken into two main parts: Noun Phrase (NP) and Verb Phrase (VP). The NP includes "The quick brown fox," and the VP
includes "jumps over the lazy dog."
o Each word is placed under its respective category, showing the hierarchical structure of the sentence.
3. Dependency Structure (Dependency Tree):
o The word "jumps" is the main verb, and the words "fox" (subject) and "dog" (object) depend on it. The other words, like adjectives and
prepositions, provide more details about the nouns.
Words and Their Analysis
Understanding words and how they function in sentences is key to • Compounding: Combining two words to form a new word (e.g.,
processing and analyzing language, especially in fields like Natural Language "notebook" from "note" and "book").
Processing (NLP) and linguistics. Below is a simple explanation of key • Conversion: Changing a word's part of speech without changing its
concepts related to words and their analysis. form (e.g., "to email" from "email").
1. Morphology • Blending: Combining parts of two words to create a new word (e.g.,
Morphology is the study of the structure and formation of words. It looks at "brunch" from "breakfast" and "lunch").
how words are built from smaller units called morphemes. • Acronyms: Creating a word from the initials of a phrase (e.g., "NASA"
• Morphemes are the smallest units of meaning in a word: from "National Aeronautics and Space Administration").
o Free Morphemes: These can stand alone as words (e.g., 5. Collocations
"book," "run"). Collocations are word combinations that frequently appear together in
o Bound Morphemes: These need to attach to other natural language.
morphemes to make sense (e.g., "un-" in "unhappy," "-ness" • Common Collocations: Words that sound natural together, like
in "happiness"). "make a decision" or "strong coffee."
2. Lexical Semantics • Idiomatic Expressions: Phrases where the meaning is different from
Lexical Semantics deals with the meanings of words and how they relate to the literal meaning of the words (e.g., "kick the bucket" means "to
each other. die").
• Synonyms: Words with similar meanings (e.g., "big" and "large"). 6. Word Sense Disambiguation (WSD)
• Antonyms: Words with opposite meanings (e.g., "hot" and "cold"). Word Sense Disambiguation is the process of identifying the correct
• Hyponyms: Words that are specific examples of a general category meaning of a word in a specific context.
(e.g., "rose" is a hyponym of "flower"). • Contextual Clues: Words around the target word help determine its
• Polysemy: A word with multiple related meanings (e.g., "bank" as a meaning. For example, "bat" could mean the animal or the sports
financial institution and the side of a river). equipment depending on the sentence.
• Homonyms: Words that sound the same but have different 7. Lemmatization and Stemming
meanings (e.g., "bat" as an animal and "bat" used in sports). Both Lemmatization and Stemming are used to reduce words to their base
3. Part-of-Speech (POS) Tagging forms for easier processing in tasks like search engines or text analysis.
Part-of-Speech Tagging is the process of assigning a label to each word in a • Lemmatization: Converts a word to its dictionary form (e.g.,
sentence based on its grammatical role. These roles help computers and "running" becomes "run").
humans understand how a word functions within a sentence. • Stemming: Trims words down to their root form, often by cutting off
• Nouns (NN): Words that represent people, places, things, or ideas suffixes (e.g., "running" and "runner" both become "run").
(e.g., "dog," "city"). 8. Tokenization
• Verbs (VB): Words that show actions or states (e.g., "run," "is"). Tokenization is the process of splitting text into smaller units called tokens,
• Adjectives (JJ): Words that describe nouns (e.g., "quick," "blue"). which can be words, phrases, or symbols. Tokenization is the first step in
• Adverbs (RB): Words that modify verbs, adjectives, or other adverbs analyzing any text.
(e.g., "quickly," "very"). • Steps in Tokenization:
• Pronouns (PRP): Words that replace nouns (e.g., "he," "they"). 1. Text Segmentation: Breaking down the text into sentences
• Prepositions (IN): Words that show relationships between words or paragraphs.
(e.g., "in," "on"). 2. Word Tokenization: Splitting sentences into individual
• Conjunctions (CC): Words that connect clauses or sentences (e.g., words.
"and," "but"). 3. Handling Special Cases: Managing punctuation, numbers,
• Determiners (DT): Words that introduce nouns (e.g., "the," "a"). and contractions.
4. Word Formation Processes • Example:
Word Formation refers to how new words are created in a language. o Input Sentence: "The quick brown fox jumps over the lazy
• Derivation: Adding prefixes or suffixes to create new words (e.g., dog."
"happiness" from "happy"). o Tokens: ["The", "quick", "brown", "fox", "jumps", "over",
"the", "lazy", "dog"]
Knowledge of Language
Understanding language involves several key components, each focusing on a different aspect of how we process and produce words, sentences, and
meaning. Here's a breakdown of the important elements of language knowledge:
1. Phonology
Phonology deals with the sounds of language and how they differ between words. It focuses on understanding how sounds (phonemes) are organized and
used in speech.
• Example: In English, the words "bat" and "pat" differ by only one sound—the initial consonant. Phonology explains how changing just one sound
changes the meaning of a word.
2. Morphology
Morphology is about how words are built from smaller units called morphemes. A morpheme is the smallest unit of meaning in a language, such as prefixes,
roots, and suffixes.
• Example: The word "unhappiness" consists of three morphemes:
o "un-" (prefix meaning "not")
o "happy" (root word)
o "-ness" (suffix indicating a state or condition)
3. Syntax
Syntax is the study of sentence structure. It examines how words combine to form grammatically correct sentences and determines the relationships between
words in a sentence.
• Example: In the sentence "The dog chased the cat," syntax explains how the words are ordered to convey that the dog is doing the chasing, not the
cat.
4. Semantics
Semantics focuses on the meaning of words and how these meanings combine in sentences to convey a complete thought. It's about the meaning of
sentences without considering the context in which they are used.
• Example: In the sentence "The cat sat on the mat," semantics tells us what the sentence means literally—there's a cat, and it's sitting on a mat.
5. Pragmatics
Pragmatics goes beyond the literal meaning of words and looks at how language is used in different situations. It deals with how context affects the
interpretation of sentences.
• Example: The sentence "Could you shut the window?" is phrased like a question, but pragmatically, it is a polite request rather than an actual
question.
6. Discourse
Discourse is about understanding how sentences relate to each other in a conversation or text. It looks at how the meaning of a sentence is influenced by
the sentences before or after it.
• Example: In a conversation, if someone says, "John was late again today," and the next sentence is "He always oversleeps," we know "He" refers to
John because of discourse context.
7. World Knowledge
World knowledge refers to the general knowledge and understanding of the world that helps people interpret language. It involves knowing about social
norms, facts, and shared beliefs.
• Example: If someone says, "It’s raining cats and dogs," you understand that this is an idiom meaning heavy rain, based on your knowledge of common
expressions.
Argmax Based Computations in NLP(PYQ)
The argmax function is crucial in NLP tasks, used to determine the index of the maximum value in an array or tensor, often translating probabilistic outputs
into definitive predictions.
Mathematical Formulation
The argmax function identifies the index of the maximum value in a given array, defined mathematically for an array aia_iai.
Properties of Argmax
• Uniqueness: Returns a unique index if the maximum value is unique; may vary in the case of ties.
• Non-linearity: Non-differentiable, complicating gradient-based optimization (often mitigated using the softmax function).
• Composability: Can be combined with other functions for complex decision-making pipelines.
Code:
import numpy as np
Morphological Analysis
Morphological analysis is the study and process of analyzing the structure and formation of words. It involves breaking down words into their smallest
meaningful units, known as morphemes. Morphemes can be classified as either free morphemes (which can stand alone as words, like "book") or bound
morphemes (which need to be attached to other morphemes to convey meaning, like prefixes or suffixes, such as "un-" or "-ed").
The purpose of morphological analysis in Natural Language Processing (NLP) is to understand the internal structure of words and how they change based on
grammatical rules. It plays a crucial role in tasks such as text analysis, language modeling, and machine translation. Through morphological analysis, NLP
systems can better handle variations of a word (e.g., "run" vs. "running") and improve their understanding of context.
Key Points:
• Morphemes: Smallest units of meaning in a language, either free (standalone words) or bound (prefixes/suffixes).
• Inflectional Morphology: Deals with variations of words to express tense, case, gender, etc., without changing the core meaning (e.g., "walk" to
"walked").
• Derivational Morphology: Changes the meaning of a word by adding morphemes (e.g., "happy" to "unhappy").
• Applications: Used in NLP tasks such as part-of-speech tagging, spell checking, and machine translation to enhance understanding of language.
• Challenges: Handling irregular word forms and languages with rich morphology, where the number of inflections is large (e.g., Turkish or Finnish).
Stemming and Lemmatization (PYQ)
Stemming is a rule-based process that removes suffixes or prefixes from words to produce their stem form. This method often leads to non-dictionary words,
as it focuses on removing affixes based on predefined rules. For example, "running," "runner," and "ran" may all be stemmed to "run." While stemming is
typically faster and simpler, it may sacrifice accuracy and clarity since it doesn't always produce valid root words.
Lemmatization, on the other hand, involves a more sophisticated approach that considers the context and meaning of a word. It reduces words to their base
form, or lemma, by analysing the word's intended meaning and its grammatical role in a sentence. For instance, "better" would be lemmatized to "good,"
which is a valid dictionary entry. This method requires more computational resources and linguistic knowledge but yields more accurate and meaningful
results.
Stemming focuses on reducing words to their root form by simply Lemmatization, on the other hand, removes affixes based on context and
removing prefixes and suffixes. This can sometimes result in non-standard converts the word to its valid root form. This method uses vocabulary and
or incorrect forms. morphological analysis to ensure the resulting word is valid.
Example Sentences: Example Sentences:
• Words: • Words:
o Caring → Car o Caring → Care (verb lemma: the base verb)
o Loveliness → Lovel o Loveliness → Lovely (adjective lemma: the base adjective)
o Happiness → Happi o Happiness → Happy (adjective lemma: the base adjective)
o Running → Run o Running → Run (verb lemma: the base verb)
o Unhappily → Unhappili o Unhappily → Unhappy (adjective lemma: the base
o Faster → Fast adjective)
o Studies → Studi o Faster → Fast (adjective lemma: the base form of the
adjective)
o Studies → Study (noun lemma: the base noun)
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
# Sample sentence
sentence = "The cats are running faster than the dogs."
# Stemming
stemmer = PorterStemmer()
stems = [stemmer.stem(word) for word in tokens]
print("Stemming:", stems)
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word) for word in tokens]
print("Lemmatization:", lemmas)
Stemming: ['The', 'cat', 'are', 'run', 'faster', 'than', 'the', 'dog', '.']
Lemmatization: ['The', 'cat', 'are', 'running', 'faster', 'than', 'the', 'dog', '.']
Sentiment Analysis Code:
# Sample sentence
sentence = "I love sunny days, but I hate the rain."
# Sentiment Analysis
analysis = TextBlob(sentence)
print("Sentiment Polarity:", analysis.sentiment.polarity)
print("Sentiment Subjectivity:", analysis.sentiment.subjectivity)
Output:
# Sample dataset
X = np.array([[0], [1], [2], [3]])
y = np.array([[0], [1], [2], [3]])
# Prediction
print("Prediction for input 4:", model.predict([[4]]))
Output (approximate):
• The model approximates the value for the input 4, as the training data follows a linear trend.