0% found this document useful (0 votes)
7 views27 pages

Learn 4

The document discusses various natural language processing (NLP) techniques including encoder-decoder architectures for sequence-to-sequence tasks, TF-IDF for term relevance, rule-based approaches, probabilistic models, part-of-speech tagging, named entity recognition, and the Bag of Words model. Each technique is explained with its methodology, advantages, disadvantages, and applications in NLP. The document emphasizes the importance of these methods in understanding and processing text data effectively.

Uploaded by

shantanukk0108
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views27 pages

Learn 4

The document discusses various natural language processing (NLP) techniques including encoder-decoder architectures for sequence-to-sequence tasks, TF-IDF for term relevance, rule-based approaches, probabilistic models, part-of-speech tagging, named entity recognition, and the Bag of Words model. Each technique is explained with its methodology, advantages, disadvantages, and applications in NLP. The document emphasizes the importance of these methods in understanding and processing text data effectively.

Uploaded by

shantanukk0108
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Encoders and decoders

An encoder-decoder is a neural network architecture commonly used in sequence-to-sequence (Seq2Seq) models, particularly in tasks involving natural
language processing (NLP) and machine translation.

The seq2Seq model is a kind of machine learning model that takes sequential
data as input and generates also sequential data as output.

Encoder Block
The main purpose of the encoder block is to process the input sequence and
capture information in a fixed-size context vector.
• The input sequence is put into the encoder.
• The encoder processes each element of the input sequence using
neural networks (or transformer architecture).
• Throughout this process, the encoder keeps an internal state, and the
ultimate hidden state functions as the context vector that encapsulates
a compressed representation of the entire input sequence. This
context vector captures the semantic meaning and important
information of the input sequence.
The final hidden state of the encoder is then passed as the context vector to the decoder.

Decoder Block
The decoder block is similar to encoder block. The decoder processes the context vector from encoder to generate output sequence incrementally.
• In the training phase, the decoder receives both the context vector and the desired target output sequence (ground truth).
• During inference, the decoder relies on its own previously generated outputs as inputs for subsequent steps.
The decoder uses the context vector to comprehend the input sequence and create the corresponding output sequence. It engages in autoregressive
generation, producing individual elements sequentially. At each time step, the decoder uses the current hidden state, the context vector, and the previous
output token to generate a probability distribution over the possible next tokens. The token with the highest probability is then chosen as the output, and
the process continues until the end of the output sequence is reached.

Advantages:
1. Handling Sequential Data: Well-suited for tasks involving natural language, speech, and time series data.
2. Context Understanding: The encoder-decoder architecture captures the input sequence's context for generating accurate outputs.
3. Attention Mechanism: Improves performance for long input sequences by allowing the model to focus on specific parts of the input during output
generation.
Disadvantages:
1. High Computational Cost: Training Seq2Seq models demands significant computational resources and can be challenging to optimize.
2. Limited Interpretability: Understanding the reasoning behind the model's decisions is difficult.
3. Overfitting: Without proper regularization, these models can overfit training data, leading to poor generalization.
4. Rare Word Handling: Struggle with rare or unseen words in the training data.
5. Long Input Sequences: Context vectors may fail to capture all the information in very long input sequences, impacting performance.
Applications of Seq2Seq model
1. Text Summarization: The seq2seq model effectively understands the input text which makes it suitable for news and document summarization.
2. Speech Recognition: Seq2Seq model, especially those with attention mechanisms, excel in processing audio waveform for ASR. They are able to
capture spoken language patterns effectively.
3. Image Captioning: The seq2seq model integrate image features from CNNs with textual generation capabilities for image captioning. They are
capable to describe images in a human readable format.
TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the relevance of a word to a document in a collection (or
corpus) of documents. It balances two factors: how frequently a term appears in a specific document (Term Frequency, TF) and how rare the term is across
the entire corpus (Inverse Document Frequency, IDF). This ensures that commonly used words (e.g., "the," "and") are given less weight, while more unique
words are emphasized.

1. Term Frequency (TF):


o Measures how often a word (term t) appears in a document (d).
o Formula: 𝑇𝐹(𝑡, 𝑑) = Count of 𝑡 in 𝑑/Total words in 𝑑
o Higher frequency indicates higher relevance to that specific document.
2. Document Frequency (DF):
o Counts the number of documents in the corpus where a term t appears.
o Formula: 𝐷𝐹(𝑡) = Number of documents containing 𝑡
o A term that appears in many documents is less significant since it is more "generic."
3. Inverse Document Frequency (IDF):
o Adjusts the weight of a term based on how unique it is across the corpus.
o Formula: 𝐼𝐷𝐹(𝑡) = 𝑙𝑜𝑔(𝑁/𝐷𝐹(𝑡))
Here, N is the total number of documents in the corpus.
o Words that appear in fewer documents receive a higher weight because they are more specific to certain documents.
4. TF-IDF Calculation:
o Combines TF and IDF to assign a weight to each term t in document d:
𝑇𝐹 − 𝐼𝐷𝐹(𝑡, 𝑑) = 𝑇𝐹(𝑡, 𝑑) × 𝐼𝐷𝐹(𝑡)
o Higher TF-IDF scores indicate that the term is important in the document but rare in the corpus.

TF-IDF assigns high scores to words that are frequent in a document but rare in the overall corpus. Common words have low scores, making TF-IDF effective
for distinguishing important terms.

Why TF-IDF is Useful

1. Captures the Importance of Words:


Words that are common in a single document but rare across the corpus are considered more significant, so TF-IDF helps identify unique terms that
better represent a document.

2. Focuses on Distinctive Words:


Common stopwords like "the", "and", or "is" are likely to appear in every document, making them irrelevant for distinguishing documents. TF-IDF
downscales their importance, focusing on words that are more informative.

3. Enhances Search and Information Retrieval:


When searching through a corpus, TF-IDF helps to rank documents by relevance based on the terms they contain. Documents that have high TF-IDF
scores for a given search query term are more likely to be relevant.

4. Text Classification:
TF-IDF is also used in machine learning models for text classification, where words that are unique to a particular class or category help the model
identify the correct label for a document.
Rule Based Approach
There are three types of NLP approaches:

1. Rule-based Approach – Based on linguistic rules and patterns


2. Machine Learning Approach – Based on statistical analysis
3. Neural Network Approach – Based on various artificial, recurrent, and convolutional neural network algorithms

The rule-based approach is one of the foundational methods in Natural Language Processing (NLP). It relies on predefined linguistic rules and patterns to
analyze, process, and understand text data. Unlike machine learning-based methods, which learn patterns from data, rule-based approaches explicitly
encode domain-specific knowledge through logical rules, grammar structures, or pattern matching.

The rule-based approach involves defining a set of rules to capture patterns in text and applying them to perform specific NLP tasks such as text
classification, entity recognition, or information extraction. These rules are often manually created by experts based on domain knowledge and linguistic
principles.

Steps in the Rule-Based Approach


1. Rule Creation:
Based on the desired tasks, domain-specific linguistic rules are created. These
rules can involve:
o Grammar rules to understand sentence structure.
o Syntax patterns to identify relationships between words.
o Semantic rules to analyse meaning.
o Regular expressions for matching text patterns.
2. Rule Application:
The predefined rules are applied to input text to identify patterns or structures. For instance:
o Extracting dates using a regular expression: \d{2}/\d{2}/\d{4} to match "12/05/2023".
o Identifying noun phrases using syntax patterns.
3. Rule Processing:
Based on the matched patterns, the system processes the data to:
o Extract relevant information (e.g., extracting names or addresses).
o Perform text classification (e.g., categorizing emails as spam or not).
4. Rule Refinement:
As the system is used, rules are iteratively refined based on feedback and errors observed in performance. This step improves accuracy and ensures
the rules remain relevant as data or requirements evolve.

Advantages of the Rule-Based Approach


1. Interpretability:
Rules are human-readable and can be easily understood and modified.
2. Domain-Specific Precision:
Tailored rules allow for high precision in specific domains where linguistic patterns are well-understood.
3. Resource Efficiency:
No need for large datasets or training, unlike machine learning methods.

Machine learning approaches use data to automatically learn patterns and adapt to new examples, while rule-based systems require manual definition of
patterns. Rule-based methods are more interpretable but less flexible.

While modern methods like machine learning and neural networks dominate NLP, rule-based systems remain relevant in resource-constrained settings or
for niche applications where explicit knowledge and control are critical.
Probabilistic Model in Machine Learning
Probabilistic models are a class of machine learning algorithms that use probability theory to model the uncertainty and variability in data. These models
rely on the principles of statistics and probability to make predictions, infer patterns, and reason under uncertainty. Instead of making deterministic
decisions, probabilistic models provide a measure of confidence or likelihood for their predictions.

Probabilistic models represent data as distributions rather than fixed values. They assume that the observed data is generated by some underlying
probability distribution. By learning these distributions, the models can predict outcomes, assess uncertainty, and even handle missing or noisy data.

Probabilistic models are used in a variety of machine learning tasks such as classification, regression, clustering, and dimensionality reduction. Some
popular probabilistic models include:

1. Naive Bayes Classifier:


Assumes features are independent given the class label. It is used for tasks like spam detection and text classification.

2. Hidden Markov Models (HMM):


Used for modelling sequential data such as speech, text, or biological sequences. HMMs describe a system with hidden states transitioning
probabilistically.

3. Gaussian Mixture Models (GMM):


Models data as a mixture of multiple Gaussian distributions. Commonly used for clustering and density estimation.

4. Bayesian Networks:
Represent probabilistic dependencies among variables using a directed acyclic graph. Useful for reasoning and decision-making under uncertainty.

5. Markov Models:
Use the Markov property, assuming the future state depends only on the current state. Often used in time series analysis.

Applications
1. Natural Language Processing (NLP): Text classification with Naive Bayes
2. Computer Vision: Image segmentation using probabilistic graphical models.
3. Finance: Risk assessment and decision-making under uncertainty.
4. Healthcare: Predictive modelling for patient diagnosis and treatment outcomes.

Probabilistic models offer a principled way to deal with uncertainty and variability in data. While they can be computationally intensive, their ability to
provide interpretable results and manage uncertainty makes them invaluable in many applications, from recommendation systems to scientific research.
Part-of-Speech (POS) Tagging in NLP
Part-of-Speech (POS) tagging is a fundamental task in Natural Language Processing (NLP) that involves
identifying and labelling the grammatical category of each word in a sentence. These categories, known as
parts of speech, include nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, and
others. The process is essential for understanding the structure and meaning of text, making it a critical step
in tasks like text analysis, machine translation, and information extraction.

How POS Tagging Works


1. Input Sentence: The text is tokenized into words (tokens).
2. Feature Extraction: Each word is analyzed using its surrounding context, position in the sentence,
and inherent properties like suffix or prefix.
3. Tagging: Based on linguistic rules or machine learning algorithms, the words are assigned their
corresponding POS tags (e.g., NN for noun, VB for verb, JJ for adjective).

Approaches to POS Tagging 3. Context Dependence: POS tagging accuracy relies on


1. Rule-Based Approaches: understanding the context of sentences, which can vary across
o Uses predefined grammatical rules and dictionaries. domains.
o Example: If a word ends in “-ing,” it is likely a verb.
o Simple but can be error-prone for ambiguous words. Applications of POS Tagging in NLP
2. Statistical Approaches: 1. Syntactic Parsing: Understanding grammatical structures in
o Based on probabilities calculated from labeled corpora sentences.
(e.g., Brown Corpus). 2. Named Entity Recognition (NER): Identifying entities like people,
o Example: Hidden Markov Models (HMMs) predict tags organizations, and locations in text.
based on the likelihood of word sequences. 3. Information Retrieval: Enhancing search engines and document
o Handles ambiguity better than rule-based methods. retrieval systems.
3. Machine Learning Approaches: 4. Text Summarization: Condensing large texts into shorter,
o Uses supervised learning models trained on annotated meaningful summaries.
datasets. 5. Machine Translation: Translating text between languages.
o Algorithms like Conditional Random Fields (CRFs) or Deep
Learning models (e.g., LSTMs) are commonly used.
Example Sentence: John eats an apple daily.
o Highly accurate and adaptable to various languages and
POS Tags:
domains.
• John: Noun (NOUN)
Challenges
• eats: Verb (VERB)
1. Ambiguity: Words with multiple meanings or roles (e.g., "run" as a
• an: Article (DET)
verb or noun) can confuse models.
• apple: Noun (NOUN)
2. Out-of-Vocabulary Words: New or unknown words pose
• daily: Adverb (ADV)
difficulties, especially in rule-based systems.
Code:
from nltk import word_tokenize, pos_tag

# Sample sentence
sentence = “The quick brown fox jumps over the lazy dog.”

# Tokenize the sentence


tokens = word_tokenize(sentence)

# Perform POS tagging


pos_tags = pos_tag(tokens)
print(pos_tags)
Named Entity Recognition
Named Entity Recognition (NER) is a technique in natural language processing (NLP) that focuses on
identifying and classifying entities. The purpose of NER is to automatically extract structured information
from unstructured text, enabling machines to understand and categorize entities in a meaningful manner
for various applications.

Basic Steps of NER

How NER Works


1. Tokenization: The text is split into smaller units, like words or phrases.
Some entities of NER
o Example: "Apple Inc. announced new iPhones in California on September 12, 2023."
Tokens: [Apple, Inc., announced, new, iPhones, in, California, on, September, 12, 2023].
2. Entity Detection: The model identifies potential named entities within the tokens.
o Example: [Apple Inc., California, September 12, 2023].
3. Entity Classification: Each detected entity is assigned a category.
o Example:
▪ Apple Inc. → Organizations
▪ California → Location
▪ September 12, 2023 → Date
Techniques for NER
NER systems are developed using different approaches:
1. Rule-Based Systems:
o Use predefined linguistic patterns or rules to identify entities.
Example: A rule may specify that any word after "Mr." or "Dr." is likely a person's name.
2. Machine Learning Models:
o Use annotated data to train models that learn to recognize entities. Examples include Hidden Markov Models (HMMs) and Conditional
Random Fields (CRFs).
3. Deep Learning Models:
o Leverage neural networks like Recurrent Neural Networks (RNNs), Bi-directional LSTMs (BiLSTMs), or Transformers (e.g., BERT) to identify
and classify entities.
Code:
import spacy

# Load spaCy's pre-trained NER model


nlp = spacy.load("en_core_web_sm")

# Sample text
text = "Barack Obama was born in Hawaii and served as President of the United States."

# Process the text


doc = nlp(text)

# Extract named entities


for ent in doc.ents:
print(f"{ent.text} - {ent.label_}")
Output:
Barack Obama – PERSON, Hawaii – GPE (Geo-Political Entity), United States – GPE (Geo-Political Entity)
Bag of Words (BoW)
Whenever we apply any algorithm in NLP, it works on numbers. We cannot directly feed our text into that algorithm. Hence, Bag of Words model is used to
preprocess the text by converting it into a bag of words, which keeps a count of the total occurrences of most frequently used words.

Key Concepts of Bag of Words (BoW): o Represent each text as a numerical vector based on the
1. Tokenization: vocabulary.
o Split the text into individual words or tokens. o Each position in the vector corresponds to a word in the
o Example: "I love NLP" → ["I", "love", "NLP"] vocabulary, and the value is the word’s frequency in the
2. Vocabulary Creation: text.
o Create a list of unique words from all documents in the o Example:
dataset. ▪ For Text 1: [1, 1, 1, 0, 0]
o Example for texts: ▪ For Text 2: [0, 0, 1, 1, 1]
▪ Text 1: "I love NLP" 4. Fixed-Length Representation:
▪ Text 2: "NLP loves Python" o Regardless of the text length, each document is converted
▪ Vocabulary: ["I", "love", "NLP", "loves", "Python"] into a vector of fixed size (equal to the vocabulary size).
3. Vectorization:

Disadvantages:
• Ignores Context: Word order and semantics are lost, so it cannot differentiate between "not good" and "good."
• High Dimensionality: Large vocabularies result in sparse and high-dimensional vectors.
• Sensitivity to Frequent Words: Common words (e.g., "the," "is") dominate, though techniques like TF-IDF can mitigate this.

Code:
from sklearn.feature_extraction.text import CountVectorizer

# Sample sentences
texts = ["I love NLP", "NLP loves Python"]

# Create a CountVectorizer instance


vectorizer = CountVectorizer()

# Fit and transform the texts to create the bag-of-words


bow_matrix = vectorizer.fit_transform(texts)

# Display the vocabulary and bag-of-words matrix


print("Vocabulary:", vectorizer.get_feature_names_out())
print("Bag of Words Matrix:\n", bow_matrix.toarray())

Output:
Vocabulary: ['i', 'love', 'loves', 'nlp', 'python']
Bag of Words Matrix:
[[1 1 0 1 0]
[0 0 1 1 1]]
Word Embeddings in NLP
Word embeddings are a technique in Natural Language Processing (NLP) to represent words in a continuous vector space. Unlike traditional methods like
Bag of Words or TF-IDF, which treat words as discrete symbols, word embeddings assign each word a dense vector that captures its meaning and semantic
relationships with other words.

Word embeddings are dense vector representations of words in a continuous space, where semantically similar words are positioned closer together.

Need for Word Embedding?


• To reduce dimensionality
• To use a word to predict the words around it.
• Inter-word semantics must be captured.

Key Concepts of Word Embeddings in NLP


1. Continuous Vector Representation:
o Each word is represented as a dense vector in a continuous, high-dimensional space.
o These vectors encode semantic and syntactic properties of words, capturing relationships between them.
2. Semantic Similarity:
o Words with similar meanings are mapped closer together in the vector space.
o For instance, the embeddings for "king" and "queen" or "dog" and "cat" will have smaller distances than unrelated words like "dog" and
"car."
3. Dimensionality:
o The embeddings reduce the dimensionality of representing words compared to traditional one-hot encoding or sparse vectors.
o Common dimensions range from 50 to 300, balancing computational efficiency and expressiveness.
4. Contextual Relationships:
o Word embeddings capture contextual relationships. For example, embeddings can distinguish between "bank" in "river bank" and "money
bank" using context-based models.
5. Mathematical Operations:
o Embeddings allow meaningful operations such as:
𝑘𝑖𝑛𝑔 − 𝑚𝑎𝑛 + 𝑤𝑜𝑚𝑎𝑛 ≈ 𝑞𝑢𝑒𝑒𝑛
o This demonstrates how embeddings capture analogies and relationships between words.
What is NLP
Natural Language Processing (NLP) is a field of artificial intelligence (AI) that focuses on the interaction between
computers and human language. It involves enabling machines to understand, interpret, and generate human
language in a way that is both meaningful and useful. NLP aims to bridge the gap between human
communication and computer understanding.
NLP can be broadly divided into two main areas: Natural Language Understanding (NLU) and Natural Language
Generation (NLG).

1. Natural Language Understanding (NLU)


NLU focuses on enabling machines to understand and extract meaningful information from text or speech. It
involves interpreting human language inputs in a way that allows the system to understand the intent,
context, and entities involved.
Key tasks in NLU:
• Tokenization
• Part-of-Speech Tagging (POS Tagging)
• Named Entity Recognition (NER)
• Sentiment Analysis
• Syntax and Parsing
• Word Sense Disambiguation (WSD)
• Coreference Resolution
2. Natural Language Generation (NLG)
NLG is concerned with the generation of human-like text or speech from structured data or inputs. It involves taking a machine's output and
transforming it into coherent, fluent, and contextually appropriate language.
Key tasks in NLG:
• Text Generation
• Paraphrasing
• Machine Translation
• Dialogue Systems (Chatbots)
Relationship Between NLU and NLG
• NLU enables the computer to understand the input (i.e., what the user says or writes), while NLG focuses on generating meaningful responses or
outputs.
• Both NLU and NLG work together in many NLP applications, such as conversational AI systems, where understanding the user's intent (NLU) and
providing a relevant response (NLG) are key.
• NLU is more difficult than NLG because understanding human language requires dealing with ambiguities, context, and complex structures that
aren't always straightforward. While generating language is complex, interpreting it is harder due to the rich variety and subtleties in natural
language.

Applications of NLP: 8. Information Extraction: Extracting structured data from


1. Question Answering: Systems that automatically answer questions unstructured or semi-structured documents.
posed in natural language. 9. Natural Language Understanding (NLU): Converting natural
2. Spam Detection: Identifying unwanted emails in a user's inbox. language text into formal representations, like logic structures, for
3. Sentiment Analysis: Analyzing the mood, behavior, and emotional easier processing by computers.
state of the sender by classifying text as positive, negative, or
neutral. Challenges:
4. Machine Translation: Translating text or speech from one language 1. Ambiguity: Words and sentences can have multiple meanings.
to another (e.g., Google Translate). • Lexical Ambiguity: A word can have different meanings (e.g.,
5. Spelling Correction: Correcting spelling errors in word processors "bank" as a financial institution or a riverbank).
like MS Word. • Syntactic Ambiguity: A sentence can be parsed in different ways
6. Speech Recognition: Converting spoken words into text for various (e.g., "He saw the man with a telescope" could mean either the
applications, such as dictation or voice commands. man has a telescope or he used a telescope to see the man).
7. Chatbots: Automated systems providing customer service via text- • Contextual Ambiguity: The meaning can depend on context (e.g.,
based conversation. the meaning of pronouns like "he," "this" depends on the
surrounding sentences).
Five phases of NLP:
The five phases of Natural Language Processing (NLP) form a structured approach to understanding and processing human language, enabling machines to
interpret and manipulate text or speech. These phases represent different levels of language analysis, from basic word recognition to understanding the
deeper meaning and context of a given text. Below is an overview of each phase:

1. Lexical Analysis

The purpose of lexical analysis is that it aims to read the input code and break it down into meaningful elements called tokens. Those tokens are turned into
building blocks for other phases of compilation.
The output of the lexical analysis phase is a stream of tokens, that can be more easily processed by the syntax analyzer, which is responsible for checking
the program for correct syntax and structure.
The actual text or character sequences that form the tokens are called lexemes.
In programming languages, lexical analysis helps in breaking down the source code so that the compiler or interpreter can understand and process it.
In NLP, lexical analysis is the first step in processing natural language text for tasks such as part-of-speech tagging, named entity recognition, and text
parsing.
Example: In the sentence "The cat sat on the mat," the tokens would be "The", "cat", "sat", "on", "the", "mat".

2. Syntactic Analysis
Syntactic Analysis, also known as parsing, is the process of analyzing the structure of a sentence or expression to determine its grammatical structure
according to a formal grammar. It is a crucial step in both Natural Language Processing (NLP) and compilers for programming languages, where the goal is
to understand how the tokens (from lexical analysis) fit together according to the rules of syntax.

Syntax analysis (parsing) is the second phase of the compilation process, following lexical analysis. It takes the tokens generated by the lexical analyzer and
attempts to build a Parse Tree or Abstract Syntax Tree (AST), representing the program’s structure. During this phase, the syntax analyzer checks whether
the input string adheres to the grammatical rules of the language using context-free grammar. If the syntax is correct, the analyzer moves forward;
otherwise, it reports an error.

The main goal of syntax analysis is to create a parse tree or abstract syntax tree (AST) of the source code, which is a hierarchical representation of the source
code that reflects the grammatical structure of the program.
Example: The cat chases the mouse in the garden

• Sentence(S)
• Noun Phrase (NP)
• Determiner (Det)
• Verb Phrase (VP)
• Prepositional Phrase (PP)
• Verb(V)
• Noun(N)

In short, Syntactic Analysis is used to check grammar, word arrangements, and shows the relationship among the words.

3. Semantic Analysis

Semantic Analysis is the process of extracting and understanding the meaning from a sentence or text. While syntactic analysis focuses on the structure of
sentences, semantic analysis focuses on interpreting the meaning behind the words and their relationships. It seeks to resolve ambiguities and determine
the precise meaning of phrases and sentences in context.

Semantic analysis in NLP can be divided into two main parts, Lexical Semantic Analysis and Compositional Semantic Analysis, each focusing on
understanding meaning at different levels.

1. Lexical Semantic Analysis


• Focus: Understanding the meaning of each word individually.
• Goal: Retrieve the dictionary meaning or contextual significance of each word in the text.
• Example: For the word "bank", lexical semantics identifies its meaning as either a financial institution or the side of a river, based on context.

2. Compositional Semantic Analysis


• Focus: Understanding how the meanings of individual words combine to form the meaning of an entire sentence or phrase.
• Goal: Interpret how word combinations and syntactic structures influence the overall meaning.
• Example:
o "Students love GeeksforGeeks" means that students have affection for GeeksforGeeks.
o "GeeksforGeeks loves Students" means that GeeksforGeeks has affection for students.
Even though both sentences use the same words (students, love, GeeksforGeeks), their meaning changes due to the order and syntactic
roles of the words.

Key elements of semantic analysis include various types of word relationships, which help in deciphering the underlying meaning of text. Here’s an
explanation of each element:

1. Hyponymy
• Definition: Refers to terms that are specific instances of a general category (hypernym).
• Analogy: Class-object relationship.
• Example: Color is the hypernym, while grey, blue, and red are its hyponyms.

2. Homonymy
• Definition: Words with the same spelling but completely different meanings.
• Example: Rose can mean "a flower" or "the past tense of rise."

3. Synonymy
• Definition: Words with similar or identical meanings.
• Example: (Job, Occupation), (Large, Big), (Stop, Halt).

4. Antonymy
• Definition: Pairs of words with opposite meanings.
• Example: (Day, Night), (Hot, Cold), (Large, Small).

5. Polysemy
• Definition: Words with the same spelling that have multiple closely related meanings.
• Difference from Homonymy: In polysemy, meanings are related; in homonymy, they are not.
• Example: Man can mean:
o "Human species"
o "Male human"
o "Adult male human"

In short, Semantic analysis is concerned with the meaning representation. It mainly focuses on the literal meaning of words, phrases, and sentences.

4. Discourse Integration (Contextual Analysis)


Discourse Integration in NLP refers to the process of understanding the relationships between sentences and how they come together to form a coherent
and meaningful discourse. This involves integrating information across sentences to maintain consistency and continuity throughout a text. Key components
of discourse integration include:
1. Coherence: Ensuring that the text makes sense as a whole. The sentences must connect logically, with ideas flowing from one to the next.
2. Anaphora Resolution: Resolving references to earlier parts of the text, especially pronouns (e.g., "he," "she," "it") and other referring expressions
(e.g., "this," "that"). This helps in determining what these terms refer to in the broader discourse.
Example Text: "Tom was hungry. He went to the kitchen to make a sandwich."
In this example:
• The first sentence introduces "Tom" and his state (hungry).
• The second sentence connects to the first by using the pronoun "He," which refers to "Tom" (this is anaphora resolution).
• The two sentences are connected logically because the action (going to the kitchen) makes sense after the statement about being hungry. This
maintains coherence in the discourse.

In short, Discourse integration is the process of connecting and interpreting different parts of a text to maintain coherence and meaning across sentences
and paragraphs.
5. Pragmatic analysis
Pragmatic analysis deals with understanding the intended meaning behind words or sentences based on the situation, social norms, or cultural context.
This step goes beyond literal meanings and interprets implied meanings.
Pragmatic analysis refers to understanding the intended meaning of a sentence in context, beyond its literal meaning. It involves interpreting how language
is used in real-world situations, considering factors like the speaker's intentions, the relationship between the speaker and listener, and the context of the
conversation.
Example, in the sentence "Can you pass the salt?" the literal meaning is a question about someone's ability to pass the salt, but pragmatically, it's a request
for the salt.
In short, it checks the real-world knowledge or context to derive the real meaning of the sentence.

Conclusion:
1. Lexical Analysis: Breaking down the text into tokens (words, punctuation, etc.) and identifying the basic components of the language.
2. Syntactic Analysis: Analyzing the grammatical structure of sentences, identifying parts of speech, and constructing a syntax tree to understand the
relationships between words.
3. Semantic Analysis: Extracting meaning from sentences by interpreting the meanings of words and how they combine to form the overall meaning of the
text.
4. Discourse Integration: Linking together different sentences or phrases within a text, understanding how previous and subsequent sentences contribute
to the overall meaning of the discourse.
5. Pragmatic Analysis: Understanding the context and intent behind the text, including resolving ambiguities and interpreting indirect meanings (e.g.,
recognizing sarcasm or requests).
Dataset Preparation for NLP Applications
1. Data Collection
Data can be collected from various sources, including text files, web scraping with tools like BeautifulSoup, APIs from platforms like Twitter, and public
datasets from Kaggle or UCI. For instance, customer reviews are often gathered for sentiment analysis.
2. Text Cleaning and Preprocessing
• Tokenization splits text into words or sentences. For example, "I love NLP!" becomes ["I", "love", "NLP", "!"].
• Lowercasing converts text to uniform lowercase, e.g., "I Love NLP" to "i love nlp."
• Removing punctuation eliminates unnecessary marks; "Hello, World!" turns into "Hello World."
• Removing stopwords filters out common words, like turning "This is a sample sentence" into "sample sentence."
• Stemming or lemmatization reduces words to their root forms, so "running," "runs," and "ran" all become "run."
• Handling numbers requires decisions on removal or conversion, as in changing "In 2023, there were 5000 participants" to "In year there were
participants."
• Handling misspellings corrects errors with libraries; "Ths is a smple text" becomes "This is a simple text."
• Removing HTML tags cleans web-scraped data, changing <p>This is a paragraph.</p> to "This is a paragraph."
3. Text Normalization
Text normalization includes expanding contractions, such as "I'm" to "I am," and handling special characters or emojis.
4. Feature Engineering
Feature engineering techniques include Bag of Words (BoW) for word counts, TF-IDF for word importance, and word embeddings like Word2Vec. n-
Grams capture sequences of words.
5. Splitting the Dataset
The dataset is typically split into training and testing sets, using an 80-20 ratio, and validated with techniques like KFold.
6. Handling Imbalanced Datasets
To manage imbalanced datasets, oversampling duplicates minority class examples, while undersampling reduces majority class examples. Class
weights can prioritize the minority class.
7. Preparing for Specific NLP Tasks
Preparation for tasks includes labeling text for sentiment analysis and annotating entities for Named Entity Recognition (NER).
8. Data Augmentation
Data augmentation techniques include synonym replacement, random insertion of words, and back translation for paraphrasing.
9. Final Checks
Final checks ensure data consistency, verifying no missing labels or formatting issues, and reviewing data quality for errors.
10. Documentation and Metadata
Documenting preprocessing steps and tools used, along with metadata like data source and collection date, adds valuable context.
Understanding Data Attribute Types

Qualitative (Categorical) Data:


• Nominal Data: Categories without any specific order.
o Example: Gender (Male, Female)
• Ordinal Data: Categories with a meaningful order, but no defined difference between them.
o Example: Education level (High School, Bachelor’s, Master’s)
• Binary Data: Two possible categories or outcomes.
o Example: Yes/No
Quantitative (Numeric) Data:
• Discrete Data: Countable, distinct values, often whole numbers.
o Example: Number of children in a family
• Continuous Data: Measurable values that can take any value within a range.
o Example: Height of a person (5.7 feet, 5.75 feet)

What is a Target Attribute?


A target attribute (or target variable) is the specific attribute in a dataset representing the outcome or prediction target in supervised learning. For instance,
in a housing dataset, the target attribute migh be the sale price, while predictor variables could include the number of bedrooms and square footage.

Different File Formats for Corpora


Corpora can be stored in various file formats based on their type and structure:
1. Text Corpora Formats
o Plain Text Files (.txt): Simple text files, suitable for raw text.
o CSV (.csv): Tabular data with metadata.
o JSON (.json): Key-value pairs for hierarchical data.
o XML (.xml): Markup language for annotated texts.
o TEI (.xml): Specialized XML for literary texts.
o HTML (.html): Used for web-extracted text.
2. Speech Corpora Formats
o WAV (.wav): Uncompressed audio format for high-quality recordings.
o MP3 (.mp3): Compressed audio format.
o FLAC (.flac): Lossless audio format.
o TextGrid (.TextGrid): Annotations for speech data.
o ELAN (.eaf): Time-aligned annotations.
3. Parallel Corpora Formats
o XLIFF (.xliff): XML format for localization data.
o TMX (.tmx): XML format for translation memory.
o Aligned Text (.txt/.xml): Aligned texts in different languages.
4. Annotated Corpora Formats
o CoNLL (.conll): Tabular form for linguistic annotations.
o BIO Format (.bio): Sequence tagging format.
o Penn Treebank Format (.mrg/.ptb): Syntactically annotated corpora.
5. Multimodal Corpora Formats
o HDF5 (.h5): Stores large amounts of complex data.
o MATLAB (.mat): Binary format used for processing datasets.

Detailed Example of File Formats


1. Comma-Separated Values (.csv)
CSV files are plain text formats for tabular data, where each line represents a row and values are separated by commas. They consist of a header row for
column names and data rows for records. Key characteristics include simplicity, universality, and human readability.
Common uses involve data import/export, analysis in tools like Python and Excel, and storing simple datasets. Notable considerations include handling commas
and newlines in data by enclosing them in double quotes.
Advantages include lightweight design and cross-platform compatibility, while disadvantages are a lack of data types, metadata, and the inability to represent
hierarchical data.

2. Speech Corpora Formats


These formats are collections of spoken language data used in linguistic and speech processing tasks:
• WAV (.wav): Uncompressed audio for high quality, suitable for professional applications.
• MP3 (.mp3): Lossy compression for smaller file sizes, ideal for storage-limited uses.
• FLAC (.flac): Lossless compression that retains original quality for archival purposes.
• TextGrid (.TextGrid): Contains time-aligned annotations for phonetic analysis.
• ELAN (.eaf): Used for annotating audio/video data with multimodal annotations.
Applications include automatic speech recognition, phonetic research, language learning, and multimodal analysis.

3. JavaScript Object Notation (.json)


JSON is a lightweight, human-readable format for data interchange, characterized by key-value pairs, objects (curly braces), and arrays (square brackets).
It’s widely used in web APIs, application configuration, NoSQL databases, and data serialization. Its advantages include simplicity and broad support, while
disadvantages are limited support for complex data types and potential security risks with untrusted data
Corpus
What is a Corpus?
A corpus is a structured collection of authentic text or audio, often used in Natural Language Processing (NLP) and machine learning. It includes data created
by native speakers of a language and can encompass a variety of sources, such as newspapers, novels, recipes, radio broadcasts, and social media posts. A
corpus is essential for training AI models, helping them learn language patterns, meanings, and structures.
Features of a Good Corpus
1. Large Size: A larger corpus generally provides more data for training models, especially for tasks like sentiment analysis, where diverse examples are
crucial.
2. High Quality: The data must be accurate and free from errors. Even small mistakes can lead to significant issues in machine learning outputs.
3. Clean Data: Proper data cleansing helps eliminate errors and duplicates, resulting in a more reliable corpus.
4. Balance: A well-structured corpus should represent various topics and genres to avoid skewed results.
Challenges in Creating a Corpus
• Determining Data Type: Identifying what data is necessary to address specific problems.
• Data Availability: Accessing the required data can be difficult.
• Data Quality: Ensuring the data meets quality standards is essential.
• Adequate Volume: Having enough data to support effective training.
Types of Corpora
1. General Corpora: Wide-ranging text types and genres (e.g., British National Corpus).
2. Specialized Corpora: Focused on specific domains (e.g., medical or legal texts).
3. Parallel Corpora: Texts in multiple languages, often for translation tasks (e.g., Europarl Corpus).
4. Annotated Corpora: Texts with linguistic annotations, such as POS tags (e.g., Penn Treebank).
5. Spoken Corpora: Transcriptions of spoken language, useful for speech recognition (e.g., Switchboard Corpus).
Need for Corpora
1. Training Machine Learning Models: Provides the extensive data necessary for training NLP algorithms, helping models learn diverse linguistic patterns.
2. Benchmarking and Evaluation: Supplies standardized datasets for comparing model performance and creating benchmarks.
3. Linguistic Research: Facilitates the study of language usage, structure, and meaning, contributing to linguistic analysis.
4. Development of Language Technologies: Essential for refining applications like speech synthesis and machine translation, using real-world data.
5. Enhancing Language Resources: Aids in creating dictionaries and educational tools, making language learning more effective.
6. Addressing Bias and Fairness: Ensures diversity in datasets, reducing biases in language models and promoting inclusivity.
Example Use Cases of Corpora
1. Sentiment Analysis: Training models on product reviews to identify positive or negative sentiments.
2. Machine Translation: Using parallel corpora to train systems like Google Translate.
3. Speech Recognition: Training systems to accurately transcribe spoken language using spoken corpora.
4. Part-of-Speech Tagging: Training taggers with annotated corpora to label parts of speech in texts.
5. Named Entity Recognition (NER): Using annotated corpora to identify and classify named entities in text.
Place of Articulation and Manner of Articulation (PYQ)
Place of Articulation refers to the point in the vocal tract where the airflow is restricted or blocked to produce different speech sounds. It's an essential
concept in phonetics and relevant to speech processing in Natural Language Processing (NLP).
1. Bilabial: Both lips come together. Examples: /p/, /b/, /m/.
2. Labiodental: Lower lip touches upper teeth. Examples: /f/, /v/.
3. Dental: Tongue touches the upper teeth. Examples: /θ/ ("think"), /ð/ ("this").
4. Alveolar: Tongue touches or is near the alveolar ridge. Examples: /t/, /d/, /s/, /z/, /n/, /l/.
5. Post-alveolar: Tongue nears the area just behind the alveolar ridge. Examples: /ʃ/ ("sh"), /ʒ/ ("measure").
6. Retroflex: Tongue curls back towards the hard palate. Example: /ɻ/ (in some pronunciations of "r").
7. Palatal: Tongue touches the hard palate. Example: /j/ ("yes").
8. Velar: Back of the tongue touches the soft palate (velum). Examples: /k/, /g/, /ŋ/ ("sing").
9. Uvular: Back of the tongue nears the uvula. Example: /ʀ/ (uvular trill in some languages).
10. Glottal: Vocal cords (glottis) constrict. Examples: /h/, /ʔ/ (glottal stop in "uh-oh").

Manner of Articulation refers to how airflow is manipulated during the production of speech sounds. It describes the way in which the vocal tract modifies
the airflow to create different types of sounds.
1. Plosive (Stop): Complete closure of the vocal tract followed by a sudden release of air.
Examples: /p/, /b/, /t/, /d/, /k/, /g/.
2. Nasal: Air flows through the nose due to a lowered velum.
Examples: /m/, /n/, /ŋ/.
3. Fricative: Partial closure of the vocal tract, creating a turbulent airflow.
Examples: /f/, /v/, /s/, /z/, /ʃ/, /ʒ/, /θ/, /ð/, /h/.
4. Affricate: A combination of a plosive and a fricative, starting with a complete closure followed by a turbulent release.
Examples: /tʃ/ ("chess"), /dʒ/ ("judge").
5. Approximant: The vocal tract is narrowed, but not enough to create turbulence.
Examples: /j/, /w/, /ɹ/ ("red").
6. Lateral Approximant: The airstream flows along the sides of the tongue.
Example: /l/.
7. Tap or Flap: A single, quick touch of the tongue against the roof of the mouth.
Example: /ɾ/ ("t" in "butter" in American English).
8. Trill: Rapid, repeated contact of the articulator with the place of articulation.
Example: /r/ (rolled "r").

Manner
Plosive Nasal Fricative Affricate Approximant Lateral Approximant
Place

Bilabial p, b m w
Labiodental f, v
Dental θ, ð
Alveolar t, d n s, z ɹ l
Post-alveolar ʃ, ʒ tʃ, dʒ
Palatal j
Velar k, g ŋ
Glottal ? h
Speech Processing in NLP

o Advantage: High accuracy (~95%) due to personalized


training.
o Disadvantage: Limited to the trained user, less effective for
others.
• Speaker Independent: Works for any speaker.
o Advantage: Flexible for multiple users.
o Disadvantage: Lower accuracy as it's not tailored to a specific
voice.

Speech Modes
• Isolated Speech Recognition: Recognizes words spoken with
clear pauses.
1. Speaker Recognition o Features: Commonly used, identifies words ~0.96
Speaker Verification: Confirms a speaker's identity by comparing their seconds in length.
voice to a claimed identity. o Use: Ideal for distinct voice commands (e.g., home
• Process: automation).
1. Speaker provides a voice sample. • Connected Speech Recognition: Bridges isolated and continuous
2. Match with a pre-stored voiceprint. speech, allowing natural multi-word input with short pauses.
• Types: o Features: Recognizes phrases up to ~1.92 seconds;
o Text-dependent: Specific phrase required. limited vocabulary (~20 words).
o Text-independent: Any speech allowed. o Use: Suitable for short commands in voice-controlled
• Applications: Biometric authentication, secure access, mobile applications.
devices. • Continuous Speech Recognition: Processes natural,
• Challenges: Background noise, microphone quality, voice conversational speech.
changes. o Challenges: Difficult to segment words as they merge
together.
Speaker Identification: Identifies a speaker from a group based on o Use: Virtual assistants, transcription, communication
unique voice characteristics. tools.
• Process:
1. Extract features (pitch, tone, accent). Speaking Style
2. Compare against a database of voiceprints. Dictation: Formal, structured speech used in controlled settings like
• Applications: Security systems, voice assistants, call centers. note-taking.
• Challenges: Variability due to mood, health, background noise. • Features: Predefined vocabulary, clear and organized speech,
distinct pronunciation, proper pauses for context and
Speaker Diarisation: Determines who spoke when in an audio recording. punctuation.
• Process: Spontaneous Speech: Unplanned, natural conversation with varied
1. Segment audio by silence and speech. expressions.
2. Extract features from segments. • Features: Free-flowing, variable speech patterns, influenced by
3. Cluster segments by voice characteristics. background noise and interruptions, requires context
• Applications: Meeting transcription, multi-speaker understanding for accurate recognition.
conversations.
• Challenges: Speech variability, overlapping dialogue, 3. Language Identification
background noise
It is the process of determining a text or audio's language. It's used in tasks
like translation, document classification, and speech recognition.
2. Speech Recognition
Techniques include character-based, word-based, and machine learning
Speaker Mode
methods. Challenges include dialects, code-switching, and limited data.
• Speaker Dependent: Trained on a specific user's voice.
Future research aims to improve robustness, address low-resource
languages, and combine multiple modalities.
Word Boundary Detection
Word boundary detection involves identifying where one word ends and another begins, essential for speech recognition and natural language processing.
Two types of speech are considered in word boundary detection:
1. Constrained Speech:
In constrained speech, words have well-defined boundaries with clear pauses between them. The speech is structured, making it easier to detect
where one word ends and another begins.
2. Unconstrained Speech:
Unconstrained speech lacks clear boundaries, grammar, or pauses. This natural, free-flowing speech makes word boundary detection more
challenging. Without a robust detection algorithm, it often leads to false alarms and missed word boundaries due to the continuous nature of
speech.

Key Approaches for Word Boundary Detection


Rule-based Method:
• Relies on predefined rules and patterns to identify word boundaries.
• Often language-specific, tailored to the unique characteristics of each language.
• Effective in languages with clear delimiters such as spaces and punctuation.
1. White Space Separation: Detects spaces to identify boundaries (e.g., "Hello world").
2. Punctuation Marks: Signals boundaries (e.g., "Let's eat, Grandma").
3. Capitalization: Indicates new words (e.g., "Alice went to Wonderland").
4. Hyphenation: Defines components (e.g., "mother-in-law").
5. Special Characters: Acts as delimiters (e.g., "file_path/example_2023").

Challenges in Word Boundary Detection


1. Ambiguity:
o Languages without clear word delimiters (e.g., Chinese, Japanese) create significant challenges for word boundary detection.
o Homonyms and Polysemes: The same character sequence may represent different words depending on context, complicating identification.
For example, the word "bank" can refer to a financial institution or the side of a river.
2. Variability in Text:
o Informal text, such as that found on social media or chat platforms, often lacks standard punctuation and spacing.
o Abbreviations, acronyms, and domain-specific jargon can confuse boundary detection systems, leading to inaccuracies.
3. Multilingual Texts:
o Detecting word boundaries in texts containing multiple languages or code-switching within a single sentence presents additional challenges.
Applications of Word Boundary Detection
1. Text Tokenization:
2. Speech Recognition:
3. Text-to-Speech Conversion:
4. Machine Translation:
Tools and Libraries for Word Boundary Detection
• NLTK: Tokenization tools for Python.
• spaCy: Pre-trained NLP models.
• Stanford NLP: Comprehensive NLP tools.
• Hugging Face Transformers: Fine-tunable models for boundary detection tasks.
Hidden Markov Models (HMM) and Speech Recognition
Hidden Markov Models (HMM)
1. Overview of HMMs: o
The acoustic signal is analyzed and matched to these states
o An HMM is a statistical model used to describe systems that using the forward-backward algorithm, Viterbi algorithm,
are modeled by a Markov process with unobservable or Baum-Welch algorithm.
(hidden) states. o Training: Involves estimating the parameters of the HMM
o It is widely used in speech recognition, part-of-speech (transition, emission, and initial state probabilities) from a
tagging, and bioinformatics. corpus of labeled speech data.
o An HMM is defined by: o Decoding: The process of finding the most likely sequence of
▪ States (Q): The hidden states of the system (e.g., hidden states (phonemes/words) for a given observed
phonemes in speech). sequence of features (audio data).
▪ Observations (O): Observable events or outputs 3. HMM Algorithms:
(e.g., audio signal features). o Viterbi Algorithm: Used for decoding, it finds the most likely
▪ Transition Probabilities (A): The probability of sequence of states given the observations.
transitioning from one state to another. o Forward-Backward Algorithm: Used to compute the
▪ Emission Probabilities (B): The probability of an probability of the observed sequence, helping in training the
observation being generated by a state. model.
▪ Initial State Probabilities (π): The probability of o Baum-Welch Algorithm: A form of the Expectation-
starting in each state. Maximization algorithm, used for training HMMs when the
2. HMM for Speech Recognition: observations are partially observable (as in speech
o Speech can be broken down into distinct phonemes or recognition).
words, which correspond to different states in an HMM.

Speech Recognition

1. Overview of Speech Recognition: o Uses statistical models to predict the probability of a word
o The process of converting spoken words into text by using sequence. It helps to handle ambiguities in speech
computational techniques. recognition by providing context.
o Involves three major stages: feature extraction, pattern o Common models: n-gram models (bigram, trigram), neural
recognition, and language modeling. network-based models.
2. Feature Extraction: 5. Decoding Process:
o Converts raw speech signals (audio waveform) into a o Given a sequence of feature vectors, the goal is to find the
sequence of feature vectors that represent the acoustic most likely sequence of words or phonemes. This is done
characteristics of the speech. using dynamic programming techniques like the Viterbi
o Common techniques: MFCC (Mel-Frequency Cepstral algorithm.
Coefficients), PLP (Perceptual Linear Prediction). Key Points to Remember:
3. Pattern Recognition: • HMMs model the sequential nature of speech, where each state
o This is where HMMs play a key role. The feature vectors corresponds to a phoneme or other linguistic unit.
extracted from the audio are matched to phonemes or • Feature extraction (e.g., MFCC) is a critical step in transforming raw
words using HMMs. audio into a format usable by HMMs.
o Acoustic Models: HMMs are used to model the sequence of • The Viterbi algorithm is central to decoding the best sequence of
phonemes or sub-word units that generate speech sounds. phonemes or words from a sequence of acoustic features.
4. Language Modeling: • Language models help disambiguate the output by providing
contextual probabilities for word sequences
Language Structure and Analyzer in NLP
A Language Structure and Analyzer is a critical component within Natural Language Processing (NLP). Its goal is to break down and understand the
grammatical structure of language. This involves identifying elements like parts of speech, syntactic patterns, and sometimes even the semantic meanings
within a given text.
Overview of Language Structure
1. Phonetics and Phonology:
o Phonetics: Focuses on the physical properties of speech sounds. It studies:
▪ How sounds are produced (articulatory phonetics).
▪ How sounds travel (acoustic phonetics).
▪ How sounds are perceived (auditory phonetics).
o Phonology: Deals with how sounds function in particular languages. It studies:
▪ How sounds are mentally organized.
▪ The interaction of sounds within the language.
2. Morphology:
o Morphology: The study of the structure of words. It looks at how words are formed by smaller units called morphemes, which are the
smallest meaningful units of language.
▪ Example: The word "unhappiness" consists of three morphemes: "un-", "happy", and "-ness".
3. Syntax:
o Syntax: The study of how words combine to form sentences. It focuses on:
▪ The rules governing sentence structure.
▪ How words are arranged to form meaningful and grammatically correct sentences.
4. Semantics:
o Semantics: The study of meaning in language. It examines:
▪ Lexical Semantics: The meanings of individual words and how they relate to one another.
▪ Compositional Semantics: How the meanings of individual words combine to form the meaning of larger expressions, like phrases
or sentences.
5. Pragmatics:
o Pragmatics: The study of how context affects the interpretation of language. It explores:
▪ How speakers use language in different social contexts.
▪ How listeners interpret what speakers say based on the context.
Computational Grammar
Computational Grammar is the set of formal grammatical rules that allow computers to process and understand human languages. It plays a key role in
various Natural Language Processing (NLP) tasks such as machine translation, speech recognition, and text generation. Here’s a simple breakdown of the key
components and why they are important.
Key Components of Computational Grammar
1. Formal Grammar:
o Formal Grammar refers to a system of rules that defines how sentences and phrases are structured in a language.
o There are different types of formal grammars:
▪ Context-Free Grammar (CFG): This is a type of grammar where the rules apply independently of the context. It helps in breaking
down sentences into smaller parts, like dividing them into subject, verb, and object.
▪ Dependency Grammar: This focuses on the relationships between words. For example, in the sentence "The cat eats fish," "eats" is
the main word, and "cat" depends on "eats" as the subject, and "fish" as the object.
2. Part-of-Speech (POS) Tagging:
o POS Tagging is the process of labeling each word in a sentence with its part of speech, such as noun, verb, or adjective.
o Computers use models like Hidden Markov Models (HMMs) and neural networks to tag words based on their roles in a sentence.
3. Parsing:
o Parsing refers to analyzing the sentence structure based on grammar rules.
o Parsers are algorithms that break down sentences to identify their structure. There are different kinds of parsers, like shift-reduce parsers
and CYK parsers.
4. Syntactic Tree Representation:
o Parse Trees represent the syntactic structure of a sentence. Each word in the sentence is connected to show how the sentence is constructed
according to grammar rules.
o Dependency Trees show how individual words are related to one another. For example, in "The dog chased the cat," the word "chased" has
two dependencies: the subject "dog" and the object "cat."
Why Computational Grammar is Important
1. Accuracy in Language Processing: It helps computers understand and generate language correctly.
2. Disambiguation: It resolves ambiguities in language, like when words have multiple meanings.
3. Contextual Understanding: It allows the computer to understand the context in which words and sentences are used.
4. Language Translation: It’s crucial for translating text from one language to another while keeping the meaning intact.
5. Speech Recognition: It improves the accuracy of converting spoken language into text.
Applications of Computational Grammar
1. Machine Translation: Converting text from one language to another, like from English to Spanish.
2. Speech Recognition: Turning spoken words into written text.
3. Text-to-Speech: Generating speech from written text, useful in virtual assistants.
4. Sentiment Analysis: Determining if a text has positive, negative, or neutral emotions.
5. Information Retrieval: Finding specific information from large amounts of text, like search engines.
Example: Parsing a Sentence
Let’s take the sentence:
“The quick brown fox jumps over the lazy dog.”
1. Part-of-Speech (POS) Tagging:
o The/DT (determiner), quick/JJ (adjective), brown/JJ (adjective), fox/NN (noun), jumps/VBZ (verb), over/IN (preposition), the/DT
(determiner), lazy/JJ (adjective), dog/NN (noun).
2. Syntactic Structure (Parse Tree):
o The sentence is broken into two main parts: Noun Phrase (NP) and Verb Phrase (VP). The NP includes "The quick brown fox," and the VP
includes "jumps over the lazy dog."
o Each word is placed under its respective category, showing the hierarchical structure of the sentence.
3. Dependency Structure (Dependency Tree):
o The word "jumps" is the main verb, and the words "fox" (subject) and "dog" (object) depend on it. The other words, like adjectives and
prepositions, provide more details about the nouns.
Words and Their Analysis
Understanding words and how they function in sentences is key to • Compounding: Combining two words to form a new word (e.g.,
processing and analyzing language, especially in fields like Natural Language "notebook" from "note" and "book").
Processing (NLP) and linguistics. Below is a simple explanation of key • Conversion: Changing a word's part of speech without changing its
concepts related to words and their analysis. form (e.g., "to email" from "email").
1. Morphology • Blending: Combining parts of two words to create a new word (e.g.,
Morphology is the study of the structure and formation of words. It looks at "brunch" from "breakfast" and "lunch").
how words are built from smaller units called morphemes. • Acronyms: Creating a word from the initials of a phrase (e.g., "NASA"
• Morphemes are the smallest units of meaning in a word: from "National Aeronautics and Space Administration").
o Free Morphemes: These can stand alone as words (e.g., 5. Collocations
"book," "run"). Collocations are word combinations that frequently appear together in
o Bound Morphemes: These need to attach to other natural language.
morphemes to make sense (e.g., "un-" in "unhappy," "-ness" • Common Collocations: Words that sound natural together, like
in "happiness"). "make a decision" or "strong coffee."
2. Lexical Semantics • Idiomatic Expressions: Phrases where the meaning is different from
Lexical Semantics deals with the meanings of words and how they relate to the literal meaning of the words (e.g., "kick the bucket" means "to
each other. die").
• Synonyms: Words with similar meanings (e.g., "big" and "large"). 6. Word Sense Disambiguation (WSD)
• Antonyms: Words with opposite meanings (e.g., "hot" and "cold"). Word Sense Disambiguation is the process of identifying the correct
• Hyponyms: Words that are specific examples of a general category meaning of a word in a specific context.
(e.g., "rose" is a hyponym of "flower"). • Contextual Clues: Words around the target word help determine its
• Polysemy: A word with multiple related meanings (e.g., "bank" as a meaning. For example, "bat" could mean the animal or the sports
financial institution and the side of a river). equipment depending on the sentence.
• Homonyms: Words that sound the same but have different 7. Lemmatization and Stemming
meanings (e.g., "bat" as an animal and "bat" used in sports). Both Lemmatization and Stemming are used to reduce words to their base
3. Part-of-Speech (POS) Tagging forms for easier processing in tasks like search engines or text analysis.
Part-of-Speech Tagging is the process of assigning a label to each word in a • Lemmatization: Converts a word to its dictionary form (e.g.,
sentence based on its grammatical role. These roles help computers and "running" becomes "run").
humans understand how a word functions within a sentence. • Stemming: Trims words down to their root form, often by cutting off
• Nouns (NN): Words that represent people, places, things, or ideas suffixes (e.g., "running" and "runner" both become "run").
(e.g., "dog," "city"). 8. Tokenization
• Verbs (VB): Words that show actions or states (e.g., "run," "is"). Tokenization is the process of splitting text into smaller units called tokens,
• Adjectives (JJ): Words that describe nouns (e.g., "quick," "blue"). which can be words, phrases, or symbols. Tokenization is the first step in
• Adverbs (RB): Words that modify verbs, adjectives, or other adverbs analyzing any text.
(e.g., "quickly," "very"). • Steps in Tokenization:
• Pronouns (PRP): Words that replace nouns (e.g., "he," "they"). 1. Text Segmentation: Breaking down the text into sentences
• Prepositions (IN): Words that show relationships between words or paragraphs.
(e.g., "in," "on"). 2. Word Tokenization: Splitting sentences into individual
• Conjunctions (CC): Words that connect clauses or sentences (e.g., words.
"and," "but"). 3. Handling Special Cases: Managing punctuation, numbers,
• Determiners (DT): Words that introduce nouns (e.g., "the," "a"). and contractions.
4. Word Formation Processes • Example:
Word Formation refers to how new words are created in a language. o Input Sentence: "The quick brown fox jumps over the lazy
• Derivation: Adding prefixes or suffixes to create new words (e.g., dog."
"happiness" from "happy"). o Tokens: ["The", "quick", "brown", "fox", "jumps", "over",
"the", "lazy", "dog"]
Knowledge of Language
Understanding language involves several key components, each focusing on a different aspect of how we process and produce words, sentences, and
meaning. Here's a breakdown of the important elements of language knowledge:

1. Phonology
Phonology deals with the sounds of language and how they differ between words. It focuses on understanding how sounds (phonemes) are organized and
used in speech.
• Example: In English, the words "bat" and "pat" differ by only one sound—the initial consonant. Phonology explains how changing just one sound
changes the meaning of a word.

2. Morphology
Morphology is about how words are built from smaller units called morphemes. A morpheme is the smallest unit of meaning in a language, such as prefixes,
roots, and suffixes.
• Example: The word "unhappiness" consists of three morphemes:
o "un-" (prefix meaning "not")
o "happy" (root word)
o "-ness" (suffix indicating a state or condition)

3. Syntax
Syntax is the study of sentence structure. It examines how words combine to form grammatically correct sentences and determines the relationships between
words in a sentence.
• Example: In the sentence "The dog chased the cat," syntax explains how the words are ordered to convey that the dog is doing the chasing, not the
cat.

4. Semantics
Semantics focuses on the meaning of words and how these meanings combine in sentences to convey a complete thought. It's about the meaning of
sentences without considering the context in which they are used.
• Example: In the sentence "The cat sat on the mat," semantics tells us what the sentence means literally—there's a cat, and it's sitting on a mat.

5. Pragmatics
Pragmatics goes beyond the literal meaning of words and looks at how language is used in different situations. It deals with how context affects the
interpretation of sentences.
• Example: The sentence "Could you shut the window?" is phrased like a question, but pragmatically, it is a polite request rather than an actual
question.

6. Discourse
Discourse is about understanding how sentences relate to each other in a conversation or text. It looks at how the meaning of a sentence is influenced by
the sentences before or after it.
• Example: In a conversation, if someone says, "John was late again today," and the next sentence is "He always oversleeps," we know "He" refers to
John because of discourse context.

7. World Knowledge
World knowledge refers to the general knowledge and understanding of the world that helps people interpret language. It involves knowing about social
norms, facts, and shared beliefs.
• Example: If someone says, "It’s raining cats and dogs," you understand that this is an idiom meaning heavy rain, based on your knowledge of common
expressions.
Argmax Based Computations in NLP(PYQ)
The argmax function is crucial in NLP tasks, used to determine the index of the maximum value in an array or tensor, often translating probabilistic outputs
into definitive predictions.
Mathematical Formulation
The argmax function identifies the index of the maximum value in a given array, defined mathematically for an array aia_iai.
Properties of Argmax
• Uniqueness: Returns a unique index if the maximum value is unique; may vary in the case of ties.
• Non-linearity: Non-differentiable, complicating gradient-based optimization (often mitigated using the softmax function).
• Composability: Can be combined with other functions for complex decision-making pipelines.

Key Applications of Argmax in NLP


1. Token Classification
Token classification involves assigning labels to tokens in an input sequence. This is common in tasks such as Named Entity Recognition (NER).
Process Overview:
1. Input Sequence: A sequence of tokens (e.g., "John Doe works at OpenAI").
2. Model Prediction: The model outputs a probability distribution over categories for each token.
3. Argmax Function: Selects the category with the highest probability for each token.
2. Sequence Classification
In tasks like sentiment analysis, the model outputs probabilities for each class (e.g., positive, negative). Argmax selects the class with the highest probability.
Applications in Other NLP Tasks
1. Part-of-Speech Tagging: Tags each word in a sentence with its corresponding part of speech.
2. Parsing: Selects the most likely parse tree or syntactic structure for a sentence.
3. Text Classification: Used in tasks like spam detection or sentiment analysis to select the most probable class.

Code:

import numpy as np

# Sample data: Scores of different classes


scores = np.array([2.5, 3.8, 1.2, 4.5, 3.1])

# Using np.argmax to find the index of the maximum value


max_index = np.argmax(scores)

print("The index of the maximum score is:", max_index)


print("The maximum score is:", scores[max_index])

Morphological Analysis
Morphological analysis is the study and process of analyzing the structure and formation of words. It involves breaking down words into their smallest
meaningful units, known as morphemes. Morphemes can be classified as either free morphemes (which can stand alone as words, like "book") or bound
morphemes (which need to be attached to other morphemes to convey meaning, like prefixes or suffixes, such as "un-" or "-ed").
The purpose of morphological analysis in Natural Language Processing (NLP) is to understand the internal structure of words and how they change based on
grammatical rules. It plays a crucial role in tasks such as text analysis, language modeling, and machine translation. Through morphological analysis, NLP
systems can better handle variations of a word (e.g., "run" vs. "running") and improve their understanding of context.
Key Points:
• Morphemes: Smallest units of meaning in a language, either free (standalone words) or bound (prefixes/suffixes).
• Inflectional Morphology: Deals with variations of words to express tense, case, gender, etc., without changing the core meaning (e.g., "walk" to
"walked").
• Derivational Morphology: Changes the meaning of a word by adding morphemes (e.g., "happy" to "unhappy").
• Applications: Used in NLP tasks such as part-of-speech tagging, spell checking, and machine translation to enhance understanding of language.
• Challenges: Handling irregular word forms and languages with rich morphology, where the number of inflections is large (e.g., Turkish or Finnish).
Stemming and Lemmatization (PYQ)
Stemming is a rule-based process that removes suffixes or prefixes from words to produce their stem form. This method often leads to non-dictionary words,
as it focuses on removing affixes based on predefined rules. For example, "running," "runner," and "ran" may all be stemmed to "run." While stemming is
typically faster and simpler, it may sacrifice accuracy and clarity since it doesn't always produce valid root words.
Lemmatization, on the other hand, involves a more sophisticated approach that considers the context and meaning of a word. It reduces words to their base
form, or lemma, by analysing the word's intended meaning and its grammatical role in a sentence. For instance, "better" would be lemmatized to "good,"
which is a valid dictionary entry. This method requires more computational resources and linguistic knowledge but yields more accurate and meaningful
results.

Differences Between Lemmatization and Stemming


Aspect Lemmatization Stemming
Output Produces valid words May produce non-words
Context Sensitivity Considers context (part of speech) Ignores context
Complexity More complex and resource-intensive Simpler and faster
Accuracy More accurate in yielding meaningful forms Less accurate due to aggressive truncation

Stemming focuses on reducing words to their root form by simply Lemmatization, on the other hand, removes affixes based on context and
removing prefixes and suffixes. This can sometimes result in non-standard converts the word to its valid root form. This method uses vocabulary and
or incorrect forms. morphological analysis to ensure the resulting word is valid.
Example Sentences: Example Sentences:
• Words: • Words:
o Caring → Car o Caring → Care (verb lemma: the base verb)
o Loveliness → Lovel o Loveliness → Lovely (adjective lemma: the base adjective)
o Happiness → Happi o Happiness → Happy (adjective lemma: the base adjective)
o Running → Run o Running → Run (verb lemma: the base verb)
o Unhappily → Unhappili o Unhappily → Unhappy (adjective lemma: the base
o Faster → Fast adjective)
o Studies → Studi o Faster → Fast (adjective lemma: the base form of the
adjective)
o Studies → Study (noun lemma: the base noun)
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Sample sentence
sentence = "The cats are running faster than the dogs."

# Tokenize the sentence


tokens = word_tokenize(sentence)

# Stemming
stemmer = PorterStemmer()
stems = [stemmer.stem(word) for word in tokens]
print("Stemming:", stems)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word) for word in tokens]
print("Lemmatization:", lemmas)

Stemming: ['The', 'cat', 'are', 'run', 'faster', 'than', 'the', 'dog', '.']
Lemmatization: ['The', 'cat', 'are', 'running', 'faster', 'than', 'the', 'dog', '.']
Sentiment Analysis Code:

from textblob import TextBlob

# Sample sentence
sentence = "I love sunny days, but I hate the rain."

# Sentiment Analysis
analysis = TextBlob(sentence)
print("Sentiment Polarity:", analysis.sentiment.polarity)
print("Sentiment Subjectivity:", analysis.sentiment.subjectivity)

Output:

Sentiment Polarity: 0.2

Sentiment Subjectivity: 0.9

• Polarity: Indicates a slightly positive sentiment (scale -1 to 1).

• Subjectivity: Indicates high subjectivity (scale 0 to 1).

Deep Learning (Simple Neural Network using Keras) Code:

from keras.models import Sequential


from keras.layers import Dense
import numpy as np

# Sample dataset
X = np.array([[0], [1], [2], [3]])
y = np.array([[0], [1], [2], [3]])

# Build the model


model = Sequential([
Dense(10, activation='relu', input_dim=1),
Dense(1, activation='linear')
])

# Compile the model


model.compile(optimizer='adam', loss='mse')

# Train the model


model.fit(X, y, epochs=100, verbose=0)

# Prediction
print("Prediction for input 4:", model.predict([[4]]))

Output (approximate):

Prediction for input 4: [[3.9994]]

• The model approximates the value for the input 4, as the training data follows a linear trend.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy