0% found this document useful (0 votes)
17 views14 pages

NLP Viva

Uploaded by

nachiketn101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views14 pages

NLP Viva

Uploaded by

nachiketn101
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

CHAPTER 1

The need of NLP. Generic NLP system, Levels of NLP Stages in building a Natural Language
Processing System. Challenges and ambiguities in NLP Design

The Need for NLP:


● Efficient Communication: NLP enables efficient communication between humans
and machines. For example, chatbots use NLP to understand user queries and
provide appropriate responses, enhancing customer service.
● Information Retrieval: Search engines like Google use NLP techniques to understand
user queries and retrieve relevant web pages. For instance, when you search for
"weather forecast," Google uses NLP to understand your query and provide weather
information.
● Sentiment Analysis: Companies use NLP for sentiment analysis of customer reviews
on social media to understand customer satisfaction levels. For example, analyzing
tweets mentioning a brand to gauge public opinion.
2. Generic NLP System:
● Tokenization: Input: "The cat is sitting on the mat." Output tokens: ["The", "cat", "is",
"sitting", "on", "the", "mat"].
● Parsing: Analyzing the sentence structure to determine relationships between words.
Example: "The cat chased the mouse." Parsing identifies that "cat" is the subject,
"chased" is the verb, and "mouse" is the object.
● NER: Input: "Apple is headquartered in Cupertino, California." Named Entities:
{"Apple": Organization, "Cupertino, California": Location}.
● POS Tagging: Input: "The quick brown fox jumps over the lazy dog." POS Tags: {"The":
Determiner, "quick": Adjective, "brown": Adjective, "fox": Noun, "jumps": Verb, "over":
Preposition, "the": Determiner, "lazy": Adjective, "dog": Noun}.
● Semantic Analysis: Input: "I love my new phone!" Sentiment: Positive.
● Text Generation: Given the prompt "Once upon a time," the system generates: "there
was a beautiful princess living in a faraway kingdom."
3. Levels of NLP:
● Syntax: Analyzing sentence structure without considering meaning. Example: "The
man bites the dog." vs. "The dog bites the man."
● Semantics: Understanding the meaning of words and sentences. Example: "I saw her
duck." (Did you see her pet duck or did you see her quickly move her head down?)
● Pragmatics: Interpreting language in context. Example: "Can you pass the salt?"
(Literal request vs. indirect request depending on context).
4. Stages in Building an NLP System:
● Data Collection: Gathering a large corpus of text data, e.g., articles, books, or social
media posts.
● Preprocessing: Removing noise, tokenization, stemming, and lemmatization.
Example: Converting "running" to "run" and "swimming" to "swim".
● Feature Engineering: Selecting relevant features like word frequency, n-grams, or
word embeddings.
● Model Training: Using algorithms like Naive Bayes, Support Vector Machines, or deep
learning models such as Recurrent Neural Networks (RNNs) or Transformers.
● Evaluation: Assessing model performance using metrics like accuracy, precision,
recall, or F1-score.
● Deployment: Integrating the trained model into a web application, chatbot, or any
other system for real-world use.
5. Challenges and Ambiguities in NLP Design:
● Ambiguity: Example: "I saw her duck." (The word "duck" can be a verb or a noun,
leading to ambiguity without context).
● Data Sparsity: Limited availability of annotated data for training models in
specialized domains.
● Domain Adaptation: Models trained on news articles may not perform well on legal
documents due to differences in vocabulary and writing style.
● Lack of Context: Understanding sarcasm or subtle nuances in language requires
understanding broader context, which is challenging for machines.
● Ethical and Bias Concerns: Example: Gender bias in language models can lead to
biased predictions in applications like resume screening or language translation.

CHAPTER 2
Probability Theory, Conditional Probability and Independence, Bayes Rule, Random
Variables, Probability Distributions, Statistics, Counting, Frequency, Mean and Variance English
Grammar, Parts of Speech, Phrase Structures

Probability Theory:

● Definition: Probability theory is a branch of mathematics concerned with quantifying


uncertainty.
● Example: When rolling a fair six-sided die, the probability of rolling a 3 is
● 16
● 6
● 1

● because there is one favorable outcome out of six possible outcomes.

Conditional Probability and Independence:

● Conditional Probability: The probability of an event A given that another event B has already
occurred.
● Example: The probability of rain (A) given that the sky is cloudy (B) might be higher than the
probability of rain without considering the sky condition.
● Independence: Events A and B are independent if the occurrence of one does not affect the
probability of the other.
● Example: When flipping a fair coin twice, the outcome of the first flip does not influence the
outcome of the second flip.

Bayes' Rule:

● Definition: Bayes' Rule allows us to update our beliefs about the probability of an event based
on new evidence.
● Example: In medical diagnosis, Bayes' Rule can be used to update the probability of a
disease given the results of a diagnostic test.

Random Variables:

● Definition: A random variable is a variable whose possible values are outcomes of a random
phenomenon.
● Example: In rolling a fair six-sided die, the random variable X could represent the number
showing on the die.

Probability Distributions:

● Definition: A probability distribution describes how the values of a random variable are
distributed.
● Example: The probability distribution of rolling a fair six-sided die is uniform, meaning each
outcome (1 through 6) has an equal probability of
● 16
● 6
● 1

● .

Statistics:

● Definition: Statistics involves collecting, analyzing, interpreting, and presenting data.


● Example: Calculating the average height of students in a class is a statistical measure.

Counting, Frequency, Mean, and Variance:

● Counting: Counting involves determining the number of possible outcomes in a given


scenario.
● Example: Counting the number of ways to arrange the letters in the word "HELLO."
● Frequency: Frequency refers to how often something occurs.
● Example: The frequency of the word "the" in a text document.
● Mean: The mean (average) of a set of numbers is calculated by adding up all the numbers
and dividing by the total count.
● Example: The mean of the numbers 1, 3, 5, 7, and 9 is
● 1+3+5+7+95=5
● 5
● 1+3+5+7+9

● =5.
● Variance: Variance measures how much the values in a dataset differ from the mean.
● Example: The variance of the numbers 1, 3, 5, 7, and 9 is calculated by finding the average
squared difference from the mean.

English Grammar, Parts of Speech, Phrase Structures:

● English Grammar: English grammar encompasses the rules and structures governing the
English language.
● Parts of Speech: Words in English are categorized into different parts of speech such as
nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, and interjections.
● Example: In the sentence "The cat sat on the mat," "cat" is a noun, "sat" is a verb, "on" is a
preposition, and so on.
● Phrase Structures: Phrase structures describe the arrangement of words into meaningful
units such as noun phrases, verb phrases, and prepositional phrases.
● Example: In the sentence "The big brown dog chased the squirrel," "the big brown dog" is a
noun phrase, and "chased the squirrel" is a verb phrase.
CHAPTER 3

Tokenization, Segmentation, Lemmatization, Edit Distance, Collocations, Porter Stemmer,


N-gram Language Model Morphological Analysis, Derivational and Reflectional Morphology
please provie steps an processes wherever you feel necesssary
ChatGPT
ChatGPT
Let's delve into each topic with explanations and examples, incorporating steps and processes

where applicable:

Tokenization:
● Definition: Tokenization is the process of breaking text into smaller units called tokens.
● Steps/Process:
● Identify delimiters such as spaces, punctuation marks, etc.
● Split the text based on these delimiters to form tokens.
● Example: Sentence: "The quick brown fox jumps over the lazy dog."
● Tokens: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"].

Segmentation:

● Definition: Segmentation involves dividing a continuous piece of text into smaller meaningful
units, such as sentences or words.
● Steps/Process:
● Identify boundaries between units, such as periods for sentence segmentation or
spaces for word segmentation.
● Use these boundaries to segment the text.
● Example: Text: "Natural language processing is an exciting field. It has many applications."
● Segmented sentences: ["Natural language processing is an exciting field.", "It has
many applications."]

Lemmatization:

● Definition: Lemmatization is the process of reducing words to their base or dictionary form
(lemma).
● Steps/Process:
● Identify the part of speech of the word.
● Apply morphological rules to reduce the word to its lemma.
● Example: Word: "running"
● Lemmatized form (lemma): "run"

Edit Distance:

● Definition: Edit distance measures the similarity between two strings by calculating the
minimum number of operations (insertion, deletion, substitution) required to transform one
string into the other.
● Steps/Process:
● Define the operations (insertion, deletion, substitution) and their costs.
● Calculate the minimum number of operations required using dynamic programming.
● Example: Calculate the edit distance between "kitten" and "sitting".
● Operations:
● Substitution: 'k' -> 's'
● Substitution: 'e' -> 'i'
● Insertion: 'g' -> ''
● Edit distance: 3

Collocations:
● Definition: Collocations are words that frequently co-occur together in a text and exhibit
some form of semantic association.
● Steps/Process:
1. Analyze a corpus to identify pairs of words that occur together more frequently than
expected by chance.
● Example: "Strong coffee" is a collocation because the words "strong" and "coffee" frequently
occur together.

Porter Stemmer:

● Definition: The Porter Stemmer is an algorithm for removing suffixes from words to obtain
their root form (stem).
● Steps/Process:
● Apply a series of rules to remove common suffixes from words.
● Example: Word: "running"
● Stem: "run"

N-gram Language Model:

● Definition: An N-gram language model predicts the probability of a word given the previous
N-1 words in a sequence of text.
● Steps/Process:
● Break the text into sequences of N consecutive words (N-grams).
● Calculate the probability of each word given the preceding N-1 words.
● Example:
● Text: "The quick brown fox jumps over the lazy dog."
● Trigram (3-gram) language model predicts the next word given the previous two
words.

Morphological Analysis:

● Definition: Morphological analysis involves analyzing the structure and form of words,
particularly how they are constructed from smaller meaningful units called morphemes.
● Steps/Process:
1. Identify morphemes (smallest units of meaning) within words.
2. Analyze how these morphemes combine to form the word's meaning.
● Example: In the word "unhappiness", "un-" is a prefix indicating negation, "happy" is the root
word, and "-ness" is a suffix indicating a state or quality.

Derivational and Inflectional Morphology:

● Derivational Morphology: Derivational morphology involves forming new words by adding


prefixes or suffixes to existing words, often resulting in a change in meaning or word class.
● Example: Deriving "happiness" from "happy" by adding the suffix "-ness".
● Inflectional Morphology: Inflectional morphology involves adding affixes to words to indicate
grammatical features such as tense, number, or case without changing the basic meaning or
word class.
● Example: Adding "-s" to "dog" to indicate plural ("dogs").

CHAPTER 4

Tag set for English, Penn Tree bank, Introduction to Parts of Speech Tagging (POST)
Markov Processes, Hidden Markov Models (HMM) Parts of Speech Tagging using Hidden Markov
Models, Viterbi Algorithm Tag set for English, Penn Treebank:

● Definition: A tag set is a collection of tags assigned to words in a corpus to indicate their
grammatical categories, such as parts of speech.
● Example: In the Penn Treebank tag set, "NN" represents a singular noun, "VB" represents a
verb in the base form, "JJ" represents an adjective, and so on.

Introduction to Parts of Speech Tagging (POS Tagging):

● Definition: POS tagging is the process of assigning parts of speech to words in a text.
● Example: In the sentence "The cat sat on the mat," POS tagging would assign "The" as a
determiner (DT), "cat" as a noun (NN), "sat" as a verb (VBD), "on" as a preposition (IN), and
"the" as a determiner (DT), and "mat" as a noun (NN).

Markov Processes:

● Definition: A Markov process is a stochastic model describing a sequence of possible events


where the probability of each event depends only on the state attained in the previous event.
● Example: In weather forecasting, a Markov process can model the transition between
weather states (e.g., sunny, rainy, cloudy) based on the current weather state.
Hidden Markov Models (HMM):

● Definition: A Hidden Markov Model is a statistical model that involves a sequence of


observable events generated by a sequence of hidden states.
● Example: In speech recognition, an HMM can model the sequence of phonemes (hidden
states) that generate the observed sequence of audio features (observable events).

Parts of Speech Tagging using Hidden Markov Models (HMM):

● Process:
1. Model Construction: Build an HMM with states representing parts of speech (e.g.,
noun, verb, adjective) and emissions representing words.
2. Training: Estimate the parameters of the HMM (transition probabilities between POS
states and emission probabilities of words given POS states) using a labeled training
corpus.
3. Decoding: Given a sequence of words, use the Viterbi algorithm to find the most
likely sequence of POS tags.

Viterbi Algorithm:

● Definition: The Viterbi algorithm is a dynamic programming algorithm used to find the most
likely sequence of hidden states in a Hidden Markov Model.
● Steps:
1. Initialization: Initialize probabilities for the initial states.
2. Recursion: Calculate the probabilities of transitioning from each state to every other
state for each word in the sequence.
3. Termination: Select the most likely final state.
4. Backtracking: Trace back through the sequence to find the most likely path of states.

Example: Let's say we have an HMM for POS tagging with states representing "noun" (N), "verb" (V),

and "adjective" (Adj). Given the sentence "The cat sat on the mat":

● We initialize the probabilities for the initial states.


● We calculate the probabilities of transitioning between states for each word.
● We select the most likely final state.
● We trace back through the sequence to find the most likely path of states.
CHAPTER 5

Lexical Semantics, ambiguous words, word senses, Relations between senses: synonym, antonym,
reversives, hyponym, hypernym, meronym, structured polysemy, metonymy, zeugma
Introduction to WordNet, gloss, synset, sense relations in WordNet. Cosine distance between
documents. Word sense disambiguation.

Lexical Semantics:

● Definition: Lexical semantics is the branch of linguistics concerned with the meaning of
words and their relationships with other words.
● Example: Understanding that the word "bank" can refer to a financial institution or the side of
a river.

Ambiguous Words and Word Senses:

● Definition: Ambiguous words are those with multiple meanings or senses.


● Example: "Bat" can refer to a flying mammal or a piece of sports equipment used in baseball.
● Word Senses: Word senses are different meanings associated with an ambiguous word.
● Example: The word "bank" has different senses depending on context: financial institution vs.
the side of a river.

Relations Between Senses:

● Synonym: Words with similar meanings.


● Example: Synonyms for "happy" include "joyful," "content," and "cheerful."
● Antonym: Words with opposite meanings.
● Example: "Hot" is an antonym of "cold."
● Reversives: Pairs of verbs indicating opposite actions.
● Example: "Open" and "close" are reversives.
● Hyponym: A word that is more specific than another word.
● Example: "Rose" is a hyponym of "flower."
● Hypernym: A word that is more general than another word.
● Example: "Animal" is a hypernym of "cat" and "dog."
● Meronym: A word that denotes a part of another word.
● Example: "Wheel" is a meronym of "car."

Structured Polysemy, Metonymy, Zeugma:

● Structured Polysemy: When a word has multiple related meanings that are systematically
related.
● Example: "Bank" can refer to financial institutions or the side of a river, both related to the
concept of containment.
● Metonymy: A figure of speech in which a word is replaced by another word closely
associated with it.
● Example: Referring to the "crown" to mean a monarch or royal authority.
● Zeugma: A figure of speech in which a word applies to two others in different senses.
● Example: "She broke his heart and his car."

Introduction to WordNet:

● Definition: WordNet is a lexical database of English nouns, verbs, adjectives, and adverbs
grouped into sets of cognitive synonyms called synsets.
● Gloss: A brief definition or explanation of a word or synset.
● Example: The gloss for the synset {dog, domestic dog, Canis familiaris} is "a member of the
genus Canis (probably descended from the common wolf) that has been domesticated by
man since prehistoric times."

Sense Relations in WordNet:

● Synset: A set of synonyms that represent a word sense.


● Example: {car, auto, automobile, machine, motorcar}
● Sense Relations: Various semantic relationships between synsets in WordNet, including
hypernymy, hyponymy, meronymy, and antonymy.

Cosine Distance Between Documents:

● Definition: Cosine distance measures the similarity between two documents based on the
cosine of the angle between their feature vectors in a high-dimensional space.
● Example: Cosine distance can be used in text mining to compare the similarity of documents
based on the frequency of words they contain.

Word Sense Disambiguation (WSD):

● Definition: WSD is the task of determining the correct sense of an ambiguous word in
context.
● Example: Given the sentence "He went to the bank to deposit his money," WSD would
determine whether "bank" refers to a financial institution or the side of a river.
CHAPTER 6

Reference resolution: Discourse model, Reference Phenomenon, Syntactic and Semantic Constraints
on co reference Applications of NLP: Categorization, Summarization, Sentiment Analysis, Named
Entity Recognition, Machine Translation, Information Retrieval, Question Answer System

Reference Resolution:

● Discourse Model: A discourse model is a representation of the structure and content of a


conversation or text.
● Reference Phenomenon: Reference phenomenon deals with how language refers to entities
mentioned earlier in the discourse.
● Syntactic and Semantic Constraints on Coreference: Syntactic constraints involve
grammatical relationships between words, while semantic constraints involve the meanings
of words and their references.

Applications:

1. Categorization:
● Steps/Processes:
1. Data Collection: Gather a large corpus of text documents.
2. Preprocessing: Clean and tokenize the text data.
3. Feature Extraction: Extract features from the text, such as word frequency or
TF-IDF scores.
4. Model Training: Train a classification algorithm (e.g., Naive Bayes, Support
Vector Machine) using labeled data.
5. Evaluation: Assess the performance of the model using metrics like
accuracy, precision, recall, or F1-score.
● Example: Classifying news articles into categories like politics, sports, or
entertainment.
2. Summarization:
● Steps/Processes:
1. Text Parsing: Parse the input text to identify important sentences or phrases.
2. Content Selection: Select the most informative sentences or phrases based
on criteria like relevance or importance.
3. Summary Generation: Construct a concise summary using the selected
sentences or phrases.
● Example: Automatically generating a summary of a news article or research paper.
3. Sentiment Analysis:
● Steps/Processes:
1. Text Preprocessing: Clean and tokenize the text data.
2. Sentiment Lexicon: Use a sentiment lexicon or dictionary to assign sentiment
scores to words.
3. Sentiment Classification: Apply machine learning algorithms (e.g., Logistic
Regression, Neural Networks) to classify text into positive, negative, or
neutral sentiments.
● Example: Analyzing customer reviews of a product to determine overall sentiment
(positive or negative).
4. Named Entity Recognition (NER):
● Steps/Processes:
1. Tokenization: Break the text into words or tokens.
2. POS Tagging: Assign Part-of-Speech tags to each word in the text.
3. Named Entity Classification: Classify words or phrases as named entities
(e.g., person names, organization names, location names).
● Example: Identifying and categorizing named entities like people, organizations, and
locations in news articles.
5. Machine Translation:
● Steps/Processes:
1. Data Collection: Gather parallel corpora containing texts in multiple
languages.
2. Preprocessing: Tokenize and clean the text data.
3. Alignment: Align corresponding sentences or phrases in the source and
target languages.
4. Model Training: Train a translation model using neural networks or statistical
methods.
5. Evaluation: Assess the quality of translations using metrics like BLEU score.
● Example: Translating English text into French or vice versa.
6. Information Retrieval:
● Steps/Processes:
1. Indexing: Create an index of words or phrases in the text corpus.
2. Query Processing: Process user queries to identify relevant terms and
concepts.
3. Ranking: Rank documents based on their relevance to the query using
techniques like TF-IDF or BM25.
● Example: Retrieving relevant web pages from a search engine in response to a user
query.
7. Question Answer System:
● Steps/Processes:
1. Question Analysis: Analyze the structure and meaning of the user's question.
2. Information Retrieval: Retrieve relevant information from a knowledge base
or text corpus.
3. Answer Extraction: Extract the answer from the retrieved information based
on the question type (e.g., factoid, definition).
● Example: Answering factual questions like "Who is the president of France?" or "What
is the capital of Japan?"

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy