NLP Viva
NLP Viva
The need of NLP. Generic NLP system, Levels of NLP Stages in building a Natural Language
Processing System. Challenges and ambiguities in NLP Design
CHAPTER 2
Probability Theory, Conditional Probability and Independence, Bayes Rule, Random
Variables, Probability Distributions, Statistics, Counting, Frequency, Mean and Variance English
Grammar, Parts of Speech, Phrase Structures
Probability Theory:
● Conditional Probability: The probability of an event A given that another event B has already
occurred.
● Example: The probability of rain (A) given that the sky is cloudy (B) might be higher than the
probability of rain without considering the sky condition.
● Independence: Events A and B are independent if the occurrence of one does not affect the
probability of the other.
● Example: When flipping a fair coin twice, the outcome of the first flip does not influence the
outcome of the second flip.
Bayes' Rule:
● Definition: Bayes' Rule allows us to update our beliefs about the probability of an event based
on new evidence.
● Example: In medical diagnosis, Bayes' Rule can be used to update the probability of a
disease given the results of a diagnostic test.
Random Variables:
● Definition: A random variable is a variable whose possible values are outcomes of a random
phenomenon.
● Example: In rolling a fair six-sided die, the random variable X could represent the number
showing on the die.
Probability Distributions:
● Definition: A probability distribution describes how the values of a random variable are
distributed.
● Example: The probability distribution of rolling a fair six-sided die is uniform, meaning each
outcome (1 through 6) has an equal probability of
● 16
● 6
● 1
●
● .
Statistics:
● English Grammar: English grammar encompasses the rules and structures governing the
English language.
● Parts of Speech: Words in English are categorized into different parts of speech such as
nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, and interjections.
● Example: In the sentence "The cat sat on the mat," "cat" is a noun, "sat" is a verb, "on" is a
preposition, and so on.
● Phrase Structures: Phrase structures describe the arrangement of words into meaningful
units such as noun phrases, verb phrases, and prepositional phrases.
● Example: In the sentence "The big brown dog chased the squirrel," "the big brown dog" is a
noun phrase, and "chased the squirrel" is a verb phrase.
CHAPTER 3
where applicable:
Tokenization:
● Definition: Tokenization is the process of breaking text into smaller units called tokens.
● Steps/Process:
● Identify delimiters such as spaces, punctuation marks, etc.
● Split the text based on these delimiters to form tokens.
● Example: Sentence: "The quick brown fox jumps over the lazy dog."
● Tokens: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"].
Segmentation:
● Definition: Segmentation involves dividing a continuous piece of text into smaller meaningful
units, such as sentences or words.
● Steps/Process:
● Identify boundaries between units, such as periods for sentence segmentation or
spaces for word segmentation.
● Use these boundaries to segment the text.
● Example: Text: "Natural language processing is an exciting field. It has many applications."
● Segmented sentences: ["Natural language processing is an exciting field.", "It has
many applications."]
Lemmatization:
● Definition: Lemmatization is the process of reducing words to their base or dictionary form
(lemma).
● Steps/Process:
● Identify the part of speech of the word.
● Apply morphological rules to reduce the word to its lemma.
● Example: Word: "running"
● Lemmatized form (lemma): "run"
Edit Distance:
● Definition: Edit distance measures the similarity between two strings by calculating the
minimum number of operations (insertion, deletion, substitution) required to transform one
string into the other.
● Steps/Process:
● Define the operations (insertion, deletion, substitution) and their costs.
● Calculate the minimum number of operations required using dynamic programming.
● Example: Calculate the edit distance between "kitten" and "sitting".
● Operations:
● Substitution: 'k' -> 's'
● Substitution: 'e' -> 'i'
● Insertion: 'g' -> ''
● Edit distance: 3
Collocations:
● Definition: Collocations are words that frequently co-occur together in a text and exhibit
some form of semantic association.
● Steps/Process:
1. Analyze a corpus to identify pairs of words that occur together more frequently than
expected by chance.
● Example: "Strong coffee" is a collocation because the words "strong" and "coffee" frequently
occur together.
Porter Stemmer:
● Definition: The Porter Stemmer is an algorithm for removing suffixes from words to obtain
their root form (stem).
● Steps/Process:
● Apply a series of rules to remove common suffixes from words.
● Example: Word: "running"
● Stem: "run"
● Definition: An N-gram language model predicts the probability of a word given the previous
N-1 words in a sequence of text.
● Steps/Process:
● Break the text into sequences of N consecutive words (N-grams).
● Calculate the probability of each word given the preceding N-1 words.
● Example:
● Text: "The quick brown fox jumps over the lazy dog."
● Trigram (3-gram) language model predicts the next word given the previous two
words.
Morphological Analysis:
● Definition: Morphological analysis involves analyzing the structure and form of words,
particularly how they are constructed from smaller meaningful units called morphemes.
● Steps/Process:
1. Identify morphemes (smallest units of meaning) within words.
2. Analyze how these morphemes combine to form the word's meaning.
● Example: In the word "unhappiness", "un-" is a prefix indicating negation, "happy" is the root
word, and "-ness" is a suffix indicating a state or quality.
CHAPTER 4
Tag set for English, Penn Tree bank, Introduction to Parts of Speech Tagging (POST)
Markov Processes, Hidden Markov Models (HMM) Parts of Speech Tagging using Hidden Markov
Models, Viterbi Algorithm Tag set for English, Penn Treebank:
● Definition: A tag set is a collection of tags assigned to words in a corpus to indicate their
grammatical categories, such as parts of speech.
● Example: In the Penn Treebank tag set, "NN" represents a singular noun, "VB" represents a
verb in the base form, "JJ" represents an adjective, and so on.
● Definition: POS tagging is the process of assigning parts of speech to words in a text.
● Example: In the sentence "The cat sat on the mat," POS tagging would assign "The" as a
determiner (DT), "cat" as a noun (NN), "sat" as a verb (VBD), "on" as a preposition (IN), and
"the" as a determiner (DT), and "mat" as a noun (NN).
Markov Processes:
● Process:
1. Model Construction: Build an HMM with states representing parts of speech (e.g.,
noun, verb, adjective) and emissions representing words.
2. Training: Estimate the parameters of the HMM (transition probabilities between POS
states and emission probabilities of words given POS states) using a labeled training
corpus.
3. Decoding: Given a sequence of words, use the Viterbi algorithm to find the most
likely sequence of POS tags.
Viterbi Algorithm:
● Definition: The Viterbi algorithm is a dynamic programming algorithm used to find the most
likely sequence of hidden states in a Hidden Markov Model.
● Steps:
1. Initialization: Initialize probabilities for the initial states.
2. Recursion: Calculate the probabilities of transitioning from each state to every other
state for each word in the sequence.
3. Termination: Select the most likely final state.
4. Backtracking: Trace back through the sequence to find the most likely path of states.
Example: Let's say we have an HMM for POS tagging with states representing "noun" (N), "verb" (V),
and "adjective" (Adj). Given the sentence "The cat sat on the mat":
Lexical Semantics, ambiguous words, word senses, Relations between senses: synonym, antonym,
reversives, hyponym, hypernym, meronym, structured polysemy, metonymy, zeugma
Introduction to WordNet, gloss, synset, sense relations in WordNet. Cosine distance between
documents. Word sense disambiguation.
Lexical Semantics:
● Definition: Lexical semantics is the branch of linguistics concerned with the meaning of
words and their relationships with other words.
● Example: Understanding that the word "bank" can refer to a financial institution or the side of
a river.
● Structured Polysemy: When a word has multiple related meanings that are systematically
related.
● Example: "Bank" can refer to financial institutions or the side of a river, both related to the
concept of containment.
● Metonymy: A figure of speech in which a word is replaced by another word closely
associated with it.
● Example: Referring to the "crown" to mean a monarch or royal authority.
● Zeugma: A figure of speech in which a word applies to two others in different senses.
● Example: "She broke his heart and his car."
Introduction to WordNet:
● Definition: WordNet is a lexical database of English nouns, verbs, adjectives, and adverbs
grouped into sets of cognitive synonyms called synsets.
● Gloss: A brief definition or explanation of a word or synset.
● Example: The gloss for the synset {dog, domestic dog, Canis familiaris} is "a member of the
genus Canis (probably descended from the common wolf) that has been domesticated by
man since prehistoric times."
● Definition: Cosine distance measures the similarity between two documents based on the
cosine of the angle between their feature vectors in a high-dimensional space.
● Example: Cosine distance can be used in text mining to compare the similarity of documents
based on the frequency of words they contain.
● Definition: WSD is the task of determining the correct sense of an ambiguous word in
context.
● Example: Given the sentence "He went to the bank to deposit his money," WSD would
determine whether "bank" refers to a financial institution or the side of a river.
CHAPTER 6
Reference resolution: Discourse model, Reference Phenomenon, Syntactic and Semantic Constraints
on co reference Applications of NLP: Categorization, Summarization, Sentiment Analysis, Named
Entity Recognition, Machine Translation, Information Retrieval, Question Answer System
Reference Resolution:
Applications:
1. Categorization:
● Steps/Processes:
1. Data Collection: Gather a large corpus of text documents.
2. Preprocessing: Clean and tokenize the text data.
3. Feature Extraction: Extract features from the text, such as word frequency or
TF-IDF scores.
4. Model Training: Train a classification algorithm (e.g., Naive Bayes, Support
Vector Machine) using labeled data.
5. Evaluation: Assess the performance of the model using metrics like
accuracy, precision, recall, or F1-score.
● Example: Classifying news articles into categories like politics, sports, or
entertainment.
2. Summarization:
● Steps/Processes:
1. Text Parsing: Parse the input text to identify important sentences or phrases.
2. Content Selection: Select the most informative sentences or phrases based
on criteria like relevance or importance.
3. Summary Generation: Construct a concise summary using the selected
sentences or phrases.
● Example: Automatically generating a summary of a news article or research paper.
3. Sentiment Analysis:
● Steps/Processes:
1. Text Preprocessing: Clean and tokenize the text data.
2. Sentiment Lexicon: Use a sentiment lexicon or dictionary to assign sentiment
scores to words.
3. Sentiment Classification: Apply machine learning algorithms (e.g., Logistic
Regression, Neural Networks) to classify text into positive, negative, or
neutral sentiments.
● Example: Analyzing customer reviews of a product to determine overall sentiment
(positive or negative).
4. Named Entity Recognition (NER):
● Steps/Processes:
1. Tokenization: Break the text into words or tokens.
2. POS Tagging: Assign Part-of-Speech tags to each word in the text.
3. Named Entity Classification: Classify words or phrases as named entities
(e.g., person names, organization names, location names).
● Example: Identifying and categorizing named entities like people, organizations, and
locations in news articles.
5. Machine Translation:
● Steps/Processes:
1. Data Collection: Gather parallel corpora containing texts in multiple
languages.
2. Preprocessing: Tokenize and clean the text data.
3. Alignment: Align corresponding sentences or phrases in the source and
target languages.
4. Model Training: Train a translation model using neural networks or statistical
methods.
5. Evaluation: Assess the quality of translations using metrics like BLEU score.
● Example: Translating English text into French or vice versa.
6. Information Retrieval:
● Steps/Processes:
1. Indexing: Create an index of words or phrases in the text corpus.
2. Query Processing: Process user queries to identify relevant terms and
concepts.
3. Ranking: Rank documents based on their relevance to the query using
techniques like TF-IDF or BM25.
● Example: Retrieving relevant web pages from a search engine in response to a user
query.
7. Question Answer System:
● Steps/Processes:
1. Question Analysis: Analyze the structure and meaning of the user's question.
2. Information Retrieval: Retrieve relevant information from a knowledge base
or text corpus.
3. Answer Extraction: Extract the answer from the retrieved information based
on the question type (e.g., factoid, definition).
● Example: Answering factual questions like "Who is the president of France?" or "What
is the capital of Japan?"