NLP Notes
NLP Notes
CHAPTER – 1
Language is studied in different academic disciplines, each addressing specific problems with unique methodologies.
Linguists study language structure.
Psycholinguists focus on human language processes.
Philosophers explore language meaning and its connection to beliefs, goals, and intentions.
Computational linguists aim to develop computational theories of language, incorporating algorithms and data structures from
computer science.
Scientific Motivation:
Objective: Understand language comprehension and production.
Rationale: Traditional disciplines are limited; computational models offer testing and improvement.
Role of Computational Models: Predict human behavior for refining understanding.
Interdisciplinary Collaboration: Involves linguists, psycholinguists, philosophers, and computer scientists.
Emerging Field: Drives interdisciplinary research, advancing cognitive science.
Practical or Technological Motivation:
Objective: Transform computer usage with natural language processing.
Rationale: Access vast human knowledge in linguistic form; enhance accessibility to complex systems with natural language
interfaces.
ELIZA (Mid-1960s)
illustration: ELIZA serves as a cautionary example, highlighting the importance of accurate evaluation of AI capabilities.
User's Dialogue: ELIZA simulates therapeutic interaction, responding empathetically to user input.
Operation: ELIZA uses keyword matching to generate contextually relevant responses.
Nature: Despite initial impression, ELIZA lacks genuine understanding and relies on scripted responses.
Program Limitations: ELIZA lacks comprehension and context retention, producing nonsensical responses.
Evaluation Challenges: Flaws in ELIZA's performance may not be readily apparent in scripted transcripts.
Challenge for Progress: Advancing beyond ELIZA-style systems requires theoretical advancements and rigorous evaluation measures.
Levels of Language Analysis:
CHAPTER - 2
Components of natural language processing (NLP):
1. Morphological and Lexical Analysis: This component deals with the analysis of the structure of words and their components
(morphemes).
Tokenization: Breaking text into smaller units, such as words or subwords.
Stemming: Reducing words to their root form to normalize variations.
Lemmatization: Reducing words to their base or dictionary form.
Stopword Removal: Eliminating common words (e.g., "the", "is") that carry little semantic value.
2. Syntactic Analysis: also known as parsing, this component focuses on analyzing the grammatical structure of sentences to
understand how words relate to each other syntactically.
Part-of-Speech Tagging: Assigning grammatical categories (e.g., noun, verb) to words.
Parsing: Analysing the grammatical structure of sentences to understand relationships between words and phrases.
Grammar Checking: Identifying and correcting grammatical errors in text.
3. Semantic Analysis: Semantic analysis aims to understand the meaning of words, phrases, and sentences in a given context.
Named Entity Recognition (NER): Identifying and categorizing entities like people, organizations, and locations.
Word Sense Disambiguation: Determining the correct meaning of ambiguous words based on context.
Semantic Role Labelling: Identifying the roles of words or phrases in sentences, such as agents, patients, and instruments.
4. Discourse Integration: This component focuses on understanding how sentences or utterances relate to each other in a larger
context, such as a paragraph or conversation.
Anaphora Resolution: Determining the referents of pronouns or other referring expressions.
Coreference Resolution: Identifying expressions that refer to the same entity.
Discourse Parsing: Analyzing the structure of discourse units like paragraphs or conversations.
5. Pragmatic Analysis: Pragmatic analysis deals with understanding the intentions, implicatures, and contextual effects of language use
beyond the literal meaning of words and sentences.
Speech Act Recognition: Identifying the intended illocutionary force of utterances (e.g., assertion, question, request).
Conversational Implicature Analysis: Recognizing implied meanings in conversation beyond literal interpretation.
Discourse Coherence Analysis: Evaluating the coherence and relevance of discourse units to maintain meaningful
communication.
3. Derivations:
Grammars derive sentences through sequences of rewrite rules.
Sentence generation constructs legal sentences using derivations.
Parsing identifies sentence structure given a grammar.
4. Generation and Parsing Processes:
Generation: Randomly choose rewrite rules starting from the start symbol until a sequence of words is obtained.
Parsing: Two methods—top-down and bottom-up.
Top-down: Start with the start symbol and search through different rewrite possibilities until the input
sentence is generated or all possibilities are explored.
Bottom-up: Start with the words in the sentence and use rewrite rules backward to reduce the sequence of
symbols until it consists solely of the start symbol.
5. Parse Tree Representation:
Parse trees serve as records of CFG rules that account for the structure of sentences.
They provide a visual representation of the parsing process, whether top-down or bottom-up.
Top-Down Parser
1. Parsing Algorithm Overview:
Parsing algorithm searches for combinations of grammatical rules to generate a tree structure representing the input
sentence.
Top-down parsing method is discussed, focusing on a simple approach that returns a yes or no answer regarding whether
a tree could be built for the sentence.
2. State Representation:
Parse state is represented as a symbol list and a word position in the sentence.
State transitions occur based on operations applied, such as applying grammar rules or matching lexical categories.
3. Parsing Procedure:
Algorithm starts with the initial state and systematically explores possible states until success or failure.
Lexicon stores possible categories for each word to efficiently guide state transitions.
4. Backtracking Technique:
Backtracking is used to explore all possible new states if a current state cannot lead to a solution.
If a dead-end is reached, the algorithm can switch to a backup state and continue parsing.
5. Search Procedure:
Parsing is likened to a search problem in AI, where the possibilities list represents the search space.
Depth-first and breadth-first search strategies are discussed, each with its advantages and potential pitfalls.
6. Left-Recursive Rules:
Left-recursive rules can cause issues in parsing, potentially leading to infinite loops.
Strategies like depth-first search may encounter problems with left-recursive rules unless explicit checks are incorporated.
7. Implementation Considerations:
Many parsers prefer the depth-first strategy due to its efficiency in memory usage and reduced need for backup states.
The algorithm starts with the initial state ((S) 1) and no backup states.
Select the current state: Take the first state off the possibilities list and call it C. If the possibilities list is empty, then the algorithm
fails (that is, no successful parse is possible).
If C consists of an empty symbol list and the word position is at the end of the sentence, then the algorithm succeeds.
Otherwise, generate the next possible states.
If the first symbol on the symbol list of C is a lexical symbol, and the next word in the sentence can be in that class, then
create a new state by removing the first symbol from the symbol list and updating the word position, and add it to the
possibilities list.
Otherwise, if the first symbol on the symbol list of C is a non-terminal, generate a new state for each rule in the grammar
that can rewrite that nonterminal symbol and add them all to the possibilities list.
Introduction:
Semantic Compositionality: The principle that the meaning of a whole sentence is composed of the meanings of its parts is
introduced. However, it acknowledges that natural languages may not always strictly adhere to this principle due to factors like
collocates and idioms.
Lexical Semantics: Focuses on the meaning of individual words and their relationships. Early theories decompose lexical meanings
into semantic primitives, but this approach is found inadequate for handling compositional semantics.
Model Theoretic Semantics: Inspired by logic, this approach creates models of the world to determine the truth of sentences. It's
effective in studying pragmatics as well as semantics.
MEANING REPRESENTATION:
Meaning Representation Languages: Bridge linguistic inputs and non-linguistic knowledge. Examples: FOPC, semantic networks.
Characteristic:
Verifiability: Ensure that meaning representations can be checked against a knowledge base.
For example, in the sentence "Does Kingfisher serve Hyderabad?", the representation "Serves (Kingfisher, Hyderabad)"
can be compared against a database to verify its truthfulness.
Unambiguous: Meaning representations should have only one interpretation.
For instance, in the sentence "I want to go to a hill station," although vague, the representation should provide a clear
indication of the user's intention, despite the lack of specificity.
Vagueness Handling: Allow for a certain level of vagueness in representations.
In the sentence "I want to go to a hill station," the representation should capture the general desire to visit a hill station
without specifying a particular location.
Canonical Form: Ensure consistency in representations despite different expressions.
For example, the sentences "Does Kingfisher offer a flight to Hyderabad?" and "Does Kingfisher have a flight to
Hyderabad?" should both be represented in a canonical form to indicate that they convey the same meaning.
Inference and Variables: Support deriving conclusions and using variables for flexibility. In the sentence "I would like to
catch a flight to Hyderabad," the representation "Goes(x, Hyderabad)" allows for the use of a variable (x) to represent any
flight that goes to Hyderabad.
Expressiveness: Enable representation of diverse content effectively. This means the language should be capable of
representing various types of linguistic inputs, such as questions, statements, requests, etc., with accuracy and richness in
meaning.
Semantic Grammars:
Semantic Grammars:
Address limitations of syntax-driven semantic analysis by integrating semantics into grammar.
Developed specifically for handling semantics, unlike traditional grammars.
Rules and constituents directly correspond to entities and activities in the domain.
Problems with Syntax-driven Semantic Analysis:
Semantic elements often distributed widely across parse trees.
Parse trees contain constituents not crucial for semantic distinctions.
Syntactic constituents may lead to meaningless representations.
Advantages of Semantic Grammars:
Focused Semantic Rules: Rules are tailored to specific domain semantics.
Effective Handling of Anaphors and Ellipsis: Facilitates better treatment of these linguistic phenomena.
Example:
Request: "I want to go from Delhi to Chennai on 24th December."
Semantic Grammar Rule: InfoRequest: User wants to go to City from City TimeExpr.
Non-terminals like "User," "City," and "TimeExpr" eliminate the need for lambda expressions.
Drawbacks:
Lack of Generality: Domain-specific rules require new grammars for each domain.
Increased Rule Complexity: Specific rules for various semantic distinctions lead to larger rule sets compared to traditional
grammars.
LEXICAL SEMANTICS:
Lexical Semantics:
Concerned with the systematic, meaning-related structure of words or lexemes.
Focuses on understanding relationships among lexemes.
Relationships Among Lexemes:
Homonymy: Words with the same form but different, unrelated meanings (e.g., "bank" for river bank or financial
institution).
Polysemy: Words with multiple related meanings (e.g., "chair" for furniture, person, presiding over a discussion).
Hypernymy and Hyponymy: Hypernym is a more general word (e.g., "automobile" for car and truck), while hyponym is
more specific (e.g., "car" for automobile).
Antonymy: Words expressing opposite meanings (e.g., "dark" and "light").
Meronymy: Part-whole relationship (e.g., "wall," "ceiling," "floor" as meronyms of "room").
Synonymy: Words with similar meanings, often interchangeable without changing sentence meaning.
WordNet:
Widely known lexical database that organizes words based on these relationships.
Cohesion
Definition:
Cohesion binds text together, maintaining its continuity and unity.
Example:
Text: "Yesterday, my friend invited me to her house..."
Cohesion achieved through the use of the pronoun "her," referring back to "my friend's."
Avoidance of Over-specification:
Use of cohesive devices like pronouns prevents unnecessary repetition and over-specification.
Enhances readability and efficiency of communication.
Shared Knowledge:
Communication assumes the presence of shared knowledge between speaker and hearer.
Speaker avoids encoding information already known to the hearer.
Pronominal Reference:
Type of reference used for cohesion.
Example: "her" referring back to "my friend's" in the text.
Other Types of Reference:
Includes ellipses, which will be discussed further.
Reference
Definition: Reference links a referring expression to another expression in the text.
Anaphoric Reference:
Refers to entities previously introduced in the discourse.
Types:
Indefinite Reference: Introduces a new object.
Definite Reference: Refers to an existing object.
Pronominal Reference: Uses pronouns to refer to entities.
Pronominal Reference:
Example: "It" in "I bought a printer today. On installation, it didn't work properly."
Can be anaphoric or cataphoric.
Pleonastic use should be distinguished.
Demonstrative Reference:
Example: "This" and "that" in "I bought a printer today. This one cost me Rs. 6,000 whereas that one cost me Rs.
12,000."
Quantifier or Ordinal Reference:
Uses ordinals like "first" or "one".
Example: "one" in "I visited a computer shop to buy a printer. I have seen many and now I have to select one."
Inferables:
Referents inferred from explicitly mentioned entities.
Example: "I bought a printer today. On opening the package, I found the paper tray broken."
Generic Reference:
Refers to a whole class instead of a specific entity.
Example: "I saw two laser printers in a shop. They were the fastest printers available."
Usage: Helps maintain coherence and clarity in discourse by establishing connections between expressions.
Ellipsis:
Definition: Ellipsis refers to the omission of a part of a sentence, where the omitted part is understood from the context.
Example:
"Yes I do" in response to "Do you take fish?" instead of "Yes I take fish."
"Do you?" in response to "I know that lady. Do you?" instead of "Do you know that lady?"
Recovery: The reader or hearer uses the surrounding text to understand the omitted part.
Target and Source Clauses:
Target Clause: The clause containing the omitted part.
Source Clause: The clause from which the omitted part is understood.
Ambiguity:
References in the source clause may lead to ambiguities in the target clause.
Example: "Seema loves her mother and Suha does too."
Ambiguity between whether Suha loves Seema's mother or her own mother.
Strict reading: Suha loves Seema's mother.
Sloppy reading: Suha loves her own mother.
Lexical Cohesion:
Definition: Lexical cohesion involves the use of lexical phenomena to achieve cohesion in text. It includes repetition, synonymy,
and hypernymy.
Repetition:
Example: "Ba ba black sheep, Have you any wool? Yes sir, yes sir, Three bags full."
Function: Repetition of words or phrases creates a stylistic effect and reinforces the connection between clauses.
Synonymy:
Definition: The use of words that have the same or similar meanings in a given context.
Example: Using "happy" instead of "joyful" in a sentence.
Function: Enhances clarity and adds variety to the language without changing the meaning.
Hypernymy:
Definition: The use of a super-ordinate term that encompasses other terms.
Example: Using "vehicle" instead of "car" or "bus."
Function: Provides a broader category for referring to specific entities, offering a more general perspective.
CHAPTER – 5 NLG
1. Monolithic Approach: Direct translation systems focus on specific language pairs and take a monolithic approach to development.
They involve little analysis of the source text and primarily rely on bilingual dictionaries.
2. Steps in Translation:
Morphological Analysis: Remove morphological inflections to obtain the root form of source-language words.
Bilingual Dictionary Lookup: Retrieve target-language words corresponding to the source-language words from a bilingual
dictionary.
Syntactic Rearrangement: Change the word order to match the target language's default structure. For example, in
English to Hindi translation, this involves changing prepositions to postpositions and adjusting the subject-verb-object
order.
3. Example: English Sentence: "Khushbu slept in the garden." Translation Steps:
Word Translation: "Khushbu soyi mein baag"
Syntactic Rearrangement: "Khushbu baag mein soyi"
4. Idiomatization: Adjustments are made to match idiomatic expressions and linguistic conventions of the target language. This
includes handling suffixes and making minor modifications for idiomatic correctness.
5. Challenges and Limitations:
Word Ambiguity: Direct translation may lead to ambiguous or nonsensical translations, especially when context and word
senses are not considered.
Limited Structural Analysis: Direct MT systems lack deeper analysis of sentence structure and relationships between
words, leading to lower translation quality.
Language Pair Specificity: These systems are developed for specific language pairs and cannot be easily adapted to new
pairs without significant effort.
Cost and Development: Developing direct MT systems for multiple language pairs can be expensive and time-consuming
due to their monolithic nature.
Rule-Based Machine Translation
Rule-based machine translation systems parse the source text and generate an intermediate representation, which could be a parse tree or an
abstract representation. The target language text is then produced from this intermediate representation. These systems rely on predefined
rules for various linguistic aspects like morphology, lexical selection, syntax, and semantics, hence they are termed rule-based systems.
Components of Rule-Based Machine Translation:
1. Analysis: This stage involves parsing the source text to produce a structure conforming to the rules of the source language. It
includes morphological, syntactic, and sometimes semantic analyses.
2. Transfer: The source language representation is transformed into a target language representation based on predefined rules and
linguistic differences between the languages.
3. Generation: Finally, the target language text is generated using the target language representation.
Transfer-Based Translation:
In transfer-based translation, the structure of the input text is transformed to match the rules of the target language. This involves
parsing the source text and transferring its structure into the target language.
The transfer component handles language-specific differences between source and target languages, while the generation module
produces the actual target language text.
Transfer systems offer modularity and language independence in analysis and generation, making them suitable for multilingual
environments.
Interlingua-Based Machine Translation:
Inspired by Chomsky's notion of a universal 'deep structure,' interlingua-based MT converts source language text into a language-
independent meaning representation called 'interlingua.'
Translation involves two stages: analysis to represent the source text in interlingua and synthesis to generate target language text
from interlingual representation.
While interlingua systems require more extensive analysis compared to transfer systems, they offer advantages such as meaning-
based representation and applicability in information retrieval.
The major challenge in interlingua-based MT lies in defining a universal interlingua that preserves the meaning of sentences across
languages, considering cultural and linguistic differences.