0% found this document useful (0 votes)
29 views19 pages

NLP Notes

Uploaded by

Bharat Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views19 pages

NLP Notes

Uploaded by

Bharat Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

NLP Notes: -

CHAPTER – 1

 Language is studied in different academic disciplines, each addressing specific problems with unique methodologies.
 Linguists study language structure.
 Psycholinguists focus on human language processes.
 Philosophers explore language meaning and its connection to beliefs, goals, and intentions.
 Computational linguists aim to develop computational theories of language, incorporating algorithms and data structures from
computer science.

Scientific Motivation:
 Objective: Understand language comprehension and production.
 Rationale: Traditional disciplines are limited; computational models offer testing and improvement.
 Role of Computational Models: Predict human behavior for refining understanding.
 Interdisciplinary Collaboration: Involves linguists, psycholinguists, philosophers, and computer scientists.
 Emerging Field: Drives interdisciplinary research, advancing cognitive science.
Practical or Technological Motivation:
 Objective: Transform computer usage with natural language processing.
 Rationale: Access vast human knowledge in linguistic form; enhance accessibility to complex systems with natural language
interfaces.

Applications of Natural Language Understanding:


Major Classes:
 Text-based and dialogue-based applications.
Text-Based Applications:
 Examples: Processing written text from various sources like books, newspapers, and emails.
 Tasks include finding documents, extracting information, translating, and summarizing texts.
 Ongoing research focuses on finding relevant documents, extracting information, translating across languages, and summarizing
texts.
 System Limitations:
 Keyword-based systems employ simple matching techniques, limiting true understanding.
 Inherent limitations include difficulty handling complex retrieval tasks.
 Crucial Characteristic: Representation and inference are necessary for complex tasks.
Machine Translation Systems:
 Some use pattern matching to associate sequences of words in one language with sequences in another.
 Challenges exist in disambiguating word senses and sentence meanings.
Dialogue-Based Applications:
 Communication involves spoken language or interaction through keyboards.
 Potential applications include question-answering systems, customer service, tutoring, and spoken language control of machines.
 Challenges include maintaining natural dialogue flow and processing techniques.
Distinguishing Speech Recognition and Language Understanding:
 Speech recognition identifies spoken words without general language understanding.
 Language understanding systems process recognized speech for deeper comprehension.
 Techniques discussed apply regardless of input medium.

Evaluating Language Understanding Systems


 System Evaluation Methods:
 Black Box Evaluation:
 Evaluates system performance without examining its internal workings.
 Involves running the program and assessing its effectiveness in performing the intended task.
 Problematic in early research stages as results might be misleading.
 Long-term viability may not be accurately predicted based on short-term performance.
 Practical significance increases only when success rates are consistently high.
 Glass Box Evaluation:
 Examines the internal structure by identifying and evaluating subcomponents.
 Requires consensus on defining natural language system components.
 Currently an area of active development to establish such consensus.
 Challenges in Evaluation:
 Consensus on Components:
 Glass box evaluation relies on defining subcomponents, requiring consensus.
 Ongoing activity in the field to establish agreement on natural language system components.

ELIZA (Mid-1960s)
 illustration: ELIZA serves as a cautionary example, highlighting the importance of accurate evaluation of AI capabilities.
 User's Dialogue: ELIZA simulates therapeutic interaction, responding empathetically to user input.
 Operation: ELIZA uses keyword matching to generate contextually relevant responses.
 Nature: Despite initial impression, ELIZA lacks genuine understanding and relies on scripted responses.
 Program Limitations: ELIZA lacks comprehension and context retention, producing nonsensical responses.
 Evaluation Challenges: Flaws in ELIZA's performance may not be readily apparent in scripted transcripts.
 Challenge for Progress: Advancing beyond ELIZA-style systems requires theoretical advancements and rigorous evaluation measures.
Levels of Language Analysis:

Representation and Understanding:


 Levels of Representations:
 Syntactic Representation:
 Focuses on representing sentence structure using context-free grammars.
 Illustrated through tree structures, exemplifying the syntactic analysis of sentences.
 Logical Form:
 Represents the context-independent meaning of a sentence.
 Encodes possible word senses and identifies semantic relationships, such as abstract roles like AGENT, THEME,
and TO-POSS.
 Eliminates impossible word senses based on the semantic relationships in a sentence.
 Final Meaning Representation (KR):
 Represents general knowledge about the application domain.
 The system uses this representation to reason about and perform tasks within the domain.
 Question-Answering Application Example:
 Maps a question to a database query in the final knowledge representation.
 Story-Understanding Application Example:
 Maps a sentence into expressions representing the situation described in the sentence.

The Organization of Natural Language Understanding Systems:


 Organized around three levels of representation: syntactic structure, logical form, and final meaning representation.
 Interpretation processes map between these representations.
 Parser:
 Maps a sentence to its syntactic structure and logical form.
 Utilizes knowledge about words and their meanings (lexicon) and a set of rules defining legal structures (grammar).
 Syntactic and Semantic Processing:
 Combining syntactic and semantic processing reduces the number of possible interpretations.
 Example: Sentences with identical syntactic structures may have different semantic interpretations, and combining both
processes helps in detecting semantic anomalies early.
 Contextual Processing:
 Transforms syntactic structure and logical form into a final meaning representation.
 Addresses issues like identifying objects, analyzing temporal aspects, and determining speaker's intention.
 Uses knowledge of discourse context and application to produce a final representation.
 Reasoning Tasks:
 The system performs reasoning tasks appropriate for the application using the final meaning representation.
 Generation Component:
 Produces a response when required.
 Uses knowledge of discourse context, grammar, and lexicon to plan and realize the form of an utterance.
 Spoken Language Application:
 In a spoken language application, words may be the output of a speech recognizer and the input to a speech synthesizer.
 Bidirectional Grammars:
 Grammars that support both understanding and generation tasks.
 While bidirectional grammars are preferred, in practice, grammars are often tailored for specific tasks (understanding or
generation).
 Computational Theories of Natural Language Understanding:
 Understanding systems compute representations of sentence meanings and use them in reasoning tasks.
 Three principal levels of representation: syntactic, semantic, and contextual processing.

CHAPTER - 2
Components of natural language processing (NLP):
1. Morphological and Lexical Analysis: This component deals with the analysis of the structure of words and their components
(morphemes).
 Tokenization: Breaking text into smaller units, such as words or subwords.
 Stemming: Reducing words to their root form to normalize variations.
 Lemmatization: Reducing words to their base or dictionary form.
 Stopword Removal: Eliminating common words (e.g., "the", "is") that carry little semantic value.

2. Syntactic Analysis: also known as parsing, this component focuses on analyzing the grammatical structure of sentences to
understand how words relate to each other syntactically.
 Part-of-Speech Tagging: Assigning grammatical categories (e.g., noun, verb) to words.
 Parsing: Analysing the grammatical structure of sentences to understand relationships between words and phrases.
 Grammar Checking: Identifying and correcting grammatical errors in text.

3. Semantic Analysis: Semantic analysis aims to understand the meaning of words, phrases, and sentences in a given context.
 Named Entity Recognition (NER): Identifying and categorizing entities like people, organizations, and locations.
 Word Sense Disambiguation: Determining the correct meaning of ambiguous words based on context.
 Semantic Role Labelling: Identifying the roles of words or phrases in sentences, such as agents, patients, and instruments.

4. Discourse Integration: This component focuses on understanding how sentences or utterances relate to each other in a larger
context, such as a paragraph or conversation.
 Anaphora Resolution: Determining the referents of pronouns or other referring expressions.
 Coreference Resolution: Identifying expressions that refer to the same entity.
 Discourse Parsing: Analyzing the structure of discourse units like paragraphs or conversations.

5. Pragmatic Analysis: Pragmatic analysis deals with understanding the intentions, implicatures, and contextual effects of language use
beyond the literal meaning of words and sentences.
 Speech Act Recognition: Identifying the intended illocutionary force of utterances (e.g., assertion, question, request).
 Conversational Implicature Analysis: Recognizing implied meanings in conversation beyond literal interpretation.
 Discourse Coherence Analysis: Evaluating the coherence and relevance of discourse units to maintain meaningful
communication.

Grammars and Sentence Structure:


1. Sentence Structure Representation:
 Sentences can be represented as trees to show their structure.
 Trees consist of labeled nodes connected by links, resembling upside-down trees.
 Terminology includes root (top node), leaves (bottom nodes), parent nodes, and child nodes.
 Nodes are dominated by their ancestor nodes, and the root node dominates all others.
2. Constructing Tree Structures:
 Legal tree structures for English are determined by sets of rewrite rules.
 Rewrite rules define how symbols can be expanded in the tree.
 Context-free grammars (CFGs) are important for describing natural language structure.
 Terminal symbols cannot be further decomposed, while other symbols are called symbols or lexical symbols.

3. Derivations:
 Grammars derive sentences through sequences of rewrite rules.
 Sentence generation constructs legal sentences using derivations.
 Parsing identifies sentence structure given a grammar.
4. Generation and Parsing Processes:
 Generation: Randomly choose rewrite rules starting from the start symbol until a sequence of words is obtained.
 Parsing: Two methods—top-down and bottom-up.
 Top-down: Start with the start symbol and search through different rewrite possibilities until the input
sentence is generated or all possibilities are explored.
 Bottom-up: Start with the words in the sentence and use rewrite rules backward to reduce the sequence of
symbols until it consists solely of the start symbol.
5. Parse Tree Representation:
 Parse trees serve as records of CFG rules that account for the structure of sentences.
 They provide a visual representation of the parsing process, whether top-down or bottom-up.

What Makes a Good Grammar:


1. Criteria for Evaluating Grammars:
 Generality: The ability of the grammar to analyze a wide range of sentences correctly.
 Selectivity: The ability of the grammar to identify problematic non-sentences.
 Understandability: The simplicity of the grammar itself.
2. Testing Constituents:
 Test by attempting to conjoin the proposed constituent with another of the same type.
 Test by inserting the proposed constituent into other sentences that take the same category of constituent.
 Conjunction Test:
1. Conjoining constituents of the same class helps determine correctness.
 Insertion Test:
1. Inserting the proposed constituent into other sentences of the same category verifies its behavior.
 Further Tests:
1. Consider changing the constituent in ways usually allowed, like replacing it with a pronoun.
3. Grammar Development:
 As the grammar develops, more tests become available to evaluate new analyses.
 Modifications to existing rules may be necessary when introducing new ones.
4. Generative Capacity (Box 3.1):
 Grammatical formalisms vary in their ability to describe languages.
 The Chomsky Hierarchy categorizes languages based on the complexity of their grammars.
 Regular grammars, context-free grammars, context-sensitive grammars, and type 0 grammars form a hierarchy of
languages.

Top-Down Parser
1. Parsing Algorithm Overview:
 Parsing algorithm searches for combinations of grammatical rules to generate a tree structure representing the input
sentence.
 Top-down parsing method is discussed, focusing on a simple approach that returns a yes or no answer regarding whether
a tree could be built for the sentence.
2. State Representation:
 Parse state is represented as a symbol list and a word position in the sentence.
 State transitions occur based on operations applied, such as applying grammar rules or matching lexical categories.
3. Parsing Procedure:
 Algorithm starts with the initial state and systematically explores possible states until success or failure.
 Lexicon stores possible categories for each word to efficiently guide state transitions.
4. Backtracking Technique:
 Backtracking is used to explore all possible new states if a current state cannot lead to a solution.
 If a dead-end is reached, the algorithm can switch to a backup state and continue parsing.
5. Search Procedure:
 Parsing is likened to a search problem in AI, where the possibilities list represents the search space.
 Depth-first and breadth-first search strategies are discussed, each with its advantages and potential pitfalls.
6. Left-Recursive Rules:
 Left-recursive rules can cause issues in parsing, potentially leading to infinite loops.
 Strategies like depth-first search may encounter problems with left-recursive rules unless explicit checks are incorporated.
7. Implementation Considerations:
 Many parsers prefer the depth-first strategy due to its efficiency in memory usage and reduced need for backup states.

 The algorithm starts with the initial state ((S) 1) and no backup states.
 Select the current state: Take the first state off the possibilities list and call it C. If the possibilities list is empty, then the algorithm
fails (that is, no successful parse is possible).
 If C consists of an empty symbol list and the word position is at the end of the sentence, then the algorithm succeeds.
 Otherwise, generate the next possible states.
 If the first symbol on the symbol list of C is a lexical symbol, and the next word in the sentence can be in that class, then
create a new state by removing the first symbol from the symbol list and updating the word position, and add it to the
possibilities list.
 Otherwise, if the first symbol on the symbol list of C is a non-terminal, generate a new state for each rule in the grammar
that can rewrite that nonterminal symbol and add them all to the possibilities list.

A Bottom-Up Chart Parser:


1. Difference between Top-Down and Bottom-Up Parsers:
2. Problems with Simple Implementation:
 A straightforward bottom-up parsing implementation would be inefficient due to redundant matching of the same
sequences.
 To address this, a data structure called a chart is introduced to store partial parsing results and avoid duplicating work.
3. Chart Construction and Management:
 Matches are considered from the perspective of a key constituent.
 Completed constituents and active arcs (partially matched constituents) are recorded on the chart.
 The chart maintains a record of constituents derived from the input sentence and incomplete matches.
4. Chart-Based Parsing Algorithm:
 The algorithm involves combining active arcs with completed constituents to generate new constituents or extend existing
ones.
 New completed constituents are added to the chart while active arcs are maintained for further processing.
 The parsing process can be conducted using depth-first or breadth-first search strategies.
5. Efficiency Considerations:
 Chart-based parsers can be more efficient than pure search-based parsers because they avoid redundant construction of
constituents.
 The worst-case complexity of a chart-based parser is 𝐾×𝑛3K×n3, where 𝑛n is the length of the input sentence and 𝐾K is
a constant depending on the algorithm.
 Despite requiring more work per step compared to search-based parsers, chart parsers can be significantly faster for
parsing complex sentences.

Transition Network grammar:


1. Transition Networks:
 Transition networks consist of nodes and labelled arcs.
 One node is specified as the initial state, and arcs are labelled with word categories.
 A legal phrase is recognized by traversing arcs from the initial state to a pop arc, accounting for each word in the phrase.
 Simple transition networks are equivalent to finite state machines (FSMs) and regular grammars.
2. Recursive Transition Networks (RTNs):
 RTNs allow arcs to refer to other networks or word categories.
 They provide the descriptive power of context-free grammars (CFGs) by allowing recursion.
 RTN parsers use algorithms similar to CFG parsers, involving current position, current node, and return points.
3. Parsing Algorithm for RTNs:
 The parsing algorithm involves traversing arcs based on word categories or network references.
 Cases include following word category arcs, push arcs to other networks, pop arcs, and completion of parsing.
4. Backup States and Backtracking:
 Backup states are saved during parsing to recover from failures.
 Backtracking allows the parser to explore alternative paths when initial parsing attempts fail.
5. Combining Top-Down and Bottom-Up Parsing:
 Combining top-down and bottom-up parsing methods provides the advantages of both.
 The algorithm involves maintaining an agenda of completed constituents and extending arcs based on grammar rules.
6. Generative Capacity of Transition Networks:
 Simple transition networks without push arcs are equivalent to regular grammars.
 RTNs are equivalent to context-free grammars, allowing for greater expressive power.
CHAPTER – 3 Semantic analysis

Introduction:
 Semantic Compositionality: The principle that the meaning of a whole sentence is composed of the meanings of its parts is
introduced. However, it acknowledges that natural languages may not always strictly adhere to this principle due to factors like
collocates and idioms.
 Lexical Semantics: Focuses on the meaning of individual words and their relationships. Early theories decompose lexical meanings
into semantic primitives, but this approach is found inadequate for handling compositional semantics.
 Model Theoretic Semantics: Inspired by logic, this approach creates models of the world to determine the truth of sentences. It's
effective in studying pragmatics as well as semantics.

MEANING REPRESENTATION:
 Meaning Representation Languages: Bridge linguistic inputs and non-linguistic knowledge. Examples: FOPC, semantic networks.

 Characteristic:
 Verifiability: Ensure that meaning representations can be checked against a knowledge base.
For example, in the sentence "Does Kingfisher serve Hyderabad?", the representation "Serves (Kingfisher, Hyderabad)"
can be compared against a database to verify its truthfulness.
 Unambiguous: Meaning representations should have only one interpretation.
For instance, in the sentence "I want to go to a hill station," although vague, the representation should provide a clear
indication of the user's intention, despite the lack of specificity.
 Vagueness Handling: Allow for a certain level of vagueness in representations.
In the sentence "I want to go to a hill station," the representation should capture the general desire to visit a hill station
without specifying a particular location.
 Canonical Form: Ensure consistency in representations despite different expressions.
For example, the sentences "Does Kingfisher offer a flight to Hyderabad?" and "Does Kingfisher have a flight to
Hyderabad?" should both be represented in a canonical form to indicate that they convey the same meaning.
 Inference and Variables: Support deriving conclusions and using variables for flexibility. In the sentence "I would like to
catch a flight to Hyderabad," the representation "Goes(x, Hyderabad)" allows for the use of a variable (x) to represent any
flight that goes to Hyderabad.
 Expressiveness: Enable representation of diverse content effectively. This means the language should be capable of
representing various types of linguistic inputs, such as questions, statements, requests, etc., with accuracy and richness in
meaning.

Meaning Structure of the Language:


 Predicate Argument Structure: Fundamental to the semantic structure of languages, it establishes relationships among concepts
expressed in sentences.
 Grammar's Role: Grammar helps organize predicate argument structures, guiding how words and phrases contribute to meaning.
 Syntactic Argument Frames: Define constraints on the number, position, and syntactic category of arguments expected with a verb.
 Example (Syntactic Argument Frames):
 "Murthy likes music."
 Syntactic Frame: NP likes NP
 "Murthy likes watching movies."
 Syntactic Frame: NP likes VP
 "Murthy likes to perform in the theatre."
 Syntactic Frame: NP likes inf-VP
 Semantic Roles: Noun phrases in sentences play thematic or theta-roles, defined by verbs. Verbs impose selection restrictions on
arguments based on their semantic roles.
 Example (Semantic Roles):
 The verb "eat" requires an animate entity as the eater and an edible item as the thing being eaten.
 Representation Requirements: A meaning representation language must support:
 Variable number of predicates
 Semantic labelling of arguments
 Semantic restrictions on arguments
 Supporting Formalisms: Formalisms like FOPC, semantic networks, and conceptual graphs enable representation of predicate
argument structures.

Syntax-driven Semantic Analysis


 Syntax-driven Semantic Analysis:
 Utilizes static knowledge from the lexicon and grammar for semantic analysis.
 Corresponds to literal meaning, independent of context and inference.
 Principle of Compositionality: Meaning of the whole is composed from meanings of its parts.
 Approach:
 Syntactic analysis guides semantic representation creation.
 Semantic analyzer produces multiple ambiguous representations, resolved later.
 Example:
 Parse tree for "President nominates speaker."
 Steps:
1. Retrieve meaning representation from the verb subtree.
2. Identify meaning representations for noun phrases.
3. Associate noun phrase meanings with verb variables.
 Augmented Rules:
 Specify mapping between grammar rules and semantic representations.
 Example: Noun → President (President)
 Lambda Calculus:
 Extension of FOPC, used for systematic combination of semantic representations.
 Supports binding variables and replacement in semantic attachments.
 Idioms and Collocates:
 Challenge to compositionality.
 Handled by introducing new grammar rules with semantic attachments not related to constituents.
 Semantic Analysis Approaches:
 Pipeline Approach: Sequential syntactic and semantic analysis.
 Integrated Semantic Approach: Semantic operations incorporated into parsing process.
 Can detect ill-formed semantic fragments early, but may involve processing unnecessary constituents.

Semantic Grammars:
 Semantic Grammars:
 Address limitations of syntax-driven semantic analysis by integrating semantics into grammar.
 Developed specifically for handling semantics, unlike traditional grammars.
 Rules and constituents directly correspond to entities and activities in the domain.
 Problems with Syntax-driven Semantic Analysis:
 Semantic elements often distributed widely across parse trees.
 Parse trees contain constituents not crucial for semantic distinctions.
 Syntactic constituents may lead to meaningless representations.
 Advantages of Semantic Grammars:
 Focused Semantic Rules: Rules are tailored to specific domain semantics.
 Effective Handling of Anaphors and Ellipsis: Facilitates better treatment of these linguistic phenomena.
 Example:
 Request: "I want to go from Delhi to Chennai on 24th December."
 Semantic Grammar Rule: InfoRequest: User wants to go to City from City TimeExpr.
 Non-terminals like "User," "City," and "TimeExpr" eliminate the need for lambda expressions.
 Drawbacks:
 Lack of Generality: Domain-specific rules require new grammars for each domain.
 Increased Rule Complexity: Specific rules for various semantic distinctions lead to larger rule sets compared to traditional
grammars.
LEXICAL SEMANTICS:
 Lexical Semantics:
 Concerned with the systematic, meaning-related structure of words or lexemes.
 Focuses on understanding relationships among lexemes.
 Relationships Among Lexemes:
 Homonymy: Words with the same form but different, unrelated meanings (e.g., "bank" for river bank or financial
institution).
 Polysemy: Words with multiple related meanings (e.g., "chair" for furniture, person, presiding over a discussion).
 Hypernymy and Hyponymy: Hypernym is a more general word (e.g., "automobile" for car and truck), while hyponym is
more specific (e.g., "car" for automobile).
 Antonymy: Words expressing opposite meanings (e.g., "dark" and "light").
 Meronymy: Part-whole relationship (e.g., "wall," "ceiling," "floor" as meronyms of "room").
 Synonymy: Words with similar meanings, often interchangeable without changing sentence meaning.
 WordNet:
 Widely known lexical database that organizes words based on these relationships.

Internal Structure of Words:


 Thematic Role:
 Semantic relationship between a predicate (e.g., verb) and its arguments (e.g., noun phrases) in a sentence.
 Helps capture the semantic commonality between actors of events.
 Introduced by Gruber (1965) and Fillmore (1968), with roots in Paninian grammar dating back to 500–400 BC.
 Panini's Karakas include Karma (action/object desired), Karana (means/instrument), Karta (agent), Sampradana
(transmission), Apadana (removal), and Adhikarana (locative).
 Thematic Roles:
 Agent: Deliberate performer of an action/event.
 Theme: Entity affected by the action.
 Other thematic roles include Experiencer, Goal, Source, Instrument, Location, etc.
 Used in shallow semantic language and interlingua in machine translation.

CHAPTER – 4 Discourse processing


 Definition of Discourse:
 Language above the sentence level, involving related sentences forming a structure known as discourse.
 Discourse units can be smaller than sentences, relying on contextual knowledge for interpretation.
 Contextual Knowledge:
 Situational context: Physical situations existing during utterance.
 Background knowledge: Cultural and interpersonal knowledge.
 Co-textual context: Knowledge of preceding statements.
 Types of Discourse:
 Written, spoken, signed, monologue, and dialogue.
 Focus on monologue type, involving unidirectional communication from speaker to hearer.
 Cohesion and Coherence:
 Cohesion: Textual phenomenon of linking elements together using cohesive devices.
 Coherence: Mental phenomenon of making sense and being meaningful.
 Cohesive Devices:
 References, ellipsis, repetitions, conjunctions, etc., used to link words in a text.
 Important for maintaining continuity and unity in discourse.
 Pronominal Reference Resolution:
 Essential for applications like information extraction and text summarization.
 Resolving pronouns referring to named entities improves readability and comprehension.
 Coherence Relations:
 Property of being meaningful and unified in discourse.
 Hobbs's abductive framework for determining local coherence is discussed.

Cohesion
 Definition:
 Cohesion binds text together, maintaining its continuity and unity.
 Example:
 Text: "Yesterday, my friend invited me to her house..."
 Cohesion achieved through the use of the pronoun "her," referring back to "my friend's."
 Avoidance of Over-specification:
 Use of cohesive devices like pronouns prevents unnecessary repetition and over-specification.
 Enhances readability and efficiency of communication.
 Shared Knowledge:
 Communication assumes the presence of shared knowledge between speaker and hearer.
 Speaker avoids encoding information already known to the hearer.
 Pronominal Reference:
 Type of reference used for cohesion.
 Example: "her" referring back to "my friend's" in the text.
 Other Types of Reference:
 Includes ellipses, which will be discussed further.

Reference
 Definition: Reference links a referring expression to another expression in the text.
 Anaphoric Reference:
 Refers to entities previously introduced in the discourse.
 Types:
 Indefinite Reference: Introduces a new object.
 Definite Reference: Refers to an existing object.
 Pronominal Reference: Uses pronouns to refer to entities.
 Pronominal Reference:
 Example: "It" in "I bought a printer today. On installation, it didn't work properly."
 Can be anaphoric or cataphoric.
 Pleonastic use should be distinguished.
 Demonstrative Reference:
 Example: "This" and "that" in "I bought a printer today. This one cost me Rs. 6,000 whereas that one cost me Rs.
12,000."
 Quantifier or Ordinal Reference:
 Uses ordinals like "first" or "one".
 Example: "one" in "I visited a computer shop to buy a printer. I have seen many and now I have to select one."
 Inferables:
 Referents inferred from explicitly mentioned entities.
 Example: "I bought a printer today. On opening the package, I found the paper tray broken."
 Generic Reference:
 Refers to a whole class instead of a specific entity.
 Example: "I saw two laser printers in a shop. They were the fastest printers available."
 Usage: Helps maintain coherence and clarity in discourse by establishing connections between expressions.
Ellipsis:
 Definition: Ellipsis refers to the omission of a part of a sentence, where the omitted part is understood from the context.
 Example:
 "Yes I do" in response to "Do you take fish?" instead of "Yes I take fish."
 "Do you?" in response to "I know that lady. Do you?" instead of "Do you know that lady?"
 Recovery: The reader or hearer uses the surrounding text to understand the omitted part.
 Target and Source Clauses:
 Target Clause: The clause containing the omitted part.
 Source Clause: The clause from which the omitted part is understood.
 Ambiguity:
 References in the source clause may lead to ambiguities in the target clause.
 Example: "Seema loves her mother and Suha does too."
 Ambiguity between whether Suha loves Seema's mother or her own mother.
 Strict reading: Suha loves Seema's mother.
 Sloppy reading: Suha loves her own mother.

Lexical Cohesion:
 Definition: Lexical cohesion involves the use of lexical phenomena to achieve cohesion in text. It includes repetition, synonymy,
and hypernymy.
 Repetition:
 Example: "Ba ba black sheep, Have you any wool? Yes sir, yes sir, Three bags full."
 Function: Repetition of words or phrases creates a stylistic effect and reinforces the connection between clauses.
 Synonymy:
 Definition: The use of words that have the same or similar meanings in a given context.
 Example: Using "happy" instead of "joyful" in a sentence.
 Function: Enhances clarity and adds variety to the language without changing the meaning.
 Hypernymy:
 Definition: The use of a super-ordinate term that encompasses other terms.
 Example: Using "vehicle" instead of "car" or "bus."
 Function: Provides a broader category for referring to specific entities, offering a more general perspective.

Discourse Coherence and Structure:


Discourse Coherence Structures:
 Discourse has a structure with relations between sentences.
 Sentences combine to form segments, linked by coherence relations.
 Coherence structure assigns a tree-like organization to discourse.
 Helps explain issues like 'topic', 'genre', and 'coherence drift'.
Hobbs' Four-Step Procedure for Discourse Analysis:
1. Segmentation:
 Identify major breaks and divide text into segments.
 Repeat process iteratively until single clauses are obtained.
 Output: Tree structure of the text.
2. Labeling Non-Terminal Nodes:
 Label nodes with coherence relations using bottom-up approach.
 Understand the meaning represented by composed segments.
 Utilize heuristics and conjunctions to identify relations.
3. Identifying Underlying Knowledge:
 Specify the knowledge or beliefs supporting coherence relations.
 Understand the discourse content underlying the composed segment.
4. Validation of Hypotheses:
 Validate hypotheses made in previous steps.
 Consider longer corpus and construct knowledge base for analysis.
Coherence Relation Indicators:
 Explanation: because, and so, hence, That's why.
 Occasion: then.
 Elaboration: Also, i.e., or that is, in addition, note that.
 Parallel: Similarly, likewise.
 Exemplification: for example.
 Contrast: but, however.
Important Points:
 Coherence structures aid in understanding discourse organization.
 Hobbs' procedure involves iterative segmentation and labeling.
 Knowledge and beliefs support coherence relations.
 Validation involves considering a longer corpus and constructing a knowledge base.

CHAPTER – 5 NLG

Architecture of nlg systems:


1. Pipelined Architecture:
 In the pipelined architecture, NLG tasks are divided into three distinct groups: discourse planner, text planner, and surface
realizer. Information flows sequentially from one stage to another in a serial manner. Each stage must complete its tasks
before passing the processed information to the next stage. However, this architecture may face challenges when there's
a need to revisit earlier decisions due to ambiguities or changes in meaning.
2. Interleaved Architecture:
 In the interleaved architecture, the discourse and text planning stages are merged into a single stage, allowing control and
information to be passed back to any task within these stages. This flexibility enables rethinking and adjustments during
generation, as decisions are made iteratively based on the evolving context. Systems like PAULINE and MUMBLE utilize
this architecture to generate pragmatic effects in text.
3. Integrated Architecture:
 The integrated architecture, introduced by Kantrowitz and Bates, is designed for interactive NLG environments. It employs
a "blackboard" architecture, allowing for interaction between computer-simulated agents and humans. Systems like
GLINDA utilize this architecture for real-time interaction and dynamic generation of sentences.
4. Hybrid Approach:
 A hybrid approach integrates all three stages of NLG into a single framework. Fine-grained sub-tasks within each stage can
be implemented as needed, breaking the boundaries between stages. Operations like serialization and transformation are
abstracted into common functions. This approach combines the advantages of pipelined, interleaved, and integrated
architectures while providing flexibility and efficiency.

Applications of Natural Language Generation (NLG):


1. Summarizing Statistical Data: NLG systems are utilized to condense statistical data extracted from databases or spreadsheets into
human-readable summaries. This application streamlines the comprehension of complex data sets for non-expert users.
2. Weather Reporting: NLG is employed in systems like the Weather Reporter to generate multi-sentence weather summaries. These
summaries are created from databases containing numerical meteorological data, often collected automatically by weather services.
By transforming raw data into coherent narratives, NLG enhances the accessibility of weather information.
3. Answering Queries about Objects in Knowledge Bases: NLG systems are utilized to generate responses to queries about objects
described in knowledge bases. This application enables users to interact with databases or knowledge systems using natural
language, facilitating easier understanding and accessibility of information.
4. Summarizing Textual Documents: NLG is crucial for creating abstractive summaries of textual documents. By analyzing the content
and context of documents, NLG systems can generate concise summaries that capture the key information and main points, aiding in
information retrieval and comprehension.
CHAPTER – 6 Machine translation

Problems in Machine Translation


1. Word Order:
 Languages have different word orders in sentences. For instance, English typically follows the subject-verb-object order,
while Indian languages may have the object precede the verb. Direct word-by-word translation becomes impractical due
to these differences.
2. Word Sense:
 Words in one language may have multiple meanings or nuances that don't directly correspond to words in another
language. Selecting the appropriate word sense in the target language poses a challenge for machine translation systems.
3. Pronoun Resolution:
 Resolving pronoun references accurately is crucial for coherent translation. Failure to resolve pronouns correctly can lead
to incorrect translations and loss of meaning.
4. Idioms:
 Idiomatic expressions present a challenge for translation as they often have meanings that cannot be inferred from the
individual words. Translating idioms word-for-word can result in nonsensical or humorous translations.
5. Ambiguity:
 Languages differ in their tolerance for ambiguity. Resolving ambiguities, such as syntactic or semantic ambiguities, is
necessary for accurate translation. For example, the sentence "The man saw the star with a telescope" has a prepositional
phrase ambiguity that needs resolution before translation.

Characteristics of Indian Languages


1. Sentence Structure:
 Indian languages often follow the Subject-Object-Verb (SOV) sentence structure as the default order.
2. Free Word Order:
 Indian languages allow flexibility in word order, meaning that rearranging words within a sentence typically does not alter
its meaning. The relationship between sentence components is conveyed through inflections.
3. Morphological Variations:
 Indian languages have a rich set of morphological variations. Adjectives, for example, undergo changes based on number
and gender, unlike in English.
4. Complex Predicates:
 Indian languages extensively use complex predicates, combining light verbs with other verbs, nouns, or adjectives to form
a single verb.
5. Post Position Case Markers:
 Instead of prepositions, Indian languages often use post-position case markers (Karaks) to indicate relationships between
words. Some languages also attach inflections to objects to handle prepositions.
6. Verb Complexes:
 Indian languages employ verb complexes, consisting of sequences of verbs (e.g., ga raha hai and khel rahi hai), to convey
information about tense, aspect, and modality.
7. Gender Information in Verbs:
 Gender information is embedded within verb groups in Indian languages. Most languages have only two genders—
masculine and feminine—while adjectives may also agree with gender.
8. Pronouns:
 Unlike English, Indian-language pronouns typically lack associated gender information.

Machine Translation Approaches


1. Direct Translation:
 In direct translation approaches, translation is performed based solely on statistical analysis of large corpora of parallel
texts. This method relies on aligning source and target language sentences and learning translation patterns directly from
the data.
2. Rule-based Translation:
 Rule-based translation utilizes linguistic rules and grammatical structures of both the source and target languages. These
rules are often manually crafted by linguists and computational linguists. The translation process involves analyzing the
input sentence, applying syntactic and semantic rules, and generating the corresponding output in the target language.
3. Hybrid Approaches:
 Hybrid approaches combine elements of both statistical and rule-based methods. These systems integrate statistical
models to capture translation probabilities from data while also incorporating linguistic rules and constraints to improve
accuracy and handle linguistic nuances.
4. Knowledge-based Translation:
 Knowledge-based translation systems rely on explicit representations of linguistic knowledge, including lexicons,
grammars, semantic networks, and ontologies. These systems often use formal representations of meaning and world
knowledge to generate translations. Knowledge-based approaches may also incorporate domain-specific knowledge to
improve translation quality in specialized domains.

Direct Machine Translation


Direct machine translation systems provide translation without using any intermediate representation. These systems typically perform word-
by-word translation with the aid of a bilingual dictionary, followed by syntactic rearrangement. Here are the key characteristics and steps
involved in direct machine translation:

1. Monolithic Approach: Direct translation systems focus on specific language pairs and take a monolithic approach to development.
They involve little analysis of the source text and primarily rely on bilingual dictionaries.
2. Steps in Translation:
 Morphological Analysis: Remove morphological inflections to obtain the root form of source-language words.
 Bilingual Dictionary Lookup: Retrieve target-language words corresponding to the source-language words from a bilingual
dictionary.
 Syntactic Rearrangement: Change the word order to match the target language's default structure. For example, in
English to Hindi translation, this involves changing prepositions to postpositions and adjusting the subject-verb-object
order.
3. Example: English Sentence: "Khushbu slept in the garden." Translation Steps:
 Word Translation: "Khushbu soyi mein baag"
 Syntactic Rearrangement: "Khushbu baag mein soyi"
4. Idiomatization: Adjustments are made to match idiomatic expressions and linguistic conventions of the target language. This
includes handling suffixes and making minor modifications for idiomatic correctness.
5. Challenges and Limitations:
 Word Ambiguity: Direct translation may lead to ambiguous or nonsensical translations, especially when context and word
senses are not considered.
 Limited Structural Analysis: Direct MT systems lack deeper analysis of sentence structure and relationships between
words, leading to lower translation quality.
 Language Pair Specificity: These systems are developed for specific language pairs and cannot be easily adapted to new
pairs without significant effort.
 Cost and Development: Developing direct MT systems for multiple language pairs can be expensive and time-consuming
due to their monolithic nature.
Rule-Based Machine Translation
Rule-based machine translation systems parse the source text and generate an intermediate representation, which could be a parse tree or an
abstract representation. The target language text is then produced from this intermediate representation. These systems rely on predefined
rules for various linguistic aspects like morphology, lexical selection, syntax, and semantics, hence they are termed rule-based systems.
Components of Rule-Based Machine Translation:
1. Analysis: This stage involves parsing the source text to produce a structure conforming to the rules of the source language. It
includes morphological, syntactic, and sometimes semantic analyses.
2. Transfer: The source language representation is transformed into a target language representation based on predefined rules and
linguistic differences between the languages.
3. Generation: Finally, the target language text is generated using the target language representation.
Transfer-Based Translation:
 In transfer-based translation, the structure of the input text is transformed to match the rules of the target language. This involves
parsing the source text and transferring its structure into the target language.
 The transfer component handles language-specific differences between source and target languages, while the generation module
produces the actual target language text.
 Transfer systems offer modularity and language independence in analysis and generation, making them suitable for multilingual
environments.
Interlingua-Based Machine Translation:
 Inspired by Chomsky's notion of a universal 'deep structure,' interlingua-based MT converts source language text into a language-
independent meaning representation called 'interlingua.'
 Translation involves two stages: analysis to represent the source text in interlingua and synthesis to generate target language text
from interlingual representation.
 While interlingua systems require more extensive analysis compared to transfer systems, they offer advantages such as meaning-
based representation and applicability in information retrieval.
 The major challenge in interlingua-based MT lies in defining a universal interlingua that preserves the meaning of sentences across
languages, considering cultural and linguistic differences.

Semantic or Knowledge-Based MT Systems


1. Semantic Processing: Unlike earlier approaches where semantic processing occurred after syntactic analysis, semantic or knowledge-
based systems integrate semantic analysis into the translation process from the outset. This allows for a more nuanced
understanding of the meaning of the source text.
2. Knowledge Base: These systems rely on a large knowledge base that encompasses various aspects of language, including ontologies,
lexical knowledge, semantic rules, and grammatical structures. This knowledge base is crucial for accurate translation and includes
information about word meanings, relationships between words, and domain-specific knowledge.
3. Ontologies: Ontologies are used to represent knowledge about concepts, entities, and their relationships in a particular domain.
They provide a structured framework for organizing and interpreting information, allowing MT systems to understand the context
and meaning of the source text.
4. Lexical Knowledge: Semantic or knowledge-based systems incorporate detailed lexical knowledge, including information about word
senses, semantic roles, and syntactic patterns. This enables more accurate translation by capturing the nuances of word meanings
and usage.
Example System: KANT
An example of a semantic or knowledge-based MT system is the KANT (Knowledge-based, Accurate Natural Language Translation) system. KANT
utilizes explicit grammars, semantic rules, and a rich knowledge base to perform translation. It incorporates ontologies, lexical knowledge, and
syntactic structures to analyze and translate source language text accurately.

TRANSLATION INVOLVING INDIAN LANGUAGES:


Research on translation involving Indian languages has predominantly focused on the English—Hindi language pair, resulting in the development
of various machine translation (MT) systems and language accessors. Let's explore some of these systems:
1. ANGLABHARTI:
 Type: Rule-based hybrid MT system.
 Approach: Uses a pseudo-interlingua for translation between English and multiple Indian languages.
 Method: Utilizes rule-based and example-based translation along with post-editing.
 Framework: Based on the Paninian framework, exploiting structural similarities among Indian languages.
2. SHAKTI:
 Type: Rule-based MT system.
 Approach: Follows a rule-based approach with hybridization incorporating example-based learning.
 Stages:
 English Sentence Analysis
 Transfer-Grammar for English to Hindi
 Hindi Sentence Generation
 Method: Implements part-of-speech tagging, morphological analysis, chunking, parsing, semantic analysis, word sense
disambiguation, transfer rules, and target language generation.
3. MaTra2:
 Type: Human-assisted MT system.
 Approach: Structured editor creates a hierarchical representation of input sentences.
 Method: Uses rules and heuristics to resolve ambiguities, assisting translators, editors, and content providers.
4. MANTRA3 (Machine-Assisted Translation Tool):
 Type: MT system based on tree-adjoining grammar formalism.
 Languages: Includes Hindi, Hindi—English, and Hindi—Bengali.
 Domain: Initially focused on a restricted domain (e.g., government appointments, parliament proceedings), expanding to other
domains and languages.
5. Anusaarak:
 Project Focus: Development of language accessors for Indian languages.
 Languages Covered: Punjabi, Bengali, Telugu, and Marathi into Hindi.
 Approach: Based on Paninian grammar principles, mapping word groups between source and target languages.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy