0% found this document useful (0 votes)

21 views

NLPNotes

The document provides an overview of Natural Language Processing (NLP), focusing on its goals, challenges, and the structure of words and documents. It discusses key components such as tokens, lexemes, and morphemes, as well as issues like ambiguity and irregularity that NLP systems face. Additionally, it covers various morphological models used in NLP and techniques for detecting sentence and topic boundaries in text.

Uploaded by

Sneha Thyagarajan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

NLPNotes

Uploaded by

Sneha Thyagarajan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

UNIT 1

Natural Language Processing (NLP):

 NLP is a subset of artificial intelligence (AI) that focuses on the interaction

between computers and human language. It involves the development of
algorithms and models to enable computers to understand, interpret, and
generate human language.
 NLP deals with practical applications of computational linguistics, such as
machine translation, sentiment analysis, text summarization, and chatbot
development.
 The goal of NLP is to facilitate communication between humans and computers
by enabling machines to process and understand natural language input.

FINDING STRUCTURE OF WORDS

• "Finding the structure of words" generally refers to the process of analyzing

and understanding the components and organization of words in a language.
• Human language is complex but not random; it has structure and organization.
• Words and expressions consist of smaller components that give them meaning.
• Different Linguistic Disciplines:
• Morphology studies how words change form and function.
• Syntax looks at how words combine into sentences.
• Phonology examines pronunciation rules.
• Orthography deals with writing conventions.
• Semantics focuses on meaning.
• Etymology & Lexicology study word origins and relationships.
• Morphological parsing is the process of identifying word structure and how it
connects to grammar and meaning.
• It helps in understanding language, especially in multilingual contexts.

WORDS AND THEIR COMPONENTS

Words are the smallest linguistic units that can form a complete utterance by
themselves

Components:

 Tokens
 Lexems
 Morphemes

1. TOKENS:
 A "token" refers to a unit of text that has been extracted from a larger body
of text.
 Tokens are the building blocks that NLP models use to process and
understand language.
 The process of breaking down a text into individual tokens is known as
tokenization.
 Each token typically corresponds to a word, although it can also represent
sub-word units, characters, or other linguistic elements depending on the
specific tokenization strategy.
 Word Tokens: Each word in a sentence is treated as a separate token.
example "I love natural language processing": ["I", "love", "natural",
"language", "processing"].
 Sub-word Tokens: In the context of sub-word tokenization, words are
broken down into smaller units One common approach is Byte Pair Encoding
(BPE). example: "processing" Sub-word Tokens (after BPE): ["pro", "cess",
"ing"]
 Character Tokens: In character tokenization, each character in the text is
treated as a separate token. example: "natural" Character Tokens: ["n", "a",
"t", "u", "r", "a", "l"]

 Role in NLP
o Used in text classification, sentiment analysis, named entity
recognition.
o Tokenization splits text into tokens.
o Part-of-Speech (POS) tagging assigns grammatical categories.

2. LEXEMS
 A lexeme is a unit of vocabulary that represents a single concept,
regardless of its grammatical variations.
 Lexemes can be divided by their behavior into the lexical categories of
verbs, nouns, adjectives, conjunctions, particles, or other parts of speech.
i. "Run," "runs," "running," and "ran" belong to the same lexeme
(concept of running).
ii. "Bank" (financial institution) and "bank" (river edge) are different
lexemes.
 Role in NLP
o Lexical analysis identifies and categorizes lexemes.
o Helps in stemming, lemmatization, and part-of-speech tagging.
3. MORPHEMES
 A morpheme is the smallest unit of meaning in a language.
 Free Morphemes: Can stand-alone (e.g., "book," "run," "happy").
 Bound Morphemes: Cannot stand alone, must attach to another
morpheme.
i. Prefixes: Added at the beginning (e.g., "un-" in "unhappy").
ii. Suffixes: Added at the end (e.g., "-ed" in "walked").
 Examples
i. Unhappily = "un-" (not) + "happy" + "-ly" (manner).
ii. Rearrangement = "re-" (again) + "arrange" + "-ment" (act of
arranging).
iii. Cats = "cat" (free morpheme) + "-s" (plural).
 Role in NLP
o Used in part-of-speech tagging, sentiment analysis,
machine translation.

ISSUES AND CHALLENGES

Finding the structure of words in natural language processing (NLP) can be a

challenging task due to various issues and challenges. Some of these issues and
challenges are:

1. Ambiguity: Many words in natural language have multiple meanings, and it

can be diﬃcult to determine the correct meaning of a word in a particular
context.

2. Morphology: Many languages have complex morphology, meaning that word

scan change their form based on various grammatical features like tense,
gender, and number. This makes it diﬃcult to identify the underlying structure
of a word,

3. Word order: The order of words in a sentence can have a signiﬁcant impact
on the meaning of the sentence, making it important to correctly identify the
relationship between words.

4. Informal language: Informal language, such as slang or colloquialisms, can

be challenging for NLP systems to process since they often deviate from the
standard rules of grammar.

5. Out-of-vocabulary words: NLP systems may not have encountered a word

before, making it diﬃcult to determine its structure and meaning.

6. Named entities: Proper nouns, such as names of people or organizations, can

be challenging to recognize and structure correctly.

7. Language-speciﬁc challenges: Different languages have different structures

and rules, making it necessary to develop language-speciﬁc approaches for
NLP.

8. Domain-speciﬁc challenges: NLP systems trained on one domain may not

be effective in another domain, such as medical or legal language.

Key challenges in Natural Language Processing (NLP):

Irregularity

Irregularity refers to words that do not follow standard patterns of formation or

inflection, making it difficult for NLP systems to process them accurately. Examples
include:

 Irregular verbs: In English, verbs like go → went and do → did don’t follow the
usual -ed past tense rule.
 Irregular plurals: Words like child → children and foot → feet don’t follow the
usual -s pluralization rule.
 Morphological irregularity: Some languages, like Spanish, have irregular
verb conjugations (tener → tengo).

To handle this, NLP models use:

 Rule-based systems to incorporate irregular forms into standard rules.

 Machine learning algorithms to learn patterns from large datasets.

Despite advancements, irregularity remains a major challenge, especially in

languages with complex morphology.

Ambiguity

Ambiguity occurs when a word, phrase, or sentence has multiple meanings, making it
difficult for NLP to determine the correct interpretation. Key types of ambiguity:

 Homonyms: Words that sound/spell the same but have different meanings
(bank = financial institution or riverbank).
 Polysemy: Words with multiple related meanings (book = a reading object or
the action of reserving something).
 Syntactic ambiguity: A sentence that can be interpreted in multiple ways (I
saw her duck = seeing a bird or watching someone lower their head).
 Cultural/Linguistic differences: Idioms like kick the bucket (meaning to die)
may confuse NLP models.
To resolve ambiguity, NLP systems use:

 Contextual information: Analyzing surrounding words.

 Part-of-speech tagging: Identifying word roles (noun, verb, etc.).
 Syntactic parsing: Determining sentence structure.
 Machine learning models trained on large datasets

Productivity

Productivity refers to the ability of a language to generate new words using existing
rules, which creates challenges for NLP systems. Examples include:

 Word formation: New words like smartphone, cyberbully, or workaholic

combine existing words.
 Prefix/suffix usage:
o Un- → happy → unhappy
o -er → run → runner

 Inflectional morphology:

o walk → walked (past tense)

o big → bigger (comparative adjective)

Since many new word forms may not exist in dictionaries or training data, NLP
systems use:

 Morphological analysis algorithms to predict word structure.

 Statistical models to recognize new words.
 Machine learning to identify word patterns.

MORPHOLOGICAL MODELS IN NLP

Morphological models in Natural Language Processing (NLP) analyse the structure of

words, including inflectional and derivational patterns. They help in tasks like part-of-
speech tagging, named entity recognition, machine translation, and text-to-speech
synthesis. The main types of morphological models include:

1. Rule-Based Models

o Use handcrafted linguistic rules to analyse word structure.

o These rules are based on linguistic knowledge and are manually created
by experts in the language
o Best for languages with simple morphology (e.g., English).

2. Statistical Models

o Use machine learning algorithms to learn the morphological structure of

words from large datasets of annotated text
o Use machine learning techniques like Hidden Markov Models (HMMs) or
Conditional Random Fields (CRFs) to analyse words.
o More accurate than rule-based models.

3. Neural Models

oUse deep learning architectures like Recurrent Neural Networks (RNNs)

and Transformers.
o Achieve state-of-the-art results, especially in complex languages like
Arabic and Turkish.
4. Dictionary Lookup
o Dictionaries or lexicon is used to store information about the words in a
language, including their inflectional and derivational forms, parts of
speech, and other relevant features.
o When a word is encountered in a text, the dictionary is consulted to
retrieve its properties
o Techniques to improve accuracy:
 Lemmatization: Reducing a word to its base form (e.g., running
→ run).
 Stemming: Reducing a word to its root (e.g., jumping → jump).
 Morphological Analysis: Identifying word components (prefixes,
suffixes, roots).
5. Finite-State Morphology
o Based on the principles of finite-state automata.Uses Finite-State
Transducers (FSTs) to model word formation.
o accept a set of strings or sequences of symbols, which represent the
morphemes that make up the word. Each morpheme is associated
with a set of features that describe its properties, such as its part of
speech, gender, tense, or case.
o There are two primary operations:
o Analysis: The transducer breaks down words into morphemes, identifying their
grammatical features.
o Generation: The transducer combines morphemes to generate a word, applying
appropriate inflections and derivations.
o Works well for languages with regular morphology (e.g., Turkish,
Finnish).
o Advantages:
o Fast and efficient.
o Transparent rules for linguists.
6. Unification-Based Morphology
o Based on the principles of unification and feature-based grammar
o Words are modeled as a set of feature structures, which are
hierarchically organized representations of the properties and attributes
of a word.
o Each feature structure is associated with a set of features and values
that describe the word's morphological and syntactic properties
o There are two main operations:
 Analysis: The system applies rules to a word's feature structure to identify its morphemes
and properties.
 Generation: It constructs a feature structure to represent a set of morphemes, generating
a word with the required features.
o Helps in complex and irregular morphology (e.g., Arabic, German).
o Advantages:
 Flexible and expressive (handles many linguistic phenomena).
 Modular and reusable.
o Disadvantages:
 Computationally expensive.
7. Functional Morphology
o This approach is based on functional and cognitive linguistics and
emphasizes how words function in communication
o Words are modeled as lexemes (units of meaning) with a set of abstract
features that reflect their semantic, pragmatic, and discourse
properties.
o The goal is to capture how morphological and syntactic structures
reflect their communicative functions in context.
Advantages:
o
 Corpus-driven and usage-based.
 Can integrate with cognitive linguistics.
o Disadvantages:
 Requires large annotated datasets.
8. Morphology Induction
o This is an unsupervised, data-driven approach where algorithms learn
the underlying morphology of a language by analysing large corpora of
raw text.
o The task is to identify morpheme boundaries based on statistical
patterns and distributional properties.
o Works for agglutinative or low-resource languages.
o Techniques:
 Clustering.
 Probabilistic modeling.
 Neural networks.
o Advantages:
 No need for manual rules or annotated data.
o Disadvantages:
 May be less accurate than other models.

FINDING THE STRUCTURE OF DOCUMENTS

In natural language processing (NLP), documents are not just a random collection of
words, but they have an inherent structure. This structure typically includes
sentences, paragraphs, and topics that make the text coherent and meaningful.
Understanding this structure is crucial for various NLP tasks such as parsing, machine
translation, and semantic labeling. These tasks rely on identifying the organization
and boundaries within the text.

1. Sentence Boundary Detection (SBD):

o SBD involves identifying where one sentence ends and another begins in a
sequence of words.
o Sentence boundary detection (Sentence segmentation) deals with
automatically segmenting a sequence of word tokens into sentence units.
o In languages like English, the beginning of a sentence is often marked by an
uppercase letter, and the end of a sentence is explicitly marked with punctuation (e.g.,
period, question mark, exclamation mark).
o Example:
 "I spoke with Dr. Smith." → The abbreviation "Dr." does not mark the end of a
sentence.
 "My house is on Mountain Dr." → The abbreviation "Dr." marks the end of the
sentence.
o Used for text summarization, machine translation, sentiment analysis, and
part-of-speech tagging.

o Challenges:

 Ambiguities: Sentences can end with punctuation marks like

periods (.), question marks (?), or exclamation marks (!), or may not
end with punctuation at all.
 Abbreviations: In cases like "Dr." (Doctor) or "e.g." (for example), a
period does not mark the end of a sentence, leading to errors if
punctuation alone is used to detect boundaries.
 Domain-Specific Text: Specialized domains (e.g., legal, medical)
may have non-standard sentence structures that confuse traditional
boundary detection systems.
 Multilingual Text: Different languages follow different conventions
for sentence boundaries, making cross-lingual sentence
segmentation challenging.

o Techniques for SBD:

 Rule-Based Methods: These methods use a set of pre-defined

rules or heuristics to identify sentence boundaries, considering
punctuation, abbreviations, and context.
 Machine Learning: Algorithms like Conditional Random Fields (CRF)
and Recurrent Neural Networks (RNNs) are trained on labelled data
to predict sentence boundaries based on the characteristics of the
input text.
 Language-Specific Models: Certain languages may require
customized models that account for their specific punctuation,
abbreviations, or syntactic rules.
 Pre-trained Models: Language models like BERT or GPT can be
fine-tuned for sentence boundary detection tasks, leveraging their
ability to understand context.

2. Topic Boundary Detection (Topic Segmentation):

o Topic boundary detection aims to identify the points in a text where the
subject or topic of discussion changes. This is an essential task for
understanding the overall structure of a document and breaking it down
into more manageable parts.

o By detecting topic boundaries, we can improve applications like

information retrieval, machine translation, and text summarization. In
information retrieval, for example, segmenting a long document into
smaller topic-based sections can help retrieve only the relevant content
in response to a user's query.

o Challenges:

 Topic Ambiguity: What constitutes a "topic" can vary depending

on the context, and different people might interpret the topic of a
text differently.
 Language Variations: Different languages and writing styles use
different signals (e.g., discourse markers, paragraph breaks) to
indicate a shift in topic.
 Implicit vs. Explicit Boundaries: Topics may be signaled
explicitly (e.g., through headings or keywords) or implicitly (e.g.,
through discourse markers or logical transitions), making it harder
to identify boundaries in some cases.

o Techniques for Topic Boundary Detection:

 Rule-Based Methods: These methods rely on predefined rules

or heuristics to detect topic boundaries, considering clues like
paragraph breaks, topic-related keywords, or specific linguistic
markers.
 Machine Learning: Supervised machine learning techniques can
be used to train models that predict topic boundaries based on
labeled training data. Algorithms like Support Vector Machines
(SVM), CRFs, or RNNs can be applied to recognize transitions
between topics.
 Topic Models: Statistical models like Latent Dirichlet Allocation
(LDA) or Non-Negative Matrix Factorization (NMF) can be used to
detect topic changes. These models analyze how topic
distributions change throughout a document or corpus, identifying
segments where a shift in topic occurs.

METHODS

Boundary Classification Problem

 Both sentence segmentation and topic segmentation are framed as boundary

classification problems. The task involves predicting whether a given boundary

 Let x ∈ X be the vector of features associated with a boundary candidate.

candidate is an actual boundary.

 Let y ∈ Y be the label predicted for that candidate (boundary or non-

boundary).
 The label y can be:

o 𝒃(b-bar) for non-boundary.

o b for boundary.

Classification Problem

 Given a set of training examples (x, y), the goal is to find a function that assigns the
most accurate label y for unseen examples x.

Fine-grained Boundary Types

 Instead of a simple binary classification, segmentation can involve multiple

categories:
 For text segmentation: A three-class problem might include:
o ba: Sentence boundary without abbreviation.
o ba'(ba(bar)): Sentence boundary with abbreviation.
o b-a(b(bar)a): Abbreviation not as a boundary.
 For spoken language segmentation: A three-class problem can include:
o bs: Non-boundary statement.
o bq: Question boundary.

The methods for segmentation can be classified into two primary categories:

1. Local Classification: Each boundary is treated independently.

2. Sequence Classification: The sequence of boundaries is considered as a

whole.

3. Generative Models estimate the joint distribution of the observations P(X,Y)

(words, punctuation) and the labels(sentence boundary, topic boundary).
4. Discriminative Models These models estimate the conditional probability
P(Y | X), which focuses on distinguishing between different boundary labels
given the features.

3. Generative Sequence Classification Methods

 Most commonly used generative sequence classification method for topic
and sentence is the hidden Markov model (HMM)
 The probability is given using Bayes rule:

o Y= (y1, y2,….yk)= Set of class(boundary) labels

o X = (x1, x2,….xn)= set of feature vectors
o P(Y|X) = the probability of X belongs to the class(boundary) label.
o P(x) = Probability of word sequence
o P(Y) = Probability of the class (boundary)
 With the joint probability distribution function, given a Y, calculate
("generate") its respective X.
 For this reason, they are called "generative" models.

 P(X) in the denominator is dropped because it is fixed for different Y and

hence does not change the argument of max.
 P(X|Y) and P(Y) can be estimated as
4. Discriminative Local Classification Methods

 Text Tiling is a method for dividing a document into segments that share a
common topic. It uses a lexical cohesion metric in a word vector space
 The document is chopped when the similarity is below some threshold

There are two methods to compute similarity:

1. Block Comparison: This method compares two adjacent blocks of text (e.g.,
sentences or paragraphs) and measures their similarity by comparing how many
words they share.
o Formula: The similarity score between blocks b1 and b2 can be computed

using:
o where wt,b1is the weight assigned to term t in block b1, and the weights can
be binary or computed using retrieval metrics like term frequency (TF).
2. Vocabulary Introduction: This method scores a gap between two blocks by
counting how many new words appear in the interval between the blocks.
o Formula: The topical cohesion score is computed as:

o Where NumNewTerms(b) returns the number of terms in block b seen the

first time in text

5. Discriminative Sequence Classification Methods

 When segmenting text or speech, the decision for a given example (e.g., word,
sentence, or paragraph) depends not just on the features of the example itself but
also on the context (i.e., surrounding boundaries).
 To handle this, discriminative sequence classification methods are used.
These methods go beyond local classifiers by considering the dependencies
between consecutive examples.

Types of Sequence Models:

1. Conditional Random Fields (CRFs):

o CRFs are log-linear models used for labeling sequences (e.g., segmenting text
into sentences or topics).
oThey consider the whole sequence of boundary candidates, rather than
labelling each boundary independently. This allows the model to make
decisions by considering neighbouring labels.
2. SVM Struct:
o An extension of SVMs for structured prediction tasks, where the goal is to
predict sequences of labels rather than just individual labels.
3. Maximum Margin Markov Networks (M3N):
o Extensions of Hidden Markov Models (HMMs) that use a maximum margin
approach to classify sequences. M3Ns are particularly useful when there is a
need to model complex dependencies in sequential data.

COMPLEXITY OF APPROACHES

1. Discriminative Approaches:

 Training complexity: Requires more steps and time because it keeps

adjusting to find the best model.
 Prediction complexity: Takes longer to make predictions because it
needs to process many features.

2. Generative Models:

 Training complexity: Easier to train, especially on large datasets, but

struggles with new, unseen data.
 Prediction complexity: Faster than discriminative models because it's
based on simple assumptions about data.

3. Discriminative Classifiers:

 Training complexity: Great for smaller datasets and can handle a lot of
different features.
 Prediction complexity: Slower than generative models because it
involves more complex calculations during prediction.

4. Sequence Approaches:

o Extra Complexity: These methods need to look at entire sequences of

decisions, not just individual decisions, which makes them more
complex.

PERFORMANCE OF APPROACHES

1. Sentence Segmentation in Text:

o Error rates can be low, but even small mistakes can affect later
processing stages in NLP tasks.

o For example, one rule-based system had an error rate of 1.41% on a

dataset of 27,000 sentences.

2. Sentence Segmentation in Speech:

o Error Rate: Measures how many mistakes the system made in finding
sentence boundaries.

o Precision: How many of the predicted boundaries were actually correct.

o Recall: How many of the correct boundaries were found by the system.

o F1-Measure: A balance between precision and recall, showing the

overall performance.

Best Practice Guidelines Management of Psychometric Tests
100% (1)
Best Practice Guidelines Management of Psychometric Tests
20 pages
1 Intro To NLP
100% (1)
1 Intro To NLP
46 pages
NLP Notes
No ratings yet
NLP Notes
43 pages
Natural Language Processing
100% (2)
Natural Language Processing
48 pages
Natural Language Processing
No ratings yet
Natural Language Processing
47 pages
NLP Introduction Overview
No ratings yet
NLP Introduction Overview
34 pages
NLP_PPT
No ratings yet
NLP_PPT
41 pages
NLP Unit1
No ratings yet
NLP Unit1
51 pages
NLP Notes
No ratings yet
NLP Notes
18 pages
Chapter 6
100% (1)
Chapter 6
28 pages
Chapter 7 - Communication Perceving and Acting
No ratings yet
Chapter 7 - Communication Perceving and Acting
21 pages
Unit 1
No ratings yet
Unit 1
24 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
45 pages
NLP Merged
100% (1)
NLP Merged
975 pages
NLP IA1
No ratings yet
NLP IA1
7 pages
NLP m2
No ratings yet
NLP m2
71 pages
AI - Natural Language Processing
No ratings yet
AI - Natural Language Processing
6 pages
Introduction To NLP - Chap1
No ratings yet
Introduction To NLP - Chap1
47 pages
NLP IAT I Solution
No ratings yet
NLP IAT I Solution
8 pages
Natural Language Processing
No ratings yet
Natural Language Processing
72 pages
IntroductionToNLPAbebeZerihun
No ratings yet
IntroductionToNLPAbebeZerihun
45 pages
ML QBF
No ratings yet
ML QBF
13 pages
Module_1_part1_NLP
No ratings yet
Module_1_part1_NLP
24 pages
NLP UNIT-1
No ratings yet
NLP UNIT-1
37 pages
NLP QB
No ratings yet
NLP QB
14 pages
NLP Ambiguity
No ratings yet
NLP Ambiguity
35 pages
Natural Language Processing Lec 1
No ratings yet
Natural Language Processing Lec 1
23 pages
Introduction To Natural Language Processing and NLTK
No ratings yet
Introduction To Natural Language Processing and NLTK
23 pages
NLP Module 1
No ratings yet
NLP Module 1
124 pages
Chapter 6
No ratings yet
Chapter 6
21 pages
notes
No ratings yet
notes
9 pages
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
No ratings yet
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
41 pages
3.1 Natural Language Processing
No ratings yet
3.1 Natural Language Processing
5 pages
1 Introduction
No ratings yet
1 Introduction
99 pages
Unit V
No ratings yet
Unit V
16 pages
NLP Self
No ratings yet
NLP Self
22 pages
Introduction to NLP_chap1.Pptx
No ratings yet
Introduction to NLP_chap1.Pptx
47 pages
UNIT I
No ratings yet
UNIT I
12 pages
NLP_Presentation1
No ratings yet
NLP_Presentation1
25 pages
lect1-intro-3jan08 (1)
No ratings yet
lect1-intro-3jan08 (1)
94 pages
Introduction
No ratings yet
Introduction
49 pages
NLP Mid-1
No ratings yet
NLP Mid-1
15 pages
NLP Final
No ratings yet
NLP Final
4 pages
01
No ratings yet
01
60 pages
INTRONLP
No ratings yet
INTRONLP
30 pages
npl12345
No ratings yet
npl12345
3 pages
Introduction To Natural Language Processing
No ratings yet
Introduction To Natural Language Processing
45 pages
Shivangi Tyagi (NLP Assignments)
No ratings yet
Shivangi Tyagi (NLP Assignments)
60 pages
nlp
No ratings yet
nlp
35 pages
NLP Unit 1 Answers
No ratings yet
NLP Unit 1 Answers
7 pages
2 INTRODUCTION
No ratings yet
2 INTRODUCTION
15 pages
NLP Qna Sem 7 2024 18 11 05 03 29 1
No ratings yet
NLP Qna Sem 7 2024 18 11 05 03 29 1
37 pages
NLP Presentation
No ratings yet
NLP Presentation
19 pages
Lec 1.1.2
No ratings yet
Lec 1.1.2
44 pages
unit-4 NLP
No ratings yet
unit-4 NLP
54 pages
nayie bayes classifier 21 page
No ratings yet
nayie bayes classifier 21 page
28 pages
NLP Unit-1 - 1
No ratings yet
NLP Unit-1 - 1
24 pages
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Syntax and Sentence Structure in Linguistics
From Everand
Syntax and Sentence Structure in Linguistics
Aadinath Guha
No ratings yet
World Languages
From Everand
World Languages
Grace Morgan
No ratings yet
Language, Linguistics, and Development Simplified
From Everand
Language, Linguistics, and Development Simplified
Narinder Mehra
No ratings yet
Artificial Intelligence - What It Is and Why It Matters - SAS
No ratings yet
Artificial Intelligence - What It Is and Why It Matters - SAS
8 pages
GECKReadme
No ratings yet
GECKReadme
2 pages
Response
No ratings yet
Response
2 pages
Plenue R2 en
No ratings yet
Plenue R2 en
27 pages
Cmos Dec 2019
No ratings yet
Cmos Dec 2019
2 pages
Tagum River Basin
No ratings yet
Tagum River Basin
74 pages
Maserati Levante 2018
No ratings yet
Maserati Levante 2018
28 pages
CS REG - Unit 7 - PART TWO
No ratings yet
CS REG - Unit 7 - PART TWO
45 pages
Homework 1
No ratings yet
Homework 1
3 pages
dbms mcq
No ratings yet
dbms mcq
2 pages
Idm - Serial Keys & Instruction
100% (3)
Idm - Serial Keys & Instruction
4 pages
Indo Mim Quoataion 118
No ratings yet
Indo Mim Quoataion 118
2 pages
Slides For Chapter 11: Security: Distributed Systems: Concepts and Design
No ratings yet
Slides For Chapter 11: Security: Distributed Systems: Concepts and Design
23 pages
BESS CONTAINER TYPE【en】
No ratings yet
BESS CONTAINER TYPE【en】
2 pages
Pu II Maths Passing Package 2021
No ratings yet
Pu II Maths Passing Package 2021
74 pages
Split & Remove Product Automation: Cisco Systems Inc. (Confidential)
No ratings yet
Split & Remove Product Automation: Cisco Systems Inc. (Confidential)
6 pages
Embedding SQL Server 2008 Express in An Application - Microsoft Learn PDF
No ratings yet
Embedding SQL Server 2008 Express in An Application - Microsoft Learn PDF
36 pages
PostgreSQL CREATE FUNCTION by Practical Examples
No ratings yet
PostgreSQL CREATE FUNCTION by Practical Examples
9 pages
Fixed Assets Audit Programme
67% (3)
Fixed Assets Audit Programme
109 pages
Network+ Certification Bible PDF
100% (1)
Network+ Certification Bible PDF
738 pages
Oracle: Question & Answers
No ratings yet
Oracle: Question & Answers
3 pages
LyX UserGuide PDF
No ratings yet
LyX UserGuide PDF
148 pages
Line follower robot using ros report-
No ratings yet
Line follower robot using ros report-
15 pages
Website Navigation Day 1
No ratings yet
Website Navigation Day 1
2 pages
3.2 math notes
No ratings yet
3.2 math notes
5 pages
2.5.2 HZ-2000H Transformer Tan Delta-User Manual
No ratings yet
2.5.2 HZ-2000H Transformer Tan Delta-User Manual
27 pages
ITIL - IM KPIs and Reports PDF
No ratings yet
ITIL - IM KPIs and Reports PDF
3 pages
Low-Cost Evaluation Techniques For Information Retrieval Systems: A Review
No ratings yet
Low-Cost Evaluation Techniques For Information Retrieval Systems: A Review
12 pages
Lab - Safety: Objectives
No ratings yet
Lab - Safety: Objectives
3 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

NLPNotes

Uploaded by

NLPNotes

Uploaded by

UNIT 1

Natural Language Processing (NLP):

 NLP is a subset of artificial intelligence (AI) that focuses on the interaction

FINDING STRUCTURE OF WORDS

• "Finding the structure of words" generally refers to the process of analyzing

WORDS AND THEIR COMPONENTS

ISSUES AND CHALLENGES

Finding the structure of words in natural language processing (NLP) can be a

1. Ambiguity: Many words in natural language have multiple meanings, and it

2. Morphology: Many languages have complex morphology, meaning that word

4. Informal language: Informal language, such as slang or colloquialisms, can

5. Out-of-vocabulary words: NLP systems may not have encountered a word

6. Named entities: Proper nouns, such as names of people or organizations, can

7. Language-speciﬁc challenges: Different languages have different structures

8. Domain-speciﬁc challenges: NLP systems trained on one domain may not

Key challenges in Natural Language Processing (NLP):

Irregularity refers to words that do not follow standard patterns of formation or

To handle this, NLP models use:

 Rule-based systems to incorporate irregular forms into standard rules.

Despite advancements, irregularity remains a major challenge, especially in

 Contextual information: Analyzing surrounding words.

 Word formation: New words like smartphone, cyberbully, or workaholic

o walk → walked (past tense)

 Morphological analysis algorithms to predict word structure.

MORPHOLOGICAL MODELS IN NLP

Morphological models in Natural Language Processing (NLP) analyse the structure of

o Use handcrafted linguistic rules to analyse word structure.

o Use machine learning algorithms to learn the morphological structure of

oUse deep learning architectures like Recurrent Neural Networks (RNNs)

FINDING THE STRUCTURE OF DOCUMENTS

1. Sentence Boundary Detection (SBD):

 Ambiguities: Sentences can end with punctuation marks like

o Techniques for SBD:

 Rule-Based Methods: These methods use a set of pre-defined

2. Topic Boundary Detection (Topic Segmentation):

o By detecting topic boundaries, we can improve applications like

 Topic Ambiguity: What constitutes a "topic" can vary depending

o Techniques for Topic Boundary Detection:

 Rule-Based Methods: These methods rely on predefined rules

Boundary Classification Problem

 Both sentence segmentation and topic segmentation are framed as boundary

 Let x ∈ X be the vector of features associated with a boundary candidate.

 Let y ∈ Y be the label predicted for that candidate (boundary or non-

o 𝒃(b-bar) for non-boundary.

Fine-grained Boundary Types

 Instead of a simple binary classification, segmentation can involve multiple

1. Local Classification: Each boundary is treated independently.

2. Sequence Classification: The sequence of boundaries is considered as a

3. Generative Models estimate the joint distribution of the observations P(X,Y)

3. Generative Sequence Classification Methods

o Y= (y1, y2,….yk)= Set of class(boundary) labels

 P(X) in the denominator is dropped because it is fixed for different Y and

There are two methods to compute similarity:

o Where NumNewTerms(b) returns the number of terms in block b seen the

5. Discriminative Sequence Classification Methods

Types of Sequence Models:

1. Conditional Random Fields (CRFs):

 Training complexity: Requires more steps and time because it keeps

 Training complexity: Easier to train, especially on large datasets, but

o Extra Complexity: These methods need to look at entire sequences of

1. Sentence Segmentation in Text:

o For example, one rule-based system had an error rate of 1.41% on a

2. Sentence Segmentation in Speech:

o Precision: How many of the predicted boundaries were actually correct.

o F1-Measure: A balance between precision and recall, showing the

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.