0% found this document useful (0 votes)
21 views

NLPNotes

The document provides an overview of Natural Language Processing (NLP), focusing on its goals, challenges, and the structure of words and documents. It discusses key components such as tokens, lexemes, and morphemes, as well as issues like ambiguity and irregularity that NLP systems face. Additionally, it covers various morphological models used in NLP and techniques for detecting sentence and topic boundaries in text.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

NLPNotes

The document provides an overview of Natural Language Processing (NLP), focusing on its goals, challenges, and the structure of words and documents. It discusses key components such as tokens, lexemes, and morphemes, as well as issues like ambiguity and irregularity that NLP systems face. Additionally, it covers various morphological models used in NLP and techniques for detecting sentence and topic boundaries in text.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

UNIT 1

Natural Language Processing (NLP):

 NLP is a subset of artificial intelligence (AI) that focuses on the interaction


between computers and human language. It involves the development of
algorithms and models to enable computers to understand, interpret, and
generate human language.
 NLP deals with practical applications of computational linguistics, such as
machine translation, sentiment analysis, text summarization, and chatbot
development.
 The goal of NLP is to facilitate communication between humans and computers
by enabling machines to process and understand natural language input.

FINDING STRUCTURE OF WORDS

• "Finding the structure of words" generally refers to the process of analyzing


and understanding the components and organization of words in a language.
• Human language is complex but not random; it has structure and organization.
• Words and expressions consist of smaller components that give them meaning.
• Different Linguistic Disciplines:
• Morphology studies how words change form and function.
• Syntax looks at how words combine into sentences.
• Phonology examines pronunciation rules.
• Orthography deals with writing conventions.
• Semantics focuses on meaning.
• Etymology & Lexicology study word origins and relationships.
• Morphological parsing is the process of identifying word structure and how it
connects to grammar and meaning.
• It helps in understanding language, especially in multilingual contexts.

WORDS AND THEIR COMPONENTS

Words are the smallest linguistic units that can form a complete utterance by
themselves

Components:

 Tokens
 Lexems
 Morphemes

1. TOKENS:
 A "token" refers to a unit of text that has been extracted from a larger body
of text.
 Tokens are the building blocks that NLP models use to process and
understand language.
 The process of breaking down a text into individual tokens is known as
tokenization.
 Each token typically corresponds to a word, although it can also represent
sub-word units, characters, or other linguistic elements depending on the
specific tokenization strategy.
 Word Tokens: Each word in a sentence is treated as a separate token.
example "I love natural language processing": ["I", "love", "natural",
"language", "processing"].
 Sub-word Tokens: In the context of sub-word tokenization, words are
broken down into smaller units One common approach is Byte Pair Encoding
(BPE). example: "processing" Sub-word Tokens (after BPE): ["pro", "cess",
"ing"]
 Character Tokens: In character tokenization, each character in the text is
treated as a separate token. example: "natural" Character Tokens: ["n", "a",
"t", "u", "r", "a", "l"]

 Role in NLP
o Used in text classification, sentiment analysis, named entity
recognition.
o Tokenization splits text into tokens.
o Part-of-Speech (POS) tagging assigns grammatical categories.

2. LEXEMS
 A lexeme is a unit of vocabulary that represents a single concept,
regardless of its grammatical variations.
 Lexemes can be divided by their behavior into the lexical categories of
verbs, nouns, adjectives, conjunctions, particles, or other parts of speech.
i. "Run," "runs," "running," and "ran" belong to the same lexeme
(concept of running).
ii. "Bank" (financial institution) and "bank" (river edge) are different
lexemes.
 Role in NLP
o Lexical analysis identifies and categorizes lexemes.
o Helps in stemming, lemmatization, and part-of-speech tagging.
3. MORPHEMES
 A morpheme is the smallest unit of meaning in a language.
 Free Morphemes: Can stand-alone (e.g., "book," "run," "happy").
 Bound Morphemes: Cannot stand alone, must attach to another
morpheme.
i. Prefixes: Added at the beginning (e.g., "un-" in "unhappy").
ii. Suffixes: Added at the end (e.g., "-ed" in "walked").
 Examples
i. Unhappily = "un-" (not) + "happy" + "-ly" (manner).
ii. Rearrangement = "re-" (again) + "arrange" + "-ment" (act of
arranging).
iii. Cats = "cat" (free morpheme) + "-s" (plural).
 Role in NLP
o Used in part-of-speech tagging, sentiment analysis,
machine translation.

ISSUES AND CHALLENGES

Finding the structure of words in natural language processing (NLP) can be a


challenging task due to various issues and challenges. Some of these issues and
challenges are:

1. Ambiguity: Many words in natural language have multiple meanings, and it


can be difficult to determine the correct meaning of a word in a particular
context.

2. Morphology: Many languages have complex morphology, meaning that word


scan change their form based on various grammatical features like tense,
gender, and number. This makes it difficult to identify the underlying structure
of a word,

3. Word order: The order of words in a sentence can have a significant impact
on the meaning of the sentence, making it important to correctly identify the
relationship between words.

4. Informal language: Informal language, such as slang or colloquialisms, can


be challenging for NLP systems to process since they often deviate from the
standard rules of grammar.

5. Out-of-vocabulary words: NLP systems may not have encountered a word


before, making it difficult to determine its structure and meaning.

6. Named entities: Proper nouns, such as names of people or organizations, can


be challenging to recognize and structure correctly.

7. Language-specific challenges: Different languages have different structures


and rules, making it necessary to develop language-specific approaches for
NLP.

8. Domain-specific challenges: NLP systems trained on one domain may not


be effective in another domain, such as medical or legal language.

Key challenges in Natural Language Processing (NLP):

Irregularity

Irregularity refers to words that do not follow standard patterns of formation or


inflection, making it difficult for NLP systems to process them accurately. Examples
include:

 Irregular verbs: In English, verbs like go → went and do → did don’t follow the
usual -ed past tense rule.
 Irregular plurals: Words like child → children and foot → feet don’t follow the
usual -s pluralization rule.
 Morphological irregularity: Some languages, like Spanish, have irregular
verb conjugations (tener → tengo).

To handle this, NLP models use:

 Rule-based systems to incorporate irregular forms into standard rules.


 Machine learning algorithms to learn patterns from large datasets.

Despite advancements, irregularity remains a major challenge, especially in


languages with complex morphology.

Ambiguity

Ambiguity occurs when a word, phrase, or sentence has multiple meanings, making it
difficult for NLP to determine the correct interpretation. Key types of ambiguity:

 Homonyms: Words that sound/spell the same but have different meanings
(bank = financial institution or riverbank).
 Polysemy: Words with multiple related meanings (book = a reading object or
the action of reserving something).
 Syntactic ambiguity: A sentence that can be interpreted in multiple ways (I
saw her duck = seeing a bird or watching someone lower their head).
 Cultural/Linguistic differences: Idioms like kick the bucket (meaning to die)
may confuse NLP models.
To resolve ambiguity, NLP systems use:

 Contextual information: Analyzing surrounding words.


 Part-of-speech tagging: Identifying word roles (noun, verb, etc.).
 Syntactic parsing: Determining sentence structure.
 Machine learning models trained on large datasets

Productivity

Productivity refers to the ability of a language to generate new words using existing
rules, which creates challenges for NLP systems. Examples include:

 Word formation: New words like smartphone, cyberbully, or workaholic


combine existing words.
 Prefix/suffix usage:
o Un- → happy → unhappy
o -er → run → runner

 Inflectional morphology:

o walk → walked (past tense)


o big → bigger (comparative adjective)

Since many new word forms may not exist in dictionaries or training data, NLP
systems use:

 Morphological analysis algorithms to predict word structure.


 Statistical models to recognize new words.
 Machine learning to identify word patterns.

MORPHOLOGICAL MODELS IN NLP

Morphological models in Natural Language Processing (NLP) analyse the structure of


words, including inflectional and derivational patterns. They help in tasks like part-of-
speech tagging, named entity recognition, machine translation, and text-to-speech
synthesis. The main types of morphological models include:

1. Rule-Based Models

o Use handcrafted linguistic rules to analyse word structure.


o These rules are based on linguistic knowledge and are manually created
by experts in the language
o Best for languages with simple morphology (e.g., English).

2. Statistical Models

o Use machine learning algorithms to learn the morphological structure of


words from large datasets of annotated text
o Use machine learning techniques like Hidden Markov Models (HMMs) or
Conditional Random Fields (CRFs) to analyse words.
o More accurate than rule-based models.

3. Neural Models

oUse deep learning architectures like Recurrent Neural Networks (RNNs)


and Transformers.
o Achieve state-of-the-art results, especially in complex languages like
Arabic and Turkish.
4. Dictionary Lookup
o Dictionaries or lexicon is used to store information about the words in a
language, including their inflectional and derivational forms, parts of
speech, and other relevant features.
o When a word is encountered in a text, the dictionary is consulted to
retrieve its properties
o Techniques to improve accuracy:
 Lemmatization: Reducing a word to its base form (e.g., running
→ run).
 Stemming: Reducing a word to its root (e.g., jumping → jump).
 Morphological Analysis: Identifying word components (prefixes,
suffixes, roots).
5. Finite-State Morphology
o Based on the principles of finite-state automata.Uses Finite-State
Transducers (FSTs) to model word formation.
o accept a set of strings or sequences of symbols, which represent the
morphemes that make up the word. Each morpheme is associated
with a set of features that describe its properties, such as its part of
speech, gender, tense, or case.
o There are two primary operations:
o Analysis: The transducer breaks down words into morphemes, identifying their
grammatical features.
o Generation: The transducer combines morphemes to generate a word, applying
appropriate inflections and derivations.
o Works well for languages with regular morphology (e.g., Turkish,
Finnish).
o Advantages:
o Fast and efficient.
o Transparent rules for linguists.
6. Unification-Based Morphology
o Based on the principles of unification and feature-based grammar
o Words are modeled as a set of feature structures, which are
hierarchically organized representations of the properties and attributes
of a word.
o Each feature structure is associated with a set of features and values
that describe the word's morphological and syntactic properties
o There are two main operations:
 Analysis: The system applies rules to a word's feature structure to identify its morphemes
and properties.
 Generation: It constructs a feature structure to represent a set of morphemes, generating
a word with the required features.
o Helps in complex and irregular morphology (e.g., Arabic, German).
o Advantages:
 Flexible and expressive (handles many linguistic phenomena).
 Modular and reusable.
o Disadvantages:
 Computationally expensive.
7. Functional Morphology
o This approach is based on functional and cognitive linguistics and
emphasizes how words function in communication
o Words are modeled as lexemes (units of meaning) with a set of abstract
features that reflect their semantic, pragmatic, and discourse
properties.
o The goal is to capture how morphological and syntactic structures
reflect their communicative functions in context.
Advantages:
o
 Corpus-driven and usage-based.
 Can integrate with cognitive linguistics.
o Disadvantages:
 Requires large annotated datasets.
8. Morphology Induction
o This is an unsupervised, data-driven approach where algorithms learn
the underlying morphology of a language by analysing large corpora of
raw text.
o The task is to identify morpheme boundaries based on statistical
patterns and distributional properties.
o Works for agglutinative or low-resource languages.
o Techniques:
 Clustering.
 Probabilistic modeling.
 Neural networks.
o Advantages:
 No need for manual rules or annotated data.
o Disadvantages:
 May be less accurate than other models.

FINDING THE STRUCTURE OF DOCUMENTS

In natural language processing (NLP), documents are not just a random collection of
words, but they have an inherent structure. This structure typically includes
sentences, paragraphs, and topics that make the text coherent and meaningful.
Understanding this structure is crucial for various NLP tasks such as parsing, machine
translation, and semantic labeling. These tasks rely on identifying the organization
and boundaries within the text.

1. Sentence Boundary Detection (SBD):

o SBD involves identifying where one sentence ends and another begins in a
sequence of words.
o Sentence boundary detection (Sentence segmentation) deals with
automatically segmenting a sequence of word tokens into sentence units.
o In languages like English, the beginning of a sentence is often marked by an
uppercase letter, and the end of a sentence is explicitly marked with punctuation (e.g.,
period, question mark, exclamation mark).
o Example:
 "I spoke with Dr. Smith." → The abbreviation "Dr." does not mark the end of a
sentence.
 "My house is on Mountain Dr." → The abbreviation "Dr." marks the end of the
sentence.
o Used for text summarization, machine translation, sentiment analysis, and
part-of-speech tagging.

o Challenges:

 Ambiguities: Sentences can end with punctuation marks like


periods (.), question marks (?), or exclamation marks (!), or may not
end with punctuation at all.
 Abbreviations: In cases like "Dr." (Doctor) or "e.g." (for example), a
period does not mark the end of a sentence, leading to errors if
punctuation alone is used to detect boundaries.
 Domain-Specific Text: Specialized domains (e.g., legal, medical)
may have non-standard sentence structures that confuse traditional
boundary detection systems.
 Multilingual Text: Different languages follow different conventions
for sentence boundaries, making cross-lingual sentence
segmentation challenging.

o Techniques for SBD:

 Rule-Based Methods: These methods use a set of pre-defined


rules or heuristics to identify sentence boundaries, considering
punctuation, abbreviations, and context.
 Machine Learning: Algorithms like Conditional Random Fields (CRF)
and Recurrent Neural Networks (RNNs) are trained on labelled data
to predict sentence boundaries based on the characteristics of the
input text.
 Language-Specific Models: Certain languages may require
customized models that account for their specific punctuation,
abbreviations, or syntactic rules.
 Pre-trained Models: Language models like BERT or GPT can be
fine-tuned for sentence boundary detection tasks, leveraging their
ability to understand context.

2. Topic Boundary Detection (Topic Segmentation):

o Topic boundary detection aims to identify the points in a text where the
subject or topic of discussion changes. This is an essential task for
understanding the overall structure of a document and breaking it down
into more manageable parts.

o By detecting topic boundaries, we can improve applications like


information retrieval, machine translation, and text summarization. In
information retrieval, for example, segmenting a long document into
smaller topic-based sections can help retrieve only the relevant content
in response to a user's query.

o Challenges:

 Topic Ambiguity: What constitutes a "topic" can vary depending


on the context, and different people might interpret the topic of a
text differently.
 Language Variations: Different languages and writing styles use
different signals (e.g., discourse markers, paragraph breaks) to
indicate a shift in topic.
 Implicit vs. Explicit Boundaries: Topics may be signaled
explicitly (e.g., through headings or keywords) or implicitly (e.g.,
through discourse markers or logical transitions), making it harder
to identify boundaries in some cases.

o Techniques for Topic Boundary Detection:

 Rule-Based Methods: These methods rely on predefined rules


or heuristics to detect topic boundaries, considering clues like
paragraph breaks, topic-related keywords, or specific linguistic
markers.
 Machine Learning: Supervised machine learning techniques can
be used to train models that predict topic boundaries based on
labeled training data. Algorithms like Support Vector Machines
(SVM), CRFs, or RNNs can be applied to recognize transitions
between topics.
 Topic Models: Statistical models like Latent Dirichlet Allocation
(LDA) or Non-Negative Matrix Factorization (NMF) can be used to
detect topic changes. These models analyze how topic
distributions change throughout a document or corpus, identifying
segments where a shift in topic occurs.

METHODS

Boundary Classification Problem

 Both sentence segmentation and topic segmentation are framed as boundary


classification problems. The task involves predicting whether a given boundary

 Let x ∈ X be the vector of features associated with a boundary candidate.


candidate is an actual boundary.

 Let y ∈ Y be the label predicted for that candidate (boundary or non-


boundary).
 The label y can be:

o 𝒃(b-bar) for non-boundary.


o b for boundary.

Classification Problem

 Given a set of training examples (x, y), the goal is to find a function that assigns the
most accurate label y for unseen examples x.

Fine-grained Boundary Types

 Instead of a simple binary classification, segmentation can involve multiple


categories:
 For text segmentation: A three-class problem might include:
o ba: Sentence boundary without abbreviation.
o ba'(ba(bar)): Sentence boundary with abbreviation.
o b-a(b(bar)a): Abbreviation not as a boundary.
 For spoken language segmentation: A three-class problem can include:
o bs: Non-boundary statement.
o bq: Question boundary.

The methods for segmentation can be classified into two primary categories:

1. Local Classification: Each boundary is treated independently.

2. Sequence Classification: The sequence of boundaries is considered as a


whole.

3. Generative Models estimate the joint distribution of the observations P(X,Y)


(words, punctuation) and the labels(sentence boundary, topic boundary).
4. Discriminative Models These models estimate the conditional probability
P(Y | X), which focuses on distinguishing between different boundary labels
given the features.

3. Generative Sequence Classification Methods


 Most commonly used generative sequence classification method for topic
and sentence is the hidden Markov model (HMM)
 The probability is given using Bayes rule:

o Y= (y1, y2,….yk)= Set of class(boundary) labels


o X = (x1, x2,….xn)= set of feature vectors
o P(Y|X) = the probability of X belongs to the class(boundary) label.
o P(x) = Probability of word sequence
o P(Y) = Probability of the class (boundary)
 With the joint probability distribution function, given a Y, calculate
("generate") its respective X.
 For this reason, they are called "generative" models.

 P(X) in the denominator is dropped because it is fixed for different Y and


hence does not change the argument of max.
 P(X|Y) and P(Y) can be estimated as
4. Discriminative Local Classification Methods

 Text Tiling is a method for dividing a document into segments that share a
common topic. It uses a lexical cohesion metric in a word vector space
 The document is chopped when the similarity is below some threshold

There are two methods to compute similarity:

1. Block Comparison: This method compares two adjacent blocks of text (e.g.,
sentences or paragraphs) and measures their similarity by comparing how many
words they share.
o Formula: The similarity score between blocks b1 and b2 can be computed

using:
o where wt,b1is the weight assigned to term t in block b1, and the weights can
be binary or computed using retrieval metrics like term frequency (TF).
2. Vocabulary Introduction: This method scores a gap between two blocks by
counting how many new words appear in the interval between the blocks.
o Formula: The topical cohesion score is computed as:

o Where NumNewTerms(b) returns the number of terms in block b seen the


first time in text

5. Discriminative Sequence Classification Methods


 When segmenting text or speech, the decision for a given example (e.g., word,
sentence, or paragraph) depends not just on the features of the example itself but
also on the context (i.e., surrounding boundaries).
 To handle this, discriminative sequence classification methods are used.
These methods go beyond local classifiers by considering the dependencies
between consecutive examples.

Types of Sequence Models:

1. Conditional Random Fields (CRFs):


o CRFs are log-linear models used for labeling sequences (e.g., segmenting text
into sentences or topics).
oThey consider the whole sequence of boundary candidates, rather than
labelling each boundary independently. This allows the model to make
decisions by considering neighbouring labels.
2. SVM Struct:
o An extension of SVMs for structured prediction tasks, where the goal is to
predict sequences of labels rather than just individual labels.
3. Maximum Margin Markov Networks (M3N):
o Extensions of Hidden Markov Models (HMMs) that use a maximum margin
approach to classify sequences. M3Ns are particularly useful when there is a
need to model complex dependencies in sequential data.

COMPLEXITY OF APPROACHES

1. Discriminative Approaches:

 Training complexity: Requires more steps and time because it keeps


adjusting to find the best model.
 Prediction complexity: Takes longer to make predictions because it
needs to process many features.

2. Generative Models:

 Training complexity: Easier to train, especially on large datasets, but


struggles with new, unseen data.
 Prediction complexity: Faster than discriminative models because it's
based on simple assumptions about data.

3. Discriminative Classifiers:

 Training complexity: Great for smaller datasets and can handle a lot of
different features.
 Prediction complexity: Slower than generative models because it
involves more complex calculations during prediction.

4. Sequence Approaches:

o Extra Complexity: These methods need to look at entire sequences of


decisions, not just individual decisions, which makes them more
complex.

PERFORMANCE OF APPROACHES

1. Sentence Segmentation in Text:

o Error rates can be low, but even small mistakes can affect later
processing stages in NLP tasks.

o For example, one rule-based system had an error rate of 1.41% on a


dataset of 27,000 sentences.

2. Sentence Segmentation in Speech:

o Error Rate: Measures how many mistakes the system made in finding
sentence boundaries.

o Precision: How many of the predicted boundaries were actually correct.


o Recall: How many of the correct boundaries were found by the system.

o F1-Measure: A balance between precision and recall, showing the


overall performance.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy