NLPNotes
NLPNotes
Words are the smallest linguistic units that can form a complete utterance by
themselves
Components:
Tokens
Lexems
Morphemes
1. TOKENS:
A "token" refers to a unit of text that has been extracted from a larger body
of text.
Tokens are the building blocks that NLP models use to process and
understand language.
The process of breaking down a text into individual tokens is known as
tokenization.
Each token typically corresponds to a word, although it can also represent
sub-word units, characters, or other linguistic elements depending on the
specific tokenization strategy.
Word Tokens: Each word in a sentence is treated as a separate token.
example "I love natural language processing": ["I", "love", "natural",
"language", "processing"].
Sub-word Tokens: In the context of sub-word tokenization, words are
broken down into smaller units One common approach is Byte Pair Encoding
(BPE). example: "processing" Sub-word Tokens (after BPE): ["pro", "cess",
"ing"]
Character Tokens: In character tokenization, each character in the text is
treated as a separate token. example: "natural" Character Tokens: ["n", "a",
"t", "u", "r", "a", "l"]
Role in NLP
o Used in text classification, sentiment analysis, named entity
recognition.
o Tokenization splits text into tokens.
o Part-of-Speech (POS) tagging assigns grammatical categories.
2. LEXEMS
A lexeme is a unit of vocabulary that represents a single concept,
regardless of its grammatical variations.
Lexemes can be divided by their behavior into the lexical categories of
verbs, nouns, adjectives, conjunctions, particles, or other parts of speech.
i. "Run," "runs," "running," and "ran" belong to the same lexeme
(concept of running).
ii. "Bank" (financial institution) and "bank" (river edge) are different
lexemes.
Role in NLP
o Lexical analysis identifies and categorizes lexemes.
o Helps in stemming, lemmatization, and part-of-speech tagging.
3. MORPHEMES
A morpheme is the smallest unit of meaning in a language.
Free Morphemes: Can stand-alone (e.g., "book," "run," "happy").
Bound Morphemes: Cannot stand alone, must attach to another
morpheme.
i. Prefixes: Added at the beginning (e.g., "un-" in "unhappy").
ii. Suffixes: Added at the end (e.g., "-ed" in "walked").
Examples
i. Unhappily = "un-" (not) + "happy" + "-ly" (manner).
ii. Rearrangement = "re-" (again) + "arrange" + "-ment" (act of
arranging).
iii. Cats = "cat" (free morpheme) + "-s" (plural).
Role in NLP
o Used in part-of-speech tagging, sentiment analysis,
machine translation.
3. Word order: The order of words in a sentence can have a significant impact
on the meaning of the sentence, making it important to correctly identify the
relationship between words.
Irregularity
Irregular verbs: In English, verbs like go → went and do → did don’t follow the
usual -ed past tense rule.
Irregular plurals: Words like child → children and foot → feet don’t follow the
usual -s pluralization rule.
Morphological irregularity: Some languages, like Spanish, have irregular
verb conjugations (tener → tengo).
Ambiguity
Ambiguity occurs when a word, phrase, or sentence has multiple meanings, making it
difficult for NLP to determine the correct interpretation. Key types of ambiguity:
Homonyms: Words that sound/spell the same but have different meanings
(bank = financial institution or riverbank).
Polysemy: Words with multiple related meanings (book = a reading object or
the action of reserving something).
Syntactic ambiguity: A sentence that can be interpreted in multiple ways (I
saw her duck = seeing a bird or watching someone lower their head).
Cultural/Linguistic differences: Idioms like kick the bucket (meaning to die)
may confuse NLP models.
To resolve ambiguity, NLP systems use:
Productivity
Productivity refers to the ability of a language to generate new words using existing
rules, which creates challenges for NLP systems. Examples include:
Inflectional morphology:
Since many new word forms may not exist in dictionaries or training data, NLP
systems use:
1. Rule-Based Models
2. Statistical Models
3. Neural Models
In natural language processing (NLP), documents are not just a random collection of
words, but they have an inherent structure. This structure typically includes
sentences, paragraphs, and topics that make the text coherent and meaningful.
Understanding this structure is crucial for various NLP tasks such as parsing, machine
translation, and semantic labeling. These tasks rely on identifying the organization
and boundaries within the text.
o SBD involves identifying where one sentence ends and another begins in a
sequence of words.
o Sentence boundary detection (Sentence segmentation) deals with
automatically segmenting a sequence of word tokens into sentence units.
o In languages like English, the beginning of a sentence is often marked by an
uppercase letter, and the end of a sentence is explicitly marked with punctuation (e.g.,
period, question mark, exclamation mark).
o Example:
"I spoke with Dr. Smith." → The abbreviation "Dr." does not mark the end of a
sentence.
"My house is on Mountain Dr." → The abbreviation "Dr." marks the end of the
sentence.
o Used for text summarization, machine translation, sentiment analysis, and
part-of-speech tagging.
o Challenges:
o Topic boundary detection aims to identify the points in a text where the
subject or topic of discussion changes. This is an essential task for
understanding the overall structure of a document and breaking it down
into more manageable parts.
o Challenges:
METHODS
Classification Problem
Given a set of training examples (x, y), the goal is to find a function that assigns the
most accurate label y for unseen examples x.
The methods for segmentation can be classified into two primary categories:
Text Tiling is a method for dividing a document into segments that share a
common topic. It uses a lexical cohesion metric in a word vector space
The document is chopped when the similarity is below some threshold
1. Block Comparison: This method compares two adjacent blocks of text (e.g.,
sentences or paragraphs) and measures their similarity by comparing how many
words they share.
o Formula: The similarity score between blocks b1 and b2 can be computed
using:
o where wt,b1is the weight assigned to term t in block b1, and the weights can
be binary or computed using retrieval metrics like term frequency (TF).
2. Vocabulary Introduction: This method scores a gap between two blocks by
counting how many new words appear in the interval between the blocks.
o Formula: The topical cohesion score is computed as:
COMPLEXITY OF APPROACHES
1. Discriminative Approaches:
2. Generative Models:
3. Discriminative Classifiers:
Training complexity: Great for smaller datasets and can handle a lot of
different features.
Prediction complexity: Slower than generative models because it
involves more complex calculations during prediction.
4. Sequence Approaches:
PERFORMANCE OF APPROACHES
o Error rates can be low, but even small mistakes can affect later
processing stages in NLP tasks.
o Error Rate: Measures how many mistakes the system made in finding
sentence boundaries.