Reference Material NLP - 2
Reference Material NLP - 2
Sentiment Analysis
it's positive, negative, or neutral. It's also known as opinion mining or emotion
artificial intelligence.
For e.g.
learning to analyze text data, which involves algorithms that learn from
training data.
Sentiment analysis is a useful tool for businesses to understand their
● Surveys
● News articles
● Tweets
● Blog posts
● Customer support chat transcripts
● Social media comments
● Reviews
Sentiment analysis is a major NLP task that involves extracting the feeling
● Jaccard similarity is primarily used for sets or binary data. It compares the
presence or absence of elements between two sets.
● Cosine similarity is commonly used for vector representations, particularly in
text analysis or document comparisons.
Jaccard similarity is a measure of how two sets (of n-grams in your case) are similar.
Let us take example of how to find Jacard similarity between two strings by using char
bigram model
For example if you have 2 strings “abcde” and “abdcde” it works as follow :
J(A, B) = (3 / 6) = 0.5
There is also the Jaccard distance which captures the dissimilarity between two sets, and
is calculated by taking one minus the Jaccard coefficient (in this case, 1 - 0.5 = 0.5)
grammar and vocabulary. It is the primary means by which humans convey meaning,
both in spoken and signed forms, and may also be conveyed through writing.
● N is set of non-terminals
empty).
parsing.
What is Parser ?
A parser is a software component that takes input data (typically text) and builds
a data structure – often some kind of parse tree, abstract syntax tree or other
checking for correct syntax. Parser also requires CFG for generating parse tree.
A parser in NLP uses the grammar rules to verify if the input text
way down to the lowest level. It begins with the start symbol of
sentence’s words and works its way up to the highest level of the
in the vector space- effectively speaking similar words will have similar
“prince”
criteria (for example royalty, masculinity, femininity, age etc.) for each
“femininity”.
except for the fact that the criterion we have used for each of the words
vector.
be.
Model Architecture
input.
● At the end of the training process, the hidden weights are treated
layer contains the context words and the output layer contains the
current word. The hidden layer contains the dimensions we want to
flavoured”.
The model will then iterate over this sentence for different target
“The ____ was chocolate flavoured” being inputs and “cake” being the
target word.
2. Skipgram
input word and expect the model to tell us what words it is expected to
be surrounded by. The input layer contains the current word and the output
layer contains the context words. The hidden layer contains the number of
layer.
Taking the same example, with “cake” we would expect the model to
Bag of words (BoW) and Word2Vec are both natural language processing
(NLP) techniques that use embeddings to represent text data, but they differ
in how they process text:
Bag of words
A statistical model that counts the number of times each word appears in a document
and represents the text as a fixed-length vector. BoW disregards word order and
syntax, and only considers whether a word appears in a document, not where.
Word2Vec
A two-layer neural network that processes text by converting words into vectors that
represent their meaning, semantic similarity, and relationship to other words.
Word2Vec takes in batches of raw text data and outputs a vector space where each
word is assigned a corresponding vector.
Implementation
BoW can be implemented as a Python dictionary, while Word2Vec uses a dense
neural network.
Limitations
BoW has several limitations, including not understanding word order, context, or
meaning, and always representing text in the same way.
Applications
BoW is used in information retrieval, while Word2Vec can help computers learn the
context of expressions and keywords from large text collections.
Understanding Linguistic :
Linguistics is the study of language, its structure, and the rules that govern its structure.
Linguists, specialists in linguistics, have traditionally analyzed language in terms of
several subfields of study. Speech-language pathologists study these subfields of
language and are specially trained to assess and treat language and its subfields.
These include morphology, syntax, semantics, pragmatics and phonology.
Morphology
Morphology is the study of word structure. It describes how words are formed out of
more basic elements of language called morphemes. A morpheme is the smallest
unit of a language. Morphemes are considered minimal because if they were subdivided
any further, they would become meaningless. Each morpheme is different from the
others because each singles a distinct meaning. Morphemes are used to form words.
Base, root or free morphemes are words that have meaning, cannot be broken-down
into smaller parts, and can have other morphemes added to them. Examples of free
morphemes are ocean, establish, book, color, connect, and hinge. These words mean
something, can stand by themselves, and cannot be broken down into smaller units.
These words can also have other morphemes added to them.
Syntax :
Syntax and morphology are concerned with two major categories of language structure.
Morphology is the study of word structure. Syntax is the study of sentence structure.
The basic meaning of the word syntax is “to join,” “to put together.” In the study of
language, syntax involves the following:
A collection of rules that specify the ways and order in which words may be combined to
form sentences in a particular language. As they mature in syntactic development,
children begin to use compound and complex sentences, which can be defined as
follows:
Semantic analysis
Search Engines:
Semantic analysis aids search engines in comprehending user queries more
effectively, consequently retrieving more relevant results by considering the
meaning of words, phrases, and context.
Information Retrieval:
There are many aspects of language that require knowledge derived from pragmatic
analysis. It helps us understand whether the sentence “Give me a glass of water” is an
order or a request. We cannot say this unless we have some context.
Thus pragmatics in NLP helps the computer to understand the real meaning of the
sentences in certain situations. Pragmatics is also concerned with the roles the speaker
and the listener play in creating sense.
For example, if in a school the teacher asks the student coming late in the class as
“What time do you call this?!” It does not mean “the teacher is asking her for the time”.
This is the semantic meaning of the sentence. Here it is referred to as “Why are you so
late?” which is the pragmatic meaning of the sentence.
● Actual Meaning: This may mean that someone was looking for something,
and they found it. The sentence gives a positive sentiment here.
Example 3
An example of pragmatic meaning is: “It's hot in here! Can you crack a window?"
Here we can infer that the speaker wants the window to be opened a little and does not
want the window to be physically damaged.
Look at the situation/context to understand the actual meaning of the sentence spoken.
If you were told to “crack the window” and the room was a little stuffy. The speaker had
just said before that they were feeling hot. We would know that the speaker wants you
to “open the window” and not literally crack it.
Tokenization
fragments of words.
process involves splitting a string, or text into a list of tokens. One can think
token in a paragraph.
suitable for machine learning. This rapid conversion enables the immediate
processes or behaviors.
Types of Tokenization
Tokenization can be classified into several types based on how the text is
Word Tokenization:
Word tokenization divides the text into individual words. Many NLP tasks use
this approach, in which words are treated as the basic units of meaning.
Example:
Sentence Tokenization:
Example:
Input: "Tokenization is an important NLP task. It helps break down
text into smaller units."
Character Tokenization:
This process divides the text into individual characters. This can be useful for
Example:
Input: "Tokenization"
Output: ["T", "o", "k", "e", "n", "i", "z", "a", "t", "i", "o",
"n"]
Need of Tokenization
learning models.
● Language Modelling: Tokenization in NLP facilitates the creation of
corpus’s vocabulary.
the needs of particular NLP tasks, meaning that it will work best in
information retrieval. Its primary goal is to reduce words to their base or root form,
known as the stem. Stemming helps group words with similar meanings or roots
The process involves removing common affixes (prefixes, suffixes) from words, resulting
in a simplified form that represents the word’s core meaning. Stemming is a heuristic
process and may only sometimes produce a valid word. Still, it is effective for tasks like
information retrieval, where the focus is on matching the essential meaning of words
For example:
Stemming algorithms use various rules and heuristics to identify and remove affixes,
and analysis.
Lemmatization
Lemmatization is a linguistic process that involves reducing words to their base or root
form, known as the lemma. The goal is to normalize different inflected forms of a word
so that they can be analyzed or compared more easily. This is particularly useful in
For example:
● Tokenization: The first step is to break down a text into individual words or
tokens. This can be done using various methods, such as splitting the text based
on spaces.
category (like noun, verb, adjective, etc.) to each token. Lemmatization often
relies on this information, as the base form of a word can depend on its
which may not necessarily be the same as the word’s root. For example, the
lemma of “running” is “run,” and the lemma of “better” (in the context of an
adjective) is “good.”
patterns. For irregular verbs or words with multiple possible lemmas, these rules
text.
stemming involves chopping off prefixes or suffixes from words to obtain a common
root, lemmatization aims for a valid base form through linguistic analysis. Lemmatization
tends to be more accurate but can be computationally more expensive than stemming.
Difference between stemming Vs. lemmatization
Stemming Lemmatization
For instance, stemming the word ‘History’ For instance, lemmatizing the word
would return ‘Histori‘. ‘History‘ would return ‘History‘.
In natural language processing (NLP), stop words are commonly used words
that are removed from text processing tasks because they don't carry much
meaning. Key words are identified using methods like YAKE, which analyzes
a text's structure, word usage, and co-occurrence patterns.
Definition Words that are commonly used but don't carry Words that are most
much meaning relevant to the content of a
document
Examples "A", "an", "the", "and", "is", "in", "on", "at", "with", Identified by methods like
"he", "she", "it", "they" YAKE
Stop words are often removed from text processing tasks to ensure that
search query is "what is a stop word?", a search system can eliminate the
words "what is a" and focus on documents that talk about stop words.
YAKE is a method that analyzes a text's structure, word usage, and
phrases.
# Stopwords
import nltk
nltk.download('stopwords')
nltk.download('punkt')
english_stopwords = set(stopwords.words('english'))
print(english_stopwords)
What’s A Phoneme?
A phoneme is the smallest unit of sound. A phoneme is the smallest unit of spoken
sound and is often the one thing that distinguishes one word from another. For example,
cat and rat are only differentiated by the first phoneme. In many cases, a single letter
represents a single phoneme, but in most cases, there are multiple ways of
representing a particular phoneme in English spelling. And in the case of the letter x, it
is comprised of two phonemes /k/ & /s/.
Generally, the accepted belief is that there are 44 phonemes in English. This includes
short vowel sounds, long vowel sounds, digraph sounds such as /sh/, /th/ (voiced and
unvoiced), and /ch/, and single consonant sounds. Most people consider the diphthong
sounds /oy/ and /ou/ to be single phonemes as well. Linguistically, /ng/ and /ar/, /or/,
/er/, /ear/, /oar/, and schwa are also phonemes.
So, a single phoneme such as /n/ may be represented by letters in numerous different
ways such as n, nn, kn, gn, or pn. Phonemes can be indicated through the International
Phonemic Alphabet or by indicating a sound between slanted lines.
What’s A Grapheme?
In linguistics, a grapheme is the smallest unit of a written language whether or
not it carries meaning or corresponds to a single phoneme. In different languages,
a grapheme may represent a syllable or unit of meaning.
Graphemes can include other printed symbols such as punctuation marks. In this
example, the grapheme <x> represents the phonemes /k//s/ while a single character in
Japanese may represent a syllable. Different types/fonts of a single letter are
considered the same grapheme. A basic alphabetic understanding and rapid recognition
is a necessary first step to learning to read. Your student’s ability to quickly, accurately,
and easily write graphemes is necessary for fluent writing and spelling.
When we are speaking purely about English, you will often see another definition of
grapheme. In this case, a grapheme is a letter or group of letters that represent a
single phoneme. This is a term used more or less synonymously with phonograms.
There are often numerous graphemes (or phonograms) that can represent a single
phoneme. For example, the /ā/ sound is a phoneme that can be represented by
numerous graphemes including ai, ay, ey, ei, eigh, a-e.
To make things even more confusing for young learners, a single grapheme such as ea
may represent three different phonemes /ē/, /ā/, or /ĕ/. While English has 26 letters and
44 phonemes, there are approximately 250 graphemes.
What’s A Morpheme?
Perhaps the most neglected term and concept in the study of teaching reading is the
morpheme. A morpheme is the smallest unit of meaning that cannot be further
divided. So, a base word might be a morpheme, but a suffix, prefix, or root also
represents a morpheme. For example, the word red is a single morpheme, but the word
unpredictable is made of the morphemes un + pre + dict + able. While none of these
units stand alone as words, they are the smallest units of meaning.
Morphemes can vary tremendously in length from a single letter such as the prefix a- to
a place name derived from another language such as Massachusetts, which in English
represents a single morpheme. As students move into reading and writing more
sophisticated academic language, the concept of morphemes becomes increasingly
important to their decoding and their spelling as well as their ability to infer meanings of
new vocabulary. The root of a word makes tremendous differences in spelling words
with the -ion suffix for instance.