0% found this document useful (0 votes)
36 views42 pages

CSC 528 Lecture 3

Uploaded by

tobianimashaun99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views42 pages

CSC 528 Lecture 3

Uploaded by

tobianimashaun99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

COURSE CODE: CSC 528

PART 2

COURSE TITLE:
INTRODUCTION TO ARTIFICIAL INTELLIGENCE
TOPIC : INTRODUCTION TO NATURAL
LANGUAGE PROCESSING
(NLP)
What Is NLP?

• Humans communicate with each other using words and text.


• The way that humans convey information to each other is called
Natural Language.
• Every day humans share a large quality of information with each
other in various languages as speech or text.
• However, computers cannot interpret this data, which is in natural
language, as they communicate in 1s and 0s.
• The data produced is precious and can offer valuable insights.
• Hence, you need computers to be able to understand, emulate
and respond intelligently to human speech.
• Natural Language Processing or NLP refers to the branch of
Artificial Intelligence that gives the machines the ability to read,
understand and derive meaning from human languages.
• NLP combines the field of linguistics and
computer science to decipher language
structure and guidelines and to make models
which can comprehend, break down and
separate significant details from text and
speech.
How does natural language processing work?

• NLP enables computers to understand natural language as


humans do.
• Whether the language is spoken or written, NLP uses AI to
take real-world input, process it, and make sense of it in a way
a computer can understand.
• Just as humans have different sensors -- such as ears to hear
and eyes to see -- computers have programs to read and
microphones to collect audio.
• And just as humans have a brain to process that input,
computers have a program to process their respective inputs.
• At some point in processing, the input is converted to code
that the computer can understand.
• There are two main phases to NLP:
i. Data preprocessing
ii. Algorithm development.
The steps to perform preprocessing of data
in NLP include:
• Segmentation:
• You first need to break the entire document down into its constituent
sentences. You can do this by segmenting the article along with its
punctuations like full stops and commas.

Segmentation
Tokenizing:

• For the algorithm to understand these sentences, you need to


get the words in a sentence and explain them individually to
our algorithm. So, you break down your sentence into its
constituent words and store them. This is called tokenizing,
and each world is called a token.
Removing Stop Words:

• You can make the learning process faster by getting rid of non-
essential words, which add little meaning to our statement
and are just there to make our statement sound more
cohesive. Words such as was, in, is, and, the, are called stop
words and can be removed.
Stemming:

• It is the process of obtaining the Word Stem of


a word.
• Word Stem gives new words upon adding
affixes to them
Lemmatization:

• The process of obtaining the Root Stem of a word.


• Root Stem gives the new base form of a word that is present in the
dictionary and from which the word is derived.
• You can also identify the base words for different words based on the
tense, mood, gender etc.
Part of Speech Tagging
• Now, you must explain the concept of nouns,
verbs, articles, and other parts of speech to
the machine by adding these tags to our
words.
Named Entity Tagging:
• Next, introduce your machine to pop culture references and everyday names by flagging
names of movies, important personalities or locations, etc that may occur in the document.
• You do this by classifying the words into subcategories. This helps you find any keywords in
a sentence.
• The subcategories are person, location, monetary value, quantity, organization, movie.

• Note:
• TF-IDF is a statistical NLP algorithm that is important in evaluating the importance of a word
to a particular document belonging to a massive collection. This technique involves the
multiplication of distinctive values, which are:
• Term frequency: The term frequency value gives you the total number of times a word
comes up in a particular document. Stop words generally get a high term frequency in a
document.
• Inverse document frequency: Inverse document frequency, on the other hand, highlights
the terms that are highly specific to a document or words that occur less in a whole corpus
of documents.

• After performing the preprocessing steps, you then give your resultant data to a machine
learning algorithm like Naive Bayes, etc., to create your NLP application.
Search and learning
• Many natural language processing problems
can be written mathematically in the form of
optimization
• This basic structure can be applied to a huge range of problems.
• For example, the input x might be a social media post, and the
output y might be a labeling of the emotional sentiment
expressed by the author,
• or x could be a sentence in French, and the output y could be a
sentence in Tamil.
• or x might be a sentence in English, and y might be a
representation of the syntactic structure of the sentence .
• or x might be a news article and y might be a structured record
of the events that the article describes.
• This formulation reflects an implicit decision that language
processing algorithms will have two distinct modules
The search module

• The search module is responsible for computing the argmax of the


function .
• In other words, it finds the output that gets the best score with
respect to the input x.
• This is easy when the search space Y(x) is small enough to enumerate, or
when the scoring function has a convenient decomposition into parts.
• In many cases, we will want to work with scoring functions t that do
not have these properties, motivating the use of more sophisticated
search algorithms, such as bottom-up dynamic programming and beam
search .
• Because the outputs are usually discrete in language processing
problems, search often relies on the machinery of combinatorial
optimization
The learning module
• The learning module is responsible for finding
the parameters
• This is typically (but not always) done by
processing a large dataset of labeled
examples,
• where
• a column vector of feature counts for instance i, often word counts
• a structured label for instance i, such as a tag sequence
• Like search, learning is also approached
through the framework of optimization.
• Because the parameters are usually
continuous, learning algorithms generally rely
on numerical optimization to identify vectors
of real-valued parameters that optimize some
function of the model and the labeled data.
Linear text classification
• We begin with the problem of text
classification: given a text document, assign it
a discrete label is the set of
possible labels.
• Text classification has many applications, from
spam filtering to the analysis of electronic
health records.
The bag of words
• vector that is mostly zeros, with a column
vector of word counts x inserted in a location
that depends on the specific label y.
• But it is usually not easy to set classification weights
by hand, due to the large number of words and the
difficulty of selecting exact numerical weights.
Instead, we will learn the weights from data.
• Email users manually label messages as SPAM;
newspapers label their own articles as BUSINESS or
STYLE.
• Using such instance labels, we can automatically
acquire weights using supervised machine learning.
Machine learning approach for
classification
• Naive Bayes:
• Naive bayes classifier has three different
algorithms: Guassian naive bayes, multinomial naive
bayes, bernoulli naive bayes.
• It is pretty much complicated to understand all algorithms
deeply.
• However, you only need to understand multinomial naive
bayes fits to text classification.
• Multinomial naive bayes is typically used for multinomial
event model like bag-of-words, which is a method to
represent document as vector space by counting words
Applying Multinomial Naive Bayes to NLP Problems

• Multinomial Naive Bayes (MNB) is a popular


machine learning algorithm for text classification
problems in Natural Language Processing (NLP). It
is particularly useful for problems that involve
text data with discrete features such as word
frequency counts. MNB works on the principle of
Bayes theorem and assumes that the features are
conditionally independent given the class
variable.
steps for applying Multinomial Naive Bayes to
NLP problems
• Preprocessing the text data: The text data needs to
be preprocessed before applying the algorithm. This
involves steps such as tokenization, stop-word
removal, stemming, and lemmatization.
• Feature extraction: The text data needs to be
converted into a feature vector format that can be
used as input to the MNB algorithm. The most
common method of feature extraction is to use a
bag-of-words model, where each document is
represented by a vector of word frequency counts.
• Splitting the data: The data needs to be split into
training and testing sets. The training set is used to
train the MNB model, while the testing set is used
to evaluate its performance.
• Training the MNB model: The MNB model is
trained on the training set by estimating the
probabilities of each feature given each class. This
involves calculating the prior probabilities of each
class and the likelihood of each feature given each
class.
Evaluating the performance of the model:

• The performance of the model is evaluated


using metrics such as accuracy, precision,
recall, and F1-score on the testing set.
• MNB, has some limitations, such as the
assumption of independence between
features, which may not hold true in some
cases. Therefore, it is important to carefully
evaluate the performance of the model before
using it in a real-world application.
• Bayes theorem calculates probability P(c|x)
where c is the class of the possible outcomes
and x is the given instance which has to be
classified, representing some certain features.
P(c|x) = P(x|c) * P(c) / P(x)
Naive Bayes predict the tag of a text.
• They calculate the probability of each tag for a
given text and then output the tag with the
highest one.
How Naive Bayes Algorithm Works ?
• Let’s consider an example, classify the review
whether it is positive or negative.
Training Dataset:
• We classify whether the text “overall liked the movie” has a positive
review or a negative review. We have to calculate,
P(positive | overall liked the movie) — the probability that the tag of
a sentence is positive given that the sentence is “overall liked the
movie”.
P(negative | overall liked the movie) — the probability that the tag
of a sentence is negative given that the sentence is “overall liked the
movie”.
Before that, first, we apply Removing Stopwords and Stemming in the
text.
Removing Stopwords: These are common words that don’t really add
anything to the classification, such as an able, either, else, ever and so
on.
Stemming: Stemming to take out the root of the word.
Now After applying these two techniques, our text becomes
• Feature Engineering:
The important part is to find the features from the
data to make machine learning algorithms works.
In this case, we have text. We need to convert this
text into numbers that we can do calculations on.
We use word frequencies. That is treating every
document as a set of the words it contains. Our
features will be the counts of each of these words.
In our case, we have P(positive | overall liked the
movie), by using this theorem:
• P(positive | overall liked the movie) = P(overall liked the movie |
positive) * P(positive) / P(overall liked the movie)

• Since for our classifier we have to find out which tag has a bigger
probability, we can discard the divisor which is the same for both
tags,
P(overall liked the movie | positive)* P(positive) with P(overall
liked the movie | negative) * P(negative)
There’s a problem though: “overall liked the movie” doesn’t
appear in our training dataset, so the probability is zero.
• Here, we assume the ‘naive’ condition that every word in a
sentence is independent of the other ones. This means that now
we look at individual words.
• We can write this as:
P(overall liked the movie) = P(overall) * P(liked) * P(the) * P(movie)
• The next step is just applying the Bayes theorem:-

• P(overall liked the movie| positive) = P(overall | positive) * P(liked | positive) *


P(the | positive) * P(movie | positive)
• And now, these individual words actually show up several times in our training
data, and we can calculate them!
Calculating probabilities:
First, we calculate the a priori probability of each tag: for a given sentence in our
training data, the probability that it is positive P(positive) is 3/5. Then,
P(negative) is 2/5.
Then, calculating P(overall | positive) means counting how many times the word
“overall” appears in positive texts (1) divided by the total number of words in
positive (17). Therefore, P(overall | positive) = 1/17, P(liked/positive) = 1/17,
P(the/positive) = 2/17, P(movie/positive) = 3/17.
• if probability comes out to be zero then By using Laplace smoothing: we
add 1 to every count so it’s never zero. To balance this, we add the
number of possible words to the divisor, so the division will never be
greater than 1. In our case, the total possible words count are 21.
Applying smoothing, The results are:
• Now we just multiply all the probabilities, and see who is
bigger:

• P(overall | positive) * P(liked | positive) * P(the |


positive) * P(movie | positive) * P(positive ) = 1.38 *
10^{-5} = 0.0000138
P(overall | negative) * P(liked | negative) * P(the |
negative) * P(movie | negative) * P(negative) = 0.13 *
10^{-5} = 0.0000013
• Our classifier gives “overall liked the movie” the positive
tag.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy