0% found this document useful (0 votes)
46 views16 pages

Unit1 SNLP Osmania University

The document provides an overview of Natural Language Processing (NLP), detailing its origins, challenges, and various language modeling techniques. It discusses the evolution of NLP from the Turing Machine to modern applications, highlighting issues like text summarization and chatbot reliability. Additionally, it covers concepts such as regular expressions, finite state automata, English morphology, and finite-state transducers, emphasizing their roles in language processing and analysis.

Uploaded by

Abdullah Shareef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views16 pages

Unit1 SNLP Osmania University

The document provides an overview of Natural Language Processing (NLP), detailing its origins, challenges, and various language modeling techniques. It discusses the evolution of NLP from the Turing Machine to modern applications, highlighting issues like text summarization and chatbot reliability. Additionally, it covers concepts such as regular expressions, finite state automata, English morphology, and finite-state transducers, emphasizing their roles in language processing and analysis.

Uploaded by

Abdullah Shareef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

SNLP

UNIT-1

Introduction and Word Level Analysis

1.1 Origins and Challenges of NLP

1.1.1 Origins of NLP

Natural Language Processing known as NLP is a form of Artificial Intelligence


that allows machines and computers to understand and interpret the human
language. This form is very essential to allow users to communicate with the
computer just as they would any other human being.

NLP is essentially the way humans communicate with a machine to allow them
to perform the tasks required. Around 1950 the Turing Machine became
famous (Developed by Alan Turing to interpret the encrypted signals being
sent by the Germans). Turing built a machine that could use logic and various
other key aspects to understand the logic behind each message and crack the
code. The first use of NLP was seen right there while communicating with the
Turing Machine.

Right after this invention researchers started to use machine intelligence to


perform several tasks that could make life a lot easier for people. To ensure
that the machine performed the tasks that were given to the people needed to
feed in appropriate data. The 1960’s seen a lot of work that was led by
Chomsky and other renowned researchers who worked on this language to
develop a universal syntax that could be used by people around the world to
communicate with machines.

Development in this field was not rapid but there was a growing interest in
developing this field further to automate a lot of manual activities. People had
a learning curve that was very slow which only allowed people to have a keen
interest in this field around the year 1990. The 90’s seen a surge of people
developing models using artificial intelligence. TO do the same they relied
heavily on language processing which seen a boost around the time.

All of this gave an incentive for people to use NLP along with machine learning
and artificial intelligence to develop probabilistic models that relied heavily
on data. To let the machine read and interpret data NLP became key right up
to the present time. A lot of data relied on speech recognition which allowed
people to transfer audio files and store them in the form of data on the internet.
Due to the Internet Age language processing is used in almost any area
around you. Social media platforms rely heavily on language processing
which forms a major part of recent generations. In short, without NLP we would
not have achieved all the technological advancements that we have in recent
times.

1.1.2 Challenges of NLP

NLP is one of the major ways in which a person can communicate with
machines to solve problems does come with a plethora of advantages.
However, as advanced as the technology may be there are always a few
challenges that lie ahead of the following which needs to be conquered to
solve them. A little about each of the challenges are given below.

1. Text Summarization

This is simply a challenge that lies ahead for readers who are working on
extracting data and going through a lot of data in a short time. They would
prefer to have a summary of the same. AI uses NLP to interpret and structure a
summary by using all of the important points. However, there are a lot of
points that do get missed out. This is one of the challenges that could solve a
lot of problems that use data inspection.

2. Chatbot’s and Other AI Answer Machines

These are so frequently used by people while surfing a website to navigate the
site appropriately to find the thing that they would be needing most. Replies
do get easy based on questions that are asked frequently. However, when we
use this on a much higher level with more complex reasonings, not every
chatbot implemented with AI would be able to provide reliable answers. AI
intends to allow everyone to use this feature to solve their daily problems.

1.2 Language Modelling: Grammar-based LM, Statistical LM

Language models (LM’s) are used to estimate the likelihood of different words
or phrases relative to other phrases or words to improve other NLP
applications. Most of the models that are based on language modelling use
probabilistic inferences to calculate the probability of the next word that the
user might be intending to note down. It can also be used to find the
probability of a certain word or phrase in data that is already existing.

P(w) = P(w1, w2, w3, w4, ………wn) = ∏ P(wi|w1,w2.w3……..wi-1)

where ‘n’ is the total number of occurrences in the given data and i is the
instance that we need to find out the occurrence of.

1.2.1 Grammar Based Language Modelling

As the name rightly suggests this model is based on the grammatical


inferences that have been given out to the machine to make an appropriate
prediction based on the probabilistic inferences that the sentence has
occurred in a similar event as before.

We can be sure that the language that we need the user to interpret a certain
event is based on the past happenings in terms of that data. Let us understand
this model with the help of an example further.

E.g. Sentence: “This glass is transparent.”

P (“This glass is transparent.”) = P(wi)

To construct the above sentence or predict it we need to see all the possible
combinations based on the rule that is given below.

1. P(This ) = P(w1)
2. P(glass | This ) = P(w2)
3. P( is | This glass ) =P(w3)
4. P(transparent | This glass is ) = P(w4)

If the model needs to calculate the probability of the above sentence it would
have to follow the same formula as given above.

Therefore, P(w) = ∏ P(wi|w1,w2.w3……..wi-1)

This would give us the probability of the grammatically correct statement


which would then be predicted and shown to the user. Based on this one could
build a sentence or phrase completion model using NLP.

1.2.2 Statistical Based Language Modelling


As the name suggests this model is based on
statistical inferences made from data that has
been entered into the machine that is
supposed to make the final inferences. This
language model relies heavily on data and
the input in a correct and appropriate form.
The probability distribution is given over
required word sequences which are then
used over the textual data to gain inferences
based on the values given to them.

E.g. 1) -p(“Today is Monday”) = 0.001

2) -p(“Today Monday is”) = 0.00000001

Approximate values are given to the model to work on making inferences.

This model is also known to have a probabilistic mechanism that is used to


generate text and is also known as a ‘generative model’. All of the values are
sampled from the data that has been entered to gain appropriate context
about the happenings that surround the entire probability distribution.
Statistical LM also happens to be context-dependent. This simply means that
the way a certain data is entered and interpreted also affects the outcome of
the probabilistic outcome.

This model is extremely useful to quantify uncertainties that might be lurking


around in the data that has been entered in the form of speech or simply
computer language. This can be seen clearly in the example below.

E.g. 1) John feels _________. (happy, habit)

This is an example where the language model needs to predict the next word
that the person is going to say or type to save time. 99 percent of the people
would agree that “happy” would be the correct option. However, the language
model does not have the same process of thinking as a normal English-
speaking human has. LM’s simply use the probability density of the word in
the option to predict the next word.

Hence the final answer would be “John feels happy.”

1.3 Regular Expressions

Regular Expression is a language that has been developed to help the user
specify a search string to be found in thedata. The pattern needs to be
specified to get the desired results.
E.g. If a person wishes to find 10 numbers from the data but does not know the
exact numbers, he/she could simply specify the pattern that those numbers
are following to find them.

We can divide the following regular expression into 2 parts which is the
corpus and the pattern.

Corpus: A Corpus is essentially a paragraph that is being specified under


which an entire search will be executed. This has the entire bulk of data that
can be used.

Pattern: This is the required pattern that can be entered in the search to obtain
the required search results. Basic expressions that are used for the following
are given below.

There are a few basic properties of a regular expression that need to be


understood to use ‘RegEx’ (Regular Expression) efficiently.
1. Brackets (“[ ]”)

Brackets are used to specify a certain disjunction of characters. For instance,


the bracket being used on a regular ‘s’ or ‘S’ can be used to return a capital or
lower case ‘S’.

E.g. /[sS]mall = Small or small (any of the above combinations)

/[a b c] = ‘a’, ‘b’ or ‘c’ (any of the letters in the bracket)

/[0,1,2,3,4,5] = any digit from the bracket.

2. Dash (“-”)

Dash is a regular expression that can specify the range for a particular search.
If we put a dash between two letters it would match all the letters between
them.

E.g. /[A-Z] = All upper-case letters between A and Z would match the search.

/[a-z] = All lower-case letters between a and z would match the search.

3. Caret (“^”)

As normal syntax, this sign is also used to indicate a negation in the search
that is conducted using this regular expression. It could also simply be used
as a ‘caret’ symbol.

E.g. /[^A-Z]mall = Not any upper-case would be selected


/[^Dd] = This would indicate the search is for neither ‘d’ nor ‘D’

/[e^] = Used as a simple caret simple as there is nothing to negate after


‘e’

4. Question Mark (“?”)

The question mark symbol is essentially used to display the optionality for the
search that has been conducted in the previous search. Putting a “?” before a
certain letter would indicate that the user would like to return the search with
an option consisting of that letter and one without that letter.

E.g. /[malls?] = This would return either “mall” or “malls”

/[colou?r] = This would return either “color” or “colour”

5. Period(“*”)

The asterisk symbol “*” is used to match a certain character that is between
an expression in the search. If the user has an expression that would make
sense after entering another letter this period mark is very useful.

E.g. /[beg*n] = Will give out any word that makes sense like ‘begin’

6. Anchors

Anchors can be used to match a certain assertion in the text that has been
searched.

E.g. Consider “^” and “$” as specified anchors. Then

 ^The = Matches a string that starts with ‘The’


 bye$ = Matches a string that ends with ‘bye’

7. Character Classes

 “\d”: Matches a single character which is a digit.


 “\w”: Matches a word character. This could be alphanumeric as
well as with an underscore.
 “\s”: Matches a whitespace character. This includes line breaks as
well as tabs.
 “.”: Matches any character.
1.4 Finite State Automata
Finite State Automata is an abstract computing device
that is designed to be a mathematical model of a system.
This system has discrete inputs, outputs, transition sets,
and different states that occur on input symbols. This can
be represented in 3 forms namely:

1. Graphical forms (Transition Diagram)


2. Tabular forms (Transition Table)
3. Mathematical forms (Mapping functions or Transition functions)

A finite Automata is made up of 5 tuples which are given in a set as:

M = { Q, ∑, ∂, q0, F }

 Q: It is a finite set known as states


 ∑: It is a finite set known as alphabets
 ∂: ( Q x ∑ ) This is a transition function
 qo: This is a state that is called the initial state and is part of Q
 F: This set is also known as the final state and is also a part of Q

Fig. Description 1: These are 3 steps of the formation in the process of FSA

1.5 English Morphology

English Morphology is essentially the process of structuring the entire language to form
new words from other base words that can be used further in the field of NLP. This type
of morphology is also known as ‘derivational morphology’. We can achieve the
following in so many different ways as follows:
We can use the stem (root word) to pair with different suffixes and affixes to make a new
word that may have relevance based on our language processing. A derived word from
on that is pre-existing is known as morphology. The English language has a lot of words
that were made through morphology. English derivation, however, is complex. The
reasons are:

1. Different meanings: A lot of words do have the same root word but differ when it comes to
the true meaning of the word. For example. Let us take the word “conform”. A person can
add 2 suffixes to form the words “conformity” and “conformation”. However, both of the
above words have completely different meanings despite stemming from the same root
word.
2. Non-uniform Effects: The words that are formed may not always have lasting effects on
other words that are formed in the same manner. Take for example the word summarizes,
if one adds the verb form – “ation” it forms another word with very similar meaning. The
English language is vast but does not form the same throughout the entire plethora of the
language.

E.g.

Category Root Affix Derived word New category

Noun character -ize characterize Verb

Verb give -er giver Noun

Adjective real -ize realize Verb

Noun colour -ful colourful Adjective

Key Takeaways

 English Morphology is used by the system in order to structure the entire language and
make some sense of it.

1.6 Transducers for Lexicon and Rules

A finite-state transducer is essentially a representation of a set of string


pairs that are used to analyse outputs to get a better understanding of the
same. They could be any type of string pairs such as input and output pairs
of the data.

E.g. An FST can represent {(run, Run + Verb + PL), (run, run + Noun + SG), (ran, run +
Verb + Past) …..}
A transducer function can be used to map input from zero and all the other outputs as
well. A regular relation is not termed as a function in the above expression.
E.g. An FST can represent – transduce(run) {(run, Run + Verb + PL, run + Noun + SG)}
Finite-state transducers essentially have 2 tapes which are known as input and output.
The pairs are made up of input and output variables. There is also a concept of
transducer inversion that is important in switching pairs i.e. the input and the output
labels. The inversion of a transducer T(T^-1) simply works to switch the input and output
labels. T would help map the input I to the output O whereas the T^-1 maps the exact
opposite of the above.
T = {(a,a1),(a,a2),(b,b1),(c,c1)}
T-1 = {(a1,a), (a2,a), (b1,b), (c1,c)}
We could have 2 lexical tapes placed as an example
 Lexical Tape: Output alphabet ( ∂ )

(dog + noun + plural)

 Surface Tape: Input alphabet ( ∑ )

(dogs)
A finite-state transducer T = P(in) X P(out) defines a certain relation that is between 2
languages which are in the terms of input and output. FST has a few applications which
are given below as follows.
1. Recognizer

The transducer plays an important role in taking a pair of strings and deciding whether
to accept them or reject them based on a certain amount of criteria.

2. Generator

The transducer can also give a pair of strings (viz. L(input) and L(output)) as output for a
certain language that is required.

3. Translator

This process involves reading a string that has been placed as input by the user. The
FST then uses “Morphological parsing” to translate the input in the form of morphemes.

4. Relater

FST’s are key in being used as a medium that can help relationships between different
types of sets being used around.
Lexicon
Every word in a language is made up of existing words that have meaning. This would
essentially be from a base word while adding an affix to the same. The result would be a
word that is formed through morphology. Most of these are governed by rules which
are also known as “Orthographic rules”. They form the main basis on which
Morphological Parsing takes place. This is done through the process of morphological
parsing which is used to form or break a complex structured form of representation into
‘morphemes. These morphemes are further used to make words that have meaning and
sense.
A “Lexicon” is essentially a list of words that can be used as the stem and affixes that
would suit them to form a new word. A lexicon should constitute the list of words that
have the base to form a structure to be converted to different types of words.

Key Takeaways

 A “Lexicon” is essentially a list of words that can be used as the stem and affixes that
would suit them to form a new word.

1.7 Tokenization

Before a user happens to proceed with processing the language, the input
text in any form needs to undergo a process of normalization. This ensures
that the text is easily distinguished and easy to work on before any process
goes through. The first process that is part of the entire normalization
process is also known as Tokenization or ‘segmentation’ of words.

Word tokenization can be achieved very easily with the help of a UNIX command line
for a text that is solely in the form of the English language. The process of breaking a
given string into smaller parts (tokens) is known as tokenization. Some of the basic
commands that are used to proceed with the following are as follows.
1. “tr”: This command is used to make systematic changes to the characters that are already
part of the input.
2. “sort”: This command as the name suggests can help to sort the lines which are part of the
input in alphabetical order for better understanding and inference.
3. “uniq”: This command is simply used to analyse the identical lines input and allow them
to collapse based on changes and modifications made to the system.

A simple method that can be used to initialize the process of tokenization is to split it o
whitespace. In the Python programming language, one should simply split this through
the NLTK library with the command raw.split(). Using regular expression makes it
impossible to match new characters in the space to the string. We need to match tabs
and lines while using this process.
Problems with tokenization:
Tokenization may be a great way to analyse and break down complex language
structures but may pan out to be a difficult task while executing. A single solution may
not always serve the purpose as efficiently as one would expect.
 Deciding what should be grouped as a token and what should not, is key to enabling users
to work easily with the help of tokenization.
 Having raw manually tokenized data to compare with data that has been given back as
output from the tokenizer helps to inspect the flaws and finally choose an appropriate
token in the process.
 Contractions serve as a major problem e.g. “couldn’t”. While analysing a sentence
meaning this would not pose too much of a problem and hence should be broken down
into 2 separate tokens potentially.
Normalization
In the field of linguistics and NLP, Morpheme is defined as a base form of the word. A
token is basically made up of two components one is morphemes and the other is
inflectional form like prefix or suffix.

For example, consider the word Antinationalist (Anti + national+ ist ) which is made up
of Anti and ist as inflectional forms and national as the morpheme.

Normalization is the process of converting a token into its base form. In the
normalization process, the inflectional form of a word is removed so that the base
form can be obtained. So in our above example, the normal form of antinationalist is
national.

Normalization is helpful in reducing the number of unique tokens present in the text,
removing the variations in a text. and also cleaning the text by removing redundant
information.

Two popular methods used for normalization are stemming and lemmatization.
Let’s discuss them in detail.

Stemming
It is an elementary rule-based process for removing inflationary forms from a given
token. The output of the error is the stem of a word. for example laughing, laughed,
laughs, laugh all will become laugh after the stemming process.

Tokenization - stemming
Stemming is not a good process for normalization. since sometimes it can produce
non-meaningful words which are not present in the dictionary. Consider the
sentence ” His teams are not winning”. After stemming we get “Hi team are not
winn ” . Notice that the keyword winn is not a regular word. Also, “hi” has changed
the context of the entire sentence.

Lemmatization
Lemmatization is a systematic process of removing the inflectional form of a token
and transform it into a lemma. It makes use of word structure, vocabulary, part of
speech tags, and grammar relations.

Tokenization - lemmatization
The output of lemmatization is a root word called a lemma. for example “am”, “are”, “is” will be
converted to “be”. Similarly, running runs, ‘ran’ will be replaced by ‘run’.

Also, since it is a systematic process while performing lemmatization one can specify the part of
the speech tag for the desired term.

Further, Lemmatization can only be performed if the given word has proper parts of speech tag.
for instance if we try to lemmatize the word running as a verb it will be converted to run. But, if
we try to lemmatize the same word running as a noun it won’t be transformed.

1.8 Detecting and Correcting Spelling Errors


While typing text into the computer it is a must that people are able to
type the words in correctly. Most of the times there is appropriate entry
of words in the form of a corpus. However, if at a certain point of time
people do not understand the word being dictated out to them. This
would cause them to have an incorrect spelling attached to them. The
computer would now have to work on this right there in order to allow
the corpus to be examined for analysis further. This is one of the main
reasons why the detection of spelling errors is essential to any corpus.
In order to get an idea about what errors could occur we need to make a
list of them. These are all the errors that could occur in a corpus.
 Ignoring grammar rules
 Semantically similar components
 Phonetically similar components
There are also different errors that are important to know while typing.
The machine has to get a good idea of the following before letting any
of the data into the sorting algorithm.
 Deletion – This means that a letter is missing from the string

 Insertion – This means that a certain letter or letters need to be added

 Substitution – This means that a certain letter or letter in the string needs to be
replaced by another letter or letters in the string

 Transposition – This means that all the letters are present but in a different order. The
letters in the string could be swapped between each other to get a meaningful word.

In order to detect the issue with the spelling there are 2 models that can be
employed based on the corpus that is in question. They are:
 Direct Detection
After working with a lot of corrections the system is trained to make a list of all the
words that have been corrected before. By running the corpus against the list could
get rid of many words that are commonly misspelled. If the words check out they
will be considered eligible.

 Language Model Detection


This model is a little different from the previous one because there are many words
that could be considered wrong even when they are correct. This is why the user
builds a model which is essentially made just for the purpose of correction. The
model is well trained in order to get a good look at the different words that could be
suggested as alternatives to the misspelled word.
Key Takeaways
 Some of the basic operators that are used in regular expressions are Brackets, Dash,
Caret, Question mark, Concatenation symbols, Anchors, etc.

1.9 Minimum Edit Distance

Just as the name rightly suggests this phenomenon is about the comparison between
two different types of strings. NLP constitutes a major part of measuring the
similarities between two or more phrases (strings). An example of this can be seen
in everyday life. While typing a message on your mobile you notice a certain
autocorrect feature. In short, if the user types the word ‘kangaroo’ instead of
‘kangaroo’ the system would intend to know what exactly did the user mean. This
would mean that the user has made a mistake and meant a word that was similar to
this one. Based on simple NLP the system deciphers ‘kangaroo’ to be the
appropriate spelling and consciously makes the change to save the user time and
effort.
At times the spelling is not the only thing that is different about the string. Many a
time people have two strings that differ by only a single word. E.g.
1. “ Andrew is a football player.”
2. “Andrew is a good football player.”

On closer observation, we notice that both the sentences are very similar to each
other and differ by only a single letter. This tells us that both these strings might be
referent. Now that we have an idea about the similarities that can occur between two
strings we can move on to the definition of “Edit Distance” or simply “Minimum Edit
Distance.”
Edit Distance is the distance between two strings in terms of their similarities and
differences. It also refers to the minimum number of operations that the user would
have to execute to make two or more strings alike (transform S1 into S2). The
operations could include deletion and addition of words as well as the substitution in
certain cases to achieve the end string product.
We can look at the following example to understand the case between the 2 words
‘execution’ and ‘intention’.

The alignment between the 2 words is something that has been depicted in the
above figure. We can notice that there are a lot of letters that bear resemblance to
each other in a very similar position as the word we intend to transform it to. We
notice that the transformation can be made with the help of 3 distinct operations viz.
deletion, insertion, and substitution. This is a visual depiction of how knowledge of
the minimum distance can help in-text transformation and editing.

We now have a clear idea about exactly is the minimum distance in NLP. However,
to calculate the same, we need to get familiar with an algorithm known as the
“Minimum Edit Distance Algorithm”. We need to find the shortest path that can get
us to the intended word with simple processes as mentioned above.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy