Unit1 SNLP Osmania University
Unit1 SNLP Osmania University
UNIT-1
NLP is essentially the way humans communicate with a machine to allow them
to perform the tasks required. Around 1950 the Turing Machine became
famous (Developed by Alan Turing to interpret the encrypted signals being
sent by the Germans). Turing built a machine that could use logic and various
other key aspects to understand the logic behind each message and crack the
code. The first use of NLP was seen right there while communicating with the
Turing Machine.
Development in this field was not rapid but there was a growing interest in
developing this field further to automate a lot of manual activities. People had
a learning curve that was very slow which only allowed people to have a keen
interest in this field around the year 1990. The 90’s seen a surge of people
developing models using artificial intelligence. TO do the same they relied
heavily on language processing which seen a boost around the time.
All of this gave an incentive for people to use NLP along with machine learning
and artificial intelligence to develop probabilistic models that relied heavily
on data. To let the machine read and interpret data NLP became key right up
to the present time. A lot of data relied on speech recognition which allowed
people to transfer audio files and store them in the form of data on the internet.
Due to the Internet Age language processing is used in almost any area
around you. Social media platforms rely heavily on language processing
which forms a major part of recent generations. In short, without NLP we would
not have achieved all the technological advancements that we have in recent
times.
NLP is one of the major ways in which a person can communicate with
machines to solve problems does come with a plethora of advantages.
However, as advanced as the technology may be there are always a few
challenges that lie ahead of the following which needs to be conquered to
solve them. A little about each of the challenges are given below.
1. Text Summarization
This is simply a challenge that lies ahead for readers who are working on
extracting data and going through a lot of data in a short time. They would
prefer to have a summary of the same. AI uses NLP to interpret and structure a
summary by using all of the important points. However, there are a lot of
points that do get missed out. This is one of the challenges that could solve a
lot of problems that use data inspection.
These are so frequently used by people while surfing a website to navigate the
site appropriately to find the thing that they would be needing most. Replies
do get easy based on questions that are asked frequently. However, when we
use this on a much higher level with more complex reasonings, not every
chatbot implemented with AI would be able to provide reliable answers. AI
intends to allow everyone to use this feature to solve their daily problems.
Language models (LM’s) are used to estimate the likelihood of different words
or phrases relative to other phrases or words to improve other NLP
applications. Most of the models that are based on language modelling use
probabilistic inferences to calculate the probability of the next word that the
user might be intending to note down. It can also be used to find the
probability of a certain word or phrase in data that is already existing.
where ‘n’ is the total number of occurrences in the given data and i is the
instance that we need to find out the occurrence of.
We can be sure that the language that we need the user to interpret a certain
event is based on the past happenings in terms of that data. Let us understand
this model with the help of an example further.
To construct the above sentence or predict it we need to see all the possible
combinations based on the rule that is given below.
1. P(This ) = P(w1)
2. P(glass | This ) = P(w2)
3. P( is | This glass ) =P(w3)
4. P(transparent | This glass is ) = P(w4)
If the model needs to calculate the probability of the above sentence it would
have to follow the same formula as given above.
This is an example where the language model needs to predict the next word
that the person is going to say or type to save time. 99 percent of the people
would agree that “happy” would be the correct option. However, the language
model does not have the same process of thinking as a normal English-
speaking human has. LM’s simply use the probability density of the word in
the option to predict the next word.
Regular Expression is a language that has been developed to help the user
specify a search string to be found in thedata. The pattern needs to be
specified to get the desired results.
E.g. If a person wishes to find 10 numbers from the data but does not know the
exact numbers, he/she could simply specify the pattern that those numbers
are following to find them.
We can divide the following regular expression into 2 parts which is the
corpus and the pattern.
Pattern: This is the required pattern that can be entered in the search to obtain
the required search results. Basic expressions that are used for the following
are given below.
2. Dash (“-”)
Dash is a regular expression that can specify the range for a particular search.
If we put a dash between two letters it would match all the letters between
them.
E.g. /[A-Z] = All upper-case letters between A and Z would match the search.
/[a-z] = All lower-case letters between a and z would match the search.
3. Caret (“^”)
As normal syntax, this sign is also used to indicate a negation in the search
that is conducted using this regular expression. It could also simply be used
as a ‘caret’ symbol.
The question mark symbol is essentially used to display the optionality for the
search that has been conducted in the previous search. Putting a “?” before a
certain letter would indicate that the user would like to return the search with
an option consisting of that letter and one without that letter.
5. Period(“*”)
The asterisk symbol “*” is used to match a certain character that is between
an expression in the search. If the user has an expression that would make
sense after entering another letter this period mark is very useful.
E.g. /[beg*n] = Will give out any word that makes sense like ‘begin’
6. Anchors
Anchors can be used to match a certain assertion in the text that has been
searched.
7. Character Classes
M = { Q, ∑, ∂, q0, F }
Fig. Description 1: These are 3 steps of the formation in the process of FSA
English Morphology is essentially the process of structuring the entire language to form
new words from other base words that can be used further in the field of NLP. This type
of morphology is also known as ‘derivational morphology’. We can achieve the
following in so many different ways as follows:
We can use the stem (root word) to pair with different suffixes and affixes to make a new
word that may have relevance based on our language processing. A derived word from
on that is pre-existing is known as morphology. The English language has a lot of words
that were made through morphology. English derivation, however, is complex. The
reasons are:
1. Different meanings: A lot of words do have the same root word but differ when it comes to
the true meaning of the word. For example. Let us take the word “conform”. A person can
add 2 suffixes to form the words “conformity” and “conformation”. However, both of the
above words have completely different meanings despite stemming from the same root
word.
2. Non-uniform Effects: The words that are formed may not always have lasting effects on
other words that are formed in the same manner. Take for example the word summarizes,
if one adds the verb form – “ation” it forms another word with very similar meaning. The
English language is vast but does not form the same throughout the entire plethora of the
language.
E.g.
Key Takeaways
English Morphology is used by the system in order to structure the entire language and
make some sense of it.
E.g. An FST can represent {(run, Run + Verb + PL), (run, run + Noun + SG), (ran, run +
Verb + Past) …..}
A transducer function can be used to map input from zero and all the other outputs as
well. A regular relation is not termed as a function in the above expression.
E.g. An FST can represent – transduce(run) {(run, Run + Verb + PL, run + Noun + SG)}
Finite-state transducers essentially have 2 tapes which are known as input and output.
The pairs are made up of input and output variables. There is also a concept of
transducer inversion that is important in switching pairs i.e. the input and the output
labels. The inversion of a transducer T(T^-1) simply works to switch the input and output
labels. T would help map the input I to the output O whereas the T^-1 maps the exact
opposite of the above.
T = {(a,a1),(a,a2),(b,b1),(c,c1)}
T-1 = {(a1,a), (a2,a), (b1,b), (c1,c)}
We could have 2 lexical tapes placed as an example
Lexical Tape: Output alphabet ( ∂ )
(dogs)
A finite-state transducer T = P(in) X P(out) defines a certain relation that is between 2
languages which are in the terms of input and output. FST has a few applications which
are given below as follows.
1. Recognizer
The transducer plays an important role in taking a pair of strings and deciding whether
to accept them or reject them based on a certain amount of criteria.
2. Generator
The transducer can also give a pair of strings (viz. L(input) and L(output)) as output for a
certain language that is required.
3. Translator
This process involves reading a string that has been placed as input by the user. The
FST then uses “Morphological parsing” to translate the input in the form of morphemes.
4. Relater
FST’s are key in being used as a medium that can help relationships between different
types of sets being used around.
Lexicon
Every word in a language is made up of existing words that have meaning. This would
essentially be from a base word while adding an affix to the same. The result would be a
word that is formed through morphology. Most of these are governed by rules which
are also known as “Orthographic rules”. They form the main basis on which
Morphological Parsing takes place. This is done through the process of morphological
parsing which is used to form or break a complex structured form of representation into
‘morphemes. These morphemes are further used to make words that have meaning and
sense.
A “Lexicon” is essentially a list of words that can be used as the stem and affixes that
would suit them to form a new word. A lexicon should constitute the list of words that
have the base to form a structure to be converted to different types of words.
Key Takeaways
A “Lexicon” is essentially a list of words that can be used as the stem and affixes that
would suit them to form a new word.
1.7 Tokenization
Before a user happens to proceed with processing the language, the input
text in any form needs to undergo a process of normalization. This ensures
that the text is easily distinguished and easy to work on before any process
goes through. The first process that is part of the entire normalization
process is also known as Tokenization or ‘segmentation’ of words.
Word tokenization can be achieved very easily with the help of a UNIX command line
for a text that is solely in the form of the English language. The process of breaking a
given string into smaller parts (tokens) is known as tokenization. Some of the basic
commands that are used to proceed with the following are as follows.
1. “tr”: This command is used to make systematic changes to the characters that are already
part of the input.
2. “sort”: This command as the name suggests can help to sort the lines which are part of the
input in alphabetical order for better understanding and inference.
3. “uniq”: This command is simply used to analyse the identical lines input and allow them
to collapse based on changes and modifications made to the system.
A simple method that can be used to initialize the process of tokenization is to split it o
whitespace. In the Python programming language, one should simply split this through
the NLTK library with the command raw.split(). Using regular expression makes it
impossible to match new characters in the space to the string. We need to match tabs
and lines while using this process.
Problems with tokenization:
Tokenization may be a great way to analyse and break down complex language
structures but may pan out to be a difficult task while executing. A single solution may
not always serve the purpose as efficiently as one would expect.
Deciding what should be grouped as a token and what should not, is key to enabling users
to work easily with the help of tokenization.
Having raw manually tokenized data to compare with data that has been given back as
output from the tokenizer helps to inspect the flaws and finally choose an appropriate
token in the process.
Contractions serve as a major problem e.g. “couldn’t”. While analysing a sentence
meaning this would not pose too much of a problem and hence should be broken down
into 2 separate tokens potentially.
Normalization
In the field of linguistics and NLP, Morpheme is defined as a base form of the word. A
token is basically made up of two components one is morphemes and the other is
inflectional form like prefix or suffix.
For example, consider the word Antinationalist (Anti + national+ ist ) which is made up
of Anti and ist as inflectional forms and national as the morpheme.
Normalization is the process of converting a token into its base form. In the
normalization process, the inflectional form of a word is removed so that the base
form can be obtained. So in our above example, the normal form of antinationalist is
national.
Normalization is helpful in reducing the number of unique tokens present in the text,
removing the variations in a text. and also cleaning the text by removing redundant
information.
Two popular methods used for normalization are stemming and lemmatization.
Let’s discuss them in detail.
Stemming
It is an elementary rule-based process for removing inflationary forms from a given
token. The output of the error is the stem of a word. for example laughing, laughed,
laughs, laugh all will become laugh after the stemming process.
Tokenization - stemming
Stemming is not a good process for normalization. since sometimes it can produce
non-meaningful words which are not present in the dictionary. Consider the
sentence ” His teams are not winning”. After stemming we get “Hi team are not
winn ” . Notice that the keyword winn is not a regular word. Also, “hi” has changed
the context of the entire sentence.
Lemmatization
Lemmatization is a systematic process of removing the inflectional form of a token
and transform it into a lemma. It makes use of word structure, vocabulary, part of
speech tags, and grammar relations.
Tokenization - lemmatization
The output of lemmatization is a root word called a lemma. for example “am”, “are”, “is” will be
converted to “be”. Similarly, running runs, ‘ran’ will be replaced by ‘run’.
Also, since it is a systematic process while performing lemmatization one can specify the part of
the speech tag for the desired term.
Further, Lemmatization can only be performed if the given word has proper parts of speech tag.
for instance if we try to lemmatize the word running as a verb it will be converted to run. But, if
we try to lemmatize the same word running as a noun it won’t be transformed.
Substitution – This means that a certain letter or letter in the string needs to be
replaced by another letter or letters in the string
Transposition – This means that all the letters are present but in a different order. The
letters in the string could be swapped between each other to get a meaningful word.
In order to detect the issue with the spelling there are 2 models that can be
employed based on the corpus that is in question. They are:
Direct Detection
After working with a lot of corrections the system is trained to make a list of all the
words that have been corrected before. By running the corpus against the list could
get rid of many words that are commonly misspelled. If the words check out they
will be considered eligible.
Just as the name rightly suggests this phenomenon is about the comparison between
two different types of strings. NLP constitutes a major part of measuring the
similarities between two or more phrases (strings). An example of this can be seen
in everyday life. While typing a message on your mobile you notice a certain
autocorrect feature. In short, if the user types the word ‘kangaroo’ instead of
‘kangaroo’ the system would intend to know what exactly did the user mean. This
would mean that the user has made a mistake and meant a word that was similar to
this one. Based on simple NLP the system deciphers ‘kangaroo’ to be the
appropriate spelling and consciously makes the change to save the user time and
effort.
At times the spelling is not the only thing that is different about the string. Many a
time people have two strings that differ by only a single word. E.g.
1. “ Andrew is a football player.”
2. “Andrew is a good football player.”
On closer observation, we notice that both the sentences are very similar to each
other and differ by only a single letter. This tells us that both these strings might be
referent. Now that we have an idea about the similarities that can occur between two
strings we can move on to the definition of “Edit Distance” or simply “Minimum Edit
Distance.”
Edit Distance is the distance between two strings in terms of their similarities and
differences. It also refers to the minimum number of operations that the user would
have to execute to make two or more strings alike (transform S1 into S2). The
operations could include deletion and addition of words as well as the substitution in
certain cases to achieve the end string product.
We can look at the following example to understand the case between the 2 words
‘execution’ and ‘intention’.
The alignment between the 2 words is something that has been depicted in the
above figure. We can notice that there are a lot of letters that bear resemblance to
each other in a very similar position as the word we intend to transform it to. We
notice that the transformation can be made with the help of 3 distinct operations viz.
deletion, insertion, and substitution. This is a visual depiction of how knowledge of
the minimum distance can help in-text transformation and editing.
We now have a clear idea about exactly is the minimum distance in NLP. However,
to calculate the same, we need to get familiar with an algorithm known as the
“Minimum Edit Distance Algorithm”. We need to find the shortest path that can get
us to the intended word with simple processes as mentioned above.