Chapter II
Chapter II
LITERATURE REVIEW
2.1 Word-Embedding Techniques and Contextual Understanding
Word embedding is one of the most popular techniques to express document vocabulary. It
can establish a word's position within a text, its relationships with other words, and whether
there are any semantic or syntactic parallels. Word embedding is one of the most used
techniques for natural language processing (NLP). Word embeddings are commonly cited as
being crucial to the operation and efficiency of SOTA models. The rapid development of
language models like RNNs, LSTMs, ELMo, BERT, AlBERT, GPT-2, and the most recent
GPT-3 is greatly aided by word embeddings. Both building language sequences and other
downstream tasks like comprehending context and word relationships are effectively
accomplished by these algorithms. Word embeddings allow for the mathematical expression
of both individual words and complete phrases. We try to convert words in a phrase into
numbers so that the computer can read and understand it since we are aware that computers
can only communicate using the language of numbers [38]. We do not want computers to be
restricted to just reading and analyzing data, though. Determining the relationship between
each word in a phrase or text and the other words in the same is another task we want
computers to be able to complete. Along with preserving the semantic and grammatical
similarity of the same words, we want word embeddings to preserve the context of the
paragraph or preceding phrases.
Grammar is the set of guidelines for constructing valid sentences. While describing the
grammatical structure of some well algorithms, language plays a very important and
significant role. Simplest terms, grammar refers to the syntactical constraints used in
communication in generative grammar. In addition to being applicable in this situation,
formal language theory is also pertinent in computer science, notably in the context of
programming languages and data structures. Formal language theory is relevant in computer
science, particularly in the context of programming languages and data structures, in addition
to being applicable in this circumstance.
For instance, the precise grammar rules of the "C" programming language specify how lists
and statements are utilized to create functions.
A grammar G can be represented mathematically as a 4-tuple (N, T, S, and P), where A
collection of non-terminal symbols or variables are denoted by N or VN.
T or the terminal symbol set are equal.the initial symbol S, and S NProduction regulations,
abbreviated P, apply to both terminals and non-terminals.
It resembles VN in appearance, is made up of VN-based strings, and at least one of its
symbols is a VN sign.The acronym CFG refers to a superset of regular grammar that is used
to describe languages. It stands for context-free grammar. The following four components
make up the condensed set of grammar rules known as CFG.
● Non-Terminals group
● Terminals set
● Collective Productions
● Initial Symbol
2.1.1 Word2Vec:
In Word2Vec, a vector is assigned to each word. Our starting point is either a single-hot
vector or a random vector. One-Hot vector: A representation where the sole bit of a vector is
1 [40]. If the corpus has 500 words, then the vector length is 500. We assign a vector to each
word, choose a window size, then iterate over the entire corpus. Word2Vec creates word
vectors, which are distributed numerical representations of word attributes. Additionally,
these word qualities could incorporate statements that explain the context of the specific
vocabulary words used on their own. Through the generated vectors, word embeddings
eventually assist in establishing the relationship between a word and another word with a
similar meaning. For this, we use these alternative neural embedding methods:
2.1.1.2 Skip-Gram
Most studies employing skip-gram models aim to weaken the CBOW model. A
search for words with comparable meanings to a given target term yields synonyms
for that keyword (the centre word). The short version of our argument is as follows:
"The quick brown fox leaps over the slow dog." You can get a lot of different
meanings from just two words, like in ([quick, fox], brown), ([the, brown], quick),
([the, dog], sluggish), and so on [43]. The skip-gram model regularly switches the
places of the context and target words in order to create more precise predictions.
Successful completion of this task requires predictions such as [speed, fox] for the
target word brown, or [the, brown] for the target word rapid, and so on. Following
that, the software makes educated guesses as to what other keywords might be present
in the context pane with the desired keyword.
The target (center) word in the skip-gram model predicts the context words.
Consequently, consider the following assertion: A deep learning model is being used
by Word embeddings in the backend. With "learning" as its center word and a context
window size of 2, the model tries to predict words like "deep" and "model," among
others.
Because the skip-gram method must predict numerous words from a single input
word, we feed it pairs of (X, Y) where X is our input and Y is our label. Positive and
negative input samples are made to achieve it.
The format of the training data for positive input samples will be [(target, context),1],
where the label 1 indicates whether the combination is significant. The target is the
target or center word, the context is the words immediately around it, and the target is
the target or center word. For negative input samples, the training data will have the
same format: [(target, random),0]. Instead of the real surrounding words, randomly
selected words are sent in with the target words, with a label of 0 indicating that the
pair is irrelevant.
As with the CBOW model, which takes the target word as input and predicts related
phrases, the Skip-gram architecture also necessitates a deep learning classification
model [43]. Our context has a large number of terms, which can make comprehension
difficult. Each (target, context words) pair is then transformed into (target, context)
pairs where each context consists of a single word. (brown, quick), (brown, fox),
(quick, the), (quick, brown), and so on now make up the original dataset.
These examples provide the model with information about the contextually
meaningful words, which in turn generates similar embeddings for words with related
meanings. The following steps constitute the model's operation:
1. By sending the target and context word pairs to different embedding layers,
we can create dense word embeddings for each of these two terms.
2. The "merge layer" is then used to calculate the dot product of these two
embeddings to determine the dot product value.
3. This dot product result is fed into the dense sigmoid layer, which outputs
either 0 or 1.
4. The output is compared to the real label, the loss is determined, and
backpropagation is then used with each epoch to update the embedding layer.
The ratio for a specific probe word can be little, big, or equal to 1 depending on their
relationships. For instance, if the ratio is significant, the probing word is related to wi
but not to wj. This ratio reveals the link between the three terms. This is comparable
to a bi-gram or a three-gram in size.
2.1.2 TF-IDF
The significance or relevance of string representations (words, phrases, lemmas, etc.) in a
document in relation to a group of documents can be assessed using a metric known as TF-
IDF, or term frequency-inverse document frequency. It is used in the fields of information
retrieval (IR) and machine learning (also known as a corpus). By looking at how frequently a
particular phrase is used in connection to the document, term frequency is calculated [46].
There are several definitions or measurements for frequency, such as:
● Instances of the word in the text (raw count).
● Term frequency and document length adjustments (raw count of occurrences divided by
number of words in the document).
● Logarithmically scaled frequency, such as log (1 + raw count).
● Binary occurrence (e.g., 1 if the term occurs, or 0 if the term does not occur, in the
document).
Analyzing the frequency of a term in the corpus is known as inverse document frequency.
The following formula is used to calculate IDF, where N is the number of documents (d) in
the corpus and t is the term (word) we're trying to estimate the frequency of (D). The number
of papers that start with the letter "t" serves as the denominator.
Textual data, or any activity involving natural language processing (NLP), a branch of
machine learning and artificial intelligence (ML/AI) that works with text, must first be
vectorized, or transformed into a vector of numerical data. Numerical data is commonly used
in machine learning algorithms. You must first determine the TF-IDF score for each word in
your corpus in relation to the provided document to do TF-IDF vectorization (see the
example documents "A" and "B" in the figure below). As a result, each word in the whole
corpus of documents would get a TF-IDF score, and each document in your corpus would
have its own vector. A common example of how TF-IDF is applied in the realm of
information retrieval is search engines.
This is since TF-IDF can advise you of the term's pertinent importance based on a document.
One can use this strategy to identify the words that are most significant because TF-IDF
weights words based on relevancy. This can be used to choose keywords for a document or
even tags, as well as to provide more accurate article summaries [48]. Using a transformer-
based ML model, BERT is a Google ML/NLP technique that converts phrases, words, etc.
into vectors. The key differences between TF-IDF and BERT are as follows: TF-IDF, as
opposed to BERT, takes the context and semantic meaning of the words into account.
Additionally, BERT's architecture uses deep neural networks, potentially costing it far more
to compute than TF-IDF, which lacks these requirements. The main advantages of TF-IDF
are its simplicity and use. It provides a clear starting point for similarity computations and is
simple to compute and run (using TF-IDF vectorization and cosine similarity). It should be
emphasised that TF-IDF cannot help with semantic meaning carrying. It weighs the words
and takes it into consideration when calculating their worth, but it is not always able to
discern the meaning of the words from their context. Compound nouns like "Queen of
England" will not be recognised as a "single unit" because BoW and TF-IDF both disregard
word order. This is relevant in situations where the order has a big impact, like when negating
with "pay the bill" vs. "not pay the bill." Using NER tools and underscores, there are two
methods to analyse the sentence in each case as a single unit: "queen of england" or "not
pay."
2.1.3 Count-Vectorizer
A fantastic utility offered by the Python scikit-learn module is Count Vectorizer. Depending
on how frequently (count) each word appears throughout the full text, it is used to convert a
text into a vector. When we have many these texts and wish to turn each word into a vector,
this is useful. Text should be adjusted to remove any unnecessary words while keeping the
ones that are kept performing predictive modelling on it [49]. This procedure is referred to as
tokenization in machine learning. Then, the usable words or tokens are subjected to an
encoding process. Meaningful words are transformed into integers or floating-point values
during encoding because they are the only inputs that Machine Learning algorithms can
correctly interpret. Feature extraction is the process of converting the inputted textual data
into a machine-understandable format.
CountVectorizer is employed to convert the raw content into a statistical generative model of
the words and n-grams. The immediate usage of this structure as characteristics (signals) in
tasks involving machine learning like textual classification and grouping is uncomplicated as
an outcome. The numbers are 2 and 7 this time. A 2 x 7 matrix is created as a result of this. In
essence, columns are used to obtain the word vector for a word presented in the matrix M.
The rows of the matrix can be connected to the corpus, while the columns of the matrix are
related to the tokens discovered in the dictionary. As we construct this matrix, we might need
to make a few adjustments. There will be a substantial number of papers in the corpus for the
actual application cases. These resources allow us to learn many unusual words and phrases.
The matrix will be difficult to generate since it will be scattered and unstable for
computation.
In the image to the left, the words "The" and "the" are encoded differently, demonstrating that
they are different words. Each word and symbol in the text data is represented by a unique
one-hot vector with values of 1 and 0. Each word in a sentence is represented as a vector,
hence the list of words in a sentence can be shown as a matrix or an array of vectors. If there
is just one hot encoding, the regular expression will result in an array with matrices as its
substrings. As a result, we may build a three-dimensional tensor for the neural network.
The first line contains our list of examples, which consists of two sentences. In order to keep
our words (keys) and the indices (values) that correspond with them, we first create an empty
Python dictionary. In addition, a counter that counts the number of key-value pairs in the
dictionary is set to 0. The words in the first for loop are cycled through, and the words in the
selected statement are cycled over in the for loop that follows, which converts each word into
a list of strings and returns it. Then, if the current word variable's value is not already
included in the dictionary token index, we add it, assign it an index equal to the counter
variable's value, and raise the value of the accounting control and our index by one so that
they begin at 1 though rather than 0. As a result, we get the language illustrated in the
graphic above. The word "had" appears lots of times, and because symbols and numerals are
text characters delimited by whitespace, they, too, have an index. As a result, while the
samples contain 11 words, they only have 10 indices. A dictionary can only include a
maximum of ten terms. The result is a three-dimensional tensor that has two matrices as
members (the number of items in "samples"). Each of those matrices has a maximum number
of columns (here, 10 + 1) and rows (token index. Values ()) (here, 6, respectively). We now
have a shape tensor as a result (2, 6, 11).
Text vectorization was generally performed through one-hot encoding prior to the widespread
use of neural networks in NLP - in what we will refer to as "traditional" NLP (note that this
persists as a useful encoding practise for a number of exercises and has not fallen out of
fashion due to the use of neural networks) [52]. In one-hot encoding, each word or token in a
text corresponds to a vector element. Consider the following vector excerpt: "The queen
entered the chamber." Only the "queen" element has been awakened, with the "king," "man,"
and so on remaining dormant. Consider the previously mentioned vector element section's
one-hot vector representation of the phrase "The king was once a man but is now a child."
The result of utilizing a single encoding method on a corpus is a sparse matrix. Consider a
corpus of 20,000 distinct words: a single short document containing, say, 40 words would be
represented by a matrix with 20,000 rows (one for each word) and a maximum of 40 non-
zero matrix components (and potentially far fewer if there are a high number of non-unique
words in this collection of 40 words). This produces a large number of zeroes and may
necessitate a large amount of memory to keep these spare representations.
One-shot encoding presents a considerable disadvantage in terms of meaning representation,
in addition to potential memory capacity issues. While this method records the presence and
absence of words in a document properly, we cannot infer any meaning from their mere
presence or absence. As a result of one-shot encoding, we lose positional relationships
between words as well as word order. This structure is essential for expressing meaning in
words, and it will be examined further in a later section.
Because word vectors are statistically orthogonal, calculating word similarity is challenging.
Consider the terms "dog" and "dogs," or "vehicle" and "auto." In a variety of ways, these two
words are clearly linked. Traditional NLP methods, such as stemming and lemmatization, can
be used in preprocessing in a linear system with one-hot encoding to highlight similarities
between the first- and second-word pairs; however, discovering similarities between the
second- and third-word pairs necessitates a more robust strategy [53]. One-shot encoded
word vectors offer the critical advantage of recording binary word co-occurrence (also known
as a bag of words), which is sufficient to perform a wide range of NLP tasks, including text
classification, which is one of the most useful and widely used efforts in the discipline.
Despite its association with linear systems, this type of word vector can be employed in both
linear machine learning and neural networks. The n-gram and TF-IDF representations are
one-shot encoding approaches that perform well for linear systems and serve to alleviate
some of the above-mentioned difficulties. They are similar in that they are simple vector
representations rather than embeddings, as further detailed below.
This word embedding method has two key drawbacks. That the very first difficulty is what is
known as the problem of dimensionality, which refers to a multitude of issues that develop
when working with high-dimensional data. Even though our example text just has eight
dimensions, it demands an increasing amount of memory space. Because zeros dominate the
matrix, pertinent information becomes scarce. Assume we have a vocabulary of 50,000
words. (The English language contains approximately one million words.) We need 50,000
squared = 2.5 billion units of memory space because each word has 49,999 zeros and one
one. In terms of processing, it is inefficient. The second issue is that it is difficult to decipher
meanings. Each word is implanted individually, and each word consists of a single one and N
zeros, where N is the total number of dimensions. The produced vectors have little in
common. Perhaps should detect certain similarities if we knew the phrases "orange,"
"banana," and "watermelon," such as the fact that they are all types of fruit or that they
usually follow some version of the verb "eat." We can immediately create a mental map or
cluster in which these words live together. With one-hot vectors, nevertheless, all words are
the same distance apart.
i. I love painting
ii. I like Computers
iii. I like Machine Learning
iv.
If we set the window size to 1, this indicates that only the adjacent words would be
considered as the context of every word and the co-occurrence would be written down as
follows:
i.
Hence, the final co-occurrence matrix with a window size of 1 would look like:
I 0 0 0 2 0 0 1
Like 1 0 2 0 1 0 0
Machine 0 0 0 1 0 1 0
Learning 0 0 0 0 1 0 0
Love 0 1 1 0 0 0 0
The method of building a co-occurrence matrix normally involves three distinct steps or
stages. Several examples of them are as follows: words that are unique in a matrix, the word
that should be focused on, and the length of the window The values that are contained in a
matrix of unique words are all set to zero, and the matrix itself is constructed using each one
of the document's one-of-a-kind words. After the matrix has been built, the entire document
is word-by-word combed through to pull out each individual word. This process is repeated
for each sentence in the document. In addition to that, the duration of the window in question
is determined. The quantity of words that are taken into consideration is referred to as the
length of the window; these are the context words. The length of the window may also be
thought of as its simplest form. In this stage of the process, we need to determine the total
number of context words for the focus word that are already present on the page so that they
can be formatted to take up the minimum amount of space possible within the window.
When applied to a more extensive text corpus or document, this strategy to detecting co-
occurrences becomes high dimensional, which in turn increases the complexity of the process
[60]. This is one of the problems with utilising this approach to find co-occurrences. Singular
Value Decomposition (SVD), also known as Principal Component Analysis, is one of the
eigenvalue approaches that can be used to solve this problem (PCA). We put these methods
into action when we have a large dataset that naturally includes an excessive number of
dimensions. This has the potential to increase the complexity of the functioning and create
brand new problems. With the assistance of SVD and PCA, the model can keep the vital
characteristics of a dataset, while simultaneously ignoring the unnecessary qualities in an
efficient manner; the end result is a reduction in the dimensions and an increase in the
effectiveness.