0% found this document useful (0 votes)
29 views26 pages

Chapter II

The document discusses word embedding techniques and contextual understanding. It covers word2vec models including continuous bag-of-words and skip-gram. Word2vec creates word vectors that represent word attributes and relationships to capture context and semantics.

Uploaded by

vits.20731a0433
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views26 pages

Chapter II

The document discusses word embedding techniques and contextual understanding. It covers word2vec models including continuous bag-of-words and skip-gram. Word2vec creates word vectors that represent word attributes and relationships to capture context and semantics.

Uploaded by

vits.20731a0433
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Chapter 2

LITERATURE REVIEW
2.1 Word-Embedding Techniques and Contextual Understanding
Word embedding is one of the most popular techniques to express document vocabulary. It
can establish a word's position within a text, its relationships with other words, and whether
there are any semantic or syntactic parallels. Word embedding is one of the most used
techniques for natural language processing (NLP). Word embeddings are commonly cited as
being crucial to the operation and efficiency of SOTA models. The rapid development of
language models like RNNs, LSTMs, ELMo, BERT, AlBERT, GPT-2, and the most recent
GPT-3 is greatly aided by word embeddings. Both building language sequences and other
downstream tasks like comprehending context and word relationships are effectively
accomplished by these algorithms. Word embeddings allow for the mathematical expression
of both individual words and complete phrases. We try to convert words in a phrase into
numbers so that the computer can read and understand it since we are aware that computers
can only communicate using the language of numbers [38]. We do not want computers to be
restricted to just reading and analyzing data, though. Determining the relationship between
each word in a phrase or text and the other words in the same is another task we want
computers to be able to complete. Along with preserving the semantic and grammatical
similarity of the same words, we want word embeddings to preserve the context of the
paragraph or preceding phrases.
Grammar is the set of guidelines for constructing valid sentences. While describing the
grammatical structure of some well algorithms, language plays a very important and
significant role. Simplest terms, grammar refers to the syntactical constraints used in
communication in generative grammar. In addition to being applicable in this situation,
formal language theory is also pertinent in computer science, notably in the context of
programming languages and data structures. Formal language theory is relevant in computer
science, particularly in the context of programming languages and data structures, in addition
to being applicable in this circumstance.

For instance, the precise grammar rules of the "C" programming language specify how lists
and statements are utilized to create functions.
A grammar G can be represented mathematically as a 4-tuple (N, T, S, and P), where A
collection of non-terminal symbols or variables are denoted by N or VN.
T or the terminal symbol set are equal.the initial symbol S, and S NProduction regulations,
abbreviated P, apply to both terminals and non-terminals.
It resembles VN in appearance, is made up of VN-based strings, and at least one of its
symbols is a VN sign.The acronym CFG refers to a superset of regular grammar that is used
to describe languages. It stands for context-free grammar. The following four components
make up the condensed set of grammar rules known as CFG.
● Non-Terminals group
● Terminals set
● Collective Productions
● Initial Symbol

Figure 2.1: Context Free Grammar representations


The techniques required in extracting pertinent features from a specific corpus of text are
fundamentally different from, say, a vector of continuous integers. This is done so that the
semantic positioning of words—the method by which information is organized into
structured sequences in sentences and other texts—can convey the meaning of the text. To
achieve the goal of text categorization, we have studied about and employed two different
NLP models. These two requirements—appropriate data representation and keeping the text's
context.
Figure 2.2: Word Embedding Representation

A word is represented by a lower-dimensional numeric vector input called a word vector or


word embedding. It allows for comparable representations of words with comparable
meanings. Additionally, they can loosely convey meaning. A word vector can represent 50
different qualities with its 50 values.
any component that joins words collectively. Age, profession, level of fitness, and sports, for
instance [39]. These attributes are related to the values for each word vector.
Word embeddings are a method for incorporating textual features into a machine learning
model that can handle textual input. They try to keep the syntactical and semantic
information intact. Techniques like Bag of Words (BOW), Count Vectorizer, and TFIDF use
the word count in a sentence even when no syntactical or semantic information is stored. The
size of the vector in these techniques depends on the number of vocabulary elements. A
sparse matrix is possible if most of the elements are 0. A lot of weights will be produced by
large input vectors, which will increase the amount of training computation required. Word
embeddings provide a solution to these problems. Word Embeddings can be obtained using
one of two methods:

2.1.1 Word2Vec:
In Word2Vec, a vector is assigned to each word. Our starting point is either a single-hot
vector or a random vector. One-Hot vector: A representation where the sole bit of a vector is
1 [40]. If the corpus has 500 words, then the vector length is 500. We assign a vector to each
word, choose a window size, then iterate over the entire corpus. Word2Vec creates word
vectors, which are distributed numerical representations of word attributes. Additionally,
these word qualities could incorporate statements that explain the context of the specific
vocabulary words used on their own. Through the generated vectors, word embeddings
eventually assist in establishing the relationship between a word and another word with a
similar meaning. For this, we use these alternative neural embedding methods:

2.1.1.1 Continuous Bowl of Words (CBOW):


Although Word2Vec is an unsupervised model that allows you to provide a corpus without
any label information and construct dense word embeddings, it uses a supervised
classification model internally to extract these embeddings from the corpus [41]. We feed
context words (X) into the deep learning classification model of the CBOW architecture to
forecast our target word (Y).
In a statement like "Word2Vec has a deep learning model functioning in the backend," there
may be pairs of context words and target (center) words. If we choose a context window size
of 2, we will receive pairings like ([deep, model], learning), ([model, in], working), ([a,
learning], deep), etc. The deep learning network would attempt to predict these target words
based on the context words.

Figure 2.3: Continuous Bag of Words Representation


The CBOW model incorporates data regarding the context of individual words. In the
following step, the computer draws conclusions from the data it has collected. Let's take a
closer look at this case to see what I mean. The word "pleasant" from the line "It is a
magnificent day" would be fed into the neural network. To be honest, we can't wait for the
day it happens. Here, both the input words and the error-rate monitoring target word will be
encoded using a one-hot encoding. It's possible that we'd use the most precise terminology
possible if we were explaining the situation and making a prediction.
In the accompanying diagram, we see a typical application of the CBOW paradigm. The
model takes into account the context of the surrounding words to determine what it will be
used to predict. Recall the upbeat sentiment expressed by the statement "It is a pleasant day"
from the previous paragraph. In accordance with the form's prescribed layout (context word,
target word). Both the width and height of the window are modifiable [42]. You can use
constructions like ([it, a], is), ([is, excellent], a, and so on when the context word window is
set to 2. Nice day ([a, day]). The approach utilizes interword relationships to locate the target
word in a given text. There may appear to be four 1XW input vectors in the input layer when
attempting to predict a single target word from four context words. This stream of input
vectors will be multiplied by a WXN matrix in the buried layer. The vectors are then
elementwise added in the sum layer, triggered, and output after the 1XN output.
The following steps constitute the model's operation:
1. The context words are initially provided as an input to an embedding layer, as shown in
the Figure below (which is first initialized with some random weights).
2. After delivering the word embeddings to the lambda layer, we average the word
embeddings.
3. These embeddings are then delivered to the substantial SoftMax layer that anticipates
our target term. To update the embedding layer, we backpropagate with each epoch after
computing the loss and matching it with our target word.
4. After training is complete, we may extract the embeddings for the required words
from our embedding layer.
Figure 2.4: Word Embedding Operation

2.1.1.2 Skip-Gram
Most studies employing skip-gram models aim to weaken the CBOW model. A
search for words with comparable meanings to a given target term yields synonyms
for that keyword (the centre word). The short version of our argument is as follows:
"The quick brown fox leaps over the slow dog." You can get a lot of different
meanings from just two words, like in ([quick, fox], brown), ([the, brown], quick),
([the, dog], sluggish), and so on [43]. The skip-gram model regularly switches the
places of the context and target words in order to create more precise predictions.
Successful completion of this task requires predictions such as [speed, fox] for the
target word brown, or [the, brown] for the target word rapid, and so on. Following
that, the software makes educated guesses as to what other keywords might be present
in the context pane with the desired keyword.
The target (center) word in the skip-gram model predicts the context words.
Consequently, consider the following assertion: A deep learning model is being used
by Word embeddings in the backend. With "learning" as its center word and a context
window size of 2, the model tries to predict words like "deep" and "model," among
others.

Figure 2.5: Skip-Gram Representations

Because the skip-gram method must predict numerous words from a single input
word, we feed it pairs of (X, Y) where X is our input and Y is our label. Positive and
negative input samples are made to achieve it.
The format of the training data for positive input samples will be [(target, context),1],
where the label 1 indicates whether the combination is significant. The target is the
target or center word, the context is the words immediately around it, and the target is
the target or center word. For negative input samples, the training data will have the
same format: [(target, random),0]. Instead of the real surrounding words, randomly
selected words are sent in with the target words, with a label of 0 indicating that the
pair is irrelevant.
As with the CBOW model, which takes the target word as input and predicts related
phrases, the Skip-gram architecture also necessitates a deep learning classification
model [43]. Our context has a large number of terms, which can make comprehension
difficult. Each (target, context words) pair is then transformed into (target, context)
pairs where each context consists of a single word. (brown, quick), (brown, fox),
(quick, the), (quick, brown), and so on now make up the original dataset.
These examples provide the model with information about the contextually
meaningful words, which in turn generates similar embeddings for words with related
meanings. The following steps constitute the model's operation:

1. By sending the target and context word pairs to different embedding layers,
we can create dense word embeddings for each of these two terms.
2. The "merge layer" is then used to calculate the dot product of these two
embeddings to determine the dot product value.
3. This dot product result is fed into the dense sigmoid layer, which outputs
either 0 or 1.
4. The output is compared to the real label, the loss is determined, and
backpropagation is then used with each epoch to update the embedding layer.

Figure 2.6: Skip-Gram Operation


2.1.1.3 Glove
This additional word embedding technique. Using this approach, we iteratively walk
through the corpus and count the instances where each word is used in conjunction
with other phrases [44]. As a result, we may generate a co-occurrence matrix. Words
close together have a value of 1, while words separated by just one word have a value
of 1/2, words separated by two words have a value of 1/3, and so on. Given a corpus
of V words, the co-occurrence matrix X will be a V x V matrix, with the ith row and
jth column of X, X ij, representing the frequency of co-occurrence between word I
and word j. The co-occurrence matrix shown below is an example.

Figure 2.7: Co-Occurrence Glove-matrix

The ratio for a specific probe word can be little, big, or equal to 1 depending on their
relationships. For instance, if the ratio is significant, the probing word is related to wi
but not to wj. This ratio reveals the link between the three terms. This is comparable
to a bi-gram or a three-gram in size.

2.1.1.4 Fast text


Using the open-source, cost-free FastText package, users can learn about text formats
and text classifiers. It works with standard, generic hardware. Later, models could be
scaled down to fit even on portable electronics. FastText differs from word vectors,
also known as word2vec, in that it assumes that each word is made up of n-grams of
letters, as opposed to word vectors, which do not [45]. The constituent elements of the
word sunny, for instance, are [sun, sunn, sunny], [sunny, unny, nny], etc., where n
may range from 1 to the length of the word. These are the advantages that fast Text’s
brand-new word representation has over word2vec or glove. Finding the vector
representation of uncommon words is useful. Rare words may share character n-
grams with common terms since they can still be divided into character units. Medical
terminology like "diseases" may be odd words for a model trained on a news dataset.

Figure 2.8: Fast text Implementation


Because OOV words can also be broken down into character n-grams, they can offer
vector representations for words that are not included in dictionaries. Both glove and
word2vec are unable to generate any vector representations for words that are not part of
dictionaries. Gensim might, for instance, provide a zero vector or a random vector of a
moderate size for a term like "stupefantabulously wonderful," which may not have ever
appeared in any corpus. By splitting the word into smaller pieces and using the vectors
for those pieces to create a final vector for the word, FastText can provide vectors that are
superior to random ones. The resulting vector might resemble the vectors of wonderful
and fantabulous more closely in this situation. Character n-gram embeddings usually
outperform word2vec and glove on smaller datasets [45]. In machine learning tasks, it is
frequently forbidden to use words in their original form. Making the words into images
that embody some of their attributes is one approach to use them. A person possesses
["height":5.10, "weight":75, "color":"dusky," etc.] where height, weight, etc. are the
characteristics of the individual. Like how words that are similar in meaning generally
have related word representations, word representations also capture some abstract
aspects of words. Each word is represented by a group of character n-grams in addition to
the word itself. Ma, Mat, Att, Tte, Ter, and Er are FastText's character n-gram
representations for the word matter with n = 3. To distinguish between the word's ngram
and the actual word, use the boundary symbols and >. For instance, the word mat is
rendered as mat> if it is in the dictionary. Therefore, shorter words that could be formed
from longer ones are still meaningful. As a result, it is simple to comprehend the
significance of suffixes and prefixes. The -minn and -maxn arguments, which denote the
minimum and maximum characters to use, can be used to change the overall size of n-
grams. These determine the range of values for which n-grams are obtained. The model is
referred described as a "bag of words" model since, aside from the sliding window of n-
gram selection, no internal word structure is considered for featurization. This suggests
that if the characters fit within the window, the character n-grams' order is irrelevant.
Additionally, by setting them both to 0, n-gram embeddings can be fully turned off.

2.1.2 TF-IDF
The significance or relevance of string representations (words, phrases, lemmas, etc.) in a
document in relation to a group of documents can be assessed using a metric known as TF-
IDF, or term frequency-inverse document frequency. It is used in the fields of information
retrieval (IR) and machine learning (also known as a corpus). By looking at how frequently a
particular phrase is used in connection to the document, term frequency is calculated [46].
There are several definitions or measurements for frequency, such as:
● Instances of the word in the text (raw count).
● Term frequency and document length adjustments (raw count of occurrences divided by
number of words in the document).
● Logarithmically scaled frequency, such as log (1 + raw count).
● Binary occurrence (e.g., 1 if the term occurs, or 0 if the term does not occur, in the
document).
Analyzing the frequency of a term in the corpus is known as inverse document frequency.
The following formula is used to calculate IDF, where N is the number of documents (d) in
the corpus and t is the term (word) we're trying to estimate the frequency of (D). The number
of papers that start with the letter "t" serves as the denominator.

Figure 2.9: IDF formula


Since words like "of," "as," "the," etc. frequently appear in an English corpus and need to be
corrected, we need IDF. As a result, we can decrease the weighting of frequent terms while
boosting the significance of infrequent terms by employing inverse document frequency.
IDFs may also generated from a background corpus, which corrects for sample bias, or from
the dataset being utilised in the current experiment. In a nutshell, the essential idea behind
TF-IDF is that the importance of a term is inversely associated with its frequency across
texts. IDF reveals a term's relative rarity within the collection of papers, while TF reveals a
term's frequency in each document. We can average these figures to get our final TF-IDF
value.

Figure 2.10: TF-IDF formula


The TF-IDF score of a phrase rises with increasing relevance or importance and falls until it
hits 0 with decreasing relevance. As we can see, the TF-IDF measure is a highly useful tool
for assessing a term's significance within a document. The three primary uses for TF-IDF are
listed below. These are in text summarization, keyword extraction, machine learning, and
information retrieval.

Figure 2.11: TF-IDF Implementation

Textual data, or any activity involving natural language processing (NLP), a branch of
machine learning and artificial intelligence (ML/AI) that works with text, must first be
vectorized, or transformed into a vector of numerical data. Numerical data is commonly used
in machine learning algorithms. You must first determine the TF-IDF score for each word in
your corpus in relation to the provided document to do TF-IDF vectorization (see the
example documents "A" and "B" in the figure below). As a result, each word in the whole
corpus of documents would get a TF-IDF score, and each document in your corpus would
have its own vector. A common example of how TF-IDF is applied in the realm of
information retrieval is search engines.

Figure 2.12: TF-IDF Example

This is since TF-IDF can advise you of the term's pertinent importance based on a document.
One can use this strategy to identify the words that are most significant because TF-IDF
weights words based on relevancy. This can be used to choose keywords for a document or
even tags, as well as to provide more accurate article summaries [48]. Using a transformer-
based ML model, BERT is a Google ML/NLP technique that converts phrases, words, etc.
into vectors. The key differences between TF-IDF and BERT are as follows: TF-IDF, as
opposed to BERT, takes the context and semantic meaning of the words into account.
Additionally, BERT's architecture uses deep neural networks, potentially costing it far more
to compute than TF-IDF, which lacks these requirements. The main advantages of TF-IDF
are its simplicity and use. It provides a clear starting point for similarity computations and is
simple to compute and run (using TF-IDF vectorization and cosine similarity). It should be
emphasised that TF-IDF cannot help with semantic meaning carrying. It weighs the words
and takes it into consideration when calculating their worth, but it is not always able to
discern the meaning of the words from their context. Compound nouns like "Queen of
England" will not be recognised as a "single unit" because BoW and TF-IDF both disregard
word order. This is relevant in situations where the order has a big impact, like when negating
with "pay the bill" vs. "not pay the bill." Using NER tools and underscores, there are two
methods to analyse the sentence in each case as a single unit: "queen of england" or "not
pay."

2.1.3 Count-Vectorizer
A fantastic utility offered by the Python scikit-learn module is Count Vectorizer. Depending
on how frequently (count) each word appears throughout the full text, it is used to convert a
text into a vector. When we have many these texts and wish to turn each word into a vector,
this is useful. Text should be adjusted to remove any unnecessary words while keeping the
ones that are kept performing predictive modelling on it [49]. This procedure is referred to as
tokenization in machine learning. Then, the usable words or tokens are subjected to an
encoding process. Meaningful words are transformed into integers or floating-point values
during encoding because they are the only inputs that Machine Learning algorithms can
correctly interpret. Feature extraction is the process of converting the inputted textual data
into a machine-understandable format.

Figure 2.13: Count-Vectorizer


Use the textual information for parameter estimation, the text must also be tokenized, which
involves the procedure of removing various keywords from the text. Similar words then must
be converted as fractions or suspended numbers in addition to be utilized parameters in
machine learning algorithms. Proposed technique (also known as feature extraction) is the
mechanism in consideration (or vectorization). Scikit-Count Vectorizer combines a corpus of
textual data into a matrix of expression counts. learn's Further, it enables pre-processing text
content prior to vector depiction creation [50]. This capability makes it an extremely versatile
feature extraction module with text. To demonstrate exactly CountVectorizer tends to work,
let's consider the name of the accompanying book (for reference, this is a title from a well-
liked series of books): Please note that this statement contains nine different terms. 9 columns
follow. The rows of the array correlate to the documents in our collection, and that each cell
represents an individual keyword from the lexicon. Because there is just one title of the book
in the record, there is simply one entry in this occurrence. The number of phrases from each
cell is expressed by a figure. We should just be mindful that if a given term was not being
included in the specific document, its quantity in this approximation may be 0.

Figure 2.14: Count-Vectorizer example

CountVectorizer is employed to convert the raw content into a statistical generative model of
the words and n-grams. The immediate usage of this structure as characteristics (signals) in
tasks involving machine learning like textual classification and grouping is uncomplicated as
an outcome. The numbers are 2 and 7 this time. A 2 x 7 matrix is created as a result of this. In
essence, columns are used to obtain the word vector for a word presented in the matrix M.
The rows of the matrix can be connected to the corpus, while the columns of the matrix are
related to the tokens discovered in the dictionary. As we construct this matrix, we might need
to make a few adjustments. There will be a substantial number of papers in the corpus for the
actual application cases. These resources allow us to learn many unusual words and phrases.
The matrix will be difficult to generate since it will be scattered and unstable for
computation.

2.1.4 One-Hot Encoding


Every phrase in the input, including any symbols, is encoded as a vector using only the digits
1 and 0 in a single hot encoding. Accordingly, a matrix that comprises both the values 1 and 0
is known as a hot vector [51]. A unique, independent hot vector is used as a description for
each term. A term can be distinguished by just one hot vector and inversely though no two
terms will ever have the exact same one hot generative model. In the figure below, the words
from the specified text have been hot encoded.

Figure 2.15: One-Hot Encoding

In the image to the left, the words "The" and "the" are encoded differently, demonstrating that
they are different words. Each word and symbol in the text data is represented by a unique
one-hot vector with values of 1 and 0. Each word in a sentence is represented as a vector,
hence the list of words in a sentence can be shown as a matrix or an array of vectors. If there
is just one hot encoding, the regular expression will result in an array with matrices as its
substrings. As a result, we may build a three-dimensional tensor for the neural network.
The first line contains our list of examples, which consists of two sentences. In order to keep
our words (keys) and the indices (values) that correspond with them, we first create an empty
Python dictionary. In addition, a counter that counts the number of key-value pairs in the
dictionary is set to 0. The words in the first for loop are cycled through, and the words in the
selected statement are cycled over in the for loop that follows, which converts each word into
a list of strings and returns it. Then, if the current word variable's value is not already
included in the dictionary token index, we add it, assign it an index equal to the counter
variable's value, and raise the value of the accounting control and our index by one so that
they begin at 1 though rather than 0. As a result, we get the language illustrated in the
graphic above. The word "had" appears lots of times, and because symbols and numerals are
text characters delimited by whitespace, they, too, have an index. As a result, while the
samples contain 11 words, they only have 10 indices. A dictionary can only include a
maximum of ten terms. The result is a three-dimensional tensor that has two matrices as
members (the number of items in "samples"). Each of those matrices has a maximum number
of columns (here, 10 + 1) and rows (token index. Values ()) (here, 6, respectively). We now
have a shape tensor as a result (2, 6, 11).
Text vectorization was generally performed through one-hot encoding prior to the widespread
use of neural networks in NLP - in what we will refer to as "traditional" NLP (note that this
persists as a useful encoding practise for a number of exercises and has not fallen out of
fashion due to the use of neural networks) [52]. In one-hot encoding, each word or token in a
text corresponds to a vector element. Consider the following vector excerpt: "The queen
entered the chamber." Only the "queen" element has been awakened, with the "king," "man,"
and so on remaining dormant. Consider the previously mentioned vector element section's
one-hot vector representation of the phrase "The king was once a man but is now a child."
The result of utilizing a single encoding method on a corpus is a sparse matrix. Consider a
corpus of 20,000 distinct words: a single short document containing, say, 40 words would be
represented by a matrix with 20,000 rows (one for each word) and a maximum of 40 non-
zero matrix components (and potentially far fewer if there are a high number of non-unique
words in this collection of 40 words). This produces a large number of zeroes and may
necessitate a large amount of memory to keep these spare representations.
One-shot encoding presents a considerable disadvantage in terms of meaning representation,
in addition to potential memory capacity issues. While this method records the presence and
absence of words in a document properly, we cannot infer any meaning from their mere
presence or absence. As a result of one-shot encoding, we lose positional relationships
between words as well as word order. This structure is essential for expressing meaning in
words, and it will be examined further in a later section.
Because word vectors are statistically orthogonal, calculating word similarity is challenging.
Consider the terms "dog" and "dogs," or "vehicle" and "auto." In a variety of ways, these two
words are clearly linked. Traditional NLP methods, such as stemming and lemmatization, can
be used in preprocessing in a linear system with one-hot encoding to highlight similarities
between the first- and second-word pairs; however, discovering similarities between the
second- and third-word pairs necessitates a more robust strategy [53]. One-shot encoded
word vectors offer the critical advantage of recording binary word co-occurrence (also known
as a bag of words), which is sufficient to perform a wide range of NLP tasks, including text
classification, which is one of the most useful and widely used efforts in the discipline.
Despite its association with linear systems, this type of word vector can be employed in both
linear machine learning and neural networks. The n-gram and TF-IDF representations are
one-shot encoding approaches that perform well for linear systems and serve to alleviate
some of the above-mentioned difficulties. They are similar in that they are simple vector
representations rather than embeddings, as further detailed below.
This word embedding method has two key drawbacks. That the very first difficulty is what is
known as the problem of dimensionality, which refers to a multitude of issues that develop
when working with high-dimensional data. Even though our example text just has eight
dimensions, it demands an increasing amount of memory space. Because zeros dominate the
matrix, pertinent information becomes scarce. Assume we have a vocabulary of 50,000
words. (The English language contains approximately one million words.) We need 50,000
squared = 2.5 billion units of memory space because each word has 49,999 zeros and one
one. In terms of processing, it is inefficient. The second issue is that it is difficult to decipher
meanings. Each word is implanted individually, and each word consists of a single one and N
zeros, where N is the total number of dimensions. The produced vectors have little in
common. Perhaps should detect certain similarities if we knew the phrases "orange,"
"banana," and "watermelon," such as the fact that they are all types of fruit or that they
usually follow some version of the verb "eat." We can immediately create a mental map or
cluster in which these words live together. With one-hot vectors, nevertheless, all words are
the same distance apart.

2.1.5 Dense Embedding Vectors


Sparse word vectors appear like a fantastic means of encoding individual text data in unique
ways while taking into consideration binary word co-occurrence. Some of the most visible
problems with one-hot encodings can be avoided by employing equivalent linear techniques,
including n-grams and TF-IDF. In the absence of any other answer, word embedding vectors
can be used to make inferences about the meaning of text and the semantic link between
tokens [54]. "Traditional" natural language processing relies on part-of-speech (POS)
tagging, which can be taught or done manually, to identify the subjects and objects of a text
(noun, verb, adverb, indefinite article, etc.). This method of feature extraction is interactive,
and the data it yields can be used as input in a wide variety of NLP techniques. Take the case
of named entity identification; to locate named entities in a text stream, it is necessary to
identify which words and phrases are nouns, as named entities are almost usually a subset of
all nouns. One strategy involves encoding feature data as dense vectors of relevant attribute
values in an embedding space of size d. Sizes of fifty or one hundred can express twenty
thousand different words. In this approach, rather than having their own independent
dimensions, features are instead converted into vectors. Like what has been mentioned
before, it appears that the meaning of a phrase is determined by the placement of its words
rather than the frequency with which its constituent parts occur together. It is not simple to
understand what I mean if I mention that the terms "foo" and "bar" appear in a phrase. By
including "The foo growled, which terrified the young bar," the true meaning of the idioms
might be understood. What, then, are these qualities, precisely? A neural network is educated
to understand human speech from just the words being exchanged. In spite of the fact that
humans have a hard time making sense of this pattern, the accompanying image clarifies the
logic behind the time-honored King-Man-Woman-Queen paradigm. Exactly where do these
characteristics have their start? In natural language processing, word embeddings are often
employed, but their construction does not include "deep learning"; rather, they obtained
weights are generally used in deep learning tasks [55]. Two popular and pioneering word2vec
embedding techniques, Continuous Bag of Words (CBOW) and Skip-gram, are based on
overcoming the difficulties of word-in-context prediction and context-in-word prediction,
respectively (note that context is a sliding window of words in the text). After training, we
discard the model's output layer because it doesn't add any value and use the model's
embedding layer's weights for our following NLP neural network tasks instead. Word vectors
are the final product of the embedding layers in the one-hot encoding process.

2.1.6 Similarity based Representation vectors


Intelligent devices in the IoT may now have conversations with one another that are both
natural and seamless thanks to chatbots and other conversational AI technologies.
The fact that it can interpret the user's intent when given verbal orders is just one of its many
useful features [56]. This is a challenging undertaking because of the wide range of topics
and writing styles that they must cover. As an LM-based method that use deep neural
networks to construct paragraph-level representations of text, Para2Vec stands out from the
rest. While these approaches show promise, they can't realise their full potential without a
massive amount of training data, which is rarely accessible in the natural language text
domain. Pretrained vectors, also known as vectors generated from an external corpus, can be
used to provide a unified vector representation across application areas (i.e. static). If a
remark could be represented by several vectors, each of which would have a slightly different
meaning based on the circumstances in which it was uttered, that would be very useful.
Additionally, traditional methods typically generate embeddings that lean too heavily on the
domain of the training data. SIMVECS is a vector representation approach that may be used
to build dynamic vector representations of utterances, which can then be used in a variety of
LU applications. Each utterance must be represented as a vector of similarity scores in
SIMVECS, with the scores being computed automatically based on a set of "typical
utterances" from the same application. A text can now be interpreted in a myriad of ways. An
app that delivers pizza would interpret "I want a huge pizza" as a request of type Order Food,
whereas a bus tracking app would interpret it as a request of type None (irregular). Given
this, application-determined vector representations are necessary for capturing the semantics
of utterances in LU services and accurately translating them to user-defined intents. Word
embedding is a common method in which a word or sentence is represented by a vector of
actual numerical symbols. Among the many word embedding methods available, Word2Vec
and glove stand out as two of the most common and effective [56]. N-grams were frequently
employed to numerically represent individual words before the advent of word embedding.
Let's see which of these strategies holds up best in a real-world example. We are presented
with two claims and asked to select the one(s) that relate to animal companions. Having a dog
or puppy as a household pet is quite prevalent in India. For the simple reason that the basic
concepts in both claims (which both pertain to canines) are the same. Word similarity cannot
be inferred using N-gram data in this approach, unfortunately. To put it another way, the
model does not share the popular misconception that the words "dog" and "puppy" are
interchangeable when referring to canine friends. This means that the existence of certain
sentences can be predicted. Words with comparable contexts of use are clustered together
using the Word embeddings system. Embeddings of words are a numerical representation of
text written in a natural language. Phrase embeddings is one approach that attempts to
achieve this goal; this method begins with dictionaries to translate words and phrases to a
vector representation. Phrase embeddings can be represented in a discrete fashion or a
distributed fashion. Word embeddings are very user-friendly since word contrast can be
easily established by operations on lower dimensional matrices. Since the input is typically
presented as numbers, most machine learning and deep learning algorithms do not use natural
language content.
One issue with having so many options is that their vector representations of texts are often
erroneous since they fail to capture the semantics of individual words. The resulting sparse
vector, sometimes referred to as a sparse representation, increases the need for both memory
and processing performance throughout the modelling process.
All of these problems have an answer in neural word embedding, which accomplishes a
reduction in depth through compact descriptions and a more succinct depiction via contextual
analogies [57].
As an example of a learned representation for text, "word embeddings" provide the same
numerical value to words that have the same meaning, such as "King" and "Man + Woman =
Queen." A key part of the method is the inclusion of a dense distributed representation for
each sentence. In this notation, each term is written as a real-valued vector with arbitrary
depth (often tens of dimensions).
Word meaning disambiguation, named entity recognition (NER), and other downstream
natural language processing tasks rely heavily on context or vector representations of words.
The goal is to develop a syntactically and semantically rich word embedding that is also
relatively shallow. Inside the greater discipline of Natural Language Processing, word
embedding is used interchangeably with natural language modelling and feature teaching
techniques (NLP). Mathematically, this is achieved by transforming each word into a single
vector presentation in a high-dimensional space. In the 1960s, a vector space model was
developed for use in information retrieval, which is considered the origin of word embedding
methods [58]. Additional background on how neural networks are utilised for vector
formation can be found in the series of publications titled "Neural Probabilistic Language
Models," initially published in 2000 by Bengio et al. Because computers cannot understand
human language, numerical input is necessary for Machine Learning and Deep Learning
algorithms. According to the claims made in the section titled "Literature survey," a wide
variety of different word embedding algorithms are out there. By a significant margin, neural
network-based word embedding methods outperformed their non-networked counterparts.
Focusing on popular word embedding models like word2vec and GloVe, we aimed to
increase their profile in the topic of word embeddings. The model-training process, which
makes use of massive text corpora, has been greatly aided by these two cutting-edge words
embedding techniques. The availability of huge text corpora allowed for the development of
high-quality word embeddings by allowing for the capture of syntactic and semantic
information. The created word embeddings are applicable to both supervised and
unsupervised NLP applications. An appropriate illustration of this would be the phrase "He
cannot find his [buddy]." The missing word in this phrase can be deduced from the context,
which suggests that he cannot find his friend. This is required to infer the friend's
personhood, thus it's not optional.
Existing Approach to identify offensive language using multilingual versions of BERT
The existing process is built on multi-lingual BERT’s and XLM-Roberta that is BERT base
uncased and neural networks. BERT multilingual base model (cased) uses a masked language
modelling (MLM) strategy to build the model on the top 104 languages with the largest
Wikipedia. This algorithm is case sensitive, meaning it distinguishes across English and
English. BERT is a self-supervised transformers model that was developed using a large
multilingual dataset collection. Therefore, implies it must have been learned solely on
unprocessed words, with really no human categorization (therefore its flexibility to use unless
abundant published information), but then used an autonomous methodology to construct
sources and classifications from all of those text messages. The XLM-Roberta model has
been pre-trained on 2.5TB of cleaned Common Crawl data with multiple languages. Roberta
is a self-supervised transformers model that was trained on a big corpus. This implies it was
pre-trained on crude texts solely, and without any manual classification and then used an
autonomous method to build inputs and classifications from these kind of texts
It was prepared with two goals in mind:
1. Masked language modelling: The model automatically conceals 15percent of the
overall of the phrases in the source prior processing the complete disguised text
through it and estimating the covered words. These are in contradiction to classic
recurrent neural networks, which almost always recognize word sequentially, and
autoregressive models such as GPT, which fundamentally hide prospective tokens. It
enables the models to understand a sentence's reversible interpretation.
2. Next sentence prediction: The models integrate two masked phrases as input data
during pretraining. They frequently match sentences in the actual text that seem to be
beside each other, and even sometimes they don't. The model should then identify not
whether the two sentences are interrelated. This allows the models to comprehend an
inside explanation of the languages in the training examples, which could be used for
feature extraction valuable for downstream tasks. The unprocessed model could be
used for mask language modelling or next sentence prediction, but it was built to be
fine-tuned for downstream application.
The XLM-Roberta model has been pre-trained on 2.5TB of cleaned Common Crawl data
with multiple languages. Roberta is a self-supervised transformers model that was trained on
a big corpus. This implies it was pre-trained on crude texts solely, and without any manual
classification and then used an autonomous method to build inputs and classifications from
these kinds of texts.
Dropout’s introduction in Neural Networks and its effects:
Finally, depending on a set of features, deep learning is used to anticipate consequences. Like
a corollary, everything we should do to improve the effectiveness of our models is deemed a
win. Dropout is a model overfitting prevention strategy. Dropout aims at setting the outbound
edges of hidden layers to 0 at the end of every upgrade of the training process. Deep learning
models are, without a doubt, the most powerful machine learning models currently
accessible. They may learn incredibly complex functions due to the large number of
parameters. As a result, they run the danger of overfitting the training data.
Other regularisation approaches, such as weight decay or early halting, are less reliable than
dropout. This is because each cycle across the network, dropout kills unique neurons. As a
result, you're averaging the results of a variety of networks with various neuron compositions.
In machine learning, one typical method of ensuring model robustness is to train several
models and average their results. Ensemble model is a process for identifying and correcting
faults caused by single models. When the models have various architectures and are
developed on multiple portions of the training examples, aggregation approaches function
best.
This strategy would be prohibitively expensive in deep learning because training a single
neural network now takes a significant amount of time and computing resources. This is
particularly true in domains like machine learning and natural language processing, where
there could be millions of training instances. There may also be insufficient labelled training
data to train several models on distinct subsets. Dropout is used only throughout training to
improve the network's susceptibility to data fluctuation. Whenever we perform testing,
we want to make full advantage of the existing network. Dropout isn't employed with testing
data or during operational extrapolation.

2.1.7 Co-Occurrence Matrix


It is possible to get at certain inferences regarding the nature of their relationship by simply
taking into account the fact that both of them have the same name. In a manuscript that has N
sentences, there is a possibility that there will be some terms that appear in more than one
phrase [59]. One of the components of a co-occurrence matrix is called a context pane, and its
primary function is to call attention to word combinations that are seen rather frequently.
After all, our report is nothing more than a collection of phrases; as a result, the co-
occurrence matrix can be read in the same way as a matrix representation of the text because
it is simply a collection of phrases. In addition, the co-occurrence matrix can be read in the
same way as a matrix representation of the text. The major objective of this stage is to look
for instances of the search word inside the main body of the text. Let's suppose that for the
sake of this discourse, we are going to assume that these sentences are contained in the paper
by pretending that we are going to pretend that they are. Let's pretend that this is the situation.

i. I love painting
ii. I like Computers
iii. I like Machine Learning
iv.
If we set the window size to 1, this indicates that only the adjacent words would be
considered as the context of every word and the co-occurrence would be written down as
follows:
i.

ii. I = love (1 occurrence), like (2 occurrences)


iii. love = I (1 occurrence), painting (2 occurrences)
iv. painting = love (1 occurrence)
v. like = I (2 occurrences), Computers (1 occurrence), Machine (1 occurrence)
vi. Computers = like (1 occurrence)
vii. Machine = like (1 occurrence), Learning (1 occurrence)
viii. Learning = Machine (1 occurrence)

Hence, the final co-occurrence matrix with a window size of 1 would look like:

Table 2.1 - Co-occurrence matrix

Compute painti I li Machi Learni lo


rs ng k ne ng ve
e
Compute 0 0 0 1 0 0 0
rs
Painting 0 0 0 0 0 0 1

I 0 0 0 2 0 0 1

Like 1 0 2 0 1 0 0

Machine 0 0 0 1 0 1 0

Learning 0 0 0 0 1 0 0

Love 0 1 1 0 0 0 0

The method of building a co-occurrence matrix normally involves three distinct steps or
stages. Several examples of them are as follows: words that are unique in a matrix, the word
that should be focused on, and the length of the window The values that are contained in a
matrix of unique words are all set to zero, and the matrix itself is constructed using each one
of the document's one-of-a-kind words. After the matrix has been built, the entire document
is word-by-word combed through to pull out each individual word. This process is repeated
for each sentence in the document. In addition to that, the duration of the window in question
is determined. The quantity of words that are taken into consideration is referred to as the
length of the window; these are the context words. The length of the window may also be
thought of as its simplest form. In this stage of the process, we need to determine the total
number of context words for the focus word that are already present on the page so that they
can be formatted to take up the minimum amount of space possible within the window.
When applied to a more extensive text corpus or document, this strategy to detecting co-
occurrences becomes high dimensional, which in turn increases the complexity of the process
[60]. This is one of the problems with utilising this approach to find co-occurrences. Singular
Value Decomposition (SVD), also known as Principal Component Analysis, is one of the
eigenvalue approaches that can be used to solve this problem (PCA). We put these methods
into action when we have a large dataset that naturally includes an excessive number of
dimensions. This has the potential to increase the complexity of the functioning and create
brand new problems. With the assistance of SVD and PCA, the model can keep the vital
characteristics of a dataset, while simultaneously ignoring the unnecessary qualities in an
efficient manner; the end result is a reduction in the dimensions and an increase in the
effectiveness.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy