Ebook Machine Learning Applications
Ebook Machine Learning Applications
net/publication/383850587
CITATIONS READS
0 68
4 authors, including:
All content following this page was uploaded by Sahil Arora on 07 September 2024.
Author’s
- Mohan Raparthi
- Astha Sharma
- Dr. Haewon Byeon
- Sahil Arora
www.xoffencerpublication.in
i
Copyright © 2024 Xoffencer
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material
is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval,
electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter
developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis
or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive
use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the
provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must
always be obtained from Springer. Permissions for use may be obtained through Rights Link at the Copyright
Clearance Center. Violations are liable to prosecution under the respective Copyright Law.
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every
occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion
and to the benefit of the trademark owner, with no intention of infringement of the trademark.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not
identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary
rights.
While the advice and information in this book are believed to be true and accurate at the date of publication, neither
the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may
be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.
MRP: 499/-
ii
Published by:
Satyam Soni
Contact us:
Email: mr.xoffencer@gmail.com
iii
iv
Author Details
Mohan Raparthi
v
vi
Astha Sharma
vii
viii
Dr. Haewon Byeon
Dr. Haewon Byeon received the DrSc degree in Biomedical Science from Ajou
University School of Medicine. Haewon Byeon currently works at the Department of
Medical Big Data, Inje University. His recent interests focus on health promotion, AI-
medicine, and biostatistics. He is currently a member of international committee for a
Frontiers in Psychiatry, and an editorial board for World Journal of Psychiatry. Also,
He were worked on 4 projects (Principal Investigator) from the Ministry of Education,
the Korea Research Foundation, and the Ministry of Health and Welfare. Byeon has
published more than 343 articles and 19 books.
ix
x
Sahil Arora
Sahil Arora is an experienced professional with a solid educational background,
having obtained a Bachelor's degree in Information Technology and a Master's degree
in Information Systems with a specialization in Data Sciences. With over 11 years of
experience in Information Technology, Sahil has refined his expertise across various
domains, including technical product management, software development, critical edge
infrastructure development, and Identity & Access Management (IAM).
His breadth of knowledge extends to various areas closely aligned with Data Science,
such as machine learning, artificial intelligence, natural language processing (NLP),
data mining, and predictive analytics. These proficiencies play a crucial role in his
capacity as a Staff Product Manager at Twilio Inc.
xi
xii
Preface
The text has been written in simple language and style in well organized and
systematic way and utmost care has been taken to cover the entire prescribed
procedures for Science Students.
We express our sincere gratitude to the authors not only for their effort in
preparing the procedures for the present volume, but also their patience in waiting to
see their work in print. Finally, we are also thankful to our publishers Xoffencer
Publishers, Gwalior, Madhya Pradesh for taking all the efforts in bringing out this
volume in short span time.
xiii
xiv
Abstract
Machine learning, which is one of the more established subfields
within the subject of artificial intelligence, focuses on the study of computer
approaches for the discovery of new information and for the management of
existing knowledge. Furthermore, machine learning is one of the subfields that
has been around the longest. The use of techniques that are associated with
machine learning has been adopted in a wide range of application industries.
New data, on the other hand, have been available in the most recent few years
as a consequence of a multitude of technological developments and research
projects (for instance, the completion of the Human Genome Project and the
spread of the Web). Consequently, new domains that have the potential to make
use of machine learning have come into existence. Learning from biological
sequences, learning from email data, and learning in complex environments
such as the web are some examples of the modern applications that are
currently being explored. Other examples include learning from natural
language processing. This study objective is to demonstrate the three
application domains that were described previously, as well as some recent
attempts that have been made to use machine learning techniques in order to
analyze the data that is supplied by these domains. In addition, this article will
describe some of the recent attempts that have been made.
xv
xvi
Contents
Chapter No. Chapter Names Page No.
Chapter 1 Introduction 1-32
xvii
xviii
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION
There is a great deal of individualization involved in the process of learning for every
one of us. The subject "Is Man a Machine?" was presented by Will Durant in his well-
known book, The Pleasures of Philosophy. In the study titled "Is Man a Machine?",
Durant composed lines that are regarded to be masterpieces. These statements include:
"Here is a youth; When you take into consideration the fact that it is striving to lift itself
to a vertical dignity for the very first time, it is doing it with both fear and courage; why
should it be so eager to stand and walk? In addition, why should it shake with an
insatiable curiosity, with a hazardous and unquenchable ambition, with touching and
tasting, with watching and listening, with manipulating and experimenting, with
observing and wondering, with growing—until it weighs the globe and charts and
measures the stars at the same time? The ability to learn, on the other hand, is not
something that is exclusive to human beings to possess.
This extraordinary phenomenon may be seen in even the most fundamental of species,
such as amoeba and paramecium, which are examples of simpler organisms. There is
also the possibility that plants exhibit intelligent activity. When it comes to the natural
world, the only things that do not take part in the process of learning are those that are
not alive. From this perspective, it would seem that learning and living are inextricably
linked to one another. It is not possible to acquire a great deal of knowledge about the
domain of nonliving items that are produced by nature. Machines are nonliving
organisms that humans have developed and referred to as machines.
Is it possible for us to put learning into these devices? It is considered a pipe dream that
one day we will be able to create a computer that is capable of learning in the same way
that humans do. In the event that this objective is accomplished, it will result in the
development of deterministic machines that possess freedom (or, alternatively, the
illusion of freedom to be more precise). We will be able to boldly say that our
humanoids are a depiction of people in the form of machines, and that they are the most
comparable to humans in appearance. This will be possible throughout that period of
time.
1|Page
1.2 PRELIMINARIES
As a result of the neurological system, these signals that have been collected are sent
to the human brain, where they are processed for perception and put into action. The
process of perception begins with the organisation of the information, followed by the
recognition of the information by differentiating it from previous experiences that have
been stored in the memory, and ultimately is the interpretation of the information.
When this is taken into consideration, the brain is the organ that is responsible for
making the decision and then instructing the other parts of the body to react to that
action. In the event that the experience is completed, there is a possibility that it will be
retained in the mind for the purpose of gaining benefits in the future.
In order for a computer to process the information that has been obtained in an
intelligent way, it is completely impossible. Because machines are unable to assess data
for classification, learn from previous experiences, and store new experiences in
memory units, they are unable to learn from experience. This prevents machines from
learning from experience. To put it another way, robots do not acquire knowledge by
experience. All of the following are examples of things that are not anticipated to be
performed by machines: Understanding the play Romeo and Juliet, jumping over a
hole in the street, forming friendships, interacting with other machines through a
common language, recognising dangers and the ways to avoid them, determining a
disease based on its symptoms and laboratory tests, recognising the face of a criminal,
and so on are all examples of skills that can be achieved by a machine. It is anticipated
that machines will be able to do mechanical jobs far more swiftly than humans.
2|Page
It is a challenging endeavour to be able to instruct unskilled robots on how to react
correctly to situations such as these. Due to the fact that robots were initially designed
to provide assistance to people in their day-to-day tasks, it is vital for machines to
possess the same cognitive skills as humans. These abilities include the capacity to
think, grasp, and solve problems, as well as the ability to make accurate judgements.
Another way of putting it is that we need machines that are intelligent. In point of fact,
the term "smart machine" is illustrative of the accomplishments that machine learning
has accomplished as well as the objectives that it will strive to reach in the in the future.
We are going to discuss the issue of intelligent machines when we reach Section 1.4 of
this document.
For the very first time, the question of whether or not a computer is capable of thinking
was asked by Alan Turing, a British mathematician, in the year 1955. This specific
inquiry served as the starting point for the development of artificial intelligence for the
first time. Specifically, he is the one who conceived up the concept of a test that would
assess the level of intelligence possessed by a computer in order to facilitate its
evaluation. Additionally, Section 1.4 has a discussion of the advancements that have
been achieved in determining whether or not our computers are capable of passing the
Turing test.
This debate is included in the section. By carrying out the tasks that are required of
them and assisting us in finding answers to problems by following the instructions that
are written into them, computers are machines that carry out the activities that are
expected of them. It is analogous to a central processing unit (CPU) that is responsible
for resolving concerns and problems for people. Take into consideration the following
scenario: we have a collection of numbers that are not sorted in any specific order. The
completion of this assignment is not a problem for us.
It's possible that different individuals will tackle the same assignment in a different way
when it arrives by themselves. To put it another way, different individuals may use
different algorithms in order to achieve the same objectives. In essence, these
processes, which are often referred to as algorithms, consist of a set of instructions that
are carried out in order to transition from one state to another in order to create output
depending on input. In the event when there are several algorithms that are capable of
doing the same task, it is acceptable to query which algorithm is better to the others.
For example, if two programmes are developed based on two different algorithms to
3|Page
find the smallest number in an unordered list, then for the same list of unordered
numbers (or the same set of input) and on the same machine, one measure of efficiency
can be the speed or quickness of the programme, and another measure can be the
minimum amount of memory that is used. Both of these measures can be considered to
be efficient.
As a consequence of this, the metrics that are often used in order to measure the
efficiency of an algorithm are time and space. There are some situations in which time
and space may be related with one another. One example of this is when there is a
reduction in the amount of memory that is being used, which leads to the algorithm
being completed more rapidly. One example of this would be a programme that is able
to manage all of the data that is fed into the cache memory. This is made possible by
an algorithm that is efficient. This approach will also make it possible to carry out the
plan in a more expedient manner.
Machine learning is a subfield of artificial intelligence that aims to make it feasible for
robots to do their duties in an expert way by using software that is extremely intelligent.
This is the purpose of the discipline. Statistical learning methods are the essential
building blocks of intelligent software, which is being employed in the process of
generating machine intelligence. These approaches are being used to produce machine
intelligence applications. The area of information science must have a connection with
the subject of database science since machine learning algorithms need data in order to
learn. This is why the two fields are needed to work together.
The words "Knowledge Discovery from Data" (KDD), "data mining," and "pattern
recognition" are examples of terms that are often used in a manner that is analogous to
this kind of phrase. When the wider picture is taken into consideration, when such a
link is proven, one is left scratching their head over how to perceive it. SAS Institute
Inc., which has its headquarters in North Carolina, is responsible for the development
of Statistical Analysis System (SAS), which is widely considered to be among the most
well-known pieces of analytical software.
In order to illustrate how the area of machine learning is connected to a wide range of
other disciplines that are related to it, we are going to make use of the example that
comes from SAS. The illustration shown in Figure 1.1 is an example of this
4|Page
representation that was really used in a data mining course that was taught by SAS in
the year 1998. Here is the example.
During the course of a research project that was carried out in 2006 and given the title
"The Discipline of Machine Learning," the researcher offered the following description
of the area of machine learning: In the end, the creation of machine learning was a
natural branch that resulted from the merging of computer science and statistics. It is
possible that the question "How can we construct machines that solve problems, and
which problems are inherently tractable and which problems are intractable?" is the
one that best encompasses the core of computer science. A substantial amount of the
explanation for statistics is comprised of the question, "What can be inferred from data
plus a set of modelling assumptions, and with what reliability?" This question is part
of the explanation for statistics.
The defining question for machine learning is one that relies on both of these topics,
despite the fact that it is a different discipline. Computer science has traditionally
focused on the question of how to manually programme computers, whereas the field
5|Page
of machine learning is concerned with the question of how to teach computers to
programme themselves (using their own experiences in conjunction with some standard
framework). Both fields are concerned with machine learning. Machine learning
incorporates additional questions regarding the computational architectures and
algorithms that can be used to most effectively capture, store, index, retrieve, and merge
these data.
If someone were to inquire about the means by which we are able to recognise the
voices, we would have a very difficult time articulating our replies. Due to the fact that
we do not have a sufficient understanding of such a phenomenon (in this case, speech
recognition), we are unable to build algorithms that are ideal for scenarios such as this
one. Machine learning strategies are likely to be of enormous aid in bridging this
understanding gap, which is becoming more difficult to cross. This is a really simple
and easy argument. Obtaining a grasp of the underlying processes that promote our
learning is not something that we want to do. The computer programmes that we
develop are designed to educate computers how to learn and to provide them with the
capacity to perform tasks such as prediction.
The goal of learning is to construct a model that is capable of delivering the desired
result by using the information that is presented to it. It is possible that we are able to
grasp the model in some situations, but in other situations, it may seem to us as if it
were a mysterious black box, the functioning of which cannot be articulated in a manner
that is intuitive. It is possible to think of the process that we want robots to emulate as
an approximation of the model that we have built. Due to the nature of this situation, it
is quite probable that we will make errors for a portion of the input; yet, the majority
of the time, the model will provide replies that are correct. Another measure of
performance for a machine learning algorithm will be the accuracy of the results, in
6|Page
addition to the performance of metrics such as speed and memory utilisation. This is
because the accuracy of the discoveries will be a consequence of the aforementioned.
For the sake of this specific case, it could be appropriate to make reference to another
observation that Professor Tom Mitchell of Carnegie Mellon University made about
the process of learning computer programmes:
Figure 1.2 Different machine learning techniques and their required data.
With reference to a certain category of tasks T and a performance measure P, one may
claim that a computer programme has learnt from experience E if the program's
performance at tasks in T, as measured by P, improves as a consequence of the
knowledge obtained from experience E. This is one way to determine whether or not
the programme has learned from experience E. It will be necessary to provide more
explanation in order to comprehend the subject matter once the topic has been described
with examples placed in the right spots.
On the other hand, before we get into the topic, we will go over a few vocabulary
phrases that are often used in the community of data mining or machine learning. For
the purpose of comprehending the examples of machine learning applications that will
be provided, this is a prerequisite. A quick description of the characteristics of the data
that are required by each of the four different machine learning algorithms can be found
in Figure 1.2, which illustrates the methodology. Beginning with Section 1.2.2 and
continuing through Section 1.2.5, the four approaches are dissected and discussed.
7|Page
1.2.2 Supervised Learning
The goal of the process of supervised learning is to extract a function or mapping from
training data that has been properly labeled. This is accomplished by using the data to
train the machine. Labels or tags make up the output vector, which is denoted by the
letter Y. The input vector is denoted by the letter X. These are the data used for training.
The label or tag that is associated with vector Y is the explanation of the input example
that is associated with it from vector X. It is vector Y that provides the basis for this
reasoning. In their whole, they serve as a model for the delivery of teaching. To phrase
it another way, training data is comprised of instances that have been used for training.
If there is no labeling for the input vector X, then the data for X is deemed to be
unlabeled. This is because the labeling is not there. I am curious as to why this particular
kind of education is referred to as supervised learning.
Each and every training sample that is a part of the training data has a label associated
with it, and the output vector Y is comprised of these labels. Each of these labels for
the output vector is the responsibility of the supervisor, who is responsible for
delivering them. The majority of the time, these supervisors are composed of human
persons; nevertheless, computers are also capable of doing labeling tasks of this kind.
Even if human judgments are more expensive than computer opinions, the higher error
rates in data labeled by machines demonstrate that human judgment is superior to that
of machines. This is the case despite the fact that they are more expensive. When it
comes to supervised learning, the data that has been manually labeled is a resource that
is both trustworthy and very useful. On the other hand, there are some situations in
which machines may be used for labeling that is reliable example.
Presented in Table 1.1 are five samples of data that have not been tagged. A wide range
of criteria may be used to determine the labels that are applied to these samples. The
second column of the table, which is titled "Example judgment for labeling," contains
a list of possible criteria that might be applied to each data instance. These criteria are
listed in the correct order. The third column provides a description of the possible labels
that might be applied once the application of judgment has been made. The fourth
column contains the information that identifies the actor who is competent to portray
the supervisor in the production. Despite the fact that machines are capable of being
used in each of the first four situations that are described in Table 1.1, the utilization of
these machines is rather unlikely due to the low accuracy rates that they possess.
8|Page
In spite of the fact that many advancements have been made in the fields of speech
detection, image identification, and emotion analysis technologies over the course of
the last three decades, there is still a substantial amount of room for improvement
before we can accurately compare their performance to that of humans. When it comes
to the fifth scenario of tumor detection, even healthy individuals are unable to label the
X-ray data, and the procedure of labeling needs the services of skilled specialists, which
may be very expensive.
When it comes to supervised learning, there are two separate groups or sorts of
algorithms that are included under its umbrella for processing information. That would
be them.
1. Regression
2. Classification
During the time that we are engaged in unsupervised learning, we do not have any
supervisors or training data available to us. To put it another way, all of the data that
we have is not classified in any manner. The purpose of this endeavor is to unearth a
structure that is hidden inside these data. The absence of a name for the data might be
attributed to a number of different reasons, all of which are possible scenarios. It might
be due to the fact that there is no money available to pay for human labeling, or it could
9|Page
be due to the basic nature of the data itself. Both of these possibilities are possible.
Because there are so many different devices that may collect data, the rate at which
data is collected has reached an all-time high. The variety, velocity, and volume of Big
Data are the parameters that are taken into consideration and analyzed within the
context of Big Data evaluation and assessment. If you want to learn anything from these
facts, you must do it without the supervision of the supervisor. This is the challenge
that each person who works in the field of machine learning must contend with in the
future. The dilemma that a practitioner of machine learning is presented with is, in some
respects, analogous to the situation that is portrayed in Alice's Adventures in
Wonderland, which was published in 1865.
This is because both of these works were written simultaneously. During this particular
scenario, Alice, who is looking for a place to go, engages in a chat with the Cheshire
cat. She proceeded with her journey. It would be really helpful if you could provide me
some advice on the path that I need to pursue from this point forward. "That is
something that is highly dependent on the destination that you have in mind," the Cat
responded to your question. For her part, Alice responded by saying, "I don't really care
where—" The Cat was said to have said, "Therefore, it is irrelevant which way you go,"
because of this. The following statement was made by Alice in order to give more
clarification: "—so long as I get somewhere." During the conversation, the Cat made
the following observation: "Oh, you are sure to do that if you only walk for a sufficient
amount of time."
In accordance with this kind of schooling, the data that are presented are a mix of
information that is classified and information that is not classified. For the purpose of
generating an appropriate model for the classification of data, the use of this
combination of labeled and unlabeled components is the means by which this is done.
According to what was said before in the definition of unsupervised learning, the most
of the time, there is a dearth of labeled data while there is an abundance of unlabeled
10 | P a g e
data. This is the case throughout the bulk of the time. The aim is to acquire a model
that is capable of predicting classes of future test data with a higher degree of accuracy
than the model that is produced just via the utilization of labeled data.
The supervisor presented data that had been denoted with labels. To provide an
example, a father may teach his children about the names (labels) of goods by
pointing toward them and saying their names. This is an example of how a
father can educate his children
During the course of this inquiry, further information on the semi-supervised
learning process will not be provided.
The objective of the reinforcement learning strategy is to make use of the observations
that are obtained throughout the course of contact with the environment in order to
adopt behaviors that would either maximize the reward or minimize the hazard. This is
accomplished by using the observations that are gathered. The process of reinforcement
learning is comprised of a number of stages, the culmination of which is the generation
of intelligent programs, which are sometimes referred to as agents.
1. This occurs whenever the agent is able to monitor the condition of the input.
Through the utilization of the decision-making function, it is possible to compel
the agent to carry out a certain endeavour.
2. Once the activity has been completed, the environment will present the agent
with a reward or reinforcement in order to encourage them to continue
functioning. When it comes to the reward, records of the information on the
state-action pair are preserved.
It is possible to fine-tune the policy for a certain state in terms of action by making use
of the information that has been stored, which therefore supports our agent in making
11 | P a g e
the most effective judgments that are conceivable. Given the parameters of this inquiry,
there will be no additional investigation into the concept of reinforcement learning.
In order to establish whether or not the model that was learnt by the machine learning
algorithm is of a good quality, it is necessary to do both validation and assessment. On
the other hand, before we go into these two important terms, it is important to
emphasize the writings of Plato, the great philosopher, that are connected to this aspect
of the subject matter. By providing readers with an introduction to this intriguing
argument, Box 1.1 is an excerpt from his technique that fulfills the objective of
providing readers with an introduction.
In addition to this, it is essential for Plato to take into consideration the several
conditions that might result in the abandoning of authentic convictions. The possibility
exists that in order to fulfill a variety of expectations for stability, it will be necessary
to meet several dependability requirements. If, for instance, I have a strong opinion that
these animals are sheep and they are sheep, then my conviction is valid for these
animals, and it does not matter if I do not know what traits those animals possess that
make them sheep. My conviction is valid for these animals. If, on the other hand, I am
unable to discriminate between sheep and goats and do not comprehend the rationale
behind the fact that these animals are categorized as sheep rather than goats, then my
lack of understanding would be an issue if I were to come with goats.
12 | P a g e
"counterfactual reliability," which is the tendency to be correct under scenarios that are
counterfactual and not necessarily empirically realistic, then the fact that I am unable
to differentiate between sheep and goats renders my view that animals with specific
appearances are sheep unreliable according to the definition of "counterfactual
reliability."
We are demonstrating that my reasoning for believing that these things are sheep is
flawed by declaring that my belief about sheep is counterfactually unreliable. This is
how we are pointing out that my reasoning is erroneous. The fact that the inaccuracy
does not influence my judgment in actual scenarios does not change the reality that this
is the case nonetheless. In the way that Plato describes a certain belief as "wandering,"
he is presenting a fault that we would be able to notice more clearly if it were presented
in a different manner. In the instance when I identify sheep based on qualities that do
not separate them from goats, then I am depending on erroneous conceptions in order
to arrive at the accurate conclusion that "this is a sheep" in an environment where goats
are not present.
Assuming that I use the same ways to identify sheep in an environment that also
includes goats, I will often arrive at the incorrect conclusion that "this is a sheep" by
the time I come across a goat. This is because I will have used the same methods to
identify sheep. It is possible that we would like to describe these facts by referring to
three different things:
(1) The true token belief that "this is a sheep" (which is applied to a sheep in the
first environment),
(2) The false token belief that "this is a sheep" (which is applied to a goat in the
second environment), and
(3) The false general principle that i use to identify sheep in both environments.
13 | P a g e
they will fail miserably. In some instances, the phenomenon of overfitting will deliver
the highest performance. Since this is a common method, the labeled data is often
divided into two parts in order to avoid overfitting. This is done in order to prevent
problems.
Both a training set and a testing set are employed in order to develop the model. The
testing set is utilized in order to validate the model that has been built. There is an
assumption that a part of the data will be held back for testing purposes. This is done
for the purpose of holdout testing and validation. For the purpose of training the model,
a greater quantity of the data is used, and the test metrics of the model are assessed by
making use of the data that is withheld. Cross-validation is a technique that may be of
great aid in cases when the available training dataset is fairly restricted and it would be
monetarily impractical to retain a section of the data exclusively for the purpose of
validation.
The use of machine learning has been shown to be the answer to a significant number
of challenges that are experienced in the real world; yet, there are still a number of
problems that need a breakthrough in relation to machine learning. On one occasion,
the requirement was articulated in the following wordings, which were taken from the
fact that Bill Gates, the cofounder and former chairman of Microsoft, thought that it
was necessary: In the next section, we will discuss a number of different applications
of machine learning and present some examples to illustrate the points that we are
making.
14 | P a g e
1.3.1 Automatic Recognition of Handwritten Postal Codes
Connecting with one another in today's world is accomplished via the use of a broad
variety of digital devices. On the other hand, postal services are still in existence, and
they make it possible for us to send our mail, gifts, and important documents to the
precise place that we specify. By using the United States Postal Service as an example,
one would be able to get a grasp of the ways in which machine learning has been
beneficial to this sector of the economy. The United States Postal Service was able to
successfully use computers in the 1960s to automatically read the city, state, and ZIP
code line of printed addresses in order to sort mail. This was accomplished via the use
of information technology. It was because of this that they were able to make advantage
of the enormous potential that machine learning featured.
The optical character recognition (OCR) technology was able to comprehend the postal
address with precision thanks to the assistance of a machine learning system. The
author has said that the photographs that include text that has been written, scribbled,
or printed are readable by human people. This is something that has been taken into
consideration. This kind of textual information must be rendered readable by
computers, and this can only be accomplished via the use of technology that utilizes
optical character recognition. A scanned document that has been stored in an image
format such as bitmap is nothing more than a picture of the text.
15 | P a g e
This is because the scanned document becomes an image. A piece of optical character
recognition (OCR) software examines the image, and then it makes an attempt to
identify each and every alphabetic letter and each and every numeric number. In the
event that it is able to effectively identify a character, it will convert the character into
text that is encoded by a machine. This machine-encoded text may be electronically
edited, searched, and compressed. Additionally, it can be used as input for applications
such as text mining, text-to-speech, and automatic translation. All of these capabilities
are available to users. The process of data entry is simplified, made more effective, and
made more cost-effective by an optical character recognition system that is exact.
Up to the year 1900, the United States Postal Service was responsible for the processing
of 7.1 billion pieces of mail every single year. Neither autos nor any other types of high
technology were used in the process of accomplishing all of this. The United States
Postal Service was responsible for the sorting and delivery of more than 213 billion
unique pieces of mail in the year 2006. This is more than any previous postal
administration in the history of the world, and it accounts for about forty percent of the
total mail volume throughout the whole planet. Through the use of various forms of
automation, the United States Postal Service is the entity that is accountable for
delivering this remarkable and unprecedented service. A new revolution in the
effectiveness of the postal system has been brought about by the use of optical character
recognition (OCR), which has been successful in bringing about this transformation.
A connection between the physical mail and the information system that is responsible
for guiding it to its designated place was established with the assistance of the optical
character recognition camera, which can be seen in Figure 1.3. This link was
established in order to facilitate the process of establishing the link. The improved
optical character recognition (OCR) technology that is available today, in combination
with other mail processing services, has the potential to raise the efficiency of postal
systems in a number of different countries. There is a statement that can be found on
the website of the United States Postal Service that states, "the Postal Service is the
world leader in optical character recognition technology."
The reason for this is because machines are able to read around 98 percent of all letter
mail that is hand-addressed, and they can read 99.5 percent of mail that is machine-
printed. With support for over 200 different languages and the ability to convert text
photos into text documents, Google has just introduced a free tool that can convert text
16 | P a g e
images into text documents for more than 25 different writing systems. This was
accomplished by Google via the use of hidden Markov models and the handling of the
input as a full sequence, rather than first trying to break it down into its component
components. There is a list of languages that are supported by the internet service that
can be found on the Google website. It is not difficult to comprehend the challenges
that would be associated with carrying out such an accomplishment. There is a clear
difficulty involved in the identification of languages.
The assumption that the language of the document that has to be processed is already
understood is no longer a secret assumption; rather, it is now a commonly held belief.
In the event that the language is identified inaccurately, this is a sign that we need to
expect a bad performance from the optical character recognition system. The technique
known as optical character recognition (OCR) is an example of one of the applications
of pattern recognition, which is a branch of machine learning called pattern recognition.
The identification of patterns and regularities within data is the major goal of the
discipline of pattern recognition, which is a subfield of recognition. It is possible to
portray the data via the use of text, voice, and/or graphics. In the case of optical
character recognition (OCR), the data that is being input is in the form of an image.
This is the example that is being used. Another example of the use of pattern
recognition via the exploitation of visual data is the field of computer-aided diagnostics.
It is anticipated that Section 1.3.2 of this publication will provide a discussion of some
of its applications.
Pattern recognition algorithms, which are used in computer-aided diagnosis, have the
potential to provide support to medical practitioners in the interpretation of medical
images in a relatively short period of time. Data that describes the condition of a patient
is derived from medical images that are gathered from a variety of medical procedures.
These investigations include X-rays, magnetic resonance imaging (MRI), and
ultrasound.
The output of these medical tests, which is in the form of a digital image, is the
responsibility of a radiologist, who is also responsible for assessing and evaluating the
findings of the tests. Because of the short amount of time that is available, it is essential
for the radiologist to have the assistance of a machine. In order to identify structures in
17 | P a g e
an image that may be seen as concerning, a computer-aided diagnostic makes use of
pattern recognition techniques that are developed from machine learning. What are the
specific ways in which an algorithm might recognize potentially dangerous structure?
It is necessary to engage in supervised learning in order to successfully accomplish this
project.
The machine learning technique, which consists of Bayesian classifier, artificial neural
network, radial basis function network, and support vector machine, receives a few
thousand categorized photographs. These photographs are then submitted to the
machine learning method. The freshly collected medical images are expected to be
accurately categorized by the classifier that was developed, since this is what is
predicted. In the event that the machine learning algorithm makes an error in its
diagnosis, the repercussions for a family might be quite severe.
Not only does the inaccuracy put a person's life in risk, but it also has the potential to
do financial harm to the individual and put their lives in serious danger. Here are two
examples of circumstances that are quite similar to one another:
1. Take, for instance, the case when our classifier diagnoses breast cancer in a
patient who did not in fact suffer from the ailment in the first place. Because of
the results that were obtained by the classifier, it is very probable that the patient
would be subjected to psychological situations that are damaging to their well-
being. It is likely that the patient may suffer financial losses as a result of further
tests that are designed to confirm the findings of the classifier. These tests are
supposed to be performed more than once.
2. Let's say for a minute that our categorization system is unable to establish the
presence of breast cancer in a patient who is, in reality, afflicted with the
condition. This will lead to the patient obtaining medical treatment that is not
suitable for them, which may put their life in danger in the near future or in the
far future further down the road. In order to achieve the goal of preventing
mistakes of this kind, it is not recommended to totally replace the doctor with
technology. The process ought to benefit from the use of technology in some
way. A physician, who is generally a radiologist, is the one who is ultimately
responsible for the interpretation of a medical image. This responsibility falls
on the physician.
18 | P a g e
Radiologists and other medical professionals are receiving support in the process of
diagnosing a wide range of health problems as a result of the use of computer-aided
diagnostics. The following are some examples that demonstrate the point:
After getting an understanding of the situation, we want our robots to be able to watch
and react appropriately to changes in the environment. There is a possibility that images
may be collected using the cameras that are installed in a robot; nevertheless, these
cameras will not aid the robot in identifying or understanding the particular picture.
What kinds of learning are in the realm of possibility for a robot to perform via the use
of pattern recognition? We will now proceed to the following, beginning with an
analysis of the example of the event that is often referred to as Robocops: Another
name for the Robocops, which is often commonly referred to as the "Robot Soccer
World Cup," is a robotic soccer competition that is contested on a global scale. There
is an objective for the project that has been given the formal designation as being very
challenging to accomplish. It is written in the following manner:
"By the middle of the 21st century, a team of fully autonomous humanoid robot soccer
players shall win a soccer game, complying with the official rules of FIFA, against the
winner of the most recent World Cup." There were 175 intelligent sports robot teams
competing in the Robocops 2015, which was hosted in China. These teams came from
47 different countries. During the tournament, the United States team constructed the
largest adult size category, which was the largest category overall. A very tough score
of 5–4 goals resulted in the triumph for the University of Pennsylvania against the
Iranian team, which was defeated by the University of Pennsylvania (Figure 1.4).
19 | P a g e
Figure 1.4 US and Iranian robot teams competing for Robocops 2014 final.
In order for the autonomous robots to be able to go on to the next phase of the
tournament, it is expected that they will collaborate with the other members of their
team, who are also robots, in an unfamiliar and dynamic environment. They must be
able to categorize items and recognize actions in order to fulfill their responsibilities.
It is possible for them to carry out these tasks because of the input that they get from
their cameras. The pattern recognition domain, which is a branch of machine learning,
is the only place where each of these tasks may be found.
The evolution of computer vision technology has made it possible for car vision to be
used in a range of applications, including autonomous vehicles. These vehicles do not
need drivers to run them since they are already capable of operating themselves. There
is a continual competition within the business to manufacture autonomous autos that
are able to launch themselves into the roads as rapidly as time allowed. This is
something that is fairly obvious. In accordance with the BBC study titled "Toyota
promises driverless cars on roads by 2020," a number of competitors are declaring their
objectives for the development of driverless autos. These competitors are jumping on
the bandwagon.
20 | P a g e
According to the findings of the research, Toyota is the most recent automotive
company to go forward with plans for an autonomous vehicle. Companies that are
situated in Silicon Valley, such as Google, Cruise, and Tesla, are looking at a new
problem as a result of this. In the beginning of the previous week, General Motors made
the announcement that it would be delivering autonomous rides to staff who are
working at its research and development plant in Warren, Ohio. The Japanese market
is expected to get an autonomous car from Nissan as early as the year 2016, according
to the company's commitment. Despite this, Google has already started testing its
driverless cars on the streets of urban areas in the United States.
Figure 1.5 Toyota tested its self-driving Highway Teammate car on a public
road.
Also, Elon Musk, the chief executive officer of Tesla, said in July that his company
was "almost ready" to make its cars capable of driving themselves on major roads and
parking themselves in parallel. Musk made this statement in reference to the company's
ability to make its vehicles autonomous. How are these autos going to be able to
successfully do this task? According to the findings of a study that was carried out by
the BBC, the following is an explanation of Toyota's story: According to Toyota, the
21 | P a g e
car "makes use of multiple external sensors to identify nearby vehicles and hazards,
and it chooses appropriate routes and lanes based on the destination." This information
is derived from the manufacturers' statement. This means that the car "automatically
operates the steering wheel, accelerator, and brakes" in order to drive in a manner that
is analogous to the way a person would drive. The use of the data inputs is what makes
this possible.
The correct answer is 1.5. The applications that are now using computer vision-related
technology as well as those that will be utilizing it in the future are quite sensitive in
their nature. If an accident involving an autonomous vehicle occurs, it is possible that
a family or families would experience a devastating loss as a subsequent result. Another
very sensitive topic is the use of computer vision technology in drones, which is
analogous to the case that was shown before. The algorithms that are responsible for
the vision of drones that are used in combat have the potential to do damage to innocent
people if they do not perform correctly.
With the advent of smart phones and security cameras, the rate at which images are
being generated has reached a level that has never been seen before. The difficulty of
matching a photograph of a face with the identity that corresponds to that face is one
of the many obstacles that face recognition offers. Construction of a classifier for this
project is not a straightforward effort since there are an excessive number of classes
involved, each of which has a variety of image-related challenges or concerns. This
assignment requires a classifier to be constructed. When it comes to locating something
that is really challenging for humans to execute manually, face recognition may be able
to aid security businesses in employing a large amount of data from a number of sources
to automatically identify it.
The field of speech recognition seeks to develop methods and technologies that will
enable computers to comprehend spoken language and translate it into text. This is the
ultimate objective of the field. Stenography, sometimes known as writing in shorthand,
is no longer required since it is no longer necessary. The practice of mechanically
translating spoken words into written text has been used in a number of industries,
including video captioning and court reporting, amongst others. It's possible that the
22 | P a g e
usage of this technology might be beneficial to those who have disabilities. Over the
course of time, speech recognition engines are achieving a higher level of accuracy in
their performance. There is no doubt that voice-controlled systems such as Apple's Siri,
Google Now, Amazon's Alexa, and Microsoft's Cortana do not always recognize
human speech. However, there is a strong probability that this will change in the not-
too-distant future. This is something that cannot be disputed.
For the most part, the examples that we have examined up to this point include the use
of voice or visual data for the aim of education. Data in the form of text is yet another
source of knowledge that we might use for the purpose of education. Through careful
investigation, it was discovered that the vast bulk of the information relevant to the
company is stored in text format. This unstructured data or text presented a challenge
that needed to be conquered in order to meet the requirements. According to the IBM
magazine, the following is the first definition or function of a business intelligence
system that was presented: The use of data-processing technology for the goal of
automatically extracting and encoding documents, as well as for the purpose of building
interest profiles for each of the "action points" at which an organization functions, is
something that should be considered.
Documents that are developed internally as well as those that are received from other
sources are automatically abstracted, categorized according to a word pattern, and
delivered automatically to the appropriate action points; this process is carried out
automatically. Researchers also have access to a substantial amount of written or
unstructured data via social media, which is another forum for the dissemination of
information. It is possible for us to see the development of text data at a level that has
never been witnessed before in the arena of social media platforms. As a result of the
transmission of human experiences in the form of text, several stakeholders, including
enterprises, have been provided with the opportunity to investigate and make use of
these experiences with the intention of reaching favorable results. There are a wide
variety of applications that might potentially gain advantages from text mining,
including.
Business intelligence
National security
23 | P a g e
Life sciences
Those related to sentiment classification
Automated placement of advertisement
Automated classification of news studys
Social media monitoring
Spam filter
Determine the identity of the handwriting: utilizing the known corpus of handwritten
data with the goal of utilizing it, a classifier that can assign a document to an author
based on numerous features may be developed. This classifier can be used to assign the
document to the author. One of the challenges that might be overcome via the use of
text mining is the detection of different writing styles.
We are searching for traits that are connected with a certain author by using well-known
study’s that are ascribed to the author. These papers are being used to investigate the
features. The building of a classifier that is able to determine whether or not the
particular document in question belongs to the author may be accomplished with the
help of these qualities.
It is possible that the two classifiers will be integrated in order to produce a new
classifier that will have higher performance when it comes to author identification. This
is something that is a possibility. A further area in which such data could be beneficial
in finding a solution to the problem is the identification of undesirable material that is
included inside a movie.
In order to establish the types of material that are not acceptable, there are two distinct
methods that might be taken to the challenge: Through the use of video images and the
implementation of machine learning algorithms to image data, it is possible to develop
a model that is capable of identifying potentially dangerous material included inside
the video frame.
24 | P a g e
Utilize comments from social media platforms that are linked with the video in order
to get a knowledge of the content of the video. This may be accomplished by
constructing a model that can detect whether or not the video includes information that
is not wanted. In order to improve the overall performance of the system, as was said
before, the two classifiers might be combined into a single system component.
For the very first time, the question of whether or not a computer is capable of thinking
was asked by Alan Turing, a British mathematician, in the year 1955. This specific
inquiry served as the starting point for the development of artificial intelligence for the
first time. A test that would assess the machine's performance in terms of its
intelligence-related skills was proposed by him. He was the one who came up with the
idea. It was in the year 2014 that a chatbot was able to successfully pass this Turing
test (for more details, please refer to Box 1.2). A computer program known as a chatbot
may duplicate the interaction by simulating an intelligent conversation with one or
more human users.
This can be done in order to imitate the conversation. You may carry out this
conversation by using either audio or text communication capabilities. Both of these
options are available to you. An individual who is a member of the panel of judges for
the annual Loebner Prize 2015 discusses the flaws of chatbots during the event that is
detailed in Box 1.3. Again, this is a fascinating occurrence that took place in the year
2015. Additionally, we have included in Appendix I a full transcript of the discussion
that took place between one of the judges and the chatbot that was awarded the Loebner
Prize in 2015.
This interaction took place in 2015. It is possible for readers to have a better grasp of
how chatbots strive to avoid answering difficult questions when they are confronted
with them by reading the transcript. Researchers at Google have built a sophisticated
form of chatbot that is able to learn from training data that is comprised of samples
from a variety of scenarios involving chats. The movie transcript dataset and the IT
helpdesk troubleshooting repository were the two datasets that were used for the
purpose of training purposes.
25 | P a g e
BOX 1.2 TURING TEST PASSED BY CHATBOT NAMED EUGENE
Can Machines Think? a well-known question and answer game that was developed by
Alan Turing in 1950, served as the impetus for the development of the Turing test. A
mathematician and code-breaker who lived in the twentieth century, Turing was a
prolific figure in the field. The purpose of the experiment is to establish whether or not
people are able to distinguish between human and robot conversations by observing
their facial expressions and facial expressions. A machine is judged to have passed the
test if it is determined that it has been mistaken for a human being more than thirty
percent of the time over a series of five-minute keyboard talks. This is the criteria that
determines whether or not the machine has passed the test. The Turing Test was held
at the illustrious Royal Society in London on June 7, 2014, and it was successfully
completed by a computer program dubbed Eugene Gootman. The Turing Test was
named after Thomas Turing. The Turing test was successfully completed for the very
first time at this given moment. There was a total of thirty human judges, and Eugene
was successful in convincing thirty of them that the thing in issue was a human being.
There was a story that was published on the website of the BBC that reported the live
broadcasting of the annual Loebner Prize in 2015. The item was titled "AI bots try to
fool human judges." The information was provided on the area of the website that was
devoted to technology. One of the judges at the event was Rory Cellan-Jones, a
technology journalist for the BBC. He was entrusted with judging the amount of
intelligence exhibited by a chatbot, and he was able to do so successfully. On the
website that is managed by the BBC, you will be able to see the whole text of his
conversation with the Chatbot Rose, which was the recipient of the prize in the year
2015. Following the completion of the whole interaction, Rory found himself in a
position to make the following remarks:
I get the sense that I am a psychiatrist who has just spent two hours delving into the
most intimate thoughts of four separate patient couples from various groups. I have
heard these views from four different patients. I have been given the chance to serve as
a judge for the Loebner Prize, and this has encouraged me to contemplate the dynamics
of conversation as well as the attributes that distinguish a good conversationalist. I
26 | P a g e
quickly became aware of a basic approach for identifying the bot, which was to pretend
to be a chaotic human conversation. This method would allow me to identify the bot.
Inquiries that were rather straightforward, such as "where do you live," "what do you
do," and "how did you get here," were within the scope of the bots' skills.
On the other hand, as soon as I started thinking about stuff like how to deal with slugs
in your yard and how much homes cost in London, they immediately fell apart. Their
plan consisted of trying to move the conversation in a new direction while
simultaneously ignoring what I was saying. This was their method. The outcome of this
is that it did not take me more than two or three questions to decide which of the two
could be deemed the bot and which of them could be considered the human. After
careful reflection, I have come to the conclusion that it will be quite some time before
a computer is able to successfully complete the Turing Test. To put it another way, the
folks are a great lot more interesting to talk to than they were before.
They trained their helper using a language model that was constructed on recurrent
neural networks. This model was utilized to train the assistant. This suggests that these
are not just pre-programmed replies that chatbots deliver after identifying certain
patterns in human discussions. Rather, these are responses that are generated by
artificial intelligence. You may be able to locate some of the fascinating and inventive
replies that were offered by the chatbot that Google produced in the research
publication that is titled "A neural conversational model" [10]. Due to the fact that the
researchers admitted in their study report that the chatbot was unable to carry on a true
conversation at the time, the Turing test was successful in determining that it was able
to pass. On the other hand, the fact that it was able to produce proper solutions to a
broad range of questions without adhering to any rules is a conclusion that came as a
surprise.
There is still a significant distance to go before the dream of producing computers that
seem to possess the same level of intelligence as people is realized. A smart machine
is, in general, an intelligent system that involves the use of many pieces of hardware,
such as sensors, RFID, a Wi-Fi or cellular communications connection, in order to
gather data, analyze that data, and then make decisions based on the interpretation of
that data. They use machine learning algorithms to do tasks that are traditionally
27 | P a g e
performed by humans in order to enhance both their efficiency and their productivity
throughout the process. This allows them to take on duties that would otherwise be
performed by people. Gartner, Inc. is a research and consulting firm that specializes in
information technology (IT) and offers technology-related insights to chief information
officers (CIOs) and senior IT professionals.
The company's headquarters are located in Stanford, Florida, in the United States of
America. In order to do this, they disseminate their findings via a number of channels,
one of which is through Gartner symposiums. At the Gartner symposium/ITxpo, there
are hundreds of chief information officers from a wide variety of sectors coming
together. With regard to a true intelligent machine, there are two prerequisites that have
been presented. Two conditions must be met for a machine to be considered really
intelligent:
Figure 1.6 The top 10 strategic technologies in years 2014 and 2015.
28 | P a g e
To get things started, a smart machine is able to do an activity that was previously
thought to be impossible for any machine due to its lack of intelligence. According to
this criterion, a drone that is capable of delivering packages—something that Amazon
is now contemplating—would be regarded as a machine that is capable of performing
its intended function.
Machines have the ability to learn new things on their own. Regarding the second need
for a fully intelligent drone, the delivery drone does not fulfill the standards needed to
be considered smart.
It is likely that the same delivery drone, regardless of how clever it is, might still have
a significant influence on the quantity of work that is completed and the number of
people who are engaged in the shipping industry. According to Gartner, Inc., smart
machines were named as one of the top 10 technologies and trends that were projected
to be strategic for the majority of businesses in both 2014 and 2015. In both years, this
forecast has proven to be accurate. The technology of 3D printers and smart machines
were placed in the category of new technologies that were expected to create disruption
in the future, according to the projection for the year 2014. Once again, intelligent
gadgets were included into the prediction for the year 2015, namely in the domain of
intelligence that is present everywhere (Figure 1.6).
They made a number of observations in their forecast regarding intelligent robots for
the years 2014 and 2015, some of which are as follows:
By the year 2015, there will be more than forty suppliers supplying managed services
that are commercially available and that make use of industrialized services and smart
equipment. There will also be more than forty suppliers offering these services.
By the year 2018, the total cost of ownership for company operations will be decreased
by thirty percent as a consequence of the implementation of industrialized services and
intelligent equipment. In the year 2020, the era of smart machines will thrive as a result
of the development of personal assistants that are contextually aware and intelligent,
smart advisers (such as IBM Watson), sophisticated global industrial systems, and the
availability of early instances of autonomous cars to the general public. A period of
time that will be the most disruptive in the history of information technology is the era
of intelligent machines.
29 | P a g e
In a study that was released not too long ago and was titled Cool Vendors in Smart
Machines, 2014, Gartner included three examples of smart machines that are well-
known in the industry. Apple's Siri, Google Now, and IBM's Watson platform are some
examples of these kind of technologies. At a later stage in this research, we will discuss
a number of the intelligent machines that were mentioned in the predictions that were
presented earlier; but, before we do so, we will discuss Deep Blue, which is a computer
that was built by IBM and is capable of playing chess.
IBM's Deep Blue created history in May of 1997 when it became the first computer
system to win a match against Garry Kasparov, who was the reigning chess world
champion at the time. Deep Blue's victory set a new standard for computer systems.
Deep Blue was able to examine 200 million locations every single second because to
the sophisticated technology that it came equipped with. Because of the enormous
amount of processing power that it possessed; this was made feasible. The human world
champion of chess was vanquished by the 259th most powerful supercomputer in 1997.
This supercomputer was able to conquer the human champion and become the world
champion. Those who are working in the field of artificial intelligence have been able
to achieve something that is quite astonishing.
For what reason was Deep Blue able to produce such a precise evaluation of the
situation that was taking place on the chess board? With regard to this query, the
following is the response that has been provided: Deep Blue's evaluation function was
first designed in a general fashion, with many parameters that had not yet been
determined (for instance, how important is it to have a safe king position in contrast to
having a space advantage in the center, etc. at the beginning).
This was done in order to ensure that Deep Blue would be able to perform its evaluation
function effectively. After then, the algorithm itself, by analyzing thousands of master
games, came up with the values that reflected the ideal values for these parameters.
These values were then used to determine the optimal values overall. The process of
evaluation had been disassembled into eight thousand distinct components, the bulk of
which were designed for tasks that required specific expertise. More than seven
hundred thousand grandmaster games and more than four thousand positions were
considered in the preliminary examination. A significant number of endgames with six
30 | P a g e
pieces and situations with five or fewer pieces were included in the database of
endgames throughout the course of its compilation.
Between the years 1997 and 1998, a supercomputer known as Deep Blue was
developed with the express purpose of competing against human beings. At the
moment, the major purpose of research in the field of chess is to improve the efficiency
of software in order to make it feasible for less powerful hardware to complete the task.
This is now the case. Deep Fritz, a computer program, faced up against Vladimir
Kramnik, the reigning world champion in the game of chess, in the world championship
tournament that took place in the year 2006. The software was executed on a personal
computer that included two Intel Core 2 Duo central processing units because it was
equipped with such processors. In comparison to Deep Blue's evaluation capabilities,
which can review 200 million locations per second, the program was only able to
analyze 8 million positions per second. This capacity is significantly lower than Deep
Blue's level of evaluation.
Thomas J. Watson, who was the first Chief Executive Officer of IBM, was honored
with the naming of the business. IBM's Watson is a fantastic piece of software that was
produced by the company. It is able to deliver replies to questions that are asked using
natural language. There is no doubt that IBM Watson is the most well-known example
of artificial intelligence that is now being employed in the modern world. This is true
regardless of whether you refer to it as a supercomputer, a cognitive computing system,
or just a question answering matching system. Watson was brought to the notice of
people all around the world when he was the winner of the first prize on the game show
"Jeopardy!" Watson is able to provide assistance to a wide variety of industries by
supplying power to a wide range of practical applications. This is made possible by
Watson's skills in the areas of advanced computing and artificial intelligence. Watson
has the potential to be beneficial to a number of different industries, including the retail
sector, the legal sector, the financial sector, and the healthcare business.
Google's creations of new innovations the product known as "Google Now" is yet
another noteworthy accomplishment in the field of machine learning. Besides being
clever, it is a personal assistant that also contains a certain level of intelligence and
31 | P a g e
intelligence in addition to its existing intelligence. It is possible for Google Now to
carry out a variety of tasks, including making suggestions, responding to queries, and
carrying out actions by delegating particular requests to a selection of web services.
These are only some of the capabilities that Google Now possesses. By utilizing voice
commands, users of this program are able to obtain assistance with trivia questions and
create reminders for themselves before answering them. The software that is proactive
analyzes the search patterns of the users, uses those patterns to generate predictions
about the information that may be beneficial to the users, and then offers the
information to the users according to those predictions.
Siri, an intelligent personal assistant developed by Apple Inc., is becoming more and
more popularity. Siri is an acronym that stands for speech interpretation and recognition
interface. Siri is able to communicate in the following variety of languages: The
following languages are included: Arabic, English, Spanish, French, German, Italian,
Japanese, Korean, Mandarin, Russian, and Turkish. Siri can communicate in each and
every one of these languages. The capabilities of Siri's responses are continually
improved with the implementation of updates, just like they are with every other
personal assistant. It is of the highest importance to recognize the relevance of that
setting. Take, for example, the scenario in which a terrorist transmits to Siri the
information that he wants to blow up a particular restaurant. Instead of presenting a
map of the restaurant, Siri could respond by telling a center that specializes in the
prevention of terrorist acts of the terrorist's plot to blow up the restaurant.
32 | P a g e
CHAPTER 2
2.1 INTRODUCTION
To someone who is not familiar with machine learning, the process of learning that
computers go through could appear to be quite strange. This is because the process is a
process that computers go through. In the same way that a narrative from a science
fiction or fantasy novel may be exciting, it is feasible that the idea that a computer is
capable of thinking and behaving intelligently could be thought of as equally enticing.
This is because the concept of a computer being able to think and behave intelligently
is appealing. On the other hand, when people undertake further investigation, they are
able to learn that it is not as exceptional as it would appear to be at first glance.
1. Data Input
2. Abstraction
3. Generalization
In spite of the fact that we have obtained a full comprehension of these characteristics
via the process of Study, let us quickly refresh our memory with an example. The
situation described here is entirely fictitious. During a campaign gathering for the next
election, it has been established that a criminal wants to launch an assault on the
primary candidate. This conclusion was reached based on information that was
obtained by the detective department of the New City Police Department. On the other
hand, the identity of the person is unknown, and it is quite realistic to assume that the
33 | P a g e
person is trying to disguise themselves in some kind. All that can be claimed with
complete and utter certainty is that the person in issue is a history-sheeter or a criminal
who has a long record of committing big misdeeds.
This is the only thing that can be absolutely known. This criminal database has been
utilized to construct a list of persons who have committed various offenses that are
comparable to one another, together with photographs of those individuals.
Additionally, the agency that is in charge of investigations has access to the images that
were taken by security cameras that were planted in a variety of locations in close
proximity to the gathering. It is necessary for them to match the images from the
criminal database with the faces of individuals who are present at the gathering in order
to successfully identify the potential perpetrator of the attack.
Nevertheless, in the real world, that is not something that can be done. It is quite
probable that the number of images is in the hundreds, if not thousands, due to the vast
number of criminals who are housed in the database. It would be impossible to look at
all of the images and commit them to memory due to the fact that this is not possible.
To add insult to injury, it is difficult to have a perfect match because the person
responsible would most likely come disguised. This renders it difficult to have a match
that is exactly the same. The strategy that needs to be utilized in this scenario is to
match the images in smaller counts and also based on certain significant physical
features such as the curves of the jaw, the slope of the forehead, the size of the eyes,
the structure of the ear, and so on. This is the technique that should be employed. As a
result, the images that were collected from the criminal database constitute the input
data.
34 | P a g e
With it serving as a basis, it is possible to abstract important qualities. The employment
of a generalization of abstracted feature-based data is a clever strategy for finding
potential criminal faces in the collecting. This is due to the fact that human matching
for each and every photo may fast lead to a visual as well as mental exhaustion. The
reason for this is because matchmaking between humans can soon result in mental and
sensory exhaustion. Applying the abstracted feature-based data as an illustration, let us
suppose that it is observed that the majority of criminals have a shorter distance
between the inner corners of their eyes, a smaller angle between the nose and the
corners of their lips, a higher curve to the top lip, and so on. In addition, let us assume
that the majority of criminals have a higher curve to the top lip. The conclusion that
can be drawn from this is that a face in the group may be classified as "potentially
criminal" depending on whether or not it corresponds with these general findings
concerning criminal conduct.
This is all done in order to determine whether or not a face in the gathering is a potential
criminal. As a result of the fact that it presents the raw input material in a form that is
both summarized and organized, the process of abstraction is an essential element in
the learning process. Consequently, this makes it possible to conduct an analysis of the
data in a manner that enables the generation of valuable insights from the data. The
word "model" refers to this structured representation of the raw data that is inserted
into the meaningful pattern for the purpose of analysis. On the other hand, the model
might take on a number of different formats. The representation may be anything from
a mathematical equation to a tree structure or graph to a computer block, or it could be
35 | P a g e
any one of a number of different things depending on the circumstances. It is the
responsibility of the learning task to decide which model should be selected for a certain
data collection given the information that is available.
The problem that has to be solved, in addition to the different types of data, are taken
into consideration while making this selection. An example of this would be the
assignment of the regression model in circumstances when the problem at hand is one
of prediction and the target field is simultaneously numerical and continuous. The
process of assigning a model and fitting a specific model to a data collection is referred
to as model training. There is another name for this technique, which is "model fitting."
After the training procedure has been finished, the raw input data is abstracted and then
summarized. This is done after the training process has been completed. However,
while employing abstraction, the learner is only able to summarize the material that
they have learnt during the course of their education. Due to the fact that it contains a
great number of data that is based on characteristics and interrelationships, it is possible
that this information is still extremely thorough. It is pretty difficult to obtain ideas that
can be put into practice from such a tremendous amount of knowledge since there is so
much of it. If we are talking about this subject, generalization is a significant factor to
consider. Finding a few of fundamental conclusions that are manageable and compact
in size is the goal of the process of generalization, which requires combing through a
great amount of material that has been abstracted.
A comprehensive search cannot be carried out since each of the discoveries that have
been abstracted needs to be analyzed separately. This makes it impossible to conduct a
search that is exhaustive. A heuristic search is the strategy that is applied, which is also
utilized for human learning and is also referred to as "gut-feel." This method is also
utilized for the purpose of maximizing efficiency. One thing that is very evident is that
the heuristics have the potential to occasionally provide findings that are wrong. In the
event when the outcome is regularly discordant with the expectations, it is asserted that
the learning might be considered biased.
Points to Ponder:
36 | P a g e
algorithm accomplishes the development of its cognitive capacities in this
manner.
Someone needs to provide certain non-learnable parameters, which are also
known as hyper-parameters, when it comes to machine learning. These
parameters are the ones that cannot be learned. The situation is comparable to
how a child who is learning anything for the first time requires the aid of her
parents in order to establish whether or not she is proper in her understanding
of the material. There is no way that machine learning algorithms could ever
attain success if they do not make advantage of these human inputs.
Now that you are familiar with the core learning process and have a knowledge of
model abstraction and generalization in conjunction with that context, let's make an
attempt to formalize it within the framework of an example that will inspire you. As a
continuation of the prospective assault that took place during the election campaign,
the New City Police Department has been successful in foiling the effort to assault the
electoral candidate. This is a continuation of the attack that took place after the election
campaign. Nevertheless, this acted as a wake-up call for them, and they wish to take
preemptive steps in order to remove any illicit operations that are going place in the
neighborhood.
The pattern of criminal acts that have taken place in the recent past is something that
they are interested in determining. To be more precise, they are interested in evaluating
whether or not the number of criminal incidents that take place on a monthly basis has
any impact on the average income of the local population, the sales of weapons, the
influx of immigrants, and other factors that are generally of a similar type. Taking this
into consideration, it is essential to create a relationship between the many elements
that may lead to disruptions and the occurrence of criminal activities. In a different way
of putting it, the objective or aim is to develop a model that is capable of inferring how
the criminal episodes change depending on the potential influencing factors that were
stated before.
Within the framework of the machine learning paradigm, the possible sources of
disturbance are thought to be input variables. The average income of the local
population, the sales of weapons, the flood of immigrants, and other similar factors are
37 | P a g e
all examples of variables that are considered entry variables. There are a variety of
terms that may be used to refer to these items, including predictors, characteristics,
features, independent variables, or simply variables and their names. Some examples
of output variables are the number of criminal incidents, which is also known as a
response variable or a dependent variable. An output variable is also known as a
reaction variable. Input variables can be represented by the symbol X, whereas
individual input variables can be written as X, X, X..., X. On the other hand, the symbol
X can be used to represent input variables. The output variable is denoted by the symbol
Y within the equation. Using the following general form, it is feasible to represent the
relationship that exists between X and Y: Y is represented as f (X) + e, where 'f' stands
for the target function and 'e' stands for a random error factor. This equation is written
in mathematical words.
I would like to bring to your attention the fact that, in connection to a machine learning
model, there are a few extra functions that are often tracked in addition to a goal
function. It is possible to determine the degree to which the model is inaccurate in its
assessment of the connection between X and Y by employing a cost function, which is
also known as an error function. This function is also referred to as a cost function at
times. To phrase it another way, the cost function may offer some insight into the
degree to which the model is not working at its optimal level.
One example of a cost function of a regression model is the R-squared statistic, which
will be discussed in greater detail later on in this inquiry. When compared to the cost
function, the loss function is often a function that is stated on a single point, but the
cost function is for the whole training data set. This is the primary difference between
the two functions. There is a striking similarity between the loss function and the cost
function. A good example of an optimization problem is the process of learning by
machine. We first make an effort to develop a model, and then we make adjustments
to the parameters of that model in order to determine which solution is the most
appropriate and optimal for a given problem.
On the other side, we need to have a method that can allow us to decide whether a
solution is of good quality or whether it is optimal. In order to do this, the objective
function is utilized. One must have a goal in order to be objective. Data and the model
are both considered to be inputs to the objective function, which also takes parameters
into consideration. The objective function subsequently produces a value as an output.
38 | P a g e
The objective of this project is to identify the values for the model parameters that
would make the return value either the highest possible or the lowest possible. The
word "cost function" becomes synonymous with the notion when the objective is to
lower the value to the greatest extent feasible. One example of a method that may be
applied is the maximization of the reward function in reinforcement learning. Other
examples include the maximization of the posterior probability in Naive Bayes and the
minimization of squared error in regression.
On the other hand, the problem that we have just gone over is an illustration of the sort
of challenge that may be encountered in the subject of machine learning. The first study
that we conducted indicated that there are three basic classes of machine learning
methodologies that are applied for the aim of tackling a variety of different sorts of
problems. The following is a simple restatement of where they are located:
1. Supervised
Classification
Regression
2. Unsupervised
Clustering
Association analysis
3. Reinforcement
There are many different models that need to be constructed and trained for
each of the different scenarios.
These models need to be developed and trained.
When we are in the process of picking the model that will be utilized to
address an issue involving machine learning, we take into consideration a
variety of distinct factors.
The two most important aspects are, first, the kind of problem that we intend
to solve with machine learning, and second, the characteristics of the data
that will serve as the basis for the solution.
It is possible that the problem is associated with the prediction of a class
value.
39 | P a g e
For example, deciding whether a tumor is malignant or benign, predicting whether the
weather forecast for the following day will be snowy or rainy, and so on are all
examples of situations in which the problem might arise. It is conceivable that it is
related to prediction, but it is of a numerical value, such as what the price of a property
ought to be in the next six months, what the predicted growth of a certain information
technology stock is in the next seven days, and so on. It is possible that it is connected
to prediction.
One example of a difficulty that is related with the grouping of data is the identification
of consumer groups that are using a certain product. Other examples include
recognizing movie genres that have had more success at the box office in the last year,
as well as other concerns that are comparable. As a consequence of this, it is quite
difficult to make a broad recommendation about the appropriate selection of machine
learning. An additional approach to express this is to argue that there is no one model
that is superior to all others when it comes to tackling every single machine learning
assignment.
The 'No Free Lunch' theorem continues by asserting that this is also the case in the
same sentence. An attempt to replicate some aspect of the actual world is incorporated
into each and every learning model. Having said that, it eliminates all of the difficult
qualities, which results in a substantial degree of simplification. In order to lay the
groundwork for these simplifications, a variety of assumptions are utilized, each of
which is highly dependent on the various conditions. The precise circumstances, more
especially the type of the data and the problem that is presently being addressed, each
have a role in determining whether or not assumptions are accurate.
If this is the case, then the same model may provide remarkable outcomes in one set of
circumstances, but it might be completely ineffective in another set of circumstances.
The conclusion is that in order to complete the data exploration that was stated in the
previous research, we must first grasp the characteristics of the data, then combine this
knowledge with the problem that we are attempting to answer, and lastly, select the
model that will be utilized in order to solve the problem. To begin, let's make an attempt
to grasp the idea that is behind the selection of models in a manner that is more ordered.
Two basic kinds of machine learning algorithms are models for supervised learning,
which are largely focused on solving prediction concerns, and models for unsupervised
40 | P a g e
learning, which are responsible for resolving descriptive problems. Both of these
categories are referred to as "supervised" and "unsupervised" learning, respectively.
The name of the model itself makes it clear that models for supervised learning or
predictive models aim to produce a prediction about a certain value by making use of
the values that are contained in an input data set. This is obvious from the fact that the
name of the model itself. A connection between the target feature, which is the feature
that is being predicted, and the predictor features is something that the learning model
makes an effort to develop on its own. The target feature is the feature that represents
the prediction. It is abundantly obvious that the prediction models place a strong
emphasis on both the subject matter that they desire to learn and the fashion in which
they wish to acquire it.
Classification models are models that are used for the purpose of predicting the target
features of category value. These models are referred to as what are known as
classification models. A class is one of the qualities that is being targeted, and the
categories that are being split into classes are referred to as levels. Both of these terms
are used interchangeably. The classification models that are gaining more and more
popularity include K-Nearest Neighbor (kNN), Naïve Bayes, and Decision Tree, to
name just a few examples. There is also the possibility of employing predictive models
in order to anticipate numerical values of the target feature. This is accomplished by
employing the predictor characteristics as the foundation for the prediction. Below are
some examples of such instances:
41 | P a g e
The objective of regression models is to make predictions about the numerical value of
a target feature of a data instance. These models are used for the purpose of creating
regression models. These particular types are well-known for serving this particular
function. For example, the linear regression model and the logistic regression model
are two of the most prevalent forms of regression models.
Points to Ponder:
It is possible to convert categorical values into numerical values, and the same is true
for numerical values once they have been converted. When it comes to forecasting the
increase in stock prices, for example, any growth % that falls within specific ranges
can be represented by a category number. This is done for the purpose of representing
the growth percentage. For example, a growth percentage that falls between 0% and
5% is seen as "low," while a growth percentage that falls between 5% and 10% is
regarded as "moderate," 10% to 20% is regarded as "high," and growth that exceeds
20% is regarded as "booming." A numerical value can be converted into a category
value in a manner that is analogous to the previous example.
In the case of the challenge of identifying the presence of malignant tumors, for
example, the term "benign" can be substituted with the number 0 and the phrase
"malignant" can be substituted with the value 1. This makes it feasible to utilize the
models interchangeably; however, it is probable that this will not always be the case.
When selecting a model, there are a number of factors that need to be taken into
consideration since they are important.
With regard to the process of selecting the model for prediction, for example, the
quantity of the training data is an essential component that needs to be taken into
consideration. Generally speaking, it is anticipated that models with low variance, like
Naïve Bayes, would exhibit superior performance in circumstances when the training
data set is quite small. In these kinds of situations, it is essential to avoid overfitting the
model, which is the reason why this is the case. In a similar vein, when there is a
considerable quantity of training data, it is advised to use low-bias models such as
logistic regression since these models are able to more correctly represent complicated
linkages.
42 | P a g e
A select few models, such as Support Vector Machines and Neural Networks, are able
to do classification and regression calculations concurrently. These models are among
the few that are capable of doing both.
When it comes to clustering, the model that is utilized the most commonly is called k-
Means. During the process of doing market basket analysis on transactional data,
descriptive models that are linked with pattern discovery are also applied throughout
the process. Utilizing the purchase pattern that is available in the transactional data,
market basket analysis is a procedure that evaluates the likelihood of purchasing one
product based on the purchase of another product. This evaluation is carried out in order
to determine the potential for purchasing one product.
For instance, transactional data may reveal a pattern that suggests that a consumer who
purchases milk also purchases biscuits at the same time the majority of the time. It
could be helpful to use this for purposes such as putting up in-store displays or doing
targeted marketing. Promotions that are pertinent to biscuits can be sent to consumers
43 | P a g e
who purchase milk products, and vice versa. This is all within the realm of possibility.
There is also the possibility of putting things associated with milk in close proximity
to biscuits within the store.
Source: Machine Learning, Data collection and processing through by Saikat Dutt
(2022)
44 | P a g e
In order to validate the performance of the model, twenty to thirty percent of the
remaining data is used as test data. This is done inside the context of the model. On the
other hand, it is permissible to divide the input data into training data and test data in a
percentage that is variable. For the purpose of ensuring that the data that are contained
within both buckets are of a similar kind, the divide is carried out in a manner that is
completely accidental. Random numbers are employed throughout the process of
assigning data items to the divisions in order to accomplish this aim. Holdout method
is a strategy that includes partitioning the input data into two different parts: training
data and test data. This is done at the beginning of the process. The demonstration of
this method may be seen in Figure 2.1. Within the framework of the holdout strategy,
a subset of the input data is set aside for the purpose of testing the trained model.
The target function of the model is executed after the training data have been used to
train the model. This is done in order to generate predictions regarding the labels of the
test data. After that, a comparison is made between the value that was intended to be
allocated to the label and the value that was actually assigned to it. It is possible to
accomplish this on account of the fact that the test data is a component of the input data
that contains labels that are already known. In most cases, the degree of accuracy with
which the model is able to anticipate the label value is utilized as a metric for the
purpose of evaluating the performance of the model.
There are some situations in which the input data is split into three distinct sections: a
training data portion, a test data section, and a third validation data section. These parts
are distinct from one another in their own right. The validation data, as opposed to the
test data, are employed for the goal of assessing the performance of the model than the
test data. Iterations are used to implement it, and each iteration serves to improve the
model that is currently being utilized. The test data is only used once, after the model
has been adjusted and completed, to assess and report the final performance of the
model as a reference for future learning efforts. This is done in order to ensure that the
model is ready for usage. In order to guarantee that the model is as exact as it can
possibly be, this task is carried out.
One of the drawbacks that is easily evident in connection with this strategy is the
possibility that the data of the various classes would not be proportionately distributed
between the training data and the test data. This condition is made considerably more
challenging when compared to other classes if the total proportion of data that is
45 | P a g e
connected with certain classes is much lower than the percentage associated with other
classes. In spite of the fact that random sampling is employed to choose test data, it is
still feasible for this to take place. To a certain extent, it is feasible to find a solution to
this problem by employing stratified random sampling as an alternative to sampling.
Stratified random sampling is a method of sampling in which the entire data set is
divided into a number of separate groups, which are also referred to as strata. Then, a
random sample is selected from each of these strata. This ensures that the random
divisions that are produced contain proportions of each class that are comparable to one
another and that they are completely random.
However, despite the fact that it employs a stratified random sampling procedure, the
holdout method continues to experience difficulties in some specific situations. One of
the potential problems that may occur is the distribution of the data for some of the
classes in a proportionate manner between the training data sets and the test data sets.
It is possible that this will be particularly challenging for the smaller data sets. A
particular form of the holdout strategy known as repeated holdout is employed in
certain circumstances in order to guarantee that the data sets that are constructed are
free of unpredictability.
This is done in order to ensure that the data sets are constructed. The repeated holdout
approach is performed in order to evaluate the performance of the model. This is
accomplished by the utilization of multiple random holdouts. A calculation is made at
the end of the day to determine the average of all of the performances. Because of the
large number of holdouts that have been drawn, the training and test data (and also the
validation data, in the event that it is drawn) are more likely to contain data that is
representative of all classes and to closely resemble the data that was initially
submitted.
This is because the holdouts have been drawn. This technique of repeated holdout
operations serves as the basis for the k-fold cross validation strategy, which is
constructed on top of his foundation. The data set is divided into k random folds
throughout the process of k-fold cross-validation. These folds are completely distinct
from one another and do not overlap with one another in any way. A thorough approach
to doing k-fold cross-validation is depicted in Figure 2.2 to explain the procedure.
46 | P a g e
In the process of doing k-fold cross-validation, the value of 'k' can be set to any integer
that you like. On the other hand, there are two approaches that have garnered a
considerable degree of popularity:
It is safe to say that the 10-fold cross-validation approach is the analytical methodology
that is utilized the most frequently. According to this approach, one of the 10 folds,
which each account for around ten percent of the entire data, is used as the test data in
order to validate the performance of the model that was trained based on the remaining
nine folds, which account for ninety percent of the total data correspondingly. This
information is used to determine whether or not the model is correctly predicting the
future.
This procedure is repeated 10 times, with one iteration for each of the ten folds that are
being used as test data and the remaining folds acting as training data after each
repetition. The test data is used to determine the accuracy of the analysis. The average
performance over all folds is being provided as the performance that is being reported
currently. An illustration of the thorough procedure that is applied in the process of
selecting the 'k' folds in k-fold cross-validation can be seen in Figure 2.3.
Every single one of the circles, as can be seen in the picture, is a representation of a
record that is a part of the data set that was input. There are many different categories
that the records belong to, and each hue represents one of those groups. The test data
set is comprised of a single fold that is selected after each iteration has been completed.
Every fold in the complete data set is selected at random, and the total data set is then
partitioned into k folds. Each of the 'k' rounds results in a different fold being selected
to serve as the test data set. This fold is unique in and of itself.
The fact that the contiguous circles in figure 2.3 are depicted as folds does not
necessarily suggest that they are consecutive records in the data set. This is something
that should be taken into consideration, despite the fact that the circles in figure 2.3 are
comparable to the records in the input data set. Rather than being a physical
representation, this is more comparable to a representation that is virtual. In the
preceding explanation, it was mentioned that the selection of the records that are
included in a fold is accomplished by a method called as random sampling.
47 | P a g e
Fig. 2.2 Overall approach for K-fold cross-validation
Source: Machine Learning, Data collection and processing through by Saikat Dutt
(2022)
Source: Machine Learning, Data collection and processing through by Saikat Dutt
(2022)
48 | P a g e
One record or data instance at a time is used as test data in the leave-one-out cross-
validation, which is an extreme variation of the k-fold cross-validation approach. This
technique is commonly referred to as LOOCV. This is done with the intention of
significantly increasing the total quantity of data that is employed in the process of
training the model. The fact that the entire quantity of data included in the input data
set is comparable to the number of iterations that need to be carried out in order for it
to be regarded complete is not something that should come as a surprise. As a
consequence of this, it is evidently fairly costly to compute, and as a consequence, it is
not applied very frequently in regular practice.
When trying to differentiate between the input data set, the training data set, and the
test data set, it is usual practice to employ bootstrapping, which is also known as
bootstrap sampling. The method of Simple Random Sampling with Replacement
(SRSWR), which is a well-known approach in the field of sampling theory, is utilized
for the purpose of drawing random samples. This method is utilized for the purpose of
doing so. In a previous section, we saw that k-fold cross-validation separates the data
into separate parts, and in the case of 10-fold cross-validation, there are a total of ten
partitions.
For the purpose of training, the remaining partitions are utilized, while the data
instances derived from the partitions that are now under consideration are utilized for
testing purposes. Bootstrapping, in contrast to the method that is applied in the case of
k-fold cross-validation, picks data instances from the input data set in a random manner,
with the possibility that the same data instance may be picked more than once. This
results in the possibility of multiple instances of the same data instance being selected.
On the basis of the input data set that contains 'n' data instances, bootstrapping may be
used to construct one or more training data sets with 'n' data instances, with some of
the data instances being repeated several times. Essentially, this is what the meaning of
this is. The bootstrap sampling approach is presented in a condensed form in Figure
2.4. This methodology is applied in the sampling process. In cases in which the input
data sets are deemed to be of a modest size, which means that there are a very restricted
number of data instances, this approach is extremely beneficial since it allows for the
most accurate results.
49 | P a g e
Fig. 2.4 Bootstrap sampling
Source: Machine Learning, Data collection and processing through by Saikat Dutt
(2022)
During the time that the model is being trained, eager learning makes an effort to
construct a generic objective function that is not dependent on the inputs that are being
utilized. In accordance with the fundamental principles of machine learning, this will
be demonstrated. After the learning phase is complete, it develops a trained model by
following to the usual techniques of machine learning, which include abstraction and
generalization. This is done in order to ensure that the model is accurate. In light of
this, the enthusiastic learner is already equipped with the model by the time the test
data is received for classification, and they do not need to look back to the training data.
In comparison to learners who are slow to learn, those who are eager to study require a
greater amount of time to complete their assignments.
50 | P a g e
applied to the test data in order to classify the data that has not been provided with
labels. Because lazy learning makes use of training material in its present form, it is
frequently referred to as rote learning, which is a way of memorizing that is based on
repetition.
This is because lazy learning makes use of training material in its current condition.
The term "instance learning" is frequently used to refer to this type of learning because
of the strong dependency it has on the specific instance of training data that is supplied.
Non-parametric learning is another name for these types of learning methodologies.
For lazy learners, learning is a time-consuming procedure since they do not genuinely
put in much effort throughout the training process during which they are being trained.
Nevertheless, classification is a procedure that requires a significant amount of time
since it entails the assignment of a label based on a comparison for each tuple of test
data. This comparison is performed for each individual test data set. The approach
known as the k-nearest neighbor is one of the algorithms that is utilized the most
frequently when it comes to lazy learning.
When it comes to parametric learning models, it is essential to keep in mind that the
number of parameters included is restricted. Non-parametric models have the potential
to have a limitless number of parameters, which is somewhat contradictory to the name
of the statistical model. This is because the number of parameters might be unlimited.
As a result of the fact that the coefficients are utilized in the construction of the learning
parameters, the size of the coefficients is fixed in models such as Linear Regression
and Support Vector Machine modeling. Because of this, we consider these models to
be parametric rather than purely mathematical. On the other hand, when it comes to
models like k-Nearest Neighbor (kNN) and decision tree, the number of parameters
rises in a manner that is proportional to the amount of data that is used for training.
They are thus considered to be non-parametric learning models due to the fact that for
this reason.
As we have seen in the past, the purpose of supervised machine learning is to train or
create a target function that is capable of identifying the target variable from the
collection of input variables in the most accurate manner. This is the aim of the process.
In the process of learning the target function from the training data, the amount of
51 | P a g e
generalization is an essential component that must be taken into consideration. This is
because the input data is only a limited and specialized perspective, and the new data
that is unknown to the test data set may differ quite a bit from the data that was used
for training. This is the reason why this is the case. The fitness of a target function that
a learning algorithm has approximated is the degree to which it is able to reliably
identify a collection of data that it has never encountered before. This is defined as the
degree to which the algorithm is able to accurately identify the data.
In the event that the target function is maintained in an unduly simplified form, there is
a possibility that it will not be able to effectively reflect the data that is being used as
the basis for the analysis and pick up on the essential nuances. When attempting to
represent non-linear data using a linear model, it is possible that a typical case of
underfitting will take place. The two instances of underfitting that are depicted in figure
2.5 serve as illustrative examples of this phenomenon. Underfitting is a phenomenon
that happens rather frequently and is typically caused by the absence of suitable training
data. One of the consequences of underfitting is poor performance with training data,
and another consequence is poor generalization to test data. Both of these consequences
are detailed in further detail below. The prevention of underfitting is accomplished by
the following:
52 | P a g e
Source: Machine Learning, Data collection and processing through by Saikat Dutt
(2022)
2.4.2 Overfitting
When a model has been designed in such a way that it seeks to duplicate the training
data in an overly close manner, this is referred to as "overfitting," and the name
"overfitting" defines the issue. In this kind of situation, the model takes into account
every single variable that may be found in the training data, such as noise or outliers.
As a consequence of this, the precision of the model's performance on the test data is
diminished to a negative degree. The attempt to fit an overly sophisticated model to the
training data in an effort to attain a larger degree of similarity is one of the most
common causes of overfitting due to the fact that it is one of the most prevalent reasons.
This idea is shown with the help of an example data set, which can be seen in figure
2.5. In circumstances such as this one, the target function will make an attempt to
guarantee that the decision boundary splits all of the training data points in an exact
manner. However, the unknown test data set does not generate findings that are
identical to this kind the great majority of the time. This is the case. As a consequence
of this, the target function causes the data set that is being evaluated to be appropriately
classified in an inaccurate manner. On the other hand, overfitting results in poor
generalization, which in turn results in poor performance with the test data set. Due to
the fact that the model is trained using the training data set, this is the result. Avoiding
overfitting is something that may be done by
Re-sampling strategies, such as k-fold cross validation hold back of a validation data
set, are applied in order to exclude the nodes that have little to no predictive ability for
the machine learning problem that has been provided. This is done in order to delete
the nodes that have been presented. Both underfitting and overfitting are factors that
contribute to poor classification quality, which is reflected in low classification
accuracy. When it comes to classification, both of these approaches are problematic.
While supervised learning is being utilized, it is possible for the class value that is
assigned by the learning model that is developed based on the training data to be
different from the actual class value. This is because the learning model is constructed
53 | P a g e
based on the training data. There are two distinct types of errors that might arise
throughout the process of learning: errors that are caused by bias and errors that are
caused by variance. Let us make an attempt to have a more in-depth understanding of
each of them simultaneously.
It is possible that mistakes that are induced by bias will occur as a result of the
simplification of the assumptions that the model makes in order to make the target
function less difficult or easier to comprehend. The model does not provide a good
match to the data, which is the explanation for this consequence. The majority of the
time, parametric models have a high degree of bias, which contributes to the fact that
they are easier to perceive and interpret, in addition to being more suited for learning.
When these algorithms are used to data sets that are complex in nature and do not
conform to the simplifying assumptions that the algorithm makes, the performance of
these algorithms is poor. An underfitting analysis results in a substantial amount of bias
being created.
The mistakes that are brought about by variance are brought about by the discrepancies
that exist between the training data sets that were utilized in the process of training the
model. Over the course of the process of training the model, a number of different
training data sets are employed. These training data sets are chosen at random from the
input data set. In an ideal world, the discrepancies between the data sets would not be
significant, and the model that was trained using various training data sets would not
be too different from one another. In other words, the world would be flawless. On the
other side, overfitting happens when the model is so closely matched to the training
data that even a little departure from the training data is taken into account and
accentuated by the model. This is a problem since it can lead to inaccurate predictions.
In the process of training a model, problems may arise for one of two reasons: (a) the
model is overly simplistic, and as a consequence, it is unable to provide a
comprehensive interpretation of the data; or (b) the model is highly complicated, and it
magnifies even the most minute variations in the training data. Both of these reasons
are possible explanations for the problems that take place during the training process.
54 | P a g e
Increasing the bias will lead to a decrease in the variance, and vice versa; this should
not come as a surprise to anybody. On the other hand, increasing the variance will lead
to a drop in the bias.
Source: Machine Learning, Data collection and processing through by Saikat Dutt
(2022)
55 | P a g e
supervised machine learning is to achieve a balance between the features of bias and
variance. This is the aim of the endeavor.
The learning algorithm that is chosen and the user parameters that may be configured
work together to give assistance in the process of achieving the goal of achieving a
balance between bias and variance. A well-known supervised method known as k-
Nearest Neighbors (kNN) may be used to make a trade-off between bias and variance.
For example, the user-configurable parameter 'k' in this algorithm can be used to
accomplish this trade-off. As the value of 'k' is decreased, the model gets simpler to fit,
which results in an increase in bias. On the other hand, this leads to an imbalance. The
variance, on the other hand, will grow as a consequence of increasing the value of the
variable 'k'.
One of the most important tasks that is included in supervised learning is classification.
The classification model is in charge of determining the value of the predictor
characteristics and then giving a class label to the target feature based on that value. As
a result, the categorization model is responsible for this. Taking the problem of
predicting the win/loss in a cricket match, for example, the classifier would assign a
class value win/loss to the target feature based on the values of other variables such as
whether or not the team won the toss, the number of spinners on the team, the number
of victories the side had in the competition, and so on. In this way, the classifier would
be able to make an accurate prediction.
It is vital to keep track of the number of accurate classifications or predictions that the
model has performed in order to evaluate the performance of the model. This is done
for the goal of determining how predictive the model is. A classification is said to be
accurate if, for example, in the situation that has been described, the model has
predicted that the team would win, and the team has really won. This is an example of
an accurate classification. It is possible to determine the accuracy of a model by
multiplying the number of correct classifications or forecasts it generates by the number
of incorrect classifications or predictions it generates. The accuracy of the model is
reported to be 99% if it has properly classified 99 out of 100 times, for example, if the
model has predicted the same outcome for 99 out of 100 games, then the model's
56 | P a g e
accuracy is said to be 99%. Simply by looking at the accuracy value, it is not feasible
to assess whether or not a model has performed effectively. This is due to the fact that
the metric of accuracy is relative. In the case of a sports triumph prediction model, for
example, an accuracy of 99% may be judged to be adequate. On the other hand, when
the learning problem includes anticipating a fatal disease, the same value might not be
considered to be a decent threshold.
These metrics include sensitivity, specificity, precision, and other metrics that are
connected to these metrics. As a result of this, let's start by examining the model in
further detail to see whether or not it is accurate. In addition, let's make an effort to
understand it by utilizing an example. When it comes to making a prediction on who
will win or lose a cricket match, there are now four distinct options available to choose
from.
It is clear that the 'win' class of interest is the one that is most prominent in this specific
instance. The first scenario represents a case in which the model correctly classified the
data instances as belonging to the class of interest. This scenario depicts a situation in
which the model predicted that the team would win, and the team actually did win.
Those cases that are classified as True Positive (TP) cases are those that fit into this
category. Examples of situations in which the model incorrectly recognized data
instances as belonging to the class of interest include the second scenario, in which the
57 | P a g e
model predicted that the team would win but ended up losing. This scenario is an
example of a case in which the model missed the mark. The cases that are classified as
belonging to this group are referred to as False Positive (FP) investigations. The third
scenario is an example of a circumstance in which the model is incorrectly classified
as not belonging to the class of interest. In this scenario, the model predicted that the
team would lose, but in the end, they ended up winning. In this particular setting, the
term "false negative" (FN) cases are the one that is being referred to.
Source: Machine Learning, Data collection and processing through by Saikat Dutt
(2022)
58 | P a g e
predicted that the team would lose, and the team did in fact lose. Those cases that are
classified as True Negative (TN) cases are those that fit within this category. Figure 2.7
provides a visual representation of each of these four possible outcomes. The accuracy
of any classification model can be determined by dividing the total number of correct
classifications (either as the class of interest, also known as True Positive or as not the
class of interest, also known as True Negative) by the total number of classifications
that have been finished. This is the method that is used to determine the accuracy of
the model.
The term "confusion matrix" refers to a matrix that illustrates both correct and faulty
forecasts in the form of true positives, false negatives, false negatives, and true
negatives. The win/loss prediction of a cricket match takes into account two areas of
interest: victory and loss. Both of these categories are covered in the forecast. This is
due to the fact that producing a confusion matrix that is two by two will be the result
of doing so. In the event that there is a classification task that involves three different
classes, for instance, the confusion matrix would be three by three, and so on
throughout the process. If we are to assume that the confusion matrix for the win/loss
prediction of the cricket match problem is as follows, then the following is the
confusion matrix:
Taking into consideration the confusion matrix shown before, the total number of TPs
is 85, the number of FPs is 4, the number of FNs is 2, and the number of TNs is 9.
59 | P a g e
It is possible to determine the percentage of incorrect classifications by utilizing the
error rate, which is quantified as
60 | P a g e
In context of the above confusion matrix, total count of TPs = 85, count of FPs = 4,
count of FNs = 2 and count of TNs = 9.
61 | P a g e
In the context of the above confusion matrix for the cricket match win prediction
problem,
Another helpful statistic that can be used to establish whether or not a model reflects
an adequate balance between being highly cautious and too aggressive is specificity.
Specificity may be used to determine whether or not a model is suitable. It is possible
to assess the degree of specificity of a model by calculating the proportion of instances
that have been classified as negative and have been assigned the right categorization.
Specificity, when applied to the process of evaluating whether or not a tumor is
malignant, refers to the proportion of benign tumors that have been assigned the right
classification. This percentage is determined by the method of classification. With
regard to the challenge of forecasting the outcome of a cricket match, the confusion
matrix that was shown before serves as the organizational structure.
Values of specificity that are higher than the average will indicate that the model is
doing effectively. On the other hand, it is not difficult to understand why a cautious
62 | P a g e
technique to reduce the number of false negatives might potentially result in an increase
in the overall number of false positives rather than a decrease in true negatives. The
rationale for this is because the model is going to classify a greater percentage of tumors
as malignant in order to reduce the total number of FNs within the population.
As a consequence of this, there will be a higher probability that benign tumors will be
classified as malignant or fiber-producing tumors. There are two more performance
metrics of a supervised learning model that are equivalent to these qualities. These
indicators are sensitivity and specificity. In addition, there are two further indicators.
The terms "recall" and "precision" correspond to this concept. In contrast to precision,
which offers the proportion of positive predictions that are actually positive, recall
provides the proportion of true positive instances compared to the total number of cases
that are truly positive. Precision is the measure of how accurately positive predictions
are made.
A model that has a higher level of accuracy is likely to be seen as having a higher level
of reliability is not surprising. Recall is a measure that reflects the percentage of
positives that were correctly predicted in comparison to the total number of positives.
When it comes to predicting the number of wins and losses in cricket, recall is
comparable to the percentage of the total victories that were properly anticipated.
63 | P a g e
In the context of the above confusion matrix for the cricket match win prediction
problem,
This is due to the fact that the F-score is a combination of a number of different metrics
into a single measurement, and it offers the proper measure that can be utilized to
compare the performance of a number of different models. Nevertheless, the
computation is based on the assumption that precision and recall are of equal value,
which may not always be the case in practice. This is the premise upon which the
computation is based. Depending on the circumstances, for instance, the degree of
accuracy with which disease prediction concerns are addressed can be given a
significantly higher amount of weightage. When faced with such a situation, it is
feasible to allocate different weightages to the notions of accuracy and recall. On the
other hand, there is the prospect of a big dilemma over the value that ought to be
accepted for each, as well as the explanation for the particular value that is taken. This
is a conundrum that might arise.
64 | P a g e
CHAPTER 3
3.1 INTRODUCTION
In the course of the three chapters that came before this one, we were provided with an
introduction to the theory and practice of machine learning. Beginning with the very
first topic, we started out by talking about what human learning is and how the different
types of machine learning aim to emulate the various aspects of human learning. In this
presentation, we were given a complete overview of the several types of problems that
may be handled via the implementation of machine learning algorithms. In order to
apply machine learning to the process of problem-solving, there is a number of basic
operations that need to be finished first. Information that is specific and detailed on
these basic stages has been supplied. Following that, we have finished an in-depth
investigation of the several tasks that are involved in the process of modeling a problem
using the technique of machine learning. By itself, modeling does not help us recognize
the effectiveness of machine learning as a tool for problem-solving.
This is because modeling is not a problem-solving tool. In addition to this, we had the
opportunity to get a grasp of how to evaluate the effectiveness of machine learning
models in terms of their ability to solve problems. In the event that a particular model
does not function well, we have the capacity to make use of a wide range of adjustments
in order to improve the efficiency of the situation. In addition, the levers that have the
potential to significantly enhance the functionality of the gadget were noted. Now that
we are prepared (or as near as we can come to being prepared!) to begin applying
machine learning to solve challenges, we need to examine another vital component that
plays an important part in the process of solving any machine learning difficulty.
Feature engineering is the component in question here. Despite the fact that feature
engineering is a component of the preparatory activities that have been described in
Chapter 2 in the past, it is necessary to view it as a separate issue owing to the fact that
it is both highly significant and rather wide with regard to its scope. This part focuses
on the features of the data set, which are a crucial component of any machine learning
problem, regardless of whether it is categorized as supervised or unsupervised learning.
All of these characteristics are mentioned in this section. The term "feature
65 | P a g e
engineering" refers to one of the most significant preparation techniques that are
utilized within the field of machine learning. Taking the raw input data and
transforming it into features that are aligned correctly and are ready to be employed by
the machine learning models is the process that has to be completed at this time.
However, before we start discussing feature engineering, let's make an attempt to have
a better knowledge of what exactly a feature is for the sake of this discussion.
During the process of machine learning, a property of a data set that is utilized is
referred to as a feature. Some practitioners in machine learning hold the opinion that
the only traits that should be referred to as features are those that are useful to a machine
learning issue. However, it is important to note that this viewpoint should be taken with
a grain of salt. In point of fact, the selection of the subset of features that are useful for
machine learning is a sub-area of feature engineering that attracts a significant amount
of research attention.
Source: Machine Learning, Data collection and processing through by Saikat Dutt
(2022)
The characteristics that make up a data set are sometimes referred to as its dimensions.
A data set that has 'n' features is referred to as an n-dimensional data set thereafter. Let's
take the example of a well-known data set for machine learning called Iris, which was
initially presented by Ronald Fisher, a British statistician and biologist. A portion of
this data set is depicted in Figure 3.1. The Sepal. Length, the Sepal. Width, the Petal.
66 | P a g e
Length, the Petal. Width, and the Species are the five characteristics or properties that
it possesses. Among them, the feature known as "Species" is the one that represents the
class variable, while the other attributes are the predictor variables. It consists of a data
collection with five dimensions.
The act of changing a data set into features in such a way that these features are able to
more precisely reflect the data set and eventually lead to higher learning performance
is referred to as "feature engineering." The phrase "feature engineering" refers to the
process. The fact that feature engineering is a key step in the pre-processing stage of
machine learning is something that is already common knowledge. To summarize,
these are the two basic components:
1. Feature transformation
2. Feature subset selection
1. Feature construction
2. Feature extraction
During the course of the process of developing features, knowledge that was previously
unknown regarding the connections between features is revealed, and the feature space
is enlarged by the production of more characteristics. Therefore, if a data collection
already has 'n' features or dimensions, then it is feasible that 'm' more features or
dimensions will be added after the process of feature creation is finished. When all is
said and done, the data set will contain 'n + m' dimensions. This represents the total
number of dimensions. For instance, the process of extracting or generating a new set
of features from the initial set of features by making use of some functional mapping
is an example of the process that is referred to as feature extraction.
67 | P a g e
While the process of feature transformation results in the creation of new features, the
process of feature subset selection, which is also referred to as feature selection, does
not result in the creation of any new features. The objective of feature selection in the
context of a specific machine learning job is to create a subset of features from the full
feature set that is the most important to the problem at hand. This subset then becomes
the focus of the machine learning work. Consequently, the fundamental goal of feature
selection is to extract a subset F (F, F..., F) from the set F (F, F..., F), where m is smaller
than n, in such a manner that F is the most meaningful and produces the best result for
a machine learning assignment. This is accomplished by ensuring that F is the subset
that delivers the best results. In the following section, we will devote our attention to a
more in-depth explanation of these concepts.
In order for any machine learning model to achieve success, it is necessary to first
construct a feature space that is of a high quality. This is an essential prerequisite. On
the other hand, it is not always clear which of the characteristics is the more important
one. As a consequence of this, each and every one of the attributes that are available
inside the data set is employed as features, and the learning model is tasked with the
obligation of selecting which features are the most crucial.
This method is not at all practicable, particularly when it comes to certain domains such
as the classification of medical pictures, the categorization of text, and other domains
that are comparable. In the event that a model has to be trained to detect whether or not
a document is spam, we may define a document as a bag, which is a collection of words
68 | P a g e
for the purpose of training the model. In the event that this takes place, the feature area
will incorporate all of the unique terms that are present in each and every paper. It is
almost certain that there will be a feature space that comprises a few hundred thousand
features. there statement is almost definite.
When we start using bigrams or trigrams in addition to words, the number of features
will rapidly surpass millions. This is because bigrams and trigrams are more complex
than words. This problem could be handled by utilizing feature transformation, which
is the method that is being discussed here. An approach that is employed for the goal
of dimensionality reduction and, subsequently, for the purpose of increasing the
performance of learning models is feature transformation. This method is utilized as an
efficient method. There are two distinct goals that may be established through the
process of feature transformation, which are as follows: obtaining the most exact
reconstruction possible of the attributes that were previously present in the data set that
was originally collected Reaching the highest potential degree of efficiency in the study
task that is being performed
The method of building features involves making modifications to a preset set of input
features in order to generate a new set of features that are more powerful than the ones
that were previously used. In order to have a better understanding of this, let's use the
example of a real estate data collection that has information on all of the apartments
that have been sold in a certain location. The length of the apartment, the breadth of the
apartment, and the price of the unit are the three parameters that are included in the
data set. In the event that these kinds of data are utilized as an input to a regression
problem, they have the ability to act as training data for the regression model. The data
that was used for training the model should allow it to offer an accurate forecast
regarding the price of an apartment whose price is unknown or which has recently
become available for purchase.
This prediction should be able to be made based on the data that was being utilized.
Instead of using the length and breadth of the apartment as a predictor, it is far more
convenient and makes more sense to use the area of the apartment, which is not a
variable that is already included in the data set. This is because the size of the apartment
is a variable that is not already included in the data set. As a result, a feature such as
69 | P a g e
this one, which is the area of the unit, may be included in the data collection. A other
way of putting this is that we transform the three-dimensional data collection into a
four-dimensional data set, and then we add the newly discovered apartment area to the
data set that was first gathered. This particular point is graphically depicted in Figure
3.2.
Source: Machine Learning, Data collection and processing through by Saikat Dutt
(2022)
Note: Although the qualities of apartment length and apartment breadth have been
retained in Figure 3.2 for the sake of simplicity, it is fairer to omit these characteristics
while developing the model. This is because it is more straightforward to do so. Under
some conditions, the process of feature construction is an essential activity that must
be finished before we can start the machine learning operation. This is because there
are certain scenarios in which this is the case. There are three situations that fall into
this category: when features have categorical value and machine learning demands
inputs of numeric value; when features have numeric (continuous) values and need to
be translated to ordinal values; and when text-specific feature development has to be
done but cannot be done immediately.
For example, the data set on athletes that is presented in Figure 3.3a is a good example.
One example of a different data set is presented below. Allow us to make the
assumption that the data set includes the following characteristics: age, city of origin,
parents’ athlete (that is, identify if either of the parents was an athlete), and likelihood
70 | P a g e
of winning. The likelihood of winning is a class variable, and the other variables are
characteristics that may be used to make predictions about the outcomes of the
situation. To learn from numerical figures is essential for any machine learning
approach, whether it be a classification algorithm (like kNN) or a regression technique.
This is because numerical figures are used to train the algorithm. This is something that
we are familiar with and have knowledge about.
Source: Machine Learning, Data collection and processing through by Saikat Dutt
(2022)
71 | P a g e
There are three components that are categorical in nature and cannot be employed by
any machine learning assignment. This is something that may be mentioned. The city
of origin, the Parents athlete, and the Chance of triumph are the features that are under
consideration here. For the purpose of generating new false features that may be
employed by machine learning approaches, this particular case calls for the utilization
of feature generation. It is essential to develop three false features, which are referred
to as organicity, organicity, and organicity. This is because the feature known as "City
of origin" has the potential to take on three different values, which are City A, City B,
and City C.
In the second row, for example, the characteristic that was referred to as "City of origin"
had a categorical value that was represented by the letter "City B." As a consequence
of this, the newly produced characteristics that will be utilized in place of the 'City of
origin' characteristic, namely organicity, organicity, and organicity, will thus have
values of 0 and 1 respectively. In a similar fashion, the values 0 and 1 will be given to
parent’s athlete Y and parents athlete N in row 2, respectively. This is due to the fact
that the initial feature 'Parents athlete' had a categorical value of 'No' in row 2. The
whole set of changes that were carried out on the data set pertaining to the athletes is
depicted in Figure 3.3b.
However, following deeper analysis, we find that the characteristics 'Parents athlete'
and 'Chance of winning' in the initial data set are only capable of having two potential
values at the same time without any other values being present. Whenever two features
are formed from them, it is possible that this might be regarded a kind of duplication.
This is because the value of one feature can be calculated depending on the value of
the other feature. As shown in Figure 3.3c, we can easily avoid seeing this duplication
by retaining one of these characteristics and getting rid of the other. This is the only
option available to us.
72 | P a g e
3.2.1.2 Encoding categorical (ordinal) variables
To illustrate this point, let's have a look at a data collection that was done by students.
It is assumed that there are three variables, as shown in Figure 3.4a. These variables
are the grade, the mark in the field of science, and the mark in the subject of
mathematics. It is clear that the grade falls within the ordinal category and contains the
potential to take on the values A, B, C, and D. In order to transform this variable into a
numeric variable, we may create a feature that we name Num grade. This feature would
map a numeric value to each ordinal value to make the conversion possible. According
to the current illustration, the grades A, B, C, and D that are displayed in Figure 3.4a
correspond to the values 1, 2, 3, and 4 that are displayed in the converted variable that
is displayed in Figure 3.4b. This is the case because the values in the converted variable
are displayed in Figure 3.42.
Source: Machine Learning, Data collection and processing through by Saikat Dutt
(2022)
73 | P a g e
of distinct groups according to the data range. Within the context of the illustration of
estimating the cost of real estate, the starting data set includes a numerical characteristic
known as apartment price. This characteristic is displayed in Figure 3.5a, which is
included from the very beginning of the illustration. As shown in Figure 3.5b or Figure
3.5c, it is feasible to transform it into a categorical variable that represents the price-
grade. The transformation is included in both of these illustration cases.
One might make the case that text is the most widely used mode of communication in
the modern society that we live in. Whether we are talking about social networks like
Facebook, micro-blogging channels like Twitter, emails, or short messaging services
like WhatsApp, it is essential to recognize that text plays a key role in the distribution
of information. This is something that should be taken into consideration. Text mining
is consequently a vital area of research, not only for those who work in the field of
technology but also for those who work in the field of industry.
74 | P a g e
Source: Machine Learning, Data collection and processing through by Saikat Dutt
(2022)
This is because text mining relies on machines to extract information from text. On the
other hand, because of the inherent unstructured nature of the data, it is not as easy to
make sense of the information contained in text data. To begin, the text data chunks
that we are able to take into consideration do not include qualities that are readily
available, such as structured data sets, which are necessary for the successful execution
of machine learning operations. There is no model for machine learning that does not
require numerical data as input that is currently available. As a result, it is essential to
transform the textual information included inside the data sets into numerical attributes.
In the process of vectorization, text data, which is also referred to as corpus, which is
the more frequent word, is converted into a numerical representation. This
transformation takes place after the text data has been created. By utilizing this method,
it is possible to accomplish the process of composing a bag-of-words format out of the
word occurrences that are contained inside all of the documents that are included in the
corpus. The following are the three basic steps that are carried out in the process:
1. Tokenize
2. Count
3. Normalize
To aid the separation of the words, which are referred to as tokens, the punctuation and
blank spaces are employed as delimiters throughout the process of tokenizing a corpus.
This is done in order to facilitate the process. After then, the total number of times that
each token occurs in each document is counted one by one starting with the first
document. Finally, the significance of tokens is proportional to the degree to which
they are present in the majority of the papers. This is the case when tokens are located
all throughout the papers.
Following this step, a matrix is built, with each token representing a column and a
specific document from the corpus representing each row. This process is repeated until
the matrix is complete. A tally of the number of times the token appears in a certain
document is kept in each individual cell. This tally is maintained in the whole
document. This specific matrix is known as a document-term matrix, which is also
occasionally referred to as a term-document matrix. The full name of this matrix is a
75 | P a g e
document-term matrix. The document term matrix that is displayed in Figure 3.6 is an
example of a typical document term matrix that is utilized as an input to a machine
learning model.
Source: Machine Learning, Data collection and processing through by Saikat Dutt
(2022)
During the process of feature extraction, new features are produced by merging the
features that were previously present through the process of feature extraction. The
process of integrating the initial characteristics is accomplished by the utilization of a
variety of operators, which are utilized on a regular basis.
A few instances of nominal qualities are the Cartesian product, the M of N, and an
assortment of additional terminology.
First, let's take a look at an example and make an attempt to understand what it means.
Consider the following scenario: we have a data collection that includes a feature set
76 | P a g e
that is represented by the letter F (F, F..., F). Following the completion of the process
of extracting features by means of a mapping function f (F, F..., F), for instance, we
Source: Machine Learning, Data collection and processing through by Saikat Dutt
(2022)
Let’s discuss the most popular feature extraction algorithms used in machine learning:
There are a number of features or dimensions that may be discovered in each and every
data collection, as was mentioned before. In addition, it is likely that a significant
number of these qualities and dimensions have similarities with one another. To give
an example, there is a correlation that may be considered statistically significant
between the height of a person and their weight on a more general level. It is a typical
occurrence that the weight is more when the height is greater, and vice versa. This is
also true in the opposite direction. Therefore, if a data set contains two variables—
77 | P a g e
height and weight—it is reasonable to presuppose that these two qualities will have a
significant degree of similarity with one another. This is because height and weight are
two of the variables. In general, the performance of any machine learning algorithm
significantly improves as the number of characteristics or qualities that are associated
with it decreases during the process of machine learning. No matter what sort of
approach is used, this is always the case. The fact that the number of features is quite
limited and that the degree of similarity between the features is also pretty low is one
of the most critical characteristics that helps to the success of machine learning.
Furthermore, the degree of similarity between the features is also fairly low.
This is one of the most important factors that should be considered when determining
whether or not machine learning will be successful. For the principal component
analysis (PCA) approach to feature extraction, which is primarily led by this theory,
this idea serves as the primary guiding philosophy that guides the technique. The
original features, which are sufficiently distinct from one another, are utilized in
principle component analysis (PCA), which concludes with the extraction of a new
collection of features from the original features. This is accomplished by using the
original features. A feature space with n dimensions is converted into a feature space
with m dimensions as a result of this transformation. The dimensions in this feature
space are orthogonal to one another, which means that they are fully independent of
one another in their respective dimensions.
This is because the dimensions are orthogonal to another dimension. In order to acquire
an understanding of the concept of orthogonality, it is necessary to take a step back and
perform a brief introduction to the concept of vector space in linear algebra. For the
purpose of acquiring a more solid understanding of the topic, this is essential. It is
common knowledge that a vector is a quantity that consists of both magnitude and
direction for a given direction. Therefore, it is feasible to ascertain the position of a
point in relation to another point in the Euclidean space, which is a space that possesses
either two or three or 'n' dimensions. This is because the Euclidean space is a space that
has 'n' dimensions.
Every single one of us is conscious of this particular fact. They are referred to as
vectors, and they are the fundamental elements that comprise a vector space. When it
comes to defining vector spaces, basis vectors are a more compact collection of vectors
that may be utilized there. The expression of vector spaces as a linear combination of
78 | P a g e
these basis vectors is something that is not impossible to do. It is a property that vector
spaces share in common with one another. The following is an example of a potential
representation for each and every vector 'v' that is contained within a vector space:
The symbol a represents the 'n' scalars, while the letter u represents the basis vectors in
this equation. That is to say, every basis vector is orthogonal to every other basis vector.
One way to think about the concept of orthogonality of vectors in n-dimensional vector
space is as an extension of the idea that vectors in a two-dimensional vector space are
perpendicular to one another. This is because orthogonality is a property that is shared
by all vectors in the space. Two vectors that are orthogonal to one another are
completely independent of one another and do not have any relationship with one
another. The transformation of a set of vectors into the matching set of basis vectors
has the effect of making the process of decomposing a collection of vectors into a
number of independent components easier to accomplish.
This transformation guarantees that each and every vector in the starting set can be
characterized as a linear combination of basis vectors with the help of this
transformation. Now that we have this notion, let's apply it to the feature space of a data
collection by giving it some extra thinking. It is possible to convert the feature vector
into a vector space that is made up of the basis vectors, which are categorized as
significant components. This transformation is a possibility. In the same way that the
basis vectors are orthogonal to one another, these basic components are also orthogonal
to one another.
To put it another way, a collection of feature vectors that could share some similarities
with one another is transformed into a collection of main components that may or may
not have any similarities with one another. Despite this, the major components are able
to capture the oscillations that were present in the feature space that was initially used.
In addition, it is important to take into consideration the fact that the number of primary
components that are formed, which is comparable to the basis vectors, is substantially
less than the original collection of features. By utilizing principal component analysis
(PCA), the transformation is planned to be carried out in such a manner that ensures
79 | P a g e
Due to the absence of covariance between the newly introduced characteristics, it may
be concluded that the primary components are equal to zero. The fact that the new
characteristics are distinct is demonstrated by this.
The order in which the major components are produced is determined by the order in
which the variability in the data that it captures, which is the source of the variability.
Therefore, the first principal component should be able to accept the highest level of
variability, the second principal component should be able to accommodate the next
highest degree of variability, and so on. This should continue until all of the principal
components are able to.
It is acceptable to anticipate that the sum of variance of the features that were previously
existing will be comparable to the total of variance of the new characteristics or the key
components. This is a realistic expectation. Principal component analysis (PCA) is a
technique that begins with a process known as eigenvalue decomposition of a
covariance matrix of a data set.
This approach is the foundation of the PCA procedure. Below is a list of the steps that
need to be done in order to proceed.
80 | P a g e
3.2.2.2 Singular value decomposition
A matrix factorization technique that is often applied in the field of linear algebra is
known as the singular value decomposition (SVD) methodology. Given that the SVD
of a matrix A (m × n) is a factorization of the form, it is possible to view it as such.
Given that U and V are orthonormal matrices, it may be said that U is a unitary matrix
with dimensions of m × m, V is also a unitary matrix with dimensions of n × n, and ∑
is a rectangular diagonal matrix with dimensions of m × n. Within the framework of
matrix A, the singular values of the function ∑ are commonly known as the diagonal
entries. The names that are provided to the columns of U and V, respectively, are the
names of the left-singular and right-singular vectors that are included within matrix A.
The SVD is frequently utilized in principal component analysis (PCA), which is
performed after the mean of each variable has been eliminated. The singular value
decomposition (SVD) is an ideal choice for dimensionality reduction in some
situations. This is due to the fact that it is not always advised to delete the mean of a
data attribute, particularly in situations when the data set is sparse (for example, when
dealing with text data). According to the following list of characteristics, it is predicted
that the SVD of a data matrix would have the following characteristics:
81 | P a g e
Thus, the dimensionality gets reduced to k SVD is often used in the context of text data
Linear discriminant analysis (LDA) is an additional method that is widely applied for
feature extraction. This method is utilized in addition to principal component analysis
(PCA) and principal component difference (SVD). LDA is a technique that aims to
transform a data set into a feature space that has fewer dimensions. This is the objective
of LDA, which is analogous to the target of the sentence that came before it. In contrast
to principal component analysis (PCA), the focus of linear discriminant analysis (LDA)
is not on capturing the variability of the data set.
On the other hand, LDA focuses an emphasis on class separability, which means that
it separates features according to class separability in order to avoid the machine
learning model from being applied too closely. This is done in order to prevent the
machine learning model from being applied too closely. Unlike principal component
analysis (PCA), which produces eigenvalues of the covariance matrix of the data set
for the data set, linear discriminant analysis (LDA) calculates eigenvalues and
eigenvectors inside a class as well as inter-class scatter matrices. LDA also creates
correlation matrices across classes.
1. Calculating the mean vectors for each of the different classes is something that
has to be done.
2. Develop scatter matrices for both the intra-class and the inter-class levels of the
classification system.
3. To calculate the eigenvalues and eigenvectors for S and S, where S represents
the intra-class scatter matrix and S represents the inter-class scatter matrix, it is
essential to carry out the computations that are required. In the context of this
discussion, the sample mean for each class is denoted by mi, the overall mean
of the data set is denoted by m, and the sample size of each class is denoted by
Ni.
4. It is essential to ascertain the top k eigenvectors that possess the top k
eigenvalues by use of this process.
82 | P a g e
where, m is the mean vector of the i-th class
Source: Machine Learning, Data collection and processing through by Saikat Dutt
(2022)
83 | P a g e
All of this information is kept in a data collection that is referred to as "student weight."
In the process of collecting data on student weight, some of the criteria that are taken
into consideration include the roll number, age, height, and weight. Because of this, we
are able to realize that the quantity of pupils on the roll does not in any way have any
impact whatsoever on the capacity to estimate the weight of the children. After that, we
will be able to eliminate the feature roll number and build a feature subset that will be
taken into consideration for this machine learning assignment. This will allow us to
more effectively complete the work. We predict that the subset of qualities will give
results that are superior to those obtained from the entire set of attributes collected. In
the same way as it was described earlier, figure 3.8 depicts the same thing.
Before we go with a more in-depth discussion on feature selection, however, let's make
an effort to grasp the circumstances that have led to the fact that feature selection is
such an important issue that needs to be fixed. This will put us in a better position to
move forward with the conversation.
The rapid advancements that have taken place in the digital arena have resulted in an
astounding increase in the amount of data that has been generated. This has led to an
increase in the volume of data that has been created. During this time, developments in
the field of storage technology have made it feasible to store large quantities of data at
rates that are extremely inexpensive. This has made it possible to store data at rates that
are affordable. The storing and mining of extremely large and high-dimensional data
sets has been significantly bolstered as a consequence of this development, which has
offered further impetus.
Within the context of certain data sets, the phrase "high-dimensional" refers to the high
number of variables, characteristics, or features that are contained in those sets. This
categorization is very common in subjects like as genetic analysis, geographic
information systems (GIS), social networking, and other fields that are closely
connected to these. A great number of high-dimensional spaces often have hundreds or
thousands of dimensions or attributes. To give an example, the data obtained from DNA
microarrays might potentially include as many as 450,000 distinct variables, which are
referred to as gene probes.
84 | P a g e
Additionally, considerable development has occurred simultaneously in two new
application domains which have been introduced. Research in the field of biomedicine
includes the selection of genes based on data acquired from microarrays. This is one of
the subfields of biomedical research. The other one is called text categorization, and it
works with massive volumes of text data that might originate from a variety of sources,
including emails, social networking sites, and other such places. The development of
data sets that contain a number of characteristics that are in the range of a few tens of
thousands is the responsibility of the first domain, which comes from the field of
biomedical research.
In addition, the text data that are generated from a wide variety of sources both have
dimensions that are unusually high. In a large document corpus that contains a few
thousand documents, it is conceivable for the number of unique word tokens that
characterize the properties of the text data set to be in the range of a few tens of
thousands. This is because the number of documents included in the corpus is relatively
small. It is possible that the challenge of extracting valuable information from data that
has such a high dimension would be a substantial hurdle for any machine learning
system. On the one hand, it will be necessary to devote a considerable amount of time
and a substantial amount of computational resources and resources to fulfill the
requirements.
On the other hand, the presence of unnecessary noise in the data causes the performance
of the model to suffer a severe reduction. This is because the noise is added to the data.
When it comes to machine learning activities, this is true for both supervised and
unsupervised tasks. Another possibility is that a model that is built on an extremely
large number of attributes will be extremely difficult to grasp. This is something that
can happen. As a result of this, it is of the utmost importance to choose a subset of the
qualities rather than the complete collection of these attributes.
Having a learning model that is not only more effective but also less costly (that
is, in terms of the number of computer resources that are required)
Improving the efficiency of the learning model in terms of its functional
capabilities. Possessing a more in-depth understanding of the underlying model
that was accountable for the development of the data
85 | P a g e
3.3.2 Key drivers of feature selection – feature relevance and redundancy
The addition of a class label to the input data set, which is often referred to as the
training data set, is one of the components that make up supervised learning. For the
inducted model to be able to assign class labels to new data that has not previously been
labeled, it is required to induct a model that is based on the training data set. This is
because the inducted model is the only one that can do so. It is anticipated that each
and every one of the predictor variables will supply information that will be utilized in
the process of determining the weight that will be assigned to the class label. A variable
that does not add any information to the analysis is an example of a variable that is
regarded to be irrelevant. In the situation when the information contribution for
prediction is exceedingly little, the variable is said to have a weakly relevant impact.
This is because the information contribution is extremely small.
The remaining factors that are deemed to be very relevant with regard to the job of
prediction are those that make a significant contribution to the effort that is being made.
Unlike supervised learning, which makes use of a training data set or data that has been
labeled, unsupervised learning has no such thing. examples of data that are comparable
to one another can be grouped together, and the degree of similarity between data
examples can be assessed based on the values of a number of different factors. When
it comes to deciding whether or not the examples of data are comparable or different,
there are some factors that do not provide any information that is relevant to the
discussion. As a consequence of this, the variables in question do not supply any
significant information that is relevant to the process of grouping.
Taking into consideration the framework of the unsupervised machine learning issue,
these variables are classified as irrelevant variables. The easy example of the student
data set that we discussed at the beginning of this section is something that we may
take into consideration if we wish to obtain a deeper knowledge of the issue. It is not
possible to establish any accurate estimations regarding the weight of a pupil based on
the roll number of that student because it does not supply any pertinent information. In
a similar vein, if we are looking to group students who have academic characteristics
that are equivalent to one another, the Roll number is not able to provide any
information at all. This is because it is impossible to do so. The variable Roll number
86 | P a g e
is largely unimportant when it comes to the supervised task of calculating student
weight or the unsupervised labor of grouping students with equivalent academic
quality. Both of these jobs are performed by employees under supervision. As instances
of tasks that require supervision, both of these tasks are included in this category. Every
feature that is irrelevant in the context of a machine learning task is a candidate for
rejection when we are selecting a subset of features. This is because we are picking a
subset of features. The aforementioned is relevant to each and every feature. We have
the ability to take into account, on an individual basis, whether or not the characteristics
that are only slightly relevant ought to be discarded.
It is common for a person's age and height to increase in parallel with one another.
When it comes to the issue of estimating weight, then, age and height both provide
approximately the same amount of information due to the context of the problem. In
other words, the learning model will generate results that are almost same regardless of
whether or not the feature height is included in the feature subset. This is because the
learning model is designed to learn. Additionally, if age is not included among the
factors that function as predictors, then the outcomes of the learning model will be
comparable to one another. This is because age is a predictable variable. When it is
thought that one characteristic is comparable to another feature, that feature is deemed
to be possibly redundant in the context of the learning issue.
The occurrence of this kind of event takes place when one characteristic is comparable
to another one. Any and all features that have the potential to be redundant are
candidates for rejection or deletion in the final feature subset. This includes any and all
87 | P a g e
overlapping features. The set of features that will ultimately be employed is comprised
of a collection of features that may be considered redundant; nonetheless, only a small
number of representative features are taken into consideration for inclusion in the set
of features that will be used.
In the past, it was said that the criterion that need to be utilized in order to evaluate the
significance of a characteristic is the quantity of information that it gives. Mutual
information is considered to be a trustworthy way for establishing the value of the class
label when it comes to the aim of supervised learning. Because it is a measurement of
the information contribution that a feature makes, this is the reason why. Due to the
fact that this is the case, it is an outstanding signal of the relevance of a characteristic
in respect to the class variable. A larger value of mutual information of a feature
suggests that the feature in question is more relevant. This is the case when the value
of mutual information grows. It is possible to determine mutual information by using
the formula that is presented below:
88 | P a g e
marginal entropy of the feature ‘x’, H( f ) =
This means that K represents the total number of classes, C stands for the class variable,
and f refers to the feature set that is capable of accepting discrete values. When learning
is done without supervision, it is impossible to have a class variable since it is against
the rules. As a consequence of this, the use of feature-to-class mutual information is
not capable of achieving the goal of estimating the information contribution that the
features make. The entropy of the collection of features is computed in the case of
unsupervised learning from the perspective of all of the features, without concentrating
on any one feature at a time.
This is done in order to ensure that the entropy is accurate. After that, the features are
organized in a decreasing order of the quantity of information received from each
feature. The features that have the greatest 'β' percentage (the value of 'β' is a design
parameter of the algorithm) are selected in order to ensure that they are among the
features that are deemed significant. It is possible to calculate the entropy of a feature
f by applying Shannon's formula, which may be written by following these steps:
is used only for features that take discrete values. For continuous features, it should
be replaced by discretization performed first to estimate probabilities p(f = x).
The idea of feature redundancy, which is based on the notion that several features
convey information that is similar to one another, has already been discussed
throughout this study. The degree of similarity between information contributions may
be evaluated in a number of different ways, the most important of which are as follows:
1. Correlation-based measures
2. Distance-based measures, and
89 | P a g e
3. Other coefficient-based measure
There is a wide range of possible values for correlation, ranging from +1 to -1.
The fact that there is a perfect correlation between the two traits, which
indicates that there is a perfect linear link between them, is demonstrated by the
fact that there is a one plus or minus one. The traits do not appear to have a
linear link if the correlation is 0, which would indicate that there is no linear
connection between them. In general, a threshold value is selected for each and
every feature selection problem in order to assess whether or not two features
have adequate similarity to one another. This is done in order to decide whether
or not the matching is satisfactory.
It is the Euclidean distance that is the most commonly used as a distance metric.
The Euclidean distance between two features F and F is determined as follows:
90 | P a g e
A data collection that has n dimensions is represented by the variables F and F,
which are features of the collection. You may locate the figure 3.9 in this
location. For the purpose of this data collection, the two criteria that are being
taken into consideration are communication (F) and aptitude (F). Applying the
procedure that was discussed previously in this paragraph allowed for the
determination of the Euclidean distance that exists between the features.
91 | P a g e
While the Minkowski distance is an example of the Euclidean distance, which
is also frequently referred to as the L norm, the Minkowski distance is
calculated when r = 2. As soon as r is equal to one, it takes on the form of
Manhattan distance, which is also known as the L norm. This is proven in the
following example:
There is a way that may be used to determine the degree of similarity between
two qualities, and that approach is the Jaccard index or coefficient.
Additionally, the Jaccard distance is a measurement that is used to determine
the degree of dissimilarity that exists between two sets of characteristics. This
is in addition to the Jaccard index.
For two features having binary values, Jaccard index is measured as with the
value of n11 being the total number of cases in which both of the qualities are
valued 1. The value of n01 represents the number of occurrences in which the
value of feature 1 is equal to zero and the value of feature 2 is equal to one. 1.
The value of n10 is the number of cases in which the value of feature 1 is
equivalent to a single digit and the value of feature 2 is none. Denoted by the
equation d = 1 – J, the Jaccard distance For the sake of this discussion, let us
take into account two characteristics, namely F and F, which have values of (0,
1, 1, 0, 1, 0, 1, 0) and (1, 1, 0, 0, 1, 0, 0, 0 respectively. Figure 3.10b provides a
92 | P a g e
visual representation of the process of determining the values of the variables
n, n, and n with an illustration. The occurrences in which both of the values are
zero have been omitted without boundaries, which acts as a signal that these
instances will be deleted from the computation of the Jaccard coefficient. This
is something that has been demonstrated by the evidence that has been supplied.
Source: Machine Learning, Data collection and processing through by Saikat Dutt
(2022)
93 | P a g e
The simple matching coefficient, also known as the SMC, is quite similar to the Jaccard
coefficient, with the exception that it takes into account a great number of instances in
which both of the characteristics have a value of 0.
Source: Machine Learning, Data collection and processing through by Saikat Dutt
(2022)
94 | P a g e
FIG. 3.12 Feature selection process
Source: Machine Learning, Data collection and processing through by Saikat Dutt
(2022)
The first step in any feature selection approach is referred to as "subset generation,"
and it is a search process that, in an ideal scenario, should produce all of the candidate
subsets that are viable. However, it is feasible to generate two subsets for a data set that
has n dimensions. This is something that can be done. Consequently, when the number
of 'n' rises, it becomes difficult to choose an ideal subset from among the two candidate
subsets. This is because they are both candidates. As a result of this, a wide range of
approximation search techniques are applied in order to discover probable subsets for
the goal of conduct evaluations. It is possible, on the one hand, for the search to start
with a collection that is devoid of any features and then proceed to add features after
that.
The term given to this particular search strategy is the sequential forward selection,
which is also the name of the phrase. Alternatively, a search might start with a complete
set and then proceed to eliminate characteristics in a sequential manner. This is an
alternative to the previous method. The method in question is referred to as sequential
backward elimination, and it is a very efficient strategy. Under some conditions, it is
conceivable that the search may start with both ends and simultaneously add and delete
features. This is something that can happen. This particular method is referred to as a
bi-directional selection, which is its name. Using a set of assessment criteria as the
foundation for comparison, the next step is to conduct an analysis of each feasible
subset and then compare it to the subset that has already demonstrated the highest level
of performance in the past.
95 | P a g e
If it is demonstrated that the new subset performs better than the previous one, then it
will replace the previous one. This cycle of subset creation and evaluation will, until it
is finished, continue until a predefined stopping criterion is fulfilled. This cycle will
continue until it is finished. Following are some examples of factors that are frequently
used to determine when to stop:
When the search is complete, there is a specific bound that has been reached,
such as a certain number of iterations. This bound has been accomplished.
The subsequent addition (or removal) of the feature does not result in a subset
that is superior in any way, shape, or form.
Choosing an acceptable subset, which can be described as a subset that holds a
greater degree of classification accuracy than the benchmark that is currently in
place, is the first step in the process.
Next, the best subset that was selected is validated by comparing it to earlier
benchmarks or by doing tests with additional data sets that are either real-life or
synthetic but legitimate. This is done in order to ensure that the subset is accurate. For
the purpose of making the validation, the accuracy of the learning model can be the
performance parameter that is taken into consideration when it comes to supervised
learning. When comparing the accuracy of the model that was generated by employing
the subset that was derived using another benchmark approach to the accuracy of the
model that was derived by utilizing the subset, a comparison is made between the two.
It is possible for the quality of the cluster to serve as the measuring stick for validation
in the event that the process is unsupervised.
1. Filter approach
2. Wrapper approach
3. Hybrid approach
4. Embedded approach
As shown in Figure 3.13, the filter technique comprises selecting the feature subset
based on statistical measures that are carried out in order to evaluate the relative merits
of the characteristics from the perspective of the data. This evaluation is carried out in
96 | P a g e
order to determine which features are more important than others. In order to evaluate
the quality of the feature that has been selected, there is no learning mechanism that is
applied during the evaluation process. In the context of the filter technique, some of the
most common statistical tests that are carried out on features are Pearson's correlation,
information gain, Fisher score, analysis of variance (ANOVA), Chi-Square, and other
statistical tests. These are only few examples of the statistical tests that are done.
Source: Machine Learning, Data collection and processing through by Saikat Dutt
(2022)
Within the framework of the wrapper approach, which is depicted in Figure 3.14 by the
way, the identification of the optimal feature subset is achieved by the utilization of the
induction technique as a black box. The feature selection method searches for a suitable
feature subset to employ in its search. It does this by utilizing the induction process
itself as a component of the evaluation function. As a result of the fact that the learning
model is trained for each and every candidate subset, and the output is evaluated by
performing the learning algorithm, the wrapper approach is extremely expensive in
terms of computing. On the other hand, when contrasted with the filter approach, the
performance is frequently rather remarkable.
97 | P a g e
CHAPTER 4
In machine learning, it is standard practice for algorithms to ingest and interpret data
in order to gain an understanding of the related patterns about individuals, business
processes, transactions, events, and so on. After this, we will discuss a number of
distinct types of real-world data, as well as several distinct classifications of machine
learning algorithms.
The ability to quickly obtain data is a factor that is given significant weight in both
data-driven real-world systems as well as models for machine learning. There are many
different types of data, the most popular of which are referred to as "structured," "semi-
structured," and "unstructured." There are also numerous more types of data. In
addition, there is a kind known as "metadata," which often represents data about data
and is sometimes referred to as "data about data." In the section that immediately
follows, we are going to have a brief conversation about a wide variety of various kinds
of data.
"Data about data" is what metadata is, and it is not the same thing as a normal data
format. The main distinction between "data" and "metadata" is that "data" only refers
to the real resources that may be used to categorize, measure, or define something in
connection to the data attributes of an organization. Metadata, on the other hand, refers
to the information about the information. On the other hand, metadata is information
that describes the information that is being discussed. On the other hand, metadata
condenses the vital data information, making it more relevant to the individuals who
will be utilizing the data. This is accomplished by ensuring that the information is
accurate. The metadata associated with a document may contain a variety of
information, including the author of the document, the date it was written, the size of
the file, keywords that describe the content, and a great lot more.
Researchers in the fields of machine learning and data science make use of a broad
variety of datasets that are available to the general public in their work. These datasets
may be accessed by the general public. Examples include cybersecurity datasets such
as NSL-KDD, UNSW-NB15, ISCX'12, CIC- DDoS2019, Bot-IoT, etc., smartphone
datasets such as call logs, SMS Log, mobile application usages logs, mobile phone
notification logs, etc., IoT data, agriculture and e-commerce data, health data such as
heart disease, diabetes, COVID-19, etc., and many more from a wide variety of
application domains. Because different applications in the actual world could require
different data formats, the kinds of information that are acquired might also be
susceptible to modification.
This is because the real world is full of surprises. For the objective of analyzing such
data in a specific issue area and deriving insights or knowledge that can be used from
the data for the development of real-world intelligent applications, a wide variety of
machine learning approaches can be implemented according to their individual learning
99 | P a g e
capacities. These methodologies can be applied to the development of intelligent
software applications.
The algorithms that make up machine learning may be broken down into one of these
four groups, as shown in Figure 4.1: supervised learning, unsupervised learning, semi-
supervised learning, and reinforcement learning. The following is a condensed
description of the several categories of learning strategies, followed by an appraisal of
the extent to which these strategies may be used to the resolution of problems that occur
in the actual world.
Fig. 4.1 There are many different kinds of machine learning approaches.
100 | P a g e
In other words, supervised learning is carried out when a task-driven approach
is chosen. The tasks of "classification," which entails dividing the data, and
"regression," which entails fitting the data, are the two most common forms of
supervised jobs. Both of these tasks include fitting the data. Text classification
is one application of supervised learning. This is the act of attempting to predict
the category label or sentiment of a piece of written text, such as a tweet or a
review made by a customer of a product. Examples of this type of text include
customer reviews of products and tweets.
101 | P a g e
utilizing the data from the model by itself. Application areas that make use of
semi-supervised learning include machine translation, fraud detection, data
labeling, and text classification, to name just a few of the many possibilities.
The ultimate goal of this form of learning is to make use of the information
obtained from environmental activists in order to take action that will either
increase the benefit or lessen the danger. It is not recommended to use it for
solving basic or straightforward issues, despite the fact that it is a powerful tool
for training AI models, which can help increase automation or optimize the
operational efficiency of sophisticated systems such as robotics, autonomous
driving tasks, manufacturing, and supply chain logistics. On the other hand, it
is a potent instrument for training AI models, which, when applied to complex
systems, may help boost automation or optimize the operational efficiency of
those systems.
In Table 4.1, we offer a concise summary of the several distinct kinds of machine
learning techniques, along with some examples of each kind. Implementing particular
machine learning approaches, which will be discussed in greater depth in the next
portion of this study, has the potential to greatly boost the intelligence and capabilities
of a data-driven application.
102 | P a g e
Table 4.1 Examples of each of the many different types of machine learning
techniques
In this section, we will examine a variety of different algorithms that are used in
machine learning. A few of these techniques include feature engineering for
dimensionality reduction, classification analysis, regression analysis, data clustering,
association rule learning, and deep learning approaches. In the first part, the model is
trained by utilizing data from the past, and in the second phase, the output is formed by
making use of data that is now being tested. Figure 3 depicts a generic structure of a
predictive model that is based on machine learning. The initial step in this process is
training the model to make accurate predictions.
Classification is a method that can be found within the area of machine learning and
may be characterized as an example of supervised learning. In other words,
classification is an example of how machine learning works. This word may also be
used to refer to a problem that is considered to be within the purview of predictive
modeling. Within the context of this kind of modeling, a class label is hypothesized to
apply to a specific sample. In mathematical terms, what it does is it transforms a
103 | P a g e
function, designated by the letter f, from the variables that serve as input (X) to the
variables that serve as output (Y) as goal, label, or categories. In other words, it
categorizes the data based on the input variables and output variables.
It is feasible to carry out the operation utilizing data that is either organized or
unstructured in order to generate an accurate forecast based on the data points that have
been provided. Either way, the procedure may be carried out successfully. For instance,
when email service providers try to identify spam by categorizing communications as
"spam" or "not spam," this might lead to an issue with the categorization of the
messages. The challenges that are most frequently encountered while sorting objects
into categories are outlined in the following parts of this study.
The term "binary classification" refers to classification tasks in which there are only
two possible class labels, such as "true" and "false" or "yes" and "no." Binary
classification is a term that is used to define classification jobs. The term "binary
classification" sometimes goes by the name "two-way classification." When doing a
task that involves binary classification, for example, the normal condition would be
represented by one class, while the abnormal state would be represented by another
class. For example, the phrase "cancer not detected" describes the normal condition of
a task that requires a medical checkup, but the phrase "cancer detected" describes the
aberrant state in which the work is currently found to be. Both "spam" and "not spam"
in the example of email service providers that was presented before are good
illustrations of binary categories because they both appear on the same line.
The word "multiclass classification" is most usually used to refer to those jobs in the
field of classification in which there are more than two class labels. In contrast to
problems requiring binary classification, problems involving multiclass classification
do not involve the usage of the ideas of normal and abnormal result distributions.
Instead, instances are categorized as belonging to one of many unique classes out of a
spectrum of probable categories to which they may potentially be assigned. There are
a great number of other categories that might be used here as well. In the NSL-KDD
dataset, for instance, where the attack categories are classified into four class labels,
such as "DoS" (Denial of Service Attack), "U2R" (User to Root Attack), "R2L" (Root
to Local Attack), and "Probing Attack," a multiclass classification task may be required
to categorize the various types of network attacks that are included in the dataset. This
is because the attack categories have already been classified into these four class labels.
104 | P a g e
This is due to the fact that the various types of attacks have already been categorized
into these four different class names.
Fig. 4.2 A conceptual framework for a predictive model that is built on machine
learning and takes into account both the training and the testing phases
For instance, Google News may be organized according to categories such as "city
name," "technology," "latest news," and so on. Other categories may also be available.
There is also the possibility of using other categories. There are other possibilities
accessible in the several other categories. On the other hand, multi-label classification
makes use of machine learning algorithms that support the prediction of many classes
105 | P a g e
or labels that are not mutually exclusive from one another. These classifications can be
combined with one another. In contrast to this, most classification jobs make use of
approaches that allow for the prediction of a single class or label. Traditional
classification jobs, on the other hand, include the utilization of class names that cannot
be substituted for one another in any context. This is in contrast to the previous point.
In the academic papers that have been done on the topics of machine learning and data
science, the development of classification algorithms has attracted a significant lot of
attention. In what follows, we will provide a description of the most prevalent and well-
known approaches that are widely deployed in a variety of application areas. These
methods are widely employed in a variety of different application areas. These methods
have been around for quite some time.
The Bayes' theorem is the basis for the naive Bayes algorithm, which is based on the
supposition that each pair of characteristics may be regarded independent of one
another and functions on the assumption that each pair of characteristics can operate
independently of one another. It has good performance and may be used for binary as
well as multi-class categories in a number of real-world scenarios, such as the
classification of documents or texts, spam filtering, and other applications that are quite
similar. The NB classifier is a helpful tool that can be used to correctly categorize the
noisy instances that are present in the data and to construct a trustworthy prediction
model. It is possible to use this tool to achieve both of these goals.
The key benefit is that, in comparison to other approaches that are more complicated,
it only needs a small quantity of training data in order to estimate the critical parameters
in a timely way. This is one of the primary advantages. The performance of the system
may be negatively impacted, however, because of the significant assumptions it makes
on the independence of its features. The NB classifier may take on a number of different
manifestations, the most common of which being the Gaussian, Multinomial,
Complement, Bernoulli, and Categorical models respectively.
106 | P a g e
dimensions. Another name for this strategy is the Fisher linear discriminant projection.
The complexity of the model, as well as the processing costs connected with the model
that is the end product of this effort, should be reduced to the greatest extent feasible.
Assuming that all classes have the same covariance matrix [82], the traditional LDA
model frequently provides the most accurate representation of each class with a
Gaussian density. Analysis of variance (ANOVA), which aims to describe one
dependent variable as a linear combination of other characteristics or measures, and
regression analysis, which also seeks to achieve this, are extremely similar to latent
variable modeling, which is another name for linear discriminant analysis (LDA).
Logistic regression (LR): Logistic Regression, often known as LR, is a common type
of statistical model that is based on probabilistic reasoning and is used to address
classification issues in machine learning. When using logistic regression, it is common
practice to estimate the probabilities by making use of a logistic function, which is also
referred to as the mathematically defined sigmoid function in equation 1. It works well
when the dataset can be divided linearly, but it has the potential to overfit high-
dimensional data sets. It is effective when the dataset can be divided linearly. By
utilizing the regularization (L1 and L2) techniques in these sorts of scenarios, it is
feasible to prevent over-fitting from occurring.
This is achieved. It is one of the most major shortcomings of logistic regression because
it makes the implicit assumption that the relationship between the dependent and
independent variables is linear. This assumption is one of the most significant defects.
It is feasible to use it for classification issues as well as regression problems, but
classification difficulties are where it is most commonly put to use. Nevertheless, it is
also viable to use it for regression problems.
In recent years, the term "machine learning" has grown more common, and its use in a
variety of subfields of technology has been rapidly developing on a daily basis. Every
day, even if we aren't conscious of it, we engage in activities that include machine
learning. Some examples of this type of technology are Google Maps, the Google
107 | P a g e
Assistant, Alexa, and other such platforms. The applications listed below are some of
the most common and widespread uses of machine learning in the real world:
1. Image Recognition:
When we are using Google, we are given the option to "Search by voice."
Speech recognition is one of the most common applications of machine
learning, and this particular application comes under that umbrella. Speech
recognition is the process of converting voice commands into text, however it
is also known as "Speech to text" and "Computer speech recognition." Another
name for speech recognition is "voice to text."
108 | P a g e
3. The forecasting
Current and future traffic situations When we want to go to a new area, we use
Google Maps since it directs us to the appropriate path using the fastest route
possible and tells us what the traffic conditions are going to be like on the way
there. It assesses the present status of the traffic by utilizing two separate ways,
and it finds out whether or not it is clear, going slowly, or severely crowded. It
also finds out whether or not it is extremely congested. The location of the
vehicle at this moment, as calculated by the Google Maps app and a number of
different sensors. The average amount of time that has been devoted to
performing a certain activity over the course of a number of days. Everyone
who uses Google Map helps contribute to the general development of our
service by providing feedback and suggestions. This is accomplished by
collecting data from the user and then submitting it to the system's database in
order to improve its overall performance.
4. Recommendations
Items to the customer who is using the service Amazon and Netflix, amongst
other e-commerce and entertainment companies, are among those that make
considerable use of machine learning in order to provide users with product
suggestions. This is done so in order to increase sales and customer satisfaction.
Once we make a search for a product on Amazon, thanks to machine learning,
we quickly begin to see an advertisement for that product while we are surfing
the internet using the same browser. This continues for as long as we use the
same browser. Google is able to grasp the user's interests by utilizing a number
of different machine learning algorithms. The search engine then makes product
recommendations depending on the interests of the customer. For instance,
when we use Netflix, we are provided with particular suggestions for
entertainment series, movies, and other types of media. This is also made
possible with the use of machine learning.
5. Automobiles
That are able to operate without human intervention: One of the most exciting
applications of machine learning is currently being developed for autonomous
vehicles. The discipline of machine learning is going to play a significant role
109 | P a g e
in the development of self-driving cars. Tesla, now the most well-known maker
of autos, is working very hard to produce a vehicle that is capable of driving
itself. It uses an unsupervised learning technique in order to teach the
automotive models how to detect people and objects while they are driving.
This allows the models to learn how to recognize people and things while they
are driving.
6. Protection
Against Malware and Spam in Email: Anti-Spam Protection When a new email
is received, it is promptly categorized as either spam, regular email, or essential
email according on its perceived level of significance. The technology known
as machine learning is what makes it possible for us to have important emails
tagged with a symbol denoting their significance appear in our inboxes while
spam emails are routed to the spam folder. The following is a list of some of the
spam filters that are available on Gmail:
Content Filter
Header filter
General blacklists filter
Rules-based filtering
Filters for giving permission
Applications such as spam filtering for email and virus detection are two
examples of uses for machine learning's myriad of different methods. Multi-
Layer Perceptron, Decision Tree, and Naive Bayes classifier are some of the
algorithms that fall within this category.
110 | P a g e
assistants, which are powered by artificial intelligence, rely largely on various
machine learning techniques. These assistants record our voice commands, send
them to a server located in the cloud, where machine learning algorithms
understand those commands, and then carry out the commands in line with the
results of those analyses.
The results of each lawful transaction's output are hashed, and those values are
then utilized as the input for the next round of computation. This procedure is
carried out several times till the chain is finally finished. Each legal business
transaction follows its own one-of-a-kind pattern, yet that pattern is always
subject to change if there is a fraudulent business transaction. Because of this,
the system is now able to recognize potentially fraudulent transactions, which
makes conducting business with us financially online far less risky.
In the world of finance, particularly the stock market, one of the most common
applications of machine learning is found. In the stock market, there is always
a risk of ups and downs in share prices; as a result, the long short term memory
neural network that is used in machine learning is applied in order to anticipate
trends in the stock market. This is done in order to reduce the impact of potential
price swings.
Medical Diagnosis: In the realm of medical science, machine learning is applied for
the aim of ailment diagnosis. Because of this, the advancement of medical technology
is growing at a rapid rate, and it is now feasible to generate three-dimensional models
that can determine the specific site of lesions in the brain. This is an important step
111 | P a g e
toward improving patient care. It makes it easier to diagnose conditions connected with
the brain, including malignancies of the brain and other neurological diseases.
Learning through experience is something that comes naturally to both people and
animals, and machine learning is a technique for data analytics that educates computers
to execute what comes naturally to both people and animals. Algorithms for machine
learning are computer procedures that "learn" directly from data rather than relying on
a predetermined equation as a model. The field of artificial intelligence makes use of
these algorithms.
The number of data points that may be utilized for training purposes increases, and the
algorithm is able to adapt itself to achieve improved outcomes. Deep learning is a
branch of machine learning that presents its own distinct set of problems to researchers.
The field of machine learning makes use of both supervised and unsupervised learning
techniques. In supervised learning, a model is "trained" using data with known inputs
and outputs in order to make predictions about data with unknown inputs and outputs.
Learning through unsupervised methods involves exploiting latent patterns or
underlying structures that are already present in the input data.
112 | P a g e
Fig.4.3
The technique of supervised machine learning allows for the development of a model
that is able to provide predictions even in the face of uncertainty by using evidence as
a basis for such predictions. This type of model may be produced by utilizing evidence
as a basis for such predictions. The method of supervised learning makes it possible to
train a model by providing the algorithm with a known set of input data as well as
known responses to the data (output). This allows the model to be trained to provide
the desired results. Because of this, the system is able to make reasonable predictions
for the reaction to new data by making use of the responses that it has already learned.
This is accomplished by making use of the responses that it has previously learned. If
you already have some data for the outcome that you are seeking to estimate, the
strategy that you should adopt is one that is referred to as "supervised learning."
113 | P a g e
Classification models make it possible to organize incoming data into a number of
different groups. The use of classification techniques allows for the formation of
particular hypotheses regarding the replies. For instance, the email may be legitimate
or it might be spam, and the tumor might be malignant or it might be benign. Both of
these possibilities exist simultaneously. It is possible for both of these things to happen
at the same time. Imaging in medicine, recognizing speech, and determining
creditworthiness are all examples of frequent uses. One further example of a frequent
application is the process of credit rating.
Taxonomy is something you should make use of if the data you have may be labelled,
arranged into certain groups or classes, or classified in any other way. One example of
a form of software that use categorization in order to distinguish alphabetic and numeric
characters is handwriting recognition software. Other examples include voice
recognition software and face recognition software. Techniques of unsupervised pattern
recognition are utilized for the purposes of object identification and picture
segmentation in the fields of image processing and computer vision. These fields
include computer vision and image processing. Utilizing these approaches may result
in the successful completion of these objectives.
Use regression techniques if you are dealing with a data range or if the nature of your
answer is a real number, such as the temperature or the amount of time remaining until
a piece of equipment breaks down. If you are working with a data range or if the nature
of your response is a real number, then you should use regression processes.
114 | P a g e
Methods that are often used for regression analysis include nonlinear and linear models,
regularization, stepwise regression, boosted and bagged decision trees, neural
networks, and adaptive neuro-fuzzy learning. Other methods include stepwise
regression, stepwise regularization, and stepwise regression. Stepwise regression is one
of these methods.
Developing a system for the early detection of heart attacks through the use of
supervised learning.
A significant number of health care workers make it their job to determine whether or
not a patient is going to suffer a heart attack within the next year. They save information
on past patients, such as the patients' ages, weights, heights, and blood pressure values.
This information is accessible if desired. They are familiar with the medical histories
of past patients, including information on whether or not the individuals in question
suffered a heart attack during the preceding year. As a result, the objective is to
incorporate the data that is presently available into a model that has the capability of
determining whether or not a new person will experience a heart attack during the
subsequent twelve months. It is anticipated that this model will be able to determine
whether or not an individual will suffer a heart attack.
Clustering is a technique that has extensive use in the study of unsupervised learning.
Exploratory data analysis makes use of it so that previously unknown clusters and
patterns within the data may be revealed and comprehended. This is accomplished by
utilizing it. Cluster analysis offers a wide range of applications, some of which include,
but are not limited to, the following: doing market research, determining the identities
of goods, and analyzing DNA sequences.
For instance, a mobile phone company that wants to maximize the effectiveness of the
locations at which it develops towers can use machine learning to make an informed
115 | P a g e
prediction as to the number of people on whom the towers are based in order to
maximize the effectiveness of the locations at which it develops towers. This allows
the mobile phone company to optimize the efficacy of the locations at which it builds
towers.
Because a phone can only communicate with one tower at a time, the team uses
clustering algorithms to discover the ideal position for cell towers in order to improve
signal reception for their groups or groups of customers. This is done in order to cater
to the fact that a phone can only talk with one tower at a time. In order to determine the
most effective positioning for cell towers, the group also takes use of approaches that
include clustering. Because a phone can only interact with a single tower at a time, this
is something that really needs to be done.
Clustering may be achieved using a variety of methods, some of which include k-means
and k-medoids, hierarchical clustering, Gaussian mixture models, hidden Markov
models, self-organizing maps, fuzzy C-means clustering, and subtractive clustering.
Other methods include k-means and k-medoids, self-organizing maps, and self-
organizing maps.
Fig.4.4
This study covers ten different approaches to machine learning, and it serves as a
foundation upon which you may expand your knowledge and talents in the field of
machine learning. The following are the 10 strategies that are discussed in this study:
116 | P a g e
1. Regression
2. Classification
3. Clustering
4. Make anything smaller by reducing its dimensions.
5. Methods for Neural Ensembles and Deep Learning,
6. which comes in at number five on the list.
7. Knowledge that may be applied in other contexts
8. learning via repetition and practice
9. The manipulation of naturally occurring languages
10. The use of Word Embeddings
In this scenario, the service provider is trying to determine how many new customers
will sign up for the offering over the course of the following month. Machine learning
without prior training, on the other hand, investigates various methods of connecting
and integrating data points without employing objective variables in order to generate
predictions. To put it another way, it sorts objects into categories based on the
similarities in the qualities they share after doing an analysis of the data through the
prism of those similarities. You might, for instance, use unsupervised learning methods
to aid a shop that is interested in segmenting things that have similar characteristics —
without beforehand deciding which attributes to apply in the segmentation process.
This could be done to assist a store that is interested in segmenting items that have
similar characteristics.
1. Regression
117 | P a g e
understanding a certain numerical value based on previous data, such as
estimating the price of an asset based on previous pricing data for properties
that are comparable to the thing that is in issue.
The most basic method is known as linear regression, and in it, the data set is
represented by utilizing the mathematical equation that corresponds to the line
(y = m x + b). We train a linear regression model using a large number of data
pairs (x, y) by determining the position and slope of a line that minimizes the
total distance between all data points and the line. This allows us to discover
the best possible fit for the data. Because of this, we are able to obtain an
estimate of the connection between the two variables that is as accurate as
feasible. To put it another way, we find the slope, which we mark with the letter
M, and the y-intercept, which we denote with the letter B, of the line that gives
the most accurate representation of the observations in the data.
Let's take a look at an example of linear regression that's more applicable to the
real world. The variables that I used in the linear regression model that I
developed to predict the amount of energy (in kW) that was consumed by a few
different buildings were the age of the building, the number of floors, the square
footage, and the number of wall devices that were plugged in. I also took into
account the number of wall devices that were plugged in.
Because there was more than one input (including age and square feet, amongst
other factors), I utilized a multivariable linear regression. This allowed me to
take into account all of the variables simultaneously. The idea was somewhat
comparable to that of a linear regression that had a one-to-one correlation. In
spite of this, the "line" that I crafted occurred in this situation in a multi-
dimensional environment, the precise placement of which was determined by a
number of different elements.
118 | P a g e
regression model has a good match with the actual energy use of the building.
This can be seen in the following graphic.
Keep in mind that you can also analyze the weight that each component plays
in the overall forecast of the quantity of energy that is wasted by making use of
linear regression. This is something that you should keep in mind. For instance,
after you have a formula, you may choose whether age, size, or height is the
most important aspect to take into consideration. You can make this decision
once you have the formula.
Fig.4.5
Based on a linear regression model, these are estimates of the amount of energy
(in kWh) that a building consumes.
There are a variety of methods for doing regression, ranging from the very
simple (linear regression) to the very complex (regular linear regression,
119 | P a g e
polynomial regression, decision trees, random forest regression, and neural
nets), with linear regression being the most easy of the available options.
However, in order to avoid being confused, you should start by studying about
simple linear regression. After you have grasped the fundamentals, you may
then go on to more advanced material.
2. Classification
The following chart details the previous students' grade point averages in
addition to indicating whether or not they were admitted to the institution. The
120 | P a g e
method of logistic regression gives us the ability to create a line that represents
the decision boundary in our analysis.
Fig.4.6
Logical regression, which is the simplest model that can be used for
categorizing data, is a good starting point for classification work since it
provides a straightforward approach to the task. As you progress through the
course, you will ultimately have the opportunity to investigate nonlinear
classifiers such as neural nets, decision trees, random forests, support vector
machines, and other approaches that are analogous to these when you reach
later stages.
3. Clustering
When we use clustering algorithms, we fall into the trap of employing untrained
machine learning since these methods attempt to aggregate or classify data that
share similar characteristics. In clustering algorithms, the output information is
not utilized for training; rather, it is left up to the algorithm itself to decide what
121 | P a g e
the output should be. When utilizing clustering algorithms, the quality of the
results can only be evaluated through the use of visualization, which is the only
method available.
The K-Means clustering method is currently the method that is used to cluster
data the most frequently, where "K" refers to the number of clusters that the
user selects. (It is essential to bear in mind that calculating the value of K may
be accomplished through the use of a variety of ways, such as the elbow
approach.)
Picks K data centers from the available alternatives at random and stores
them in the specified directory.
Assigns the value based on the data point that has been determined to be the
closest to each randomly generated center.
In this scenario, we will need to return to the second step. (If the centers keep
moving, you can prevent the process from being caught in an infinite loop by
defining the maximum number of iterations in advance.)
The procedure is complete when the centers do not shift in any way, or when
they shift only very slightly.
The data collection pertaining to the building depicted in the following plot is
processed using the K-means method. Air conditioning, plug-in equipment
(such microwaves and refrigerators), domestic gas, and heating gas are the four
factors that are being taken into consideration. One column of the plot is
devoted to illustrating the degree of success achieved by each separate structure.
Based on a linear regression model, these are estimates of the amount of energy
(in kWh) that a building consumes.
The linear method is the most straightforward of the regression techniques,
which vary from simple (linear) to complex (regular linear, polynomial,
decision trees, random forest, and neural networks), with linear being the most
straightforward. However, in order to avoid being confused, you should start by
studying about simple linear regression. After you have grasped the
fundamentals, you may then go on to more advanced material.
Putting the buildings into groups that are either effective (represented by the
color green) or ineffective (represented by the color red).
As you get more knowledge regarding clustering, you are going to discover that
there is a great deal of diverse approaches, each of which has the potential to be
122 | P a g e
of great assistance to you. The Density-based Spatial Clustering of Noise
(DBSCAN), the Mean Shift Clustering Algorithm, the Agglomerative
Hierarchical Clustering Algorithm, and the Expectation-Maximization
Clustering Algorithm are a few examples of these methods. Taking use of the
Gaussian Mixture Model is only one example among many.
4. The Everything and Its Dimensions Are Reduced
By employing the technique of dimensionality reduction, we cleanse the data
set of the information that is the least important (in certain cases, these are
columns that are not required at all). For instance, a visual snapshot might be
composed of hundreds of pixels, all of which are unrelated to the research that
you are doing out at the moment. Alternately, as microchips are being
manufactured, you might subject each chip to hundreds of measurements and
tests, the results of the majority of which would provide information that is
already common knowledge. This is due to the fact that many of the
measurements and tests produce information that is already known. You will
need an algorithm that can reduce the dimensionality of the data in order to
make the data set more manageable under these conditions.
Principal Component Analysis (PCA), which seeks for new vectors that
maximize the linear variance of the data, is the method that has garnered the
greatest popularity as a means of lowering the number of dimensions that are
present in a feature space. PCA does this by finding new vectors that maximize
the linear variance of the data. As a result, principal component analysis has
become the most often used approach for dimensionality reduction. (In addition
to this, you are able to measure the quantity of information that has been lost
and make corrections as necessary.) Principal component analysis (PCA) is a
technique that may dramatically reduce the number of data points without
compromising an excessive amount of information when there is a strong linear
link between the data points.
Another frequent method that decreases the number of nonlinear dimensions is
known as the t-stochastic neighbor embedding (t-SNE) approach. Data
visualization is the most prevalent use for t-SNE; however, it may also be used
for machine learning tasks such as feature space reduction and clustering, to
mention just a few of the potential uses for these applications. t-SNE is most
typically employed for this purpose.
123 | P a g e
The findings of the research of the MNIST database of handwritten digits are
presented in the graphic that may be seen below. The researchers test the
performance of their clustering and classification algorithms by applying them
to MNIST's hundreds of photos of the numbers 0 to 9. MNIST is a database that
contains standard numerical data. A vector representation of the original picture
(with dimensions of 28 by 28 for a total of 784) is included in each row of the
data collection. Additionally, a label for each image (ranging from 0 to 9) is also
included in each row. In terms of the dimensionality, as a direct consequence of
this, we are transitioning from 784 (the number of pixels) to 2 (the number of
dimensions in our display). By projecting the data sets to a lower
dimensionality, we are able to examine data sets that were initially stored in a
higher dimensionality.
5. Techniques for Groups or Ensembles
Imagine that you are disappointed with the options that are available to purchase
at brick-and-mortar stores as well as online and that you have made the decision
to build a bicycle on your own rather than purchasing one. You will end up with
a bike that is superior to any other option when you combine all of these high-
quality components together in it.
Each model implements the same method, which entails combining a number
of distinct predictive models through the use of supervised machine learning in
order to provide forecasts of a higher quality than the model itself.
An example of an ensemble approach would be a method known as Random
Forest, which is a methodology. It is a technique that combines a number of
decision trees, each of which has been trained using various examples taken
from a data collection. The quality of forecasts that are produced by a random
forest is greater to the quality of predictions that are produced by a single
decision tree. This is a direct result of what has been said above.
Think about many methods that may be applied if you want to reduce the
amount of bias and volatility that a single machine learning model can bring
about. When these two models are integrated, the degree of accuracy of the
forecasts is raised to the same standard. It is possible that another model will
completely reverse the relative correctness of the one being considered. It is
necessary because any given model might be accurate in certain circumstances,
but it might not be accurate in others. This means that it is possible for a model
to be accurate in some circumstances but not in others.
124 | P a g e
On Kaggle, the bulk of the most successful competitors make use of a dressing
technique of some form. These days, the ensemble algorithms that see the
greatest use are Random Forest, XG Boost, and Light GBM. Random Forest is
the most popular.
6. Synthetic neural networks and a significant amount of learning
Neural networks, as opposed to linear and logistic regression, which are both
categorized as linear models, are devised to recognize nonlinear patterns in data
by incorporating additional layers of parameterization into the model. In
comparison, linear and logistic regression are both considered to be linear
models. The basic neural network has three inputs, similar as the one that is
shown below. Additionally, the network has an output layer, a hidden layer, and
five parameters.
The neural network contains a hidden layer in its architecture.
The architecture of the neural network is flexible enough to permit the
development of both linear and logistic regression, which is made possible by
the network's adaptability. Deep learning is an umbrella term that may apply to
a wide range of distinct computer architectures. It takes its name from a specific
kind of neural network that is composed of several hidden layers.
Because the academic and industrial communities have lately redoubled their
efforts in this field, which has resulted in the daily birth of completely new
approaches, it is especially difficult to remain up to date with the most current
breakthroughs in deep learning.
The concept of deep learning may be conceptualized using a neural network
that has a significant number of hidden layers.
Learning strategies such as deep learning need for not just a sizable quantity of
input data but also a sizable amount of available processing capacity. This is
due to the fact that approaches for deep learning entail the process of
automatically updating a large number of parameters that are included inside
massive structures. It is not hard to see why professionals who work with deep
learning need powerful computers that are outfitted with GPUs (Graphical
Processing Units).
Deep learning algorithms have demonstrated remarkable levels of success in
the areas of vision (picture classification), text, audio, and video. The bulk of
the time, academics and practitioners in the field of deep learning will choose
either TensorFlow or PyTorch as the two software packages that they work with.
125 | P a g e
7. Information that may be used to a variety of different scenarios
Let's pretend for the sake of this hypothetical situation that you are a data
scientist who works in the retail sector, and let's suppose that you work in the
industry. You have invested a significant amount of time and effort on training
a superior model to recognize pictures of shirts, t-shirts, and polos. The
outcomes of your efforts have turned out to be successful. Your new mission is
to develop a model that is analogous to the previous one and is capable of
categorizing photographs of several types of trousers, such as jeans, cargo
pants, casual pants, and formal pants.
Transfer learning refers to the process of applying previously learned
information from one part of a neural network to another task that is
conceptually related to the first. This approach may be used to speed up the
learning process. To clarify, once you have trained a neural net to do a certain
task using the data for that job, you may opt to combine a section of the taught
layers with some additional layers that you may use for a different work. This
is something that you may do in a few different ways. You are able to
accomplish this by merging elements of previously learnt layers with elements
of newly discovered ones. You will be able to achieve this goal by rearranging
some of the training levels in the environment. The newly constructed neural
network is now able to swiftly learn and adapt to any new task just by adding a
few extra layers to it.
Transfer learning provides a variety of benefits, one of which is that it reduces
the amount of data that is required to train a neural network. This is just one of
the many benefits that transfer learning offers. This is especially important
when taking into consideration the fact that the process of constructing deep
learning algorithms necessitates a large commitment of both time and money.
constructing these algorithms is a time- and resource-intensive procedure.
One of the primary benefits of transfer learning is that it reduces the amount of
data that has to be gathered in order to train a neural network. This is a
significant advantage. This is especially helpful when taking into consideration
the fact that the process of training deep learning algorithms may be very costly,
both in terms of the length of time and the amount of money (computational
resources) that it requires. Finding sufficient labeled data for the sake of training
is not a simple endeavor, which ought to go without saying as it is not an easy
task to do.
126 | P a g e
Let's return to the example you provided and make the assumption that the
model of the shirt is constructed using a neural network that has 20 hidden
layers. After conducting a few tests, you come to the conclusion that in order to
train the model on picture data pertaining to pants, you will need to relocate the
18 layers of the shirt model and merge them with a new layer of parameters.
You come to this understanding after performing a few trials. You will be able
to train on image data as a result of this.
Because of this, the Pants model will end up having a grand total of 19 layers
that are concealed from view. Reusable layers have the ability to summarize
information that is significant to both objectives, such as the features of the
fabric, despite the fact that the inputs and outputs of the two jobs are
independent of one another. This is the case even though reusable layers have
the potential to do so.
Transfer learning is becoming increasingly popular, and there are already a huge
number of real pre-trained models available for basic deep learning tasks like
as the categorization of pictures and text. These tasks include deep learning
applications such as image and text classification. Utilizing these models in a
variety of different ways may allow for the learning process to be sped up.
8. Learning via practice and the process of repetition
Imagine a mouse searching through the twists and turns of a labyrinth's maze
in order to find the concealed bits of cheese that have been buried there. After
a certain amount of time has passed, the mouse will be able to get a sense of the
actions that lead it closer to the cheese. After some time has elapsed, the mouse
will be able to detect which activities bring it closer to the cheese. At first, the
mouse may move around in a haphazard manner; but, after some time has
passed, the mouse will be able to perceive which actions bring it closer to the
cheese. The more times we put the mouse through the challenge of navigating
the maze, the more adept she will get at locating the cheese at the challenge's
conclusion.
A sequence of steps that are generally referred to as the "Process for Mouse"
are carried out when a system or game that makes use of Reinforcement
Learning (RL) is being trained. This might be a computer program or a video
game. Reinforcement learning is a branch of machine learning that, in a broad
sense, helps the process by which an agent learns from its earlier experiences.
It is more often known as RL, which stands for reinforcement learning.
127 | P a g e
RL is able to maximize the total amount of reward it receives by first recording
the activities it witnesses, and then employing a strategy based on trial and error
while functioning within a predetermined framework. This allows RL to
achieve its goal of maximizing the total amount of reward it receives. In this
particular scenario, the mouse takes on the role of the agent, while the labyrinth
acts in the capacity of the environment, performing the function of the labyrinth.
The Mouse can travel in any of these four directions: forward, backward, left,
or right. It is possible for the Mouse to move in any of these ways. These are
the several options for courses of action that can be chosen. Cheese is going to
be given out as a prize for the winner of this tournament.
You may utilize RL in situations when you have very little to no past data on an
issue since, in contrast to traditional machine learning approaches, it does not
require any prior knowledge in order to function correctly. Because of this, you
are able to make better use of the data you already possess in order to more
efficiently address problems. As you progress through the RL framework, you
will acquire new knowledge gleaned from the data at each stage of the process.
That RL is exceptionally excellent at games, particularly ones that need "correct
information" like chess and Go, shouldn't come as much of a surprise to
anybody. While a player is engaged in gaming, both the agent and the
environment provide feedback in a timely manner. This enables the model to
acquire knowledge in a manner that is more expeditious. If the problem that has
to be addressed is exceptionally complicated, the process of teaching RL to
solve the problem might take an extremely extended period of time. Using RL
does come with certain limitations, and this is one of them.
In 1997, IBM's Deep Blue defeated the greatest human chess player in the
world; in 2016, the reinforcement learning (RL)-based algorithm AlphaGo
defeated the top Go player in the world. Both achievements were made possible
by advances in artificial intelligence. The Rugby League competition in the
United Kingdom is currently being led by the teams that are linked with
DeepMind.
Because there were no RL algorithms that were able to prevail in the video game
Dota 2, the OpenAI Five team took the choice to compete in it because the game
is recognized for being incredibly tough. In April of 2019, the OpenAI Five
team accomplished a first by becoming the first artificial intelligence to defeat
the Dota 2 team that had previously held the title of world champion. This
128 | P a g e
victory ushered in a new era in the history of the game. You can see that
reinforcement learning is a very strong type of artificial intelligence, and we
most certainly want to see more development from these teams. You can see
that reinforcement learning is a highly powerful type of artificial intelligence
by clicking here. It should not come as a surprise that one of the most effective
types of artificial intelligence is known as reinforcement learning. However, it
is crucial to bear in mind the constraints that the method places on the research.
These constraints should be kept in mind at all times.
9. The modification of languages that occur in their natural environment
The great majority of the world's information and knowledge can be found
either written down or spoken verbally in a form that can be categorized as a
human language. This is the case for the vast majority of the world's information
and knowledge. For instance, we could educate our phones to autocomplete the
text messages that we enter or to correct any words that were entered
incorrectly. Similarly, we might set our watches to automatically adjust the
time. In addition, we are able to teach a computer how to hold a basic
conversation with a person by providing it with the appropriate instructions.
Natural Language Processing, which is often referred to by its acronym NLP at
times, is not in reality a method for machine learning; rather, it is a methodology
that is frequently used for the purpose of preparing text for machine learning.
Natural Language Processing is occasionally known by its abbreviation NLP.
Imagine you have a lot of text documents, and they're all in different formats,
such as Word and several web blogs. How would you go about organizing
them? The vast majority of these text documents are going to include a number
of typos, missing letters, and other phrases that need to be removed from the
text. NLTK, which is an abbreviation that stands for "Natural Language
Toolkit," was developed by academics at Stanford, and it is now the software
program that is used the most commonly for the processing of text. Its
development was funded by the National Science Foundation.
The process of mapping text to a numerical representation may be made far
more manageable if one keeps track of the number of times each word appears
in each of the many text files. This will allow the procedure to be carried out
more efficiently. Imagine a matrix that is made up of numbers, and in this
matrix, each row stands for a separate text document, and each column stands
for a different word. The papers would be searched through using this matrix to
129 | P a g e
look for certain terms. The word frequency matrix, sometimes abbreviated as
TFM, is the typical name for the specific representation of the term frequency
in the form of a matrix that is commonly used. By splitting each entry on the
matrix by the weighting of how essential each word is to the entire corpus of
texts, we might be able to produce a matrix representation of a text document
that is more commonly understood. This might be done in order to construct a
matrix that is more widely understood. As a direct consequence of this
modification, utilizing the matrix will become less difficult. When applied to
applications that include machine learning, this method, which is known as the
Term Frequency Inverse Document Frequency (TFIDF) technique, is usually
more effective than others.
10. Word Embedding
Text frequency modeling, also known as TFM, and text frequency integrated
document formatting, also known as TFIDF, are both numerical representations
of text documents. TFM and TFIDF stand for text frequency modeling and text
frequency integrated document formatting, respectively. TFM and TFIDF only
examine frequency and weighted frequency when trying to characterize text
documents. This is because frequency is the most important factor. Word
embedding, on the other hand, has the capacity to determine the surrounding
context of a word that is contained inside a page. This ability may be found in
some modern web browsers. As a direct consequence of the fact that word
context and embeddings both have the ability to quantify the degree to which
two words are connected to one another, we now have the capacity to do
mathematical operations utilizing words.
Word2Vec is a method that is based on neural networks and it maps the words
that are in a corpus to a numerical vector. It does this by converting each word
to its own numerical vector. In order to accomplish this, it transforms each word
into its own numerical vector. After that, we are able to utilize these vectors to
either locate synonyms, conduct out mathematical operations with words, or
represent text documents (by taking the mean of all of the word vectors that are
contained inside the text document). For instance, in order to find out where
specific words are embedded, we make use of a corpus of text texts that is
suitably broad.
Let's play a game of pretend for a second and say that the term "vector('word')"
is a notation for the numerical vector that corresponds to the word "word." Let's
130 | P a g e
assume this is true. We might do an arithmetic operation on the vectors in order
to get a ballpark figure for the vector that represents 'female': vector('king') plus
vector('woman') minus vector('man') results in vector('queen), which is the
output of our computations.
131 | P a g e
CHAPTER 5
Supervised learning, in which the algorithm creates a function that maps inputs
to outputs that have been specified by the user. One example of a frequent
formulation for the supervised learning job is a problem known as the
classification issue. When faced with this challenge, the learner is tasked with
learning (or approximately approximating the behavior of) a function that
translates a vector into one of several classes by seeing many input-output
instances of the function. In other words, the learner must approximate the
behavior of the function.
Unsupervised learning, often known as "learning by doing," models a collection
of inputs without the use of labeled examples as a guide. This type of learning
is most commonly used in machine learning.
Semi-supervised learning, in which examples that have been labeled as well as
those that have not been labeled are used to build an acceptable function or
classifier. This kind of learning develops an appropriate function or classifier
by utilizing instances that have been labeled.
Reinforcement learning, which is when an algorithm learns a policy on how to
respond based on an observation of the world that it is presented with. This type
of learning occurs when an algorithm is exposed to something for a certain
amount of time. Every action has some type of influence on the environment,
and the learning algorithm uses the input it receives from the environment to
drive its own behavior. Every action has some kind of effect on the
environment.
Transduction is a technique to machine learning that is similar to supervised
learning in that it does not explicitly define a function; yet, it does not require
as much training as supervised learning requires. Instead, it makes an effort to
132 | P a g e
forecast future outputs by basing its calculations on the training inputs, training
outputs, as well as any new inputs.
Learning to learn, which refers to the process by which the algorithm learns its
own inductive bias based on its previous experiences. This discovery is made
possible through the process of "learning to learn."
Machine learning refers to the process of coming up with algorithms that can give a
computer the ability to learn new things on its own. The presence of conscious thought
is not a precondition for learning; rather, it is more of a byproduct of the process of
recognizing statistical regularities and other patterns in the data. Finding patterns in the
data is an essential part of the learning process. As a consequence of this, the majority
of machine learning algorithms will only superficially resemble the method in which a
person may approach a learning task. Learning algorithms, on the other hand, are able
to offer some insight into the relative difficulty of learning in a variety of contexts.
Because the goal of many classification problems is to train the computer to learn a
classification system that we have established, supervised learning is a method that is
applied relatively frequently in this kind of challenge. This is because the objective of
many classification problems is to educate the computer to learn a classification system
that we have developed. Learning to recognize numbers, for example, is an example
that may be used to illustrate categorization learning rather frequently. Classification
learning may be used, in a more general sense, to any issue that can be readily handled
by deducing a classification and where the classification itself can be easily identified.
If the agent is able to determine the classifications on its own, then there may be
situations in which it is not even necessary to assign pre-determined classes to each and
every occurrence of a problem. This is because the agent can decide the classifications
on its own. When it comes to the process of classification, this is an example of
unsupervised learning in action. When supervised learning2 is being utilized, the
probability for the inputs is typically not defined. It is not essential to use this model if
all of the input values are known; but, if any of the input values are unknown, it is
133 | P a g e
difficult to make any inferences about what the outputs may be. If all of the input values
are known, then it is not necessary to use this model.
Supervised learning3, more often known as supervised learning, is the most used
approach for training neural networks and decision trees. In order for these two
methods to operate properly, it is critical that they have access to the data that is
supplied by the pre-determined classes. In the case of neural networks, the
classification is used to ascertain the error of the network, and the network is afterwards
changed in order to ensure that the error is reduced to its smallest possible value.
On the other hand, the classification is utilized in the case of decision trees to decide
which characteristics give the most information that can be used to solve the
classification problem. This is done by determining which attributes provide the most
information that can be used to classify something. For the time being, it ought to
134 | P a g e
enough to be aware that in order to thrive, both of these examples require a certain level
of "supervision" in the form of preset categories. Both of these will receive a more in-
depth investigation later on, but for the time being, it ought to be sufficient to know
this.
Inductive machine learning is the process of generating a classifier that can be used to
generalize from new cases. Another way to say it is that it is the process of learning a
set of rules from instances (examples in a training set). Learning a set of rules by seeing
how they are applied to examples in a training set is the primary focus of inductive
machine learning. Figure F offers a detailed explanation of the process that was
followed in order to apply supervised machine learning to a problem that actually
occurs in the real world. The first thing that has to be done is to gather the dataset. This
is an absolute prerequisite. If the required expert is available, it is feasible for him or
her to offer advice on which aspects (attributes, characteristics) are the most useful.
If the expert is not available, it is not possible for him or her to do so. If this is not the
case, the most straightforward method is to use "brute force," which refers to the
practice of measuring everything that is available with the hope of extracting the right
(informative, relevant) features. If this is not the case, the most straightforward method
is to use "brute force." On the other hand, a dataset that was compiled through the use
of the "brute-force" technique is not immediately suitable for induction. According to
Zhang et al (Zhang, 2002), it often comprises of noise in addition to missing feature
values, and as a result, considerable pre-processing is necessary in order to work with
it. Zhang et al.
The second phase is comprised of the activities of preparing the data and doing any
necessary preprocessing on the data. According to Batista (2003), when it comes to
dealing with missing data, researchers can choose from a number of different
approaches. The available choices shift according to the specifics of the situation.
Hodge et al. (Hodge, 2004) have just recently published a review of new strategies for
detecting outliers and noise in data. These scholars have provided an analysis of the
advantages as well as the disadvantages of implementing the techniques. Instance
selection is not only used to control noise, but it is also used to cope with the inability
to learn from excessively large datasets. This is accomplished by limiting the number
of examples that are considered. When working with these datasets, instance selection
is an optimization issue that tries to maintain a high mining quality while
135 | P a g e
simultaneously minimizing the sample size. This goal may be achieved by selecting
just the most relevant instances. It accomplishes this goal by cutting down on the total
quantity of data, which, in turn, makes it feasible for a data mining algorithm to function
properly and effectively even when faced with very large datasets. When picking
instances from a large collection to serve as a sample, there are many various strategies
that may be utilized. Please refer to the figure that is labeled 5.2 below.
According to Yu (2004), the process of selecting a feature subset entail identifying and
removing as many unnecessary and redundant qualities as is humanly feasible. This is
done in order to choose the most relevant features. This leads to a decrease in the
dimensionality of the data and enables data mining algorithms to work in a way that is
both timelier and more effective. The accuracy of supervised machine learning
classification models is commonly impacted negatively as a result of the
interdependence of a great number of features on one another. Constructing extra
features by building on top of the primary feature set is one strategy that may be
employed to find a solution to this problem.
This is done in order to obtain the lowest possible error rate based on the information
that is provided. Take notice of something that is extremely important to keep in mind:
the purpose of the learning algorithm in the context of the classification problem is to
minimize the error in proportion to the inputs that have been supplied to it. These
inputs, which are frequently and frequently referred to as the "training set" most of the
time, are the examples from which the agent makes an effort to learn. Having a solid
understanding of the workout set, on the other hand, might not necessarily be the thing
that will be the most useful to undertake.
136 | P a g e
If I were to attempt to teach you the exclusive-or operator, for example, but I could
only show you combinations that had one true and one false, and never both false and
both true, you may catch up on the rule that the answer is always true. For example, if
I tried to teach you the exclusive-or operator, but I could only show you combinations
that had one true and one false, and never both false and both true.
In a similar vein, one of the most common problems that occur with machine learning
algorithms is that they wind up "over-fitting" the data and, in effect, memorizing the
training set rather than adopting a more general classification approach. This is one of
the most common faults that emerges with machine learning algorithms. You
137 | P a g e
presumably already have a good idea that not all training sets have inputs that have
been adequately classified, but it bears repeating in case you forgot. This might lead to
complications if the algorithm that is being used is robust enough to memorize even
the "special cases" that don't appear to meet the more fundamental principles that are
being applied. This would be the case if the more basic rules are being applied. This,
too, has the potential to lead to overfitting, and it is challenging to design algorithms
that are robust enough to yield conclusions that are generalizable while still being
powerful enough to learn sophisticated functions.
Unsupervised learning seems to be the more difficult of the two; the goal of
unsupervised learning is to have the computer learn how to do something that we don't
tell it how to do, and it seems like this would be a lot more difficult than supervised
learning. Learning that is not directly supervised can really be handled from two
distinct perspectives. The first way entails educating the agent not via the supply of
explicit categorizations but rather through the employment of some form of incentive
system to denote successful completion of the job. In other words, the agent is not given
explicit instructions. It is essential to keep in mind that the majority of the time, this
sort of training will be able to be accommodated within the parameters of decision-
making issues.
This is because the goal is not to build a categorization, but rather to generate
judgements that maximize rewards; therefore, this explains why this is the case. This
strategy may be readily generalized to the real world, where people can be rewarded
for engaging in some activities and penalized for engaging in others. Unsupervised
learning can, in many circumstances, be accomplished through the use of a kind of
learning known as reinforcement learning.
In this sort of learning, the agent bases its actions on the previous rewards and
punishments without necessarily ever possessing any information about the particular
ways in which its activities have an effect on the outside world. This is because this
type of learning is a form of reinforcement learning, which means that the agent
receives rewards and penalties for its actions. Because the agent has acquired a reward
function through training, it has an innate understanding of what to perform and does
not need any processing to do this. Because the agent is aware of the precise reward it
138 | P a g e
can expect receiving for each potential action it may take, it is able to forecast how
much it will be awarded regardless of which action it selects. This is because the agent
is aware of the exact reward it can anticipate receiving for each prospective action it
may take. This might be of immense aid in situations in which calculating every
possible option would take an abnormally long period of time (even if the probability
of transitioning between different world states were all already known). On the other
hand, acquiring new knowledge through what is essentially a process of trial and error
may be a very time-consuming endeavor. However, despite the fact that it does not need
a previously discovered categorization of data, this strategy for acquiring knowledge
has the potential to be highly effective. For instance, our categorizations may not
always be the most precise ones that are conceivably possible under certain
circumstances.
It is a striking example that the conventional wisdom about the game of backgammon
was turned on its head when a series of computer programs (neuro-gammon and TD-
gammon) that learned through unsupervised learning became stronger than the best
human chess players simply by playing themselves over and over again. This is a
striking example that the conventional wisdom about the game of backgammon was
turned on its head. Another eye-catching illustration of this is how the game of baccarat
was fundamentally altered after it was discovered that a set of computer algorithms that
learnt through unsupervised learning were more powerful than the finest human chess
players. These backgammon algorithms beat earlier backgammon systems that had
been taught on pre-classified examples, and they discovered various backgammon
concepts that astounded the game's professionals.
One of the subtypes that fall under the umbrella of unsupervised learning is referred to
as clustering. This method of learning is not intended to maximize the value of any
particular utility function; rather, it seeks to identify patterns of similarity within the
training data in order to improve its performance. It is a common practice to make the
assumption that any clusters that are discovered would, to a reasonable extent, match
with any intuitive categorization. If we were to group people together according to their
demographics, for instance, we may discover that those who have more money end up
in one group while those who have less money end up in another group. The algorithm
will not have names to give to these clusters; nonetheless, it will be able to create them
and then use those clusters to allocate fresh samples to either one of the clusters or
another of the clusters. Although the method will not have names to apply to these
139 | P a g e
clusters, it will be able to construct them. This is a strategy that is driven by data and
has the potential to be successful when there is a substantial volume of data accessible.
For instance, social information filtering algorithms, like the ones used by Amazon.com
to recommend study’s, are constructed on the idea of discovering comparable groups
of persons and then assigning new users to groups.
This is done after locating new users. Knowledge about other members of a cluster
(such as the kinds of study’s they like to read, for example) may be sufficient for an
algorithm to produce results that are pertinent in certain situations, such as when social
information is filtered. It's likely that the clusters are nothing more than a useful tool
for a human analyst to have at their disposal in certain situations, but that's not the case
in all of them. Sadly, even unsupervised learning can fall victim to the problem of
overfitting the data that is utilized for training. There is no failsafe approach to evading
the issue that can be discovered; this is because any algorithm that can learn from its
inputs has to be of an exceptionally robust nature.
During this discussion, I will discuss the problem of overfitting that occurs within a
certain category of histogram clustering models. Applications such as information
retrieval, language analysis, and computer vision all make extensive use of these
models and their capabilities. The discoveries of large deviations and the idea of
maximum entropy, which drives the learning process, are used to design learning
algorithms that have a high degree of resistance to changes in sample size. These
learning algorithms are used to develop learning algorithms that have a high level of
resistance to changes in sample size.
140 | P a g e
Unsupervised learning has led to a variety of technological advances, including
computer programs that can play backgammon at a level equivalent to that of world
champions and even computers that are able to drive vehicles. A strategy is considered
to have the potential to be extremely effective when there is an easy way to correlate
values with actions. Clustering may be useful when there is sufficient data to establish
clusters (though this can be tough at times), and especially when more data about
members of a cluster can be utilized to generate further conclusions owing to
dependencies in the data. This is because clusters are often formed when there is a
correlation between the data points being analyzed. Having said that, this circumstance
arises only when there is adequate data to build clusters. Classification learning is a
powerful technique that can be utilized in situations in which the classifications are
either known to be accurate (for instance, when dealing with diseases, it is generally
easy to determine the design after the fact by performing an autopsy) or in situations in
which the classifications are simply arbitrary things that we would like the computer to
be able to recognize for us.
1. Logical Regression
The Naive Bayes Classifier
The Perceptron
Support Vector Machine
141 | P a g e
Linear Classifiers
2. Quadratic Classifiers
3. The K-Means Clustering Algorithm
4. Boosting
5. The Decision Tree and the Random Forest
6. Neural networks
7. Bayesian Networks (Bayesians)
Linear Classifiers: The aim of classification in machine learning is to build groups out
of entities that have comparable feature values. Linear classifiers are used to achieve
this purpose. In order to accomplish this goal, linear classifiers are utilized. According
to Timothy et al (Timothy Jason Shepard, 1998), a linear classifier is able to achieve
this objective because it bases its classification choice on the value of the linear
combination of the attributes. This allows the linear classifier to successfully achieve
the aim. The score that is produced by the classifier is a real vector if and only if the
feature vector that is given into the classifier is also a real vector.
where is a real vector of weights that is being utilized, and f is a function that takes the
dot product of the two vectors and delivers the result that was described before. For the
purpose of teaching in the process of acquiring the weight vector, a collection of labeled
training examples is utilized. f is often a basic function that divides all of the values
into two classes: the first class has only those values that are larger than a preset cutoff
point, and the second class contains all of the remaining values. If a function has a
higher degree of complexity, it may be possible to determine whether or not a certain
piece of data belongs to a particular category.
142 | P a g e
of classification is an essential concern. This is because a linear classifier is typically
the quickest classifier. Decision trees, on the other hand, have the potential to be more
productive. In addition, the performance of linear classifiers is frequently excellent
even when the number of dimensions in question is rather big. This is the situation, for
example, with document categorization, where each element in is generally the number
of counts of a word in a document (for more detail, see to the document-term matrix).
When faced with such challenges, the classifier must be equipped with sufficient
regularization.
Support Vector Machine, often known as SVM According to Luis et al (Luis Gonz,
2005), a Support Vector Machine (SVM) may do classification by constructing an N-
dimensional hyper plane that divides the input into two groups in the most effective
way. This method was described by Luis Gonz. The neural network modeling family is
closely related to the support vector machine (SVM) modeling family. An SVM model
that makes use of a sigmoid kernel function is, in point of fact, comparable to a
perceptron neural network that has two layers.
There is a very close and personal connection that can be made between Support Vector
Machine (SVM) models and more conventional multilayer perceptron neural networks.
Classifiers such as polynomials, radial basis functions, and multi-layer perceptrons all
employ a kernel function. Support vector machines (SVMs) offer an alternate technique
of training for these classifiers. In this approach, the weights of the network are
determined by solving a quadratic programming problem with linear constraints. This
is in contrast to the usual neural network training method, which involves solving a
non-convex, unconstrained minimization problem to determine the weights of the
network. One kind of support vector machine is referred to as a support vector machine,
and an SVM is an example of this.
143 | P a g e
that partitions vector clusters in such a way that instances that belong to one category
of the target variable are located on one side of the plane and cases that belong to the
other category are located on the other side of the plane. In other words, the objective
is to find a hyper plane that partitions vector clusters in such a way that cases that
belong to the other category are located on the other side of the plane. The vectors that
are situated in close proximity to the hyper plane are referred to as support vectors. The
figure below presents an overview of the SVM process.
Before moving on to talk about hyper planes with N dimensions, let's begin by taking
a look at a simple illustration of a two-dimensional hyper plane. Assume that we are
interested in carrying out a classification and that the data that we have contain a
category target variable that is capable of taking on two distinct values. In addition,
let's suppose that there are two predictor variables that both have value ranges that are
continuous. If we were to plot the data points by placing the value of one predictor on
the X axis and the value of the other predictor on the Y axis, we might end up with a
picture that is comparable to the one that is shown in the following paragraph. One of
the categories of the target variable is represented by rectangles, while the other
category is represented by ovals. The target variable is separated into these two
categories.
Fig.5.3
144 | P a g e
In this imaginary image, the cases that belong to one category are positioned in the
bottom left corner, whilst the cases that belong to the other category are located in the
top right corner; the cases are completely different from one another. During the SVM
analysis, an effort is made to identify a one-dimensional hyper plane, which is also
known as a line, that clearly demarcates the instances into their respective target
categories. This line must be able to differentiate between the examples in a manner
that is unambiguous. Two of the many potential lines that might be drawn are shown at
the top of the page. There is an infinite number of other lines that could be drawn. The
question that has to be answered is which line is preferable, in addition to the topic of
how the perfect line ought to be defined.
The dashed lines that are created in parallel to the line that is separating the two groups
display the distance between the dividing line and the vectors that are closest to the
line. This line is used to separate the two groups of vectors. The space that is contained
inside the dashed lines is referred to as the "margin," and the word "margin" is used to
describe this region. Support vectors are the vectors (points) that establish the bounds
of the margin's width, and they go by the name "support vectors." This idea is
demonstrated by the graphic that can be found below.
Fig.5.4
During an analysis carried out using a support vector machine (SVM), the line (or, more
broadly, the hyper plane) that is oriented in such a way as to raise the margin between
the support vectors to its maximum is located. The line that can be seen above in the
figure's right panel, which can also be seen in the figure's left panel, is superior than the
line that can be seen above in the left panel.
145 | P a g e
If all studies were to consist of goal variables that were divided into two categories and
had two predictor variables, and if the cluster of points could be divided by a straight
line, then life would be a lot easier to deal with. Unfortunately, this is not typically the
case, and as a result, SVM must be able to deal with a. more than two predictor
variables, b. separating the points with non-linear curves, c. managing the
circumstances when clusters cannot be entirely separated, and d. handling
classifications with more than two categories.
In this chapter, we will explore three of the most significant techniques to machine
learning, along with some examples of their application and a practical demonstration
of how they operate. Some of the technologies that might be utilized are K-Means
Clustering, Neural Network, and Self-Organized Map. These are just a few examples.
Continue following these steps until there is no longer any movement of items inside
the group:
146 | P a g e
number as the number of clusters with which to work. This results in the classification
of a given data set in a way that is basic and easy to understand. The major purpose of
this endeavor is to locate k centroids, with the intention of assigning one to each cluster.
Because relocating these centroids to various places will result in different
consequences, the positioning of them needs to be thoughtfully studied. As a result, the
course of action that would be the most judicious would be to place them as far apart
from one another as would be logistically possible. The next step is to take each point
that is a part of a certain data collection and assign it to the centroid that is the closest
to it.
This will be done by taking each point that is a part of the data collection. The first step
is considered ended and an early groupage is carried out when there are no points
remaining that need to be fulfilled. At this step, we are needed to compute k new
centroids so that they can act as barycenter’s for the clusters that were formed by the
phase that occurred before this one. These centroids will be used to determine the
relative positions of the clusters. After we have finished obtaining these k new
centroids, we will need to carry out a new binding between the points in the same data
set and the new centroid that is located in the region that is geographically closest to
them. It appears like a loop has been created.
147 | P a g e
It is possible that, as a result of iterating through this loop, we may see that the positions
of the k centroids gradually move until there is no longer room for any more
modifications to be made. To put it another way, the centroids are no longer shifting in
any way. In conclusion, the goal of this approach is to lessen the value of an objective
function, which in this instance is a squared error function. This function's value may
be decreased by following this procedure. The most important objective of the project
Fig.5.6
148 | P a g e
Although it is feasible to demonstrate that the procedure will inevitably arrive at a
conclusion, the k-means algorithm does not necessarily discover the configuration that
corresponds to the lowest possible value of the global objective function. This is despite
the fact that it is possible to demonstrate that the procedure will invariably arrive at a
decision. In addition to this, the method is quite sensitive to the initial randomized
cluster centers that are chosen. It is possible to run the k-means algorithm many times
in order to reduce the negative effects of this phenomena. The K-means approach is a
simple statistical technique that has proven to be effective in a wide range of different
problem domains. As we are going to see, it is an excellent candidate for extension to
operate with fuzzy feature vectors, which is something that we are going to look at in
more depth in a moment. This is something that we are going to look at in more detail
in the following sentence.
An Example
Let us make the assumption that we have access to n sample feature vectors, x1, x2,,
xn, all of which originate from the same class, and that we are aware that these vectors
can be grouped into k compact clusters, where k is fewer than n. Let us also make the
assumption that we are aware that these vectors can be clustered into k compact
clusters. Let's name this number, which we'll refer to as mi, the average of all of the
vectors that belong to cluster I. We should be able to discriminate between the clusters
using a minimum-distance classifier if they are not too near to one another and the
clusters themselves are not too close to one another. That is to say, we are in a position
to state that x belongs to cluster i if the distance between x and mi is the smallest of all
k distances. In other words, if the distance between x and mi is the shortest of all k
distances. Because of this, we have reason to suppose that the procedure described
below may be utilized to ascertain the values of k:
Determine the most accurate informed guesses you can come up with for the
means m1, m2,.., and mk.
Provided that there are no changes in the nature of any of the ways
Using the estimated means of the variables, organize the samples into clusters
in order to better analyze the data.
Begin with the letter i between the numbers 1 and k.
End the for clause once you have replaced mi with the mean of all of the samples
that belong to cluster i. The until clause should be terminated.
149 | P a g e
The following example provides a visual representation of how the means m1 and m2
make their way towards the centers of two clusters:
Fig.5.7
The k-means clustering technique has been implemented in a simple and uncomplicated
manner here. It is feasible to conceive of it as a greedy approach for splitting the n
samples into the k clusters in such a way as to minimize the sum of the squared
distances to the centers of the clusters. This is achievable because it seeks to minimize
the sum of the squared distances to the centers of the clusters. However, it does have
the following problems that need to be addressed:
The means by which the procedures should first be carried out have not been
specified in any way. A frequent way for getting started is to pick k of the
samples at random. This is called the random selection method.
The findings that are achieved are contingent on the beginning values that are
used for the means, and it is extremely rare for less-than-ideal partitions to be
identified in the process.
The results that are acquired are dependent on the initial values that are used for
the means. Experimenting with different starting points is a frequent strategy
that people use when trying to solve problems like this one.
It is possible that the set of samples that is most similar to mi does not include
any data; if this is the case, then mi cannot be updated. We are going to ignore
150 | P a g e
this annoyance for the time being, but it is something that needs to be taken care
of during the implementation process. The conclusions are dependent on the
metric that was used to estimate the gap between x and mi. This gap was used
to measure the overall difference. It is usual practice to normalize each variable
by using the standard deviation of that variable; nevertheless, it is essential to
bear in mind that this strategy is not always the most effective one.
The results are dependent on the value of k in the equation.
This last issue is particularly challenging for us since there is frequently no way for us
to establish how many clusters there are. As a result, this issue is particularly severe.
The subsequent 3-means clustering is the outcome of applying the same approach to
the same data that was done in the example that was just demonstrated. This procedure
was shown earlier. When compared to the 2-means clustering, does it represent an
improvement or a loss in the overall clustering quality?
Fig.5.8
Sadly, there is no one, overarching theoretical answer that can be utilized to figure out
how many clusters should be employed for each and every different collection of data.
The outcomes of several iterations carried out using a wide range of k classes may be
compared with one another, and the iteration that ends up producing the best overall
performance can then be chosen as the winner. This is a simple and effective approach.
151 | P a g e
5.4.2 The Connective Web of the Nervous System
Even though it is customary practice for each network to only carry out a single job at
a time, neural networks (Bishop C. M., 1995) are really capable of completing several
jobs of regression and/or classification concurrently. As a consequence of this, the
network will often only ever have a single output variable, despite the fact that in the
case of many-state classification difficulties, this may correspond to a number of output
units (the post-processing stage is responsible for mapping output units to output
variables). On the other hand, this might equate to a lot of output units in the case that
there are problems with categorization in many states. This is as a result of the fact that
the great majority of classification problems only involve a small number of countries.
If you do establish a single network that has numerous output variables, there is a
possibility that the network will be impacted by cross-talk. This risk exists whether or
not you actually build the network. Whether or not you were the one to create the
network, this result would still be the same. Because of this, the hidden neurons will
have difficulty learning, since they will be attempting to represent at least two functions
simultaneously, which will prevent them from being able to do so successfully. The
majority of the time, the most productive course of action is to first train distinct
networks for each output, and then to integrate those networks into an ensemble so that
they may be carried out as a single entity. This is the most successful course of action
because it maximizes the likelihood of achieving the desired results. The following are
some examples of neural methods:
This is the type of network that was described. To put it another way, the output of the
nodes is formed as a direct consequence of the transfer function. Additionally, the nodes
152 | P a g e
are organized in a topology that is layered and feeds forth information to one another.
As a consequence of this, the network may be comprehended in a comprehensible
manner as a type of input-output model, with the weights and thresholds (biases)
operating as the free parameters of the model.
The number of input and output units is determined by the problem itself. On the other
hand, for the time being, we are going to operate on the assumption that the input
variables were chosen in a logical way, and that each one is significant to a certain
extent. This will be our working assumption. There is not the least hint that there will
be a certain number of covert units employed in this operation. There should be one
hidden layer, and the number of units that it includes should be equal to fifty percent of
the total number of units that are present in both the input and output layers combined.
This should be the case since the hidden layer should be the only one that has units. It
seems like a decent spot to begin looking into topics to start here. In the paragraph that
follows, we are going to have another conversation about how to pick a number that is
appropriate for the circumstance.
Putting Some Multilayer Perceptrons Through Their Paces: Once the number of
layers and the number of units that will make up each layer have been chosen, the
weights and thresholds of the network will need to be updated so that they are
appropriate for the new configuration. This is done so that the amount of inaccuracy
that is caused by the network's predictions may be brought down to a level that is more
easily controlled. The training algorithms are the ones responsible for carrying out this
function as their primary responsibility. The historical examples that you have gathered
are utilized in the process of making automated adjustments to the weights and
thresholds in order to reduce the amount of errors that are produced as a direct result of
153 | P a g e
this. This tactic is analogous to altering the model that is being represented by the
network in order to make it compatible with the data that is being utilized for training
at the present time.
An approach that can be used to compute the error associated with a given configuration
of the network is to run all of the training examples through the network and then
compare the actual output provided by the network with the expected or target outputs.
This method can be used to compute the error associated with the network. Running
the network through each of the training instances is one approach that may be taken
to achieve this objective. An error function first takes into account the differences, and
then uses those differences in the computation of the network error. Discrepancies are
taken into consideration by an error function. The cross-entropy function and the sum
squared error function are the two forms of error functions that are utilized the vast
majority of the time. When dealing with issues pertaining to regression, the sum
squared error method is utilized.
This method mandates that the individual mistakes related to output units be squared
before being combined with one another. The cross-entropy functions are utilized
whenever maximum likelihood classification is carried out. When using more
conventional modeling techniques, such linear modeling, it is easy to algorithmically
discover the model configuration that totally eliminates this inaccuracy. This is one of
the benefits of using these approaches. Since linear modeling is one of the most used
modeling approaches, this is not outside the realm of possibility. Other modeling
methods, such as nonlinear modeling, don't have this capacity at all, thus it's not
something you can do with them. Despite the fact that we can alter a neural network in
order to reduce the number of errors that it generates, there is no way to determine with
absolute certainty whether or not the error can be reduced any more.
This is the price that needs to be paid in order to receive the benefit of enhanced
capabilities for non-linear modeling that neural networks provide. The idea of the
mistake surface is one that has the potential to be useful in dealing with the
circumstances that we are in right now. It is reasonable to assume that there are N
dimensions in space for each of the weights and thresholds that, when combined, create
the network (also known as the "free parameters" of the model). The error that happens
across the network is modeled in the N+1th dimension as a representation of the error.
The creation of an error surface may be accomplished by graphing the error in the
154 | P a g e
N+1th dimension, and this technique can be carried out for each and every feasible
configuration of weights. The objective of network training is to home in on the
location on this surface that possesses the fewest possible dimensions while still
keeping the lowest possible value.
This error surface is a parabola, which is a quadratic, which implies that it is a smooth
bowl-shape with a single minimum in a linear model with a sum squared error function.
This error surface is a quadratic. As a direct result of this, figuring out what the absolute
minimal requirements are is a "simple" task. The error surfaces of neural networks are
far more complicated, and they are defined by a large variety of qualities that are
detrimental to the system in every possible way. These sorts of topographical
characteristics include examples such as flat-spots and plateaus, saddle-points, and
long, narrow ravines. Another illustration of this would be local minima, which are sites
that are lower than the topography surrounding them but higher than the global
minimum.
Because it is not feasible to identify analytically where the local minimum of the error
surface is, the training of neural networks is basically an exploration of the error
surface. This is because it is impossible to determine analytically where the local
minimum of the error surface is. This is because it is not feasible to identify analytically
where the local minimum is located. The reason for this is related to the fact that. This
is due to the fact that it is not possible to determine analytically where the local
minimum of the error surface is situated.
The explanation for this is as follows. The training methods begin with an entirely
fictitious arrangement of weights and thresholds (or a random point on the error
surface), and from there, they gradually approach the global minimum in an effort to
get optimal results. The gradient, also known as the slope, of the error surface is
typically calculated at the location where the error is currently being made. After
gathering this knowledge, the next step taken is one that is in the opposite direction,
which is a move downhill. In the end, the algorithm will get stalled at a low point,
which might be a local minimum (although we are keeping our fingers crossed that it
will be the global minimum instead).
This is the algorithm that goes by the name Back Propagation, and it does the following:
According to Haykin (19994, Patterson (19996), and Fausett (19994), back propagation
155 | P a g e
is likely the most well-known example of an algorithm that is used for the training of
neural networks. Back propagation was developed by Haskin. Haskin in 19994,
Patterson in 19996, and Fausett in 19994 were the ones who initially created the
concept of back propagation. Although more recent second-order algorithms such as
conjugate gradient descent and Levenberg-Marquardt (see Bishop, 1995; Shepherd,
1997) (both of which are included in ST Neural Networks) are significantly faster (for
example, by an order of magnitude faster) for many problems, back propagation is still
useful in certain situations and is the simplest algorithm to understand. Back
propagation is also the oldest of the second-order algorithms. Back propagation is not
only the easiest method to put into action, but it is also the most effective.
This will be presented later on today, and after that, we will proceed to talk about
algorithms that are far more sophisticated. Calculations are carried out in the course of
the back propagation process in order to arrive at an estimate of the gradient vector of
the error surface. Since this vector points down the line of steepest fall from the present
place, we are aware that even if we go down it a "short" distance, the error will be
minimized. This is because this vector points down the line of steepest fall. This is due
to the fact that this vector goes in the direction of the line that has the steepest descent
from the present position. In the end, identifying a minimum of some kind will need a
succession of activities similar to this one, with an increasing focus on slowing down
as we go closer to the bottom of the range. Figuring out which size should be utilized
at each stage of the procedure is the most challenging component of the process.
Larger steps have the ability to converge more quickly; but they also have the capacity
to overshoot the solution or (if the error surface is sufficiently eccentric) go in the
wrong direction. If the error surface is sufficiently eccentric, larger steps have the
potential to move in the incorrect direction. If considerable action is taken, it will be
possible to achieve both of these goals. When training neural networks, one of the most
common examples of this phenomenon is when the algorithm moves extremely slowly
down a steep and narrow valley, bouncing back and forth between the two sides. This
is one of the most common examples of this phenomenon.
This is one of the most prevalent ways in which this phenomenon might be expressed.
On the other hand, going ahead in the right direction can be achieved by taking
extremely baby steps; but, doing so will require a significant number of repetitions. In
actual practice, the magnitude of the step is determined not only by the slope (to ensure
156 | P a g e
that the algorithm will eventually converge on a minimum) but also by a singular
constant referred to as the learning rate. Both of these factors work together to
determine the size of the step. The gradient guarantees that the algorithm will, at some
point in the future, arrive at a minimum. Experimentation is often the method that
proves to be the most successful when it comes to determining the highest potential
number to use for the learning rate. Nevertheless, while the process is being carried
out, this value could change, and if it does, it might become more cautious.
The algorithm is also typically modified by the inclusion of a momentum term. This
encourages movement in a fixed direction, so that if several steps are taken in the same
direction, the algorithm "picks up speed," which gives it the ability to (sometimes)
escape local minimum and also to move rapidly over flat spots and plateaus. The
inclusion of a momentum term typically modifies the algorithm in the following way:
it encourages movement in a fixed direction. In most cases, the algorithm also
undergoes the following modifications when a momentum term is added to it: The
presence of this phrase fosters forward progress along a predetermined path. This is
achieved by encouraging motion in a preset direction, so that when several steps are
completed in the same way, the algorithm "picks up speed."
As a direct result of this, the algorithm advances in an iterative manner, cycling through
a variety of distinct epochs as it does so. After the training samples have been sent to
the network in order at the beginning of each epoch, a comparison is made between the
objective and the actual outputs, and the error is determined based on the difference
between the two. After the procedure for adjusting the weight has been finished, the
approach is carried out once again. In addition to the error surface gradient, this error
is used in the procedure as well. You have the option of selecting which of these ending
conditions to utilize, while the beginning setup of the network is set up in a manner that
is chosen at random. Training is terminated either when a predetermined number of
epochs have passed, when the amount of error reaches a level that is deemed acceptable,
or when it no longer makes progress.
Over-learning and Generalization: One of the most significant issues with the strategy
that was mentioned above is that it does not really reduce the error that we are really
interested in reducing, which is the error that is anticipated to be made by the network
each and every time new examples are presented to it. This is one of the most significant
problems with the strategy that was mentioned above. This is one of the most major
157 | P a g e
issues that might arise while using the technique that was just described. This is a major
flaw in the plan that has to be addressed. To phrase it another way, the capacity of a
network to generalize to new settings is the characteristic that network designers covet
the most. In actuality, the network is trained to minimize the error on the training set;
however, unless one has a training set that is both perfect and indefinitely large, this is
not the same thing as reducing the error on the actual error surface, which is the error
surface of the model that is being used behind the scenes (Bishop C. M., 1995). This is
because reducing the error on the training set is not the same as reducing the error on
the actual error surface. In actual practice, the network is trained to achieve the goal of
achieving the lowest possible error rate on the training set. In actual practice, the
network is trained to achieve the goal of achieving the lowest possible error rate on the
training set. In order to reach the objective of decreasing the total number of errors on
the real error surface, one need a training set that is not only error-free but also infinitely
vast.
158 | P a g e
in the dataset. Instead, we ought to have contingency plans in place for the event that it
does not occur. However, a high-order polynomial may be excessively flexible, fitting
the data precisely by adopting a highly eccentric form that is actually unconnected to
the underlying function. If this is the case, using a high-order polynomial may not be
the best choice. It is possible that a low-order polynomial may not have enough
flexibility to fit near the points, but a high-order polynomial may have too much
flexibility and fit the data too well. It is possible that a low-order polynomial does not
have enough flexibility to fit near to the points. This is something that you should keep
in mind. Regarding the following figure, which is labeled 5.4, please refer to it.
The precise same issue manifests itself in neural networks. Because a greater number
of weights indicates a more complicated function, overfitting is more likely to occur in
a network when it has a greater number of weights than when it does not. It is possible
that a network with less weight would not have the necessary competence to
successfully reproduce the underlying function while it is being imitated by another
system. This is something that might happen. For instance, if a network does not have
any hidden layers, it is possible to use it to imitate a linear function, which is one of the
most fundamental types of function.
As a result of this, one of the difficulties that emerges is figuring out how to identify
the appropriate level of complexity for the network. However, this may point to
overfitting rather than accurate modeling because it is quite evident that a larger
network will, at some point in time, be able to reach a lower error rate. In other words,
159 | P a g e
the larger the network, the more accurate it will be. The approach is to do a comparison
between the present condition of the project and a separate data collection, which is
known as the selection set. In the back propagation method, some instances are held
back specifically for the aim of utilizing them in training, despite the fact that they are
not actually employed.
This is done despite the fact that certain examples are not really utilized. They are
instead put to use as a part of the process of ensuring that the creation of the algorithm
is submitted to objective inspection as a means of guaranteeing that the algorithm is
developed correctly. It is always the case that the first performance of the network on
training and selection sets is the same (if it is not at least nearly the same, the partition
of instances between the two sets is most likely biased). In other words, there is no
exception to this rule. This is the situation due to the fact that the initial performance
of the network is guaranteed to be consistent at all times. To put it another way, it is
inconceivable that the initial performance of the network would vary between the two
sets, as it is impossible for there to be a difference between them.
As training continues, there will be a natural decrease in the training error, and so long
as training is successful in lowering the actual error function, there will also be a
decrease in the selection error as training continues. If, on the other hand, the selection
error either ceases decreasing or even starts to grow, this is an indication that the
network is beginning to overfit the data, which suggests that training should be
immediately discontinued, and the network should be reset to its original state. In other
words, if the selection error stops lowering or even starts to expand, this is a warning
that the network is beginning to overfit the data. When something throughout the
process of training resembles over-fitting, it is referred to as "over-learning," and it can
be harmful to an individual's performance.
When it is apparent that the issue at hand demands a solution that is more sophisticated
than what the network is able to supply, it is standard practice to propose cutting back
on the amount of hidden units and/or hidden layers that the network possesses. This is
done so in order to save space. If the network is not powerful enough to accurately
mimic the underlying function, then the possibility of it engaging in overlearning is
rather low. As a direct consequence of this, neither the training mistakes nor the
selection errors will diminish to a level that is considered acceptable. On the other hand,
if the network is capable of precisely replicating the underlying function and it is
160 | P a g e
powerful enough to do it, then there is a good chance that it will do so. Utilizing a
neural network often necessitates carrying out tests with a huge number of distinct
neural networks, most likely training each one a number of times (so as to avoid being
tricked by local minima), and analyzing the performances of each individual network.
This is due to the issues that are connected to local minima as well as the choices that
need to be made regarding the size of the network that will be utilized. In this specific
setting, the selection error is the one signal of performance that is the most important
overall. If, on the other hand, you adhere to the general scientific principle that asserts
that a simple model is always preferable to a complicated model, then you can decide
that a smaller network is preferable to a larger one, despite the fact that the selection
error will only be reduced by a marginal amount as a result of this decision. This is
because a simple model is always preferable to a complicated model. This is owing to
the fact that a scientific principal claims that, all other circumstances being equal, a
simple model is always preferable over a sophisticated one. The reason for this may be
found in the scientific concept known as the simplicity principle. One possible
explanation for this might be located in the word "always."
The fact that the selection set plays such a large role in the process of selecting the
model—which means that it is truly a component of the training process—presents a
challenge for the strategy of repeatedly carrying out the same experiment. This is
because the selection set plays such a significant role in the process of selecting the
model. This is due to the fact that the selection set is such an important component of
the whole process of choosing the model. Because of this, its legitimacy as an objective
reference to the performance of the model is put into doubt; if you conduct enough
tests, it's possible that you'll just get fortunate and find a network that happens to
perform well on the selection set. It is common practice (at least in situations where
there is sufficient volume of training data to support it) to reserve a distinct collection
of instances for use as a test set.
This is done in order to evaluate the accuracy of the model. The outcomes of the training
data are validated by doing this step, which is why it is done. This is done with the
intention of fostering an increased level of confidence in the overall performance of the
completed model as a whole. The data from the test set are used to validate the results
obtained from the selection set and the training set. This is done to ensure that the
findings are accurate and do not result from any artifacts caused by the training
161 | P a g e
approach. In order for the test set to be able to work in the most efficient manner
possible and fulfill its intended purpose, it should, of course, only be used the one time.
If it is then utilized to change and repeat the training process, then it is being used as
selection data. In the event that this takes place, it demonstrates that the data is being
utilized appropriately.
Even within a single subset, we usually have less information accessible than we would
want to have available in an ideal environment. As a result, the fact that we are forced
to divide the data up into a huge number of subsets is a situation that is highly
unfavorable but yet necessary. If we take one more sample before we begin, we will be
able to avoid having to deal with this issue. Experiments can be carried out by dividing
the data in a variety of different ways, such as into training, selection, and test sets.
Another option is to divide the data into subsets. This makes it possible for the findings
to exhibit a larger degree of diversity. It is possible to approach the examination of this
particular subset from a variety of perspectives, some of which include cross-
validation, bootstrapping, and random (Monte Carlo) resampling, to mention just a few
of those methods.
If we do a lot of distinct experiments utilizing a wide range of subset samples, then the
dependability of the findings will considerably increase if we base our design decisions
on the outcomes of those experiments. For instance, we may figure out which
configuration of a neural network is the most productive to use. After that, we have two
options: either we can use the outcomes of those experiments as the sole basis for
making the decision as to which types of networks to use and then train those networks
from scratch using new samples, which would eliminate any sampling bias; or we can
keep the best networks discovered during the sampling process but average their results
in an ensemble, which would at the very least mitigate the sampling bias. Both of these
options would keep the best networks discovered during the sampling process. These
two choices are realistic possibilities to consider.
Following the selection of the input variables, the following actions are
included in the development of a network. This will provide you a basic insight
of what is involved.
Pick a starting configuration (which will typically have one hidden layer, with
the total number of hidden units equal to half the sum of the input and output
units).
162 | P a g e
In an iterative fashion, carry out a number of tests with each configuration, and
preserve the configuration that, in terms of selection error, produces the best
results. It is vital to carry out a number of tests with each configuration in order
to reduce the likelihood of being deceived in the event that training uncovers a
local minimum. In addition to this, it is strongly suggested that the data be
resampled.
On each trial, if under-learning occurs (the network doesn't achieve an
acceptable performance level), consider adding more neurons to the hidden
layer(s). This is done in the event that the network does not acquire an
acceptable level of performance. Only in the event that the experiment was the
cause of the under-learning should this step be taken. In the event that this does
not work, you might try adding another layer but this one will be hidden.
If you see an increase in the number of selection errors, which might be an
indicator of over-learning, you should think about eliminating hidden units (and
possibly layers).
Once you have identified via trial and error an effective configuration for your
networks, you should resample and create new networks with that
configuration. You may do this by following the steps outlined in the previous
section.
Figuring Out Which Pieces of Data to Use: All of the preceding processes are built
on the foundation of one primary assumption. To be more explicit, the data that are
used for training, verification, and testing all need to be representative of the model that
is being tested (and furthermore, all three sets of data need to individually reflect the
model). When it comes to neural modeling, the old adage in computer science that
"garbage in, garbage out" could not be more appropriate to describe the situation. If the
data used for training the model are not accurate representations of the target
population, then the usefulness of the model is, at best, substantially decreased. It's
possible that it won't even be beneficial in the end. It is essential to bring attention to
the many different problems that might potentially ruin a training set, such as the
following issues:
The present and the past will not in any way be repeated in the future. The majority of
the time, training data is made up of historical information. If there have been big
changes in the environment, it's likely that connections that were previously stable
won't be able to survive after they've been disrupted. It is necessary to make
163 | P a g e
preparations for any imaginable emergency situation. A neural network can only learn
new information by seeing instances that are already stored within the system. You
cannot expect it to make the right decision when it is presented with a scenario that it
has not seen before because your training data does not include anyone with an income
of more than $40,000 per year. If people with salaries of more than $100,000 per year
are considered to be a poor credit risk, then your training data does not contain anyone
with a salary of more than $100,000 per year. Extrapolation is a dangerous business
regardless of the model that is being employed; nonetheless, certain types of neural
networks have the potential to deliver extremely erroneous forecasts under specific
circumstances.
A network will focus its attention on the most fundamental features it can learn. An
example of this that is believed to be a classic is a vision project that was meant to
automatically recognize tanks (although this may be an urban legend). In order to train
a network, one hundred pictures will be used that feature tanks, and another hundred
pictures will not feature tanks. It achieves a score of one hundred percent, which is the
highest possible. When evaluated on new data, it is demonstrated to be absolutely
pointless and worthless. What is the root of the problem? The photographs that contain
tanks were collected on days that were overcast and rainy, whereas the ones that do not
contain tanks were obtained on days that were bright and sunny.
The network will ultimately gain the capacity to discern between differences in the
overall amount of accessible light, even if such variations are very little. Training
scenarios that span every possible kind of terrain, as well as every conceivable kind of
weather and lighting circumstance in which it is meant to operate, would be required
for the network to function successfully. These scenarios would be necessary for the
network to function properly. Not to mention the different distances and angles of the
shot. The data sets do not have a balanced distribution. Because a network may assist
to cut down on mistakes in general, the variety of data types that it stores is an
incredibly significant factor to take into account.
A network that is trained on a data set that has 900 excellent examples and 100 terrible
instances will have a judgment that is more favorable to the good cases because this
enables the algorithm to minimize the total error, which is much more strongly
influenced by the positive cases. Additionally, a network that is trained on a data set
that has 900 good examples and 100 poor cases will also have a judgment that is more
164 | P a g e
favorable to the good cases. If the actual population has a different distribution of great
and poor examples, then the judgements made by the network may not be true. This is
because the network relies on data from the entire population. The method by which a
patient's sickness is identified and diagnosed is a good instance of this point. It's
feasible that routine screenings will turn up negative results for a condition in ninety
percent of those who submit themselves to them. In order for a network to learn from
the data, it is trained using a 90/10 split on a publicly available data set.
After then, it is used in the process of diagnosis on individuals who have specific
symptoms and when there is an equal likelihood of having the sickness or not having
the illness. Because of this, the network will react with an excessive level of caution,
which may lead to it missing signs of illness in certain individuals who are not healthy.
If, on the other hand, the network is trained on the data from "complainants" and then
tested on "routine" data, it is quite likely that it will yield a considerable number of
false positives. In such circumstances, the data set may need to be designed to take into
consideration the distribution of the data (for instance, you may repeat the examples
with a lower number of occurrences or eliminate some of the cases with a higher
number of occurrences), or the decisions made by the network may need to be adjusted
by the addition of a loss matrix (Bishop C. M., 1995). In many contexts, the most
efficient course of action is to first make certain that different possible outcomes are
represented in an equal amount, and then to interpret the decisions made by the network
in line with the various circumstances.
When opposed to the other types of networks, the Self-Organizing Feature Map
networks, which are also referred to as Kohonen networks, are employed in a very
different way. On the other hand, SOFM networks are developed particularly for
unsupervised learning (Haskin, 19994), (Patterson, 19996), and (Fausett, 19994) ().
(Haskin, 19994), (Patterson, 19996), and (Fausett, 19994) (). All of the other networks,
on the other hand, are designed to do supervised learning tasks (Haskin, 19994,
Patterson, 19996, and Fausett, 19994). In supervised learning, the training data set
includes instances that feature input variables in addition to the associated outputs (and
the network must infer a mapping from the inputs to the outputs), whereas in
unsupervised learning, the training data set only contains input variables. In supervised
learning, the network must infer a mapping from the inputs to the outputs. During the
165 | P a g e
supervised learning process, the network is required to make an assumption about the
mapping between the inputs and the outputs. It's probable that at first, this will seem
strange to you. That's okay. If it does not receive any outputs, what kind of information
can the network potentially gather? This is because the SOFM network makes an
effort to find the structure of the data. The reason for this may be seen in the previous
sentence.
In addition, Kohonen (Kohonen, 1997) elaborated that one of the potential applications
is hence exploratory data analysis, which is one of the conceivable uses. The SOFM
network may be taught to recognize clusters of data, and it can also group together data
classes that are similar to one another. An understanding of the data may be constructed
by the user, which can subsequently be used to the development of the network. After
it has identified the numerous categories of data and developed the capability to label
those categories, the network will be able to perform classification tasks successfully.
SOFM networks can also be used for classification purposes if the output classes can
be accessed immediately. The ability of neural networks to demonstrate similarities
across multiple classes is one of the many benefits of utilizing them for categorization
purposes.
The only two layers that are present in a SOFM network are the input layer and the
output layer of any radial units that are present. It's possible that you've heard the output
layer referred to as the topological map layer. However, ST Neural Networks also
supports one-dimensional Kohonen networks. Normally, the units that make up the
topological map layer are structured in two dimensions of space; however, this is not
always the case.
During the training phase of SOFM networks, an iterative method is utilized (Patterson,
19996). The approach starts with a collection of radial centers that are initially picked
at random, and then gradually adjusts them such that they represent the clustering of
the training data. This is done by iteratively adding and removing radial centers until
166 | P a g e
they are accurate representations of the clustering. On one level, this is analogous to
the techniques of sub-sampling and K-Means, which are used to allot centers in a SOM
network. In addition, the SOFM technique may be utilized to determine the centers for
the various types of networks described above. Having said that, the algorithm also acts
on a level that is independent from what was just discussed.
The iterative training method also arranges the network in such a manner that the nodes
that represent centers that are geographically close to one another in the input space are
also situated geographically close to one another on the topological map. This is done
so that the network may more accurately reflect the input space. One way to
conceptualize the topological layer of the network is as a basic two-dimensional grid.
In order for this grid to be able to fit into the N-dimensional input space, it will require
some bending and deformation. If you do this, the original structure of the network will
be preserved to the fullest extent that is practically possible. Even though it is evident
that any effort to portray an N-dimensional space in two dimensions would lead to a
loss of clarity, the technique can be valuable since it enables the user to perceive facts
that would otherwise be difficult to grasp. This is despite the fact that it is obvious that
any attempt to depict an N-dimensional space in two dimensions would lead to a loss
of clarity.
The core iterative Kohonen algorithm is comprised of just cycling over a predetermined
number of epochs. During this period, the algorithm performs the following technique
to each training example that is carried out during each epoch:
• First, choose the winning neuron, which is the one whose center is positioned
the closest to the input case.
• Second, modify the winning neuron so that it is more similar to the input case
by applying a weighted sum of the old neuron center and the training case.
The method takes use of a learning rate that decreases with increasing amounts of time.
The utilization of this learning rate in the computation of the weighted total ensures
that the modifications will get more nuanced as time progresses through the epochs.
This ensures that the centers will, at some point, come to an agreement on a
representation of the conditions under which that neuron emerged triumphant that is
acceptable to all parties involved. The topological ordering property may be realized if
the algorithm is designed with the concept of a neighborhood in mind. This may be
167 | P a g e
done by inserting the word "neighborhood." The neuron that was selected as the winner
is encircled by a group of other neurons that together form the neighborhood. The
neighborhood, like the learning rate, decreases over time, which means that initially
quite a large number of neurons belong to the neighborhood (possibly virtually the
whole topological map), but by the time you get to the later stages, the neighborhood
will be zero (that is, it will only consist of the winning neuron itself). This is because
the learning rate also decreases over time. The alteration of neurons is really applied to
all of the members of the current neighborhood when using the Kohonen technique,
not simply the neuron that was considered to have been the winner of the previous
round.
This neighborhood update has the result that initially rather significant portions of the
network are "dragged towards" training instances, and they are dragged pretty heavily.
This is the impact that this update has. Clusters of neurons in the topological map
become active in response to instances that are comparable to one another, which leads
to the ultimate formation of a fundamental topological ordering by the network, which
was described before. The amount of epochs that pass causes the learning rate and the
neighborhood to slow down, which opens the door to the potential of drawing finer
distinctions within particular parts of the map, which ultimately leads to the fine-tuning
of individual neurons.
This process is called "fine-tuning." A training program will often be divided into two
distinct phases: an initial phase that is relatively short and includes high learning rates
and neighborhood, and a following period that is purposely protracted and features low
learning rates and zero or virtually zero neighborhoods. The initial phase of a training
program will typically be very quick, and it will also have high learning rates and
neighborhood. After being trained to recognize patterns in the data, the network might
then be used as a visualization tool to analyze the data after it has been given the
necessary training.
When training instances are carried out, the Win Frequencies Datasheet, which
provides counts of the number of times each neuron is successful, may be reviewed to
see whether or not various clusters have emerged on the map. This can be done in order
to find out whether or not different clusters have evolved. Execution of individual cases
and observation of the topological map are done in order to determine whether or not
the clusters may be given any particular meaning (often, this step entails returning to
168 | P a g e
the initial application area in order to establish the link between the instances that have
been grouped together). The goal of this step is to determine whether or not the clusters
may be given any particular meaning. After the clusters have been found, the neurons
that comprise the topological map are given names to represent the meanings of the
clusters (sometimes, individual instances themselves are also given names). After the
topological map has been created in this manner, fresh instances could then be provided
to the network for consideration at that point. If a class name has been given to the
neuron that was chosen as the winner of the competition, the network will be able to
classify the information. In that scenario, we would characterize the network as being
undecided.
SOFM networks will also make use of the accept threshold when it comes time to
undertake classification. The accept threshold determines the maximum distance
between the neuron and the input case that is considered acceptable. This is due to the
fact that the activity level of a neuron in a SOFM network is dependent on the distance
between the neuron and the input case. If the activity of the neuron that was decided to
be the winner is further than this distance, then the SOFM network is deemed to be
indecisive. Therefore, a SOFM network has the potential to operate as a novelty
detector if all of the neurons in the network are labeled and the accept threshold is
appropriately calibrated (the network will only report that it is unsure if the input case
is sufficiently different from all of the radial units).
SOFM networks, in the form outlined by Kohonen (Kohonen, 1997), take their cue
from a variety of well-established properties of the brain in order to function properly.
The cerebral cortex is actually a big flat sheet that measures about 0.5 meters squared;
it is folded up into the typical convoluted form only for the ease of fitting within the
skull! having qualities that are already well known (for example, the area that
corresponds to the hand is located adjacent to the arm, and a warped human frame may
be topologically mapped out in two dimensions on its surface). The sole reason the
cerebral cortex gets folded up into its characteristic convoluted form is so that it may
more easily fit inside of the skull.
The first part of a SOM is comprised of the data. When carrying out experiments that
use SOMs, it is common practice to make use of three-dimensional data such as the
169 | P a g e
data sets that have been shown below as examples. A three-dimensional depiction of
the hues red, blue, and green may be seen in this show. The goal of the self-organizing
maps is to project the n-dimensional data (in this case, it would be color, which would
be three dimensions) into something that can be better perceived visually (in this case,
it would be a two-dimensional picture map). In other words, the self-organizing maps
are used to create picture maps.
In this case, it is realistic to assume that the dark blue and the greys will be positioned
close to one another on a good map, but the yellow will be located close to both the red
and the green. This is because the dark blue and the greys are both shades of the same
color, dark blue. The weight vectors are the second part of a SOM's construction. In
the picture that can be found below, I have made an effort to demonstrate both of the
parts that go into producing each weight vector. You may check out my work if you're
interested. The data are what make up the first portion of a weight vector. This possesses
the same dimensions as the sample vectors, and the second component of a weight
vector denotes the natural location of a weight vector inside the vector space.
Due to the fact that the data may be sent by only displaying the color, the color itself is
the data in this particular instance, and the location is the x,y coordinate of the pixel
that is displayed on the screen. Working with color has a lot of benefits, and this is one
of them. During this demonstration, a two-dimensional array of weight vectors was
utilized, and the representation of this array would be the same as that which was
presented previously in figure 5. This picture shows a grid from an angled viewpoint.
The n-dimensional array in the grid symbolizes each weight, and each weight occupies
a place in the grid that is unique in comparison to the positions occupied by the other
weights.
170 | P a g e
Fig. 5.11. Weight of the Two-Dimensional Array
The selection and learning process will result in the weights organically organizing
themselves into a map that illustrates the relationships between the objects. As a
consequence of this, given these two components (the sample vectors and the weight
vectors), how is it feasible to organize the weight vectors in such a manner that they
would correctly represent the similarities that are contained in the sample vectors? The
171 | P a g e
relatively basic technique that is outlined in this study is all that is required to
successfully complete the task at hand.
The first step in the process of building a SOM is the initialization of the weight vectors,
which is also known as the phase. After that, a random sample vector is selected, and a
search is conducted throughout the map of weight vectors to identify the weight vector
that provides the best description of the sample. Because each weight vector has a
position, it also has neighboring weights that are in the near area of it. These
neighboring weights are referred to as "adjacent weights." The weight that is picked is
rewarded by having the potential to grow progressively similar to the sample vector
that was randomly chosen. This similarity is measured over time. In addition to this
incentive, the neighbors of that weight receive additional reward in the form of the
opportunity to become more similar to the sample vector that was picked.
This reward comes in the form of the ability to become more similar to the sample
vector that was selected. Because both the number of neighbors and the total amount
that each weight is able to learn decreases with the passage of time, the value of t is
increased by a small fraction as a direct consequence of this step. After that, the entire
process is repeated an exceptionally high number of times, usually more than a
thousand times in total.
172 | P a g e
In the case of the colors, the program would start by picking a color from the variety of
samples, let's say green, and then it would search the weights for the spot that had the
color that was the greenest overall in terms of the whole spectrum. After that, the tones
that surround that weight are altered so that there is a bigger proportion of green in
them. Following that, a different color, such as red, is selected, and the process is carried
out once more. The following are some of them:
173 | P a g e
There are many various approaches that may be used when it comes to the
process of initializing the weight vectors. To get started, you may just give each
weight vector random values for its data. This is a good place to start. This is
the first choice that you have at your disposal. An example of this may be seen
in the image to the left, which shows a screen made up of pixels with arbitrary
values assigned to the hues red, blue, and green. According to Kohonen
(Khonen, 1997), unfortunately, the calculation of SOMs requires a significant
amount of processing resources. As a consequence of this, there are a few
different ways to initialize the weights in such a manner that the distance
between samples that you are certain are not connected from the beginning is
quite great. If you do it in this manner, you will be able to construct a beautiful
map with fewer iterations, which will save you some time.
In this part of the study, in addition to the approach that uses random numbers,
we also developed two more ways to initially set the weights. The first step in
this one is to position a square of black, red, blue, and green at each of the four
corners. After that, gradually blend the colors into one another as you get closer
to the center. The second one positions the red, green, and blue triangles at
identical distances from the center and from one another. In addition, the
distances between each triangle and the center are also equal.
This part of the process is rather simple; all that needs to be done to complete it
is to loop through all of the weight vectors and compute the distance that
separates each weight from the sample vector that was chosen. The weight that
goes the most distance will be declared the winner of the competition. If there
are many weights that have the same distance, the winning weight will be
chosen at random from among the weights that have the shortest distance
between them and the starting point.
174 | P a g e
If the data value located at the ith data member of a sample is represented by
x[i], and n refers to the total number of dimensions that the sample vectors
include, then the ith data member of a sample is regarded to be the ith data
member of the sample.
This procedure of computing distances and comparing them is carried out over
the entirety of the map. The winner and the BMU are determined by the weight
that has the shortest distance to the sample vector. This computation of distances
and comparison of them is done over the rest of the map. As a result, brilliant
green is now the unit that most closely matches, but this process continues. The
calculation of the square root is skipped in this particular area of the Java
program so that the processing speed can be improved.
When you scale the nearby weights, there are actually two aspects to take into
consideration: determining which weights are considered to be neighbors and
determining the degree to which each weight may become more comparable to
175 | P a g e
the sample vector. Both of these aspects are important when scaling the nearby
weights. In order to identify the weights that are immediately adjacent to a
winning weight, one can use any one of a number of different methods. The
employment of concentric squares is preferred by certain individuals, while
others favor the utilization of hexagons. I came to the conclusion that a
Gaussian function would be most useful in this situation since it defines a
neighbor as any point that has a value that is larger than zero.
It was only mentioned previously, however as time goes on, the number of
individuals living in close proximity to you will gradually decrease. This is done
so that the samples may first travel to an area where they will most likely be,
and then they can compete for position within that region. The reason for this
is so that the samples can go to an area where they will most likely be. This
approach is analogous to first making a broad modification, and then going back
and making a more specific one. The precise function that was used to decrease
the radius of impact is not really that crucial as long as it does decrease, thus
we merely used a linear function instead of trying to figure out which function
was used.
176 | P a g e
Figure 8 provides a visual representation of the operation that was carried out and can
be seen at the top of this page. Because the position of the base is gradually moving
toward the center of the map over the course of the game, the number of players who
are directly next to it is gradually decreasing. The beginning radius is configured to an
exceptionally large value, one that is relatively near to either the width or the height of
the map, depending on which dimension is being considered.
1. Instruction
The scaling the neighbor’s algorithm consists of two components: the learning function
and the scaling function. The winning weight will be granted to the vector that most
closely resembles the sample vector in terms of its look. The sample vector's neighbors
begin to resemble it in an increasingly consistent manner as well. The quantity of
information obtained by the neighbor in this way of learning is found to be proportional
to the distance between it and the winning vector. This is one of the characteristics of
this method of learning. You not only have total control over the rate at which the
potential learning capacity of a weight decreases, but you can also modify it to whatever
value you like.
The Gaussian function was the instrument that I decided to use for this job. This
function will produce a value that is somewhere between 0 and 1, and the parametric
equation will be used to make modifications to each neighbor. The value that is returned
by this function will fall somewhere in that range. The following is the recipe for the
newly developed shade of color: Current color(1.-t) + sample vector. As a result, on the
very first iteration, the unit that has the data that most closely matches the target will
have a t value of 1 assigned to it for its learning function. As a consequence of this, the
weight that is arrived at as a result of this process will have the same exact values as
the sample that was selected at random.
The amount of knowledge a weight is able to learn decreases over time as well,
reflecting the pattern that can be seen in the number of neighbors a weight has. Because
t may take on any value between 0 and 1, the winning weight is immediately employed
as the sample vector on the very first iteration. This happens because t can take on any
value between 0 and 1. Next, as time goes on, the winning weight starts to resemble
the sample slightly more, particularly the section where the maximum value of t starts
to decrease. This is because the winning weight starts to resemble the sample as the
177 | P a g e
maximum value of t starts to decline. The rate at which a person is able to absorb new
information slows down linearly in proportion to their body mass. The preceding graph
serves as a graphical representation of this idea, demonstrating that the amount of
information a weight is able to acquire is proportional to the height of the bump while
it is at that weight's position. The passage of more time will result in the hump's
elevation descending to a lower level. Because of the addition of this function to the
neighborhood function, the bump's base will become more concentrated, and as a
consequence, the bump's height will drop.
Therefore, when a weight has been determined to be the winner, the neighbors of that
weight are determined. In addition to the weight that was determined to be the winner,
each of the winning weight's neighbors are modified in such a way that they are more
comparable to the sample vector.
178 | P a g e
CHAPTER 6
6.1 INTRODUCTION
Within this part, we have presented the idea of a learning task and offered an illustration
of cross-sectional supervised learning as an example that is Sui study for this context.
As part of the scope of work for this research, we will be doing the construction of a
taxonomy of activities that include learning about time series. The taxonomy is the first
scientific contribution that this thesis gives, as demonstrated by the material that is
made available in Section 1.6 of the thesis. Within this section, we will discuss the
initial set of research questions that were presented in the previous section (1.5). In the
segment that came before this one, these questions were presented. The taxonomy will
be utilised in order to formalize the learning activities that are considered to be the most
significant in relation to time series.
As a consequence of this, it will be an essential part of the conceptual model of the time
series domain, which serves as the foundation for the unified software framework that
is implemented in sktime. This is due to the fact that it will operate as the basis for the
structure that will be constructed eventually. Study will show us that there is a
substantial relationship between the structure and language of sktime and the taxonomy
of tasks. This connection will be demonstrated during the course of Study. All of us are
looking forward to witnessing this particular event.
179 | P a g e
It is because of this that time series data is distinguished from other types of data. When
discussing the subject of machine learning that makes use of time series, the phrase
"time series analysis" is used to refer to a collection of techniques that are concerned
with the exploration of this dependency. These approaches are referred to as "time
series analysis models." The most important thing to consider is the relevance of time
series analysis when it comes to applications that are employed in the real world. The
fact that we have a better understanding of the processes that lie underneath the surface
of the data that we witness has enabled us to be in a position to make more accurate
predictions regarding the data that we observe. At the end of the day, it acts as a guide
for our decision-making and has the power to enhance the outcomes. Since the
problems associated with predictive learning will be the primary focus of this thesis,
non-predictive tasks will be left for more study to be conducted in the future.
As far as the majority of examples are concerned, they only provide the data and the
definition of success, which implies that they do not describe the process of learning.
As a direct consequence of this, the challenges that are traditionally linked with
learning become even more confusing. It is fairly unusual for responsibilities that need
to be regarded as separate from one another to be grouped together and classified under
the same title. Because there are no task requirements that have been defined, it is
possible that unfair comparisons between algorithms will take place, that an incorrect
method will be matched with the problem at hand, or that performance forecasts would
be too optimistic. Every one of these outcomes is a possibility.
180 | P a g e
encompassing in its span of coverage. There are in-depth explanations of the statistics
as well as the criteria for success that may be found in the publications that are included
in the competition guide. The flip side of the coin is that it does not offer a thorough
understanding of the methodology that lies behind the learning process.
Taking a look at the code that is provided, we are able to observe that certain models,
particularly the "statistical" ones, are trained on a single series during the training
process. There are a few of these models who capture my attention greatly. On the other
hand, when it comes to machine learning models, the vast majority of them are trained
on a number of different datasets. As a direct result of this, they are able to make
advantage of patterns that are repeated throughout the series. An example of a
comparison of models that is not only unfair but also has the potential to be misleading
is one in which models that have been trained on a single series are used in comparison
to models that have been trained on several series without making the contrast between
the two series evident. The learning challenge appears to be panel forecasting, which
is also frequently referred to as "cross-learning."
These terms are used interchangeably. When seen from a technological point of view,
this is the situation being discussed. When compared to making predictions based on a
single variable, this is an alternative. The following website is where you will be able
to locate the educational activities that we have mentioned. Even though many of the
univariate models were initially designed to be trained on a single series (or require
multivariate data to be additionally available during prediction), it is essential to keep
in mind that they are still able to make use of multiple series while they are being
trained. This is something that should be kept in mind. An alternative to the usage of
temporal cross-validation is the optimisation of hyper-parameters across series, which
is one of the methods that might be employed to accomplish this purpose.
It is not just that a more explicit job definition, which would have included a description
of the learning process, would have made these differences more apparent, but it would
have also prevented any misconceptions from occurring in the first place. It is from this
point of view that we are operating. Additionally, due to the fact that certain models
have been trained on all of the series models that are now accessible, it is possible that
statistical comparisons between these models will become erroneous while they are
during the process of being trained. This is due to the fact that certain models have been
extensively trained on all of the series models. In the lack of a distinct separation of
181 | P a g e
series into a training set and a test set, the assumption that each prediction is based on
an independent sample is no longer valid. This is the reason why this is the case. This
is the reason why situations are the way they are. It is possible that the differences in
performance across models that appear to be statistically significant may, in fact, be
insignificant after all.
This is something that should be taken into consideration. The M4 competition is not
the only example of an implied or insufficient job description that can be found in the
literature; we do not want to single it out as an example since it is not the only case.
The purpose of this research is to come up with a proper definition of the activities that
are considered to be significant time series. The purpose of this investigation is to lessen
the amount of ambiguity and uncertainty that may be brought about by further research
on the subject. By behaving in this manner, we are doing it in the expectation that it
may be of some service.
It has been demonstrated in a number of studies that tasks are put to use in order to
characterise one-of-a-kind learning challenges. On the other hand, a considerable
number of activities that involve time series are intricately connected to one another in
a major way. One method that may be utilised in order to gain an understanding of the
connection that exists between them is the concept of reduction. Through the process
of reduction, we are able to transfer information from one task to another. Additionally,
we are able to utilise an algorithm that was developed for one activity in order to assist
in the resolution of another problem. Within the realm of time series analysis, the
reduction procedure is among the most crucial things to consider.
There are a variety of distinct time series algorithms that are dependent on reduction,
and it is feasible to use a wide variety of various reduction approaches. On the other
hand, not only are time series problems rarely formalized in a complete manner in the
literature, but they are also rarely researched in connection to one another. This is a
significant limitation. The fact that this limitation exists is a significant point. This gap
will be reduced with the help of a conversation that will be included at the conclusion
of this study. The discussion will focus on eliminating the linkages between tasks. In
order to be able to recognise significant abstractions and a modular architecture for the
unified software framework that was built in Study, it will be necessary to clearly
differentiate between these learning activities while also gaining an understanding of
the connections that exist between them. This comprehension will be necessary in order
182 | P a g e
to be able to recognise the connections that exist between them. The remaining
components of this study proposal are structured in the following format, which is as
follows:
The first thing that we are going to do in Section 6.2 is going to be the formalization of
the fundamental components that are responsible for the creation of temporal data. This
is going to be the first step that we will start. Next, in Section 6.3, we will develop the
taxonomy of time series learning tasks, with a particular emphasis on three activities:
forecasting, time series classification/regression, and supervised forecasting. This will
be done in order to ensure that the taxonomy is accurate and comprehensive. It is
planned to carry out these steps in order to guarantee that the taxonomy is correct. After
we have finished the process of determining the distinctions between these activities,
we will go to Section 6.4, where we will make a discussion about the ways in which
these activities are connected through the process of reduction.
In the exercises that are going to follow, we are going to make use of the terminology
that was discussed in Section 2.5. It is possible that we will be able to observe a single
time series under some circumstances. This type of time series is composed of
observations that were made on a single instance (the experimental unit) and a single
variable (the specific form of measurement) throughout the course of a number of
distinct time periods. In certain instances, the phrase "univariate" data is utilised to refer
to this particular type of data. There is a possibility that we shall be witnesses to various
programmes that take place in a variety of locales. This is not without possibility.
183 | P a g e
The occurrence of this is something that can take place concurrently in two modes that
are completely different from one another. For the purposes of this article, the term
"multivariate" data refers to a collection of series that represent observations on a single
occurrence of a number of different variables. It is not out of the question that we will
observe these series with an extremely high number of variables. However, there is a
possibility that we may come across a number of distinct series that are aggregated and
referred to as "panel" data. These series offer observations on a large number of
occurrences of one or more variables to be considered.
A clear identification of these three kinds of time series data, as well as the theorized
generative mechanisms that may result in the development of such data, will be
provided in the article that follows. Moreover, we will investigate the outcomes that
are most likely to be produced by those techniques. The next step, which will be to
continue with the construction of the taxonomy of time series tasks, which will serve
as a reference to the various types of data, will be the next one that we will go on to
when that has been completed.
We will follow the exposition in part in order to achieve the goal of univariate and
multivariate series, both of which will be explained in this section as well as in the one
that comes after this one. A time series is a sequence of values that are observed in
sequential order over a short period of time. This sequence is indexed, and it is possible
to describe it as such. A time series can be defined in this manner, for example. The
following are the components that, in our perspective, make a time series: the values
that we observe, the time points or indexes at which we witness the values, and the
values themselves accordingly.
The index, on the other hand, offers information about the time at which we came
across it, in contrast to the values, which provide information about what we came
across. Within the realm of formal notation, a time series is denoted by the notation
(z(t1), z(t2),..., z(tT), where z(t) is a representation of an observation that occurs at a
particular time point t. It is possible to represent and analyse time series with the help
of this notation. One of the elements that makes up the set N+ is the symbol T, which
stands for the total number of observations. This symbol is also a member of the set.
The representation of a univariate time series may alternatively be stated as a vector
184 | P a g e
with dimensions of T × 1. This vector can be defined by the equation that is shown
below: In order to maintain the least complicated notation possible, it is vital to bear in
mind that we do not carry the time index directly throughout the notation. This prevents
the notation from becoming overly complicated. Our notation defines a time series as
nothing more than a sequence of data that does not contain any time indices. This is the
precise definition of a time series. We are functioning under the assumption that the
information that is associated with the time index is accessible whenever it is required
to have access to it. There is no reason to doubt this assumption.
We claim that a time series is deterministic when the values of the time series
are precisely specified by a mathematical function. This is the case when the
time series is studied. In this context, the term "determinism" is used to
characterise the scenario. As an illustration, if the function f and the time point
t are both equal to the equation z(t) = f(t), then the time series is said to be
dependent on the function for the time series. A time series is considered to be
non-deterministic or stochastic when, on the other hand, values can only be
characterised in terms of a probability distribution. This is the case when the
time series is being analysed. The probability distribution is the sole method
that can be used to describe the values, which is why this is the case. Throughout
the entirety of this thesis presentation, we will be concentrating our attention on
probabilistic time series specifically.
185 | P a g e
domain, which is represented by the letter Z. Given this, we may consider the
observation z(t) to be a realisation of Zt at the time points t. This is because of
the fact that this is the case. In light of the fact that this is the situation, this is
the reason. The realisation of a multi-dimensional random variable, Z = (Zt1,
Zt2..., ZtT), with a joint probability distribution may be considered to be a time
series (z(t1), z(t2),..., z(tT)) that is comprised of a large number of observations.
This is because the time series is composed of a big number of observations.
The situation is like this in general.
Within the context of a time series, the value domain is frequently represented
by the equation Z ⊆ R, where R represents the collection of real numbers.
Continuous variables include things like the temperature of a chemical reaction,
which is an example of such a characteristic. It is possible that the random
variable Zt may be utilised in order to offer an indicator of the temperature on
the occasion. In this particular instance, the time series in question is referred
to be "value-continuous" since it takes values from a continuous domain, such
as the real numbers. This is due to the fact that the values remain study over the
entirety of the time series.
186 | P a g e
random variable is a representation of a random variable, which is why this is the case.
An example of this would be the possibility of having Z ⊆ N+ (for instance, count
values) or Z ⊆ C, where C is a finite collection of some discrete values (for instance,
categories). Both of these possibilities are plausible. The two options shown here are
both instances of outcomes that are plausible. Both of these choices are available for
consideration at this time. In this context, the time series that occurs under these
conditions is referred to as "value-discrete," and the term "value-discrete" refers to a
segment of time and space.
It is common practice to make use of the time index in order to indicate a certain instant
in time in the majority of situations. I use the notation t = (ti ∈ T for all i = 1..., T; ti <
tj for all i < j with i, j ∈ N+) to describe the ordered series of time points at which we
make any observations. This notation allows us to write the sequence of time points in
a certain order. In order to express the order in which time points occur, this notation is
utilised. One of the elements that makes up the set N+ is the letter T, which represents
the total number of observations that have been produced. The domain of the random
variable Z may thus be defined as series (Z, t), where we define: as the set of tuples of
values ordered in time at specified time points in t. This is the domain of the random
variable Z. Consequently, this offers a definition of the domain in question. Because of
this knowledge, it is now possible to specify the domain of the periodic variable Z
because of this realisation.
It is common practice for us to simplify this to series(Z) in order to save time. This is
done under the premise that the time points are provided. When the time domain, which
is denoted by the letter T, is taken into consideration, it is common practice to define T
as a subset of R+. This is because the time domain describes the time domain. The fact
that time points are located on the positive real half-line is demonstrated by this
classification. Nevertheless, as was indicated in the opening remark, it is also plausible
that it may reflect multiple scales depending on the conditions as well. In this context,
this option was brought forward.
Concerning data that is not sequential at any given moment, the initial observation deals
with the situation. The word "time series data" is used to refer to information that may
be obtained under conditions that are not temporal in nature. This is done for the sake
of this discussion. Given this, it is probable that the data did not start from a process
that is directed or flows in the same direction as time. This is something that should be
187 | P a g e
taken into consideration. It is worth noting that the concept may be used to many types
of data as long as the data is composed of values that are arranged in a sequential
sequence. This is an extra point of interest. There are examples of non-temporal but
sequential data, such as wave length data or picture outlines that have been unrolled
into a series with an arbitrary beginning point and utilising the distance from the image
centre to points on the image contour as values. Both of these examples are examples
of data that are not temporal but sequential. The two forms of data that are being
discussed here are instances of sequential data. The use of spectroscopy is yet another
example of the data that falls within this category.
Regardless of the value's domain, the index will always display the position of the
linked value within the series. This is true regardless of the circumstances. Every time,
this is the situation that arises. As a consequence of this, it is responsible for transferring
the information regarding the ordered Ness of the data. However, the domain that the
content pertains to will be the one to decide whether or not the index will include more
information. There is a potential that the index will include further information. An
index that represents points in absolute time may incorporate extra calendar
information, such as the day of the week and whether or not it is a holiday.
This information may be represented by the index. By way of illustration, these are two
instances of the many kinds of information that such an index may have. When
referring to time series, the phrase "time-continuous" describes the circumstance in
which the index, which is the collection of time points, occurs in a continuous way.
This is the condition that is being discussed here. In the context of time series, the term
"time-discrete" refers to a situation in which the collection of time points that make up
the series are not dependent on one another and are independent of one another. We are
exclusively interested in time-discrete series for the entirety of this thesis. This includes
any and all time-discrete series.
A time-discrete series can be obtained in two distinct ways: either by sampling a time-
continuous time series at a number of time points (for example, taking temperature
readings during a chemical process at one-minute intervals) or by accumulating a time-
continuous time series over a period of time (for example, measuring the amount of
rain that has accumulated over the course of a year). Both of these methods are
examples of time-discrete series. These two approaches are both reasonable
possibilities to consider.
188 | P a g e
In order to simplify matters, it is common practice to make the assumption that the
intervals between time points are regular, which means that they are constant and of
equal length. There is a significant amount of usage of this assumption. We are able to
make the assumption, in a formal sense, that the T consecutive values of a series are
formed at time points that are completely equal to one another. This provides us with
the ability to make this claim. The time points in question are represented by the
notation t0 + d, t0 + 2d..., t0 + Td, where d is a predetermined interval that takes place
between the mentioned time points. Due to the fact that the time series z(t1), z(t2),...,
z(tT) may be represented as z1, z2,..., zT, we are able to accomplish this mission. This
also makes it possible for us to carry out the action.
Despite the fact that the values of t0 and d are likely to be unnecessary for a variety of
reasons, it is nevertheless possible to provide these two values in the event that the
observation times need to be set precisely. This is necessary in the case that the
observation times have to be established precisely. Assuming that t0 is the starting point
and that d is the unit of time, it is feasible that we could pick zt to represent the
observation that took place at time t. This would be the case if we take into
consideration the given information. The explanation for this presumption may be
found in comment 2, which can be followed here. In addition to this assumption,
variations are taken into consideration. Here is the second observation, which pertains
to the timing of random occurrences. The recording of observations might not always
be done in the form of a normal time series with equal and specified time intervals.
This is because there are specific circumstances in which this might not be the case.
Due to the fact that the time intervals could not always be the same, this is the case. We
are able to notice, via the use of a chain of events as an illustration, that each point in
the chain corresponds to a different time for the event that occurred at random. One of
the things that we do is make sure to keep a record of the precise minute at which each
incidence takes place. For instance, incidents may involve the malfunction of a piece
of equipment, the acquisition of a product from a retail store, or the admission to a
medical facility. Specifically, we have the capacity to construct time points on the
positive half-line [0, τ], where τ is a member of the set of real numbers (R+). This
feature is formal in nature.
The definition of these time points is constructed in such a way that the interval (0 ≤ t1
≤ t2 ≤ t3 < tN(τ) ≤ τ) is a non-decreasing, piece-wise constant counting function that
189 | P a g e
occurs at the event times. This function is executed at the event times. In addition to
the sequence of observations, which is the source of additional information, there is
also the duration of the gap that is seen between consecutive measurements, which is
commonly referred to as the "wait time." All of these factors contribute to the overall
information. In contrast to regular time series, which are characterised by intervals that
are predetermined and of the same length, this is an entirely different situation. Point
process models are a way that is frequently used in order to show the inherent
uncertainty that is present in these processes. This is because point process models are
a method that utilizes point processes. The use of this strategy is what allows for this
to be accomplished.
Time series that are either single or univariate have been the focus of the conversation
up until this point in debate. The existence of many series may be seen in a variety of
different contexts, each of which is unique. It is vital to differentiate between the two
fundamentally distinct ways in which the phenomenon in question may take place. The
phenomenon in question may show itself in a variety of different forms, including
multivariate time series and panel data. Beginning with the analysis of multivariate
data, we will move on to the analysis of panel data in the next part. When we are ready,
we will go on to the next section.
The term "multivariate time series" refers to a collection of univariate time series that
are connected to one another. These series are what constitute the multivariate time
series. In a manner analogous to how univariate series are made up of observations on
several instances, multivariate series are made up of observations on a single instance
for each and every occurrence. Multivariate series, on the other hand, are made up of
data that refers to a number of different variables, in contrast to univariate series, which
only contain data that belongs to one variable. For the purpose of representing
experimental units and variables, which are various types of measurements, instances
are utilised, just as they were demonstrated in the instance that came before it. To
illustrate this point, let's say that we are able to ascertain the temperature and pressure
that are generated by a single chemical reaction.
190 | P a g e
probability distribution across time points and variables. This allows for the process to
be stated in a manner that is more precise. One type of stochastic process that may be
identified by the presence of a joint probability distribution is the one that we are
discussing here. At this precise moment, Z joins the ranks of the RL. In addition to
being a member of the set set N+, the variable L is a representation of the total number
of variables. Note that in this specific context, we simplify the problem by making the
assumption that all univariate series take values in the given domain. This is something
that should be understood. This is a very important issue that should be kept in mind.
An examination of the various ways in which this assumption does not always hold
true is going to be presented in the following statement.
The observation zjt may be expressed in a number of different ways. One of these ways
is by realizing the random variable Zjt for the variable j at the time point t, where j
would be equal to 1. The observation may be portrayed in a number of different ways,
and this is one of them. It is possible to define multivariate time series, despite the fact
that when employing a matrix that has dimensions of T × L, it is crucial to take note
that the rows of the matrix are used to represent time points, while the columns are
dedicated to representing variables. If we make the assumption that the collection of
time points t is shared by all variables, then it is possible to accomplish this goal. In
multivariate time series, one of the most significant elements is the fact that
observations are not only statistically reliant on previous observations within the same
univariate series, but also across series, both concurrently and over time periods.
This is one of the most essential aspects of multivariate time series. The multivariate
time series possesses this as one of its most important characteristics. In addition to
this, the joint probability distribution is able to take into consideration this unique
characteristic. The possibility exists that some series will eventually follow other series,
or that there will be feedback linkages between the series. Both of these possibilities
are feasible. Each of these two alternatives is a distinct possibility. Therefore, the
assumption that the numerous univariate component series are statistically independent
is not supported by the data.
This is because the data do not support the premise. Due to the fact that the evidence
does not support the premise, this is the result. Taking into consideration the panel data
that will be examined in the following section, it is essential to take notice that this
divergence is the most important one.
191 | P a g e
The usage of multiple time indices for each variable is carried out in line with the
information presented in Remark 3. It is possible that the observed time indices of one
univariate component series might be different from those of another series. This is one
of the reasons that could potentially make things more difficult. At the same time, it is
possible that not all of the series will utilise the same index. To provide a more
particular example, you may exhibit two or more of the following characteristics
through the use of series:
As was mentioned in the second comment, there is also the possibility that the time
points are purely arbitrary. This was mentioned in the second comment. When it comes
to the process of constructing an appropriate container for time series data, which is
covered in Study, this is an important aspect that has to be taken into consideration.
This brings us to the fourth discovery, which is that different value domains are
connected to different variables. Another usual issue that may occasionally arise is the
chance that observed univariate component series may have several value domains,
which is represented by the letter Z. This is a problem that occurs rather frequently. I
would like to underline once again that this is of the highest importance for the design
of a Sui study time series data container, which was described in the article that was
published in the research.
Multivariate time series is only one of the many methods that are available to pick from
when it comes to visualizing multiple time series. There are many more method options
available as well. The next topic of conversation that we will be having is going to be
about panel data, which is an alternative data collection method.
Panel data is a type of data that is also sometimes referred to as "longitudinal data" in
certain circumstances. Panel data comprises a specific type of data. Panel data is a sort
of data that is generated by gathering observations on several occurrences of the same
kind(s) of measurements over a period of time and in a variety of locations. Panel data,
192 | P a g e
which is extremely comparable to cross-sectional data, is comprised of observations
that are taken from a range of different settings. Panel data is quite similar to cross-
sectional data. The distinction between cross-sectional data and panel data is that panel
data is comprised of repeated observations that are gathered over a period of time.
Cross-sectional data, on the other hand, are collected at once. For another way of saying
it, panel data is formed of a number of cross-sections that are obtained on a regular
basis over the course of time.
In a formal sense, we are able to explain the process of data generation by making use
of N jointly independent random variables. These variables are provided by the
equation, which states that the random variable Z takes values in series(Z), and N is a
member of the set N+. It is possible for us to express the process by means of the
equation, to put it another way. In the case of panel data, on the other hand, it is
reasonable to presume that each occurrence reflects an independent realisation of the
same stochastic process. This presumption is supported by the fact that it is probable.
This is due to the fact that panel data are often more easily accessible. In contrast to the
situation involving multivariate time series, this is a considerable departure from all
that happened. Keeping in mind that the assumption of identity is only reasonable when
it is applied to the countless examples is something that should be kept in mind at all
times and is of utmost significance. The application of this principle to the time series
observations that are contained inside a particular instance is illogical due to the fact
that these observations may still be dependent on observations that were made earlier.
This is due to the fact that the observations of the time series are really contained within
the instance. Panel data may also be investigated in conjunction with multivariate time
series, which is another capability that we possess.
We have faith that this is something that can be accomplished. According to the
information presented in Section 6.2.2, the individual instances continue to be
independent of one another in this particular scenario; however, the univariate time
series variables that are contained inside each instance are not independent of one
another. In order to simplify the operations that are involved in notation, we are going
to take use of the easy assumption that we are observing what we refer to as "time-
homogeneous" data. Time-homogeneous information is comprised of observations that
are equal to one another in terms of the time index that they share with both the
occurrences and the variables.
193 | P a g e
This type of information is referred to as "time-homogeneous." In the case of time-
homogeneous multivariate time series, for instance, each and every one of the
univariate component series has the same time index. This is an illustration of the
condition. The phrase "time-homogeneous" is used in the context of panel data to refer
to the fact that all of the variables that are associated with each instance have the same
time index among them.
It is imperative that you keep this particular aspect in mind. In a general sense,
hierarchical data structures or three-dimensional matrices are required to be utilised in
order to perform an analysis on multivariate panel data. In terms of the goal that we are
working towards, mixed panel data is an important scenario to take into consideration.
Cases that include observations from both time series and cross-sectional variables are
examples of that which are known as mixed panel data samples. When used as a whole,
these samples are referred to as having mixed panel data characteristics. The fact that
we have data on patients for whom we observe both cross-sectional characteristics
(such as the date of birth) and time series variables (such as the heart rate) is an example
of things that illustrate this point. With the following, you will find a representation of
the fundamental process that is responsible for the generation of data:
In the purpose of this discussion, the variables Z and X are both regarded to be
examples of random variables. In some circumstances, the letter Z is the same as the
letters Z1, Z2,..., ZT. The domain in which these variables take their values is referred
to as a domain series (Z) × X. When it comes to cross-sectional characteristics, X is a
subset of RK, where K is the total number of characteristics that are utilised. It is a
common occurrence that is analogous to the cross-sectional scenario that was addressed
194 | P a g e
in Section 2.5. A cross-sectional variable may be considered the objective variable that
has to be predicted from the existing time series in addition to any other time constant
variables. This was covered in Section 2.5. In order to obtain accuracy in prediction,
this is something that has to be done. It is essential to have this specific piece of
information in mind with regard to this problem. This particular scenario is reviewed
within the framework of time series classification and time series regression, both of
which are discussed in Section 6.3.1 of this paper. Both of these topics are covered in
this article. Remark 5 (Different time indices across instances) provides a detailed
explanation of this idea in a manner that is easy to understand. It is not only feasible
for time indices to differ across variables, as was mentioned in comment 3, but it is also
conceivable for time indices to differ between occurrences when working with panel
data sources.
This is something that can happen whether or not the variables are different. There are
two possibilities that we may observe: (i) a different number of time points for each
instance, and/or (ii) a varied amount of time that passes between each time point. Both
of these possibilities are distinct from one another. Each of these two alternatives is a
distinct possibility. Panel data is frequently referred to as "asynchronous," in contrast
to "synchronous" data, which includes time indices that are shared by several receivers.
This is because of the fact that panel data is not sent in a synchronous manner.
Information that is "asynchronous" is not the same as information that is "synchronous"
in this aspect. According to the information presented in comments 2 and 6 (Different
variables across instances), it is possible that time points may potentially be fully
random with no discernible pattern. This is something that can be considered a
possibility. To put it another way, this is something that may take place given specific
conditions.
The possibility that various instances may have varying quantities of variables is an
extra difficulty that occurs less frequently than other potential complications. This is a
potential that is becoming more and more apparent to be a possibility. As an illustration,
it is feasible that some variables can only be observed for particular occurrences but
not for others. Univariate data, multivariate data, and panel data are the three distinct
types of time series data that we have explored up to this point in our investigation
procedure. Every one of these categories of data is unique in comparison to the others.
In this investigation, we brought to light the disparities that exist between the producing
settings in which different types of data may originate, and we contrasted these
195 | P a g e
differences with the cross-sectional scenario that was employed in the previous study.
We will go to the next part in order to formalize the most important time series learning
tasks, and these data descriptions will serve as the basis for our discussion.
Similarly, there is a vast range of learning tasks that may be formed from such data,
and there is also a wide range of scenarios that can generate data. Both of these things
are possible. The purpose of this article is to offer a stylized explanation of the most
major time series learning assignments that are currently accessible. In this part, we
will formalize three actions: time series categorization, time series regression, and
forecasting. These activities will be discussed in further detail. This part will be devoted
to the formalization of these responsibilities, so continue reading! In this particular
section, we are going to make use of the template that was constructed in the section
that came before this one (2.3). The three assignments listed above are the most crucial
ones to complete when it comes to making use of time series.
On the basis of the formalization of these processes, which will serve as a foundation,
a unified software framework for machine learning that takes use of time series will be
constructed. This framework will be the basis for the construction of the framework. In
the future, the formalization of new obligations is something that will be carried out at
some point in time. The subsequent description of learning tasks will make reference
to the various data settings that were presented in Section 6.2, which can be found
above. These settings may be found in the previous section. In the course of the
conversation, these environments will be brought up at a number of different points.
The field of time series learning presents a number of challenges, two of the most
important of which are time series classification and time series regression. Both of
these responsibilities are considered to be among the most challenging ones.2. Within
the framework of these tasks, we are interested in creating a prediction about a cross-
sectional target variable based on one or more time series that are input. This prediction
will be based on the given information. The forecast that will be made will be based on
patterns that are discovered over occurrences that are shared by both the input series
and the target variable. These patterns will serve as the basis for the prediction. These
are fundamental expansions of cross-sectional classification and regression to panel
196 | P a g e
data, and they are carried out on panel data. They are executed on panel data. On the
basis of panel data, several actions are carried out. When it comes to these learning
activities, the cross-sectional and time series versions are unique from one another due
to the fact that they assume various data production assumptions.
This allows them to differentiate themselves from one another and distinguish
themselves from one another. Upon initial inspection, it is possible that the distinctions
between these jobs will not be immediately apparent to the observer. The most
important distinction lies in the fact that features are now being represented as series,
as opposed to scalars (which include things like integers, categories, or strings). The
primary distinction lies in this aspect. Because of this, we will occasionally refer to
these activities as "series-as-features" tasks rather than "series-as-features" jobs. This
is because of the rationale stated above.
We make the assumption that the panel data are mixed based on the definition
that is stated in Section 6.2.3. With the number of occurrences N ∈ N+, let there
be random variables that are jointly independent and independent of one
another. Let there be a collection of random variables given this information.
When referring to the feature variables that are utilised in time series, the letter
Z is the symbol that is utilised. That which is the target variable for the cross-
sectional investigation is denoted by the letter Y in the preceding sentence. One
must never make the assumption that the random variables Z and Y are
independent of one another. This is not the case under any circumstances. It is
not generally believed that this is the case. If this were not the case, then target
values would be absolutely irrelevant to features since they would be
independent of the features. In other words, they would be completely unrelated
to the features.
When a specific group of time points t is considered, the Z take values are
presented in the form of series(Z). The Z take values are what make up the time
series for the calculation. There is a possibility that this study will make use of
both a univariate and a multivariate Z. In a manner that is analogous to the
cross-sectional arrangement, the random variable Y is assigned values that are
really contained inside Y. Random selection was used to choose these values.
197 | P a g e
When the value of Y is equal to the value of R, we refer to the learning problem
as "time series regression." This is used to describe the situation. The learning
process that takes place when Y is equal to C is referred to as "time series
classification." C is a finite set of class values that are determined in advance,
and the phrase "time series classification" is used to characterise the learning
task.
Y and f (Z), where f is a fixed but unknown function of the time series Z that is
being investigated, have a tight connection with one another where f is the
variable in question. An estimate of this relationship, which is the one that is
anticipated to exist, is something that we would take into consideration. In the
course of learning, the feature variables are now series, as opposed to being
scalars in the past. Specifically, this is the fundamental difference that can be
made between the learning process and the environment that is cross-sectional.
With the exception of that particular component, the learning process is almost
identical to what it was before. For the purpose of training the algorithm, a
collection of training samples, denoted by the notation D = {(Z(1), Y (1))...
198 | P a g e
(Z(M), Y (M))}, is utilised. In addition, on the other hand, a collection of test
cases, which are referred to as Z∗ and Y ∗, is kept for the purpose of evaluating
the effectiveness of the procedure. It is important to note that the number of
training examples, denoted by the letter M, is a subset of the total number of
instances that are bigger than N. To achieve the objective of aligning a
prediction functional f̆:= A(D) with a certain training set, an algorithm A is
employed. It is feasible to offer the following definition of the algorithm A:
Making use of the formula that is based on the prediction functional f̆: series(Z)
→ Y. Choosing a training set is a crucial phase in the learning process that must
be taken into consideration. The information that is communicated moves in the
opposite direction, from the variables in the time series to the variable of
interest. This is the direction in which communication occurs. Because the
information is being conveyed, this is the result. Regarding classification, the
word "time series classifier" is something that is utilised to define A.
199 | P a g e
owing to any randomness that may be incorporated in these algorithms by these
algorithms). The randomness may be included into the algorithms, which is the
reason for this possibility. When the sequence in which time series observations
are presented is changed, on the other hand, it is possible that this will have an
impact on the degree to which time series algorithms are able to be adapted to
match with the data.
Despite the fact that this is the case, cross-sectional approaches can still be
utilised to solve problems that need time series investigations. In the fourth
section, which is devoted to the concept of reduction, these strategies are
discussed in greater depth.6. Specifications on the accomplishment of the
objective that are comparable to the cross-sectional scenario that was covered
in Section 2, the evaluation technique for time series regression and time series
classification is comparable to the condition that was detailed in Section 2.5.
We evaluate the performance of the system on new data in order to validate its
accuracy. The phrase "new data" refers to examples that are different from the
ones that were used for training the system.
After the algorithm has been trained on the training set, we are able to include
the test cases (Z∗ and Y ∗) for the purpose of evaluation on the training set. This
is necessary before we can move further with the evaluation process. The loss
function L, which is defined as the translation of Y × Y into R, is utilised in the
process of performance evaluation. Our goal is to reduce the amount of
generalisation error that occurs with relation to a certain training set D to the
greatest degree that is feasible: For the purpose of taking into account the fact
that f is dependent on the training data D, our objective is to minimise the
anticipated generalisation error as much as is humanly practicable. GE and EGE
are the two categories of theoretical variables that are available. When it comes
to life, it is necessary to approximate both of these factors.
6.3.2 Forecasting
200 | P a g e
in doing. This forecast is founded on trends that have been identified across a range of
time intervals via analysis of historical data.
There are many various approaches that may be taken while carrying out the task of
prediction. A basic univariate case, which does not involve any external elements, is
the focus of the definition that is presented here.
Additionally, it is essential to take into consideration the fact that the time series
is only manifested in a single instance. It is not taken for granted that the
individual is entirely self-sufficient.
On the other hand, the algorithm should not only perform well on the data that
was used for training, but it should also be able to work successfully with data
that was not supplied in the past. As a result of this, it is of the utmost
importance to evaluate the efficiency of algorithms with regard to real forecasts.
201 | P a g e
Because of this, we divide the data that is accessible into two unique subsets,
which we call a training set and a test set based on their respective
characteristics. On the other hand, the test set is employed for the purpose of
evaluating the algorithm's performance, while the training set is utilised for the
purpose of training the algorithm. In light of the fact that the test set is not
utilised in the process of training the algorithms and, consequently, creating the
predictions, it is essential that it be able to provide a reliable signal of how well
the model is likely to perform when applied to new data.
202 | P a g e
Algorithm A is a function that receives data of type series(Z) and returns a
function that transforms input of type T to outputs of type series(Z). This
sequence of events is implemented by the algorithm. In other terms, algorithm
A is an operation that is performed mathematically. To express it in a different
manner, the output is the prediction functional of type fˆ: T → series(Z). An
extra time point, which is referred to as the "forecasting horizon," is entered
into the prediction functional, and it then generates a prediction for that
particular time point it has been given. In the direction of information
transmission, observations made in the past are transmitted to those made in the
future through the process of transmission. In this context, the term "forecaster"
is used.
According to the definition that was stated before, the forecasting horizon is
regarded to be a single time point. In point of fact, it is feasible that we may be
required to create forecasts for a number of different time points at the same
time. A great number of predictions functionals, which need predictions for
previous time points, require the development of forecasts for time points that
are multiple steps ahead of the current time point. This is a requirement for the
generation of forecasts.
We are going to make the assumption that the prediction functional has access
to these predictions in order to keep things as simple as possible. The fitted
prediction functional often requires not just a time point but also some data from
the past in order to generate predictions. This is because the time point is not
203 | P a g e
enough. As an alternative to defining fˆ as fˆ: series(Z) × T → series(Z), we may
define it as instead. For this particular scenario, the algorithm A would be
supplied by:
The prediction horizon must already be there during the training phase of some
algorithms in order for them to function properly. One example of an algorithm
that fits distinct parameters for each time point in the forecasting horizon is the
"direct" reduction approach [34]. This technique represents one example of an
algorithm. In this context, all of the algorithms that are compatible with these
parameters are included. We are going to proceed with the more basic notation
that is given in Equation 6.8, despite the fact that these elements had a crucial
influence in the construction of sktime in Section 8.3.3.
This is because due to the fact that this is the case. Taking into consideration
how well a model performs in relation to new data that was not used for training
the model is the only method to determine how accurate predictions are. This
204 | P a g e
analysis is the only approach to quantify the accuracy of predictions. In the
context of this discussion, the term "new data" refers to observations that were
taken at future time points and were not a part of the course of training.
Following the use of the time points tτ = (ti : ti ∈ t; ti ≤ τ) for the purpose of
training, we are subsequently offered the opportunity to employ test time points
t∗, tT ≥ t∗ > τ for the purpose of evaluation.
A loss function L is established for the aim of assessing performance. This loss
function is defined as Z × Z → R. At a certain instant in time, this function
entails contrasting the value that was anticipated with the value that actually
occurred. If we take into consideration a particular training set D, we may
consider the learning task to be an optimisation problem that entails reducing
the generalisation error at the test time point t∗. To have a better understanding
of this topic, it is necessary to take into consideration the following equation:
where the expectations are placed over z(t∗).
205 | P a g e
In the same manner that they were defined in the past, the GE and the EGE
are both described in terms of random variables. This describes how they
were defined. A great deal of conjecture may be made regarding quantities
that are in dispute. Estimations of them are necessary when they are
considered in the context of actual practice.
Despite the fact that GE and EGE are defined for a single time point and t∗
respectively, it is not inconceivable that we may be interested in aggregate
errors over a number of different test time points. This is due to the fact that
it is feasible that we would be interested in mistakes that are aggregated.
Through a variety of different methods, we are able to undertake an analysis
of the performance throughout a number of different time periods. This is
accomplished by aggregating the individual loss numbers. This enables us
to examine the performance over a variety of different time periods, which
is a significant advantage. Taking the position of the middle ground is a
simple tactic that may be utilised. The symmetric mean absolute percentage
error (sMAPE) and the mean absolute scaled error (MASE) are two
examples of loss functions that are frequently utilised in the process of
forecasting. Both of these loss functions are considered to be essential
examples. It is common practice to refer to both of these loss functions as
"loss functions."
206 | P a g e
important to take it into mind. Because both the training set and the test set
are derived from the same time series, it is highly probable that their
conclusions are statistically reliant on one another. This is because the
training set and the test set are both taken from the same time series. As a
result of this, when we are attempting to evaluate the level of uncertainty
that is connected with our performance projections, we are required to take
into consideration the dependent structure of the data.
6.4 REDUCTION
The definitions of learning tasks that have just been presented have, on several
occasions, brought to light the distinctions that exist between the various learning tasks.
These distinctions have been brought to light in a variety of different instances. Within
the following paragraphs, we are going to focus the majority of our emphasis on the
links that are present between each of them in their various ways. When attempting to
get a knowledge of the links that exist between positions, it is feasible to make use of
the idea of "reduction" as a tool in order to do this. In the field of machine learning, it
is standard practice to use the phrase "reduction" to refer to a broad variety of reduction
strategies, such as "data reduction" and "dimensionality reduction."
This is because all of these approaches are essentially the same thing. In the context of
this particular case, the term "reduction" refers to a process that entails reducing one
action into another. The following provides a more in-depth explanation of this.
Because of this, reduction makes it possible for us to use any approach that was
employed for the prior task in order to tackle the current one. This is because reduction
makes utilisation of any technique possible. This is due to the fact that reduction offers
us the opportunity to accomplish this. Although the idea of reductions has been there
for quite some time, it is not frequently referenced in the literature on time series
analysis as a concept that may be utilised to link a variety of activities. This is despite
the fact that it has been around for quite some time.
In spite of the fact that cutbacks have been happening for a considerable amount of
time, this is the case. Reductions are an essential component of machine learning in
terms of their applicability, and they are particularly important when applied to time
series. This diagram, which offers an overview of the reduction relations that are
considered to be the most significant in the time series domain, describes the reduction
207 | P a g e
relations that are considered to be the most essential. Not only that, but this figure also
depicts the reduction relations. In order to accomplish reduction, it is standard practice
to break down a difficult process into a number of procedures that are more doable. The
strategy that is utilised the most frequently is this one. As a result of this, it is not out
of the question to rewrite the solutions to the less complicated issues in order to arrive
at a solution for the more challenging assignment. Due to the fact that this particular
condition still exists, this is now feasible.
Having the same viewpoint on both the methods and the applications is something that
is not impossible to do. For instance, when it comes to the process of resolving concerns
linked with forecasting, cross-sectional regressors are frequently used as an appropriate
example. This is because they are straightforward and easy to understand. When it
comes to the categorization of time series, cross-sectional classifiers are frequently
utilised as components inside specialist algorithms. The utilisation of cross-sectional
classifiers is analogous to this approach in its application.
Another way of looking at the concept of reductions is as mappings from one form of
algorithm to another kind of algorithm. This is an alternate way of looking at the
concept. Due to the fact that reductions are representations of mappings, this is the case.
An example of a reduction from forecasting to cross-sectional regression is a mapping
operation that takes an algorithm of the cross-sectional regressor type as input and
returns an algorithm of the forecaster type. This is an example of a reduction. In this
particular instance, a decrease from predicting to cross-sectional regression is being
demonstrated. The example that follows is an illustration of a reduction. One example
208 | P a g e
of a mapping approach that one would consider to be accept study is the reduction that
is being discussed here. A key part of the design of the unified software framework that
has been established for machine learning with time series is reduction, as we will see
in the next section.
This framework has been designed for machine learning. A more in-depth
demonstration of this will be provided in the next section. The framework of Study and
8 provides a more in-depth discussion on the topic of reduction, which is relevant to
this specific respect. Following that, we will first present a more in-depth analysis of
two different scenarios, and then we will focus on some of the most significant
components of reduction relations. Immediately after this, there will be a debate about
the subsequent subject. In conclusion, we will bring our conversation to a close by
reaching some conclusions.
The arrow that is represented by the letter (g) and can be seen in the row that is situated
at the bottom of Fig. 6.1 is the one that illustrates this reduction connection. There is a
wide range of alternative approaches that may be used in order to achieve this
reduction. All of them require the process of feature extraction, which is a technique
for converting the properties of the time series into scalar features. This is a
requirement for each and every one of them. Scalar features are the input format that
cross-sectional classification algorithms require in order to function properly. An
element that is responsible for feature extraction is incorporated into each and every
one of these applications simultaneously.
The method of considering each time point as if it were a distinct feature and discarding
any information that may be present in the sequence of the time series data is an easy
209 | P a g e
approach that can be efficiently carried out with a little bit of work. This method is a
straightforward approach that can be carried out well. It is possible to acquire scalar
characteristics from the time series, which is a method that is utilised by a significant
number of businesses. Scalar characteristics may be obtained from the time series.
Some of the characteristics are easy to comprehend, such as the mean, while others are
more complicated, such as the coefficients of the Fourier transform or the time reversal
asymmetry statistic. The characteristics range from simple to complex. A good example
of a fundamental characteristic is the mean. In terms of quality, there is a vast selection
available for selection.
Taking the value that is found after each window is all that is required of us in order to
obtain the variable that we are seeking for. No other action is required of us. This is the
only thing that is expected of us. The feature matrix that is formed as a consequence of
a univariate time series (z1, z2,..., zT) and a window length w is the feature matrix with
dimensions T − w × w and the feature matrix that is created with dimensions T − w ×
1. In order to better understand how these two feature matrices are created, the
210 | P a g e
following explanation is provided: The notation T − w × w is used to represent a two-
dimensional feature matrix, while the notation T − w × 1 dimensional is used to
represent a one-dimensional feature matrix. In this notation, T represents the total
number of time points, and w is a member of the set N+.
It is important to note that both T and w are greater than each other. The term "auto-
regression" or "lagged-variable regression" may also be used to refer to this technique
across certain areas. Due to the fact that we do regression on the target variable based
on its lagging values, this is the result. To put it another way, we are making predictions
about the future based on the "lagged" numbers. This is because we do regression
analysis on the variable that is the primary focus of our attention. This is the reason
why this is the case. After converting the series into the tabular format that is required,
we will be able to use any cross-sectional regression approach to the data in order to
carry out an analysis of it. This will allow us to understand the data better.
Following the completion of the fitting method, we make use of the last window that
is at our disposal in order to create the first-step-ahead forecast. The range of values
that are included in this window is [zT−w+1, zT−w+2,..., zT]. We are able to produce
the prediction in ahead thanks to the knowledge that we have today. Through the use
of this window, which functions as the input for the algorithm that has been fitted, these
aims are successfully completed. We employ two processes in a recursive way in order
to generate multi-step-ahead forecasts: (i) we make use of previously predicted values
in order to update the most recent window, and (ii) we develop new predictions for the
step that is to come after this one. Both of these processes are performed in order to
generate multi-step-ahead forecasts.
Both of these stages are carried out together with the intention of ensuring that the
forecasts are accurate. In addition to the recursive approach that is now being utilised,
there are more prospective ways that might be developed and put into practice. When
doing cross-sectional regression, it is essential to bear in mind that we are operating
under the premise that we have data on a large number of events that are not related to
one another. This is another assumption that we are making. Regarding the process of
forecasting, on the other hand, we have only encountered one instance of this
happening. There is no way to think that the rows of the matrix are representative of
independent occurrences, and the fact that the data that has been updated is presented
in a tabular format does not change the fact that this is the accurate representation of
211 | P a g e
the situation. As a result of the fact that the rows on the matrix are likely to rely on the
rows that came before them, this is the reason why this is the case. approaches of model
assessment that are based on the random shuffling of examples, such as the use of a
hold-out test set or resampling processes like cross-validation, are prone to produce too
optimistic evaluations of performance. This is because these approaches are dependent
on the random shuffling of cases.
These approaches are dependent on the random movement of instances, which is the
reason why this is the case. These particular approaches are dependent on the random
shuffling of instances, which is the reason why this is the case. This is the reason why
this is the case. To evaluate the success of cross-sectional regression algorithms, it is
vital to evaluate their performance in the context of the original forecasting purpose.
This is in contrast to assessing the efficiency of these algorithms in the context of the
reduced regression setting. The reason for this is that, despite the fact that we are able
to utilise these algorithms to address problems with forecasting, it is quite important to
evaluate how successful they are in this particular context given the circumstances.
For the purpose of providing an overview of some of the most important traits that
reduction relations possess, the following paragraphs will be presented:
The performance estimates of the algorithm: That were got from the new
work do not, in general, hold true for the task that was initially assigned to it,
even if they were derived from the new work. This is the case even if the new
work was used to produce the estimates. The fact why this is the case is due to
212 | P a g e
the fact that the fundamental presumptions that are connected to the new
activity could not be relevant to the work that was done before it. When, on the
other hand, a forecasting job is simplified to cross-sectional regression by
applying a sliding window transformation, as seen in example 2, the converted
data may be displayed in a tabular style; nevertheless, rows do not reflect
instances that are recognised by their unique IDs. This is because the rows do
not represent the instances that are identified by their IDs. The rows do not
accurately reflect the instances that are recognised by their instances, which is
the reason for this finding. Nevertheless, despite the fact that the relational data
structure is the same, the assumption about the environment in which the data
is created is not the same.
This is a system that is modular: When any algorithm is reduced for a specific
job, a new algorithm that performs a different function is formed as a result of
the reduction process. This new algorithm serves a different purpose than the
original algorithm. The application of a reduction method to N fundamental
algorithms results in the creation of N new algorithms that have the potential to
be utilised for the new assignment. As a consequence, this leads to the
development of N new algorithms. Each and every step forward that is
accomplished with regard to the fundamental strategy is instantly transferred to
the next task. This action is taken with the purpose of reducing the amount of
time that is spent on research and the creation of software.
213 | P a g e
the window length parameter or approach for making predictions. This is
essential in order to provide accurate forecasts. This is due to the fact that the
window length parameter is dependent on the mode of operation that is being
utilised. For the sake of this discussion, the "recursive" technique or any other
strategy might be the one being considered. Although it is possible that these
selections may be understood as hyper-parameters of the reduction approach
that has been presented, this is not guaranteed. As soon as we have put these
decisions into action, we will be able to determine if we want to optimise them
or make some alterations to them by employing the right model selection
techniques.
214 | P a g e
View publication stats