Suhana
Suhana
(Submitted in Partial Fulfillment of the Requirement for B-Tech Degree Course in Electronics
and Communication Engineering of APJ Abdul Kalam Technological University)
Submitted By
SUHANA SAINAB.S(AME19EC013)
SEMINAR REPORT
(Submitted in Partial Fulfillment of the Requirement for B. Tech DegreeCourse in Electronics and
Communication Engineering of A P J Abdul Kalam Technological University)
Submitted by
SUHANA SAINAB.S (AME19EC013)
CERTIFICATE
This is to certify that seminar report entitled “AUTOMATED MACHINE LEARNING:THE
NEW WAVE OF MACHINE LEARNING” is a bonafide record of the project work done
by SUHANA SAINAB.S(AME19EC013) at Rajadhani Institute of Science and Technology,
in partial fulfillment of the requirements of the B.Tech Degree course in Electronics and
Communication Engineering of A P J Abdul Kalam Technological University 2019- 2023
batch.
Ms.SITHARA KRISHNAN
Head of department
It is with great enthusiasm and learning spirit that I am bringing out this seminar report.
Here I would like to mark my token of gratitude to all those who influenced me during
the period of my work. I would like to express my sincere thanks to The Management
of Rajadhani Institute of Science and Technology, Palakkad and Dr.RAMANI.K, The
Principal Rajadhani Institute of Science and Technology for the facilities provided here.
I express my heart-felt gratitude to Head of the Department, Ms.SITHARA KRISHNAN,
Assistant Professor, Department of Electronics and Communication Engineering for allowing
me to takeup this work.
With immense pleasure and gratitude, I express sincere thanks to my guide Ms.ASHA
ARVIND and Co-ordinator Ms.SITHARA KRISHNAN, Assistant Professor for her com-
mitted guidance, valuable suggestions and constructive criticisms. Her stimulating suggestions
and encouragement helped me through our project work. I extend my gratitude to all teachers
in the Department of Electronics and Communication Engineering, Rajadhani Institute of
Science and Technology, Palakkad for their support and inspiration.
Above all I praise and thank the Almighty God, who showered her abundant grace on me to
make this project a success. I also express my special thanks and gratitude to my family and
all my friends for their support and encouragement.
ABSTRACT
With the explosion in the use of machine learning in various domains, the need for an
efficient pipeline for the development of machine learning models has never been more crit-
ical. However, the task of forming and training models largely remains traditional with a
dependency on domain experts and time-consuming data manipulation operations, which
impedes the development of machine learning models in both academia as well as indus-
try. This demand advocates the new research era concerned with fitting machine learning
models fully auto- matically i.e., AutoML. Automated Machine Learning (AutoML) is an end-
to-end process that aims at automating this model development pipeline without any external
assistance. First, we provide an insights of AutoML. Second, we delve into the individual
segments in the AutoML pipeline and cover their approaches in brief. We also provide a case
study on the industrial use and impact of AutoML with a focus on practical applicability in a
business context. At last, we conclude with the open research issues, and future research
directions. Index Terms—Automated Machine Learning, Artificial Intelli- gence Meta
Learning, Hyperparameter Optimization.
AUTOMATED MACHINE LEARNING:THE NEW WAVE OF MACHINE LEARNING ii
Contents
4 Data preprocessing 7
4.1 Data Imputation 7
4.2 Data Balancing 8
4.3 Data Encoding 12
5 FEATURE ENGINEERING 14
5.1 Feature Mining 14
5.2 Feature Generation 15
5.3 Feature Generation 16
7 DISCUSSION 24
8 CONCLUSION 25
BIBLIOGRAPHY 25
List of Figures
to∆10
2.1 Machine learning 2
Chapter 1
INTRODUCTION
Data analysis is a powerful tool for learning insights on how to improve the
decision making, business model and even products. This involves the construction and
training of a machine learning model which faces several challenges due to lack of expert
knowledge. This challenges can be overcomed by using automated machine
learning(AutoML) field. AutoML refers to the process of studying a traditional machine
learning model development pipeline to segment it into modules and automate each of those
to accelerate workflow. With the advent of deeper models, such as the ones used in image
processing, Natural Language Processing, etc., there is an increasing need for tailored models
that can be crafted for specific workloads. However, such specific models require immense
resources such as high capacity memory, strong GPUs, domain experts to help during the
development and long wait times during training.
The task gets critical as there is not much work done for creating a formal framework for
deciding model parameters without the need for trial and error. These nuances emphasized the
need for AutoML where automation can reduce turnaround times and also increase the
accuracy of the derived models by removing human errors. In recent years, several tools and
models have been proposed in the domain of AutoML. Some of these focus on particular
segments of AutoML such as feature engineering or model selection, whereas some models
attempt to optimize the complete pipeline. These tools have matured enough to be able to
compare with human experts on Kaggle competitions and at times have beat them as well,
showcasing their veracity. There are wide variety of applications based on AutoML such as
autonomic cloud computing , Intelligent Vehicular networks, Block Chain,Software Defined
Networking , among others.
This paper aims at providing an overview of the advances seen in the realm of AutoML in
recent years. We focus on in- dividual aspects of AutoML and summarize the improvements
achieved in recent years. The motivation of this paper stems from the unavailability ofa
compact study of the current state of AutoML. While we acknowledge the existence of other
surveys , their motive is to either provide an in-depth understanding of a particular segment of
AutoML, provide just an experimental comparison of various tools used or are fixated towards
deep learning models.
Chapter 2
The machine learning process begins with observations or data, such as examples, direct
experience or instruction. It looks for patterns in data so it can later make inferences based on
the examples provided. The primary aim of ML is to allow computers to learn autonomously
without human intervention or assistance and adjust actions accordingly.Machine learning
as a concept has been around for quite some time. The term “machine learning” was coined by
Arthur Samuel, a computer scientist at IBM and a pioneer in AI and computer gaming. Samuel
designed a computer program for playing checkers. The more the program played, the more
it learned from experience, using algorithms to make predictions.
ML has proven valuable because it can solve problems at a speed and scale that cannot
be duplicated by the human mind alone. With massive amounts of computational ability behind
a single task or multiple specific tasks, machines can be trained to identify patterns in and
relationships between input data and automate routine processes
Supervised Learning: More Control, Less Bias Supervised machine learning algorithms
apply what has been learned in the past to new data using labeled examples to predict future
events. By analyzing a known training dataset, the learning algorithm produces an inferred
function to predict output values. The system can provide targets for any new input after
sufficient training. It can also compare its output with the correct, intended output to find
errors and modify the model accordingly.
Unsupervised Learning: Speed and Scale Unsupervised machine learning algorithms are
used when the information used to train is neither classified nor labeled. Unsupervised learning
studies how systems can infer a function to describe a hidden structure from unlabeled data.
At no point does the system know the correct output with certainty. Instead, it draws inferences
from datasets as to what the output should be.
Reinforcement Learning: Reinforcement learning is a feedback-based learning method,
in which a learning agent gets a reward for each right action and gets a penalty for each
wrong action. The agent learns automatically with these feedbacks and improves its
performance. In reinforcement learning, the agent interacts with the environment and
explores it. The goal of an agent is to get the most reward points, and hence, it improves its
performance. The robotic dog, which automatically learns the movement of his arms, is an
example of
Reinforcement learning.
Chapter 3
Google has launched several AutoML products for building our own custom machine
learning models as per the business needs, and it also allows us to integrate these models
into our applications or websites. Google has created the following product:
AutoML Natural Language AutoML Tables AutoML translation AutoML Video Intel-
ligence AutoML Vision The above products provide various tools to train the model for
specific use cases with limited machine learning expertise. For cloud AutoML, we don’t need
to have knowledge of transfer learning or how to create a neural network, as it provides the
out-of-box for deep learning models.
The Microsoft Azure AutoML was released in the year 2018. It also offers a transparent
model selection process to non-ml experts to build the ML models.
3. H2O.ai
H2O is an open-source platform that enables the user to create ML models. It can be
used for automating the machine learning workflow, such as automatic training and tuning
RAJADHANI INSTITUTE OF SCIENCE AND TECHNOLOGY DEPT OF ECE
AUTOMATED MACHINE LEARNING:THE NEW WAVE OF MACHINE LEARNING 5
of many models within a user-specified time limit. Although H2O AutoML can make the
development of ML models easy for the non-experts still, a good knowledge of data science
is required to build the high-performing ML models.
4. TPOT
5. DataRobot
DataRobot is one of the best AutoML tools platforms. It provides complete automa-
tion by automating the ML pipeline and supports all the steps required for the preparation,
building, deployment, monitoring, and maintaining the powerful AI applications.
6. Auto-Sklearn
Auto-Sklearn is an open-source library built on the top of scikit learn. Itautomatically does
algorithm selection and parameter tuning for a machine learning model.It provides out-of-the-
box features of supervised learning.
7. MLBox
MLBox also provides the powerful Python Library for automated Machine Learning.
3.2 AUTO ML
An AutoML is the process of automating the end-to-end process of applying machine
learning to real-world problems. The problem of AutoML is a combinational one, where any
proposed algorithm is required to find a suitable combination of operations for each segment
of the ML pipeline to minimize the errors.
The standard data pre-processing operations are well defined and discussed in sectionII-
A. While a completely raw data collection cannot be processed with these standard opera-
tions, datasets are usually refined to some extent and can work with such operations well. The
automation in data pre-processing is defined as a series of actions that are selected from the
standard pre-defined operation set and performed on the dataset. Feature Engineeringis
performed by selecting relevant features from the dataset by finding dependant pairs and using
them for generating new features. Model selection and hyperparameter optimization work on
finding the optimal parametric configuration from an infinite search-space or from learning
RAJADHANI INSTITUTE OF SCIENCE AND TECHNOLOGY DEPT OF ECE
AUTOMATED MACHINE LEARNING:THE NEW WAVE OF MACHINE LEARNING 6
them (reinforcement learning) from previous models designed for various tasks.The final
term in equation 1 demonstrates the probabilistic reinforcement learning used in recent years
for constraining the configuration space.
The solution-space explosion due to exponentials and facto- rials, as shown in equation 1
is the core issue of AutoML. This explosion causes a high expense computationallyand voids
any accuracy advantage over humans. To address this problem, various research works that are
proposed, allows a parameter configuration to granularly adjust the volumeof the search space
explored by any algorithm. Some works have removed the combination configurations deemed
ineffective based on previous experience
Chapter 4
DATA PREPROCESSING
• Data Imputation
• Data Balancing
• Data Encoding
This section describes the various segments of AutoML as per the taxonomy shown.
We present the most notable contributions seen in the domain of AutoML. We compare the
various approaches adopted for each individual segment of AutoML.
Data pre-processing guarantees the delivery of quality data derived from the original
dataset. It is an important step due to the unavailability of quality data as a large portion of
information generated and stored is usually semi-structural or even non-structured in form.
However, even though it is a crucial part of any machine learning pipeline, it is reported
to be the least enjoyable part, with authors stating that 60-80 percentage of data scientists
finding it to be the most mundane and tedious job. In AutoML, certain data-preprocessing
operations are hard- coded, which are then applied to a given dataset in certain combinations
such that the overall clarity and usability of the data increases. We have largely classified these
operations into the following categories based on our surveys of recent papers.
(MCAR) data. The randomness of MCAR data is high enough that there is no overall bias. In
data Imputation, we deal with inconsistencies such as NaNs, spaces, Null values, incorrect data
types, etc. This is addressed by replacing these values with multiple methods such as default
value selection in which every problematic value is removed, and a pre- selected value takes its
place. Another approach is to use the mean or median of the dataset column to replace any
missing value. Some approaches as regression imputation have used standard Deviation and
Variance to compute the replacement value for a given data column. Some data imputation
technique with lighter time constraints uses the successive halving approach in Auto- WEKA.
XGboost algorithm is also used widely in TPOT tool , and Auto- WEKA for data
learning uses the variable cost of misclassification to balance the bias of an imbalance class.
It is suitable for a highly skewed dataset where certain classes are minorities. In the case
of AutoML, tools such as TPOT provide an implementation in its API to adjust for class-
specific sensitivity to adjust for skewed classes.
Resampling Techniques
Dealing with imbalanced datasets entails strategies such as improving classification
algorithms or balancing classes in the training data (data preprocessing) before providing the
data as input to the machine learning algorithm. The later technique is preferred as it has wider
application.
The main objective of balancing classes is to either increasing the frequency of the mi-
nority class or decreasing the frequency of the majority class. This is done in order to obtain
approximately the same number of instances for both the classes. Let us look at a few
resampling techniques:
Random Under-Sampling
Random Under sampling aims to balance class distribution by randomly eliminating
majority class examples. This is done until the majority and minority class instances are
balanced out.
Total Observations = 1000
Fraudulent Observations =20
Non Fraudulent Observations = 980
Event Rate= 2
Random Over-Sampling
Over-Sampling increases the number of instances in the minority class by randomly
replicating them in order to present a higher representation of the minority class in the sample.
Total Observations = 1000
Fraudulent Observations =20
Non Fraudulent Observations = 980
Event Rate= 2
In this case we are replicating 20 fraud observations 20 times.
Non Fraudulent Observations =980
Fraudulent Observations after replicating the minority class observations= 400
Total Observations in the new data set after oversampling=1380
Event Rate for the new data set after under sampling= 400/1380 = 29
Advantages
Unlike under sampling this method leads to no information loss. Outper- forms under
sampling Disadvantages It increases the likelihood of overfitting since it repli- cates the
minority class events.
Advantages
This clustering technique helps overcome the challenge between class im- balance. Where
the number of examples representing positive class differs from the number of examples
representing a negative class. Also, overcome challenges within class imbalance, where a class
is composed of different sub clusters. And each sub cluster does not contain the same number
of examples. Disadvantages The main drawback of this algorithm, like most oversampling
techniques is the possibility of over-fitting the training data.
A sample of 15 instances is taken from the minority class and similar synthetic instances
are generated 20 times
Post generation of synthetic instances, the following data set is created
Minority Class (Fraudulent Observations) = 300
Majority Class (Non-Fraudulent Observations) = 980
Event rate= 300/1280 = 23.4
Advantages
Mitigates the problem of overfitting caused by random oversampling as synthetic examples
are generated rather than replication of instances No loss of useful information
Disadvantages
While generating synthetic examples SMOTE does not take into consideration
neighboring examples from other classes. This can result in increase in overlapping of classes
and can introduce additional noise SMOTE is not very effective for high dimensional data
Now that we have discussed about the type of categorical variables, let’s see the different
types of encoding:
Nominal Encoding and Ordinal Encoding
Chapter 5
FEATURE ENGINEERING
• Feature Mining
• Feature Generation
• Feature Selection
During the process of feature selection, either the analyst or the modeling tool or al-
gorithm actively selects or discards attributes based on their usefulness for analysis. The
analyst might perform feature engineering to add features, and remove or modify existing data,
while the machine learning algorithm typically scores columns and validates their use- fulness
in the model.
In short, feature selection helps solve two problems: having too much data that is of little
value, or having too little data that is of high value. Your goal in feature selection should
be to identify the minimum number of columns from the data source that are significant in
building a model.
Feature Generation can improve model performance when there is a feature interaction.
Two or more features interact if the combined effect is (greater or less) than the sum of their
individual effects. It is possible to make interactions with three or more features, but this tends
to result in diminishing returns.
Feature Generation is often overlooked as it is assumed that the model will learn any
relevant relationships between features to predict the target variable. However, the genera- tion
of new flexible features is important as it allows us to use less complex models that are faster
to run and easier to understand and maintain.
Feature Selection
In fact, not all features generated are relevant. Moreover, too many features may
adversely affect the model performance. This is because as the numberof features increases,
it becomes more difficult for the model to learn mappings between features and target (this is
known as the curse of dimensionality).
Chapter 6
Most of the AutoML tools and methods combine the problem of model selection and
hyperparameter optimization into a single problem called the CASH(Combined Algorithm
Selection and Hyperparameter) problem. CASH problem con- siders model selection and
Grid search is the simplest algorithm for hyperparameter tuning. Basically, we divide
the domain of the hyperparameters into a discrete grid. Then, we try every combination of
values of this grid, calculating some performance metrics using cross-validation. The point
of the grid that maximizes the average value in cross-validation, is the optimal combination
of values for the hyperparameters.
approach has so far been limited to the optimization of few numerical algorithm parame-
ters on single instances. In this paper, we extend this paradigm for the first time to general
algorithm configuration problems, allowing many categorical parameters and optimization for
sets of instances. We experimentally validate our new algorithm configuration procedure by
optimizing a local search and a tree search solver for the propositional satisfiability problem
(SAT), as well as the commercial mixed integer programming (MIP) solver CPLEX. In these
experiments, our procedure yielded state-of-the-art performance, and in many cases
outperformed the previous best configuration approach. :: Random Search and Grid search
performs hyperparameter checking independ of each other and often end up performing
repeated and wasteful computations. To improve on their shortcomings, Sequential Model-
based Optimization(SMBO) was pro- posed, which uses a combination of regression and
Bayesian optimization to select hyperparameters. It sequentially applies the hyperparameters
and adjusts their values based on the Bayesian heatmap, which is a probabilistic distribution.
The probabilistic approach of SMBO resolves the scalability issues that were rampant in grid
search and random search.
Chapter 7
DISCUSSION
Even though data pre-processing consumes a large chunk of time in an ML pipeline, it is
astonishing to see the inadequate amount of work done to automate it. For data pre- process-
ing, it can be noted that while the existing approaches are adequate for structured and semi-
structured data, work still needs to be done to assimilate unstructured data. We suggest the
incorporation of data-mining methods as they can deal with such unformed data. This can
allow AutoML pipelines to create models capable of learning from Internet sources. In fea-
ture engineering, it should be noted that most methods used until now adhere to supervised
learning. However, dataset specificity is high, and therefore, AutoML pipelines should be
as generic as possible to accommodate the diverse datasets. Therefore, a gradual paradigm shift
towards unsupervised learning is required to increase the ability of AutoML. To replace
domain experts, feature generation should be able to work flexibly(such as the introduction
of non-standard trans- forms) with the original feature sets. Reinforcement learning is a
step in the right direction and needs to be inculcated further with feature engineering. Hy- per
parameter optimization has seen large improvements over the years, especially with the
introduction of Bayesian optimization strategies such as
SMBO. However, the use of a continuously integrating meta- learning framework needs to
be researched as its performance gain is high. Transfer learning has also been success- fully
used in the context of AutoML to show promising results. With the increase in the availability
of task-specific pre-trained models, it should be expected to see an increase in the usage of
transfer learning.
Chapter 8
CONCLUSION
In this paper, we provide insights to the readers about the various segments of AutoML
with a conceptual perspective. Each of these segments has various approaches that have been
briefly explained to provide a concise overview. We also discuss the various trends seen in
recent years including suggestions of thirsty research areas which need attention.We also
put forward some future directions that can be explored to extend the research in the
domain of AutoML. We suggest that the research exploration can be done in the direction of a
generalized AutoML pipeline, which can accept datasets of a wide range and a central meta-
learning framework be established that acts as a central brain for approximating the pipelines
for all future problems statements.
We almost forget it in these times of strong focus on technological innovation, but in the
end, technology is only there to support your business. In other words: our data analysis is
not our core business. our data analysis must only our core business. As an entrepreneur, it
is therefore beneficial to choose technology that works as efficiently as possible.
Then our analysts can put their cognitive energy back into thinking about business
problems instead of doing endless repetitive work before the technology can do its job.
Bibliography
[1] Lukas Tuggener, Mohammadreza Amirian, Katharina Rombach, Stefan Lo¨rwald,
Anastasia Varlet, Christian Westermann, and Thilo Stadel- mann. Automated machine
learning in practice: state of the art and recent results. In 2019 6th Swiss Conference
on Data Science (SDS), pages 31–36. IEEE, 2019.
[2] Karen Simonyan and Andrew Zisserman. Very deep convolu- tional networks for large-
scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training
of deep bidirectional transformers for language un- derstanding. arXiv preprint
arXiv:1810.04805, 2018.
[4] Avatar Jaykrushna, Pathik Patel, Harshal Trivedi, and Jitendra Bhatia. Linear re-
gression assisted prediction based load balancer for cloud computing. In 2018 IEEE
Punecon, pages 1–3. IEEE.
[5] Jitendra Bhatia, Ruchi Mehta, and Madhuri Bhavsar. Variants of software defined net-
work (sdn) based load balancing in cloud comput- ing: A quick review. In International
Conference on Future Internet Technologies and Trends, pages 164–173. Springer,2017.
[6] Ishan Mistry, Sudeep Tanwar, Sudhanshu Tyagi, and Neeraj Kumar. Blockchain for 5g-
enabled iot for industrial automation: A systematic review, solutions, and challenges.
Mechanical Systems and Signal Processing, 135:106382, 2020.