0% found this document useful (0 votes)
161 views66 pages

AI&ML-FluidMech-Chapter ML Mendez 2020 LS Opt

This document provides an introduction to machine learning and its applications to fluid mechanics. It discusses the different types of machine learning, including supervised, unsupervised, semi-supervised, and reinforced learning. It then presents three demonstrative exercises: 1) using artificial neural networks for turbulence modeling, 2) applying dimensionality reduction techniques to turbulent velocity fields, and 3) using reinforcement learning for feedback control of waves. The overall goal is to provide hands-on examples of machine learning approaches and assess their potential to advance the field of fluid mechanics.

Uploaded by

lighthillj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
161 views66 pages

AI&ML-FluidMech-Chapter ML Mendez 2020 LS Opt

This document provides an introduction to machine learning and its applications to fluid mechanics. It discusses the different types of machine learning, including supervised, unsupervised, semi-supervised, and reinforced learning. It then presents three demonstrative exercises: 1) using artificial neural networks for turbulence modeling, 2) applying dimensionality reduction techniques to turbulent velocity fields, and 3) using reinforcement learning for feedback control of waves. The overall goal is to provide hands-on examples of machine learning approaches and assess their potential to advance the field of fluid mechanics.

Uploaded by

lighthillj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Machine Learning for Fluid Mechanics:

Challenges, Opportunities and Perspectives


Miguel A. Mendez, Fabio Pino and Matilde Fiore
von Karman Institute for Fluid Dynamics

Version 1

9 September 2020

This lecture gives an introduction to machine learning and an overview of its current
and its potential application to fluid dynamics. The lecture opens with the types of learn-
ing (supervised, unsupervised, semi-supervised and reinforced) and the classification of
machine learning algorithms based on their scope and their reliance on data. The presen-
tation will indirectly unveil the motivation for such a lecture in a course on Optimization,
as learning means optimizing a functional. Examples of fluid mechanics problems that
can be framed as machine learning problems are discussed.
Three demonstrative exercises are proposed to give the attendee hands-on experience
on the subject. These are 1) the regression problem of deriving a turbulence model
using Artificial Neural Networks (ANNs), 2) the unsupervised problem of extracting a
low dimensional representation of a turbulent velocity field and 3) the Reinforcement
Learning problem of using feedback control in wave stabilization. These examples should
give enough practical experience to develop the optimism and skepticism required to assess
machine learning capabilities to advance fluid mechanics.

Preamble This is an ‘alpha’ version of the notes. Feedback and suggestions for im-
provements are very welcome at mendez@vki.ac.be. To cite this version:
1 @InProceedings { Mendez2020 ,
2 author = { Miguel A . Mendez and Fabio Pino and Matilde Fiore } ,
3 title = { Machine Learning for Fluid Mechanics :
4 Challenges , Opportunities and Perspectives } ,
5 booktitle = { Optimization Methods for Computational Fluid Dynamics ,
6 VKI Lecture Series } ,
7 year = {2020} ,
8 publisher = { von Karman Institute } ,
9 doi = { XXXX } ,
10 }

VKI -1-
CONTENTS CONTENTS

Contents
1 Introduction and Motivation 3

2 What is Machine Learning? 4


2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Regression and Modeling 14


3.1 Linear Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Common Optimization Tools . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Artificial Neural Networks (ANNs) . . . . . . . . . . . . . . . . . . . . . . 25
3.4 An exercise on Turbulence Modeling . . . . . . . . . . . . . . . . . . . . . 30
3.4.1 The Constant P rt . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4.2 The ANN turbulence model . . . . . . . . . . . . . . . . . . . . . . 34

4 Dimensionality Reduction 36
4.1 A first simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Dimensionality reduction of a flow field . . . . . . . . . . . . . . . . . . . . 42

5 Reinforcement Learning and Control 44


5.1 Problem Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 RL Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6 Machine Learning for Fluid Mechanics 50

7 Conclusions and Perspectives 52

A Learning Machine Learning 54

VKI -2-
1 INTRODUCTION AND MOTIVATION

1 Introduction and Motivation


Machine learning is driving comprehensive economic and social transformations and has
earned its enormous popularity by solving problems that were deemed impossible a couple
of decades ago. The main driver of such a revolution is the ever-increasing amount of
data produced in our everyday life. Big data and machine learning have given computers
the ability to learn to recognize people from their faces, detect and track objects from
videos, understand and translate speeches in real-time, recommend which movie to watch,
diagnose diseases, and drive cars.
These tasks are incredibly complex and share a common ground: they are not amenable
to a precise mathematical formulation. There is no mathematical model, for instance, that
can be used to distinguish objects in images, words in audio recording, or customer pref-
erences in customer review databases. To succeed in these tasks, our brain (and the
computer’s ‘brain’) must learn to detect features and patterns –and distill information
accordingly– using only experience and (a lot of) data. In other words, these tasks can
only be solved by learning from data (Abu-Mostafa et al., 2012).
Extracting knowledge from data without using first principles is potentially opening
the path towards a shift in our working paradigm, redefining the role of data and al-
gorithms in the scientific method. Data might not only be used to test and validate
hypotheses but also to formulate new ones and algorithms might not only be used to
answer questions but also to ask new ones (Jiménez, 2020a).
Whether Science could ever be automatized is a question with deep philosophical im-
plications, and the reader is referred to Jiménez (2020b) for an interesting perspective.
It is nevertheless commonly accepted that, thanks to its continuous advancements in nu-
merical and experimental methods, fluid mechanics has grown into a field of big data, and
offers fertile grounds to the data-driven paradigm. The extend to which Machine learn-
ing might develop in our field and revolutionize our methodological portfolio is subject
to intense debate (see Brenner et al. 2019; Brunton et al. 2020; Kutz 2017; Duraisamy
et al. 2019b; Müller et al. 1999; Raissi et al. 2020) accompanied, like every revolution, by
optimism and skepticism. This lecture should hopefully promote both, and highlight the
needs for combining first principles and machine learning.
We begin by introducing the fundamentals of the learning process and its different
forms in Section 1, along with a brief description of the most popular tools. We proceed
with three tutorial sessions to promote hands-on experience. The first, in Section 3, pro-
vides a tour in regression analysis, which begins with linear regression, briefly presents
Gaussian Processes and Artificial Neural Networks (ANNs) and then closes with an ex-
ercise on data-driven turbulence modeling. The second, in Section 4, provides a tour
on Reduced Order Modeling using Principal Component Analysis (PCA), Kernel PCA
(kPCA) and ANNs auto-encoders. The third tutorial, in Section 5, presents an exercise
on feedback control via Reinforcement Learning (RL). We give a short overview of how
these tools are used in fluid mechanics in Section 6 and we offer some perspectives in
Section 7.
To limit the lecture’s scope, the pitch of the presentation is rather introductory, and
the desire to span a large portion of such an active field is likely to result in a tour
de force. We hope the attendees will forgive us for this, and we offer in return some
recommendations for further reading and online courses in the Appendix A.

VKI -3-
2 WHAT IS MACHINE LEARNING?

2 What is Machine Learning?


Machine learning is a subset of Artificial Intelligence (AI) at the intersection of computer
science, statistics, engineering, neuroscience, and biology. The term ‘machine learning’
was coined by Samuel (1959), who defined it as follows:
Machine learning is the field of study that gives the computers the ability to learn
without being explicitly programmed.
This definition sets a clear change of paradigm over the earliest approaches to AI,
known as symbolic AI, which conceived the learning process as the assimilation of a large
set of rules that programmers could hard-code into algorithms. The size of such a set
of rules grew beyond imagination even for the simplest task1 and the paradigm faced an
insurmountable barrier with problems that cannot be described by a list of formal rules.
Machine learning developed as a new paradigm of AI, focusing on the notion of learn-
ing from data, or, putting it in a human (and animal) behavioral terms, learning from
experience. This is a central ingredient of intelligence, as it encompasses important hu-
man abilities to remember, adapt and generalize. There are nevertheless many traits of
intelligence, such as reasoning or logical deduction that are out of the scope of machine
learning2 .
If the focus is solely on the learning process, the engineering-oriented definition by
Mitchell (1997) gives all the necessary ingredients:
A computer program is said to learn from experience E with respect to some task T
and some performance measure P, if its performance on T, as measured by P, improves
with experience E.
This definition calls the four ingredients of learning: 1) a task, i.e., what exactly must
be learned, 2) some experience, (some data) 3) a performance measure, i.e., a way to
evaluate performances, and 4) an optimization, i.e. a way to drive the learner towards
better performances.
At this stage of the course, the reader should recognize, in the third ingredient, the
notion of cost function, and should have an arsenal of tools to deal with the fourth
ingredient– that is the topic of this course and the engine of the machine learning rev-
olution: optimization. What the reader might still need to learn, to enter the machine
learning world, is how to formulate the ingredient 1) and how to use the ingredient 2) in
the whole process. This is the subject of this section.
Combining the definitions previously given, we can formulate the problem in the gen-
eral framework3 of Figure 1. The task to be learned is formulated in terms of an unknown
function f : X → Y that links inputs x ∈ X to outputs y ∈ Y. Drawing from a set of hy-
potheses, the learning algorithm operates with a parametric representation y = f˜(x, w),
1
see Domingos 2015 for a compelling historical review.
2
and must remain within your scope, as you learn how to use machine learning for scientific purposes.
3
The presentation that follows is strongly simplified. We have deliberately omitted the probabilistic
and the Bayesian framework of the process. More comprehensive introductions are proposed by Abu-
Mostafa et al. (2012), who analyze in more detail the hypothesis set and the theory of generalization and
in Vladimir Cherkassky (2008) and Murphy (2012), who emphasize the probabilistic framework. More
advanced introductions to the Bayesian aspects of the learning process are provided by Bishop (2006)
and Rasmussen and Williams (2005).

VKI -4-
2 WHAT IS MACHINE LEARNING?

Figure 1: General set up of a machine learning problem: an algorithm learns to ap-


proximate a functional f with a functional g that depends on some parameters w using
experience (data).

which depends4 on a set of weights w ∈ W. This is our data-driven (surrogate) model for
the function f and is the final outcome of the learning process.
Learning consists in tuning the parameters (weights) w such that our model performs
satisfactorily (according to the chosen measure), that is5 f˜ ≈ f . The learning process is
also called training. The bold notation recalls that both inputs x and outputs y are in
general of high dimensions: these are vectors in Rd , with d the dimension, or sequence of
numbers encoding discrete quantities such as categorical data (classes).
The dimension of W, and more generally of the hypothesis set, can be loosely linked to
the notion of model complexity or model capacity (see Abu-Mostafa et al. 2012 and
Vladimir Cherkassky (2008) for more formal definitions). The more complex the model,
the larger the dataset required to train it.
The performance of the algorithm are measured in terms of error on the available
data, called in-sample error Ein or training error while the (unknown) error on new
data is called out-of-sample error Eout (or generalization error). A model that has
not enough capacity (i.e. is not sufficiently complex) for a given task will produce large
Ein regardless of the parameters w. This is called underfitting and can be handled by
increasing the model complexity. On the contrary, a model that is too complex for a given
dataset, produces overfitting. A model overfits if it yields small Ein but large Eout , and
the difficulties in avoid this problem are due to the fact that we can only estimate Eout .
The need for minimizing Ein anchors machine learning to optimization while the
4
Not all learning algorithms are parametric (i.e., have parameters). Non-parametric methods link
inputs to outputs without building a model but using the training data itself. A classic example is
the nearest neighbor regression, in which a prediction is based on the similarity of new input to inputs
available in the training set. These methods are also called instance-based or memory-based since they
essentially store the training set in lookup tables and make predictions based on interpolation. We do
not cover these methods here; the reader is referred to Alpaydin (2020) for more information.
5
The distinctive feature of machine learning, over function approximation, is that the function f
(referred to as target function), is unknown.

VKI -5-
2 WHAT IS MACHINE LEARNING?

hope of minimizing Eout bridges with statistical inference. The overfitting problem is
central in machine learning and stem from its inherently ill-posed mathematical frame-
work: estimating a mapping f : X → Y from a finite set of points has no unique solution
and there is no guarantee that a model performing well on a set of data generalizes to the
set of all possible inputs. Dealing with such a generalization is the essence of inductive
reasoning, which lays the fundamental of all empirical sciences (Popper, 2002). As em-
pirical scientists, our tools to cope with overfitting (hence our abilities to generalize) stem
from the fundamental principles and laws of physics (e.g., conservation laws), which we
accept based on experience. We trust these laws not because we have a mathematical
proof of their validity, but because we have never seen problems falsifying them6 .
Physical laws can be incorporated in the hypothesis set, hence in the definition of the
model and its complexity; this flexibility is driving the diffusion of machine learning in
all empirical sciences (Frank et al., 2020).
When no prior knowledge about the model can be used in the hypothesis set, machine
learning offers two common (and general) tools: cross-validation and regularization.
We will practice with both aspects in the tutorial of Section 3. Briefly, cross-validation
consists in splitting the data into a training set x∗ , y∗ and a validation set x∗∗ , y∗∗ . The
first is used in the training process (to minimize Ein ), whereas the second is used to
estimate Eout . Regularization consists of adding additional constraints for the weights w
in the optimization, with the scope of guiding the learner towards a parsimonious choice
of complexity.
We close this introduction recalling that, whether a model is data-driven or physics-
driven, it is good practice to prioritize the simplest formulation. This principle goes by the
name of Occam’s Razor 7 and accompanies the history of Science since the 14th century.
Equipped with the general framework of figure 1, we proceed with the formulation of
the four main kinds of learning, along with the main machine learning tasks and methods.
These are briefly reviewed in what follows and summarized in Figure 2.

Figure 2: Overview of various machine learning techniques and their classification. The
list is by no means exhaustive, but solely offer a mind map of the machine learning
landscape.

6
at least within the scales that concerns us: as fluid dynamicists, we work with Netwon’s principles
and the laws of Thermodynamics, although we know that both fails in other contexts of modern physics.
7
Attributed to William of Occam (1287-1347), the razor is meant to trim down an explanation to the
simplest one that is consistent with the data. This is also known as the law of parsimony.

VKI -6-
2.1 Supervised Learning 2 WHAT IS MACHINE LEARNING?

2.1 Supervised Learning


Supervised learning is also known as predictive learning and is the learning approach
that most naturally fits in the general scheme of Figure 1. The goal is to learn the mapping
from inputs x ∈ X to outputs y ∈ Y from a set of training data. Such training data is
often referred to as labeled data, which has been collected by a supervisor. The main
supervised learning tasks are classification and regression: both involve learning the
function f : X → Y, but differ in the nature of the output space Y.
In classification problems, the output space is a finite set of classes, e.g. Y :=
{1, 2, . . . C}. If this set only contains two classes (e.g. y = 1 for ‘yes’ and y = −1 for
‘no’), the classification is binary (or binomial, in some sources); if it contains more, the
classification is multiclass (or multinomial). An example of multinomial classification
is that of recognizing handwritten digits from an image or classifying a flow condition as
belonging to certain regime (e.g. bubbly flows versus churn flow).
In regression problems, the output space is continuous (typically Y := R). Re-
gression encompasses methods used in curve fitting, as both aims at finding a relation
between variables, but it differs by its roots on statistical inference: a regression model
allows to make inferences on how likely the model reflect the data or how likely is a new
data belonging to the same process that has generated the model. While most of the
regression methods discussed in this chapter can be derived in a curve-fitting framework,
there are methods that do not involve fitting a curve8 . Regression is also different from
interpolation or smoothing, which lack the statistical framework and hence do not
conceive the presence of noise in the training data. An example of a regression problem
is that of predicting the price of a car given a set of features (e.g., mileage and age), or
developing a turbulence model from data.
All classification algorithms can be used for regression and vice-versa, with a minor
change in the problem formulation. In a regression problem, the function to be learned (or
an approximation of it) is the one that most likely links input to output. In a classification
problem, the function to be learned is the one that better separates the input domain
into the subdomains belonging to each class, i.e., that better represents the decision
boundary(ies). The difference is pictorially illustrated in figure 3. In both cases, the
estimation of the function f has uncertainties that leads to the notion of margins, which
in figure 3 is represented by the dashed lines. Note that in a regression problem we hope
to have narrow margins while the opposite is true in a classification problem. Methods
for margin minimization/maximization in regression/classification can be hard or soft:
in the first case we tolerate no margin violation (all points are inside the dashed lines
in regression and outside the dashed lines in classification) while in the second case we
accept some violation. An example of classification problem is that of labeling emails as
spam or not, or identifying and mapping flow regimes.
To give the reader an idea of what can be considered as ‘simple’ problems in Ma-
chine learning, we consider the classification task of recognizing hand-written digits from
grayscale images (Lecun et al., 1998). This is nowadays one of the first tutorials in almost
all programming manuals9 . The dataset is the famous MNIST dataset, provided by the
8
This is the case of ANOVA (ANalysis Of VAriance) regression, for instance (Paolella, 2018).
9
We recommend Müller and Sarah (2016); Chollet (2017); Géron (2019); Raschka and Mirjalili (2019)
for practicing with Machine Learning using Python. Python has become the lingua franca for most data

VKI -7-
2 WHAT IS MACHINE LEARNING? 2.2 Unsupervised Learning

Figure 3: Schematic illustration of the distinction between regression (left) and classi-
fication problems (right). In regression, the function f is used to predict the output,
i.e. f (x) = y; in classification, the function f defines the decision boundary separating
the domains of each class. In both figures, the function is represented with continuous
lines, while the dashed lines demarks the margins: the region within which the function
is expected to lie.

National Institute of Standards and Technology10 , consisting of n∗ = 60000 images for


the training and n∗∗ = 10000 images for validation. Each image consists of 28 × 28 pixels
and has an 8 bits color depth, hence each pixel store an integer number from 0 (black)
to 256 (white). The decision boundary for the classification is a function from X := R728
to Y = {0, 1, 2, . . . 9}. Using the Python library Keras (Chollet, 2017), this problem is
solved in less than 20 lines of codes using ANNs and reaches an accuracy of 97.8% within
4 minutes of training on the writers’ laptops.
Classic supervised algorithms are linear regressions with or without kernel formulation,
Support Vector Machines (SVMs), ANNs, Decision Trees, Random Forests, Gaussian, and
logistic regressions. Some of these are covered in Section 3, while for the reader is referred
to standard textbooks such as Alpaydin (2020); Vladimir Cherkassky (2008); Frank et al.
(2020); Mitchell (1997) for a more exhaustive introductions.

2.2 Unsupervised Learning


Unsupervised learning is also known as descriptive learning. The goal is to discover
“interesting” or “essential” structures and patterns in the data. Setting the problem in
the framework of Figure 1, the function to be learned is a mapping from the input space
to itself, i.e. f : X → X and the training data is unlabeled: there is no output domain.
The main unsupervised learning tasks are dimensionality reduction and clustering.

science applications, as it combines the power of a general-purpose programming language with the ease
of use of scripting languages such as Matlab, R or Julia. It is free, and it is equipped with an extremely
vast toolbox of libraries for scientific computing and machine learning. Some of these (like Tensorfflow)
have been recently open-sourced by tech giants such as Google.
10
available at http://yann.lecun.com/exdb/mnist/.

VKI -8-
2.2 Unsupervised Learning 2 WHAT IS MACHINE LEARNING?

Dimensionality Reduction
Dimensionality reduction aims at identifying a lower-dimensional representation of the
data. The underlying assumption is that the degrees of variability in the data can be
reduced if the focus is placed on essential features (called hidden or latent factors). A
successful face recognition algorithm, for example, focuses on patterns in the images that
are well associated with features such as age, gender or pose, and constructs a reduced set
of images that encodes all the information required for the recognition (Swets and Weng,
1996; Turk and Pentland, 1991).
Dimensionality reduction is an “information bottleneck” (Vladimir Cherkassky, 2008)
composed of an encoder mapping z = E(x) ∈ Rr and a decoder mapping x̃ = D(z) ∈ Rd
as shown in the schematic of Figure 4. The function to be learned is g(x, w) := D(z) =
D(E(x)).

Figure 4: Schematic illustration of the process of dimensionality reduction viewed “infor-


mation bottleneck”. Dimensionality reduction consists in using an encoder mapping E to
distill the input into a low order representation z that contains sufficient information for
a decoding mapping D to recover x ≈ x̃ = E(D(x)).

The compressed representation z ∈ Rr is the result of the distillation process and has
a (much) lower dimension than the input x ∈ Rd , i.e. r  d. Yet, if the features encoded
in z are truly essential, the decoder should be able to recover a good approximation
x̃ ≈ x ∈ Rd . The composition of an encoder and a decoder is commonly known as
autoencoder, although this term is mostly used when the process in Figure 4 is carried
out using ANNs. A popular (and dangerous) application of autoencoders that has hit
the headlines recently is that of Deep Fakes 11 , i.e. the generation of deceptively realistic
fake videos. These can be created by swapping the decoder and the encoder of different
datasets (Nguyen et al., 2019), effectively blending them in a synthetic dataset.
Dimensionality reduction is an essential tool in the data scientist toolbox for at least
three reasons. The first obvious reason is that of an economy in interpretation and analy-
sis: if the relevant information is hidden in r  d dimensions, then there is no interest in
considering the remaining d − r. The second reason is that the simpler representations z
might be used to derive simpler models or as input to supervised techniques, reducing the
computational complexity involved and hence the risk of overfitting. The third reason is
that, by focusing on the essential features, an algorithm can learn to recognize (and thus
remove) irrelevant sources of variance such as noise or outliers. Any denoising operation
is, in essence, an exercise in dimensionality reduction.
11
see https://en.wikipedia.org/wiki/Deepfake.

VKI -9-
2 WHAT IS MACHINE LEARNING? 2.2 Unsupervised Learning

Algorithms for dimensionality reduction (i.e., autoencoders) can be broadly classified


into linear and nonlinear. The most popular example of linear autoencoder is the Principal
Component Analysis (PCA), known in the fluid dynamics community as Proper Orthog-
onal Decomposition (POD), and its many variants. Examples of nonlinear autoencoders
are kernel PCA, isometric feature mapping, Locally Linear Embedding (LLE), Laplacian
Eigenmaps, and ANNs autoencoders. Section 5 gives a short review of PCA and ANNs
autoencoders; for more material on the subject, the reader is referred to Alpaydin (2020);
Farcomeni and Greco (2015); Goodfellow et al. (2016); Mendez (2020a); Jolliffe (2002);
Lee and Verleysen (2007).

Clustering
Clustering aims at partitioning the dataset into groups (clusters). Each of the members
of a cluster is assumed to have some “similarity”, hence to belong to a certain “class”.
Clustering differs from classification in that it is an unsupervised technique: no labeled
data is available and no “right answer” is known upfront– not even the number of clusters.
The training set only contains unlabeled data and the only possible mapping remains
f : X → X . A simple example of clustering problem is that of finding customers that
have similar purchase behavior as a basis for recommendation engines.
Clustering methods can be classified based on the type of input or the type of output.
In terms of inputs, clustering can be feature-based if the input is the dataset itself or
similarity-based if the input is some measure of similarity (distances between samples).
The first is often less sensitive to noise while the second allows for introducing prior knowl-
edge (domain-specific) of similarity (Vladimir Cherkassky, 2008). In terms of outputs,
clustering can be partitional (or prototype-based) or hierarchical. In the first all clusters
are at ‘the same level’ and characterized by a representative prototype (typically the cen-
troid in case of continuous features or the medoid in categorical features). In the second,
the data is grouped over a variety of scales by creating trees and dendograms. Finally,
clustering can be model-based or model-free. In the first case, a probabilistic model is
assumed (as in Gaussian Mixture Models (GMM), see Bishop (2006)) while in the second
case no assumption is made on the probability density distributions of the clusters.
The first step of any clustering method is the definition of a measure of similarity,
which implicitly dictates what constitutes a cluster. Intuitive definitions are the Euclidean
distance (or its square) while less intuitive measures are the Mahalanobis, the Chernoff
or the Bhattacharyya distance (see Snyder and Qi (2004) for more).
The simplest and more popular prototype-based method for clustering is the k-means
algorithm. In its basic implementation, this algorithm requires the number of clusters as
input and returns a set of labels which assign each dataset to a cluster. This assignment
is driven by an optimization that minimizes the intra-cluster variance, often called cluster
inertia or distortion. To select the number of clusters, two classic approaches are the elbow
plot and the silhouette plot. The reader is referred to Raschka and Mirjalili (2019) for an
excellent tutorial on these tools; Section 5 will briefly illustrate their use in the context of
cluster-based reduced order models. Variants of the k-means algorithm arise from different
initialization strategies or the deterministic (hard) versus probabilistic (soft) criteria used
in the classification: in the fist case, each dataset belongs to a single class, while in the
second case a dataset has a probability of belonging to one cluster or the other. This
second approach is also known as fuzzy-clustering (Bezdek, 1981).

VKI - 10 -
2.3 Semi-Supervised Learning 2 WHAT IS MACHINE LEARNING?

Methods for hierarchical clustering can be classified into agglomerative or divisive.


These methods do not need the number of clusters as an input. An agglomerative al-
gorithms initially assign each data point to a different cluster and iteratively merges the
“closest” clusters down to a single one; a divisive approach is conceptually similar but
follows the opposite process. During the iterative merging/grouping, different levels of
‘similarities’ between the data points are revealed and schematically represented into a
dendogram. Again, the reader is referred to Raschka and Mirjalili (2019) for an excellent
hands-on tutorial.
Another popular clustering methods is the Density-based Spatial Clustering of Appli-
cations with Noise (DBSCAN, see Schubert et al. (2017)). This approach assigns cluster
labels form the ‘density’ of various regions in the input space, making no assumption on
the probability density functions within each clusters. The method classifies each point
into 1) core points 2) border points and 3) noise points. A point is labeled as a core
point if at least np points falls within a prescribed radius  while it is labeled as border if
has fewer neighbors than np but at least one of these is a core point. A noise point fulfills
none of the two criteria and can be removed. This is a major distinction over prototype
or hierarchical methods, in that not all the points are necessarily assigned.
A more advanced class of methods is that of graph-based clustering, of which the
spectral clustering represents the most prominent representative. A self-contained intro-
duction to these methods is proposed by von Luxburg (2007). Extensive treatment of
clustering analysis can be found in Nasraoui et al. (2019); Gan et al. (2007); Aggarwal
and Reddy (2013).
Besides its obvious relevance in data analysis, clustering is a fundamental tool in data
compression via quantization. In image compression, for instance, clustering can be be
used to compress an image with 16 bits color depth (65536 shades of gray) to a 8 bits
color depth (256 shades of gray): a cluster-based dimensionality reduction is in essence a
coarse graining of the input space.

2.3 Semi-Supervised Learning


Semi-supervised learning is conceptually situated between the previous two: the mapping
to learn is still predictive (f : X → Y) but we dispose of (some) labeled data (x∗ , y∗ ) ∈
(X , Y) and (mostly) unlabeled data x∗∗ ∈ X . The overall goal is to use an unsupervised
method as the supervisor of a supervised method.
In most regression and classification tasks, preparing the training data is the most
expensive and time consuming portion and might be unfeasible. Unlabeled data is com-
paratively cheaper and requires less human supervision. The most classic application
of this framework is the semi-supervised classification, where clustering techniques are
used on the labeled data (Xl , Yl ) ∈ (X , Y) to partition the input space and then train
a classifier in a supervised setting. This of course implies that the notion of cluster can
be generalized to that of a class, which in turn implies that the underlying probability
density functions of each class are smooth. This is sometimes referred to as the cluster
assumption (Olivier Chapelle and Zien, 2006). Other assumptions are the continuity 12
and the manifold 13 assumption.
12
Points that are close together are more likely to share the same label.
13
The data lies on a manifold of much lower dimensions on which distances and densities can be

VKI - 11 -
2 WHAT IS MACHINE LEARNING? 2.4 Reinforcement Learning

The reader is referred to Olivier Chapelle and Zien (2006); Zhou and Goldman (2004);
van Engelen and Hoos (2019) for a complete survey. The broader classification of these
methods is between inductive and transductive. The inductive methods seeks to find a
model from the labeled data that generalized to the unlabeled one; once the model is
constructed, the labeled data is no longer necessary. The transductive methods do not
build such a predictive model and need to rely on the labeled data for making predictions;
these are also called graph-based, as their predictions rely on graph models connecting
labeled and unlabeled data. The previous example of semi-supervised classifier trained
via clustering is an example of inductive semi-supervised learning with an unsupervised
pre-processing.
Following van Engelen and Hoos (2019), subcategories of inductive methods include
wrapper methods and intrinsically semi-supervised methods. The first category combine
supervised learning with data augmentation: the unlabeled data is pseudo-labeled using
wrapping techniques and fed to a supervised learning method. In the class of intrinsically
semi-supervised methods, the most prominent example is that of generative models, and
particularly Generative Adversial Nets (GANs, Goodfellow et al. (2014)). In generative
models, the labeled data is used to infer a model that generates synthetic data to arti-
ficially enlarge the labeled dataset. In GANs, the idea is to combine a generative and
a discriminative learner having opposite objectives: the first must learn to produce data
that is hardly distinguishable from the initial set; the second must learn to distinguish
them. The two ANNs train each other in an endless process. Spectacular demonstration
of the power of this ’self-learning’ process is the generation of photorealistic images14 .

2.4 Reinforcement Learning


Reinforcement learning is about learning a mapping f : X → Y having no data at all. The
only viable learning approach is by trial and error. The learner is thus a decision-making
agent which interacts with an environment by taking actions that leads to rewards
(or penalties).
The agent learns from its mistakes and seeks to maximize the rewards (or minimize
the penalties). Reinforcement learning is strongly inspired by behavioral psychology and
is one of the most effective ways of learning in nature: our actions are influenced by
previous experience on how our environment responds.
The mapping to be learned is a policy, typically denoted as π : st → at , that maps
the state of the agent at time t (st ) to the best action (at ) to take to achieve a goal.
The best action is the one that maximizes rewards along the path. The framework is
schematically illustrated in Figure 5. It is assumed that the agent and the environment
constitute a Markov Decision Process (MDP), a probabilistic model governed by a
transition probability matrix such that:

P (st+1 |st , at ) = P (st+1 |st , at , . . . s0 , a0 ) (1a)

P (rt |st , at ) = P (rt |st , at , . . . s0 , a0 ) , (1b)

conveniently computed.
14
We recommend a tour at http://whichfaceisreal.com to better grasp what we are referring.

VKI - 12 -
2.4 Reinforcement Learning 2 WHAT IS MACHINE LEARNING?

Figure 5: Schematic illustration of reinforcement learning framework: an agent interact


with an environment and learns a policy π that maps the current state st to the best
action at to achieve a certain goal. The learning is promoted by rewards rt (or penalties,
if rt < 0) provided by the environment.

where the symbol | stands for conditional probability. In words: the evolution to the
following state only depends on the current state-action pair and not on the previous
pairs, nor the past trajectory of the system.
The reader should notice the similarity of this framework with that of feedback control
(Ogata, 2009; Stengel, 1994; Kirk, 2004): here, the agent plays the role of the controller,
which is informed by sensors and acts through actuators, whereas the environment is the
plant to be controlled. It is thus not surprising that reinforcement learning is rooted in
optimal control theory and dynamic programming (Richard S. Sutton, 2018).
In the machine learning landscape, the distinctive feature of reinforcement learning is
that the environment does not tell the agent what to do, but only grades its actions– this
is why reinforcement learning is also called learning with a critic. Such a critic, in the
form of a reward/penalty, is received only after the action is performed. The criteria for
assigning rewards is one of the main challenges in RL. On the one hand, we could assign
rewards only after a specific objective is achieved. Because a goal certainly involves a
sequence of actions, the agent must figure out what part of the action sequence led to the
reward: this is the the credit assignment problem. On the other hand, we might decide
to assign many intermediate rewards. This implies that we already have a good knowledge
of what the sequence of actions should be. At the limit of using frequent rewards to force
the agent behaving as we desire, reinforcement learning reduces15 to supervised learning.
Balancing these two extremes constitute the reward shaping problem.
Another key challenge of the field is identifying a good trade-off between exploration
and exploitation. To obtain rewards, the agent should exploit actions that are known,
from experience, to be successful. Yet, to discover improvements, the agent must explore
new actions.
Much of the recent success of this field has been motivated by standout success in
strategy board games (Silver et al., 2016, 2018) and video-games (Szita, 2012), robotics
(Kober and Peters, 2014) and language processing (Luketina et al., 2019). An historical
success was the development of a hybrid RL system that defeated the Korean world
15
An agent trained via supervised learning cannot, by definition, perform better than its supervisor.
The remarkable achievements of agents reaching super-human performances at tasks such as, e.g., playing
board games, would have never been possible.

VKI - 13 -
3 REGRESSION AND MODELING

champion Lee Sedol in the game of Go (Silver et al., 2016). This game has 10170 legal
board position; neither large memory nor hand-crafted rules can be of any help.
All of the aforementioned achievements have been achieved by Deep Reinforcement
Learning (DRL), which is a combination of RL strategies and the regression capabilities
of deep ANNs. The reader is referred to Arulkumaran et al. (2017); Hernandez-Leal et al.
(2019) for extensive surveys, to Richard S. Sutton (2018); François-Lavet et al. (2018) for
complete introductions to the topic and to Alexander Zai (2020) for hands-on tutorials
using Python. The internet abounds of online tutorials and courses on Reinforcement
Learning. We highly recommend the course by Pieter et al. (2017).
Reinforcement learning can be broadly classified into four categories: (1) off-policy (or
value-based), (2) on-policy (or policy optimization), (3) model-based and (4) imitation
learning (or behavior cloning).
Off-policy methods do not focus on learning the policy but on the maximization
of the value of each state (denoted as V π (st )) and each state-action pair (denoted as
Qπ (st , at )). The first represents how ‘good’ it is for the agent to be in a given state while
the second represents how ‘good’ it is to perform an action at a certain state. In both
definitions, the notion of ‘good’ is quantified in terms of expected rewards. Once the
optimal sequence of states is identified, the policy is determined as the one that allows to
follow the optimal sequence of states. These methods are suited for systems having a finite
set of possible actions; the most successful applications of this framework is the Q-learning
(Watkins, 1989) and its many variants. On-policy algorithms(or policy optimization)
aims at optimizing a parameterized policy πθ (a|s), with θ the set of parameters, so as
to maximize the expected future reward. The framework naturally handles continuous
actions and it is rooted in classic optimization, usually using gradient-based methods (for
which these are also known as policy-gradient methods).
Both of the aforementioned methods are usually used in a model-free framework. In
a model-based approach, the agent interrogates a surrogate or predictive model based
on the observation history. This model can give faster and computationally cheaper
evaluations than a numerical simulation of the environment16 and thus allow for increasing
the learning speed of the agent. Finally, imitation learning refers to an approach in
which the agent is supported by a demonstrator (e.g. a human or another RL agent)
which can guide the agent imitates: this approach essentially introduces a supervised
setting framework in the process.

3 Regression and Modeling


To introduce many of the relevant concepts of supervised learning learning, it is convenient
to consider a simple regression problem in one dimension. We here introduce and test
three regression strategies; conclusions on their advantages and limitations are drawn in
Sec 7. The dataset discussed in this section is plotted in Figure 6.
This dataset is randomly sampled from a process

yk = f (xk ) + n(xk ) ←→ y = f (x) + n(x) , (2)


16
The reader should note the link with surrogate-based optimization

VKI - 14 -
3 REGRESSION AND MODELING

30

20
y

10

0 2 4 6 8 10
x

Figure 6: Plot of the dataset considered in this section: np = 400 points randomly sampled
from a process that has a deterministic and a stochastic part. Data is missing in the range
x ∈ [4.3, 4.8] and several outliers are added around x = 2 and x = 8.

where f () represents a deterministic model and n() represents a stochastic process.


The bold notation is used to denote a vector, hence x, y ∈ Rnp ×1 are the column vectors
collecting the sampled points, while xk , yk with k ∈ [1, np ] is the sequence notation. Both
notations are used in this section.
For the readers familiar with Python17 , the script generating the data is:
1 # Problem to F i t
2 x1 = np . linspace (0 , 4.3 , 200 , endpoint = True )
3 x2 = np . linspace (4.8 , 10 , 200 , endpoint = True )
4 x = np . concatenate (( x1 , x2 ))
5 # Create the d e t e r m i n i s t i c part
6 y_clean = 3* x +( x /100)**3+4* np . sin (3/2* np . pi * x )
7 # Add ( a s e e d e d ) s t o c h a s t i c p a r t
8 np . random . seed (0)
9 y = y_clean +1* np . random . randn ( len ( x ))
10 # I n t r o d u c e some o u t l i e r s i n x=2 and x=8
11 G1 =10* np . exp ( -(x -2)**2/0.005)* np . random . randn ( len ( x ))
12 G2 =15* np . exp ( -(x -8)**2/0.005)* np . random . randn ( len ( x ))
13 y_final = y + G1 + G2

The ansatz in (2) is the starting point of most data-driven strategies aiming at de-
veloping a predictive model18 for a quantity yk given an input xk . The primary scope of
a regression problem is to learn an approximation f˜ of the deterministic contribution f
using the available data.
The learned approximation can be used to make prediction on new data points, i.e.
f˜(x) ≈ f (x), under the assumption that the stochastic contribution in (2) is less impor-
17
All the Python codes for the discussed examples are available on request and are given in the VKI
courses Tools for Scientific Computing and Data-Driven Fluid Mechanics and Machine Learning .
18
The problem is generally of dimension d, while here we only consider d = 1 for simplicity. For d > 1,
both input and output data can be conveniently cast as matrices X, Y ∈ Rd×np , with every column
representing a ’snapshot’ of the data. This approach is common in modal analysis (see Mendez (2020a)).

VKI - 15 -
3 REGRESSION AND MODELING 3.1 Linear Methods

tant than the deterministic one19 . The prediction can be supported by an uncertainty
evaluation which accounts for the finite number of samples available for constructing the
approximation, as well as the stochastic contribution. The uncertainty quantification re-
lies on some assumptions that are briefly discussed later in this section. For the moment,
we only assume that E{n(x)} = 0 where E is the expectation operator

1 XN N
xk = xk p(n(xk )), (3)
X
E{n(x)} ≈
N k=1 k=1

with p(n(xk )) the probability of sampling n(xk ).


Finally, the reader should notice that no assumption is made on the ordering of the
data points nor the uniformity of the sampling. In other words, we are here interested
in a statistical framework and not a dynamical framework. In the latter, the variable
xk usually represents time: both the ordering (e.g. t3 > t2 > t1 ) and the uniform
sampling (tk = k∆t with ∆t = 1/fs and fs the sampling frequency) are essential for
many data processing tools (e.g. Fourier and Wavelet analysis) and system identification
tools. The dynamic analysis of data-sequences is known as Time Series Analysis (see
Mendez (2020b); Brockwell and Davis (1991); Nielsen (2019)).
We begin our tour in regression analysis with linear methods in Section 3.1 and proceed
with nonlinear methods and Artificial Neural Networks (ANNs) in Section 3.3. Section
3.2 introduces classic optimization tools for regression problems. Finally, Sec.3.4 proposes
a simple exercise on turbulence modeling.

3.1 Linear Methods


In linear methods, the approximation f˜ is constructed as a linear combination of pre-
scribed basis functions20 , which we denoted as φj (x), with j = [1, nb ] and nb the number
of basis functions. To model nonlinear relations, these basis functions are nonlinear– yet
the regression problem remains linear as long as the unknown are the coefficients of the
combination, which we denote as wj .
Using a summation, the approximation is written as
nb
yk = f˜(xk ) = wj φj (xk ) . (4)
X

j=1

In a more compact matrix notation, this reads

φ1 (x1 ) φ2 (x1 ) φnb (x1 )


  
... w1
 φ1 (x2 ) φ2 (x2 ) ... φnb (x2 )   w2 
y ≈ f˜(x) = 
  
.. .. .. ..   .  ⇐⇒ y = Φ(x)w (5)
 . 
. . . .  . 


φ1 (xnp ) φ2 (xnp ) . . . φnb (xnp ) wnb
19
In a fully stochastic system (n(x)  f (x)), one can only aim at predicting some of its statistical
properties. This is not the case we consider here.
20
an alternative view is the kernel formulation: the Φ()’s are nonlinear transforms (sometimes also
referred to as features) allowing to solve nonlinear tasks using linear tools on the transformed space. We
skip the kernel formalism for brevity although this will be briefly recalled in the framework of kernel PCA
in Section 4; the reader is referred to Bishop (2006) and Murphy (2012) for detailed overviews.

VKI - 16 -
3.1 Linear Methods 3 REGRESSION AND MODELING

where Φ(x) ∈ Rnp ×nb is the basis matrix, with np the number of data points involved and
w = [w1 · · · wnb ]T is the (unknown) vector of coefficients. This equation drives both the
training and the testing of a model. We therefore begin by randomly splitting the data into
a training set, denoted as x∗ , y∗ ∈ Rn∗ ×1 and a testing set, denoted as x∗∗ , y∗∗ ∈ Rn∗∗ ×1 .
In this tutorial, we split the np = 400 available points into n∗ = 300 and n∗∗ = 100.
The training consists in identifying the coefficients w from the training data. In an
ordinary least square method, this means solving the linear system in (5) with x = x∗ and
y = y∗ , i.e. y∗ = Φ(x∗ )w. Note that since usually np  nb , the system is ill-posed and
no exact solution is available21 . Hence once the coefficients w are available, the prediction
ỹ∗ = Φ(x∗ )w will not match the training points. We thus define the Root Mean Square
√ √
(RMS) in-sample error22 as Ein (w) = ||y∗ − ỹ∗ ||2 / n∗ = ||y∗ − Φ(x∗ )w)||2 / n∗ .
The validation consists in using the coefficients w on the validation data. This
means computing ỹ∗∗ = Φ(x∗∗ )w, i.e. an estimate of y∗∗ . We can define the RMS out-of
√ √
sample error as Eout (w) = ||y∗∗ − ỹ∗∗ ||2 / n∗∗ = ||y∗∗ − Φ(x∗∗ )w)||2 / n∗∗ .
The balance between Ein and Eout is linked to the overfitting problem introduced
in Sec 2: very low values of Ein are often achieved at the cost of large values of Eout .
Excellent introductions on the subject in terms of bias-variance trade off are provided
by Bishop (2006); Abu-Mostafa et al. (2012) along with the required mathematical tools.
We here invite the reader to experiment with this exercise to build confidence.
The scope of the training is to identify w such that Ein is minimized. More generally,
one might choose among a large set of a cost functions that aim to achieve this goal while
adding additional constraints on w. For example, a classical choice is the regularized
least square cost function:

J(w) = Ein + λR(w) = ||y − Φ(x∗ )w||2 + λR(w) (6)


where ||a||2 = aT a denotes the L2 norm of a vector and the parameter λ acts as a weighting
term23 that penalizes certain choices of w according to the regularizing function R(w).
The most popular choices are R(w) := ||w||2 , known as Tikhonov regularization (or
L2 regularization) and R(w) := ||w||1 = |wk |, known as Lasso regularization (or L1
P

regularization). A combination of both methods is known as Elastic net, while setting


λ = 0 yields the Ordinary Least Square (OLS).
A L2 regularization promotes ‘small but dense’ vector w, i.e. composed of many
entries with relatively low values. A L1 regularization promotes sparsity, i.e. ‘large but
sparse’ vector w, i.e. composed of few (large) nonzero coefficients. The linear regression
using the L2 penalization is also known as Ridge regression. Since this is amenable to
simple analytic treatment, we consider it first. The gradient of the cost function is24
21
and when it is available, such solution is not desirable since it would strongly depend on the dataset
used and likely yield poor generalization
22
Note that many authors introduce a term 1/2 in this cost function.
23
Readers familiar with Lagrangian multipliers should recognize in (6) an augmented cost function with
λ the Lagrangian multiplier that allows for turning a constrained optimization into an unconstrained one.
Here the constraint is ‘small’ R(w) < t.
24
It is worth recalling the following matrix differentiation rules:
     
∇x xT AT b = AT b; ∇x bT Ax = AT b; ∇x xT AT Ax = 2AT Ax

VKI - 17 -
3 REGRESSION AND MODELING 3.1 Linear Methods

" #
T 
∇w J(w) = ∇w y − Φ(x∗ )w y − Φ(x∗ )w + λw w =
 
T

" #
= ∇w w Φ (x∗ )Φ(x∗ )w − w Φ (x∗ )y − y Φ(x∗ )w − y y + λw w
T T T T T T T (7)
 
= 2 Φ (x∗ )Φ(x∗ )w − Φ (x∗ )y + λw
T T

The Hessian of (6) is H(J(w)) = 2Φ T (x∗ )Φ(x∗ ) + λInb , with Inb the identity matrix
of dimension nb × nb . Since Φ T (x∗ )Φ(x∗ ) is a positive definite matrix for all the classic
choices of basis functions25 , the optimization process is convex and the minima can be
found as a solution of ∇w J(w ∗ ) = 0. This solution, from (7), reads:
 −1 −1
w ∗ = Φ T (x∗ )Φ(x∗ ) + λInb Φ T (x∗ )y = K (x∗ ) + λInb Φ T (x∗ )y (8)


with K (x∗ ) = Φ T (x∗ )Φ(x∗ ) the covariance matrix of the chosen basis functions. Note
that the set of coefficient is determined using only the training dataset x∗ .
We begin with a bad choice of basis functions, namely a polynomial basis of the form
φj (x) = xj−1 . For such a choice, the definition of the polynomial coefficients in (8)
is implemented in the popular functions polyfit in Matlab or Python. We consider
nb = 10, i.e. we represent the approximated function as a polynomial of order 9. Note
that the predicted output y∗∗ for the testing set, combining (8) and (5), is:
−1
y∗∗ = Φ∗∗ w ∗ = Φ∗∗ K∗,∗ + λInb Φ∗T y∗ = Sy∗ , (9)


where the matrix S is often referred to as the smoother matrix or hat matrix, and
we have easied the notation by writing Φ(x∗ ) = Φ∗ . We perform a loop of 100-fold cross-
validations26 : we randomly split the dataset into n∗ = 300 training points and n∗∗ = 100
testing points and we repeat the regression 100 times. This results in 100 possible fits,
with associated coefficients w and estimates of Ein and Eout . We repeat the process for
three values of the regularization parameter, namely λ/n∗ = 0, 10, 100. Note that the
weight of this parameter scales with the number of points and is a complex function of
the basis chosen. We skip here these mathematical details. The results27 are in Figure 7.
The histograms on the left shows the sampled distribution of Ein and Eout for the
three cases. The plots on the right compare the average predictions with the data and
include an estimation of the uncertainty from the K-fold cross validation. From the 100
predictions obtained, the average is shown in continuous blue line while the colored area
covers the uncertain area. This is computed as a span of ±(E out + 1.96 σ(f˜(x)) around
the mean, with σ(f˜(x)) denoting the standard deviation observed at each location x over
all predictions. The average standard deviation in each case is indicated on the bottom
right of these plots.
These figures shows the impact of the regularization and the reason why the polynomial
basis is a bad choice. Increasing the regularization decreases the standard deviation in
25
In other words, Φ(x∗ ) is usually of full rank.
26
This is sometimes referred to as ‘k-fold’ cross-validation
27
Interested readers are strongly encouraged to reproduce these plots!

VKI - 18 -
3.1 Linear Methods 3 REGRESSION AND MODELING

λ=0
40
Ein
10 f˜∗∗
Eout
Data
30
8
counts

6 20

y
4 10
2
0 σ(f˜)=0.296
0
1 2 3 4 5 0 2 4 6 8 10
Ein and Eout x

λ = 10 n∗∗
40
12.5 f˜∗∗
Data
10.0 30
counts

7.5 20
y

5.0
10
2.5
0 σ(f˜)=0.207
0.0
2.5 3.0 3.5 4.0 4.5 5.0 0 2 4 6 8 10
Ein and Eout x

λ = 100 n∗∗
15 40
f˜∗∗
Data
30
10
counts

20
y

5 10

0 σ(f˜)=0.194
0
2.5 3.0 3.5 4.0 4.5 5.0 0 2 4 6 8 10
Ein and Eout x

Figure 7: Results from 100-fold cross validations for the polynomial fitting using λ = 0
(top row), λ = 10 n∗∗ (middle row) and λ = 100 n∗∗ (bottom row). The left plot shows the
histogram of Ein and Eout ; the right plots show the average prediction with the estimated
uncertainty.

VKI - 19 -
3 REGRESSION AND MODELING 3.1 Linear Methods

the prediction but increase the average28 Eout . Overall, the loss outweighs the gain in
the third case, suggesting that an acceptable regularization is closer to λ = 10n∗∗ than
to λ = 100n∗∗ . A similar analysis can be performed for multiple values of regularization
to identify the optimal value: it is possible to show that there always exists a value of λ
such that the Ridge regression has lower mean squared error than the OLS estimator (see
Taboga (2017)).
Regardless of the regularization chosen, this model selection underfits the data. In-
creasing the model complexity, i.e. increasing the number of basis elements further dete-
riorate the predictions29 . The natural solution is therefore a change of basis.
A popular choice are Radial Basis Functions (RBFs). These are real-valued function
whose value solely depends on the distance between the input and some fixed point. The
most classic example of RBF is the Gaussian bases φj (x) = e−(x−xj ) /2lc , where xj are the
2 2

center points and lc acts as a characteristic length scale of the data. This controls how
far one should move in a given direction (in this case along x) for the function values to
become uncorrelated. Basis functions that are sufficiently far apart (e.g. at a distance of
10lc ) do not overlap and hence their corresponding weights can be tuned independently.
In other words, these bases functions becomes more and more localized as lc → 0.
RBFs are normally introduced in the framework of kernel methods (see Bishop (2006);
Murphy (2012)) and thus offer here an opportunity to briefly present this alternative
perspective and its link with the basis representation.
We first consider a set with nb = 50 Gaussians with lc = 0.5, with centers xj uniformly
spaced over the domain of interest. We repeat the same analysis presented in Figure 7.
The result for the Gaussian RBFs are collected in Figure 8 for three levels of regularization.
It is evident that the RBF approach performs better. For λ = 0.1 the minor increase in
the bias error is compensated by the reduction of variance, while λ = 1 results in larger
bias and variance. In all three cases, it is interesting to notice how the uncertainty grows
in the regions lacking data or populated by outliers.
The reader is encouraged to repeat the exercise increasing the number of basis elements
nb . In Gaussian Processes (Rasmussen and Williams, 2005), one is often interested in
cases for which nb  np . This could be the case either because limited data is available or
because one wishes to stretch the notion of bases, as we shall see shortly. In this settings,
a simple alternative to (9) can be derived using the matrix inversion lemma30 , which can
be written for the basis matrix Φ∗ as
 −1  −1
Φ∗T Φ∗ + λInb Φ∗T = Φ∗T Φ∗ Φ∗T + λInp . (10)
The right hand side becomes more attractive for nb  np as it involves the inversion
of a smaller matrix. The matrix G∗,∗ = Φ(x∗ )Φ(x∗ )T is known as the Gram matrix.
Simplifying the notation Φ(x∗ ) = Φ∗ and Φ(x∗∗ ) = Φ∗∗ and using (10) in (9) gives31 :
 −1  −1
y∗∗ = Φ∗∗ Φ∗T Φ∗ Φ∗T + λInp y = G∗∗,∗ G∗,∗ + λInp y (11)
28
In the bias-variance trade-off, we are decreasing the variance while increasing the bias.
29
The reader is invited to try with a polynomial of order 15, for example: without sufficient regular-
ization, the problem inversion of K becomes problematic and returns NaNs.
30
This is also known as Woodbury matrix identity.
31
The link between ridge regression and Gaussian processes is well described by Belousov (2017).

VKI - 20 -
3.1 Linear Methods 3 REGRESSION AND MODELING

λ = 0.001
15 40
f˜∗∗
Data
30
10
counts

20

y
5 10

0 σ(f˜)=0.214
0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0 2 4 6 8 10
Ein and Eout x

λ = 0.1
40
10 f˜∗∗
Data
30
8
counts

6 20
y

4 10
2
0 σ(f˜)=0.196
0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0 2 4 6 8 10
Ein and Eout x

λ=1
12.5 40
f˜∗∗
10.0 Data
30

7.5
counts

20
y

5.0
10
2.5
0 σ(f˜)=0.257
0.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0 2 4 6 8 10
Ein /Eout x

Figure 8: Same as Figure 7, but considering Gaussian RBF as bases.

VKI - 21 -
3 REGRESSION AND MODELING 3.1 Linear Methods

We are now ready to take the limit nb → ∞ and stretch the introduced framework to
the case of an infinite basis: while K∗ = Φ∗T Φ∗ becomes infinite dimensional, the Gram
matrix remains of finite size. The entries of the Gram matrix are solely function of the
points xi and xj and can be computed using a continuous function G∗,∗ [i, j] = g(xi , xj ),
called kernel function. This is one of the many facets of the kernel trick (see Murphy
(2012); Alpaydin (2020)).
For Gaussian RBFs, the kernel function is again a Gaussian, and takes the form:

1
 

g(xi , xj ) = σf2 exp− 0 2 (xi − xj )2  , (12)


2lc

where σf is a scaling parameter and lc0 is the continuous analogous32 of lc . Note that the
function is symmetric, i.e. g(xi , xj ) = g(xj , xi ) and G∗,∗∗ = g(x∗ , x∗∗ ) = G∗∗,∗
T

Under these assumptions, the ridge regression resembles closely a Gaussian Process
(GP) regression. While an introduction to the theory of GPs and the required Bayesian
framework is out of the scope of these notes, the introduced background allows to grasp
the fundamental ideas and give working knowledge of the method33 .
Having opened the presentation to an infinite basis, a GP is a Gaussian distribution
of functions. Given a set of points x ∈ Rnp ×1 , every element of the distribution represents
a possible set of targets y = f (x) ∈ Rnp ×1 and the joint distribution of np points is
assumed to be Gaussian with zero mean and kernel G, i.e. p(y|x) = N (y|0, G). Because
the kernel G must be assumed beforehand, this ansatz is called prior.
Note that this assumption allows to compute uncertainties of the regression analyti-
cally, without resorting to test data and k-fold cross-validation. Therefore, in the remain-
ing of this section, we define (x∗∗ , y∗∗ ) as the set of predictions y∗∗ = f (x∗∗ ).
Leveraging on the Gaussian assumption, the prediction is carried out via condition-
ing, using Bayesian theorem and the prior to predict the posterior. This turns out to
be a Gaussian of the form p(y∗∗ |x∗∗ , x∗ , y∗ ) = N (y∗∗ |µ∗ , Σ∗ ), with
 −1
µ∗ = g(x∗∗ , x∗ ) g(x∗ , x∗ ) + σy2 Inp y (13a)
 −1
Σ∗ = g(x∗∗ , x∗∗ ) − g(x∗∗ , x∗ ) g(x∗ , x∗ ) + σy2 Inp g(x∗ , x∗∗ ) (13b)

where σy is an estimation of the standard deviation of the noise and the uncertainty can
be computed as δf (x) = 1.96 diag(Σ∗ ).
q

Comparing (13a) with (11) shows that the predicted mean corresponds to the ridge
regression as long as the kernel function is such that g(x1 , x2 ) = Φ(x1 )Φ T (x1 ) and λ = σy2 ;
this comparison gives a statistical interpretation to the regularization term previously
introduced. The reader is invited to use (13a) and (13b) on the provided dataset and
compare with the k-fold cross validated ridge regressions. The results are shown in Figure
9. The chosen parameters are indicated on the top of the figure..
32
Because of the assumption nb → ∞, lc 6= lc0
33
An excellent tutorial on GP, supported by a very didactic implementation in Python, is provided by
Krasser (2018). Krasser’s blog (http://krasserm.github.io) is a precious source of machine learning
tutorials.

VKI - 22 -
3.2 Common Optimization Tools 3 REGRESSION AND MODELING

lc = 0.7 σf = 10 σy = 2

Figure 9: Solution to the provided regression exercise using Gaussian Process regression
(GPR). The left plot shows the regression with the uncertainties, the right figure shows
a zoom in the region where data is missing.

We conclude this subsection recalling that the methods presented are only two of
the most popular (linear) tools. Different formulations can be obtained from different
cost functions, with common machine learning packages offering a dozen alternatives. A
valuable alternative to the regularized least square in (6) is the ε-sensitive loss function,
which takes the form:

0 if |y − Φw| < ε

J(w) =  (14)
|y − Φw| − ε otherwise
In words, this approach tolerates errors within ±ε and gives a linear penalty to points
that are beyond this range. This region of tolerance promotes sparseness while the linear
penalty gives robustness to outliers. This function is used in combination in Support
Vector Regression (SVR), the adaptation of the popular classification algorithm Sup-
port Vector Machines (SVM, Boser et al. 1992). We refer to Smola and Schölkopf
(2004) for an excellent overview of this powerful tool.

3.2 Common Optimization Tools


The cost function in (6), combined with the linear model in (5), allowed for an analytical
solution of the optimization problem of finding the best weights w∗. This is not possible
for more complex cost functions and more complex models.
Not surprisingly, then, the engine of the training process in all machine learning al-
gorithms are optimizers. More specifically, within the vast optimization literature, the
machine learning community has been mostly focused on first order methods. The
reason for this is certainly feasibility: most machine learning applications involves large
datasets and large parameter space, hence the computation of the Hessian are usually
too costly (if even possible) and impractical. While recent advances might open the path
towards second order methods (see Yao et al. (2020); Anil et al. (2020)), today most
common optimizers are variants of the first order gradient descent method.
Reviews of optimization algorithms are available in most machine learning manuals
(see Géron (2019); Raschka and Mirjalili (2019)) and in the documentation of computa-
tional libraries. This section merely provide the minimal background to understand the

VKI - 23 -
3 REGRESSION AND MODELING 3.2 Common Optimization Tools

terminology of the field, as well as understanding the setting used for solving the provided
tutorials exercises.
A gradient descent algorithm is an iterative method that adapts the weights of a model
in the opposite direction of the gradient ∇w J(w), i.e.

wi+1 = wi − η∇w J(w) η ∈ R (15)

where the superscript denotes an iteration and the scalar η is known as learning rate.
State-of-the-art libraries are equipped with automatic differentiation to compute the gra-
dient and variants of this approach differ in the amount of data used in (15).
The simplest algorithm is the Batch Gradient Descend (BGD), which computes
the gradient on the entire dataset, i.e. ∇w J(w) = ∇w J(w, x, y). This approach has
a high cost per iteration, it is memory demanding and not suited for ‘online’ learning,
that is continuous optimization as new data is available. On the other extreme is the
Stochastic Gradient Descend (SGD) which computes the gradient on each data-point
independently, i.e. ∇w J(w) = ∇w J(w, xi , yi ). This makes the algorithm faster but
give strong fluctuations in the convergence. These fluctuations often allows for jumping
towards better local minima but become harmful once the minima is approached. To
limit this effect, the SGD is usually combined with a learning schedule that reduces η
as a function of the iterations.
A practical measure of the optimization performance is often given in terms of epochs.
An epoch is a full pass of (15) over the entire dataset. Hence, in BGD one epoch
corresponds to one iteration while in SGD one epoch corresponds to n∗ iterations.
The most common approach is an intermediate formulation between BGD and SGB
called mini-batch gradient descent (MGD), in which the gradient evaluation is carried
out over a batch of data of size N, i.e. ∇w J(w) = ∇w J(w, xi:i+N , yi:i+N ). This approach
is now standard and it is often implied when referring to gradient descent (GD).
The main challenge of these algorithms is in the definition of an appropriate learning
rate (or learning schedule) and the risk of being trapped in suboptimal local minima or
saddle points (especially in the case of ANNs, see Dauphin et al. (2014)). To (partially)
overcome these problems, standard optimization tools use momentum and adaptive
learning or gradient re-scaling as well as their combinations.
The idea of momentum consists in introducing a sort of memory about earlier gradi-
ents; the simplest implementation (Polyak, 1964) reads

wi+1 = wi + mi
(16)
mi+1 = βmi − η∇w J(w) η, β ∈ R; m ∈ Rnp

The parameter β act as a momentum/friction control which varies between 0 (high


‘friction’) and 1 (no ‘’friction’): since the gradient now acts as an ‘acceleration’ and not
as a ‘velocity’, the algorithm tends to go faster and use inertia to escape from plateaus.
A variant of this approach is to compute the gradient slightly ahead, i.e. ∇w J(w + βm);
this is known as Nesterov accelerated gradient (Nesterov, 1983).
As an example of gradient re-scaling, we here consider the popular RMSprop proposed

VKI - 24 -
3.3 Artificial Neural Networks (ANNs) 3 REGRESSION AND MODELING

by G. Hinton in his Coursera course on neural network34 . This algorithm introduces a


scaling of the gradient such that


wi+1 = wi − η∇w J(w) si + ε
(17)
si+1 = βsi + (1 − β)∇w J(w) ⊗ ∇w J(w) η, β,  ∈ R; s ∈ Rnp

where and ⊗ denotes the entry by entry division and multiplication, β is a decay
rate, s is a scaling parameter and ε is a small term introduced to avoid division by zero.
The idea of this re-scaling is to decrease the learning rate faster for steepest directions
while the parameter β give importance only to recent updates when computing the scal-
ing s. More sophisticated algorithms such as the ADAptive Momentum estimation
(ADAM) combines both momentum and scaling. This combination reads:


wi+1 = wi − η m̂i ŝi + ε
si+1 = βsi + (1 − β)∇w J(w) ⊗ ∇w J(w)
mi+1 = β1 mi + (1 − β1 )∇w J(w) (18)
m s
m̂ = ŝ = η, β,  ∈ R; s, m ∈ Rnp
1 − β1i
1 − β2i
The weight update in the first line has a re-scaled momentum. The momentum m
vector updates as in the momentum formulation in (16) while the scaling vector s follows
the RMSprop idea in (17). Both quantities are then scaled before their use in the weight
update: this has the main objective to avoid a bias towards zero Kingma and Ba (2014)
With the increasing complexity of the optimizer, the number of tuning parameters
(referred to as hyper-parameters) to adjust increases. While many guidelines are avail-
able, mastering the use of these optimizer for training complex models such as ANNs
requires experience, and often, a tedious trial and error procedure.

3.3 Artificial Neural Networks (ANNs)


ANNs are the most popular nonlinear tools in machine learning. These were initially
introduced in the late 50’s as simplified models for the human brain, although their
biological analogy is limited and certainly unnecessary for the scopes of a data scientist.
In the early developments of ANNs, this analogy has triggered scientific controversies and
exaggerated (and unfulfilled35 ) claims (see Olazaran (1993)) that resulted in skepticism
and a drop of interest and funding. Today, the regained interest in ANNs is fueled
by the tremendous increase in computer power (particularly the recent developments in
GPU technology), the availability of data, and improvements in training algorithms. The
relevance of ANNs research has given them their own sub-field of machine learning, called
Deep Learning (with the adjective ‘deep’ referring to networks with many layers).
34
A curiosity: this algorithm remain un-published and cited by the community as “slide 29 in lecture
6”! This great lecture is available at https://www.youtube.com/watch?v=defQQqkXEfE.
35
Here’s excerpt from an article in the New York Times from 8 July 1958: The Navy revealed the
embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce
itself and be conscious of its existence.

VKI - 25 -
3 REGRESSION AND MODELING 3.3 Artificial Neural Networks (ANNs)

To understand their operation and harvest their potential, one should see ANNs as
nonlinear mathematical models that have a distributed architecture consisting of a large
number of simple connected units (called neurons). An example of ANNs (the first that
will be tested in this section) is shown in Figure 10. This network consist of seven neurons,
organized four in layers: an input and an output layer, together with two intermediate
ones (referred to as hidden layers). Because this is used for regression analysis of the
data in Fig 6, the input and the output layers have only one neuron (since we have one
input and one output) while the hidden layers have two and three neurons (labeled from
bottom to top). This network is fully connected because each neuron in a layer is
connected to all the neurons of the following one, and is feed-forward because the flow
of information is unidirectional, from one layer to the next. The reader should note that
feedforward networks are often referred also as Multilayer Perceptrons for historical
reasons: the Perceptron is the first ANN architecture, consisting of a single neuron, used
for binary classification (Rosenblatt, 1957).

Figure 10: First of the two configurations of ANNs tested in this regression tutorial:
a feedforward fully connected architecture with two hidden layers and a total of seven
neurons.

A feedforward network is a static model that maps a set of inputs xj to a set of


outputs yj . Since the mapping of a point is independent from the others, feedforward
networks are also memory-less. Conversely, recurrent neural networks are dynamical
systems characterized by feedback connections (e.g. from output to input) so that each
input triggers a sequence of outputs (see Bianchi et al. 2017 for a comprehensive overview).
The most popular alternative to the fully connected architectures are convolutional
neural networks (CNNs), in which a much limited set of connections exists between
different layers: these connections perform a convolution 36 that gives the name to the
architecture. CNNs are mostly used in image processing and application that benefits from
their better position in the scale connectedness versus complexity. This section is solely
limited to the fundamentals of feedforward network and gives the minimal background
36
The use of the work convolution in this field is colloquial, since the operation in fact a sliding dot
product.

VKI - 26 -
3.3 Artificial Neural Networks (ANNs) 3 REGRESSION AND MODELING

to understand the potential of this technology. Interested readers are referred to Bishop
(1995); Goodfellow et al. (2016); Haykin (1998) for comprehensive treatment.
We begin our analysis of the network in Fig 10 from the output layer and its single
neuron. This neuron receives the output of three neurons from the previous layer, weighed
by the connection weights wi,j , where l denotes the layer hosting the neurons and the
(l)

subscripts map the connection: for instance w2,1 is the weight of the connection from
(3)

neuron 2 (in layer 3) to neuron 1 (in layer 4). This neuron responds to these inputs as:
3
X   
y=y =σ + =σ + (19)
(4) (4) (3) (3) (4) (4) T (3) (4)
wj,1 yj b1 W34 y b1 .
j=1

In addition to the weights, we have introduced the bias term of this neuron, b1 , the
(4)

output of the neurons in the previous layer, yj , and the activation function σ (4) of this
(3)

layer. The function is a usually37 nonlinear mapping between input and output. Within
the vast zoology of activation functions, we here use the following two:

if x > 0 1

x
σ(x) =  ; σ(x) = (20)
α(e − 1) if x ≤ 0
x 1 + e−x

The one on the left is called Exponential Linear Unit (ELU); the one on the right is the
sigmoid function. Note that one could introduce a different activation function for each
neuron. However this gives no particular gains, the activation functions are usually chosen
for each layer. For this last layer, we chose the ELU activation.
The nonlinearity introduced by activation functions is the first key difference with
respect to the methods described in Sec 3.1: without the activation function, (19) is
analogous to (4) and can be seen as a linear combination of basis functions38 . This is
why (19) can be conveniently written as a matrix multiplication, with the matrix W3,4
collecting all the weights39 . Weights initially unknown and must be inferred during the
training of the ANNS.
The second key distinction is in the basis function and their composite nature: the
basis in (19) is represented by the outputs of the previous layer, which reads:
 
j=1 wj,1 yj + b1
P2 (2) (2) (3)
 
   
y(3) = + b2  = σ (3) W23 y + b(3) (21)
(3)  P2
 (2) (2) (3)  T (2)
σ  j=1 wj,2 yj 
 
wj,3 yj + b3
 P2 (2) (2) (2) 
j=1

Differently from the last layer, this layer outputs a vector y(3) ∈ R3×1 because it
contains three neurons. Again, the input to this layer is the output of the previous, which
reads:
37
These could also be linear in some layers. However, is all the activations are linear, the network
becomes linear and the relation between input-out is just a matrix multiplication!
38
The bias shift could be condensed within the weights, taking one of the basis functions as the
vector of constants; we refrain from using this notational simplification as it is seldom encountered in the
literature. Note that the bias term gives an additional degree of freedom
39
it is left as a simple exercise to the reader to write down this matrix!

VKI - 27 -
3 REGRESSION AND MODELING 3.3 Artificial Neural Networks (ANNs)

w1,1 y1 + b1
 
(1) (1) (2)  
=σ =σ +b (22)
(2) (2) (2) (1) (2)
y W12 y1 .
w1,2 y1 + b2
 
(1) (1) (2)

Note that this layers receives in input the scalar y1 , which is the output of the neuron
(1)

in the first layer, i.e:

y1 = σ (1) (w10 x + b11 ) (23)


(1)

If one now tracks back the full path from the input x to the output y, inserting (23)
into (22) and all the way up to (19), it is evident that even a simple network with seven
neurons represents a fairly complex function:
     
y=σ (w10 x + b11 ) +b +b + (24)
4 T (3) T (2) (1) (2) (3) (4)
W34 σ W23 σ W12 σ b1

In this simple architecture, the number of parameters (weight and biases) to be tuned
during the training amounts to 19. Common architecture in deep learning have thousands
of neurons and milions of parameters. For example, the famous AlexNet (Krizhevsky
et al., 2017) that revolutionized image classification and computer vision in 2012, is an
ANNs with 8 layers (5 convolutional and 3 feedforward) consisting of 65000 neurons and
60 milion parameters. The training of this network took between five and six days using
two GTX280 3GB GPUs and used a training set of 15 million labeled images. This
network significantly outperformed any classification strategy and set new standards in
image classification.
The complexity of the nested mathematical model constructed by an ANN has found
theoretical support in the popular function approximation theorem(s), pioneered by Cy-
benko (1989), which proves that sufficiently large networks can approximate any continu-
ous function. The results have been recently focused on the dilemma between width and
depth (see Lu et al. (2017); Hanin (2019)). Nevertheless, a robust theory of the ANNs ex-
pressivity ( i.e., the ability to approximate a rich class of functions) and their architecture
is still at its infancy, and the design of ANNs is nowadays largely based on experience.
The biggest challenge in the use of ANNs remains the training process, constituting an
extremely high dimensional optimization problem. Modern training strategies are based
on the back propagation algorithm, which computes the gradient of the cost function
with respect to the weights and bias, i.e. ∇w,b J with w and b vectors containing all the
weights and biases of the networks. This gradient is then given to any of the optimizers
described in Section 3.1.
The backpropagation algorithm was first proposed by Werbos (1974), reinvented sev-
eral times and popularized by Rumelhart and McClelland (1989). While a detailed treat-
ment is out of the scope of this section, it is possible to build intuition on its operating
principles recalling the chain rule of differentiation.
For the network in Fig 10, the gradient ∇w,b J is a vector with 19 components providing
the sensitivity of the error to each of the network parameter. A direct computation of
this gradient involves a large chain of differentiation and is thus impractical. The back-
propagation algorithm offers a more convenient alternative by propagating backward the
error from the output to the input.

VKI - 28 -
3.3 Artificial Neural Networks (ANNs) 3 REGRESSION AND MODELING

The first step of the algorithm is to look at how an error propagates from one layer
to the following, without attempting to get the entire chain. We simplify the notation by
writing the neuron’s output as

yj = σ (l) (zj ) with zj = W (l) yj + bj (25)


(l) (l) (l) (l−1) (l)

Therefore, the output of the network is simply σ (4) (z1 ), where z1 is clearly a com-
(4) (4)

plicated function of the input xj (of no interest for the moment). Taking the root mean
square error as a cost function over n∗ training points gives
n∗  2
J(w, b) = (z1 ) (26)
X
(4) (4)
yi − σ
i=1

We introduce next the variable δj = ∂J/∂zj as the gradient of the cost function
(l) (l)

with respect to the output of the neuron j at layer l before this goes into the activation
function. This quantity can be easily computed in the last layer from (26):
n∗ 
dσ (4)

=2 (z1 ) (27)
(4) X
(4) (4)
δ1 yi − σ
i=1 dz (4)
where the derivatives of the activation function dσ/dz is readily available from its
definition40 . The scope of the back propagation is to use this entry to compute the others
(δ (3) ,δ (2) and so on). The error of a neuron propagates to the ones in the consecutive layer
as follows:
(l+1) (l)
∂J ∂zj 0 ∂yj 0
= (28)
(l) X
δj (l+1) (l+1) (l)
j0 ∂zj 0 ∂yj 0 ∂zj 0
where j 0 is the index spanning the neurons connected with j. The first term hides the
effect of what happens from one neuron to the next while the other two are a direct
application of the chain rule, having considered ’whatever happens’ next as a function of
the form J(z (l+1) (y l )). Introducing δj and (26) gives:
(l)

dσ (l)
δj = (29)
(l) X (l+1)
δ wl
j j,j 0
j0
dz (l)
Starting from (27) it is thus possible to backpropagate the errors and compute all the
δ s. This is the key intermediate step towards the computation of the gradient, which can
0

be readily obtained by noticing that:


∂J ∂J
= δj y j and = δj (30)
(l) (l−1) (l)
l
∂wj,j 0 ∂blj
We conclude this introduction by showing the network in Figure 10 in action on the
data in Figure 6.The network is implemented in Keras (Chollet, 2017). The results of
100 attempts are shown in Fig. 11 (on the left) while Fig 11(on the right) shows the
convergence of the training and the validation error for an example. The weights and bias
40
This should already inform the reader about the importance of having differentiable (and smooth)
activation functions!

VKI - 29 -
3 REGRESSION AND MODELING 3.4 An exercise on Turbulence Modeling

are initialized randomly and the optimization is carried out using the ADAM optimizer
with η = 0.02, taking all the default parameters of the optimizers. The dataset has been
pre-processed using a Support Vector Machines to remove the outliers (that were found
to have a significant impact). The training is performed using mini batches of 120 data
points and an ‘early stopping’ approach is used to control overfitting. This technique
consists in stopping the optimization if the validation error starts increasing.

40 40

30 30

20 20
y

y
10 10

0 0

0 2 4 6 8 10 0 2 4 6 8 10
x x

Figure 11: Left: regression results of the ANN with 2 and 3 neurons in the hidden layers.
the regression is repeated 100 times (continuous red lines) and the data is preprocessed
via Support Vector Machines for outlier removal. Right: same plot but considering a
ANN with 64 and 128 neurons in the hidden layers.

The variance of the tests appears acceptable, but shows that the optimizers does have
a tendency to occasionally fall into local minima. The results shows that the simple
network in Figure 10 underfits the data. The same analysis is then carried out increasing
the number of neurons to 64 in the second layer and 128 in the third layer. The results
are collected in Figure 11. While the same observations on the variance hold, the larger
network performs better (as expected). On the other hand, readers accustomed to linear
methods might find the lack of overfitting somewhat surprising: this network consists
of 8579 parameters and is trained using only 450 datapoints. Undetermined networks,
having many more parameters than datapoints, are very common in deep learning. An
interesting discussion on this topic is provided by Zhang et al. (2016).

3.4 An exercise on Turbulence Modeling


We close this section on regression with a simple exercise on turbulence modeling, adapted
from Fiore et al. (2020).
Consider the flow configuration in Figure 12: a turbulent channel flow within two
plates at a distance 2δ subject to constant and uniform heat fluxes from the wall. The
flow is stationary and fully developed, hence governed by the following Reynolds-averaged
momentum and energy equation:

1 dp duv d2 U
− − +ν 2 =0 (31)
ρ dx dy dy
d2 T dvθ
α − =0 (32)
dy 2 dy

VKI - 30 -
3.4 An exercise on Turbulence Modeling 3 REGRESSION AND MODELING

where ρ is the fluid density, ν is the kinematic viscosity, α is the thermal diffusivity and
p(x) represents the mean pressure field. The velocity field is decomposed as the sum of a
mean flow U (y) and a fluctuating velocity with stream-wise component u(x, y) and cross-
stream component v(x, y); the temperature field is decomposed as the sum of a mean
temperature T and a fluctuating temperature θ. The overbar denotes spatial averaging,
and the terms uv and vθ are the Reynolds stresses and the turbulent heat fluxes. Both
quantities needs to be modelled in order to solve (31) and (32).
Equations (31) and (32) can be written in dimensionless form as follows:

dp+ du+ v + 1 d2 U +
− − + =0 (33)
dx∗ dy ∗ Reτ dy∗2
1 d2 T + dv + θ+
− =0 (34)
Reτ P r dy∗2 dy∗
where p+ = ρu2τ , U + = U/uτ and T + = T /Tτ , x∗ q = x/δ and y ∗ = y/δ. The reference
velocity of the problem is the friction velocity uτ = τw /ρ, with τw the wall shear stress;
the reference temperature is the friction temperature Tτ = qw /ρcp uτ , with cp the fluid’s
specific heat. The dimensionless number in (33) and (34) are the turbulent Reynolds
number Reτ = uτ δ/ν and the Prandtl number P r = α/ν.

Figure 12: Scheme of fully developed turbulent channel flow with uniform heat flux bound-
ary conditions.

Because of the full development of the velocity field, a self-similar transform can be
used to render the temperature equation 1D, i.e. solely function of y, while accounting
for its linear increase along x due to the uniform wall heating. Following Kawamura et al.
(2000), it is convenient to introduce the self-similar variable T
g +

+ (y) =
dhT + i ∗
T
g x − T + (x, y) (35)
dx∗
where
dhT + i ∗ 1

x = (36)
dx hU + i
and hi denotes the spatial average along y to get bulk quantities, i.e

1Z δ + 1Z δ +
hT i =
+
T (x, y) dy and hU i =
+
U (x, y) dy . (37)
δ 0 δ 0

VKI - 31 -
3 REGRESSION AND MODELING 3.4 An exercise on Turbulence Modeling

Introducing (35) and (36) in (34) yield the 1-D energy equation for the self-similar
variable T
g + :

∂  1 ∂T
 
+
+ θ+  +
U+
=0 (38)
g
− v
∂y ∗ Reτ P r ∂y ∗ hU + i

Focusing on the thermal problem, we assume that the velocity field is given and we seek
to develop a model for the turbulent heat flux v + θ̃. This quantity is known to be a function
of the Reynolds number Re = hU iδ/ν, the Prandtl number P r, the velocity gradients
dU/dy and the temperature gradients dT /dy. The scope of the regression problem is to
identify this function, which we here denote as v + θ+ = g(x, w), where x is the vector
containing the quantities that are assumed to play a role and as w is the set of weights
defining the function parametrization.
For a given function v + θ+ = g(x, w), it is possible to integrate (38) and obtain a
prediction of the temperature field over a mesh y ∈ Rnp denoted as TP (y, x, w). Given
some data T (y), it is possible to evaluate the quality of the prediction in terms of a root
mean square cost error:

J(w) = ||T (y) − TP (y, x, w)||2 . (39)


The target function is thus the turbulence model that minimizes the discrepancy be-
tween prediction and available data, as measured by J(w). In this exercise, the training is
carried out with the DNS dataset freely available from Kawamura et al. (2000), consisting
of 11 temperature profiles obtained at different Re and P r numbers. The list of condition
is shown in table 1.

Table 1: Available DNS database Kawamura et al. (2000).

Application Reτ Pr Optimum P rt (SLSQP)


Training 640 0.71 0.828
0.025 1.352
395 0.025 1.6
180 0.71 0.947
0.4 0.95
0.2 1.049
0.1 1.263
0.05 1.7
0.025 2.547
Validation 395 0.71 0.83
180 0.6 0.92

In this exercise, the integration of (38) is carried out using a finite difference discretiza-
tion on a mesh of np = 500 points, hence solving the following matrix equation

LT
g + + Gα+ GT
g ++S = 0 (40)
t

where L is the Laplacian matrix multiplied by (Reτ P r)−1 , G is the backward finite
difference matrix and S is the source vector containing the values of U + /hU + i.

VKI - 32 -
3.4 An exercise on Turbulence Modeling 3 REGRESSION AND MODELING

In the following subsections, we consider two possible parametrization of the turbu-


lence model.

3.4.1 The Constant P rt


The most common formulation for the turbulent heat flux is the Reynolds Analogy (Si-
monson, 1988), which introduces a thermal turbulence diffusivity αt that link the turbulent
heat flux to the average temperature gradients:

dT νt dT
vθ = −αt =− . (41)
dy P rt dy
This quantity is further assumed to be linked to the eddy viscosity νt via the turbulent
Prandtl number P rt = νt /αt , which is often taken as a constant in the range P rt = 0.5−1
(Bird et al., 2006).
Nevertheless, this assumption has been shown to be inaccurate even in simple flow
configurations. Figure 13 plots the turbulent Prandl number from several of the test
cases in Kawamura et al. (2000)’s dataset, highlighting that this value is not constant and
is larger than unity for fluids with low Prandtl number.

Re=640 Pr=0.71 Re=395 Pr=0.71 Re=640 Pr=0.71 Re=395 Pr=0.71


Re=640 Pr=0.025 Re=395 Pr=0.025 Re=640 Pr=0.025 Re=395 Pr=0.025

1.0

0.9
2.0
0.8
P rt /P rt,max
P rt

1.5 0.7

0.6
1.0
0.5

100 101 102 100 101 102


y+ y+

Figure 13: Profiles of the turbulent Prandtl number along the height of the channel
Kawamura et al. (2000) for different Reτ and P r.

While keeping the constant P rt number assumption, Reynolds (1975) proposed the
following correlation to better account for the dependency on the Re and P r:

1
!
P rt = (1 + 100(ReP r) − 12
) − 0.15 (42)
1 + 120Re− 2
1

In this first approach to the exercise, we assume that the turbulent Prandtl number
is constant but we leave it as a regression parameter. The model therefore consists of a
single weight and the regression task reduces to the optimization problem of identifying
the P rt the minimizes the cost function in (39). This optimization is solved using the
Sequential Least Squares Programming (SLSQP) implemented in the Scipy library and the
results for the optimal turbulent Prandtl number are shown in the last column of Table

VKI - 33 -
3 REGRESSION AND MODELING 3.4 An exercise on Turbulence Modeling

1. These are compared with the prediction from (42) in Figure 14 (left) while Figure 14
(right) compares the dimensionless temperature profiles produced by both methods with
the DNS data at the same Reynolds and Prandtl numbers (Reτ = 180, P r = 0.6). The
optimized P rt significantly improves the temperature prediction.

14 SLSQP
2.50 Reynolds corr. Reynolds corr.
2.25 SLSQP 12 DNS
2.00 10
1.75
8
P rt

θ+
1.50
6
1.25
4
1.00

0.75 2

0
0.0 0.2 0.4 0.6
Pr 100 101 102
y+

Figure 14: (a) Turbulent Prandtl number as a function of the molecular Prandtl number
for Reτ = 180. Comparison between the Reynolds correlation for P rt Reynolds (1975) and
the optimal values calculated with the SLSQP.(b) Temperature profiles computed with
the optimized P rt and with the value of P rt given by eq. eqrefrc (Reτ = 180, P r = 0.6):
comparison with the corresponding DNS temperature profile.

3.4.2 The ANN turbulence model


The second approach assumes that αt is an additional flow variable that depends on the
turbulence level, the distance from the closest wall and the molecular Prandtl number.
An ANN is be used for this regression exercise, considering the following input features:

• x1 = k2
νε
(approximated, dimensionless eddy viscosity)

• x2 = k
U 2 +k
(turbulence intensity)

• x3 = ky
ν
(dimensionless wall distance)

• x4 = P r (molecular Prandtl number)

where k is the turbulent kinetic energy k = 1/2 i u2i and ε is the turbulent dissipation
P

rate ε = 2ν i,j ∂ui /∂xj ∂uj /∂xi . The parameter x2 is an alternative definition of the
P

turbulence intensity that make this quantity bounded between 0 and 1.0. Figure 15
indicates the evolution of x1 -x3 along the channel for different values of the Reynolds
number.
Figure 15 indicates the evolution of x1 -x3 along the channel for different values of
the Reynolds number. A plot of the features x1 -x3 along the channel is shown in Figure
15. All these input variables are dimensionless quantities to enforce the model invariance
with respect to changes of the geometry or a specific choice of flow parameters. The

VKI - 34 -
3.4 An exercise on Turbulence Modeling 3 REGRESSION AND MODELING

Re=640 1.0
600
Re=395 0.9
500 Re=180
0.8
400
0.7
x1

x2
300
0.6
200
0.5 Re=640
100 Re=395
0.4
Re=180
0
100 101 102 100 101 102
y+ y+

Re=640
0.8 Re=395
Re=180
0.6
x3

0.4

0.2

0.0
100 101 102
y+

Figure 15: Evolution of the input parameters x1 − x3 over the channel width at different
Reynolds numbers.

structure of the network is shown in Figure 16. This consists of a 4 neurons input, a 4
layer Gaussian noise layer, two hidden layers of 50 neurons and a single neuron output
layer. This output is then introduced to the numerical solver to predict the temperature
distribution and compute the associated cost in (39). The activation functions in the two
hidden layers are chosen as linear and hyperbolic tangent. The dataset is scaled to the
range [−1, 1].
The training is performed using ADAM optimizer with stochastic gradient descent.
The training-validation data is defined as shown in table 1. The Gaussian layer acts as a
regularization strategy that mitigates overfitting. The noise is generated with zero mean
and standard deviation of σ = 0.005. Figure 17 shows the evolution of the loss function
during training for both the training and the validation ranges: the training terminates
after 3500 epochs.
Figures 18 shows the predicted Prandtl number for four profiles for which this quantity
is available in the DNS profile, together with the one extracted from the DNS. The
prediction from the ANN appears in good agreement in the logarithmic layer (y + > 30),
where the turbulent flux is significantly higher than the viscous. On the other hand,
both νt and αt tend to zero towards the viscous sub-layer (y + < 10) and hence the cost
function becomes insensitive to the temperature. A more complex formulation, capable
of modelling the turbulent heat flux down to y = 0, could be build by introducing the
role of the imposed heat fluxes in the cost function.

VKI - 35 -
4 DIMENSIONALITY REDUCTION

Figure 16: Scheme of the ANN employed to predict the turbulent diffusivity αt .

103 train
validation
102

101
loss

100

10−1

10−2

10−3
100 101 102 103
epoch

Figure 17: Evolution of the loss function during training.

Finally, Figure 19 compares the temperature and turbulent heat flux profiles obtained
with the Prandtl number optimized by SLSQP (first approach, with green lines) and the
ANN (second approach, light blue lines), along with the corresponding DNS profiles (red
lines). The non-linear mapping offered by the ANN clearly yield a more accurate and
robust turbulence model as visible from the comparison at Reτ = 395-P r = 0.71, which
is one of the validation profiles.

4 Dimensionality Reduction
4.1 A first simple example
Consider the dataset in Fig 20. The set consists of np = 150 datapoints x ∈ R3×1 that
have been sampled from three distributions in a three-dimensional space. The colors in

VKI - 36 -
4.1 A first simple example 4 DIMENSIONALITY REDUCTION

Re=640 Pr=0.71 Re=180 Pr=0.025 Re=180 Pr=0.71


Re=640 Pr=0.025 Re=395 Pr=0.025 Re=395 Pr=0.71

2.5

2.0

1.5
Prt

1.0

0.5

0.0
100 101 102
y+

Figure 18: Profiles of the turbulent Prandtl number predicted by the ANN (solid lines)
over the channel width. Comparison with the DNS profiles (dashed lines).

17.5 1.0
Re=640 Pr=0.025 Re=640 Pr=0.025
15.0 Re=395 Pr=0.71 Re=395 Pr=0.71
Re=180 Pr=0.6 0.8 Re=180 Pr=0.6
12.5

10.0 0.6
v + θ+
θ+

7.5
0.4
5.0
0.2
2.5

0.0 0.0
100 101 102 100 101 102
y+ y+

Figure 19: Comparison of the results achieved with the optimization approach (green
lines) and the ANN-regression approach (light blue lines). DNS data are indicated with
red lines.

the markers encodes the different distributions (cluster). The dataset can be compactly
represented by a dataset matrix X ∈ R3×150 with each point (vector) along its columns.
To minimize the amount of technical details, the data has been mean-centered (i.e. the
mean vector over the rows of X is zero).
To illustrate the main ideas behind dimensionality reduction, we here seek to
identify the best bidimensional representation of the data. That is, we seek a compression
ratio 3 : 2, from (R3 to R2 ). Of course a simple approach would consist in projecting along
any of the planes available from the Cartesian coordinate frame. The projection implies
the definition of a basis matrix, here denoted as B, whose columns collect the basis vectors
in the projected space. For instance, the projection onto the (i, j) plane is

VKI - 37 -
4 DIMENSIONALITY REDUCTION 4.1 A first simple example

1.0
−3

0.8−2
−1
k
0
0.6
1
2
0.4
3

0.2 1.0
0.5
0.0
j

0.0 −0.5 −1.5


0.0 0.2 0.4 0.6 0.8−1.0
0.0 −0.5 1.0
−1.0 1.0 0.5
2.0 1.5 i

Figure 20: Plot of the dataset considered in this section: np = 150 belonging to three
probability density functions are sampled in a three-dimensional space.

1 0
 

XB = B T X 0 1
with B =  XB ∈ R2×150 (43)

0 0
We evaluate the quality of a projection by the amount of information lost. The opera-
tion in (43) is a linear encoder. The matrix B has no inverse. Nevertheless, it is possible
to show (Mendez, 2020a) that the ‘best possible’ inversion of (43) is

X̃B = BXB = BB T X = AB X, (44)


where AB = BB T is the linear autoencoder with respect to the basis B. For this trivial
case, the autoencoder simply removes the last raw to X and the inversion cannot recover
it. We could evaluate the performance of this linear autoencoder using an L2 norm:

J(B) = ||X − X̃B ||2 = ||X − AB X||2 (45)


A natural question is that of identifying the basis vectors that allows to minimize
(45). This leads to the Principal Component Analysis (PCA). Without entering in
the details of the derivation41 , the basis that leads to this optimal reconstruction is the
solution of the following eigenvalue problem

(XX T )uj = λj uj ⇐⇒ (XX T )U = ΛU (46)


where the eigenvectors uj are the principal components, collected as columns of the
matrix U and the eigenvalues λj (collected in the diagonal of Λ) controls their relative
importance. It is easy to show that these are non-negative positive numbers and (46) is
the first step to compute the Singular Value Decomposition (SVD) of X. Keeping
the focus on the definition of the best linear autoencoder, we have
41
An extensive presentation is proposed by Jolliffe (2002) and Bishop (2006).

VKI - 38 -
4.1 A first simple example 4 DIMENSIONALITY REDUCTION

2
X̃U = UXU = UU T X ⇐⇒ x̃k = ckj uj with ckj = ujT xk , (47)
X

i=1

with AU = UU T the PCA autoencoder. The summation on the right recalls that each
of the approximation (x̃k ) of a data point (xk ) is a linear combination of the principal
components uj . The coefficients ckj = x b U , written as a vector over the index j, is the
PCA-transformed data: that is the projection of a data onto the principal components.
Returning to the notation in Fig 4, this is the encoding function E; the entries of the
encoded representation are inner products:

zi = E(xi ) → zi [j] = xiT uj = ujT xi (48)


The decoding function for the PCA is simply x̃i = D(zi ) = Uzi .
Figure 21 shows a comparison between the PCA and the trivial projection onto B.
RMS: 0.14030605015495526
1.0
−3 1.0

0.8−2
−1 0.5
k

0
z[2]

0.6
1 0.0
2
0.4 −0.5
3

0.2 1.0 −1.0


0.5
0.0
−3 −2 −1
j

0.0 −1.5 0 1 2 3
0.0 −0.5
0.2 0.4 0.6 −1.0
−0.50.8 1.0
−1.0 1.0 0.5 0.0 z[1]
2.0 1.5 i
1.0 RMS: 1.015792410988693
−3 1.0
0.8
−2
−1 0.5
k

0.6 0
j

1 0.0
2
0.4
3
−0.5

0.2 1.0
0.5 −1.0
0.0
j

−0.5 −1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
0.0 −0.5 −1.0
0.0 0.2−1.0 0.4 1.0 0.6 0.0
0.5 0.8 1.0 i
2.0 1.5 i

Figure 21: Comparison of the data reconstruction via the autoencoder AV (top left)
versus the autoencoder AB (bottom left). On the right, the data is shown in the 2D
representations provided by each bases, keeping the cluster color code.

The figure on the left shows the reconstructions X˜U (top) and X˜B (bottom) while the
figures on the right show the data in the reduced space: the plane (u1 , u2 ) for the PCA
and the plane (i, j) for the trivial projection. In the reconstruction plots, the blue markers
indicate the original data while the orange ones indicates the reconstructions. In the 2D
plots, the color code of the clusters is maintained.

The normalized error (i.e. (46) divided by 3np ) for the PCA is E = 0.14 while it is
E = 1.05 for the trivial projection. Clearly, the 2D representation from the PCA preserves

VKI - 39 -
4 DIMENSIONALITY REDUCTION 4.1 A first simple example

more information and keeps the distant cluster (red markers) far from the other two as
in the original 3D. Note that the area of machine learning that focuses on the mapping
from a higher dimensional space onto a lower dimensional space is known as manifold
learning. This is a subset of dimensionality reduction and encompasses mostly nonlinear
methods. Manifold learning is generally not concerned with the possibility of mapping
back the data from the reduced space onto the original one, but rather aims at preserving
certain metrics (e.g. distances) in the reduced space.
Before proceeding with a more interesting example, it is worth testing two nonlinear
methods and asses whether these allows for autoencoders that leads to lower error for the
same compression ratio. A nonlinear generalization of PCA is the Kernel PCA (KPCA,
Schölkopf et al. (1997)). The underlying idea is to perform PCA on a dataset that has
been first transformed by a nonlinear function, called kernel function Φ(), similarly
to what is introduced in Section 3 for the regression problem. The transformed data is
thus XT = Φ(X) ∈ RnF ×np , with nF the dimension of the new space, which is increased,
possibly to infinity. This new space is referred to as feature space42 .
Linear operations in the feature space (e.g. a projection) are nonlinear in the original
space. The principal components in the feature space becomes a nonlinear function that,
in general, cannot be represented in the original space. The eigenvalue problem in the
feature space reads

(Φ(X)Φ(X)T )wj = λj wj (49)


In KPCA, the kernel trick allows for avoiding every operation in the feature space.
Moreover, one wishes to compute the projection onto wj ∈ RnF ×1 without computing
wj . The entire derivation of the algorithm is presented in Schölkopf et al. (1997); Bishop
(2006). The main idea is to introduce the expansion
np
wj = aji Φ(xi ) = Φ(X)aj (50)
X

i=1

that is writing the principal components in the feature space as a linear combination of
the features. Introducing in (50) in (49) and multiplying both sides by Φ(X)T gives

Φ(X)T (Φ(X)Φ(X)T )Φ(X)aj = λj Φ(X)T Φ(X)aj → K 2 aj = λj K aj . (51)


where K = Φ(X)T Φ(X) can be computed with a kernel function:

K [i, j] = Φ(xi )T Φ(xj ) = κ(xi , xj ) (52)


similarly to what done in section 3 with the function g. If this matrix is invertible, the last
equation becomes K aj = λj aj , i.e. an eigenvalue problem for K. Given the eigenvectors
42
The kernel trick was historically introduced as a tool to enable linear classifier to learn nonlinear
decision boundaries. This can be done by increasing the dimensionality of the problem. For example,
consider a nonlinear curve in 2D: the circle x2 + y 2 = 1. This can be seen as the intersection of a
paraboloid z = x2 + y 2 and the plane z = 1. Hence, rather than looking at the dataset (x, y) in the 2D
plane, one could look at the transformed set Φ(x, y) = [x, y, x2 + y2 ] in the feature space. This has been
lifted from 2D to 3D, but allows for linear separation: the linear decision boundary in the feature space
is a nonlinear decision boundary in the original space.

VKI - 40 -
4.1 A first simple example 4 DIMENSIONALITY REDUCTION

aj , the projection of a feature vector Φ(xi ) onto the principal component wj in the feature
space gives the encoding function for the KPCA:

zi = E(xi ) → zi [j] = wjT Φ(xi ) = Φ(xi )T wj = Φ(xi )T Φ(X)aj = κ(xi , X)aj (53)

The first part is analogous to (56), while in the second part the kernel function is intro-
duced. The reader should notice that the computational cost of the KPCA is significantly
increased, since it involves the diagonalization of K ∈ Rnp ×np while the PCA required the
diagonalization of XX T ∈ R3×3 . The major difficulty, moreover, is in the decoder step,
which in the KPCA has no analytical form. The reader is referred to Bakir et al. (1999)
for details on the decoding of KPCA.
The KPCA for this example is performed using scikit-learn Pedregosa et al. (2011).
The results are shown in Fig22 both in terms of reconstruction and 2D representation.
The chosen kernel is a radial basis function with γ = 1. The reconstruction error is of
the order of 10−14 . In the reduced space, the clusters appear fairly well separated: the
nonlinear mapping allows to preserve most of the information in the reduced space so
that the decoder successfully recover the 3D representation.

RMS: 4.310507913454499e-14
1.0 0.6
−3
0.4
0.8−2
−1 0.2
k

0
z[2]

0.6 0.0
1
2 −0.2
0.4
3
−0.4
0.2 1.0
−0.6
0.5
0.0
−0.6 −0.4 −0.2
j

0.0 −1.5 0.0 0.2 0.4 0.6 0.8


−0.5 0.8−1.0
0.0 0.2 0.4 0.6
0.5 0.0 −0.5 1.0 z[1]
−1.0 1.0
2.0 1.5 i

Figure 22: Same as Fig 21 but considering an autoencoding via KPCA.

Finally, the last auto-encoder tested is the fully connected ANN architecture in Figure
23. Recalling the input-output relations for a neuron, described in Sec 3.3, the reader
should notice the PCA autoencoder is recovered in case of linear activation functions. The
training is performed using ADAM solver with batches of 32 and the chosen activation
functions are sigmoids for both encoder and decoder. The dataset is scaled to the range
[0, 1].
Figure 24 compares the error convergence during the training for the nonlinear au-
toencoder and the linear one, obtained setting linear activation functions. The error in
this plot is taken from the scaled dataset so it should not be compared with the root
mean square error in the titles of the 3D plots in Figure 22-21-20. As expected, the lin-
ear autoencoder matches the results from the PCA to machine precision. This gives an
interesting alternative tool to compute PCA for extremely large datasets, for which the
matrix multiplications involved in (47) might become too memory demanding.
The convergence for the nonlinear autoencoder is much more gentle but the final result
yields comparable errors. Figure 24 shows the results for the nonlinear autoencoder in

VKI - 41 -
4 DIMENSIONALITY REDUCTION 4.2 Dimensionality reduction of a flow field

Figure 23: Architecture of the ANNs autoencoder used in this first exercise.

terms of The 3D reconstruction and 2D mapping for the nonlinear ANNs autoencoder.
The performances are comparable with that of the standard PCA with an RMS error of
about E = 0.14.

Relu/Sigmoid
Linear

10−1
Training Error (mse)

10−2

10−3

100 101 102 103 104 105


Epochs

Figure 24: Convergence of the training error for the nonlinear autoencoder (using sigmoid
activation functions) and the linear autoencoder (using linear activation functions).

4.2 Dimensionality reduction of a flow field


We repeat the same exercise considering a dataset from 2D Time Resolved Particle Image
Velocimetry. This dataset can be downloaded and stored in a local directory running the
following Python script
1 import urllib . request
2 print ( ’ Downloading Data for Tutorial 2... ’)
3 url = ’ https :// osf . io /47 ftd / download ’
4 urllib . request . urlretrieve ( url , ’ Ex_4_TR_PIV_Jet . zip ’)
5 print ( ’ Download Completed ! I prepare data Folder ’)
6 # Unzip the f i l e
7 from zipfile import ZipFile
8 String = ’ Ex_4_TR_PIV_Jet . zip ’
9 zf = ZipFile ( String , ’r ’ ); zf . extractall ( ’ ./ ’ ); zf . close ()

Details on the experimental configuration can be found in Mendez et al. (2020).

VKI - 42 -
4.2 Dimensionality reduction of a flow field 4 DIMENSIONALITY REDUCTION

RMS: 0.14030605015495526 1.25


1.0 1.00
−3
0.75
0.8−2
−1 0.50
k

0 0.25

z[2]
0.6
1
0.00
2
0.4 −0.25
3
−0.50
0.2 1.0
−0.75
0.5
0.0
−1.5 −1.0 −0.5
j

0.0 −1.5 0.0 0.5 1.0 1.5 2.0


−0.5 0.8−1.0
0.0 0.2 0.4 0.6
0.5 0.0 −0.5 1.0 z[1]
−1.0 1.0
2.0 1.5 i

Figure 25: Same as Fig 21 but considering an autoencoding via ANNs. The results,
comparing with top plots in Figure 21, are similar to those from the PCA.

The set consists of nt = 13200 velocity fields, sampled at fs = 3kHz over a grid of
71 × 30 points. The spatial resolution is approximately ∆x = 0.85mm. A snapshot of the
velocity field is shown in Figure 26 on the left.
This test case is characterized by a large scale variation of the free-stream velocity.
A plot of the velocity magnitude in the main stream is shown in Figure 26 on the right.
In the first 1.5s, the free stream velocity is at approximately U∞ ≈ 12m/s. Between
t = 1.5s and t = 2.5s this drops down to U∞ ≈ 8m/s. The variation of the flow velocity
is sufficiently low to let the vortex shedding adapt, and hence preserve an approximately
constant Strouhal number of St = f dU∞ ≈ 0.19, with d = 5mm the diameter of the
cylinder. Consequently, the vortex shedding varies from f ≈ 459Hz to f ≈ 303Hz.

13

12
U∞ [m/s]

11

10

8
0 1 2 3 4
t[s]

Figure 26: Snapshot of the velocity field in the test case considered in this section. This is
the flow past a cylinder in transient conditions, measured via TR-PIV (see Mendez et al.
(2020)).

The dataset matrix is now X ∈ R4260×13200 . We here seek a compression of the snap-
shots from R4260 to R3×1 , that is (4260 : 3). We consider the same auto-encoder tested in
the previous section: PCA, KPCA and ANNs.
The data is scaled to the range [0, 1]. For the KPCA, the γ parameter is set to γ = 1.
For the ANN autoencoder, the network input and output consist of 4260 neurons while the
hidden layer consists of 3 neurons. The total number of training weights is thus 38.343.
The chosen activation function is the ELU , as it allowed for smoother training. The
training is performed with ADAM optimizer and a batch size of 128 and starts from a
random initialization of the weights. Dropout and batch normalization are used between

VKI - 43 -
5 REINFORCEMENT LEARNING AND CONTROL

input and hidden layer to control over-fitting and facilitate the training. The first tech-
nique consists in randomly deactivating some of neurons (in this case 20%), hence forcing
the network to level the contribution of all the weights; the second consists in enforcing
that the outputs of the neurons have zero mean and unitary standard deviation.
The results are collected in Figure 27. The left figures shows the data in the reduced
space identified by each of the autoencoders. The right figure shows an example snap-
shot, with the title indicating the RMS over the entire field. In each case, the snapshot
corresponds to the red marker on the left.
The three autoencoders learn valuable 3D representation of a 4260 dimensional space,
with the snapshots being indistinguishable from the original data. In the case of the
PCA, the learned representation appears to collapse around a paraboloid with principal
axis along z3 . Each of the steady states corresponds to circles at z1 = const, and the
transient leads the shift from the first circular orbit (with larger radius) to the second.
The existence of such paraboloid in an optimally linear basis of eigenfunctions was firstly
derived by Noack et al. (2003) (see also Noack et al. (2020) for a review). It is worth
nevertheless highlighting that most of the previous works on reduced order modeling of
the cylinder wake flow focuses on much lower Reynolds numbers.
The 3D mapping produced by the kernel PCA appears as a distorted version of the one
produced by the PCA. The kinematics of the data in this space is similar to the PCA, but
the region of higher velocities are nonlinearly compressed in orbits of much lower radius.
The error is remarkably close to machine precision. Finally, the ANN autoencoder yields
results that are comparable to the PCA. Up to a rotation, the learned manifold is similar
to the paraboloid with the only difference being an inclined axis. Interestingly, as no
special constraints are introduced in the network, the nonlinear mappings differ at each
run by a rotation of the encoded representation. Finally, it is worth noticing that the
results produced by this autoencoder are obviously strongly dependent on the choice of
the activation function, but an analysis of such influence is out of this scope of this lecture.
The main message from this exercises, further discussed in Sec.6, is that it has been
possible to achieve impressive level of compression. Running regression tools or system
identification methods one could construct predictive models in the reduced space and
then use the decoder to map these back to the full space. By solving a regression problem
in a 3-dimensional domain, one could predict with remarkable accuracy the evolution of
a flow field in a 4260-dimensional domain.

5 Reinforcement Learning and Control


We illustrate the working principle as well as the main challenges in implementing Rein-
forcement Learning (RL), introduced in Sec 2.4, for flow control problems. We consider
a simple exercise, for which an analytic solution is available, to evaluate the obtained
control laws. This exercise is adapted from Pino et al. (2020).

5.1 Problem Set


The environment (i.e. system to be controlled) is a 1D linear advection equation subject
to a disturbance f (t, x) and a control action g(t, x):

VKI - 44 -
5.1 Problem Set 5 REINFORCEMENT LEARNING AND CONTROL

PCA (Global Mean Square Error: 0.00022869072551777843)

KPCA (Global Mean Square Error: 3.385300114380556e-17)

ANN (Global Mean Square Error: 0.00022926356673408875 )

Figure 27: Results of the dimensionality reduction exercise on the flow field of the flow
past a cylinder, available from TR-PIV from Mendez et al. (2020). The scatter plots on
the left shows the 3D representation identified from the 4260-dimensional velocity field.
The red points on these scatter plot correspond to the snapshot shown on the right.

VKI - 45 -
5 REINFORCEMENT LEARNING AND CONTROL 5.2 RL Definitions

∂u ∂u
+ 330 = f (t, x) + g(t, x) , (54)
∂t ∂x | {z } | {z }
disturbance control

over a domain x ∈ [0, 50]. Both sides of the domain are taken as open (non-reflecting
conditions). The domain and a snapshot of the system is illustrated in Figure 28.
The disturbance term (in yellow in Fig. 28) consists of a Gaussian function multiplied
by a sinusoid with given frequency and amplitude. The control term (in green in Fig. 28)
consist of an identical Gaussian placed downstream and having amplitude driven by the
agent (i.e. the controller):

f (t, x) = 300 sin (100πt) · N (x − 5, 0.2) ; g(t, x) = at · N (x − 18.2, 0.2) (55)


| {z } | {z }
disturbance control

The amplitude at constitutes the action of the agent. These are taken as a function of
the system state st , which in this case consists of a vector of 18 points, sampled at three
time steps (t, t − 1 and t − 2) in 6 points (indicated with red dots).
The goal of the agent is to cancel the disturbance downstream its location, i.e. achiev-
ing u(x > 18.2) ≈ 0 ∀t. In an open-loop approach, it is easy to show that the optimal
control action is a sinusoid with the same amplitude and frequency of the forcing and
a phase shift that accounts for the traveling time from the disturbance to the control
location. In a closed loop approach, it is easy to show that the optimal law consists of a
linear combination of the observations at a given location and the time steps (t, t − 1).

5.2 RL Definitions
The solution of the control problem consists in identifying the control law at = π(st ),
known as policy, capable of achieving the control objective. In RL, we set this objective
in the form of rewards at a given time step, which are here taken as the L2 norm of a
snapshot at time t, i.e. rt = −||u(t, x)||2 .
Among the four categories of algorithms introduced in Sec. 2.4, we here focus on
on-policy methods, aiming at learning the function π. In Deep Reinforcement Learning
(DRL), the policy is parametrized by an ANN. In a deterministic approach, this policy
is thus written as at = π(st , θ) with θ the set of weights43 ; in a stochastic approach, the
ANN outputs the parameters of a distribution from which an action is sampled and it is
hence more commonly denoted as πθ (at |st ), where | denotes the conditioning operator.
In this exercise we opt for a stochastic approach and parameterize the policy with
a feed-forward ANN with an input layer of 18 neurons, a single neuron output and 2
hidden layers with 500 neurons each. The activation functions are RELU in the entire
network. The training of the ANN requires therefore the optimization of about 6.8 million
parameters. We here focus on model-free methods, for which we have no surrogate model
that allows for estimating how the environment responds to actions; hence the learning
can only proceed via trial and error, as the agent interacts with the environment.
The interaction occurs through a series of episodes. Each episode produces a set
of triplets τ = {(s0 , a0 , r0 ), (s1 , a1 , r1 ) . . . } which are processed to update the weights of
43
While we have used w for the weithgs in the previous sections, we here follow the RL literature and
opt for θ.

VKI - 46 -
5.2 RL Definitions 5 REINFORCEMENT LEARNING AND CONTROL

Figure 28: Description of the 1D advection equation environment with source of pertur-
bation(yellow), control action(green) and observation points (red dots).

the network towards better policy estimates. The performance is measured in terms of
cumulative rewards, defined as
H
R= γ t rt
X

t=0

with H the duration of the episode and γ ∈ [0.1] the discount rate. This parameter
makes the agent prioritize immediate rewards over long-term gains. In this exercise, we
fix the length of the episode to T = 0.3s and set the duration of the episode as H = T /∆t,
with ∆t the time step of the numerical scheme solving (54). This is a simple explicit first
order finite difference method with CF L = 0.9 and nx = 200 grid points.
The cumulative rewards allows for estimating the value of a state st , based on how
much rewards can be achieved if the system follows the policy πθ (at |st ) from that state
until the end of the episode. This is given by the value function

V π (s) = E[R|s, π].

Similarly, the quality of an action at at the state st is given by the Q function

Qπ (s, a) = E[R|s, a, π].

In both definitions, the expected operator is simply the empirical average over a set of
batch examples. The difference between these two quantities is the advantage function

Aπt (s, a) = Qπ (s, a) − V π (s) ,


which indicates the relative advantage of taking an action a in s as compared to a
purely random choice.
We recall while passing that off-policy methods, aims at learning estimates of Qπ
and select the policy indirectly by always selecting the action with the highest Qπ value.

VKI - 47 -
5 REINFORCEMENT LEARNING AND CONTROL 5.3 Results

The exercise is here solved with the Proximal Policy Optimization (PPO) proposed
by Schulman et al. (2017). This algorithm combines policy gradient methods with value-
based methods and hence belong to the actor-critic family. The scope is maximizing
a cost function, using the first-order optimization tools discussed in 3.2, which in RL
becomes gradient ascent methods. In an actor-critic formalism, this cost function is
usually defined as follows

J(θ) = Et [logπθ (at |st )|At ] (56)


The importance of the first term can be derived from the gradient theorem (see Richard
S. Sutton (2018)) and accounts for the role of the actor: a policy ANN that must learn
πθ (at |st ). The second term encodes the role of the critic: an additional ANN that could
learn estimates of the state-values and enforces better actions based on their actual value.
In this exercise, the critic and the actor networks share the same hidden layers and solely
differ in the last layers (see Hill et al. 2018). It is worth noticing that the state-value
network operates in fully supervised settings.
It is worth highlighting that the PPO proposes a number of modifications to (56)
to increase the smoothness and the stability of the learning. This is one of the most
recent and advanced methods, and its detailed description goes beyond the scopes of
these notes44 .

5.3 Results
The environment and the agent were prepared using Python classes from stable-baseline
(Hill et al., 2018). Figure 29 shows the reward evolution over the number of episodes,
while Figure 30 (left) shows a snapshot of a control iteration after POCHI episodes. Figure
30 (right) compares the control action taken by the agent as a function of time with the
optimal sinusoidal low predicted from first principles. The performances are obviously
very poor in the first part of the training
Fig.31 shows the same plots 30, but testing the agent towards the end of its training.
It is remarkable that the agent ultimately achieves the control task and learn how to
cancel the disturbance, while solely resorting to trial and error. While this problem
is straightforward and amenable to analytic solutions, no simplifying assumptions on
the problem’s physics have been introduced. No parametrization linking observations to
actions has been enforced.
On the other hand, the significant downside is that the training involved about 10
million interactions with the environment. In this exercise, the training required about
8h of computational time on a personal laptop, but such a sampling efficiency should warn
the reader on the use of RL in more complex 3D CFD simulations. Techniques such as
multi-agents frameworks (where multiple environments run simultaneously) and methods
for restricting the action space are thus necessary enablers.

44
Besides the original publication, the reader is referred to Pieter et al. (2017) for an overview

VKI - 48 -
5.3 Results 5 REINFORCEMENT LEARNING AND CONTROL

−100

−200

Reward −300

−400

−500

−600

−700

−800
0.0 0.2 0.4 0.6 0.8 1.0
Time Steps ×107

Figure 29: Evolution of the rewards as a function of the number of simulation time steps.
The rewards are given as rt = −||u(x, t)||2 , and approach zero as the agent learns how to
cancel the incoming wave.

Figure 30: (a) Snapshot of the environment interacting with the agent at the beginnign of its
training, leading to complete failure of the control task. (b) RL control low versus the optimal
control action (an harmonic).

400
observation RL Theory
0.4
forcing
action
0.2 200
Displacement [u]

Control Action

0.0
0

−0.2

−200
−0.4

0 10 20 30 40 50 0.000 0.025 0.050 0.075 0.100 0.125 0.150


Position [m] Time [s]

Figure 31: Same as Figure 30, but considering the agent at the end of its training. Remarkably,
the agent learns the control law and achieve good stabilization performances. The figure on the
right shows that the agent learns a nearly optimal sinusoidal law from experience.

VKI - 49 -
6 MACHINE LEARNING FOR FLUID MECHANICS

6 Machine Learning for Fluid Mechanics


We here present a short literature review of how (and in which subfields) the tools pre-
sented in this lecture have entered (or are entering) the fluid mechanics community. The
review is by no means exhaustive, but solely aimed at giving some ideas on the current
trends. The provided bibliography should serve as a starting point for deeper review.

Supervised Learning
Regression models from the machine learning literature have been extensively used
in fluid mechanics for a wide range of applications, ranging from surrogate model-based
optimization (Kim and Boukouvala, 2019), turbulence modeling (Duraisamy et al., 2019a),
non-intrusive Reduced Order Modeling (Daniel et al., 2020; Hesthaven and Ubbiali, 2018;
Renganathan et al., 2020; Pawar et al., 2019) and system identification for prediction and
control (Pan and Duraisamy, 2018; Brunton et al., 2016; Kaiser et al., 2018; Huang and
Kim, 2008).
With the growing capacity of data-driven models, the main challenge remain that of
defining the amount of prior knowledge to be incorporated in the training phase, the
data amount of required data and its preparation (adimensionalization, normalization
and transformation) and the model generalization outside the training range.
In turbulence modeling, machine learning techniques are increasingly used for algebraic
modelling of the Reynolds stress tensor (Ling et al., 2016; Jiang et al., 2020; Akolekar et al.,
2018; Sotgiu et al., 2019) and for the improvement of one or two-equation models (Parish
and Duraisamy, 2016; Holland et al., 2019). In some cases (Parish and Duraisamy, 2016;
Holland et al., 2019), the turbulence modelling focuses on operational parameters ( e.g. ω
and  in the k − ω and k −  models) rather than purely physical ones: these quantities are
not available in high-fidelity turbulence databases and must be then inferred to obtain the
labelled data needed for any supervised approach. This field inversion process (Parish and
Duraisamy, 2016; Diez Sanhueza, 2018; Singh et al., 2017) can be rather expensive and the
inference problem could be ill-posed in regions of weak turbulence, where the sensitivity
of the mean flow to the turbulence quantities diminishes. Despite these limitations, the
field inversion was successfully applied in the works of Parish and Duraisamy (Parish and
Duraisamy, 2016), Singh et al. Singh et al. (2017) and Diez (Diez Sanhueza, 2018) to infer
correction terms for the Spalart Allmaras model (Singh et al., 2017), the k − ω model
(Parish and Duraisamy, 2016) and the MK turbulence model (Diez Sanhueza, 2018).
In non-intrusive ROMs and system identification, regression tools are used to construct
models in the low dimensional space produced by a suitable dimensionality reduction
technique. These have been used for constructing predictive models of complex reactive
systems (Parente, 2020; Swischuk et al., 2020), in multidisciplinary optimization (Goertz,
2020a; Parrish et al., 2014), sensor-based extraction of sparse representation of nonlinear
dynamical systems (Loiseau et al., 2018; Goertz, 2020b).
Finally, while these notes did not cover classification problems, it is worth noticing
that these are also increasingly used for automatic classifications of flow regimes (Majors,
2018; Hobold and da Silva, 2018; Kang et al., 2020).

Unsupervised Learning
Fluid mechanics has a long history (and a vast literature) in methods for dimension-
ality reduction. The notion of principal components in PCA is tightly linked to the

VKI - 50 -
6 MACHINE LEARNING FOR FLUID MECHANICS

notion of coherent structures in turbulent flows. The PCA is known in fluid mechanics
as Proper Orthogonal Decomposition, and was introduced by Lumley (1967) as a
method to identify (and define) coherent structures in a turbulent flow. These structures
are the most energy containing ones and are thus the main responsible for phenomena of
engineering relevance includying heat and mass transfer, noise emission or fluid structure
interaction.
The POD has been used to construct reduced order models of fluid flows (Holmes
et al., 1997; Noack et al., 2003), to find optimally balanced control laws (ROWLEY,
2005; Ilak and Rowley, 2008), to perform correlation-based filtering (Mendez et al., 2017;
Raiola et al., 2015) or to identify correlations between different fields (Borée, 2003; An-
toranz et al., 2018; Ianiro, 2020), to name a few examples. An overview of POD and its
applications to fluid mechanics is provided by Holmes et al. (2012); Dawson (2020a).
Over the last couple of decades, many variants of the POD/PCA have been developed
in fluid mechanics (Sieber et al., 2016; Mendez et al., 2019) along with alternative linear
decomposition tools. The most popular (linear) alternative is certainly the Dynamic
Mode Decomposition (DMD, Schmid (2010); Rowley et al. (2009); Schmid (2020))
which decomposes the data as a linear combination of linear modes, i.e. structures with
a single complex frequency. DMD modes are the eigenvectors of the operator that best
approximates the data as a linear dynamical systems45 .
Most of the decomposition methods developed in fluid mechanics are linear, and the
literature has grown into a subfield of data processing often referred to as Modal Anal-
ysis (Taira et al., 2017), where the notion of ‘mode’ generalizes that of coherent structure
or principal component or harmonic (Fourier) mode . Nonlinear methods of dimension-
ality reduction have comparatively much less history and have been popularized mostly
in the last few years. Notable examples are the use of manifold learning techniques such
as Locally Linear Embedding (Ehlert et al., 2020), cluster-based reduced order model
(Kaiser et al., 2014) and autoencoders Agostini (2020); Murata et al. (2019); Agostini
(2020). As dimensionality reduction became the building block of many other tools in
fluid mechanics, these tools are likely going to enter in the tool box of most researchers.
Reinforcement Learning and Control
The control of turbulent flows poses unique challenges to classic closed loop methods
(Brunton and Noack, 2015). Machine learning based techniques offers a viable alternative
by-passing the need of deriving complex models via trial and error. Control tasks can be
set in the form of an optimization problems and the use of machine learning approaches
such as Genetic Programming have been pioneered by the group of Prof. Noack (Noack,
2020; Chovet et al., 2017; Duriez et al., 2017).
The first applications of RL in fluid mechanics were focused on the study of collec-
tive behavior of swimmers, pionered by the group of Prof. Koumoutsakos (Wang et al.,
2018; Verma et al., 2018; Novati et al., 2017; Novati and Koumoutsakos, 2019; Novati
et al., 2019), while the first applications for flow control were presented by Pivot et al.
(2017) and by Rabault et al. (2019) (see also Rabault et al. 2020; Rabault and Kuhnle
2019, 2020). Recently, Bucci et al. (2019) used RL to control the one-dimensional Ku-
ramoto–Sivashinsky equation while Beintema et al. (2020) used it to control heat transport
in a two-dimensional Rayleigh–Bénard systems. Verma et al. (2018) used RL to study
45
An overview of linear dynamical systems is provided by Mendez (2020b) and Dawson (2020b)

VKI - 51 -
7 CONCLUSIONS AND PERSPECTIVES

how fishes can reduce energy expenditure by schooling while Reddy et al. (2016) used RL
to analyze how bird and gliders minimize energy expenditure by soaring exploring tur-
bulent fluctuations. With a careful choice of the reward function, Viquerat et al. (2019)
demonstrated the use of RL for shape optimization, while an application of RL for flow
control in an experimental configuration has been recently presented by Fan et al. (2020).

7 Conclusions and Perspectives


The exercises discussed in this document were not aimed at drawing definite conclusions as
to what method should be used for a given task. The main goal was to provide a minimal
background (and possibly some working knowledge) of what machine learning methods
can do. We here draw some conclusions on the potentials and the main challenges for
their diffusion in fluid mechanics.
Concerning the regression exercise, both RBF Ridge Regression and GP performed
better than the ANN in terms of variance, sensitivity to outliers, and computation time.
These are the most natural tools for 1D problems and small datasets (of the order of
about a few thousand points). In the RBF Ridge Regression, the most computationally
demanding task is the preparation and the diagonalization of the covariance matrix (of
size nb ×nb ). In the GP, the matrices involved are of size np ×np ; while their construction is
relatively cheap thanks to the kernel trick, their inversion and their memory requirements
can become prohibitive for large datasets. For high dimensional and large datasets (tens
or hundreds of thousands), the ANN is the natural solution.
While many open-source tools allow for constructing ANN easily (e.g., Keras or Py-
torch), the reader should be aware that their implementation and training require con-
siderable experience. The number of hyperparameters to tune is large. Although many
guidelines are available on the internet, there is no definitive answer on the choices of
the number of layers, the size and the architecture of the network, nor the optimization
settings. Moreover, while regularization, drop-out, Gaussian noise layer, and (most im-
portantly) cross-validation are important tools to control overfitting, there is no guarantee
that overfitting is entirely avoided, nor that the network will be capable of extrapolating
outside the training range.
In applications that are heavily rooted in regression analysis, such as turbulence mod-
eling, it is likely that ANN will play an essential role in the future. The same is valid for
system identification and time series analysis and forecasting (not treated in these notes),
which can be performed using Recurrent ANNs. Because all ANNs are very data demand-
ing, the importance of open databases in fluid mechanics is likely to increase considerably
in the future. As the ImageNet data supported the ANN revolution in computer vision, a
sufficiently large dataset of fluid flows might lead to ‘better’ turbulence models one day.
The word ‘better’ is here solely considering performances (lower MSE); yet this is only
one of the demands we have for a model. The main limitation of ANN remains that of
leading to models that are difficult to interpret because it is challenging to understand
how the network succeeds in specific tasks. As the idea of working with models that are
not fully understood is disturbing, the quest for systematic methods for analyzing how
ANNs models do their job is likely going to open new paths of research.
Similar considerations hold for the field of dimensionality reduction. The presented

VKI - 52 -
7 CONCLUSIONS AND PERSPECTIVES

exercise illustrated the three main paradigms: linear, kernel-based nonlinear, and ANNs-
based nonlinear. The fluid mechanic’s community has vast experience in the linear meth-
ods (and has pioneered several related decompositions) but has limited experience on the
nonlinear tools. There is no doubt that this will change shortly, and a revolution in the
field is about to set in. While nonlinear methods are likely going to yield more efficient
representations, the resulting models’ interpretability will become a significant challenge.
The reduced models from linear tools can sometimes be derived from first principles and
have intuitive interpretation, while this is not the case for nonlinear approaches. The
results achievable by kernel methods (or any other manifold learning technique) are ex-
tremely sensitive the hyperparameters which have no physical interpretation.
Tasks that do not demand such interpretations, such as filtering or data compression,
are most likely going to be the first to benefit from these nonlinear tools: most POD-based
filters, for instance, might be soon replaced by KPOD filters. Tasks such as reduced-order
modeling for prediction and control, on the other hand, still require a significant amount
of additional research. The main challenge in ROM-based control, for instance, is that of
constructing a low dimensional space that models both the non-actuated and the actuated
flow. This often requires (1) prohibitively large training data and (2) carefully constructed
models. It is unclear whether nonlinear methods will help with (1), but they will certainly
challenge (2).
The cost of kernel-based methods grows with the number of data points, and hence
ANNs are again the only viable solution for huge datasets. All the considerations on ANN
for regression hold for dimensionality reduction. While experimenting in the exercise’s
solution, it appeared clear that the error convergence was sensitive to a wide range of
hyperparameters and the results obtained required a fair amount of experimentation. On
a practical level, the link between POD and linear autoencoder brings a useful method
for computing POD-based decompositions using ANN and their training strategies, which
are considerably more memory efficient than matrix multiplications. ANN-based methods
are also better suited for big data and ‘on-line’ or ‘streaming’ approaches.
Finally, we conclude with some remarks on the control problem solved via Reinforce-
ment learning. Compared to supervised and unsupervised techniques, the RL appears
somewhat less mature, with new agent-training strategies published every year. None of
the most popular RL training strategies (e.g. PPO, DDPG or A3C) used for controlling
continuous systems is older than four years. While this reflects in part the enormous
efforts that the community is investing in RL, it also shows that the field is not yet set-
tled, and the ‘best’ agent-training algorithm is yet to come. As a physicist, the paradigm
of controlling a system in a purely ‘black-box’ approach, hence without resorting to any
physics knowledge but solely on trial-and-error, is disturbing. On the other hand, classic
methods exclusively based on physical principles have so far failed in the task of controlling
turbulent flows in complex scenarios.
Provided that a black-box approach succeeds in such tasks, we might then invest in
the learning of how such success was achieved. In such a case, which today appears on the
edge of science fiction for fluid mechanics, the usual roles might be inverted, and the agent
would become our supervisor. This is already happening in board games, where agents
have reached superhuman performances and are now teaching us to play better. But the
essential obstacle, today, is that the agent usually needs to interact with the environment
millions of times (with no guarantees of success). This is no problem with a board game

VKI - 53 -
A LEARNING MACHINE LEARNING

but is a significant challenge in flow control.


A compromise is needed. An important avenue for future research is combining our
knowledge with RL’s exploratory capabilities and do not ask the RL agent to reinvent
the wheel. This will require significant efforts in developing new optimization strategies
that incorporate our physical knowledge in the form of some constraints, or a ‘behavioral
cloning’ process that lets us train the agent, at least initially, in a supervised setting.
The field is open, and this lecture series gave you the background to contribute to its
evolution.

A Learning Machine Learning


Gentle introductions to Machine Learning

• Paul Wilmoot, Machine Learning- An Applied Mathematics Introduction, Panda


Ohana Publishing, 2019.

• Andriy Burkovt,The hundred-page Machine Learning Book, 2019.

• Ethem Alpaydin, Machine Learning, MIT press- Essential Knowldge Series.

• Oliver Theobald, Machine Learning For Absolute Beginners: A Plain English In-
troduction, 2018.

Essential Books

• Abu- Mostafa et al, Learning From Data, AMLBook (see References).

• Aurelian Gèron, Hands-On Machine Learning with Scikit-Learn, Keras, and Ten-
sorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, O’Reilly,
2019 (see References)

• Bishop, C.M, Pattern Recognition and Machine Learning, Springer (see References).

Excellent online courses

• Abu- Mostafa’s course at Caltech University. Available at this link.

• Andrew Ng’s course at Standarford University. Available at this link

• Gilbert Strang’s course at MIT. Available at this link

• Kilian Weinberger’s course Cornell University. Available at this link

VKI - 54 -
REFERENCES REFERENCES

References
Abu-Mostafa, Y. S., Magdon-Ismail, M., and Lin, H.-T. (2012). Learning from Data.
AMLBook.

Aggarwal, C. C. and Reddy, C. K. (2013). Data Clustering: Algorithms and Applications.


Taylor & Francis Inc.

Agostini, L. (2020). Exploration and prediction of fluid dynamical systems using auto-
encoder technology. Physics of Fluids, 32(6):067103.

Akolekar, H. D., Weatheritt, J., Hutchins, N., Sandberg, R. D., Laskowski, G., and
Michelassi, V. (2018). Development and use of machine-learnt algebraic reynolds stress
models for enhanced prediction of wake mixing in lpts. In Turbo Expo: Power for Land,
Sea, and Air, volume 51012, page V02CT42A009. American Society of Mechanical
Engineers.

Alexander Zai, B. B. (2020). Deep Reinforcement Learning in Action. Manning Publica-


tions.

Alpaydin, E. (2020). Introduction to Machine Learning. MIT University Press.

Anil, R., Gupta, V., Koren, T., Regan, K., and Singer, Y. (2020). Second order optimiza-
tion made practical.

Antoranz, A., Ianiro, A., Flores, O., and Villalba, M. (2018). Extended proper orthogonal
decomposition of non-homogeneous thermal fields in a turbulent pipe flow. Int. J. Heat
Mass Transfer, 118:1264–1275.

Arulkumaran, K., Deisenroth, M. P., Brundage, M., and Bharath, A. A. (2017). Deep
reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6):26–38.

Bakir, G., Weston, J., and Bernhard, S. (1999). Learning to find pre-images. In Proceed-
ings of the 16th International Conference on Neural Information Processing Systems.

Beintema, G., Corbetta, A., Biferale, L., and Toschi, F. (2020). Controlling
rayleigh–bénard convection via reinforcement learning. Journal of Turbulence, pages
1–21.

Belousov, B. (2017). Gaussian process vs kernel ridge regression. http://boris-belousov.net.

Bezdek, J. C. (1981). Pattern Recognition with Fuzzy Objective Function Algorithms.


Springer US.

Bianchi, F. M., Maiorino, E., Kampffmeyer, M. C., Rizzi, A., and Jenssen, R. (2017).
Recurrent Neural Networks for Short-Term Load Forecasting. Springer International
Publishing.

Bird, R. B., Stewart, W. E., and Lightfoot, E. N. (2006). Transport Phenomena. John
Wiley & Sons Inc.

VKI - 55 -
REFERENCES REFERENCES

Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press.

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer-Verlag New


York Inc.

Borée, J. (2003). Extended proper orthogonal decomposition: a tool to analyse correlated


events in turbulent flows. Exp. Fluids, 35(2):188–192.

Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for opti-
mal margin classifiers. In Proceedings of the fifth annual workshop on Computational
learning theory - COLT 92. ACM Press.

Brenner, M. P., Eldredge, J. D., and Freund, J. B. (2019). Perspective on machine learning
for advancing fluid mechanics. Physical Review Fluids, 4(10).

Brockwell, P. J. and Davis, R. A. (1991). Time Series: Theory and Methods. Springer
New York.

Brunton, S. L. and Noack, B. R. (2015). Closed-loop turbulence control: Progress and


challenges. Applied Mechanics Reviews, 67(5).

Brunton, S. L., Noack, B. R., and Koumoutsakos, P. (2020). Machine learning for fluid
mechanics. Annual Review of Fluid Mechanics, 52(1):477–508.

Brunton, S. L., Proctor, J. L., and Kutz, J. N. (2016). Discovering governing equations
from data by sparse identification of nonlinear dynamical systems. Proceedings of the
National Academy of Sciences, 113(15):3932–3937.

Bucci, M. A., Semeraro, O., Allauzen, A., Wisniewski, G., Cordier, L., and Mathelin, L.
(2019). Control of chaotic systems by deep reinforcement learning. Proceedings of the
Royal Society A, 475(2231):20190351.

Chollet, F. (2017). Deep Learning with Python. Manning.

Chovet, C., Lippert, M., Keirsbulck, L., Noack, B. R., and Foucaut, J.-M. (2017). Machine
learning control for experimental turbulent flow targeting the reduction of a recircula-
tion bubble. In ASME 2017 Fluids Engineering Division Summer Meeting. American
Society of Mechanical Engineers.

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathe-


matics of Control, Signals, and Systems, 2(4):303–314.

Daniel, T., Casenave, F., Akkari, N., and Ryckelynck, D. (2020). Model order reduction
assisted by deep neural networks (ROM-net). Advanced Modeling and Simulation in
Engineering Sciences, 7(1).

Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014).
Identifying and attacking the saddle point problem in high-dimensional non-convex
optimization.

VKI - 56 -
REFERENCES REFERENCES

Dawson, S. (2020a). Data Driven Fluid Mechanics: Combining First Principles and Ma-
chine Learning, chapter The Proper Orthogonal Decomposition. Cambridge University
Press; Ed. Miguel Alfonso Mendez and Andrea Ianiro and Bernd Noack and Steve Brun-
ton. Also available as online lecture from the VKI Lecture series ’Machine Learning for
Fluid Mechanics, 2020’: https://www.youtube.com/watch?v=TcqBbtWTcIc.

Dawson, S. (2020b). Data Driven Fluid Mechanics: Combining First Principles and Ma-
chine Learning, chapter Linear Dynamical Systems and Control. Cambridge University
Press; Ed. Miguel Alfonso Mendez and Andrea Ianiro and Bernd Noack and Steve
Brunton. Also available as online lecture from the VKI Lecture series ’Machine Learn-
ing for Fluid Mechanics, 2020’: https://www.youtube.com/watch?v=Y5jWRnya3ds&
feature=emb_logo.

Diez Sanhueza, R. (2018). Machine learning for rans turbulence modelling of variable
property flows.

Domingos, P. (2015). The Master Algorithm. Basic Books.

Duraisamy, K., Gianluca, I., and Xiao, H. (2019a). Turbulence modeling in the age of
data. Annual review of Fluid Mechanics, 51:357–377.

Duraisamy, K., Iaccarino, G., and Xiao, H. (2019b). Turbulence modeling in the age of
data. Annual Review of Fluid Mechanics, 51(1):357–377.

Duriez, T., Brunton, S. L., and Noack, B. R. (2017). Machine Learning Control – Taming
Nonlinear Dynamics and Turbulence. Springer International Publishing.

Ehlert, A., Nayeri, C. N., Morzynski, M., and Noack, B. R. (2020). Locally linear embed-
ding for transient cylinder wakes.

Fan, D., Yang, L., Triantafyllou, M. S., and Karniadakis, G. E. (2020). Reinforcement
learning for active flow control in experiments. arXiv preprint arXiv:2003.03419.

Farcomeni, A. and Greco, L. (2015). Robust Methods for Data Reduction. Taylor &
Francis Inc.

Fiore, M., Kolozar, L., Mendez, M., van Beeck, J., and Bartosiewicz, Y. (24-28 feb. 2020).
Thermal turbulence modelling of liquid metal flows using artificial neural networks. In
Lecture Series: Machine Learning for Fluid Mechanics: Analysis, Modeling, Control
and Closures.

François-Lavet, V., Henderson, P., Islam, R., Bellemare, M. G., and Pineau, J. (2018).
An introduction to deep reinforcement learning. Foundations and Trends
R in Machine
Learning, 11(3-4):219–354.

Frank, M., Drikakis, D., and Charissis, V. (2020). Machine-learning methods for compu-
tational science and engineering. Computation, 8(1):15.

Gan, G., Ma, C., and Wu, J. (2007). Data Clustering: Theory, Algorithms, and Applica-
tion. ASA-SIAM Series on Statistics and Applied Probability).

VKI - 57 -
REFERENCES REFERENCES

Géron, A. (2019). Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow.
O’Reilly UK Ltd.
Goertz, S. (2020a). Data Driven Fluid Mechanics: Combining First Principles and Ma-
chine Learning, chapter Reduced-Order Modeling for Aerodynamic Applications and
MDO. Cambridge University Press; Ed. Miguel Alfonso Mendez and Andrea Ianiro
and Bernd Noack and Steve Brunton. Also available as online lecture from the VKI
Lecture series ’Machine Learning for Fluid Mechanics, 2020’: https://www.youtube.
com/watch?v=JUqNMjVCR_k&feature=emb_logo.
Goertz, S. (2020b). Data Driven Fluid Mechanics: Combining First Principles and Ma-
chine Learning, chapter Methods for System Identification. Cambridge University Press;
Ed. Miguel Alfonso Mendez and Andrea Ianiro and Bernd Noack and Steve Brunton.
Also available as online lecture from the VKI Lecture series ’Machine Learning for Fluid
Mechanics, 2020’: https://www.youtube.com/watch?v=TL86S3mmlqg&feature=emb_
logo.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. the MIT Press.
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,
Courville, A., and Bengio, Y. (2014). Generative adversarial networks.
Hanin, B. (2019). Universal function approximation by deep neural nets with bounded
width and ReLU activations. Mathematics, 7(10):992.
Haykin, S. (1998). Neural Networks: A Comprehensive Foundation. Prentice Hall; 2nd
Edition.
Hernandez-Leal, P., Kartal, B., and Taylor, M. E. (2019). A survey and critique of
multiagent deep reinforcement learning. Autonomous Agents and Multi-Agent Systems,
33(6):750–797.
Hesthaven, J. and Ubbiali, S. (2018). Non-intrusive reduced order modeling of nonlinear
problems using neural networks. Journal of Computational Physics, 363:55–78.
Hill, A., Raffin, A., Ernestus, M., Gleave, A., Kanervisto, A., Traore, R., Dhari-
wal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman,
J., Sidor, S., and Wu, Y. (2018). Stable baselines. https://github.com/hill-a/
stable-baselines.
Hobold, G. M. and da Silva, A. K. (2018). Machine learning classification of boiling
regimes with low speed, direct and indirect visualization. International Journal of Heat
and Mass Transfer, 125:1296–1309.
Holland, J. R., Baeder, J. D., and Duraisamy, K. (2019). Towards integrated field inversion
and machine learning with embedded neural networks for rans modeling. In AIAA
Scitech 2019 Forum, page 1884.
Holmes, P., Lumley, J. L., Berkooz, G., and Rowley, C. (2012). Turbulence, Coherent
Structures, Dynamical Systems and Symmetry. Cambridge University Press, 2nd edi-
tion.

VKI - 58 -
REFERENCES REFERENCES

Holmes, P. J., Lumley, J. L., Berkooz, G., Mattingly, J. C., and Wittenberg, R. W. (1997).
Low-dimensional models of coherent structures in turbulence. Phys. Rep., 287(4):337–
384.

Huang, S.-C. and Kim, J. (2008). Control and system identification of a separated flow.
Physics of Fluids, 20(10):101509.

Ianiro, A. (2020). Data Driven Fluid Mechanics: Combining First Principles and Machine
Learning, chapter Applications and Good Practice. Cambridge University Press; Ed.
Miguel Alfonso Mendez and Andrea Ianiro and Bernd Noack and Steve Brunton. Also
available as online lecture from the VKI Lecture series ’Machine Learning for Fluid
Mechanics, 2020’: https://www.youtube.com/watch?v=H6twKFTCd2k&feature=emb_
logo.

Ilak, M. and Rowley, C. W. (2008). Modeling of transitional channel flow using balanced
proper orthogonal decomposition. Physics of Fluids, 20(3):034103.

Jiang, C., Mi, J., Laima, S., and Li, H. (2020). A novel algebraic stress model with
machine-learning-assisted parameterization. Energies, 13(1):258.

Jiménez, J. (2020a). Computers and turbulence. European Journal of Mechanics - B/Flu-


ids, 79:1–11.

Jiménez, J. (2020b). Data Driven Fluid Mechanics: Combining First Principles and Ma-
chine Learning, chapter The Computer as Turbulence Researcher. Cambridge Univer-
sity Press; Ed. Miguel Alfonso Mendez and Andrea Ianiro and Bernd Noack and Steve
Brunton. Also available as online lecture from the VKI Lecture series ’Machine Learning
for Fluid Mechanics, 2020’: https://www.youtube.com/watch?v=i6lbZkK8rVI.

Jolliffe, I. T. (2002). Principal Component Analysis. Springer-Verlag.

Kaiser, E., Kutz, J. N., and Brunton, S. L. (2018). Sparse identification of nonlinear
dynamics for model predictive control in the low-data limit. Proceedings of the Royal
Society A: Mathematical, Physical and Engineering Sciences, 474(2219):20180335.

Kaiser, E., Noack, B. R., Cordier, L., Spohn, A., Segond, M., Abel, M., Daviller, G., Östh,
J., Krajnović, S., and Niven, R. K. (2014). Cluster-based reduced-order modelling of a
mixing layer. Journal of Fluid Mechanics, 754:365–414.

Kang, M., Hwang, L. K., and Kwon, B. (2020). Machine learning flow regime classification
in three-dimensional printed tubes. Physical Review Fluids, 5(8).

Kawamura, H., Abe, H., and Shingai, K. (2000). Dns of turbulence and heat transport in
a channel flow with different reynolds and prandtl numbers and boundary conditions.
Turbulence, Heat and Mass Transfer, 3:15–32.

Kim, S. H. and Boukouvala, F. (2019). Machine learning-based surrogate modeling for


data-driven optimization: a comparison of subset selection for regression techniques.
Optimization Letters, 14(4):989–1010.

VKI - 59 -
REFERENCES REFERENCES

Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization.


Kirk, D. E. (2004). Optimal Control Theory: An Introduction. Dover Books on Electrical
Engineering.
Kober, J. and Peters, J. (2014). Reinforcement learning in robotics: A survey. In Springer
Tracts in Advanced Robotics, pages 9–67. Springer International Publishing.
Krasser, M. (2018). Gaussian processes. https://krasserm.github.io.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2017). ImageNet classification with deep
convolutional neural networks. Communications of the ACM, 60(6):84–90.
Kutz, J. N. (2017). Deep learning in fluid dynamics. Journal of Fluid Mechanics, 814:1–4.
Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied
to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
Lee, J. A. and Verleysen, M. (2007). Nonlinear Dimensionality Reduction. Springer New
York.
Ling, J., Kurzawski, A., and Templeton, J. (2016). Reynolds averaged turbulence mod-
elling using deep neural networks with embedded invariance. Journal of Fluid Mechan-
ics, 807:155–166.
Loiseau, J.-C., Noack, B. R., and Brunton, S. L. (2018). Sparse reduced-order modelling:
sensor-based dynamics to full-state estimation. Journal of Fluid Mechanics, 844:459–
490.
Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. (2017). The expressive power of neural
networks: A view from the width.
Luketina, J., Nardelli, N., Farquhar, G., Foerster, J., Andreas, J., Grefenstette, E., White-
son, S., and Rocktäschel, T. (2019). A survey of reinforcement learning informed by
natural language.
Lumley, J. (1967). The structure of inhomogeneous turbulent flows. In Yaglom, A. M.,
Tatarski, V. I., editor, Atmospheric turbulence and radio propagation, pages 166–178.
Majors, J. H. (2018). Machine learning identifies flow patterns faster and better with a
combined analysis approach. Scilight, 2018(3):030007.
Mendez, M., Raiola, M., Masullo, A., Discetti, S., Ianiro, A., Theunissen, R., and Buchlin,
J.-M. (2017). POD-based background removal for particle image velocimetry. Experi-
mental Thermal and Fluid Science, 80:181–192.
Mendez, M. A. (2020a). Data Driven Fluid Mechanics: Combining First Principles and
Machine Learning, chapter Generalized and Multiscale Data-Driven Modal Analysis.
Cambridge University Press; Ed. Miguel Alfonso Mendez and Andrea Ianiro and Bernd
Noack and Steve Brunton. Also available as online lecture from the VKI Lecture series
’Machine Learning for Fluid Mechanics, 2020’: https://www.youtube.com/watch?v=
i6lbZkK8rVI.

VKI - 60 -
REFERENCES REFERENCES

Mendez, M. A. (2020b). Data Driven Fluid Mechanics: Combining First Principles and
Machine Learning, chapter Mathematical Tools, Part I: Continuous and Discrete LTI
Systems. Cambridge University Press; Ed. Miguel Alfonso Mendez and Andrea Ianiro
and Bernd Noack and Steve Brunton. Also available as online lecture from the VKI
Lecture series ’Machine Learning for Fluid Mechanics, 2020’: https://www.youtube.
com/watch?v=qvZmKr6fhW4&feature=emb_logo.

Mendez, M. A., Balabane, M., and Buchlin, J.-M. (2019). Multi-scale proper orthogonal
decomposition of complex fluid flows. Journal of Fluid Mechanics, 870:988–1036.

Mendez, M. A., Hess, D., Watz, B. B., and Buchlin, J.-M. (2020). Multiscale proper
orthogonal decomposition (mPOD) of TR-PIV data—a case study on stationary and
transient cylinder wake flows. Measurement Science and Technology, 31(9):094014.

Mitchell, T. M. (1997). Machine Learning. McGraw-Hill Education.

Müller, A. C. and Sarah, G. (2016). Introduction to Machine Learning with Python.


O’Reilly UK Ltd.

Müller, S. D., Milano, M., and Koumoutsakos, P. (1999). Application of machine learning
algorithms to flow modeling and optimization.

Murata, T., Fukami, K., and Fukagata, K. (2019). Nonlinear mode decomposition with
convolutional neural networks for fluid dynamics. Journal of Fluid Mechanics, 882.

Murphy, K. P. (2012). Machine Learning. MIT Press Ltd.

Nasraoui, O., Chiheb-Eddine, and Cir, B. N., editors (2019). Clustering Methods for Big
Data Analytics. Springer International Publishing.

Nesterov, Y. (1983). A method for unconstrained convex minimization problem with the
rate of convergence o(1/k 2 ). Doklady an ussr, 269:543–547.

Nguyen, T. T., Nguyen, C. M., Nguyen, D. T., Nguyen, D. T., and Nahavandi, S. (2019).
Deep learning for deepfakes creation and detection.

Nielsen, A. (2019). Practical Time Series Analysis. O’Reilly UK Ltd.

Noack, B. R. (2020). Machine Learning for Turbulence Control, chapter Machine Learning
for Turbulence Control. Cambridge University Press; Ed. Miguel Alfonso Mendez and
Andrea Ianiro and Bernd Noack and Steve Brunton. Also available as online lecture
from the VKI Lecture series ’Machine Learning for Fluid Mechanics, 2020’: https:
//www.youtube.com/watch?v=7rT1Hjs5poc&feature=emb_logo.

Noack, B. R., Afanesiev, K., Morzyński, M., Tadmor, G., and Thiele, F. (2003). A
hierarchy of low-dimensional models for the transient and post-transient cylinder wake.
Journal of Fluid Mechanics, 497:335–363.

VKI - 61 -
REFERENCES REFERENCES

Noack, B. R., Ehlert, A., Nayeri, C. N., and Morzyński, M. (2020). Data Driven Fluid
Mechanics: Combining First Principles and Machine Learning, chapter Analysis, Mod-
eling and Control of the Cylinder Wake. Cambridge University Press; Ed. Miguel Al-
fonso Mendez and Andrea Ianiro and Bernd Noack and Steve Brunton. Also available
as online lecture from the VKI Lecture series ’Machine Learning for Fluid Mechanics,
2020’: https://www.youtube.com/watch?v=iehMMhDqmys&feature=emb_logo.

Novati, G. and Koumoutsakos, P. (2019). Remember and forget for experience replay. In
Proceedings of the 36th International Conference on Machine Learning.

Novati, G., Mahadevan, L., and Koumoutsakos, P. (2019). Controlled gliding and perching
through deep-reinforcement-learning. Phys. Rev. Fluids, 4(9).

Novati, G., Verma, S., Alexeev, D., Rossinelli, D., van Rees, W. M., and Koumoutsakos,
P. (2017). Synchronisation through learning for two self-propelled swimmers. Bioinspir.
Biomim., 12(3):036001.

Ogata, K. (2009). Modern Control Engineering. Pearson, 5-th Edition.

Olazaran, M. (1993). A sociological history of the neural network controversy. In Advances


in Computers, pages 335–425. Elsevier.

Olivier Chapelle, B. S. and Zien, A., editors (2006). Semi-Supervised Learning. MIT Press
( Adaptive Computation and Machine Learning series ).

Pan, S. and Duraisamy, K. (2018). Long-time predictive modeling of nonlinear dynamical


systems using neural networks. Complexity, 2018:1–26.

Paolella, M. S. (2018). Linear Models and Time-Series Analysis. John Wiley & Sons, Inc.

Parente, A. (2020). Data Driven Fluid Mechanics: Combining First Principles and Ma-
chine Learning, chapter Advancing Reacting Flow Simulations with Data-Driven Mod-
els: Chemistry Accelerations and Reduced-Order Modelling. Cambridge University
Press; Ed. Miguel Alfonso Mendez and Andrea Ianiro and Bernd Noack and Steve
Brunton. Also available as online lecture from the VKI Lecture series ’Machine Learn-
ing for Fluid Mechanics, 2020’: https://www.youtube.com/watch?v=Ys5_0YY730M&
feature=emb_logo.

Parish, E. J. and Duraisamy, K. (2016). A paradigm for data-driven predictive modeling


using field inversion and machine learning. Journal of Computational Physics, 305:758–
774.

Parrish, J., Rais-Rohani, M., and Janus, J. M. (2014). Reduced order techniques for
sensitivity analysis and design optimization of aerospace systems. In 10th AIAA Mul-
tidisciplinary Design Optimization Conference. American Institute of Aeronautics and
Astronautics.

Pawar, S., Rahman, S. M., Vaddireddy, H., San, O., Rasheed, A., and Vedula, P. (2019).
A deep learning enabler for nonintrusive reduced order modeling of fluid flows. Physics
of Fluids, 31(8):085101.

VKI - 62 -
REFERENCES REFERENCES

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel,
M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau,
D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning
in Python. Journal of Machine Learning Research, 12:2825–2830.

Pieter, A., Duan, Y., Chen, X., and Karpathy, A. (2017). Deep RL bootcamp. https:
//sites.google.com/view/deep-rl-bootcamp/lectures. Accessed: 2020-09-1.

Pino, F., Mendez, M. A., and Benoit, S. (24-28 feb. 2020). Feedback control of liquid
metal coating. In Lecture Series: Machine Learning for Fluid Mechanics: Analysis,
Modeling, Control and Closures.

Pivot, C., Cordier, L., and Mathelin, L. (2017). A continuous reinforcement learning
strategy for closed-loop control in fluid dynamics. In 35th AIAA Applied Aerodynamics
Conference. American Institute of Aeronautics and Astronautics.

Polyak, B. (1964). Some methods of speeding up the convergence of iteration methods.


USSR Computational Mathematics and Mathematical Physics, 4(5):1–17.

Popper, K. R. (2002). The Logic of Scientific Discovery. Taylor & Francis Ltd.

Rabault, J., Kuchta, M., Jensen, A., Réglade, U., and Cerardi, N. (2019). Artificial
neural networks trained through deep reinforcement learning discover control strategies
for active flow control. Journal of Fluid Mechanics, 865:281–302.

Rabault, J. and Kuhnle, A. (2019). Accelerating deep reinforcement learning strategies of


flow control through a multi-environment approach. Physics of Fluids, 31(9):094105.

Rabault, J. and Kuhnle, A. (2020). Data Driven Fluid Mechanics: Combining First
Principles and Machine Learning, chapter Deep Reinforcement Learning Applied to
Active Flow Control. Cambridge University Press; Ed. Miguel Alfonso Mendez and
Andrea Ianiro and Bernd Noack and Steve Brunton.

Rabault, J., Ren, F., Zhang, W., Tang, H., and Xu, H. (2020). Deep reinforcement
learning in fluid mechanics: A promising method for both active flow control and shape
optimization. Journal of Hydrodynamics, 32(2):234–246.

Raiola, M., Discetti, S., and Ianiro, A. (2015). On PIV random error minimization with
optimal POD-based low-order reconstruction. Experiments in Fluids, 56(4).

Raissi, M., Yazdani, A., and Karniadakis, G. E. (2020). Hidden fluid mechanics: Learning
velocity and pressure fields from flow visualizations. Science, 367(6481):1026–1030.

Raschka, S. and Mirjalili, V. (2019). Python Machine Learning, Third Edition. Packt
Publishing.

Rasmussen, C. E. and Williams, C. K. I. (2005). Gaussian Processes for Machine Learning.


MIT Press Ltd.

VKI - 63 -
REFERENCES REFERENCES

Reddy, G., Celani, A., Sejnowski, T. J., and Vergassola, M. (2016). Learning to soar in tur-
bulent environments. Proceedings of the National Academy of Sciences, 113(33):E4877–
E4884.

Renganathan, S. A., Maulik, R., and Rao, V. (2020). Machine learning for nonintrusive
model order reduction of the parametric inviscid transonic flow past an airfoil. Physics
of Fluids, 32(4):047110.

Reynolds, A. (1975). The prediction of turbulent prandtl and schmidt numbers. Interna-
tional Journal of heat and mass transfer, 18(9):1055–1069.

Richard S. Sutton, A. G. B. (2018). Reinforcement Learning. MIT Press Ltd, Second


Edition.

Rosenblatt, F. (1957). The Perceptron—a perceiving and recognizing automaton. Tech-


nical Report Report 85-460-1, Cornell Aeronautical Laboratory.

Rowley, C., Mezić, I., Bagheri, S., Schlatter, P., and Henningson, D. (2009). Spectral
analysis of nonlinear flows. J. Fluid Mech., 641:115.

ROWLEY, C. W. (2005). Model reduction for fluids, using balanced proper orthogonal
decomposition. International Journal of Bifurcation and Chaos, 15(03):997–1013.

Rumelhart, D. E. and McClelland, J. L. (1989). Parallel Distributed Processing: Explo-


rations in the Microstructure of Cognition: Foundations. MIT PRess.

Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM
Journal of Research and Development, 3(3):210–229.

Schölkopf, B., Smola, A., and Müller, K.-R. (1997). Kernel principal component analysis.
In Lecture Notes in Computer Science, pages 583–588. Springer Berlin Heidelberg.

Schmid, P. (2010). Dynamic mode decomposition of numerical and experimental data. J.


Fluid Mech., 656:5–28.

Schmid, P. (2020). Data Driven Fluid Mechanics: Combining First Principles and Ma-
chine Learning, chapter The Dynamic Mode Decomposition: From Koopman The-
ory to Applications. Cambridge University Press; Ed. Miguel Alfonso Mendez and
Andrea Ianiro and Bernd Noack and Steve Brunton. Also available as online lec-
ture from the VKI Lecture series ’Machine Learning for Fluid Mechanics, 2020’:
https://www.youtube.com/watch?v=xAYimi7x4Lc&feature=emb_logo.

Schubert, E., Sander, J., Ester, M., Kriegel, H. P., and Xu, X. (2017). DBSCAN revisited,
revisited. ACM Transactions on Database Systems, 42(3):1–21.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal
policy optimization algorithms.

Sieber, M., Paschereit, C. O., and Oberleithner, K. (2016). Spectral proper orthogonal
decomposition. J Fluid Mech, 792:798–828.

VKI - 64 -
REFERENCES REFERENCES

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrit-
twieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe,
D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu,
K., Graepel, T., and Hassabis, D. (2016). Mastering the game of go with deep neural
networks and tree search. Nature, 529(7587):484–489.

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M.,
Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., and Hassabis, D.
(2018). A general reinforcement learning algorithm that masters chess, shogi, and go
through self-play. Science, 362(6419):1140–1144.

Simonson, J. (1988). Forced convection: Reynolds analogy and dimensional analysis. In


Engineering Heat Transfer, pages 101–123. Springer.

Singh, A. P., Medida, S., and Duraisamy, K. (2017). Machine-learning-augmented predic-


tive modeling of turbulent separated flows over airfoils. AIAA journal, 55(7):2215–2227.

Smola, A. J. and Schölkopf, B. (2004). A tutorial on support vector regression. Statistics


and Computing, 14(3):199–222.

Snyder, W. E. and Qi, H. (2004). Machine Vision. Cambridge University Press.

Sotgiu, C., Weigand, B., Semmler, K., and Wellinger, P. (2019). Towards a general data-
driven explicit algebraic reynolds stress prediction framework. International Journal of
Heat and Fluid Flow, 79:108454.

Stengel, R. F. (1994). Optimal Control and Estimation. Dover Books.

Swets, D. and Weng, J. (1996). Using discriminant eigenfeatures for image retrieval. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 18(8):831–836.

Swischuk, R., Kramer, B., Huang, C., and Willcox, K. (2020). Learning physics-based
reduced-order models for a single-injector combustion process. AIAA Journal, 58(6).

Szita, I. (2012). Reinforcement learning in games. In Adaptation, Learning, and Opti-


mization, pages 539–577. Springer Berlin Heidelberg.

Taboga, M. (2017). Lectures on Probability Theory and Mathematical Statistics. CreateS-


pace Independent Publishing Platform.

Taira, K., Brunton, S. L., Dawson, S. T. M., Rowley, C. W., Colonius, T., McKeon, B. J.,
Schmidt, O. T., Gordeyev, S., Theofilis, V., and Ukeiley, L. S. (2017). Modal analysis
of fluid flows: An overview. AIAA J., 55(12):4013–4041.

Turk, M. and Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive


Neuroscience, 3(1):71–86.

van Engelen, J. E. and Hoos, H. H. (2019). A survey on semi-supervised learning. Machine


Learning, 109(2):373–440.

VKI - 65 -
REFERENCES REFERENCES

Verma, S., Novati, G., and Koumoutsakos, P. (2018). Efficient collective swimming by
harnessing vortices through deep reinforcement learning. Proceedings of the National
Academy of Sciences, 115(23):5849–5854.

Viquerat, J., Rabault, J., Kuhnle, A., Ghraieb, H., Larcher, A., and Hachem, E.
(2019). Direct shape optimization through deep reinforcement learning. arXiv preprint
arXiv:1908.09885.

Vladimir Cherkassky, F. M. M. (2008). Learning from Data. John Wiley & Sons.

von Luxburg, U. (2007). A tutorial on spectral clustering.

Wang, L., Fortunati, S., Greco, M. S., and Gini, F. (2018). Reinforcement learning-
based waveform optimization for MIMO multi-target detection. In 2018 52nd Asilomar
Conference on Signals, Systems, and Computers. IEEE.

Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, King’s College,
Cambridge.

Werbos, P. (1974). Beyond Regression: New Tools for Prediction and Analysis in the
Behavioral Sciences. PhD thesis, Dept of Applied Mathematics, Harvard University.

Yao, Z., Gholami, A., Shen, S., Keutzer, K., and Mahoney, M. W. (2020). Adahessian:
An adaptive second order optimizer for machine learning.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016). Understanding deep
learning requires rethinking generalization.

Zhou, Y. and Goldman, S. (2004). Democratic co-learning. In 16th IEEE International


Conference on Tools with Artificial Intelligence. IEEE Comput. Soc.

VKI - 66 -

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy