AI&ML-FluidMech-Chapter ML Mendez 2020 LS Opt
AI&ML-FluidMech-Chapter ML Mendez 2020 LS Opt
Version 1
9 September 2020
This lecture gives an introduction to machine learning and an overview of its current
and its potential application to fluid dynamics. The lecture opens with the types of learn-
ing (supervised, unsupervised, semi-supervised and reinforced) and the classification of
machine learning algorithms based on their scope and their reliance on data. The presen-
tation will indirectly unveil the motivation for such a lecture in a course on Optimization,
as learning means optimizing a functional. Examples of fluid mechanics problems that
can be framed as machine learning problems are discussed.
Three demonstrative exercises are proposed to give the attendee hands-on experience
on the subject. These are 1) the regression problem of deriving a turbulence model
using Artificial Neural Networks (ANNs), 2) the unsupervised problem of extracting a
low dimensional representation of a turbulent velocity field and 3) the Reinforcement
Learning problem of using feedback control in wave stabilization. These examples should
give enough practical experience to develop the optimism and skepticism required to assess
machine learning capabilities to advance fluid mechanics.
Preamble This is an ‘alpha’ version of the notes. Feedback and suggestions for im-
provements are very welcome at mendez@vki.ac.be. To cite this version:
1 @InProceedings { Mendez2020 ,
2 author = { Miguel A . Mendez and Fabio Pino and Matilde Fiore } ,
3 title = { Machine Learning for Fluid Mechanics :
4 Challenges , Opportunities and Perspectives } ,
5 booktitle = { Optimization Methods for Computational Fluid Dynamics ,
6 VKI Lecture Series } ,
7 year = {2020} ,
8 publisher = { von Karman Institute } ,
9 doi = { XXXX } ,
10 }
VKI -1-
CONTENTS CONTENTS
Contents
1 Introduction and Motivation 3
4 Dimensionality Reduction 36
4.1 A first simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Dimensionality reduction of a flow field . . . . . . . . . . . . . . . . . . . . 42
VKI -2-
1 INTRODUCTION AND MOTIVATION
VKI -3-
2 WHAT IS MACHINE LEARNING?
VKI -4-
2 WHAT IS MACHINE LEARNING?
which depends4 on a set of weights w ∈ W. This is our data-driven (surrogate) model for
the function f and is the final outcome of the learning process.
Learning consists in tuning the parameters (weights) w such that our model performs
satisfactorily (according to the chosen measure), that is5 f˜ ≈ f . The learning process is
also called training. The bold notation recalls that both inputs x and outputs y are in
general of high dimensions: these are vectors in Rd , with d the dimension, or sequence of
numbers encoding discrete quantities such as categorical data (classes).
The dimension of W, and more generally of the hypothesis set, can be loosely linked to
the notion of model complexity or model capacity (see Abu-Mostafa et al. 2012 and
Vladimir Cherkassky (2008) for more formal definitions). The more complex the model,
the larger the dataset required to train it.
The performance of the algorithm are measured in terms of error on the available
data, called in-sample error Ein or training error while the (unknown) error on new
data is called out-of-sample error Eout (or generalization error). A model that has
not enough capacity (i.e. is not sufficiently complex) for a given task will produce large
Ein regardless of the parameters w. This is called underfitting and can be handled by
increasing the model complexity. On the contrary, a model that is too complex for a given
dataset, produces overfitting. A model overfits if it yields small Ein but large Eout , and
the difficulties in avoid this problem are due to the fact that we can only estimate Eout .
The need for minimizing Ein anchors machine learning to optimization while the
4
Not all learning algorithms are parametric (i.e., have parameters). Non-parametric methods link
inputs to outputs without building a model but using the training data itself. A classic example is
the nearest neighbor regression, in which a prediction is based on the similarity of new input to inputs
available in the training set. These methods are also called instance-based or memory-based since they
essentially store the training set in lookup tables and make predictions based on interpolation. We do
not cover these methods here; the reader is referred to Alpaydin (2020) for more information.
5
The distinctive feature of machine learning, over function approximation, is that the function f
(referred to as target function), is unknown.
VKI -5-
2 WHAT IS MACHINE LEARNING?
hope of minimizing Eout bridges with statistical inference. The overfitting problem is
central in machine learning and stem from its inherently ill-posed mathematical frame-
work: estimating a mapping f : X → Y from a finite set of points has no unique solution
and there is no guarantee that a model performing well on a set of data generalizes to the
set of all possible inputs. Dealing with such a generalization is the essence of inductive
reasoning, which lays the fundamental of all empirical sciences (Popper, 2002). As em-
pirical scientists, our tools to cope with overfitting (hence our abilities to generalize) stem
from the fundamental principles and laws of physics (e.g., conservation laws), which we
accept based on experience. We trust these laws not because we have a mathematical
proof of their validity, but because we have never seen problems falsifying them6 .
Physical laws can be incorporated in the hypothesis set, hence in the definition of the
model and its complexity; this flexibility is driving the diffusion of machine learning in
all empirical sciences (Frank et al., 2020).
When no prior knowledge about the model can be used in the hypothesis set, machine
learning offers two common (and general) tools: cross-validation and regularization.
We will practice with both aspects in the tutorial of Section 3. Briefly, cross-validation
consists in splitting the data into a training set x∗ , y∗ and a validation set x∗∗ , y∗∗ . The
first is used in the training process (to minimize Ein ), whereas the second is used to
estimate Eout . Regularization consists of adding additional constraints for the weights w
in the optimization, with the scope of guiding the learner towards a parsimonious choice
of complexity.
We close this introduction recalling that, whether a model is data-driven or physics-
driven, it is good practice to prioritize the simplest formulation. This principle goes by the
name of Occam’s Razor 7 and accompanies the history of Science since the 14th century.
Equipped with the general framework of figure 1, we proceed with the formulation of
the four main kinds of learning, along with the main machine learning tasks and methods.
These are briefly reviewed in what follows and summarized in Figure 2.
Figure 2: Overview of various machine learning techniques and their classification. The
list is by no means exhaustive, but solely offer a mind map of the machine learning
landscape.
6
at least within the scales that concerns us: as fluid dynamicists, we work with Netwon’s principles
and the laws of Thermodynamics, although we know that both fails in other contexts of modern physics.
7
Attributed to William of Occam (1287-1347), the razor is meant to trim down an explanation to the
simplest one that is consistent with the data. This is also known as the law of parsimony.
VKI -6-
2.1 Supervised Learning 2 WHAT IS MACHINE LEARNING?
VKI -7-
2 WHAT IS MACHINE LEARNING? 2.2 Unsupervised Learning
Figure 3: Schematic illustration of the distinction between regression (left) and classi-
fication problems (right). In regression, the function f is used to predict the output,
i.e. f (x) = y; in classification, the function f defines the decision boundary separating
the domains of each class. In both figures, the function is represented with continuous
lines, while the dashed lines demarks the margins: the region within which the function
is expected to lie.
science applications, as it combines the power of a general-purpose programming language with the ease
of use of scripting languages such as Matlab, R or Julia. It is free, and it is equipped with an extremely
vast toolbox of libraries for scientific computing and machine learning. Some of these (like Tensorfflow)
have been recently open-sourced by tech giants such as Google.
10
available at http://yann.lecun.com/exdb/mnist/.
VKI -8-
2.2 Unsupervised Learning 2 WHAT IS MACHINE LEARNING?
Dimensionality Reduction
Dimensionality reduction aims at identifying a lower-dimensional representation of the
data. The underlying assumption is that the degrees of variability in the data can be
reduced if the focus is placed on essential features (called hidden or latent factors). A
successful face recognition algorithm, for example, focuses on patterns in the images that
are well associated with features such as age, gender or pose, and constructs a reduced set
of images that encodes all the information required for the recognition (Swets and Weng,
1996; Turk and Pentland, 1991).
Dimensionality reduction is an “information bottleneck” (Vladimir Cherkassky, 2008)
composed of an encoder mapping z = E(x) ∈ Rr and a decoder mapping x̃ = D(z) ∈ Rd
as shown in the schematic of Figure 4. The function to be learned is g(x, w) := D(z) =
D(E(x)).
The compressed representation z ∈ Rr is the result of the distillation process and has
a (much) lower dimension than the input x ∈ Rd , i.e. r d. Yet, if the features encoded
in z are truly essential, the decoder should be able to recover a good approximation
x̃ ≈ x ∈ Rd . The composition of an encoder and a decoder is commonly known as
autoencoder, although this term is mostly used when the process in Figure 4 is carried
out using ANNs. A popular (and dangerous) application of autoencoders that has hit
the headlines recently is that of Deep Fakes 11 , i.e. the generation of deceptively realistic
fake videos. These can be created by swapping the decoder and the encoder of different
datasets (Nguyen et al., 2019), effectively blending them in a synthetic dataset.
Dimensionality reduction is an essential tool in the data scientist toolbox for at least
three reasons. The first obvious reason is that of an economy in interpretation and analy-
sis: if the relevant information is hidden in r d dimensions, then there is no interest in
considering the remaining d − r. The second reason is that the simpler representations z
might be used to derive simpler models or as input to supervised techniques, reducing the
computational complexity involved and hence the risk of overfitting. The third reason is
that, by focusing on the essential features, an algorithm can learn to recognize (and thus
remove) irrelevant sources of variance such as noise or outliers. Any denoising operation
is, in essence, an exercise in dimensionality reduction.
11
see https://en.wikipedia.org/wiki/Deepfake.
VKI -9-
2 WHAT IS MACHINE LEARNING? 2.2 Unsupervised Learning
Clustering
Clustering aims at partitioning the dataset into groups (clusters). Each of the members
of a cluster is assumed to have some “similarity”, hence to belong to a certain “class”.
Clustering differs from classification in that it is an unsupervised technique: no labeled
data is available and no “right answer” is known upfront– not even the number of clusters.
The training set only contains unlabeled data and the only possible mapping remains
f : X → X . A simple example of clustering problem is that of finding customers that
have similar purchase behavior as a basis for recommendation engines.
Clustering methods can be classified based on the type of input or the type of output.
In terms of inputs, clustering can be feature-based if the input is the dataset itself or
similarity-based if the input is some measure of similarity (distances between samples).
The first is often less sensitive to noise while the second allows for introducing prior knowl-
edge (domain-specific) of similarity (Vladimir Cherkassky, 2008). In terms of outputs,
clustering can be partitional (or prototype-based) or hierarchical. In the first all clusters
are at ‘the same level’ and characterized by a representative prototype (typically the cen-
troid in case of continuous features or the medoid in categorical features). In the second,
the data is grouped over a variety of scales by creating trees and dendograms. Finally,
clustering can be model-based or model-free. In the first case, a probabilistic model is
assumed (as in Gaussian Mixture Models (GMM), see Bishop (2006)) while in the second
case no assumption is made on the probability density distributions of the clusters.
The first step of any clustering method is the definition of a measure of similarity,
which implicitly dictates what constitutes a cluster. Intuitive definitions are the Euclidean
distance (or its square) while less intuitive measures are the Mahalanobis, the Chernoff
or the Bhattacharyya distance (see Snyder and Qi (2004) for more).
The simplest and more popular prototype-based method for clustering is the k-means
algorithm. In its basic implementation, this algorithm requires the number of clusters as
input and returns a set of labels which assign each dataset to a cluster. This assignment
is driven by an optimization that minimizes the intra-cluster variance, often called cluster
inertia or distortion. To select the number of clusters, two classic approaches are the elbow
plot and the silhouette plot. The reader is referred to Raschka and Mirjalili (2019) for an
excellent tutorial on these tools; Section 5 will briefly illustrate their use in the context of
cluster-based reduced order models. Variants of the k-means algorithm arise from different
initialization strategies or the deterministic (hard) versus probabilistic (soft) criteria used
in the classification: in the fist case, each dataset belongs to a single class, while in the
second case a dataset has a probability of belonging to one cluster or the other. This
second approach is also known as fuzzy-clustering (Bezdek, 1981).
VKI - 10 -
2.3 Semi-Supervised Learning 2 WHAT IS MACHINE LEARNING?
VKI - 11 -
2 WHAT IS MACHINE LEARNING? 2.4 Reinforcement Learning
The reader is referred to Olivier Chapelle and Zien (2006); Zhou and Goldman (2004);
van Engelen and Hoos (2019) for a complete survey. The broader classification of these
methods is between inductive and transductive. The inductive methods seeks to find a
model from the labeled data that generalized to the unlabeled one; once the model is
constructed, the labeled data is no longer necessary. The transductive methods do not
build such a predictive model and need to rely on the labeled data for making predictions;
these are also called graph-based, as their predictions rely on graph models connecting
labeled and unlabeled data. The previous example of semi-supervised classifier trained
via clustering is an example of inductive semi-supervised learning with an unsupervised
pre-processing.
Following van Engelen and Hoos (2019), subcategories of inductive methods include
wrapper methods and intrinsically semi-supervised methods. The first category combine
supervised learning with data augmentation: the unlabeled data is pseudo-labeled using
wrapping techniques and fed to a supervised learning method. In the class of intrinsically
semi-supervised methods, the most prominent example is that of generative models, and
particularly Generative Adversial Nets (GANs, Goodfellow et al. (2014)). In generative
models, the labeled data is used to infer a model that generates synthetic data to arti-
ficially enlarge the labeled dataset. In GANs, the idea is to combine a generative and
a discriminative learner having opposite objectives: the first must learn to produce data
that is hardly distinguishable from the initial set; the second must learn to distinguish
them. The two ANNs train each other in an endless process. Spectacular demonstration
of the power of this ’self-learning’ process is the generation of photorealistic images14 .
conveniently computed.
14
We recommend a tour at http://whichfaceisreal.com to better grasp what we are referring.
VKI - 12 -
2.4 Reinforcement Learning 2 WHAT IS MACHINE LEARNING?
where the symbol | stands for conditional probability. In words: the evolution to the
following state only depends on the current state-action pair and not on the previous
pairs, nor the past trajectory of the system.
The reader should notice the similarity of this framework with that of feedback control
(Ogata, 2009; Stengel, 1994; Kirk, 2004): here, the agent plays the role of the controller,
which is informed by sensors and acts through actuators, whereas the environment is the
plant to be controlled. It is thus not surprising that reinforcement learning is rooted in
optimal control theory and dynamic programming (Richard S. Sutton, 2018).
In the machine learning landscape, the distinctive feature of reinforcement learning is
that the environment does not tell the agent what to do, but only grades its actions– this
is why reinforcement learning is also called learning with a critic. Such a critic, in the
form of a reward/penalty, is received only after the action is performed. The criteria for
assigning rewards is one of the main challenges in RL. On the one hand, we could assign
rewards only after a specific objective is achieved. Because a goal certainly involves a
sequence of actions, the agent must figure out what part of the action sequence led to the
reward: this is the the credit assignment problem. On the other hand, we might decide
to assign many intermediate rewards. This implies that we already have a good knowledge
of what the sequence of actions should be. At the limit of using frequent rewards to force
the agent behaving as we desire, reinforcement learning reduces15 to supervised learning.
Balancing these two extremes constitute the reward shaping problem.
Another key challenge of the field is identifying a good trade-off between exploration
and exploitation. To obtain rewards, the agent should exploit actions that are known,
from experience, to be successful. Yet, to discover improvements, the agent must explore
new actions.
Much of the recent success of this field has been motivated by standout success in
strategy board games (Silver et al., 2016, 2018) and video-games (Szita, 2012), robotics
(Kober and Peters, 2014) and language processing (Luketina et al., 2019). An historical
success was the development of a hybrid RL system that defeated the Korean world
15
An agent trained via supervised learning cannot, by definition, perform better than its supervisor.
The remarkable achievements of agents reaching super-human performances at tasks such as, e.g., playing
board games, would have never been possible.
VKI - 13 -
3 REGRESSION AND MODELING
champion Lee Sedol in the game of Go (Silver et al., 2016). This game has 10170 legal
board position; neither large memory nor hand-crafted rules can be of any help.
All of the aforementioned achievements have been achieved by Deep Reinforcement
Learning (DRL), which is a combination of RL strategies and the regression capabilities
of deep ANNs. The reader is referred to Arulkumaran et al. (2017); Hernandez-Leal et al.
(2019) for extensive surveys, to Richard S. Sutton (2018); François-Lavet et al. (2018) for
complete introductions to the topic and to Alexander Zai (2020) for hands-on tutorials
using Python. The internet abounds of online tutorials and courses on Reinforcement
Learning. We highly recommend the course by Pieter et al. (2017).
Reinforcement learning can be broadly classified into four categories: (1) off-policy (or
value-based), (2) on-policy (or policy optimization), (3) model-based and (4) imitation
learning (or behavior cloning).
Off-policy methods do not focus on learning the policy but on the maximization
of the value of each state (denoted as V π (st )) and each state-action pair (denoted as
Qπ (st , at )). The first represents how ‘good’ it is for the agent to be in a given state while
the second represents how ‘good’ it is to perform an action at a certain state. In both
definitions, the notion of ‘good’ is quantified in terms of expected rewards. Once the
optimal sequence of states is identified, the policy is determined as the one that allows to
follow the optimal sequence of states. These methods are suited for systems having a finite
set of possible actions; the most successful applications of this framework is the Q-learning
(Watkins, 1989) and its many variants. On-policy algorithms(or policy optimization)
aims at optimizing a parameterized policy πθ (a|s), with θ the set of parameters, so as
to maximize the expected future reward. The framework naturally handles continuous
actions and it is rooted in classic optimization, usually using gradient-based methods (for
which these are also known as policy-gradient methods).
Both of the aforementioned methods are usually used in a model-free framework. In
a model-based approach, the agent interrogates a surrogate or predictive model based
on the observation history. This model can give faster and computationally cheaper
evaluations than a numerical simulation of the environment16 and thus allow for increasing
the learning speed of the agent. Finally, imitation learning refers to an approach in
which the agent is supported by a demonstrator (e.g. a human or another RL agent)
which can guide the agent imitates: this approach essentially introduces a supervised
setting framework in the process.
VKI - 14 -
3 REGRESSION AND MODELING
30
20
y
10
0 2 4 6 8 10
x
Figure 6: Plot of the dataset considered in this section: np = 400 points randomly sampled
from a process that has a deterministic and a stochastic part. Data is missing in the range
x ∈ [4.3, 4.8] and several outliers are added around x = 2 and x = 8.
The ansatz in (2) is the starting point of most data-driven strategies aiming at de-
veloping a predictive model18 for a quantity yk given an input xk . The primary scope of
a regression problem is to learn an approximation f˜ of the deterministic contribution f
using the available data.
The learned approximation can be used to make prediction on new data points, i.e.
f˜(x) ≈ f (x), under the assumption that the stochastic contribution in (2) is less impor-
17
All the Python codes for the discussed examples are available on request and are given in the VKI
courses Tools for Scientific Computing and Data-Driven Fluid Mechanics and Machine Learning .
18
The problem is generally of dimension d, while here we only consider d = 1 for simplicity. For d > 1,
both input and output data can be conveniently cast as matrices X, Y ∈ Rd×np , with every column
representing a ’snapshot’ of the data. This approach is common in modal analysis (see Mendez (2020a)).
VKI - 15 -
3 REGRESSION AND MODELING 3.1 Linear Methods
tant than the deterministic one19 . The prediction can be supported by an uncertainty
evaluation which accounts for the finite number of samples available for constructing the
approximation, as well as the stochastic contribution. The uncertainty quantification re-
lies on some assumptions that are briefly discussed later in this section. For the moment,
we only assume that E{n(x)} = 0 where E is the expectation operator
1 XN N
xk = xk p(n(xk )), (3)
X
E{n(x)} ≈
N k=1 k=1
j=1
VKI - 16 -
3.1 Linear Methods 3 REGRESSION AND MODELING
where Φ(x) ∈ Rnp ×nb is the basis matrix, with np the number of data points involved and
w = [w1 · · · wnb ]T is the (unknown) vector of coefficients. This equation drives both the
training and the testing of a model. We therefore begin by randomly splitting the data into
a training set, denoted as x∗ , y∗ ∈ Rn∗ ×1 and a testing set, denoted as x∗∗ , y∗∗ ∈ Rn∗∗ ×1 .
In this tutorial, we split the np = 400 available points into n∗ = 300 and n∗∗ = 100.
The training consists in identifying the coefficients w from the training data. In an
ordinary least square method, this means solving the linear system in (5) with x = x∗ and
y = y∗ , i.e. y∗ = Φ(x∗ )w. Note that since usually np nb , the system is ill-posed and
no exact solution is available21 . Hence once the coefficients w are available, the prediction
ỹ∗ = Φ(x∗ )w will not match the training points. We thus define the Root Mean Square
√ √
(RMS) in-sample error22 as Ein (w) = ||y∗ − ỹ∗ ||2 / n∗ = ||y∗ − Φ(x∗ )w)||2 / n∗ .
The validation consists in using the coefficients w on the validation data. This
means computing ỹ∗∗ = Φ(x∗∗ )w, i.e. an estimate of y∗∗ . We can define the RMS out-of
√ √
sample error as Eout (w) = ||y∗∗ − ỹ∗∗ ||2 / n∗∗ = ||y∗∗ − Φ(x∗∗ )w)||2 / n∗∗ .
The balance between Ein and Eout is linked to the overfitting problem introduced
in Sec 2: very low values of Ein are often achieved at the cost of large values of Eout .
Excellent introductions on the subject in terms of bias-variance trade off are provided
by Bishop (2006); Abu-Mostafa et al. (2012) along with the required mathematical tools.
We here invite the reader to experiment with this exercise to build confidence.
The scope of the training is to identify w such that Ein is minimized. More generally,
one might choose among a large set of a cost functions that aim to achieve this goal while
adding additional constraints on w. For example, a classical choice is the regularized
least square cost function:
VKI - 17 -
3 REGRESSION AND MODELING 3.1 Linear Methods
" #
T
∇w J(w) = ∇w y − Φ(x∗ )w y − Φ(x∗ )w + λw w =
T
" #
= ∇w w Φ (x∗ )Φ(x∗ )w − w Φ (x∗ )y − y Φ(x∗ )w − y y + λw w
T T T T T T T (7)
= 2 Φ (x∗ )Φ(x∗ )w − Φ (x∗ )y + λw
T T
The Hessian of (6) is H(J(w)) = 2Φ T (x∗ )Φ(x∗ ) + λInb , with Inb the identity matrix
of dimension nb × nb . Since Φ T (x∗ )Φ(x∗ ) is a positive definite matrix for all the classic
choices of basis functions25 , the optimization process is convex and the minima can be
found as a solution of ∇w J(w ∗ ) = 0. This solution, from (7), reads:
−1 −1
w ∗ = Φ T (x∗ )Φ(x∗ ) + λInb Φ T (x∗ )y = K (x∗ ) + λInb Φ T (x∗ )y (8)
with K (x∗ ) = Φ T (x∗ )Φ(x∗ ) the covariance matrix of the chosen basis functions. Note
that the set of coefficient is determined using only the training dataset x∗ .
We begin with a bad choice of basis functions, namely a polynomial basis of the form
φj (x) = xj−1 . For such a choice, the definition of the polynomial coefficients in (8)
is implemented in the popular functions polyfit in Matlab or Python. We consider
nb = 10, i.e. we represent the approximated function as a polynomial of order 9. Note
that the predicted output y∗∗ for the testing set, combining (8) and (5), is:
−1
y∗∗ = Φ∗∗ w ∗ = Φ∗∗ K∗,∗ + λInb Φ∗T y∗ = Sy∗ , (9)
where the matrix S is often referred to as the smoother matrix or hat matrix, and
we have easied the notation by writing Φ(x∗ ) = Φ∗ . We perform a loop of 100-fold cross-
validations26 : we randomly split the dataset into n∗ = 300 training points and n∗∗ = 100
testing points and we repeat the regression 100 times. This results in 100 possible fits,
with associated coefficients w and estimates of Ein and Eout . We repeat the process for
three values of the regularization parameter, namely λ/n∗ = 0, 10, 100. Note that the
weight of this parameter scales with the number of points and is a complex function of
the basis chosen. We skip here these mathematical details. The results27 are in Figure 7.
The histograms on the left shows the sampled distribution of Ein and Eout for the
three cases. The plots on the right compare the average predictions with the data and
include an estimation of the uncertainty from the K-fold cross validation. From the 100
predictions obtained, the average is shown in continuous blue line while the colored area
covers the uncertain area. This is computed as a span of ±(E out + 1.96 σ(f˜(x)) around
the mean, with σ(f˜(x)) denoting the standard deviation observed at each location x over
all predictions. The average standard deviation in each case is indicated on the bottom
right of these plots.
These figures shows the impact of the regularization and the reason why the polynomial
basis is a bad choice. Increasing the regularization decreases the standard deviation in
25
In other words, Φ(x∗ ) is usually of full rank.
26
This is sometimes referred to as ‘k-fold’ cross-validation
27
Interested readers are strongly encouraged to reproduce these plots!
VKI - 18 -
3.1 Linear Methods 3 REGRESSION AND MODELING
λ=0
40
Ein
10 f˜∗∗
Eout
Data
30
8
counts
6 20
y
4 10
2
0 σ(f˜)=0.296
0
1 2 3 4 5 0 2 4 6 8 10
Ein and Eout x
λ = 10 n∗∗
40
12.5 f˜∗∗
Data
10.0 30
counts
7.5 20
y
5.0
10
2.5
0 σ(f˜)=0.207
0.0
2.5 3.0 3.5 4.0 4.5 5.0 0 2 4 6 8 10
Ein and Eout x
λ = 100 n∗∗
15 40
f˜∗∗
Data
30
10
counts
20
y
5 10
0 σ(f˜)=0.194
0
2.5 3.0 3.5 4.0 4.5 5.0 0 2 4 6 8 10
Ein and Eout x
Figure 7: Results from 100-fold cross validations for the polynomial fitting using λ = 0
(top row), λ = 10 n∗∗ (middle row) and λ = 100 n∗∗ (bottom row). The left plot shows the
histogram of Ein and Eout ; the right plots show the average prediction with the estimated
uncertainty.
VKI - 19 -
3 REGRESSION AND MODELING 3.1 Linear Methods
the prediction but increase the average28 Eout . Overall, the loss outweighs the gain in
the third case, suggesting that an acceptable regularization is closer to λ = 10n∗∗ than
to λ = 100n∗∗ . A similar analysis can be performed for multiple values of regularization
to identify the optimal value: it is possible to show that there always exists a value of λ
such that the Ridge regression has lower mean squared error than the OLS estimator (see
Taboga (2017)).
Regardless of the regularization chosen, this model selection underfits the data. In-
creasing the model complexity, i.e. increasing the number of basis elements further dete-
riorate the predictions29 . The natural solution is therefore a change of basis.
A popular choice are Radial Basis Functions (RBFs). These are real-valued function
whose value solely depends on the distance between the input and some fixed point. The
most classic example of RBF is the Gaussian bases φj (x) = e−(x−xj ) /2lc , where xj are the
2 2
center points and lc acts as a characteristic length scale of the data. This controls how
far one should move in a given direction (in this case along x) for the function values to
become uncorrelated. Basis functions that are sufficiently far apart (e.g. at a distance of
10lc ) do not overlap and hence their corresponding weights can be tuned independently.
In other words, these bases functions becomes more and more localized as lc → 0.
RBFs are normally introduced in the framework of kernel methods (see Bishop (2006);
Murphy (2012)) and thus offer here an opportunity to briefly present this alternative
perspective and its link with the basis representation.
We first consider a set with nb = 50 Gaussians with lc = 0.5, with centers xj uniformly
spaced over the domain of interest. We repeat the same analysis presented in Figure 7.
The result for the Gaussian RBFs are collected in Figure 8 for three levels of regularization.
It is evident that the RBF approach performs better. For λ = 0.1 the minor increase in
the bias error is compensated by the reduction of variance, while λ = 1 results in larger
bias and variance. In all three cases, it is interesting to notice how the uncertainty grows
in the regions lacking data or populated by outliers.
The reader is encouraged to repeat the exercise increasing the number of basis elements
nb . In Gaussian Processes (Rasmussen and Williams, 2005), one is often interested in
cases for which nb np . This could be the case either because limited data is available or
because one wishes to stretch the notion of bases, as we shall see shortly. In this settings,
a simple alternative to (9) can be derived using the matrix inversion lemma30 , which can
be written for the basis matrix Φ∗ as
−1 −1
Φ∗T Φ∗ + λInb Φ∗T = Φ∗T Φ∗ Φ∗T + λInp . (10)
The right hand side becomes more attractive for nb np as it involves the inversion
of a smaller matrix. The matrix G∗,∗ = Φ(x∗ )Φ(x∗ )T is known as the Gram matrix.
Simplifying the notation Φ(x∗ ) = Φ∗ and Φ(x∗∗ ) = Φ∗∗ and using (10) in (9) gives31 :
−1 −1
y∗∗ = Φ∗∗ Φ∗T Φ∗ Φ∗T + λInp y = G∗∗,∗ G∗,∗ + λInp y (11)
28
In the bias-variance trade-off, we are decreasing the variance while increasing the bias.
29
The reader is invited to try with a polynomial of order 15, for example: without sufficient regular-
ization, the problem inversion of K becomes problematic and returns NaNs.
30
This is also known as Woodbury matrix identity.
31
The link between ridge regression and Gaussian processes is well described by Belousov (2017).
VKI - 20 -
3.1 Linear Methods 3 REGRESSION AND MODELING
λ = 0.001
15 40
f˜∗∗
Data
30
10
counts
20
y
5 10
0 σ(f˜)=0.214
0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0 2 4 6 8 10
Ein and Eout x
λ = 0.1
40
10 f˜∗∗
Data
30
8
counts
6 20
y
4 10
2
0 σ(f˜)=0.196
0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0 2 4 6 8 10
Ein and Eout x
λ=1
12.5 40
f˜∗∗
10.0 Data
30
7.5
counts
20
y
5.0
10
2.5
0 σ(f˜)=0.257
0.0
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 0 2 4 6 8 10
Ein /Eout x
VKI - 21 -
3 REGRESSION AND MODELING 3.1 Linear Methods
We are now ready to take the limit nb → ∞ and stretch the introduced framework to
the case of an infinite basis: while K∗ = Φ∗T Φ∗ becomes infinite dimensional, the Gram
matrix remains of finite size. The entries of the Gram matrix are solely function of the
points xi and xj and can be computed using a continuous function G∗,∗ [i, j] = g(xi , xj ),
called kernel function. This is one of the many facets of the kernel trick (see Murphy
(2012); Alpaydin (2020)).
For Gaussian RBFs, the kernel function is again a Gaussian, and takes the form:
1
where σf is a scaling parameter and lc0 is the continuous analogous32 of lc . Note that the
function is symmetric, i.e. g(xi , xj ) = g(xj , xi ) and G∗,∗∗ = g(x∗ , x∗∗ ) = G∗∗,∗
T
Under these assumptions, the ridge regression resembles closely a Gaussian Process
(GP) regression. While an introduction to the theory of GPs and the required Bayesian
framework is out of the scope of these notes, the introduced background allows to grasp
the fundamental ideas and give working knowledge of the method33 .
Having opened the presentation to an infinite basis, a GP is a Gaussian distribution
of functions. Given a set of points x ∈ Rnp ×1 , every element of the distribution represents
a possible set of targets y = f (x) ∈ Rnp ×1 and the joint distribution of np points is
assumed to be Gaussian with zero mean and kernel G, i.e. p(y|x) = N (y|0, G). Because
the kernel G must be assumed beforehand, this ansatz is called prior.
Note that this assumption allows to compute uncertainties of the regression analyti-
cally, without resorting to test data and k-fold cross-validation. Therefore, in the remain-
ing of this section, we define (x∗∗ , y∗∗ ) as the set of predictions y∗∗ = f (x∗∗ ).
Leveraging on the Gaussian assumption, the prediction is carried out via condition-
ing, using Bayesian theorem and the prior to predict the posterior. This turns out to
be a Gaussian of the form p(y∗∗ |x∗∗ , x∗ , y∗ ) = N (y∗∗ |µ∗ , Σ∗ ), with
−1
µ∗ = g(x∗∗ , x∗ ) g(x∗ , x∗ ) + σy2 Inp y (13a)
−1
Σ∗ = g(x∗∗ , x∗∗ ) − g(x∗∗ , x∗ ) g(x∗ , x∗ ) + σy2 Inp g(x∗ , x∗∗ ) (13b)
where σy is an estimation of the standard deviation of the noise and the uncertainty can
be computed as δf (x) = 1.96 diag(Σ∗ ).
q
Comparing (13a) with (11) shows that the predicted mean corresponds to the ridge
regression as long as the kernel function is such that g(x1 , x2 ) = Φ(x1 )Φ T (x1 ) and λ = σy2 ;
this comparison gives a statistical interpretation to the regularization term previously
introduced. The reader is invited to use (13a) and (13b) on the provided dataset and
compare with the k-fold cross validated ridge regressions. The results are shown in Figure
9. The chosen parameters are indicated on the top of the figure..
32
Because of the assumption nb → ∞, lc 6= lc0
33
An excellent tutorial on GP, supported by a very didactic implementation in Python, is provided by
Krasser (2018). Krasser’s blog (http://krasserm.github.io) is a precious source of machine learning
tutorials.
VKI - 22 -
3.2 Common Optimization Tools 3 REGRESSION AND MODELING
lc = 0.7 σf = 10 σy = 2
Figure 9: Solution to the provided regression exercise using Gaussian Process regression
(GPR). The left plot shows the regression with the uncertainties, the right figure shows
a zoom in the region where data is missing.
We conclude this subsection recalling that the methods presented are only two of
the most popular (linear) tools. Different formulations can be obtained from different
cost functions, with common machine learning packages offering a dozen alternatives. A
valuable alternative to the regularized least square in (6) is the ε-sensitive loss function,
which takes the form:
0 if |y − Φw| < ε
J(w) = (14)
|y − Φw| − ε otherwise
In words, this approach tolerates errors within ±ε and gives a linear penalty to points
that are beyond this range. This region of tolerance promotes sparseness while the linear
penalty gives robustness to outliers. This function is used in combination in Support
Vector Regression (SVR), the adaptation of the popular classification algorithm Sup-
port Vector Machines (SVM, Boser et al. 1992). We refer to Smola and Schölkopf
(2004) for an excellent overview of this powerful tool.
VKI - 23 -
3 REGRESSION AND MODELING 3.2 Common Optimization Tools
terminology of the field, as well as understanding the setting used for solving the provided
tutorials exercises.
A gradient descent algorithm is an iterative method that adapts the weights of a model
in the opposite direction of the gradient ∇w J(w), i.e.
where the superscript denotes an iteration and the scalar η is known as learning rate.
State-of-the-art libraries are equipped with automatic differentiation to compute the gra-
dient and variants of this approach differ in the amount of data used in (15).
The simplest algorithm is the Batch Gradient Descend (BGD), which computes
the gradient on the entire dataset, i.e. ∇w J(w) = ∇w J(w, x, y). This approach has
a high cost per iteration, it is memory demanding and not suited for ‘online’ learning,
that is continuous optimization as new data is available. On the other extreme is the
Stochastic Gradient Descend (SGD) which computes the gradient on each data-point
independently, i.e. ∇w J(w) = ∇w J(w, xi , yi ). This makes the algorithm faster but
give strong fluctuations in the convergence. These fluctuations often allows for jumping
towards better local minima but become harmful once the minima is approached. To
limit this effect, the SGD is usually combined with a learning schedule that reduces η
as a function of the iterations.
A practical measure of the optimization performance is often given in terms of epochs.
An epoch is a full pass of (15) over the entire dataset. Hence, in BGD one epoch
corresponds to one iteration while in SGD one epoch corresponds to n∗ iterations.
The most common approach is an intermediate formulation between BGD and SGB
called mini-batch gradient descent (MGD), in which the gradient evaluation is carried
out over a batch of data of size N, i.e. ∇w J(w) = ∇w J(w, xi:i+N , yi:i+N ). This approach
is now standard and it is often implied when referring to gradient descent (GD).
The main challenge of these algorithms is in the definition of an appropriate learning
rate (or learning schedule) and the risk of being trapped in suboptimal local minima or
saddle points (especially in the case of ANNs, see Dauphin et al. (2014)). To (partially)
overcome these problems, standard optimization tools use momentum and adaptive
learning or gradient re-scaling as well as their combinations.
The idea of momentum consists in introducing a sort of memory about earlier gradi-
ents; the simplest implementation (Polyak, 1964) reads
wi+1 = wi + mi
(16)
mi+1 = βmi − η∇w J(w) η, β ∈ R; m ∈ Rnp
VKI - 24 -
3.3 Artificial Neural Networks (ANNs) 3 REGRESSION AND MODELING
√
wi+1 = wi − η∇w J(w) si + ε
(17)
si+1 = βsi + (1 − β)∇w J(w) ⊗ ∇w J(w) η, β, ∈ R; s ∈ Rnp
where and ⊗ denotes the entry by entry division and multiplication, β is a decay
rate, s is a scaling parameter and ε is a small term introduced to avoid division by zero.
The idea of this re-scaling is to decrease the learning rate faster for steepest directions
while the parameter β give importance only to recent updates when computing the scal-
ing s. More sophisticated algorithms such as the ADAptive Momentum estimation
(ADAM) combines both momentum and scaling. This combination reads:
√
wi+1 = wi − η m̂i ŝi + ε
si+1 = βsi + (1 − β)∇w J(w) ⊗ ∇w J(w)
mi+1 = β1 mi + (1 − β1 )∇w J(w) (18)
m s
m̂ = ŝ = η, β, ∈ R; s, m ∈ Rnp
1 − β1i
1 − β2i
The weight update in the first line has a re-scaled momentum. The momentum m
vector updates as in the momentum formulation in (16) while the scaling vector s follows
the RMSprop idea in (17). Both quantities are then scaled before their use in the weight
update: this has the main objective to avoid a bias towards zero Kingma and Ba (2014)
With the increasing complexity of the optimizer, the number of tuning parameters
(referred to as hyper-parameters) to adjust increases. While many guidelines are avail-
able, mastering the use of these optimizer for training complex models such as ANNs
requires experience, and often, a tedious trial and error procedure.
VKI - 25 -
3 REGRESSION AND MODELING 3.3 Artificial Neural Networks (ANNs)
To understand their operation and harvest their potential, one should see ANNs as
nonlinear mathematical models that have a distributed architecture consisting of a large
number of simple connected units (called neurons). An example of ANNs (the first that
will be tested in this section) is shown in Figure 10. This network consist of seven neurons,
organized four in layers: an input and an output layer, together with two intermediate
ones (referred to as hidden layers). Because this is used for regression analysis of the
data in Fig 6, the input and the output layers have only one neuron (since we have one
input and one output) while the hidden layers have two and three neurons (labeled from
bottom to top). This network is fully connected because each neuron in a layer is
connected to all the neurons of the following one, and is feed-forward because the flow
of information is unidirectional, from one layer to the next. The reader should note that
feedforward networks are often referred also as Multilayer Perceptrons for historical
reasons: the Perceptron is the first ANN architecture, consisting of a single neuron, used
for binary classification (Rosenblatt, 1957).
Figure 10: First of the two configurations of ANNs tested in this regression tutorial:
a feedforward fully connected architecture with two hidden layers and a total of seven
neurons.
VKI - 26 -
3.3 Artificial Neural Networks (ANNs) 3 REGRESSION AND MODELING
to understand the potential of this technology. Interested readers are referred to Bishop
(1995); Goodfellow et al. (2016); Haykin (1998) for comprehensive treatment.
We begin our analysis of the network in Fig 10 from the output layer and its single
neuron. This neuron receives the output of three neurons from the previous layer, weighed
by the connection weights wi,j , where l denotes the layer hosting the neurons and the
(l)
subscripts map the connection: for instance w2,1 is the weight of the connection from
(3)
neuron 2 (in layer 3) to neuron 1 (in layer 4). This neuron responds to these inputs as:
3
X
y=y =σ + =σ + (19)
(4) (4) (3) (3) (4) (4) T (3) (4)
wj,1 yj b1 W34 y b1 .
j=1
In addition to the weights, we have introduced the bias term of this neuron, b1 , the
(4)
output of the neurons in the previous layer, yj , and the activation function σ (4) of this
(3)
layer. The function is a usually37 nonlinear mapping between input and output. Within
the vast zoology of activation functions, we here use the following two:
if x > 0 1
x
σ(x) = ; σ(x) = (20)
α(e − 1) if x ≤ 0
x 1 + e−x
The one on the left is called Exponential Linear Unit (ELU); the one on the right is the
sigmoid function. Note that one could introduce a different activation function for each
neuron. However this gives no particular gains, the activation functions are usually chosen
for each layer. For this last layer, we chose the ELU activation.
The nonlinearity introduced by activation functions is the first key difference with
respect to the methods described in Sec 3.1: without the activation function, (19) is
analogous to (4) and can be seen as a linear combination of basis functions38 . This is
why (19) can be conveniently written as a matrix multiplication, with the matrix W3,4
collecting all the weights39 . Weights initially unknown and must be inferred during the
training of the ANNS.
The second key distinction is in the basis function and their composite nature: the
basis in (19) is represented by the outputs of the previous layer, which reads:
j=1 wj,1 yj + b1
P2 (2) (2) (3)
y(3) = + b2 = σ (3) W23 y + b(3) (21)
(3) P2
(2) (2) (3) T (2)
σ j=1 wj,2 yj
wj,3 yj + b3
P2 (2) (2) (2)
j=1
Differently from the last layer, this layer outputs a vector y(3) ∈ R3×1 because it
contains three neurons. Again, the input to this layer is the output of the previous, which
reads:
37
These could also be linear in some layers. However, is all the activations are linear, the network
becomes linear and the relation between input-out is just a matrix multiplication!
38
The bias shift could be condensed within the weights, taking one of the basis functions as the
vector of constants; we refrain from using this notational simplification as it is seldom encountered in the
literature. Note that the bias term gives an additional degree of freedom
39
it is left as a simple exercise to the reader to write down this matrix!
VKI - 27 -
3 REGRESSION AND MODELING 3.3 Artificial Neural Networks (ANNs)
w1,1 y1 + b1
(1) (1) (2)
=σ =σ +b (22)
(2) (2) (2) (1) (2)
y W12 y1 .
w1,2 y1 + b2
(1) (1) (2)
Note that this layers receives in input the scalar y1 , which is the output of the neuron
(1)
If one now tracks back the full path from the input x to the output y, inserting (23)
into (22) and all the way up to (19), it is evident that even a simple network with seven
neurons represents a fairly complex function:
y=σ (w10 x + b11 ) +b +b + (24)
4 T (3) T (2) (1) (2) (3) (4)
W34 σ W23 σ W12 σ b1
In this simple architecture, the number of parameters (weight and biases) to be tuned
during the training amounts to 19. Common architecture in deep learning have thousands
of neurons and milions of parameters. For example, the famous AlexNet (Krizhevsky
et al., 2017) that revolutionized image classification and computer vision in 2012, is an
ANNs with 8 layers (5 convolutional and 3 feedforward) consisting of 65000 neurons and
60 milion parameters. The training of this network took between five and six days using
two GTX280 3GB GPUs and used a training set of 15 million labeled images. This
network significantly outperformed any classification strategy and set new standards in
image classification.
The complexity of the nested mathematical model constructed by an ANN has found
theoretical support in the popular function approximation theorem(s), pioneered by Cy-
benko (1989), which proves that sufficiently large networks can approximate any continu-
ous function. The results have been recently focused on the dilemma between width and
depth (see Lu et al. (2017); Hanin (2019)). Nevertheless, a robust theory of the ANNs ex-
pressivity ( i.e., the ability to approximate a rich class of functions) and their architecture
is still at its infancy, and the design of ANNs is nowadays largely based on experience.
The biggest challenge in the use of ANNs remains the training process, constituting an
extremely high dimensional optimization problem. Modern training strategies are based
on the back propagation algorithm, which computes the gradient of the cost function
with respect to the weights and bias, i.e. ∇w,b J with w and b vectors containing all the
weights and biases of the networks. This gradient is then given to any of the optimizers
described in Section 3.1.
The backpropagation algorithm was first proposed by Werbos (1974), reinvented sev-
eral times and popularized by Rumelhart and McClelland (1989). While a detailed treat-
ment is out of the scope of this section, it is possible to build intuition on its operating
principles recalling the chain rule of differentiation.
For the network in Fig 10, the gradient ∇w,b J is a vector with 19 components providing
the sensitivity of the error to each of the network parameter. A direct computation of
this gradient involves a large chain of differentiation and is thus impractical. The back-
propagation algorithm offers a more convenient alternative by propagating backward the
error from the output to the input.
VKI - 28 -
3.3 Artificial Neural Networks (ANNs) 3 REGRESSION AND MODELING
The first step of the algorithm is to look at how an error propagates from one layer
to the following, without attempting to get the entire chain. We simplify the notation by
writing the neuron’s output as
Therefore, the output of the network is simply σ (4) (z1 ), where z1 is clearly a com-
(4) (4)
plicated function of the input xj (of no interest for the moment). Taking the root mean
square error as a cost function over n∗ training points gives
n∗ 2
J(w, b) = (z1 ) (26)
X
(4) (4)
yi − σ
i=1
We introduce next the variable δj = ∂J/∂zj as the gradient of the cost function
(l) (l)
with respect to the output of the neuron j at layer l before this goes into the activation
function. This quantity can be easily computed in the last layer from (26):
n∗
dσ (4)
=2 (z1 ) (27)
(4) X
(4) (4)
δ1 yi − σ
i=1 dz (4)
where the derivatives of the activation function dσ/dz is readily available from its
definition40 . The scope of the back propagation is to use this entry to compute the others
(δ (3) ,δ (2) and so on). The error of a neuron propagates to the ones in the consecutive layer
as follows:
(l+1) (l)
∂J ∂zj 0 ∂yj 0
= (28)
(l) X
δj (l+1) (l+1) (l)
j0 ∂zj 0 ∂yj 0 ∂zj 0
where j 0 is the index spanning the neurons connected with j. The first term hides the
effect of what happens from one neuron to the next while the other two are a direct
application of the chain rule, having considered ’whatever happens’ next as a function of
the form J(z (l+1) (y l )). Introducing δj and (26) gives:
(l)
dσ (l)
δj = (29)
(l) X (l+1)
δ wl
j j,j 0
j0
dz (l)
Starting from (27) it is thus possible to backpropagate the errors and compute all the
δ s. This is the key intermediate step towards the computation of the gradient, which can
0
VKI - 29 -
3 REGRESSION AND MODELING 3.4 An exercise on Turbulence Modeling
are initialized randomly and the optimization is carried out using the ADAM optimizer
with η = 0.02, taking all the default parameters of the optimizers. The dataset has been
pre-processed using a Support Vector Machines to remove the outliers (that were found
to have a significant impact). The training is performed using mini batches of 120 data
points and an ‘early stopping’ approach is used to control overfitting. This technique
consists in stopping the optimization if the validation error starts increasing.
40 40
30 30
20 20
y
y
10 10
0 0
0 2 4 6 8 10 0 2 4 6 8 10
x x
Figure 11: Left: regression results of the ANN with 2 and 3 neurons in the hidden layers.
the regression is repeated 100 times (continuous red lines) and the data is preprocessed
via Support Vector Machines for outlier removal. Right: same plot but considering a
ANN with 64 and 128 neurons in the hidden layers.
The variance of the tests appears acceptable, but shows that the optimizers does have
a tendency to occasionally fall into local minima. The results shows that the simple
network in Figure 10 underfits the data. The same analysis is then carried out increasing
the number of neurons to 64 in the second layer and 128 in the third layer. The results
are collected in Figure 11. While the same observations on the variance hold, the larger
network performs better (as expected). On the other hand, readers accustomed to linear
methods might find the lack of overfitting somewhat surprising: this network consists
of 8579 parameters and is trained using only 450 datapoints. Undetermined networks,
having many more parameters than datapoints, are very common in deep learning. An
interesting discussion on this topic is provided by Zhang et al. (2016).
1 dp duv d2 U
− − +ν 2 =0 (31)
ρ dx dy dy
d2 T dvθ
α − =0 (32)
dy 2 dy
VKI - 30 -
3.4 An exercise on Turbulence Modeling 3 REGRESSION AND MODELING
where ρ is the fluid density, ν is the kinematic viscosity, α is the thermal diffusivity and
p(x) represents the mean pressure field. The velocity field is decomposed as the sum of a
mean flow U (y) and a fluctuating velocity with stream-wise component u(x, y) and cross-
stream component v(x, y); the temperature field is decomposed as the sum of a mean
temperature T and a fluctuating temperature θ. The overbar denotes spatial averaging,
and the terms uv and vθ are the Reynolds stresses and the turbulent heat fluxes. Both
quantities needs to be modelled in order to solve (31) and (32).
Equations (31) and (32) can be written in dimensionless form as follows:
dp+ du+ v + 1 d2 U +
− − + =0 (33)
dx∗ dy ∗ Reτ dy∗2
1 d2 T + dv + θ+
− =0 (34)
Reτ P r dy∗2 dy∗
where p+ = ρu2τ , U + = U/uτ and T + = T /Tτ , x∗ q = x/δ and y ∗ = y/δ. The reference
velocity of the problem is the friction velocity uτ = τw /ρ, with τw the wall shear stress;
the reference temperature is the friction temperature Tτ = qw /ρcp uτ , with cp the fluid’s
specific heat. The dimensionless number in (33) and (34) are the turbulent Reynolds
number Reτ = uτ δ/ν and the Prandtl number P r = α/ν.
Figure 12: Scheme of fully developed turbulent channel flow with uniform heat flux bound-
ary conditions.
Because of the full development of the velocity field, a self-similar transform can be
used to render the temperature equation 1D, i.e. solely function of y, while accounting
for its linear increase along x due to the uniform wall heating. Following Kawamura et al.
(2000), it is convenient to introduce the self-similar variable T
g +
+ (y) =
dhT + i ∗
T
g x − T + (x, y) (35)
dx∗
where
dhT + i ∗ 1
∗
x = (36)
dx hU + i
and hi denotes the spatial average along y to get bulk quantities, i.e
1Z δ + 1Z δ +
hT i =
+
T (x, y) dy and hU i =
+
U (x, y) dy . (37)
δ 0 δ 0
VKI - 31 -
3 REGRESSION AND MODELING 3.4 An exercise on Turbulence Modeling
Introducing (35) and (36) in (34) yield the 1-D energy equation for the self-similar
variable T
g + :
∂ 1 ∂T
+
+ θ+ +
U+
=0 (38)
g
− v
∂y ∗ Reτ P r ∂y ∗ hU + i
Focusing on the thermal problem, we assume that the velocity field is given and we seek
to develop a model for the turbulent heat flux v + θ̃. This quantity is known to be a function
of the Reynolds number Re = hU iδ/ν, the Prandtl number P r, the velocity gradients
dU/dy and the temperature gradients dT /dy. The scope of the regression problem is to
identify this function, which we here denote as v + θ+ = g(x, w), where x is the vector
containing the quantities that are assumed to play a role and as w is the set of weights
defining the function parametrization.
For a given function v + θ+ = g(x, w), it is possible to integrate (38) and obtain a
prediction of the temperature field over a mesh y ∈ Rnp denoted as TP (y, x, w). Given
some data T (y), it is possible to evaluate the quality of the prediction in terms of a root
mean square cost error:
In this exercise, the integration of (38) is carried out using a finite difference discretiza-
tion on a mesh of np = 500 points, hence solving the following matrix equation
LT
g + + Gα+ GT
g ++S = 0 (40)
t
where L is the Laplacian matrix multiplied by (Reτ P r)−1 , G is the backward finite
difference matrix and S is the source vector containing the values of U + /hU + i.
VKI - 32 -
3.4 An exercise on Turbulence Modeling 3 REGRESSION AND MODELING
dT νt dT
vθ = −αt =− . (41)
dy P rt dy
This quantity is further assumed to be linked to the eddy viscosity νt via the turbulent
Prandtl number P rt = νt /αt , which is often taken as a constant in the range P rt = 0.5−1
(Bird et al., 2006).
Nevertheless, this assumption has been shown to be inaccurate even in simple flow
configurations. Figure 13 plots the turbulent Prandl number from several of the test
cases in Kawamura et al. (2000)’s dataset, highlighting that this value is not constant and
is larger than unity for fluids with low Prandtl number.
1.0
0.9
2.0
0.8
P rt /P rt,max
P rt
1.5 0.7
0.6
1.0
0.5
Figure 13: Profiles of the turbulent Prandtl number along the height of the channel
Kawamura et al. (2000) for different Reτ and P r.
While keeping the constant P rt number assumption, Reynolds (1975) proposed the
following correlation to better account for the dependency on the Re and P r:
1
!
P rt = (1 + 100(ReP r) − 12
) − 0.15 (42)
1 + 120Re− 2
1
In this first approach to the exercise, we assume that the turbulent Prandtl number
is constant but we leave it as a regression parameter. The model therefore consists of a
single weight and the regression task reduces to the optimization problem of identifying
the P rt the minimizes the cost function in (39). This optimization is solved using the
Sequential Least Squares Programming (SLSQP) implemented in the Scipy library and the
results for the optimal turbulent Prandtl number are shown in the last column of Table
VKI - 33 -
3 REGRESSION AND MODELING 3.4 An exercise on Turbulence Modeling
1. These are compared with the prediction from (42) in Figure 14 (left) while Figure 14
(right) compares the dimensionless temperature profiles produced by both methods with
the DNS data at the same Reynolds and Prandtl numbers (Reτ = 180, P r = 0.6). The
optimized P rt significantly improves the temperature prediction.
14 SLSQP
2.50 Reynolds corr. Reynolds corr.
2.25 SLSQP 12 DNS
2.00 10
1.75
8
P rt
θ+
1.50
6
1.25
4
1.00
0.75 2
0
0.0 0.2 0.4 0.6
Pr 100 101 102
y+
Figure 14: (a) Turbulent Prandtl number as a function of the molecular Prandtl number
for Reτ = 180. Comparison between the Reynolds correlation for P rt Reynolds (1975) and
the optimal values calculated with the SLSQP.(b) Temperature profiles computed with
the optimized P rt and with the value of P rt given by eq. eqrefrc (Reτ = 180, P r = 0.6):
comparison with the corresponding DNS temperature profile.
• x1 = k2
νε
(approximated, dimensionless eddy viscosity)
• x2 = k
U 2 +k
(turbulence intensity)
√
• x3 = ky
ν
(dimensionless wall distance)
where k is the turbulent kinetic energy k = 1/2 i u2i and ε is the turbulent dissipation
P
rate ε = 2ν i,j ∂ui /∂xj ∂uj /∂xi . The parameter x2 is an alternative definition of the
P
turbulence intensity that make this quantity bounded between 0 and 1.0. Figure 15
indicates the evolution of x1 -x3 along the channel for different values of the Reynolds
number.
Figure 15 indicates the evolution of x1 -x3 along the channel for different values of
the Reynolds number. A plot of the features x1 -x3 along the channel is shown in Figure
15. All these input variables are dimensionless quantities to enforce the model invariance
with respect to changes of the geometry or a specific choice of flow parameters. The
VKI - 34 -
3.4 An exercise on Turbulence Modeling 3 REGRESSION AND MODELING
Re=640 1.0
600
Re=395 0.9
500 Re=180
0.8
400
0.7
x1
x2
300
0.6
200
0.5 Re=640
100 Re=395
0.4
Re=180
0
100 101 102 100 101 102
y+ y+
Re=640
0.8 Re=395
Re=180
0.6
x3
0.4
0.2
0.0
100 101 102
y+
Figure 15: Evolution of the input parameters x1 − x3 over the channel width at different
Reynolds numbers.
structure of the network is shown in Figure 16. This consists of a 4 neurons input, a 4
layer Gaussian noise layer, two hidden layers of 50 neurons and a single neuron output
layer. This output is then introduced to the numerical solver to predict the temperature
distribution and compute the associated cost in (39). The activation functions in the two
hidden layers are chosen as linear and hyperbolic tangent. The dataset is scaled to the
range [−1, 1].
The training is performed using ADAM optimizer with stochastic gradient descent.
The training-validation data is defined as shown in table 1. The Gaussian layer acts as a
regularization strategy that mitigates overfitting. The noise is generated with zero mean
and standard deviation of σ = 0.005. Figure 17 shows the evolution of the loss function
during training for both the training and the validation ranges: the training terminates
after 3500 epochs.
Figures 18 shows the predicted Prandtl number for four profiles for which this quantity
is available in the DNS profile, together with the one extracted from the DNS. The
prediction from the ANN appears in good agreement in the logarithmic layer (y + > 30),
where the turbulent flux is significantly higher than the viscous. On the other hand,
both νt and αt tend to zero towards the viscous sub-layer (y + < 10) and hence the cost
function becomes insensitive to the temperature. A more complex formulation, capable
of modelling the turbulent heat flux down to y = 0, could be build by introducing the
role of the imposed heat fluxes in the cost function.
VKI - 35 -
4 DIMENSIONALITY REDUCTION
Figure 16: Scheme of the ANN employed to predict the turbulent diffusivity αt .
103 train
validation
102
101
loss
100
10−1
10−2
10−3
100 101 102 103
epoch
Finally, Figure 19 compares the temperature and turbulent heat flux profiles obtained
with the Prandtl number optimized by SLSQP (first approach, with green lines) and the
ANN (second approach, light blue lines), along with the corresponding DNS profiles (red
lines). The non-linear mapping offered by the ANN clearly yield a more accurate and
robust turbulence model as visible from the comparison at Reτ = 395-P r = 0.71, which
is one of the validation profiles.
4 Dimensionality Reduction
4.1 A first simple example
Consider the dataset in Fig 20. The set consists of np = 150 datapoints x ∈ R3×1 that
have been sampled from three distributions in a three-dimensional space. The colors in
VKI - 36 -
4.1 A first simple example 4 DIMENSIONALITY REDUCTION
2.5
2.0
1.5
Prt
1.0
0.5
0.0
100 101 102
y+
Figure 18: Profiles of the turbulent Prandtl number predicted by the ANN (solid lines)
over the channel width. Comparison with the DNS profiles (dashed lines).
17.5 1.0
Re=640 Pr=0.025 Re=640 Pr=0.025
15.0 Re=395 Pr=0.71 Re=395 Pr=0.71
Re=180 Pr=0.6 0.8 Re=180 Pr=0.6
12.5
10.0 0.6
v + θ+
θ+
7.5
0.4
5.0
0.2
2.5
0.0 0.0
100 101 102 100 101 102
y+ y+
Figure 19: Comparison of the results achieved with the optimization approach (green
lines) and the ANN-regression approach (light blue lines). DNS data are indicated with
red lines.
the markers encodes the different distributions (cluster). The dataset can be compactly
represented by a dataset matrix X ∈ R3×150 with each point (vector) along its columns.
To minimize the amount of technical details, the data has been mean-centered (i.e. the
mean vector over the rows of X is zero).
To illustrate the main ideas behind dimensionality reduction, we here seek to
identify the best bidimensional representation of the data. That is, we seek a compression
ratio 3 : 2, from (R3 to R2 ). Of course a simple approach would consist in projecting along
any of the planes available from the Cartesian coordinate frame. The projection implies
the definition of a basis matrix, here denoted as B, whose columns collect the basis vectors
in the projected space. For instance, the projection onto the (i, j) plane is
VKI - 37 -
4 DIMENSIONALITY REDUCTION 4.1 A first simple example
1.0
−3
0.8−2
−1
k
0
0.6
1
2
0.4
3
0.2 1.0
0.5
0.0
j
Figure 20: Plot of the dataset considered in this section: np = 150 belonging to three
probability density functions are sampled in a three-dimensional space.
1 0
XB = B T X 0 1
with B = XB ∈ R2×150 (43)
0 0
We evaluate the quality of a projection by the amount of information lost. The opera-
tion in (43) is a linear encoder. The matrix B has no inverse. Nevertheless, it is possible
to show (Mendez, 2020a) that the ‘best possible’ inversion of (43) is
VKI - 38 -
4.1 A first simple example 4 DIMENSIONALITY REDUCTION
2
X̃U = UXU = UU T X ⇐⇒ x̃k = ckj uj with ckj = ujT xk , (47)
X
i=1
with AU = UU T the PCA autoencoder. The summation on the right recalls that each
of the approximation (x̃k ) of a data point (xk ) is a linear combination of the principal
components uj . The coefficients ckj = x b U , written as a vector over the index j, is the
PCA-transformed data: that is the projection of a data onto the principal components.
Returning to the notation in Fig 4, this is the encoding function E; the entries of the
encoded representation are inner products:
0.8−2
−1 0.5
k
0
z[2]
0.6
1 0.0
2
0.4 −0.5
3
0.0 −1.5 0 1 2 3
0.0 −0.5
0.2 0.4 0.6 −1.0
−0.50.8 1.0
−1.0 1.0 0.5 0.0 z[1]
2.0 1.5 i
1.0 RMS: 1.015792410988693
−3 1.0
0.8
−2
−1 0.5
k
0.6 0
j
1 0.0
2
0.4
3
−0.5
0.2 1.0
0.5 −1.0
0.0
j
−0.5 −1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
0.0 −0.5 −1.0
0.0 0.2−1.0 0.4 1.0 0.6 0.0
0.5 0.8 1.0 i
2.0 1.5 i
Figure 21: Comparison of the data reconstruction via the autoencoder AV (top left)
versus the autoencoder AB (bottom left). On the right, the data is shown in the 2D
representations provided by each bases, keeping the cluster color code.
The figure on the left shows the reconstructions X˜U (top) and X˜B (bottom) while the
figures on the right show the data in the reduced space: the plane (u1 , u2 ) for the PCA
and the plane (i, j) for the trivial projection. In the reconstruction plots, the blue markers
indicate the original data while the orange ones indicates the reconstructions. In the 2D
plots, the color code of the clusters is maintained.
√
The normalized error (i.e. (46) divided by 3np ) for the PCA is E = 0.14 while it is
E = 1.05 for the trivial projection. Clearly, the 2D representation from the PCA preserves
VKI - 39 -
4 DIMENSIONALITY REDUCTION 4.1 A first simple example
more information and keeps the distant cluster (red markers) far from the other two as
in the original 3D. Note that the area of machine learning that focuses on the mapping
from a higher dimensional space onto a lower dimensional space is known as manifold
learning. This is a subset of dimensionality reduction and encompasses mostly nonlinear
methods. Manifold learning is generally not concerned with the possibility of mapping
back the data from the reduced space onto the original one, but rather aims at preserving
certain metrics (e.g. distances) in the reduced space.
Before proceeding with a more interesting example, it is worth testing two nonlinear
methods and asses whether these allows for autoencoders that leads to lower error for the
same compression ratio. A nonlinear generalization of PCA is the Kernel PCA (KPCA,
Schölkopf et al. (1997)). The underlying idea is to perform PCA on a dataset that has
been first transformed by a nonlinear function, called kernel function Φ(), similarly
to what is introduced in Section 3 for the regression problem. The transformed data is
thus XT = Φ(X) ∈ RnF ×np , with nF the dimension of the new space, which is increased,
possibly to infinity. This new space is referred to as feature space42 .
Linear operations in the feature space (e.g. a projection) are nonlinear in the original
space. The principal components in the feature space becomes a nonlinear function that,
in general, cannot be represented in the original space. The eigenvalue problem in the
feature space reads
i=1
that is writing the principal components in the feature space as a linear combination of
the features. Introducing in (50) in (49) and multiplying both sides by Φ(X)T gives
VKI - 40 -
4.1 A first simple example 4 DIMENSIONALITY REDUCTION
aj , the projection of a feature vector Φ(xi ) onto the principal component wj in the feature
space gives the encoding function for the KPCA:
zi = E(xi ) → zi [j] = wjT Φ(xi ) = Φ(xi )T wj = Φ(xi )T Φ(X)aj = κ(xi , X)aj (53)
The first part is analogous to (56), while in the second part the kernel function is intro-
duced. The reader should notice that the computational cost of the KPCA is significantly
increased, since it involves the diagonalization of K ∈ Rnp ×np while the PCA required the
diagonalization of XX T ∈ R3×3 . The major difficulty, moreover, is in the decoder step,
which in the KPCA has no analytical form. The reader is referred to Bakir et al. (1999)
for details on the decoding of KPCA.
The KPCA for this example is performed using scikit-learn Pedregosa et al. (2011).
The results are shown in Fig22 both in terms of reconstruction and 2D representation.
The chosen kernel is a radial basis function with γ = 1. The reconstruction error is of
the order of 10−14 . In the reduced space, the clusters appear fairly well separated: the
nonlinear mapping allows to preserve most of the information in the reduced space so
that the decoder successfully recover the 3D representation.
RMS: 4.310507913454499e-14
1.0 0.6
−3
0.4
0.8−2
−1 0.2
k
0
z[2]
0.6 0.0
1
2 −0.2
0.4
3
−0.4
0.2 1.0
−0.6
0.5
0.0
−0.6 −0.4 −0.2
j
Finally, the last auto-encoder tested is the fully connected ANN architecture in Figure
23. Recalling the input-output relations for a neuron, described in Sec 3.3, the reader
should notice the PCA autoencoder is recovered in case of linear activation functions. The
training is performed using ADAM solver with batches of 32 and the chosen activation
functions are sigmoids for both encoder and decoder. The dataset is scaled to the range
[0, 1].
Figure 24 compares the error convergence during the training for the nonlinear au-
toencoder and the linear one, obtained setting linear activation functions. The error in
this plot is taken from the scaled dataset so it should not be compared with the root
mean square error in the titles of the 3D plots in Figure 22-21-20. As expected, the lin-
ear autoencoder matches the results from the PCA to machine precision. This gives an
interesting alternative tool to compute PCA for extremely large datasets, for which the
matrix multiplications involved in (47) might become too memory demanding.
The convergence for the nonlinear autoencoder is much more gentle but the final result
yields comparable errors. Figure 24 shows the results for the nonlinear autoencoder in
VKI - 41 -
4 DIMENSIONALITY REDUCTION 4.2 Dimensionality reduction of a flow field
Figure 23: Architecture of the ANNs autoencoder used in this first exercise.
terms of The 3D reconstruction and 2D mapping for the nonlinear ANNs autoencoder.
The performances are comparable with that of the standard PCA with an RMS error of
about E = 0.14.
Relu/Sigmoid
Linear
10−1
Training Error (mse)
10−2
10−3
Figure 24: Convergence of the training error for the nonlinear autoencoder (using sigmoid
activation functions) and the linear autoencoder (using linear activation functions).
VKI - 42 -
4.2 Dimensionality reduction of a flow field 4 DIMENSIONALITY REDUCTION
0 0.25
z[2]
0.6
1
0.00
2
0.4 −0.25
3
−0.50
0.2 1.0
−0.75
0.5
0.0
−1.5 −1.0 −0.5
j
Figure 25: Same as Fig 21 but considering an autoencoding via ANNs. The results,
comparing with top plots in Figure 21, are similar to those from the PCA.
The set consists of nt = 13200 velocity fields, sampled at fs = 3kHz over a grid of
71 × 30 points. The spatial resolution is approximately ∆x = 0.85mm. A snapshot of the
velocity field is shown in Figure 26 on the left.
This test case is characterized by a large scale variation of the free-stream velocity.
A plot of the velocity magnitude in the main stream is shown in Figure 26 on the right.
In the first 1.5s, the free stream velocity is at approximately U∞ ≈ 12m/s. Between
t = 1.5s and t = 2.5s this drops down to U∞ ≈ 8m/s. The variation of the flow velocity
is sufficiently low to let the vortex shedding adapt, and hence preserve an approximately
constant Strouhal number of St = f dU∞ ≈ 0.19, with d = 5mm the diameter of the
cylinder. Consequently, the vortex shedding varies from f ≈ 459Hz to f ≈ 303Hz.
13
12
U∞ [m/s]
11
10
8
0 1 2 3 4
t[s]
Figure 26: Snapshot of the velocity field in the test case considered in this section. This is
the flow past a cylinder in transient conditions, measured via TR-PIV (see Mendez et al.
(2020)).
The dataset matrix is now X ∈ R4260×13200 . We here seek a compression of the snap-
shots from R4260 to R3×1 , that is (4260 : 3). We consider the same auto-encoder tested in
the previous section: PCA, KPCA and ANNs.
The data is scaled to the range [0, 1]. For the KPCA, the γ parameter is set to γ = 1.
For the ANN autoencoder, the network input and output consist of 4260 neurons while the
hidden layer consists of 3 neurons. The total number of training weights is thus 38.343.
The chosen activation function is the ELU , as it allowed for smoother training. The
training is performed with ADAM optimizer and a batch size of 128 and starts from a
random initialization of the weights. Dropout and batch normalization are used between
VKI - 43 -
5 REINFORCEMENT LEARNING AND CONTROL
input and hidden layer to control over-fitting and facilitate the training. The first tech-
nique consists in randomly deactivating some of neurons (in this case 20%), hence forcing
the network to level the contribution of all the weights; the second consists in enforcing
that the outputs of the neurons have zero mean and unitary standard deviation.
The results are collected in Figure 27. The left figures shows the data in the reduced
space identified by each of the autoencoders. The right figure shows an example snap-
shot, with the title indicating the RMS over the entire field. In each case, the snapshot
corresponds to the red marker on the left.
The three autoencoders learn valuable 3D representation of a 4260 dimensional space,
with the snapshots being indistinguishable from the original data. In the case of the
PCA, the learned representation appears to collapse around a paraboloid with principal
axis along z3 . Each of the steady states corresponds to circles at z1 = const, and the
transient leads the shift from the first circular orbit (with larger radius) to the second.
The existence of such paraboloid in an optimally linear basis of eigenfunctions was firstly
derived by Noack et al. (2003) (see also Noack et al. (2020) for a review). It is worth
nevertheless highlighting that most of the previous works on reduced order modeling of
the cylinder wake flow focuses on much lower Reynolds numbers.
The 3D mapping produced by the kernel PCA appears as a distorted version of the one
produced by the PCA. The kinematics of the data in this space is similar to the PCA, but
the region of higher velocities are nonlinearly compressed in orbits of much lower radius.
The error is remarkably close to machine precision. Finally, the ANN autoencoder yields
results that are comparable to the PCA. Up to a rotation, the learned manifold is similar
to the paraboloid with the only difference being an inclined axis. Interestingly, as no
special constraints are introduced in the network, the nonlinear mappings differ at each
run by a rotation of the encoded representation. Finally, it is worth noticing that the
results produced by this autoencoder are obviously strongly dependent on the choice of
the activation function, but an analysis of such influence is out of this scope of this lecture.
The main message from this exercises, further discussed in Sec.6, is that it has been
possible to achieve impressive level of compression. Running regression tools or system
identification methods one could construct predictive models in the reduced space and
then use the decoder to map these back to the full space. By solving a regression problem
in a 3-dimensional domain, one could predict with remarkable accuracy the evolution of
a flow field in a 4260-dimensional domain.
VKI - 44 -
5.1 Problem Set 5 REINFORCEMENT LEARNING AND CONTROL
Figure 27: Results of the dimensionality reduction exercise on the flow field of the flow
past a cylinder, available from TR-PIV from Mendez et al. (2020). The scatter plots on
the left shows the 3D representation identified from the 4260-dimensional velocity field.
The red points on these scatter plot correspond to the snapshot shown on the right.
VKI - 45 -
5 REINFORCEMENT LEARNING AND CONTROL 5.2 RL Definitions
∂u ∂u
+ 330 = f (t, x) + g(t, x) , (54)
∂t ∂x | {z } | {z }
disturbance control
over a domain x ∈ [0, 50]. Both sides of the domain are taken as open (non-reflecting
conditions). The domain and a snapshot of the system is illustrated in Figure 28.
The disturbance term (in yellow in Fig. 28) consists of a Gaussian function multiplied
by a sinusoid with given frequency and amplitude. The control term (in green in Fig. 28)
consist of an identical Gaussian placed downstream and having amplitude driven by the
agent (i.e. the controller):
The amplitude at constitutes the action of the agent. These are taken as a function of
the system state st , which in this case consists of a vector of 18 points, sampled at three
time steps (t, t − 1 and t − 2) in 6 points (indicated with red dots).
The goal of the agent is to cancel the disturbance downstream its location, i.e. achiev-
ing u(x > 18.2) ≈ 0 ∀t. In an open-loop approach, it is easy to show that the optimal
control action is a sinusoid with the same amplitude and frequency of the forcing and
a phase shift that accounts for the traveling time from the disturbance to the control
location. In a closed loop approach, it is easy to show that the optimal law consists of a
linear combination of the observations at a given location and the time steps (t, t − 1).
5.2 RL Definitions
The solution of the control problem consists in identifying the control law at = π(st ),
known as policy, capable of achieving the control objective. In RL, we set this objective
in the form of rewards at a given time step, which are here taken as the L2 norm of a
snapshot at time t, i.e. rt = −||u(t, x)||2 .
Among the four categories of algorithms introduced in Sec. 2.4, we here focus on
on-policy methods, aiming at learning the function π. In Deep Reinforcement Learning
(DRL), the policy is parametrized by an ANN. In a deterministic approach, this policy
is thus written as at = π(st , θ) with θ the set of weights43 ; in a stochastic approach, the
ANN outputs the parameters of a distribution from which an action is sampled and it is
hence more commonly denoted as πθ (at |st ), where | denotes the conditioning operator.
In this exercise we opt for a stochastic approach and parameterize the policy with
a feed-forward ANN with an input layer of 18 neurons, a single neuron output and 2
hidden layers with 500 neurons each. The activation functions are RELU in the entire
network. The training of the ANN requires therefore the optimization of about 6.8 million
parameters. We here focus on model-free methods, for which we have no surrogate model
that allows for estimating how the environment responds to actions; hence the learning
can only proceed via trial and error, as the agent interacts with the environment.
The interaction occurs through a series of episodes. Each episode produces a set
of triplets τ = {(s0 , a0 , r0 ), (s1 , a1 , r1 ) . . . } which are processed to update the weights of
43
While we have used w for the weithgs in the previous sections, we here follow the RL literature and
opt for θ.
VKI - 46 -
5.2 RL Definitions 5 REINFORCEMENT LEARNING AND CONTROL
Figure 28: Description of the 1D advection equation environment with source of pertur-
bation(yellow), control action(green) and observation points (red dots).
the network towards better policy estimates. The performance is measured in terms of
cumulative rewards, defined as
H
R= γ t rt
X
t=0
with H the duration of the episode and γ ∈ [0.1] the discount rate. This parameter
makes the agent prioritize immediate rewards over long-term gains. In this exercise, we
fix the length of the episode to T = 0.3s and set the duration of the episode as H = T /∆t,
with ∆t the time step of the numerical scheme solving (54). This is a simple explicit first
order finite difference method with CF L = 0.9 and nx = 200 grid points.
The cumulative rewards allows for estimating the value of a state st , based on how
much rewards can be achieved if the system follows the policy πθ (at |st ) from that state
until the end of the episode. This is given by the value function
In both definitions, the expected operator is simply the empirical average over a set of
batch examples. The difference between these two quantities is the advantage function
VKI - 47 -
5 REINFORCEMENT LEARNING AND CONTROL 5.3 Results
The exercise is here solved with the Proximal Policy Optimization (PPO) proposed
by Schulman et al. (2017). This algorithm combines policy gradient methods with value-
based methods and hence belong to the actor-critic family. The scope is maximizing
a cost function, using the first-order optimization tools discussed in 3.2, which in RL
becomes gradient ascent methods. In an actor-critic formalism, this cost function is
usually defined as follows
5.3 Results
The environment and the agent were prepared using Python classes from stable-baseline
(Hill et al., 2018). Figure 29 shows the reward evolution over the number of episodes,
while Figure 30 (left) shows a snapshot of a control iteration after POCHI episodes. Figure
30 (right) compares the control action taken by the agent as a function of time with the
optimal sinusoidal low predicted from first principles. The performances are obviously
very poor in the first part of the training
Fig.31 shows the same plots 30, but testing the agent towards the end of its training.
It is remarkable that the agent ultimately achieves the control task and learn how to
cancel the disturbance, while solely resorting to trial and error. While this problem
is straightforward and amenable to analytic solutions, no simplifying assumptions on
the problem’s physics have been introduced. No parametrization linking observations to
actions has been enforced.
On the other hand, the significant downside is that the training involved about 10
million interactions with the environment. In this exercise, the training required about
8h of computational time on a personal laptop, but such a sampling efficiency should warn
the reader on the use of RL in more complex 3D CFD simulations. Techniques such as
multi-agents frameworks (where multiple environments run simultaneously) and methods
for restricting the action space are thus necessary enablers.
44
Besides the original publication, the reader is referred to Pieter et al. (2017) for an overview
VKI - 48 -
5.3 Results 5 REINFORCEMENT LEARNING AND CONTROL
−100
−200
Reward −300
−400
−500
−600
−700
−800
0.0 0.2 0.4 0.6 0.8 1.0
Time Steps ×107
Figure 29: Evolution of the rewards as a function of the number of simulation time steps.
The rewards are given as rt = −||u(x, t)||2 , and approach zero as the agent learns how to
cancel the incoming wave.
Figure 30: (a) Snapshot of the environment interacting with the agent at the beginnign of its
training, leading to complete failure of the control task. (b) RL control low versus the optimal
control action (an harmonic).
400
observation RL Theory
0.4
forcing
action
0.2 200
Displacement [u]
Control Action
0.0
0
−0.2
−200
−0.4
Figure 31: Same as Figure 30, but considering the agent at the end of its training. Remarkably,
the agent learns the control law and achieve good stabilization performances. The figure on the
right shows that the agent learns a nearly optimal sinusoidal law from experience.
VKI - 49 -
6 MACHINE LEARNING FOR FLUID MECHANICS
Supervised Learning
Regression models from the machine learning literature have been extensively used
in fluid mechanics for a wide range of applications, ranging from surrogate model-based
optimization (Kim and Boukouvala, 2019), turbulence modeling (Duraisamy et al., 2019a),
non-intrusive Reduced Order Modeling (Daniel et al., 2020; Hesthaven and Ubbiali, 2018;
Renganathan et al., 2020; Pawar et al., 2019) and system identification for prediction and
control (Pan and Duraisamy, 2018; Brunton et al., 2016; Kaiser et al., 2018; Huang and
Kim, 2008).
With the growing capacity of data-driven models, the main challenge remain that of
defining the amount of prior knowledge to be incorporated in the training phase, the
data amount of required data and its preparation (adimensionalization, normalization
and transformation) and the model generalization outside the training range.
In turbulence modeling, machine learning techniques are increasingly used for algebraic
modelling of the Reynolds stress tensor (Ling et al., 2016; Jiang et al., 2020; Akolekar et al.,
2018; Sotgiu et al., 2019) and for the improvement of one or two-equation models (Parish
and Duraisamy, 2016; Holland et al., 2019). In some cases (Parish and Duraisamy, 2016;
Holland et al., 2019), the turbulence modelling focuses on operational parameters ( e.g. ω
and in the k − ω and k − models) rather than purely physical ones: these quantities are
not available in high-fidelity turbulence databases and must be then inferred to obtain the
labelled data needed for any supervised approach. This field inversion process (Parish and
Duraisamy, 2016; Diez Sanhueza, 2018; Singh et al., 2017) can be rather expensive and the
inference problem could be ill-posed in regions of weak turbulence, where the sensitivity
of the mean flow to the turbulence quantities diminishes. Despite these limitations, the
field inversion was successfully applied in the works of Parish and Duraisamy (Parish and
Duraisamy, 2016), Singh et al. Singh et al. (2017) and Diez (Diez Sanhueza, 2018) to infer
correction terms for the Spalart Allmaras model (Singh et al., 2017), the k − ω model
(Parish and Duraisamy, 2016) and the MK turbulence model (Diez Sanhueza, 2018).
In non-intrusive ROMs and system identification, regression tools are used to construct
models in the low dimensional space produced by a suitable dimensionality reduction
technique. These have been used for constructing predictive models of complex reactive
systems (Parente, 2020; Swischuk et al., 2020), in multidisciplinary optimization (Goertz,
2020a; Parrish et al., 2014), sensor-based extraction of sparse representation of nonlinear
dynamical systems (Loiseau et al., 2018; Goertz, 2020b).
Finally, while these notes did not cover classification problems, it is worth noticing
that these are also increasingly used for automatic classifications of flow regimes (Majors,
2018; Hobold and da Silva, 2018; Kang et al., 2020).
Unsupervised Learning
Fluid mechanics has a long history (and a vast literature) in methods for dimension-
ality reduction. The notion of principal components in PCA is tightly linked to the
VKI - 50 -
6 MACHINE LEARNING FOR FLUID MECHANICS
notion of coherent structures in turbulent flows. The PCA is known in fluid mechanics
as Proper Orthogonal Decomposition, and was introduced by Lumley (1967) as a
method to identify (and define) coherent structures in a turbulent flow. These structures
are the most energy containing ones and are thus the main responsible for phenomena of
engineering relevance includying heat and mass transfer, noise emission or fluid structure
interaction.
The POD has been used to construct reduced order models of fluid flows (Holmes
et al., 1997; Noack et al., 2003), to find optimally balanced control laws (ROWLEY,
2005; Ilak and Rowley, 2008), to perform correlation-based filtering (Mendez et al., 2017;
Raiola et al., 2015) or to identify correlations between different fields (Borée, 2003; An-
toranz et al., 2018; Ianiro, 2020), to name a few examples. An overview of POD and its
applications to fluid mechanics is provided by Holmes et al. (2012); Dawson (2020a).
Over the last couple of decades, many variants of the POD/PCA have been developed
in fluid mechanics (Sieber et al., 2016; Mendez et al., 2019) along with alternative linear
decomposition tools. The most popular (linear) alternative is certainly the Dynamic
Mode Decomposition (DMD, Schmid (2010); Rowley et al. (2009); Schmid (2020))
which decomposes the data as a linear combination of linear modes, i.e. structures with
a single complex frequency. DMD modes are the eigenvectors of the operator that best
approximates the data as a linear dynamical systems45 .
Most of the decomposition methods developed in fluid mechanics are linear, and the
literature has grown into a subfield of data processing often referred to as Modal Anal-
ysis (Taira et al., 2017), where the notion of ‘mode’ generalizes that of coherent structure
or principal component or harmonic (Fourier) mode . Nonlinear methods of dimension-
ality reduction have comparatively much less history and have been popularized mostly
in the last few years. Notable examples are the use of manifold learning techniques such
as Locally Linear Embedding (Ehlert et al., 2020), cluster-based reduced order model
(Kaiser et al., 2014) and autoencoders Agostini (2020); Murata et al. (2019); Agostini
(2020). As dimensionality reduction became the building block of many other tools in
fluid mechanics, these tools are likely going to enter in the tool box of most researchers.
Reinforcement Learning and Control
The control of turbulent flows poses unique challenges to classic closed loop methods
(Brunton and Noack, 2015). Machine learning based techniques offers a viable alternative
by-passing the need of deriving complex models via trial and error. Control tasks can be
set in the form of an optimization problems and the use of machine learning approaches
such as Genetic Programming have been pioneered by the group of Prof. Noack (Noack,
2020; Chovet et al., 2017; Duriez et al., 2017).
The first applications of RL in fluid mechanics were focused on the study of collec-
tive behavior of swimmers, pionered by the group of Prof. Koumoutsakos (Wang et al.,
2018; Verma et al., 2018; Novati et al., 2017; Novati and Koumoutsakos, 2019; Novati
et al., 2019), while the first applications for flow control were presented by Pivot et al.
(2017) and by Rabault et al. (2019) (see also Rabault et al. 2020; Rabault and Kuhnle
2019, 2020). Recently, Bucci et al. (2019) used RL to control the one-dimensional Ku-
ramoto–Sivashinsky equation while Beintema et al. (2020) used it to control heat transport
in a two-dimensional Rayleigh–Bénard systems. Verma et al. (2018) used RL to study
45
An overview of linear dynamical systems is provided by Mendez (2020b) and Dawson (2020b)
VKI - 51 -
7 CONCLUSIONS AND PERSPECTIVES
how fishes can reduce energy expenditure by schooling while Reddy et al. (2016) used RL
to analyze how bird and gliders minimize energy expenditure by soaring exploring tur-
bulent fluctuations. With a careful choice of the reward function, Viquerat et al. (2019)
demonstrated the use of RL for shape optimization, while an application of RL for flow
control in an experimental configuration has been recently presented by Fan et al. (2020).
VKI - 52 -
7 CONCLUSIONS AND PERSPECTIVES
exercise illustrated the three main paradigms: linear, kernel-based nonlinear, and ANNs-
based nonlinear. The fluid mechanic’s community has vast experience in the linear meth-
ods (and has pioneered several related decompositions) but has limited experience on the
nonlinear tools. There is no doubt that this will change shortly, and a revolution in the
field is about to set in. While nonlinear methods are likely going to yield more efficient
representations, the resulting models’ interpretability will become a significant challenge.
The reduced models from linear tools can sometimes be derived from first principles and
have intuitive interpretation, while this is not the case for nonlinear approaches. The
results achievable by kernel methods (or any other manifold learning technique) are ex-
tremely sensitive the hyperparameters which have no physical interpretation.
Tasks that do not demand such interpretations, such as filtering or data compression,
are most likely going to be the first to benefit from these nonlinear tools: most POD-based
filters, for instance, might be soon replaced by KPOD filters. Tasks such as reduced-order
modeling for prediction and control, on the other hand, still require a significant amount
of additional research. The main challenge in ROM-based control, for instance, is that of
constructing a low dimensional space that models both the non-actuated and the actuated
flow. This often requires (1) prohibitively large training data and (2) carefully constructed
models. It is unclear whether nonlinear methods will help with (1), but they will certainly
challenge (2).
The cost of kernel-based methods grows with the number of data points, and hence
ANNs are again the only viable solution for huge datasets. All the considerations on ANN
for regression hold for dimensionality reduction. While experimenting in the exercise’s
solution, it appeared clear that the error convergence was sensitive to a wide range of
hyperparameters and the results obtained required a fair amount of experimentation. On
a practical level, the link between POD and linear autoencoder brings a useful method
for computing POD-based decompositions using ANN and their training strategies, which
are considerably more memory efficient than matrix multiplications. ANN-based methods
are also better suited for big data and ‘on-line’ or ‘streaming’ approaches.
Finally, we conclude with some remarks on the control problem solved via Reinforce-
ment learning. Compared to supervised and unsupervised techniques, the RL appears
somewhat less mature, with new agent-training strategies published every year. None of
the most popular RL training strategies (e.g. PPO, DDPG or A3C) used for controlling
continuous systems is older than four years. While this reflects in part the enormous
efforts that the community is investing in RL, it also shows that the field is not yet set-
tled, and the ‘best’ agent-training algorithm is yet to come. As a physicist, the paradigm
of controlling a system in a purely ‘black-box’ approach, hence without resorting to any
physics knowledge but solely on trial-and-error, is disturbing. On the other hand, classic
methods exclusively based on physical principles have so far failed in the task of controlling
turbulent flows in complex scenarios.
Provided that a black-box approach succeeds in such tasks, we might then invest in
the learning of how such success was achieved. In such a case, which today appears on the
edge of science fiction for fluid mechanics, the usual roles might be inverted, and the agent
would become our supervisor. This is already happening in board games, where agents
have reached superhuman performances and are now teaching us to play better. But the
essential obstacle, today, is that the agent usually needs to interact with the environment
millions of times (with no guarantees of success). This is no problem with a board game
VKI - 53 -
A LEARNING MACHINE LEARNING
• Oliver Theobald, Machine Learning For Absolute Beginners: A Plain English In-
troduction, 2018.
Essential Books
• Aurelian Gèron, Hands-On Machine Learning with Scikit-Learn, Keras, and Ten-
sorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, O’Reilly,
2019 (see References)
• Bishop, C.M, Pattern Recognition and Machine Learning, Springer (see References).
VKI - 54 -
REFERENCES REFERENCES
References
Abu-Mostafa, Y. S., Magdon-Ismail, M., and Lin, H.-T. (2012). Learning from Data.
AMLBook.
Agostini, L. (2020). Exploration and prediction of fluid dynamical systems using auto-
encoder technology. Physics of Fluids, 32(6):067103.
Akolekar, H. D., Weatheritt, J., Hutchins, N., Sandberg, R. D., Laskowski, G., and
Michelassi, V. (2018). Development and use of machine-learnt algebraic reynolds stress
models for enhanced prediction of wake mixing in lpts. In Turbo Expo: Power for Land,
Sea, and Air, volume 51012, page V02CT42A009. American Society of Mechanical
Engineers.
Anil, R., Gupta, V., Koren, T., Regan, K., and Singer, Y. (2020). Second order optimiza-
tion made practical.
Antoranz, A., Ianiro, A., Flores, O., and Villalba, M. (2018). Extended proper orthogonal
decomposition of non-homogeneous thermal fields in a turbulent pipe flow. Int. J. Heat
Mass Transfer, 118:1264–1275.
Arulkumaran, K., Deisenroth, M. P., Brundage, M., and Bharath, A. A. (2017). Deep
reinforcement learning: A brief survey. IEEE Signal Processing Magazine, 34(6):26–38.
Bakir, G., Weston, J., and Bernhard, S. (1999). Learning to find pre-images. In Proceed-
ings of the 16th International Conference on Neural Information Processing Systems.
Beintema, G., Corbetta, A., Biferale, L., and Toschi, F. (2020). Controlling
rayleigh–bénard convection via reinforcement learning. Journal of Turbulence, pages
1–21.
Bianchi, F. M., Maiorino, E., Kampffmeyer, M. C., Rizzi, A., and Jenssen, R. (2017).
Recurrent Neural Networks for Short-Term Load Forecasting. Springer International
Publishing.
Bird, R. B., Stewart, W. E., and Lightfoot, E. N. (2006). Transport Phenomena. John
Wiley & Sons Inc.
VKI - 55 -
REFERENCES REFERENCES
Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press.
Boser, B. E., Guyon, I. M., and Vapnik, V. N. (1992). A training algorithm for opti-
mal margin classifiers. In Proceedings of the fifth annual workshop on Computational
learning theory - COLT 92. ACM Press.
Brenner, M. P., Eldredge, J. D., and Freund, J. B. (2019). Perspective on machine learning
for advancing fluid mechanics. Physical Review Fluids, 4(10).
Brockwell, P. J. and Davis, R. A. (1991). Time Series: Theory and Methods. Springer
New York.
Brunton, S. L., Noack, B. R., and Koumoutsakos, P. (2020). Machine learning for fluid
mechanics. Annual Review of Fluid Mechanics, 52(1):477–508.
Brunton, S. L., Proctor, J. L., and Kutz, J. N. (2016). Discovering governing equations
from data by sparse identification of nonlinear dynamical systems. Proceedings of the
National Academy of Sciences, 113(15):3932–3937.
Bucci, M. A., Semeraro, O., Allauzen, A., Wisniewski, G., Cordier, L., and Mathelin, L.
(2019). Control of chaotic systems by deep reinforcement learning. Proceedings of the
Royal Society A, 475(2231):20190351.
Chovet, C., Lippert, M., Keirsbulck, L., Noack, B. R., and Foucaut, J.-M. (2017). Machine
learning control for experimental turbulent flow targeting the reduction of a recircula-
tion bubble. In ASME 2017 Fluids Engineering Division Summer Meeting. American
Society of Mechanical Engineers.
Daniel, T., Casenave, F., Akkari, N., and Ryckelynck, D. (2020). Model order reduction
assisted by deep neural networks (ROM-net). Advanced Modeling and Simulation in
Engineering Sciences, 7(1).
Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014).
Identifying and attacking the saddle point problem in high-dimensional non-convex
optimization.
VKI - 56 -
REFERENCES REFERENCES
Dawson, S. (2020a). Data Driven Fluid Mechanics: Combining First Principles and Ma-
chine Learning, chapter The Proper Orthogonal Decomposition. Cambridge University
Press; Ed. Miguel Alfonso Mendez and Andrea Ianiro and Bernd Noack and Steve Brun-
ton. Also available as online lecture from the VKI Lecture series ’Machine Learning for
Fluid Mechanics, 2020’: https://www.youtube.com/watch?v=TcqBbtWTcIc.
Dawson, S. (2020b). Data Driven Fluid Mechanics: Combining First Principles and Ma-
chine Learning, chapter Linear Dynamical Systems and Control. Cambridge University
Press; Ed. Miguel Alfonso Mendez and Andrea Ianiro and Bernd Noack and Steve
Brunton. Also available as online lecture from the VKI Lecture series ’Machine Learn-
ing for Fluid Mechanics, 2020’: https://www.youtube.com/watch?v=Y5jWRnya3ds&
feature=emb_logo.
Diez Sanhueza, R. (2018). Machine learning for rans turbulence modelling of variable
property flows.
Duraisamy, K., Gianluca, I., and Xiao, H. (2019a). Turbulence modeling in the age of
data. Annual review of Fluid Mechanics, 51:357–377.
Duraisamy, K., Iaccarino, G., and Xiao, H. (2019b). Turbulence modeling in the age of
data. Annual Review of Fluid Mechanics, 51(1):357–377.
Duriez, T., Brunton, S. L., and Noack, B. R. (2017). Machine Learning Control – Taming
Nonlinear Dynamics and Turbulence. Springer International Publishing.
Ehlert, A., Nayeri, C. N., Morzynski, M., and Noack, B. R. (2020). Locally linear embed-
ding for transient cylinder wakes.
Fan, D., Yang, L., Triantafyllou, M. S., and Karniadakis, G. E. (2020). Reinforcement
learning for active flow control in experiments. arXiv preprint arXiv:2003.03419.
Farcomeni, A. and Greco, L. (2015). Robust Methods for Data Reduction. Taylor &
Francis Inc.
Fiore, M., Kolozar, L., Mendez, M., van Beeck, J., and Bartosiewicz, Y. (24-28 feb. 2020).
Thermal turbulence modelling of liquid metal flows using artificial neural networks. In
Lecture Series: Machine Learning for Fluid Mechanics: Analysis, Modeling, Control
and Closures.
François-Lavet, V., Henderson, P., Islam, R., Bellemare, M. G., and Pineau, J. (2018).
An introduction to deep reinforcement learning. Foundations and Trends
R in Machine
Learning, 11(3-4):219–354.
Frank, M., Drikakis, D., and Charissis, V. (2020). Machine-learning methods for compu-
tational science and engineering. Computation, 8(1):15.
Gan, G., Ma, C., and Wu, J. (2007). Data Clustering: Theory, Algorithms, and Applica-
tion. ASA-SIAM Series on Statistics and Applied Probability).
VKI - 57 -
REFERENCES REFERENCES
Géron, A. (2019). Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow.
O’Reilly UK Ltd.
Goertz, S. (2020a). Data Driven Fluid Mechanics: Combining First Principles and Ma-
chine Learning, chapter Reduced-Order Modeling for Aerodynamic Applications and
MDO. Cambridge University Press; Ed. Miguel Alfonso Mendez and Andrea Ianiro
and Bernd Noack and Steve Brunton. Also available as online lecture from the VKI
Lecture series ’Machine Learning for Fluid Mechanics, 2020’: https://www.youtube.
com/watch?v=JUqNMjVCR_k&feature=emb_logo.
Goertz, S. (2020b). Data Driven Fluid Mechanics: Combining First Principles and Ma-
chine Learning, chapter Methods for System Identification. Cambridge University Press;
Ed. Miguel Alfonso Mendez and Andrea Ianiro and Bernd Noack and Steve Brunton.
Also available as online lecture from the VKI Lecture series ’Machine Learning for Fluid
Mechanics, 2020’: https://www.youtube.com/watch?v=TL86S3mmlqg&feature=emb_
logo.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. the MIT Press.
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,
Courville, A., and Bengio, Y. (2014). Generative adversarial networks.
Hanin, B. (2019). Universal function approximation by deep neural nets with bounded
width and ReLU activations. Mathematics, 7(10):992.
Haykin, S. (1998). Neural Networks: A Comprehensive Foundation. Prentice Hall; 2nd
Edition.
Hernandez-Leal, P., Kartal, B., and Taylor, M. E. (2019). A survey and critique of
multiagent deep reinforcement learning. Autonomous Agents and Multi-Agent Systems,
33(6):750–797.
Hesthaven, J. and Ubbiali, S. (2018). Non-intrusive reduced order modeling of nonlinear
problems using neural networks. Journal of Computational Physics, 363:55–78.
Hill, A., Raffin, A., Ernestus, M., Gleave, A., Kanervisto, A., Traore, R., Dhari-
wal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman,
J., Sidor, S., and Wu, Y. (2018). Stable baselines. https://github.com/hill-a/
stable-baselines.
Hobold, G. M. and da Silva, A. K. (2018). Machine learning classification of boiling
regimes with low speed, direct and indirect visualization. International Journal of Heat
and Mass Transfer, 125:1296–1309.
Holland, J. R., Baeder, J. D., and Duraisamy, K. (2019). Towards integrated field inversion
and machine learning with embedded neural networks for rans modeling. In AIAA
Scitech 2019 Forum, page 1884.
Holmes, P., Lumley, J. L., Berkooz, G., and Rowley, C. (2012). Turbulence, Coherent
Structures, Dynamical Systems and Symmetry. Cambridge University Press, 2nd edi-
tion.
VKI - 58 -
REFERENCES REFERENCES
Holmes, P. J., Lumley, J. L., Berkooz, G., Mattingly, J. C., and Wittenberg, R. W. (1997).
Low-dimensional models of coherent structures in turbulence. Phys. Rep., 287(4):337–
384.
Huang, S.-C. and Kim, J. (2008). Control and system identification of a separated flow.
Physics of Fluids, 20(10):101509.
Ianiro, A. (2020). Data Driven Fluid Mechanics: Combining First Principles and Machine
Learning, chapter Applications and Good Practice. Cambridge University Press; Ed.
Miguel Alfonso Mendez and Andrea Ianiro and Bernd Noack and Steve Brunton. Also
available as online lecture from the VKI Lecture series ’Machine Learning for Fluid
Mechanics, 2020’: https://www.youtube.com/watch?v=H6twKFTCd2k&feature=emb_
logo.
Ilak, M. and Rowley, C. W. (2008). Modeling of transitional channel flow using balanced
proper orthogonal decomposition. Physics of Fluids, 20(3):034103.
Jiang, C., Mi, J., Laima, S., and Li, H. (2020). A novel algebraic stress model with
machine-learning-assisted parameterization. Energies, 13(1):258.
Jiménez, J. (2020b). Data Driven Fluid Mechanics: Combining First Principles and Ma-
chine Learning, chapter The Computer as Turbulence Researcher. Cambridge Univer-
sity Press; Ed. Miguel Alfonso Mendez and Andrea Ianiro and Bernd Noack and Steve
Brunton. Also available as online lecture from the VKI Lecture series ’Machine Learning
for Fluid Mechanics, 2020’: https://www.youtube.com/watch?v=i6lbZkK8rVI.
Kaiser, E., Kutz, J. N., and Brunton, S. L. (2018). Sparse identification of nonlinear
dynamics for model predictive control in the low-data limit. Proceedings of the Royal
Society A: Mathematical, Physical and Engineering Sciences, 474(2219):20180335.
Kaiser, E., Noack, B. R., Cordier, L., Spohn, A., Segond, M., Abel, M., Daviller, G., Östh,
J., Krajnović, S., and Niven, R. K. (2014). Cluster-based reduced-order modelling of a
mixing layer. Journal of Fluid Mechanics, 754:365–414.
Kang, M., Hwang, L. K., and Kwon, B. (2020). Machine learning flow regime classification
in three-dimensional printed tubes. Physical Review Fluids, 5(8).
Kawamura, H., Abe, H., and Shingai, K. (2000). Dns of turbulence and heat transport in
a channel flow with different reynolds and prandtl numbers and boundary conditions.
Turbulence, Heat and Mass Transfer, 3:15–32.
VKI - 59 -
REFERENCES REFERENCES
VKI - 60 -
REFERENCES REFERENCES
Mendez, M. A. (2020b). Data Driven Fluid Mechanics: Combining First Principles and
Machine Learning, chapter Mathematical Tools, Part I: Continuous and Discrete LTI
Systems. Cambridge University Press; Ed. Miguel Alfonso Mendez and Andrea Ianiro
and Bernd Noack and Steve Brunton. Also available as online lecture from the VKI
Lecture series ’Machine Learning for Fluid Mechanics, 2020’: https://www.youtube.
com/watch?v=qvZmKr6fhW4&feature=emb_logo.
Mendez, M. A., Balabane, M., and Buchlin, J.-M. (2019). Multi-scale proper orthogonal
decomposition of complex fluid flows. Journal of Fluid Mechanics, 870:988–1036.
Mendez, M. A., Hess, D., Watz, B. B., and Buchlin, J.-M. (2020). Multiscale proper
orthogonal decomposition (mPOD) of TR-PIV data—a case study on stationary and
transient cylinder wake flows. Measurement Science and Technology, 31(9):094014.
Müller, S. D., Milano, M., and Koumoutsakos, P. (1999). Application of machine learning
algorithms to flow modeling and optimization.
Murata, T., Fukami, K., and Fukagata, K. (2019). Nonlinear mode decomposition with
convolutional neural networks for fluid dynamics. Journal of Fluid Mechanics, 882.
Nasraoui, O., Chiheb-Eddine, and Cir, B. N., editors (2019). Clustering Methods for Big
Data Analytics. Springer International Publishing.
Nesterov, Y. (1983). A method for unconstrained convex minimization problem with the
rate of convergence o(1/k 2 ). Doklady an ussr, 269:543–547.
Nguyen, T. T., Nguyen, C. M., Nguyen, D. T., Nguyen, D. T., and Nahavandi, S. (2019).
Deep learning for deepfakes creation and detection.
Noack, B. R. (2020). Machine Learning for Turbulence Control, chapter Machine Learning
for Turbulence Control. Cambridge University Press; Ed. Miguel Alfonso Mendez and
Andrea Ianiro and Bernd Noack and Steve Brunton. Also available as online lecture
from the VKI Lecture series ’Machine Learning for Fluid Mechanics, 2020’: https:
//www.youtube.com/watch?v=7rT1Hjs5poc&feature=emb_logo.
Noack, B. R., Afanesiev, K., Morzyński, M., Tadmor, G., and Thiele, F. (2003). A
hierarchy of low-dimensional models for the transient and post-transient cylinder wake.
Journal of Fluid Mechanics, 497:335–363.
VKI - 61 -
REFERENCES REFERENCES
Noack, B. R., Ehlert, A., Nayeri, C. N., and Morzyński, M. (2020). Data Driven Fluid
Mechanics: Combining First Principles and Machine Learning, chapter Analysis, Mod-
eling and Control of the Cylinder Wake. Cambridge University Press; Ed. Miguel Al-
fonso Mendez and Andrea Ianiro and Bernd Noack and Steve Brunton. Also available
as online lecture from the VKI Lecture series ’Machine Learning for Fluid Mechanics,
2020’: https://www.youtube.com/watch?v=iehMMhDqmys&feature=emb_logo.
Novati, G. and Koumoutsakos, P. (2019). Remember and forget for experience replay. In
Proceedings of the 36th International Conference on Machine Learning.
Novati, G., Mahadevan, L., and Koumoutsakos, P. (2019). Controlled gliding and perching
through deep-reinforcement-learning. Phys. Rev. Fluids, 4(9).
Novati, G., Verma, S., Alexeev, D., Rossinelli, D., van Rees, W. M., and Koumoutsakos,
P. (2017). Synchronisation through learning for two self-propelled swimmers. Bioinspir.
Biomim., 12(3):036001.
Olivier Chapelle, B. S. and Zien, A., editors (2006). Semi-Supervised Learning. MIT Press
( Adaptive Computation and Machine Learning series ).
Paolella, M. S. (2018). Linear Models and Time-Series Analysis. John Wiley & Sons, Inc.
Parente, A. (2020). Data Driven Fluid Mechanics: Combining First Principles and Ma-
chine Learning, chapter Advancing Reacting Flow Simulations with Data-Driven Mod-
els: Chemistry Accelerations and Reduced-Order Modelling. Cambridge University
Press; Ed. Miguel Alfonso Mendez and Andrea Ianiro and Bernd Noack and Steve
Brunton. Also available as online lecture from the VKI Lecture series ’Machine Learn-
ing for Fluid Mechanics, 2020’: https://www.youtube.com/watch?v=Ys5_0YY730M&
feature=emb_logo.
Parrish, J., Rais-Rohani, M., and Janus, J. M. (2014). Reduced order techniques for
sensitivity analysis and design optimization of aerospace systems. In 10th AIAA Mul-
tidisciplinary Design Optimization Conference. American Institute of Aeronautics and
Astronautics.
Pawar, S., Rahman, S. M., Vaddireddy, H., San, O., Rasheed, A., and Vedula, P. (2019).
A deep learning enabler for nonintrusive reduced order modeling of fluid flows. Physics
of Fluids, 31(8):085101.
VKI - 62 -
REFERENCES REFERENCES
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel,
M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau,
D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning
in Python. Journal of Machine Learning Research, 12:2825–2830.
Pieter, A., Duan, Y., Chen, X., and Karpathy, A. (2017). Deep RL bootcamp. https:
//sites.google.com/view/deep-rl-bootcamp/lectures. Accessed: 2020-09-1.
Pino, F., Mendez, M. A., and Benoit, S. (24-28 feb. 2020). Feedback control of liquid
metal coating. In Lecture Series: Machine Learning for Fluid Mechanics: Analysis,
Modeling, Control and Closures.
Pivot, C., Cordier, L., and Mathelin, L. (2017). A continuous reinforcement learning
strategy for closed-loop control in fluid dynamics. In 35th AIAA Applied Aerodynamics
Conference. American Institute of Aeronautics and Astronautics.
Popper, K. R. (2002). The Logic of Scientific Discovery. Taylor & Francis Ltd.
Rabault, J., Kuchta, M., Jensen, A., Réglade, U., and Cerardi, N. (2019). Artificial
neural networks trained through deep reinforcement learning discover control strategies
for active flow control. Journal of Fluid Mechanics, 865:281–302.
Rabault, J. and Kuhnle, A. (2020). Data Driven Fluid Mechanics: Combining First
Principles and Machine Learning, chapter Deep Reinforcement Learning Applied to
Active Flow Control. Cambridge University Press; Ed. Miguel Alfonso Mendez and
Andrea Ianiro and Bernd Noack and Steve Brunton.
Rabault, J., Ren, F., Zhang, W., Tang, H., and Xu, H. (2020). Deep reinforcement
learning in fluid mechanics: A promising method for both active flow control and shape
optimization. Journal of Hydrodynamics, 32(2):234–246.
Raiola, M., Discetti, S., and Ianiro, A. (2015). On PIV random error minimization with
optimal POD-based low-order reconstruction. Experiments in Fluids, 56(4).
Raissi, M., Yazdani, A., and Karniadakis, G. E. (2020). Hidden fluid mechanics: Learning
velocity and pressure fields from flow visualizations. Science, 367(6481):1026–1030.
Raschka, S. and Mirjalili, V. (2019). Python Machine Learning, Third Edition. Packt
Publishing.
VKI - 63 -
REFERENCES REFERENCES
Reddy, G., Celani, A., Sejnowski, T. J., and Vergassola, M. (2016). Learning to soar in tur-
bulent environments. Proceedings of the National Academy of Sciences, 113(33):E4877–
E4884.
Renganathan, S. A., Maulik, R., and Rao, V. (2020). Machine learning for nonintrusive
model order reduction of the parametric inviscid transonic flow past an airfoil. Physics
of Fluids, 32(4):047110.
Reynolds, A. (1975). The prediction of turbulent prandtl and schmidt numbers. Interna-
tional Journal of heat and mass transfer, 18(9):1055–1069.
Rowley, C., Mezić, I., Bagheri, S., Schlatter, P., and Henningson, D. (2009). Spectral
analysis of nonlinear flows. J. Fluid Mech., 641:115.
ROWLEY, C. W. (2005). Model reduction for fluids, using balanced proper orthogonal
decomposition. International Journal of Bifurcation and Chaos, 15(03):997–1013.
Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM
Journal of Research and Development, 3(3):210–229.
Schölkopf, B., Smola, A., and Müller, K.-R. (1997). Kernel principal component analysis.
In Lecture Notes in Computer Science, pages 583–588. Springer Berlin Heidelberg.
Schmid, P. (2020). Data Driven Fluid Mechanics: Combining First Principles and Ma-
chine Learning, chapter The Dynamic Mode Decomposition: From Koopman The-
ory to Applications. Cambridge University Press; Ed. Miguel Alfonso Mendez and
Andrea Ianiro and Bernd Noack and Steve Brunton. Also available as online lec-
ture from the VKI Lecture series ’Machine Learning for Fluid Mechanics, 2020’:
https://www.youtube.com/watch?v=xAYimi7x4Lc&feature=emb_logo.
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., and Xu, X. (2017). DBSCAN revisited,
revisited. ACM Transactions on Database Systems, 42(3):1–21.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal
policy optimization algorithms.
Sieber, M., Paschereit, C. O., and Oberleithner, K. (2016). Spectral proper orthogonal
decomposition. J Fluid Mech, 792:798–828.
VKI - 64 -
REFERENCES REFERENCES
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrit-
twieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe,
D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu,
K., Graepel, T., and Hassabis, D. (2016). Mastering the game of go with deep neural
networks and tree search. Nature, 529(7587):484–489.
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M.,
Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., and Hassabis, D.
(2018). A general reinforcement learning algorithm that masters chess, shogi, and go
through self-play. Science, 362(6419):1140–1144.
Sotgiu, C., Weigand, B., Semmler, K., and Wellinger, P. (2019). Towards a general data-
driven explicit algebraic reynolds stress prediction framework. International Journal of
Heat and Fluid Flow, 79:108454.
Swets, D. and Weng, J. (1996). Using discriminant eigenfeatures for image retrieval. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 18(8):831–836.
Swischuk, R., Kramer, B., Huang, C., and Willcox, K. (2020). Learning physics-based
reduced-order models for a single-injector combustion process. AIAA Journal, 58(6).
Taira, K., Brunton, S. L., Dawson, S. T. M., Rowley, C. W., Colonius, T., McKeon, B. J.,
Schmidt, O. T., Gordeyev, S., Theofilis, V., and Ukeiley, L. S. (2017). Modal analysis
of fluid flows: An overview. AIAA J., 55(12):4013–4041.
VKI - 65 -
REFERENCES REFERENCES
Verma, S., Novati, G., and Koumoutsakos, P. (2018). Efficient collective swimming by
harnessing vortices through deep reinforcement learning. Proceedings of the National
Academy of Sciences, 115(23):5849–5854.
Viquerat, J., Rabault, J., Kuhnle, A., Ghraieb, H., Larcher, A., and Hachem, E.
(2019). Direct shape optimization through deep reinforcement learning. arXiv preprint
arXiv:1908.09885.
Vladimir Cherkassky, F. M. M. (2008). Learning from Data. John Wiley & Sons.
Wang, L., Fortunati, S., Greco, M. S., and Gini, F. (2018). Reinforcement learning-
based waveform optimization for MIMO multi-target detection. In 2018 52nd Asilomar
Conference on Signals, Systems, and Computers. IEEE.
Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, King’s College,
Cambridge.
Werbos, P. (1974). Beyond Regression: New Tools for Prediction and Analysis in the
Behavioral Sciences. PhD thesis, Dept of Applied Mathematics, Harvard University.
Yao, Z., Gholami, A., Shen, S., Keutzer, K., and Mahoney, M. W. (2020). Adahessian:
An adaptive second order optimizer for machine learning.
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016). Understanding deep
learning requires rethinking generalization.
VKI - 66 -