0% found this document useful (0 votes)
136 views103 pages

Generative Modelling With Tensor Networks

Research on generative modelling using tensor networks algorithms

Uploaded by

Martin Molnar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views103 pages

Generative Modelling With Tensor Networks

Research on generative modelling using tensor networks algorithms

Uploaded by

Martin Molnar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

Generative machine learning with tensor networks

Martin Molnar

May 25, 2023


Generative machine learning with tensor
networks

Martin Molnar

Supervisor Prof. Frank Verstraete


Department of Physics and Astronomy
Ghent University

Supervisor Prof. Jutho Haegeman


Department of Physics and Astronomy
Ghent University

Counsellor Dr. Maarten Van Damme

May 25, 2023


Martin Molnar
Generative machine learning with tensor networks
May 25, 2023
Supervisors: Prof. Frank Verstraete and Prof. Jutho Haegeman
Counsellor: Dr. Maarten Van Damme
Acknowledgement

The completion of this master’s thesis signifies the culmination of my academic


journey in Ghent, and its accomplishment would have been impossible without the
invaluable support of specific individuals.

First of all, I wish to express my gratitude to Prof. Frank Verstraete and Prof. Jutho
Haegeman, my supervisors at the Quantum Group, for granting me the opportunity
to undertake my master’s thesis in their research group. I would like to extend a
special appreciation to Prof. Jutho Haegeman for his dedicated time and invaluable
guidance throughout the entire process of my thesis. I am also grateful for his advice
that greatly influenced the refinement of my numerous drafts.

I would like to express my sincere gratitude to Dr. Maarten Van Damme, my


counselor, for his dedication, and enthusiasm. His valuable time and numerous ideas
have significantly contributed to my thesis. Furthermore, I would like to extend my
appreciation to the members of the Quantum Group for their warm reception and
willingness to provide assistance throughout my thesis journey when needed.

I also want to extend a special thanks to Céline Molnar and Sébastien Van Laecke for
their assistance in meticulously identifying typos and helping to correct my written
work.

I would also like to express my heartfelt gratitude to Doriane for her unwavering
support and encouragement throughout the whole academic year.

Lastly, I would like to express my deepest gratitude to my family for their unwavering
presence and support. I am particularly grateful to my grandfather, who nurtured
my curiosity during my childhood, which undoubtedly played a significant role
in shaping my decision to study Engineering physics. I would also like to extend
my thanks to my sisters, Lucie and Camille, for their constant support and for the
delightful companionship they provided at home. Lastly, but certainly not least, I
want to express my deepest gratitude to my parents, Nadine and Michel, who are
the epitome of love and hard work. I am immensely thankful for their unwavering
support and belief in me throughout the years. They have not only allowed me
to pursue my passion in studying Engineering Physics, but also instilled in me the

v
invaluable lessons of hard work. I will forever cherish their guidance and I am
eternally grateful for the countless sacrifices they have made. Merci à vous.

vi
Admission to Loan

The author gives permission to make this master dissertation available for consul-
tation and to copy parts of this master dissertation for personal use. In all cases of
other use, the copyright terms have to be respected, in particular with regard to
the obligation to state explicitly the source when quoting results from this master
dissertation.

Martin Molnar, May 2023

vii
Declaration

This master’s dissertation is part of an exam. Any comments formulated by the


assessment committee during the oral presentation of the master’s dissertation are
not included in this text.

ix
Generative machine learning with tensor
networks
Martin Molnar
Student number: 01700533

Supervisors: Prof. Frank Verstraete, Prof. Jutho Haegeman


Consellor: Dr. Maarten Van Damme

Master’s dissertation submitted in order to obtain the academic degree of


Master of Science in Engineering Physics

Faculty of Engineering and Architecture: Academic year 2022-2023

Abstract In conventional machine learning approaches, discriminative tasks aim to


model the conditional probability of obtaining a specific label given a specific input
feature. In contrast, the field of generative modeling considers that the data points
that constitutes a dataset can be regarded as samples extracted from an unknown
underlying data distribution. The main objective of generative models is to model
this underlying distribution. The probabilistic formulation of quantum mechanics
provides a framework for modeling probability distributions using quantum states.
This formulation can serve as an inspiration for constructing generative models
in a similar way. This work focus on a model that uses Matrix Product States, a
(one-dimensional) tensor network, as a quantum states to model the underlying
probability distribution of a dataset. The algorithm used to optimized the model
reassembles the two-site Density Matrix Renormalization Group algorithm. The
model also benefits from a direct sampling method to efficiently generate new
samples. Considering the one-dimensional nature of language, this work investigate
the applicability of MPS to model underlying probability distributions existing in
textual data. More specifically, it is applied for the purpose of text generation and
language recognition. Generating text with MPS remains a challenging task and the
MPS-based model shows a poor ability to generate text as well as a clear lack of
semantic understanding. However, it seems that the MPS representation is still able
to learn the specificities of a language and be used as a language recognition model
with a high accuracy.

Keywords Generative Modelling, Tensor Networks (TN), Matrix Product States


(MPS), Natural Language Processing (NLP)

xi
1

Generative machine learning with tensor networks


Martin Molnar
Supervisors: Prof. Frank Verstraete, Prof. Jutho Haegeman
Counsellor: Dr. Maarten Van Damme

Abstract—In conventional machine learning approaches, data can be regarded as samples extracted from a genuine real-
discriminative tasks aim to model the conditional probability world distribution, x ∼ Pd (x), commonly referred to as the data
of obtaining a specific label given a specific input feature. distribution. The fundamental objective of all generative models
In contrast, the field of generative modeling considers that is to model the data distribution to some extent. The literature
the data points that constitutes a dataset can be regarded regarding generative models roughly subdivides the field in four
as samples extracted from an unknown underlying data main groups [1]: Auto-regressive generative models (ARM), Flow-
distribution. The main objective of generative models is to based models, Latent variable models (e.g. Generative Adversial
model this underlying distribution. The probabilistic for- Networks (GAN) or Variational Auto Encoder (VAE)) and Energy-
mulation of quantum mechanics provides a framework for based models (e.g. Boltzmann Machines).
modeling probability distributions using quantum states. This There exists a significant synergy between the fields of physics
formulation can serve as an inspiration for constructing and machine learning, wherein each has a substantial influence
generative models in a similar way. This work focus on a on the other. On the one hand, machine learning has emerged
model that uses Matrix Product States, a (one-dimensional) as an important tool in physics and other related fields [3]–
tensor network, as a quantum states to model the underlying [6]. On the other hand, ideas from the field of physics can
probability distribution of a dataset. The algorithm used to serve as inspiration for the developments of new, physics-inspired,
optimized the model reassembles the two-site Density Matrix learning schemes. Examples of this are the Hopfield model [7] and
Renormalization Group algorithm. The model also benefits Boltzmann machines [8] which are closely related to the Ising
from a direct sampling method to efficiently generate new model and its inverse version. Another type of generative model
samples. Considering the one-dimensional nature of language, that draws inspiration from physics is the Born machine, where
this work investigate the applicability of MPS to model under- the probability distribution is modeled by a quantum state wave
lying probability distributions existing in textual data. More function Ψ and is given by its squared amplitude according to
specifically, it is applied for the purpose of text generation Born’s rule. There exists different Ansätze in quantum mechanics
and language recognition. Generating text with MPS remains to represent quantum states. This work focuses on models that use
a challenging task and the MPS-based model shows a poor Tensor Networks (TN) states as a parametrization class to model
ability to generate text as well as a clear lack of semantic data distributions.
understanding. However, it seems that the MPS representation Tensor networks (TN) [9]–[14] are variational quantum states
is still able to learn the specificities of a language and be used initially proposed to describe ground states of many-body systems.
as a language recognition model with a high accuracy. They have already been used in the context of machine learning.
Notably, they have been employed for various learning tasks,
Keywords— Generative Modelling, Tensor Networks (TN), such as classification [15], [16], for certain unsupervised learning
Matrix Product States (MPS), Natural Language Processing problems [17], [18], for the compression of neural networks [19],
(NLP) for reinforcement learning [20] and also to model the underlying
distribution of datasets [21]–[23].
I. I NTRODUCTION This work employs a learning scheme which is based on
Matrix Product States (MPS) as described in Ref. [21]. MPS
In order to build reliable decision-making AI systems, dis- refers to a specific type of Tensor Network where the tensors are
crimative models alone are not sufficient. In addition to requiring interconnected in a one-dimensional fashion. This arrangement
extensive labeled datasets, which may be limited in availability, is also referred to as the tensor train decomposition within the
discriminative models often lack a semantic comprehension of mathematical community. Learning is achieved by optimizing
their environment and also lack the aptitude to express uncertainty the MPS using an algorithm that resembles the two-site DMRG
concerning their decision making [1]. Generative models could algorithm [24]. It has been shown that the MPS-model exhibits a
potentially present themselves as a solution to these concerns. strong learning ability and also additionally benefits from a direct
Additionally, they might make it easier to assess the effectiveness sampling method for the generation of new samples that is more
of ML models. Indeed, classifiers frequently produce identical efficient than other traditional generative models. The generative
results for a wide variety of inputs, making it challenging to deter- model based on Matrix Product States, will be employed to
mine exactly what the model has learned [2]. Unlike discrimative capture the underlying distributions in textual data, serving as
models which aim at modelling a conditional probability distribu- a foundation for text generation. Moreover, this model will be
tion P (y|x) where y is a target value or label of a data point x, used for constructing a language classifier, enabling accurate
generative models try to model the joint data distribution P (x). categorization of linguistic content in three different languages
At the core of generative models lies the fundamental concept that (English, French and Dutch).
2

The rest of this paper is organized as follows: In Section II the in this manifold possess unique entanglement characteristics.
fundamental principles of Tensor Networks are presented within In fact, it can be demonstrated that low-energy eigenstates of
the framework of many-body physics. Section III presents the gapped Hamiltonians with local interactions follow the area-law
MPS-based learning scheme (initially proposed in Ref. [21]) as for entanglement entropy [27] and those states are thereby heavily
well as a direct sampling procedure to draw samples from the constrained by the local behavior of physical interactions [13]. The
MPS. Section IV displays the results achieved in text generation states that satisfy the area-law for entanglement are precisely the
and language classification tasks utilizing an MPS-based genera- ones targeted by tensor network states. This implies that tensor
tive framework. Subsequently, Section V discusses the potential networks specifically focus on the crucial corner of the Hilbert
further research areas that worth exploring to model textual data space where physical states reside.
using MPS. Matrix Product States (MPS) are one specific type of tensor net-
works and they have been successfully used to describe quantum
II. T ENSOR N ETWORKS ground states (especially) for physical systems in one dimension.
Quantum many-body physics aims at describing and under- In a MPS the rank-N coefficient tensor is expressed as a product
standing systems that consists of many interacting particles, that of lower rank tensors:
are subjected to the Schrödinger Equation. Although the last
XX
|ψ⟩ = A1j1 α1 A2α1 j2 α2 A3α2 j3 α3 ...AN
αN −1 jN |j1 j2 ...jN ⟩ (2)
decades have seen a tremendous amount of developments in {ji } {αi }
theoretical and numerical methods for the treatment of many-
body systems, the many-body quantum problem remains one of which can be expressed by means of the tensor diagrammatic
the most challenging problems in physics. One of the postulates notation1 as:
of quantum theory is that quantum states can be represented
A1 A2 A3 AN −1 AN
by state vectors in a Hilbert space [25]. In a quantum system, (3)
interacting particles can be in a superposition and one of the v1 v2 v3 vN −1 vN
consequences of this superposition principle is that the dimension
of the Hilbert space grows exponentially with the number of By convention, a tensor is represented in pictorial language by
particles in system. For instance, consider a general quantum an arbitrary shape or block (tensor shapes can have a certain
many-body system consisting of N particles. The system can be meaning depending on the context) with emerging lines, where
described by the following wavefunction: each line corresponds to one of the tensor’s indices. An index that
X is shared by two tensors designates a contraction between them
|ψ⟩ = Ψj1 j2 ...jN |j1 ⟩ ⊗ |j2 ⟩ ⊗ ... ⊗ |jN ⟩ (1) (i.e. a sum over the values of that index), in pictorial language
j1 j2 ...jN
that is represented by a line connecting two blocks.
where each |ji ⟩ represents the basis of an individual particle
with label i and with dimension d (for simplicity, the dimensions III. MPS- BASED GENERATIVE MODEL
of the individual particle’s basis are assumed to be equal). The
Considering the inherent similarities between the task of gen-
number of possible configurations of the system and hence the
erative modelling and quantum physics, the application of tensor
dimension of the Hilbert space is dN [10]. The wavefunction of the
networks to effectively capture underlying probability distributions
many-body system is expressed as a linear combination of all the
of datasets is not unexpected. Indeed, both fields strive to model
possible configuration where each configuration is weighted with a
a probability distribution within a vast parameter space. Further-
specific coefficient. The coefficients corresponding to the different
more, the pertinent configurations encompass only a small fraction
dN configurations can be represented using the coefficient tensor
of the exponentially vast parameter space. Consider, for example,
Ψj1 ,j2 ,...,jN that contains dN parameters. This tensor has N d-
the case of images, where the majority of pixel value combinations
dimensional entries and fully determines the wavefunction of the
are regarded as noise, while the actual images constitute a small
quantum system. Describing the wavefunction by specifying the
subset of feasible configurations. This has again some striking
coefficients related to all possible configurations of the system
similarities with the field of quantum physics, wherein the relevant
is clearly an inefficient way to deal with large systems. The
states exist within a tiny manifold of the Hilbert space. The
solution proposed by tensor network approaches is to represent
success of tensor networks in efficiently parameterizing such
and approximate the coefficient tensor Ψj1 ,j2 ,...,jN by a network
states, suggests that they could be useful for generative modeling.
of interconnected lower-rank tensors with a given structure. The
In this section, the MPS-based generative model used in this work
structure of a tensor network is determined by the entanglement
is presented. It was initially proposed in Ref. [21].
patterns between the local degrees of freedom within the quantum
system it represents.
Tensor Network methods have been successfully use to de- A. Model representation
scribe quantum ground states in quantum many-body systems. Consider a datasets T consisting of |T | N -dimensional data
The reason behind their success is their efficient parametrization ⊗N
points vi ∈ V = {0, 1, ..., p} where the entries of the vectors
of quantum systems (i.e. the number of parameters scales only vi can take p different values. The data points vi are potentially
polynomially with N [26]), which focuses specifically on a special repeated in the dataset and can be mapped to basis vectors of
group of states within the Hilbert space. In quantum many-body a Hilbert space of dimension pN . The datasets can be seen
systems, the presence of certain structures within the Hamiltonians as a collection of samples extracted from the underlying data
(e.g. the local character of physical interactions found in many
physical Hamiltonians) is such that the physical states exist in a 1
See Ref. [10] for a comprehensive introduction to the tensor diagram-
tiny manifold of the gigantic Hilbert space. The states located matic notation.
3

distribution Pd (v), which is unknown. The approach followed C. Sampling


in this work bears strong similarities to the manner in which The generation of new samples is a difficult task for tradi-
probability distributions are represented in quantum mechanics. tional generative models as it often involves dealing with the
From a practical standpoint, quantum mechanics models the intractability of the partition function. For instance, energy-based
wavefunction Ψ(v), whose squared norm provides the probability models like Boltzmann machines often rely on Markov Chain
distribution in accordance with Born’s rule: Monte Carlo (MCMC) methods to generate new samples [33].
One of the advantages of MPS-based generative models is that
|Ψ(v)|2
P (v) = (4) independent samples can directly be drawn from the model
Z
probability distributions, alleviating thereby the need for MCMC
where Z = v∈V |Ψ(v)|2 is a normalization factor, also referred
P methods. The generating process is done site by site and starts at
to as the partition function. The generative models that leverage one end of the MPS. Consider that the first feature to be sampled is
this probabilistic interpretation are referred to as Born machines the N -th feature. This canPdirectly be obtained from the marginal
[28]. One advantage of this probabilistic formulation is that it probability Pm (vN ) = v1 ,v2 ,...,vN −1 Pm (v). If the MPS is in
ensures the non-negativity of the probability distribution. In this the left-canonical form then it can be expressed in tensor diagram
work, the wavefunction is parameterized using the MPS form notation as follows:
with real-valued parameters and is subjected to open boundary
conditions. This corresponds to the MPS given in Equation 3. AN

One of the advantages of this formulation is that the partition vN


function Z given by: Pm (vN ) = (7)
vN

AN
(5)

where the MPS was assumed to be normed. The orange circles


can be exactly and efficiently computed by leveraging the relevant represent the one-hot encoded value of vN which can take p
contraction schemes [10]. In fact computing the partition function different values (i.e. vn ∈ {1, 2, ..., p} , ∀n ∈ {1, 2, ..., N }). From
a significant obstacle that is frequently faced in the field of a practical point of view, Pm (vN ) is computed for vN = 1, 2, ..., p
generative modelling. Machine learning models which uses the and the value of vN is then drawn from these probabilities. Once
maximum likelihood approximation often resorts to approximate the N -th feature is sampled, one can then move on to the (N −1)-
methods such as annealed importance sampling (AIS) [29], [30]. th feature. More generally, given the values of vk , vk+1 , ...vN ,
Other alternative models such as GAN, completely alleviate the the (k − 1)-th feature can be sampled from the one-dimensional
need to explicitly compute the partition function [31]. conditional probability:

Pm (vk−1 , vk , vk+1 , ..., vN )


B. Optimization Pm (vk−1 |vk , vk+1 , ..., vN ) = (8)
Pm (vk , vk+1 , ..., vN )
After establishing the parameterization class of the model as
By employing this iterative sampling procedure, it becomes
the MPS form, the objective is to identify the optimal MPS
feasible to sample all the feature values based on the different
that produces a probability distribution that closely matches the
one-dimensional conditional probabilities. As a result, a sample is
distribution of the given data. Consequently, the parameters of
obtained that strictly obeys the probability distribution represented
the Matrix Product State are fine-tuned to minimize the Negative
by the MPS [21]. It is also worth noting that this sampling
Log-Likelihood (NLL) loss function:
approach can be extended to inference tasks where only a part
|T | |T |
of the sample is given and can thereby be used for tasks such as
1 X 1 X missing data imputation [21].
L=− ln P (vi ) = ln Z − ln |Ψ(vi )|2 (6)
|T | i=1 |T | i=1
IV. A PPLICATION : NATURAL L ANGUAGE
It is well known that minimizing the Negative Log-Likelihood
is equivalent to minimizing the Kullback-Leiber divergence. The
P ROCESSING (NLP)
minimization of the loss function is achieved through the use of Natural Language Processing (NLP) is a field that lies at
a Gradient Descent (GD) approach, yet with a notable difference the crossroads between computer science, artificial intelligence
from conventional GD methods where all parameters are updated and linguistics. Its aim is to enable machines and computers
simultaneously [22]. In this case, the parameters of the MPS to understand, interpret and generate languages, both in written
are iteratively updated in a manner that resembles the DMRG and spoken formats, in a manner akin to humans beings. In
algorithm, and more specifically the two-sites DMRG variant this work the focus lies on capturing the underlying distributions
[32]. The two-sites DMRG enables a dynamic adjustment of the in textual data for the purpose of text generation and language
bond dimensions of the MPS during the learning phase. This recognition. As previously indicated, MPS are not well-suited
adaptive process efficiently allocates computational resources to for analyzing two-dimensional datasets. However, considering
areas where stronger correlations among the physical variables the inherent one-dimensional nature of language, it suggests the
exist [21]. potential applicability of MPS in this context.
4

A. Dataset and workflow


The dataset used to train the MPS-based models was obtained
via the Project Gutenberg (making books freely available on
www.gutenberg.org) and consists of 50 different books written in
English and randomly selected in the database. To render the data
usable, a preprocessing step is implemented wherein capital letters
are transformed into lowercase letters. The dataset is subsequently
filtered to only include the 26 letters of the alphabet and the ’space’
token, while excluding all other special characters. All the text
sections consisting of N characters that occur in the books are then
extracted from the text data. Each section is subsequently mapped
to a sequence of numerical values ranging from 0 to 26, achieved
Fig. 1. NLL calculated on the training and test sets as functions of
by replacing each character with a corresponding numerical value. the number of optimization sweeps for a 5-site MPS-base model with
The overall goal is to capture the inherent probability distribution Dmax = 150. The results are shown for different training set sizes.
of the text sections. In other words, the aim is to model the
frequency of occurrence associated with distinct text sections by
means of an N -site MPS-based model with a physical dimension data distribution are [3]. It is important to note that for small
of 27. In what follows, MPS-based models will be trained for systems, such as N = 5, the data distribution can be kept in
different values of N (ranging from 4 to 10). memory and stored as a rank-N tensor. Despite the computational
expense and inefficiency associated with this task, it allows for an
B. Text generation easy assessment of the learning ability of the model and forms a
proof-of-concept for the approach used. Having done this, it was
The optimization scheme used to optimize the MPS is the one observe that the MPS trained on the largest training set accurately
resembling the two-site DMRG. The gradient of the NLL with represents the real data distribution.
respect to a merged tensor is given by The overall goal of this work is to build a model capable
|T | of generating text, therefore Table I illustrates sentences that
∂L Z′ 2 X Ψ′ (vi ) were generated based on a given input text section (given in
(k,k+1)j j
= − (9)
k k+1
∂Aαk−1 αk+1 Z |T | i=1 Ψ(vi ) bold). The text is generated in a sequential manner: Given an
input consisting of n letters l1 , l2 , ..., ln−1 , ln , the 5-site MPS
Computing the gradient for large datasets is computationally samples the next character from the conditional probabilities:
expensive. One solution to this predicament is to use the so- P (ln+1 |ln−3 , ln−2 , ln−1 , ln ). Based on the sentences generated, it
called Stochastic Gradient Descent (SGD) approach as was put is evident that the model’s ability to produce coherent sentences
forward in Ref. [29]. In fact, when using SGD, the gradient is is considerably limited. However, there are several noteworthy
evaluated using a mini-batch which represents a subset of samples observations to emphasize. Despite the lack of meaningful content,
drawn uniformly from the training set [34]. At each iteration, certain existing words do appear, and the MPS accurately positions
the gradient descent is performed in multiple steps on the same ’space’ tokens at specific instances, such as following the words
merged tensor, where in each step a different mini-batch is used should, have, my, et cetera. The limited text generation ability
for the computation of the gradient. exhibited by the model, despite its capability in capturing the
Figure 1 shows the NLL calculated on the training and test underlying data distribution, could potentially be attributed to the
sets as a function of the number of optimization sweeps for relatively short length of the considered MPS. Specifically, the
a 5-site MPS-based model with Dmax = 150. The results are generation process relies on the preceding 4 characters, which
shown for different training set sizes (103 samples on the left, might be insufficient to yield satisfactory text generation capabil-
104 samples in the center and 105 samples ont the right of the ities.
figure). In the three different cases, the NLL calculated on the
training sets decreases monotonically. The behavior of the test
a the person whose such we must spee while and then the
loss varies with the size of the training set. When trained on b i am they hindi cupiess the sating withou were he
a small training set, the NLL calculated on the test set rapidly c which is ther and falled in allwore about a righting he
converges to a minimum after a few sweeps. However, further d one should have had say wear when salvolated essed
optimization beyond this point leads to overfitting, as evidenced
by an increase in the test loss. With larger training set sizes, the TABLE I
S ENTENCES GENERATED BY THE 5- SITE MPS- BASED MODEL WITH
test loss attains systematically a lower minimal value, indicating Dmax = 150 TRAINED ON TRAINING SET CONTAINING 105 SAMPLES .
an enhanced capability to generalize to unseen samples. This T HE BOLDED PORTION OF THE SENTENCES REPRESENTS THE INPUT
behavior, commonly observed in machine learning, highlights the PROVIDED TO THE MODEL FOR PREDICTING THE REMAINDER OF THE
necessity of abundant data to limit overfitting and to attain an SENTENCE .
optimal model in terms of generalization abilities.
Seeing that minimizing the NLL is equivalent to minimizing
the KL divergence which quantify the similarity between the data To address the issue raised above, other MPS-based models
distribution and the model distribution, plotting the distribution were trained with a higher value of N . Considering the sus-
log (Pd (x)/Pm (x)) for samples taken from the data distribution ceptibility of the MPS model to overfits [3], an early stopping
Pd (x) can be used to assess how similar the model and the procedure was incorporated during the training process to mitigate
5

this problem [35]. The MPS were trained for different values of
N (ranging from for to ten) and for different values of Dmax . The
training and validations sets systematically consisted of 0.75×105
and 105 samples, respectively. Unfortunately the trained models
did not show a better ability to generate text. The sentences
generated by these models have revealed that models based on
longer MPS still exhibit limited abilities to generate coherent
sentences. While all the models are capable of generating some
existing words, they also generate non-existent ones. Furthermore,
it is crucial to highlight that none of the models exhibit any level
of semantic understanding which is a fundamental requirement for
a proficient text generator.
During training, the MPS adapts the bond dimension of the Fig. 2. Accuracy of the 5-site (a) and 4-site (b) MPS-based language
different bonds to capture the correlation that exists within the recognition models as a function of the number of characters in the text
segments used for evaluation. The results are shown for different values
training set. The problem in this case is that the physical dimen- of the maximal bond dimension allowed.
sion is relatively high and the correlation within textual data is
significant. Thereby, the bond dimensions of the Matrix Product
States systematically attain the maximal allowed value of Dmax , 1
PS−N +1
by S−N +1 i=1 log Planguage (vi ). The classification process
except for the first and last bond, which have a value equal to
involves selecting the language for which the model gives the
the physical dimension. This renders the training of longer MPS
highest average log-likelihood. To assess the model’s performance,
more challenging. Furthermore, when the training set size is too
the accuracy is used as an evaluation metric. Accuracy is an
small, the resulting trained MPS may be sub-optimally trained, as
appropriate measure in this case because both the training and
was the case for the 5-site MPS which was trained on 103 samples
the test sets are perfectly balanced between the different classes.
(see Figure 1). Longer MPS might therefore require larger training
Models with different values of N and Dmax were trained.
sets but this would render the training computationally even more
In Figure 2, the accuracy is shown for the 4-site MPS-based
expensive.
model as a function of the number of characters in the text
Another attempt that was investigated to improve the perfor-
segments used for evaluation. Results are shown for different
mances of the model, while limiting the increase in physical
values of Dmax . A similar trend was observed for the 5-site MPS
dimension, is to selectively incorporate the most frequently oc-
with a slightly better accuracy achieved for smaller evaluation
curring bi-grams (i.e. sequences of two characters) alongside the
text sequences. The MPS-based classification models demonstrate
uni-grams (i.e. the characters) already taken into account up until
a high accuracy, especially for longer text segments. This holds
now. A 5-site MPS model has been trained by considering the
true for all values of Dmax . The results indicate that an optimal
top 1/15 fraction of the most frequent bi-grams, resulting in a
value for Dmax is 25, as increasing the bond dimension beyond
physical dimension of 65. The model was trained on a training
this value does not result in a significant improvement in accuracy
set consisting of 5×104 samples. The text generation performance
while decreasing it to 10 significantly reduces the accuracy (blue
of the model trained using the selected bi-grams appeared to be
curve in Figure 2). This observation valid for text segments of
inferior compared to the basic 5-site MPS model. This observation
various lengths. Additionally, the results demonstrate that the
might be attributed to either a too small training set size (as
MPS-based approach performs better in classifying longer text
increasing the physical dimension leads to longer computational
segments. This observation seems logic as longer text segments
times, a smaller training set was chosen to balance computational
can be divided into more text sections, thereby increasing the
efficiency.) or a sub-optimal choice for the maximum bond di-
chance of having a large average likelihood and improving the
mension (Dmax ).
classification accuracy. The accuracy achieved for very short text
segments, specifically those containing five characters, although
C. Language Recognition slightly lower, is still quite good as it ranges between 0.70 and
In this section, an MPS-based model for language classi- 0.85. In comparison, a simple naive Bayes classifier trained on
fication is presented. The fact that the partition function can the same three books achieved an average accuracy close to 0.85.
exactly and efficiently be computed allows for the evaluation
of the log-likelihood of any given text section. The proposed
approach involves training multiple MPS-based models, where
V. S UMMARY AND OUTLOOK
each model corresponds to a specific language. The goal is that In conclusion, generating text with an MPS-based model is
each model learns the specificities of the probability distribution a challenging task, mainly due to the high physical dimension
of the language it tries to model, so as to be able to give a required and the significant correlation that exists with textual
high likelihood to text section derived from that language and data. Further research should be conducted to thoroughly explore
low likelihood to text section from other languages. In this work, the complete potential of MPS-based generative models for text
three languages were considered: English, French and Dutch. generation. This could be accomplished through conducting more
When considering an N -site MPS, the classification process is extensive and computationally intensive simulations, incorporating
as follows. Consider a given text segment s of size S originating a higher physical dimension and larger training set sizes. While
from a text written in English, French or Dutch. The average theoretically feasible, such simulations were deemed too demand-
log-likelihood is calculated for the three models trained on the ing to be conducted within the scope of this study. Comparing the
English, French and Dutch datasets respectively and it is given performances of models with different MPS sizes is also a delicate
6

task because of the fundamental differences that exists between the [15] E. Stoudenmire and D. J. Schwab, “Supervised learning with tensor
dataset on which those models are evaluated with the NLL. Further networks,” Advances in neural information processing systems,
research should incorporate a universal quantifiable measure of vol. 29, 2016.
the performance of a model that can be used for all models. This [16] A. Novikov, M. Trofimov, and I. Oseledets, “Exponential ma-
chines,” arXiv preprint arXiv:1605.03795, 2016.
measure could be inspired by the BLEU or ROUGE evaluation [17] J. Liu, S. Li, J. Zhang, and P. Zhang, “Tensor networks for
methods commonly used to evaluate text generation performances, unsupervised machine learning,” Physical Review E, vol. 107, no. 1,
where the generated text is compared to a reference text segment p. L012103, 2023.
[36]. Additional studies should also focus on the development of [18] E. M. Stoudenmire, “Learning relevant features of data with multi-
problem-specific datasets. For example, when tackling a task such scale tensor networks,” Quantum Science and Technology, vol. 3,
as predicting the most probable words based on an input sequence no. 3, p. 034003, 2018.
for email text, it might be interesting to use a dataset tailored for [19] A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov, “Ten-
sorizing neural networks,” Advances in neural information process-
that task. ing systems, vol. 28, 2015.
Despite the complexity of generating text using an MPS-based [20] S. T. Wauthier, B. Vanhecke, T. Verbelen, and B. Dhoedt, “Learning
model, and the suboptimal performances observed, the model still generative models for active inference using tensor networks,”
exhibits the ability to capture the specificities of the language in Active Inference: Third International Workshop, IWAI 2022,
it aims to model. This proficiency is sufficient for achieving Grenoble, France, September 19, 2022, Revised Selected Papers.
good results in language classification tasks. Future research Springer, 2023, pp. 285–297.
[21] Z.-Y. Han, J. Wang, H. Fan, L. Wang, and P. Zhang, “Unsupervised
should prioritize the inclusion of additional languages. However,
generative modeling using matrix product states,” Physical Review
it is important to acknowledge that evaluating the average log- X, vol. 8, no. 3, p. 031012, 2018.
likelihood for numerous MPS models (corresponding to all the [22] S. Cheng, L. Wang, T. Xiang, and P. Zhang, “Tree tensor networks
languages taken into account) would be relatively time-consuming. for generative modeling,” Physical Review B, vol. 99, no. 15, p.
155131, 2019.
[23] T. Vieijra, L. Vanderstraeten, and F. Verstraete, “Generative
R EFERENCES modeling with projected entangled-pair states,” arXiv preprint
arXiv:2202.08177, 2022.
[1] J. M. Tomczak, Deep generative modeling. Springer, 2022. [24] S. R. White, “Density matrix formulation for quantum renormal-
[2] A. Lamb, “A brief introduction to generative models,” arXiv ization groups,” Physical review letters, vol. 69, no. 19, p. 2863,
preprint arXiv:2103.00265, 2021. 1992.
[3] Vieijra, Tom, “Artificial neural networks and tensor networks in [25] H. Bruus and K. Flensberg, Many-body quantum theory in con-
Variational Monte Carlo,” Ph.D. dissertation, Ghent University, densed matter physics: an introduction. OUP Oxford, 2004.
2022. [26] D. Poulin, A. Qarry, R. Somma, and F. Verstraete, “Quantum simu-
lation of time-dependent hamiltonians and the convenient illusion of
[4] D. Baron, “Machine learning in astronomy: A practical overview,”
hilbert space,” Physical review letters, vol. 106, no. 17, p. 170501,
arXiv preprint arXiv:1904.07248, 2019.
2011.
[5] R. E. Goodall and A. A. Lee, “Predicting materials properties
[27] J. Eisert, M. Cramer, and M. B. Plenio, “Area laws for the
without crystal structure: Deep representation learning from stoi-
entanglement entropy-a review,” arXiv preprint arXiv:0808.3773,
chiometry,” Nature communications, vol. 11, no. 1, p. 6280, 2020.
2008.
[6] Z.-A. Jia, B. Yi, R. Zhai, Y.-C. Wu, G.-C. Guo, and G.-P. Guo,
[28] S. Cheng, J. Chen, and L. Wang, “Information perspective to prob-
“Quantum neural network states: A brief review of methods and
abilistic modeling: Boltzmann machines versus born machines,”
applications,” Advanced Quantum Technologies, vol. 2, no. 7-8, p.
Entropy, vol. 20, no. 8, p. 583, 2018.
1800077, 2019.
[29] J. Van Gompel, J. Haegeman, and J. Ryckebusch, Tensor netwerken
[7] J. J. Hopfield, “Neural networks and physical systems with emer- als niet-gesuperviseerde generatieve modellen, 2020.
gent collective computational abilities.” Proceedings of the national [30] R. M. Neal, “Annealed importance sampling,” Statistics and com-
academy of sciences, vol. 79, no. 8, pp. 2554–2558, 1982. puting, vol. 11, pp. 125–139, 2001.
[8] N. Zhang, S. Ding, J. Zhang, and Y. Xue, “An overview on restricted [31] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
boltzmann machines,” Neurocomputing, vol. 275, pp. 1186–1199, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial
2018. networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–
[9] F. Verstraete, V. Murg, and J. I. Cirac, “Matrix product states, 144, 2020.
projected entangled pair states, and variational renormalization [32] S. R. White, “Density-matrix algorithms for quantum renormaliza-
group methods for quantum spin systems,” Advances in physics, tion groups,” Physical review b, vol. 48, no. 14, p. 10345, 1993.
vol. 57, no. 2, pp. 143–224, 2008. [33] A. Fischer and C. Igel, “An introduction to restricted boltzmann
[10] J. C. Bridgeman and C. T. Chubb, “Hand-waving and interpretive machines,” in Progress in Pattern Recognition, Image Analysis,
dance: an introductory course on tensor networks,” Journal of Computer Vision, and Applications: 17th Iberoamerican Congress,
physics A: Mathematical and theoretical, vol. 50, no. 22, p. 223001, CIARP 2012, Buenos Aires, Argentina, September 3-6, 2012. Pro-
2017. ceedings 17. Springer, 2012, pp. 14–36.
[11] S. Montangero, E. Montangero, and Evenson, Introduction to tensor [34] L. Bottou, “Large-scale machine learning with stochastic gradient
network methods. Springer, 2018. descent,” in Proceedings of COMPSTAT’2010: 19th International
[12] J. I. Cirac and F. Verstraete, “Renormalization and tensor product Conference on Computational StatisticsParis France, August 22-27,
states in spin chains and lattices,” Journal of physics a: mathemat- 2010 Keynote, Invited and Contributed Papers. Springer, 2010, pp.
ical and theoretical, vol. 42, no. 50, p. 504004, 2009. 177–186.
[13] R. Orús, “A practical introduction to tensor networks: Matrix prod- [35] Y. Yao, L. Rosasco, and A. Caponnetto, “On early stopping in
uct states and projected entangled pair states,” Annals of physics, gradient descent learning,” Constructive Approximation, vol. 26,
vol. 349, pp. 117–158, 2014. no. 2, pp. 289–315, 2007.
[14] T. E. Baker, S. Desrosiers, M. Tremblay, and M. P. Thompson, [36] C.-Y. Lin, “Rouge: A package for automatic evaluation of sum-
“Méthodes de calcul avec réseaux de tenseurs en physique,” Cana- maries,” in Text summarization branches out, 2004, pp. 74–81.
dian Journal of Physics, vol. 99, no. 4, pp. 207–221, 2021.
Contents

1 Introduction 1
1.1 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Generative modeling 3
2.1 Fundamentals of machine learning . . . . . . . . . . . . . . . . . . . 3
2.1.1 Motivation and definition . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Building blocks of a ML model . . . . . . . . . . . . . . . . . . 4
2.2 Generative models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Probabilistic framework and maximum likelihood estimation . 13
2.2.2 Evaluating a generative model . . . . . . . . . . . . . . . . . . 15
2.2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Tensor Networks 19
3.1 Quantum many-body systems . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Motivation and representation of Tensor Network states . . . . . . . 21
3.2.1 Diagrammatic representation . . . . . . . . . . . . . . . . . . 22
3.2.2 Tensor Networks and quantum states . . . . . . . . . . . . . . 24
3.3 Matrix Product State (MPS) . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Construction via successive Singular Value Decomposition . . 25
3.3.2 Gauge degree of freedom and canonical form . . . . . . . . . 28
3.3.3 Expectation values and operators . . . . . . . . . . . . . . . . 29
3.3.4 Optimizing MPS: finding ground states . . . . . . . . . . . . . 31
3.4 Tree Tensor Networks and Projected Entangled Pair States . . . . . . 33

4 Generative modeling using Tensor Networks 35


4.1 Overview of TN-based methods in machine learning . . . . . . . . . . 35
4.2 MPS-based generative model . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 Model representation . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.3 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 TTN and PEPS as generative models . . . . . . . . . . . . . . . . . . 44

xix
5 Application: Natural Language Processing (NLP) 45
5.1 Dataset and workflow . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Text generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Language recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6 Conclusion 61

Bibliography 63

A Appendix A: Tensor Manipulations 73


A.1 Index grouping and splitting . . . . . . . . . . . . . . . . . . . . . . . 73
A.2 Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

xx
List of Figures

2.1 A typical machine learning pipeline that includes the formulation of the
problem, model selection, optimization, and generalization steps. . . . 5
2.2 Basic neural network architecture. . . . . . . . . . . . . . . . . . . . . 9
2.3 Typical relationships between training loss and generalization loss and
the model’s capacity. Figure from Ref. [41]. . . . . . . . . . . . . . . . 11
2.4 Example of a regression problem where (from left to right) a linear
model, a quadratic model and a degree-9 polynomial model attempt to
capture an underlying quadratic distribution. Figure from Ref. [41]. . . 12
2.5 Image reconstruction from partial images from the MNIST database
[24]. The given parts are in black and the reconstructed parts are in
orange. Figure from [46]. . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Tensor T of rank 3 represented (a) by all its elements, (b) by the
diagrammatic representation. . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 The coefficient of a quantum many-body state (top) can be represented
in the form of a high-dimensional tensor (middle) with an exponentially
large number of parameters in the system size. The high-dimensional
tensor can be expressed in a network of interconnected tensors (bottom)
that takes into account the structure and the amount of entanglement
in the quantum many-body state. . . . . . . . . . . . . . . . . . . . . . 25
3.3 Illustration of Tree Tensor Network (TTN). . . . . . . . . . . . . . . . . 33
3.4 Illustration of a Projected Entangled-Pair State (PEPS). . . . . . . . . . 34

5.1 NLL calculated on the training and test sets as functions of the number
of optimization sweeps for a 5-site MPS-base model with Dmax = 150.
The results are shown for different training set sizes. . . . . . . . . . . 49
5.2 Histogram of the difference in log-likelihoods under distributions P (x) =
Pd (x) and Q(x) = Pm (x) for samples randomly drawn from the data
distribution. The results are shown for models trained with different
training set size: |Tr| = 103 in blue, |Tr| = 104 in orange and |Tr| = 105
in turquoise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

xxi
5.3 Negative log-likelihood as a function of the maximal bond dimension
Dmax for different values of N . . . . . . . . . . . . . . . . . . . . . . . 52
5.4 Frequency of occurrence of the bi-grams encountered in the textual data. 55
5.5 Accuracy of the 5-site (a) and 4-site (b) MPS-based language recognition
models as a function of the number of characters in the text segments
used for evaluation. The results are shown for different values of the
maximal bond dimension allowed. . . . . . . . . . . . . . . . . . . . . 57
5.6 Confusion matrices shown for the 4-site MPS model with Dmax = 10
evaluated in the two extreme cases where the text segments contain
five characters (a), and 100 characters (b) and for the 5-site MPS model
with Dmax = 250 evaluated on text segment containing five characters
(c), and 100 characters (d). . . . . . . . . . . . . . . . . . . . . . . . . 59

xxii
List of Tables

5.1 Sentences generated by the 5-site MPS-based model with Dmax = 150
trained on training set containing 105 samples. The bolded portion of
the sentences represents the input provided to the model for predicting
the remainder of the sentence. . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Sentences generated by the different MPS-based models with Dmax =
100 trained on training set containing 0.75 × 105 samples. The bolded
portion of the sentences represents the input provided to the model for
predicting the remainder of the sentence. . . . . . . . . . . . . . . . . 53
5.3 Sentences generated by the 5-site MPS-based which takes the top 1/15
fraction of the most frequent bi-grams into account. The model is
trained with Dmax = 150 on training set containing 5 × 104 samples.
The bolded portion of the sentences represents the input provided to
the model for predicting the remainder of the sentence. . . . . . . . . . 56

xxiii
Acronyms

ADAM Adaptive Moment Estimation. 10

AIS Annealed Importance Sampling. 37

ANN Artificial Neural Networks. 35

BGD Batch Gradient Descent. 10

DBSCAN Density-Based Clustering Algorithm. 8

DMRG Density Matrix Renormalization Group. 32

GAN Generative Adversial Networks. 13

GD Gradient Descent. 10, 38

GPU Graphics Processing Units. 3

KL Kullback–Leibler. 14

LHC Large Hadron Collider. 4

MCMC Markov Chain Monte Carlo. 42

MPS Matrix Product State. 24

MSE Mean Squared Error. 10

NISQ Noisy-Intermediate-Scale Quantum. 35

NLL Negative-Log Likelihood. 10

NQS Neural Networks Quantum States. 35

PEPS Projected Entangled Pair States. 33

xxv
RNN Recurrent Neural Networks. 46

SGD Stochastic Gradient Descent. 10

SVD Singular Value Decomposition. 25

SVM Support Vector Machine. 8

TEBD Time Evolving Block Decimation. 33

TN Tensor Network. 21

TT Tensor Train. 25

TTN Tree Tensor Network. 33

VAE Variational Auto Encoder. 13

xxvi
Introduction 1
„ Without locality, physics would be wild.

— J. Ignacio Cirac et al.


(Matrix product states and projected entangled
pair states: Concepts, symmetries, theorems)

The prominence of machine learning has experienced a substantial growth in the last
decades, paralleled by a remarkable expansion in the scope of its applications. In this
thesis, the focus lies on generative modelling, a particular unsupervised learning task.
In generative modelling, it is common practice to assume that datasets can be seen as
a collection of samples extracted from an underlying data distribution. The principal
aim of generative model is to capture and represent this probability distribution. In
this thesis, generative models inspired by the probabilistic formulation of quantum
mechanics are considered. These models are referred to as Born Machines and
represent the underlying data distribution by means of a quantum state wave
function.

The quote selected to commence this thesis portrays a significant concept which has
profound consequences in the realm of physics. In quantum many-body physics, the
physical interactions exhibit a local behavior, which is manifested through systematic
structures embedded within the Hamiltonians that describe many-body systems.
This locality of physical interactions has profound consequences in terms of the
entanglement properties of quantum ground states. Indeed, it has been shown
that the ground states of physical system exhibit an area-law for the entanglement
entropy and those state are therefore heavily constrained by locality. As a result
they live in small manifold of the exponentially large Hilbert space. It has been
proven that Tensor Network (TN) methods have the ability to efficiently represent
those states. In other words, TN states can be used to represent the ground states of
physical system efficiently and faithfully.

Seeing the similarities between the task of generative modelling and quantum
physics, the application of tensor networks to effectively capture underlying prob-
ability distributions of datasets is not unexpected. Indeed both fields strive to
model a probability distribution within a vast parameter space. This thesis focus

1
on generative models represented by means of a specific type of (one-dimensional)
tensor network, the matrix product state. The overall goal is to use an MPS-based
generative model to capture and represent the underlying distributions existing in
textual data, serving as a foundation for the tasks of text generation and language
classification. In what follows, an overview of the different chapters of this thesis is
provided.

1.1 Thesis overview

Chapter 2
This chapter serves as a comprehensive introduction to the field of generative
modelling. The fundamental concepts of machine learning are explained and the
basic building blocks needed to construct a ML model are presented. Then the field of
generative modeling will be introduced, encompassing a comprehensive discussion
on the probabilistic framework often considered as well as diverse applications.

Chapter 3
In order to gain a comprehensive understanding of the application of Tensor Net-
works in the domain of Machine Learning, it is crucial to present the basic concepts
related to Tensor Networks in Physics. This chapter aims to provide a comprehensive
and thorough introduction to the field of Tensor Networks and more specifically to
Matrix Product States within the context of quantum-many body physics.

Chapter 4
This chapter presents a generative model which represents underlying distribution
by using a matrix product state as parametrization class. The presented model is
mainly based on the work presented in Ref. [46]. The optimization algorithm used
for learning as well as the advantageous direct sampling method will be presented
and explained.

Chapter 5
In this chapter the MPS-based model presented in Chapter 4 is employed for the
purpose of text generation and language recognition. A framework is proposed to
model textual data by using an MPS. Then the results related to the text generation
are presented. Finally it is shown that the presented framework can be used to build
a language classifier. Results will be shown for a MPS-based classifier trained on
three different languages, namely English, French and Dutch.

2 Chapter 1 Introduction
Generative modeling 2
2.1 Fundamentals of machine learning

When an individual perceives a darkened sky several kilometers away, they may infer
that precipitation is imminent. Similarly, when an individual is stuck in a traffic jam,
they may assume that there is an accident or roadwork ahead. These predictions
are based on personal experiences and learned associations. The individual has
learned from previous experiences that darkened skies are frequently followed
by precipitation, leading them to conclude that it is about to rain. Likewise, the
individual has learned that traffic jams are often indicative of obstacles on the road,
leading them to conclude that there is an accident or roadwork ahead. Since the
advent of the computer age, huge efforts have been made to build machines or
computers with the ability to learn in a manner akin to human cognition, enabling
them to make decisions and predictions without the need for explicit programming.
These efforts have led to the emergence of the field of Machine Learning (ML),
which is a branch of the broader field of artificial intelligence (AI). The general goal
of ML is to retrieve information from existing data and transform it in knowledge
that can be used to address unseen problems.

2.1.1 Motivation and definition

The significance of machine learning has increased exponentially over the past
decades. This can be mainly attributed to the remarkable advancements in com-
puting power and memory resources (Moore’s law [75]), and more specifically to
the evolution of graphics processing units (GPU) [71]. Additionally, the explosion
of available data [7, 124] has played a crucial role in the growing importance of
the field of machine learning. Today’s society has become highly dependent on
data-driven methodologies. The practice of data collection and analysis is now
omnipresent in industry but also in science. It has resulted in the adoption of
machine learning techniques for the effective extraction of valuable insights from
massive datasets. Examples of this trend can be observed in numerous sectors,
including agriculture, where data collection is for instance used to optimize the use

3
of fertilizers [89]. The field of astronomy is also experiencing an impressive growth
in the datasets size and complexity [6, 5]. In astronomy, data are used to gain a
better understanding of properties of stars and galaxies. The field of Physics has
likewise entered the era of the Big Data where, for instance some experiments at the
Large Hadron Collider (LHC) generate petabytes (i.e. 1000 terabytes or 1015 bytes)
of data per year [71]. Making sense of these huge amounts of data can prove to be
a challenging task. Nevertheless, machine learning methodologies can be leveraged
to unearth significant patterns and gain valuable insights from datasets, thereby
enabling the resolution of problems that are beyond the scope of human capacity.

There is no real consensus on how to clearly define the concept of Machine Learning
[106]. Nevertheless there exists a, rather formal, description of the field given by
the widely cited quote of Tom Mitchell [73]:

"A computer program is said to learn from experience E with respect to


some class of tasks T and performance measure P if its performance at tasks
in T, as measured by P, improves with experience E."

Therefore, in this dissertation the field of machine learning is considered to include


all computer programs or algorithms capable of performing a particular task without
requiring explicit programming. Instead, these programs leverage information
obtained through experience, which is extracted from data. The learning process
involves optimizing the model with respect to a performance measure that is used
to evaluate the model’s proficiency in accomplishing the task. The best performing
models are those with the best ability to generalize. Generalizing means being
able to make correct predictions/decisions when given unseen data. While Tom
Mitchell’s quote has already highlighted some essential elements, the following
section provides a clear overview of the different building blocks that form the basis
of every machine learning model.

2.1.2 Building blocks of a ML model

The construction of all ML models can be considered as a systematic process that


follows four fundamental phases: problem formulation, model representation,
optimization, and ultimately, generalization. Figure 2.1 summarizes and illustrates
the general workflow followed in the construction of a machine learning model.
The task to be completed and the accessible data are stated during the problem
formulation process. An appropriate model representation is chosen based on the
problem that is being addressed. A learning algorithm is then used to optimize

4 Chapter 2 Generative modeling


the chosen model, resulting in an optimized model with the best generalization
properties. It is worth noting that when considering a certain learning problem, both
the task to be accomplished and the available data are intrinsically dictated by the
problem itself (turquoise elements). The choice of the representation and learning
algorithm (orange elements) is not predetermined, hence there exists a certain level
of flexibility in their choice. When tackling a learning problem, it is common to make
an informed choice of the model representation (e.g. seeing a linear relationship
between input and output hints towards the use of a linear model). Moreover, it is
standard practice to experiment with multiple models and algorithms and compare
their performances in order to identify the best performing model. The remainder
of this section will provide a succinct explanation and presentation of the four
fundamental phases mentioned above.

Given
Task
Model Learning Optimized
Representation Algorithm Model

Data

Fig. 2.1.: A typical machine learning pipeline that includes the formulation of the problem,
model selection, optimization, and generalization steps.

Problem formulation

In the field of machine learning, diverse categories of problem can be addressed


through distinct ML architectures, which present fundamental variations in terms
of expected functionalities. The field can be roughly subdivided into two main
categories: supervised learning and unsupervised learning.

Supervised learning: Supervised learning is one of the most encountered types


of machine learning. In the case of supervised learning n the algorithm
o learns
N
from labeled data. It means that the training set D = (xn , yn )n=1 contains
N training examples and consists of two types of variables: the input variables
xn and the target variable(s) or label(s) yn . For instance, xn can be a vector
containing the pixel values of an image, while the label yn is a variable
expressing if the inputs (i.e. the image in this case) belongs to one of several
predetermined categories such as images containing an animal or images
showing a vehicle. The overall goal of supervised learning is to predict the
value of the labels for unseen inputs xn (i.e. xn ∈ / D). Stated differently,

2.1 Fundamentals of machine learning 5


supervised learning aims at finding a mapping function fθ (depending on the
model’s parameters θ) from x to y. The two main classes of supervised learning
problems are classification [58] and regression [66]. In classification tasks, the
output variable yn belongs to a categorical or discrete set, as illustrated in the
example above. In contrast, regression problems seek to estimate a continuous
output variable. A good example of this can be observed in the context of
house price prediction, where a model tries to find the price of a house (i.e.
the output variable yn ) based on a set of input features represented by the
vector xn , which may include attributes such as square footage, number of
bedrooms, and other relevant factors.

Unsupervised learning: Although unsupervised learning is not as precisely de-


fined as supervised learning, it is still useful for a varietynof purposes
o [106].
N
In the case of unsupervised learning the datasets D = (xn )n=1 contains
unlabeled training examples. The absence of information related to the labels
of the data make the unsupervised learning problem fundamentally different
from the supervised learning one. In supervised learning, the objective is
to predict the labels of a given input, whereas in unsupervised learning, the
primary goal is to capture the underlying patterns and structures within an un-
labeled dataset. It may seem surprising that learning can be achieved without
the use of labeled data, and one may question what is actually learned from
the unlabeled information. To demonstrate the usefulness of unsupervised
learning, consider the book classification task. In supervised learning, the input
consists of book features and their corresponding output labels (e.g. romance,
history, science-fiction, et cetera). The supervised learning approach learns
to associate the input features of a book with a specific genre. In contrast,
unsupervised learning algorithms tend to group or cluster books with similar
features together without explicitly labeling the groups or clusters. This type
of problem is commonly referred to as clustering [93, 54]. Other important
classes of unsupervised learning problems are dimensionality reduction [32],
feature extraction and generative modelling [12, 105] which lie at the core of
this dissertation and will be discussed further below.

Supervised and unsupervised learning are the two main classes of problem en-
countered in the field of machine learning. However, it should be noted that this
subdivision can be extended to include other, less common, types of learning such as
semi-supervised learning [90] or reinforcement learning [62]. Semi-supervised
learning uses both labeled and unlabeled data to learn. Usually in this approach,
only a small subset of the training set labeled, while most of the dataset is unla-
beled. Semi-supervised learning paradigms have proven to be rather useful in cases

6 Chapter 2 Generative modeling


where labeled data is scarce or difficult to obtain. In reinforcement learning, the
training data does not provide explicit target labels. Instead, an agent learns to
make decisions through interactions with an environment over time, with the aim
of maximizing a cumulative reward signal. Reinforcement learning is inspired by
the way humans and animals learn from experience: whereby the learner attempts
different strategies and receives feedback in the form of failures or successes. For
example, a toddler learns to drink from a glass by trying different approaches and
receiving feedbacks in the form of failures (e.g. spilled the drink) or successes (e.g.
managed to drink from the glass).

Model representation

After formulating the problem, the task to be performed is clear (e.g. in the case
of classification that is predicting a class label given some input features) and the
structure of the available training data is determined (e.g. labeled or unlabeled
data). The next step consists of selecting the model representation, which involves
determining the mapping function fθ (x) : X → Y that depends on the model’s
internal parameters θ, where X is the input space and Y is the output space (e.g.
{0, 1} for binary classification). Choosing a model representation fθ (x) entirely
specifies the hypothesis space H through the functional form of fθ (x). The hypothesis
space is the set of all possible candidate solutions (or hypotheses) that a machine
learning algorithm is allowed to consider in order to solve a problem. To illustrate
the meaning of the hypothesis space, consider the regression problem of predicting
the price of a house based on its square footage and assume that the price of a
house is linearly related to its square footage (note that this forms a rather bold
approximation that might not be true in a real situation). In this case, the obvious
choice for the model representation would be a linear mapping function of the
form f (s) = a × s + b, where a and b are the model parameters and s is the square
footage. The hypothesis space for this problem would include all possible linear
functions of the form f (s) = a × s + b, and thus H = {f (s) = a × s + b|∀a, b ∈ R}.
During the following optimization stage, the machine learning algorithm will search
through the hypothesis space for the optimal set of parameters a and b that best fits
the training data and minimize a certain error between predicted and actual house
prices.

As one might expect, there are many different model representations to use in the
field of machine learning. The following is a list of some of the widely known
and most significant model representations in the field; however, it is important
to note that this list is not exhaustive. When dealing with classification tasks, the

2.1 Fundamentals of machine learning 7


most important and widely used models are Support Vector Machines (SVM) [79],
decision trees [44], Naives Bayes classifiers [11] and random forests [8]. Examples
of regression models include the aforementioned linear regression models [66],
Polynomial regression [66] (which is similar to linear regression but considers
higher order dependencies between input feature(s) and output variable(s)) as
well as Lasso and Ridge regression [68]. An important sub-field of unsupervised
learning is clustering where the goal is to group similar objects together. This is
done in such a way as to minimize the intra-cluster distance while maximizing the
inter-cluster distance. Examples of model representations used for the clustering task
are: representation-based clustering, hierarchical clustering [77] and density-based
clustering [59] such as DBSCAN.

Some model representations can be used to address various supervised and unsu-
pervised problems. Among these, neural networks are recognized as one of the
most prevalent and widely used model representations in the field of machine learn-
ing. Any comprehensive introduction to machine learning ought to incorporate the
concept of neural networks, which have received considerate attention in the last
decades. They lie at the foundation of many advanced machine learning models
that address various issues such as pattern recognition, prediction, optimization,
and more. [53]. Neural networks are inspired by the structure and function of
the human brain, and consist of a collection of interconnected nodes, or "neurons",
that are organized in layers. The fundamental idea behind neural networks is to
create linear combinations of input variables and express the output as a nonlinear
function of these features. Figure 2.2 illustrates the general architecture of a neural
network. The neurons are represented by circles and are the core processing unit of
the network. In a neural network, neurons are arranged in multiple layers, where
the first layer receives the inputs and the final layer predicts the output values. The
number of nodes in the input and output layers is determined by the specific problem
being addressed. Most of the computations occur in the hidden part of the network,
which can consist of one or more layers. Each node in the hidden layer receives a
linear combination of values from the previous layer, weighted by the connections
between the neurons. The resulting linear combination value at each node is then
passed through an activation or threshold function, which produces the output value
of the node and serves as input for the following layer. The parameters of the model
are the weights connecting the different neurons.

It is clear that there exist various model representations in the field of machine
learning. It is important to note that that there is one-size-fits-all machine learning
model. The performance of an ML model representation can only be deemed superior
in relation to a particular problem or dataset. It is therefore the responsibility of ML

8 Chapter 2 Generative modeling


Input Layer Hidden Layer Output Layer

Fig. 2.2.: Basic neural network architecture.

researchers to fine-tune and customize their models to suit the specific problems
they are addressing. This dissertation focuses on exploring model representations
that are based on tensor networks and more specifically on matrix product states.

Optimization

The learning process can be reformulated as an optimization problem where the


model parameters are fine-tuned in order to increase the model’s performance. In
order to be able to assess the performance of the model, it is necessary to define
a suitable measure, which is often referred to as the cost- or loss function. The
choice of the cost function depends on the problem that is being addressed [10].
To illustrate this, consider a supervised
n learning
o problem, where the training set
N
consists of labeled data: D = (xn , yn )n=1 and where the model representation
is given by fθ (x). Call ŷi = fθ (xi ) the model prediction for the label of an input xi .
The loss function L(ŷi , yi ) = L(fθ (xi ), yi ) gives a performance measure, assessing
how well the model prediction is compared to the actual label. The parameters θ
are fine-tuned in order to minimize the loss function on the training dataset D, thus
to minimize the following:
N
X
L(fθ (xi ), yi ) (2.1)
i=1

2.1 Fundamentals of machine learning 9


Commonly used cost functions in machine learning are the cross-entropy function,
the mean-square error (MSE) or the negative-log likelihood (NLL). To illustrate this
consider the example of house price prediction where the model prediction fθ (si )
gives the predicted price. The optimization algorithm aims to minimize the MSE:

N
1 X
MSE = (yi − ŷi )2 (2.2)
N i=1

Gradient descent (GD) [92] is a popular machine learning optimization technique


for determining the ideal values of a model’s parameters θ that minimize the cost
function. The algorithm iteratively updates the parameters by computing the gradi-
ent of the cost function with respect to the parameters and by moving in the negative
direction of the gradient.
∂L
θk+1 = θk − γ k (2.3)
∂θ
In the above Equation θ k represents the model parameters at iteration step k and
γ is the learning rate. The learning rate is a hyperparameter of the model that
controls the rate at which the model updates its parameters. The initialization of
the parameters can be random but it is worth noting that a clever choice for the
initialization of the model’s parameter might lead to a faster or better convergence
in some cases. There also exist some variations of the gradient descent algorithms
[92] such as Stochastic Gradient Descent (SGD), Batch Gradient Descent (BGD) and
Adaptive Moment Estimation (ADAM) [57].

Generalization

As repeatedly mentioned in this thesis, the ultimate goal of machine learning is to be


able to generalize. The model’s generalization capability refers to its ability to gener-
ate accurate predictions or decisions when confronted to previously unseen data. In
the field of machine learning, it is standard practice to divide the dataset into two
subsets, namely a training set and a test set. The training set is used to optimize the
model parameters, and the test set is used to evaluate its generalization performance.
Improving the performance of the model is often related to minimizing some error
measure. The error calculated on the training set is called the training error while
the error calculated on the test is called the generalization error or test error [41].
The goal of machine learning differs from that of traditional optimization problems.
While the goal of optimization problems is to improve a specific performance metric
on a specific set, the goal of machine learning is to achieve satisfactory performance
on both the training and test sets. The two key challenges that occur in the pursuit

10 Chapter 2 Generative modeling


of generalization in machine learning are underfitting and overfitting. Underfitting
occurs when the model cannot capture the underlying patterns and structure present
in the data and therefore fails to reduce its error on both the training and the test
set. Overfitting, on the other hand, occurs when the model has captured (too much
of) the specifics of the training set. Although the model accurately describes the
training data, it will fail to generalize to unseen data, resulting in a low training
error and a high generalization error.

In machine learning, the capacity of a model refers to its ability to capture a wide
range of functions and is mostly determined by the number of parameters used as
well as the optimization algorithm employed [41]. The model capacity determines
whether it is likely to underfit or overfit. Figure 2.3 illustrates the typical behaviour
of training and testing loss as a function of model capacity. The first regime is the
underfitting regime, while the overfitting regime corresponds to an increasing model
capacity that results in a decreasing training loss and an increasing test loss. The
ideal model complexity lies somewhere between the extremes of underfitting and
overfitting. The capacity of the model should be sufficient to capture the underlying
patterns in the data, but should not be so complex that it simply memorizes the
training set without being able to generalize successfully to new data. When the test
loss is minimal, the model with the best generalization potential is achieved.

Fig. 2.3.: Typical relationships between training loss and generalization loss and the model’s
capacity. Figure from Ref. [41].

Figure 2.4 illustrates the concepts of underfitting and overfitting by comparing three
different models (linear, quadratic and degree-9 polynomial) that attempt to capture
an underlying quadratic distribution. The linear model cannot properly capture
the curvature in the data distribution simply because it does not dispose of enough

2.1 Fundamentals of machine learning 11


resources do to so. It will therefore lead to underfitting. The degree-9 polynomial
model captures (too much of) the specifics of the training data and therefore leads to
overfitting. Finally, the quadratic model perfectly captures the underlying structure
present in the data and is likely to be able to generalize correctly, resulting in a low
training and testing loss.

Fig. 2.4.: Example of a regression problem where (from left to right) a linear model,
a quadratic model and a degree-9 polynomial model attempt to capture an
underlying quadratic distribution. Figure from Ref. [41].

2.2 Generative models

So far, the basic principles of machine learning have been presented, along with
the essential stages involved in the construction of an ML model. The focus will
now shift to the field of generative modelling, a particular unsupervised learning
task that lies at the core of this dissertation. In order to build reliable decision-
making AI systems, discrimative models alone are not sufficient. In addition to
requiring extensive labeled datasets, the availability of which may be limited (e.g.
in medical domains and fraud detection), discriminative models often lack semantic
understanding of their environment and also lack the aptitude to express uncertainty
concerning their decision making [105]. Generative models could potentially offer
a solution to these concerns. Additionally, they might make it easier to assess
the effectiveness of ML models. Indeed, classifiers frequently produce identical
results for a wide variety of inputs, making it challenging to determine exactly
what the model has learned [60]. Unlike discrimative models, which aim to model
a conditional probability distribution P (y|x) where y is a target value or label of

12 Chapter 2 Generative modeling


a data point x, generative models try to model the joint data distribution P (x).
At the core of generative models lies the fundamental concept that data can be
regarded as samples extracted from a genuine real-world distribution, x ∼ Pd (x),
commonly referred to as the data distribution. The fundamental objective of all
generative models is to model the data distribution to some extent. While certain
models enable explicit evaluation of the probability distribution function, others
may not explicitly model it but still allow implicit operations such as sampling
from it [41]. The literature regarding generative models roughly subdivides the
field into four main groups [105]: Auto-regressive generative models (ARM), Flow-
based models, Latent variable models (e.g. Generative Adversial Networks (GAN)
or Variational Auto Encoder (VAE)) and Energy-based models (e.g. Boltzmann
Machines). Certain methodologies do not conform to the probabilistic framework of
generative modeling and deviate significantly from it by encouraging the generation
of data points that have no density under any observed distribution. Examples are
deep style transfer and texture synthesis [36, 35] or Creative Adversarial Network
[28]. The Tensor Network-based model employed for generative modelling, as
presented in this dissertation (see below) is fully consistent with the probabilistic
framework of generative modeling. Henceforth, the following will be restricted to
the treatment of generative models in this framework.

2.2.1 Probabilistic framework and maximum likelihood estimation

The following section presents the probabilistic framework (as introduced in Ref.
[60]) as well as a widely used optimization method for generative models. As
stated previously, the fundamental concept of generative modelling is that data
can be considered as samples drawn from a probability distribution x ∼ Pd (x). In
the case of generative modelling, the machine learning model fθ (x) represents an
estimation of the data distribution. It is parameterized by a set of parameters θ and
is often referred to as Pm (x) being the model distribution. The generative modelling
problem can be viewed as minimizing the discrepancy between the true data dis-
tribution, Pd (x), and the model distribution, Pm (x), so that they become as close
as possible. Statistical divergence offers a way to quantify the difference between
two distributions. A statistical divergence is a non-symmetric and non-negative
function D(P ||Q) : S × S → R+ with, as input, two distributions over the space of
possible distributions S. Generative modeling can be restated as an optimization
problem within the probabilistic framework, with the goal of minimizing a statistical
divergence expressing the difference between Pd (x) and Pm (x; θ) (with this notation
the dependence of the model distribution on the model parameter is explicit). The

2.2 Generative models 13


Kullback-Leibler (KL) divergence DKL is a commonly used statistical divergence in
the context of generative modelling [56] . This popularity can be explained by the
fact that minimizing the KL divergence is equivalent to maximizing the likelihood of
the data [60]. The expression for the KL divergence is:
Z ∞ Z ∞
DKL (P ||Q) = P (x) log(P (x)) − P (x) log(Q(x))
−∞ −∞
Z ∞
P (x)
 
= P (x) log
−∞ Q(x) (2.4)
P (x)
  
= Ex∼P (x) log
Q(x)
= Ex∼P (x) [log(P (x)) − log(Q(x))]

The parameters θ KL that minimize DKL (Pd (x)||Pm (x; θ)) are given by:

θ KL = arg min DKL (Pd (x)||Pm (x; θ))


θ

= arg min Ex∼Pd (x) [log(Pd (x)) − log(Pm (x; θ))] (2.5)
θ

= arg min −Ex∼Pd (x) [log(Pm (x; θ))]


θ

Note that in the second line of Equation 2.5, the term log(Pd (x)) does not affect
the argument of the minima and can therefore simply be omitted. Consider now
the parameters θ M LE that maximize the likelihood estimation for a set of N data
points.

N
Y
θ M LE = arg max Pm (xi ; θ)
θ
i=1
XN
= arg max log(Pm (xi ; θ))
θ
i=1
N (2.6)
1 X
= arg min − log(Pm (xi ; θ))
θ N
i=1

≈ arg min −Ex∼Pd (x) [log(Pm (x; θ))]


θ

= θ KL

Therefore, for a sufficiently large number of data points N , minimizing the KL


divergence corresponds to maximizing the likelihood estimation.

14 Chapter 2 Generative modeling


2.2.2 Evaluating a generative model

This section provides a few important notes regarding the challenging task of
evaluating generative models’ performances. One of the primary objectives of a
generative model is to generate realistic samples from its probability distribution. A
rather instinctive method to evaluate the model’s performance is by assessing the
quality of the samples drawn from the model probability distribution. However, this
approach can be inadequate as the model may still produce high-quality samples
even when its performances are poor. This situation can arise when the model
overfits the training data and reproduces the training instances precisely. In that
case the generated samples will be of good quality despite the model’s poor ability to
generalize. It is even possible to simultaneously underfit and overfit while producing
good quality samples [41]. To illustrate this, consider a generative model trained on
an image dataset with two categories, such as tables and chairs. Assume that the
model only replicates the training images of tables and ignores the images of chairs.
Because it does not produce images that were not in the training data, the model
clearly overfits the training set and displays also severe underfitting since images of
chairs are completely missing from the generated samples.

As the qualitative assessment of the samples is not a reliable way of evaluating a


model’s performance, one often resort to the evaluation of the log-likelihood that
the model assigns to a certain test set. The evaluation of the log-likelihood might
be a challenging task, as some models may encounter difficulties in computing
the log probability of the data. Moreover, in certain situations, the likelihood may
not capture significant attributes of the model [104]. Finding good and robust
evaluation metrics is often considered a hard research problem [41], whence the
difficulty to fairly compare different models. As discussed in the following section,
the scope of application of generative models extends beyond the mere generation
of new samples. When one is concerned on the usefulness of the model for a specific
task (e.g. anomaly detection), it is best to evaluate the model’s performance using
metrics that are related to the task being addressed.

2.2.3 Applications

Generative models are often referred to as generators of new data. However it is


important to understand that the knowledge of the joint probability distribution has
a much broader range of application that extends beyond the mere generation of
new data [105, 41, 107]. A few examples of these applications are presented below
to illustrate this point.

2.2 Generative models 15


• Denoising: Given some noisy input x̃, the model is able to return an estimate
of the original sample x that belongs to the probability distribution estimated
by the model Pm (x) [65]. For instance in medical imaging, noise can be
present in scans due to various factors such as the imaging hardware or motion
of the patient. Denoising techniques can be used to remove this noise and
improve the clarity of the images.

• Density estimation: Given some input x the model returns an estimation


Pm (x) of the real data probability density Pd (x). Despite its apparent simplicity,
the task requires the machine learning model to possess a efficient method of
parameterization as well as a comprehensive comprehension of the patterns
and structure of the underlying data distribution. Density estimation can be
used for anomaly identification, which is the process of identifying abnormal
cases (i.e. outliers, anomalies) in a dataset that differ considerably from
the rest of the data. By modeling the data distribution of normal samples
and finding examples with a low likelihood of belonging to that distribution,
abnormal samples can be recognized. For instance this can be applied to detect
lesion in medical imaging [94], or to detect credit-card fraud in finance [3].

• Missing value imputation: Missing data imputation is the process of providing


a prediction for the values of some features which are missing from a given sam-
ple. By approximating the data distribution Pd (x) with Pm (x) it is possible to
estimate the conditional distribution of the missing features: Pm (xmissing |xgiven )
where xgiven corresponds to the corrupted sample containing every features
except the missing one(s). Example of missing data imputation is illustrated in
Figure 2.5 where the model reconstruct the missing part (orange) of a given
image (black).

• Classification: When dealing with a problem in a supervised context, gener-


ative models can be used to learn the joint probability distribution Pd (x, y)
where x and y represent the feature(s) and label(s) respectively. Knowledge
of this joint distribution enables the model to carry out discriminative task as
well. Indeed, by leveraging Baye’s rule it follows [107]:

Pm (x|y)Pm (y) Pm (x, y)


Pm (y|x) = =P (2.7)
Pm (x) y Pm (x, y)

Generative models can thereby be used to obtain the conditional distribution


Pm (y|x), paving the way to classification tasks.

16 Chapter 2 Generative modeling


Fig. 2.5.: Image reconstruction from partial images from the MNIST database [24]. The
given parts are in black and the reconstructed parts are in orange. Figure from
[46].

• Sampling: Finally the most straightforward application of generative models


is the generation of new samples. One of the key attributes of these models
is that it is possible to draw samples (that are similar, but not identical to
the training samples) from the explicitly defined (or not) model distribution
Pm (x). Generating new samples can be used to perform data augmentation
to provide ML models with more important and more diverse datasets. Huge
amounts of data are often needed to train State-of-the-Art machine learning
models such as neural networks. If the available amount of data is too low
then the parameters of the model are under-determined and the model will
show poor generalization abilities. Standard data augmentation techniques are
often constrained in their ability to produce diverse and novel data points that
can effectively augment a given dataset. Generative models offer a solution
to this predicament as they can be used to generate broader sets of new data,
enhancing thereby the learning potential of ML models [2].

2.2 Generative models 17


Tensor Networks 3
3.1 Quantum many-body systems

People have long sought to comprehend the laws of nature and their impact on
the universe. The primary means of doing so was by following the reductionist
technique which attempts to explain the physical world in terms of ever-smaller
entities. However it has been realised that this approach cannot easily be extended
to the study of system with a large number of interacting degrees of freedom [1]. In
fact, the microscopic physics is, to a large extent, known. The properties of a system
(e.g. a material, a chemical system or a spin system) can, in theory, be determined
by the many-body wave function [22]:

ψ(−
→, −
x → −→
1 x2 , ..., xN , t) (3.1)

which is governed by the Schrödinger equation [97], expressed as1


 
N
∂ −ℏ2 X
V (→

xi − →
− U (→

X X
iℏ ψ =  ∇2j + xj ) + xj ) ψ (3.2)
∂t 2m j=1 i<j j

Unfortunately this knowledge cannot be easily extended to the study of many-body


systems [61]. From a pragmatic point of view, the system of equations simply be-
comes toO complex to solve and the system is thereby intractable. Even for few-body
systems (e.g. modest multi-electron atoms) the solution cannot be obtained without
bold approximations. In other words, while the underlying physical description
of individual degrees of freedom is known and easily obtainable, the study of a
collection of interacting degrees of freedom can become extremely complex. In
quantum systems, the many interactions between constituents lead to quantum
correlation or entanglement, making the purely reductionist-based description of
many-body quantum systems complex and inefficient [117].

1
The many-body Schrödinger equation contains the essential microscopic physics that governs
the macroscopic behavior of materials even if one could argue that this expression could be
further refined to take other relevant aspects into account, e.g. the spin, the presence of external
electromagnetic fields, et cetera [61].

19
The objective of quantum many-body physics is to understand the characteristics of
systems consisting of many interacting particles, that are subjected to the Schrödinger
Equation (see Equation 3.2) and to acquire a better understanding of the resulting
emergent properties of these systems [22]. Quantum many-body physics lies at the
basis of our understanding of nature and tries to provide answers to fundamental
questions in theoretical physics, but it is also key to the advancement of new
technologies [117]. Fields such as quantum computing, quantum optics, high-
energy physics, nuclear physics, condensed matter physics or quantum chemistry are
all in essence related to the many-body quantum problem. Finding and developing
methods to effectively solve this problem is therefore the Holy Grail in modern
research in physics.

Although the last decades have seen a tremendous amount of developments in


theoretical and numerical methods for the treatment of many-body systems, the
many-body quantum problem remains one of the most challenging problems in
physics. One of the postulates of quantum theory is that quantum states can
be represented by state vectors in a Hilbert space [15]. In a quantum system,
interacting particles can be in a superposition. This superposition principle does
confer its particular tensor product structure to the Hilbert space, meaning that the
Hilbert space of interacting particles is the tensor product of the Hilbert spaces of
each particle [112]. Because of this structure, the Hilbert space is gigantic as its
dimension grows exponentially with the number of particles in the system. This is
known as the curse of dimensionality related to the Hilbert space [14]. To illustrate
this, consider N particles with d degrees of freedom (e.g. spin configurations,
electronic or molecular orbitals). The number of possible configurations for the
system is given by dN , which means that the dimension of the Hilbert space is dN .
Describing the wavefunction (i.e. the fundamental object of interest in quantum
mechanics [74]) by specifying the coefficients related to all possible configurations
of the system is clearly an inefficient way to deal with large systems. To illustrate
this, consider systems with N ∼ 1023 (which is of the order of Avogadro’s number,
i.e. the number of particles contained in one mole), then the dimension of the
23
Hilbert space is of the order O(210 ) which is much larger than the number of atoms
in the universe [82]. Computers cannot cope with the exponentially big number of
parameters and performing calculations would imply a number of operations that
exponentially increases with N . Many-body calculations would therefore not be
feasible in a reasonable amount of time [21]. This means that the wavefunctions for
systems in the thermodynamic limit or large systems (such systems often capture
the attention of physicists and are a subject of great interest) are unobtainable and
the systems are thereby intractable [74]. It becomes evident that conventional

20 Chapter 3 Tensor Networks


systems can result in Hilbert spaces of exponential size. Consequently, to efficiently
characterize quantum systems and correspondingly describe the surrounding world,
the use of more efficient representations becomes imperative.

3.2 Motivation and representation of Tensor Network


states

Many (analytical or numerical) approximate methods have been developed to deal


with many-body systems. Some of them are based on tensor networks. Tensor
networks (TN) [112, 14, 74, 21, 82, 4] are variational quantum states that retain
the tensor product structure of the local degrees of freedom. They are successful in
handling large N -particle systems and represent quantum states in terms of inter-
connected tensors. The reason behind their success is their efficient parametrization
of quantum systems (i.e. the number of parameters scales only polynomially with
N [85]), which focuses specifically on a special group of states within the Hilbert
space. In quantum many-body systems, the presence of certain structures within
the Hamiltonians (e.g. the local character of physical interactions found in many
physical Hamiltonians) is such that the physical states exist in a tiny manifold of the
gigantic Hilbert space. The states located in this manifold possess unique entangle-
ment characteristics. A useful entanglement measure to analyze the entanglement
structure of a quantum state is the bipartite entanglement entropy [45]. Consider a
quantum state |Ψ⟩ living in a Hilbert space H. By bipartitioning the system into two
subsystems, A and B, with H = HA ⊗ HB , the reduced density matrix can be defined
as
ρA = TrB [|ψ⟩ ⟨ψ|] (3.3)

The eigenvalues pi = e−λi of ρA determine the entanglement spectrum λi and the


entanglement entropy is given by the Von Neumann entropy [9] of the reduced
density matrix ρA :

λi e−λi
X X
S = − Tr [ρA log ρA ] = − pi log pi = (3.4)
i i

In fact, the entanglement entropy between two subsystems of a gapped quantum


system scales with the area of the region between them, rather than with the volume.
In other words, the entropy is proportional to the surface area of the boundary,
rather than to its volume. For a generic quantum state located in the huge Hilbert
space, this is not be the case. In fact, it can be demonstrated that low-energy

3.2 Motivation and representation of Tensor Network states 21


eigenstates of gapped Hamiltonians with local interactions follow the area-law for
entanglement entropy [27] and those states are thereby heavily constrained by the
local behavior of physical interactions [82]. It has also been shown that the quantum
many-body states generated through time evolution controlled by a time-dependent
local Hamiltonian during a time period that scales polynomially with the system size
N, can only occupy an exponentially small fraction of the Hilbert space. This implies
that the majority of the states in the Hilbert space are inaccessible and therefore
considered nonphysical. This is the reason why the Hilbert space is sometimes
referred to as a convenient illusion [85] (it is convenient from a mathematical point
of view, but it is an illusion because the major part of the Hilbert space consists
of nonphysical states [82]). The states that satisfy the area-law for entanglement
are precisely the ones targeted by tensor network states. This implies that tensor
networks specifically focus on the crucial corner of the Hilbert space where physical
states reside. Another factor contributing to the success of tensor network (TN)
techniques is the advantageous diagrammatic representation they provide. As will be
made clear below, TN methods offer a straightforward and intuitive way of depicting
tensors and tensor networks, thus obviating the need for complicated mathematical
equations. In what follows the basics of TN diagrammatic notation will be provided
and the TN depiction of quantum many-body states will be elucidated.

3.2.1 Diagrammatic representation

For our intent, a tensor is a mathematical object characterized by its multidimen-


sional array structure. It can be thought of as a generalization of the concepts of
vectors and matrices, encompassing them in a higher-dimensional framework. The
dimensionality of a tensor is referred to as its rank or order, which specifies the
number of indices that the tensor has. Consider the following rank-3 tensor T with
indices α, β and γ (expressed here by the Einstein index notation):

Tα,β,γ (3.5)

Each index of the tensor corresponds to a specific dimension and can have different
values. Figure 3.1 shows the tensors in two distinct but equivalent representations
(in Figure 3.1a the tensor is represented by all its elements while in Figure 3.1b
it is represented in the diagrammatic representation.) Figure 3.1a is intended to
explicitly depict the fact that a tensor is a multi-dimensional data structure. By
convention, a tensor is represented in pictorial language by an arbitrary shape or
block (tensor shapes can have a certain meaning depending on the context) with

22 Chapter 3 Tensor Networks


emerging lines, where each line corresponds to one of the tensor’s indices. This
is illustrated in Figure 3.1b where the lines are explicitly identified for the sake of
clarity but this is usually not the case.

 
 T113 . . . T1b3
 
T . . . T 
 112  ..1b2 . .  .. 
.

T111 . . . T  . .  T
 .. . . 
1b1  ..  
.

. . . . . Tab3
  
 Ta13
 
.. . .
 .
. T ..

. . . .
  
 a12  Tab2
. . . Tab1
 
Ta11

(a) (b)

Fig. 3.1.: Tensor T of rank 3 represented (a) by all its elements, (b) by the diagrammatic
representation.

An index that is shared by two tensors designates a contraction between them (i.e. a
sum over the values of that index), in pictorial language that is represented by a line
connecting two blocks. An example of a tensor contraction is the matrix product
shown in mathematical form in Equation 3.6. It can also be expressed using tensor
diagrams. This is shown in Equation 3.7.
X
Cik = Aij Bjk (3.6)
j

⇐⇒ i C k = i A j B k (3.7)

In physics, contracted indices are commonly referred to as virtual indices, and their
dimension is referred to as the bond dimension. On the other hand, the physical
degrees of freedom are represented by external non-connected lines referred to as
physical indices. Numerically, tensors can be contracted by summing each of the con-
tracted indices. Nonetheless, a faster approach exists, which involves transforming
the problem into a matrix product [4]. The computational cost associated with a
contraction depends on the size of the contracted indices.

3.2 Motivation and representation of Tensor Network states 23


3.2.2 Tensor Networks and quantum states

Consider now a general N -particle quantum many-body system described by the


following wavefunction:
X
|ψ⟩ = Ψj1 j2 ...jN |j1 ⟩ ⊗ |j2 ⟩ ⊗ ... ⊗ |jN ⟩ (3.8)
j1 j2 ...jN

where each |ji ⟩ represents the basis of an individual particle with label i and with
dimension d (for simplicity, the dimensions of the individual particle’s basis are
assumed to be equal). The coefficients corresponding to the different dN config-
urations can be represented using the coefficient tensor Ψj1 ,j2 ,...,jN that contains
dN parameters. This tensor has N d-dimensional entries and fully determines the
wavefunction of the quantum system. As mentioned above, the exponentially large
number of parameters needed to fully describe the system makes the system in-
tractable. The solution proposed by tensor network approaches is to represent and
approximate the coefficient tensor Ψj1 ,j2 ,...,jN by a network of interconnected lower-
rank tensors with a given structure. The structure of a tensor network is determined
by the entanglement patterns between the local degrees of freedom within the
quantum system it represents. Figure 3.2 illustrates how a high-dimensional tensor
(e.g. the coefficient tensor Ψj1 ,j2 ,...,jN ) can be decomposed as a tensor networks. The
structure of the tensor network shown here is meant to be "generic". More useful
TN will be presented in the following sections.

3.3 Matrix Product State (MPS)

As mentioned above, the relatively restricted entanglement in quantum ground states,


in contrast to generic states, implies the potential for a more efficient description of
those states. Matrix Product States (MPS) offer such a successful parametrization
of quantum ground states for physical systems in one dimension [14, 74, 82, 112,
21, 4, 45]. Several independent occurrences of the concept of MPS can be observed
in the literature. In fact, the year 1992 saw two independent papers give birth to
the, nowadays well-known, Matrix Product State. In one of them, a class of finitely
correlated and translational invariant states in quantum spin chains was introduced
[29] and can be associated with translational invariant infinite MPS. In conjunction
with the introduction of finitely correlated states, Steve White’s work on the Density
Matrix Renormalization Group (DMRG) in the field of condensed matter physics
also led to the implicit emergence of the MPS [119]. Furthermore, the notion of

24 Chapter 3 Tensor Networks


|ψ⟩ = Ψj1 j2 ...jN |j1 ⟩ ⊗ |j2 ⟩ ⊗ ... ⊗ |jN ⟩
P
j1 j2 ...jN

Fig. 3.2.: The coefficient of a quantum many-body state (top) can be represented in the
form of a high-dimensional tensor (middle) with an exponentially large number of
parameters in the system size. The high-dimensional tensor can be expressed in a
network of interconnected tensors (bottom) that takes into account the structure
and the amount of entanglement in the quantum many-body state.

tensor train (TT) is equivalent to that of matrix product state (MPS) and was initially
introduced in a more mathematical framework [83].

3.3.1 Construction via successive Singular Value Decomposition

For a one-dimensional lattice system consisting of N qudits (d-dimensional quantum


system), the general wavefunction of the N-particle quantum many-body system is
given by Equation 3.8. As previously stated, tensor network-based methods aim at
decomposing the rank-N coefficient tensor Ψj1 j2 ,...jN in a network of interconnected
lower rank tensors. For this purpose, the Singular Value Decomposition (SVD) is
used (see Appendix A for a detailed explanation of the tensor manipulations required
to construct an MPS via successive SVD). Reshaping the tensor Ψj1 j2 ...jN to a matrix

3.3 Matrix Product State (MPS) 25


by isolating the physical index that belongs to the first lattice site and then applying
the SVD to it results in
X
|ψ⟩ = Ψ(j1 );(j2 ...jN ) |j1 j2 ...jN ⟩
{ji }
XX
= A1j1 ;α1 Σα1 ;α1 Bα1 1 ;j2 j3 ...jN |j1 j2 ...jN ⟩ (3.9)
{ji } α1
XX ′
= A1j1 α1 Ψα1 j2 j3 ...jN |j1 j2 ...jN ⟩
{ji } α1

where {ji } = dj1 =1 ... djN =1 . The matrix Σ is a diagonal matrix with the singular
P P P

values on the diagonal. In the last line of Equation 3.9 the singular values are

absorbed into the matrix B. This results in the new tensor Ψ . This can be expressed
using diagrammatic notation as follows:

SVD

Ψ = Ψ

j1 j2 j3 ... ... ... jN j1 j2 j3 ... ... ... jN

A1 α1 Σ α1 B1 (3.10)
=
j1 j2 j3 ... ... ... jN

A1 α1 α1Ψ′
=
j1 j2 j3 ... ... ... jN

By repeating the procedure and isolating the next lattice site, the newly created

tensor Ψ can be transformed into a matrix:


Ψ(α1 j2 );(j3 ...jN ) (3.11)

which can in turn be decomposed via SVD. In fact, by repeating this procedure for
all lattice sites it follows
XX
|ψ⟩ = A1j1 α1 A2α1 j2 α2 A3α2 j3 α3 ...AN
αN −1 jN |j1 j2 ...jN ⟩ (3.12)
{ji } {αi }

26 Chapter 3 Tensor Networks


where {αi } expresses the sum over the different virtual indices or virtual degrees of
P

freedom and the sums over the individual values αn run from 1 to Dn with
 
n N −1
Hi  = dmin(n,N −n)
O O
Dn = min dim Hi , dim (3.13)
i=1 i=n+1

where Hi is the Hilbert space of the qudit at site i. Choosing D0 = DN = 1, is


interpreted as a state with open boundary conditions. Using the tensor diagrammatic
notation the coefficient tensor of Equation 3.12 is given by

A1 A2 A3 AN −1 AN
(3.14)
j1 j2 j3 jN −1 jN

The construction can also be extended to a state with periodic boundary condition,
which gives

A1 A2 A3 AN −1 AN
(3.15)
j1 j2 j3 jN −1 jN

The expressions in Equations 3.12 and 3.14 represent the Matrix Product State
ansatz and more specifically the left-canonical form of the MPS. Note that the MPS
is not unique (see below) and that there exists a more general way to construct it
via successive Schmidt decompositions [45, 37]. Alternatively, the MPS form of a
quantum state can also be obtained via a construction as one-dimensional projected
entangled pair states in the valence bond picture [45].

The Matrix Product State provides an exact representation of the rank-N tensor.
However, one might wonder about the usefulness of the MPS expression in Equation
3.12; and rightly so, as the number of parameters is the same as for the coefficient
tensor Ψ. Upon careful analysis of the dimensionality of the virtual bond indices (i.e.
αn ), it becomes apparent that the bond dimension grows exponentially towards the
middle of the tensor sequence, meaning that the exponential scaling problem of the
system has not been solved yet. It has merely been cast in a more complicated form.
A potential solution emerges from this predicament; the SVD can be truncated. The
number of parameters contained in the MPS can be drastically reduced by limiting
the bond dimension of the MPS to a value D. Providing an upper limit for the
bond dimension, boils down to taking at most the D largest singular values in the
SVD decomposition at each step of the procedure depicted above. By doing so, the
maximum number of parameters that the MPS can contain is N pD2 . The choice of

3.3 Matrix Product State (MPS) 27


D is crucial as it determines the accuracy of the TN methods. Finally, the tensor
obtained by fully contracting the truncated MPS does approximate the coefficient
tensor Ψ with only O(N ) parameters.

It is clear that, this approximation is only applicable in cases where the truncated
singular values are negligible. Fortunately, the entanglement entropy of ground
states in physical systems exhibits an area-law behavior, as previously described.
This limited amount of entanglement for gapped ground states results in highly
constrained systems, in which a part of the Schmidt weights (i.e. the singular
values) tend to vanish [14]. Thanks to that, Matrix Product States can faithfully
represent one-dimensional area-law states (i.e. gapped ground states) with only
O(N ) parameters [109].

3.3.2 Gauge degree of freedom and canonical form

As previously stated, MPS are not unique, which means that a quantum state
can have multiple MPS representations. In order to illustrate this, consider the
subsequent local transformation used to change the tensors entries:
h i−1 
Aiαi−1 ji αi → X i−1
Aiβi−1 ji βi Xβi i αi (3.16)
αi−1 βi−1

where the Einstein notation was used to imply the summation over the repeated
−1
indices. The matrix Xαi i βi only need to satisfy Xαi i βi X i βi γi = δαi ,γi . Substituting
the above transformation in Equation 3.15 boils down to inserting an identity matrix
between each site, which yields:

 −1  −1 
A2j2 X 2 ... (X N −1 )−1 AN
X
|ψ⟩ = Tr XN A1j1 X 1 X 1 jN X
N
|j1 j2 ...jN ⟩
{ji }
(3.17)

The above MPS is specified to have periodic boundary conditions. In the case of open
boundary conditions, the trace operation is not taken into account and D0 = DN .
Consequently, it can be observed that the tensors Ai are transformed while the
quantum states remain unaltered. This ability to modify the MPS parametrization
without affecting the quantum state is known as the gauge freedom. There exist

28 Chapter 3 Tensor Networks


canonical forms of the MPS that gauge fix the representation. One might indeed
choose the tensors Ai such that they satisfy

(Aiji )† (Aiji ) = Iαi


X
(3.18)
ji

AN

⇐⇒ = (3.19)
AN

for all lattice sites except the rightmost one. This is known as the left canonical
form of the MPS. Clearly by decomposing a generic rank-N tensors by applying the
SVD for each lattice site i = 1, ..., (N − 1) (procedure outlined above), the resulting
tensors Ai (as defined in Appendix A) conform to Equation 3.18. The tensors Ai are
then said to be in their left canonical form. The tensors Ai can also be chosen to
satisfy
(Aiji )(Aiji )† = Iαi
X
(3.20)
ji

Ai

⇐⇒ = (3.21)
Ai

for all lattice sites except the leftmost one. This form is known as the right canonical
form. Finally, a mixed-canonical form can be introduced with respect to a lattice site
i. This form is obtained by left-canonicalising the tensors on the left of the lattice
site i and right-canonicalising the tensors on the right while keeping the tensor
corresponding to the site i neither left- nor right-canonical. By using the canonical
forms of the Matrix Product State (MPS), it is possible to make the computation
time of certain operations independent of the number of lattice sites N . This leads
to a significant reduction in computational cost, which is why the canonical forms
are widely used in the numerical implementation of tensor network methods.

3.3.3 Expectation values and operators

In quantum mechanics, a fundamental challenge is to extract information from a


quantum state representation. To do so, one often needs to calculate the expectation
values of local observables. The use of tensor networks, and more specifically, of
Matrix Product States (MPS), raises the question of how efficiently this information

3.3 Matrix Product State (MPS) 29


can be extracted. If the coefficients related to all configurations of the system have
to be extracted from the MPS in order to calculate expectation values, then the
representation loses its usefulness. Fortunately, it is possible to calculate expectation
values using MPS in a time complexity of O(N pD3 ) [82]. Moreover, the use of
the canonical form can significantly reduce the computational complexity of such
calculations. As an instructive example, consider the calculation of the norm ⟨ψ|ψ⟩
of an MPS with open boundary conditions in left-canonical form.

A1 A2 A3 AN −1 AN

⟨Ψ|Ψ⟩ =
A1 A2 A3 AN −1 AN

A2 A3 AN −1 AN

=
A2 A3 AN −1 AN

A3 AN −1 AN (3.22)

=
A3 AN −1 AN

= ...

AN

=
AN

It is clear from Equation 4.14 that the norm of an MPS in left-canonical form
is encapsulated in the rightmost tensor AN and that the computation becomes
independent of N . Analogous results are obtained when using the right- or mixed-
canonical forms.

In quantum mechanics, operators can be seen as linear maps from one Hilbert space
onto itself. Moreover, they can be expressed in the same TN-based framework as
quantum states [117]. Consider a two-site local operator acting on the sites i and
i + 1:

Ô = Ô (3.23)

30 Chapter 3 Tensor Networks


Its expectation value with respect to some MPS state becomes:

A0 A0 Ai Ai+1 AN −1 AN

⟨ψ|Ô|ψ⟩ = Ô

A0 A0 Ai Ai+1 AN −1 AN

(3.24)
Ai Ai+1

= Ô

Ai Ai+1

Similarly to Equation 4.14, a canonical form can be leveraged to simplify the


calculations. In this case, the tensor Ak are left- (right-) canonical for ∀k < i
(∀k > i + 1), bringing the MPS in a mixed-canonical form. It is also worth noting
that an operator that acts on multiple sites can be decomposed and represented as a
tensor network of smaller tensors that operate on individual sites. Specifically, in
one-dimensional systems, many-body operators that exhibit local interactions can
often be expressed in this tensor network-based representation known as Matrix
Product Operators (MPO). [117, 49, 111]. The MPO consists of multiple rank-4
tensors (with two physical indices compared to one in the case of MPS).

3.3.4 Optimizing MPS: finding ground states

So far, it has been demonstrated how general rank-N tensors can be decomposed
into tensor trains of lower rank tensors. In Section 3.3.3, a method for efficiently
extracting information from a Matrix Product State (MPS) was presented, under
the assumption that the MPS representation was readily accessible. However, when
considering a generic physical system that is subjected to a certain Hamiltonian
H, the ground state and thus its MPS representation is not known beforehand.
The relevant question is therefore: how can a suitable MPS representation of a
physical systems’ ground state be found? This section describes a variational method
used to find MPS that accurately represent such a state. The goal of variational
methods is to obtain an approximation of the ground states of physical systems. The
approach consists of choosing an initial parametrization class for the wavefunction
|Ψ⟩ and then finding the parameter values that minimize the expectation value of
the energy.

3.3 Matrix Product State (MPS) 31


At the beginning of this chapter it was mentioned that tensor networks are states
that directly target the relevant corner (i.e. the area-law corner) of the Hilbert space
H. It means that the area-law for entanglement entropy is directly encoded in their
structure [82]. As a result, TN states form an interesting class of variational Ansätze
that can be used to approximate the ground states of gapped quantum systems.
Consider a physical system that is subjected to a Hamiltonian H and assume that
H is given in Matrix Product Operator (MPO) form. The parametrization class
considered here consists of MPS with fixed bond dimensions. Therefore, the best
approximation of the system’s ground state within this class is the MPS |Ψ⟩ that
minimizes
⟨Ψ|H|Ψ⟩
E= (3.25)
⟨Ψ|Ψ⟩
As explained in Ref. [96], this problem can be solved by introducing the Lagrangian
multiplier λ and by minimizing the following:

⟨Ψ|H|Ψ⟩ − λ ⟨Ψ|Ψ⟩ (3.26)

⇐⇒ −λ×

(3.27)
The high non-linearity of the problem resulting from the fact that the parameters
of the wavefunction (i.e. the parameters of the tensors Ak , ∀k) appear in the form
of products makes the problem rather difficult to solve [96]. One solution to this
predicament is to proceed in an iterative manner. Specifically, all tensors except
for the one located at site k are kept constant. The variational parameters of the
wavefunction are now only encapsulated in the tensor Ak which reduces Equation
3.27 to a quadratic form, for which the determination of the extremum is a simpler
linear algebra problem. The resulting wavefunction leads to a decrease in energy.
However, it is clear that this state is not the optimal approximation within the class
under consideration. Therefore, the parameters of another tensor are modified in
a similar way to obtain a state with a lower energy. This technique is performed
several times for each tensor of the MPS, by sweeping back and forth, until the
energy reaches a point of convergence.

The optimization process described above corresponds to the well-known Density


Matrix Renormalization Group algorithm (DMRG) in the language of MPS. When
initially presented, the DMRG algorithm was not explicitly expressed in terms of
MPS [119, 120]. However, it has been shown that it has a natural interpretation as a

32 Chapter 3 Tensor Networks


variational method over the class of Matrix Product States [84, 26]. The DMRG has
been extended to MPS with periodic boundary conditions [113] but also to other
TN structures [67, 110, 100, 76]. It is worth noting that other methods exist to find
suitable MPS representations such as Time Evolving Block Decimation (TEBD) [114]
or Riemannian optimization [47].

3.4 Tree Tensor Networks and Projected Entangled Pair


States

It can be shown that the correlation functions of MPS decay exponentially [82].
However, many ground states of gapless quantum system in one dimension exhibit a
power-law decay of the correlation [14]. The implication that can be drawn from
this observation is that MPS, although being very useful for representing low energy
states of one-dimensional gapped systems, might be inadequate to accurately and
efficiently represent some other relevant quantum states. Fortunately, there are
other Tensor Network architectures that possess other characteristics. The present
section is not intended to provide an in-depth analysis of alternative TN models;
instead, it introduces two other TN architectures: Tree Tensor Networks (TTN) and
Project Entangled Pair States (PEPS)

Tree Tensor Networks consist of Tensor Networks exhibiting a tree-like structure as


shown in Figure 3.3. The TTN can be used to simulate one-dimensional systems
with long-range interactions [100]. It can be shown that the correlation function
in a TTN decays with a power law. It is therefore possible to use TTN to efficiently
represent some quantum states that are beyond the representation capability of
MPS.

Fig. 3.3.: Illustration of Tree Tensor Network (TTN).

3.4 Tree Tensor Networks and Projected Entangled Pair States 33


As explained above, MPS are very well suited for one-dimensional lattice systems
(e.g. a one-dimensional spin system) as they properly incorporate the area-law
behaviour of those systems. However, the representation of wavefunctions on two-
dimensional lattices using MPS requires exponentially large bond dimensions [117].
Two-dimensional systems are best described using the Projected Entangled-Pair
States (PEPS) ansatz (illustrated in Figure 3.4), which provides an efficient way to
parameterize interesting many-body wavefunctions on two-dimensional lattices [20]
(i.e. they incorporate the correct area-law behaviour for such systems). Unlike the
MPS, PEPS contain loops within the tensor network and tasks such as exact contrac-
tion of the tensor network are exponentially hard to achieve[117, 98]. Fortunately,
there exists efficient algorithms available that can provide approximations for the
contraction of PEPS [110].

Fig. 3.4.: Illustration of a Projected Entangled-Pair State (PEPS).

34 Chapter 3 Tensor Networks


Generative modeling using
Tensor Networks
4
4.1 Overview of TN-based methods in machine learning

There exists a significant synergy between the fields of physics and machine learning,
wherein each has a substantial influence on the other. On the one hand, machine
learning has emerged as an important tool in physics and other related fields. As
previously mentioned it is used in astronomy to mine large datasets and unearth
novel information from them [6]. ML algorithms have also shown potential to
enhanced the discovery of new materials [40] and they have been successfully
employed in the classification of different phases of matter [16]. Another example
is the utilization of neural networks quantum states (NQS) which are used to
describe quantum systems in terms of artificial neural networks (ANN) [117, 55].
On the other hand, ideas from the field of physics can serve as inspiration for the
developments of new, physics-inspired, learning schemes. Examples of this are the
Hopfield model [48] and energy-based models [105].

Similarly, the last decade has witnessed a notable interest towards the use of tensor
networks in the realm of machine learning. Notably, tensor networks have been
employed for various learning tasks, such as classification [103, 81, 38, 19], for
certain unsupervised learning problems [64, 102], for the compression of neural
networks [80, 87], for reinforcement learning [118, 72] and also to model the
underlying distribution of datasets [46, 18, 116, 39, 101].

Tensor networks have also been proposed as a potential framework for the implemen-
tation of both discriminative and generative task on near term quantum computers
[50, 121, 25, 23]. In the foreseeable future, near-term quantum devices often
referred to as Noisy-Intermediate-Scale Quantum (NISQ) technology are expected
to become accessible, featuring an increased number of qubits ranging from 50 to
100 [86]. However, the presence of noise in quantum gates will impose restrictions
on the scale of quantum circuits that can be effectively executed [86]. Machine
learning is a promising application of quantum computing. It is in fact resilient to
noise which would allow implementation on near-term quantum devices without

35
the need for error correction [50]. The primary obstacle lies in the fact that real
datasets often possess a high number of features, necessitating a substantial number
of qubits for representation. Ref. [50] illustrates the potential of tensor network for
quantum machine learning by presenting qubit-efficient quantum circuits for the
implementation of ML algorithm whereby the required quantity of physical qubits
scales logarithmically with, or independently of, the input data sizes.

The following section focuses on a MPS-based architecture for generative modelling


and is mostly based on Ref. [46].

4.2 MPS-based generative model

Considering the inherent similarities between the task of generative modelling


and quantum physics, the application of tensor networks to effectively capture
underlying probability distributions of datasets is not unexpected. Both fields strive
to model a probability distribution within a vast parameter space. In general,
when a model tries to capture a distribution over a random vector x comprising n
discrete variables that can assume k distinct values, representing P (x) naively by
storing the probabilities associated with each configuration is not an efficient way to
deal with the problem. This echoes the fact that describing the wave function by
specifying the coefficients related to all configuration of the system is not a good
approach for representing quantum many-body systems. Furthermore, it is worth
emphasizing that the pertinent configurations encompass only a small fraction of
the exponentially vast parameter space. Consider, for example, the case of images,
where the majority of pixel value combinations are regarded as noise, while the
actual images constitute a small subset of feasible configurations. This has again
some striking similarities with the field of quantum physics, wherein the relevant
states exist within a tiny manifold of the Hilbert space. The success of tensor
networks in efficiently parameterizing such states, suggests that they could be useful
for generative modeling.

4.2.1 Model representation

Consider a dataset T consisting of |T | N -dimensional data points vi ∈ V =


{0, 1, ..., p}⊗N where the entries of the vectors vi can take p different values. The
data points vi are potentially repeated in the dataset and can be mapped to basis
vectors of a Hilbert space of dimension pN . The datasets can be seen as a collection

36 Chapter 4 Generative modeling using Tensor Networks


of samples extracted from the underlying data distribution Pd (v), which is unknown.
The approach followed in this work bears strong similarities to the manner in which
probability distributions are represented in quantum mechanics. From a practical
standpoint, quantum mechanics models the wavefunction Ψ(v), whose squared
norm provides the probability distribution in accordance with Born’s rule:

|Ψ(v)|2
P (v) = (4.1)
Z

where Z = v∈V |Ψ(v)|2 is a normalization factor, also referred to as the partition


P

function. Modelling probability distribution using a quantum state wavefunction


fundamentally differs from traditional approaches. The generative models that
leverage this probabilistic interpretation are referred to as Born machines [17]. One
advantage of this probabilistic formulation is that it ensures the non-negativity
of the probability distribution. In this work, the wavefunction is parameterized
using the MPS form with real-valued parameters and is subjected to open boundary
conditions:

A1 A2 A3 AN −1 AN
Ψ(v1 , v2 , ..., vN ) = (4.2)
v1 v2 v3 vN −1 vN

It is also possible to explicitly model the probability distribution using an MPS with
non-negative tensor elements as opposed to describing it as the square of a matrix
product state. Born machines, however, have been suggested to be more expressive
than conventional probability functions [107, 18, 17].

In the field of generative modelling, a significant obstacle frequently faced is the


computation of the partition function. Evaluating Z requires a summation over all
pN possible configurations and the problem is thereby often intractable. Machine
learning models which uses the maximum likelihood approximation often resorts to
approximate methods such as annealed importance sampling (AIS) [107, 78]. Other
alternative models such as GAN, completely alleviate the need to explicitly compute
the partition function [42]. The use of MPS-based Born machines allows for efficient
and exact calculation of Z which is the squared norm of the Matrix Product States.
In tensor diagram notation it is expressed as:

4.2 MPS-based generative model 37


A1 A2 A3 AN −1 AN

(4.3)
A1 A2 A3 AN −1 AN

As explained in Section 3.3.3, by leveraging canonical forms, the partition function


can be accurately and efficiently computed. This constitutes a distinct advantage of
this MPS-based method over alternative approaches.

4.2.2 Optimization

After establishing the parameterization class of the model as the MPS form, the
objective is to identify the optimal MPS that produces a probability distribution that
closely matches the distribution of the given data. Consequently, the parameters of
the Matrix Product State are fine-tuned to minimize the Negative Log-Likelihood
(NLL) loss function:

|T |
1 X
L=− ln P (vi )
|T | i=1
(4.4)
|T |
1 X
= ln Z − ln |Ψ(vi )|2
|T | i=1

The minimization of the Negative Log-Likelihood is achieved through the use of a


Gradient Descent (GD) approach, yet with a notable difference from conventional
GD methods where all parameters are updated simultaneously [18]. In this case,
the parameters of the MPS are iteratively updated in a manner that resembles the
DMRG algorithm, and more specifically the two-sites DMRG variant. This variant
deviates from the conventional one-site DMRG algorithm that was presented in
Section 3.3.4. The two-sites DMRG enables a dynamic adjustment of the bond
dimensions of the MPS during the learning phase. This adaptive process efficiently
allocates computational resources to areas where stronger correlations among the
physical variables exist [46].

The learning process starts by initializing the MPS with low bond dimensions (typ-
ically Dk = 2 except for the right- and left most tensors) and with random tensor
elements. As discussed in Ref. [107], the initialization of the tensor elements can
have an impact on the performance of the model. Following the initialization, the

38 Chapter 4 Generative modeling using Tensor Networks


next step involves canonicalizing the MPS, wherein all tensors, except the right-
most one, are transformed into their left-canonical form (Equation 3.18). The
left-canonicalization of a tensor Ak of the MPS is performed by first merging Ak
with its right neighbor Ak+1 and then using the Singular Values Decomposition on
the merged tensor Ak,k+1 :

SVD

Ak Ak+1
=

Ũ Σ̃ Ṽ
= (4.5)

Ak Ak+1
=

The left-most tensor in the last line of Equation 4.5 is left-canonical (represented
in turquoise) as the matrices U and V of the SVD are, by definition, orthogonal
matrices (see Equation A.4). The diagonal matrix Σ that contains the singular values
is then merged with the right-most tensor. The left-canonicalization of the MPS is
achieved by iteratively applying the aforementioned procedure to each tensor in the
MPS, beginning with the leftmost one and progressing towards the right.

In the two-sites DMRG-like algorithm the parameters of the MPS are fine-tuned by
sweeping back and forth through the MPS and by using gradient descent to optimize
iteratively the parameters of two adjacent tensors. At each iteration, the initial step
involves combining two neighboring tensors to form an order-4 tensor:

αk−1 Ak Ak+1 αk αk−1 Ak,k+1 αk


= (4.6)
jk jk+1 jk jk+1

The parameters of the merged tensor Ak,k+1 are then updated to minimize the loss
function given in Equation 4.4. To this end, the gradient of the loss function with
respect to the element of a merged tensor can be computed and is given by [46]:

4.2 MPS-based generative model 39


|T |
∂L Z′ 2 X Ψ′ (vi )
= − (4.7)
(k,k+1)j j
k k+1
∂Aαk−1 αk+1 Z |T | i=1 Ψ′ (vi )

where Ψ′ (vi ) represents the derivative of the MPS with respect to the merged tensor
Ak,k+1 . It is know that the derivative of a linear function of tensors, defined through
a tensor network structure, with respect to a specific tensor A, is given by the tensor
network where the tensor A has been removed [74], whence

 ′
A1 Ak−1 Ak,k+1 Ak+2 AN
Ψ′ (v) = 
 

v1 vk−1 vk vk+1 vk+2 vN
(4.8)
A1 Ak−1 wk wk+1 Ak+2 AN
=
v1 vk−1 vk vk+1 vk+2 vN

The vertical connections of wk , vk and wk+1 , vk+1 stand for δwk ,vk and δwk+1 ,vk+1
respectively. This means that in the second term of Equation 4.7, where a sum
is carried over all training samples, only the input data vi with a certain pat-
tern that contains vk vk+1 will contribute to the gradient with respect to the ten-
sor elements A(k,k+1)vk vk+1 [46]. The first term of Equation 4.7 contains Z and
Z ′ = 2 v∈V Ψ′ (v)Ψ(v). Even though Z and Z ′ contain a summation over an expo-
P

nentially large number of terms, both can be efficiently computed by leveraging the
adequate canonical form. Considering the case where 1 < k < N − 1 then Z ′ can be
efficiently computed by leveraging of the mixed-canonical form:

A1 Ak−1 Ak,k+1 Ak+2 AN


Z′
=
2 A1 Ak−1 Ak+2 AN
(4.9)

Ak,k+1
=

In the cases where k = 1 and k = N − 1, the calculation of Z ′ can still be performed


efficiently by leveraging the right-canonical or left-canonical form, respectively. After
calculating the gradient, the optimization process using gradient descent can be
carried out, and the parameters of the merged tensor can be updated according to
the following equation:

40 Chapter 4 Generative modeling using Tensor Networks


Ãk,k+1 Ak,k+1 L′
= −γ× (4.10)

where Ãk,k+1 stands for the updated merged tensor and γ is the learning rate. In
order to retrieved the merged MPS form, the updated merged tensor is decomposed
by applying truncated SVD:

SVD

Ũ Σ̃ Ṽ
= (4.11)

where the matrix Σ̃ only contains the singular values whose ratio to the largest
one is greater or equal than a prescribed cutoff value ϵcut . All the other singular
values, along with the corresponding rows and columns of U and V , are disregarded.
Ultimately, the dimension of the virtual bond is determined by the number of singular
values that are retained. The bond dimension tends to grow as the optimization
process captures the correlations present in the training data. To control this growth,
a hyperparameter called Dmax is defined, representing the maximum allowable bond
dimension within the MPS. The result is that, even if the ratio of certain singular
values exceeds ϵcut , they can still be disregarded to prevent an overly large bond
dimension, thereby effectively managing the model complexity. Depending on the
direction in which the sweeping process is happening, the matrix Σ̃ will be merged
with the matrix U or V . In fact if the next bond to train is the (k − 1)-th bond
(i.e. if the optimization of the MPS proceeds to the left), then Ak+1 = V will be
right-canonical and Ak = U Σ̃, whereas if the next bond is the (k + 1)-th bond (i.e.
if the optimization of the MPS proceeds to the right) Ak = U and Ak+1 = Σ̃V .

As previously stated, the learning process, which involves optimizing the MPS
with respect to the NLL loss function, follows an iterative procedure. The described
process of merging, updating, and decomposing adjacent tensors is repeated multiple
times, in a sweeping fashion that alternates back and forth. Each iteration starts
from the far-right end (i.e. k = N − 1) of the MPS, where the two rightmost
tensors are considered, and then proceeds towards the leftmost tensor (i.e. k = 1).
Subsequently, the process reverses direction and moves back to the rightmost part of
the MPS. Throughout the sweeping process, the MPS is consistently maintained in a
canonical form, either a mixed-, left-, or right-canonical form, ensuring thereby an
efficient computation of the gradient in each step.

4.2 MPS-based generative model 41


4.2.3 Sampling

The generation of new samples is a difficult task for traditional generative models
as it often involves dealing with the intractability of the partition function. For
instance, energy-based models like Boltzmann machines often rely on Markov Chain
Monte Carlo (MCMC) methods to generate new samples [31]. MCMC methods can
produce sequences of configurations s → s′ → s′′ → ... by starting from an initial
configuration s = (s1 , s2 , ..., sN ), sweeping through it and performing the changes
si → s′i according to a certain probability, known as the Metropolis probability
[30]. The produced configurations are, in general, correlated and it requires a
certain number of sweeps τ between two successive configurations in order for
them to be independent from each other. The value of τ is often referred to as
the autocorrelation time. Furthermore, to ensure that the first configuration s is
correctly picked-up from the correct probability distribution, τ ′ sweeps needs to be
performed on an initially random configuration. In this case τ ′ is referred to as the
equilibration time [30]. In some cases, generating independent samples becomes
computationally expensive and in machine learning, this problem is often referred
to as the slow-mixing problem [41].

One of the advantages of MPS-based generative models is that independent samples


can directly be drawn from the model probability distributions, alleviating thereby
the need for MCMC methods. In the remainder of this section the direct sampling
method presented in Ref. [30] will be discussed. The generating process is done site
by site and starts at one end of the MPS. Consider that the first feature to be sampled
is the N -th feature. This can directly be obtained from the marginal probability
Pm (vN ) = v1 ,v2 ,...,vN −1 Pm (v). If the MPS is in the left-canonical form then it can
P

be expressed in tensor diagram notation as follows:

42 Chapter 4 Generative modeling using Tensor Networks


A1 A2 A3 AN −1 AN

vN
Pm (vN ) =
vN

A1 A2 A3 AN −1 AN

(4.12)
AN

vN
=
vN

AN

where the MPS was assumed to be normed. The orange circles represent the one-hot
encoded value of vN which can take p different values (i.e. vn ∈ {1, 2, ..., p} , ∀n ∈
{1, 2, ..., N }). From a practical point of view, Pm (vN ) is computed for vN = 1, 2, ..., p
and the value of vN is then drawn from these probabilities. Once the N -th feature
is sampled, one can then move on to the (N − 1)-th feature. More generally,
given the values of vk , vk+1 , ...vN , the (k − 1)-th feature can be sampled from the
one-dimensional conditional probability:

Pm (vk−1 , vk , vk+1 , ..., vN )


Pm (vk−1 |vk , vk+1 , ..., vN ) =
Pm (vk , vk+1 , ..., vN )

Ak−1 Xk

vk−1

vk−1
(4.13)
Ak−1 Xk

=
Xk

Xk

where the vectors in turquoise are determined according to

4.2 MPS-based generative model 43


Xk = Ak X k+1 and XN = AN (4.14)
vk vN

By employing this iterative sampling procedure, it becomes feasible to sample all


the feature values based on the different one-dimensional conditional probabilities.
As a result, a sample is obtained that strictly obeys the probability distribution
represented by the MPS [46]. It is also worth noting that this sampling approach
can be extended to inference tasks where only a part of the sample is given and can
thereby be used for tasks such as missing data imputation [46].

4.3 TTN and PEPS as generative models

Although this dissertation only focuses on the MPS-based generative model, it is


worth mentioning that other TN structures have successfully been used to build
generative models. For instance, in Ref. [18] Tree Tensor Networks have been pre-
sented as way to construct generative models in terms of Born machines. Tree Tensor
Networks have been introduced as a solution to address the issue of exponential
decay of correlation which is typical to MPS and limits the representational capacity
of the model. The presented TTN model has demonstrated superior capabilities
compared to MPS in the field of generative modeling. The TTN makes similar use of
the canonicalization tricks to efficiently optimize the tensor network and to directly
draw samples from the model probability distribution.

Furthermore, a direct sampling method for Projected Entangled-Pair States was


recently proposed in Ref. [115], which opened up possibilities for the use PEPS
for the construction of generative models. One critical aspect of applying Tensor
Networks for modeling specific data distributions is to efficiently capture the local
structure of the data. In the case of two-dimensional datasets, such as images,
PEPS emerges as a natural choice. Although the utilization of PEPS was previously
hindered by a lack of efficient algorithms, the introduction of the direct sampling
algorithm now allows for the effective application of PEPS in generative modeling
Ref. [116].

44 Chapter 4 Generative modeling using Tensor Networks


Application: Natural
Language Processing (NLP)
5

Natural Language Processing (NLP) is a field that lies at the crossroads between
computer science, artificial intelligence and linguistics. Its aim is to enable machines
and computers to understand, interpret and generate languages, both in written
and spoken formats, in a manner akin to humans beings. NLP integrates the prin-
ciples of computational linguistics, which involve rule-based modeling of human
language, with statistical and machine learning methodologies [51]. When dealing
with NLP tasks, one faces important challenges arising from the intricate nature of
human languages. These languages are filled with inherent ambiguities and intrica-
cies. While individuals acquire an understanding of the subtle nuances of natural
language through the formative years of their childhood and teenage years, NLP
models must efficiently assimilate and represent the intricate structure and nuanced
characteristics of natural language from the start in order to effectively execute NLP
tasks [51]. Commonly encountered tasks within the field of NLP encompass an array
of applications, including but not limited to, speech recognition [33], which entails
converting spoken data into textual format; sentiment analysis [70], which aim to
capture emotions and subjective qualities from textual content; caption generation
for images [122], which involves producing descriptive textual explanations for
pictures; and text generation [69], which encompasses the generation of textual
content.

In this dissertation the focus lies on capturing the underlying distributions in textual
data for the purpose of text generation and language recognition. As previously
indicated, MPS are not well-suited for analyzing two-dimensional datasets. How-
ever, considering the inherent one-dimensional nature of language, it suggests the
potential applicability of MPS in this context. Text generation can be used to pro-
duce coherent linguistic construction, ranging from individual sentences to entire
documents. Furthermore, the trained models can be leveraged for predictive text
applications, where the objective is to anticipate the subsequent words within a
given sequence of words. Predictive text methodologies have gained substantial

45
interest, mainly finding application in touchscreen keyboards and email services.
Popular methods for the generation of text are based on Generative Adversial Net-
works (GAN) [91] or Recurrent Neural Networks (RNN) [52]. In recent years a
cutting-edge approach has been presented to build sequence transduction models,
which is based on the transformer architecture, introduced in the highly influential
paper "Attention is all you need" (Ref. [108]). Transformers rely on the concept
of self-attention and serves as the fundamental framework in the renowned Gen-
erative Pretrained Transformers (GPTs). One notable advantage of Transformers,
distinguishing them from competing models, lies in their parallelizability, allowing
for faster and more efficient training [108].

5.1 Dataset and workflow

The dataset used to train the MPS-based models was obtained via the Project
Gutenberg (making books freely available on www.gutenberg.org) and consists of
50 different books written in English and randomly selected in the database. To
render the data usable, a preprocessing step is implemented wherein capital letters
are transformed into lowercase letters. The dataset is subsequently filtered to only
include the 26 letters of the alphabet and the ’space’ token, while excluding all other
special characters. All the text sections consisting of N characters that occur in the
books are then extracted from the text data. Each section is subsequently mapped to
a sequence of numerical values ranging from 0 to 26, achieved by replacing each
character with a corresponding numerical value. The overall goal is to capture the
inherent probability distribution of the text sections. In other words, the aim is to
model the frequency of occurrence associated with distinct text sections by means
of a N -site MPS-based model with a physical dimension of 27. In what follows,
MPS-based models will be trained for different values of N (ranging from 4 to 10).
It is important to note that the characteristics of the problem treated here, set it
apart from typical problems solved with MPS in the field of physics. In physics,
scenarios often involve a large number of entities (represented by N ) with relatively
smaller physical dimensions (e.g., 2 for spin-1/2 systems).

46 Chapter 5 Application: Natural Language Processing (NLP)


5.2 Text generation

The first (and rather naive) idea was to optimize the Matrix Product States by
minimizing the Mean Squared Error (MSE) given by:

1 X
MSE = (Pm (v) − Pd (v))2 (5.1)
|V| v∈V

where the summation runs over the whole Hilbert space of the problem (i.e. all
the possible data configuration) and where Pd (v) is stored in a rank-N tensor.
Performing a summation over the entire Hilbert space is computationally intractable.
Hence, to enhance the efficiency of the process, an attempt was made to decompose
the summation term into two components: the first one containing a summation over
the text sections present in the dataset, and the second one containing a summation
over the text sections that do not occur in the dataset. Performing the decomposition
of the summation and rearranging the terms yields:

1 X h i 1 X
M SE = (Pm (v) − Pd (v))2 − (Pm (v))2 + (Pm (v))2 (5.2)
|V| v∈V |V| v∈V
occ

where Vocc stands for the subset of V which contains the occurring text sections.
While the contraction v∈V Pm (v) is easily obtainable by leveraging the canonical
P

form of the MPS, the last term in the Equation 5.2 cannot be efficiently computed.
There are two solutions to solve this predicaments; the first is to model the prob-
ability distribution directly as an MPS (and not the wavefunction as is done in
Born machines) and the second is to use a Mean Absolute Error (MAE)-type of
loss function instead of the MSE. The problem with the former being that that the
non-negativity of the probability distribution is not guaranteed anymore and the
problem with the latter is that the MAE is not overall differentiable. In addition
to the aforementioned challenges, it is imperative to emphasize that, although not
directly acknowledged, the approach described here above to address the problem is
fundamentally flawed. This is due to the reliance on the knowledge of Pd (v), which
can be stored in a rank-N tensor for small systems (i.e. system with small value
of N ). However, this methodology cannot be effectively scaled to handle longer
text sections and was therefore abandoned. The subsequent findings presented
in this dissertation have been achieved by optimizing the MPS using the Negative
Log-Likelihood (NLL) loss function. This particular approach is better aligned with

5.2 Text generation 47


the characteristics and requirements of the problem being addressed1 .

The optimization scheme used to optimize the MPS is the one resembling the two-site
DMRG presented in Section 4.2.2. The gradient of the NLL with respect to a merged
tensor is given by Equation 4.7, given here for the sake of clarity

|T |
∂L Z′ 2 X Ψ′ (vi )
(k,k+1)j j
= − (5.3)
k k+1
∂Aαk−1 αk+1 Z |T | i=1 Ψ(vi )

Computing the gradient for large datasets is computationally expensive. One solu-
tion to this predicament is to use the so-called Stochastic Gradient Descent (SGD)
approach as was put forward in Ref. [107]. In fact, when using SGD, the gradient is
evaluated using a mini-batch which represents a subset of samples drawn uniformly
from the training set [13]. At each iteration, the gradient descent (Equation 4.10)
is performed in multiple steps on the same merged tensor, where in each step a
different mini-batch is used for the computation of the gradient.

The primary objective of machine learning is to be able to generalize. In order to


assess the model’s ability to generalize, the NLL is calculated on an independent set
of samples (i.e. the test set) that were drawn from the same underlying data distri-
bution but are distinct from the training set. Figure 5.1 shows the NLL calculated
on the training and test sets as a function of the number of optimization sweeps
for a 5-site MPS-based model with Dmax = 150. The results are shown for different
training set sizes (103 samples on the left, 104 samples in the center and 105 samples
ont the right of the figure). In the three different cases, the NLL calculated on the
training sets decreases monotonically. The behavior of the test loss varies with the
size of the training set. When trained on a small training set, the NLL calculated on
the test set rapidly converges to a minimum after a few sweeps. However, further
optimization beyond this point leads to overfitting, as evidenced by an increase in
the test loss. With larger training set sizes, the test loss attains systematically a lower
minimal value, indicating an enhanced capability to generalize to unseen samples.
This behavior, commonly observed in machine learning, highlights the necessity
of abundant data to limit overfitting and to attain an optimal model in terms of
generalization abilities.
1
Despite the seemingly intuitive choice of utilizing the Negative Log-Likelihood (NLL), the decision to
employ it in the initial stages of our research was not directly clear as the experience of the author
in dealing with generative models was mainly developed during the course of this work. As the
sayings go, "C’est en forgeant qu’on devient forgeron." or "Practice makes perfect.".

48 Chapter 5 Application: Natural Language Processing (NLP)


Fig. 5.1.: NLL calculated on the training and test sets as functions of the number of opti-
mization sweeps for a 5-site MPS-base model with Dmax = 150. The results are
shown for different training set sizes.

As explained in Section 2.2.1, minimizing the NLL is equivalent to minimizing the


KL divergence which allows to quantify the similarity between the data distribution
and the model distribution. For the sake of clarity, the KL divergence between a
distribution P (x) and Q(x) is given here by (for the discrete case):

P (x)
X  
DKL (P ||Q) = P (x) log (5.4)
x Q(x)

In order to assess how similar the model distribution Pm represented by the MPS
is to the data distribution, the distribution log (Pd (x)/Pm (x)) can be plotted for
samples taken from the data distribution Pd (x). It is important to note that for small
systems, such as N = 5, the data distribution can be kept in memory and stored as a
rank-N tensor. Despite the computational expense and inefficiency associated with
this task, it allows for an easy assessment of the learning ability of the model and
forms a proof-of-concept for the approach used. Figure 5.2 shows the distribution
log (Pd (x)/Pm (x)) for models trained with different training set sizes (same MPS
as the one presented in Figure 5.1). When the probability distributions Pm (x) and
Pd (x) are similar, the distribution log (Pd (x)/Pm (x)) tends to have its peak value
around zero. This pattern is observed for the model trained with the largest training
set size. Conversely, the model trained with a smaller training set displays a peak
around 4, indicating a high dissimilarity between the modeled distribution Pm (x)
and the true distribution Pd (x). This discrepancy further validates that the MPS has
overfitted the training data. The presented results offer a validation of the employed

5.2 Text generation 49


methods for the MPS and demonstrate that the effective modeling of the underlying
data distribution in text using MPS is possible.

Fig. 5.2.: Histogram of the difference in log-likelihoods under distributions P (x) = Pd (x)
and Q(x) = Pm (x) for samples randomly drawn from the data distribution. The
results are shown for models trained with different training set size: |Tr| = 103 in
blue, |Tr| = 104 in orange and |Tr| = 105 in turquoise.

The aforementioned results have demonstrated the feasibility of accurately modeling


the underlying distribution of text sections consisting of 5 characters using an MPS-
based model. However, the objective is to generate text, thereby Table 5.1 illustrates
sentences that were generated based on a given input text section (given in bold).
The text is generated in a sequential manner: Given an input consisting of n letters
l1 , l2 , ..., ln−1 , ln , the 5-site MPS samples the next character from the conditional
probabilities: P (ln+1 |ln−3 , ln−2 , ln−1 , ln ). Based on the sentences generated, it is
evident that the model’s ability to produce coherent sentences is considerably limited.
However, there are several noteworthy observations to emphasize. Despite the lack
of meaningful content, certain existing words do appear, and the MPS accurately
positions ’space’ tokens at specific instances, such as following the words should,
have, my, et cetera. The limited text generation ability exhibited by the model,
despite its capability in capturing the underlying data distribution, could potentially
be attributed to the relatively short length of the considered MPS. Specifically, the
generation process relies on the preceding 4 characters, which might be insufficient
to yield satisfactory text generation capabilities.

50 Chapter 5 Application: Natural Language Processing (NLP)


a the person whose such we must spee while and then the secretence
b i am they hindi cupiess the sating withou were here he
c which is ther and falled in allwore about a righting henin
d one should have had say wear when salvolated essed me house
e one should informing irhyddidntal nevertush ogede my did the
Tab. 5.1.: Sentences generated by the 5-site MPS-based model with Dmax = 150 trained
on training set containing 105 samples. The bolded portion of the sentences
represents the input provided to the model for predicting the remainder of the
sentence.

To address the issue raised above, other MPS-based models were trained with a
higher value of N . Considering the susceptibility of the MPS model to overfits
[117], an early stopping procedure was incorporated during the training process
to mitigate this problem [123]. More specifically, the NLL was computed on a
separate validation set (distinct from the training set) at intervals during training.
The training was stopped as soon as an increase in validation error was detected,
preventing thereby the model to overfit the training data. The MPS were trained for
different values of N and for different values of Dmax . The training and validations
sets systematically consisted of 0.75 × 105 and 105 samples, respectively. Figure 5.3
shows the lowest value of the validation error obtained for the different models
that were trained. It should be made clear that the value of the NLL obtained for
MPS-based models with different values of N should not be directly compared with
each other (the values are plotted in the same figure for the sake of compactness).
This is due to the fact that the models were trained on fundamentally different
training sets (i.e. each consisting of text sections of length N ). The results indicate
that increasing the maximum bond dimension (Dmax ) results in a reduction of the
validation error until reaching a plateau. At this point, further increase of Dmax
does not yield any significant improvement in the model’s performance. The extent
of improvement in validation error between Dmax = 10 and Dmax = 100 is more
pronounced for larger values of N . This can be attributed to the increased complexity
of modeling the probability distribution for higher N values, which requires a model
with a higher number of parameters. Additionally, it can be observed that for the
scenario where N = 5, a value of Dmax = 50 appears to be sufficient to obtain
satisfactory performance. Moreover, increasing Dmax beyond this point does not lead
to improved results, as evidenced by the rise in validation error when Dmax = 200.
For other values of N , a maximal bond dimension of 100 appear to be an optimal
choice.

5.2 Text generation 51


Fig. 5.3.: Negative log-likelihood as a function of the maximal bond dimension Dmax for
different values of N .

Table 5.2 displays sentences that were generated using MPS models of different
sizes. For each model the maximum bond dimension is 100 and the text was
generated following a similar sequential generation approach as described previously.
The generated sentences reveals that models based on longer MPS still exhibit
limited abilities to generate coherent sentences. While all the models are capable of
generating some existing words, they also generate non-existent ones. Furthermore,
it is crucial to highlight that none of the models exhibit any level of semantic
understanding which is a fundamental requirement for a proficient text generator.

During training, the MPS adapts the bond dimension of the different bonds to
capture the correlation that exists within the training set. The problem in this case
is that the physical dimension is relatively high and the correlation within textual
data is significant. Thereby, the bond dimensions of the Matrix Product States
systematically attain the maximal allowed value of Dmax , except for the first and
last bond, which have a value equal to the physical dimension. This renders the
training of longer MPS more challenging. Furthermore, when the training set size
is too small, the resulting trained MPS may be sub-optimally trained, as was the
case for the 5-site MPS which was trained on 103 samples (see Figure 5.1). Longer
MPS might therefore require larger training sets but this would render the training
computationally even more expensive.

52 Chapter 5 Application: Natural Language Processing (NLP)


N =5 the person who so handable you who ai have so couraek a fell put
N =6 the person who appremove all gotshis a posed durious and most p
N =8 the person who was really breased let to bleat together i warran
N = 10 the person who door somethourself i disass owned wilsonated ther
N =5 i went there to may ligacemetely abuealousm ik aognike him a such
N =6 i went there to destridyman had repliestawhat do after it ita ama
N =8 i went there to lay devairst from the garely desticed of deaomise
N = 10 i went there to come to triand we wours who stard limeilo you by
Tab. 5.2.: Sentences generated by the different MPS-based models with Dmax = 100 trained
on training set containing 0.75×105 samples. The bolded portion of the sentences
represents the input provided to the model for predicting the remainder of the
sentence.

To try and overcome this issue, an alternative approach to Stochastic Gradient


Descent was considered where the training set is subdivided in a certain number
of mini-batches and where a validation set is used. The training of an MPS is
conducted in a sequential manner on one mini-batch at a time, where multiple
optimization sweeps are performed. Similarly as previously done, training is stopped
upon detecting an increase in validation error. It should be mentioned that a similar
approach to SGD (without the early stopping strategy) was presented in Ref. [107],
and was previously deemed unstable when applied to the MNIST dataset. However,
the implementation of an early stopping strategy effectively mitigates the instability.
Unfortunately, the MPS trained using the this approach systematically exhibited
inferior performances in comparison to the other optimization method described
earlier, even when trained on significantly larger datasets. Furthermore, one of the
motivations for using this method was to make the training process faster and more
efficient. But when the training is conducted on a mini-batch, the validation error is
only calculated at certain intervals, that means after a fixed number of optimization
loops have been performed. Consequently, in certain instances, the MPS is optimized
for a few iterations, only to realize later that overfitting has occurred (indicated by an
increase in validation loss). Therefore, the obtained MPS configuration is discarded,
and the previous version of the MPS (with a lower validation loss) is reinstated and
employed as starting point for training on another mini-batch. This means that a
few iterations have been executed without yielding substantial progress, and when
longer MPS with high bond dimensions are involved, the procedure becomes time-
consuming. Additionally, depending on the number of mini-batches and the size
of the validation set, certain batches were excluded from training because training
on them resulted in an immediate increase in validation error. This resulted in a
training performed on an effective training set substantially smaller than expected.

5.2 Text generation 53


In an attempt to solve this issue, an adaptive learning approach was introduced,
where the learning rate is decrease batch after batch but it did not yield significant
improvements.

The previously discussed MPS-based approach was primarily focused on processing


textual data at the character (often referred to as uni-gram) level. In contrast,
many state-of-the-art generative models employed for text generation tasks adopt
a different approach, segmenting textual data into words or tokens. Tokens can
be seen as sequences of character that are grouped together as a useful semantic
unit for processing [99]. For instance, the GPT2 model presented in Ref. [88] has
been made available in Python and provides a tokenizer module. Applying this
tokenizer to the textual data used in this work would have resulted in more than
thousand different tokens to take into account. The reason behind the choice of
a character-based approach was mainly to circumvent excessively high physical
dimensions in the MPS, which would result in computationally more demanding
training. By operating at the character level, the MPS could maintain a manageable
physical dimension, thereby mitigating the computational complexity. It was also
considered to treat the textual data on a bi-gram (sequences of two characters)
level. However, this would have necessitated a physical dimension ranging between
500 and 600, which is still relatively high. Moreover it should be noted that many
bi-grams occurring in the textual data have a rather low frequency of occurrence,
as is shown in Figure 5.4. The idea to improve the MPS-based performances, while
limiting the increase in physical dimension, is to selectively incorporate the most
frequently occurring bi-grams alongside the uni-grams (i.e. the characters) already
taken into account up until now. A 5-site MPS model has been trained by considering
the top 1/15 fraction of the most frequent bi-grams, resulting in a physical dimension
of 65. The model was trained on a training set consisting of 5 × 104 samples using the
first stochastic gradient descent (SGD) method described in this chapter. Sentences
generated by the models are shown in Table 5.3. The text generation performance of
the model trained using the selected bi-grams appears to be inferior compared to the
basic 5-site MPS model. This observation might be attributed to either a too small
training set size (as increasing the physical dimension leads to longer computational
times, a smaller training set was chosen to balance computational efficiency.) or a
sub-optimal choice for the maximum bond dimension (Dmax ). Unfortunately, no
further investigations have been conducted at this stage. However it should be noted
that further research should be carried out in order to fully explore the potential of
this method.

In conclusion, generating text with an MPS-based model is a challenging task, mainly


due to the high physical dimension required and the significant correlation that

54 Chapter 5 Application: Natural Language Processing (NLP)


Fig. 5.4.: Frequency of occurrence of the bi-grams encountered in the textual data.

exists with textual data. Further research should be conducted to thoroughly explore
the complete potential of MPS-based generative models for text generation. This
could be accomplished through conducting more extensive and computationally
intensive simulations, incorporating a higher physical dimension and larger training
set sizes. While theoretically feasible, such simulations were deemed too demanding
to be conducted within the scope of this study. Comparing the performances of
models with different MPS sizes is also a delicate task because of the fundamental
differences that exists between the dataset on which those models are evaluated with
the NLL. It would also be interesting to investigate other generative models based on
other tensor networks structures. In fact, TTN were proposed for language design
tasks in Ref. [34]. The proposed model works on a word-level and therefore require
a high physical dimension. This implies that the proposed model would be practically
hard to train. Further research should incorporate a universal quantifiable measure
of the performance of a model that can be used for all models. This measure could
be inspired by the BLEU or ROUGE evaluation methods commonly used to evaluate
text generation performances, where the generated text is compared to a reference
text segment [63]. Additional studies should also focus on the development of
problem-specific datasets. For example, when tackling a task such as predicting
the most probable words based on an input sequence for email text, it might be
interesting to use a dataset tailored for that task.

5.2 Text generation 55


a the person whociduring in the chrvansolent to active will restrendd
b i am whens and even is all fld cent opinionprandinto cts undive wad ecum
c which is mostere is the pitionbabondso which of continceit of feelings what
d one should instinct confl a ve in ructs de which we king pizationalizes act
Tab. 5.3.: Sentences generated by the 5-site MPS-based which takes the top 1/15 fraction of
the most frequent bi-grams into account. The model is trained with Dmax = 150
on training set containing 5 × 104 samples. The bolded portion of the sentences
represents the input provided to the model for predicting the remainder of the
sentence.

5.3 Language recognition

As explained in Section 2.2.3, the knowledge of the joint probability distributions


opens up the possibility to a wide range of application for generative models. In
this section, an MPS-based model for language classification is presented. The fact
that the partition function can exactly and efficiently be computed allows for the
evaluation of the log-likelihood of any given text section. The proposed approach
involves training multiple MPS-based models, where each model corresponds to
a specific language. The goal is that each model learns the specificities of the
probability distribution of the language it tries to model, so as to be able to give
a high likelihood to text section derived from that language and low likelihood to
text section from other languages. In this work, three languages were considered:
English, French and Dutch. Naturally, distinct datasets were used to train each of
the respective models. The training data for the English model consists of randomly
selected text sections extracted from Jonathan Swift’s book, "Gulliver’s Travels".
Similarly, the training sets for the French and Dutch models were extracted from the
corresponding French and Dutch versions of the same book. The same book was
employed for training all three models to mitigate the potential biases introduced by
using different books. This approach aimed to minimize the influence of external
factors that could arise from using different books, such as the use of an older book
for one language, which could introduce biases in the evaluation results. Each MPS
was trained using a training set consisting of 105 text sections.

When considering an N -site MPS, the classification process is as follows. Consider


a given text segment s of size S originating from a text written in English, French
or Dutch. The text is subdivided in S − N + 1 text sections. For instance given the
text segment "tensor" consisting of six characters, and considering a 4-site MPS, the
three text sections extracted are: ’tens, ’enso’ and ’nsor’. The average log-likelihood

56 Chapter 5 Application: Natural Language Processing (NLP)


is calculated for the three models trained on the English, French and Dutch datasets
respectively and it is given by

S−N
X+1
1
log Planguage (vi ) (5.5)
S − N + 1 i=1

The classification process involves selecting the language for which the model gives
the highest average log-likelihood. To assess the model’s performance, the accuracy
is used as an evaluation metric. Accuracy is an appropriate measure in this case
because both the training and the test sets are perfectly balanced between the
different classes.

The number of parameters of the model is determined by the physical dimensions


considered (i.e. the number of characters taken into account), the size of the MPS
used and the maximum allowed bond dimension. In order to find a balance between
efficiency of representation (low number of parameters to describe the model) and
performance, various models were trained with different MPS sizes and different
values of the maximum bond dimension Dmax . Figure 5.5 shows the results for 4-
and 5-site MPS based models trained on the three different languages for various
values of Dmax . The accuracy is showed as a function of the number of characters in
the text segments used for evaluation.

(a) N = 5 (b) N = 4

Fig. 5.5.: Accuracy of the 5-site (a) and 4-site (b) MPS-based language recognition models
as a function of the number of characters in the text segments used for evaluation.
The results are shown for different values of the maximal bond dimension allowed.

The MPS-based classification models demonstrate a high accuracy, especially for


longer text segments. This holds true for all values of Dmax and is observed for both

5.3 Language recognition 57


the 4-site and 5-site MPS models. The results indicate that an optimal value for
Dmax is 25, as increasing the bond dimension beyond this value does not result in a
significant improvement in accuracy while decreasing it to 10 significantly reduces
the accuracy (blue curve in Figure 5.5). This observation valid for text segments of
various lengths. Additionally, the results demonstrate that the MPS-based approach
performs better in classifying longer text segments. This observation seems logic
as longer text segments can be divided into more text sections, thereby increasing
the chance of having a large average likelihood and improving the classification
accuracy. The accuracy achieved for very short text segments, specifically those
containing five characters, although slightly lower, is still quite good as it ranges
between 0.70 and 0.85. In comparison, a simple naive Bayes classifier trained on
the same three books achieved an average accuracy close to 0.85.

Figure 5.6 shows the confusion matrices for the 4-site MPS model with Dmax = 10
(i.e. the model with smallest number of parameters) evaluated in the two extreme
cases where the text segments contain five characters (a), and 100 characters (b).
Figure 5.6 also illustrates the confusion matrices for the 5-site MPS model with
Dmax = 250 (i.e. the model with the highest number of parameters) evaluated
on text segment containing five characters (c), and 100 characters (d). All four
confusion matrices are diagonal, meaning that all models perform well. When
working with short text segments, the 4-site MPS model with fewer parameters
demonstrates slightly lower performance compared to the 5-site MPS model (as
mentioned above). However, for the longest text sections, the performance of the two
models is nearly identical, indicating that the model trained with the lowest number
of parameters achieves the optimal balance between efficiency and performance.

In further works, the model could be extended to include more languages. However
it should be noted that a good language classifier should be able to rapidly take
a decision (e.g. language recognition on Google Translate). If more languages
are taken into account, this might become a problem seeing that the average
log-likelihood of the text segments must be computed for all MPS. For longer
text segments, this problem might (partially) be mitigated by only computing the
average log-likelihood on a sub-section (of about 50 to 60 characters) of the given
text segments. This can be motivated by the fact that the accuracy of the model
evaluated on text segments longer than 50/60 characters is not significantly better.

58 Chapter 5 Application: Natural Language Processing (NLP)


(a) (b)

(c) (d)

Fig. 5.6.: Confusion matrices shown for the 4-site MPS model with Dmax = 10 evaluated in
the two extreme cases where the text segments contain five characters (a), and
100 characters (b) and for the 5-site MPS model with Dmax = 250 evaluated on
text segment containing five characters (c), and 100 characters (d).

5.3 Language recognition 59


Conclusion 6
The overall goal of this thesis was to model the probability distributions existing
in textual data by using an MPS-based generative model. This Chapter provides
a summary of this thesis and of the obtained results, with an emphasis on further
research directions.

After having introduce the fundamental concepts of Machine Learning and generative
modelling in Chapter 2, an introduction to tensor networks within the context
of many-body physics was provided in Chapter 3. Then Chapter 4 presented a
generative model based on a specific type of (one-dimensional) tensor network, the
matrix product state. As explained, the application of tensor network states to model
underlying data distribution is not unexpected seeing the numerous similarities
between both fields.

Finally in Chapter 5 a framework was presented to use the MPS-based generative


model for Natural Language Processing tasks. In Section 5.2 the model was used to
capture the underlying distribution of text sections consisting of a fixed number of
characters. The initial model works on a character-level and only consider the 26
letters of the alphabet and the ’space’ token. The results have shown that a 5-site
MPS can accurately capture the data distribution, however its text generation ability
have been shown to be rather limited. Its poor performances might be attributed
to the short length of the MPS. To solve this, longer MPS were trained but the
high physical dimension and the significant correlation that exists in textual data
render the optimization of the MPS challenging. An alternative stochastic gradient
descent approach was proposed but unfortunately, it did not yield any improvement
to the training procedure. Finally an alternative model was consider which takes
into account the most frequent bi-grams (sequences of two characters) that occur
in texts. The increase in physical dimension has led to an even more challenging
optimization resulting in a poor ability to generate text. Further research should be
focus on performing computationally more expensive optimization to investigate the
full potential of the MPS-based model for text generation. Alternative, more efficient
optimization procedure should be investigated. Finally, research should be carried on
the use of alternative tensor networks, such as tree tensor networks, for the purpose
of text generation. The challenge will once again, be the high physical dimension of

61
the system under consideration. It might also be interesting to work with datasets
that are more specific to the task being addressed, for instance a predictive e-mail
text model should be trained on a dataset consisting of existing e-mails and not
on random books. Moreover, comparing MPS-based models of different sizes also
requires a more suited evaluation method where the generated text results could be
compare with (a) reference text segment(s).

Although unable to generate coherent and cohesive text, the MPS-based model was
successfully employed as language classifier. In Section 5.3, the model that was
presented consisted of three different MPS, each train on a specific language. It
seems that, even with a low value of the maximal bond dimension, the MPS can learn
the distinct specificities of the different languages considered. The classification
process is done by calculating the average log-likelihood of given text segments and
selecting the language for which this value is the highest. The results obtained have
shown that the classifier has a high accuracy, especially when the text segments used
for evaluation are longer. Further developments on this topic should aim to include
a higher number of languages.

62 Chapter 6 Conclusion
Bibliography

[1]Philip W Anderson. “More is different: broken symmetry and the nature of the
hierarchical structure of science.” In: Science 177.4047 (1972), pp. 393–396 (cit. on
p. 19).

[2]Antreas Antoniou, Amos Storkey, and Harrison Edwards. “Data augmentation


generative adversarial networks”. In: arXiv preprint arXiv:1711.04340 (2017) (cit.
on p. 17).

[3]John O Awoyemi, Adebayo O Adetunmbi, and Samuel A Oluwadare. “Credit card


fraud detection using machine learning techniques: A comparative analysis”. In:
2017 international conference on computing networking and informatics (ICCNI).
IEEE. 2017, pp. 1–9 (cit. on p. 16).

[4]Thomas E Baker, Samuel Desrosiers, Maxime Tremblay, and Martin P Thompson.


“Méthodes de calcul avec réseaux de tenseurs en physique”. In: Canadian Journal of
Physics 99.4 (2021), pp. 207–221 (cit. on pp. 21, 23, 24, 73).

[5]Nicholas M Ball and Robert J Brunner. “Data mining and machine learning in
astronomy”. In: International Journal of Modern Physics D 19.07 (2010), pp. 1049–
1106 (cit. on p. 4).

[6]Dalya Baron. “Machine learning in astronomy: A practical overview”. In: arXiv


preprint arXiv:1904.07248 (2019) (cit. on pp. 4, 35).

[7]Cynthia Beath, Irma Becerra-Fernandez, Jeanne Ross, and James Short. “Finding
value in the information explosion”. In: MIT Sloan Management Review (2012)
(cit. on p. 3).

[8]Mariana Belgiu and Lucian Drăguţ. “Random forest in remote sensing: A review of
applications and future directions”. In: ISPRS journal of photogrammetry and remote
sensing 114 (2016), pp. 24–31 (cit. on p. 8).

[9]Ingemar Bengtsson and Karol Życzkowski. Geometry of quantum states: an introduc-


tion to quantum entanglement. Cambridge university press, 2017 (cit. on p. 21).

[10]Ben Bright Benuwa, Yong Zhao Zhan, Benjamin Ghansah, Dickson Keddy Wornyo,
and Frank Banaseka Kataka. “A review of deep machine learning”. In: International
Journal of Engineering Research in Africa 24 (2016), pp. 124–136 (cit. on p. 9).

[11]Concha Bielza and Pedro Larranaga. “Discrete Bayesian network classifiers: A


survey”. In: ACM Computing Surveys (CSUR) 47.1 (2014), pp. 1–43 (cit. on p. 8).

63
[12]Sam Bond-Taylor, Adam Leach, Yang Long, and Chris G Willcocks. “Deep generative
modelling: A comparative review of vaes, gans, normalizing flows, energy-based
and autoregressive models”. In: IEEE transactions on pattern analysis and machine
intelligence (2021) (cit. on p. 6).

[13]Léon Bottou. “Large-scale machine learning with stochastic gradient descent”. In:
Proceedings of COMPSTAT’2010: 19th International Conference on Computational
StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers.
Springer. 2010, pp. 177–186 (cit. on p. 48).

[14]Jacob C Bridgeman and Christopher T Chubb. “Hand-waving and interpretive dance:


an introductory course on tensor networks”. In: Journal of physics A: Mathematical
and theoretical 50.22 (2017), p. 223001 (cit. on pp. 20, 21, 24, 28, 33).

[15]Henrik Bruus and Karsten Flensberg. Many-body quantum theory in condensed matter
physics: an introduction. OUP Oxford, 2004 (cit. on p. 20).

[16]Juan Carrasquilla and Roger G Melko. “Machine learning phases of matter”. In:
Nature Physics 13.5 (2017), pp. 431–434 (cit. on p. 35).

[17]Song Cheng, Jing Chen, and Lei Wang. “Information perspective to probabilistic
modeling: Boltzmann machines versus born machines”. In: Entropy 20.8 (2018),
p. 583 (cit. on p. 37).

[18]Song Cheng, Lei Wang, Tao Xiang, and Pan Zhang. “Tree tensor networks for
generative modeling”. In: Physical Review B 99.15 (2019), p. 155131 (cit. on pp. 35,
37, 38, 44).

[19]Song Cheng, Lei Wang, and Pan Zhang. “Supervised learning with projected entan-
gled pair states”. In: Physical Review B 103.12 (2021), p. 125117 (cit. on p. 35).

[20]J Ignacio Cirac, David Perez-Garcia, Norbert Schuch, and Frank Verstraete. “Matrix
product states and projected entangled pair states: Concepts, symmetries, theorems”.
In: Reviews of Modern Physics 93.4 (2021), p. 045003 (cit. on p. 34).

[21]J Ignacio Cirac and Frank Verstraete. “Renormalization and tensor product states
in spin chains and lattices”. In: Journal of physics a: mathematical and theoretical
42.50 (2009), p. 504004 (cit. on pp. 20, 21, 24).

[22]Piers Coleman. Introduction to many-body physics. Cambridge University Press, 2015


(cit. on pp. 19, 20).

[23]James Dborin, Fergus Barratt, Vinul Wimalaweera, Lewis Wright, and Andrew
G Green. “Matrix product state pre-training for quantum machine learning”. In:
Quantum Science and Technology 7.3 (2022), p. 035014 (cit. on p. 35).

[24]Li Deng. “The mnist database of handwritten digit images for machine learning
research”. In: IEEE Signal Processing Magazine 29.6 (2012), pp. 141–142 (cit. on
p. 17).

[25]Rohit Dilip, Yu-Jie Liu, Adam Smith, and Frank Pollmann. “Data compression for
quantum machine learning”. In: Physical Review Research 4.4 (2022), p. 043007
(cit. on p. 35).

64 Bibliography
[26]Jorge Dukelsky, Miguel A Martın-Delgado, Tomotoshi Nishino, and Germán Sierra.
“Equivalence of the variational matrix product method and the density matrix
renormalization group applied to spin chains”. In: Europhysics letters 43.4 (1998),
p. 457 (cit. on p. 33).

[27]Jens Eisert, Marcus Cramer, and Martin B Plenio. “Area laws for the entanglement
entropy-a review”. In: arXiv preprint arXiv:0808.3773 (2008) (cit. on p. 22).

[28]Ahmed Elgammal, Bingchen Liu, Mohamed Elhoseiny, and Marian Mazzone. “Can:
Creative adversarial networks, generating" art" by learning about styles and de-
viating from style norms”. In: arXiv preprint arXiv:1706.07068 (2017) (cit. on
p. 13).

[29]Mark Fannes, Bruno Nachtergaele, and Reinhard F Werner. “Finitely correlated


states on quantum spin chains”. In: Communications in mathematical physics 144
(1992), pp. 443–490 (cit. on p. 24).

[30]Andrew J Ferris and Guifre Vidal. “Perfect sampling with unitary tensor networks”.
In: Physical Review B 85.16 (2012), p. 165146 (cit. on p. 42).

[31]Asja Fischer and Christian Igel. “An introduction to restricted Boltzmann machines”.
In: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications:
17th Iberoamerican Congress, CIARP 2012, Buenos Aires, Argentina, September 3-6,
2012. Proceedings 17. Springer. 2012, pp. 14–36 (cit. on p. 42).

[32]Imola K Fodor. A survey of dimension reduction techniques. Tech. rep. Lawrence


Livermore National Lab., CA (US), 2002 (cit. on p. 6).

[33]Santosh K Gaikwad, Bharti W Gawali, and Pravin Yannawar. “A review on speech


recognition technique”. In: International Journal of Computer Applications 10.3
(2010), pp. 16–24 (cit. on p. 45).

[34]Angel J Gallego and Roman Orus. “Language design as information renormaliza-


tion”. In: SN Computer Science 3.2 (2022), p. 140 (cit. on p. 55).

[35]Leon Gatys, Alexander S Ecker, and Matthias Bethge. “Texture synthesis using con-
volutional neural networks”. In: Advances in neural information processing systems
28 (2015) (cit. on p. 13).

[36]Leon A Gatys, Alexander S Ecker, and Matthias Bethge. “A neural algorithm of


artistic style”. In: arXiv preprint arXiv:1508.06576 (2015) (cit. on p. 13).

[37]Raffael Gawatz. Matrix Product State Based Algorithms: For Ground States and
Dynamics. Niels Bohr Institute, Copenhagen University, 2017 (cit. on p. 27).

[38]Ivan Glasser, Nicola Pancotti, and J Ignacio Cirac. “From probabilistic graphical
models to generalized tensor networks for supervised learning”. In: IEEE Access 8
(2020), pp. 68169–68182 (cit. on p. 35).

[39]Ivan Glasser, Ryan Sweke, Nicola Pancotti, Jens Eisert, and Ignacio Cirac. “Expres-
sive power of tensor-network factorizations for probabilistic modeling”. In: Advances
in neural information processing systems 32 (2019) (cit. on p. 35).

Bibliography 65
[40]Rhys EA Goodall and Alpha A Lee. “Predicting materials properties without crystal
structure: Deep representation learning from stoichiometry”. In: Nature communica-
tions 11.1 (2020), p. 6280 (cit. on p. 35).

[41]Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press,
2016 (cit. on pp. 10–13, 15, 42).

[42]Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, et al. “Generative adversarial


networks”. In: Communications of the ACM 63.11 (2020), pp. 139–144 (cit. on
p. 37).

[43]Gunst, Klaas. “Tree tensor networks in quantum chemistry”. eng. PhD thesis. Ghent
University, 2020, xvii, 171 (cit. on p. 75).

[44]Gaurav Gupta et al. “A self explanatory review of decision tree classifiers”. In:
International conference on recent advances and innovations in engineering (ICRAIE-
2014). IEEE. 2014, pp. 1–7 (cit. on p. 8).

[45]Jutho Haegeman. Strongly Correlated Quantum Systems. 2nd ed. Ghent, Belgium,
2020-2021 (cit. on pp. 21, 24, 27, 74).

[46]Zhao-Yu Han, Jun Wang, Heng Fan, Lei Wang, and Pan Zhang. “Unsupervised
generative modeling using matrix product states”. In: Physical Review X 8.3 (2018),
p. 031012 (cit. on pp. 2, 17, 35, 36, 38–40, 44).

[47]Markus Hauru, Maarten Van Damme, and Jutho Haegeman. “Riemannian optimiza-
tion of isometric tensor networks”. In: SciPost Physics 10.2 (2021), p. 040 (cit. on
p. 33).

[48]John J Hopfield. “Neural networks and physical systems with emergent collective
computational abilities.” In: Proceedings of the national academy of sciences 79.8
(1982), pp. 2554–2558 (cit. on p. 35).

[49]C Hubig, IP McCulloch, and Ulrich Schollwöck. “Generic construction of efficient


matrix product operators”. In: Physical Review B 95.3 (2017), p. 035129 (cit. on
p. 31).

[50]William Huggins, Piyush Patil, Bradley Mitchell, K Birgitta Whaley, and E Miles
Stoudenmire. “Towards quantum machine learning with tensor networks”. In:
Quantum Science and technology 4.2 (2019), p. 024001 (cit. on pp. 35, 36).

[51]IBM. What is Natural Language Processing? https : / / www . ibm . com / topics /
natural-language-processing. Accessed: 2023-05-20 (cit. on p. 45).

[52]Touseef Iqbal and Shaima Qureshi. “The survey: Text generation models in deep
learning”. In: Journal of King Saud University-Computer and Information Sciences
34.6 (2022), pp. 2515–2528 (cit. on p. 46).

[53]Anil K Jain, Jianchang Mao, and K Moidin Mohiuddin. “Artificial neural networks:
A tutorial”. In: Computer 29.3 (1996), pp. 31–44 (cit. on p. 8).

[54]Anil K Jain, M Narasimha Murty, and Patrick J Flynn. “Data clustering: a review”.
In: ACM computing surveys (CSUR) 31.3 (1999), pp. 264–323 (cit. on p. 6).

66 Bibliography
[55]Zhih-Ahn Jia, Biao Yi, Rui Zhai, et al. “Quantum neural network states: A brief
review of methods and applications”. In: Advanced Quantum Technologies 2.7-8
(2019), p. 1800077 (cit. on p. 35).

[56]James M Joyce. “Kullback-leibler divergence”. In: International encyclopedia of


statistical science. Springer, 2011, pp. 720–722 (cit. on p. 14).

[57]Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”.
In: arXiv preprint arXiv:1412.6980 (2014) (cit. on p. 10).

[58]Sotiris B Kotsiantis, Ioannis Zaharakis, P Pintelas, et al. “Supervised machine


learning: A review of classification techniques”. In: Emerging artificial intelligence
applications in computer engineering 160.1 (2007), pp. 3–24 (cit. on p. 6).

[59]Hans-Peter Kriegel, Peer Kröger, Jörg Sander, and Arthur Zimek. “Density-based
clustering”. In: Wiley interdisciplinary reviews: data mining and knowledge discovery
1.3 (2011), pp. 231–240 (cit. on p. 8).

[60]Alex Lamb. “A Brief Introduction to Generative Models”. In: arXiv preprint arXiv:2103.00265
(2021) (cit. on pp. 12–14).

[61]Robert B Laughlin and David Pines. “The theory of everything”. In: Proceedings of
the national academy of sciences 97.1 (2000), pp. 28–31 (cit. on p. 19).

[62]Yuxi Li. “Deep reinforcement learning: An overview”. In: arXiv preprint arXiv:1701.07274
(2017) (cit. on p. 6).

[63]Chin-Yew Lin. “Rouge: A package for automatic evaluation of summaries”. In: Text
summarization branches out. 2004, pp. 74–81 (cit. on p. 55).

[64]Jing Liu, Sujie Li, Jiang Zhang, and Pan Zhang. “Tensor networks for unsupervised
machine learning”. In: Physical Review E 107.1 (2023), p. L012103 (cit. on p. 35).

[65]Gabriel Loaiza-Ganem, Brendan Leigh Ross, Luhuan Wu, et al. “Denoising Deep
Generative Models”. In: Proceedings on. PMLR. 2023, pp. 41–50 (cit. on p. 16).

[66]Dastan Maulud and Adnan M Abdulazeez. “A review on linear regression compre-


hensive in machine learning”. In: Journal of Applied Science and Technology Trends
1.4 (2020), pp. 140–147 (cit. on pp. 6, 8).

[67]Ian P McCulloch. “Infinite size density matrix renormalization group, revisited”. In:
arXiv preprint arXiv:0804.2509 (2008) (cit. on p. 33).

[68]Gary C McDonald. “Ridge regression”. In: Wiley Interdisciplinary Reviews: Computa-


tional Statistics 1.1 (2009), pp. 93–100 (cit. on p. 8).

[69]Kathleen McKeown. Text generation. Cambridge University Press, 1992 (cit. on


p. 45).

[70]Walaa Medhat, Ahmed Hassan, and Hoda Korashy. “Sentiment analysis algorithms
and applications: A survey”. In: Ain Shams engineering journal 5.4 (2014), pp. 1093–
1113 (cit. on p. 45).

Bibliography 67
[71]Pankaj Mehta, Marin Bukov, Ching-Hao Wang, et al. “A high-bias, low-variance
introduction to machine learning for physicists”. In: Physics reports 810 (2019),
pp. 1–124 (cit. on pp. 3, 4).

[72]Friederike Metz and Marin Bukov. “Self-correcting quantum many-body control us-
ing reinforcement learning with tensor networks”. In: arXiv preprint arXiv:2201.11790
(2022) (cit. on p. 35).

[73]Tom M Mitchell. Machine learning. Vol. 1. 9. McGraw-hill New York, 1997 (cit. on
p. 4).

[74]Simone Montangero, Evenson Montangero, and Evenson. Introduction to tensor


network methods. Springer, 2018 (cit. on pp. 20, 21, 24, 40, 73).

[75]Gordon E Moore et al. Cramming more components onto integrated circuits. 1965
(cit. on p. 3).

[76]Valentin Murg, Frank Verstraete, Örs Legeza, and Reinhard M Noack. “Simulating
strongly correlated quantum systems with tree tensor networks”. In: Physical Review
B 82.20 (2010), p. 205105 (cit. on p. 33).

[77]Fionn Murtagh and Pedro Contreras. “Algorithms for hierarchical clustering: an


overview”. In: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
2.1 (2012), pp. 86–97 (cit. on p. 8).

[78]Radford M Neal. “Annealed importance sampling”. In: Statistics and computing 11


(2001), pp. 125–139 (cit. on p. 37).

[79]William S Noble. “What is a support vector machine?” In: Nature biotechnology


24.12 (2006), pp. 1565–1567 (cit. on p. 8).

[80]Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. “Ten-
sorizing neural networks”. In: Advances in neural information processing systems 28
(2015) (cit. on p. 35).

[81]Alexander Novikov, Mikhail Trofimov, and Ivan Oseledets. “Exponential machines”.


In: arXiv preprint arXiv:1605.03795 (2016) (cit. on p. 35).

[82]Román Orús. “A practical introduction to tensor networks: Matrix product states


and projected entangled pair states”. In: Annals of physics 349 (2014), pp. 117–158
(cit. on pp. 20–22, 24, 30, 32, 33).

[83]Ivan V Oseledets. “Tensor-train decomposition”. In: SIAM Journal on Scientific


Computing 33.5 (2011), pp. 2295–2317 (cit. on p. 25).

[84]Stellan Östlund and Stefan Rommer. “Thermodynamic limit of density matrix


renormalization”. In: Physical review letters 75.19 (1995), p. 3537 (cit. on p. 33).

[85]David Poulin, Angie Qarry, Rolando Somma, and Frank Verstraete. “Quantum
simulation of time-dependent Hamiltonians and the convenient illusion of Hilbert
space”. In: Physical review letters 106.17 (2011), p. 170501 (cit. on pp. 21, 22).

[86]John Preskill. “Quantum computing in the NISQ era and beyond”. In: Quantum 2
(2018), p. 79 (cit. on p. 35).

68 Bibliography
[87]Yong Qing, Peng-Fei Zhou, Ke Li, and Shi-Ju Ran. “Compressing neural network by
tensor network with exponentially fewer variational parameters”. In: arXiv preprint
arXiv:2305.06058 (2023) (cit. on p. 35).

[88]Alec Radford, Jeffrey Wu, Rewon Child, et al. “Language models are unsupervised
multitask learners”. In: OpenAI blog 1.8 (2019), p. 9 (cit. on p. 54).

[89]Dorijan Radočaj, Mladen Jurišić, and Mateo Gašparović. “The role of remote sensing
data and methods in a modern approach to fertilization in precision agriculture”.
In: Remote Sensing 14.3 (2022), p. 778 (cit. on p. 4).

[90]YCAP Reddy, P Viswanath, and B Eswara Reddy. “Semi-supervised learning: A brief


review”. In: Int. J. Eng. Technol 7.1.8 (2018), p. 81 (cit. on p. 6).

[91]Gustavo H de Rosa and Joao P Papa. “A survey on text generation using generative
adversarial networks”. In: Pattern Recognition 119 (2021), p. 108098 (cit. on p. 46).

[92]Sebastian Ruder. “An overview of gradient descent optimization algorithms”. In:


arXiv preprint arXiv:1609.04747 (2016) (cit. on p. 10).

[93]Amit Saxena, Mukesh Prasad, Akshansh Gupta, et al. “A review of clustering tech-
niques and developments”. In: Neurocomputing 267 (2017), pp. 664–681 (cit. on
p. 6).

[94]Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein, Ursula Schmidt-Erfurth,


and Georg Langs. “Unsupervised anomaly detection with generative adversarial
networks to guide marker discovery”. In: Information Processing in Medical Imag-
ing: 25th International Conference, IPMI 2017, Boone, NC, USA, June 25-30, 2017,
Proceedings. Springer. 2017, pp. 146–157 (cit. on p. 16).

[95]Erhard Schmidt. “Zur Theorie der linearen und nichtlinearen Integralgleichungen”.


In: Mathematische Annalen 63.4 (1907), pp. 433–476 (cit. on p. 74).

[96]Ulrich Schollwöck. “The density-matrix renormalization group in the age of matrix


product states”. In: Annals of physics 326.1 (2011), pp. 96–192 (cit. on p. 32).

[97]Erwin Schrödinger. “An undulatory theory of the mechanics of atoms and molecules”.
In: Physical review 28.6 (1926), p. 1049 (cit. on p. 19).

[98]Norbert Schuch, Michael M Wolf, Frank Verstraete, and J Ignacio Cirac. “Compu-
tational complexity of projected entangled pair states”. In: Physical review letters
98.14 (2007), p. 140506 (cit. on p. 34).

[99]Hinrich Schutze, Christopher D Manning, and Prabhakar Raghavan. Introduction to


information retrieval. Cambridge University Press, 2008 (cit. on p. 54).

[100]Y-Y Shi, L-M Duan, and Guifre Vidal. “Classical simulation of quantum many-body
systems with a tree tensor network”. In: Physical review a 74.2 (2006), p. 022320
(cit. on p. 33).

[101]James Stokes and John Terilla. “Probabilistic modeling with matrix product states”.
In: Entropy 21.12 (2019), p. 1236 (cit. on p. 35).

Bibliography 69
[102]E Miles Stoudenmire. “Learning relevant features of data with multi-scale tensor
networks”. In: Quantum Science and Technology 3.3 (2018), p. 034003 (cit. on
p. 35).

[103]Edwin Stoudenmire and David J Schwab. “Supervised learning with tensor net-
works”. In: Advances in neural information processing systems 29 (2016) (cit. on
p. 35).

[104]Lucas Theis, Aäron van den Oord, and Matthias Bethge. “A note on the evaluation
of generative models”. In: arXiv preprint arXiv:1511.01844 (2015) (cit. on p. 15).

[105]Jakub M Tomczak. Deep generative modeling. Springer, 2022 (cit. on pp. 6, 12, 13,
15, 35).

[106]van den Oord, Aäron. “Deep architectures for feature extraction and generative
modeling”. eng. PhD thesis. Ghent University, 2015, XVII, 144 (cit. on pp. 4, 6).

[107]J Van Gompel, J Haegeman, and J Ryckebusch. Tensor netwerken als niet-gesuperviseerde
generatieve modellen. 2020 (cit. on pp. 15, 16, 37, 38, 48, 53).

[108]Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. “Attention is all you need”. In:
Advances in neural information processing systems 30 (2017) (cit. on p. 46).

[109]Frank Verstraete and J Ignacio Cirac. “Matrix product states represent ground states
faithfully”. In: Physical review b 73.9 (2006), p. 094423 (cit. on p. 28).

[110]Frank Verstraete and J Ignacio Cirac. “Renormalization algorithms for quantum-


many body systems in two and higher dimensions”. In: arXiv preprint cond-mat/0407066
(2004) (cit. on pp. 33, 34).

[111]Frank Verstraete, Juan J Garcia-Ripoll, and Juan Ignacio Cirac. “Matrix product
density operators: Simulation of finite-temperature and dissipative systems”. In:
Physical review letters 93.20 (2004), p. 207204 (cit. on p. 31).

[112]Frank Verstraete, Valentin Murg, and J Ignacio Cirac. “Matrix product states, pro-
jected entangled pair states, and variational renormalization group methods for
quantum spin systems”. In: Advances in physics 57.2 (2008), pp. 143–224 (cit. on
pp. 20, 21, 24).

[113]Frank Verstraete, Diego Porras, and J Ignacio Cirac. “Density matrix renormalization
group and periodic boundary conditions: A quantum information perspective”. In:
Physical review letters 93.22 (2004), p. 227205 (cit. on p. 33).

[114]Guifré Vidal. “Efficient classical simulation of slightly entangled quantum computa-


tions”. In: Physical review letters 91.14 (2003), p. 147902 (cit. on p. 33).

[115]Tom Vieijra, Jutho Haegeman, Frank Verstraete, and Laurens Vanderstraeten. “Direct
sampling of projected entangled-pair states”. In: Physical Review B 104.23 (2021),
p. 235141 (cit. on p. 44).

[116]Tom Vieijra, Laurens Vanderstraeten, and Frank Verstraete. “Generative modeling


with projected entangled-pair states”. In: arXiv preprint arXiv:2202.08177 (2022)
(cit. on pp. 35, 44).

70 Bibliography
[117]Vieijra, Tom. “Artificial neural networks and tensor networks in Variational Monte
Carlo”. eng. PhD thesis. Ghent University, 2022, XII, 163 (cit. on pp. 19, 20, 30, 31,
34, 35, 51).

[118]Samuel T Wauthier, Bram Vanhecke, Tim Verbelen, and Bart Dhoedt. “Learning
Generative Models for Active Inference using Tensor Networks”. In: Active Inference:
Third International Workshop, IWAI 2022, Grenoble, France, September 19, 2022,
Revised Selected Papers. Springer. 2023, pp. 285–297 (cit. on p. 35).

[119]Steven R White. “Density matrix formulation for quantum renormalization groups”.


In: Physical review letters 69.19 (1992), p. 2863 (cit. on pp. 24, 32).

[120]Steven R White. “Density-matrix algorithms for quantum renormalization groups”.


In: Physical review b 48.14 (1993), p. 10345 (cit. on p. 32).

[121]L Wright, F Barratt, J Dborin, et al. “Deterministic Tensor Network Classifiers”. In:
arXiv preprint arXiv:2205.09768 (2022) (cit. on p. 35).

[122]Zhilin Yang, Ye Yuan, Yuexin Wu, William W Cohen, and Russ R Salakhutdinov. “Re-
view networks for caption generation”. In: Advances in neural information processing
systems 29 (2016) (cit. on p. 45).

[123]Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. “On early stopping in gradient
descent learning”. In: Constructive Approximation 26.2 (2007), pp. 289–315 (cit. on
p. 51).

[124]Yangyong Zhu, Ning Zhong, and Yun Xiong. “Data explosion, data nature and
dataology”. In: Brain Informatics: International Conference, BI 2009 Beijing, China,
October 22-24, 2009 Proceedings. Springer. 2009, pp. 147–158 (cit. on p. 3).

Bibliography 71
Appendix A: Tensor
Manipulations
A
A.1 Index grouping and splitting

A tensor can be manipulated through the grouping or ungrouping of its indices,


thereby changing its structure. For instance the rank-3 tensor from Equation 3.5 can
be reshaped in a rank-2 tensor by combining the α and β indices:

Tµ,γ (A.1)

where µ = (α, β). The splitting of indices is the inverse operation. This involves
separating a single index into multiple indices, which in turn changes the organiza-
tion of the tensor’s data. Generally speaking, a tensor can be reshaped into another
tensor of any rank, following some given (invertible) rule [74]. Naturally, both
tensors must posses the same number of elements. An example of one such a rule is
given for the above example where a 2 × 2 × 2 tensor with elements Tijk is reshaped
in a 4 × 2 matrix with elements:
 
T T112
 111 
T T122 
 121
(A.2)

 
T211 T212 
 
T221 T222

In terms of computational cost, reshaping a tensor can be regarded as a negligible


operation [4].

A.2 Decomposition

The process of grouping the indices of a tensor enables the reshaping of a tensor
with any rank-N to a matrix form. This can be expressed as follows:

Tα1 ,...,αN = T(α1 ,...,αn )(αn+1 ,...,αN ) = Ti,j (A.3)

73
As tensors can be flattened into matrices, it is possible to utilize the mathematical
tools of matrix algebra and extend them to manipulate tensors. The Singular Value
Decomposition (SVD) is an example of such an operation that can factorize any
rank-r matrix. This technique involves decomposing a matrix into three matrices.
More specifically the SVD of any matrix T ∈ Cm×n is a factorization of the form:

T = U ΣV † (A.4)

where U ∈ Cm×m , V ∈ Cn×n and V † is the conjugate transpose of V . The matrices


U and V are unitary meaning that U † U = Im×m and V † V = In×n . The matrix
Σ ∈ Cm×n is a diagonal and non-negative1 real matrix. The diagonal element of
Σ are referred to as the singular values σi = Σii and are uniquely determined by
the matrix T . By convention the singular values are arranged in descending order:
σ1 > σ2 ... > σr where r is the number of non-zero singular values and is called the
the rank of the matrix T . The left-singular vectors of a matrix T are represented
by the columns of matrix U , while the right-singular vectors are represented by the
columns of matrix V . These vectors form two sets of orthonormal bases.

A useful reduced description of the SVD is the compact SVD. In this case, we
restrict Σ to its k ≤ min(m, n) non-zero diagonal elements and only consider the
corresponding vectors of U and V such that U ∈ Cm×k and V ∈ Ck×n are isometric
matrices satisfying U † U = Ik and V † V = Ik , while U U † and V V † are only considered
to be projector [45].

In fact, it is common practice in the field of quantum information to use the Schmidt
Decomposition [95], which is basically a restatement of the SVD. In order to see this,
consider a general pure bipartite state |ψ⟩AB = αβ cαβ |α⟩ ⊗ |β⟩ ∈ HA ⊗ HB . The
P

matrix C with elements cαβ can be written as C = U ΣV using the singular value
decomposition (SVD). Now applying U and V as basis transform in HA and HB
respectively, gives
k
X
|ψ⟩ = σi |ui ⟩ ⊗ |vi ⟩ (A.5)
i=1

The singular values σi are also called the Schmidt weights. The number of non-zero
singular values σi is called the Schmidt number and a bipartite state is entangled if
the Schmidt number is greater than one. When incorporating Equation A.5 in the

1
The singular values are the square root of the eigenvalues of the matrix M † M . Hence the non-
negativity of Σ.

74 Appendix A Appendix A: Tensor Manipulations


expression for the Von Neumann entropy of the reduced density matrix (given by
Equation 3.4) this results in
X
S=− σi2 log σi2 (A.6)
i

This results shows how the Schmidt weights provide a measurement of the entangle-
ment between a part of the system and its environment [43].

A.2 Decomposition 75
Colophon

This thesis was typeset with LATEX 2ε . It uses the Clean Thesis style developed by
Ricardo Langner. The design of the Clean Thesis style is inspired by user guide
documents from Apple Inc.

Download the Clean Thesis style at http://cleanthesis.der-ric.de/.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy