Generative Modelling With Tensor Networks
Generative Modelling With Tensor Networks
Martin Molnar
Martin Molnar
First of all, I wish to express my gratitude to Prof. Frank Verstraete and Prof. Jutho
Haegeman, my supervisors at the Quantum Group, for granting me the opportunity
to undertake my master’s thesis in their research group. I would like to extend a
special appreciation to Prof. Jutho Haegeman for his dedicated time and invaluable
guidance throughout the entire process of my thesis. I am also grateful for his advice
that greatly influenced the refinement of my numerous drafts.
I also want to extend a special thanks to Céline Molnar and Sébastien Van Laecke for
their assistance in meticulously identifying typos and helping to correct my written
work.
I would also like to express my heartfelt gratitude to Doriane for her unwavering
support and encouragement throughout the whole academic year.
Lastly, I would like to express my deepest gratitude to my family for their unwavering
presence and support. I am particularly grateful to my grandfather, who nurtured
my curiosity during my childhood, which undoubtedly played a significant role
in shaping my decision to study Engineering physics. I would also like to extend
my thanks to my sisters, Lucie and Camille, for their constant support and for the
delightful companionship they provided at home. Lastly, but certainly not least, I
want to express my deepest gratitude to my parents, Nadine and Michel, who are
the epitome of love and hard work. I am immensely thankful for their unwavering
support and belief in me throughout the years. They have not only allowed me
to pursue my passion in studying Engineering Physics, but also instilled in me the
v
invaluable lessons of hard work. I will forever cherish their guidance and I am
eternally grateful for the countless sacrifices they have made. Merci à vous.
vi
Admission to Loan
The author gives permission to make this master dissertation available for consul-
tation and to copy parts of this master dissertation for personal use. In all cases of
other use, the copyright terms have to be respected, in particular with regard to
the obligation to state explicitly the source when quoting results from this master
dissertation.
vii
Declaration
ix
Generative machine learning with tensor
networks
Martin Molnar
Student number: 01700533
xi
1
Abstract—In conventional machine learning approaches, data can be regarded as samples extracted from a genuine real-
discriminative tasks aim to model the conditional probability world distribution, x ∼ Pd (x), commonly referred to as the data
of obtaining a specific label given a specific input feature. distribution. The fundamental objective of all generative models
In contrast, the field of generative modeling considers that is to model the data distribution to some extent. The literature
the data points that constitutes a dataset can be regarded regarding generative models roughly subdivides the field in four
as samples extracted from an unknown underlying data main groups [1]: Auto-regressive generative models (ARM), Flow-
distribution. The main objective of generative models is to based models, Latent variable models (e.g. Generative Adversial
model this underlying distribution. The probabilistic for- Networks (GAN) or Variational Auto Encoder (VAE)) and Energy-
mulation of quantum mechanics provides a framework for based models (e.g. Boltzmann Machines).
modeling probability distributions using quantum states. This There exists a significant synergy between the fields of physics
formulation can serve as an inspiration for constructing and machine learning, wherein each has a substantial influence
generative models in a similar way. This work focus on a on the other. On the one hand, machine learning has emerged
model that uses Matrix Product States, a (one-dimensional) as an important tool in physics and other related fields [3]–
tensor network, as a quantum states to model the underlying [6]. On the other hand, ideas from the field of physics can
probability distribution of a dataset. The algorithm used to serve as inspiration for the developments of new, physics-inspired,
optimized the model reassembles the two-site Density Matrix learning schemes. Examples of this are the Hopfield model [7] and
Renormalization Group algorithm. The model also benefits Boltzmann machines [8] which are closely related to the Ising
from a direct sampling method to efficiently generate new model and its inverse version. Another type of generative model
samples. Considering the one-dimensional nature of language, that draws inspiration from physics is the Born machine, where
this work investigate the applicability of MPS to model under- the probability distribution is modeled by a quantum state wave
lying probability distributions existing in textual data. More function Ψ and is given by its squared amplitude according to
specifically, it is applied for the purpose of text generation Born’s rule. There exists different Ansätze in quantum mechanics
and language recognition. Generating text with MPS remains to represent quantum states. This work focuses on models that use
a challenging task and the MPS-based model shows a poor Tensor Networks (TN) states as a parametrization class to model
ability to generate text as well as a clear lack of semantic data distributions.
understanding. However, it seems that the MPS representation Tensor networks (TN) [9]–[14] are variational quantum states
is still able to learn the specificities of a language and be used initially proposed to describe ground states of many-body systems.
as a language recognition model with a high accuracy. They have already been used in the context of machine learning.
Notably, they have been employed for various learning tasks,
Keywords— Generative Modelling, Tensor Networks (TN), such as classification [15], [16], for certain unsupervised learning
Matrix Product States (MPS), Natural Language Processing problems [17], [18], for the compression of neural networks [19],
(NLP) for reinforcement learning [20] and also to model the underlying
distribution of datasets [21]–[23].
I. I NTRODUCTION This work employs a learning scheme which is based on
Matrix Product States (MPS) as described in Ref. [21]. MPS
In order to build reliable decision-making AI systems, dis- refers to a specific type of Tensor Network where the tensors are
crimative models alone are not sufficient. In addition to requiring interconnected in a one-dimensional fashion. This arrangement
extensive labeled datasets, which may be limited in availability, is also referred to as the tensor train decomposition within the
discriminative models often lack a semantic comprehension of mathematical community. Learning is achieved by optimizing
their environment and also lack the aptitude to express uncertainty the MPS using an algorithm that resembles the two-site DMRG
concerning their decision making [1]. Generative models could algorithm [24]. It has been shown that the MPS-model exhibits a
potentially present themselves as a solution to these concerns. strong learning ability and also additionally benefits from a direct
Additionally, they might make it easier to assess the effectiveness sampling method for the generation of new samples that is more
of ML models. Indeed, classifiers frequently produce identical efficient than other traditional generative models. The generative
results for a wide variety of inputs, making it challenging to deter- model based on Matrix Product States, will be employed to
mine exactly what the model has learned [2]. Unlike discrimative capture the underlying distributions in textual data, serving as
models which aim at modelling a conditional probability distribu- a foundation for text generation. Moreover, this model will be
tion P (y|x) where y is a target value or label of a data point x, used for constructing a language classifier, enabling accurate
generative models try to model the joint data distribution P (x). categorization of linguistic content in three different languages
At the core of generative models lies the fundamental concept that (English, French and Dutch).
2
The rest of this paper is organized as follows: In Section II the in this manifold possess unique entanglement characteristics.
fundamental principles of Tensor Networks are presented within In fact, it can be demonstrated that low-energy eigenstates of
the framework of many-body physics. Section III presents the gapped Hamiltonians with local interactions follow the area-law
MPS-based learning scheme (initially proposed in Ref. [21]) as for entanglement entropy [27] and those states are thereby heavily
well as a direct sampling procedure to draw samples from the constrained by the local behavior of physical interactions [13]. The
MPS. Section IV displays the results achieved in text generation states that satisfy the area-law for entanglement are precisely the
and language classification tasks utilizing an MPS-based genera- ones targeted by tensor network states. This implies that tensor
tive framework. Subsequently, Section V discusses the potential networks specifically focus on the crucial corner of the Hilbert
further research areas that worth exploring to model textual data space where physical states reside.
using MPS. Matrix Product States (MPS) are one specific type of tensor net-
works and they have been successfully used to describe quantum
II. T ENSOR N ETWORKS ground states (especially) for physical systems in one dimension.
Quantum many-body physics aims at describing and under- In a MPS the rank-N coefficient tensor is expressed as a product
standing systems that consists of many interacting particles, that of lower rank tensors:
are subjected to the Schrödinger Equation. Although the last
XX
|ψ⟩ = A1j1 α1 A2α1 j2 α2 A3α2 j3 α3 ...AN
αN −1 jN |j1 j2 ...jN ⟩ (2)
decades have seen a tremendous amount of developments in {ji } {αi }
theoretical and numerical methods for the treatment of many-
body systems, the many-body quantum problem remains one of which can be expressed by means of the tensor diagrammatic
the most challenging problems in physics. One of the postulates notation1 as:
of quantum theory is that quantum states can be represented
A1 A2 A3 AN −1 AN
by state vectors in a Hilbert space [25]. In a quantum system, (3)
interacting particles can be in a superposition and one of the v1 v2 v3 vN −1 vN
consequences of this superposition principle is that the dimension
of the Hilbert space grows exponentially with the number of By convention, a tensor is represented in pictorial language by
particles in system. For instance, consider a general quantum an arbitrary shape or block (tensor shapes can have a certain
many-body system consisting of N particles. The system can be meaning depending on the context) with emerging lines, where
described by the following wavefunction: each line corresponds to one of the tensor’s indices. An index that
X is shared by two tensors designates a contraction between them
|ψ⟩ = Ψj1 j2 ...jN |j1 ⟩ ⊗ |j2 ⟩ ⊗ ... ⊗ |jN ⟩ (1) (i.e. a sum over the values of that index), in pictorial language
j1 j2 ...jN
that is represented by a line connecting two blocks.
where each |ji ⟩ represents the basis of an individual particle
with label i and with dimension d (for simplicity, the dimensions III. MPS- BASED GENERATIVE MODEL
of the individual particle’s basis are assumed to be equal). The
Considering the inherent similarities between the task of gen-
number of possible configurations of the system and hence the
erative modelling and quantum physics, the application of tensor
dimension of the Hilbert space is dN [10]. The wavefunction of the
networks to effectively capture underlying probability distributions
many-body system is expressed as a linear combination of all the
of datasets is not unexpected. Indeed, both fields strive to model
possible configuration where each configuration is weighted with a
a probability distribution within a vast parameter space. Further-
specific coefficient. The coefficients corresponding to the different
more, the pertinent configurations encompass only a small fraction
dN configurations can be represented using the coefficient tensor
of the exponentially vast parameter space. Consider, for example,
Ψj1 ,j2 ,...,jN that contains dN parameters. This tensor has N d-
the case of images, where the majority of pixel value combinations
dimensional entries and fully determines the wavefunction of the
are regarded as noise, while the actual images constitute a small
quantum system. Describing the wavefunction by specifying the
subset of feasible configurations. This has again some striking
coefficients related to all possible configurations of the system
similarities with the field of quantum physics, wherein the relevant
is clearly an inefficient way to deal with large systems. The
states exist within a tiny manifold of the Hilbert space. The
solution proposed by tensor network approaches is to represent
success of tensor networks in efficiently parameterizing such
and approximate the coefficient tensor Ψj1 ,j2 ,...,jN by a network
states, suggests that they could be useful for generative modeling.
of interconnected lower-rank tensors with a given structure. The
In this section, the MPS-based generative model used in this work
structure of a tensor network is determined by the entanglement
is presented. It was initially proposed in Ref. [21].
patterns between the local degrees of freedom within the quantum
system it represents.
Tensor Network methods have been successfully use to de- A. Model representation
scribe quantum ground states in quantum many-body systems. Consider a datasets T consisting of |T | N -dimensional data
The reason behind their success is their efficient parametrization ⊗N
points vi ∈ V = {0, 1, ..., p} where the entries of the vectors
of quantum systems (i.e. the number of parameters scales only vi can take p different values. The data points vi are potentially
polynomially with N [26]), which focuses specifically on a special repeated in the dataset and can be mapped to basis vectors of
group of states within the Hilbert space. In quantum many-body a Hilbert space of dimension pN . The datasets can be seen
systems, the presence of certain structures within the Hamiltonians as a collection of samples extracted from the underlying data
(e.g. the local character of physical interactions found in many
physical Hamiltonians) is such that the physical states exist in a 1
See Ref. [10] for a comprehensive introduction to the tensor diagram-
tiny manifold of the gigantic Hilbert space. The states located matic notation.
3
AN
(5)
this problem [35]. The MPS were trained for different values of
N (ranging from for to ten) and for different values of Dmax . The
training and validations sets systematically consisted of 0.75×105
and 105 samples, respectively. Unfortunately the trained models
did not show a better ability to generate text. The sentences
generated by these models have revealed that models based on
longer MPS still exhibit limited abilities to generate coherent
sentences. While all the models are capable of generating some
existing words, they also generate non-existent ones. Furthermore,
it is crucial to highlight that none of the models exhibit any level
of semantic understanding which is a fundamental requirement for
a proficient text generator.
During training, the MPS adapts the bond dimension of the Fig. 2. Accuracy of the 5-site (a) and 4-site (b) MPS-based language
different bonds to capture the correlation that exists within the recognition models as a function of the number of characters in the text
segments used for evaluation. The results are shown for different values
training set. The problem in this case is that the physical dimen- of the maximal bond dimension allowed.
sion is relatively high and the correlation within textual data is
significant. Thereby, the bond dimensions of the Matrix Product
States systematically attain the maximal allowed value of Dmax , 1
PS−N +1
by S−N +1 i=1 log Planguage (vi ). The classification process
except for the first and last bond, which have a value equal to
involves selecting the language for which the model gives the
the physical dimension. This renders the training of longer MPS
highest average log-likelihood. To assess the model’s performance,
more challenging. Furthermore, when the training set size is too
the accuracy is used as an evaluation metric. Accuracy is an
small, the resulting trained MPS may be sub-optimally trained, as
appropriate measure in this case because both the training and
was the case for the 5-site MPS which was trained on 103 samples
the test sets are perfectly balanced between the different classes.
(see Figure 1). Longer MPS might therefore require larger training
Models with different values of N and Dmax were trained.
sets but this would render the training computationally even more
In Figure 2, the accuracy is shown for the 4-site MPS-based
expensive.
model as a function of the number of characters in the text
Another attempt that was investigated to improve the perfor-
segments used for evaluation. Results are shown for different
mances of the model, while limiting the increase in physical
values of Dmax . A similar trend was observed for the 5-site MPS
dimension, is to selectively incorporate the most frequently oc-
with a slightly better accuracy achieved for smaller evaluation
curring bi-grams (i.e. sequences of two characters) alongside the
text sequences. The MPS-based classification models demonstrate
uni-grams (i.e. the characters) already taken into account up until
a high accuracy, especially for longer text segments. This holds
now. A 5-site MPS model has been trained by considering the
true for all values of Dmax . The results indicate that an optimal
top 1/15 fraction of the most frequent bi-grams, resulting in a
value for Dmax is 25, as increasing the bond dimension beyond
physical dimension of 65. The model was trained on a training
this value does not result in a significant improvement in accuracy
set consisting of 5×104 samples. The text generation performance
while decreasing it to 10 significantly reduces the accuracy (blue
of the model trained using the selected bi-grams appeared to be
curve in Figure 2). This observation valid for text segments of
inferior compared to the basic 5-site MPS model. This observation
various lengths. Additionally, the results demonstrate that the
might be attributed to either a too small training set size (as
MPS-based approach performs better in classifying longer text
increasing the physical dimension leads to longer computational
segments. This observation seems logic as longer text segments
times, a smaller training set was chosen to balance computational
can be divided into more text sections, thereby increasing the
efficiency.) or a sub-optimal choice for the maximum bond di-
chance of having a large average likelihood and improving the
mension (Dmax ).
classification accuracy. The accuracy achieved for very short text
segments, specifically those containing five characters, although
C. Language Recognition slightly lower, is still quite good as it ranges between 0.70 and
In this section, an MPS-based model for language classi- 0.85. In comparison, a simple naive Bayes classifier trained on
fication is presented. The fact that the partition function can the same three books achieved an average accuracy close to 0.85.
exactly and efficiently be computed allows for the evaluation
of the log-likelihood of any given text section. The proposed
approach involves training multiple MPS-based models, where
V. S UMMARY AND OUTLOOK
each model corresponds to a specific language. The goal is that In conclusion, generating text with an MPS-based model is
each model learns the specificities of the probability distribution a challenging task, mainly due to the high physical dimension
of the language it tries to model, so as to be able to give a required and the significant correlation that exists with textual
high likelihood to text section derived from that language and data. Further research should be conducted to thoroughly explore
low likelihood to text section from other languages. In this work, the complete potential of MPS-based generative models for text
three languages were considered: English, French and Dutch. generation. This could be accomplished through conducting more
When considering an N -site MPS, the classification process is extensive and computationally intensive simulations, incorporating
as follows. Consider a given text segment s of size S originating a higher physical dimension and larger training set sizes. While
from a text written in English, French or Dutch. The average theoretically feasible, such simulations were deemed too demand-
log-likelihood is calculated for the three models trained on the ing to be conducted within the scope of this study. Comparing the
English, French and Dutch datasets respectively and it is given performances of models with different MPS sizes is also a delicate
6
task because of the fundamental differences that exists between the [15] E. Stoudenmire and D. J. Schwab, “Supervised learning with tensor
dataset on which those models are evaluated with the NLL. Further networks,” Advances in neural information processing systems,
research should incorporate a universal quantifiable measure of vol. 29, 2016.
the performance of a model that can be used for all models. This [16] A. Novikov, M. Trofimov, and I. Oseledets, “Exponential ma-
chines,” arXiv preprint arXiv:1605.03795, 2016.
measure could be inspired by the BLEU or ROUGE evaluation [17] J. Liu, S. Li, J. Zhang, and P. Zhang, “Tensor networks for
methods commonly used to evaluate text generation performances, unsupervised machine learning,” Physical Review E, vol. 107, no. 1,
where the generated text is compared to a reference text segment p. L012103, 2023.
[36]. Additional studies should also focus on the development of [18] E. M. Stoudenmire, “Learning relevant features of data with multi-
problem-specific datasets. For example, when tackling a task such scale tensor networks,” Quantum Science and Technology, vol. 3,
as predicting the most probable words based on an input sequence no. 3, p. 034003, 2018.
for email text, it might be interesting to use a dataset tailored for [19] A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov, “Ten-
sorizing neural networks,” Advances in neural information process-
that task. ing systems, vol. 28, 2015.
Despite the complexity of generating text using an MPS-based [20] S. T. Wauthier, B. Vanhecke, T. Verbelen, and B. Dhoedt, “Learning
model, and the suboptimal performances observed, the model still generative models for active inference using tensor networks,”
exhibits the ability to capture the specificities of the language in Active Inference: Third International Workshop, IWAI 2022,
it aims to model. This proficiency is sufficient for achieving Grenoble, France, September 19, 2022, Revised Selected Papers.
good results in language classification tasks. Future research Springer, 2023, pp. 285–297.
[21] Z.-Y. Han, J. Wang, H. Fan, L. Wang, and P. Zhang, “Unsupervised
should prioritize the inclusion of additional languages. However,
generative modeling using matrix product states,” Physical Review
it is important to acknowledge that evaluating the average log- X, vol. 8, no. 3, p. 031012, 2018.
likelihood for numerous MPS models (corresponding to all the [22] S. Cheng, L. Wang, T. Xiang, and P. Zhang, “Tree tensor networks
languages taken into account) would be relatively time-consuming. for generative modeling,” Physical Review B, vol. 99, no. 15, p.
155131, 2019.
[23] T. Vieijra, L. Vanderstraeten, and F. Verstraete, “Generative
R EFERENCES modeling with projected entangled-pair states,” arXiv preprint
arXiv:2202.08177, 2022.
[1] J. M. Tomczak, Deep generative modeling. Springer, 2022. [24] S. R. White, “Density matrix formulation for quantum renormal-
[2] A. Lamb, “A brief introduction to generative models,” arXiv ization groups,” Physical review letters, vol. 69, no. 19, p. 2863,
preprint arXiv:2103.00265, 2021. 1992.
[3] Vieijra, Tom, “Artificial neural networks and tensor networks in [25] H. Bruus and K. Flensberg, Many-body quantum theory in con-
Variational Monte Carlo,” Ph.D. dissertation, Ghent University, densed matter physics: an introduction. OUP Oxford, 2004.
2022. [26] D. Poulin, A. Qarry, R. Somma, and F. Verstraete, “Quantum simu-
lation of time-dependent hamiltonians and the convenient illusion of
[4] D. Baron, “Machine learning in astronomy: A practical overview,”
hilbert space,” Physical review letters, vol. 106, no. 17, p. 170501,
arXiv preprint arXiv:1904.07248, 2019.
2011.
[5] R. E. Goodall and A. A. Lee, “Predicting materials properties
[27] J. Eisert, M. Cramer, and M. B. Plenio, “Area laws for the
without crystal structure: Deep representation learning from stoi-
entanglement entropy-a review,” arXiv preprint arXiv:0808.3773,
chiometry,” Nature communications, vol. 11, no. 1, p. 6280, 2020.
2008.
[6] Z.-A. Jia, B. Yi, R. Zhai, Y.-C. Wu, G.-C. Guo, and G.-P. Guo,
[28] S. Cheng, J. Chen, and L. Wang, “Information perspective to prob-
“Quantum neural network states: A brief review of methods and
abilistic modeling: Boltzmann machines versus born machines,”
applications,” Advanced Quantum Technologies, vol. 2, no. 7-8, p.
Entropy, vol. 20, no. 8, p. 583, 2018.
1800077, 2019.
[29] J. Van Gompel, J. Haegeman, and J. Ryckebusch, Tensor netwerken
[7] J. J. Hopfield, “Neural networks and physical systems with emer- als niet-gesuperviseerde generatieve modellen, 2020.
gent collective computational abilities.” Proceedings of the national [30] R. M. Neal, “Annealed importance sampling,” Statistics and com-
academy of sciences, vol. 79, no. 8, pp. 2554–2558, 1982. puting, vol. 11, pp. 125–139, 2001.
[8] N. Zhang, S. Ding, J. Zhang, and Y. Xue, “An overview on restricted [31] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
boltzmann machines,” Neurocomputing, vol. 275, pp. 1186–1199, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial
2018. networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–
[9] F. Verstraete, V. Murg, and J. I. Cirac, “Matrix product states, 144, 2020.
projected entangled pair states, and variational renormalization [32] S. R. White, “Density-matrix algorithms for quantum renormaliza-
group methods for quantum spin systems,” Advances in physics, tion groups,” Physical review b, vol. 48, no. 14, p. 10345, 1993.
vol. 57, no. 2, pp. 143–224, 2008. [33] A. Fischer and C. Igel, “An introduction to restricted boltzmann
[10] J. C. Bridgeman and C. T. Chubb, “Hand-waving and interpretive machines,” in Progress in Pattern Recognition, Image Analysis,
dance: an introductory course on tensor networks,” Journal of Computer Vision, and Applications: 17th Iberoamerican Congress,
physics A: Mathematical and theoretical, vol. 50, no. 22, p. 223001, CIARP 2012, Buenos Aires, Argentina, September 3-6, 2012. Pro-
2017. ceedings 17. Springer, 2012, pp. 14–36.
[11] S. Montangero, E. Montangero, and Evenson, Introduction to tensor [34] L. Bottou, “Large-scale machine learning with stochastic gradient
network methods. Springer, 2018. descent,” in Proceedings of COMPSTAT’2010: 19th International
[12] J. I. Cirac and F. Verstraete, “Renormalization and tensor product Conference on Computational StatisticsParis France, August 22-27,
states in spin chains and lattices,” Journal of physics a: mathemat- 2010 Keynote, Invited and Contributed Papers. Springer, 2010, pp.
ical and theoretical, vol. 42, no. 50, p. 504004, 2009. 177–186.
[13] R. Orús, “A practical introduction to tensor networks: Matrix prod- [35] Y. Yao, L. Rosasco, and A. Caponnetto, “On early stopping in
uct states and projected entangled pair states,” Annals of physics, gradient descent learning,” Constructive Approximation, vol. 26,
vol. 349, pp. 117–158, 2014. no. 2, pp. 289–315, 2007.
[14] T. E. Baker, S. Desrosiers, M. Tremblay, and M. P. Thompson, [36] C.-Y. Lin, “Rouge: A package for automatic evaluation of sum-
“Méthodes de calcul avec réseaux de tenseurs en physique,” Cana- maries,” in Text summarization branches out, 2004, pp. 74–81.
dian Journal of Physics, vol. 99, no. 4, pp. 207–221, 2021.
Contents
1 Introduction 1
1.1 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Generative modeling 3
2.1 Fundamentals of machine learning . . . . . . . . . . . . . . . . . . . 3
2.1.1 Motivation and definition . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Building blocks of a ML model . . . . . . . . . . . . . . . . . . 4
2.2 Generative models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Probabilistic framework and maximum likelihood estimation . 13
2.2.2 Evaluating a generative model . . . . . . . . . . . . . . . . . . 15
2.2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Tensor Networks 19
3.1 Quantum many-body systems . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Motivation and representation of Tensor Network states . . . . . . . 21
3.2.1 Diagrammatic representation . . . . . . . . . . . . . . . . . . 22
3.2.2 Tensor Networks and quantum states . . . . . . . . . . . . . . 24
3.3 Matrix Product State (MPS) . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 Construction via successive Singular Value Decomposition . . 25
3.3.2 Gauge degree of freedom and canonical form . . . . . . . . . 28
3.3.3 Expectation values and operators . . . . . . . . . . . . . . . . 29
3.3.4 Optimizing MPS: finding ground states . . . . . . . . . . . . . 31
3.4 Tree Tensor Networks and Projected Entangled Pair States . . . . . . 33
xix
5 Application: Natural Language Processing (NLP) 45
5.1 Dataset and workflow . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Text generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3 Language recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6 Conclusion 61
Bibliography 63
xx
List of Figures
2.1 A typical machine learning pipeline that includes the formulation of the
problem, model selection, optimization, and generalization steps. . . . 5
2.2 Basic neural network architecture. . . . . . . . . . . . . . . . . . . . . 9
2.3 Typical relationships between training loss and generalization loss and
the model’s capacity. Figure from Ref. [41]. . . . . . . . . . . . . . . . 11
2.4 Example of a regression problem where (from left to right) a linear
model, a quadratic model and a degree-9 polynomial model attempt to
capture an underlying quadratic distribution. Figure from Ref. [41]. . . 12
2.5 Image reconstruction from partial images from the MNIST database
[24]. The given parts are in black and the reconstructed parts are in
orange. Figure from [46]. . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 Tensor T of rank 3 represented (a) by all its elements, (b) by the
diagrammatic representation. . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 The coefficient of a quantum many-body state (top) can be represented
in the form of a high-dimensional tensor (middle) with an exponentially
large number of parameters in the system size. The high-dimensional
tensor can be expressed in a network of interconnected tensors (bottom)
that takes into account the structure and the amount of entanglement
in the quantum many-body state. . . . . . . . . . . . . . . . . . . . . . 25
3.3 Illustration of Tree Tensor Network (TTN). . . . . . . . . . . . . . . . . 33
3.4 Illustration of a Projected Entangled-Pair State (PEPS). . . . . . . . . . 34
5.1 NLL calculated on the training and test sets as functions of the number
of optimization sweeps for a 5-site MPS-base model with Dmax = 150.
The results are shown for different training set sizes. . . . . . . . . . . 49
5.2 Histogram of the difference in log-likelihoods under distributions P (x) =
Pd (x) and Q(x) = Pm (x) for samples randomly drawn from the data
distribution. The results are shown for models trained with different
training set size: |Tr| = 103 in blue, |Tr| = 104 in orange and |Tr| = 105
in turquoise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
xxi
5.3 Negative log-likelihood as a function of the maximal bond dimension
Dmax for different values of N . . . . . . . . . . . . . . . . . . . . . . . 52
5.4 Frequency of occurrence of the bi-grams encountered in the textual data. 55
5.5 Accuracy of the 5-site (a) and 4-site (b) MPS-based language recognition
models as a function of the number of characters in the text segments
used for evaluation. The results are shown for different values of the
maximal bond dimension allowed. . . . . . . . . . . . . . . . . . . . . 57
5.6 Confusion matrices shown for the 4-site MPS model with Dmax = 10
evaluated in the two extreme cases where the text segments contain
five characters (a), and 100 characters (b) and for the 5-site MPS model
with Dmax = 250 evaluated on text segment containing five characters
(c), and 100 characters (d). . . . . . . . . . . . . . . . . . . . . . . . . 59
xxii
List of Tables
5.1 Sentences generated by the 5-site MPS-based model with Dmax = 150
trained on training set containing 105 samples. The bolded portion of
the sentences represents the input provided to the model for predicting
the remainder of the sentence. . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Sentences generated by the different MPS-based models with Dmax =
100 trained on training set containing 0.75 × 105 samples. The bolded
portion of the sentences represents the input provided to the model for
predicting the remainder of the sentence. . . . . . . . . . . . . . . . . 53
5.3 Sentences generated by the 5-site MPS-based which takes the top 1/15
fraction of the most frequent bi-grams into account. The model is
trained with Dmax = 150 on training set containing 5 × 104 samples.
The bolded portion of the sentences represents the input provided to
the model for predicting the remainder of the sentence. . . . . . . . . . 56
xxiii
Acronyms
KL Kullback–Leibler. 14
xxv
RNN Recurrent Neural Networks. 46
TN Tensor Network. 21
TT Tensor Train. 25
xxvi
Introduction 1
„ Without locality, physics would be wild.
The prominence of machine learning has experienced a substantial growth in the last
decades, paralleled by a remarkable expansion in the scope of its applications. In this
thesis, the focus lies on generative modelling, a particular unsupervised learning task.
In generative modelling, it is common practice to assume that datasets can be seen as
a collection of samples extracted from an underlying data distribution. The principal
aim of generative model is to capture and represent this probability distribution. In
this thesis, generative models inspired by the probabilistic formulation of quantum
mechanics are considered. These models are referred to as Born Machines and
represent the underlying data distribution by means of a quantum state wave
function.
The quote selected to commence this thesis portrays a significant concept which has
profound consequences in the realm of physics. In quantum many-body physics, the
physical interactions exhibit a local behavior, which is manifested through systematic
structures embedded within the Hamiltonians that describe many-body systems.
This locality of physical interactions has profound consequences in terms of the
entanglement properties of quantum ground states. Indeed, it has been shown
that the ground states of physical system exhibit an area-law for the entanglement
entropy and those state are therefore heavily constrained by locality. As a result
they live in small manifold of the exponentially large Hilbert space. It has been
proven that Tensor Network (TN) methods have the ability to efficiently represent
those states. In other words, TN states can be used to represent the ground states of
physical system efficiently and faithfully.
Seeing the similarities between the task of generative modelling and quantum
physics, the application of tensor networks to effectively capture underlying prob-
ability distributions of datasets is not unexpected. Indeed both fields strive to
model a probability distribution within a vast parameter space. This thesis focus
1
on generative models represented by means of a specific type of (one-dimensional)
tensor network, the matrix product state. The overall goal is to use an MPS-based
generative model to capture and represent the underlying distributions existing in
textual data, serving as a foundation for the tasks of text generation and language
classification. In what follows, an overview of the different chapters of this thesis is
provided.
Chapter 2
This chapter serves as a comprehensive introduction to the field of generative
modelling. The fundamental concepts of machine learning are explained and the
basic building blocks needed to construct a ML model are presented. Then the field of
generative modeling will be introduced, encompassing a comprehensive discussion
on the probabilistic framework often considered as well as diverse applications.
Chapter 3
In order to gain a comprehensive understanding of the application of Tensor Net-
works in the domain of Machine Learning, it is crucial to present the basic concepts
related to Tensor Networks in Physics. This chapter aims to provide a comprehensive
and thorough introduction to the field of Tensor Networks and more specifically to
Matrix Product States within the context of quantum-many body physics.
Chapter 4
This chapter presents a generative model which represents underlying distribution
by using a matrix product state as parametrization class. The presented model is
mainly based on the work presented in Ref. [46]. The optimization algorithm used
for learning as well as the advantageous direct sampling method will be presented
and explained.
Chapter 5
In this chapter the MPS-based model presented in Chapter 4 is employed for the
purpose of text generation and language recognition. A framework is proposed to
model textual data by using an MPS. Then the results related to the text generation
are presented. Finally it is shown that the presented framework can be used to build
a language classifier. Results will be shown for a MPS-based classifier trained on
three different languages, namely English, French and Dutch.
2 Chapter 1 Introduction
Generative modeling 2
2.1 Fundamentals of machine learning
When an individual perceives a darkened sky several kilometers away, they may infer
that precipitation is imminent. Similarly, when an individual is stuck in a traffic jam,
they may assume that there is an accident or roadwork ahead. These predictions
are based on personal experiences and learned associations. The individual has
learned from previous experiences that darkened skies are frequently followed
by precipitation, leading them to conclude that it is about to rain. Likewise, the
individual has learned that traffic jams are often indicative of obstacles on the road,
leading them to conclude that there is an accident or roadwork ahead. Since the
advent of the computer age, huge efforts have been made to build machines or
computers with the ability to learn in a manner akin to human cognition, enabling
them to make decisions and predictions without the need for explicit programming.
These efforts have led to the emergence of the field of Machine Learning (ML),
which is a branch of the broader field of artificial intelligence (AI). The general goal
of ML is to retrieve information from existing data and transform it in knowledge
that can be used to address unseen problems.
The significance of machine learning has increased exponentially over the past
decades. This can be mainly attributed to the remarkable advancements in com-
puting power and memory resources (Moore’s law [75]), and more specifically to
the evolution of graphics processing units (GPU) [71]. Additionally, the explosion
of available data [7, 124] has played a crucial role in the growing importance of
the field of machine learning. Today’s society has become highly dependent on
data-driven methodologies. The practice of data collection and analysis is now
omnipresent in industry but also in science. It has resulted in the adoption of
machine learning techniques for the effective extraction of valuable insights from
massive datasets. Examples of this trend can be observed in numerous sectors,
including agriculture, where data collection is for instance used to optimize the use
3
of fertilizers [89]. The field of astronomy is also experiencing an impressive growth
in the datasets size and complexity [6, 5]. In astronomy, data are used to gain a
better understanding of properties of stars and galaxies. The field of Physics has
likewise entered the era of the Big Data where, for instance some experiments at the
Large Hadron Collider (LHC) generate petabytes (i.e. 1000 terabytes or 1015 bytes)
of data per year [71]. Making sense of these huge amounts of data can prove to be
a challenging task. Nevertheless, machine learning methodologies can be leveraged
to unearth significant patterns and gain valuable insights from datasets, thereby
enabling the resolution of problems that are beyond the scope of human capacity.
There is no real consensus on how to clearly define the concept of Machine Learning
[106]. Nevertheless there exists a, rather formal, description of the field given by
the widely cited quote of Tom Mitchell [73]:
Given
Task
Model Learning Optimized
Representation Algorithm Model
Data
Fig. 2.1.: A typical machine learning pipeline that includes the formulation of the problem,
model selection, optimization, and generalization steps.
Problem formulation
Supervised and unsupervised learning are the two main classes of problem en-
countered in the field of machine learning. However, it should be noted that this
subdivision can be extended to include other, less common, types of learning such as
semi-supervised learning [90] or reinforcement learning [62]. Semi-supervised
learning uses both labeled and unlabeled data to learn. Usually in this approach,
only a small subset of the training set labeled, while most of the dataset is unla-
beled. Semi-supervised learning paradigms have proven to be rather useful in cases
Model representation
After formulating the problem, the task to be performed is clear (e.g. in the case
of classification that is predicting a class label given some input features) and the
structure of the available training data is determined (e.g. labeled or unlabeled
data). The next step consists of selecting the model representation, which involves
determining the mapping function fθ (x) : X → Y that depends on the model’s
internal parameters θ, where X is the input space and Y is the output space (e.g.
{0, 1} for binary classification). Choosing a model representation fθ (x) entirely
specifies the hypothesis space H through the functional form of fθ (x). The hypothesis
space is the set of all possible candidate solutions (or hypotheses) that a machine
learning algorithm is allowed to consider in order to solve a problem. To illustrate
the meaning of the hypothesis space, consider the regression problem of predicting
the price of a house based on its square footage and assume that the price of a
house is linearly related to its square footage (note that this forms a rather bold
approximation that might not be true in a real situation). In this case, the obvious
choice for the model representation would be a linear mapping function of the
form f (s) = a × s + b, where a and b are the model parameters and s is the square
footage. The hypothesis space for this problem would include all possible linear
functions of the form f (s) = a × s + b, and thus H = {f (s) = a × s + b|∀a, b ∈ R}.
During the following optimization stage, the machine learning algorithm will search
through the hypothesis space for the optimal set of parameters a and b that best fits
the training data and minimize a certain error between predicted and actual house
prices.
As one might expect, there are many different model representations to use in the
field of machine learning. The following is a list of some of the widely known
and most significant model representations in the field; however, it is important
to note that this list is not exhaustive. When dealing with classification tasks, the
Some model representations can be used to address various supervised and unsu-
pervised problems. Among these, neural networks are recognized as one of the
most prevalent and widely used model representations in the field of machine learn-
ing. Any comprehensive introduction to machine learning ought to incorporate the
concept of neural networks, which have received considerate attention in the last
decades. They lie at the foundation of many advanced machine learning models
that address various issues such as pattern recognition, prediction, optimization,
and more. [53]. Neural networks are inspired by the structure and function of
the human brain, and consist of a collection of interconnected nodes, or "neurons",
that are organized in layers. The fundamental idea behind neural networks is to
create linear combinations of input variables and express the output as a nonlinear
function of these features. Figure 2.2 illustrates the general architecture of a neural
network. The neurons are represented by circles and are the core processing unit of
the network. In a neural network, neurons are arranged in multiple layers, where
the first layer receives the inputs and the final layer predicts the output values. The
number of nodes in the input and output layers is determined by the specific problem
being addressed. Most of the computations occur in the hidden part of the network,
which can consist of one or more layers. Each node in the hidden layer receives a
linear combination of values from the previous layer, weighted by the connections
between the neurons. The resulting linear combination value at each node is then
passed through an activation or threshold function, which produces the output value
of the node and serves as input for the following layer. The parameters of the model
are the weights connecting the different neurons.
It is clear that there exist various model representations in the field of machine
learning. It is important to note that that there is one-size-fits-all machine learning
model. The performance of an ML model representation can only be deemed superior
in relation to a particular problem or dataset. It is therefore the responsibility of ML
researchers to fine-tune and customize their models to suit the specific problems
they are addressing. This dissertation focuses on exploring model representations
that are based on tensor networks and more specifically on matrix product states.
Optimization
N
1 X
MSE = (yi − ŷi )2 (2.2)
N i=1
Generalization
In machine learning, the capacity of a model refers to its ability to capture a wide
range of functions and is mostly determined by the number of parameters used as
well as the optimization algorithm employed [41]. The model capacity determines
whether it is likely to underfit or overfit. Figure 2.3 illustrates the typical behaviour
of training and testing loss as a function of model capacity. The first regime is the
underfitting regime, while the overfitting regime corresponds to an increasing model
capacity that results in a decreasing training loss and an increasing test loss. The
ideal model complexity lies somewhere between the extremes of underfitting and
overfitting. The capacity of the model should be sufficient to capture the underlying
patterns in the data, but should not be so complex that it simply memorizes the
training set without being able to generalize successfully to new data. When the test
loss is minimal, the model with the best generalization potential is achieved.
Fig. 2.3.: Typical relationships between training loss and generalization loss and the model’s
capacity. Figure from Ref. [41].
Figure 2.4 illustrates the concepts of underfitting and overfitting by comparing three
different models (linear, quadratic and degree-9 polynomial) that attempt to capture
an underlying quadratic distribution. The linear model cannot properly capture
the curvature in the data distribution simply because it does not dispose of enough
Fig. 2.4.: Example of a regression problem where (from left to right) a linear model,
a quadratic model and a degree-9 polynomial model attempt to capture an
underlying quadratic distribution. Figure from Ref. [41].
So far, the basic principles of machine learning have been presented, along with
the essential stages involved in the construction of an ML model. The focus will
now shift to the field of generative modelling, a particular unsupervised learning
task that lies at the core of this dissertation. In order to build reliable decision-
making AI systems, discrimative models alone are not sufficient. In addition to
requiring extensive labeled datasets, the availability of which may be limited (e.g.
in medical domains and fraud detection), discriminative models often lack semantic
understanding of their environment and also lack the aptitude to express uncertainty
concerning their decision making [105]. Generative models could potentially offer
a solution to these concerns. Additionally, they might make it easier to assess
the effectiveness of ML models. Indeed, classifiers frequently produce identical
results for a wide variety of inputs, making it challenging to determine exactly
what the model has learned [60]. Unlike discrimative models, which aim to model
a conditional probability distribution P (y|x) where y is a target value or label of
The following section presents the probabilistic framework (as introduced in Ref.
[60]) as well as a widely used optimization method for generative models. As
stated previously, the fundamental concept of generative modelling is that data
can be considered as samples drawn from a probability distribution x ∼ Pd (x). In
the case of generative modelling, the machine learning model fθ (x) represents an
estimation of the data distribution. It is parameterized by a set of parameters θ and
is often referred to as Pm (x) being the model distribution. The generative modelling
problem can be viewed as minimizing the discrepancy between the true data dis-
tribution, Pd (x), and the model distribution, Pm (x), so that they become as close
as possible. Statistical divergence offers a way to quantify the difference between
two distributions. A statistical divergence is a non-symmetric and non-negative
function D(P ||Q) : S × S → R+ with, as input, two distributions over the space of
possible distributions S. Generative modeling can be restated as an optimization
problem within the probabilistic framework, with the goal of minimizing a statistical
divergence expressing the difference between Pd (x) and Pm (x; θ) (with this notation
the dependence of the model distribution on the model parameter is explicit). The
The parameters θ KL that minimize DKL (Pd (x)||Pm (x; θ)) are given by:
= arg min Ex∼Pd (x) [log(Pd (x)) − log(Pm (x; θ))] (2.5)
θ
Note that in the second line of Equation 2.5, the term log(Pd (x)) does not affect
the argument of the minima and can therefore simply be omitted. Consider now
the parameters θ M LE that maximize the likelihood estimation for a set of N data
points.
N
Y
θ M LE = arg max Pm (xi ; θ)
θ
i=1
XN
= arg max log(Pm (xi ; θ))
θ
i=1
N (2.6)
1 X
= arg min − log(Pm (xi ; θ))
θ N
i=1
= θ KL
This section provides a few important notes regarding the challenging task of
evaluating generative models’ performances. One of the primary objectives of a
generative model is to generate realistic samples from its probability distribution. A
rather instinctive method to evaluate the model’s performance is by assessing the
quality of the samples drawn from the model probability distribution. However, this
approach can be inadequate as the model may still produce high-quality samples
even when its performances are poor. This situation can arise when the model
overfits the training data and reproduces the training instances precisely. In that
case the generated samples will be of good quality despite the model’s poor ability to
generalize. It is even possible to simultaneously underfit and overfit while producing
good quality samples [41]. To illustrate this, consider a generative model trained on
an image dataset with two categories, such as tables and chairs. Assume that the
model only replicates the training images of tables and ignores the images of chairs.
Because it does not produce images that were not in the training data, the model
clearly overfits the training set and displays also severe underfitting since images of
chairs are completely missing from the generated samples.
2.2.3 Applications
People have long sought to comprehend the laws of nature and their impact on
the universe. The primary means of doing so was by following the reductionist
technique which attempts to explain the physical world in terms of ever-smaller
entities. However it has been realised that this approach cannot easily be extended
to the study of system with a large number of interacting degrees of freedom [1]. In
fact, the microscopic physics is, to a large extent, known. The properties of a system
(e.g. a material, a chemical system or a spin system) can, in theory, be determined
by the many-body wave function [22]:
ψ(−
→, −
x → −→
1 x2 , ..., xN , t) (3.1)
1
The many-body Schrödinger equation contains the essential microscopic physics that governs
the macroscopic behavior of materials even if one could argue that this expression could be
further refined to take other relevant aspects into account, e.g. the spin, the presence of external
electromagnetic fields, et cetera [61].
19
The objective of quantum many-body physics is to understand the characteristics of
systems consisting of many interacting particles, that are subjected to the Schrödinger
Equation (see Equation 3.2) and to acquire a better understanding of the resulting
emergent properties of these systems [22]. Quantum many-body physics lies at the
basis of our understanding of nature and tries to provide answers to fundamental
questions in theoretical physics, but it is also key to the advancement of new
technologies [117]. Fields such as quantum computing, quantum optics, high-
energy physics, nuclear physics, condensed matter physics or quantum chemistry are
all in essence related to the many-body quantum problem. Finding and developing
methods to effectively solve this problem is therefore the Holy Grail in modern
research in physics.
λi e−λi
X X
S = − Tr [ρA log ρA ] = − pi log pi = (3.4)
i i
Tα,β,γ (3.5)
Each index of the tensor corresponds to a specific dimension and can have different
values. Figure 3.1 shows the tensors in two distinct but equivalent representations
(in Figure 3.1a the tensor is represented by all its elements while in Figure 3.1b
it is represented in the diagrammatic representation.) Figure 3.1a is intended to
explicitly depict the fact that a tensor is a multi-dimensional data structure. By
convention, a tensor is represented in pictorial language by an arbitrary shape or
block (tensor shapes can have a certain meaning depending on the context) with
T113 . . . T1b3
T . . . T
112 ..1b2 . . ..
.
T111 . . . T . . T
.. . .
1b1 ..
.
. . . . . Tab3
Ta13
.. . .
.
. T ..
. . . .
a12 Tab2
. . . Tab1
Ta11
(a) (b)
Fig. 3.1.: Tensor T of rank 3 represented (a) by all its elements, (b) by the diagrammatic
representation.
An index that is shared by two tensors designates a contraction between them (i.e. a
sum over the values of that index), in pictorial language that is represented by a line
connecting two blocks. An example of a tensor contraction is the matrix product
shown in mathematical form in Equation 3.6. It can also be expressed using tensor
diagrams. This is shown in Equation 3.7.
X
Cik = Aij Bjk (3.6)
j
⇐⇒ i C k = i A j B k (3.7)
In physics, contracted indices are commonly referred to as virtual indices, and their
dimension is referred to as the bond dimension. On the other hand, the physical
degrees of freedom are represented by external non-connected lines referred to as
physical indices. Numerically, tensors can be contracted by summing each of the con-
tracted indices. Nonetheless, a faster approach exists, which involves transforming
the problem into a matrix product [4]. The computational cost associated with a
contraction depends on the size of the contracted indices.
where each |ji ⟩ represents the basis of an individual particle with label i and with
dimension d (for simplicity, the dimensions of the individual particle’s basis are
assumed to be equal). The coefficients corresponding to the different dN config-
urations can be represented using the coefficient tensor Ψj1 ,j2 ,...,jN that contains
dN parameters. This tensor has N d-dimensional entries and fully determines the
wavefunction of the quantum system. As mentioned above, the exponentially large
number of parameters needed to fully describe the system makes the system in-
tractable. The solution proposed by tensor network approaches is to represent and
approximate the coefficient tensor Ψj1 ,j2 ,...,jN by a network of interconnected lower-
rank tensors with a given structure. The structure of a tensor network is determined
by the entanglement patterns between the local degrees of freedom within the
quantum system it represents. Figure 3.2 illustrates how a high-dimensional tensor
(e.g. the coefficient tensor Ψj1 ,j2 ,...,jN ) can be decomposed as a tensor networks. The
structure of the tensor network shown here is meant to be "generic". More useful
TN will be presented in the following sections.
Fig. 3.2.: The coefficient of a quantum many-body state (top) can be represented in the
form of a high-dimensional tensor (middle) with an exponentially large number of
parameters in the system size. The high-dimensional tensor can be expressed in a
network of interconnected tensors (bottom) that takes into account the structure
and the amount of entanglement in the quantum many-body state.
tensor train (TT) is equivalent to that of matrix product state (MPS) and was initially
introduced in a more mathematical framework [83].
where {ji } = dj1 =1 ... djN =1 . The matrix Σ is a diagonal matrix with the singular
P P P
values on the diagonal. In the last line of Equation 3.9 the singular values are
′
absorbed into the matrix B. This results in the new tensor Ψ . This can be expressed
using diagrammatic notation as follows:
SVD
Ψ = Ψ
A1 α1 Σ α1 B1 (3.10)
=
j1 j2 j3 ... ... ... jN
A1 α1 α1Ψ′
=
j1 j2 j3 ... ... ... jN
By repeating the procedure and isolating the next lattice site, the newly created
′
tensor Ψ can be transformed into a matrix:
′
Ψ(α1 j2 );(j3 ...jN ) (3.11)
which can in turn be decomposed via SVD. In fact, by repeating this procedure for
all lattice sites it follows
XX
|ψ⟩ = A1j1 α1 A2α1 j2 α2 A3α2 j3 α3 ...AN
αN −1 jN |j1 j2 ...jN ⟩ (3.12)
{ji } {αi }
freedom and the sums over the individual values αn run from 1 to Dn with
n N −1
Hi = dmin(n,N −n)
O O
Dn = min dim Hi , dim (3.13)
i=1 i=n+1
A1 A2 A3 AN −1 AN
(3.14)
j1 j2 j3 jN −1 jN
The construction can also be extended to a state with periodic boundary condition,
which gives
A1 A2 A3 AN −1 AN
(3.15)
j1 j2 j3 jN −1 jN
The expressions in Equations 3.12 and 3.14 represent the Matrix Product State
ansatz and more specifically the left-canonical form of the MPS. Note that the MPS
is not unique (see below) and that there exists a more general way to construct it
via successive Schmidt decompositions [45, 37]. Alternatively, the MPS form of a
quantum state can also be obtained via a construction as one-dimensional projected
entangled pair states in the valence bond picture [45].
The Matrix Product State provides an exact representation of the rank-N tensor.
However, one might wonder about the usefulness of the MPS expression in Equation
3.12; and rightly so, as the number of parameters is the same as for the coefficient
tensor Ψ. Upon careful analysis of the dimensionality of the virtual bond indices (i.e.
αn ), it becomes apparent that the bond dimension grows exponentially towards the
middle of the tensor sequence, meaning that the exponential scaling problem of the
system has not been solved yet. It has merely been cast in a more complicated form.
A potential solution emerges from this predicament; the SVD can be truncated. The
number of parameters contained in the MPS can be drastically reduced by limiting
the bond dimension of the MPS to a value D. Providing an upper limit for the
bond dimension, boils down to taking at most the D largest singular values in the
SVD decomposition at each step of the procedure depicted above. By doing so, the
maximum number of parameters that the MPS can contain is N pD2 . The choice of
It is clear that, this approximation is only applicable in cases where the truncated
singular values are negligible. Fortunately, the entanglement entropy of ground
states in physical systems exhibits an area-law behavior, as previously described.
This limited amount of entanglement for gapped ground states results in highly
constrained systems, in which a part of the Schmidt weights (i.e. the singular
values) tend to vanish [14]. Thanks to that, Matrix Product States can faithfully
represent one-dimensional area-law states (i.e. gapped ground states) with only
O(N ) parameters [109].
As previously stated, MPS are not unique, which means that a quantum state
can have multiple MPS representations. In order to illustrate this, consider the
subsequent local transformation used to change the tensors entries:
h i−1
Aiαi−1 ji αi → X i−1
Aiβi−1 ji βi Xβi i αi (3.16)
αi−1 βi−1
where the Einstein notation was used to imply the summation over the repeated
−1
indices. The matrix Xαi i βi only need to satisfy Xαi i βi X i βi γi = δαi ,γi . Substituting
the above transformation in Equation 3.15 boils down to inserting an identity matrix
between each site, which yields:
−1 −1
A2j2 X 2 ... (X N −1 )−1 AN
X
|ψ⟩ = Tr XN A1j1 X 1 X 1 jN X
N
|j1 j2 ...jN ⟩
{ji }
(3.17)
The above MPS is specified to have periodic boundary conditions. In the case of open
boundary conditions, the trace operation is not taken into account and D0 = DN .
Consequently, it can be observed that the tensors Ai are transformed while the
quantum states remain unaltered. This ability to modify the MPS parametrization
without affecting the quantum state is known as the gauge freedom. There exist
AN
⇐⇒ = (3.19)
AN
for all lattice sites except the rightmost one. This is known as the left canonical
form of the MPS. Clearly by decomposing a generic rank-N tensors by applying the
SVD for each lattice site i = 1, ..., (N − 1) (procedure outlined above), the resulting
tensors Ai (as defined in Appendix A) conform to Equation 3.18. The tensors Ai are
then said to be in their left canonical form. The tensors Ai can also be chosen to
satisfy
(Aiji )(Aiji )† = Iαi
X
(3.20)
ji
Ai
⇐⇒ = (3.21)
Ai
for all lattice sites except the leftmost one. This form is known as the right canonical
form. Finally, a mixed-canonical form can be introduced with respect to a lattice site
i. This form is obtained by left-canonicalising the tensors on the left of the lattice
site i and right-canonicalising the tensors on the right while keeping the tensor
corresponding to the site i neither left- nor right-canonical. By using the canonical
forms of the Matrix Product State (MPS), it is possible to make the computation
time of certain operations independent of the number of lattice sites N . This leads
to a significant reduction in computational cost, which is why the canonical forms
are widely used in the numerical implementation of tensor network methods.
A1 A2 A3 AN −1 AN
⟨Ψ|Ψ⟩ =
A1 A2 A3 AN −1 AN
A2 A3 AN −1 AN
=
A2 A3 AN −1 AN
A3 AN −1 AN (3.22)
=
A3 AN −1 AN
= ...
AN
=
AN
It is clear from Equation 4.14 that the norm of an MPS in left-canonical form
is encapsulated in the rightmost tensor AN and that the computation becomes
independent of N . Analogous results are obtained when using the right- or mixed-
canonical forms.
In quantum mechanics, operators can be seen as linear maps from one Hilbert space
onto itself. Moreover, they can be expressed in the same TN-based framework as
quantum states [117]. Consider a two-site local operator acting on the sites i and
i + 1:
Ô = Ô (3.23)
A0 A0 Ai Ai+1 AN −1 AN
⟨ψ|Ô|ψ⟩ = Ô
A0 A0 Ai Ai+1 AN −1 AN
(3.24)
Ai Ai+1
= Ô
Ai Ai+1
So far, it has been demonstrated how general rank-N tensors can be decomposed
into tensor trains of lower rank tensors. In Section 3.3.3, a method for efficiently
extracting information from a Matrix Product State (MPS) was presented, under
the assumption that the MPS representation was readily accessible. However, when
considering a generic physical system that is subjected to a certain Hamiltonian
H, the ground state and thus its MPS representation is not known beforehand.
The relevant question is therefore: how can a suitable MPS representation of a
physical systems’ ground state be found? This section describes a variational method
used to find MPS that accurately represent such a state. The goal of variational
methods is to obtain an approximation of the ground states of physical systems. The
approach consists of choosing an initial parametrization class for the wavefunction
|Ψ⟩ and then finding the parameter values that minimize the expectation value of
the energy.
⇐⇒ −λ×
(3.27)
The high non-linearity of the problem resulting from the fact that the parameters
of the wavefunction (i.e. the parameters of the tensors Ak , ∀k) appear in the form
of products makes the problem rather difficult to solve [96]. One solution to this
predicament is to proceed in an iterative manner. Specifically, all tensors except
for the one located at site k are kept constant. The variational parameters of the
wavefunction are now only encapsulated in the tensor Ak which reduces Equation
3.27 to a quadratic form, for which the determination of the extremum is a simpler
linear algebra problem. The resulting wavefunction leads to a decrease in energy.
However, it is clear that this state is not the optimal approximation within the class
under consideration. Therefore, the parameters of another tensor are modified in
a similar way to obtain a state with a lower energy. This technique is performed
several times for each tensor of the MPS, by sweeping back and forth, until the
energy reaches a point of convergence.
It can be shown that the correlation functions of MPS decay exponentially [82].
However, many ground states of gapless quantum system in one dimension exhibit a
power-law decay of the correlation [14]. The implication that can be drawn from
this observation is that MPS, although being very useful for representing low energy
states of one-dimensional gapped systems, might be inadequate to accurately and
efficiently represent some other relevant quantum states. Fortunately, there are
other Tensor Network architectures that possess other characteristics. The present
section is not intended to provide an in-depth analysis of alternative TN models;
instead, it introduces two other TN architectures: Tree Tensor Networks (TTN) and
Project Entangled Pair States (PEPS)
There exists a significant synergy between the fields of physics and machine learning,
wherein each has a substantial influence on the other. On the one hand, machine
learning has emerged as an important tool in physics and other related fields. As
previously mentioned it is used in astronomy to mine large datasets and unearth
novel information from them [6]. ML algorithms have also shown potential to
enhanced the discovery of new materials [40] and they have been successfully
employed in the classification of different phases of matter [16]. Another example
is the utilization of neural networks quantum states (NQS) which are used to
describe quantum systems in terms of artificial neural networks (ANN) [117, 55].
On the other hand, ideas from the field of physics can serve as inspiration for the
developments of new, physics-inspired, learning schemes. Examples of this are the
Hopfield model [48] and energy-based models [105].
Similarly, the last decade has witnessed a notable interest towards the use of tensor
networks in the realm of machine learning. Notably, tensor networks have been
employed for various learning tasks, such as classification [103, 81, 38, 19], for
certain unsupervised learning problems [64, 102], for the compression of neural
networks [80, 87], for reinforcement learning [118, 72] and also to model the
underlying distribution of datasets [46, 18, 116, 39, 101].
Tensor networks have also been proposed as a potential framework for the implemen-
tation of both discriminative and generative task on near term quantum computers
[50, 121, 25, 23]. In the foreseeable future, near-term quantum devices often
referred to as Noisy-Intermediate-Scale Quantum (NISQ) technology are expected
to become accessible, featuring an increased number of qubits ranging from 50 to
100 [86]. However, the presence of noise in quantum gates will impose restrictions
on the scale of quantum circuits that can be effectively executed [86]. Machine
learning is a promising application of quantum computing. It is in fact resilient to
noise which would allow implementation on near-term quantum devices without
35
the need for error correction [50]. The primary obstacle lies in the fact that real
datasets often possess a high number of features, necessitating a substantial number
of qubits for representation. Ref. [50] illustrates the potential of tensor network for
quantum machine learning by presenting qubit-efficient quantum circuits for the
implementation of ML algorithm whereby the required quantity of physical qubits
scales logarithmically with, or independently of, the input data sizes.
|Ψ(v)|2
P (v) = (4.1)
Z
A1 A2 A3 AN −1 AN
Ψ(v1 , v2 , ..., vN ) = (4.2)
v1 v2 v3 vN −1 vN
It is also possible to explicitly model the probability distribution using an MPS with
non-negative tensor elements as opposed to describing it as the square of a matrix
product state. Born machines, however, have been suggested to be more expressive
than conventional probability functions [107, 18, 17].
(4.3)
A1 A2 A3 AN −1 AN
4.2.2 Optimization
After establishing the parameterization class of the model as the MPS form, the
objective is to identify the optimal MPS that produces a probability distribution that
closely matches the distribution of the given data. Consequently, the parameters of
the Matrix Product State are fine-tuned to minimize the Negative Log-Likelihood
(NLL) loss function:
|T |
1 X
L=− ln P (vi )
|T | i=1
(4.4)
|T |
1 X
= ln Z − ln |Ψ(vi )|2
|T | i=1
The learning process starts by initializing the MPS with low bond dimensions (typ-
ically Dk = 2 except for the right- and left most tensors) and with random tensor
elements. As discussed in Ref. [107], the initialization of the tensor elements can
have an impact on the performance of the model. Following the initialization, the
SVD
Ak Ak+1
=
Ũ Σ̃ Ṽ
= (4.5)
Ak Ak+1
=
The left-most tensor in the last line of Equation 4.5 is left-canonical (represented
in turquoise) as the matrices U and V of the SVD are, by definition, orthogonal
matrices (see Equation A.4). The diagonal matrix Σ that contains the singular values
is then merged with the right-most tensor. The left-canonicalization of the MPS is
achieved by iteratively applying the aforementioned procedure to each tensor in the
MPS, beginning with the leftmost one and progressing towards the right.
In the two-sites DMRG-like algorithm the parameters of the MPS are fine-tuned by
sweeping back and forth through the MPS and by using gradient descent to optimize
iteratively the parameters of two adjacent tensors. At each iteration, the initial step
involves combining two neighboring tensors to form an order-4 tensor:
The parameters of the merged tensor Ak,k+1 are then updated to minimize the loss
function given in Equation 4.4. To this end, the gradient of the loss function with
respect to the element of a merged tensor can be computed and is given by [46]:
where Ψ′ (vi ) represents the derivative of the MPS with respect to the merged tensor
Ak,k+1 . It is know that the derivative of a linear function of tensors, defined through
a tensor network structure, with respect to a specific tensor A, is given by the tensor
network where the tensor A has been removed [74], whence
′
A1 Ak−1 Ak,k+1 Ak+2 AN
Ψ′ (v) =
v1 vk−1 vk vk+1 vk+2 vN
(4.8)
A1 Ak−1 wk wk+1 Ak+2 AN
=
v1 vk−1 vk vk+1 vk+2 vN
The vertical connections of wk , vk and wk+1 , vk+1 stand for δwk ,vk and δwk+1 ,vk+1
respectively. This means that in the second term of Equation 4.7, where a sum
is carried over all training samples, only the input data vi with a certain pat-
tern that contains vk vk+1 will contribute to the gradient with respect to the ten-
sor elements A(k,k+1)vk vk+1 [46]. The first term of Equation 4.7 contains Z and
Z ′ = 2 v∈V Ψ′ (v)Ψ(v). Even though Z and Z ′ contain a summation over an expo-
P
nentially large number of terms, both can be efficiently computed by leveraging the
adequate canonical form. Considering the case where 1 < k < N − 1 then Z ′ can be
efficiently computed by leveraging of the mixed-canonical form:
Ak,k+1
=
where Ãk,k+1 stands for the updated merged tensor and γ is the learning rate. In
order to retrieved the merged MPS form, the updated merged tensor is decomposed
by applying truncated SVD:
SVD
Ũ Σ̃ Ṽ
= (4.11)
where the matrix Σ̃ only contains the singular values whose ratio to the largest
one is greater or equal than a prescribed cutoff value ϵcut . All the other singular
values, along with the corresponding rows and columns of U and V , are disregarded.
Ultimately, the dimension of the virtual bond is determined by the number of singular
values that are retained. The bond dimension tends to grow as the optimization
process captures the correlations present in the training data. To control this growth,
a hyperparameter called Dmax is defined, representing the maximum allowable bond
dimension within the MPS. The result is that, even if the ratio of certain singular
values exceeds ϵcut , they can still be disregarded to prevent an overly large bond
dimension, thereby effectively managing the model complexity. Depending on the
direction in which the sweeping process is happening, the matrix Σ̃ will be merged
with the matrix U or V . In fact if the next bond to train is the (k − 1)-th bond
(i.e. if the optimization of the MPS proceeds to the left), then Ak+1 = V will be
right-canonical and Ak = U Σ̃, whereas if the next bond is the (k + 1)-th bond (i.e.
if the optimization of the MPS proceeds to the right) Ak = U and Ak+1 = Σ̃V .
As previously stated, the learning process, which involves optimizing the MPS
with respect to the NLL loss function, follows an iterative procedure. The described
process of merging, updating, and decomposing adjacent tensors is repeated multiple
times, in a sweeping fashion that alternates back and forth. Each iteration starts
from the far-right end (i.e. k = N − 1) of the MPS, where the two rightmost
tensors are considered, and then proceeds towards the leftmost tensor (i.e. k = 1).
Subsequently, the process reverses direction and moves back to the rightmost part of
the MPS. Throughout the sweeping process, the MPS is consistently maintained in a
canonical form, either a mixed-, left-, or right-canonical form, ensuring thereby an
efficient computation of the gradient in each step.
The generation of new samples is a difficult task for traditional generative models
as it often involves dealing with the intractability of the partition function. For
instance, energy-based models like Boltzmann machines often rely on Markov Chain
Monte Carlo (MCMC) methods to generate new samples [31]. MCMC methods can
produce sequences of configurations s → s′ → s′′ → ... by starting from an initial
configuration s = (s1 , s2 , ..., sN ), sweeping through it and performing the changes
si → s′i according to a certain probability, known as the Metropolis probability
[30]. The produced configurations are, in general, correlated and it requires a
certain number of sweeps τ between two successive configurations in order for
them to be independent from each other. The value of τ is often referred to as
the autocorrelation time. Furthermore, to ensure that the first configuration s is
correctly picked-up from the correct probability distribution, τ ′ sweeps needs to be
performed on an initially random configuration. In this case τ ′ is referred to as the
equilibration time [30]. In some cases, generating independent samples becomes
computationally expensive and in machine learning, this problem is often referred
to as the slow-mixing problem [41].
vN
Pm (vN ) =
vN
A1 A2 A3 AN −1 AN
(4.12)
AN
vN
=
vN
AN
where the MPS was assumed to be normed. The orange circles represent the one-hot
encoded value of vN which can take p different values (i.e. vn ∈ {1, 2, ..., p} , ∀n ∈
{1, 2, ..., N }). From a practical point of view, Pm (vN ) is computed for vN = 1, 2, ..., p
and the value of vN is then drawn from these probabilities. Once the N -th feature
is sampled, one can then move on to the (N − 1)-th feature. More generally,
given the values of vk , vk+1 , ...vN , the (k − 1)-th feature can be sampled from the
one-dimensional conditional probability:
Ak−1 Xk
vk−1
vk−1
(4.13)
Ak−1 Xk
=
Xk
Xk
Natural Language Processing (NLP) is a field that lies at the crossroads between
computer science, artificial intelligence and linguistics. Its aim is to enable machines
and computers to understand, interpret and generate languages, both in written
and spoken formats, in a manner akin to humans beings. NLP integrates the prin-
ciples of computational linguistics, which involve rule-based modeling of human
language, with statistical and machine learning methodologies [51]. When dealing
with NLP tasks, one faces important challenges arising from the intricate nature of
human languages. These languages are filled with inherent ambiguities and intrica-
cies. While individuals acquire an understanding of the subtle nuances of natural
language through the formative years of their childhood and teenage years, NLP
models must efficiently assimilate and represent the intricate structure and nuanced
characteristics of natural language from the start in order to effectively execute NLP
tasks [51]. Commonly encountered tasks within the field of NLP encompass an array
of applications, including but not limited to, speech recognition [33], which entails
converting spoken data into textual format; sentiment analysis [70], which aim to
capture emotions and subjective qualities from textual content; caption generation
for images [122], which involves producing descriptive textual explanations for
pictures; and text generation [69], which encompasses the generation of textual
content.
In this dissertation the focus lies on capturing the underlying distributions in textual
data for the purpose of text generation and language recognition. As previously
indicated, MPS are not well-suited for analyzing two-dimensional datasets. How-
ever, considering the inherent one-dimensional nature of language, it suggests the
potential applicability of MPS in this context. Text generation can be used to pro-
duce coherent linguistic construction, ranging from individual sentences to entire
documents. Furthermore, the trained models can be leveraged for predictive text
applications, where the objective is to anticipate the subsequent words within a
given sequence of words. Predictive text methodologies have gained substantial
45
interest, mainly finding application in touchscreen keyboards and email services.
Popular methods for the generation of text are based on Generative Adversial Net-
works (GAN) [91] or Recurrent Neural Networks (RNN) [52]. In recent years a
cutting-edge approach has been presented to build sequence transduction models,
which is based on the transformer architecture, introduced in the highly influential
paper "Attention is all you need" (Ref. [108]). Transformers rely on the concept
of self-attention and serves as the fundamental framework in the renowned Gen-
erative Pretrained Transformers (GPTs). One notable advantage of Transformers,
distinguishing them from competing models, lies in their parallelizability, allowing
for faster and more efficient training [108].
The dataset used to train the MPS-based models was obtained via the Project
Gutenberg (making books freely available on www.gutenberg.org) and consists of
50 different books written in English and randomly selected in the database. To
render the data usable, a preprocessing step is implemented wherein capital letters
are transformed into lowercase letters. The dataset is subsequently filtered to only
include the 26 letters of the alphabet and the ’space’ token, while excluding all other
special characters. All the text sections consisting of N characters that occur in the
books are then extracted from the text data. Each section is subsequently mapped to
a sequence of numerical values ranging from 0 to 26, achieved by replacing each
character with a corresponding numerical value. The overall goal is to capture the
inherent probability distribution of the text sections. In other words, the aim is to
model the frequency of occurrence associated with distinct text sections by means
of a N -site MPS-based model with a physical dimension of 27. In what follows,
MPS-based models will be trained for different values of N (ranging from 4 to 10).
It is important to note that the characteristics of the problem treated here, set it
apart from typical problems solved with MPS in the field of physics. In physics,
scenarios often involve a large number of entities (represented by N ) with relatively
smaller physical dimensions (e.g., 2 for spin-1/2 systems).
The first (and rather naive) idea was to optimize the Matrix Product States by
minimizing the Mean Squared Error (MSE) given by:
1 X
MSE = (Pm (v) − Pd (v))2 (5.1)
|V| v∈V
where the summation runs over the whole Hilbert space of the problem (i.e. all
the possible data configuration) and where Pd (v) is stored in a rank-N tensor.
Performing a summation over the entire Hilbert space is computationally intractable.
Hence, to enhance the efficiency of the process, an attempt was made to decompose
the summation term into two components: the first one containing a summation over
the text sections present in the dataset, and the second one containing a summation
over the text sections that do not occur in the dataset. Performing the decomposition
of the summation and rearranging the terms yields:
1 X h i 1 X
M SE = (Pm (v) − Pd (v))2 − (Pm (v))2 + (Pm (v))2 (5.2)
|V| v∈V |V| v∈V
occ
where Vocc stands for the subset of V which contains the occurring text sections.
While the contraction v∈V Pm (v) is easily obtainable by leveraging the canonical
P
form of the MPS, the last term in the Equation 5.2 cannot be efficiently computed.
There are two solutions to solve this predicaments; the first is to model the prob-
ability distribution directly as an MPS (and not the wavefunction as is done in
Born machines) and the second is to use a Mean Absolute Error (MAE)-type of
loss function instead of the MSE. The problem with the former being that that the
non-negativity of the probability distribution is not guaranteed anymore and the
problem with the latter is that the MAE is not overall differentiable. In addition
to the aforementioned challenges, it is imperative to emphasize that, although not
directly acknowledged, the approach described here above to address the problem is
fundamentally flawed. This is due to the reliance on the knowledge of Pd (v), which
can be stored in a rank-N tensor for small systems (i.e. system with small value
of N ). However, this methodology cannot be effectively scaled to handle longer
text sections and was therefore abandoned. The subsequent findings presented
in this dissertation have been achieved by optimizing the MPS using the Negative
Log-Likelihood (NLL) loss function. This particular approach is better aligned with
The optimization scheme used to optimize the MPS is the one resembling the two-site
DMRG presented in Section 4.2.2. The gradient of the NLL with respect to a merged
tensor is given by Equation 4.7, given here for the sake of clarity
|T |
∂L Z′ 2 X Ψ′ (vi )
(k,k+1)j j
= − (5.3)
k k+1
∂Aαk−1 αk+1 Z |T | i=1 Ψ(vi )
Computing the gradient for large datasets is computationally expensive. One solu-
tion to this predicament is to use the so-called Stochastic Gradient Descent (SGD)
approach as was put forward in Ref. [107]. In fact, when using SGD, the gradient is
evaluated using a mini-batch which represents a subset of samples drawn uniformly
from the training set [13]. At each iteration, the gradient descent (Equation 4.10)
is performed in multiple steps on the same merged tensor, where in each step a
different mini-batch is used for the computation of the gradient.
P (x)
X
DKL (P ||Q) = P (x) log (5.4)
x Q(x)
In order to assess how similar the model distribution Pm represented by the MPS
is to the data distribution, the distribution log (Pd (x)/Pm (x)) can be plotted for
samples taken from the data distribution Pd (x). It is important to note that for small
systems, such as N = 5, the data distribution can be kept in memory and stored as a
rank-N tensor. Despite the computational expense and inefficiency associated with
this task, it allows for an easy assessment of the learning ability of the model and
forms a proof-of-concept for the approach used. Figure 5.2 shows the distribution
log (Pd (x)/Pm (x)) for models trained with different training set sizes (same MPS
as the one presented in Figure 5.1). When the probability distributions Pm (x) and
Pd (x) are similar, the distribution log (Pd (x)/Pm (x)) tends to have its peak value
around zero. This pattern is observed for the model trained with the largest training
set size. Conversely, the model trained with a smaller training set displays a peak
around 4, indicating a high dissimilarity between the modeled distribution Pm (x)
and the true distribution Pd (x). This discrepancy further validates that the MPS has
overfitted the training data. The presented results offer a validation of the employed
Fig. 5.2.: Histogram of the difference in log-likelihoods under distributions P (x) = Pd (x)
and Q(x) = Pm (x) for samples randomly drawn from the data distribution. The
results are shown for models trained with different training set size: |Tr| = 103 in
blue, |Tr| = 104 in orange and |Tr| = 105 in turquoise.
To address the issue raised above, other MPS-based models were trained with a
higher value of N . Considering the susceptibility of the MPS model to overfits
[117], an early stopping procedure was incorporated during the training process
to mitigate this problem [123]. More specifically, the NLL was computed on a
separate validation set (distinct from the training set) at intervals during training.
The training was stopped as soon as an increase in validation error was detected,
preventing thereby the model to overfit the training data. The MPS were trained for
different values of N and for different values of Dmax . The training and validations
sets systematically consisted of 0.75 × 105 and 105 samples, respectively. Figure 5.3
shows the lowest value of the validation error obtained for the different models
that were trained. It should be made clear that the value of the NLL obtained for
MPS-based models with different values of N should not be directly compared with
each other (the values are plotted in the same figure for the sake of compactness).
This is due to the fact that the models were trained on fundamentally different
training sets (i.e. each consisting of text sections of length N ). The results indicate
that increasing the maximum bond dimension (Dmax ) results in a reduction of the
validation error until reaching a plateau. At this point, further increase of Dmax
does not yield any significant improvement in the model’s performance. The extent
of improvement in validation error between Dmax = 10 and Dmax = 100 is more
pronounced for larger values of N . This can be attributed to the increased complexity
of modeling the probability distribution for higher N values, which requires a model
with a higher number of parameters. Additionally, it can be observed that for the
scenario where N = 5, a value of Dmax = 50 appears to be sufficient to obtain
satisfactory performance. Moreover, increasing Dmax beyond this point does not lead
to improved results, as evidenced by the rise in validation error when Dmax = 200.
For other values of N , a maximal bond dimension of 100 appear to be an optimal
choice.
Table 5.2 displays sentences that were generated using MPS models of different
sizes. For each model the maximum bond dimension is 100 and the text was
generated following a similar sequential generation approach as described previously.
The generated sentences reveals that models based on longer MPS still exhibit
limited abilities to generate coherent sentences. While all the models are capable of
generating some existing words, they also generate non-existent ones. Furthermore,
it is crucial to highlight that none of the models exhibit any level of semantic
understanding which is a fundamental requirement for a proficient text generator.
During training, the MPS adapts the bond dimension of the different bonds to
capture the correlation that exists within the training set. The problem in this case
is that the physical dimension is relatively high and the correlation within textual
data is significant. Thereby, the bond dimensions of the Matrix Product States
systematically attain the maximal allowed value of Dmax , except for the first and
last bond, which have a value equal to the physical dimension. This renders the
training of longer MPS more challenging. Furthermore, when the training set size
is too small, the resulting trained MPS may be sub-optimally trained, as was the
case for the 5-site MPS which was trained on 103 samples (see Figure 5.1). Longer
MPS might therefore require larger training sets but this would render the training
computationally even more expensive.
exists with textual data. Further research should be conducted to thoroughly explore
the complete potential of MPS-based generative models for text generation. This
could be accomplished through conducting more extensive and computationally
intensive simulations, incorporating a higher physical dimension and larger training
set sizes. While theoretically feasible, such simulations were deemed too demanding
to be conducted within the scope of this study. Comparing the performances of
models with different MPS sizes is also a delicate task because of the fundamental
differences that exists between the dataset on which those models are evaluated with
the NLL. It would also be interesting to investigate other generative models based on
other tensor networks structures. In fact, TTN were proposed for language design
tasks in Ref. [34]. The proposed model works on a word-level and therefore require
a high physical dimension. This implies that the proposed model would be practically
hard to train. Further research should incorporate a universal quantifiable measure
of the performance of a model that can be used for all models. This measure could
be inspired by the BLEU or ROUGE evaluation methods commonly used to evaluate
text generation performances, where the generated text is compared to a reference
text segment [63]. Additional studies should also focus on the development of
problem-specific datasets. For example, when tackling a task such as predicting
the most probable words based on an input sequence for email text, it might be
interesting to use a dataset tailored for that task.
S−N
X+1
1
log Planguage (vi ) (5.5)
S − N + 1 i=1
The classification process involves selecting the language for which the model gives
the highest average log-likelihood. To assess the model’s performance, the accuracy
is used as an evaluation metric. Accuracy is an appropriate measure in this case
because both the training and the test sets are perfectly balanced between the
different classes.
(a) N = 5 (b) N = 4
Fig. 5.5.: Accuracy of the 5-site (a) and 4-site (b) MPS-based language recognition models
as a function of the number of characters in the text segments used for evaluation.
The results are shown for different values of the maximal bond dimension allowed.
Figure 5.6 shows the confusion matrices for the 4-site MPS model with Dmax = 10
(i.e. the model with smallest number of parameters) evaluated in the two extreme
cases where the text segments contain five characters (a), and 100 characters (b).
Figure 5.6 also illustrates the confusion matrices for the 5-site MPS model with
Dmax = 250 (i.e. the model with the highest number of parameters) evaluated
on text segment containing five characters (c), and 100 characters (d). All four
confusion matrices are diagonal, meaning that all models perform well. When
working with short text segments, the 4-site MPS model with fewer parameters
demonstrates slightly lower performance compared to the 5-site MPS model (as
mentioned above). However, for the longest text sections, the performance of the two
models is nearly identical, indicating that the model trained with the lowest number
of parameters achieves the optimal balance between efficiency and performance.
In further works, the model could be extended to include more languages. However
it should be noted that a good language classifier should be able to rapidly take
a decision (e.g. language recognition on Google Translate). If more languages
are taken into account, this might become a problem seeing that the average
log-likelihood of the text segments must be computed for all MPS. For longer
text segments, this problem might (partially) be mitigated by only computing the
average log-likelihood on a sub-section (of about 50 to 60 characters) of the given
text segments. This can be motivated by the fact that the accuracy of the model
evaluated on text segments longer than 50/60 characters is not significantly better.
(c) (d)
Fig. 5.6.: Confusion matrices shown for the 4-site MPS model with Dmax = 10 evaluated in
the two extreme cases where the text segments contain five characters (a), and
100 characters (b) and for the 5-site MPS model with Dmax = 250 evaluated on
text segment containing five characters (c), and 100 characters (d).
After having introduce the fundamental concepts of Machine Learning and generative
modelling in Chapter 2, an introduction to tensor networks within the context
of many-body physics was provided in Chapter 3. Then Chapter 4 presented a
generative model based on a specific type of (one-dimensional) tensor network, the
matrix product state. As explained, the application of tensor network states to model
underlying data distribution is not unexpected seeing the numerous similarities
between both fields.
61
the system under consideration. It might also be interesting to work with datasets
that are more specific to the task being addressed, for instance a predictive e-mail
text model should be trained on a dataset consisting of existing e-mails and not
on random books. Moreover, comparing MPS-based models of different sizes also
requires a more suited evaluation method where the generated text results could be
compare with (a) reference text segment(s).
Although unable to generate coherent and cohesive text, the MPS-based model was
successfully employed as language classifier. In Section 5.3, the model that was
presented consisted of three different MPS, each train on a specific language. It
seems that, even with a low value of the maximal bond dimension, the MPS can learn
the distinct specificities of the different languages considered. The classification
process is done by calculating the average log-likelihood of given text segments and
selecting the language for which this value is the highest. The results obtained have
shown that the classifier has a high accuracy, especially when the text segments used
for evaluation are longer. Further developments on this topic should aim to include
a higher number of languages.
62 Chapter 6 Conclusion
Bibliography
[1]Philip W Anderson. “More is different: broken symmetry and the nature of the
hierarchical structure of science.” In: Science 177.4047 (1972), pp. 393–396 (cit. on
p. 19).
[5]Nicholas M Ball and Robert J Brunner. “Data mining and machine learning in
astronomy”. In: International Journal of Modern Physics D 19.07 (2010), pp. 1049–
1106 (cit. on p. 4).
[7]Cynthia Beath, Irma Becerra-Fernandez, Jeanne Ross, and James Short. “Finding
value in the information explosion”. In: MIT Sloan Management Review (2012)
(cit. on p. 3).
[8]Mariana Belgiu and Lucian Drăguţ. “Random forest in remote sensing: A review of
applications and future directions”. In: ISPRS journal of photogrammetry and remote
sensing 114 (2016), pp. 24–31 (cit. on p. 8).
[10]Ben Bright Benuwa, Yong Zhao Zhan, Benjamin Ghansah, Dickson Keddy Wornyo,
and Frank Banaseka Kataka. “A review of deep machine learning”. In: International
Journal of Engineering Research in Africa 24 (2016), pp. 124–136 (cit. on p. 9).
63
[12]Sam Bond-Taylor, Adam Leach, Yang Long, and Chris G Willcocks. “Deep generative
modelling: A comparative review of vaes, gans, normalizing flows, energy-based
and autoregressive models”. In: IEEE transactions on pattern analysis and machine
intelligence (2021) (cit. on p. 6).
[13]Léon Bottou. “Large-scale machine learning with stochastic gradient descent”. In:
Proceedings of COMPSTAT’2010: 19th International Conference on Computational
StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers.
Springer. 2010, pp. 177–186 (cit. on p. 48).
[15]Henrik Bruus and Karsten Flensberg. Many-body quantum theory in condensed matter
physics: an introduction. OUP Oxford, 2004 (cit. on p. 20).
[16]Juan Carrasquilla and Roger G Melko. “Machine learning phases of matter”. In:
Nature Physics 13.5 (2017), pp. 431–434 (cit. on p. 35).
[17]Song Cheng, Jing Chen, and Lei Wang. “Information perspective to probabilistic
modeling: Boltzmann machines versus born machines”. In: Entropy 20.8 (2018),
p. 583 (cit. on p. 37).
[18]Song Cheng, Lei Wang, Tao Xiang, and Pan Zhang. “Tree tensor networks for
generative modeling”. In: Physical Review B 99.15 (2019), p. 155131 (cit. on pp. 35,
37, 38, 44).
[19]Song Cheng, Lei Wang, and Pan Zhang. “Supervised learning with projected entan-
gled pair states”. In: Physical Review B 103.12 (2021), p. 125117 (cit. on p. 35).
[20]J Ignacio Cirac, David Perez-Garcia, Norbert Schuch, and Frank Verstraete. “Matrix
product states and projected entangled pair states: Concepts, symmetries, theorems”.
In: Reviews of Modern Physics 93.4 (2021), p. 045003 (cit. on p. 34).
[21]J Ignacio Cirac and Frank Verstraete. “Renormalization and tensor product states
in spin chains and lattices”. In: Journal of physics a: mathematical and theoretical
42.50 (2009), p. 504004 (cit. on pp. 20, 21, 24).
[23]James Dborin, Fergus Barratt, Vinul Wimalaweera, Lewis Wright, and Andrew
G Green. “Matrix product state pre-training for quantum machine learning”. In:
Quantum Science and Technology 7.3 (2022), p. 035014 (cit. on p. 35).
[24]Li Deng. “The mnist database of handwritten digit images for machine learning
research”. In: IEEE Signal Processing Magazine 29.6 (2012), pp. 141–142 (cit. on
p. 17).
[25]Rohit Dilip, Yu-Jie Liu, Adam Smith, and Frank Pollmann. “Data compression for
quantum machine learning”. In: Physical Review Research 4.4 (2022), p. 043007
(cit. on p. 35).
64 Bibliography
[26]Jorge Dukelsky, Miguel A Martın-Delgado, Tomotoshi Nishino, and Germán Sierra.
“Equivalence of the variational matrix product method and the density matrix
renormalization group applied to spin chains”. In: Europhysics letters 43.4 (1998),
p. 457 (cit. on p. 33).
[27]Jens Eisert, Marcus Cramer, and Martin B Plenio. “Area laws for the entanglement
entropy-a review”. In: arXiv preprint arXiv:0808.3773 (2008) (cit. on p. 22).
[28]Ahmed Elgammal, Bingchen Liu, Mohamed Elhoseiny, and Marian Mazzone. “Can:
Creative adversarial networks, generating" art" by learning about styles and de-
viating from style norms”. In: arXiv preprint arXiv:1706.07068 (2017) (cit. on
p. 13).
[30]Andrew J Ferris and Guifre Vidal. “Perfect sampling with unitary tensor networks”.
In: Physical Review B 85.16 (2012), p. 165146 (cit. on p. 42).
[31]Asja Fischer and Christian Igel. “An introduction to restricted Boltzmann machines”.
In: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications:
17th Iberoamerican Congress, CIARP 2012, Buenos Aires, Argentina, September 3-6,
2012. Proceedings 17. Springer. 2012, pp. 14–36 (cit. on p. 42).
[35]Leon Gatys, Alexander S Ecker, and Matthias Bethge. “Texture synthesis using con-
volutional neural networks”. In: Advances in neural information processing systems
28 (2015) (cit. on p. 13).
[37]Raffael Gawatz. Matrix Product State Based Algorithms: For Ground States and
Dynamics. Niels Bohr Institute, Copenhagen University, 2017 (cit. on p. 27).
[38]Ivan Glasser, Nicola Pancotti, and J Ignacio Cirac. “From probabilistic graphical
models to generalized tensor networks for supervised learning”. In: IEEE Access 8
(2020), pp. 68169–68182 (cit. on p. 35).
[39]Ivan Glasser, Ryan Sweke, Nicola Pancotti, Jens Eisert, and Ignacio Cirac. “Expres-
sive power of tensor-network factorizations for probabilistic modeling”. In: Advances
in neural information processing systems 32 (2019) (cit. on p. 35).
Bibliography 65
[40]Rhys EA Goodall and Alpha A Lee. “Predicting materials properties without crystal
structure: Deep representation learning from stoichiometry”. In: Nature communica-
tions 11.1 (2020), p. 6280 (cit. on p. 35).
[41]Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press,
2016 (cit. on pp. 10–13, 15, 42).
[43]Gunst, Klaas. “Tree tensor networks in quantum chemistry”. eng. PhD thesis. Ghent
University, 2020, xvii, 171 (cit. on p. 75).
[44]Gaurav Gupta et al. “A self explanatory review of decision tree classifiers”. In:
International conference on recent advances and innovations in engineering (ICRAIE-
2014). IEEE. 2014, pp. 1–7 (cit. on p. 8).
[45]Jutho Haegeman. Strongly Correlated Quantum Systems. 2nd ed. Ghent, Belgium,
2020-2021 (cit. on pp. 21, 24, 27, 74).
[46]Zhao-Yu Han, Jun Wang, Heng Fan, Lei Wang, and Pan Zhang. “Unsupervised
generative modeling using matrix product states”. In: Physical Review X 8.3 (2018),
p. 031012 (cit. on pp. 2, 17, 35, 36, 38–40, 44).
[47]Markus Hauru, Maarten Van Damme, and Jutho Haegeman. “Riemannian optimiza-
tion of isometric tensor networks”. In: SciPost Physics 10.2 (2021), p. 040 (cit. on
p. 33).
[48]John J Hopfield. “Neural networks and physical systems with emergent collective
computational abilities.” In: Proceedings of the national academy of sciences 79.8
(1982), pp. 2554–2558 (cit. on p. 35).
[50]William Huggins, Piyush Patil, Bradley Mitchell, K Birgitta Whaley, and E Miles
Stoudenmire. “Towards quantum machine learning with tensor networks”. In:
Quantum Science and technology 4.2 (2019), p. 024001 (cit. on pp. 35, 36).
[51]IBM. What is Natural Language Processing? https : / / www . ibm . com / topics /
natural-language-processing. Accessed: 2023-05-20 (cit. on p. 45).
[52]Touseef Iqbal and Shaima Qureshi. “The survey: Text generation models in deep
learning”. In: Journal of King Saud University-Computer and Information Sciences
34.6 (2022), pp. 2515–2528 (cit. on p. 46).
[53]Anil K Jain, Jianchang Mao, and K Moidin Mohiuddin. “Artificial neural networks:
A tutorial”. In: Computer 29.3 (1996), pp. 31–44 (cit. on p. 8).
[54]Anil K Jain, M Narasimha Murty, and Patrick J Flynn. “Data clustering: a review”.
In: ACM computing surveys (CSUR) 31.3 (1999), pp. 264–323 (cit. on p. 6).
66 Bibliography
[55]Zhih-Ahn Jia, Biao Yi, Rui Zhai, et al. “Quantum neural network states: A brief
review of methods and applications”. In: Advanced Quantum Technologies 2.7-8
(2019), p. 1800077 (cit. on p. 35).
[57]Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”.
In: arXiv preprint arXiv:1412.6980 (2014) (cit. on p. 10).
[59]Hans-Peter Kriegel, Peer Kröger, Jörg Sander, and Arthur Zimek. “Density-based
clustering”. In: Wiley interdisciplinary reviews: data mining and knowledge discovery
1.3 (2011), pp. 231–240 (cit. on p. 8).
[60]Alex Lamb. “A Brief Introduction to Generative Models”. In: arXiv preprint arXiv:2103.00265
(2021) (cit. on pp. 12–14).
[61]Robert B Laughlin and David Pines. “The theory of everything”. In: Proceedings of
the national academy of sciences 97.1 (2000), pp. 28–31 (cit. on p. 19).
[62]Yuxi Li. “Deep reinforcement learning: An overview”. In: arXiv preprint arXiv:1701.07274
(2017) (cit. on p. 6).
[63]Chin-Yew Lin. “Rouge: A package for automatic evaluation of summaries”. In: Text
summarization branches out. 2004, pp. 74–81 (cit. on p. 55).
[64]Jing Liu, Sujie Li, Jiang Zhang, and Pan Zhang. “Tensor networks for unsupervised
machine learning”. In: Physical Review E 107.1 (2023), p. L012103 (cit. on p. 35).
[65]Gabriel Loaiza-Ganem, Brendan Leigh Ross, Luhuan Wu, et al. “Denoising Deep
Generative Models”. In: Proceedings on. PMLR. 2023, pp. 41–50 (cit. on p. 16).
[67]Ian P McCulloch. “Infinite size density matrix renormalization group, revisited”. In:
arXiv preprint arXiv:0804.2509 (2008) (cit. on p. 33).
[70]Walaa Medhat, Ahmed Hassan, and Hoda Korashy. “Sentiment analysis algorithms
and applications: A survey”. In: Ain Shams engineering journal 5.4 (2014), pp. 1093–
1113 (cit. on p. 45).
Bibliography 67
[71]Pankaj Mehta, Marin Bukov, Ching-Hao Wang, et al. “A high-bias, low-variance
introduction to machine learning for physicists”. In: Physics reports 810 (2019),
pp. 1–124 (cit. on pp. 3, 4).
[72]Friederike Metz and Marin Bukov. “Self-correcting quantum many-body control us-
ing reinforcement learning with tensor networks”. In: arXiv preprint arXiv:2201.11790
(2022) (cit. on p. 35).
[73]Tom M Mitchell. Machine learning. Vol. 1. 9. McGraw-hill New York, 1997 (cit. on
p. 4).
[75]Gordon E Moore et al. Cramming more components onto integrated circuits. 1965
(cit. on p. 3).
[76]Valentin Murg, Frank Verstraete, Örs Legeza, and Reinhard M Noack. “Simulating
strongly correlated quantum systems with tree tensor networks”. In: Physical Review
B 82.20 (2010), p. 205105 (cit. on p. 33).
[80]Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. “Ten-
sorizing neural networks”. In: Advances in neural information processing systems 28
(2015) (cit. on p. 35).
[85]David Poulin, Angie Qarry, Rolando Somma, and Frank Verstraete. “Quantum
simulation of time-dependent Hamiltonians and the convenient illusion of Hilbert
space”. In: Physical review letters 106.17 (2011), p. 170501 (cit. on pp. 21, 22).
[86]John Preskill. “Quantum computing in the NISQ era and beyond”. In: Quantum 2
(2018), p. 79 (cit. on p. 35).
68 Bibliography
[87]Yong Qing, Peng-Fei Zhou, Ke Li, and Shi-Ju Ran. “Compressing neural network by
tensor network with exponentially fewer variational parameters”. In: arXiv preprint
arXiv:2305.06058 (2023) (cit. on p. 35).
[88]Alec Radford, Jeffrey Wu, Rewon Child, et al. “Language models are unsupervised
multitask learners”. In: OpenAI blog 1.8 (2019), p. 9 (cit. on p. 54).
[89]Dorijan Radočaj, Mladen Jurišić, and Mateo Gašparović. “The role of remote sensing
data and methods in a modern approach to fertilization in precision agriculture”.
In: Remote Sensing 14.3 (2022), p. 778 (cit. on p. 4).
[91]Gustavo H de Rosa and Joao P Papa. “A survey on text generation using generative
adversarial networks”. In: Pattern Recognition 119 (2021), p. 108098 (cit. on p. 46).
[93]Amit Saxena, Mukesh Prasad, Akshansh Gupta, et al. “A review of clustering tech-
niques and developments”. In: Neurocomputing 267 (2017), pp. 664–681 (cit. on
p. 6).
[97]Erwin Schrödinger. “An undulatory theory of the mechanics of atoms and molecules”.
In: Physical review 28.6 (1926), p. 1049 (cit. on p. 19).
[98]Norbert Schuch, Michael M Wolf, Frank Verstraete, and J Ignacio Cirac. “Compu-
tational complexity of projected entangled pair states”. In: Physical review letters
98.14 (2007), p. 140506 (cit. on p. 34).
[100]Y-Y Shi, L-M Duan, and Guifre Vidal. “Classical simulation of quantum many-body
systems with a tree tensor network”. In: Physical review a 74.2 (2006), p. 022320
(cit. on p. 33).
[101]James Stokes and John Terilla. “Probabilistic modeling with matrix product states”.
In: Entropy 21.12 (2019), p. 1236 (cit. on p. 35).
Bibliography 69
[102]E Miles Stoudenmire. “Learning relevant features of data with multi-scale tensor
networks”. In: Quantum Science and Technology 3.3 (2018), p. 034003 (cit. on
p. 35).
[103]Edwin Stoudenmire and David J Schwab. “Supervised learning with tensor net-
works”. In: Advances in neural information processing systems 29 (2016) (cit. on
p. 35).
[104]Lucas Theis, Aäron van den Oord, and Matthias Bethge. “A note on the evaluation
of generative models”. In: arXiv preprint arXiv:1511.01844 (2015) (cit. on p. 15).
[105]Jakub M Tomczak. Deep generative modeling. Springer, 2022 (cit. on pp. 6, 12, 13,
15, 35).
[106]van den Oord, Aäron. “Deep architectures for feature extraction and generative
modeling”. eng. PhD thesis. Ghent University, 2015, XVII, 144 (cit. on pp. 4, 6).
[107]J Van Gompel, J Haegeman, and J Ryckebusch. Tensor netwerken als niet-gesuperviseerde
generatieve modellen. 2020 (cit. on pp. 15, 16, 37, 38, 48, 53).
[108]Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. “Attention is all you need”. In:
Advances in neural information processing systems 30 (2017) (cit. on p. 46).
[109]Frank Verstraete and J Ignacio Cirac. “Matrix product states represent ground states
faithfully”. In: Physical review b 73.9 (2006), p. 094423 (cit. on p. 28).
[111]Frank Verstraete, Juan J Garcia-Ripoll, and Juan Ignacio Cirac. “Matrix product
density operators: Simulation of finite-temperature and dissipative systems”. In:
Physical review letters 93.20 (2004), p. 207204 (cit. on p. 31).
[112]Frank Verstraete, Valentin Murg, and J Ignacio Cirac. “Matrix product states, pro-
jected entangled pair states, and variational renormalization group methods for
quantum spin systems”. In: Advances in physics 57.2 (2008), pp. 143–224 (cit. on
pp. 20, 21, 24).
[113]Frank Verstraete, Diego Porras, and J Ignacio Cirac. “Density matrix renormalization
group and periodic boundary conditions: A quantum information perspective”. In:
Physical review letters 93.22 (2004), p. 227205 (cit. on p. 33).
[115]Tom Vieijra, Jutho Haegeman, Frank Verstraete, and Laurens Vanderstraeten. “Direct
sampling of projected entangled-pair states”. In: Physical Review B 104.23 (2021),
p. 235141 (cit. on p. 44).
70 Bibliography
[117]Vieijra, Tom. “Artificial neural networks and tensor networks in Variational Monte
Carlo”. eng. PhD thesis. Ghent University, 2022, XII, 163 (cit. on pp. 19, 20, 30, 31,
34, 35, 51).
[118]Samuel T Wauthier, Bram Vanhecke, Tim Verbelen, and Bart Dhoedt. “Learning
Generative Models for Active Inference using Tensor Networks”. In: Active Inference:
Third International Workshop, IWAI 2022, Grenoble, France, September 19, 2022,
Revised Selected Papers. Springer. 2023, pp. 285–297 (cit. on p. 35).
[121]L Wright, F Barratt, J Dborin, et al. “Deterministic Tensor Network Classifiers”. In:
arXiv preprint arXiv:2205.09768 (2022) (cit. on p. 35).
[122]Zhilin Yang, Ye Yuan, Yuexin Wu, William W Cohen, and Russ R Salakhutdinov. “Re-
view networks for caption generation”. In: Advances in neural information processing
systems 29 (2016) (cit. on p. 45).
[123]Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. “On early stopping in gradient
descent learning”. In: Constructive Approximation 26.2 (2007), pp. 289–315 (cit. on
p. 51).
[124]Yangyong Zhu, Ning Zhong, and Yun Xiong. “Data explosion, data nature and
dataology”. In: Brain Informatics: International Conference, BI 2009 Beijing, China,
October 22-24, 2009 Proceedings. Springer. 2009, pp. 147–158 (cit. on p. 3).
Bibliography 71
Appendix A: Tensor
Manipulations
A
A.1 Index grouping and splitting
Tµ,γ (A.1)
where µ = (α, β). The splitting of indices is the inverse operation. This involves
separating a single index into multiple indices, which in turn changes the organiza-
tion of the tensor’s data. Generally speaking, a tensor can be reshaped into another
tensor of any rank, following some given (invertible) rule [74]. Naturally, both
tensors must posses the same number of elements. An example of one such a rule is
given for the above example where a 2 × 2 × 2 tensor with elements Tijk is reshaped
in a 4 × 2 matrix with elements:
T T112
111
T T122
121
(A.2)
T211 T212
T221 T222
A.2 Decomposition
The process of grouping the indices of a tensor enables the reshaping of a tensor
with any rank-N to a matrix form. This can be expressed as follows:
73
As tensors can be flattened into matrices, it is possible to utilize the mathematical
tools of matrix algebra and extend them to manipulate tensors. The Singular Value
Decomposition (SVD) is an example of such an operation that can factorize any
rank-r matrix. This technique involves decomposing a matrix into three matrices.
More specifically the SVD of any matrix T ∈ Cm×n is a factorization of the form:
T = U ΣV † (A.4)
A useful reduced description of the SVD is the compact SVD. In this case, we
restrict Σ to its k ≤ min(m, n) non-zero diagonal elements and only consider the
corresponding vectors of U and V such that U ∈ Cm×k and V ∈ Ck×n are isometric
matrices satisfying U † U = Ik and V † V = Ik , while U U † and V V † are only considered
to be projector [45].
In fact, it is common practice in the field of quantum information to use the Schmidt
Decomposition [95], which is basically a restatement of the SVD. In order to see this,
consider a general pure bipartite state |ψ⟩AB = αβ cαβ |α⟩ ⊗ |β⟩ ∈ HA ⊗ HB . The
P
matrix C with elements cαβ can be written as C = U ΣV using the singular value
decomposition (SVD). Now applying U and V as basis transform in HA and HB
respectively, gives
k
X
|ψ⟩ = σi |ui ⟩ ⊗ |vi ⟩ (A.5)
i=1
The singular values σi are also called the Schmidt weights. The number of non-zero
singular values σi is called the Schmidt number and a bipartite state is entangled if
the Schmidt number is greater than one. When incorporating Equation A.5 in the
1
The singular values are the square root of the eigenvalues of the matrix M † M . Hence the non-
negativity of Σ.
This results shows how the Schmidt weights provide a measurement of the entangle-
ment between a part of the system and its environment [43].
A.2 Decomposition 75
Colophon
This thesis was typeset with LATEX 2ε . It uses the Clean Thesis style developed by
Ricardo Langner. The design of the Clean Thesis style is inspired by user guide
documents from Apple Inc.