0% found this document useful (0 votes)
9 views10 pages

Arbitrary Quantum States Preparation Aided by Deep Reinforcement Learning

Uploaded by

Boot Box
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views10 pages

Arbitrary Quantum States Preparation Aided by Deep Reinforcement Learning

Uploaded by

Boot Box
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Arbitrary quantum states preparation aided by deep reinforcement learning

Zhao-Wei Wang1 and Zhao-Ming Wang1∗


1
College of Physics and Optoelectronic Engineering,
Ocean University of China, Qingdao 266100, China

The preparation of quantum states is essential in the realm of quantum information processing,
and the development of efficient methodologies can significantly alleviate the strain on quantum
resources. Within the framework of deep reinforcement learning (DRL), we integrate the initial and
the target state information within the state preparation task together, so as to realize the control
trajectory design between two arbitrary quantum states. Utilizing a semiconductor double quantum
dots (DQDs) model, our results demonstrate that the resulting control trajectories can effectively
achieve arbitrary quantum state preparation (AQSP) for both single-qubit and two-qubit systems,
with average fidelities of 0.9868 and 0.9556 for the test sets, respectively. Furthermore, we consider
the noise around the system and the control trajectories exhibit commendable robustness against
arXiv:2407.16368v1 [quant-ph] 23 Jul 2024

charge and nuclear noise. Our study not only substantiates the efficacy of DRL in QSP, but also
provides a new solution for quantum control tasks of multi-initial and multi-objective states, and is
expected to be extended to a wider range of quantum control problems.

PACS numbers:

I. INTRODUCTION. DRL assisted QSP has been studied a lot, such as fixed
QSP from specific quantum states to other designated
Precise control of quantum dynamics in physical sys- states [22, 30]. Multi-objective control from fixed states
tems is a cornerstone of quantum information processing. to arbitrary states [12, 31], or from arbitrary states to
Achieving high-fidelity quantum state preparation (QSP) fixed state [13], also have promising results. Then the
is crucial for quantum computing and simulation [1, 2]. arbitrary quantum state preparation (AQSP) can be ob-
This process often requires iterative solutions of a set of tained by combining these two opposite directions of QSP
nonlinear equations [3–6], which is complex and time- immediately, which involves preparing multi-initial states
consuming. Consequently, the quest for efficient meth- and multi-objective states [32]. However, two steps: from
ods to prepare arbitrary quantum states has become a arbitrary to fixed state and then to arbitrary states,
prominent issue in quantum control. must be required in above stategy. Can we use a unified
method to realize AQSP from abitray states to arbitray
In this context, protocols based on quantum optimal states directly? In this paper, we successfully accomplish
control theory [7, 8, 10] have been acquiring increas- the task of designing control trajectories for the AQSP
ing attention. Traditional gradient-based optimization by harnessing the DRL algorithm to condense the initial
methods, such as stochastic gradient descent (SGD) [7], and target quantum state information into a unified rep-
chopped random-basis optimization (CRAB) [8, 9], and resentation. We also incorporate the positive-operator
gradient ascent pulse engineering (GRAPE) [10, 11], have valued measure (POVM) method [33–36] to address the
been employed to address optimization challenges. How- complexity of the density matrix elements that are not
ever, these methods tend to produce nearly continuous readily applicable in machine learning. Take the semicon-
pulses, which may not be ideal for experimental imple- ductor double quantum dots (DQDs) model as a testbed,
mentation. Recently, machine learning techniques, such we assess the performance of our algorithm-designed ac-
as deep reinforcement learning (DRL) has emerged as tion trajectories in the AQSP for both single-qubit and
a more efficient approach for designing discrete control two-qubit. At last, we consider the effectiveness of our
pulses, offering lower control costs and significant control algorithm on the noise problme and the results show the
effects compared to traditional optimization algorithms robustness of the designed control trajectories against
[12, 13]. charge noise and nuclear noise.
DRL enhances reinforcement learning with neural net-
works, significantly boosting the ability of agents to rec-
ognize and learn from complex features [14]. This en-
II. MODEL
hancement has paved the way for a broad spectrum of
applications in quantum physics [15–21], including QSP
[12, 13, 22], quantum circuit gate optimization [23–26], In the architecture of a circuit-model quantum com-
coherent state transmission [27], adiabatic quantum con- puter, the construction of any quantum logic gate is fa-
trol [28], and the measurement of quantum devices [29]. cilitated through the combination of single-qubit gates
and entangled two-qubit gates [1]. This study focuses
on the AQSP for both single-qubit and two-qubit. We
have adopted the spin singlet-triplet (S − T0 ) encoding
∗ wangzhaoming@ouc.edu.cn scheme within DQDs for qubit encoding, a method that
2

is favored for its ability to be manipulated solely by elec- operators M = {M(a) } is utilized to translate the den-
trical pulses [37–39]. sity matrix into a corresponding set of measurement out-
The spin singlet state and the spin√triplet state are en- comes P = {P(a) }. When these outcomes fully capture
coded as |0⟩ =√ |S⟩ = (| ↑↓⟩ − | ↓↑⟩)/ 2 and |1⟩ = |T0 ⟩ = the information content of the density matrix, they are
(| ↑↓⟩ + | ↓↑⟩)/ 2, respectively. Here | ↑⟩ and | ↓⟩ denote termed informationally complete POVMs (IC-POVMs).
the two spin eigenstates of a single electron. The con- These
P operators adhere to the normalization condition
trol Hamiltonian for a single-qubit in the semiconductor a M (a) = I.
DQDs model is given by [40, 41]: For an N-qubit system, the density matrix can be con-
verted to the form of a probability distribution by
H(t) = J(t)σz + hσx , (1)
where σz and σx are the Pauli matrix components in P(a) = tr[ρM(a) ], (3)
the z and x directions, respectively. J(t) is a positive,
where M(a) = M(a1 ) ⊗ .. ⊗ M(aN ) . In this work, we
adjustable parameter, while h symbolizes the Zeeman
energy level separation between two spins, typically re- employ the Pauli-4 POVM MPauli−4 = {M(a 1
i)
= 13 ×
garded as a constant [42]. For simplicity, we set h = 1 |0⟩⟨0|, M(a2
i)
= 13 × |l⟩⟨l|, M(a
3
i)
= 13 × |+⟩⟨+|, M(a4
i)
=
and use the reduced Planck constant ℏ = 1 throughout. 1 2 3
I − M(ai ) − M(ai ) − M(ai ) }. By inverting Eq. (3), we can
In the field of quantum information processing, op- retrieve the density matrix ρ as follows:
erations on entangled qubits are indispensable. Within
semiconductor DQDs, interqubit operations are per-
XX
−1 ′
ρ= P (a)Taa ′ M (a ), (4)
formed on two adjacent S − T0 qubits that are ca- a a′
pacitively coupled. The Hamiltonian, in the basis of
{|SS⟩, |T0 S⟩, |ST0 ⟩, |T0 T0 ⟩}, is expressed as [37, 38, 43, where Taa′ = tr(M(a) M(a′ ) ) represents an element of
44]: the overlap matrix T . More details of the POVM, see
Ref. [33].
1
H(t) = J1 (σz ⊗ I) + J2 (I ⊗ σz )
2
J12 (2) B. Arbitrary quantum states preparation via DRL
+ ((σz − I) ⊗ (σz − I))
2 
+ h1 (σx ⊗ I) + h2 (I ⊗ σx ) , Our objective is to accomplish AQSP using discrete
rectangular pulses [4, 5, 10, 43, 45]. To this end, we use
where Ji and hi represent the exchange coupling and the the Deep Q-Network (DQN) algorithm [14, 46], which
Zeeman energy gap of the ith qubit, respectively. J12 is is one of the important methods of DRL, to formulate
proportional to J1 J2 , representing the magnitude of the action trajectories. Details of DQN algorithm are put in
coulomb coupling between the two qubits. It is crucial Appendix A.
for Ji to be positive to maintain consistent interqubit At first, we sample uniformly on the surface of the
coupling. To streamline the model, we assume h1 = h2 = Bloch sphere to identify the initial quantum states ρini
1 and set J12 = J1 J2 /2 in this context. and the target quantum states ρtar for the QSP. Then
we construct a data set for training, validation, and test.
In the context of universal state preparation (USP) [13],
III. METHODS which involves transitioning from any arbitrary ρini to a
predetermined ρtar , the data set is compiled solely with
A. Positive-Operator Valued Measure (POVM) ρini instances, using the fixed ρtar to assess the efficacy
of potential actions. For tasks involving the preparation
Typically, the density matrix elements ρij are com- of diverse ρtar states, supplementary network training is
plex numbers. However, standard machine learning al- required.
gorithms can not be used to handle complex numbers In scenarios that demand handling multiple initial
directly. To solve this problem, a straightforward ap- states and objectives, such as AQSP, our aim is to en-
proach is to decompose each complex number into its able the Agent to discern among various ρtar states and
real and imaginary components, reorganize them into a to devise the corresponding control trajectories. Thus, in
new data following specific protocols, and subsequently the process of data set design, we take the information
input this data into the machine learning model. Once from ρini and ρtar to form the state s within the DQN
1 n 1 n
processed, the data can be reassembled into their original algorithm, expressed as s = [Pini , ..., Pini , Ptar , ..., Ptar ].
complex form using the inverse of the initial transforma- Here, the POVM method is employed to transform the
tion rules. Beyond this method, recent advancements density matrix ρ into a probability distribution {P(a) }.
in applying machine learning to quantum information The first segment of s primarily serves the evolutionary
tasks have used POVM method to deal with the complex computations, while the latter portion is utilized to dis-
number problems of the density matrix [33–36]. Specif- tinguish between different tasks and to compute the re-
ically, a collection of positive semi-definite measurement ward values associated with actions. The data set is then
3

randomly shuffled and partitioned into training, valida- Algorithm 1 The pseudocode for training the AQSP
tion, and test subsets. The training set is predominantly algorithm
used for the Main Net’s training, the validation set assists 1: Initialize the Experience Memory D.
in estimating the generalization error during the training 2: Initialize the Main-network θ.
phase, and the test set is employed to assess the Main 3: Initialize the Target-network θ− by: θ− ← θ.
Net’s performance post-training. 4: Set the ϵ = 0.
Subsequently, the Main Net, initialized at random, 5: for episode=0, episodemax do
samples the input state s from the training set at each 6: for Select state s = sinput from the training set do
7: Split P and Ptar from s.
step k = 1 and subsequently predicts the optimal ac-
8: Convert P (Ptar ) to ρ(ρtar ) by the inverse operation
tion ak (i.e., the pulse intensity J(t)). From s, the of POVM.
sets {Pk } and {Ptar } are isolated, and the states ρk 9: while True do
and ρtar are calculated using Eq. (4). Given the cur- 10: With probability 1 − ϵ select a random action,
rent ρk and the chosen action ak , we compute the next otherwise ai = argmaxa Q(s, a; θ).
state ρ′k = −i[H(ap k ), ρk ] and determine the fidelity 11: Set the ϵ = ϵ + δϵ, exceptϵ = ϵmax .
√ √ 
Perform ai and get the next quantum state ρ′ .
F (ρtar , ρk ) ≡ Tr ρtar ρk ρtar . The fidelity F 12:
serves as a critical metric, quantifying the proximity be- 13: Calculate fidelity F and obtain the reward r.
tween the subsequent state and the target state. Utilizing 14: Convert ρ′ to P ′ with POVM.
15: Concatenate P ′ and Ptar to s′ .
Eq. (3), we derive the set {Pk′ }, which, when merged with
16: Store experience unit (s, ai , r, s′ ) in D.
{Ptar } , allows us to reconstruct the new state s′ . This 17: Select Batch size Nbs of experiences units ran-
updated state s′ is then introduced to the Main Network domly from D for training.
as the current state s, with the iteration index k incre- 18: Update θ by minimizing the Loss function.
mented by one. The reward value r = r(F ), which is in- 19: Every C steps, θ− ← θ.
strumental in training the Main Network, is formulated 20: break if F > Fthreshold or step ≥ T /dt.
as a function of fidelity. This process is iteratively ex- 21: end while
ecuted until the iteration count k reaches its maximum 22: end for
limit or the task completion criteria are satisfied, indi- 23: end for
cated by F > Fthreshold . After completing the action
sequence designed by AQSP algorithm, the initial state
ρini evolves to the final state ρf in . We use the fidelity F TABLE I: Hyperparameter table
of the final state ρf in and the target state ρtar to judge Qubit quantity Single-qubit Two-qubit
the quality of the action sequence. The average fidelity Allowed action a(J(t)) 0, 1, 2, 3, 4 {(i, j)}∗
is the average of the fidelity of all tasks in the entire data Size of the training set 100 100
set. Ultimately, following extensive training, the Main Size of the validation set 100 100
Net attains the capability to assign a Q-value to each Size of the test set 9506 39600
Batch size Nbs 32 32
state-action pair. With precise Q-values at our disposal,
Memory size M 20000 30000
we can determine an appropriate action for any given Learning rate α 0.001 0.001
state, including those that were not explicitly trained. Replace period C S200 200
A complete description of the training process and Reward discount factor γ 0.9 0.9
the data format conversion is delineated in Algorithm 1, Number of hidden layers 2 3
which we define it as AQSP algorithm. For the computa- Neurons per hidden layer 64/64 128/128/64
tional aspects of the algorithm, we employed the Quan- Activation function Relu Relu
tum Toolbox in Python [47]. ϵ-greedy increment δϵ 0.001 0.0001
Maximal ϵ in training ϵmax 0.95 0.95
ϵ in validation and test 1 1
Fthreshold per episode 0.999 0.99
IV. RESULTS AND DISCUSSIONS
Episode for training 100 400
Total time T 2π 5π
A. Single-qubit Action duration dt π/5 π/4
Maximum steps per episode 10 20
We now employ our AQSP algorithm to achieve the {(i, j) i, j = 0, 1, 2, 3, 4, 5}∗
task of preparing a single-qubit from an arbitrary initial
state to an arbitrary target state. To this end, we sam-
ple 98 quantum states uniformly across the Bloch sphere, values 0, 1, 2, 3 and 4. Each action pulse has a duration
varying the parameters α and β, and subsequently con- dt = π/5, the total evolution time is 2π, and a maximum
struct a data set comprising 98∗97 = 9506 data points by of 10 actions are allowed per task. The Main Network is
pairing each state as both the initial and target states. structured with two hidden layers, each consisting of 64
The training and validation sets each contain 100 data neurons. The reward function is set as r = F , with all
points, with the remaining 9306 data points allocated to algorithm hyperparameters detailed in Table I.
the test set. We define five distinct actions J(t) with As shown in Fig 1, the average fidelity and total re-
4

TABLE II: Fidelity (average fidelity) table of USP and AQSP


in different QSP tasks
Task |a⟩ → |1⟩ |0⟩ → |1⟩ |1⟩ → |0⟩ |a⟩ → |a⟩
USP 0.9972 0.9941 0.2892
AQSP 0.9932 0.9975 0.9972 0.9864

FIG. 1: The average fidelity and total reward over the valida-
tion set as functions of the number of episodes in the training
process for the single-qubit AQSP. The average fidelity of the
test set is F̄ = 0.9864.

ward surge rapidly during the initial 6 training episodes,


experience a slight fluctuation, and then plateau around
the 77th episode, signaling the convergence of the Main
Network for the AQSP task.
FIG. 2: Motion trajectories designed by AQSP algorithm and
To illustrate the efficacy of our algorithm, we contrast
USP algorithm in two preparation tasks. (a) and (b) are the
it with the USP algorithm [13]. Specifically, we train a action trajectories designed by USP algorithm in the task
model using the USP algorithm with the same hyper- |0⟩ → |1⟩ and |1⟩ → |0⟩, respectively. (c) and (d) are the
parameters to design action trajectories from arbitrary action trajectories designed by AQSP algorithm in the task
quantum states to the state |1⟩. For comparative analy- |0⟩ → |1⟩ and |1⟩ → |0⟩, respectively.
sis, we replace the original AQSP test set with one where
the target state ρ is |1⟩ during testing. We then as-
sessed the performance of both algorithms in the task B. Two-qubit
of preparing any quantum state to the state |1⟩. Table
II records the fidelity (average fidelity) of these different We now turn our attention to the AQSP for two-
tests. Although the USP algorithm achieves a high av- qubit. Our data set encompasses 6912 points, defined as
erage fidelity F̄ = 0.9972 for tasks with the target state {[a1 , a2 , a3 , a4 ]T }, where aj = bcj and b ∈ {1, i, −1, −i}
|1⟩, it fails when the target state is |0⟩. In contrast, the belongs to the set {1, i, −1, −i}, representing the phase.
AQSP algorithm demonstrates adaptability across differ- Collectively, a set of cj values defines a point on a four-
ent tasks, yielding favorable outcomes. dimensional unit hypersphere, as described by Eq. (5):
To provide a more intuitive comparison of the control
action sequences crafted by the two algorithms, we plot- 
ted their specific control trajectories for two QSP tasks: c1 = cos θ1 ,



|0⟩ → |1⟩ and |1⟩ → |0⟩ as shown in Fig 2. The AQSP c2 = sin θ1 cos θ2 ,

algorithm proves to be highly effective in both tasks, (5)
c3 = sin θ1 sin θ2 cos θ3 ,

whereas the USP algorithm is only capable of completing



the first task. This is attributed to the AQSP algorithm’s c4 = sin θ1 sin θ2 sin θ3 ,

incorporation of the target state within the training pro-
cess, enabling it to adapt to a variety of QSP tasks. Con- where θi ∈ {π/8, π/4, 3π/8}. The quantum states cor-
versely, the USP algorithm sets the target state as fixed responding to these points adhere to the normalization
during training, which results in the designed action tra- condition. For training and test, we randomly selected
jectory being confined to the selected target state during 200 points from this database. Subsequently, we ran-
training, even if the target state changes. Should the USP domly chose 200 quantum states from the database to
algorithm be utilized for the second task, an additional serve as both the initial and target states for the QSP
model with the target state |0⟩ must be trained, which task, constructing a data set of 200 ∗ 199 = 39800 en-
would still be incapable of completing the first task. tries. From this data set, we randomly designated 100
A quantum states for both the training set and the test set,
5

FIG. 4: The fidelity distribution of 39600 samples of the test


set in the two-qubit AQSP, the average fidelity is F̄ = 0.9556.

FIG. 3: The average fidelity and total reward over the valida-
tion set as functions of the number of episodes in the training
process for the two-qubit AQSP.
C. AQSP in noisy environments

with the remaining 39600 states allocated for the test set. In the discussed AQSP tasks, the influence of the exter-
Our standard pulse strengths for each qubit are given by nal environment was not considered. However, in prac-
the set {(J1 , J2 )|J1 , J2 ∈ {1, 2, 3, 4, 5}}. The pulse dura- tice, quantum systems are inevitablely disturbed by its
tion for each action is dt = π/4, the total evolution time surroundings, which can significantly hinder the precise
is 5π, and a maximum of 20 actions are permitted per control of the quantum system. We now proceed to eval-
task. The reward function is also defined as r = F . All uate the performance of the control trajectories designed
hyperparameters are detailed in Table I. by the AQSP algorithm in the presence of noise. For the
semiconductor DQDs model, it mainly has two kinds of
As shown in Fig. 3, the average fidelity and total re-
noises: charge noise and nuclear noise. Charge noise pri-
ward value increase rapidly after the first 40 training
marily originates from flaws in the applied voltage field,
episodes, then begin to increase slowly, and the Main Net-
while nuclear noise is mainly attributed to uncontrollable
work converges after 280 training episodes. It is worth
hyperfine spin coupling within the material [48–51].
mentioning that in order to reduce the pressure on the
server memory during the training process, we use a
For a single-qubit system, these two types of noise can
step-by-step training method. Specifically, after every
be modeled by introducing minor variations, δσz and δσx ,
50 episodes of training, we temporarily save the Main
into the Hamiltonian in Eq. (1). In the case of a two-qubit
Network and Target Network parameters, and then re-
system, δi σz and δi σx are incorporated into the Hamilto-
lease the data in the memory. At the next training we
nian to account for the noise effects, where i belongs to
will reload the saved network parameters and train fur-
the set {1, 2}, representing each qubit, and δ signifies the
ther. The Experience Memory D that stores the expe-
noise amplitude. These noise factors are superimposed
rience unit is also emptied as the data in the memory
on the system’s evolution after the Main Network has
is released, so the average fidelity and average total re-
formulated a control trajectory. Specifically, we intro-
ward value will fluctuate at the beginning of each training
duce random intensity noise to various actions within a
session. This fluctuation decreases with the increase of
control trajectory, which is a plausible assumption given
training episodes, and disappears when the Main Net-
the often unpredictable nature of environmental impacts.
work converges.
After 400 training episodes, we assessed the Main Net- Fig. 5 and Fig. 6 depict the average fidelity of the
work’s performance using data from the test set, record- preparation tasks for single-qubit and two-qubit arbi-
ing an average fidelity of F̄ = 0.9556. Fig. 4 illustrates trary quantum states, respectively, for the control tra-
the fidelity distribution for the two-qubit AQSP in the jectories generated by the AQSP algorithm under vary-
test set, under the control trajectory designed by the ing noise amplitudes within the test set. It is observable
AQSP algorithm. The results indicate that the fidelity that the average fidelity of the test set remains relatively
for the majority of tasks surpasses 0.95, signifying that stable, suggesting that the control trajectory designed by
the overall performance of the Main Network is com- our algorithm exhibits commendable robustness within a
mendable. certain range of noise amplitude intensity.
6

V. CONCLUSION

In this paper, we have effectively designed a control


trajectory for the AQSP. This was accomplished by in-
tegrating the initial and target state information into a
unified state within the architecture of the DQN algo-
rithm. To overcome the challenge posed by the intricate
nature of quantum state elements, which are typically not
conducive to machine learning applications, we have im-
plemented the POVM method. This approach allows for
the successful incorporation of these complex elements
into our machine learning framework.
We have assessed the efficacy of the designed control
trajectories by testing them on the AQSP for both single-
and two-qubit scenarios within a semiconductor DQDs
FIG. 5: The average fidelity over the test set as functions model. The average fidelity achieved for the test set in
of amplitudes of charge and nuclear noises for single-qubit the preparation of single-qubit and two-qubit arbitrary
AQSP. quantum states are F̄ = 0.9864 and F̄ = 0.9556, respec-
tively. Additionally, our findings indicate that the con-
trol trajectories have substantial robustness against both
charge noise and nuclear noise, provided the noise levels
remain within a specific threshold. Although our cur-
rent focus is on state preparation, the proposed scheme
possesses the versatility to be extended and applied to
a diverse array of multi-objective quantum control chal-
lenges.

VI. ACKNOWLEDGMENTS

This paper is supported by the Natural Sci-


ence Foundation of Shandong Province (Grant No.
FIG. 6: The average fidelity over the test set as functions of ZR2021LLZ004) and Fundamental Research Funds for
amplitudes of charge and nuclear noises for two-qubit AQSP. the Central Universities (Grant No. 202364008).

[1] Michael A Nielsen and Isaac L Chuang. Quantum compu- bust universal control of singlet–triplet qubits. Nature
tation and quantum information. Cambridge university communications, 3(1):997, 2012.
press, 2010. [6] Robert E Throckmorton, Chengxian Zhang, Xu-Chen
[2] Chien-Hung Cho, Chih-Yu Chen, Kuo-Chin Chen, Yang, Xin Wang, Edwin Barnes, and S Das Sarma.
Tsung-Wei Huang, Ming-Chien Hsu, Ning-Ping Cao, Bei Fast pulse sequences for dynamically corrected gates in
Zeng, Seng-Ghee Tan, and Ching-Ray Chang. Quan- singlet-triplet qubits. Physical Review B, 96(19):195424,
tum computation: Algorithms and applications. Chinese 2017.
Journal of Physics, 72:248–269, 2021. [7] Christopher Ferrie. Self-guided quantum tomography.
[3] Matthias G Krauss, Christiane P Koch, and Daniel M Physical review letters, 113(19):190404, 2014.
Reich. Optimizing for an arbitrary schrödinger cat state. [8] Patrick Doria, Tommaso Calarco, and Simone Mon-
Physical Review Research, 5(4):043051, 2023. tangero. Optimal control technique for many-body quan-
[4] Xin Wang, Lev S Bishop, Edwin Barnes, JP Kestner, and tum dynamics. Physical review letters, 106(19):190501,
S Das Sarma. Robust quantum gates for singlet-triplet 2011.
spin qubits using composite pulses. Physical Review A, [9] Tommaso Caneva, Tommaso Calarco, and Simone Mon-
89(2):022310, 2014. tangero. Chopped random-basis quantum optimiza-
[5] Xin Wang, Lev S Bishop, JP Kestner, Edwin Barnes, tion. Physical Review A—Atomic, Molecular, and Op-
Kai Sun, and S Das Sarma. Composite pulses for ro- tical Physics, 84(2):022326, 2011.
7

[10] Navin Khaneja, Timo Reiss, Cindie Kehlet, Thomas quantum gate control. Europhysics Letters, 126(6):60002,
Schulte-Herbrüggen, and Steffen J Glaser. Optimal con- 2019.
trol of coupled spin dynamics: design of nmr pulse se- [26] Yuval Baum, Mirko Amico, Sean Howell, Michael Hush,
quences by gradient ascent algorithms. Journal of mag- Maggie Liuzzi, Pranav Mundada, Thomas Merkh, An-
netic resonance, 172(2):296–305, 2005. dre RR Carvalho, and Michael J Biercuk. Experimen-
[11] Benjamin Rowland and Jonathan A Jones. Implementing tal deep reinforcement learning for error-robust gate-set
quantum logic gates with gradient ascent pulse engineer- design on a superconducting quantum computer. PRX
ing: principles and practicalities. Philosophical Transac- Quantum, 2(4):040324, 2021.
tions of the Royal Society A: Mathematical, Physical and [27] Riccardo Porotti, Dario Tamascelli, Marcello Restelli,
Engineering Sciences, 370(1976):4636–4650, 2012. and Enrico Prati. Coherent transport of quantum
[12] Xiao-Ming Zhang, Zezhu Wei, Raza Asad, Xu-Chen states by deep reinforcement learning. Communications
Yang, and Xin Wang. When does reinforcement learn- Physics, 2(1):61, 2019.
ing stand out in quantum control? a comparative study [28] Yongcheng Ding, Yue Ban, José D Martı́n-Guerrero, En-
on state preparation. npj Quantum Information, 5(1):85, rique Solano, Jorge Casanova, and Xi Chen. Breaking
2019. adiabatic quantum control with deep learning. Physical
[13] Run-Hong He, Rui Wang, Shen-Shuang Nie, Jing Wu, Review A, 103(4):L040401, 2021.
Jia-Hui Zhang, and Zhao-Ming Wang. Deep reinforce- [29] V Nguyen, SB Orbell, Dominic T Lennon, Hyungil Moon,
ment learning for universal quantum state preparation Florian Vigneau, Leon C Camenzind, Liuqi Yu, Do-
via dynamic pulse control. EPJ Quantum Technology, minik M Zumbühl, G Andrew D Briggs, Michael A Os-
8(1):29, 2021. borne, et al. Deep reinforcement learning for efficient
[14] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, An- measurement of quantum devices. npj Quantum Infor-
drei A Rusu, Joel Veness, Marc G Bellemare, Alex mation, 7(1):100, 2021.
Graves, Martin Riedmiller, Andreas K Fidjeland, Georg [30] Wenjie Liu, Bosi Wang, Jihao Fan, Yebo Ge, and Mo-
Ostrovski, et al. Human-level control through deep rein- hammed Zidan. A quantum system control method based
forcement learning. nature, 518(7540):529–533, 2015. on enhanced reinforcement learning. Soft Computing,
[15] Haixu Yu and Xudong Zhao. Deep reinforcement learning 26(14):6567–6575, 2022.
with reward design for quantum control. IEEE Transac- [31] Tobias Haug, Wai-Keong Mok, Jia-Bin You, Wenzu
tions on Artificial Intelligence, 2022. Zhang, Ching Eng Png, and Leong-Chuan Kwek. Clas-
[16] Haixu Yu and Xudong Zhao. Event-based deep reinforce- sifying global state preparation via deep reinforcement
ment learning for quantum control. IEEE Transactions learning. Machine Learning: Science and Technology,
on Emerging Topics in Computational Intelligence, 2023. 2(1):01LT02, 2020.
[17] Matteo M Wauters, Emanuele Panizon, Glen B Mbeng, [32] Run-Hong He, Hai-Da Liu, Sheng-Bin Wang, Jing Wu,
and Giuseppe E Santoro. Reinforcement-learning- Shen-Shuang Nie, and Zhao-Ming Wang. Universal
assisted quantum optimization. Physical Review Re- quantum state preparation via revised greedy algorithm.
search, 2(3):033446, 2020. Quantum Science and Technology, 6(4):045021, 2021.
[18] Daoyi Dong, Chunlin Chen, Hanxiong Li, and Tzyh-Jong [33] Juan Carrasquilla, Giacomo Torlai, Roger G Melko,
Tarn. Quantum reinforcement learning. IEEE Transac- and Leandro Aolita. Reconstructing quantum states
tions on Systems, Man, and Cybernetics, Part B (Cyber- with generative models. Nature Machine Intelligence,
netics), 38(5):1207–1220, 2008. 1(3):155–161, 2019.
[19] Tanmay Neema, Susmit Jha, and Tuhin Sahai. Non- [34] Juan Carrasquilla, Di Luo, Felipe Pérez, Ashley Milsted,
markovian quantum control via model maximum like- Bryan K Clark, Maksims Volkovs, and Leandro Aolita.
lihood estimation and reinforcement learning. arXiv Probabilistic simulation of quantum circuits using a deep-
preprint arXiv:2402.05084, 2024. learning architecture. Physical Review A, 104(3):032610,
[20] Marin Bukov, Alexandre GR Day, Dries Sels, Phillip 2021.
Weinberg, Anatoli Polkovnikov, and Pankaj Mehta. Re- [35] Di Luo, Zhuo Chen, Juan Carrasquilla, and Bryan K
inforcement learning in different phases of quantum con- Clark. Autoregressive neural network for simulating open
trol. Physical Review X, 8(3):031086, 2018. quantum systems via a probabilistic formulation. Physi-
[21] José D Martı́n-Guerrero and Lucas Lamata. Re- cal review letters, 128(9):090501, 2022.
inforcement learning and physics. Applied Sciences, [36] Moritz Reh, Markus Schmitt, and Martin Gärttner.
11(18):8589, 2021. Time-dependent variational principle for open quantum
[22] Chunlin Chen, Daoyi Dong, Han-Xiong Li, Jian Chu, and systems with artificial neural networks. Physical Review
Tzyh-Jong Tarn. Fidelity-based probabilistic q-learning Letters, 127(23):230501, 2021.
for control of quantum systems. IEEE transactions on [37] JM Taylor, H-A Engel, W Dür, Amnon Yacoby, CM Mar-
neural networks and learning systems, 25(5):920–933, cus, P Zoller, and MD Lukin. Fault-tolerant architecture
2013. for quantum computation using electrically controlled
[23] Omar Shindi, Qi Yu, Parth Girdhar, and Daoyi Dong. semiconductor spins. Nature Physics, 1(3):177–183, 2005.
Model-free quantum gate design and calibration using [38] John M Nichol, Lucas A Orona, Shannon P Harvey,
deep reinforcement learning. IEEE Transactions on Ar- Saeed Fallahi, Geoffrey C Gardner, Michael J Man-
tificial Intelligence, 5(1):346–357, 2023. fra, and Amir Yacoby. High-fidelity entangling gate for
[24] Murphy Yuezhen Niu, Sergio Boixo, Vadim N Smelyan- double-quantum-dot spin qubits. npj Quantum Informa-
skiy, and Hartmut Neven. Universal quantum control tion, 3(1):3, 2017.
through deep reinforcement learning. npj Quantum In- [39] Xian Wu, Daniel R Ward, JR Prance, Dohun Kim,
formation, 5(1):33, 2019. John King Gamble, RT Mohr, Zhan Shi, DE Savage,
[25] Zheng An and DL Zhou. Deep reinforcement learning for MG Lagally, Mark Friesen, et al. Two-axis control
8

of a singlet–triplet qubit with an integrated micromag-


net. Proceedings of the National Academy of Sciences,
111(33):11938–11942, 2014.
[40] Filip K Malinowski, Frederico Martins, Peter D Nis-
sen, Edwin Barnes, Lukasz Cywiński, Mark S Rudner,
Saeed Fallahi, Geoffrey C Gardner, Michael J Manfra,
Charles M Marcus, et al. Notch filtering the nuclear
environment of a spin qubit. Nature nanotechnology,
12(1):16–20, 2017.
[41] Brett M Maune, Matthew G Borselli, Biqin Huang,
Thaddeus D Ladd, Peter W Deelman, Kevin S Holabird,
Andrey A Kiselev, Ivan Alvarado-Rodriguez, Richard S
Ross, Adele E Schmitz, et al. Coherent singlet-triplet os-
cillations in a silicon-based double quantum dot. Nature,
481(7381):344–347, 2012.
[42] Xin Zhang, Hai-Ou Li, Gang Cao, Ming Xiao, Guang-
Can Guo, and Guo-Ping Guo. Semiconductor quantum
computation. National Science Review, 6(1):32–54, 2019.
[43] Michael D Shulman, Oliver E Dial, Shannon P Harvey,
Hendrik Bluhm, Vladimir Umansky, and Amir Yacoby.
Demonstration of entanglement of electrostatically cou-
pled singlet-triplet qubits. science, 336(6078):202–205,
2012.
[44] I Van Weperen, BD Armstrong, EA Laird, J Medford,
CM Marcus, MP Hanson, and AC Gossard. Charge-state
conditional operation of a spin qubit. Physical review
letters, 107(3):030506, 2011.
[45] Philip Krantz, Morten Kjaergaard, Fei Yan, Terry P Or-
lando, Simon Gustavsson, and William D Oliver. A quan-
tum engineer’s guide to superconducting qubits. Applied
physics reviews, 6(2), 2019.
[46] Volodymyr Mnih, Koray Kavukcuoglu, David Silver,
Alex Graves, Ioannis Antonoglou, Daan Wierstra, and
Martin Riedmiller. Playing atari with deep reinforcement
learning. arXiv preprint arXiv:1312.5602, 2013.
[47] J Robert Johansson, Paul D Nation, and Franco Nori.
Qutip: An open-source python framework for the dynam-
ics of open quantum systems. Computer Physics Com-
munications, 183(8):1760–1772, 2012.
[48] Semiconductor Quantum Dots. Coherent manipulation
of coupled electron spins in. condensed-matter physics,
5:6.
[49] Robert Roloff, Thomas Eissfeller, Peter Vogl, and Walter
Pötz. Electric g tensor control and spin echo of a hole-
spin qubit in a quantum dot molecule. New Journal of
Physics, 12(9):093012, 2010.
[50] Edwin Barnes, Lukasz Cywiński, and S Das Sarma.
Nonperturbative master equation solution of central
spin dephasing dynamics. Physical review letters,
109(14):140403, 2012.
[51] Nga TT Nguyen and S Das Sarma. Impurity effects on
semiconductor quantum bits in coupled quantum dots.
Physical Review B—Condensed Matter and Materials
Physics, 83(23):235322, 2011.
[52] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep
learning. nature, 521(7553):436–444, 2015.
[53] Richard S Sutton and Andrew G Barto. Reinforcement
learning: An introduction. MIT press, 2018.
[54] Shai Shalev-Shwartz and Shai Ben-David. Understanding
machine learning: From theory to algorithms. Cambridge
university press, 2014.
[55] Christopher JCH Watkins and Peter Dayan. Q-learning.
Machine learning, 8:279–292, 1992.
9

Appendix A: Deep reinforcement learning and deep


Q network

In this section, we detail the DRL and DQN algorithm,


which constitute the core of our AQSP framework. DRL
is an amalgamation of deep learning and reinforcement
learning techniques. Deep learning employs multi-layered
neural networks to discern features and patterns from in-
tricate tasks [52]. In contrast, reinforcement learning is
a paradigm where a learning Agent progressively mas-
ters the art of decision-making to achieve a predefined
objective, through continuous interaction with its Envi-
ronment [53]. Within the purview of DRL, deep neural
networks are harnessed to approximate value functions
or policy functions, thereby equipping intelligent systems
with the acumen to make optimal decisions [54]. FIG. 7: Schematic for the AQSP algorithm.
In the realm of reinforcement learning, an Agent sym-
bolizes an intelligent system that is endowed with the
capability to make decisions. The Agent’s action selec- Q-Table is utilized to record these Q-values. Armed with
tion process is predicated on the Markov decision process an accurate Q-Table, the optimal action to undertake in
framework, wherein an action is chosen solely based on a given state s becomes readily apparent. The learning
the current state, discounting any past state influences process, in essence, revolves around the continual updat-
[53]. As the Agent and the Environment engage in a ing of the Q-Table, with the Q-learning update formula
dynamic interaction at a given time t, the Agent selects articulated as follows:
the most advantageous action ai from a set of possible
actions a = {a1 , a2 , · · · , an }, in response to the Envi- Q(s, ai ) ← Q(s, ai ) + α[rt + γ max

Q(s′ , a′ ) − Q(s, ai )],
a
ronment’s current state s, and subsequently executes it. (A.3)
The Environment then transitions to a subsequent state
s′ and bestows an immediate reward r upon the Agent. where α is the learning rate. When updating the Q-value,
The Agent employs a policy function π(s) to ascertain we consider not only the immediate reward but also the
the most suitable action to undertake, effectively deter- prospective future rewards. The current Q-value up-
mining ai = π(s). date for Q(s, ai ) necessitates identifying the maximum Q-
A comprehensive decision task yields a cumulative re- value Q(s′ , a′ ) for the subsequent state s′ , which requires
ward R, which can be mathematically expressed as [53]: evaluating multiple actions to ascertain the largest Q-
value. Confronted with this trade-off between exploita-
N
X tion and exploration, we employ the ϵ -greedy algorithm
R = r1 + γr2 + γ 2 r3 · · · + γ N −1 rN = γ t−1 rt , (A.1)
to select actions [13]. Specifically, we allocate a probabil-
t=1
ity ϵ to choose the currently most advantageous action,
where γ denotes the discount rate, ranging within the and a probability of 1 − ϵ to explore additional actions.
interval [0, 1], and N is the total number of actions exe- As training advances, ϵ gradually increases from 0 to a
cuted throughout the decision task. The Agent’s goal is value just below 1. This approach enables the Q-Table
to maximize R, as a higher cumulative reward R signifies to expand swiftly during the initial phase of training and
superior performance. Owing to the discounted nature facilitates efficient Q-value updates during the interme-
of cumulative rewards, the Agent is inherently motivated diate and final stages of training.
to secure larger rewards promptly, thereby ensuring a Calculation of Q-values for tasks that involve multiple
substantial cumulative R. To determine the optimal ac- steps and actions can be time-consuming, as the out-
tion to take in a given state, we rely on the action-value come is contingent upon the sequence of actions chosen.
function, commonly referred to as the Q-value [55]: To address this challenge, we utilize a multi-layer arti-
ficial neural network as an alternative to a Q-table. A
Qπ (s, ai ) = E[rt + γrt+1 + · · · |s, ai ] trained neural network is capable of estimating Q-values
h i (A.2) for various actions within a given state. The Deep Q-
= E rt + γ Qπ (s′ , a′ ) |s, ai . Network (DQN) algorithm [14, 46] comprises two neural
networks with identical architectures. The Main Net θ
The Q-value embodies the anticipated cumulative re- and the Target Net θ− are employed to predict the terms
ward R that the Agent will garner by executing action Q(s, ai ) and maxa′ Q(s′ , a′ ) from Eq. (A.3), respectively.
ai in a specific state s, in accordance with policy π. This We implement an experience replay strategy [46] to
value can be iteratively computed based on the Q-values train the Main Net. Throughout the training process, the
associated with the ensuing state. In Q-learning [55], a Agent accumulates experience units (s, a, r, s′ ) at each
10

step, storing them in an Experience Memory D with a where we use Eq. (A.4) to calculate the Loss and refine
memory capacity M . The Agent then randomly selects a the parameters of the Main Net θ using the mini-batch
batch of Nbs experience units from the Experience Mem- gradient descent (MBGD) algorithm [14, 46]. The Tar-
ory D to train the Main Net. The loss function is calcu- get Net θ− remains inactive during the training process,
lated as follows: only updating its parameters by directly copying from
the Main Net θ after every C steps. The schematic di-
Nbs h 2
1 X ′ ′
i agram illustrating the AQSP algorithm is presented in
Loss = r + γ max Q (s , a ) − Q(s, a)i
Nbs i=1 a′ i Fig. 7.
(A.4)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy