Efficient Sampling From Noisy Hardware FINAL
Efficient Sampling From Noisy Hardware FINAL
Anonymous Author(s)
Affiliation
Address
email
Abstract
17 1 Introduction
18 The widespread implementation of artificial intelligence (AI) incurs significant energy use, financial
19 costs, and CO2 emissions. This not only increases the cost of products, but also presents obstacles
20 in addressing climate change. Traditional AI methods like deep learning lack the ability to quantify
21 uncertainties, which is crucial to address issues such as hallucinations or ensuring safety in critical
22 tasks. Probabilistic machine learning, while providing a theoretical framework for achieving much-
23 needed uncertainty quantification, also suffers from high energy consumption and is unviable on a truly
24 large scale due to insufficient computational resources [11]. At the heart of probabilistic machine
25 learning and Bayesian inference is Markov Chain Monte Carlo (MCMC) sampling [14, 21, 8].
26 Although effective in generating samples from complex distributions, MCMC is known for its
27 substantial computational and energy requirements, making it unsuitable for large-scale deployment
28 for applications such as Bayesian neural networks [11].
29 Addressing these challenges, this paper proposes a novel hardware framework designed to enhance the
30 energy efficiency of sampling-based probabilistic machine learning. Unlike traditional approaches that
31 depend on MCMC, our framework utilizes uniform floating-point format sampling utilizing stochas-
32 tically switching magnetic tunnel junction (s-MTJ) devices as a foundation, achieving significant
33 gains in both computational resources and energy consumption compared to current pseudorandom
34 number generators. In contrast to existing generators, this device-focused strategy not only enhances
35 sampling efficiency but also incorporates genuine randomness originating from the thermal noise in
Submitted to 38th Conference on Neural Information Processing Systems (NeurIPS 2024). Do not distribute.
36 our devices. Simultaneously, this noise is crucial for the probabilistic functioning of the s-MTJs and
37 is associated with low energy costs during operation.
38 We present an acceleration approach for efficiently handling probability distributions. Our experi-
39 ments confirm its effectiveness by quantifying potential approximation errors in key scenarios. This
40 work does not seek to create a one-size-fits-all setting for all possible probabilistic algorithms that
41 manage probability distributions. Rather, we offer a solution approach that researchers can customize
42 and utilize based on individual required sampling resolution dependent on a specific algorithm.
43 Our contributions can summarized as follows:
79 2 Related Work
80 A majority of artificial intelligence algorithms rely on random number generators. Random number
81 generators (RNG) are employed for weight initialization or dropout in deep learning or taking random
82 actions in reinforcement learning. In probabilistic machine learning, Markov-Chain-Monte-Carlo
83 (MCMC) algorithms utilize them for sampling from proposal distributions or for making decisions
84 on whether to accept or reject samples based on random draws.
85 Hence, the research community focused on the development of efficient random number generators
86 [15] and their infrastructure [26, 22] shares similarities to this work. Physical (true) random number
87 generators (TRNG) using physical devices is an active research field since the 1950s [16]. Currently
2
88 used random number generators are often feasability-motivated free-running oscillators with ran-
89 domness from electronic noise [25]. A very recent subfield are Quantum Based Random Number
90 generators (QRNG) [17, 12, 25, 5]. However, it should be noted that our conceptual approach can in
91 principle be applied with any RNG that generates parametrizable Bernoulli distributions, given that
92 they are sufficiently (energy-)efficient.
93 MCMC methods like Metropolis-Hastings [4, 19] and the state-of-the-art Hamiltonian Monte Carlo
94 (HMC) [23] algorithm are crucial for this research. The use of MCMC for Bayesian inference and
95 probabilistic machine learning represents the core application area of this paper, aiming to achieve
96 computational and energy-efficient deployment at a large scale. Furthermore, (pseudo)random
97 number generation is often discussed in the context of Monte Carlo approaches as they are closely
98 intertwined. MCMC algorithms take advantage of efficient random number sampling as proposed
99 by us. On the other hand, we propose an alternative hardware-supported approach to the MCMC
100 algorithms themselves with our mixture model. In general, our approach differs from traditional
101 pseudorandom number generation of MCMC algorithms as we employ a genuinely random sampling
102 method, making it less suitable for scenarios requiring reproducibility [16, 9, 6] or reversability [28],
103 our objectives align in aiming for (energy-)efficient random number generation and genuine statistical
104 independence.
105 Antunes and Hill [1] accurately measured the energy usage of random number generators (Mersenne-
106 Twister, PCG, and Philox) in programming languages and frameworks such as Python, C, Numpy,
107 Tensorflow, and PyTorch, thus providing a quantification of energy consumption in tools relevant to
108 AI. The energy measurements of this benchmark serve as baseline against our approach.
109 3 Preliminaries
110 We use the floating-point format as the number representation of interest as this is also the format
111 that machine learning algorithms use. We define a generic floating-point number as follows:
x = ±2e−b · d1 .d2 . . . dt , (1)
112 where e is the exponent adjusted by a bias b, d1 .d2 . . . dt represent the mantissa, di ∈ {0, 1}, and
113 d1 = 1 indicates an implicit leading bit for normalized numbers.
114 While our approach is generally applicable to any floating-point format, we demonstrate the approach
115 for the Float16 format in this paper. The use of the Float16 format compared to formats with more
116 precision bits is advantageous in a real-world setting as it demands less rigor in setting the current
117 bias for the s-MTJ devices, which is especially relevant for higher-order exponent bits.
118 In the following, we describe a Float16 number by its 16-bit organization
B = (b0 , b1 , . . . , b15 ), (2)
119 where b15 is the sign bit, b14 to b10 are the exponent bits with a bias of 15, and b9 to b0 are the
120 mantissa bits. The implicit bit remains unexpressed. This arrangement represents the actual storage
121 format of the bits in memory. By expressing the floating-point format in terms of its bit structure, we
122 can directly map an s-MTJ device’s output bit to its equivalent position in the Float16 format.
123 4 Approach
124 4.1 Probabilistic Spintronic Devices
125 Spintronic devices are a class of computing (logic and memory) devices that harness the spin of
126 electrons (in addition to their charge) for computation [10]. This contrasts with traditional electronic
127 devices which only use electron charges for computation. Spintronic devices are built using magnetic
128 materials, as the magnetization (magnetic moment per unit volume) of a magnet is a macroscopic
129 manifestation of its correlated electron spins. The prototypical spintronic device, called the magnetic
130 tunnel junction (MTJ), is a three-layer device which can act both as a memory unit and a switch [20].
131 It consists of two ferromagnetic layers separated by a thin, insulating non-magnetic layer. When the
132 magnetization of the two ferromagnetic layers is aligned parallel to each other, the MTJ exhibits a low
133 resistance (RP ). Conversely, when the two magnetizations are aligned anti-parallel, the MTJ exhibits
134 a high resistance (RAP ). By virtue of the two discrete resistance states, an MTJ can act as a memory
3
Table 1: Required 1-bit occurrences in a 3-bit exponent representation
1-Bit Count
e3 0 0 0 0 24 25 26 27
2
e2 0 0 2 23 0 0 26 27
1 3
e1 0 2 0 2 0 25 0 27
135 bit as well as a switch. In practice, the MTJs are constructed such that one of the ferromagnetic layers
136 stays fixed, while the other layer’s magnetization can be easily toggled (free layer, FL). Thus, by
137 toggling the FL, using a magnetic field or electric currents, the MTJ can be switched between its ‘0’
138 and ‘1’ state.
139 An MTJ can serve as a natural source of randomness upon aggressive scaling, i.e. when the FL of
140 the MTJ is shrunk to such a small volume that it toggles randomly just due to thermal energy in
141 the vicinity. It is worth noting that the s-MTJ can produce a Bernoulli distribution like probability
142 density function (PDF), with p = 0.5, without any external stimulus, by virtue of only the ambient
143 temperature. However, applying a bias current across the s-MTJ can allow tuning of the PDF through
144 the spin transfer torque mechanism. As shown in Figure 5c-f of Appendix A, applying a positive bias
145 current across the device makes the high resistance state more favorable, while applying a negative
146 current has the opposite effect. In fact, by applying an appropriate bias current across the s-MTJ,
147 using a simple current-mode digital to analog converter as shown in Figure 6a of Appendix A, we
148 can achieve precise control over the Bernoulli parameter (p) exhibited by the s-MTJ. The p-value of
149 the s-MTJ responds to the bias current through a sigmoidal dependence. A more detailed version of
150 this section on the physical principles, device structure and simulations of the s-MTJ device can be
151 found in Appendix A.
153 This section describes the configuration of s-MTJ devices representing Bernoulli distributions for
154 generating uniform random numbers in floating-point formats, particularly Float16. To apply this
155 method to other floating-point formats, modify the number of total bits in Equation (3), (5) and (6) as
156 well as the number of exponent bits in (8) and their positions in the format in variable e of (6).
157 The configuration C for a set of s-MTJ devices is defined as follows:
C = {(bi , pi ) | pi ∈ [0, 1], bi ∈ {b0 , . . . , b15 }}, (3)
158 where each pi is the parameter of a Bernoulli distribution representing the probability of the corre-
159 sponding Float16 format bit being ‘1’ in the output.
160 The goal is to configure C so that, with infinite resampling, the sequence Bn of Float16 values
161 converges to a uniform distribution D over the full format. Formally, we seek C such that:
lim P (Bn = b | C) = D(b), where D = Uniform(−65504, 65504) (4)
n→∞
162 In order to meet this condition, we need to assign each bit position bi of the Float16 format a
163 probability pi , representing the frequency of each bit’s occurrence in a uniform Float16 distribution
164 (Equations (5)-(8)). The mantissa bits are assigned a value of 0.5, as detailed in line (6), ensuring
165 uniformity across the range they cover. This method extends to the sign bit, whose equal likelihood
166 of toggling maintains the format’s symmetry.
167 In floating-point formats, increasing the exponent doubles the range covered by the mantissa due to the
168 base 2 system. Higher exponent ranges need more frequent sampling to maintain uniform coverage,
169 as simply doubling sample occurrence from one range to the next does not preserve uniformity. Table
170 2 shows the number of 1-bits for each exponent in a 3-bit example. In general, one can see a specific
171 overall pattern. Specifically, e1 has four groups of size 1, e2 has two groups of size 2, and e3 has one
i−1
172 group of size 1. More generally, the first count of any exponent group is always 22 . For the first
173 exponent, groups are size 1 (excludable by 1{i>1} ). For other exponents, remaining 1-Bit counts in
Pc−1 i−1
174 the first group are k=1 22 +k , where c = 2i−1 is the group size, depending on the position i in
175 the floating-point format. The count of groups based on bit position i and total bits e is z = 2−i+e .
Pz−1 Pc−1 i−1 i
176 The count sums for remaining groups are given by k=1 g=1 22 +2 ·k+g , where z is the number
4
177 of groups and c their size. The highest exponent bit e3 with one group is excluded using 1{z>1} .
e
178 To find the probability of 1-Bit occurrences for each exponent ei , divide by the total bits 2(2 ) − 1,
179 which depends on the exponent bits e.
180 Combining everything, we derive the equation for the configuration C as follows:
C = {(bi , pi ) | pi ∈ [0, 1], bi ∈ {b0 , . . . , b15 }, where (5)
(
oi−9
e if i ∈ {10, . . . , 14},
pi = 2(2 ) −1 , and (6)
0.5 otherwise
c−1 z−1 z−1 X
c−1
i−1 i−1 i−1
+2i ·k i−1
+2i ·k+g
X X X
oi = 22 + 22 +k
· 1{i>1} + 22 + 22 · 1{z>1} , and (7)
k=1 k=1 k=1 g=1
181 After obtaining a sample s, min-max normalization can be applied to linearly transform it into a
182 sample s′ that adheres to any specified uniform distribution within the Float16 range:
(s + 65504) · (b − a)
s′ ∼ Uniform(a, b) = a + . (9)
131008
183 The transformation must be performed in a format exceeding Float16, like Float32 or a specialized cir-
184 cuit, to maintain numerical stability and precision, due to exceeding Float16 limits in the denominator
185 of Equation (9). We assume special cases like NaNs or Infinities are discarded.
187 This section addresses how to represent and sample from any arbitrary one-dimensional distribution,
188 aiming for random and energy-efficient non-parametric sampling without closed-form solutions.
189 Sampling from a uniform distribution within the Float16 range is an energy-efficient method. Given
190 that hardware representations of continuous distributions are inherently discretized during real
191 computations, we use a mixture model of uniform distributions as distributional representation.
192 This approach [3] is well-established for handling real-world data that standard distributions do not
193 adequately represent. In general, mixture models of all forms are used in probabilistic machine
194 learning to approximate multimodal and complex distributions [21]. We break down a distribution
195 into several non-overlapping uniform distributions, where the approximation error depends on the
196 interval size. The weights of these components indicate the relative probability density of each
197 interval within the overall distribution.
198 Let D be the distribution to be represented, F16 the set of Float16 values, and Ui ∼ Uniform(ai , bi )
199 for i = 1, 2, . . . , k non-overlapping interval components of our mixture model, where each Ui is
200 uniform on [ai , bi ) with ai , bi ∈ F16 . The mixture probability density function fU is defined by
k
X
D(x) = fU (x) = wi fUi (x) (10)
i=1
P
201 such that x∈X wi fUi (x) = 1 and wi fUi (x) is the probability density function of component Ui :
(
1
bi −ai if x ∈ [ai , bi )
fUi (x) =
0 otherwise.
202 The distribution D is derived by sampling candidates from all non-overlapping components Ui and
203 selecting a final sample based on their component weights, mapped proportionally to the [0, 1] interval.
204 A uniform number determines the chosen sample. Possible optimizations involving multiple final
205 samples are omitted here for the sake of scope and simplicity.
206 Our approach is particularly suited for the concentration of statistical distributions in ranges (e.g.,
207 near zero due to data normalization). Using a high component resolution in this range ensures precise
208 sampling, though it may cause inaccuracies further afield. We propose using a balanced number of
5
209 s-MTJ devices to manage errors, offering a viable and energy-efficient solution. More research is
210 needed to tailor distribution resolutions to specific algorithms. The effectiveness of our method is
211 demonstrated through the analysis of cumulative approximation errors in Section 5.3.
212 Probabilistic machine learning relies heavily on thorough sampling from the posterior distribution.
213 We have introduced an efficient sampling method, but operations involving two arbitrary distributions
214 are necessary to derive a posterior distribution. Modern probabilistic machine learning mainly uses
215 distributions that have closed-form solutions and methods for approximating unknown distributions
216 to familiar ones. We introduce both the sum (convolution) and the computation of prior-likelihood
217 (pointwise multiplication) as methods to facilitate the learning of posterior distributions in a non-
218 parametric manner, bypassing the need for closed-form solutions.
219 In all definitions, it is assumed that the intervals {[ai , bi )} in our mixture models are consistent across
220 all represented distributions. Variations in notation (e.g., {[ci , di )}) highlight different distributions.
221
R ∞ convolution Z = X + Y for two independent random variables X and Y is defined as fZ (z) =
The
222
−∞ X
f (x)fY (z − x) dx. For approximating the convolution using interval sets with weights,
223 we calculate the mean of sums of interval bounds for each combination (Cartesian product). Let
224 {Xi = ([ai , bi ), wi )}ni=1 and {Yj = ([cj , dj ), vj )}nj=1 represent the mixture models for X and Y
225 respectively, covering the entire Float16 range.
226 Calculating the means results in
ai + bi cj + dj
mij = + , (11)
2 2
227 with a combined weight:
uij = wi · vj . (12)
228 This intermediate set {(mij , uij )}ni,j=1 contains pairs of mean and weight. Define {Zl =
229 ([gl , hl ), rl )}nl=1 as the desired distribution. Update the weights for Zl by
n
X
rl = uij · 1[gl ,hl ) (mij ), (13)
l=1
230 where 1[gl ,hl ) (x) is the indicator function that is 1 if x ∈ [gl , hl ) and 0 otherwise. Lastly, the weights
231 are normalized
rl
rl′ = Pn . (14)
s=1 rs
232 It should be noted that sampling from both normalized and unnormalized distributions yields equiva-
233 lent results since normalization maintains the relative proportions within our distributions. However,
234 it keeps the weights bounded and manageable for storage purposes.
235 Intermediate pairs of mean and weight for the prior-likelihood computation is obtained by
ai + bi ci + di
mii = = , where ai = ci and bi = di (15)
2 2
236 and
uii = wi · vi . (16)
237 Aside from above equations, the remaining algorithm is that of convolution. Note that the joint
238 distribution is derived by simultaneously sampling from two mixture models. The components of the
239 models remain unchanged during sampling.
240 5 Evaluation
241 5.1 Energy Consumption of the s-MTJ Approach
242 Figure 1 depicts our hardware configuration for sampling a single Float16 value. Each di is an s-MTJ
243 device. The devices d10 , · · · , d14 for the exponent are equipped with 4 control bits to adjust the
244 current bias ci , which corresponds to the Bernoulli probability. The other devices are set to a fixed
6
c10 c11 c12 c13 c14
Sign Bit
d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 Exponent
Mantissa
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15
Figure 1: Hardware setup for sampling one value from a uniform Float16 distribution.
245 current bias equivalent to a Bernoulli of 0.5. The resolution, which determines how accurately we
246 can set the Bernoulli distributions for a device, is dependent on the number of control bits and is
247 visualized in Figure 2. This Figure displays the specific Bernoulli values achievable with 4 control
248 bits. Although additional control bits could allow for more precise settings, we restrict this number
249 to 4 due to physical limitations in setting current biases in hardware with higher resolution while
250 keeping the bias circuit simple (and hence energy-efficient). Our approach focuses on achieving
251 high accuracy around a probability of 1 (cf. configuration in Section 5.2) by taking advantage of
252 the characteristics of the sigmoid function, thus making 4 bits sufficient for achieving the required
253 probability density function.
254 For our specific case, where the s-MTJs are being configured to generate a uniform distribution of
255 Float16 samples, the p for each s-MTJ is predetermined and fixed. All the mantissa bits and the sign
256 bits require p = 0.5, which is exhibited by the s-MTJ without any current bias (cf. 4.1 and 4.2).
257 Thus, these eleven s-MTJs do not require a current biasing circuit. The predetermined p-values for
258 the five exponent bits correspond to specific current biases as shown in Figure 2, which amount to a
259 total power consumption of 20.86 W, as determined through SPICE simulations (details in Appendix
260 C). For a sampling rate of 1 MHz, this corresponds to 20.86 pJ biasing energy per Float16 sample.
261 Additionally, reading the state of all the sixteen s-MTJs, assuming a nominal resistance of 1 kΩ and
262 10 ns readout with 10 µA probe current, amounts to a readout energy dissipation of 16 fJ per Float16
263 sample.
264 Given a hardware accelerator-style architecture, our system is designed with an embarrassingly
265 parallel structure, capable of producing samples every 1 µs. Energy-wise, there is no difference
266 between parallel and sequential setups. Using min-max normalization, sampled intervals can be
267 transformed efficiently into other intervals. It is reasonable that each of the five floating point
268 operations mentioned in Equation 9 within a normalization circuit consumes about 150 fJ on modern
269 microprocessors [7], leading to an extra energy cost of 750 fJ per sample.
270 Consequently, generating 230 samples without linear transformation yields an energy consumption of
272 Our method’s energy usage is compared to actual energy measurements taken by Antunes and Hill
273 [1]. They benchmarked advanced pseudorandom number generators like Mersenne Twister, PCG,
274 and Philox. This includes evaluations across original C versions (O2 and O3 suffixes refer to C flags)
275 and adaptations in Python, NumPy, TensorFlow, and PyTorch, relevant platforms and languages for
276 AI. Each measurement reports the total energy used to produce 230 pseudorandom 32-bit integers
277 or 64-bit doubles, which are common outputs from these generators. Often, specific algorithms and
278 implementations are limited to producing only certain numeric formats (like integers or doubles),
279 particular bit sizes, or specific stochastic properties. As such, comparing different implementations
280 and floating-point formats is somewhat limited. However, given that all implementations serve the
281 same machine learning algorithms and that our energy consumption estimates show vast differences,
282 this comparison is deemed both reasonable and significant.
283 Although our method introduces considerable energy costs due to transformations, the overall energy
284 usage, when including linear transformations, is reduced by factor 5649 (pcg32integer) compared
285 to the most efficient pseudorandom number generator currently available. Compared to the double-
286 generating Mersenne-Twister (mt19937arO2), we obtain an improvement by factor 9721. We provide
287 a full comparison against all benchmarked generators in Figure 7 of Appendix D.
7
0.9933
0.9911
0.9882
0.9843
0.9792
0.9725
0.9637
0.9523
0.9374
0.9183
0.8941
0.8637
1.0
0.8264
0.7813
0.7284
0.6682
E[bi = 1] 0.8
0.6
0.4
0 5 10 15 20
Current Bias(µA)
Figure 2: Possible Bernoulli resolutions for s-MTJ device with 4 control bits.
(a) First Moment (Mean) (b) Second Moment (Variance) (c) Third Moment (Kurtosis)
Figure 3: Physical approximation error comparison for the first three moments of the uniform
distribution (s-MTJ-based approach vs. closed-form solution sampling). Second moment standard
deviation omitted due to equivalence to the means.
289 The number of control bits in an s-MTJ device impacts both energy consumption and the precision
290 of setting the energy bias, which in turn affects the available probabilities of obtaining bit samples.
291 Figure 2 illustrates this relationship. This section evaluates the approximation error caused by
292 imprecision in achieving a desired Bernoulli distribution.
293 Four control bits allow 16 distinct, uniformly spaced current biases for an s-MTJ device. The stability
294 of reading a ‘1’ or ‘0’ from the device follows a sigmoid function, enhancing resolution near 0 and 1,
295 but reducing it around 0.5.
296 This effect is beneficial as it yields the configurations c10 , c11 , · · · , c14 =
297 {(10, 0.66666), (11, 0.80000), (12, 0.94118), (13, 0.99611), (14, 0.99998)} for our hardware
298 setup shown in Figure 1, as derived from Equations (5)-(8). Higher exponent bits demand greater
299 precision than lower ones, highlighting the advantages of the Float16 format over larger formats due
300 to the physical constraints in setting the energy bias.
301 Figure 8 of Appendix E plots samples using perfect resolution sampling and sampling that considers
302 physical control bit boundaries. Control Bits Sampling v1 uses the closest distance, assigning equal
303 probabilities of 0.9933 to c13 and c14 . Control Bits Sampling v2 assigns probabilities of 0.9911 to
304 c13 and 0.9933 to c14 , testing whether a having a difference is more effective than the closest distance
305 method (see Figure 2). All versions yield a uniform distribution without noticeable discrepancies.
306 To precisely analyze distribution shifts, we compared the first three moments (mean, variance,
307 kurtosis) of the uniform Float16 distribution in Figures 3a, 3b, and 3c. We evaluated the empirical
308 moments of these distributions against theoretical expectations using closed-form solutions. We
309 generated 100 000 samples per measurement, repeating each measurement 100 times, and report the
310 results as mean and standard deviations.
311 The mean values over all three moments are rather consistent for all bit resolutions. The observed
312 deviations are primarily attributed to variations within the extensive Float16 scale, rather than impacts
313 from individual approximation errors. Furthermore, the deviation in the second moment is relatively
314 minor given its high absolute value in the closed-form expression. Based on these observations, it
315 is evident that all versions show negligible effects of major approximation errors due to physical
316 inaccuracies.
8
Table 2: Approximation error comparison (mixture-based approach vs. closed-form solution)
Approach for Distribution P DKL (P ∥ Qclosed-form ) ∆ Qclosed-form
Sampling Qclosed-form Convolution 0.7846 ± 0.0695 -
Mixture-based Convolution 0.9192 ± 0.0805 0.1346 ± 0.1060
Sampling Qclosed-form Prior-Likelihood 0.8074 ± 0.1422 -
Mixture-based Prior-Likelihood 0.8545 ± 0.1713 0.0471 ± 0.2284
318 This section discusses approximation errors induced by our conceptual approach due to interval
319 resolution and transformation errors. It examines the convolution and prior-likelihood transformation
320 of two distributions. The convolution analysis spans the interval [−1, 1) with a 0.002 resolution,
321 comprising 1000 elements. Similarly, the prior-likelihood transformation is analyzed over the interval
322 [−0.5, 1.5) using the same resolution.
323 The approximation error is quantified by setting up transformations as described in Section 4.3.
324 Control bit errors are not considered, attributing the error solely to the theoretical approach. We set up
325 input distributions and their closed-form√ probability density functions. We convolved two Gaussian
326 distributions N (0.2, 0.12 ) to get N (0.4, 0.12 + 0.12 ). Using a Beta(2, 5) prior and a N (0.1, 0.12 )
327 likelihood, we derived the final distribution by multiplying their densities.
328 We evaluated the difference in outcomes between our mixture-based approach and the closed-form
329 solution using Kullback-Leibler (KL) divergence. We used kernel density estimation with a uniform
330 kernel and a bandwidth of 0.002 for density estimation. To assess the inherent offset between closed-
331 form densities and sampling-based ones due to limited sample sizes, we sampled 50 000 times from
332 the Gaussian closed-form distribution. We also used rejection sampling with a uniform proposal
333 distribution, allowing us to obtain samples from the prior-likelihood multiplication. Remaining KL
334 discrepancies can be attributed to the approximation errors of our mixture model. We repeated these
335 sampling-based evaluations 100 times, recording the mean and standard deviation.
336 Table 2 shows the approximation errors observed. As shown in Table 2, each method aligns well with
337 the closed-form probability densities. The approximation errors due to sample size are 0.0471 ±
338 0.2284 for prior-likelihood transformations and 0.1346±0.1060 for convolutions. The higher error in
339 convolutions is likely due to more frequent recalculations of means and weights (Cartesian product),
340 while prior-likelihood transformations are linear (pointwise multiplication). Appendix 9 illustrates
341 the sampled distributions for these calculations, helping explain KL-divergence metrics.
343 We introduced a hardware-driven highly energy-efficient acceleration method for transforming and
344 sampling one-dimensional probability distributions, using stochastically switching magnetic tunnel
345 junctions as a support or alternative to Markov-Chain-Monte-Carlo techniques. This method includes
346 a precise initialization for these devices for uniform random number sampling that beats current
347 state-of-the-art Mersenne-Twister by a factor of 5649, a uniform mixture model for distribution
348 sampling, and convolution and prior-likelihood computations to enhance learning and sampling
349 efficiency.
350 We assessed the approximation error associated with the s-MTJ devices and our theoretical framework.
351 Findings show that the physical approximation error is negligible when sampling uniform random
352 numbers. Furthermore, the KL-divergence showed only minor variations compared to sampling from
353 the closed-form solution, noting deviations of 0.1346 ± 0.1060 in convolution and 0.0471 ± 0.2284 in
354 prior-likelihood operations. Our framework allows for the development of tailored solutions focused
355 on specific algorithms and tasks. It enables researchers to customize distributional transformations
356 by adjusting the weight components in the mixture model methodology. Future studies will explore
357 specific algorithms in probabilistic machine learning currently unsuitable for MCMC methods and
358 validate the s-MTJ method by building a prototype including statistical randomness testing of the
359 device [18, 16]. We plan to extend our method to multi-dimensional distributions.
9
360 References
361 [1] Benjamin Antunes and David R. C. Hill. Reproducibility, energy efficiency and performance of
362 pseudorandom number generators in machine learning: a comparative study of python, numpy,
363 tensorflow, and pytorch implementations. CoRR, abs/2401.17345, 2024.
364 [2] Kerem Y. Camsari, Brian M. Sutton, and Supriyo Datta. p-bits for probabilistic spin logic.
365 Applied Physics Reviews, 6(1):011305, 03 2019.
366 [3] Jianxiong Gao, Zongwen An, and Xuezong Bai. A new representation method for probability
367 distributions of multimodal and irregular data based on uniform mixture model. Ann. Oper.
368 Res., 311(1):81–97, 2022.
369 [4] Keith Hastings. Monte carlo sampling methods using markov chains and their applications.
370 Biometrika, 57(1):97–109, 1970.
371 [5] Miguel Herrero-Collantes and Juan Carlos Garcia-Escartin. Quantum random number genera-
372 tors. Reviews of Modern Physics, 89(1):015004, 2017.
373 [6] David R. C. Hill. Parallel random numbers, simulation, and reproducible research. Computing
374 in Science & Engineering, 17(4):66–71, 2015.
375 [7] Anson Ho, Ege Erdil, and Tamay Besiroglu. Limits to the energy efficiency of CMOS micropro-
376 cessors. In IEEE International Conference on Rebooting Computing, ICRC 2023, San Diego,
377 CA, USA, December 5-6, 2023, pages 1–10. IEEE, 2023.
378 [8] Matthew Hoffman and Andrew Gelman. The no-u-turn sampler: adaptively setting path lengths
379 in hamiltonian monte carlo. Journal of Machine Learning Research, 15(1):1593–1623, 2014.
380 [9] Naoise Holohan. Random number generators and seeding for differential privacy. CoRR,
381 abs/2307.03543, 2023.
382 [10] Igor Žutić, Jaroslav Fabian, and S. Das Sarma. Spintronics: Fundamentals and applications.
383 Reviews of Modern Physics, 76:323–410, Apr 2004.
384 [11] Pavel Izmailov, Sharad Vikram, Matthew D. Hoffman, and Andrew Gordon Wilson. What
385 are bayesian neural network posteriors really like? In Marina Meila and Tong Zhang, editors,
386 Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24
387 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages
388 4629–4640. PMLR, 2021.
389 [12] Piotr Józwiak, Janusz Jacak, and Witold Jacak. New concepts and construction of quantum
390 random number generators. Quantum Inf. Process., 23(4):132, 2024.
391 [13] Shivam N. Kajale, Thanh Nguyen, Corson A. Chao, David C. Bono, Artittaya Boonkird, Mingda
392 Li, and Deblina Sarkar. Current-induced switching of a van der waals ferromagnet at room
393 temperature. Nature Communications, 15(1):1485, Feb 2024.
394 [14] Robert Kass, Bradley Carlin, Andrew Gelman, and Radford Neal. Markov chain monte carlo in
395 practice: a roundtable discussion. The American Statistician, 52(2):93–100, 1998.
396 [15] Pierre L’Ecuyer. Uniform random number generation. Annals of Operations Research, 53(1):
397 77–120, 1994.
398 [16] Pierre L’Ecuyer. History of uniform random number generation. In 2017 Winter Simulation
399 Conference, WSC 2017, Las Vegas, NV, USA, December 3-6, 2017, pages 202–230. IEEE, 2017.
400 [17] Vaisakh Mannalatha, Sandeep Mishra, and Anirban Pathak. A comprehensive review of quantum
401 random number generators: concepts, classification and the origin of randomness. Quantum Inf.
402 Process., 22(12):439, 2023.
403 [18] Aldo C. Martínez, Aldo Solís, Rafael Díaz Hernández Rojas, Alfred B. U’Ren, Jorge G. Hirsch,
404 and Isaac Pérez Castillo. Advanced statistical testing of quantum random number generators.
405 Entropy, 20(11):886, 2018.
10
406 [19] Nicholas Metropolis, Arianna Rosenbluth, Marshall Rosenbluth, Augusta Teller, and Edward
407 Teller. Equation of state calculations by fast computing machines. The journal of chemical
408 physics, 21(6):1087–1092, 1953.
409 [20] J. S. Moodera, Lisa R. Kinder, Terrilyn M. Wong, and R. Meservey. Large magnetoresistance at
410 room temperature in ferromagnetic thin film tunnel junctions. Phys. Rev. Lett., 74:3273–3276,
411 Apr 1995.
412 [21] Kevin Murphy. Machine learning - a probabilistic perspective. Adaptive computation and
413 machine learning series. MIT Press, 2012. ISBN 0262018020.
414 [22] Yusuke Nagasaka, Akira Nukada, Satoshi Matsuoka, Kenichi Miura, and John Shalf. MRG8:
415 random number generation for the exascale era. In Proceedings of the Platform for Advanced
416 Scientific Computing Conference, PASC 2018, Basel, Switzerland, July 02-04, 2018, pages
417 6:1–6:11. ACM, 2018.
418 [23] Radford Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte
419 carlo, 2(11):2, 2011.
420 [24] M. D. Stiles and A. Zangwill. Anatomy of spin-transfer torque. Physical Review B, 66:014407,
421 Jun 2002.
422 [25] Mario Stipcevic and Çetin Kaya Koç. True random number generators. In Çetin Kaya Koç,
423 editor, Open Problems in Mathematics and Computational Science, pages 275–315. Springer,
424 2014.
425 [26] Hongshi Tan, Xinyu Chen, Yao Chen, Bingsheng He, and Weng-Fai Wong. Thundering:
426 generating multiple independent random number sequences on fpgas. In Huiyang Zhou, Jose
427 Moreira, Frank Mueller, and Yoav Etsion, editors, ICS ’21: 2021 International Conference on
428 Supercomputing, Virtual Event, USA, June 14-17, 2021, pages 115–126. ACM, 2021.
429 [27] Arne Vansteenkiste, Jonathan Leliaert, Mykola Dvornik, Mathias Helsen, Felipe Garcia-Sanchez,
430 and Bartel Van Waeyenberge. The design and verification of MuMax3. AIP Advances, 4(10):
431 107133, 10 2014.
432 [28] Srikanth Yoginath and Kalyan Perumalla. Efficient reversible uniform and non-uniform random
433 number generation in UNU.RAN. In Erika Frydenlund, Shafagh Jafer, and Hamdi Kavak,
434 editors, Proceedings of the Annual Simulation Symposium, SpringSim (ANSS) 2018, Baltimore,
435 MD, USA, April 15-18, 2018, pages 2:1–2:10. ACM, 2018.
436 [29] Gaojie Zhang, Fei Guo, Hao Wu, Xiaokun Wen, Li Yang, Wen Jin, Wenfeng Zhang, and Haixin
437 Chang. Above-room-temperature strong intrinsic ferromagnetism in 2d van der waals fe3gate2
438 with large perpendicular magnetic anisotropy. Nature Communications, 13(1):5067, Aug 2022.
11
439 A Additional Information on the Spintronic Device
440 Spintronic devices are a class of computing (logic and memory) devices that harness the spin of
441 electrons (in addition to their charge) for computation. This contrasts with traditional electronic
442 devices which only use electron charges for computation. Spintronic devices are built using magnetic
443 materials, as the magnetization (magnetic moment per unit volume) of a magnet is a macroscopic
444 manifestation of its correlated electron spins. The prototypical spintronic device, called the magnetic
445 tunnel junction (MTJ), is a three-layer device which can act both as a memory unit and a switch
446 [10, 20]. It consists of two ferromagnetic layers separated by a thin, insulating non-magnetic layer.
447 When the magnetization of the two ferromagnetic layers is aligned parallel to each other, the MTJ
448 exhibits a low resistance (RP ). Conversely, when the two magnetizations are aligned anti-parallel,
449 the MTJ exhibits a high resistance (RAP ). By virtue of the two discrete resistance states, an MTJ
450 can act as a memory bit as well as a switch. In practice, the MTJs are constructed such that one
451 of the ferromagnetic layers stays fixed, while the other layer’s magnetization can be easily toggled
452 (free layer, FL). Thus, by toggling the FL, using a magnetic field or electric currents, the MTJ can be
453 switched between its ‘0’ and ‘1’ state.
454 An MTJ can serve as a natural source of randomness upon aggressive scaling, i.e. when the FL
455 of the MTJ is shrunk to such a small volume that it toggles randomly just due to thermal energy
456 in the vicinity. As schematically illustrated in Figure 4a, the self-energy of the magnetic layer is
457 minimum and equal for the magnetization pointing vertically up or down, i.e. polar angle θM = 0o
458 or 180o , respectively. The self-energy is maximum for the horizontal orientation (θM = 90o ). The
459 corresponding energy barrier, ∆E dictates the time scale at which the magnet can toggle between
460 the up and down oriented states owing to thermal energy. This time scale follows an Arrhenius law
461 dependence [2], i.e.
∆E
τ↑↓ = τ0 e kT , (19)
462 where, τ0 is the inverse of attempt frequency, typically of the order of 1 ns, k is the Boltzmann constant
463 and T is the ambient temperature. The energy barrier for a magnet is ∆E = KU V = µ0 HK MS V /2,
464 where KU , V , HK and MS are the magnet’s uniaxial anisotropy energy, volume, effective magnetic
465 anisotropy field and saturation magnetization, respectively. µ0 is the magnetic permeability of free
466 space. Thus, it can be observed that by reducing the volume V of the magnetic free layer, we can
467 make its ∆E comparable to kT and achieve natural toggling frequencies of computational relevance,
468 as shown in Figure 4b. Figure 5a shows a time-domain plot of the normalized state of such an s-MTJ,
469 calculated using micromagnetic simulations with the MuMax3 package [27]. Further details on the
470 micromagnetic simulations are included in Appendix B. A histogram of the resistance state of this
471 s-MTJ is presented in Figure 5b. It is worth noting that the s-MTJ can produce such a Bernoulli
472 distribution like probability density function (PDF), with p = 0.5, without any external stimulus,
473 by virtue of only the ambient temperature. However, applying a bias current across the s-MTJ can
474 allow tuning of the PDF through the spin transfer torque mechanism [24]. As shown in Figure 5c-f,
475 applying a positive bias current across the device makes the high resistance state more favorable,
476 while applying a negative current has the opposite effect. In fact, by applying an appropriate bias
477 current across the s-MTJ, using a simple current-mode digital to analog converter as shown in Figure
478 6a, we can achieve precise control over the Bernoulli parameter (p) exhibited by the s-MTJ. Details
479 on the current-biasing circuit are included in Appendix C. The p-value of the s-MTJ responds to the
480 bias current through a sigmoidal dependence, as shown in Figure 6b.
12
Figure 4: (a) Schematic illustration of the self-energy (E) of a nanomagnet with respect to the polar
angle (θM ) of its magnetization (indicated by thick arrows). (b) Natural frequency of stochastic
switching for a nanomagnet of a particular diameter at different temperatures.
Figure 5: Dynamics of the normalized resistance of a stochastic MTJ for different bias current
densities. (a) Ibias = 0 produces equal probability of observing the high or low state. (b) Histogram of
the observed resistance state for Ibias = 0. (c, d) Trace and histogram of the observed resistance for a
bias current of 2 × 1011 A/m2 . (e, f) Trace and histogram of the observed resistance for a bias current
of -2 × 1011 A/m2 .
Figure 6: (a) Schematic diagram of a current-mode digital to analog converter for providing the
biasing current to a stochastic MTJ. (b) Variation of the Bernoulli parameter of the stochastic MTJ
with bias current. Red triangles are data point obtained from micromagnetic simulations, while the
grey dotted line is a theoretical fit (sigmoid function).
481 B Micromagnetic Simulations
482 Dynamics of a ferromagnet’s magnetization in response to external stimuli, like magnetic fields,
483 currents or heat can be modelled using micromagnetic simulations. The magnetization dynamics
484 can be described using a differential equation, known as the Landau-Lifshitz-Gilbert-Slonczewski
485 (LLGS) equation:
dm
⃗ ⃗ eff + αm dm
⃗ ⃗ × (⃗x × m)
m ⃗ ⃗x × m
⃗
= −γ m
⃗ ×H ⃗ × + τ∥ + τ⊥ (20)
dt dt |⃗x × m|
⃗ |⃗x × m|
⃗
15
503 C Power Estimation of the Current Biasing Circuit
504 The current biasing circuit was simulated using Cadence Virtuoso using the Global Foundries 22FDX
505 (22 nm FDSOI) process design kit. The circuit has been designed for a maximum bias current of
506 20 µA to attain an s-MTJ with Bernoulli parameter p = 0.99. The current levels corresponding to
507 p = 0.67 and p = 0.99 are divided into 4-bit resolution (Figure 2). The four bias bits (B0-B3) are
508 fed to the transistors P0, P1, P2, P3 (LSB to MSB), which are sized to produce currents I0 , 2I0 , 4I0
509 and 8I0 , respectively, when the corresponding bias bit it ‘1’. A constant current Ibase = 2.82 µA is
510 additionally supplied through P4 to create a baseline of p = 0.67 for the s-MTJs. The transistors are
511 operated at a low supply voltage of 0.35 V to achieve a small I0 = 1.14 µA. Thus, each exponent
512 bit can be set to its requisite Bernoulli parameter by appropriately setting the 4-bit bias word, and
513 the power dissipation in the biasing circuit can be estimated for each of the exponent bits. Lengths
514 of all the transistors are set to 20 nm. Width of P4 is set to 260 nm, while the widths of P0, P1, P2
515 and P3 are 100 nm, 200 nm, 400 nm, and 800 nm, respectively. As discussed in the main text, our
516 proposed method requires only positive current biases for the stochastic MTJs. Thus, the unipolar
517 current mode DAC proposed here suffices for our application. For more general use cases where both
518 positive and negative bias currents may be needed, a bipolar current-steering DAC can be utilized.
16
D 519
10 2
100
102
104
106
108
pcg32Integer 131.17
numpyIntegerTaskseAtOnce 206.41
mt19937arIntegerO3 258.09
tensorflowIntegerAtOnce 262.52
mt19937arIntegerO2 262.77
Generator Types
numpyIntegerAtOnce 267.33
16-bit Float
numpyIntegerMtAtOnce 267.77
64-bit Double
32-bit Integer
mt19937arInteger 367.77
numpyIntegerPhiloxAtOnce 404.65
tensorflowIntegerTasksetAtOnce 547.38
pytorchIntegerTasksetAtOnce 566.63
pytorchIntegerAtOnce 640.54
philoxInteger 728.84
pythonIntegerTasksetAtOnce 7.27E+03
pythonIntegerAtOnce 3.48E+04
pythonIntegerOneByOne 3.57E+04
pytorchIntegerOneByOne 3.84E+04
numpyIntegerOneByOne 3.36E+05
numpyIntegerMtOneByOne 6.38E+05
numpyIntegerPhiloxOneByOne 4.77E+06
mt19937arO3 225.74
17
well19937O3 247.41
tensorflowAtOnce 288.69
numpyTasksetAtOnce 330.27
Generator
mt19937ar 397.49
pytorchTasksetAtOnce 436.27
numpyAtOnce 568.32
tensorflowAtOncev2 649.36
well19937a 663.01
pcg64O3 711.7
pcg64O2 724.4
pytorchAtOnce 821.43
numpyMtAtOnce 950.24
pcg64 982.61
Energy Consumption of Random Number Generators
mrg32k3aO3 998.58
numpyPhiloxAtOnce 1.11E+03
mrg32k3aO2 1.44E+03
mrg32k3a 2.00E+03
pythonOneByOne 2.38E+03
pythonTasksetAtOnce 4.63E+03
pythonAtOnce 8.20E+03
numpyOneByOne 3.27E+04
numpyMtOneByOne 1.99E+05
numpyPhiloxOneByOne 2.02E+05
pytorchOneByOne 2.24E+05
our approach w/o transform 0.02242
our approach with transform 0.02322
Figure 7: Power consumption analysis in Joules (logarithmic scale) for 230 random numbers. Bench-
520 E Additional Figures on Physical Approximation Error
Figure 8: Visualization of samples obtained with three different assumptions. Perfect Resolution
Sampling assumes the precise values obtained from Equations (5)-(8) in Section 4.2. Control Bits
Sampling v1 assumes the closest distance measure to actual obtainable control bits. Control Bits
Sampling v2 assumes that each exponent bit should actually be different over closest distance, even if
the physically closest distance would imply redundant values.
18
521 F Additional Figures for Conceptual Approximation Error
522
Figure 9: Sampling from the √ convolution of two Gaussian distributions, N (0.2, 0.12 ) and
2
N (0.2, 0.1 ), resulting in N (0.4, 0.12 + 0.12 ).
Figure 10: Sampling after Prior-Likelihood Transformation: Using a Beta(2, 5) prior and a
523
N (0.1, 0.12 ) likelihood, the final distribution is derived by multiplying their densities.
19
524 NeurIPS Paper Checklist
525 1. Claims
526 Question: Do the main claims made in the abstract and introduction accurately reflect the
527 paper’s contributions and scope?
528 Answer: [Yes]
529 Justification: Each claim/proposal is addressed and evaluated in distinct subsections, and
530 backed by supplementary materials in the Appendix. For spintronic devices, see Approach
531 Section 4.1, and Appendix A and B. For random number sampling, see Section 4.2. For its
532 energy-consumption, see Section 5.1 and Appendix C. For potential approximation error due
533 to physical hardware constraints, see Section 5.2 and Appendix E. For arbitrary distribution
534 sampling and learning, see Approach Section 4.3. See Section 5.3 and Appendix F for
535 approximation errors evaluation of distribution sampling and learning. Doing that, also
536 demonstrated using the conceptual approach itself.
537 Guidelines:
538 • The answer NA means that the abstract and introduction do not include the claims
539 made in the paper.
540 • The abstract and/or introduction should clearly state the claims made, including the
541 contributions made in the paper and important assumptions and limitations. A No or
542 NA answer to this question will not be perceived well by the reviewers.
543 • The claims made should match theoretical and experimental results, and reflect how
544 much the results can be expected to generalize to other settings.
545 • It is fine to include aspirational goals as motivation as long as it is clear that these goals
546 are not attained by the paper.
547 2. Limitations
548 Question: Does the paper discuss the limitations of the work performed by the authors?
549 Answer: [Yes]
550 Justification: We discuss limitations from approximation error coming both from hardware
551 and the conceptual approach. For potential approximation error due to physical hardware
552 constraints, see Section 5.2 and Appendix E. See Section 5.3 and Appendix F for approxima-
553 tion errors evaluation of distribution sampling and learning. We point out the need for further
554 research on setting individual interval resolution for specific algorithms in the Introduction
555 Section as we propose a generic solution framework. Due to the inherent uncertainty of
556 device simulations of novel/innovative hardware, we acknowledged the need for building
557 and evaluating a real prototype in the Conclusion and Future Work Section.
558 Guidelines:
559 • The answer NA means that the paper has no limitation while the answer No means that
560 the paper has limitations, but those are not discussed in the paper.
561 • The authors are encouraged to create a separate "Limitations" section in their paper.
562 • The paper should point out any strong assumptions and how robust the results are to
563 violations of these assumptions (e.g., independence assumptions, noiseless settings,
564 model well-specification, asymptotic approximations only holding locally). The authors
565 should reflect on how these assumptions might be violated in practice and what the
566 implications would be.
567 • The authors should reflect on the scope of the claims made, e.g., if the approach was
568 only tested on a few datasets or with a few runs. In general, empirical results often
569 depend on implicit assumptions, which should be articulated.
570 • The authors should reflect on the factors that influence the performance of the approach.
571 For example, a facial recognition algorithm may perform poorly when image resolution
572 is low or images are taken in low lighting. Or a speech-to-text system might not be
573 used reliably to provide closed captions for online lectures because it fails to handle
574 technical jargon.
575 • The authors should discuss the computational efficiency of the proposed algorithms
576 and how they scale with dataset size.
20
577 • If applicable, the authors should discuss possible limitations of their approach to
578 address problems of privacy and fairness.
579 • While the authors might fear that complete honesty about limitations might be used by
580 reviewers as grounds for rejection, a worse outcome might be that reviewers discover
581 limitations that aren’t acknowledged in the paper. The authors should use their best
582 judgment and recognize that individual actions in favor of transparency play an impor-
583 tant role in developing norms that preserve the integrity of the community. Reviewers
584 will be specifically instructed to not penalize honesty concerning limitations.
585 3. Theory Assumptions and Proofs
586 Question: For each theoretical result, does the paper provide the full set of assumptions and
587 a complete (and correct) proof?
588 Answer: [NA]
589 Justification: This paper does not contain theoretical results. We provide simulation-based
590 results, empirical results, and descriptive equations of our approach.
591 Guidelines:
592 • The answer NA means that the paper does not include theoretical results.
593 • All the theorems, formulas, and proofs in the paper should be numbered and cross-
594 referenced.
595 • All assumptions should be clearly stated or referenced in the statement of any theorems.
596 • The proofs can either appear in the main paper or the supplemental material, but if
597 they appear in the supplemental material, the authors are encouraged to provide a short
598 proof sketch to provide intuition.
599 • Inversely, any informal proof provided in the core of the paper should be complemented
600 by formal proofs provided in appendix or supplemental material.
601 • Theorems and Lemmas that the proof relies upon should be properly referenced.
602 4. Experimental Result Reproducibility
603 Question: Does the paper fully disclose all the information needed to reproduce the main ex-
604 perimental results of the paper to the extent that it affects the main claims and/or conclusions
605 of the paper (regardless of whether the code and data are provided or not)?
606 Answer: [Yes]
607 Justification: The code provided and the use of seeds is sufficient to fully reproduce all
608 of our concept-related results. We provide the necessary information/assumptions on the
609 micromagnetic simulations and utilized simulation software to reproduce our device results.
610 Guidelines:
611 • The answer NA means that the paper does not include experiments.
612 • If the paper includes experiments, a No answer to this question will not be perceived
613 well by the reviewers: Making the paper reproducible is important, regardless of
614 whether the code and data are provided or not.
615 • If the contribution is a dataset and/or model, the authors should describe the steps taken
616 to make their results reproducible or verifiable.
617 • Depending on the contribution, reproducibility can be accomplished in various ways.
618 For example, if the contribution is a novel architecture, describing the architecture fully
619 might suffice, or if the contribution is a specific model and empirical evaluation, it may
620 be necessary to either make it possible for others to replicate the model with the same
621 dataset, or provide access to the model. In general. releasing code and data is often
622 one good way to accomplish this, but reproducibility can also be provided via detailed
623 instructions for how to replicate the results, access to a hosted model (e.g., in the case
624 of a large language model), releasing of a model checkpoint, or other means that are
625 appropriate to the research performed.
626 • While NeurIPS does not require releasing code, the conference does require all submis-
627 sions to provide some reasonable avenue for reproducibility, which may depend on the
628 nature of the contribution. For example
21
629 (a) If the contribution is primarily a new algorithm, the paper should make it clear how
630 to reproduce that algorithm.
631 (b) If the contribution is primarily a new model architecture, the paper should describe
632 the architecture clearly and fully.
633 (c) If the contribution is a new model (e.g., a large language model), then there should
634 either be a way to access this model for reproducing the results or a way to reproduce
635 the model (e.g., with an open-source dataset or instructions for how to construct
636 the dataset).
637 (d) We recognize that reproducibility may be tricky in some cases, in which case
638 authors are welcome to describe the particular way they provide for reproducibility.
639 In the case of closed-source models, it may be that access to the model is limited in
640 some way (e.g., to registered users), but it should be possible for other researchers
641 to have some path to reproducing or verifying the results.
642 5. Open access to data and code
643 Question: Does the paper provide open access to the data and code, with sufficient instruc-
644 tions to faithfully reproduce the main experimental results, as described in supplemental
645 material?
646 Answer: [Yes]
647 Justification: We release all relevant data/code on Github necessary to reproduce the re-
648 sults for our conceptual approach. Will provide relevant information on micromagnetic
649 simulations in the Approach Section and Appendix.
650 Guidelines:
651 • The answer NA means that paper does not include experiments requiring code.
652 • Please see the NeurIPS code and data submission guidelines (https://nips.cc/
653 public/guides/CodeSubmissionPolicy) for more details.
654 • While we encourage the release of code and data, we understand that this might not be
655 possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not
656 including code, unless this is central to the contribution (e.g., for a new open-source
657 benchmark).
658 • The instructions should contain the exact command and environment needed to run to
659 reproduce the results. See the NeurIPS code and data submission guidelines (https:
660 //nips.cc/public/guides/CodeSubmissionPolicy) for more details.
661 • The authors should provide instructions on data access and preparation, including how
662 to access the raw data, preprocessed data, intermediate data, and generated data, etc.
663 • The authors should provide scripts to reproduce all experimental results for the new
664 proposed method and baselines. If only a subset of experiments are reproducible, they
665 should state which ones are omitted from the script and why.
666 • At submission time, to preserve anonymity, the authors should release anonymized
667 versions (if applicable).
668 • Providing as much information as possible in supplemental material (appended to the
669 paper) is recommended, but including URLs to data and code is permitted.
670 6. Experimental Setting/Details
671 Question: Does the paper specify all the training and test details (e.g., data splits, hyper-
672 parameters, how they were chosen, type of optimizer, etc.) necessary to understand the
673 results?
674 Answer: [Yes]
675 Justification: We provide the experimental code for our conceptual approach in a way that
676 the results are fully reproducable. We are providing the necessary information to reproduce
677 the micromagnetic simulations in the Appendix.
678 Guidelines:
679 • The answer NA means that the paper does not include experiments.
680 • The experimental setting should be presented in the core of the paper to a level of detail
681 that is necessary to appreciate the results and make sense of them.
22
682 • The full details can be provided either with the code, in appendix, or as supplemental
683 material.
684 7. Experiment Statistical Significance
685 Question: Does the paper report error bars suitably and correctly defined or other appropriate
686 information about the statistical significance of the experiments?
687 Answer: [Yes]
688 Justification: We report mean and standard deviation of experiment runs along with informa-
689 tion on the repetitions and experiment settings.
690 Guidelines:
691 • The answer NA means that the paper does not include experiments.
692 • The authors should answer "Yes" if the results are accompanied by error bars, confi-
693 dence intervals, or statistical significance tests, at least for the experiments that support
694 the main claims of the paper.
695 • The factors of variability that the error bars are capturing should be clearly stated (for
696 example, train/test split, initialization, random drawing of some parameter, or overall
697 run with given experimental conditions).
698 • The method for calculating the error bars should be explained (closed form formula,
699 call to a library function, bootstrap, etc.)
700 • The assumptions made should be given (e.g., Normally distributed errors).
701 • It should be clear whether the error bar is the standard deviation or the standard error
702 of the mean.
703 • It is OK to report 1-sigma error bars, but one should state it. The authors should
704 preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis
705 of Normality of errors is not verified.
706 • For asymmetric distributions, the authors should be careful not to show in tables or
707 figures symmetric error bars that would yield results that are out of range (e.g. negative
708 error rates).
709 • If error bars are reported in tables or plots, The authors should explain in the text how
710 they were calculated and reference the corresponding figures or tables in the text.
711 8. Experiments Compute Resources
712 Question: For each experiment, does the paper provide sufficient information on the com-
713 puter resources (type of compute workers, memory, time of execution) needed to reproduce
714 the experiments?
715 Answer: [Yes]
716 Justification: All code can be executed on consumer-grade computers, as reported in the end
717 of the Introduction Section and in the Appendix Section on Micromagnetic Simulations.
718 Guidelines:
719 • The answer NA means that the paper does not include experiments.
720 • The paper should indicate the type of compute workers CPU or GPU, internal cluster,
721 or cloud provider, including relevant memory and storage.
722 • The paper should provide the amount of compute required for each of the individual
723 experimental runs as well as estimate the total compute.
724 • The paper should disclose whether the full research project required more compute
725 than the experiments reported in the paper (e.g., preliminary or failed experiments that
726 didn’t make it into the paper).
727 9. Code Of Ethics
728 Question: Does the research conducted in the paper conform, in every respect, with the
729 NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?
730 Answer: [Yes]
731 Justification: None of the listed harms, impacts, or consequences are applicable to this
732 research.
23
733 Guidelines:
734 • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.
735 • If the authors answer No, they should explain the special circumstances that require a
736 deviation from the Code of Ethics.
737 • The authors should make sure to preserve anonymity (e.g., if there is a special consid-
738 eration due to laws or regulations in their jurisdiction).
739 10. Broader Impacts
740 Question: Does the paper discuss both potential positive societal impacts and negative
741 societal impacts of the work performed?
742 Answer: [Yes]
743 Justification: We briefly motivate our work to reduce significant energy use by AI, im-
744 plying reduced energy-related financial costs, reduced CO2 emissions in the context of
745 climate change, and ensuring safety in critical applications due to uncertainty quantification
746 (Introduction Section).
747 Guidelines:
748 • The answer NA means that there is no societal impact of the work performed.
749 • If the authors answer NA or No, they should explain why their work has no societal
750 impact or why the paper does not address societal impact.
751 • Examples of negative societal impacts include potential malicious or unintended uses
752 (e.g., disinformation, generating fake profiles, surveillance), fairness considerations
753 (e.g., deployment of technologies that could make decisions that unfairly impact specific
754 groups), privacy considerations, and security considerations.
755 • The conference expects that many papers will be foundational research and not tied
756 to particular applications, let alone deployments. However, if there is a direct path to
757 any negative applications, the authors should point it out. For example, it is legitimate
758 to point out that an improvement in the quality of generative models could be used to
759 generate deepfakes for disinformation. On the other hand, it is not needed to point out
760 that a generic algorithm for optimizing neural networks could enable people to train
761 models that generate Deepfakes faster.
762 • The authors should consider possible harms that could arise when the technology is
763 being used as intended and functioning correctly, harms that could arise when the
764 technology is being used as intended but gives incorrect results, and harms following
765 from (intentional or unintentional) misuse of the technology.
766 • If there are negative societal impacts, the authors could also discuss possible mitigation
767 strategies (e.g., gated release of models, providing defenses in addition to attacks,
768 mechanisms for monitoring misuse, mechanisms to monitor how a system learns from
769 feedback over time, improving the efficiency and accessibility of ML).
770 11. Safeguards
771 Question: Does the paper describe safeguards that have been put in place for responsible
772 release of data or models that have a high risk for misuse (e.g., pretrained language models,
773 image generators, or scraped datasets)?
774 Answer: [NA]
775 Justification: We do not release data and do not see any obvious misuse risk for our work.
776 Guidelines:
777 • The answer NA means that the paper poses no such risks.
778 • Released models that have a high risk for misuse or dual-use should be released with
779 necessary safeguards to allow for controlled use of the model, for example by requiring
780 that users adhere to usage guidelines or restrictions to access the model or implementing
781 safety filters.
782 • Datasets that have been scraped from the Internet could pose safety risks. The authors
783 should describe how they avoided releasing unsafe images.
784 • We recognize that providing effective safeguards is challenging, and many papers do
785 not require this, but we encourage authors to take this into account and make a best
786 faith effort.
24
787 12. Licenses for existing assets
788 Question: Are the creators or original owners of assets (e.g., code, data, models), used in
789 the paper, properly credited and are the license and terms of use explicitly mentioned and
790 properly respected?
791 Answer: [NA]
792 Justification: We do not use any assets. Where we build on previous scientific
793 work/benchmarks, authors are properly cited.
794 Guidelines:
795 • The answer NA means that the paper does not use existing assets.
796 • The authors should cite the original paper that produced the code package or dataset.
797 • The authors should state which version of the asset is used and, if possible, include a
798 URL.
799 • The name of the license (e.g., CC-BY 4.0) should be included for each asset.
800 • For scraped data from a particular source (e.g., website), the copyright and terms of
801 service of that source should be provided.
802 • If assets are released, the license, copyright information, and terms of use in the
803 package should be provided. For popular datasets, paperswithcode.com/datasets
804 has curated licenses for some datasets. Their licensing guide can help determine the
805 license of a dataset.
806 • For existing datasets that are re-packaged, both the original license and the license of
807 the derived asset (if it has changed) should be provided.
808 • If this information is not available online, the authors are encouraged to reach out to
809 the asset’s creators.
810 13. New Assets
811 Question: Are new assets introduced in the paper well documented and is the documentation
812 provided alongside the assets?
813 Answer: [NA]
814 Justification: There are no new assets introduced.
815 Guidelines:
816 • The answer NA means that the paper does not release new assets.
817 • Researchers should communicate the details of the dataset/code/model as part of their
818 submissions via structured templates. This includes details about training, license,
819 limitations, etc.
820 • The paper should discuss whether and how consent was obtained from people whose
821 asset is used.
822 • At submission time, remember to anonymize your assets (if applicable). You can either
823 create an anonymized URL or include an anonymized zip file.
824 14. Crowdsourcing and Research with Human Subjects
825 Question: For crowdsourcing experiments and research with human subjects, does the paper
826 include the full text of instructions given to participants and screenshots, if applicable, as
827 well as details about compensation (if any)?
828 Answer: [NA]
829 Justification: We have not conducted crowd-sourcing or human subject research.
830 Guidelines:
831 • The answer NA means that the paper does not involve crowdsourcing nor research with
832 human subjects.
833 • Including this information in the supplemental material is fine, but if the main contribu-
834 tion of the paper involves human subjects, then as much detail as possible should be
835 included in the main paper.
836 • According to the NeurIPS Code of Ethics, workers involved in data collection, curation,
837 or other labor should be paid at least the minimum wage in the country of the data
838 collector.
25
839 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human
840 Subjects
841 Question: Does the paper describe potential risks incurred by study participants, whether
842 such risks were disclosed to the subjects, and whether Institutional Review Board (IRB)
843 approvals (or an equivalent approval/review based on the requirements of your country or
844 institution) were obtained?
845 Answer: [NA]
846 Justification: We have not conducted crowd-sourcing or human subject research.
847 Guidelines:
848 • The answer NA means that the paper does not involve crowdsourcing nor research with
849 human subjects.
850 • Depending on the country in which research is conducted, IRB approval (or equivalent)
851 may be required for any human subjects research. If you obtained IRB approval, you
852 should clearly state this in the paper.
853 • We recognize that the procedures for this may vary significantly between institutions
854 and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the
855 guidelines for their institution.
856 • For initial submissions, do not include any information that would break anonymity (if
857 applicable), such as the institution conducting the review.
26