Ecc: A Gpu - E C C: G Based High Throughput Framework For Lliptic Urve Ryptography
Ecc: A Gpu - E C C: G Based High Throughput Framework For Lliptic Urve Ryptography
January 8, 2025
A BSTRACT
Elliptic Curve Cryptography (ECC) is an encryption method that provides security comparable to
traditional techniques like Rivest–Shamir–Adleman (RSA) but with lower computational complexity
and smaller key sizes, making it a competitive option for applications such as blockchain, secure
multi-party computation, and database security. However, the throughput of ECC is still hindered by
the significant performance overhead associated with elliptic curve (EC) operations, which can affect
their efficiency in real-world scenarios. This paper presents gECC, a versatile framework for ECC
optimized for GPU architectures, specifically engineered to achieve high-throughput performance in
EC operations. To maximize throughput, gECC incorporates batch-based execution of EC operations
and microarchitecture-level optimization of modular arithmetic. It employs Montgomery’s trick [1]
to enable batch EC computation and incorporates novel computation parallelization and memory
management techniques to maximize the computation parallelism and minimize the access overhead
of GPU global memory. Furthermore, we analyze the primary bottleneck in modular multiplication by
investigating how the user codes of modular multiplication are compiled into hardware instructions and
what these instructions’ issuance rates are. We identify that the efficiency of modular multiplication
is highly dependent on the number of Integer Multiply-Add (IMAD) instructions. To eliminate
this bottleneck, we propose novel techniques to minimize the number of IMAD instructions by
leveraging predicate registers to pass the carry information and using addition and subtraction
instructions (IADD3) to replace IMAD instructions. Our experimental results show that, for ECDSA
and ECDH, the two commonly used ECC algorithms, gECC can achieve performance improvements
of 5.56 × and 4.94 ×, respectively, compared to the state-of-the-art GPU-based system. In a real-
world blockchain application, we can achieve performance improvements of 1.56 ×, compared
to the state-of-the-art CPU-based system. gECC is completely and freely available at https:
//github.com/CGCL-codes/gECC.
A PREPRINT - JANUARY 8, 2025
1 Introduction
Elliptic Curve Cryptography (ECC) [2, 3] is a method for public-key encryption by using elliptic curves. It offers
security that is on par with conventional public-key cryptographic systems like Rivest–Shamir–Adleman (RSA) but with
a smaller key size and lower computational complexity. Recently, ECC gained increased attention due to its efficient
privacy protection and fewer interactions in verifiable databases [4, 5, 6, 7, 8].
ECC serves as a powerful encryption tool widely used in areas such as data encryption, digital signatures, blockchain,
secure transmission, and secure multi-party computation. The two most prevalent public-key algorithms based
on ECC are Elliptic Curve Diffie-Hellman (ECDH) [9] for data encryption and Elliptic Curve Digital Signature
Algorithm (ECDSA) [10] for digital signatures. Transport Layer Security (TLS), the preferred protocol for securing 5G
communications, incorporates ECDH in its handshake process [9]. ECDH also plays a crucial role in sensitive data
storage [11, 12] and data sharing, allowing basic database query instructions to be executed without any data information
leakage. This includes Private Set Intersection (PSI) [13, 14, 15] and Private Intersection Sum [16]. ECDSA is widely
employed in blockchain systems to safeguard data integrity and transaction accountability [17]. Currently, numerous
blockchain-database (referred to as verifiable databases) [4, 5, 6, 7, 8], which combine the properties of blockchains
with the ones of classic Database Management Systems (DBMS), protect data history against malicious tampering.
Additionally, researchers developed verifiable SQL [18, 19], protecting the integrity of user data and query execution on
untrusted database providers. As an up-and-coming field approaching commercial applications, cloud service providers,
such as Amazon [20] and Microsoft [21], provide services that can maintain an append-only and cryptographically
verifiable log of data operations.
To enhance the efficiency of ECC, researchers have dedicated significant efforts to designing specialized curve forms
that reduce the computational overhead associated with modular arithmetic in elliptic curve operations. For instance,
blockchain systems, such as Bitcoin [22], Ethereum [23], Zcash [24], and Hyperledger Fabric [25], use the secp256k1
curve [26] and the P-256 curve endorsed by the National Institute of Standards and Technology (NIST) [27] for ECDSA.
Differently, China keeps the SM2 curve as the standard for electronic authentication systems, key management, and
e-commerce applications [28, 29].
Despite such progress, ECC remains a bottleneck in the performance of these throughput-sensitive applications. On
mainstream server environments, a single fundamental EC operation typically takes more than 6 milliseconds to execute.
A PSI computation to identify the intersection between two datasets containing millions of items has to perform tens
of millions of EC operations, which could take around 632 seconds to complete [13]. Although researchers develop
blockchain-database systems with ASIC [30] to accelerate ECDSA to improve transaction throughput, the improvement
compared to a CPU-based system is limited, with a maximal 12% improvement.
There are recent efforts [31, 32, 33] in employing GPU
to minimize the latency of individual EC operations.
Application (PSI, blockchain, etc)
However, achieving the high-throughput requirements of
support
emerging big data applications remains a challenge. To
close the gap, we present a high-throughput GPU-based Elliptic Curve Cryptography Framework (ECDSA, ECDH)
ECC framework, which is holistically optimized to max-
imize the data processing throughput. The optimizations gECC Algorithm gECC Hardware
consist of the following three major aspects. Throughput-orieted EC Operation Layer Multi-level Cache
map
2
A PREPRINT - JANUARY 8, 2025
Second, we employ data-locality-aware kernel fusion optimization and design multi-level cache management to
minimize the memory access overhead incurred by the frequent data access for point data and the large intermediate
results caused by batching EC operations with Montgomery’s Trick. Each EC operation computation inherently requires
access to the point data and the intermediate results twice. These data need more than hundreds of megabytes of
memory space, which far exceeds the limited GPU’s shared memory size. For example, when 220 EC point additions
are performed simultaneously, the total size of data to be temporarily stored is 96 MB and the inputs are 128 MB. Our
techniques optimize data access to minimize the pressure on the registers and GPU’s shared memory.
Third, all the EC’s arithmetic operations, including addition, subtraction, and multiplication, are based on modular
arithmetic on the finite field with at least 256-bit. Among these, the most time-consuming modular multiplication
operation is also the most frequently performed operation in all kinds of EC operations. Previous studies [31, 32,
33, 35, 36] fall short in evaluating the performance of modular multiplication from the perspective of instructions’
issuance rates at the microarchitectural level. Furthermore, there is a noticeable absence of arithmetic optimization for
specific prime moduli in the existing work [33, 36]. We identify that the efficiency of modular multiplication is highly
dependent on the number of Integer Multiply-Add (IMAD) instructions. To eliminate this bottleneck, we propose
novel techniques to minimize the number of IMAD instructions by leveraging predicate registers to pass the carry
information. In addition, we develop an arithmetic optimization for the SM2 curve, using addition and subtraction
instructions (IADD3) to replace the expensive IMAD instructions.
We implement gECC using CUDA and conduct evaluations on the Nvidia A100 GPU to assess its performance.
Our comparative analysis includes gECC and the leading GPU-based ECC system, RapidEC [33], highlighting the
advancements and improvements offered by our framework. On standard cryptographic benchmarks for two commonly
used ECC algorithms, our results demonstrate that gECC achieves a speedup of 4.18× for signature generation and
5.56× for signature verification in ECDSA. If we measure the two key layers in our framework separately, gECC
achieves up to 4.04× and 4.94× speedup for fixed-point multiplication and unknown point multiplication of the EC
operator layer, and 1.72× speedup in the modular multiplication for SM2 curve in the modular arithmetic layer. In
real-world blockchain system, we can achieve 1.56× performance improvements, compared to the state-of-the-art
CPU-based systems. gECC is completely and freely available at https://github.com/CGCL-codes/gECC.
ECC is a form of public-key cryptography that relies on the algebraic properties of elliptic curves over finite fields. It is
widely employed in various cryptographic algorithms and protocols, including key exchange, digital signatures, and
others. ECC is also utilized in several standard algorithms, such as ECDSA and ECDH. One of the key features of
ECC is its ability to offer the same level of security as RSA encryption with smaller key sizes, making it particularly
appealing for resource-constrained environments.
The SM2 elliptic curve [29] utilized in ECDSA applications is defined by Eqn. 1 . The parameters a and b are elements
in a finite field Fq , and they establish the specific definition of the curve. A pair (x, y), where (x, y ∈ Fq ), is considered
a point on the curve if it satisfies the aforementioned Eqn. 1.
y 2 = x3 + ax + b, Short W eierstrass curve (1)
3
A PREPRINT - JANUARY 8, 2025
Table 1: Performance analysis of EC operations on Short Weierstrass curves in different coordinate systems
Performing EC operations on points represented in the original coordinate system (called affine coordinate), such
as Eqn. 2, 3, results in lower efficiency. This is mainly due to the time-consuming nature of modular inversion
1
operations when computing xp −x t
, significantly affecting overall performance. Various types of elliptic curve coordinate
systems have been introduced to enhance the computational efficiency of single EC operations. For instance, the
Jacobian coordinate system, used in [33, 37, 38], represents points using the triplets (X, Y, Z), where x = X/Z 2 and
y = Y /Z 3 [39]. This coordinate system eliminates the need for costly modular inversion and improves the performance
of single PADD and PDBL operations. Table 1 lists the number of modular multiplication (modmul), modular addition
(modadd), and modular inversion (modinv) involved in both PADD and PDBL under different coordinate systems.
PMUL operation. This operation computes the product of a scalar s, a large integer in a finite field, and an elliptic
curve point P . The result, Q, is a new EC point. As demonstrated in Algorithm 1, the PMUL operation can be computed
by repeated PADD and PDBL operations based on the binary representation of the scalars. There are two types of point
multiplication: fixed point multiplication (FPMUL) and unknown point multiplication (UPMUL). For FPMUL, the
point P in Algorithm 1 is known. The common practice [33, 31, 40, 38] is to preprocess the known point to eliminate
the PDBL operation (line 3 in Algorithm 1). UPMUL, on the other hand, is typically processed faithfully following
Algorithm 1.
Finite fields have extensive applications in areas Algorithm 2: Montgomery multiplication with SOS strategy
such as cryptography, computer algebra, and
numerical analysis, playing a crucial role in Input: a and b are stored in A[m:0] and B[m:0] respectively,
ECC. The integer x belongs to a finite field where a, b ∈ Fq and q_inv ≡ −q −1 mod Dm
Fq , that is, x ∈ [0, q). Here, q represents a Output: c ≡ a ∗ b mod q, store in C[2m:m];
large prime modulus with a bit width l that typ- 1: C[2m : 0] = A[m : 0] ∗ B[m : 0] {▷ integer multiply}
ically spans 256 to 1024. Generally, the large 2: for x = 0 to m − 1 do {▷ modular reduce}
Pm−1 i 3: Mi = (C[i] ∗ q_inv) & (D − 1);
integer x can be composed as i=0 xi D , 4: C[2m : i]+ = Mi ∗ q;
where D symbolizes the base and and q ≤ Dm .
5: end for
D usually set to 64 or 32, allowing the array
6: return c > q ? (c − q) : c;
X[m : 0] = {x0 . . . xm−1 } to be stored in
word-size integers.
A finite field is where the results of modmul, modadd, and modinv over integers remain within the field. The most
time-consuming operations are modmul and modinv, which have about 5 times and 500 times the latency of modadd
on a mainstream server. These operations are the primary targets of our acceleration efforts.
Montgomery multiplication [42] is designed for modmul. By altering the structure of the upper-level loop, various
optimization strategies such as SOS and CIOS have been extensively explored [43] to reduce read and write operations.
As demonstrated in Algorithm 2 (Montgomery multiplication with SOS Strategy), the operation modmul comprises
two phases: integer multiplication and modular reduction. The multiplication phase involves multiplying two integers,
4
A PREPRINT - JANUARY 8, 2025
resulting in a value within the range of [0, q 2 ). Subsequently, the value is reduced to the interval [0, q) through the
modular reduction phase.
As discussed previously, researchers are dedicated to studying the prime moduli of special forms, which provide
opportunities to effectively reduce the number of arithmetic operations required for the operation modmul. In particular,
the use of moduli with characteristics similar to Mersenne primes proves to be beneficial. For instance, the State
Cryptography Administration (SCA) of China recommends q = 2256 − 2224 − 296 + 264 − 1, known as SCA-256 [29]
which is used in the SM2 curve. On the one hand, the exponents in both cases are chosen as multiples of 32, which can
expedite implementation on a 32-bit platform. On the other hand, the q_inv values of the prime in the algorithm 2 are
equal to 1, indicating that there is room for optimization in the modular reduction phase.
The modinv operation can be calculated either via Fermat’s little theorem (i.e., x−1 ≡ xq−2 mod q) or a variant of
Extended Euclid Algorithm [44, 45]. The former converts a modinv operation into multiple modmul operations,
allowing one to reduce the number of modmul based on known q, as demonstrated in the solution [33]. The latter
employs binary shift, addition, and subtraction operations instead of modmul operations. This results in a performance
improvement of about 3 times compared to the former. However, due to the presence of multiple branch instructions
within the latter algorithm, it is typically used in CPU-based systems [46] and is not suitable for parallel execution on
GPU-based systems.
Our key hypothesis is that we can batch and execute EC operations in parallel to enhance throughput. The overhead of
processing a number of primitive operations (PADD and PDBL) could be reduced by employing a more efficient method
based on the affine coordinate system. For N PADD operations, gECC first calculates λ (Eqn. 2) of each PADD by
one modmul operation and one modinv operation. Specifically, gECC batches all the modinv operations in multiple
PADD operations together and processes them with Montgomery’s trick. The N modinv operations are converted
into one modinv operation and 3N modmul operations. After getting the λ of each PADD operation, gECC only
needs another 2 modmul and 6 modadd operation by leveraging Eqn. 3 to compute the output. In summary, there are
5
A PREPRINT - JANUARY 8, 2025
about one modinv, 6N modmul, and 6N modadd operations for batch N PADD operations with affine coordinates,
while there are 11N modmul and 6N modadd for N PADD operation with Jacobian coordinates, as shown in Table 1.
Although a few expensive modinv operations are involved, by amortizing these costs through large batch sizes, the
total overhead of batch PADD operations could be reduced. Theoretically, when the batch size N is greater than 20, the
total overhead of processing a batch of N PADD operations in the affine coordinate is less than that in the Jacobian
coordinate system. The actual size of N is much larger than 20 to fully utilize the large-scale parallelism provided by
GPUs. gECC accelerate the processing of a batch of PDBL operations in a similar fashion.
Although this idea sounds promising, it is non-trivial to capitalize on its benefits due to the following two challenges.
Challenge 1. The calculation of the batch modinv with Montgomery’s trick is inherently sequential since modinv
needs the results of the multiple accumulated modmul operations. Although this shortens the sequential dependency in
the compress/decompress step, we have to introduce more time-consuming modinv operations. Moreover, the large
number of cores on a GPU requires a high degree of data parallelism to fully utilize its computing power, which results
in introducing many more modinv calculations. Even worse, these modinv operations are computed serially rather
than in parallel on the GPU Streaming Processor (SP), as demonstrated in the right part of Fig. 3(a). The reason is that
the fastest algorithm [44, 45] for modinv employs numerous branching conditions to achieve efficient computation,
which causes a large number of GPU warp divergences in parallel computing and then results in worse performance. To
address this, we carefully devise a parallel workflow for Montgomery’s Trick algorithm in Section 3.2 to minimize the
number of the inverse step of modinv computation and improve the efficiency of parallel computing.
Challenge 2. When processing a batch of EC operations using Montgomery’s Trick, there are much higher memory
access overheads compared to existing methods, such as [33]. Specifically, in the compress step (Fig. 2), we need
to load two EC points for each PADD operation to calculate the numerator and denominator of λ, and then multiply
the denominator of λ of different PADD operations together to compress the modinv operations needed and store
them for subsequent calculation of the decompress step (Fig. 2). Suppose that we process a batch of millions (e.g.,
220 ) of EC operations to maximize the parallelism. The point data and intermediate arrays would consume more
than 128 MB 96 MB, respectively. However, this far exceeds the capacity of L1 cache or shared memory of modern
GPUs. For example, the combined L1 cache and shared memory in NVIDIA A100 GPU has only 20 MB. Offloading
these data to global memory would incur huge overhead for reloading them to the cache for computation in both the
compress step and the decompress step Fig. 2. To tackle this issue, we propose two significant optimizations: first, a
memory management mechanism for batch PADD operations that actively utilizes a multi-level cache to cache data
(see Section 3.3); second, a data-locality-aware kernel fusion method for batch PMUL operations, which aims to reduce
memory access frequency and enhance data locality (Section 3.4).
K1
-1
K1 serial execution
T1 compress inverse decompress
Warp1
…
…
…
-1
K32 K32
diverge
…
…
Warpn
…
…
…
(a) The throughput bottleneck on each GPU Steaming Processor with data parallelism
K1’
…
Warp1
T32 compress decompress
synchronize
synchronize
gather scatter
…
…
…
…
Warpn/sp
T32 compress decompress
scatter decompress
…
(b) gECC’s runtime execution on each GPU Steaming Processor with GAS mechanism
Figure 3: Different mechanism of batch modular inversion on GPU and corresponding runtime execution.
Here, we introduce the design of batch modinv operation in gECC, which serves as a crucial component of the
throughput-centric EC operations. In a GPU, the fundamental scheduling unit is a warp that consists of 32 threads.
6
A PREPRINT - JANUARY 8, 2025
These threads run concurrently on a single GPU SP, which houses multiple CUDA cores. For example, NVIDIA A100
contains 432 SPs.
To address Challenge 1, we borrow the wisdom of parallel graph processing systems and adopt the Gather-Apply-
Scatter (GAS) mechanism [34] to reduce the overhead of the modinv operation. Similar to data parallelism, we
evenly distribute the N inputs that require inversion across 32n threads in n warps for parallel processing as shown in
the left part of Fig. 3(b). Each thread then executes the compress step to obtain the final accumulated products, Ki
(1 ≤ i ≤ 32n). The difference lies in the fact that each thread does not directly compute the inverse step to get Ki−1 .
Instead, it waits for the GAS synchronization to complete, obtaining Ki−1 , and then performs the decompress step.
In the gather phase, the accumulated products Ki from all threads in the n/sp warps running on each GPU SP are
′
collected. After applying the compress step again, we obtain the accumulated products Kj (where j ranges from 1 to sp,
and sp is the number of GPU streaming processors). For instance, the accumulated products Ki (where i ranges from 1
to 32n/sp) from W arp1 to W arpn/sp are compressed using shared memory for inter-thread communication, yielding
′ 32n/sp
the result K1 = Πi=1 Ki , as shown in the left part of Fig. 3(b).
During the apply phase, each GPU SP performs a modinv operation in the inverse step on Kj′ using one thread,
′
resulting in Kj−1 , as shown in the right part of Fig. 3(b). This effectively avoids the divergence issue caused by multiple
threads performing the modinv operations simultaneously. Although some CUDA cores in the GPU SPs are idle during
′
this time, this period is very short relative to the entire computation process. In the scatter phase, the Kj−1 obtained by
each GPU SP is decompressed to produce the corresponding Ki−1 .
Compared to a naive data parallel approach, batch inversion based on the GAS mechanism reduces the number of
required modinv operations from 32n to sp, where n is typically several times the value of sp to use computing to
overlap the memory access overhead. This effectively reduces the overhead of batch inverse.
For batch PADD operation, gECC first calculates the λ for all PADD operations using batch modinv and then completes
the computation of Eqn. 3 of each PADD, which are combined with the decompress step of batch λ to reuse the point
data. We called this combined decompress step as DCWPA step. As we discussed in Challenge 2, there are high
memory access overheads due to frequent global memory access. To alleviate this, gECC employs multi-level cache
management to minimize data access overhead, as shown in Fig. 4. gECC reduces the overhead of memory access with
two major techniques. 1) We minimize caching the intermediate data produced in the compress step and recompute
them when needed in the DCWPA step, decreasing the required cache space. 2) gECC classifies the data according to
the computing characteristics and allocates them at the optimized cache level, improving the data access efficiency.
Compress Step Inverse Step DCWPA Step
P0 T0 P5 T5 P0 T0 P5 T5
M5 M10 M15 … M0 … M4 M5-1
X PX … X PX TY Y … Y P Y T …
6 T6 10 10 P6 T6 10 10 M6 … M9 M10-1
…
X X … X X Y Y … Y Y 1 compress
M M4
-1 to -1
M0 , similarly 1 t5-1 = M4 * M5-1
1 t0 = XP0 - XT0
P1 to P5, similarly
2 modinv M
-1 to -1
M6 , similarly 1 t10-1 = M9 * M10-1
t =X -X
2 m1 = 0t0 * mP0 T0 Gather Scatter
P09 T0 P T5 5 P0 T0 P5 T5
P1 to P5, similarly M-1 X PX T… X PX TY PY T… Y PY T
2 m = t0 * m 3 decompress 6 6 10 10 6 6 10 10
M0 M 1 … M5 X X … X X Y Y … Y Y
M6 M7 … M10 M5-1 M10-1 M15-1 … 3 t5 = XP5 - XT5 2 PADD Eqn.3
7
A PREPRINT - JANUARY 8, 2025
access of a large integer. Furthermore, the DCWPA step has to load the complete EC point data anyway; therefore, the
recomputation does not incur additional data loading overhead. As shown in the DCWPA step in Fig. 4, the operation ➌
recalculates the denominator t of λ based on the loaded point data. Therefore, gECC reduces the intermediate data
stored by introducing a little computation overhead. This scheme is also applicable to PDBL operations.
In addition, gECC assigns the different data to different caches to align with the computation. We observe that the data
generated later in the compress step are used earlier in the inverse step and DACWPA step. For example, the last result
M5 of the accumulated products for each thread in the compress step is immediately used for the inverse step, and its
inverse value M5 −1 is first used in the DCWPA step. To support the gather phase in the inverse step, gECC utilizes
GPU shared memory to cache the last result of the accumulated products for all threads in a GPU block in the compress
step, as shown in the inverse step in Fig. 4. The gather phase requires frequent data access to merge data from different
threads in one block. GPU’s shared memory could meet this demand by enabling data communication between threads
and supporting simultaneous access by multiple threads. Moreover, the shared memory of Nvidia GPUs allows for very
low memory access latency, thus reducing memory access overhead in the inverse step.
gECC also leverages the L2 persistent cache to store all intermediate data of accumulated products in the compress step.
As demonstrated in Fig. 4, the array M , stored the intermediate result of each accumulated product in the compress step,
is actively cached in the L2 persistent cache of Nvidia A100 GPU. As we discussed in Section 3.1, the intermediate
arrays will occupy tens of megabytes of memory space. Although gECC has reduced the stored intermediate data, the
needed memory space still far exceeds the limited shared memory space. For example, the stored data in the compress
step is 32 MB when the scale of batch PADD operations is 220 . Besides, in the last DCWPA step, we need to first access
the array M to generate the inverse value of the denominator of λ. To meet these requirements, gECC uses the L2
persistent cache with larger capacity and lower access latency to optimize these data accesses. The Nvidia Ampere
architecture supports the L2 cache residency control feature [53], allowing users to customize the large 40 MB L2 cache
more efficiently. According to the CUDA programming guide [52], users can configure to use up to 75% L2 cache.
gECC sets aside a portion of the L2 cache as the persistent cache and configures the array M on it, thus achieving lower
latency accesses. When the allocated space exceeds the configured L2 persistent cache, gECC uses the first-in-first-out
(FIFO) cache replacement policy provided by CUDA to align with the calculation of the DCWPA step. We know that
the last result of multiplication in the compress step is first used in the DCWPA step to compute the corresponding
inverse value. The FIFO cache replacement strategy is perfectly suited to such a computational feature, improving the
hit rate of the L2 cache. Finally, the original point data array (e.g., Pi and Ti in Fig. 4), which occupies several hundred
megabytes of space, is stored in the GPU’s global memory.
Cache-efficient data layout. gECC adopts 256-bit logical Format
Addr. P0 P1 P2 P3 … Addr. P0 P1 P2 P3 …
a cache-efficient column-majored data layout 0 1 2 3 4 5 6 7 0B X0 X0 X0 X0 48*nB Y0 Y0 Y0 Y0
for large integers to achieve efficient concurrent Addr.
0B X0 X1 X2 X3 X4 X5 X6 X7 4*nB X1 X1 X1 X1 56*nB Y1 Y1 Y1 Y1
data access, improving the efficiency of global P0
32B Y0 Y1 Y2 Y3 Y4 Y5 Y6 Y7 8*nB X2 X2 X2 X2 64*nB Y2 Y2 Y2 Y2
memory access. Usually, a large integer is rep- 64B X0 X1 X2 X3 X4 X5 X6 X7
resented by an integer array in a CPU, as men- P1 12*nB X3 X3 X3 X3 72*nB Y3 Y3 Y3 Y3
96B Y0 Y1 Y2 Y3 Y4 Y5 Y6 Y7
tioned in section 2.2. As illustrated in Fig. 5(a), 128B X0 X1 X2 X3 X4 X5 X6 X7
16*nB X4 X4 X4 X4 80*nB Y4 X4 Y4 Y4
the two coordinate values (X, Y) of the EC point P2 160B Y0 Y1 Y2 Y3 Y4 Y5 Y6 Y7 24*nB X5 X5 X5 X5 88*nB Y5 Y5 Y5 Y5
P i are stored in the two integer arrays X and 192B X0 X1 X2 X3 X4 X5 X6 X7
32*nB X6 X6 X6 X6 96*nB Y6 Y6 Y6 Y6
P3
Y , respectively, each element occupying 32 bits. 224B Y0 Y1 Y2 Y3 Y4 Y5 Y6 Y7
40*nB X7 X7 X7 X7 104*nB Y7 Y7 Y7 Y7
…
8
A PREPRINT - JANUARY 8, 2025
Recompute
… T1 T1T2
load
T1 *t1 T1t1
Discard *
t2-1 *T1 t1-1 …
According to the double-and-add algorithm 1, for each bit of the scalar, a PADD operation is needed when the current bit
of the scalar is one and there is a PDBL operation after the PADD operation when the loop is from the least significant
bit of the scalar to the most significant bit. The input of the PDBL is irrelevant to the output of the PADD, and its output
is used to calculate the next loop. Based on this, gECC proposes data-locality-aware kernel fusion optimization for
batch PMUL operations. The goal of our design is to reduce the memory access frequency and enhance data locality.
Initially, the three steps of batch PADD operation are combined with the corresponding steps of batch PDBL operation.
This native fusion could improve the data locality by reusing the loaded point data but requires more memory space
to store the intermediate data. For example, two intermediate arrays (i.e., the array M ) are necessary for each thread
in the combined compress step: one for the PADD operation and the other for the PDBL operation. The next inverse
step could work on both of them, simply doubling the input size and further merging the inverse calculations into one.
However, the limited shared memory will limit the input
scale, thus decreasing the throughput of batch PMUL
operations. Algorithm 3: Batch UPMUL operation
Input: the (scalar, point) set
To achieve the design goal, gECC uses a find-then- {(s0 , P0 ), (s1 , P1 ), ..., (s(n−1) , P(n−1) )}
recompute method to fusion the two kernels, which could Output: {Q0 , Q1 , Q2 , ..., Qn−1 }
reduce memory space and improve data locality. There is 1: len_s ← the bits of scalar;
an example of applying the find-then-recompute method 2: M [n];
to Montgomery’s trick, as illustrated in Fig. 6. This 3: for i ← 0 to len_s − 1 do
method discards part of the results of the accumulated 4: T1 = 1;
multiplications to reduce the space occupied by array M , 5: for j ← 0 to n − 1 do
then recomputes the discarded data with the intermediate 6: M [j] = T1 ;
result stored and the original inputs. For example, when 7: t1 = Pj .Y + Pj .Y ; T1 t1 = T1 ∗ t1;
multiplying t1 and t2 with the previous accumulated prod- 8: t2 = Pj .X − Qj .X; T1 T2 = T1 t1 ∗ t2;
uct result T1 , we discard the result T1 t1 of multiplying t1 9: T1 = T1 T2 ;
and T1 and only store the final result T1 T2 of the product 10: end for
of t1, t2, and T1 . Then, to get the correct inverse values 11: T1 −1 = inverse_step_f unc(T1 );
of t1 and t2 in the decompress step, these values T1 , T1 t1, 12: for j ← n − 1 to 0 do
and T1 T2 −1 are needed according to Montgomery’s trick. 13: T 1 t1 = (Pj .Y + Pj .Y ) ∗ M [j];
We could recompute the intermediate result T1 t1 based
on the stored value T1 and the input t1 to complete the 14: t2−1 = T1 −1 ∗ T 1 t1;
remaining computation. 15: λP ADD = λP ADD _f unc(Pj , Qj , t2−1 );
16: Rj = padd_f unc(λP ADD , Pj , Qj );
Specifically, gECC uses the same variable T1 to multiply 17: T1 −1 = T1 −1 ∗ (Pj .X − Qj .X);
with the denominator of λP ADD and the denominator of 18: t1−1 = T1 −1 ∗ M [j];
λP DBL together and store the last result in the interme-
19: λP DBL = λP DBL _f unc(Pj , t1−1 );
diate array M as shown in the lines 6 − 9 in Algorithm 3.
In addition, gECC adjusts the order of the calculation to 20: T1 −1 = T1 −1 ∗ (Pj .Y + Pj .Y );
achieve overlap between global memory access and large 21: Pj = pdbl_f unc(λP DBL , Pj );
integer computation to reduce latency. Firstly, we only 22: if s[i] == 1 then
load the y-axis value of the point Pj to calculate the de- 23: Qj = Rj
nominator of λP DBL . By leveraging the cache-efficient 24: end if
data layout, gECC could aggregate data access to any 25: end for
part of a large integer at a smaller granularity (32-bit). 26: end for
When calculating the denominator of λP DBL and then
multiplying by T1 , gECC loads the values on the x axis
of the point Pj and Qj to overlap the long latency of global memory access with the large integer multiplication. Then,
9
A PREPRINT - JANUARY 8, 2025
we calculate the denominator of λP ADD and multiply by T1 , Finally, the value of T1 for the point pair (Pj , Qj ) is
stored in the array M . Next, the calculation of the inverse step is the same as the one in batch PADD operations. gECC
also reduces the latency of the inverse step compared to the native kernel fusion method.
In the DCWPA step, we first complete the calculation of PADD operation and then the calculation of PDBL operation.
gECC gets the inverse value of the denominator of λP ADD by recomputing the discarded values, the denominator t1
of λP DBL and the value T 1 t1, as shown in the lines 13 − 15 in Algorithm 3. Then, it completes the computation of
Eqn. 3 to obtain the result Rj of PADD operations and updates the value Qj using Rj according to the j-th bit value of
the scalar. At last, just like handling PADD operation, gECC calculates the value of λP DBL and then performs the
PDBL’s calculation. Despite introducing additional computation overhead (i.e., one modmul operation) to get the
correct inverse values, the benefits of the discard-then-recompute scheme far outweigh it. This scheme fully utilizes the
point data locality of the integrated PADD and PDBL calculation process to reduce data access, thus improving the
performance of batch PMUL operations.
Our design ensures constant time for the EC operations to realize side-channel protection [54]. To prevent any leakage
of timing information related to the secret scalar in the PUML operation, instead of omitting unnecessary operations,
the PMUL operation conducts a dummy operation with the point-at-infinity when required (for instance, zeros of the
scalar are absorbed without revealing timing information, as shown in algorithm 3);
10
A PREPRINT - JANUARY 8, 2025
Table 2: GPU Ampere micro architecture level SASS instructions issuance analysis
dependent / independent
arithmetic PTX SASS issue rate
operation instructions Operators (per warp scheduler) meaning
floating-point
1 1
mul-add fma.rz.f64 d, a, b, c DFMA 8 / 4 cycle d=a∗b+c
floating-point
1 1
add add.rz.f64 d, a, b DADD 8 / 4 cycle d=a+b
integer madc.cc.lo.u32 (d)lo , a,b,(c)lo cc, (d)lo = (a ∗ b)lo + (c)lo + cc
1 1
mul-add madc.cc.hi.u32 (d)hi , a,b,(c)hi IMAD.WIDE.X 8 / 4 cycle (d)hi = (a ∗ b)hi + (c)hi + cc
integer
1 1
add addc.cc.u32 d,a,b; IADD3.X 4 / 2 cycle cc, d = a + b + cc
instructions. The 10,000 instructions consist of two SASS instructions (any two types shown in Table 2), arranged in a
1:1 ratio. For instance, an IMAD instruction is followed by a DFMA instruction. The results indicate that when a warp
issues one DFMA followed by one IMAD instruction, the total cycles required are the sum of issuing DFMA or IMAD
alone, which means that although these two types of instructions run in different compute units, they are still launched
at intervals of 4 cycles. Similarly, the interval between the DFMA instruction and the DADD instruction, as well as the
IMAD instruction and the DADD instruction, remains at 4 cycles. However, the interval between the IMAD, DFMA, or
DADD instruction and the IADD3 instruction is 2 cycles.
Theoretically, a 256-bit integer represented in base D = 32, necessitates m = 8 32-bit integers for storage. As shown
in Algorithm 2, the modmul operation of two 256-bit integers consists of two parts. The integer multiplication requires
m2 = 64 IMAD instructions, and the modular reduction requires m + m2 = 72 IMAD instructions. This results in a
total of 136 IMAD instructions. When considering D = 52 in the solution (DPF) presented in [36], storing a 256-bit
integer requires 5 double-precision floating-point numbers. Due to considerations regarding floating-point precision,
storing the product of two 52-bit integers necessitates 2 DFMA instructions. So, in modmul operation in this solution,
integer multiplication demands 2 ∗ m2 = 50 DFMA instructions, while the modular reduction requires 2 ∗ m2 = 100
DFMA instructions and m = 5 IMAD instructions. Moreover, the solution incorporates extra DADD instructions and
data conversions to ensure precision.
We used the GPU performance analysis tool NVIDIA Nsight Compute [56] to disassemble the fastest modmul
implementation [35] based on integer instructions. This implementation uses 1, 2, and 4 threads with D = 32, and is
referred to as CGBN-1, CGBN-2, and CGBN-4, respectively. Additionally, we disassembled a solution [36, 31] based
on a mixture of multi-precision floating-point instructions and integer arithmetic instructions, which uses 1 thread and
has a parameter D = 52. This solution is named DFP-1.
Table 3: Instructions analysis of a modmul in different solution for SM2 curve
The results of the performance test depicted in Fig. 12 indicate the following performance ranking: CGBN-1 outperforms
DPF-1, which is in turn more efficient than CGBN-2 and CGBN-4. As shown in Table 3, the primary performance
bottlenecks for CGBN-4 and CGBN-2 arise from the SHFL instructions necessary for inter-thread communication [52].
These instructions take approximately 5 clock cycles [57] and introduce synchronization overhead within the modmul
operation. DPF-1 has attempted to replace some IMAD instructions with DFMA instructions. Regrettably, despite these
two types of instruction operating in different computational units, they cannot be executed concurrently, as confirmed
by our benchmark results 2. Furthermore, additional DADD instructions and data conversion instructions have hindered
DPF-1 from outperforming the integer-based implementation CGBN-1. Although CGBN-1 has shown impressive
performance, a microarchitecture analysis reveals that actual usage of IMAD instructions is 38% higher than theoretical
predictions, suggesting that there is still potential for further optimization.
In general, the main challenge in speeding up modmul lies in minimizing the number of IMAD instructions and
reducing the dependencies between instructions. Additionally, it is important to avoid register bank conflicts, as these
can also delay the issuance of instruction [58, 59, 55].
11
A PREPRINT - JANUARY 8, 2025
bit word, and both carry-in and IMAD.WIDE.U32.X R20, P0, R25, R5, R20, P0 ;
IMAD.WIDE.U32.X R16, P1, R25, R3, R16, P1 ; (D0)lo (D0)hi (D1)lo (D1)hi
carry-out are each 1-bit. After
compilation, the SASS instruc- Figure 7: Divergent carry addition chains for odd and even word-size integer
tions executed on the hardware is
IMAD.WIDE.U32.X R1, P0, R2, R3, R1, P0;. In these instructions, R1, R2, and R3 are general-purpose registers
that store c, a, and b, respectively, while P0 is a predicate register used to pass the carry information to another SASS
instruction. Improper propagation of carry information can lead to instruction overhead. For example, in the study
of [49], if the high word carry and the low word carry of a ∗ b are propagated separately, the number of IMAD
instructions can be doubled compared to the theoretical predictions.
Our implementation of multiplication carry propagation, akin to sppark [50], effectively circumvents such issues. As
depicted in Fig. 7, the product of integers A and B can be computed by generating m row-by-row multiplications with
mad_n function, specifically A ∗ Bi , where i ranges from 0 to m − 1. The carry from the high word of Aj ∗ Bi can be
passed as the carry input to the low word of Aj+2 ∗ Bi , where j ranges from 0 to m − 1. It includes the product of all
even elements of A and Bi and avoids the extra addition instructions caused by the conflict when the carry from the
high word of Aj ∗ Bi carries over to the high word of Aj+1 ∗ Bi . This propagation of the carry can also apply to the
remaining odd elements of A, such as the carry from the high word of Aj+1 ∗ Bi can be passed as the carry input to the
low word of Aj+3 ∗ Bi . ➕ ➕ ➕ ➕ ➕
4.1.3 Reorder IMAD instructions to reduce register moves and register bank conflicts.
After customizing the one row-by-row 1 integer multiply
multiplication with two mad_n func-
(C0)lo (C0)hi (C1)lo (C1)hi (C2)lo (C2)hi (C3)lo (C3)hi (C4)lo (C4)hi
tions, we still need to pay attention
to the IMAD order when applying it mad_n(C, A, B0) (A0B0)lo (A0B0)hi (A2B0)lo (A2B0)hi carry
to the integer multiplication stage of mad_n(C+1, A+1, B1) (A1B1)lo (A1B1)hi (A3B1)lo (A3B1)hi carry
the modmul operation because regis- mad_n(C+1, A, B2) (A0B2)lo (A0B2)hi (A2B2)lo (A2B2)hi carry
ter move and bank conflicts will also mad_n(C+2, A+1, B3) (A1B3)lo (A1B3)hi (A3B3)lo (A3B3)hi carry
affect the instructions’ issuance rate. (D0)lo (D0)hi (D1)lo (D1)hi (D2)lo (D2)hi (D3)lo (D3)hi
when processing m row-by-row mul-
mad_n(D, A+1, B0) (A1B0)lo (A1B0)hi (A3B0)lo (A3B0)hi carry
tiplications in integer multiplication
mad_n(D, A, B1) (A0B1)lo (A0B1)hi (A2B1)lo (A2B1)hi carry
of modmul operation, we modify the mad_n(D+1, A+1, B2) (A1B2)lo (A1B2)hi (A3B2)lo (A3B2)hi carry
write-back registers of the IMAD in- mad_n(D+1, A, B3) (A0B3)lo (A0B3)hi (A2B3)lo (A2B3)hi carry
struction, as shown in Fig. 8. Specif-
ically, the product results of the even 2 modular reduce 3 merge
elements in A ∗ B[0] are accumulated 1: for (size_t i = 0; i < m/2; i += 1) do
(C2)lo (C2)hi (C3)lo (C3)hi (C4)lo
in the C array, while those of even el- 2: Mi = (Ci)lo * q_inv;
3: mad_n(C+i, q, Mi);
ements in A ∗ B[1] are accumulated in 4: mad_n(D+i, q+1, Mi); (D1)hi (D2)lo (D2)hi (D3)lo (D3)hi
// keep the carry and pass to (Ci+1)lo
the D array. This adjustment offers the 5: (carry-out, (Di)lo) = (Di)lo+(Ci)hi carry carry carry carry
advantage of eliminating the need to 6: Mi+1 = (Di)lo * q_inv; (D1)hi (D2)lo (D2)hi (D3)lo (D3)hi
7: mad_n(C+i+1, q+1, Mi+1);
separately read the high word (C0 )hi
"
8: mad_n(D+i, q, Mi+1);
of the (C0 ) register and the low word // keep the carry and pass to (Di+1)lo
9: (carry-out, (Ci+1)lo) = (Ci+1)lo+(Di)hi
(C1 )lo of the (C1 ) register, thus reduc-
ing the movement and reading opera-
Figure 8: General modmul algorithm for m=4
tions of the registers.
During the modular reduction phase of modmul operation, as shown in Algorithm 2, while accumulating the product
of q ∗ Mi in each iteration, sppark ensures result accuracy by shifting the C or D array 64 bits to the right, such as
Ci = Ci+1 + q ∗ M1 . However, the corresponding SASS instructions requires the IMAD instruction to access four
12
A PREPRINT - JANUARY 8, 2025
registers, potentially causing a one-cycle delay in instruction issuance due to register bank conflicts [58, 59, 55]. We
employ an in-place write-back strategy, namely Ci = Ci + q ∗ M1 , which allows the IMAD instruction to access three
registers simultaneously. To ensure that the results of the C and D arrays are shifted 64 bits to the right in each iteration
without increasing the number of IMAD instructions like CGBN, we utilize predicate registers to store carries as shown
in phase (2) in Fig. 8. For instance, when i = 0, the value of (C0 )hi is added to (D0 )lo , and the carry-out is stored in
the predicate register to be passed to (C1 )lo in the subsequent mad_n function. Subsequently, the IADD3 instruction is
employed to accumulate the C and D array to produce the final result in the merge phase in Fig. 8. Upon disassembly
of our design, the IMAD instruction in a modmul operation aligns with the theoretical prediction.
To further reduce the number of IMAD instructions, we observed that, for the SM2 curve, the IADD3 instruction
with a higher issuance rate can be used to replace the number of IMAD instructions. The unique forms of the
SCA-256 prime modulus on the SM2 curve facilitate a distinct modular reduction phase. On the SM2 curve, q =
2256 − 2224 − 296 + 264 − 1, and therefore q_inv of q in Algorithm 2 is equal to 1. A previous work [60], an ASIC-based
accelerator for the SM2 curve, also utilizes this property. However, its design is not portable to GPU as it introduces a
multitude of intermediate results that require more registers and introduces a lot of addition operation.
So in the middle part of Fig. 8, q_inv = 1 in line 2 and line 6 is specified, rendering Mi equivalent to (Ci )lo or (Di )lo .
let us incorporate the value of q into the reduction phase, the mad_n in the line 3, 4, 6 and 7 can be simplified to
addition instead multiplication operation. For example, at the iteration i = 0, lines 2 to 4 can be simplified using Eqn. 4.
Therefore, 8 subtractions are required to construct q ∗ (C0 )lo , and 10 additions are required to update the C array (from
(C0 )lo to (C4 )hi ).
7
X
C = C + (C0 )lo ∗ q = ((Ci )lo + (Ci )hi ∗ 232 ) ∗ 2i∗32 + (2256 − 2224 − 296 + 264 − 1) ∗ (C0 )lo
i=0
= (C0 )hi ∗ 232 + ((C1 )lo + (C0 )lo ) ∗ 264 + ((C1 )hi − (C0 )lo ) ∗ 296 + (C2 )lo ∗ 2128 + (C2 )hi ∗ 2160 + (C3 )lo ∗ 2192
7
X
+ ((C3 )hi − (C0 )lo ) ∗ 2224 + ((C4 )lo + (C0 )lo ) ∗ 2256 + (C4 )hi ∗ 2288 + ((Ci )lo + (Ci )hi ∗ 232 ) ∗ 2i∗32
i=5
(4)
After that, both arrays D and (C0 )hi , required by line 5, remain unchanged, allowing lines 6 to 8 to be simplified in a
similar manner. Consequently, in each iteration of the reduction phase, 8 ∗ 2 subtractions are still necessary to construct
(Ci )lo ∗ q and (Di )lo ∗ q, and 10 ∗ 2 additions are required to update the arrays C and D.
Nonetheless, there is potential for further optimization. As depicted in Fig. 8, the merge phase of modmul does not
necessitate precise values for C0 , C1 and D0 , D1 , but merely requires accurate carry information. Hence, we can
simplify the computation from line 2 to line 8 when i = 0 using Eqn. 5. In the calculation of tmp, we employ one
addition operation and use the predicate register to store the carry information. When computing (Ci )lo ∗ q + ((Ci )hi +
(Di )lo ) ∗ q, we require an additional eight 32-bit integers and seven subtractions with IADD3 instructions, as we can
disregard the specific value of the lower 64-bits. When updating the array C ′ , we also ignore the lower 64 bits and
simply add the carry information stored in the predicate register, which can be accomplished through nine additions with
IADD3 instructions. Compared to the previous work in [60], which requires 176 addition and subtraction instructions,
we only need (1 + 7 + 9 + 1) ∗ 4 = 72 addition and subtraction instructions to complete the reduction phase.
13
A PREPRINT - JANUARY 8, 2025
5 Experimental Evaluation
5.1 Methodology
Experimental Setup: Our experiments are conducted on a virtual machine with an Intel Xeon processor (Skylake,
IBRS) and 72 GB DRAM, running Ubuntu 18.04.2 with CUDA Toolkit version 12.2. It is equipped with one NVIDIA
Ampere A100 GPU (each with 40 GB memory) and one NVIDIA Votal V100 GPU (each with 32 GB). Because the
optimization of EC operations in gECC relies on the new feature of L2 cache persistence in A100 GPU, we only
evaluate the throughput performance of ECC algorithm (Section 5.2), EC operations (Section 5.3) and application
(Section 5.5) on A100 GPU. Because the optimization of modmul operation in gECC is general and does not rely
on this feature, we evaluate it on both V100 and A100 GPUs (Section 5.4). We use random synthetic data generated
by RapidEC based on the SM2 curve to evaluate the performance of ECC algorithm , EC operations , and modmul
operation performance. And we used real workload data to evaluate the application performance.
Baselines: We compare gECC with the state-of-the-art method for GPU-accelerated EC operations, called Rapi-
dEC [33].
Performance Metrics: We focus on enhancing the throughput of the GPU programs. Therefore, for the ECDSA
algorithm, we use the number of signatures generated/verified per second (i.e., signature/s) as the performance metric for
the signature generation and verification phases, respectively. The ECDH algorithm comprises two FPMUL operations
and a single key value comparison. Given that the time taken for the latter is negligible, the throughput of the ECDH
algorithm is dominated by the throughput of the FPMUL operation. Therefore, we only report the throughput value of
the FPMUL operation. Specifically, we record the calculation time t when performing a batch of n signature generations,
verifications, and FPMUL operations, respectively, and calculate throughput as nt , which is consistent with RapidEC.
We evaluate the end-to-end throughput of signature generation and verification based on the SM2 curve, respectively.
Table 4: Throughput Performance of ECDSA on NVIDIA A100 based on SM2 curve
Signature Signature
Solution Generation Verification
RapidEC 3,386,544 773,481
gECC 14,141,978 4,372,853
Signature Generation. We examine the throughput of signature generation in the ECDSA algorithm. As shown in
Table 4, the throughput of signature generation in gECC is 14,141,978 signature/s, which achieves 4.18× speedup
compared to the corresponding part of RapidEC. The end-to-end performance advantage of gECC can be attributed to
its comprehensive optimization from the low-level modmul to cryptographic operation.
To take a closer look, we investigate the improvement brought
about by the major optimization techniques of gECC for the
generation of signatures. This generation mainly contains a 5
RapidEC
modinv operation and an FPMUL operation. First, to verify RapidEC+OM 1.41E+7
the improvement brought about by our optimized modular 4 RapidEC+OM+BI
Normalized Throughput
14
A PREPRINT - JANUARY 8, 2025
Nomalized Throughput
put of signature verification is 4,372,853 signature/s, which gECC
4 2.84E+6
achieves 5.65× speedup compared to RapidEC.
2.43E+6
We provide a breakdown analysis of the signature verifi- 3
cation process as follows. First, we re-implemented the
2
RapidEC solution with our optimized modular arithmetic
(“RapidEC+OM”). Then, we evaluate an alternative solution 7.73E+5
1
(“RapidEC+OM+BF”) that also accelerates the FPMUL oper-
ation with the optimization proposed in Section 3.2. Finally, 0
“gECC” contains all our proposed optimizations, including
data-locality-aware kernel fusion (Section 3.4) and mem- Figure 10: Breakdown analysis of signature verification
ory management (Section 3.3) optimizations. As shown in
Fig. 10, the breakdown analysis shows that RapidEC+OM achieves 3.2× speedup against RapidEC. Signature veri-
fication contains more calculations than signature generation, so our optimized modular arithmetic could provide a
more significant speedup compared to RapidEC. With the accelerated FPMUL operation, RapidEC+OM+BF further
improves the throughput by 17% over RapidEC+OM. By accelerating the UPMUL operation, gECC achieves another
54% improvement over RapidEC+OM+BF.
This set of experiments evaluates the throughput of PMUL based on the SM2 curve.
Table 5: Throughput Performance of PMUL on NVIDIA A100 GPU based on SM2 curve
Batch size FPMUL operation per second UPMUL operation per second
per GPU SP RapidEC gECC RapidEC gECC
211 3,928,144 11,723,016 (2.98×) 1,476,365 6,270,010 (4.25×)
212 3,935,789 13,005,468 (3.30×) 1,405,325 6,561,449 (4.67×)
213 3,939,345 13,755,865 (3.49×) 1,427,610 6,689,186 (4.69×)
214 3,810,161 14,389,168 (3.78×) 1,464,398 6,766,892 (4.62×)
215 3,863,144 14,385,871 (3.72×) 1,471,910 6,837,619 (4.65×)
216 3,906,517 14,299,320 (3.66×) 1,477,929 6,732,287 (4.56×)
Fixed Point Multiplication (FPMUL). First, we evaluate the throughput of the FPMUL operation. As shown in
Table 5, gECC achieves an average speedup of 3.16× compared to RapidEC. To ensure GPU streaming processors load
balancing, we set the batch size by the product of the number of GPU stream processors (432) and the amount of data
processed by each processor. As the batch size increases, the throughput of FPMUL operations in gECC gradually
increases until it saturates the GPU. This is because, with a larger batch size, Montgomery’s trick can bring about
a greater degree of reduction in the computational complexity. As shown in Fig. 3(b), the total duration of FPMUL
computation in gECC comprises three parts: compress step, GAS step, and DCWPA step. The duration of the compress
and DCWPA steps is linearly proportional to the batch size, while the duration of the GAS step is almost constant
regardless of the batch size (Section 3.2). Therefore, the throughput can increase as the batch size increases. When the
batch size reaches about 432 ∗ 215 , the duration of the GAS part only occupies a small portion of the total elapsed time,
so the overall performance tends to be stabilized as the batch size continues to increase. RapidEC’s throughput remains
almost constant with different batch sizes, as it does not perform any optimization for batch processing.
In the next experiment, we investigate the impact of the major optimization techniques of gECC on the FPMUL operation.
First, we re-implemented the RapidEC’s FPMUL solution with our optimized modular arithmetic (“RapidEC+OM”).
As shown in Fig. 11(a), the throughput of RapidEC+OM is stable over different scales, which is up to 10,592,834
peration/s. We experiment with two variants of gECC, gECC−MO, a gECC variant without the memory management
mechanism, and gECC−MO−KF, a variant without both the memory management and the data-locality-aware kernel
fusion mechanisms. The stabilized throughput of gECC−MO−KF is lower than RapidEC+OM by 8%. The reason
is that gECC−MO−KF incurs a high overhead of global memory access due to the lack of relevant optimizations.
The throughput of gECC−MO is about 10% higher than that of RapidEC+OM because the kernel fusion optimization
reduces the global memory access frequency and enhances data locality to a certain degree, thus improving the
throughput of batch PADD operations. Finally, the throughput of gECC with all the optimizations can achieve 36%
15
A PREPRINT - JANUARY 8, 2025
7.5×106
7
1.50×10
6.0×106
1.25×107
upmul / s
1.00×107 4.5×106
fpuml / s
7.50×106
RapidEC RapidEC+OM 3.0×106
gECC−MO−KF gECC−MO gECC RapidEC RapidEC+OM
6
5.00×10 gECC−MO−KF gECC−MO gECC
1.5×106
2.50×106
0.00 0.0
2^11 2^12 2^13 2^14 2^15 2^16 2^11 2^12 2^13 2^14 2^15 2^16
batch size for each GPU stream processor batch size for each GPU stream processor
(a) Breakdown analysis of fixed point multiplication (b) Breakdown analysis of unknown point multiplication
Figure 11: Breakdown analysis of Point Multiplication
and 23% improvement over RapidEC+OM and gECC−MO, respectively. This shows that our memory management
optimization effectively optimizes global memory access through column-majored data layout and reduces data access
latency through multi-level caching.
Unknown Point Multiplication (UPMUL). This experiment examines the throughput of the UPMUL operation, the
most time-consuming operation in ECC. As shown in Table 5, gECC achieves, on average, 4.36× speedup over the
RapidEC. RapidEC is implemented using the NAF (no-adjacent form) algorithm [39], which converts a scalar into
its NAF representation with signed bits to reduce the number of PADD operations. However, when multiple UPMUL
operations are performed concurrently in a GPU Streaming Processor, it leads to severe warp divergence issues due to
the different NAF values, resulting in higher inter-thread synchronization costs than gECC.
We further investigate the improvement brought about by the optimizations of gECC for the UPMUL operation. Simi-
larly, we re-implemented RapidEC’s UPMUL operation using the NAF algorithm with our optimized modular arithmetic
(“RapidEC+OM”). We also experiment with the same two variants of gECC, gECC−MO and gECC−MO−KF. As
shown in Fig. 11(b), the stabilized throughput of gECC−MO−KF is 45% higher than that of RapidEC+OM. The reason
is that the benefit of decreasing the computational complexity brought about by Montgomery’s trick dominates the
increased overhead of global memory access. gECC−MO brings about 20% improvement and the memory management
achieves a further 9% improvement.
3×106 2771683.74
operation, referred to as CGBN-1 and CGBN-2, respectively. 2400114.26
1, SOS does not make special modular reduction optimizations Figure 12: Throughput analysis of modmul operation
for SCA-256 prime modulus and is a standard modmul imple-
mentation which suitable for any prime moduli. Then, the full gECC ("gECC") adopts fast reduction for SCA-256
prime modulus (Section 4.2).
As shown in Fig. 12, the gECC on SCA-256 prime modulus achieves 1.63× and 1.72× speedup against CGBN-1, and
2.68× and 2.58× speedups against CGBN-4 on the V100 and A100 GPU, respectively. To take a deeper look, we
investigate the improvement brought by major gECC proposals for modmul operation. First, both SOS, gECC and
16
A PREPRINT - JANUARY 8, 2025
CGBN-1 avoid the inter-thread communication overhead in the CGBN-4, and significantly improve the throughput
performance. SOS achieves 1.17× speedup against CGBN-1 on A100 GPU. As a highly optimized assembly code
implementation, SOS can minimize number of IMAD instructions and reduce stall caused by register bank conflict
and register move. gECC reduce can deliver an additional 1.39× performance, since we use as few additions and
subtractions as possible to replace the IMAD operations in the reduce phase.
In the final set of experiments, we investigate the end-to-end effect of gECC on the overall performance of ECC
applications. ECDSA is widely used in blockchain systems or blockchain databases for user identity authorization and
verification for transaction security and integrity. For instance, in a blockchain database, each modification history
of the database is recorded in the transaction. The ECDSA encryption safeguards the data from malicious tampering.
We evaluate the effect of gECC to accelerate the ECDSA algorithm of a real-world permissioned blockchain system
FISCO-BCOS [61].
We run the blockchain on a four-node cluster, each with a 2.40GHz Intel Xeon CPU and 512 MB memory. The
system transaction throughput without applying gECC is approximately 5, 948 transactions per second, with the time
breakdown revealing that the ECDSA signature generation and verification accounts for 37.2% of the total time.
Despite the significant performance improvement of our ECDSA, its throughput is constrained by other factors, such as
consensus or other hash computations. After applying gECC, the throughput can reach 9, 313 transactions per second,
representing a performance improvement of 1.56×.
6 Conclusion
ECC has a lower computational complexity and a smaller key size compared to RSA, making it competitive for digital
signatures, blockchain, and secure multi-party computation. Despite its superiority, ECC remains the bottleneck in
these applications because the EC operations in ECC are still time-consuming, which makes it imperative to optimize
their performance. In this paper, we study how to optimize a batch of EC operations using the GAS mechanism
and Montgomery’s Trick on GPUs. We propose locality-aware kernel fusion optimization and design multi-level
cache management to minimize the memory access overhead incurred by frequent data access for point data and the
intermediate results when batching EC operations. Finally, we optimize the operation performed most frequently
modmul in all types of EC operations. Our results reveal that gECC significantly improves the parallel execution
efficiency of batch EC operations and archives much higher throughput than the state of the art.
References
[1] Peter L Montgomery. Speeding the pollard and elliptic curve methods of factorization. Mathematics of computation,
48(177):243–264, 1987.
[2] Victor S Miller. Use of elliptic curves in cryptography. In Conference on the theory and application of
cryptographic techniques, pages 417–426. Springer, 1985.
[3] Neal Koblitz. Elliptic curve cryptosystems. Mathematics of computation, 48(177):203–209, 1987.
[4] Cong Yue, Tien Tuan Anh Dinh, Zhongle Xie, Meihui Zhang, Gang Chen, Beng Chin Ooi, and Xiaokui Xiao.
Glassdb: An efficient verifiable ledger database system through transparency. arXiv preprint arXiv:2207.00944,
2022.
[5] Zerui Ge, Dumitrel Loghin, Beng Chin Ooi, Pingcheng Ruan, and Tianwen Wang. Hybrid blockchain database
systems: design and performance. Proceedings of the VLDB Endowment, 15(5):1092–1104, 2022.
[6] Dumitrel Loghin. The anatomy of blockchain database systems. Data Engineering, page 48, 2022.
[7] Xinying Yang, Yuan Zhang, Sheng Wang, Benquan Yu, Feifei Li, Yize Li, and Wenyuan Yan. Ledgerdb: A
centralized ledger database for universal audit and verification. Proceedings of the VLDB Endowment, 13(12):3138–
3151, 2020.
[8] Meihui Zhang, Zhongle Xie, Cong Yue, and Ziyue Zhong. Spitz: a verifiable database system. Proc. VLDB
Endow., 13(12):3449–3460, August 2020.
[9] Simon Blake-Wilson, Nelson Bolyard, Vipul Gupta, Chris Hawk, and Bodo Moeller. Elliptic curve cryptography
(ecc) cipher suites for transport layer security (tls). Technical report, 2006.
17
A PREPRINT - JANUARY 8, 2025
[10] Don Johnson, Alfred Menezes, and Scott Vanstone. The elliptic curve digital signature algorithm (ecdsa).
International journal of information security, 1:36–63, 2001.
[11] Than Myo Zaw, Min Thant, and S. V. Bezzateev. Database security with aes encryption, elliptic curve encryption
and signature. In 2019 Wave Electronics and its Application in Information and Telecommunication Systems
(WECONF), pages 1–6, 2019.
[12] Pradeep Suthanthiramani, Sannasy Muthurajkumar, Ganapathy Sannasi, and Kannan Arputharaj. Secured data
storage and retrieval using elliptic curve cryptography in cloud. Int. Arab J. Inf. Technol., 18(1):56–66, 2021.
[13] Benny Pinkas, Mike Rosulek, Ni Trieu, and Avishay Yanai. Spot-light: lightweight private set intersection from
sparse ot extension. In Advances in Cryptology–CRYPTO 2019: 39th Annual International Cryptology Conference,
Santa Barbara, CA, USA, August 18–22, 2019, Proceedings, Part III 39, pages 401–431. Springer, 2019.
[14] Quanyu Zhao, Yuan Zhang Bingbing Jiang, Heng Wang, Yunlong Mao, and Sheng Zhong. Unbalanced private
set intersection with linear communication complexity. SCIENCE CHINA Information Sciences, 67(3):132105–,
2024.
[15] Yibiao Lu, Zecheng Wu, Bingsheng Zhang, and Kui Ren. Efficient secure computation from sm series cryptography.
Wireless Communications and Mobile Computing, 2023(1):6039034, 2023.
[16] Yuanyuan Li, Hanyue Xiao, Peng Han, and Zhihao Zhou. Practical private intersection-sum protocols with good
scalability. In Jianming Zhu, Qianhong Wu, Yong Ding, Xianhua Song, and Zeguang Lu, editors, Blockchain
Technology and Application, pages 49–63, Singapore, 2024. Springer Nature Singapore.
[17] Licheng Wang, Xiaoying Shen, Jing Li, Jun Shao, and Yixian Yang. Cryptographic primitives in blockchains.
Journal of Network and Computer Applications, 127:43–58, 2019.
[18] Yupeng Zhang, Jonathan Katz, and Charalampos Papamanthou. Integridb: Verifiable sql for outsourced databases.
In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 1480–
1491, 2015.
[19] HweeHwa Pang, Jilian Zhang, and Kyriakos Mouratidis. Scalable verification for outsourced dynamic databases.
Proceedings of the VLDB Endowment, 2(1):802–813, 2009.
[20] Amazon. Amazon quantum ledger database, 2019. https://aws.amazon.com/qldb/.
[21] Panagiotis Antonopoulos, Raghav Kaushik, Hanuma Kodavalla, Sergio Rosales Aceves, Reilly Wong, Jason
Anderson, and Jakub Szymaszek. Sql ledger: Cryptographically verifiable data in azure sql database. In
Proceedings of the 2021 international conference on management of data, pages 2437–2449, 2021.
[22] Satoshi Nakamoto. Bitcoin: A peer-to-peer electronic cash system. 2008.
[23] Gavin Wood et al. Ethereum: A secure decentralised generalised transaction ledger. Ethereum project yellow
paper, 151(2014):1–32, 2014.
[24] Eli Ben Sasson, Alessandro Chiesa, Christina Garman, Matthew Green, Ian Miers, Eran Tromer, and Madars
Virza. Zerocash: Decentralized anonymous payments from bitcoin. In 2014 IEEE symposium on security and
privacy, pages 459–474. IEEE, 2014.
[25] Elli Androulaki, Artem Barger, Vita Bortnikov, Christian Cachin, Konstantinos Christidis, Angelo De Caro, David
Enyeart, Christopher Ferris, Gennady Laventman, Yacov Manevich, et al. Hyperledger fabric: a distributed
operating system for permissioned blockchains. In Proceedings of the thirteenth EuroSys conference, pages 1–15,
2018.
[26] Minghua Qu. Sec 2: Recommended elliptic curve domain parameters. Certicom Res., Mississauga, ON, Canada,
Tech. Rep. SEC2-Ver-0.6, 1999.
[27] National Institute of Standards and Technology (NIST). Digital signature standard (dss). Federal Information
Processing Standards (FIPS) Publication 186-4, 2024. https://nvlpubs.nist.gov/nistpubs/fips/nist.
fips.186-4.pdf.
[28] Ang Yang, Junghyun Nam, Moonseong Kim, and Kim-Kwang Raymond Choo. Provably-secure (chinese
government) sm2 and simplified sm2 key exchange protocols. The Scientific World Journal, 2014(1):825984,
2014.
[29] State Cryptography Administration of China (SCA). Public key cryptographic algorithm sm2 based on elliptic
curves, 2016.
[30] Rares Ifrim, Dumitrel Loghin, and Decebal Popescu. Baldur: A hybrid blockchain database with fpga or gpu
acceleration. In Proceedings of the 1st Workshop on Verifiable Database Systems, VDBS ’23, page 19–27, New
York, NY, USA, 2023. Association for Computing Machinery.
18
A PREPRINT - JANUARY 8, 2025
[31] Lili Gao, Fangyu Zheng, Niall Emmart, Jiankuo Dong, Jingqiang Lin, and Charles Weems. Dpf-ecc: Accelerating
elliptic curve cryptography with floating-point computing power of gpus. In 2020 IEEE International Parallel
and Distributed Processing Symposium (IPDPS), pages 494–504. IEEE, 2020.
[32] Lili Gao, Fangyu Zheng, Rong Wei, Jiankuo Dong, Niall Emmart, Yuan Ma, Jingqiang Lin, and Charles Weems.
Dpf-ecc: A framework for efficient ecc with double precision floating-point computing power. IEEE Transactions
on Information Forensics and Security, 16:3988–4002, 2021.
[33] Zonghao Feng, Qipeng Xie, Qiong Luo, Yujie Chen, Haoxuan Li, Huizhong Li, and Qiang Yan. Accelerating
elliptic curve digital signature algorithms on gpus. In SC22: International Conference for High Performance
Computing, Networking, Storage and Analysis, pages 1–13. IEEE, 2022.
[34] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. PowerGraph: Distributed
Graph-Parallel computation on natural graphs. In 10th USENIX Symposium on Operating Systems Design and
Implementation (OSDI 12), pages 17–30, Hollywood, CA, October 2012. USENIX Association.
[35] NVIDIA. Cgbn: Cuda accelerated multiple precision arithmetic (big num) using cooperative groups, 2018.
https://github.com/NVlabs/CGBN.
[36] Niall Emmart, Fangyu Zheng, and Charles Weems. Faster modular exponentiation using double precision floating
point arithmetic on the gpu. In 2018 IEEE 25th Symposium on Computer Arithmetic (ARITH), pages 130–137.
IEEE, 2018.
[37] OpenSSL Software Foundation. Tls/ssl and crypto library, 2024. https://github.com/openssl/openssl.
[38] Wuqiong Pan, Fangyu Zheng, Yuan Zhao, Wen-Tao Zhu, and Jiwu Jing. An efficient elliptic curve cryptography
signature server with gpu acceleration. IEEE Transactions on Information Forensics and Security, 12(1):111–122,
2017.
[39] Darrel Hankerson, Alfred J Menezes, and Scott Vanstone. Guide to elliptic curve cryptography. Springer Science
& Business Media, 2006.
[40] Long Mai, Yuan Yan, Songlin Jia, Shuran Wang, Jianqiang Wang, Juanru Li, Siqi Ma, and Dawu Gu. Accelerating
sm2 digital signature algorithm using modern processor features. In International Conference on Information and
Communications Security, pages 430–446. Springer, 2019.
[41] Junhao Huang, Zhe Liu, Zhi Hu, and Johann Großschädl. Parallel implementation of sm2 elliptic curve cryptogra-
phy on intel processors with avx2. In Information Security and Privacy: 25th Australasian Conference, ACISP
2020, Perth, WA, Australia, November 30–December 2, 2020, Proceedings 25, pages 204–224. Springer, 2020.
[42] Peter L Montgomery. Modular multiplication without trial division. Mathematics of computation, 44(170):519–
521, 1985.
[43] C Kaya Koc, Tolga Acar, and Burton S Kaliski. Analyzing and comparing montgomery multiplication algorithms.
IEEE micro, 16(3):26–33, 1996.
[44] Daniel J Bernstein and Bo-Yin Yang. Fast constant-time gcd computation and modular inversion. IACR
Transactions on Cryptographic Hardware and Embedded Systems, pages 340–398, 2019.
[45] Benjamin Salling Hvass, Diego F Aranha, and Bas Spitters. High-assurance field inversion for curve-based
cryptography. In 2023 IEEE 36th Computer Security Foundations Symposium (CSF), pages 552–567. IEEE, 2023.
[46] Daniel J. Bernstein, Billy Bob Brumley, Ming-Shing Chen, and Nicola Tuveri. OpenSSLNTRU: Faster post-
quantum TLS key exchange. In 31st USENIX Security Symposium (USENIX Security 22), pages 845–862, Boston,
MA, August 2022. USENIX Association.
[47] Jiankuo Dong, Fangyu Zheng, Niall Emmart, Jingqiang Lin, and Charles Weems. sdpf-rsa: Utilizing floating-point
computing power of gpus for massive digital signature computations. In 2018 IEEE International Parallel and
Distributed Processing Symposium (IPDPS), pages 599–609, 2018.
[48] Fangyu Zheng, Wuqiong Pan, Jingqiang Lin, Jiwu Jing, and Yuan Zhao. Exploiting the potential of gpus for
modular multiplication in ecc. In International Workshop on Information Security Applications, pages 295–306.
Springer, 2014.
[49] Jiankuo Dong, Fangyu Zheng, Juanjuan Cheng, Jingqiang Lin, Wuqiong Pan, and Ziyang Wang. Towards high-
performance x25519/448 key agreement in general purpose gpus. In 2018 IEEE Conference on Communications
and Network Security (CNS), pages 1–9. IEEE, 2018.
[50] Supranational Corp. Zero-knowledge template library, 2024. https://github.com/supranational/sppark.
[51] ZPrize. Accelerating the future of zero-knowledge cryptography, 2023. https://www.zprize.io/.
19
A PREPRINT - JANUARY 8, 2025
20