0% found this document useful (0 votes)
20 views20 pages

Ecc: A Gpu - E C C: G Based High Throughput Framework For Lliptic Urve Ryptography

Uploaded by

AnhNguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views20 pages

Ecc: A Gpu - E C C: G Based High Throughput Framework For Lliptic Urve Ryptography

Uploaded by

AnhNguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

G ECC: A GPU- BASED HIGH - THROUGHPUT FRAMEWORK FOR

E LLIPTIC C URVE C RYPTOGRAPHY

Qian Xiong Weiliang Ma Xuanhua Shi


Huazhong University Huazhong University Huazhong University
of Science and Technology, of Science and Technology, of Science and Technology,
arXiv:2501.03245v1 [cs.CR] 22 Dec 2024

Wuhan, China Wuhan, China Wuhan, China


xiongqian@hust.edu.cn weiliangma@hust.edu.cn xhshi@hust.edu.cn

Yongruan Zhou Hai Jin Kaiyi Huang


University of Copenhagen Huazhong University Huazhong University
Copenhagen, Denmark of Science and Technology, of Science and Technology,
zhou@di.ku.dk Wuhan, China Wuhan, China
hjin@hust.edu.cn kyhuang@hust.edu.cn

Haozhou Wang Zhengru Wang


Huazhong University Nvidia
of Science and Technology, Shanghai, China
Wuhan, China zhengruw@nvidia.com
whz_tec003@hust.edu.cn

January 8, 2025

A BSTRACT

Elliptic Curve Cryptography (ECC) is an encryption method that provides security comparable to
traditional techniques like Rivest–Shamir–Adleman (RSA) but with lower computational complexity
and smaller key sizes, making it a competitive option for applications such as blockchain, secure
multi-party computation, and database security. However, the throughput of ECC is still hindered by
the significant performance overhead associated with elliptic curve (EC) operations, which can affect
their efficiency in real-world scenarios. This paper presents gECC, a versatile framework for ECC
optimized for GPU architectures, specifically engineered to achieve high-throughput performance in
EC operations. To maximize throughput, gECC incorporates batch-based execution of EC operations
and microarchitecture-level optimization of modular arithmetic. It employs Montgomery’s trick [1]
to enable batch EC computation and incorporates novel computation parallelization and memory
management techniques to maximize the computation parallelism and minimize the access overhead
of GPU global memory. Furthermore, we analyze the primary bottleneck in modular multiplication by
investigating how the user codes of modular multiplication are compiled into hardware instructions and
what these instructions’ issuance rates are. We identify that the efficiency of modular multiplication
is highly dependent on the number of Integer Multiply-Add (IMAD) instructions. To eliminate
this bottleneck, we propose novel techniques to minimize the number of IMAD instructions by
leveraging predicate registers to pass the carry information and using addition and subtraction
instructions (IADD3) to replace IMAD instructions. Our experimental results show that, for ECDSA
and ECDH, the two commonly used ECC algorithms, gECC can achieve performance improvements
of 5.56 × and 4.94 ×, respectively, compared to the state-of-the-art GPU-based system. In a real-
world blockchain application, we can achieve performance improvements of 1.56 ×, compared
to the state-of-the-art CPU-based system. gECC is completely and freely available at https:
//github.com/CGCL-codes/gECC.
A PREPRINT - JANUARY 8, 2025

Keywords elliptic curve cryptography, modular multiplication, GPU acceleration

1 Introduction

Elliptic Curve Cryptography (ECC) [2, 3] is a method for public-key encryption by using elliptic curves. It offers
security that is on par with conventional public-key cryptographic systems like Rivest–Shamir–Adleman (RSA) but with
a smaller key size and lower computational complexity. Recently, ECC gained increased attention due to its efficient
privacy protection and fewer interactions in verifiable databases [4, 5, 6, 7, 8].
ECC serves as a powerful encryption tool widely used in areas such as data encryption, digital signatures, blockchain,
secure transmission, and secure multi-party computation. The two most prevalent public-key algorithms based
on ECC are Elliptic Curve Diffie-Hellman (ECDH) [9] for data encryption and Elliptic Curve Digital Signature
Algorithm (ECDSA) [10] for digital signatures. Transport Layer Security (TLS), the preferred protocol for securing 5G
communications, incorporates ECDH in its handshake process [9]. ECDH also plays a crucial role in sensitive data
storage [11, 12] and data sharing, allowing basic database query instructions to be executed without any data information
leakage. This includes Private Set Intersection (PSI) [13, 14, 15] and Private Intersection Sum [16]. ECDSA is widely
employed in blockchain systems to safeguard data integrity and transaction accountability [17]. Currently, numerous
blockchain-database (referred to as verifiable databases) [4, 5, 6, 7, 8], which combine the properties of blockchains
with the ones of classic Database Management Systems (DBMS), protect data history against malicious tampering.
Additionally, researchers developed verifiable SQL [18, 19], protecting the integrity of user data and query execution on
untrusted database providers. As an up-and-coming field approaching commercial applications, cloud service providers,
such as Amazon [20] and Microsoft [21], provide services that can maintain an append-only and cryptographically
verifiable log of data operations.
To enhance the efficiency of ECC, researchers have dedicated significant efforts to designing specialized curve forms
that reduce the computational overhead associated with modular arithmetic in elliptic curve operations. For instance,
blockchain systems, such as Bitcoin [22], Ethereum [23], Zcash [24], and Hyperledger Fabric [25], use the secp256k1
curve [26] and the P-256 curve endorsed by the National Institute of Standards and Technology (NIST) [27] for ECDSA.
Differently, China keeps the SM2 curve as the standard for electronic authentication systems, key management, and
e-commerce applications [28, 29].
Despite such progress, ECC remains a bottleneck in the performance of these throughput-sensitive applications. On
mainstream server environments, a single fundamental EC operation typically takes more than 6 milliseconds to execute.
A PSI computation to identify the intersection between two datasets containing millions of items has to perform tens
of millions of EC operations, which could take around 632 seconds to complete [13]. Although researchers develop
blockchain-database systems with ASIC [30] to accelerate ECDSA to improve transaction throughput, the improvement
compared to a CPU-based system is limited, with a maximal 12% improvement.
There are recent efforts [31, 32, 33] in employing GPU
to minimize the latency of individual EC operations.
Application (PSI, blockchain, etc)
However, achieving the high-throughput requirements of
support
emerging big data applications remains a challenge. To
close the gap, we present a high-throughput GPU-based Elliptic Curve Cryptography Framework (ECDSA, ECDH)
ECC framework, which is holistically optimized to max-
imize the data processing throughput. The optimizations gECC Algorithm gECC Hardware
consist of the following three major aspects. Throughput-orieted EC Operation Layer Multi-level Cache
map

Batch Fixed or Unknown Point Multiplication Shared Memory


First, we employ Montgomery’s trick [1] to batch EC kernel fusion L2 Cache
operations to enhance the overall throughput. Existing
Batch Point Addition Batch Point Doubling Global Memory
GPU-based ECC solutions solely offer interfaces for in-
assemble

dividual EC operations, such as single-point addition Batch modular inversion


GAS Mechanism
and single-point multiplication, and therefore they can- compose
not bach EC operations. We reconstruct the entire ECC Modular Arithmetic Layer
Instruction
add mul reduce inv Optimization
framework, called gECC, consisting of batch EC opera-
tions, including point multiplication, point addition, and
point doubling. However, the modular inversion opera- Figure 1: Overview of gECC Framework.
tion in the algorithm incurs significant overhead due to
poor parallelism, thus reducing performance. To address the issue, we borrow the wisdom of parallel graph processing
systems and incorporate the Gather-Apply-Scatter (GAS) mechanism [34] to reduce the overhead of the modular
inversion operation. To the best of our knowledge, gECC is the first system to implement batch PMUL operations on a
GPU using Montgomery’s trick.

2
A PREPRINT - JANUARY 8, 2025

Second, we employ data-locality-aware kernel fusion optimization and design multi-level cache management to
minimize the memory access overhead incurred by the frequent data access for point data and the large intermediate
results caused by batching EC operations with Montgomery’s Trick. Each EC operation computation inherently requires
access to the point data and the intermediate results twice. These data need more than hundreds of megabytes of
memory space, which far exceeds the limited GPU’s shared memory size. For example, when 220 EC point additions
are performed simultaneously, the total size of data to be temporarily stored is 96 MB and the inputs are 128 MB. Our
techniques optimize data access to minimize the pressure on the registers and GPU’s shared memory.
Third, all the EC’s arithmetic operations, including addition, subtraction, and multiplication, are based on modular
arithmetic on the finite field with at least 256-bit. Among these, the most time-consuming modular multiplication
operation is also the most frequently performed operation in all kinds of EC operations. Previous studies [31, 32,
33, 35, 36] fall short in evaluating the performance of modular multiplication from the perspective of instructions’
issuance rates at the microarchitectural level. Furthermore, there is a noticeable absence of arithmetic optimization for
specific prime moduli in the existing work [33, 36]. We identify that the efficiency of modular multiplication is highly
dependent on the number of Integer Multiply-Add (IMAD) instructions. To eliminate this bottleneck, we propose
novel techniques to minimize the number of IMAD instructions by leveraging predicate registers to pass the carry
information. In addition, we develop an arithmetic optimization for the SM2 curve, using addition and subtraction
instructions (IADD3) to replace the expensive IMAD instructions.
We implement gECC using CUDA and conduct evaluations on the Nvidia A100 GPU to assess its performance.
Our comparative analysis includes gECC and the leading GPU-based ECC system, RapidEC [33], highlighting the
advancements and improvements offered by our framework. On standard cryptographic benchmarks for two commonly
used ECC algorithms, our results demonstrate that gECC achieves a speedup of 4.18× for signature generation and
5.56× for signature verification in ECDSA. If we measure the two key layers in our framework separately, gECC
achieves up to 4.04× and 4.94× speedup for fixed-point multiplication and unknown point multiplication of the EC
operator layer, and 1.72× speedup in the modular multiplication for SM2 curve in the modular arithmetic layer. In
real-world blockchain system, we can achieve 1.56× performance improvements, compared to the state-of-the-art
CPU-based systems. gECC is completely and freely available at https://github.com/CGCL-codes/gECC.

2 Background and Related Work


2.1 Elliptic Curve Cryptography

ECC is a form of public-key cryptography that relies on the algebraic properties of elliptic curves over finite fields. It is
widely employed in various cryptographic algorithms and protocols, including key exchange, digital signatures, and
others. ECC is also utilized in several standard algorithms, such as ECDSA and ECDH. One of the key features of
ECC is its ability to offer the same level of security as RSA encryption with smaller key sizes, making it particularly
appealing for resource-constrained environments.
The SM2 elliptic curve [29] utilized in ECDSA applications is defined by Eqn. 1 . The parameters a and b are elements
in a finite field Fq , and they establish the specific definition of the curve. A pair (x, y), where (x, y ∈ Fq ), is considered
a point on the curve if it satisfies the aforementioned Eqn. 1.
y 2 = x3 + ax + b, Short W eierstrass curve (1)

2.1.1 Elliptic Curve Point Operations


The fundamental EC operations are point addition (PADD) and point doubling (PDBL), which are essential build-
ing blocks for more complex EC operations, such as point multiplication (PMUL), forming the basis for various
cryptographic protocols in ECC.
PADD operation. This operation involves adding two points, P(xp , yp ) and T(xt , yt ), on an elliptic curve to obtain the
resulting point R(xr , yr ) = (xp , yp ) + (xt , xt ). The addition is performed using the following two equations: 1) Eqn. 2
is to calculate λ. 2) Eqn. 3 is to calculate the values of X and Y coordinates of the result point R, respectively.
yp − yt
λ= (2)
xp − xt
xr = λ2 − xp − xt yr = λ(xp − xr ) − yp (3)
PDBL operation. This operation is used when the points P and T are identical. In this case, λ is calculated differently,
3xp 2 +a
by 2y p
. Here, a is the parameter of the elliptical curve Eqn. 1, and the rest of the PDBL operation follows the same
steps as the PADD operation using this new value of λ.

3
A PREPRINT - JANUARY 8, 2025

Table 1: Performance analysis of EC operations on Short Weierstrass curves in different coordinate systems

operation affine coordinate Jacobian coordinate


PADD 1 modinv + 3 modmul + 6 modadd 11 modmul + 6 modadd
PDBL 1 modinv + 4 modmul + 5 modadd 10 modmul + 10 modadd

Performing EC operations on points represented in the original coordinate system (called affine coordinate), such
as Eqn. 2, 3, results in lower efficiency. This is mainly due to the time-consuming nature of modular inversion
1
operations when computing xp −x t
, significantly affecting overall performance. Various types of elliptic curve coordinate
systems have been introduced to enhance the computational efficiency of single EC operations. For instance, the
Jacobian coordinate system, used in [33, 37, 38], represents points using the triplets (X, Y, Z), where x = X/Z 2 and
y = Y /Z 3 [39]. This coordinate system eliminates the need for costly modular inversion and improves the performance
of single PADD and PDBL operations. Table 1 lists the number of modular multiplication (modmul), modular addition
(modadd), and modular inversion (modinv) involved in both PADD and PDBL under different coordinate systems.
PMUL operation. This operation computes the product of a scalar s, a large integer in a finite field, and an elliptic
curve point P . The result, Q, is a new EC point. As demonstrated in Algorithm 1, the PMUL operation can be computed
by repeated PADD and PDBL operations based on the binary representation of the scalars. There are two types of point
multiplication: fixed point multiplication (FPMUL) and unknown point multiplication (UPMUL). For FPMUL, the
point P in Algorithm 1 is known. The common practice [33, 31, 40, 38] is to preprocess the known point to eliminate
the PDBL operation (line 3 in Algorithm 1). UPMUL, on the other hand, is typically processed faithfully following
Algorithm 1.

2.1.2 Acceleration for ECC


As mentioned previously, existing GPU-based solutions [33, 38, 31, Algorithm 1: Double-And-Add Algorithm
40, 41] have mainly focused on reducing the latency of individual
PADD and PDBL operations to enhance the throughput of PMUL op- Input: an elliptic curve point P , a scalar s
erations in the Jacobian coordinate system. These solutions typically with bit-width l, s = (sl−1 , . . . , s1 , s0 )2 ;
employ data parallelism, utilizing multi-core processors to process Output: the result point Q
multiple PMUL operations concurrently. 1: for i ← 0 to l − 1 do
2: if si = 1 then
The solutions above have neglected the potential to enhance the 3: Q=Q+P
throughput by batching PMUL operations using Montgomery’s Trick 4: end if
to reduce the number of modinv operations required in the affine 5: P =P +P
coordinate system, which is one of the contributions of gECC (Sec- 6: end for
tion 3).

2.2 Modular Arithmetic on Finite Field

Finite fields have extensive applications in areas Algorithm 2: Montgomery multiplication with SOS strategy
such as cryptography, computer algebra, and
numerical analysis, playing a crucial role in Input: a and b are stored in A[m:0] and B[m:0] respectively,
ECC. The integer x belongs to a finite field where a, b ∈ Fq and q_inv ≡ −q −1 mod Dm
Fq , that is, x ∈ [0, q). Here, q represents a Output: c ≡ a ∗ b mod q, store in C[2m:m];
large prime modulus with a bit width l that typ- 1: C[2m : 0] = A[m : 0] ∗ B[m : 0] {▷ integer multiply}
ically spans 256 to 1024. Generally, the large 2: for x = 0 to m − 1 do {▷ modular reduce}
Pm−1 i 3: Mi = (C[i] ∗ q_inv) & (D − 1);
integer x can be composed as i=0 xi D , 4: C[2m : i]+ = Mi ∗ q;
where D symbolizes the base and and q ≤ Dm .
5: end for
D usually set to 64 or 32, allowing the array
6: return c > q ? (c − q) : c;
X[m : 0] = {x0 . . . xm−1 } to be stored in
word-size integers.
A finite field is where the results of modmul, modadd, and modinv over integers remain within the field. The most
time-consuming operations are modmul and modinv, which have about 5 times and 500 times the latency of modadd
on a mainstream server. These operations are the primary targets of our acceleration efforts.
Montgomery multiplication [42] is designed for modmul. By altering the structure of the upper-level loop, various
optimization strategies such as SOS and CIOS have been extensively explored [43] to reduce read and write operations.
As demonstrated in Algorithm 2 (Montgomery multiplication with SOS Strategy), the operation modmul comprises
two phases: integer multiplication and modular reduction. The multiplication phase involves multiplying two integers,

4
A PREPRINT - JANUARY 8, 2025

resulting in a value within the range of [0, q 2 ). Subsequently, the value is reduced to the interval [0, q) through the
modular reduction phase.
As discussed previously, researchers are dedicated to studying the prime moduli of special forms, which provide
opportunities to effectively reduce the number of arithmetic operations required for the operation modmul. In particular,
the use of moduli with characteristics similar to Mersenne primes proves to be beneficial. For instance, the State
Cryptography Administration (SCA) of China recommends q = 2256 − 2224 − 296 + 264 − 1, known as SCA-256 [29]
which is used in the SM2 curve. On the one hand, the exponents in both cases are chosen as multiples of 32, which can
expedite implementation on a 32-bit platform. On the other hand, the q_inv values of the prime in the algorithm 2 are
equal to 1, indicating that there is room for optimization in the modular reduction phase.
The modinv operation can be calculated either via Fermat’s little theorem (i.e., x−1 ≡ xq−2 mod q) or a variant of
Extended Euclid Algorithm [44, 45]. The former converts a modinv operation into multiple modmul operations,
allowing one to reduce the number of modmul based on known q, as demonstrated in the solution [33]. The latter
employs binary shift, addition, and subtraction operations instead of modmul operations. This results in a performance
improvement of about 3 times compared to the former. However, due to the presence of multiple branch instructions
within the latter algorithm, it is typically used in CPU-based systems [46] and is not suitable for parallel execution on
GPU-based systems.

2.2.1 Montgomery’s trick for batch inversion


In this section, we review a well-known
method, called Montgomery’s Trick, which ab = a * b d = abcd * abc
-1 -1
abc = abcd * d
-1 -1
intermediate
could compress n modinv operations into one abc = ab * c abcd -1
c = abc * ab
-1 -1
ab = abc * c
-1 results
-1

modinv operation by introducing 3n modmul abcd = abc * d b = ab


-1
* a
-1
a = ab
-1
* b -1
data
dependency
operations. There are three key steps in Mont- 1 compress 2 inverse 3 decompress
gomery’s trick. For example, to calculate
a−1 , b−1 , c−1 , d−1 from the inputs a, b, c, d, Figure 2: The example of Montgomery’s trick.
as shown in Fig. 2, the first step, called com-
press step, merges these inputs through cumulative multiplications. Then we calculate the inverse of abcd to get abcd−1 ,
called the inverse step. Finally, we could get a−1 , b−1 , c−1 , d−1 by reversing the compression process, called the
decompress step.

2.2.2 Acceleration for Modular Arithmetic


Extensive research has been conducted on accelerating modmul operation on GPUs, falling into two categories: one
that uses a mixture of multi-precision floating-point instructions and integer instructions, and the other using only integer
instructions. Since the introduction of the independent floating point unit in Nvidia’s GPU Pascal architecture, many
studies [31, 32, 36, 47] have replaced integer multiplication with multi-precision floating-point multiplication. Among
them, the fastest implementation is based on D = 52, called DPF. Efforts persist to implement modmul using only
integer arithmetic instructions, with notable studies [48, 35, 49, 50] like CGBN and sppark, which were fastest in the
ZPrize competition [51], though sppark does not support 256-bit prime moduli due to carry propagation issues. From a
parallelism perspective, the implementations of modmul operation can be single-threaded or multi-threaded. The latter
utilizes inter-thread communication warp-level primitives [52] to complete a modmul operation collaboratively. For
instance, CGBN and DPF support setting 1, 2, 4, 16, and 32 threads.

3 Throughput-oriented Elliptic Curve Operations


In this section, we describe the design of throughput-oriented EC operations based on the affine coordinate with
Montgomery’s Trick.

3.1 Opportunities and Challenges of Batching EC Operations

Our key hypothesis is that we can batch and execute EC operations in parallel to enhance throughput. The overhead of
processing a number of primitive operations (PADD and PDBL) could be reduced by employing a more efficient method
based on the affine coordinate system. For N PADD operations, gECC first calculates λ (Eqn. 2) of each PADD by
one modmul operation and one modinv operation. Specifically, gECC batches all the modinv operations in multiple
PADD operations together and processes them with Montgomery’s trick. The N modinv operations are converted
into one modinv operation and 3N modmul operations. After getting the λ of each PADD operation, gECC only
needs another 2 modmul and 6 modadd operation by leveraging Eqn. 3 to compute the output. In summary, there are

5
A PREPRINT - JANUARY 8, 2025

about one modinv, 6N modmul, and 6N modadd operations for batch N PADD operations with affine coordinates,
while there are 11N modmul and 6N modadd for N PADD operation with Jacobian coordinates, as shown in Table 1.
Although a few expensive modinv operations are involved, by amortizing these costs through large batch sizes, the
total overhead of batch PADD operations could be reduced. Theoretically, when the batch size N is greater than 20, the
total overhead of processing a batch of N PADD operations in the affine coordinate is less than that in the Jacobian
coordinate system. The actual size of N is much larger than 20 to fully utilize the large-scale parallelism provided by
GPUs. gECC accelerate the processing of a batch of PDBL operations in a similar fashion.
Although this idea sounds promising, it is non-trivial to capitalize on its benefits due to the following two challenges.
Challenge 1. The calculation of the batch modinv with Montgomery’s trick is inherently sequential since modinv
needs the results of the multiple accumulated modmul operations. Although this shortens the sequential dependency in
the compress/decompress step, we have to introduce more time-consuming modinv operations. Moreover, the large
number of cores on a GPU requires a high degree of data parallelism to fully utilize its computing power, which results
in introducing many more modinv calculations. Even worse, these modinv operations are computed serially rather
than in parallel on the GPU Streaming Processor (SP), as demonstrated in the right part of Fig. 3(a). The reason is that
the fastest algorithm [44, 45] for modinv employs numerous branching conditions to achieve efficient computation,
which causes a large number of GPU warp divergences in parallel computing and then results in worse performance. To
address this, we carefully devise a parallel workflow for Montgomery’s Trick algorithm in Section 3.2 to minimize the
number of the inverse step of modinv computation and improve the efficiency of parallel computing.
Challenge 2. When processing a batch of EC operations using Montgomery’s Trick, there are much higher memory
access overheads compared to existing methods, such as [33]. Specifically, in the compress step (Fig. 2), we need
to load two EC points for each PADD operation to calculate the numerator and denominator of λ, and then multiply
the denominator of λ of different PADD operations together to compress the modinv operations needed and store
them for subsequent calculation of the decompress step (Fig. 2). Suppose that we process a batch of millions (e.g.,
220 ) of EC operations to maximize the parallelism. The point data and intermediate arrays would consume more
than 128 MB 96 MB, respectively. However, this far exceeds the capacity of L1 cache or shared memory of modern
GPUs. For example, the combined L1 cache and shared memory in NVIDIA A100 GPU has only 20 MB. Offloading
these data to global memory would incur huge overhead for reloading them to the cache for computation in both the
compress step and the decompress step Fig. 2. To tackle this issue, we propose two significant optimizations: first, a
memory management mechanism for batch PADD operations that actively utilizes a multi-level cache to cache data
(see Section 3.3); second, a data-locality-aware kernel fusion method for batch PMUL operations, which aims to reduce
memory access frequency and enhance data locality (Section 3.4).
K1
-1
K1 serial execution
T1 compress inverse decompress
Warp1


-1
K32 K32
diverge

T32 compress inverse decompress


Warp …


Warpn


compress inverse decompress


Time
Data Parallel Programming Runtime scheduling on each GPU Streaming Processor

(a) The throughput bottleneck on each GPU Steaming Processor with data parallelism

T1 legend one warp (32 threads) schedule


compress apply decompress
K1’-1

K1’

Warp1
T32 compress decompress
synchronize

synchronize

gather scatter

T1 compress inter-warp decompress Warp


communication



Warpn/sp
T32 compress decompress

compress gather apply Time


scatter decompress

GAS Programming Runtime scheduling on each GPU Streaming Processor

(b) gECC’s runtime execution on each GPU Steaming Processor with GAS mechanism
Figure 3: Different mechanism of batch modular inversion on GPU and corresponding runtime execution.

3.2 Batch Modular Inversion Optimization

Here, we introduce the design of batch modinv operation in gECC, which serves as a crucial component of the
throughput-centric EC operations. In a GPU, the fundamental scheduling unit is a warp that consists of 32 threads.

6
A PREPRINT - JANUARY 8, 2025

These threads run concurrently on a single GPU SP, which houses multiple CUDA cores. For example, NVIDIA A100
contains 432 SPs.
To address Challenge 1, we borrow the wisdom of parallel graph processing systems and adopt the Gather-Apply-
Scatter (GAS) mechanism [34] to reduce the overhead of the modinv operation. Similar to data parallelism, we
evenly distribute the N inputs that require inversion across 32n threads in n warps for parallel processing as shown in
the left part of Fig. 3(b). Each thread then executes the compress step to obtain the final accumulated products, Ki
(1 ≤ i ≤ 32n). The difference lies in the fact that each thread does not directly compute the inverse step to get Ki−1 .
Instead, it waits for the GAS synchronization to complete, obtaining Ki−1 , and then performs the decompress step.
In the gather phase, the accumulated products Ki from all threads in the n/sp warps running on each GPU SP are

collected. After applying the compress step again, we obtain the accumulated products Kj (where j ranges from 1 to sp,
and sp is the number of GPU streaming processors). For instance, the accumulated products Ki (where i ranges from 1
to 32n/sp) from W arp1 to W arpn/sp are compressed using shared memory for inter-thread communication, yielding
′ 32n/sp
the result K1 = Πi=1 Ki , as shown in the left part of Fig. 3(b).
During the apply phase, each GPU SP performs a modinv operation in the inverse step on Kj′ using one thread,

resulting in Kj−1 , as shown in the right part of Fig. 3(b). This effectively avoids the divergence issue caused by multiple
threads performing the modinv operations simultaneously. Although some CUDA cores in the GPU SPs are idle during

this time, this period is very short relative to the entire computation process. In the scatter phase, the Kj−1 obtained by
each GPU SP is decompressed to produce the corresponding Ki−1 .
Compared to a naive data parallel approach, batch inversion based on the GAS mechanism reduces the number of
required modinv operations from 32n to sp, where n is typically several times the value of sp to use computing to
overlap the memory access overhead. This effectively reduces the overhead of batch inverse.

3.3 Batch Point Addition Optimization

For batch PADD operation, gECC first calculates the λ for all PADD operations using batch modinv and then completes
the computation of Eqn. 3 of each PADD, which are combined with the decompress step of batch λ to reuse the point
data. We called this combined decompress step as DCWPA step. As we discussed in Challenge 2, there are high
memory access overheads due to frequent global memory access. To alleviate this, gECC employs multi-level cache
management to minimize data access overhead, as shown in Fig. 4. gECC reduces the overhead of memory access with
two major techniques. 1) We minimize caching the intermediate data produced in the compress step and recompute
them when needed in the DCWPA step, decreasing the required cache space. 2) gECC classifies the data according to
the computing characteristics and allocates them at the optimized cache level, improving the data access efficiency.
Compress Step Inverse Step DCWPA Step
P0 T0 P5 T5 P0 T0 P5 T5
M5 M10 M15 … M0 … M4 M5-1
X PX … X PX TY Y … Y P Y T …
6 T6 10 10 P6 T6 10 10 M6 … M9 M10-1

X X … X X Y Y … Y Y 1 compress
M M4
-1 to -1
M0 , similarly 1 t5-1 = M4 * M5-1
1 t0 = XP0 - XT0
P1 to P5, similarly
2 modinv M
-1 to -1
M6 , similarly 1 t10-1 = M9 * M10-1
t =X -X
2 m1 = 0t0 * mP0 T0 Gather Scatter
P09 T0 P T5 5 P0 T0 P5 T5
P1 to P5, similarly M-1 X PX T… X PX TY PY T… Y PY T
2 m = t0 * m 3 decompress 6 6 10 10 6 6 10 10
M0 M 1 … M5 X X … X X Y Y … Y Y
M6 M7 … M10 M5-1 M10-1 M15-1 … 3 t5 = XP5 - XT5 2 PADD Eqn.3

3 t10 = XP10 - XT10 2 PADD Eqn.3


data 4 M4-1 = t5 * M5-1
Global Memory L2 Cache Shared Memory
dependency 4 M9-1 = t10 * M10-1
R 0 R1 … R5
represents an represents the result of represents the
Pi / Ti Mi Ri R6 R7 … R10
elliptic curve point accumulated modmul result of PADD

Figure 4: Multi-level cache management for batch PADD.


To minimize the number of memory accesses when processing PADD operations, gECC simplifies the computation
of the compress step. The core idea is that gECC only accesses the data needed to get the modular inverse of the
denominator of λ. As shown in the compress step in Fig. 4, we only load each EC point’s X coordinate value to get the
denominator ti of λ and store the intermediate result mi of the accumulated products to the array M in the compress
step. The nominator of λ is only used to get the result of the PADD, so we calculate it in the DCWPA step. It should be
noted that gECC does not store the denominator and the nominator of λ to the intermediate array M for each PADD
but recomputes them in the DCWPA step. The reason is that modadd operations are cheaper than global memory

7
A PREPRINT - JANUARY 8, 2025

access of a large integer. Furthermore, the DCWPA step has to load the complete EC point data anyway; therefore, the
recomputation does not incur additional data loading overhead. As shown in the DCWPA step in Fig. 4, the operation ➌
recalculates the denominator t of λ based on the loaded point data. Therefore, gECC reduces the intermediate data
stored by introducing a little computation overhead. This scheme is also applicable to PDBL operations.
In addition, gECC assigns the different data to different caches to align with the computation. We observe that the data
generated later in the compress step are used earlier in the inverse step and DACWPA step. For example, the last result
M5 of the accumulated products for each thread in the compress step is immediately used for the inverse step, and its
inverse value M5 −1 is first used in the DCWPA step. To support the gather phase in the inverse step, gECC utilizes
GPU shared memory to cache the last result of the accumulated products for all threads in a GPU block in the compress
step, as shown in the inverse step in Fig. 4. The gather phase requires frequent data access to merge data from different
threads in one block. GPU’s shared memory could meet this demand by enabling data communication between threads
and supporting simultaneous access by multiple threads. Moreover, the shared memory of Nvidia GPUs allows for very
low memory access latency, thus reducing memory access overhead in the inverse step.
gECC also leverages the L2 persistent cache to store all intermediate data of accumulated products in the compress step.
As demonstrated in Fig. 4, the array M , stored the intermediate result of each accumulated product in the compress step,
is actively cached in the L2 persistent cache of Nvidia A100 GPU. As we discussed in Section 3.1, the intermediate
arrays will occupy tens of megabytes of memory space. Although gECC has reduced the stored intermediate data, the
needed memory space still far exceeds the limited shared memory space. For example, the stored data in the compress
step is 32 MB when the scale of batch PADD operations is 220 . Besides, in the last DCWPA step, we need to first access
the array M to generate the inverse value of the denominator of λ. To meet these requirements, gECC uses the L2
persistent cache with larger capacity and lower access latency to optimize these data accesses. The Nvidia Ampere
architecture supports the L2 cache residency control feature [53], allowing users to customize the large 40 MB L2 cache
more efficiently. According to the CUDA programming guide [52], users can configure to use up to 75% L2 cache.
gECC sets aside a portion of the L2 cache as the persistent cache and configures the array M on it, thus achieving lower
latency accesses. When the allocated space exceeds the configured L2 persistent cache, gECC uses the first-in-first-out
(FIFO) cache replacement policy provided by CUDA to align with the calculation of the DCWPA step. We know that
the last result of multiplication in the compress step is first used in the DCWPA step to compute the corresponding
inverse value. The FIFO cache replacement strategy is perfectly suited to such a computational feature, improving the
hit rate of the L2 cache. Finally, the original point data array (e.g., Pi and Ti in Fig. 4), which occupies several hundred
megabytes of space, is stored in the GPU’s global memory.
Cache-efficient data layout. gECC adopts 256-bit logical Format
Addr. P0 P1 P2 P3 … Addr. P0 P1 P2 P3 …
a cache-efficient column-majored data layout 0 1 2 3 4 5 6 7 0B X0 X0 X0 X0 48*nB Y0 Y0 Y0 Y0
for large integers to achieve efficient concurrent Addr.
0B X0 X1 X2 X3 X4 X5 X6 X7 4*nB X1 X1 X1 X1 56*nB Y1 Y1 Y1 Y1
data access, improving the efficiency of global P0
32B Y0 Y1 Y2 Y3 Y4 Y5 Y6 Y7 8*nB X2 X2 X2 X2 64*nB Y2 Y2 Y2 Y2
memory access. Usually, a large integer is rep- 64B X0 X1 X2 X3 X4 X5 X6 X7
resented by an integer array in a CPU, as men- P1 12*nB X3 X3 X3 X3 72*nB Y3 Y3 Y3 Y3
96B Y0 Y1 Y2 Y3 Y4 Y5 Y6 Y7
tioned in section 2.2. As illustrated in Fig. 5(a), 128B X0 X1 X2 X3 X4 X5 X6 X7
16*nB X4 X4 X4 X4 80*nB Y4 X4 Y4 Y4
the two coordinate values (X, Y) of the EC point P2 160B Y0 Y1 Y2 Y3 Y4 Y5 Y6 Y7 24*nB X5 X5 X5 X5 88*nB Y5 Y5 Y5 Y5
P i are stored in the two integer arrays X and 192B X0 X1 X2 X3 X4 X5 X6 X7
32*nB X6 X6 X6 X6 96*nB Y6 Y6 Y6 Y6
P3
Y , respectively, each element occupying 32 bits. 224B Y0 Y1 Y2 Y3 Y4 Y5 Y6 Y7
40*nB X7 X7 X7 X7 104*nB Y7 Y7 Y7 Y7

In the row-majored data layout, X and Y of an


EC point occupy a contiguous physical memory (a) The row-majored data (b) The column-majored data layout
space. However, this row-majored data layout layout
is unsuitable for GPU concurrent access. When Figure 5: The data layout of n EC point
the threads in a GPU warp load the point data,
each thread T i serially accesses the element of the array X first. As shown in Fig. 5(a), the stride between threads is up
to 64 bytes, which means there is no aggregate access in a GPU warp for each large integer access and causes long data
access latency. To address this issue, gECC separates the X and Y coordinates of the point data and places the data of
the same type together in a column-majored data layout, such as the X coordinates of all the points in Fig. 5(b). For
example, we store the first element X0s of all X-axis large integers contiguously, then all the second elements, until the
last one X7. Now, the data accessed by threads within a warp is contiguously located, supporting efficient aggregation
access. In addition, we eliminate the additional overhead of data transpose by overlaying the transpose of the next batch
of data with the calculation of the current batch.

8
A PREPRINT - JANUARY 8, 2025

Recompute
… T1 T1T2
load
T1 *t1 T1t1
Discard *
t2-1 *T1 t1-1 …

T1 *t1 *t2 T1T2 inverse *t2 *t1


… T1t1 T1T2-1 T1T2-1 T1t1-1 T1-1 …
store store store
… T1 … T1 T1T2

Figure 6: An example of find-then-recompute method.

3.4 Batch Point Multiplication Optimization

According to the double-and-add algorithm 1, for each bit of the scalar, a PADD operation is needed when the current bit
of the scalar is one and there is a PDBL operation after the PADD operation when the loop is from the least significant
bit of the scalar to the most significant bit. The input of the PDBL is irrelevant to the output of the PADD, and its output
is used to calculate the next loop. Based on this, gECC proposes data-locality-aware kernel fusion optimization for
batch PMUL operations. The goal of our design is to reduce the memory access frequency and enhance data locality.
Initially, the three steps of batch PADD operation are combined with the corresponding steps of batch PDBL operation.
This native fusion could improve the data locality by reusing the loaded point data but requires more memory space
to store the intermediate data. For example, two intermediate arrays (i.e., the array M ) are necessary for each thread
in the combined compress step: one for the PADD operation and the other for the PDBL operation. The next inverse
step could work on both of them, simply doubling the input size and further merging the inverse calculations into one.
However, the limited shared memory will limit the input
scale, thus decreasing the throughput of batch PMUL
operations. Algorithm 3: Batch UPMUL operation
Input: the (scalar, point) set
To achieve the design goal, gECC uses a find-then- {(s0 , P0 ), (s1 , P1 ), ..., (s(n−1) , P(n−1) )}
recompute method to fusion the two kernels, which could Output: {Q0 , Q1 , Q2 , ..., Qn−1 }
reduce memory space and improve data locality. There is 1: len_s ← the bits of scalar;
an example of applying the find-then-recompute method 2: M [n];
to Montgomery’s trick, as illustrated in Fig. 6. This 3: for i ← 0 to len_s − 1 do
method discards part of the results of the accumulated 4: T1 = 1;
multiplications to reduce the space occupied by array M , 5: for j ← 0 to n − 1 do
then recomputes the discarded data with the intermediate 6: M [j] = T1 ;
result stored and the original inputs. For example, when 7: t1 = Pj .Y + Pj .Y ; T1 t1 = T1 ∗ t1;
multiplying t1 and t2 with the previous accumulated prod- 8: t2 = Pj .X − Qj .X; T1 T2 = T1 t1 ∗ t2;
uct result T1 , we discard the result T1 t1 of multiplying t1 9: T1 = T1 T2 ;
and T1 and only store the final result T1 T2 of the product 10: end for
of t1, t2, and T1 . Then, to get the correct inverse values 11: T1 −1 = inverse_step_f unc(T1 );
of t1 and t2 in the decompress step, these values T1 , T1 t1, 12: for j ← n − 1 to 0 do
and T1 T2 −1 are needed according to Montgomery’s trick. 13: T 1 t1 = (Pj .Y + Pj .Y ) ∗ M [j];
We could recompute the intermediate result T1 t1 based
on the stored value T1 and the input t1 to complete the 14: t2−1 = T1 −1 ∗ T 1 t1;
remaining computation. 15: λP ADD = λP ADD _f unc(Pj , Qj , t2−1 );
16: Rj = padd_f unc(λP ADD , Pj , Qj );
Specifically, gECC uses the same variable T1 to multiply 17: T1 −1 = T1 −1 ∗ (Pj .X − Qj .X);
with the denominator of λP ADD and the denominator of 18: t1−1 = T1 −1 ∗ M [j];
λP DBL together and store the last result in the interme-
19: λP DBL = λP DBL _f unc(Pj , t1−1 );
diate array M as shown in the lines 6 − 9 in Algorithm 3.
In addition, gECC adjusts the order of the calculation to 20: T1 −1 = T1 −1 ∗ (Pj .Y + Pj .Y );
achieve overlap between global memory access and large 21: Pj = pdbl_f unc(λP DBL , Pj );
integer computation to reduce latency. Firstly, we only 22: if s[i] == 1 then
load the y-axis value of the point Pj to calculate the de- 23: Qj = Rj
nominator of λP DBL . By leveraging the cache-efficient 24: end if
data layout, gECC could aggregate data access to any 25: end for
part of a large integer at a smaller granularity (32-bit). 26: end for
When calculating the denominator of λP DBL and then
multiplying by T1 , gECC loads the values on the x axis
of the point Pj and Qj to overlap the long latency of global memory access with the large integer multiplication. Then,

9
A PREPRINT - JANUARY 8, 2025

we calculate the denominator of λP ADD and multiply by T1 , Finally, the value of T1 for the point pair (Pj , Qj ) is
stored in the array M . Next, the calculation of the inverse step is the same as the one in batch PADD operations. gECC
also reduces the latency of the inverse step compared to the native kernel fusion method.
In the DCWPA step, we first complete the calculation of PADD operation and then the calculation of PDBL operation.
gECC gets the inverse value of the denominator of λP ADD by recomputing the discarded values, the denominator t1
of λP DBL and the value T 1 t1, as shown in the lines 13 − 15 in Algorithm 3. Then, it completes the computation of
Eqn. 3 to obtain the result Rj of PADD operations and updates the value Qj using Rj according to the j-th bit value of
the scalar. At last, just like handling PADD operation, gECC calculates the value of λP DBL and then performs the
PDBL’s calculation. Despite introducing additional computation overhead (i.e., one modmul operation) to get the
correct inverse values, the benefits of the discard-then-recompute scheme far outweigh it. This scheme fully utilizes the
point data locality of the integrated PADD and PDBL calculation process to reduce data access, thus improving the
performance of batch PMUL operations.
Our design ensures constant time for the EC operations to realize side-channel protection [54]. To prevent any leakage
of timing information related to the secret scalar in the PUML operation, instead of omitting unnecessary operations,
the PMUL operation conducts a dummy operation with the point-at-infinity when required (for instance, zeros of the
scalar are absorbed without revealing timing information, as shown in algorithm 3);

4 Modular Arithmetic Design


In this section, we first introduce the design of our modular arithmetic, which mainly includes a general modmul
algorithm for any modulus. Then we also develop curve-specific modular reduction optimization for the SM2 curve,
which is the standard in China.

4.1 Modular Multiplication Optimization

4.1.1 Identify bottlenecks in modular multiplication solutions.


In this section, we aim to identify the potential optimization of modmul. To this end, we benchmark the execution of
SASS instructions at the microarchitectural level. SASS instructions directly affect the performance of the program,
but the issuance and latency of SASS instructions are not openly documented. Official reports [55, 52] only provide
the throughput of native arithmetic operations per cycle per steaming processor. There is a huge gap between PTX
instructions, with which developers program their computations, and SASS instructions, which are actually executed on
GPUs. This gap is evident in two main areas. 1) How the arithmetic operations of PTX instructions are compiled into
SASS instructions. The same arithmetic operations can be implemented with different PTX instructions by different
programmers, such as different order of arithmetic operations, different storage allocation of the results of arithmetic
operations, etc. Finally, the compiler also plays an important role in what SASS instructions are produced for an
arithmetic operation. Therefore, it is critical to gain knowledge of the correspondence between arithmetic operations,
PTX instructions as well as SASS instructions, including SASS Operators and register usage information. 2) Programs
with the same arithmetic complexity in terms of PTX instructions may differ in their performance. Although we can get
the total clock cycles and the total number of the different types of SASS instructions of a program, it is unclear how the
SASS instructions are pipelined on the GPU, and hence we cannot estimate the performance of the program accurately.
The pipelining may render a program with a larger number of SASS instructions run faster than another with a smaller
number. This is because the instruction issue rates depend on the types of SASS instructions and their execution order.
There are cases where the program with more SASS instructions may perform better due to its higher instruction issue
rates. The lack of documentation drives us to simulate the pipeline of instruction issuance, which enables us to identify
the bottleneck of a program and the potential for optimization. Therefore, we also need to benchmark the issue rates of
the different types of SASS instructions.
We tested four types of SASS instructions related to computing, which represented a large proportion in the previous
solution [31, 32, 36, 47, 48, 35, 49, 50]. Among them, DFMA and DADD denote the multiply-add and addition
operations for two double-precision floating-point numbers, respectively. IMAD and IADD3 are instructions for
multiply-add and addition operations on 32-bit integers, as shown in Table 2. We count the total number of cycles
required for a warp to execute 10,000 SASS instructions of the same type (with and without data dependency). By
combining the results in the official reports with those of our benchmark, we can obtain an estimate of the issue rate per
warp scheduler of SASS instructions, as shown in Table 2. It is evident that both data-independent DFMA and IMAD
instructions are launched at intervals of 4 cycles.
In addition, we also conducted experiments involving a warp issuing 10,000 different types of SASS instructions
(without data dependencies) consecutively and tallied the total number of cycles required for processing all the

10
A PREPRINT - JANUARY 8, 2025

Table 2: GPU Ampere micro architecture level SASS instructions issuance analysis
dependent / independent
arithmetic PTX SASS issue rate
operation instructions Operators (per warp scheduler) meaning
floating-point
1 1
mul-add fma.rz.f64 d, a, b, c DFMA 8 / 4 cycle d=a∗b+c
floating-point
1 1
add add.rz.f64 d, a, b DADD 8 / 4 cycle d=a+b
integer madc.cc.lo.u32 (d)lo , a,b,(c)lo cc, (d)lo = (a ∗ b)lo + (c)lo + cc
1 1
mul-add madc.cc.hi.u32 (d)hi , a,b,(c)hi IMAD.WIDE.X 8 / 4 cycle (d)hi = (a ∗ b)hi + (c)hi + cc
integer
1 1
add addc.cc.u32 d,a,b; IADD3.X 4 / 2 cycle cc, d = a + b + cc

instructions. The 10,000 instructions consist of two SASS instructions (any two types shown in Table 2), arranged in a
1:1 ratio. For instance, an IMAD instruction is followed by a DFMA instruction. The results indicate that when a warp
issues one DFMA followed by one IMAD instruction, the total cycles required are the sum of issuing DFMA or IMAD
alone, which means that although these two types of instructions run in different compute units, they are still launched
at intervals of 4 cycles. Similarly, the interval between the DFMA instruction and the DADD instruction, as well as the
IMAD instruction and the DADD instruction, remains at 4 cycles. However, the interval between the IMAD, DFMA, or
DADD instruction and the IADD3 instruction is 2 cycles.
Theoretically, a 256-bit integer represented in base D = 32, necessitates m = 8 32-bit integers for storage. As shown
in Algorithm 2, the modmul operation of two 256-bit integers consists of two parts. The integer multiplication requires
m2 = 64 IMAD instructions, and the modular reduction requires m + m2 = 72 IMAD instructions. This results in a
total of 136 IMAD instructions. When considering D = 52 in the solution (DPF) presented in [36], storing a 256-bit
integer requires 5 double-precision floating-point numbers. Due to considerations regarding floating-point precision,
storing the product of two 52-bit integers necessitates 2 DFMA instructions. So, in modmul operation in this solution,
integer multiplication demands 2 ∗ m2 = 50 DFMA instructions, while the modular reduction requires 2 ∗ m2 = 100
DFMA instructions and m = 5 IMAD instructions. Moreover, the solution incorporates extra DADD instructions and
data conversions to ensure precision.
We used the GPU performance analysis tool NVIDIA Nsight Compute [56] to disassemble the fastest modmul
implementation [35] based on integer instructions. This implementation uses 1, 2, and 4 threads with D = 32, and is
referred to as CGBN-1, CGBN-2, and CGBN-4, respectively. Additionally, we disassembled a solution [36, 31] based
on a mixture of multi-precision floating-point instructions and integer arithmetic instructions, which uses 1 thread and
has a parameter D = 52. This solution is named DFP-1.
Table 3: Instructions analysis of a modmul in different solution for SM2 curve

total inst DFMA DADD IMAD IADD3 SHFL


DPF-1 325 100 50 29 114 0
CGBN-1 227 0 0 185 32 0
CGBN-2 382 0 0 148 94 50
CGBN-4 584 0 0 188 144 100

The results of the performance test depicted in Fig. 12 indicate the following performance ranking: CGBN-1 outperforms
DPF-1, which is in turn more efficient than CGBN-2 and CGBN-4. As shown in Table 3, the primary performance
bottlenecks for CGBN-4 and CGBN-2 arise from the SHFL instructions necessary for inter-thread communication [52].
These instructions take approximately 5 clock cycles [57] and introduce synchronization overhead within the modmul
operation. DPF-1 has attempted to replace some IMAD instructions with DFMA instructions. Regrettably, despite these
two types of instruction operating in different computational units, they cannot be executed concurrently, as confirmed
by our benchmark results 2. Furthermore, additional DADD instructions and data conversion instructions have hindered
DPF-1 from outperforming the integer-based implementation CGBN-1. Although CGBN-1 has shown impressive
performance, a microarchitecture analysis reveals that actual usage of IMAD instructions is 38% higher than theoretical
predictions, suggesting that there is still potential for further optimization.
In general, the main challenge in speeding up modmul lies in minimizing the number of IMAD instructions and
reducing the dependencies between instructions. Additionally, it is important to avoid register bank conflicts, as these
can also delay the issuance of instruction [58, 59, 55].

11
A PREPRINT - JANUARY 8, 2025

4.1.2 Minimize IMAD instructions.


Based on the preceding analy- PTX instructions carry addition chains for m=4 example
sis, as shown in Table 2, the def mad_n(acc[m], const a[m], Base bi) (C0)lo (C0)hi (C1)lo (C1)hi
for (size_t j = 0; j < m; j += 2)
native arithmetic operations asm("madc.lo.cc.u32 %0, %2, %3, %0;
madc.hi.cc.u32 %1, %2, %3, %1;"
for implementing modmul op- : "+r"(acc[j]), "+r"(acc[j+1]) (A0B0)lo (A0B0)hi (A2B0)lo (A2B0)hi even
: "r"(a[j]), "r"(bi)); carry carry carry carry elements
eration is the multiply-add asm("addc.u32 %0, 0, 0;" : "+r"(acc[m]));
(C0)lo (C0)hi (C1)lo (C1)hi
arithmetic operation defined as mad_n(C, A, B[i])
mad_n(D, A+1, B[i])
(carry-out, c) = a ∗ b + c + carry-in. (D0)lo (D0)hi (D1)lo (D1)hi
Here, a and b are 32-bit integers, SASS instructions
c is a 64-bit integer consisting of odd (A1B0)lo (A1B0)hi (A3B0)lo (A3B0)hi
IMAD.WIDE.U32 R18, P0, R2, R25, R18 ; elements
a high 32-bit word and a low 32- IMAD.WIDE.U32 R22, P1, R0, R25, R22 ; carry carry carry carry

bit word, and both carry-in and IMAD.WIDE.U32.X R20, P0, R25, R5, R20, P0 ;
IMAD.WIDE.U32.X R16, P1, R25, R3, R16, P1 ; (D0)lo (D0)hi (D1)lo (D1)hi
carry-out are each 1-bit. After
compilation, the SASS instruc- Figure 7: Divergent carry addition chains for odd and even word-size integer
tions executed on the hardware is
IMAD.WIDE.U32.X R1, P0, R2, R3, R1, P0;. In these instructions, R1, R2, and R3 are general-purpose registers
that store c, a, and b, respectively, while P0 is a predicate register used to pass the carry information to another SASS
instruction. Improper propagation of carry information can lead to instruction overhead. For example, in the study
of [49], if the high word carry and the low word carry of a ∗ b are propagated separately, the number of IMAD
instructions can be doubled compared to the theoretical predictions.
Our implementation of multiplication carry propagation, akin to sppark [50], effectively circumvents such issues. As
depicted in Fig. 7, the product of integers A and B can be computed by generating m row-by-row multiplications with
mad_n function, specifically A ∗ Bi , where i ranges from 0 to m − 1. The carry from the high word of Aj ∗ Bi can be
passed as the carry input to the low word of Aj+2 ∗ Bi , where j ranges from 0 to m − 1. It includes the product of all
even elements of A and Bi and avoids the extra addition instructions caused by the conflict when the carry from the
high word of Aj ∗ Bi carries over to the high word of Aj+1 ∗ Bi . This propagation of the carry can also apply to the
remaining odd elements of A, such as the carry from the high word of Aj+1 ∗ Bi can be passed as the carry input to the
low word of Aj+3 ∗ Bi . ➕ ➕ ➕ ➕ ➕

4.1.3 Reorder IMAD instructions to reduce register moves and register bank conflicts.
After customizing the one row-by-row 1 integer multiply
multiplication with two mad_n func-
(C0)lo (C0)hi (C1)lo (C1)hi (C2)lo (C2)hi (C3)lo (C3)hi (C4)lo (C4)hi
tions, we still need to pay attention
to the IMAD order when applying it mad_n(C, A, B0) (A0B0)lo (A0B0)hi (A2B0)lo (A2B0)hi carry
to the integer multiplication stage of mad_n(C+1, A+1, B1) (A1B1)lo (A1B1)hi (A3B1)lo (A3B1)hi carry

the modmul operation because regis- mad_n(C+1, A, B2) (A0B2)lo (A0B2)hi (A2B2)lo (A2B2)hi carry

ter move and bank conflicts will also mad_n(C+2, A+1, B3) (A1B3)lo (A1B3)hi (A3B3)lo (A3B3)hi carry

affect the instructions’ issuance rate. (D0)lo (D0)hi (D1)lo (D1)hi (D2)lo (D2)hi (D3)lo (D3)hi
when processing m row-by-row mul-
mad_n(D, A+1, B0) (A1B0)lo (A1B0)hi (A3B0)lo (A3B0)hi carry
tiplications in integer multiplication
mad_n(D, A, B1) (A0B1)lo (A0B1)hi (A2B1)lo (A2B1)hi carry
of modmul operation, we modify the mad_n(D+1, A+1, B2) (A1B2)lo (A1B2)hi (A3B2)lo (A3B2)hi carry
write-back registers of the IMAD in- mad_n(D+1, A, B3) (A0B3)lo (A0B3)hi (A2B3)lo (A2B3)hi carry
struction, as shown in Fig. 8. Specif-
ically, the product results of the even 2 modular reduce 3 merge
elements in A ∗ B[0] are accumulated 1: for (size_t i = 0; i < m/2; i += 1) do
(C2)lo (C2)hi (C3)lo (C3)hi (C4)lo
in the C array, while those of even el- 2: Mi = (Ci)lo * q_inv;
3: mad_n(C+i, q, Mi);
ements in A ∗ B[1] are accumulated in 4: mad_n(D+i, q+1, Mi); (D1)hi (D2)lo (D2)hi (D3)lo (D3)hi
// keep the carry and pass to (Ci+1)lo
the D array. This adjustment offers the 5: (carry-out, (Di)lo) = (Di)lo+(Ci)hi carry carry carry carry

advantage of eliminating the need to 6: Mi+1 = (Di)lo * q_inv; (D1)hi (D2)lo (D2)hi (D3)lo (D3)hi
7: mad_n(C+i+1, q+1, Mi+1);
separately read the high word (C0 )hi
"

8: mad_n(D+i, q, Mi+1);
of the (C0 ) register and the low word // keep the carry and pass to (Di+1)lo
9: (carry-out, (Ci+1)lo) = (Ci+1)lo+(Di)hi
(C1 )lo of the (C1 ) register, thus reduc-
ing the movement and reading opera-
Figure 8: General modmul algorithm for m=4
tions of the registers.
During the modular reduction phase of modmul operation, as shown in Algorithm 2, while accumulating the product
of q ∗ Mi in each iteration, sppark ensures result accuracy by shifting the C or D array 64 bits to the right, such as
Ci = Ci+1 + q ∗ M1 . However, the corresponding SASS instructions requires the IMAD instruction to access four

12
A PREPRINT - JANUARY 8, 2025

registers, potentially causing a one-cycle delay in instruction issuance due to register bank conflicts [58, 59, 55]. We
employ an in-place write-back strategy, namely Ci = Ci + q ∗ M1 , which allows the IMAD instruction to access three
registers simultaneously. To ensure that the results of the C and D arrays are shifted 64 bits to the right in each iteration
without increasing the number of IMAD instructions like CGBN, we utilize predicate registers to store carries as shown
in phase (2) in Fig. 8. For instance, when i = 0, the value of (C0 )hi is added to (D0 )lo , and the carry-out is stored in
the predicate register to be passed to (C1 )lo in the subsequent mad_n function. Subsequently, the IADD3 instruction is
employed to accumulate the C and D array to produce the final result in the merge phase in Fig. 8. Upon disassembly
of our design, the IMAD instruction in a modmul operation aligns with the theoretical prediction.

4.2 Modular Reduction Optimization on SM2 Curves

To further reduce the number of IMAD instructions, we observed that, for the SM2 curve, the IADD3 instruction
with a higher issuance rate can be used to replace the number of IMAD instructions. The unique forms of the
SCA-256 prime modulus on the SM2 curve facilitate a distinct modular reduction phase. On the SM2 curve, q =
2256 − 2224 − 296 + 264 − 1, and therefore q_inv of q in Algorithm 2 is equal to 1. A previous work [60], an ASIC-based
accelerator for the SM2 curve, also utilizes this property. However, its design is not portable to GPU as it introduces a
multitude of intermediate results that require more registers and introduces a lot of addition operation.
So in the middle part of Fig. 8, q_inv = 1 in line 2 and line 6 is specified, rendering Mi equivalent to (Ci )lo or (Di )lo .
let us incorporate the value of q into the reduction phase, the mad_n in the line 3, 4, 6 and 7 can be simplified to
addition instead multiplication operation. For example, at the iteration i = 0, lines 2 to 4 can be simplified using Eqn. 4.
Therefore, 8 subtractions are required to construct q ∗ (C0 )lo , and 10 additions are required to update the C array (from
(C0 )lo to (C4 )hi ).

7
X
C = C + (C0 )lo ∗ q = ((Ci )lo + (Ci )hi ∗ 232 ) ∗ 2i∗32 + (2256 − 2224 − 296 + 264 − 1) ∗ (C0 )lo
i=0
= (C0 )hi ∗ 232 + ((C1 )lo + (C0 )lo ) ∗ 264 + ((C1 )hi − (C0 )lo ) ∗ 296 + (C2 )lo ∗ 2128 + (C2 )hi ∗ 2160 + (C3 )lo ∗ 2192
7
X
+ ((C3 )hi − (C0 )lo ) ∗ 2224 + ((C4 )lo + (C0 )lo ) ∗ 2256 + (C4 )hi ∗ 2288 + ((Ci )lo + (Ci )hi ∗ 232 ) ∗ 2i∗32
i=5
(4)
After that, both arrays D and (C0 )hi , required by line 5, remain unchanged, allowing lines 6 to 8 to be simplified in a
similar manner. Consequently, in each iteration of the reduction phase, 8 ∗ 2 subtractions are still necessary to construct
(Ci )lo ∗ q and (Di )lo ∗ q, and 10 ∗ 2 additions are required to update the arrays C and D.
Nonetheless, there is potential for further optimization. As depicted in Fig. 8, the merge phase of modmul does not
necessitate precise values for C0 , C1 and D0 , D1 , but merely requires accurate carry information. Hence, we can
simplify the computation from line 2 to line 8 when i = 0 using Eqn. 5. In the calculation of tmp, we employ one
addition operation and use the predicate register to store the carry information. When computing (Ci )lo ∗ q + ((Ci )hi +
(Di )lo ) ∗ q, we require an additional eight 32-bit integers and seven subtractions with IADD3 instructions, as we can
disregard the specific value of the lower 64-bits. When updating the array C ′ , we also ignore the lower 64 bits and
simply add the carry information stored in the predicate register, which can be accomplished through nine additions with
IADD3 instructions. Compared to the previous work in [60], which requires 176 addition and subtraction instructions,
we only need (1 + 7 + 9 + 1) ∗ 4 = 72 addition and subtraction instructions to complete the reduction phase.

(cc, tmp) = (C0 )hi + (D0 )lo


C = C + (C0 )lo ∗ q + ((C0 )hi + (D0 )lo ) ∗ q
C ′ = (cc + (C1 )lo + (C0 )lo ) ∗ 264 + ((C1 )hi − (C0 )lo + tmp) ∗ 296 + ((C2 )lo − tmp) ∗ 2128 + (C2 )hi ∗ 2160
(5)
+ (C3 )lo ∗ 2192 + ((C3 )hi − (C0 )lo ) ∗ 2224 + ((C4 )lo + (C0 )lo − tmp) ∗ 2256 + ((C4 )hi + tmp) ∗ 2288
7
X
+ ((Ci )lo + (Ci )hi ∗ 232 ) ∗ 2i∗32
i=5

13
A PREPRINT - JANUARY 8, 2025

5 Experimental Evaluation
5.1 Methodology

Experimental Setup: Our experiments are conducted on a virtual machine with an Intel Xeon processor (Skylake,
IBRS) and 72 GB DRAM, running Ubuntu 18.04.2 with CUDA Toolkit version 12.2. It is equipped with one NVIDIA
Ampere A100 GPU (each with 40 GB memory) and one NVIDIA Votal V100 GPU (each with 32 GB). Because the
optimization of EC operations in gECC relies on the new feature of L2 cache persistence in A100 GPU, we only
evaluate the throughput performance of ECC algorithm (Section 5.2), EC operations (Section 5.3) and application
(Section 5.5) on A100 GPU. Because the optimization of modmul operation in gECC is general and does not rely
on this feature, we evaluate it on both V100 and A100 GPUs (Section 5.4). We use random synthetic data generated
by RapidEC based on the SM2 curve to evaluate the performance of ECC algorithm , EC operations , and modmul
operation performance. And we used real workload data to evaluate the application performance.
Baselines: We compare gECC with the state-of-the-art method for GPU-accelerated EC operations, called Rapi-
dEC [33].
Performance Metrics: We focus on enhancing the throughput of the GPU programs. Therefore, for the ECDSA
algorithm, we use the number of signatures generated/verified per second (i.e., signature/s) as the performance metric for
the signature generation and verification phases, respectively. The ECDH algorithm comprises two FPMUL operations
and a single key value comparison. Given that the time taken for the latter is negligible, the throughput of the ECDH
algorithm is dominated by the throughput of the FPMUL operation. Therefore, we only report the throughput value of
the FPMUL operation. Specifically, we record the calculation time t when performing a batch of n signature generations,
verifications, and FPMUL operations, respectively, and calculate throughput as nt , which is consistent with RapidEC.

5.2 ECDSA Algorithm Performance

We evaluate the end-to-end throughput of signature generation and verification based on the SM2 curve, respectively.
Table 4: Throughput Performance of ECDSA on NVIDIA A100 based on SM2 curve

Signature Signature
Solution Generation Verification
RapidEC 3,386,544 773,481
gECC 14,141,978 4,372,853
Signature Generation. We examine the throughput of signature generation in the ECDSA algorithm. As shown in
Table 4, the throughput of signature generation in gECC is 14,141,978 signature/s, which achieves 4.18× speedup
compared to the corresponding part of RapidEC. The end-to-end performance advantage of gECC can be attributed to
its comprehensive optimization from the low-level modmul to cryptographic operation.
To take a closer look, we investigate the improvement brought
about by the major optimization techniques of gECC for the
generation of signatures. This generation mainly contains a 5
RapidEC
modinv operation and an FPMUL operation. First, to verify RapidEC+OM 1.41E+7
the improvement brought about by our optimized modular 4 RapidEC+OM+BI
Normalized Throughput

arithmetic, we re-implemented the RapidEC solution with our gECC


optimized modular arithmetic (“RapidEC+OM”). Then, we 3 9.45E+6
evaluate an alternative solution (“RapidEC+OM+BI”), which
7.79E+6
further incorporates batch modular inversion optimization
2
presented in Section 3.2. Finally, gECC (“gECC”) contains
all our proposed optimizations. It adopts the optimized FP- 3.39E+6
MUL by leveraging the memory management optimization 1
(Section 3.3). As shown in Fig. 9, the breakdown analysis
shows that RapidEC+OM achieves 2.3× speedup against 0
RapidEC, and RapidEC+OM+BI further improves the per- Figure 9: Breakdown analysis of signature generation
formance by 21% over RapidEC+OM. gECC can achieve
another 50% improvement against RapidEC+OM+BI. This verifies that batch modular inversion based on Montgomery’s
Trick could significantly reduce the number of modinv operations, and the GAS mechanism could minimize the overall
latency and fully utilize the high degree of parallelism provided by GPU. Furthermore, the batch PADD operations
based on affine coordinates reduce the overall computation overhead.

14
A PREPRINT - JANUARY 8, 2025

Signature Verification. In this experiment, we examine


the throughput of signature verification of the ECDSA algo- 6 4.37E+6
RapidEC
rithm, which mainly contains one FPMUL operation and one RapidEC+OM
5 RapidEC+OM+BF
UPMUL operation. As shown in Table 4, gECC’s through-

Nomalized Throughput
put of signature verification is 4,372,853 signature/s, which gECC
4 2.84E+6
achieves 5.65× speedup compared to RapidEC.
2.43E+6
We provide a breakdown analysis of the signature verifi- 3
cation process as follows. First, we re-implemented the
2
RapidEC solution with our optimized modular arithmetic
(“RapidEC+OM”). Then, we evaluate an alternative solution 7.73E+5
1
(“RapidEC+OM+BF”) that also accelerates the FPMUL oper-
ation with the optimization proposed in Section 3.2. Finally, 0
“gECC” contains all our proposed optimizations, including
data-locality-aware kernel fusion (Section 3.4) and mem- Figure 10: Breakdown analysis of signature verification
ory management (Section 3.3) optimizations. As shown in
Fig. 10, the breakdown analysis shows that RapidEC+OM achieves 3.2× speedup against RapidEC. Signature veri-
fication contains more calculations than signature generation, so our optimized modular arithmetic could provide a
more significant speedup compared to RapidEC. With the accelerated FPMUL operation, RapidEC+OM+BF further
improves the throughput by 17% over RapidEC+OM. By accelerating the UPMUL operation, gECC achieves another
54% improvement over RapidEC+OM+BF.

5.3 Performance Analysis of Batch PMUL on GPU

This set of experiments evaluates the throughput of PMUL based on the SM2 curve.

Table 5: Throughput Performance of PMUL on NVIDIA A100 GPU based on SM2 curve
Batch size FPMUL operation per second UPMUL operation per second
per GPU SP RapidEC gECC RapidEC gECC
211 3,928,144 11,723,016 (2.98×) 1,476,365 6,270,010 (4.25×)
212 3,935,789 13,005,468 (3.30×) 1,405,325 6,561,449 (4.67×)
213 3,939,345 13,755,865 (3.49×) 1,427,610 6,689,186 (4.69×)
214 3,810,161 14,389,168 (3.78×) 1,464,398 6,766,892 (4.62×)
215 3,863,144 14,385,871 (3.72×) 1,471,910 6,837,619 (4.65×)
216 3,906,517 14,299,320 (3.66×) 1,477,929 6,732,287 (4.56×)

Fixed Point Multiplication (FPMUL). First, we evaluate the throughput of the FPMUL operation. As shown in
Table 5, gECC achieves an average speedup of 3.16× compared to RapidEC. To ensure GPU streaming processors load
balancing, we set the batch size by the product of the number of GPU stream processors (432) and the amount of data
processed by each processor. As the batch size increases, the throughput of FPMUL operations in gECC gradually
increases until it saturates the GPU. This is because, with a larger batch size, Montgomery’s trick can bring about
a greater degree of reduction in the computational complexity. As shown in Fig. 3(b), the total duration of FPMUL
computation in gECC comprises three parts: compress step, GAS step, and DCWPA step. The duration of the compress
and DCWPA steps is linearly proportional to the batch size, while the duration of the GAS step is almost constant
regardless of the batch size (Section 3.2). Therefore, the throughput can increase as the batch size increases. When the
batch size reaches about 432 ∗ 215 , the duration of the GAS part only occupies a small portion of the total elapsed time,
so the overall performance tends to be stabilized as the batch size continues to increase. RapidEC’s throughput remains
almost constant with different batch sizes, as it does not perform any optimization for batch processing.
In the next experiment, we investigate the impact of the major optimization techniques of gECC on the FPMUL operation.
First, we re-implemented the RapidEC’s FPMUL solution with our optimized modular arithmetic (“RapidEC+OM”).
As shown in Fig. 11(a), the throughput of RapidEC+OM is stable over different scales, which is up to 10,592,834
peration/s. We experiment with two variants of gECC, gECC−MO, a gECC variant without the memory management
mechanism, and gECC−MO−KF, a variant without both the memory management and the data-locality-aware kernel
fusion mechanisms. The stabilized throughput of gECC−MO−KF is lower than RapidEC+OM by 8%. The reason
is that gECC−MO−KF incurs a high overhead of global memory access due to the lack of relevant optimizations.
The throughput of gECC−MO is about 10% higher than that of RapidEC+OM because the kernel fusion optimization
reduces the global memory access frequency and enhances data locality to a certain degree, thus improving the
throughput of batch PADD operations. Finally, the throughput of gECC with all the optimizations can achieve 36%

15
A PREPRINT - JANUARY 8, 2025

7.5×106
7
1.50×10
6.0×106
1.25×107

upmul / s
1.00×107 4.5×106
fpuml / s

7.50×106
RapidEC RapidEC+OM 3.0×106
gECC−MO−KF gECC−MO gECC RapidEC RapidEC+OM
6
5.00×10 gECC−MO−KF gECC−MO gECC
1.5×106
2.50×106

0.00 0.0
2^11 2^12 2^13 2^14 2^15 2^16 2^11 2^12 2^13 2^14 2^15 2^16
batch size for each GPU stream processor batch size for each GPU stream processor

(a) Breakdown analysis of fixed point multiplication (b) Breakdown analysis of unknown point multiplication
Figure 11: Breakdown analysis of Point Multiplication
and 23% improvement over RapidEC+OM and gECC−MO, respectively. This shows that our memory management
optimization effectively optimizes global memory access through column-majored data layout and reduces data access
latency through multi-level caching.
Unknown Point Multiplication (UPMUL). This experiment examines the throughput of the UPMUL operation, the
most time-consuming operation in ECC. As shown in Table 5, gECC achieves, on average, 4.36× speedup over the
RapidEC. RapidEC is implemented using the NAF (no-adjacent form) algorithm [39], which converts a scalar into
its NAF representation with signed bits to reduce the number of PADD operations. However, when multiple UPMUL
operations are performed concurrently in a GPU Streaming Processor, it leads to severe warp divergence issues due to
the different NAF values, resulting in higher inter-thread synchronization costs than gECC.
We further investigate the improvement brought about by the optimizations of gECC for the UPMUL operation. Simi-
larly, we re-implemented RapidEC’s UPMUL operation using the NAF algorithm with our optimized modular arithmetic
(“RapidEC+OM”). We also experiment with the same two variants of gECC, gECC−MO and gECC−MO−KF. As
shown in Fig. 11(b), the stabilized throughput of gECC−MO−KF is 45% higher than that of RapidEC+OM. The reason
is that the benefit of decreasing the computational complexity brought about by Montgomery’s trick dominates the
increased overhead of global memory access. gECC−MO brings about 20% improvement and the memory management
achieves a further 9% improvement.

5.4 Performance Analysis of Modular Multiplication on the GPU

In this section, we examine the throughput of the modmul


operation, the major underlying computation within ECC. We
compare against several accelerated modmul implementations. A100 V100 4892973.22
5×106
CGBN [35] is the finite field arithmetic library used by Rapi-
4086320.53
dEC. RapidEC uses four threads to calculate a modmul oper- 4×106
ation, which we refer to as CGBN-4. In addition, we added 3497155.33
2997357.47
two cases, using one or two threads to calculate a modmul
modmul/s

3×106 2771683.74
operation, referred to as CGBN-1 and CGBN-2, respectively. 2400114.26

We also studied the solution presented in [31], which is based 2×106


on floating-point instructions and uses one thread to calculate
a modmul operation, which is referred to as DFP-1. First, we 1×106
implement general modmul operation ("SOS") as shown in
Fig. 8, which minimizing IMAD instructions and organizing 0
the order of IMAD instructions (Section 4.1). Same as CGBN- CGBN-1 CGBN-2 CGBN-4 DFP-1 SOS SOS+FR

1, SOS does not make special modular reduction optimizations Figure 12: Throughput analysis of modmul operation
for SCA-256 prime modulus and is a standard modmul imple-
mentation which suitable for any prime moduli. Then, the full gECC ("gECC") adopts fast reduction for SCA-256
prime modulus (Section 4.2).
As shown in Fig. 12, the gECC on SCA-256 prime modulus achieves 1.63× and 1.72× speedup against CGBN-1, and
2.68× and 2.58× speedups against CGBN-4 on the V100 and A100 GPU, respectively. To take a deeper look, we
investigate the improvement brought by major gECC proposals for modmul operation. First, both SOS, gECC and

16
A PREPRINT - JANUARY 8, 2025

CGBN-1 avoid the inter-thread communication overhead in the CGBN-4, and significantly improve the throughput
performance. SOS achieves 1.17× speedup against CGBN-1 on A100 GPU. As a highly optimized assembly code
implementation, SOS can minimize number of IMAD instructions and reduce stall caused by register bank conflict
and register move. gECC reduce can deliver an additional 1.39× performance, since we use as few additions and
subtractions as possible to replace the IMAD operations in the reduce phase.

5.5 Application Performance

In the final set of experiments, we investigate the end-to-end effect of gECC on the overall performance of ECC
applications. ECDSA is widely used in blockchain systems or blockchain databases for user identity authorization and
verification for transaction security and integrity. For instance, in a blockchain database, each modification history
of the database is recorded in the transaction. The ECDSA encryption safeguards the data from malicious tampering.
We evaluate the effect of gECC to accelerate the ECDSA algorithm of a real-world permissioned blockchain system
FISCO-BCOS [61].
We run the blockchain on a four-node cluster, each with a 2.40GHz Intel Xeon CPU and 512 MB memory. The
system transaction throughput without applying gECC is approximately 5, 948 transactions per second, with the time
breakdown revealing that the ECDSA signature generation and verification accounts for 37.2% of the total time.
Despite the significant performance improvement of our ECDSA, its throughput is constrained by other factors, such as
consensus or other hash computations. After applying gECC, the throughput can reach 9, 313 transactions per second,
representing a performance improvement of 1.56×.

6 Conclusion

ECC has a lower computational complexity and a smaller key size compared to RSA, making it competitive for digital
signatures, blockchain, and secure multi-party computation. Despite its superiority, ECC remains the bottleneck in
these applications because the EC operations in ECC are still time-consuming, which makes it imperative to optimize
their performance. In this paper, we study how to optimize a batch of EC operations using the GAS mechanism
and Montgomery’s Trick on GPUs. We propose locality-aware kernel fusion optimization and design multi-level
cache management to minimize the memory access overhead incurred by frequent data access for point data and the
intermediate results when batching EC operations. Finally, we optimize the operation performed most frequently
modmul in all types of EC operations. Our results reveal that gECC significantly improves the parallel execution
efficiency of batch EC operations and archives much higher throughput than the state of the art.

References

[1] Peter L Montgomery. Speeding the pollard and elliptic curve methods of factorization. Mathematics of computation,
48(177):243–264, 1987.
[2] Victor S Miller. Use of elliptic curves in cryptography. In Conference on the theory and application of
cryptographic techniques, pages 417–426. Springer, 1985.
[3] Neal Koblitz. Elliptic curve cryptosystems. Mathematics of computation, 48(177):203–209, 1987.
[4] Cong Yue, Tien Tuan Anh Dinh, Zhongle Xie, Meihui Zhang, Gang Chen, Beng Chin Ooi, and Xiaokui Xiao.
Glassdb: An efficient verifiable ledger database system through transparency. arXiv preprint arXiv:2207.00944,
2022.
[5] Zerui Ge, Dumitrel Loghin, Beng Chin Ooi, Pingcheng Ruan, and Tianwen Wang. Hybrid blockchain database
systems: design and performance. Proceedings of the VLDB Endowment, 15(5):1092–1104, 2022.
[6] Dumitrel Loghin. The anatomy of blockchain database systems. Data Engineering, page 48, 2022.
[7] Xinying Yang, Yuan Zhang, Sheng Wang, Benquan Yu, Feifei Li, Yize Li, and Wenyuan Yan. Ledgerdb: A
centralized ledger database for universal audit and verification. Proceedings of the VLDB Endowment, 13(12):3138–
3151, 2020.
[8] Meihui Zhang, Zhongle Xie, Cong Yue, and Ziyue Zhong. Spitz: a verifiable database system. Proc. VLDB
Endow., 13(12):3449–3460, August 2020.
[9] Simon Blake-Wilson, Nelson Bolyard, Vipul Gupta, Chris Hawk, and Bodo Moeller. Elliptic curve cryptography
(ecc) cipher suites for transport layer security (tls). Technical report, 2006.

17
A PREPRINT - JANUARY 8, 2025

[10] Don Johnson, Alfred Menezes, and Scott Vanstone. The elliptic curve digital signature algorithm (ecdsa).
International journal of information security, 1:36–63, 2001.
[11] Than Myo Zaw, Min Thant, and S. V. Bezzateev. Database security with aes encryption, elliptic curve encryption
and signature. In 2019 Wave Electronics and its Application in Information and Telecommunication Systems
(WECONF), pages 1–6, 2019.
[12] Pradeep Suthanthiramani, Sannasy Muthurajkumar, Ganapathy Sannasi, and Kannan Arputharaj. Secured data
storage and retrieval using elliptic curve cryptography in cloud. Int. Arab J. Inf. Technol., 18(1):56–66, 2021.
[13] Benny Pinkas, Mike Rosulek, Ni Trieu, and Avishay Yanai. Spot-light: lightweight private set intersection from
sparse ot extension. In Advances in Cryptology–CRYPTO 2019: 39th Annual International Cryptology Conference,
Santa Barbara, CA, USA, August 18–22, 2019, Proceedings, Part III 39, pages 401–431. Springer, 2019.
[14] Quanyu Zhao, Yuan Zhang Bingbing Jiang, Heng Wang, Yunlong Mao, and Sheng Zhong. Unbalanced private
set intersection with linear communication complexity. SCIENCE CHINA Information Sciences, 67(3):132105–,
2024.
[15] Yibiao Lu, Zecheng Wu, Bingsheng Zhang, and Kui Ren. Efficient secure computation from sm series cryptography.
Wireless Communications and Mobile Computing, 2023(1):6039034, 2023.
[16] Yuanyuan Li, Hanyue Xiao, Peng Han, and Zhihao Zhou. Practical private intersection-sum protocols with good
scalability. In Jianming Zhu, Qianhong Wu, Yong Ding, Xianhua Song, and Zeguang Lu, editors, Blockchain
Technology and Application, pages 49–63, Singapore, 2024. Springer Nature Singapore.
[17] Licheng Wang, Xiaoying Shen, Jing Li, Jun Shao, and Yixian Yang. Cryptographic primitives in blockchains.
Journal of Network and Computer Applications, 127:43–58, 2019.
[18] Yupeng Zhang, Jonathan Katz, and Charalampos Papamanthou. Integridb: Verifiable sql for outsourced databases.
In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 1480–
1491, 2015.
[19] HweeHwa Pang, Jilian Zhang, and Kyriakos Mouratidis. Scalable verification for outsourced dynamic databases.
Proceedings of the VLDB Endowment, 2(1):802–813, 2009.
[20] Amazon. Amazon quantum ledger database, 2019. https://aws.amazon.com/qldb/.
[21] Panagiotis Antonopoulos, Raghav Kaushik, Hanuma Kodavalla, Sergio Rosales Aceves, Reilly Wong, Jason
Anderson, and Jakub Szymaszek. Sql ledger: Cryptographically verifiable data in azure sql database. In
Proceedings of the 2021 international conference on management of data, pages 2437–2449, 2021.
[22] Satoshi Nakamoto. Bitcoin: A peer-to-peer electronic cash system. 2008.
[23] Gavin Wood et al. Ethereum: A secure decentralised generalised transaction ledger. Ethereum project yellow
paper, 151(2014):1–32, 2014.
[24] Eli Ben Sasson, Alessandro Chiesa, Christina Garman, Matthew Green, Ian Miers, Eran Tromer, and Madars
Virza. Zerocash: Decentralized anonymous payments from bitcoin. In 2014 IEEE symposium on security and
privacy, pages 459–474. IEEE, 2014.
[25] Elli Androulaki, Artem Barger, Vita Bortnikov, Christian Cachin, Konstantinos Christidis, Angelo De Caro, David
Enyeart, Christopher Ferris, Gennady Laventman, Yacov Manevich, et al. Hyperledger fabric: a distributed
operating system for permissioned blockchains. In Proceedings of the thirteenth EuroSys conference, pages 1–15,
2018.
[26] Minghua Qu. Sec 2: Recommended elliptic curve domain parameters. Certicom Res., Mississauga, ON, Canada,
Tech. Rep. SEC2-Ver-0.6, 1999.
[27] National Institute of Standards and Technology (NIST). Digital signature standard (dss). Federal Information
Processing Standards (FIPS) Publication 186-4, 2024. https://nvlpubs.nist.gov/nistpubs/fips/nist.
fips.186-4.pdf.
[28] Ang Yang, Junghyun Nam, Moonseong Kim, and Kim-Kwang Raymond Choo. Provably-secure (chinese
government) sm2 and simplified sm2 key exchange protocols. The Scientific World Journal, 2014(1):825984,
2014.
[29] State Cryptography Administration of China (SCA). Public key cryptographic algorithm sm2 based on elliptic
curves, 2016.
[30] Rares Ifrim, Dumitrel Loghin, and Decebal Popescu. Baldur: A hybrid blockchain database with fpga or gpu
acceleration. In Proceedings of the 1st Workshop on Verifiable Database Systems, VDBS ’23, page 19–27, New
York, NY, USA, 2023. Association for Computing Machinery.

18
A PREPRINT - JANUARY 8, 2025

[31] Lili Gao, Fangyu Zheng, Niall Emmart, Jiankuo Dong, Jingqiang Lin, and Charles Weems. Dpf-ecc: Accelerating
elliptic curve cryptography with floating-point computing power of gpus. In 2020 IEEE International Parallel
and Distributed Processing Symposium (IPDPS), pages 494–504. IEEE, 2020.
[32] Lili Gao, Fangyu Zheng, Rong Wei, Jiankuo Dong, Niall Emmart, Yuan Ma, Jingqiang Lin, and Charles Weems.
Dpf-ecc: A framework for efficient ecc with double precision floating-point computing power. IEEE Transactions
on Information Forensics and Security, 16:3988–4002, 2021.
[33] Zonghao Feng, Qipeng Xie, Qiong Luo, Yujie Chen, Haoxuan Li, Huizhong Li, and Qiang Yan. Accelerating
elliptic curve digital signature algorithms on gpus. In SC22: International Conference for High Performance
Computing, Networking, Storage and Analysis, pages 1–13. IEEE, 2022.
[34] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. PowerGraph: Distributed
Graph-Parallel computation on natural graphs. In 10th USENIX Symposium on Operating Systems Design and
Implementation (OSDI 12), pages 17–30, Hollywood, CA, October 2012. USENIX Association.
[35] NVIDIA. Cgbn: Cuda accelerated multiple precision arithmetic (big num) using cooperative groups, 2018.
https://github.com/NVlabs/CGBN.
[36] Niall Emmart, Fangyu Zheng, and Charles Weems. Faster modular exponentiation using double precision floating
point arithmetic on the gpu. In 2018 IEEE 25th Symposium on Computer Arithmetic (ARITH), pages 130–137.
IEEE, 2018.
[37] OpenSSL Software Foundation. Tls/ssl and crypto library, 2024. https://github.com/openssl/openssl.
[38] Wuqiong Pan, Fangyu Zheng, Yuan Zhao, Wen-Tao Zhu, and Jiwu Jing. An efficient elliptic curve cryptography
signature server with gpu acceleration. IEEE Transactions on Information Forensics and Security, 12(1):111–122,
2017.
[39] Darrel Hankerson, Alfred J Menezes, and Scott Vanstone. Guide to elliptic curve cryptography. Springer Science
& Business Media, 2006.
[40] Long Mai, Yuan Yan, Songlin Jia, Shuran Wang, Jianqiang Wang, Juanru Li, Siqi Ma, and Dawu Gu. Accelerating
sm2 digital signature algorithm using modern processor features. In International Conference on Information and
Communications Security, pages 430–446. Springer, 2019.
[41] Junhao Huang, Zhe Liu, Zhi Hu, and Johann Großschädl. Parallel implementation of sm2 elliptic curve cryptogra-
phy on intel processors with avx2. In Information Security and Privacy: 25th Australasian Conference, ACISP
2020, Perth, WA, Australia, November 30–December 2, 2020, Proceedings 25, pages 204–224. Springer, 2020.
[42] Peter L Montgomery. Modular multiplication without trial division. Mathematics of computation, 44(170):519–
521, 1985.
[43] C Kaya Koc, Tolga Acar, and Burton S Kaliski. Analyzing and comparing montgomery multiplication algorithms.
IEEE micro, 16(3):26–33, 1996.
[44] Daniel J Bernstein and Bo-Yin Yang. Fast constant-time gcd computation and modular inversion. IACR
Transactions on Cryptographic Hardware and Embedded Systems, pages 340–398, 2019.
[45] Benjamin Salling Hvass, Diego F Aranha, and Bas Spitters. High-assurance field inversion for curve-based
cryptography. In 2023 IEEE 36th Computer Security Foundations Symposium (CSF), pages 552–567. IEEE, 2023.
[46] Daniel J. Bernstein, Billy Bob Brumley, Ming-Shing Chen, and Nicola Tuveri. OpenSSLNTRU: Faster post-
quantum TLS key exchange. In 31st USENIX Security Symposium (USENIX Security 22), pages 845–862, Boston,
MA, August 2022. USENIX Association.
[47] Jiankuo Dong, Fangyu Zheng, Niall Emmart, Jingqiang Lin, and Charles Weems. sdpf-rsa: Utilizing floating-point
computing power of gpus for massive digital signature computations. In 2018 IEEE International Parallel and
Distributed Processing Symposium (IPDPS), pages 599–609, 2018.
[48] Fangyu Zheng, Wuqiong Pan, Jingqiang Lin, Jiwu Jing, and Yuan Zhao. Exploiting the potential of gpus for
modular multiplication in ecc. In International Workshop on Information Security Applications, pages 295–306.
Springer, 2014.
[49] Jiankuo Dong, Fangyu Zheng, Juanjuan Cheng, Jingqiang Lin, Wuqiong Pan, and Ziyang Wang. Towards high-
performance x25519/448 key agreement in general purpose gpus. In 2018 IEEE Conference on Communications
and Network Security (CNS), pages 1–9. IEEE, 2018.
[50] Supranational Corp. Zero-knowledge template library, 2024. https://github.com/supranational/sppark.
[51] ZPrize. Accelerating the future of zero-knowledge cryptography, 2023. https://www.zprize.io/.

19
A PREPRINT - JANUARY 8, 2025

[52] Nvidia Corp. Cuda c++ programming guide, 2024. https://docs.nvidia.com/cuda/


cuda-c-programming-guide/index.html#arithmetic-instructions.
[53] NVIDIA. Nvidia a100 tensor core gpu architecture, 2022. https://images.nvidia.com/aem-dam/en-zz/
Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf.
[54] Paul C. Kocher. Timing attacks on implementations of diffie-hellman, rsa, dss, and other systems. In Neal
Koblitz, editor, Advances in Cryptology — CRYPTO ’96, pages 104–113, Berlin, Heidelberg, 1996. Springer
Berlin Heidelberg.
[55] Nvidia Corp. Cuda kernel profiling using nvidia nsight compute, 2019. https:
//developer.download.nvidia.cn/video/gputechconf/gtc/2019/presentation/
s9345-cuda-kernel-profiling-using-nvidia-nsight-compute.pdf.
[56] Nvida Corp. Nsight compute documentation, 2024. https://docs.nvidia.com/nsight-compute/2023.2/
index.html.
[57] Jie Wang, Xinfeng Xie, and Jason Cong. Communication optimization on gpu: A case study of sequence alignment
algorithms. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 72–81.
IEEE, 2017.
[58] Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P Scarpazza. Dissecting the nvidia volta gpu architecture
via microbenchmarking. arXiv preprint arXiv:1804.06826, 2018.
[59] Z Jia and PV Sandt. Dissecting the ampere gpu architecture through microbenchmarking. In GTC, 2021.
[60] Xianghong Hu, Xin Zheng, Shengshi Zhang, Weijun Li, Shuting Cai, and Xiaoming Xiong. A high-performance
elliptic curve cryptographic processor of sm2 over gf (p). Electronics, 8(4):431, 2019.
[61] FISCO. FISCO BCOS: an enterprise-level financial blockchain platform developed and open-sourced by the
Financial Services Blockchain Consortium (Shenzhen) "FISCO" led by WeBank., 2024. https://github.com/
FISCO-BCOS/java-sdk-demo.git.

20

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy