Final Project Report
Final Project Report
Title:
FPGA Design and Implementation of Galois
Field Arithmetic Circuits
for Cryptography Applications
Presented by:
Benslimane Mohammed el Bachir, Zorgani Merouane
Supervisor:
Prof. Khouas Abdelhakim
Abstract
The internet has become the dominant medium of virtually all communi-
cation and transactions between individuals and even large corporations and
governments. However, it remains vulnerable to hacker attacks that may ex-
pose our information to unauthorized parties. Cryptography deals with this
problem to ensure that exchanged information cannot be exposed. This field
exploits computationally expensive mathematical problems to secure commu-
nication over the internet. Elliptic curves (ECs) are one such mathematical
tool. Our cryptosystems need to be fast and secure, otherwise they would
be susceptible to hacker attacks. We adopt hardware implementation as a
solution to this problem. By implementing the low level arithmetic oper-
ations in hardware, we lay the foundation to a safe and efficient EC based
cryptosystem as an application to our work. Our work achieved an optimized
implementation of prime and binary field arithmetic circuits, and in this re-
port we present a comparison between the two fields in terms of hardware
resources and efficiency.
i
Acknowledgements
We express our appreciation to those who helped, in any way, in the com-
pletion of this project; our supervisor Prof. Khouas Abdelhakim, our friends,
and our families.
ii
Contents
List of Figures v
List of Tables vi
1 Mathematical background 1
1.1 Elliptic curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Modular arithmetic . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Galois fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Prime fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Binary fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Design 10
2.1 Prime field arithmetic circuits . . . . . . . . . . . . . . . . . . 10
2.1.1 Design of the prime field modulo adder . . . . . . . . . 11
2.1.2 Design of the prime field modulo subtractor . . . . . . 11
2.1.3 Design of the prime field modulo multiplier . . . . . . . 12
2.1.4 Design of the prime field modulo inverter . . . . . . . . 15
2.2 Binary field arithmetic circuits . . . . . . . . . . . . . . . . . . 21
2.2.1 Design of the binary field adder and subtractor . . . . 21
2.2.2 Design of the binary field multiplier . . . . . . . . . . . 21
2.2.3 Design of the binary field inverter . . . . . . . . . . . . 23
3 Implementation 25
3.1 Simulation of the prime field multiplier . . . . . . . . . . . . . 25
3.2 Simulation of the prime field inverter . . . . . . . . . . . . . . 26
3.3 Simulation of the binary field multiplier . . . . . . . . . . . . . 26
3.4 Simulation of the binary field inverter . . . . . . . . . . . . . . 27
iii
3.5 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Bibliography 31
Appendix 33
List of Figures
2.1 Top level diagram of the prime field adder . . . . . . . . . . . 11
2.2 Proposed hardware implementation of modulo adder . . . . . 12
2.3 Proposed hardware implementation of modulo subtractor . . . 12
2.4 Top level diagram of the prime field multiplication circuit . . . 12
2.5 Circuit diagram of the prime field multiplier . . . . . . . . . . 15
2.6 Top level diagram of the prime field inversion circuit . . . . . 15
2.7 State diagram of the FSM controller . . . . . . . . . . . . . . 18
2.8 Conceptual diagram of a modulo inverter . . . . . . . . . . . . 18
2.9 Circuit diagram for register U datapath . . . . . . . . . . . . . 19
2.10 Circuit diagram for register V datapath . . . . . . . . . . . . . 20
2.11 Circuit diagram for register X datapath . . . . . . . . . . . . . 20
2.12 Circuit diagram for register Y datapath . . . . . . . . . . . . . 21
2.13 Top level diagram of the binary field multiplication circuit . . 22
2.14 Top level diagram of the binary field inversion circuit . . . . . 23
2.15 Bitwise comparison of u ⊕ v under degree equality and inequality 24
3.1 Circuit diagram of the prime field multiplier . . . . . . . . . . 25
3.2 Simulation results of the inversion for inputs a = 45 and p = 103 26
3.3 Simulation results of the binary multiplication circuit . . . . . 27
3.4 Simulation results of the binary inversion simulation . . . . . . 27
3.5 Graph showing maximum operating frequency vs operand size
for prime field multiplier circuit . . . . . . . . . . . . . . . . . 29
3.6 Graph showing LUT consumption vs operand size for prime
field multiplier circuit . . . . . . . . . . . . . . . . . . . . . . 30
v
List of Tables
1.1 Addition in GF(7) . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 multiplication in GF(7) . . . . . . . . . . . . . . . . . . . . . . 5
2.1 An example multiplication using algorithm 2.1.3 . . . . . . . . 14
2.2 Table of values for u, v, x, and y . . . . . . . . . . . . . . . . 17
3.1 Prime field multiplier performance and used resources . . . . . 28
3.2 Binary field multiplier performance and used resources . . . . 28
3.3 Prime field inversion performance and resource usage . . . . . 28
3.4 Binary field inversion performance and resource usage . . . . . 28
vi
List of acronyms
• CPU: Central Processing Unit
• RSA: Rivest-Shamir-Adleman
vii
Introduction
Cryptography is the process of shielding information used in communica-
tion such that no party except the sender and receiver is able to read or
process it. Throughout history, several techniques and methods have been
utilized for this purpose. From primitive encryption schemes, such as the Cae-
sar cipher, to modern algorithms like AES(Advanced Encryption Standard)
and RSA(after its inventors Ron Rivest, Shi Adleman, and Adi Shamir), the
aim of cryptography has always been to ensure secure communication be-
tween two parties. In today’s computer-dominated world, millions of devices
are exchanging critical data all the time. Each second, thousands of gigabytes
are being passed around. And so with malevolent hackers peering from every
corner, the need for fast, efficient and secure cryptosystems is as strong as
ever. To tackle the problem of building cryptosystems that are up to this
task, we need to define what a cryptosystem does. A cryptosystem takes the
sender’s text (known as plaintext) and produces an unreadable text (known
as ciphertext). The latter is sent over an insecure channel, where it’s suscep-
tible to being intercepted by unwanted third parties, to the receiving end,
where it is to be decrypted back into the plaintext.
For this system to be of any use at all, only the sender and receiver must
have the ability of decryption. A cryptosystem achieves this is by using al-
gorithms that somehow enable the sender and receiver to share a piece of
information known as the key without which decryption is extremely diffi-
cult. To ensure the security of the cryptosystem as a whole, these algorithms
are based on mathematical problems that are computationally expensive and
require a great deal of time to reverse without the key. Our work will deal
with one such mathematical tool. A cryptosystem does not only need to be
secure, it needs to be fast and efficient as well. In fact this reinforces its secu-
rity. Therefore, for a lot of applications an implementation of such algorithms
on a general purpose CPU(Central Processing Unit) won’t cut it. This mo-
tivates the use of custom hardware circuits that are designed and optimized
viii
for these specific tasks. Indeed, our work is centered around implementing
operations necessary for one of the standard encryption schemes which use
a mathematical problem associated with a specific type of curves called El-
liptic Curves. Defining operations on these curves turns out to produce a
discrete logarithm problem1 , which is at the core of one of the fundamental
key generation algorithms that underpin the realization of secure communi-
cation over the Internet. Our work proposes an optimized implementation of
the arithmetic operations associated with elliptic curves both over binary and
prime fields (these will be explained later). This report is organized as fol-
lows: in chapter 1 we go through an explanation of the math that is relevent
to our work, namely, modular arithmetic, Galois fields, prime and binary
field arithmetic. Then in chapter 2 we present and explain the algorithms
and the designs of the basic arithmetic blocks. In chapter 3 we inspect the
performance of the presented designs and compare between binary field and
prime field circuit performance.
1
The Discrete logarithm problem is the problem of finding x such that bx = a, with a and b known. In
many definitions of this exponentiation operation, it is extremely difficult to find x.
ix
Chapter 1
Mathematical background
To understand how a cryptosystem permits the communicating parties to
possess the same key without directly transmitting it over the internet, which
is one of the fundamental tasks of a cryptosystem, we first need to entertain
the notion of operations. Think of an operation as a function that takes in
two numbers and outputs a number. The operation of addition takes in 9 and
5 and returns 14. If it were multiplication, the result would be 45. Curiously,
there exists an operation that takes in 9 and 5 and produces 2; Grab a watch,
start at the number 9, and move 7 hours ahead. You will land at the number
2. If you input 12 and 39 into this operation, it yields 3, since 12 goes into
39 four times before leaving a remainder of 2. This is one example of an
operation alternative to what’s typically taught in primary school and it’s
called addition modulo 12.
Cryptosystems make use of operations with certain properties to actualize
the following scheme; Alice, the sender, and Bob, the receiver, agree publicly
on an integer number g. Then each party generates their own secret number
a and b for Alice and Bob respectively. What happens now is, using an
operation denoted ⋆, Alice transmits, not a, but g ⋆ a. And Bob transmits,
not b, but g ⋆ b. This way, Alice can compute g ⋆ b ⋆a, and Bob can compute
|{z}
Bob’s
1
g ⋆ a ⋆b. This results in only Alice and Bob having the same key g ⋆ a ⋆ b,
|{z}
Alice’s
which they can use to communicate securely.
An astute reader should notice a problem in this scheme. If g and g ⋆ a
are both public knowledge, then it might be possible for unwanted third
1
Note that this means that in order to make this a joint key for the two parties, we need the operation ⋆
to yield (g ⋆ b) ⋆ a = (g ⋆ a) ⋆ b. For example, (3 × 5) × 7 = (3 × 7) × 5.
1
parties to derive a. The same thing goes for Bob’s secret number, b. If
g = 428, a = 43 and we make the bad choice of the addition operation for ⋆,
then it’s possible for a spy to derive a from g + a = 471 simply by performing
the inverse operation of addition, subtraction; 471−g = 43 = a. This renders
the whole cryptosystem pointless. It is this reason that underpins the need
for operations hard to reverse. An operation ⋆ such that if I give you g and
g ⋆ x, it would be extremely difficult for anyone to deduce x through any
means. This is precisely the discrete logarithm problem, and elliptic curves
provide exactly that.
2
the second insight, and it sets the stage for introducing the mathematical
prerequisites relevant to this work. This will be done in the next sections.
3
1.3 Galois fields
The mathematical operations we use in cryptography must have a specific
set of properties and constraints. A good demonstration of this statement
is the concept of a field. A field is basically a set equipped with two op-
erations, which we call addition, denoted +, and multiplication, denoted ·,
where these operations ”behave nicely”. For example, both operations must
be commutative. The precise definition of a field is provided in the appendix.
Fields of an infinite number of elements do not constitute much of a
concern in this work. We deal only with the so called finite fields or Galois
fields.
Definition 3. A finite field or a galois field is a field with a finite number of
elements (finite order).
For example, there exists a field with 7 elements. If we represent each
element by a number;
{0, 1, 2, 3, 4, 5, 6}
We can, in fact, use addition modulo 7 and multiplication modulo 7 as the
operations associated with this Galois field. Table 1.1 shows the results of
performing the addition operation on any two elements of the field.
+ 0 1 2 3 4 5 6
0 0 1 2 3 4 5 6
1 1 2 3 4 5 6 0
2 2 3 4 5 6 0 1
3 3 4 5 6 0 1 2
4 4 5 6 0 1 2 3
5 5 6 0 1 2 3 4
6 6 0 1 2 3 4 5
Note that every row contains a zero. In a field, any element x has what
we call an additive inverse y, often denoted −x, where x + y = 0.
Table 1.2 describes the multiplication result of every two elements in this
field.
4
× 0 1 2 3 4 5 6
0 0 0 0 0 0 0 0
1 0 1 2 3 4 5 6
2 0 2 4 6 1 3 5
3 0 3 6 2 5 1 4
4 0 4 1 5 2 6 3
5 0 5 3 1 6 4 2
6 0 6 5 4 3 2 1
5
Definition 6. Let a and b be elements of the prime field GF (p). Their sum,
c, which belongs to GF (p), is then computed as follows: c = a + b mod p
Subtraction is defined in a similar fashion;
Definition 7. Let a and b be elements of the prime field GF (p). Their
difference, d, which belongs to GF (p), is then computed as follows: d =
a − b mod p
Example. The addition of 2 and 6 in GF (7) = {0, 1, 2, 3, 4, 5, 6} yields 1
since 2 + 6 mod 7 = 1.
Subtracting 6 from 4 in GF (7) yields 5 since 4 − 6 mod 7 = 5.
Multiplication is, unsurprisingly, also defined in a similar fashion;
Definition 8. Let a and b be elements of the prime field GF (p). Their
product, m, which belongs to GF (p), is then computed as follows: m = a ×
b mod p
Example. Multiplication of 4 and 9 in GF (11) yields 3; 4 × 9 mod 11 =
36 mod 11 = 3
Naturally, the next operation to consider is the inverse operation of prime
field multiplication; inversion. Essentially, it is the operation that returns the
multiplicative inverse of a number modulo p.
Definition 9. Let a be an element of GF (p). The multiplicative inverse of
a, denoted a−1 , is the element of GF (p) that satisfies: a × a−1 mod p = 1
Example. The multiplicative inverse of 9 in GF (11) is 5 , since 9 × 5 mod
11 = 45 mod 11 = 1
The smallest Galois field is GF (2) = {0, 1}, which is a prime field . The
addition and multiplication are, of course, done modulo 2, as demonstrated
by the table below.
addition multiplication
+ 0 1 × 0 1
0 0 1 0 0 0
1 1 0 1 0 1
Notice that in this field, addition and subtraction are the same operation;
each element is its own inverse. The reader might find it interesting that
6
addition and multiplication operations in GF (2) are no more than XOR and
AND gates respectively.
Note that finding the multiplicative inverse of an element in GF (p) is not as
straight forward as the other operations in the field.
Binary fields with m > 1 are often called Extension fields. It’s not hard
to see why binary fields could offer convenient characteristics. For m = 8, for
example, each field element could be represented by one byte.
Since the order is not prime, we cannot use modular arithmetic to handle
the operations of binary fields. How, then, does one go about performing
addition and multiplication on the elements of binary fields? The answer lies
in the way we represent these elements.
Every element in a binary field GF (2m ) can be represented by a polynomial
with binary coefficients. Each polynomial has a degree of m−1, which counts
to m coefficients for each element. For example, elements in GF (28 ) have the
form
A(x) = a7 x7 + ... + a1 x + a0
Where ai ∈ GF (2) = {0, 1}
Addition in binary fields is the simple classical polynomial addition, where
we add corresponding coefficients in GF (2). It is therefore a bit wise to think
of addition (or subtraction) in GF (2m ) as simply a bitwise XOR function.
Definition 11. Let A(x), B(x) be elements of GF (2m ). Their sum (or differ-
Pm−1are ithen computed according to; C(x) = A(x) + B(x) = A(x) − B(x) =
ence)
i=0 ci x , ci = ai + bi mod 2 = ai − bi mod 2
7
A(x) = x7 + x6 + x4 + 1
4 2
B(x) = x +x +1
7 6
C(x) = x + x + x2
Subtraction yields the same result.
In other words, we find the product A(x) · B(x), then divide it by P (x),
and take the remainder. We consider an example.
8
Example. Let A(x) = x2 + 1 and B(x) = x2 + x + 1 be elements of GF (22 ),
with an irreducible polynomial P (x) = x4 + x3 + 1.
We shall perform the multiplication operation of the above elements A and
B. To do so, we first calculate the plain polynomial product A(x)B(x).
x4 + x3 + x + 1 = P (x) · 1 + x
A · B = C
2 2
(x + 1) · (x + x + 1) = x
(0 1 0 1) · (0 1 1 1) = (0 0 1 0).
In regards to operations on binary fields, only inversion is left to discuss.
Definition 14. For a given binary field GF (2m ) and its associated irreducible
polynomial P (x), the inverse A−1 (x) of a non zero element A(x) ∈ GF (2m )
must satisfy
A(x) · A−1 (x) = 1 mod P (x)
Example 1. for the binary field GF (26 ) with the irreducible polynomial
P (x) = x6 + x5 + x2 + x + 1, the inverse of the element 110101 is 001001,
because;
(x5 + x4 + x2 + 1)(x3 + 1) mod P (x) = 1
9
Chapter 2
Design
In this chapter, we shall go through the design and implementation of the
basic arithmetic blocks of prime and binary fields. The approach we take is
as follows: We take a look at the algorithm or mathematical formalism of the
operations in question, then derive a circuit that implements it.
These algorithms are sequential in nature. That is; they are a series of op-
erations that are to be performed in a specific order with some depending
on others. To enforce this sequentialism in hardware, which is concurrent in
nature, we need to follow a specific methodology that leads to the realization
of a circuit that will faithfully follow the algorithm and yield the expected
results.
First of all, the variables in the algorithms are mapped to registers that take
their appropriate values from the datapath; a combinational circuit that out-
puts the register’s next value based on the algorithm’s instructions.
To enforce the order of the operations, we use a finite state machine (FSM)
that will act as the datapath controller.
This way we have circuitry that performs the data manipulation and calcu-
lations for our algorithm’s variables and an FSM that controls them. This is
known as the register transfer level methodology (RTL methodology [1]).
10
2.1.1 Design of the prime field modulo adder
Our circuit will work with inputs that do not exceed the prime modulus.
This offers a simple way to implement modular addition; if a and b belong to
a prime field, then
a + b mod p = a + b if a + b ≤ p
(2.1)
a + b mod p = a + b − p if a + b > p
Taking a look at equation 2.1, we see that the result of modular addition is
nothing but a selection between two computed results based on a comparison
of their magnitudes. For this, we just need an adder to compute a + b, a
subtractor to take off p from that result, and a multiplexer that selects the
output of the subtractor if a + b > p, otherwise it selects the output of the
adder. The circuit diagram that performs the described operation is shown
in Figure 2.2
Following a similar logic to the previous design, the modulo subtractor circuit
is therefore derived and is shown in figure 2.3.
11
Figure 2.2: Proposed hardware implementation of modulo adder
Figure 2.4: Top level diagram of the prime field multiplication circuit
12
There are many algorithms that implement modular multiplication pre-
sented in the literature. We chose one that directly follows from the basic
shift-and-add algorithm. It makes use of the linearity of the mod opera-
tor and reduces the partial products each iteration of its execution. 1 The
algorithm’s pseudocode is presented below (Algorithm 2.1.3):
Algorithm 2.1.3 Conventional Interleaved Modular Multiplication (Left-to-
Right) [3]
Input: X, Y are n-bit vectors such that X, Y ∈ [1, p − 1], with each bit
given by X(i), Y (i) ∈ {0, 1}.
Output: Z = (X · Y ) mod p
1. Z ← 0
2.1. U ← 2Z
2.2. V ← Y (i) · X
2.3. W ← U + V
2.4. Z ← W mod p
3. Return Z
To compute the mod product, this algorithm follows the basic shift and
add algorithm from usual multiplication (lines 2.1, 2.2 and 2.3) then reduces
the partial product modulo p (line 2.4) and repeats these steps for all multi-
plier bits Y(i) (loop in line 2). The usual shift and add multiplication of two
vectors involves the following:
But we are doing multiplication in GF(p), which means our result needs to
be in the range [0, p-1]. Hence the mod reduction in line 2.4, which will
result in taking the remainder of the usual multiplication partial product
upon division by the modulus p. this is possible because the mod operator is
13
linear, that is; the remainder of the sum is equal to the sum of remainders.
This guarantees that the partial product will be brought back into the range
[0, p-1] whenever it exceeds it. In fact it will never exceed the range by more
than 2p. To see that this is indeed valid (and to illustrate the algorithm) we
consider the worst case for a 3-bit multiplication example: let the modulus
p=7, X= 110 and Y=110, the results of each step are presented in table 2.1
i U V W Z
2 000 110 110 110
1 1100 110 1000 100
0 1000 000 1000 001
if we track the values of W we see that it never exceeds the range with more
than 2p, this gives us insight as to how we should implement the reduction
in line 2.4. We distinguish 3 cases:
case 1: W is already in the range.
Case 2: W exceeds the range with a value less than p. In this case we need
to subtract p to get W back to the range.
Case 3: W exceeds the range with a value more than 1p. In this case we need
to subtract 2p to get W back to the range.
this tells us that the result is a selection between the results in the above
3 cases. To implement this in hardware we use 2 subtractors to compute
w-p and w-2p whose result are going to feed into the multiplexer along with
unreduced w. To choose which is the value to be fed into Z the select lines
of the multiplexer are simply the sign bits of the subtractor’s output, with
s1 being w-p sign bit and s2 w-2p sign bit, there are again 3 cases:
Case 1: s1s2 = 00, meaning neither subtraction makes the result negative,
which means that we choose w-2p.
Case 2: s1s2 = 01, meaning that the w-2p makes the result negative and w-p
doesn’t so we select the latter.
Case 3: s1s2 = 11, both subtractions make the result negative which means
W is already in the range and so we select W.
A case where s1s2 = 10 is impossible because it implies that subtracting a
number p from W makes the result negative but subtracting two times p
doesn’t. In hardware, line 2.1 is simply a shift right operation. Line 2.2 is
14
done using AND gates as shown in the figure. To index into Y we use a shift
register, take its MSB and feed it into the AND gates. Line 2.3 is simply an
adder. All that remains is to implement the loop in line 2. We use a counter
along with an FSM with two states: an Idle state to initialize the registers
X, Y and Z. The FSM stays in the idle state until the Start input is asserted.
Thereafter it rolls to the ”compute” state where the operations in 2.1, 2.2,
2.3 and 2.4 are carried out concurrently. The FSM keeps entering this state
until the value of the counter reaches 0, then it goes back to the idle state
where the ready output signal is asserted announcing the end of execution.
The circuit is presented in Figure 2.5
Figure 2.6: Top level diagram of the prime field inversion circuit
15
To find the multiplicative inverse of an integer a modulo p, that is; the
integer x such that a × x mod p = 1, several algorithms were developed.
These include algorithms such as the Itoh-Tsuji algorithm [4] (which relies
on repeated exponentiation), The extended Euclidean algorithm (which relies
on reduction by division), the Binary Extended Euclidean Algorithm, etc.
These algorithms were explained extensively in the literature, so we will solely
present the latter algorithm, the one relevant to our work. Its pseudocode is
presented in Algorithm 2.1.4.
Algorithm 2.1.4 Binary extended Euclidean algorithm for inversion in GF(p)
[2]
1. u ← a, v ← p, x ← 1, y←0
2. While u ̸= 0, do:
3. If u = 1, then R ← x mod p
16
The algorithm uses 4 variables x,y,u and v they are first initialized as
specefied in line 1. Then the algorithm enters the main while loop and stays
there until u reduced to 0. Inside this loop there are 2 consecutive while loops,
after them there’s a conditional assignement that depends on the values of
U, V, X and Y computed in the previous two loops. To better illustrate the
algorithm we go through an example and compute the values of the variables
each step. The results are summarized in table 2.2
u v x y
4 7 1 0
2 7 4 0
1 6 2 5
1 2 2 4
0 1 2 2
Table 2.2: Table of values for u, v, x, and y
To preserve the order, loop structure and interdependency of the steps our
FSM must do the following: Have a state where only u and x are changed as
specified by lines 2.1.1 and 2.1.2 until u is no longer even. The FSM should
then enter a new state where only v and y this time are changed as specified
by lines 2.2.1 and 2.2.2 until v is no longer even. After these two loops are
executed the FSM should enter a state where the values of U,V,X and Y
are ready and the selective assignment can take place as specified by lines
2.3 through 2.4.2. We label the previously outlined states ”while u even”,
”while v even” and ”u v comp” states respectively. In addition to the ”idle”
state where the registers are initialized, the start input signal is checked to
begin execution and the ready signal is asserted. This way, the FSM enforces
the nested loop structure of the algorithm and enforces the flow of execution.
The state diagram is shown in figure 2.7
To use the FSM we let it control the values that the registers should take
each state, by using the state register as the select line to the multiplexer that
feeds the registers from the datapath as shown in the conceptual diagram in
17
Figure 2.7: State diagram of the FSM controller
figure 2.8.
We specify the registers values and derive the datapath. To do that, let’s
track the possible values of the registers/variables each individually to derive
the datapath:
18
The U register: It takes U/2 if the FSM is inside the while-u-even state,
retains its value if inside the while-v-state and in the u-v-comparison
state it either receives u - v. If u ≥ v otherwise it retains its value.
Division by 2 is a shift right operation. So the datapath for the register
simply consists of MUXs for selection, a shifter and a subtractor. The
circuit is shown in Figure 2.9
The V register: similar to the U register, the V register takes V/2 if the
FSM is inside the while-v-even stain, retains its value if the FSM is
inside the while-u-even state and in the u-v-comparison state it either
takes v-u if u ≥ v condition is false, retains its value otherwise. The
circuit is shown in Figure 2.10
The x register: in the while-u-even state, the x register takes x/2 if it’s
even or (x + p)/2 otherwise, retains its value in the while-v-even-state.
In the u-v-comparison state it takes x > y, if not, it receives x + p − y.
The y register: in the while-u-even state, the y register takes y/2, in the
while-u-even-state it retains its value and in the u-v-comparison state a
19
Figure 2.10: Circuit diagram for register V datapath
20
similar selection to register x determines its value. The circuit is shown
in figure 2.12
21
Figure 2.13: Top level diagram of the binary field multiplication circuit
1. If a0 = 1 then c ← b; else c ← 0.
2. For i from 1 to m − 1
3. Return c.
In essence, the algorithm shifts b one bit to the left, then accumulates it
into c if ai =′ 1′ . Except that when we shift b, it could exceed the range
of GF (2m ), in which case we subtract it from f (z) (i.e., XOR them). This
explains the implementation of the mod operation in line 2.1.
The datapath is therefore apparent. Now, note that the operations inside
the loop are possible to perform in one clock cycle. We therefore need two
states for the control path; the idle state where we load the registers with
their initial values, and an Operation state where we perform the necessary
operations (shift, Xor, decrement) based on the relevant conditions. The
FSM will simply keep entering the Operation state to loop over all the a
register indices. To index into a we simply shift it right and process its LSB.
For its simplicity, we leave the circuit out.
22
Figure 2.14: Top level diagram of the binary field inversion circuit
1. u ← a, v←f
2. x ← 1, y←0
u ← u + v, x←x+y
Else:
v ← v + u, y ←y+x
23
4. If u = 1, then return x; else return y
We note that this is very similar to algorithm 2.1.4 for prime field inver-
sion. The difference is that the addition here is just bitwise XORing and that
we are comparing the degrees of the polynomials instead of their magnitudes.
Therefore; the implementation will be very similar.
The FSM is exactly the same in both designs. The register datapaths, here,
will have XOR gates instead of adders. What might be a bit tricky is the im-
plementation of the comparison in line 3.3. Explicitly computing the degrees
of each polynomial and then performing a comparison is quite expensive in
hardware. To work around this, we reasoned as follows; If two vectors u have
v have the same degree, then u XOR v will be smaller than both u and v. If,
instead, polynomial u had a greater degree than v, then XORing them must
yield a polynomial that is greater than v. This is demonstrated in figure 2.15
Figure 2.15: Bitwise comparison of u⊕v under degree equality and inequality
24
Chapter 3
Implementation
In this chapter we present the simulation results of the designs presented in
the previous chapter. We also examine and compare the performance and
used resources of both prime and binary field designs.
25
it converges to the value 89. Which is indeed equal to 107 ∗ 100 mod 131.
The ready output signal is finally asserted announcing the end of execution.
Figure 3.2: Simulation results of the inversion for inputs a = 45 and p = 103
We can see that upon asserting the start signal the output starts to go
through different values following the algorithm (this can easily be verified
by carrying out the algorithm by hand, checking the values of registers U, V,
X, Y) until after 14 clock cycles it converges to the value 87 which is indeed
equal to the multiplicative inverse of 45 mod 103.
The ready output signal is finally asserted anouncing the end of execution.
26
Figure 3.3: Simulation results of the binary multiplication circuit
We can see that upon asserting the start signal the output starts to go
through different values following the algorithm until after 6 clock cycles it
converges to the value (189)10 = 10111101. Which is indeed equal to the
product of a and b in this field.
The ready output signal is finally asserted announcing the end of execution.
We can see that upon asserting the start signal the output starts to go
through different values following the algorithm until after 13 clock cycles it
converges to the value (9)10 = 0001001. Which is indeed correct as shows
example 1. The ready output signal is finally asserted announcing the end of
execution.
27
3.5 Performance
The designs presented in the previous sections have been implemented us-
ing VHDL to target the Xilinx Artix 7 FPGA.
The performance results and used resources for different NIST recommended
field sizes of both prime and binary field multipliers and inverters are sum-
marized in tables 3.1, 3.2, 3.3, and 3.4 respectively.
Overall, the binary field multiplier achieves better frequency and con-
sumes less resources. This is due to the simplicity of the algorithm and dat-
apath, since they only consist of XOR gates and multiplexers. Whereas the
prime field multiplier datapath consists of adders as well, which introduces
more delay, since adders concume more look up tables (LUTs).
Table 3.3: Prime field inversion per- Table 3.4: Binary field inversion per-
formance and resource usage formance and resource usage
From tables 3.3 and 3.4 we can see that, again, the binary field inverter
outperforms its prime field counterpart, and consumes less resources.
Once again, this is expected. Especially that both circuits implement the
same algorithm (Algorithms 2.1.4 and 2.2.3). However, since the associated
binary operations are much easier to perform (XOR compared to an adder),
the critical path in the inversion circuit for the binary field is shorter. This
further confirms the idea that binary fields are generally more efficient and
consume less area.
28
To visualize how size affects frequency and resource consumption, we
present in Figures 3.5 and 3.6, as an example, the prime field multiplier’s
frequency as a function of size.
160
FMAX vs Bit-width
140
120
FMAX (MHz)
100
80
60
40
20
0
0 50 100 150 200 250 300 400 500
Bit-width
Figure 3.5: Graph showing maximum operating frequency vs operand size
for prime field multiplier circuit
The reason the curve slightly changes slope is that, using the synthesis
tools, we can only approximate the maximum operating frequency of the de-
sign by adjusting the timing constraints that the synthesis tool should try
to achieve. For example if we increase the timing constaints the routing
tools may ”work harder” to acheive a better maximum frequency, while if we
”loosen” the constraints, it may acheive only a suboptimal frequency. There-
fore, this is not an exact curve of maximum frequency vs size. However, it
reveals that maximum frequency in this multiplier design is roughly related
to size through a linear fit.
It should be noted that a full comparison that takes into account other
parameters, such as security and power consumption, is beyond the scope of
our work.
29
LUT Usage vs. Bit-width for Prime Field Multiplier
1,400
1,200
1,000
800
LUTs
600
400
200
LUTs vs Bit-width
0
0 50 100 150 200 256 300 400 512
Bit-width
Figure 3.6: Graph showing LUT consumption vs operand size for prime field
multiplier circuit
30
Conclusion
In our work, overall, we acheived good performance in the individual de-
signs. We showed that binary field arithmetic circuits generally consume
less resources and acheive better maximum operating frequency. However it
should be noted that the inversion circuits achieve a significantly lower max-
imum frequency than the other designs. This would introduce a bottleneck
if the inversion circuits are integrated within a larger system. This is a lim-
itation of our work. As future work, we will first make use of optimization
techniques such as pipelining to increase frequency and avoid bottlenecks,
then we would use our designs to implement EC group operations which
would then implement an ECC protocol like key exchange or digital signa-
ture generation and verification. To make our designs useful we could make
use of the processor built in the Virtex 4 FPGA board to have a functioning
System on Chip that serves as an ECC core.
31
References
[1] P.P. Chu. RTL Hardware Design Using VHDL: Coding for Efficiency,
Portability, and Scalability. IEEE Press. Wiley, 2006. isbn: 9780471786399.
url: https://books.google.dz/books?id=gVd2yeFHshUC.
[2] Scott Vanstone Darrel Hankerson Alfred J. Menezes. Guide to Elliptic
Curve Cryptography. Springer New York, 2004.
[3] MD. Mainul Islam et al. “Area-Time Efficient Hardware Implementa-
tion of Modular Multiplication for Elliptic Curve Cryptography”. In:
IEEE Access 8 (2020), pp. 73898–73906. doi: 10.1109/ACCESS.2020.
2988379.
[4] Toshiya Itoh and Shigeo Tsujii. “A fast algorithm for computing multi-
plicative inverses in GF(2m) using normal bases”. In: Information and
Computation 78.3 (1988), pp. 171–177. issn: 0890-5401. doi: https :
//doi.org/10.1016/0890- 5401(88)90024- 7. url: https://www.
sciencedirect.com/science/article/pii/0890540188900247.
[5] T.R. Shemanske. Modern Cryptography and Elliptic Curves. Student
Mathematical Library. American Mathematical Society, 2017. isbn: 9781470435820.
url: https://books.google.dz/books?id=TQIvDwAAQBAJ.
32
Appendix
Definition 15. A field is a set F with two binary operations called addition,
denoted +, and multiplication, denoted ·, satisfying the following field axioms:
33
FA10 (Distributivity) For all x, y, z ∈ F, multiplication distributes over
addition: x · (y + z) = x · y + x · z.
34