Looting The LUTs
Looting The LUTs
1 Introduction
2
Fig. 1: Evolution of the AES Sbox/ISbox area vs. Xilinx FPGA families
3
Our Contributions. In this article, we propose new improvements for FPGA
implementations of AEAD schemes based on AES-like primitive. These improve-
ments are twofold.
Firstly, we provide a new efficient hardware architecture for OCB-like AEAD
modes (Section 2). The architecture uses a generic multi-stream AES-like cipher,
such as AES or Deoxys-BC (the tweakable block cipher used in CAESAR compe-
tition candidate Deoxys [13]) as an underlying primitive. This architecture can
be easily modified to support the OTR or SCT AEAD modes for example.
Secondly, we improve the implementation efficiency of several AES-like ci-
phers, such as AES, LED and Deoxys-BC. In particular, the problem of FPGA
mapping and under-utilized hardware discussed earlier is studied in details for
two applications (Section 3):
– we show how to design low-area logic primitives optimized for FPGA LUTs
instead of the number of logic gates (Section 3.2).
– we explain how to select the locations of pipelining registers to accommodate
as many independent streams as possible without any additional area cost
compared to the single stream architecture (Section 3.3).
Eventually, as practical results, following these implementation strategies we
obtained very efficient LED and AES implementations (Section 4). For example,
our AES implementation achieves an efficiency of 38 Mbps/slice, which is the
most efficient AES FPGA implementation in the literature to the best of our
knowledge. We also applied our techniques to Deoxys, and we obtained the
current best Deoxys-I FPGA implementation, improving their efficiency by a
factor ∼1.7 with almost the same area. Table 1 shows a summary of our results
compared to state of the art implementations.
4
Table 1: Summary of our results compared to the state-of-the-art implementa-
tions
Throughput Efficiency
Algorithm Family Impl. Slices
(Gbps) (Mbps/slice)
m0 m1 m2 mn
c0 c1 c2 cn
P
mi AD0 AD1 ADn
T ag
1. The first and second parts of execution do not depend on each other. Con-
sequently, following the implementation from Poschmann and Stöttinger on
5
the ATHENa website [2], the order can be reversed. This enables one to use
the same storage for both the checksum and tag computation.
2. In Figure 2 the computations are completely independent, while in Figure 3,
there is an output dependency between different blocks. Since there is no
input dependency, both the structures are fully parallelisable. Additionally,
a small temporal shift saves the temporary storage needed. For example, the
first block starts at time t = 0 and the second block starts at t = ∆t. At
t = T the first block is finished and stored in the tag storage. Finally, at
time t = T + ∆t the second block is finished and XOR-ed with tag, in-place.
6
2.3 Proposed Architecture
The proposed high level architecture is shown in Figure 4. For simplicity, only
the encryption data path is drawn. However, a similar data path for decryption
can also be included. The architecture consists of a single round of the under-
lying block cipher, which is divided into N stages, each stage takes one cycle
to be processed. If the block cipher requires r rounds, the architecture loads
and processes N blocks, every r · N rounds, which leads an average latency of r
cycles, equivalent to a simple single round implementation. The selection of N
depends on several considerations:
Tweak
Stage 0 Key Stage 0
Stage 0
Stage 1
Stage 2
Tweak Key
SRLN SRLN
···
Stage N
Tag
Management
1. This architecture is intended for high speed over long messages. It is notice-
able that any number of blocks less than N requires the time to be encrypted.
Consequently, a very large N leads to a huge overhead for short messages or
for messages whose block length is not divisible by N .
2. In order to minimize the key scheduling overhead, it is performed in only one
pipeline stage and then shifted N cycles. This is based on the SRL feature
of the FPGA LUTs, which allows the utilization of very compact serial shift
registers using logic LUTs. For most FPGAs, a single LUT can implement
either a 16-bit or 32-bit SRL, which we consider as the upper bound on the
value of N .
7
3. The pipeline registers can add a huge overhead over the simple round im-
plementation. Therefore, in Section 3.3 we describe a technique to select the
optimal locations of the pipeline registers in the FPGA implementation.
From these three considerations, we concluded that the optimal value for N
is between 2 and 4, neglecting the control overhead. This leads to a speed-up be-
tween 2x and 4x. Additionally, for applications that require ultra high speed over
very long messages, e.g. disk encryption, high speed multimedia interfaces, etc.,
and do not care about the area, the same architecture can be unrolled into a fully
pipelined implementation. This can lead to a huge increase of the throughput.
B
Specifically, the single round multi-stream architecture requires about r · N · d N e
cycles to compute B blocks. On the other hand, a fully unrolled architecture
has an initial latency of r · N and a new block is generated every cycle, lead-
ing to a total number of cycles of r · N + B − 1. The speed up over the round
implementation is given by
r·B
G=
r·N +B−1
and for very long messages, the unrolled architecture has a speed up of r times.
Since the area increases less than r times (only the round part is replicated
while the tag and control part almost have the same area), the efficiency remains
unchanged. In Section 4.1 we show that an AES round can be implemented with a
clock frequency greater than 700 MHz on FPGA, with almost the same number
of slices/LUTs. Therefore, we estimate that this variant can be suitable for
applications that require very high speed authenticated encryption.
8
implementations. In [14], the authors proposed the AES data path shown in Fig-
ure 5. Each box in Figure 5 represents a pipeline stage, and it can be noticed that
the selection of the pipeline stages is based on the functionality of each stage,
which leads to two very fast stages in the beginning, then two slow stages after-
wards. This limits the maximum possible frequency. In the next sections, we will
show why this architecture might not be optimal and describe a new four-stream
data path designed for FPGA to achieve higher performance efficiency.
Input Selection
Add Key
Sbox
MixColumns
9
Specifically, only 108 XOR gates are required for implementing the 32-bit map-
ping [26]. However, as discussed earlier, since modern FPGAs use big 6/5 input
LUTs to implement logic circuits, having a lot of small shared 2/3-input gates
is not the most efficient solution. Synthesizing the circuit used in [26] or [27] for
Virtex-6 FPGA requires 41 LUTs for low area and 44 LUTs for high speed. On
the other hand, the dot-product view is given by
p=2·a⊕3·b⊕c⊕d
a6 a5 a4 a3 a2 a1 a0 0
0 0 0 a7 a7 0 a7 a7
b6 b5 b4 b3 b2 b1 b0 0
0 0 0 b7 b7 0 b7 b7 (1)
b7 b6 b5 b4 b3 b2 b1 b0
c7 c6 c5 c4 c3 c2 c1 c0
d7 d6 d5 d4 d3 d2 d1 d0
where the elements of each column represent the inputs of one output function.
From this perspective, it can be seen that 5 outputs can be implemented using
one 5-input LUT, while 3 outputs can be implemented using 7-input LUT, which
can be implemented using two 6-input LUTs. That sums to a total of 11 LUTs
per output coefficient, 44 LUTs per output column. This shows that logic opti-
mization does not offer much gain over the straightforward implementation of the
transformation. Besides, a deeper look at the view given by the decomposition
in (1) shows that the three outputs that need 7-input LUTs share two inputs
bits, namely a7 &b7 . Decomposition (1) can be written as decomposition (2),
where x = a7 ⊕ b7 . This decomposition can be implemented using eight 6-input
LUTs and one 2-input LUT, a total of 9 LUTs per output coefficient, 36 LUTs
per output column (which is smaller than the best-reported implementations)
or 1.125 LUTs per output bit. It is worth mentioning that this number is near-
optimal for any linear transformation over 32 bits, as the optimal number is 1
LUT/bit, which corresponds to transformation where each output bit depends
on n bits, where 2 ≤ n ≤ 6 (the case where n = 1 corresponds to an identity
function and can be neglected, w.l.o.g.)3 .
3
In fact, each 6:1 LUT can be implemented as a 5:2 LUT with shared inputs. Using
this feature, our circuit can be indeed implemented using only 8 LUTs, which is
the optimal figure. However, in this paper we are handling the optimization at the
front-end stage and this feature is incorporated automatically by the placement and
routing tool.
10
a6 a5 a4 a3 a2 a1 a0 a7
b6 b5 b4 b3 b2 b1 b0 b7
0 0 0 x x 0 x 0
b7
(2)
b6 b5 b4 b3 b2 b1 b0
c7 c6 c5 c4 c3 c2 c1 c0
d7 d6 d5 d4 d3 d2 d1 d0
The optimization of the AES inverse MixColumns circuit is less straightfor-
ward, as M −1 includes larger coefficients. M −1 is given by
EBD 9
9 E B D
D 9 E B
BD 9 E
A lot of work has been done on how to reuse the same circuit from M
to implement M −1 with minimal overhead. This is done by using any of the
following relations M −1 = M 3 , M −1 = M · N or M −1 = M ⊕ K, where N
and K are matrices with low coefficients. In that direction, the circuit given by
decomposition (2) will also be the smallest and the same reasoning can be used to
achieve small area for both K and N . However, this approach is most useful for
low area serial implementations with shared encryption/decryption data path.
They do not achieve the best results for high speed round implementations with
dedicated decryption data path. For example, using M −1 = M 3 requires 3.375
LUTs/bit and produces a large-depth circuit (low performance), while using
M −1 = M ⊕ K is even larger. The most promising approach is M −1 = M · N
which requires 288 LUTs/block, corresponding to 2.25 LUTs/bit, which is still
far from optimal. On the other hand, the straightforward implementation of
M −1 leads to output functions that include 19 input bits, which can lead to
very low performance. Here, we give a circuit that requires 60 LUTs per output
column, corresponding to 1.875 LUTs/bit. First, we use the same dot product
view mentioned earlier, which is given by equation (3).
p0 = E · a ⊕ B · b ⊕ D · c ⊕ 9 · d = F · (a ⊕ b ⊕ c ⊕ d) ⊕ (a ⊕ 4 · b ⊕ 2 · c ⊕ 4 · d ⊕ 2 · d) (3)
Using these two observations, a circuit that requires only 60 LUTs per output
column can be implemented. The circuit diagram is given in Figure 6. This is
17% smaller than the best reported implementation. Given that MixColumns
is the main difference between the AES encryption and decryption data paths,
optimizing this primitive is crucial. On the other hand, since 1.875 LUTs/bit is
still far from the optimal 1 LUT/bit figure, there may be some room for further
optimization.
11
F a 2·c 2·d
b
c
d b 2·d 2·a
a
c
b 4 c 2·a 2·b
d
d 2·b 2·c
While the impact of moving the flip-flops 1 logic level forward in the previous
example is obvious, the designers usually do not have an accurate estimation of
the exact LUT utilization before synthesis. Consequently, the designers choose
12
Logic
Circuit 1
Logic Logic Storage
Circuit 2 Circuit 4 Element
Logic
Circuit 3
Logic
Circuit 1
Logic Storage Logic
Circuit 2 Element 1 Circuit 4
Logic Storage
Circuit 3 Element 2
the pipeline stages based on the logical functions, e.g. Sbox, MixColumns, input
selection, etc. In our work, we follow a different approach. First, we synthesize
a single stream sequential implementation of the required block cipher. Second,
we study the output layout to to determine the precise distribution of pipeline
stages without affecting the structure of the utilized LUTs.4 .
Using the techniques described in Sections 3.2 and 3.3, we have implemented
two multi-stream AES data path (two and four streams), shown in Figure 9. In
addition to the use of the low area MixColumns circuit and ROM-based Sbox,
the locations of the pipelining registers have been selected specifically to ensure
as efficient LUT utilization as possible. In other words, both the two-stream and
four-stream implementations use the same number of logical LUTs (944 LUTs,
without key scheduling), out of which 95% (896 LUTs) are 6-input LUTs. The
4
The term zero overhead refers to the number of LUT-FF pairs, as this is the impor-
tant metric, not the number of LUTs or FFs.
13
Input Selection/
Addkey
Sbox 1
Input Selection/
Sbox 2
Addkey/Sbox1
Sbox 2/
MixColumns
MixColumns
1. The critical path in the implementation from [14] consists of three levels
of logic inside the Sbox used (one LUT6 followed by MUX7 and MUX8).
In our design, the critical path also consists of three levels of logic (Sbox
part 2 (LUT6) and the MixColumns circuit proposed in Section 3.2 (LUT3
+ LUT6)).
2. The MixColumns circuit used is smaller.
3. The first two pipeline stages in Figure 5 necessitate the use of 256 LUTs.
The two stages altogether can be viewed as a 6-input function, which can be
easily merged into a single stage of 128 LUTs (of type LUT6).
This implies that our proposed implementation achieves the same perfor-
mance as [14] for lower latency and using only independent two streams (easier
to achieve). In fact, following the architecture in Section 2, it can be used even
for slightly dependent streams (even and odd blocks of an OCB message). Addi-
tionally, by choosing to add two more stages at the output of the Sbox 2 and key
addition circuits, the performance and efficiency can be further enhanced with-
out any additional increase in the area occupied, as shown in Table 2. The results
shows that our four-stream implementation outperforms all the AES FPGA im-
plementations in the literature in terms of efficiency, to the best of our knowledge.
In Table 2, we show the implementation results of the AES decryption data path.
14
It is shown that it has a speed-up of around 2x over the similar implementation
from [14].
Based on the architecture proposed in Section 2 and the AES data paths
proposed in Section 4.1, we have implemented two complete data paths for
Deoxys-I. They include two and four pipeline stages, respectively. They also
consist of four parts: the encryption data path, the decryption data path, the
15
key schedule and the tweak schedule. Using the pipeline selection technique,
both implementations consume the same area, except for the decryption data
path which we implement only for 4 streams. In Table 4, it is shown that the
bottleneck of the design, not considering the control overhead, is the decryption
data path.
1. While the datapath is very fast, there is still optimization required to the
control unit to cope with such speed.
2. Although we choose a small FPGA from the Virtex 6 family, the design
is still small compared to the FPGA size, which leads to it becoming I/O
dominated. leading to a lot of wiring delays related to the I/O pins. This will
not be applicable if the design is used as a part of a larger on-chip system.
To verify the second problem as part of the reason for performance degrada-
tion, we also implemented the design for the small Spartan-6 (xc6slx9ftg256-3)
FPGA. The maximum operating frequency is 273 MHz vs. 333 MHz pre-layout
(only 18% degradation). The results of the Virtex-6 implementation are sum-
marized in Table 5. For fair comparison, we have also downloaded and imple-
mented the Deoxys-I-128 implementation reported on the ATHENa website [2]
by CERG team. We only compare the cipher circuits without the overhead of the
hardware API. Our results show an efficiency gain of 75% (1.75x) for Virtex 6
and 74% (1.74x) for Spartan-6. Table 5 shows the results for the encryption-only
implementation. Our implementation is 5.536x more efficient than the implemen-
tation by (Poschmann and Stöttinger).
16
Table 5: Post-layout results of the Deoxys-I-128 implementation on FPGA
17
Table 7. We are currently in the process of preparing the HDL code to be pro-
vided publicly soon so that other researchers can verify our results.
Input Selection/
Addkey
Sbox
MixColumns
LED [29] is a 64-bit block cipher based on an AES-like SPN. Its state is a 4×4
matrix of 4-bit nibbles. In this paper we focus on the 64-bit key version LED-64.
However, the same results can extend to the other variants of LED, since the
only difference is the key scheduling part, which can be easily adjusted for this
architecture. The (χ4 ) round implementation from [18], Section 3.1, has been
replicated for Spartan-3 Xilinx FPGA. Using the guidelines from Section 3.3, we
have been able to add two extra pipeline stages at the outputs of the Sbox and
the MixColumns operations, as shown in Figure 10. In Table 8, it is shown that
almost all the available flip-flops has been used, increasing both the throughput
and efficiency by 2.57x at no additional area cost.
18
Table 8: Results of the three-stream LED-64 implementation compared to the
single stream counterpart on Spartan 3 FPGA.
Acknowledgments
The authors would like to thank the anonymous referees for their helpful com-
ments. This work is partly supported by the Singapore National Research Foun-
dation Fellowship 2012 (NRF-NRFF2012-06).
References
19
13. Jean, J., Nikolic, I., Peyrin, T., Seurin, Y.: Deoxys v1.41. Technical report,
Nanyang Technological University, Singapore/ANSSI, Paris, France (2016)
14. Bulens, P., Standaert, F.X., Quisquater, J.J., Pellegrin, P., Rouvroy, G.: Imple-
mentation of the AES-128 on Virtex-5 FPGAs. In: International Conference on
Cryptology in Africa, Springer (2008) 16–26
15. Liu, Q., Xu, Z., Yuan, Y.: A 66.1 gbps single-pipeline aes on fpga. In: 2013
International Conference on Field-Programmable Technology (FPT). (Dec 2013)
378–381
16. : Deoxys-I-128 implementation by cerg team.
https://cryptography.gmu.edu/athena/ (2016)
17. Poschmann, A., Stöttinger, M.: Deoxys-I-128 implementation by poschmann and
Stöttinger. https://cryptography.gmu.edu/athena/ (2016)
18. Anandakumar, N.N., Peyrin, T., Poschmann, A.: A very compact FPGA imple-
mentation of LED and PHOTON. In: International Conference in Cryptology in
India, Springer (2014) 304–321
19. Black, J., Rogaway, P.: A block-cipher mode of operation for parallelizable message
authentication. In: International Conference on the Theory and Applications of
Cryptographic Techniques, Springer (2002) 384–397
20. Krovetz, T., Rogaway, P.: Ocb (v1. 1). (2016)
21. Minematsu, K.: AES-OTR v3.1. Technical report, (NEC Corporation, Japan
(2016)
22. Homsirikamol, E., Diehl, W., Ferozpuri, A., Farahmand, F., Yalla, P., Kaps, J.P.,
Gaj, K.: CAESAR Hardware API. Cryptology ePrint Archive, Report 2016/626
(2016)
23. NIST: National Institute of Standards and Technology: Advanced Encryption
Standard AES (2001)
24. El Maraghy, M., Hesham, S., El Ghany, M.A.A.: Real-time efficient fpga im-
plementation of aes algorithm. In: SOC Conference (SOCC), 2013 IEEE 26th
International, IEEE (2013) 203–208
25. Chaves, R., Kuzmanov, G., Vassiliadis, S., Sousa, L.: Reconfigurable memory based
aes co-processor. In: Parallel and Distributed Processing Symposium, 2006. IPDPS
2006. 20th International, IEEE (2006) 8–pp
26. Banik, S., Bogdanov, A., Regazzoni, F.: Atomic-AES v 2.0. Cryptology ePrint
Archive, Report 2016/1005 (2016)
27. Ghaznavi, S., Gebotys, C., Elbaz, R.: Efficient technique for the FPGA implemen-
tation of the aes mixcolumns transformation. In: Reconfigurable Computing and
FPGAs, 2009. ReConFig’09. International Conference on, IEEE (2009) 219–224
28. Resende, J.C., Chaves, R.: Compact dual block aes core on fpga for ccm protocol.
In: Field Programmable Logic and Applications (FPL), 2015 25th International
Conference on, IEEE (2015) 1–8
29. Guo, J., Peyrin, T., Poschmann, A., Robshaw, M.: The LED block cipher. In: Inter-
national Workshop on Cryptographic Hardware and Embedded Systems, Springer
(2011) 326–341
20