Cryptacus 2018 Paper 4
Cryptacus 2018 Paper 4
Ko Stoffelen
Digital Security Group, Radboud University,
Nijmegen, The Netherlands
Email: k.stoffelen@cs.ru.nl
Abstract—This paper describes two related contributions to for increasing matrix dimensions. However, quite a number
the area of designing lightweight mixing (diffusion) layers for of heuristic algorithms for finding the shortest linear straight-
symmetric cryptographic primitives such as permutations and line program, which corresponds exactly to minimizing the
block ciphers. First, we show how existing algorithms can number of XORs, have been proposed in the literature [8],
provide cheaper implementations for MDS matrices than previ- [9], [10].
ously thought possible, by viewing an MDS matrix over a finite We take several well-locally-optimized MDS matrices
field as a binary matrix and performing global optimization. from the literature and apply the known algorithms to all of
Then, we define column parity mixers as a generalization of the them. This leads immediately to significant improvements:
mixing layer used in K ECCAK, and we study their interesting we often get an implementation using less XOR operations
algebraic and diffusion properties. We show that column parity than what was considered fixed costs before.
mixers are a suitable alternative to MDS matrices. Not all ciphers use (near-)MDS matrices for their dif-
This very concise overview is based on two recently pub- fusion. A different type of mixing layer is found as θ in
lished papers [1], [2], which should be consulted for more K ECCAK-f , the permutation underlying K ECCAK [11]. This
details. mixing layer θ has a branch number of 4 and it requires
only 2 XORs per bit. Despite the low cost, it appears to
have quite good diffusion in combination with other parts of
1. Introduction the round function. In particular, [12] reports on proofs for
promising upper bounds for the differential probability of
Lightweight cryptography has been a major trend in differential trails. It appears that θ-like mappings can form
symmetric cryptography over the last years. While it is mixing layers with a good trade-off between implementation
sometimes not very clear when something is lightweight, the cost and mixing power.
main goal can be summarized as very efficient cryptography. In [2] we present a generalization of the θ mixing layer
Here, the meaning of efficiency ranges from small chip size in K ECCAK-f called column parity mixers (CPMs). CPMs
to low latency and low energy. operate on two-dimensional arrays and in their definition
In light of this, researchers started to optimize the parities computed over the columns play a central role. In
construction of many parts of permutations and block ciphers, Section 4 we provide an elegant description using matrix
with a special focus on the linear layers more recently and arithmetic, allowing us to easily derive algebraic and diffu-
even more specifically the implementation of maximum sion properties.
distance separable (MDS) matrices. That is, linear layers We also show that CPMs operating on states with an
with an optimal branch number. even number of rows have quite different properties from
Starting with [3] and followed by a whole series of papers those operating on an odd number of rows. The former
(e. g., [4], [5], [6]) researchers focused on finding MDS are involutions and are ideally suited for block ciphers and
constructions that minimize the number of XOR operations permutations that need to have an efficient inverse. The latter
needed for their implementations. Considering an n × n may have an inverse with high implementation cost but also
MDS matrix over a finite field F2k given as A = (αi,j ), with very interesting diffusion properties.
the aim was to choose the elements αi,j in such a way that
implementing all of the multiplications x 7→ αi,j x in parallel
becomes as cheap as possible. In order to compute the matrix
2. Preliminaries
A entirely, those partial results have to be added together,
for which an additional amount of XORs is required. So far, 2.1. Matrices
researchers have focused on local optimization by taking the
cost of combining the parts as a given. We use I to denote a (square) identity matrix and 0 to
Global optimization of matrix multiplication is another denote an all-zero matrix. We assume that the dimensions of
extensively studied line of research. It is known that the these matrices are determined by the context. The transpose
problem is NP-hard [7] and thus renders quickly infeasible of a matrix A is denoted as AT .
We use 1x to denote a column vector of x components 3. Global Optimization of MDS Matrices
that are all equal to 1. Consequently, 1T x is an all-1 row
vector with x components. We use 1yx to denote a matrix 3.1. Algorithms
with x rows and y columns with all components 1. Clearly,
1yx = 1x 1T
y. Back in 1997, Paar [8] studied how to optimize the
The element of a matrix A at row i and column j is arithmetic used by Reed-Solomon encoders. This boils down
denoted by αi,j . If B = AT , we have βi,j = αj,i . The trace to reducing the number of XORs that are necessary for a
of a square matrix is the linear function that simply takes multiplier that operates on matrices A over the field F2k .
Paar described two algorithms that find a local optimum.
the sum ofPits diagonal elements. It is denoted by tr(A), so Here we focus on the second algorithm. Intuitively, the idea
tr(A) = i αi,i .
is to iteratively eliminate common subexpressions. Let Tα
The Hamming weight of a vector u or of a matrix A is
be the multiplication matrix, to be applied to a variable
denoted hw(u) and hw(A) and is defined as the number of
field element x = (x1 , . . . , xk ) ∈ Fk2 . The algorithm for
nonzero entries in u and A, respectively.
computing Tα x finds a pair (i, j), with i 6= j , where the
Consider an n × n matrix A with αi,j ∈ F2k . Then bitwise AND between columns i and j of Tα has the highest
every multiplication by an element α can be described by Hamming weight. In other words, it finds a pair (xi , xj ) that
a left-multiplication with a matrix Tα ∈ Fk×k 2 . For 1 ≤ occurs most frequently as subexpression in the output bits of
i, j ≤ n, we define B(A) := (Tαi,j ) ⊆ GL(k, F2 )n×n ⊆ Tα x. When multiple pairs are equally common, all of them
(Fk×k
2 )n×n ∼
= Fnk×nk
2 and call this the binary representation are tried recursively. The XOR between xi and xj is then
of A. computed, and A is updated accordingly, with xi + xj as
newly available variable. This is repeated until there are no
common subexpressions left. Compared to the naive XOR
2.2. MDS Matrices count, Paar noted an average reduction in the number of
XORs of 17.5% for matrices over F24 and 40% for matrices
over F28 .
For a binary vector v ∈ Fnk Paar’s algorithms lead to so-called cancellation-free
2 , we define hwk (v) :=
hw(v 0 ), where v 0 ∈ (Fk2 )n is the vector that has been programs. This means that for every XOR operation u + v ,
constructed by partitioning v into groups of k bits. Fur- none of the input bit variables xi occurs in both u and v .
thermore, the branch number of a matrix A is defined as Thus, the possibility that two variables cancel each other
bn(A) := minu6=0 {hw(u) + hw(Au)}. For a binary matrix out is never taken into consideration, while this may in fact
B ∈ Fnk×nk , the branch number for k -bit words is defined yield a more efficient solution in terms of the total number
2
as bnk (B) := minu∈Fnk {hwk (u) + hwk (Au)}. of XORs. In 2008, Boyar, Matthews, and Peralta [7] showed
2 \{0} that cancellation-free techniques can often not be expected
In the design of block ciphers, maximum distance sepa- to yield optimal solutions for non-trivial inputs. They also
rable (MDS) matrices play an important role. showed that, even under the restriction to cancellation-free
Definition 1. An n × n matrix A is MDS if and only if programs, the problem of finding an optimal program is
bn(A) = n + 1. NP-complete.
Around 2010, Boyar and Peralta [9] came up with a
MDS matrices do not exist for every choice of n, k . The heuristic that is not cancellation-free and that improved on
exact parameters for which MDS matrices do or do not exist Paar’s algorithms in most scenarios. Their idea was to keep
are investigated in the context of the famous MDS conjecture. track of a distance vector that contains, for each targeted
For binary matrices, we need to modify Definition 1. expression of an output bit, the minimum number of XORs of
the already computed intermediate values that are necessary
Definition 2. A binary matrix B ∈ Fnk×nk
2 is MDS for k -bit to obtain that target. To decide which values will be added,
words if and only if bnk (A) = n + 1. the pair that minimizes the sum of new distances is picked.
If there is a tie, the pair that maximizes the Euclidean norm
MDS matrices have a common application in linear layers of the new distances is chosen. Additionally, if the XOR of
of block ciphers, due to the wide trail strategy proposed for two values immediately leads to a targeted output, this can
the A ES, see [13]. We typically deal with n×n MDS matrices always be done without searching further.
over Fk2 respectively binary Fnk×nk
2 matrices that are MDS At BFA 2017, an improvement was presented that
for k -bit words where k ∈ {4, 8} is the size of the S-box. simultaneously reduces the number of XORs and the depth
In either case, when we call a matrix MDS, the size of k of the resulting circuit [10].
will always be clear from the context when not explicitly
mentioned. 3.2. Results
It is easy to see that, if A ∈ Fn×n
2k
is MDS, then also
B(A) is MDS for k -bit words. On the other hand, there Using the heuristic methods that are described in the
might also exist binary MDS matrices for k -bit words that previous section, we can easily and significantly reduce the
have no according representation over Fk2 . XOR counts for many matrices that have been used in the
TABLE 1. N UMBER OF XOR S REQUIRED FOR MATRICES IN CIPHERS . a square matrix Z at the right. We call the n × n matrix Z
the parity-folding matrix of θ. We are now ready to define
Cipher Type Naive Literature PAAR [8] BP [9] the θ-effect of a matrix A.
8
4×4
A ES F2 152 7 + 96* 108 97 Definition 5. The θ-effect of A with respect to Z is a row
4×4
A NUBIS F82 184 20 + 96† 121 113 vector, denoted as eZ (A) (or just e(A) if Z is clear from
4×4
F82 —‡
C LEFIA M0
4×4
184 121 106 the context) and is defined by eZ (A) = 1T
m AZ .
C LEFIA M1 F28 208 —‡ 121 111
F OX MU 4 F82
4×4
219 —‡ 143 137 For a given input A and parity-folding matrix Z , a column
T WOFISH F82
4×4
327 —‡ 149 129 x is called unaffected (affected) if the component with index
8×8
x in eZ (A) is zero (non-zero). Whether a column is affected
F82 —‡
F OX MU 8 1257 611 594 or not is fully determined by the column parity of A and
8×8
G RØSTL F82 1112 504 + 448† 493 475 the column x of the parity-folding matrix Z .
8×8
K HAZAD F82 1232 584 + 448† 488 507
8×8
W HIRLPOOL F82 840 304 + 448† 481 465 Definition 6. The expanded θ effect of A with respect to Z
4
4×4 †
is a matrix with m rows all equal to the CPM effect, namely,
J OLTIK F2 72 20 + 48 48 48
4×4 EZ (A) = 1m m AZ .
S MALL S CALE A ES F42 72 —‡ 54 47
8×8 A CPM θ simply consists in computing the expanded
F42 488 200 + 224†
W HIRLWIND M0 218 212
8×8 θ-effect of a matrix A and adding it to A.
W HIRLWIND M1 F42 536 200 + 224† 244 235
*
Reported by [14]. Definition 7. The column parity mixer θ using parity-folding
† Reported by [15]. matrix Z is defined as:
‡ We are not aware of any reported results for this matrix.
θ(A) = A + EZ (A) = A + 1m
m AZ .
literature. The running times for the optimizations are in the Note that a CPM is fully defined by a parity-folding
range of seconds to minutes. Table 1 summarizes the main matrix Z and m.
results.
A number of issues arise from this that are worth 4.2. Group Properties
highlighting. First, it turns out that there are cases where the
n(n − 1)k XORs for summing the products for all rows is In this section we list a few algebraic properties of CPMs
not a correct lower bound. In fact, all the 4 × 4 matrices over θ for given dimensions m × n. Proofs and examples are
GL(4, F2 ) that we studied can be implemented in at most omitted here and can be found in [2].
48 XORs. Second, the implementation of the MDS matrix
used in A ES with 97 XORs is, to the best of our knowledge, • Let ψ = θ0 ◦ θ be the composition of two CPMs.
the most efficient implementation so far and improves on the Then ψ is again a CPM.
previous implementation of 103 XORs, reported by [14]. As – If m is even, the parity-folding matrix of ψ
a side note, cancellations do occur in this implementation, is Z + Z 0 .
we thus conjecture that such a low XOR count is not possible – If m is odd, its parity-folding matrix is (Z 0 +
with cancellation-free programs. I)(Z + I) + I.
4. Column Parity Mixers • The set of all CPMs with m even forms a group
with composition that is isomorphic to the abelian
2
(Hamming) branch number, both differentially and linearly, k Where h is the weight of a row of Z . XORs/bit ≥ 2 − 1/m.