1 Matrix Multiplication: Strassen's Algorithm: Tuan Nguyen, Alex Adamson, Andreas Santucci
1 Matrix Multiplication: Strassen's Algorithm: Tuan Nguyen, Alex Adamson, Andreas Santucci
http://stanford.edu/~rezab/dao.
Instructor: Reza Zadeh, Matriod and Stanford.
Idea - Block Matrix Multiplication The idea behind Strassen’s algorithm is in the formulation
of matrix multiplication as a recursive problem. We first cover a variant of the naive algorithm,
formulated in terms of block matrices, and then parallelize it. Assume A, B ∈ Rn×n and C = AB,
where n is a power of two.2
We write A and B as block matrices,
! ! !
A11 A12 B11 B12 C11 C12
A= , B= , C= ,
A21 A22 B21 B22 C21 C22
where block matrices Aij are of size n/2 × n/2 (same with respect to block entries of B and C).
Trivially, we may apply the definition of block-matrix multiplication to write down a formula for
the block-entries of C, i.e.
Parallelizing the Algorithm Realize that Aij and Bk` are smaller matrices, hence we have
broken down our initial problem of multiplying two n × n matrices into a problem requiring 8
matrix multiplies between matrices of size n/2 × n/2, as well as a total of 4 matrix additions.
1
Refresher, to compute C = AB, we need to compute cij , of which there are n2 entries. Each one may be computed
via cij = haTi , bj i in 2n − 1 = Θ(n) operations. Hence total work is O(n3 ).
2
If n is not a power of two, then from a theoretical perspective we may simply pad the matrix with additional
zeros. From a practical perspective, we would simply use un-equal size blocks.
1
There is nothing fundamentally different between the matrix multiplies that we need to compute
at this level relative to our original problem.
Further, realize that the four block entries of C may be computed independently from one
another, hence we may come up with the following recurrence for work:
By the Master Theorem,3 W (n) = O(nlog2 8 ) = O(n3 ). So we have not made any progress
(other than making our algorithm parallel). We already saw in lecture two that we can naively
parallelize matrix-multiplies very simply to yield O(n3 ) work and O(log n) depth.
Strassen’s Algorithm We now turn toward Strassen’s algorithm, such that we will be able to
reduce the number of sub-calls to matrix-multiplies to 7, using just a bit of algebra. In this way,
we bring the work down to O(nlog2 7 ).
How do we do this? We use the following factoring scheme. We write down Cij ’s in terms of
block matrices Mk ’s. Each Mk may be calculated simply from products and sums of sub-blocks of
A and B. That is, we let
Crucially, each of the above factors can be evaluated using exactly one matrix multiplication.
And yet, since each of the Mk ’s expands by the distributive property of matrix multiplication,
they capture additional information. Also important, is that these matrices Mk may be computed
independently of one another, i.e. this is where the parallelization of our algorithm occurs.
It can be verified that
C11 = M1 + M4 − M5 + M7
C12 = M3 + M5
C21 = M2 + M4
C22 = M1 − M2 + M 3 + M6
3
Case 1: f (n) = O(n2 ), so c = 2 < 3 = log2 (8).
2
Realize that our algorithm requires quite a few summations, however, this number is a constant
independent of the size of our matrix multiples. Hence, the work is given by a recurrence of the
form
W (n) = 7W (n/2) + O(n2 ) =⇒ W (n) = O(nlog2 7 ).
What about the depth of this algorithm? Since all of our recursive matrix-multiplies may be
computed in parallel, and since we can add matrices together in unit depth,4 we see that depth is
given by
Communication Cost Our PRAM model assumes zero communication costs between proces-
sors. The reason is because the PRAM model assumes a shared memory model, in which each
processor has fast access to a single memory bank. Realistically, we never have efficient communi-
cation, since often times in the real world we have clusters of computers, each with its own private
bank of memory. In these cases, divide and conquer is often impractical.
It is true that when our data are split across multiple machines, having an algorithm operate
on blocks of data at a time can be useful. However, as Strassen’s algorithm continues to chop up
matrices into smaller and smaller chunks, this places a large communication burden on distributed
set ups because after the first iteration, it is likely that we will incur a shuffle cost as we are forced
to send data between machines.
4
We note that to perform matrix addition of two n × n matrices X + Y = W , we may calculate each of the n2
entries Wij = Xij +Yij in parallel using n2 processors. Each entry requires only one fundamental unit of computation,
hence the work for matrix addition is O(n2 ) and the depth is O(1).
3
Caveat - Big O and Big Constants One last caveat specific to Strassen’s Algorithm is that
in practice, the O(n2 ) term requires 20 · n2 operations, which is quite a large constant to hide. If
our data is large enough that it must be distributed across machines in order to store it all, then
really we can often only afford to pass through the entire data set one time. If each matrix-multiply
requires twenty passes through the data, we’re in big trouble. Big O notation is great to get you
started, and tells us to throw away egregiously inefficient algorithms. But once we get down to
comparing two reasonable algorithms, we often have to look at the algorithms more closely.
When is Strassen’s worth it? If we’re actually in the PRAM model, i.e. we have a shared
memory cluster, then Strassen’s algorithm tends to be advantageous only if n ≥ 1, 000, assuming no
communication costs. Higher communication costs drive up the n at which Strassen’s becomes useful
very quickly. Even at n = 1, 000, naive matrix-multiply requires 1e9 operations; we can’t really do
much more than this with a single processor. Strassen’s is mainly interesting as a theoretical idea.
For more on Strassen in distributed models, see [1].
Disk Vs. RAM Trade-off What is the reason that we can only pass through our data once?
There is a big trade-off between having data in ram and having it on disk. If we have tons of data,
our data is stored on disk. We also have an additional constraint that with respect to streaming
data, as the data are coming in they are being stored in memory, i.e. we have fast random access,
but once we store the data to disk retrieving it again is expensive.
2 Mergesort
Merge-sort is a very simple routine. It was fully parallelized in 1988 by Cole.[2] The algorithm itself
has been known for several decades longer.
4
It’s critical to note how the merge sub-routine works, since this is important to our algorithms
work and depth. We can think of the process as simply “zipping” together two sorted arrays.
Algorithm 2: Merge
Input : Two sorted arrays A, B each of length n
Output: Merged array C, consisting of elements of A and B in sorted order
1 a ← pointer to head of array A (i.e. pointer to smallest element in A)
2 b ← pointer to head of array B (i.e. pointer to smallest element in B)
3 while a, b are not null do
4 Compare the value of the element at a with the value of the element at b
5 if value(a) < value(b) then
6 add value of a to output C
7 increment pointer a to next element in A
8 end
9 else
10 add value of b to output C
11 increment pointer b to next element in B
12 end
13 end
14 if elements remaining in either a or (exclusive) b then
15 Append these sorted elements to our sorted output C
16 end
17 return C
Since we iterate over each of the elements exactly one time, and each time we make a constant
time comparison, we require Θ(n) operations. Hence the merge routine on a single machine takes
O(n) work.
5
D(n) = D(n/2) + O(n)
= O(n)
Therefore W (n) = O(n log n) and D(n) = O(n). Note that the bottleneck lies in merge, which
takes O(n) time. That is, even though we have an infinitude of processors, the time it takes to
merge two sorted arrays of size n/2 on the first call to mergeSort dominates the time it takes to
complete the recursive calls.
To find the rank of an element x ∈ A in another sorted B requires O(log n) work using a
sequential processor. Notice, however, that each of the n iterations of the for loop in our algorithm
6
is independent of the previous, hence our binary searches may be performed in parallel. That is,
we can use n processors and assign each a single element from A. Each processor then performs
a binary search with O(log n) work. Hence in total, this parallel merge routine requires O(n log n)
work and O(log n) depth.
Hence when we use parallelMerge in our mergeSort algorithm, we realize the following work
and depth, by the master theorem:
so for large p we significantly outperform the naive implementation! The best known implementa-
tion (work O(n log n), depth O(log n)) was found by Richard Cole[2].
Motivating the Next Step We notice that we use many binary searches in our recently defined
parallel merge routine. Can we do better? Yes.
Let Lm denote the median index of array L. We then find the corresponding index in R
using binary search with logarithmic work. We then observe that all of the elements in L at or
below Lm and all of the elements in R at rankR (value(Lm )) are at most the value of L’s median
element. Hence if we were to recursively merge-sort the first Lm elements in L along with the
first rankR (value(Lm )) elements in R, and correspondingly for the upper parts of L and R, we
may simply append the results together to maintain sorted order. This leads us to Richard Cole
(1988).[2] He works out all the intricate details in this approach nicely to achieve
7
you wish for all your processors to be busy. Depending on how you schedule the operations, you
sometimes may end up with processors which are idle and not working.
It is the schedulers task to schedule things in tandem in such a way that you look ahead a little
bit and minimize the idle time of processors. We could do this greedily, i.e. as soon as there is any
computation to be done, we assign it to a processor. Or, we could be a little bit clever about it,
and perhaps look ahead further to our DAG to see if we can plan more efficiently.
We will talk about scheduling after we are done with Divide and Conquer algorithms. Spark
has a scheduler. Every distributed computing set up has a scheduler. Your operating system and
phone’s have schedulers. Every computer has processes, and every computer runs in parallel. Your
computer might have fifty Chrome tabs open and must decide which one to give priority to in order
to optimize performance of your machine.
References
[1] G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz, Communication-
optimal parallel algorithm for strassen’s matrix multiplication, CoRR, abs/1202.3173 (2012).
[2] R. Cole, Parallel merge sort, SIAM J. Comput., 17 (1988), pp. 770–785.
[4] V. Strassen, Gaussian elimination is not optimal, Numerische Mathematik, 13, pp. 354–356.