0% found this document useful (0 votes)
127 views8 pages

1 Matrix Multiplication: Strassen's Algorithm: Tuan Nguyen, Alex Adamson, Andreas Santucci

Strassen's algorithm for matrix multiplication improves upon the naive O(n^3) algorithm by breaking the matrices into blocks and rearranging the order of operations, achieving O(n^2.81) time complexity. Mergesort is a simple sorting algorithm that was fully parallelized in 1988. It divides the array in half, recursively sorts each half, and then merges the sorted halves together.

Uploaded by

Shubh Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views8 pages

1 Matrix Multiplication: Strassen's Algorithm: Tuan Nguyen, Alex Adamson, Andreas Santucci

Strassen's algorithm for matrix multiplication improves upon the naive O(n^3) algorithm by breaking the matrices into blocks and rearranging the order of operations, achieving O(n^2.81) time complexity. Mergesort is a simple sorting algorithm that was fully parallelized in 1988. It divides the array in half, recursively sorts each half, and then merges the sorted halves together.

Uploaded by

Shubh Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

CME 323: Distributed Algorithms and Optimization, Spring 2016

http://stanford.edu/~rezab/dao.
Instructor: Reza Zadeh, Matriod and Stanford.

Lecture 3, 04/04/2016. Scribed by Tuan Nguyen, Alex Adamson, Andreas Santucci.

1 Matrix multiplication: Strassen’s algorithm


We’ve all learned the naive way to perform matrix multiplies in O(n3 ) time.1 In today’s lecture, we
review Strassen’s sequential algorithm for matrix multiplication which requires O(nlog2 7 ) = O(n2.81 )
operations; the algorithm is amenable to parallelizable.[4]
A variant of Strassen’s sequential algorithm was developed by Coppersmith and Winograd, they
achieved a run time of O(n2.375 ).[3] The current best algorithm for matrix multiplication O(n2.373 )
was developed by Stanford’s own Virginia Williams[5].

Idea - Block Matrix Multiplication The idea behind Strassen’s algorithm is in the formulation
of matrix multiplication as a recursive problem. We first cover a variant of the naive algorithm,
formulated in terms of block matrices, and then parallelize it. Assume A, B ∈ Rn×n and C = AB,
where n is a power of two.2
We write A and B as block matrices,
! ! !
A11 A12 B11 B12 C11 C12
A= , B= , C= ,
A21 A22 B21 B22 C21 C22
where block matrices Aij are of size n/2 × n/2 (same with respect to block entries of B and C).
Trivially, we may apply the definition of block-matrix multiplication to write down a formula for
the block-entries of C, i.e.

C11 = A11 B11 + A12 B21


C12 = A11 B12 + A12 B22
C21 = A21 B11 + A22 B21
C22 = A21 B12 + A22 B22

Parallelizing the Algorithm Realize that Aij and Bk` are smaller matrices, hence we have
broken down our initial problem of multiplying two n × n matrices into a problem requiring 8
matrix multiplies between matrices of size n/2 × n/2, as well as a total of 4 matrix additions.
1
Refresher, to compute C = AB, we need to compute cij , of which there are n2 entries. Each one may be computed
via cij = haTi , bj i in 2n − 1 = Θ(n) operations. Hence total work is O(n3 ).
2
If n is not a power of two, then from a theoretical perspective we may simply pad the matrix with additional
zeros. From a practical perspective, we would simply use un-equal size blocks.

1
There is nothing fundamentally different between the matrix multiplies that we need to compute
at this level relative to our original problem.
Further, realize that the four block entries of C may be computed independently from one
another, hence we may come up with the following recurrence for work:

W (n) = 8W (n/2) + O(n2 )

By the Master Theorem,3 W (n) = O(nlog2 8 ) = O(n3 ). So we have not made any progress
(other than making our algorithm parallel). We already saw in lecture two that we can naively
parallelize matrix-multiplies very simply to yield O(n3 ) work and O(log n) depth.

Strassen’s Algorithm We now turn toward Strassen’s algorithm, such that we will be able to
reduce the number of sub-calls to matrix-multiplies to 7, using just a bit of algebra. In this way,
we bring the work down to O(nlog2 7 ).
How do we do this? We use the following factoring scheme. We write down Cij ’s in terms of
block matrices Mk ’s. Each Mk may be calculated simply from products and sums of sub-blocks of
A and B. That is, we let

M1 = (A11 + A22 ) (B11 + B22 )


M2 = (A21 + A22 )B11
M3 = A11 (B12 − B22 )
M4 = A22 (B21 − B11 )
M5 = (A11 + A12 )B22
M6 = (A21 − A11 )(B11 + B12 )
M7 = (A12 − A22 )(B21 + B22 )

Crucially, each of the above factors can be evaluated using exactly one matrix multiplication.
And yet, since each of the Mk ’s expands by the distributive property of matrix multiplication,
they capture additional information. Also important, is that these matrices Mk may be computed
independently of one another, i.e. this is where the parallelization of our algorithm occurs.
It can be verified that

C11 = M1 + M4 − M5 + M7
C12 = M3 + M5
C21 = M2 + M4
C22 = M1 − M2 + M 3 + M6
3
Case 1: f (n) = O(n2 ), so c = 2 < 3 = log2 (8).

2
Realize that our algorithm requires quite a few summations, however, this number is a constant
independent of the size of our matrix multiples. Hence, the work is given by a recurrence of the
form
W (n) = 7W (n/2) + O(n2 ) =⇒ W (n) = O(nlog2 7 ).
What about the depth of this algorithm? Since all of our recursive matrix-multiplies may be
computed in parallel, and since we can add matrices together in unit depth,4 we see that depth is
given by

D(n) = D(n/2) + O(1) =⇒ D(n) = O(log n)


2.81
By Brent’s theorem, Tp ≤ n p + O(log n). In the years since Strassen published his paper,
people have been playing this game to bring down the work required marginally, but nobody has
come up with a fundamentally different approach.

1.1 Drawbacks of Divide and Conquer


We now discuss some bottleneck’s of Strassen’s algorithm (and Divide and Conquer algorithms in
general).

• We haven’t considered communication bottlenecks; in real life communication is expensive.

• Disk/RAM differences are a bottleneck for recursive algorithms, and

• PRAM assumes perfect scheduling.

Communication Cost Our PRAM model assumes zero communication costs between proces-
sors. The reason is because the PRAM model assumes a shared memory model, in which each
processor has fast access to a single memory bank. Realistically, we never have efficient communi-
cation, since often times in the real world we have clusters of computers, each with its own private
bank of memory. In these cases, divide and conquer is often impractical.
It is true that when our data are split across multiple machines, having an algorithm operate
on blocks of data at a time can be useful. However, as Strassen’s algorithm continues to chop up
matrices into smaller and smaller chunks, this places a large communication burden on distributed
set ups because after the first iteration, it is likely that we will incur a shuffle cost as we are forced
to send data between machines.
4
We note that to perform matrix addition of two n × n matrices X + Y = W , we may calculate each of the n2
entries Wij = Xij +Yij in parallel using n2 processors. Each entry requires only one fundamental unit of computation,
hence the work for matrix addition is O(n2 ) and the depth is O(1).

3
Caveat - Big O and Big Constants One last caveat specific to Strassen’s Algorithm is that
in practice, the O(n2 ) term requires 20 · n2 operations, which is quite a large constant to hide. If
our data is large enough that it must be distributed across machines in order to store it all, then
really we can often only afford to pass through the entire data set one time. If each matrix-multiply
requires twenty passes through the data, we’re in big trouble. Big O notation is great to get you
started, and tells us to throw away egregiously inefficient algorithms. But once we get down to
comparing two reasonable algorithms, we often have to look at the algorithms more closely.

When is Strassen’s worth it? If we’re actually in the PRAM model, i.e. we have a shared
memory cluster, then Strassen’s algorithm tends to be advantageous only if n ≥ 1, 000, assuming no
communication costs. Higher communication costs drive up the n at which Strassen’s becomes useful
very quickly. Even at n = 1, 000, naive matrix-multiply requires 1e9 operations; we can’t really do
much more than this with a single processor. Strassen’s is mainly interesting as a theoretical idea.
For more on Strassen in distributed models, see [1].

Disk Vs. RAM Trade-off What is the reason that we can only pass through our data once?
There is a big trade-off between having data in ram and having it on disk. If we have tons of data,
our data is stored on disk. We also have an additional constraint that with respect to streaming
data, as the data are coming in they are being stored in memory, i.e. we have fast random access,
but once we store the data to disk retrieving it again is expensive.

2 Mergesort
Merge-sort is a very simple routine. It was fully parallelized in 1988 by Cole.[2] The algorithm itself
has been known for several decades longer.

Algorithm 1: Merge Sort


Input : Array A with n elements
Output: Sorted A
1 n ← |A|
2 if n is 1 then
3 return A
4 end
5 else
// (In Parallel)
6 L ← MERGESORT(A[0,...,n/2)) // Indices 0, 1, . . . , n2 − 1
7 R ← MERGESORT(A[n/2,...,n)) // Indices n2 , n2 + 1, . . . , n − 1
8 return MERGE(L,R)
9 end

4
It’s critical to note how the merge sub-routine works, since this is important to our algorithms
work and depth. We can think of the process as simply “zipping” together two sorted arrays.

Algorithm 2: Merge
Input : Two sorted arrays A, B each of length n
Output: Merged array C, consisting of elements of A and B in sorted order
1 a ← pointer to head of array A (i.e. pointer to smallest element in A)
2 b ← pointer to head of array B (i.e. pointer to smallest element in B)
3 while a, b are not null do
4 Compare the value of the element at a with the value of the element at b
5 if value(a) < value(b) then
6 add value of a to output C
7 increment pointer a to next element in A
8 end
9 else
10 add value of b to output C
11 increment pointer b to next element in B
12 end
13 end
14 if elements remaining in either a or (exclusive) b then
15 Append these sorted elements to our sorted output C
16 end
17 return C

Since we iterate over each of the elements exactly one time, and each time we make a constant
time comparison, we require Θ(n) operations. Hence the merge routine on a single machine takes
O(n) work.

2.1 Naive parallelization


Suppose we parallelize the algorithm via the obvious divide-and-conquer approach, i.e. by delegat-
ing the recursive calls to individual processors. The work done is then

W (n) = 2W (n/2) + O(n)


= O(n log n)

by case 2 of the Master Theorem.


As you’ll recall from earlier algorithms classes, the canonical implementation of the merge
routine involves simultaneously iterating over L and R: starting at the first index of each, we
merge them by placing the smaller of the currently pointed-to elements of L and R at the back of a
new list and advance the pointer in the list that the just-placed element belonged to, and continue
until we reach beyond the end of one list. Crucially, merge has depth O(n). The depth is then

5
D(n) = D(n/2) + O(n)
= O(n)

again by the Master Theorem.


Using Brent’s theorem, we have that

Tp ≤ O(n log n)/p + O(n)

Therefore W (n) = O(n log n) and D(n) = O(n). Note that the bottleneck lies in merge, which
takes O(n) time. That is, even though we have an infinitude of processors, the time it takes to
merge two sorted arrays of size n/2 on the first call to mergeSort dominates the time it takes to
complete the recursive calls.

2.2 Improved parallelization


How do we merge L and R in parallel? The merge routine we have used is written in a way that
is inherently sequential; it is not immediately obvious how to interleave the elements of L and R
together even with an infinitude of processors.
Let us call the output of our algorithm M . For an element x in R, let us define rankM (x) to
be the index of element x in output M . For any such element x ∈ R, we know how many elements
(say a) in R come before x since we have sorted R. But we don’t immediately know the rank of an
element x in M .
If we know how many elements (say b) in L are less than x, then we know we should place x in
the (a + b)th position in the merged array M . It remains to find b. We can find b by performing a
binary search over L. We perform the symmetric procedure for each l ∈ L (i.e. we find how many
elements in R are less than it), so for a call to merge on an input of size n, we perform n binary
searches, each of which takes O(log n/2) = O(log n) time.

rankM (x) = rankL (x) + rankR (x)

Algorithm 3: Parallel Merge


Input : Two sorted arrays A, B each of length n
Output: Merged array C, consisting of elements of A and B in sorted order
1 for each a ∈ A do
2 Do a binary search to find where a would be added into B,
3 The final rank of a given by rankM (a) = rankA (a) + rankB (a).
4 end

To find the rank of an element x ∈ A in another sorted B requires O(log n) work using a
sequential processor. Notice, however, that each of the n iterations of the for loop in our algorithm

6
is independent of the previous, hence our binary searches may be performed in parallel. That is,
we can use n processors and assign each a single element from A. Each processor then performs
a binary search with O(log n) work. Hence in total, this parallel merge routine requires O(n log n)
work and O(log n) depth.
Hence when we use parallelMerge in our mergeSort algorithm, we realize the following work
and depth, by the master theorem:

W (n) = 2W (n/2) + O(n log n) =⇒ W (n) = O(n log2 n),


D(n) = D(n/2) + log n =⇒ D(n) = O(log2 n).

By Brent’s Theorem, we get

Tp ≤ O(n log2 n)/p + O(log2 n)

so for large p we significantly outperform the naive implementation! The best known implementa-
tion (work O(n log n), depth O(log n)) was found by Richard Cole[2].

Motivating the Next Step We notice that we use many binary searches in our recently defined
parallel merge routine. Can we do better? Yes.
Let Lm denote the median index of array L. We then find the corresponding index in R
using binary search with logarithmic work. We then observe that all of the elements in L at or
below Lm and all of the elements in R at rankR (value(Lm )) are at most the value of L’s median
element. Hence if we were to recursively merge-sort the first Lm elements in L along with the
first rankR (value(Lm )) elements in R, and correspondingly for the upper parts of L and R, we
may simply append the results together to maintain sorted order. This leads us to Richard Cole
(1988).[2] He works out all the intricate details in this approach nicely to achieve

W (n) = O(n log n)


D(n) = O(log n)

3 Coming Up: Quick-Sort and Scheduling


We briefly recall how quick-sort operates. We arbitrarily pick an element as a pivot. In O(n) time,
we place all elements smaller than the pivot on the Left Hand Side of the array, and all elements
larger than pivot on Right Hand Side. This is an inherently sequential process.
With regard to scheduling, we have assumed that processors are assigned tasks in an optimal
way. However, the process of assigning tasks to processors is actually a non-trivial problem in and
of itself. So, suppose you have a program which is very large and recursive in nature. At the end
of the day, it’s a DAG of computations. At any level in the DAG, there are a certain number of
computations which can be required to execute (at the same time). It’s possible that the number of
computations to be more (or less) than the number of processors you have available to you. Ideally,

7
you wish for all your processors to be busy. Depending on how you schedule the operations, you
sometimes may end up with processors which are idle and not working.
It is the schedulers task to schedule things in tandem in such a way that you look ahead a little
bit and minimize the idle time of processors. We could do this greedily, i.e. as soon as there is any
computation to be done, we assign it to a processor. Or, we could be a little bit clever about it,
and perhaps look ahead further to our DAG to see if we can plan more efficiently.
We will talk about scheduling after we are done with Divide and Conquer algorithms. Spark
has a scheduler. Every distributed computing set up has a scheduler. Your operating system and
phone’s have schedulers. Every computer has processes, and every computer runs in parallel. Your
computer might have fifty Chrome tabs open and must decide which one to give priority to in order
to optimize performance of your machine.

References
[1] G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz, Communication-
optimal parallel algorithm for strassen’s matrix multiplication, CoRR, abs/1202.3173 (2012).

[2] R. Cole, Parallel merge sort, SIAM J. Comput., 17 (1988), pp. 770–785.

[3] D. Coppersmith and S. Winograd, Matrix multiplication via arithmetic progressions, J.


Symbolic Computation, 9 (1990), pp. 251–280.

[4] V. Strassen, Gaussian elimination is not optimal, Numerische Mathematik, 13, pp. 354–356.

[5] V. V. Williams, Multiplying matrices in o(n2.373 ) time, Stanford University, (2014).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy