0% found this document useful (0 votes)

127 views8 pages

1 Matrix Multiplication: Strassen's Algorithm: Tuan Nguyen, Alex Adamson, Andreas Santucci

Strassen's algorithm for matrix multiplication improves upon the naive O(n^3) algorithm by breaking the matrices into blocks and rearranging the order of operations, achieving O(n^2.81) time complexity. Mergesort is a simple sorting algorithm that was fully parallelized in 1988. It divides the array in half, recursively sorts each half, and then merges the sorted halves together.

Uploaded by

Shubh Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

127 views8 pages

1 Matrix Multiplication: Strassen's Algorithm: Tuan Nguyen, Alex Adamson, Andreas Santucci

Uploaded by

Shubh Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

CME 323: Distributed Algorithms and Optimization, Spring 2016

http://stanford.edu/~rezab/dao.
Instructor: Reza Zadeh, Matriod and Stanford.

Lecture 3, 04/04/2016. Scribed by Tuan Nguyen, Alex Adamson, Andreas Santucci.

1 Matrix multiplication: Strassen’s algorithm

We’ve all learned the naive way to perform matrix multiplies in O(n3 ) time.1 In today’s lecture, we
review Strassen’s sequential algorithm for matrix multiplication which requires O(nlog2 7 ) = O(n2.81 )
operations; the algorithm is amenable to parallelizable.[4]
A variant of Strassen’s sequential algorithm was developed by Coppersmith and Winograd, they
achieved a run time of O(n2.375 ).[3] The current best algorithm for matrix multiplication O(n2.373 )
was developed by Stanford’s own Virginia Williams[5].

Idea - Block Matrix Multiplication The idea behind Strassen’s algorithm is in the formulation
of matrix multiplication as a recursive problem. We first cover a variant of the naive algorithm,
formulated in terms of block matrices, and then parallelize it. Assume A, B ∈ Rn×n and C = AB,
where n is a power of two.2
We write A and B as block matrices,
! ! !
A11 A12 B11 B12 C11 C12
A= , B= , C= ,
A21 A22 B21 B22 C21 C22
where block matrices Aij are of size n/2 × n/2 (same with respect to block entries of B and C).
Trivially, we may apply the definition of block-matrix multiplication to write down a formula for
the block-entries of C, i.e.

C11 = A11 B11 + A12 B21

C12 = A11 B12 + A12 B22
C21 = A21 B11 + A22 B21
C22 = A21 B12 + A22 B22

Parallelizing the Algorithm Realize that Aij and Bk` are smaller matrices, hence we have
broken down our initial problem of multiplying two n × n matrices into a problem requiring 8
matrix multiplies between matrices of size n/2 × n/2, as well as a total of 4 matrix additions.
1
Refresher, to compute C = AB, we need to compute cij , of which there are n2 entries. Each one may be computed
via cij = haTi , bj i in 2n − 1 = Θ(n) operations. Hence total work is O(n3 ).
2
If n is not a power of two, then from a theoretical perspective we may simply pad the matrix with additional
zeros. From a practical perspective, we would simply use un-equal size blocks.

1
There is nothing fundamentally different between the matrix multiplies that we need to compute
at this level relative to our original problem.
Further, realize that the four block entries of C may be computed independently from one
another, hence we may come up with the following recurrence for work:

W (n) = 8W (n/2) + O(n2 )

By the Master Theorem,3 W (n) = O(nlog2 8 ) = O(n3 ). So we have not made any progress
(other than making our algorithm parallel). We already saw in lecture two that we can naively
parallelize matrix-multiplies very simply to yield O(n3 ) work and O(log n) depth.

Strassen’s Algorithm We now turn toward Strassen’s algorithm, such that we will be able to
reduce the number of sub-calls to matrix-multiplies to 7, using just a bit of algebra. In this way,
we bring the work down to O(nlog2 7 ).
How do we do this? We use the following factoring scheme. We write down Cij ’s in terms of
block matrices Mk ’s. Each Mk may be calculated simply from products and sums of sub-blocks of
A and B. That is, we let

M1 = (A11 + A22 ) (B11 + B22 )

M2 = (A21 + A22 )B11
M3 = A11 (B12 − B22 )
M4 = A22 (B21 − B11 )
M5 = (A11 + A12 )B22
M6 = (A21 − A11 )(B11 + B12 )
M7 = (A12 − A22 )(B21 + B22 )

Crucially, each of the above factors can be evaluated using exactly one matrix multiplication.
And yet, since each of the Mk ’s expands by the distributive property of matrix multiplication,
they capture additional information. Also important, is that these matrices Mk may be computed
independently of one another, i.e. this is where the parallelization of our algorithm occurs.
It can be verified that

C11 = M1 + M4 − M5 + M7
C12 = M3 + M5
C21 = M2 + M4
C22 = M1 − M2 + M 3 + M6
3
Case 1: f (n) = O(n2 ), so c = 2 < 3 = log2 (8).

2
Realize that our algorithm requires quite a few summations, however, this number is a constant
independent of the size of our matrix multiples. Hence, the work is given by a recurrence of the
form
W (n) = 7W (n/2) + O(n2 ) =⇒ W (n) = O(nlog2 7 ).
What about the depth of this algorithm? Since all of our recursive matrix-multiplies may be
computed in parallel, and since we can add matrices together in unit depth,4 we see that depth is
given by

D(n) = D(n/2) + O(1) =⇒ D(n) = O(log n)

2.81
By Brent’s theorem, Tp ≤ n p + O(log n). In the years since Strassen published his paper,
people have been playing this game to bring down the work required marginally, but nobody has
come up with a fundamentally different approach.

1.1 Drawbacks of Divide and Conquer

We now discuss some bottleneck’s of Strassen’s algorithm (and Divide and Conquer algorithms in
general).

• We haven’t considered communication bottlenecks; in real life communication is expensive.

• Disk/RAM differences are a bottleneck for recursive algorithms, and

• PRAM assumes perfect scheduling.

Communication Cost Our PRAM model assumes zero communication costs between proces-
sors. The reason is because the PRAM model assumes a shared memory model, in which each
processor has fast access to a single memory bank. Realistically, we never have efficient communi-
cation, since often times in the real world we have clusters of computers, each with its own private
bank of memory. In these cases, divide and conquer is often impractical.
It is true that when our data are split across multiple machines, having an algorithm operate
on blocks of data at a time can be useful. However, as Strassen’s algorithm continues to chop up
matrices into smaller and smaller chunks, this places a large communication burden on distributed
set ups because after the first iteration, it is likely that we will incur a shuffle cost as we are forced
to send data between machines.
4
We note that to perform matrix addition of two n × n matrices X + Y = W , we may calculate each of the n2
entries Wij = Xij +Yij in parallel using n2 processors. Each entry requires only one fundamental unit of computation,
hence the work for matrix addition is O(n2 ) and the depth is O(1).

3
Caveat - Big O and Big Constants One last caveat specific to Strassen’s Algorithm is that
in practice, the O(n2 ) term requires 20 · n2 operations, which is quite a large constant to hide. If
our data is large enough that it must be distributed across machines in order to store it all, then
really we can often only afford to pass through the entire data set one time. If each matrix-multiply
requires twenty passes through the data, we’re in big trouble. Big O notation is great to get you
started, and tells us to throw away egregiously inefficient algorithms. But once we get down to
comparing two reasonable algorithms, we often have to look at the algorithms more closely.

When is Strassen’s worth it? If we’re actually in the PRAM model, i.e. we have a shared
memory cluster, then Strassen’s algorithm tends to be advantageous only if n ≥ 1, 000, assuming no
communication costs. Higher communication costs drive up the n at which Strassen’s becomes useful
very quickly. Even at n = 1, 000, naive matrix-multiply requires 1e9 operations; we can’t really do
much more than this with a single processor. Strassen’s is mainly interesting as a theoretical idea.
For more on Strassen in distributed models, see [1].

Disk Vs. RAM Trade-off What is the reason that we can only pass through our data once?
There is a big trade-off between having data in ram and having it on disk. If we have tons of data,
our data is stored on disk. We also have an additional constraint that with respect to streaming
data, as the data are coming in they are being stored in memory, i.e. we have fast random access,
but once we store the data to disk retrieving it again is expensive.

2 Mergesort
Merge-sort is a very simple routine. It was fully parallelized in 1988 by Cole.[2] The algorithm itself
has been known for several decades longer.

Algorithm 1: Merge Sort

Input : Array A with n elements
Output: Sorted A
1 n ← |A|
2 if n is 1 then
3 return A
4 end
5 else
// (In Parallel)
6 L ← MERGESORT(A[0,...,n/2)) // Indices 0, 1, . . . , n2 − 1
7 R ← MERGESORT(A[n/2,...,n)) // Indices n2 , n2 + 1, . . . , n − 1
8 return MERGE(L,R)
9 end

4
It’s critical to note how the merge sub-routine works, since this is important to our algorithms
work and depth. We can think of the process as simply “zipping” together two sorted arrays.

Algorithm 2: Merge
Input : Two sorted arrays A, B each of length n
Output: Merged array C, consisting of elements of A and B in sorted order
1 a ← pointer to head of array A (i.e. pointer to smallest element in A)
2 b ← pointer to head of array B (i.e. pointer to smallest element in B)
3 while a, b are not null do
4 Compare the value of the element at a with the value of the element at b
5 if value(a) < value(b) then
6 add value of a to output C
7 increment pointer a to next element in A
8 end
9 else
10 add value of b to output C
11 increment pointer b to next element in B
12 end
13 end
14 if elements remaining in either a or (exclusive) b then
15 Append these sorted elements to our sorted output C
16 end
17 return C

Since we iterate over each of the elements exactly one time, and each time we make a constant
time comparison, we require Θ(n) operations. Hence the merge routine on a single machine takes
O(n) work.

2.1 Naive parallelization

Suppose we parallelize the algorithm via the obvious divide-and-conquer approach, i.e. by delegat-
ing the recursive calls to individual processors. The work done is then

W (n) = 2W (n/2) + O(n)

= O(n log n)

by case 2 of the Master Theorem.

As you’ll recall from earlier algorithms classes, the canonical implementation of the merge
routine involves simultaneously iterating over L and R: starting at the first index of each, we
merge them by placing the smaller of the currently pointed-to elements of L and R at the back of a
new list and advance the pointer in the list that the just-placed element belonged to, and continue
until we reach beyond the end of one list. Crucially, merge has depth O(n). The depth is then

5
D(n) = D(n/2) + O(n)
= O(n)

again by the Master Theorem.

Using Brent’s theorem, we have that

Tp ≤ O(n log n)/p + O(n)

Therefore W (n) = O(n log n) and D(n) = O(n). Note that the bottleneck lies in merge, which
takes O(n) time. That is, even though we have an infinitude of processors, the time it takes to
merge two sorted arrays of size n/2 on the first call to mergeSort dominates the time it takes to
complete the recursive calls.

2.2 Improved parallelization

How do we merge L and R in parallel? The merge routine we have used is written in a way that
is inherently sequential; it is not immediately obvious how to interleave the elements of L and R
together even with an infinitude of processors.
Let us call the output of our algorithm M . For an element x in R, let us define rankM (x) to
be the index of element x in output M . For any such element x ∈ R, we know how many elements
(say a) in R come before x since we have sorted R. But we don’t immediately know the rank of an
element x in M .
If we know how many elements (say b) in L are less than x, then we know we should place x in
the (a + b)th position in the merged array M . It remains to find b. We can find b by performing a
binary search over L. We perform the symmetric procedure for each l ∈ L (i.e. we find how many
elements in R are less than it), so for a call to merge on an input of size n, we perform n binary
searches, each of which takes O(log n/2) = O(log n) time.

rankM (x) = rankL (x) + rankR (x)

Algorithm 3: Parallel Merge

Input : Two sorted arrays A, B each of length n
Output: Merged array C, consisting of elements of A and B in sorted order
1 for each a ∈ A do
2 Do a binary search to find where a would be added into B,
3 The final rank of a given by rankM (a) = rankA (a) + rankB (a).
4 end

To find the rank of an element x ∈ A in another sorted B requires O(log n) work using a
sequential processor. Notice, however, that each of the n iterations of the for loop in our algorithm

6
is independent of the previous, hence our binary searches may be performed in parallel. That is,
we can use n processors and assign each a single element from A. Each processor then performs
a binary search with O(log n) work. Hence in total, this parallel merge routine requires O(n log n)
work and O(log n) depth.
Hence when we use parallelMerge in our mergeSort algorithm, we realize the following work
and depth, by the master theorem:

W (n) = 2W (n/2) + O(n log n) =⇒ W (n) = O(n log2 n),

D(n) = D(n/2) + log n =⇒ D(n) = O(log2 n).

By Brent’s Theorem, we get

Tp ≤ O(n log2 n)/p + O(log2 n)

so for large p we significantly outperform the naive implementation! The best known implementa-
tion (work O(n log n), depth O(log n)) was found by Richard Cole[2].

Motivating the Next Step We notice that we use many binary searches in our recently defined
parallel merge routine. Can we do better? Yes.
Let Lm denote the median index of array L. We then find the corresponding index in R
using binary search with logarithmic work. We then observe that all of the elements in L at or
below Lm and all of the elements in R at rankR (value(Lm )) are at most the value of L’s median
element. Hence if we were to recursively merge-sort the first Lm elements in L along with the
first rankR (value(Lm )) elements in R, and correspondingly for the upper parts of L and R, we
may simply append the results together to maintain sorted order. This leads us to Richard Cole
(1988).[2] He works out all the intricate details in this approach nicely to achieve

W (n) = O(n log n)

D(n) = O(log n)

3 Coming Up: Quick-Sort and Scheduling

We briefly recall how quick-sort operates. We arbitrarily pick an element as a pivot. In O(n) time,
we place all elements smaller than the pivot on the Left Hand Side of the array, and all elements
larger than pivot on Right Hand Side. This is an inherently sequential process.
With regard to scheduling, we have assumed that processors are assigned tasks in an optimal
way. However, the process of assigning tasks to processors is actually a non-trivial problem in and
of itself. So, suppose you have a program which is very large and recursive in nature. At the end
of the day, it’s a DAG of computations. At any level in the DAG, there are a certain number of
computations which can be required to execute (at the same time). It’s possible that the number of
computations to be more (or less) than the number of processors you have available to you. Ideally,

7
you wish for all your processors to be busy. Depending on how you schedule the operations, you
sometimes may end up with processors which are idle and not working.
It is the schedulers task to schedule things in tandem in such a way that you look ahead a little
bit and minimize the idle time of processors. We could do this greedily, i.e. as soon as there is any
computation to be done, we assign it to a processor. Or, we could be a little bit clever about it,
and perhaps look ahead further to our DAG to see if we can plan more efficiently.
We will talk about scheduling after we are done with Divide and Conquer algorithms. Spark
has a scheduler. Every distributed computing set up has a scheduler. Your operating system and
phone’s have schedulers. Every computer has processes, and every computer runs in parallel. Your
computer might have fifty Chrome tabs open and must decide which one to give priority to in order
to optimize performance of your machine.

References
[1] G. Ballard, J. Demmel, O. Holtz, B. Lipshitz, and O. Schwartz, Communication-
optimal parallel algorithm for strassen’s matrix multiplication, CoRR, abs/1202.3173 (2012).

[2] R. Cole, Parallel merge sort, SIAM J. Comput., 17 (1988), pp. 770–785.

[3] D. Coppersmith and S. Winograd, Matrix multiplication via arithmetic progressions, J.

Symbolic Computation, 9 (1990), pp. 251–280.

[4] V. Strassen, Gaussian elimination is not optimal, Numerische Mathematik, 13, pp. 354–356.

[5] V. V. Williams, Multiplying matrices in o(n2.373 ) time, Stanford University, (2014).

Strassen's Matrix Multiplication
100% (5)
Strassen's Matrix Multiplication
11 pages
Matrix Multiplication Algorithm
No ratings yet
Matrix Multiplication Algorithm
9 pages
Strassen's Matrix Multiplication Algorithm: Problem Description
No ratings yet
Strassen's Matrix Multiplication Algorithm: Problem Description
5 pages
Strassen Matrix
No ratings yet
Strassen Matrix
24 pages
CLRS 4 1 2
No ratings yet
CLRS 4 1 2
7 pages
Strassen's Matrix Multiplication
100% (1)
Strassen's Matrix Multiplication
12 pages
Strassen's Matrix Multiplication
100% (1)
Strassen's Matrix Multiplication
10 pages
Lab Manual DAA
No ratings yet
Lab Manual DAA
42 pages
Strassen's Algorithm For MATRIX MULTIPLICATION
No ratings yet
Strassen's Algorithm For MATRIX MULTIPLICATION
13 pages
DAA IA-1 Case Study Material-CSE
No ratings yet
DAA IA-1 Case Study Material-CSE
9 pages
Greedy, Dynamic Programming, and Divide and Conquer Algorithm
No ratings yet
Greedy, Dynamic Programming, and Divide and Conquer Algorithm
15 pages
Algorithms MCQ
No ratings yet
Algorithms MCQ
23 pages
Strassen S
No ratings yet
Strassen S
10 pages
Algorithm Assignments Fall Semester
No ratings yet
Algorithm Assignments Fall Semester
5 pages
Dynamic Programming - General Method
No ratings yet
Dynamic Programming - General Method
13 pages
Dwarkadas J. Sanghvi College of Engineering
No ratings yet
Dwarkadas J. Sanghvi College of Engineering
22 pages
STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
No ratings yet
STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
18 pages
s6 Aad Module3
No ratings yet
s6 Aad Module3
14 pages
Efficient Parallel Implementation of The Fox Algorithm
No ratings yet
Efficient Parallel Implementation of The Fox Algorithm
8 pages
BHH 93
No ratings yet
BHH 93
27 pages
Week 4
No ratings yet
Week 4
20 pages
Using Strassen's Algorithm To Accelerate The Solution of Linear Systems
No ratings yet
Using Strassen's Algorithm To Accelerate The Solution of Linear Systems
15 pages
Ktu - CS S6 Full
No ratings yet
Ktu - CS S6 Full
28 pages
CS502 Quiz-2 File by Vu Topper RM
No ratings yet
CS502 Quiz-2 File by Vu Topper RM
45 pages
Introduction To Algorithms: Dynamic Programming
No ratings yet
Introduction To Algorithms: Dynamic Programming
25 pages
Chapter 2 Devide and Conquer
No ratings yet
Chapter 2 Devide and Conquer
33 pages
Programming Assignment 4: Divide-and-Conquer
No ratings yet
Programming Assignment 4: Divide-and-Conquer
15 pages
Basic Algoritham PDF
No ratings yet
Basic Algoritham PDF
84 pages
Matrix Chain Mult
No ratings yet
Matrix Chain Mult
11 pages
Chapter 5 Divide and Conquer Student
No ratings yet
Chapter 5 Divide and Conquer Student
16 pages
Daa Unit-1
No ratings yet
Daa Unit-1
44 pages
Algorithm Design
100% (7)
Algorithm Design
820 pages
Libya Free High Study Academy / Misrata: - Report of Address
No ratings yet
Libya Free High Study Academy / Misrata: - Report of Address
19 pages
Divide-And-Conquer (CLRS 4.2) : Matrix Multiplication
No ratings yet
Divide-And-Conquer (CLRS 4.2) : Matrix Multiplication
4 pages
Divide and Conquer: Andreas Klappenecker (Based On Slides by Prof. Welch)
No ratings yet
Divide and Conquer: Andreas Klappenecker (Based On Slides by Prof. Welch)
27 pages
Pan Linking PDF
No ratings yet
Pan Linking PDF
1 page
Final Report Daa Case Study 1
No ratings yet
Final Report Daa Case Study 1
19 pages
Algo - Presentation
No ratings yet
Algo - Presentation
20 pages
Programmin Fundamentals Weak 1 and 2
No ratings yet
Programmin Fundamentals Weak 1 and 2
86 pages
Algo VC Lecture24
No ratings yet
Algo VC Lecture24
32 pages
Lab3
No ratings yet
Lab3
11 pages
Lecture 33 Algebraic Computation and FFTs
No ratings yet
Lecture 33 Algebraic Computation and FFTs
16 pages
Breaking Through Memory Limitation in GPU Parallel Processing Using Strassen Algorithm
No ratings yet
Breaking Through Memory Limitation in GPU Parallel Processing Using Strassen Algorithm
5 pages
Daa 02 R1 2
No ratings yet
Daa 02 R1 2
63 pages
Strassen's Matrix Multiplication
No ratings yet
Strassen's Matrix Multiplication
13 pages
CS124 Spring 2011: (N) Is The Number of Comparisons, Then T (N) 2T (n/2) + 2. (The 2T (n/2) Term Comes From
No ratings yet
CS124 Spring 2011: (N) Is The Number of Comparisons, Then T (N) 2T (n/2) + 2. (The 2T (n/2) Term Comes From
4 pages
Unit 2 Data Representation: Structure
No ratings yet
Unit 2 Data Representation: Structure
29 pages
Btech Degree Examination, May2014 Cs010 601 Design and Analysis of Algorithms Answer Key Part-A 1
No ratings yet
Btech Degree Examination, May2014 Cs010 601 Design and Analysis of Algorithms Answer Key Part-A 1
14 pages
Computer
No ratings yet
Computer
25 pages
Part Heard Matters: Sno. Case No. Petitioner / Respondent Petitioner/Respondent Advocate
No ratings yet
Part Heard Matters: Sno. Case No. Petitioner / Respondent Petitioner/Respondent Advocate
45 pages
Unit 2 Introduction To HTML: Structure
No ratings yet
Unit 2 Introduction To HTML: Structure
21 pages
I T 1 The Internet: Structure
No ratings yet
I T 1 The Internet: Structure
25 pages
Strassen
No ratings yet
Strassen
11 pages
Algorithms and Data Structures: Dynamic Programming Matrix-Chain Multiplication
No ratings yet
Algorithms and Data Structures: Dynamic Programming Matrix-Chain Multiplication
17 pages
13.BLOB and CLOB PDF
No ratings yet
13.BLOB and CLOB PDF
8 pages
13.BLOB and CLOB PDF
No ratings yet
13.BLOB and CLOB PDF
8 pages
13.BLOB and CLOB PDF
No ratings yet
13.BLOB and CLOB PDF
8 pages
Csce411 Divideconquer2
No ratings yet
Csce411 Divideconquer2
12 pages
Strassian Matrix
No ratings yet
Strassian Matrix
18 pages
Strassens Matrix Multiplication
No ratings yet
Strassens Matrix Multiplication
18 pages
Strassen Algorithm
No ratings yet
Strassen Algorithm
2 pages
All MCS-011
No ratings yet
All MCS-011
11 pages
Greedy Algorithms
No ratings yet
Greedy Algorithms
6 pages
10 Strassens Matrix Multiplication
No ratings yet
10 Strassens Matrix Multiplication
25 pages
Strassens Matrix Multiflication
No ratings yet
Strassens Matrix Multiflication
14 pages
Matrix Multiplication1
No ratings yet
Matrix Multiplication1
10 pages
Strassen
No ratings yet
Strassen
11 pages
Strassen Matrix Multiplication: Under The Guidance of
No ratings yet
Strassen Matrix Multiplication: Under The Guidance of
10 pages
Q1. Explain The Meaning and Significance of The Following A) Business Entity Concept B) Continuity Concept C) Accrual Concept D) Materiality Concept
No ratings yet
Q1. Explain The Meaning and Significance of The Following A) Business Entity Concept B) Continuity Concept C) Accrual Concept D) Materiality Concept
13 pages
ADA Unit II GCR
No ratings yet
ADA Unit II GCR
58 pages
Lab2
No ratings yet
Lab2
10 pages
Daa Lab Manual 24
No ratings yet
Daa Lab Manual 24
36 pages
Unit Reduced Instruction Set Computer Architecture: Structure No
No ratings yet
Unit Reduced Instruction Set Computer Architecture: Structure No
17 pages
Strassen Matrix Multiplication
No ratings yet
Strassen Matrix Multiplication
11 pages
FALLSEM2023-24 CSE2012 ETH VL2023240103657 2023-10-06 Reference-Material-I
No ratings yet
FALLSEM2023-24 CSE2012 ETH VL2023240103657 2023-10-06 Reference-Material-I
11 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
11 pages
Unit 1 - DAA
No ratings yet
Unit 1 - DAA
138 pages
Strassen Algorithm
No ratings yet
Strassen Algorithm
4 pages
Fork Join Framework-Brian Goetz
No ratings yet
Fork Join Framework-Brian Goetz
9 pages
Advanced Algorithm Design and Analysis Techniques: Dynamic Programming
No ratings yet
Advanced Algorithm Design and Analysis Techniques: Dynamic Programming
24 pages
Strassen
No ratings yet
Strassen
11 pages
04 Sorting
No ratings yet
04 Sorting
58 pages
Divide and Conquer
No ratings yet
Divide and Conquer
7 pages
Java Se Language Updates PDF
No ratings yet
Java Se Language Updates PDF
5 pages
Updated 4th Sem Syllabus
No ratings yet
Updated 4th Sem Syllabus
5 pages
Strassen
No ratings yet
Strassen
11 pages
Data Structure and Algorithm
No ratings yet
Data Structure and Algorithm
19 pages
Analysis GF, Algorithms: Design
No ratings yet
Analysis GF, Algorithms: Design
3 pages
Scsa1403 PDF
No ratings yet
Scsa1403 PDF
1 page
Integrity: - Integrity Ensure The Data Stored On Device Is Correct and No Unauthorized Persons or Malicious
No ratings yet
Integrity: - Integrity Ensure The Data Stored On Device Is Correct and No Unauthorized Persons or Malicious
1 page
Microservice Course Structure
No ratings yet
Microservice Course Structure
14 pages
DAA - Unit 1 MCQ
No ratings yet
DAA - Unit 1 MCQ
5 pages
7-Tree Sum Parallel Algorithm & Applications
No ratings yet
7-Tree Sum Parallel Algorithm & Applications
23 pages
Using Puzzles in Teaching Algorithms
No ratings yet
Using Puzzles in Teaching Algorithms
6 pages
How To Multiply: 5.5 Integer Multiplication
No ratings yet
How To Multiply: 5.5 Integer Multiplication
16 pages
Cannon Strassen DNS Algorithm
No ratings yet
Cannon Strassen DNS Algorithm
10 pages
Pan Linking PDF
No ratings yet
Pan Linking PDF
1 page
Lecture 01 - Algorithms, Correctness, and Efficiency
No ratings yet
Lecture 01 - Algorithms, Correctness, and Efficiency
26 pages
CSE ADA 4th Semester Scheme Syllabus 2023 24
No ratings yet
CSE ADA 4th Semester Scheme Syllabus 2023 24
3 pages
Strassen's Matrix Multiplication: Sibel Kirmizigül
No ratings yet
Strassen's Matrix Multiplication: Sibel Kirmizigül
11 pages
Worked Examples in Advanced Mechanics of Materials using MATLAB
From Everand
Worked Examples in Advanced Mechanics of Materials using MATLAB
Eric Okoth Ogur
No ratings yet
Design And Analysis Of Algorithm
From Everand
Design And Analysis Of Algorithm
Bhupendra Mandloi
No ratings yet
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

1 Matrix Multiplication: Strassen's Algorithm: Tuan Nguyen, Alex Adamson, Andreas Santucci

Uploaded by

1 Matrix Multiplication: Strassen's Algorithm: Tuan Nguyen, Alex Adamson, Andreas Santucci

Uploaded by

CME 323: Distributed Algorithms and Optimization, Spring 2016

Lecture 3, 04/04/2016. Scribed by Tuan Nguyen, Alex Adamson, Andreas Santucci.

1 Matrix multiplication: Strassen’s algorithm

C11 = A11 B11 + A12 B21

W (n) = 8W (n/2) + O(n2 )

M1 = (A11 + A22 ) (B11 + B22 )

D(n) = D(n/2) + O(1) =⇒ D(n) = O(log n)

1.1 Drawbacks of Divide and Conquer

• We haven’t considered communication bottlenecks; in real life communication is expensive.

• Disk/RAM differences are a bottleneck for recursive algorithms, and

• PRAM assumes perfect scheduling.

Algorithm 1: Merge Sort

2.1 Naive parallelization

W (n) = 2W (n/2) + O(n)

by case 2 of the Master Theorem.

again by the Master Theorem.

Tp ≤ O(n log n)/p + O(n)

2.2 Improved parallelization

rankM (x) = rankL (x) + rankR (x)

Algorithm 3: Parallel Merge

W (n) = 2W (n/2) + O(n log n) =⇒ W (n) = O(n log2 n),

By Brent’s Theorem, we get

Tp ≤ O(n log2 n)/p + O(log2 n)

W (n) = O(n log n)

3 Coming Up: Quick-Sort and Scheduling

[3] D. Coppersmith and S. Winograd, Matrix multiplication via arithmetic progressions, J.

[5] V. V. Williams, Multiplying matrices in o(n2.373 ) time, Stanford University, (2014).

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.