0% found this document useful (0 votes)
23 views

13 Wrapup

Uploaded by

oreh2345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

13 Wrapup

Uploaded by

oreh2345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

University of Washington

thanks to Dan Grossman for the succinct definitions

What is parallel processing?

When can we execute things in parallel?

Parallelism: Concurrency:
Use extra resources to Correctly and efficiently manage
solve a problem faster access to shared resources
work requests

resources resource

Autumn 2013 Wrap-up 1


University of Washington

What is parallel processing?

¢ Briefly introduction to key ideas of parallel processing


§ instruction level parallelism
§ data-level parallelism
§ thread-level parallelism

Autumn 2013 Wrap-up 2


University of Washington

Exploiting Parallelism

¢ Of the computing problems for which performance is important, many


have inherent parallelism

¢ computer games
§ Graphics, physics, sound, AI etc. can be done separately
§ Furthermore, there is often parallelism within each of these:
§ Each pixel on the screen’s color can be computed independently
§ Non-contacting objects can be updated/simulated independently
§ Artificial intelligence of non-human entities done independently

¢ search engine queries


§ Every query is independent
§ Searches are (ehm, pretty much) read-only!!

Autumn 2013 Wrap-up 3


University of Washington

Instruction-Level Parallelism
add %r2 <- %r3, %r4 Dependences?
or %r2 <- %r2, %r4 RAW – read after write
lw %r6 <- 0(%r4) WAW – write after write
addi %r7 <- %r6, 0x5 WAR – write after read
sub %r8 <- %r8, %r4
When can we reorder instructions?

When should we reorder instructions?

add %r2 <- %r3, %r4


or %r5 <- %r2, %r4 Superscalar Processors:
lw %r6 <- 0(%r4) Multiple instructions executing in
sub %r8 <- %r8, %r4 parallel at *same* stage
addi %r7 <- %r6, 0x5

Take 352 to learn more.


Autumn 2013 Wrap-up 4
University of Washington

Data Parallelism
¢ Consider adding together two arrays:

void array_add(int A[], int B[], int C[], int length) {


int i;
for (i = 0 ; i < length ; ++ i) {
C[i] = A[i] + B[i];
}
}

Operating on one element at a time

Autumn 2013 Wrap-up 5


University of Washington

Data Parallelism
¢ Consider adding together two arrays:

void array_add(int A[], int B[], int C[], int length) {


int i;
for (i = 0 ; i < length ; ++ i) {
C[i] = A[i] + B[i];
}
}

Operating on one element at a time

Autumn 2013 Wrap-up 6


University of Washington

Data Parallelism with SIMD


¢ Consider adding together two arrays:

void array_add(int A[], int B[], int C[], int length) {


int i;
for (i = 0 ; i < length ; ++ i) {
C[i] = A[i] + B[i];
}
}
Operate on MULTIPLE elements

+ + ++ Single Instruction,
Multiple Data (SIMD)

Autumn 2013 Wrap-up 7


University of Washington

Is it always that easy?


¢ Not always… a more challenging example:

unsigned sum_array(unsigned *array, int length) {


int total = 0;
for (int i = 0 ; i < length ; ++ i) {
total += array[i];
}
return total;
}

¢ Is there parallelism here?


¢ Each loop iteration uses data from previous iteration.

Autumn 2013 Wrap-up 8


University of Washington

Restructure the code for SIMD…


// one option...
unsigned sum_array2(unsigned *array, int length) {
unsigned total, i;
unsigned temp[4] = {0, 0, 0, 0};
// chunks of 4 at a time
for (i = 0 ; i < length & ~0x3 ; i += 4) {
temp[0] += array[i];
temp[1] += array[i+1];
temp[2] += array[i+2];
temp[3] += array[i+3];
}
// add the 4 sub-totals
total = temp[0] + temp[1] + temp[2] + temp[3];
// add the non-4-aligned parts
for ( ; i < length ; ++ i) {
total += array[i];
}
return total;
}
Autumn 2013 Wrap-up 9
University of Washington

What are threads?


¢ Independent “thread of control” within process
¢ Like multiple processes within one process, but sharing the
same virtual address space.
§ logical control flow
§ program counter
§ stack
§ shared virtual address space
§ all threads in process use same virtual address space

¢ Lighter-weight than processes


§ faster context switching
§ system can support more threads

Autumn 2013 Wrap-up 10


University of Washington

Thread-level parallelism: Multicore Processors


¢ Two (or more) complete processors, fabricated on the same silicon chip
¢ Execute instructions from two (or more) programs/threads at same time

#1 #2

IBM Power5

Autumn 2013 Wrap-up 11


University of Washington

Multicores are everywhere. (circa 2013)


¢ Laptops, desktops, servers
§ Most any machine from the past few years has at least 2 cores
¢ Game consoles:
§ Xbox 360: 3 PowerPC cores; Xbox One: 8 AMD cores
§ PS3: 9 Cell cores (1 master; 8 special SIMD cores);
PS4: 8 custom AMD x86-64 cores
§ Wii U: 2 Power cores
¢ Smartphones
§ iPhone 4S, 5: dual-core ARM CPUs
§ Galaxy S II, III, IV: dual-core ARM or Snapdragon
§ …

Autumn 2013 Wrap-up 12


University of Washington

Why Multicores Now?


¢ Number of transistors we can put on a chip growing
exponentially…
¢ But performance is no longer growing along with transistor
count.
¢ So let’s use those transistors to add more cores to do more at
once…

Autumn 2013 Wrap-up 13


University of Washington

As programmers, do we care?
¢ What happens if we run this program on a multicore?

void array_add(int A[], int B[], int C[], int length) {


int i;
for (i = 0 ; i < length ; ++i) {
C[i] = A[i] + B[i];
}
}

#1 #2

Autumn 2013 Wrap-up 14


University of Washington

What if we want one program to run on


multiple processors (cores)?
¢ We have to explicitly tell the machine exactly how to do this
§ This is called parallel programming or concurrent programming

¢ There are many parallel/concurrent programming models


§ We will look at a relatively simple one: fork-join parallelism

Autumn 2013 Wrap-up 15


University of Washington

How does this help performance?

¢ Parallel speedup measures improvement from parallelization:

time for best serial version


speedup(p) =
time for version with p processors

¢ What can we realistically expect?

Autumn 2013 Wrap-up 16


University of Washington

Reason #1: Amdahl’s Law


¢ In general, the whole computation is not (easily) parallelizable
¢ Serial regions limit the potential parallel speedup.

Serial regions

Autumn 2013 Wrap-up 17


University of Washington

Reason #1: Amdahl’s Law


¢ Suppose a program takes 1 unit of time to execute serially
¢ A fraction of the program, s, is inherently serial (unparallelizable)

New Execution 1-s


= + s
Time p

¢ For example, consider a program that, when executing on one processor, spends 10% of its
time in a non-parallelizable region. How much faster will this program run on a 3-processor
system?

New Execution .9T


= + .1T = Speedup =
Time 3
¢ What is the maximum speedup from parallelization?

Autumn 2013 Wrap-up 18


University of Washington

Reason #2: Overhead

— Forking and joining is not instantaneous


• Involves communicating between processors
• May involve calls into the operating system
— Depends on the implementation

New Execution 1-s


= + s + overhead(P)
Time P

Autumn 2013 Wrap-up 19


University of Washington

Multicore: what should worry us?


¢ Concurrency: what if we’re sharing resources, memory, etc.?
¢ Cache Coherence
§ What if two cores have the same data in their own caches?
How do we keep those copies in sync?
¢ Memory Consistency, Ordering, Interleaving,
Synchronization…
§ With multiple cores, we can have truly concurrent execution of threads.
In what order do their memory accesses appear to happen?
Do the orders seen by different cores/threads agree?
¢ Concurrency Bugs
§ When it all goes wrong…
§ Hard to reproduce, hard to debug
§ http://cacm.acm.org/magazines/2012/2/145414-you-dont-know-jack-
about-shared-variables-or-memory-models/fulltext
Autumn 2013 Wrap-up 20
University of Washington

Summary
¢ Multicore: more than one processor on the same chip.
§ Almost all devices now have multicore processors
§ Results from Moore’s law and power constraint

¢ Exploiting multicore requires parallel programming


§ Automatically extracting parallelism too hard for compiler, in general.
§ But, can have compiler do much of the bookkeeping for us

¢ Fork-Join model of parallelism


§ At parallel region, fork a bunch of threads, do the work in parallel, and then join,
continuing with just one thread
§ Expect a speedup of less than P on P processors
§ Amdahl’s Law: speedup limited by serial portion of program
§ Overhead: forking and joining are not free

¢ Take 332, 352, 451 to learn more!


Autumn 2013 Wrap-up 21

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy