Parralel 01
Parralel 01
PROCESSING
UNIT
1
1
UNDERSTANDING
PARALLEL
2 ENVIRONMENT
QUIZ
What are 3 traditional ways HW Designers make
computers r u n faster?
Faster Clocks
Longer Clock Period
4
PARALLEL COMPUTING
It was intended to be used by super computing.
Now all computers/mobiles are using parallel
computing.
Modern GPUs
⚫ Hundred of processors
⚫ Thousand of ALUs (3,000)
⚫ Ten or thousands of concurrent threads.
This requires a different way of programming
t h a n a single scalar processor
General purpose programmability over GPU
(GPGPU.) 5
TRANSISTORS CONTINUE ON
MOORE’S PATH . . . FOR NOW
6
CLOCK SPEED (NO MORE
SPEED)
7
QUIZ
Are processing today getting faster Because
10
Less speed with
M ORE TO UNDERSTAND (CONT.) simple
structure
More speed
with
complex
structure
Less
Power Power
11
QUIZ
Which techniques are computer designer using
today to build more power-efficient chips?
12
ANOTHER FACTOR FOR POWER
EFFICIENCY
Power Efficiency
14
GPU DESIGN BELIEVES
Lots of simple compute units
Explicitly parallel programming model
⚫ We know there are many processors and we didn’t
depend on the complier for example to parallel the
task for us.
Optimized for throughput not latency
15
INTRO TO
PARALLEL
16 PROGRAMMING
IMPORTANCE OF PARALLEL
PROGRAMMING
Intel 8 core Ivy bridge
8-wide AVX vector operations/core
17
CUDA PLATFORM
CUDA Program
W
ith
Ex
te
ns
io
C
ns
CPU GPU
"Host" Co-processor "Device "
Memory Memory
21
QUIZ
What is the GPU good at?
Sequential solution:
for(int i=0;i<64;i++)
Out[i]=in[i]*in[i];
⚫ here we have 1 thread do 64 multiplications
each takes 2 ns.
23
GPU P OWER (CONT.)
Example:
⚫ In : [1, 2, 3, …., 64]
⚫ Out: [02 , 1 2 , 2 2 , …., 642 ]
CPU GPU
Parallel solution: j
⚫ CPU code: square kernel <<<64>>>(out, in)
⚫ here we have 64 thread each do 1 multiplication which
take 10 ns. 24
EXAMP
25 LE
start
THREADS AND
26 BLOCKS
THREADS
27
BLOCKS
28
GRID
29
MAXIMUMS
You can launch up to 1024 threads per block (or
512 if your card is compute capability 1.3 or less).
30
WHY BLOCKS AND THREADS?
You may be wondering why not just say “launch 67 million
threads” instead of organizing them into blocks.
Suppose you wrote a program for a GPU can which can
r u n 2000 threads concurrently. Then you want to execute
the same code on a higher GPU with 6000 threads. Are you
going to change the whole code fore each GPU?
Each GPU h as a limit on the number of threads per block
but (almost) no limit on the number of blocks. Each GPU
can r u n some number of blocks concurrently, executing
some number of threads simultaneously.
By adding the extra level of abstraction, higher
performance GPU's can simply r u n more blocks
concurrently and chew through the workload quicker with
absolutely no change to the code.
nVidia h as done this to allow automatic performance gains
when your code is r u n on different higher performance
GPU's. 31
DIM3
32
DIM3 DATA TYPE
SomeKernel<<<100, 25>>>(...);
Inside the kernel, each thread can calculate a unique
id with:
⚫ int id = blockIdx.x * blockDim.x + threadIdx.x;
So the 5th thread of the 4th block would calculate:
⚫ int id = 4 * 25 + 5 = 105
The 14th thread of the 76th block would calculate:
⚫ int id = 76 * 25 + 14 = 1914 35
MAPPIN
36 G
MAP
Set of elements to process [64 floats]
Function to r u n on each element [square]
Map(element, function)
37
QUIZ
Which programs can be solved using Map
38