0% found this document useful (0 votes)
37 views38 pages

Parralel 01

The document discusses parallel processing units and parallel computing. It explains that traditional CPUs are not the most energy efficient processors due to their complex control hardware, while GPU-like processors are more efficient because they have simpler control structures and devote more transistors to computation. It also discusses how computer designers are building more power efficient chips today by using more simpler processors rather than fewer complex ones.

Uploaded by

demro channel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views38 pages

Parralel 01

The document discusses parallel processing units and parallel computing. It explains that traditional CPUs are not the most energy efficient processors due to their complex control hardware, while GPU-like processors are more efficient because they have simpler control structures and devote more transistors to computation. It also discusses how computer designers are building more power efficient chips today by using more simpler processors rather than fewer complex ones.

Uploaded by

demro channel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

PARALLEL

PROCESSING
UNIT
1
1
UNDERSTANDING
PARALLEL
2 ENVIRONMENT
QUIZ
What are 3 traditional ways HW Designers make
computers r u n faster?

Faster Clocks
 Longer Clock Period

More Work per Clock Cycle


 Larger Hard Disk
More Processors
 Reduce amount of memory
3
SEYMOUR CRAY (SUPER COMPUTER
DESIGNER)
 Ifyou are plowing a field, which would
you rat h er use?

⚫ Two strong oxen.


⚫ 1024 chickens

4
PARALLEL COMPUTING
 It was intended to be used by super computing.
 Now all computers/mobiles are using parallel
computing.
 Modern GPUs
⚫ Hundred of processors
⚫ Thousand of ALUs (3,000)
⚫ Ten or thousands of concurrent threads.
 This requires a different way of programming
t h a n a single scalar processor
 General purpose programmability over GPU
(GPGPU.) 5
TRANSISTORS CONTINUE ON
MOORE’S PATH . . . FOR NOW

6
CLOCK SPEED (NO MORE
SPEED)

7
QUIZ
 Are processing today getting faster Because

We are clocking their transistors faster


We have more transistors available
for computation.

o Why don’t we keep increasing clock speed of a


single processor instead of multiprocessors with a
less clock speed?
o No, we can’t because of power (heat)
8
WHAT KIND OF PROCESSORS
WILL WE BUILD?
 Assume major design constraint is Power

 Why are traditional CPU-like processors are not


the most energy efficient processors?
⚫ It has complex control hardware
⚫ This increase flexibility and performance
⚫ And increase power consumption and design
complexity as well
 How to increase power efficiency (GPU-
like)?
⚫ Build simple control structure.
⚫ Take those transistors and devote them to support
9
more computation on the data path
⚫ The challenge becomes how to program?
MORE TO UNDERSTAND

10
Less speed with
M ORE TO UNDERSTAND (CONT.) simple
structure

More speed
with
complex
structure

Less
Power Power
11
QUIZ
 Which techniques are computer designer using
today to build more power-efficient chips?

 Fewer, more complex processors

More, Simpler processors


 Maximizing the speed of the processor clock
 Increasing the complexity of the control
hardware

12
ANOTHER FACTOR FOR POWER
EFFICIENCY
Power Efficiency

Decrease latency Increase Throughput


(Amount of time to (Task completed per
complete a task) unit time)
“Time” “Number”

 The two goals are not


⚫ CPU-like: design to decrease latency
aligned
⚫ GPU-like: design to increase throughput
13
 The choice depends on the application (Image processing
prefer to increase the throughput)
SUPER QUIZ
 Why do I say GPU-like and not saying Multi-core
CPU? Is there a deference ?!

⚫ They both build for parallel programming. However,


Multi-core CPUs can be used for sequential and
parallel programming as well (provides branches and
interrupts ). On the other hand GPU build for
parallel programming from scratch.

14
GPU DESIGN BELIEVES
 Lots of simple compute units
 Explicitly parallel programming model
⚫ We know there are many processors and we didn’t
depend on the complier for example to parallel the
task for us.
 Optimized for throughput not latency

15
INTRO TO
PARALLEL
16 PROGRAMMING
IMPORTANCE OF PARALLEL
PROGRAMMING
 Intel 8 core Ivy bridge
 8-wide AVX vector operations/core

 2 threads core (hyper threading)

 This means the processor has 128 way of


parallelism
 Parallel programming is more complex however
Running sequential C program means using less
t h a n 1% of this processor power

17
CUDA PLATFORM
CUDA Program
W
ith
Ex
te
ns
io

C
ns

CPU GPU
"Host" Co-processor "Device "

Memory Memory

 CUDA compiler generate two separated program one


for CPU (Host) and another for GPU (Device).
 CPU in charge and control the GPU
⚫ Moves data between memories (cudaMemcpy)
⚫ Allocates memory on GPU (cudaMalloc)
⚫ Invokes programs (kernels) on the GPU: ”Host 18
lunches kernels on the Device”
QUIZ
The GPU can do the following:

 Initiate dat a send from GPU to CPU


Respond to CPU request to send data from GPU
to CPU
 Initiate dat a request from CPU to GPU

Respond to CPU request to receive data from


CPU to GPU
Compute a kernel lunched by CPU
 Compute a kernel lunched by GPU 
19
TYPICAL GPU PROGRAM
 CPU allocate storage on GPU
 CPU copy input data from CPU to GPU

 CPU lunches the kernels on the GPU to process


the data
 CPU copies results back to the CPU from the
GPU

 If you need to move data many times between


CPU and GPU, CUDA is not good for your
program because it takes many steps to do so as
showing above 20
MAIN ISSUE
 Defining the GPU computation

⚫ Write a Kernel like serial program


⚫ When lunching the kernel tell the GPU how
many threads to lunch

21
QUIZ
What is the GPU good at?

 Lunching a small number of threads


efficiently

Lunching a large number of threads efficiently


 Running one thread very Quickly
 Respond to CPU request to receive data from
CPU to GPU
 Running one thread t h a t does lots of work
in parallel 22

Running a large number of threads in


GPU P OWER
 Example:
⚫ In : [1, 2, 3, …., 64]
⚫ Out: [02 , 1 2 , 2 2 , …., 642 ]

 Sequential solution:
for(int i=0;i<64;i++)
Out[i]=in[i]*in[i];
⚫ here we have 1 thread do 64 multiplications
each takes 2 ns.

23
GPU P OWER (CONT.)
 Example:
⚫ In : [1, 2, 3, …., 64]
⚫ Out: [02 , 1 2 , 2 2 , …., 642 ]

CPU GPU

Allocate memory out= in * in


Copy data to/from GPU
launch kernel

 Parallel solution: j
⚫ CPU code: square kernel <<<64>>>(out, in)
⚫ here we have 64 thread each do 1 multiplication which
take 10 ns. 24
EXAMP
25 LE
start
THREADS AND
26 BLOCKS
THREADS

A single execution units t h a t r u n kernels on the


GPU. Similar to CPU threads but there's usually
many more of them. They are sometimes drawn as
arrows

27
BLOCKS

 Thread blocks are a virtual collection of threads.


 All the threads in any single thread block can
communicate

28
GRID

 A kernel is launched as a collection of thread


blocks called the grid.

29
MAXIMUMS
 You can launch up to 1024 threads per block (or
512 if your card is compute capability 1.3 or less).

 You can launch 2 32 -1 blocks in a single launch(or


2 16 -1 if your card is compute capability 2.0 or
less).

 So my relatively inexpensive GeForce GT 440 can


launch a rat her ridiculous 67,108,864 threads.

30
WHY BLOCKS AND THREADS?
 You may be wondering why not just say “launch 67 million
threads” instead of organizing them into blocks.
 Suppose you wrote a program for a GPU can which can
r u n 2000 threads concurrently. Then you want to execute
the same code on a higher GPU with 6000 threads. Are you
going to change the whole code fore each GPU?
 Each GPU h as a limit on the number of threads per block
but (almost) no limit on the number of blocks. Each GPU
can r u n some number of blocks concurrently, executing
some number of threads simultaneously.
 By adding the extra level of abstraction, higher
performance GPU's can simply r u n more blocks
concurrently and chew through the workload quicker with
absolutely no change to the code.
 nVidia h as done this to allow automatic performance gains
when your code is r u n on different higher performance
GPU's. 31
DIM3
32
DIM3 DATA TYPE

 Dim3 is a 3d structure or vector type with three


integers, x, y and z. You can initialize as many of
the three coordinates as you like:
⚫ dim3 threads(256); // Initialize with x as 256, y and z
// will both be 1
⚫ dim3 blocks(100, 100); // Initialize x and y, z will be 1

 dim3 anotherOne(10, 54, 32); // Initialises all


three values, x
⚫ // will be 10, y gets 54 and z
⚫ // will be the 32.
33
THREAD ACCESS PARAMETERS
 Each of the running threads is individual, they
know the following:

 threadIdx ← Thread index within the block


 blockIdx ← Block index within the grid

 blockDim ← Number of threads in the block

 gridDim ← Number of blocks in the grid

 Each of these are dim3 structures and can be


read in the kernel to assign particular workloads
to any thread. 34
THREAD ACCESS PATTERN
 Its common to have threads calculate a unique id
within the kernel to process some specific data. If we
launch a kernel with:

 SomeKernel<<<100, 25>>>(...);
 Inside the kernel, each thread can calculate a unique
id with:
⚫ int id = blockIdx.x * blockDim.x + threadIdx.x;
 So the 5th thread of the 4th block would calculate:
⚫ int id = 4 * 25 + 5 = 105
 The 14th thread of the 76th block would calculate:
⚫ int id = 76 * 25 + 14 = 1914 35
MAPPIN
36 G
MAP
 Set of elements to process [64 floats]
 Function to r u n on each element [square]

Map(element, function)

37
QUIZ
Which programs can be solved using Map

 Sort a n input array

Add one to each element of a n input array


 Sum up all elements of a n input array
 Compute the average of and input array

38

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy