0% found this document useful (0 votes)

37 views38 pages

Parralel 01

The document discusses parallel processing units and parallel computing. It explains that traditional CPUs are not the most energy efficient processors due to their complex control hardware, while GPU-like processors are more efficient because they have simpler control structures and devote more transistors to computation. It also discusses how computer designers are building more power efficient chips today by using more simpler processors rather than fewer complex ones.

Uploaded by

demro channel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views38 pages

Parralel 01

Uploaded by

demro channel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 38

PARALLEL

PROCESSING
UNIT
1
1
UNDERSTANDING
PARALLEL
2 ENVIRONMENT
QUIZ
What are 3 traditional ways HW Designers make
computers r u n faster?

Faster Clocks
 Longer Clock Period

More Work per Clock Cycle

 Larger Hard Disk
More Processors
 Reduce amount of memory
3
SEYMOUR CRAY (SUPER COMPUTER
DESIGNER)
 Ifyou are plowing a field, which would
you rat h er use?

⚫ Two strong oxen.

⚫ 1024 chickens

4
PARALLEL COMPUTING
 It was intended to be used by super computing.
 Now all computers/mobiles are using parallel
computing.
 Modern GPUs
⚫ Hundred of processors
⚫ Thousand of ALUs (3,000)
⚫ Ten or thousands of concurrent threads.
 This requires a different way of programming
t h a n a single scalar processor
 General purpose programmability over GPU
(GPGPU.) 5
TRANSISTORS CONTINUE ON
MOORE’S PATH . . . FOR NOW

6
CLOCK SPEED (NO MORE
SPEED)

7
QUIZ
 Are processing today getting faster Because

We are clocking their transistors faster

We have more transistors available
for computation.

o Why don’t we keep increasing clock speed of a

single processor instead of multiprocessors with a
less clock speed?
o No, we can’t because of power (heat)
8
WHAT KIND OF PROCESSORS
WILL WE BUILD?
 Assume major design constraint is Power

 Why are traditional CPU-like processors are not

the most energy efficient processors?
⚫ It has complex control hardware
⚫ This increase flexibility and performance
⚫ And increase power consumption and design
complexity as well
 How to increase power efficiency (GPU-
like)?
⚫ Build simple control structure.
⚫ Take those transistors and devote them to support
9
more computation on the data path
⚫ The challenge becomes how to program?
MORE TO UNDERSTAND

10
Less speed with
M ORE TO UNDERSTAND (CONT.) simple
structure

More speed
with
complex
structure

Less
Power Power
11
QUIZ
 Which techniques are computer designer using
today to build more power-efficient chips?

 Fewer, more complex processors

More, Simpler processors

 Maximizing the speed of the processor clock
 Increasing the complexity of the control
hardware

12
ANOTHER FACTOR FOR POWER
EFFICIENCY
Power Efficiency

Decrease latency Increase Throughput

(Amount of time to (Task completed per
complete a task) unit time)
“Time” “Number”

 The two goals are not

⚫ CPU-like: design to decrease latency
aligned
⚫ GPU-like: design to increase throughput
13
 The choice depends on the application (Image processing
prefer to increase the throughput)
SUPER QUIZ
 Why do I say GPU-like and not saying Multi-core
CPU? Is there a deference ?!

⚫ They both build for parallel programming. However,

Multi-core CPUs can be used for sequential and
parallel programming as well (provides branches and
interrupts ). On the other hand GPU build for
parallel programming from scratch.

14
GPU DESIGN BELIEVES
 Lots of simple compute units
 Explicitly parallel programming model
⚫ We know there are many processors and we didn’t
depend on the complier for example to parallel the
task for us.
 Optimized for throughput not latency

15
INTRO TO
PARALLEL
16 PROGRAMMING
IMPORTANCE OF PARALLEL
PROGRAMMING
 Intel 8 core Ivy bridge
 8-wide AVX vector operations/core

 2 threads core (hyper threading)

 This means the processor has 128 way of

parallelism
 Parallel programming is more complex however
Running sequential C program means using less
t h a n 1% of this processor power

17
CUDA PLATFORM
CUDA Program
W
ith
Ex
te
ns
io

C
ns

CPU GPU
"Host" Co-processor "Device "

Memory Memory

 CUDA compiler generate two separated program one

for CPU (Host) and another for GPU (Device).
 CPU in charge and control the GPU
⚫ Moves data between memories (cudaMemcpy)
⚫ Allocates memory on GPU (cudaMalloc)
⚫ Invokes programs (kernels) on the GPU: ”Host 18
lunches kernels on the Device”
QUIZ
The GPU can do the following:

 Initiate dat a send from GPU to CPU

Respond to CPU request to send data from GPU
to CPU
 Initiate dat a request from CPU to GPU

Respond to CPU request to receive data from

CPU to GPU
Compute a kernel lunched by CPU
 Compute a kernel lunched by GPU 
19
TYPICAL GPU PROGRAM
 CPU allocate storage on GPU
 CPU copy input data from CPU to GPU

 CPU lunches the kernels on the GPU to process

the data
 CPU copies results back to the CPU from the
GPU

 If you need to move data many times between

CPU and GPU, CUDA is not good for your
program because it takes many steps to do so as
showing above 20
MAIN ISSUE
 Defining the GPU computation

⚫ Write a Kernel like serial program

⚫ When lunching the kernel tell the GPU how
many threads to lunch

21
QUIZ
What is the GPU good at?

 Lunching a small number of threads

efficiently

Lunching a large number of threads efficiently

 Running one thread very Quickly
 Respond to CPU request to receive data from
CPU to GPU
 Running one thread t h a t does lots of work
in parallel 22

Running a large number of threads in

GPU P OWER
 Example:
⚫ In : [1, 2, 3, …., 64]
⚫ Out: [02 , 1 2 , 2 2 , …., 642 ]

 Sequential solution:
for(int i=0;i<64;i++)
Out[i]=in[i]*in[i];
⚫ here we have 1 thread do 64 multiplications
each takes 2 ns.

23
GPU P OWER (CONT.)
 Example:
⚫ In : [1, 2, 3, …., 64]
⚫ Out: [02 , 1 2 , 2 2 , …., 642 ]

CPU GPU

Allocate memory out= in * in

Copy data to/from GPU
launch kernel

 Parallel solution: j
⚫ CPU code: square kernel <<<64>>>(out, in)
⚫ here we have 64 thread each do 1 multiplication which
take 10 ns. 24
EXAMP
25 LE
start
THREADS AND
26 BLOCKS
THREADS

A single execution units t h a t r u n kernels on the

GPU. Similar to CPU threads but there's usually
many more of them. They are sometimes drawn as
arrows

27
BLOCKS

 Thread blocks are a virtual collection of threads.

 All the threads in any single thread block can
communicate

28
GRID

 A kernel is launched as a collection of thread

blocks called the grid.

29
MAXIMUMS
 You can launch up to 1024 threads per block (or
512 if your card is compute capability 1.3 or less).

 You can launch 2 32 -1 blocks in a single launch(or

2 16 -1 if your card is compute capability 2.0 or
less).

 So my relatively inexpensive GeForce GT 440 can

launch a rat her ridiculous 67,108,864 threads.

30
WHY BLOCKS AND THREADS?
 You may be wondering why not just say “launch 67 million
threads” instead of organizing them into blocks.
 Suppose you wrote a program for a GPU can which can
r u n 2000 threads concurrently. Then you want to execute
the same code on a higher GPU with 6000 threads. Are you
going to change the whole code fore each GPU?
 Each GPU h as a limit on the number of threads per block
but (almost) no limit on the number of blocks. Each GPU
can r u n some number of blocks concurrently, executing
some number of threads simultaneously.
 By adding the extra level of abstraction, higher
performance GPU's can simply r u n more blocks
concurrently and chew through the workload quicker with
absolutely no change to the code.
 nVidia h as done this to allow automatic performance gains
when your code is r u n on different higher performance
GPU's. 31
DIM3
32
DIM3 DATA TYPE

 Dim3 is a 3d structure or vector type with three

integers, x, y and z. You can initialize as many of
the three coordinates as you like:
⚫ dim3 threads(256); // Initialize with x as 256, y and z
// will both be 1
⚫ dim3 blocks(100, 100); // Initialize x and y, z will be 1

 dim3 anotherOne(10, 54, 32); // Initialises all

three values, x
⚫ // will be 10, y gets 54 and z
⚫ // will be the 32.
33
THREAD ACCESS PARAMETERS
 Each of the running threads is individual, they
know the following:

 threadIdx ← Thread index within the block

 blockIdx ← Block index within the grid

 blockDim ← Number of threads in the block

 gridDim ← Number of blocks in the grid

 Each of these are dim3 structures and can be

read in the kernel to assign particular workloads
to any thread. 34
THREAD ACCESS PATTERN
 Its common to have threads calculate a unique id
within the kernel to process some specific data. If we
launch a kernel with:

 SomeKernel<<<100, 25>>>(...);
 Inside the kernel, each thread can calculate a unique
id with:
⚫ int id = blockIdx.x * blockDim.x + threadIdx.x;
 So the 5th thread of the 4th block would calculate:
⚫ int id = 4 * 25 + 5 = 105
 The 14th thread of the 76th block would calculate:
⚫ int id = 76 * 25 + 14 = 1914 35
MAPPIN
36 G
MAP
 Set of elements to process [64 floats]
 Function to r u n on each element [square]

Map(element, function)

37
QUIZ
Which programs can be solved using Map

 Sort a n input array

Add one to each element of a n input array

 Sum up all elements of a n input array
 Compute the average of and input array

s71500 Et200mp Manual Collection en-US
No ratings yet
s71500 Et200mp Manual Collection en-US
11,261 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Introduction To Massively Parallel Computing
No ratings yet
Introduction To Massively Parallel Computing
44 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
No ratings yet
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
33 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
cs179 2017 Lec01
No ratings yet
cs179 2017 Lec01
24 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Sap On Azure
No ratings yet
Sap On Azure
493 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Control Unit - Wikipedia PDF
No ratings yet
Control Unit - Wikipedia PDF
20 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
Lec 14
No ratings yet
Lec 14
52 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Types and Components of Computer Systems: Learning Objectives
No ratings yet
Types and Components of Computer Systems: Learning Objectives
9 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Infineon TC1791 DS v01 01 en
No ratings yet
Infineon TC1791 DS v01 01 en
153 pages
GPU Computing 3
No ratings yet
GPU Computing 3
32 pages
Embedded Processor Architecture: 16bec033@nirmauni - Ac.in 16bec034@nirmauni - Ac.in
No ratings yet
Embedded Processor Architecture: 16bec033@nirmauni - Ac.in 16bec034@nirmauni - Ac.in
4 pages
Lec 3
No ratings yet
Lec 3
48 pages
Unit 2 - ARM7 Based Microcontroller
No ratings yet
Unit 2 - ARM7 Based Microcontroller
112 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
8085 Instructions: Institute of Lifelong Learning, University of Delhi
No ratings yet
8085 Instructions: Institute of Lifelong Learning, University of Delhi
21 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
Lec. 2
No ratings yet
Lec. 2
24 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
217 Lec1
No ratings yet
217 Lec1
35 pages
GPGPU
No ratings yet
GPGPU
139 pages
Toshiba Microcomputer and Peripheral LSIs
No ratings yet
Toshiba Microcomputer and Peripheral LSIs
62 pages
Infamous
No ratings yet
Infamous
9 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Hardware
No ratings yet
Hardware
54 pages
Cours 1
No ratings yet
Cours 1
38 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
ENGG 143.02 Lab Activity 3 - Manlapaz
No ratings yet
ENGG 143.02 Lab Activity 3 - Manlapaz
2 pages
Introduction To Modern Code Virtualization
No ratings yet
Introduction To Modern Code Virtualization
7 pages
Adaptive Dynamic Relaxation Algorithm For Non-Linear Hyperelastic Structures
No ratings yet
Adaptive Dynamic Relaxation Algorithm For Non-Linear Hyperelastic Structures
19 pages
Chapter 2 Introduction To TMS320C6748
No ratings yet
Chapter 2 Introduction To TMS320C6748
12 pages
Lab#07 LAB NAME: Led Blinking by Using Delay
No ratings yet
Lab#07 LAB NAME: Led Blinking by Using Delay
3 pages
Lec 1
No ratings yet
Lec 1
27 pages
cs1541 HW4
No ratings yet
cs1541 HW4
5 pages
LAS - Computer Science 7 - Quarter 1 Module 1week 1-3
No ratings yet
LAS - Computer Science 7 - Quarter 1 Module 1week 1-3
3 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
CPBO423 - Lesson 3 - Instruction Sets-Characteristics and Functions
No ratings yet
CPBO423 - Lesson 3 - Instruction Sets-Characteristics and Functions
49 pages
Programming For Graphics Processing Units (Gpus) : Parallel
No ratings yet
Programming For Graphics Processing Units (Gpus) : Parallel
35 pages
LEC 4,5 Linked List
No ratings yet
LEC 4,5 Linked List
50 pages
LEC - 3 Queue
No ratings yet
LEC - 3 Queue
23 pages
C++ Note1
No ratings yet
C++ Note1
9 pages
Thermodynamics1 Ch6 Control Volume p1
No ratings yet
Thermodynamics1 Ch6 Control Volume p1
23 pages
LEC - 2 Stack
No ratings yet
LEC - 2 Stack
20 pages
Thermodynamics1 Ch7 Second Law
No ratings yet
Thermodynamics1 Ch7 Second Law
54 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Testmax Sms Ds
No ratings yet
Testmax Sms Ds
3 pages
CSE 3RDedit Organized
No ratings yet
CSE 3RDedit Organized
8 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
GTC S62191
No ratings yet
GTC S62191
89 pages
lEC - 10 - Sorting - Part1
No ratings yet
lEC - 10 - Sorting - Part1
162 pages
Unit 1-Basic Structure of Computers
No ratings yet
Unit 1-Basic Structure of Computers
146 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
58 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Microprocessor Module 1 and 2
No ratings yet
Microprocessor Module 1 and 2
30 pages
BUS1710 Chapter 2 Emotions
No ratings yet
BUS1710 Chapter 2 Emotions
32 pages
Parralel Demro 003
No ratings yet
Parralel Demro 003
46 pages
Thermodynamics1 Ch2 Basic Concepts
No ratings yet
Thermodynamics1 Ch2 Basic Concepts
42 pages
CSC423 - Lec10 - Distributed and Parallel ComputerSystems
No ratings yet
CSC423 - Lec10 - Distributed and Parallel ComputerSystems
29 pages
CSC423 - Lec11 - Distributed and Parallel ComputerSystems
No ratings yet
CSC423 - Lec11 - Distributed and Parallel ComputerSystems
19 pages
Lecture 4
No ratings yet
Lecture 4
27 pages
Lecture 3
No ratings yet
Lecture 3
27 pages
CSC423 - Lec9 - Distributed and Parallel ComputerSystems
No ratings yet
CSC423 - Lec9 - Distributed and Parallel ComputerSystems
16 pages
Unit 4
No ratings yet
Unit 4
48 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
CH 10 OB Summary
No ratings yet
CH 10 OB Summary
7 pages
Class Test 11 Python
No ratings yet
Class Test 11 Python
2 pages
Group V
No ratings yet
Group V
9 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Cuda
No ratings yet
Cuda
69 pages
Testbank Chapter 1
No ratings yet
Testbank Chapter 1
7 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Lecture2 GPU Architecture - 2025
No ratings yet
Lecture2 GPU Architecture - 2025
46 pages
Cmos
No ratings yet
Cmos
16 pages
Ch5 - Revision Questions + Model Answers
No ratings yet
Ch5 - Revision Questions + Model Answers
4 pages
BCA IBM 3 Years
No ratings yet
BCA IBM 3 Years
36 pages
2-Summary L 6
No ratings yet
2-Summary L 6
6 pages
Os Lab-01
No ratings yet
Os Lab-01
5 pages
L7 Demro
No ratings yet
L7 Demro
13 pages
Test
No ratings yet
Test
6 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
Owens
No ratings yet
Owens
67 pages
MIS Summary
No ratings yet
MIS Summary
14 pages
L 6 Part 1 Summary
No ratings yet
L 6 Part 1 Summary
3 pages
People of IS
No ratings yet
People of IS
5 pages
CS105 W9 eCommerceAndEnterpriseSystems
No ratings yet
CS105 W9 eCommerceAndEnterpriseSystems
33 pages
L 3 - Demro
No ratings yet
L 3 - Demro
4 pages
Chapter 11 Aggregate Planning and Master Scheduling - Part 1
No ratings yet
Chapter 11 Aggregate Planning and Master Scheduling - Part 1
14 pages
TB Chapter 13
No ratings yet
TB Chapter 13
15 pages
Comparison Between Inventory Management Models
No ratings yet
Comparison Between Inventory Management Models
1 page
Surplus
No ratings yet
Surplus
7 pages
Puma - See Product Analysis
No ratings yet
Puma - See Product Analysis
29 pages
Note2 4
No ratings yet
Note2 4
11 pages
Arya Basic Computer
No ratings yet
Arya Basic Computer
35 pages
CUDA
No ratings yet
CUDA
18 pages
Parallel Computing 1 Unit
No ratings yet
Parallel Computing 1 Unit
59 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Basics CUDA
No ratings yet
Basics CUDA
55 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Parralel 01

Uploaded by

Parralel 01

Uploaded by

PARALLEL

More Work per Clock Cycle

⚫ Two strong oxen.

We are clocking their transistors faster

o Why don’t we keep increasing clock speed of a

 Why are traditional CPU-like processors are not

 Fewer, more complex processors

More, Simpler processors

Decrease latency Increase Throughput

 The two goals are not

⚫ They both build for parallel programming. However,

 2 threads core (hyper threading)

 This means the processor has 128 way of

 CUDA compiler generate two separated program one

 Initiate dat a send from GPU to CPU

Respond to CPU request to receive data from

 CPU lunches the kernels on the GPU to process

 If you need to move data many times between

⚫ Write a Kernel like serial program

 Lunching a small number of threads

Lunching a large number of threads efficiently

Running a large number of threads in

Allocate memory out= in * in

A single execution units t h a t r u n kernels on the

 Thread blocks are a virtual collection of threads.

 A kernel is launched as a collection of thread

 You can launch 2 32 -1 blocks in a single launch(or

 So my relatively inexpensive GeForce GT 440 can

 Dim3 is a 3d structure or vector type with three

 dim3 anotherOne(10, 54, 32); // Initialises all

 threadIdx ← Thread index within the block

 blockDim ← Number of threads in the block

 gridDim ← Number of blocks in the grid

 Each of these are dim3 structures and can be

 Sort a n input array

Add one to each element of a n input array

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.