0% found this document useful (0 votes)
43 views47 pages

Lecture 19-Opencl: Ece 459: Programming For Performance

This document provides an overview of OpenCL concepts and programming using a simple example. It discusses OpenCL concepts like work-items, work-groups, and memory types. It then demonstrates how to program with OpenCL by initializing the platform and devices, compiling a kernel, setting arguments, launching the kernel, and mapping memory to access the results. The example kernel simply writes each work-item's ID to the corresponding element in the output buffer.

Uploaded by

Abed Momani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views47 pages

Lecture 19-Opencl: Ece 459: Programming For Performance

This document provides an overview of OpenCL concepts and programming using a simple example. It discusses OpenCL concepts like work-items, work-groups, and memory types. It then demonstrates how to program with OpenCL by initializing the platform and devices, compiling a kernel, setting arguments, launching the kernel, and mapping memory to access the results. The example kernel simply writes each work-item's ID to the corresponding element in the output buffer.

Uploaded by

Abed Momani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Lecture 19OpenCL

ECE 459: Programming for Performance

March 18, 2014

Last Time: Compiler Optimizations

Compiler reads your program


and emits one just like it, but faster.
Also: profile-guided optimizations.

2 / 47

Part I
OpenCL concepts

3 / 47

Introduction

OpenCL: coding on a heterogeneous architecture.


No longer just programming the CPU;
will also leverage the GPU.
OpenCL = Open Computing Language.
Usable on both NVIDIA and AMD GPUs.

4 / 47

SIMT

Another term you may see vendors using:


Single Instruction, Multiple Threads.
Runs on a vector of data.
Similar to SIMD instructions (e.g. SSE).
However, the vector is spread out over the GPU.

5 / 47

Other Heterogeneous Programming Examples

PlayStation 3 Cell
CUDA

[PS4: back to a regular CPU/GPU system,


albeit on one chip.]

6 / 47

(PS3) Cell Overview


Cell consists of:
a PowerPC core; and
8 SIMD co-processors.

(from the Linux Cell documentation)


7 / 47

CUDA Overview

Compute Unified Device Architecture:


NVIDIAs architecture for processing on GPUs.
C for CUDA predates OpenCL,
NVIDIA supports it first and foremost.
May be faster than OpenCL on NVIDIA hardware.
API allows you to use (most) C++ features in CUDA;
OpenCL has more restrictions.

8 / 47

GPU Programming Model

The abstract model is simple:


Write the code for the parallel computation (kernel)
separately from main code.
Transfer the data to the GPU co-processor
(or execute it on the CPU).
Wait . . .
Transfer the results back.

9 / 47

Data Parallelism
Key idea: evaluate a function (or kernel)
over a set of points (data).

Another example of data parallelism.


Another name for the set of points: index space.
Each point corresponds to a work-item.

Note: OpenCL also supports task parallelism (using


different kernels), but documentation is sparse.
10 / 47

Work-Items

Work-item: the fundamental unit of work in OpenCL.


Stored in an n-dimensional grid (ND-Range); 2D above.
OpenCL spawns a bunch of threads to handle work-items.
When executing, the range is divided into work-groups,
which execute on the same compute unit.
The set of compute units (or cores) is called something
different depending on the manufacturer.
NVIDIA - warp
AMD/ATI - wavefront

11 / 47

Work-Items: Three more details

One thread per work item, each with a different thread ID.
You can say how to divide the ND-Range into work-groups,
or the system can do it for you.
Scheduler assigns work-items to warps/wavefronts
until no more left.

12 / 47

Shared Memory

There are many different types of memory available to you:


private memory: available to a single work-item;
local memory (aka shared memory): shared between
work-items belonging to the same work-group;
like a user-managed cache;
global memory: shared between all work-items
as well as the host;
constant memory: resides on the GPU and cached.
Does not change.
There is also host memory (normal memory);
usually contains app data.

13 / 47

Example Kernel
Heres some traditional code to evaluate Ci = Ai Bi :
void traditional_mul ( int n ,
c o n s t f l o a t a ,
c o n s t f l o a t b ,
f l o a t c ) {
int i ;
f o r ( i = 0 ; i < n ; i ++) c [ i ] = a [ i ] b [ i ] ;
}

And as a kernel:
k e r n e l void opencl_mul ( g l o b a l
global
global
int id = get_global_id (0);
c [ id ] = a [ id ] b [ id ] ;
}

c o n s t f l o a t a ,
c o n s t f l o a t b ,
f l o a t c ) {
// d i m e n s i o n 0

14 / 47

Restrictions when writing kernels in OpenCL

Its mostly C, but:


No function pointers.
No bit-fields.
No variable-length arrays.
No recursion.
No standard headers.

15 / 47

OpenCLs Additions to C in Kernels

In kernels, you can also use:


Work-items.
Work-groups.
Vectors.
Synchronization.
Declarations of memory type.
Kernel-specific library.

16 / 47

Branches in kernels
kernel void contains_branch ( global float *a ,
global float * b ) {
int id = get_global_id (0);
if ( cond ) {
x [ id ] += 5.0;
} else {
y [ id ] += 5.0;
}
}

The hardware will execute all branches that any thread in a warp
executescan be slow!
In other words: an if statement will cause each thread to execute
both branches; we keep only the result of the taken branch.
17 / 47

Loops in kernels
kernel void contains_loop ( global float *a ,
global float * b ) {
int id = get_global_id (0);
for ( i = 0; i < id ; i ++) {
b [ i ] += a [ i ];
}
}

A loop will cause the workgroup to wait for the maximum number
of iterations of the loop in any work-item.
Note: when you set up work-groups, best to arrange for all
work-items in a workgroup to execute the same branches & loops.

18 / 47

Synchronization

Different workgroups execute independently.


You can only put barriers and memory fences between
work-items in the same workgroup.
OpenCL supports:
Memory fences (load and store).
Barriers.
volatile (beware!)

19 / 47

Part II
Programming with OpenCL

20 / 47

Introduction
Today, well see how to program with OpenCL.
Were using OpenCL 1.1.
There is a lot of initialization and querying.
When you compile your program, include -lOpenCL.

You can find the official documentation here:


http://www.khronos.org/opencl/
More specifically:
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/

Lets just dive into an example.

21 / 47

First, reminders

All data belongs to an NDRange.


The range can be divided into work-groups. (in software)
The work-groups run on wavefronts/warps. (in hardware)
Each wavefront/warp executes work-items.

All branches in a wavefront/warp should execute the same path.


If an iteration of a loop takes t:
when one work-item executes 100 iterations,
the total time to complete the wavefront/warp is 100t.

22 / 47

Part III
Simple Example

23 / 47

Simple Example (1)


//
//
//
//
//
//

Note by PL : don t u s e t h i s e x a m p l e a s a t e m p l a t e ;
i t u s e s t h e C b i n d i n g s ! I n s t e a d , u s e t h e C++ b i n d i n g s .
s o u r c e : p a g e s 19 t h r o u g h 1 11 ,
h t t p : / / d e v e l o p e r . amd . com/ w o r d p r e s s / media /2013/07/
AMD_Accelerated_Parallel_Processing_OpenCL_
Programming_Guider e v 2 . 7 . p d f

#i n c l u d e <CL/ c l . h>
#i n c l u d e < s t d i o . h>
#d e f i n e NWITEMS 512
// A s i m p l e memset k e r n e l
const char source =
" _ _ k e r n e l v o i d memset ( _ _ g l o b a l u i n t d s t )
"{
"
dst [ get_global_id (0)] = get_global_id (0);
"}

\n "
\n "
\n "
\n " ;

i n t main ( i n t a r g c , c h a r a r g v )
{
// 1 . Get a p l a t f o r m .
cl_platform_id platform ;
c l G e t P l a t f o r m I D s ( 1 , &p l a t f o r m , NULL ) ;
24 / 47

Explanation (1)

Include the OpenCL header.


Request a platform (also known as a host).
A platform contains compute devices:
GPUs or CPUs.

25 / 47

Simple Example (2)

// 2 . F i n d a gpu d e v i c e .
cl_device_id device ;
c l G e t D e v i c e I D s ( p l a t f o r m , CL_DEVICE_TYPE_GPU ,
1,
&d e v i c e ,
NULL ) ;
// 3 . C r e a t e a c o n t e x t and command q ue ue on t h a t d e v i c e .
c l _ c o n t e x t c o n t e x t = c l C r e a t e C o n t e x t (NULL ,
1,
&d e v i c e ,
NULL , NULL , NULL ) ;
cl_command_queue q ue ue = clCreateCommandQueue ( c o n t e x t ,
device ,
0 , NULL ) ;

26 / 47

Explanation (2)

Request a GPU device.


Request a OpenCL context (representing all of OpenCLs state).
Create a command-queue:
get OpenCL to do work by telling it to run a kernel in a queue.

27 / 47

Simple Example (3)


// 4 . P e r f o r m r u n t i m e s o u r c e c o m p i l a t i o n , and o b t a i n
//
kernel entry point .
c l _ p r o g r a m program = c l C r e a t e P r o g r a m W i t h S o u r c e ( c o n t e x t ,
1,
&s o u r c e ,
NULL ,
NULL ) ;
c l B u i l d P r o g r a m ( program , 1 , &d e v i c e , NULL , NULL , NULL ) ;
c l _ k e r n e l k e r n e l = c l C r e a t e K e r n e l ( program , " memset " ,
NULL ) ;
// 5 . C r e a t e a d a t a b u f f e r .
cl_mem b u f f e r = c l C r e a t e B u f f e r ( c o n t e x t ,
CL_MEM_WRITE_ONLY,
NWITEMS s i z e o f ( c l _ u i n t ) ,
NULL , NULL ) ;

28 / 47

Explanation (3)
We create an OpenCL program (runs on the compute unit):
kernels;
functions; and
declarations.
In this case, we create a kernel called memset from source.
OpenCL may also create programs from binaries
(may be in intermediate representation).
Next, we need a data buffer (enables inter-device communication).
This program does not have any input,
so we dont put anything into the buffer (just declare its size).

29 / 47

Simple Example (4)


// 6 . Launch t h e k e r n e l . L e t OpenCL p i c k t h e l o c a l work
//
size .
s i z e _ t g l o b a l _ w o r k _ s i z e = NWITEMS ;
c l S e t K e r n e l A r g ( k e r n e l , 0 , s i z e o f ( b u f f e r ) , ( v o i d )& b u f f e r ) ;
c l E n q u e u e N D R a n g e K e r n e l ( queue ,
kernel ,
1,
// d i m e n s i o n s
NULL , // i n i t i a l o f f s e t s
&g l o b a l _ w o r k _ s i z e , // number o f
// worki t e m s
NULL , // worki t e m s p e r workg r o u p
0 , NULL , NULL ) ;
// e v e n t s
c l F i n i s h ( q ue ue ) ;
// 7 . Look a t t h e r e s u l t s v i a s y n c h r o n o u s b u f f e r map .
cl_uint ptr ;
p t r = ( c l _ u i n t ) c l E n q u e u e M a p B u f f e r ( queue , b u f f e r ,
CL_TRUE , CL_MAP_READ,
0 , NWITEMS
sizeof ( cl_uint ) ,
0 , NULL , NULL , NULL ) ;

30 / 47

Explanation (4)
Set kernel arguments to buffer.
We launch the kernel, enqueuing the 1-dimensional
index space starting at 0.
We specify that the index space has NWITEMS elements;
and not to subdivide the program into work-groups.
There is also an event interface, which we do not use.
We copy the results back; call is blocking (CL_TRUE);
hence we dont need an explicit clFinish() call.
We specify that we want to read the results back into
buffer.

31 / 47

Simple Example (5)

int i ;
f o r ( i =0; i < NWITEMS ; i ++)
p r i n t f ("%d %d\n " , i , p t r [ i ] ) ;
return 0;

The program simply prints 0 0, 1 1, . . . , 511 511.


Note: I didnt clean up or include error handling
for any of the OpenCL functions.

32 / 47

Part IV
Another Example

33 / 47

C++ Bindings

If we use the C++ bindings, well get automatic


resource release and exceptions.
C++ likes to use the RAII style
(resource allocation is initialization).
Change the header to CL/cl.hpp and define
__CL_ENABLE_EXCEPTIONS.
Wed also like to store our kernel in a file instead of a string.
The C API is not so nice to work with.

34 / 47

Vector Addition Kernel

Lets write a kernel that adds two vectors and stores the result.
This kernel will go in the file vector_add_kernel.cl.
_ _ k e r n e l v o i d v e c t o r _ a d d ( _ _ g l o b a l c o n s t i n t A ,
_ _ g l o b a l c o n s t i n t B ,
_ _ g l o b a l i n t C) {
// Get t h e i n d e x o f t h e c u r r e n t e l e m e n t t o be p r o c e s s e d
int i = get_global_id (0);

// Do t h e o p e r a t i o n
C[ i ] = A[ i ] + B[ i ] ;

Other possible qualifiers: local, constant and private.

35 / 47

Vector Addition (1)


// V e c t o r add example , C++ b i n d i n g s ( u s e t h e s e ! )
// s o u r c e :
// h t t p : / /www . t h e b i g b l o b . com/ g e t t i n g s t a r t e d
//
w i t h o p e n c l andgpuc o m p u t i n g /
#d e f i n e __CL_ENABLE_EXCEPTIONS
#i n c l u d e <CL/ c l . hpp>
#i n c l u d e
#i n c l u d e
#i n c l u d e
#i n c l u d e
#i n c l u d e

<i o s t r e a m >
<f s t r e a m >
<s t r i n g >
<u t i l i t y >
<v e c t o r >

i n t main ( ) {
// C r e a t e t h e two i n p u t v e c t o r s
c o n s t i n t LIST_SIZE = 1 0 0 0 ;
i n t A = new i n t [ LIST_SIZE ] ;
i n t B = new i n t [ LIST_SIZE ] ;
f o r ( i n t i = 0 ; i < LIST_SIZE ; i ++) {
A[ i ] = i ;
B [ i ] = LIST_SIZE i ;
}
36 / 47

Vector Addition (2)


try {
// Get a v a i l a b l e p l a t f o r m s
s t d : : v e c t o r <c l : : P l a t f o r m > p l a t f o r m s ;
c l : : P l a t f o r m : : g e t (& p l a t f o r m s ) ;
// S e l e c t t h e d e f a u l t p l a t f o r m and c r e a t e a c o n t e x t
// u s i n g t h i s p l a t f o r m and t h e GPU
c l _ c o n t e x t _ p r o p e r t i e s cps [ 3 ] = {
CL_CONTEXT_PLATFORM,
( cl_context_properties )( platforms [ 0 ] ) ( ) ,
0
};
c l : : C o n t e x t c o n t e x t (CL_DEVICE_TYPE_GPU , c p s ) ;
// Get a l i s t o f d e v i c e s on t h i s p l a t f o r m
s t d : : v e c t o r <c l : : D e v i c e > d e v i c e s =
c o n t e x t . g e t I n f o <CL_CONTEXT_DEVICES> ( ) ;
// C r e a t e a command q ue ue and u s e t h e f i r s t d e v i c e
c l : : CommandQueue qu eu e = c l : : CommandQueue ( c o n t e x t ,
devices [ 0 ] ) ;

37 / 47

Explanation (2)

You can define __NO_STD_VECTOR and use cl::vector


(same with strings).
You can enable profiling by adding
CL_QUEUE_PROFILING_ENABLE as 3rd argument to queue.

38 / 47

Vector Addition (3)


// Read s o u r c e f i l e
std : : i f s t r e a m s o u r c e F i l e (" vector_add_kernel . c l " ) ;
std : : s t r i n g sourceCode (
s t d : : i s t r e a m b u f _ i t e r a t o r <c h a r >( s o u r c e F i l e ) ,
( s t d : : i s t r e a m b u f _ i t e r a t o r <c h a r >())
);
c l : : Program : : S o u r c e s s o u r c e (
1,
s t d : : make_pair ( sourceCode . c _ s t r ( ) ,
s o u r c e C o d e . l e n g t h ()+1)
);
// Make program o f t h e s o u r c e c o d e i n t h e c o n t e x t
c l : : Program program = c l : : Program ( c o n t e x t , s o u r c e ) ;
// B u i l d program f o r t h e s e s p e c i f i c
program . b u i l d ( d e v i c e s ) ;

devices

// Make k e r n e l
c l : : K e r n e l k e r n e l ( program , " v e c t o r _ a d d " ) ;

39 / 47

Vector Addition (4)


// C r e a t e memory b u f f e r s
cl : : Buffer bufferA = cl : : Buffer (
context ,
CL_MEM_READ_ONLY,
LIST_SIZE s i z e o f ( i n t )
);
cl : : Buffer bufferB = cl : : Buffer (
context ,
CL_MEM_READ_ONLY,
LIST_SIZE s i z e o f ( i n t )
);
cl : : Buffer bufferC = cl : : Buffer (
context ,
CL_MEM_WRITE_ONLY,
LIST_SIZE s i z e o f ( i n t )
);

40 / 47

Vector Addition (5)


// Copy l i s t s A and B t o t h e memory b u f f e r s
q ue ue . e n q u e u e W r i t e B u f f e r (
bufferA ,
CL_TRUE ,
0,
LIST_SIZE s i z e o f ( i n t ) ,
A
);
q ue ue . e n q u e u e W r i t e B u f f e r (
bufferB ,
CL_TRUE ,
0,
LIST_SIZE s i z e o f ( i n t ) ,
B
);
// S e t a r g u m e n t s
kernel . setArg (0 ,
kernel . setArg (1 ,
kernel . setArg (2 ,

to k e r n e l
bufferA );
bufferB );
bufferC );

41 / 47

Explanation (5)

enqueue*Buffer arguments:
buffer
cl_ bool blocking_write
::size_t offset
::size_t size
const void * ptr

42 / 47

Vector Addition (6)


// Run t h e k e r n e l on s p e c i f i c ND r a n g e
c l : : NDRange g l o b a l ( LIST_SIZE ) ;
c l : : NDRange l o c a l ( 1 ) ;
q ue ue . enqueueNDRangeKernel (
kernel ,
c l : : NullRange ,
global ,
local
);
// Read b u f f e r C i n t o a l o c a l
i n t C = new i n t [ LIST_SIZE ] ;
q ue ue . e n q u e u e R e a d B u f f e r (
bufferC ,
CL_TRUE ,
0,
LIST_SIZE s i z e o f ( i n t ) ,
C
);

list

43 / 47

Vector Addition (7)

f o r ( i n t i = 0 ; i < LIST_SIZE ; i ++) {


s t d : : c o u t << A [ i ] << " + " << B [ i ] << " = "
<< C [ i ] << s t d : : e n d l ;
}
} catch ( c l : : Error e r r o r ) {
s t d : : c o u t << e r r o r . what ( ) << " ( " << e r r o r . e r r ( )
<< " ) " << s t d : : e n d l ;
}
}

return 0;

This program just prints all the additions (equalling 1000).

44 / 47

Other Improvements

The host memory is still unreleased.


With the same number of lines, we could use the C++11
unique_ptr, which would free the memory for us.
You can use a vector instead of an array,
and use &v[0] instead of <type>*.
Valid as long as the vector is not resized.

45 / 47

OpenCL Programming Summary

Went through real OpenCL examples.


Have the reference card for the API.
Saw a C++ template for setting up OpenCL.
Aside: if youre serious about programming in C++, check
out Effective C++ by Scott Meyers (slightly dated with
C++11, but it still has some good stuff)

46 / 47

Overall summary

First Half: Brief overview of OpenCL and its programming


model.
Many concepts are similar to plain parallel programming
(more structure).
Second Half: Looked at an OpenCL implementation and
how to organize it.
Need to write lots of boilerplate!

47 / 47

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy