Lecture 19-Opencl: Ece 459: Programming For Performance
Lecture 19-Opencl: Ece 459: Programming For Performance
2 / 47
Part I
OpenCL concepts
3 / 47
Introduction
4 / 47
SIMT
5 / 47
PlayStation 3 Cell
CUDA
6 / 47
CUDA Overview
8 / 47
9 / 47
Data Parallelism
Key idea: evaluate a function (or kernel)
over a set of points (data).
Work-Items
11 / 47
One thread per work item, each with a different thread ID.
You can say how to divide the ND-Range into work-groups,
or the system can do it for you.
Scheduler assigns work-items to warps/wavefronts
until no more left.
12 / 47
Shared Memory
13 / 47
Example Kernel
Heres some traditional code to evaluate Ci = Ai Bi :
void traditional_mul ( int n ,
c o n s t f l o a t a ,
c o n s t f l o a t b ,
f l o a t c ) {
int i ;
f o r ( i = 0 ; i < n ; i ++) c [ i ] = a [ i ] b [ i ] ;
}
And as a kernel:
k e r n e l void opencl_mul ( g l o b a l
global
global
int id = get_global_id (0);
c [ id ] = a [ id ] b [ id ] ;
}
c o n s t f l o a t a ,
c o n s t f l o a t b ,
f l o a t c ) {
// d i m e n s i o n 0
14 / 47
15 / 47
16 / 47
Branches in kernels
kernel void contains_branch ( global float *a ,
global float * b ) {
int id = get_global_id (0);
if ( cond ) {
x [ id ] += 5.0;
} else {
y [ id ] += 5.0;
}
}
The hardware will execute all branches that any thread in a warp
executescan be slow!
In other words: an if statement will cause each thread to execute
both branches; we keep only the result of the taken branch.
17 / 47
Loops in kernels
kernel void contains_loop ( global float *a ,
global float * b ) {
int id = get_global_id (0);
for ( i = 0; i < id ; i ++) {
b [ i ] += a [ i ];
}
}
A loop will cause the workgroup to wait for the maximum number
of iterations of the loop in any work-item.
Note: when you set up work-groups, best to arrange for all
work-items in a workgroup to execute the same branches & loops.
18 / 47
Synchronization
19 / 47
Part II
Programming with OpenCL
20 / 47
Introduction
Today, well see how to program with OpenCL.
Were using OpenCL 1.1.
There is a lot of initialization and querying.
When you compile your program, include -lOpenCL.
21 / 47
First, reminders
22 / 47
Part III
Simple Example
23 / 47
Note by PL : don t u s e t h i s e x a m p l e a s a t e m p l a t e ;
i t u s e s t h e C b i n d i n g s ! I n s t e a d , u s e t h e C++ b i n d i n g s .
s o u r c e : p a g e s 19 t h r o u g h 1 11 ,
h t t p : / / d e v e l o p e r . amd . com/ w o r d p r e s s / media /2013/07/
AMD_Accelerated_Parallel_Processing_OpenCL_
Programming_Guider e v 2 . 7 . p d f
#i n c l u d e <CL/ c l . h>
#i n c l u d e < s t d i o . h>
#d e f i n e NWITEMS 512
// A s i m p l e memset k e r n e l
const char source =
" _ _ k e r n e l v o i d memset ( _ _ g l o b a l u i n t d s t )
"{
"
dst [ get_global_id (0)] = get_global_id (0);
"}
\n "
\n "
\n "
\n " ;
i n t main ( i n t a r g c , c h a r a r g v )
{
// 1 . Get a p l a t f o r m .
cl_platform_id platform ;
c l G e t P l a t f o r m I D s ( 1 , &p l a t f o r m , NULL ) ;
24 / 47
Explanation (1)
25 / 47
// 2 . F i n d a gpu d e v i c e .
cl_device_id device ;
c l G e t D e v i c e I D s ( p l a t f o r m , CL_DEVICE_TYPE_GPU ,
1,
&d e v i c e ,
NULL ) ;
// 3 . C r e a t e a c o n t e x t and command q ue ue on t h a t d e v i c e .
c l _ c o n t e x t c o n t e x t = c l C r e a t e C o n t e x t (NULL ,
1,
&d e v i c e ,
NULL , NULL , NULL ) ;
cl_command_queue q ue ue = clCreateCommandQueue ( c o n t e x t ,
device ,
0 , NULL ) ;
26 / 47
Explanation (2)
27 / 47
28 / 47
Explanation (3)
We create an OpenCL program (runs on the compute unit):
kernels;
functions; and
declarations.
In this case, we create a kernel called memset from source.
OpenCL may also create programs from binaries
(may be in intermediate representation).
Next, we need a data buffer (enables inter-device communication).
This program does not have any input,
so we dont put anything into the buffer (just declare its size).
29 / 47
30 / 47
Explanation (4)
Set kernel arguments to buffer.
We launch the kernel, enqueuing the 1-dimensional
index space starting at 0.
We specify that the index space has NWITEMS elements;
and not to subdivide the program into work-groups.
There is also an event interface, which we do not use.
We copy the results back; call is blocking (CL_TRUE);
hence we dont need an explicit clFinish() call.
We specify that we want to read the results back into
buffer.
31 / 47
int i ;
f o r ( i =0; i < NWITEMS ; i ++)
p r i n t f ("%d %d\n " , i , p t r [ i ] ) ;
return 0;
32 / 47
Part IV
Another Example
33 / 47
C++ Bindings
34 / 47
Lets write a kernel that adds two vectors and stores the result.
This kernel will go in the file vector_add_kernel.cl.
_ _ k e r n e l v o i d v e c t o r _ a d d ( _ _ g l o b a l c o n s t i n t A ,
_ _ g l o b a l c o n s t i n t B ,
_ _ g l o b a l i n t C) {
// Get t h e i n d e x o f t h e c u r r e n t e l e m e n t t o be p r o c e s s e d
int i = get_global_id (0);
// Do t h e o p e r a t i o n
C[ i ] = A[ i ] + B[ i ] ;
35 / 47
<i o s t r e a m >
<f s t r e a m >
<s t r i n g >
<u t i l i t y >
<v e c t o r >
i n t main ( ) {
// C r e a t e t h e two i n p u t v e c t o r s
c o n s t i n t LIST_SIZE = 1 0 0 0 ;
i n t A = new i n t [ LIST_SIZE ] ;
i n t B = new i n t [ LIST_SIZE ] ;
f o r ( i n t i = 0 ; i < LIST_SIZE ; i ++) {
A[ i ] = i ;
B [ i ] = LIST_SIZE i ;
}
36 / 47
37 / 47
Explanation (2)
38 / 47
devices
// Make k e r n e l
c l : : K e r n e l k e r n e l ( program , " v e c t o r _ a d d " ) ;
39 / 47
40 / 47
to k e r n e l
bufferA );
bufferB );
bufferC );
41 / 47
Explanation (5)
enqueue*Buffer arguments:
buffer
cl_ bool blocking_write
::size_t offset
::size_t size
const void * ptr
42 / 47
list
43 / 47
return 0;
44 / 47
Other Improvements
45 / 47
46 / 47
Overall summary
47 / 47