0% found this document useful (0 votes)

43 views47 pages

Lecture 19-Opencl: Ece 459: Programming For Performance

This document provides an overview of OpenCL concepts and programming using a simple example. It discusses OpenCL concepts like work-items, work-groups, and memory types. It then demonstrates how to program with OpenCL by initializing the platform and devices, compiling a kernel, setting arguments, launching the kernel, and mapping memory to access the results. The example kernel simply writes each work-item's ID to the corresponding element in the output buffer.

Uploaded by

Abed Momani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views47 pages

Lecture 19-Opencl: Ece 459: Programming For Performance

Uploaded by

Abed Momani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Lecture 19OpenCL

ECE 459: Programming for Performance

March 18, 2014

Last Time: Compiler Optimizations

Compiler reads your program

and emits one just like it, but faster.
Also: profile-guided optimizations.

2 / 47

Part I
OpenCL concepts

3 / 47

Introduction

OpenCL: coding on a heterogeneous architecture.

No longer just programming the CPU;
will also leverage the GPU.
OpenCL = Open Computing Language.
Usable on both NVIDIA and AMD GPUs.

4 / 47

SIMT

Another term you may see vendors using:

Single Instruction, Multiple Threads.
Runs on a vector of data.
Similar to SIMD instructions (e.g. SSE).
However, the vector is spread out over the GPU.

5 / 47

Other Heterogeneous Programming Examples

PlayStation 3 Cell
CUDA

[PS4: back to a regular CPU/GPU system,

albeit on one chip.]

6 / 47

(PS3) Cell Overview

Cell consists of:
a PowerPC core; and
8 SIMD co-processors.

(from the Linux Cell documentation)

7 / 47

CUDA Overview

Compute Unified Device Architecture:

NVIDIAs architecture for processing on GPUs.
C for CUDA predates OpenCL,
NVIDIA supports it first and foremost.
May be faster than OpenCL on NVIDIA hardware.
API allows you to use (most) C++ features in CUDA;
OpenCL has more restrictions.

8 / 47

GPU Programming Model

The abstract model is simple:

Write the code for the parallel computation (kernel)
separately from main code.
Transfer the data to the GPU co-processor
(or execute it on the CPU).
Wait . . .
Transfer the results back.

9 / 47

Data Parallelism
Key idea: evaluate a function (or kernel)
over a set of points (data).

Another example of data parallelism.

Another name for the set of points: index space.
Each point corresponds to a work-item.

Note: OpenCL also supports task parallelism (using

different kernels), but documentation is sparse.
10 / 47

Work-Items

Work-item: the fundamental unit of work in OpenCL.

Stored in an n-dimensional grid (ND-Range); 2D above.
OpenCL spawns a bunch of threads to handle work-items.
When executing, the range is divided into work-groups,
which execute on the same compute unit.
The set of compute units (or cores) is called something
different depending on the manufacturer.
NVIDIA - warp
AMD/ATI - wavefront

11 / 47

Work-Items: Three more details

One thread per work item, each with a different thread ID.
You can say how to divide the ND-Range into work-groups,
or the system can do it for you.
Scheduler assigns work-items to warps/wavefronts
until no more left.

12 / 47

Shared Memory

There are many different types of memory available to you:

private memory: available to a single work-item;
local memory (aka shared memory): shared between
work-items belonging to the same work-group;
like a user-managed cache;
global memory: shared between all work-items
as well as the host;
constant memory: resides on the GPU and cached.
Does not change.
There is also host memory (normal memory);
usually contains app data.

13 / 47

Example Kernel
Heres some traditional code to evaluate Ci = Ai Bi :
void traditional_mul ( int n ,
c o n s t f l o a t a ,
c o n s t f l o a t b ,
f l o a t c ) {
int i ;
f o r ( i = 0 ; i < n ; i ++) c [ i ] = a [ i ] b [ i ] ;
}

And as a kernel:
k e r n e l void opencl_mul ( g l o b a l
global
global
int id = get_global_id (0);
c [ id ] = a [ id ] b [ id ] ;
}

c o n s t f l o a t a ,
c o n s t f l o a t b ,
f l o a t c ) {
// d i m e n s i o n 0

14 / 47

Restrictions when writing kernels in OpenCL

Its mostly C, but:

No function pointers.
No bit-fields.
No variable-length arrays.
No recursion.
No standard headers.

15 / 47

OpenCLs Additions to C in Kernels

In kernels, you can also use:

Work-items.
Work-groups.
Vectors.
Synchronization.
Declarations of memory type.
Kernel-specific library.

16 / 47

Branches in kernels
kernel void contains_branch ( global float *a ,
global float * b ) {
int id = get_global_id (0);
if ( cond ) {
x [ id ] += 5.0;
} else {
y [ id ] += 5.0;
}
}

The hardware will execute all branches that any thread in a warp
executescan be slow!
In other words: an if statement will cause each thread to execute
both branches; we keep only the result of the taken branch.
17 / 47

Loops in kernels
kernel void contains_loop ( global float *a ,
global float * b ) {
int id = get_global_id (0);
for ( i = 0; i < id ; i ++) {
b [ i ] += a [ i ];
}
}

A loop will cause the workgroup to wait for the maximum number
of iterations of the loop in any work-item.
Note: when you set up work-groups, best to arrange for all
work-items in a workgroup to execute the same branches & loops.

18 / 47

Synchronization

Different workgroups execute independently.

You can only put barriers and memory fences between
work-items in the same workgroup.
OpenCL supports:
Memory fences (load and store).
Barriers.
volatile (beware!)

19 / 47

Part II
Programming with OpenCL

20 / 47

Introduction
Today, well see how to program with OpenCL.
Were using OpenCL 1.1.
There is a lot of initialization and querying.
When you compile your program, include -lOpenCL.

You can find the official documentation here:

http://www.khronos.org/opencl/
More specifically:
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/

Lets just dive into an example.

21 / 47

First, reminders

All data belongs to an NDRange.

The range can be divided into work-groups. (in software)
The work-groups run on wavefronts/warps. (in hardware)
Each wavefront/warp executes work-items.

All branches in a wavefront/warp should execute the same path.

If an iteration of a loop takes t:
when one work-item executes 100 iterations,
the total time to complete the wavefront/warp is 100t.

22 / 47

Part III
Simple Example

23 / 47

Simple Example (1)

//
//
//
//
//
//

Note by PL : don t u s e t h i s e x a m p l e a s a t e m p l a t e ;
i t u s e s t h e C b i n d i n g s ! I n s t e a d , u s e t h e C++ b i n d i n g s .
s o u r c e : p a g e s 19 t h r o u g h 1 11 ,
h t t p : / / d e v e l o p e r . amd . com/ w o r d p r e s s / media /2013/07/
AMD_Accelerated_Parallel_Processing_OpenCL_
Programming_Guider e v 2 . 7 . p d f

#i n c l u d e <CL/ c l . h>
#i n c l u d e < s t d i o . h>
#d e f i n e NWITEMS 512
// A s i m p l e memset k e r n e l
const char source =
" _ _ k e r n e l v o i d memset ( _ _ g l o b a l u i n t d s t )
"{
"
dst [ get_global_id (0)] = get_global_id (0);
"}

\n "
\n "
\n "
\n " ;

i n t main ( i n t a r g c , c h a r a r g v )
{
// 1 . Get a p l a t f o r m .
cl_platform_id platform ;
c l G e t P l a t f o r m I D s ( 1 , &p l a t f o r m , NULL ) ;
24 / 47

Explanation (1)

Include the OpenCL header.

Request a platform (also known as a host).
A platform contains compute devices:
GPUs or CPUs.

25 / 47

Simple Example (2)

// 2 . F i n d a gpu d e v i c e .
cl_device_id device ;
c l G e t D e v i c e I D s ( p l a t f o r m , CL_DEVICE_TYPE_GPU ,
1,
&d e v i c e ,
NULL ) ;
// 3 . C r e a t e a c o n t e x t and command q ue ue on t h a t d e v i c e .
c l _ c o n t e x t c o n t e x t = c l C r e a t e C o n t e x t (NULL ,
1,
&d e v i c e ,
NULL , NULL , NULL ) ;
cl_command_queue q ue ue = clCreateCommandQueue ( c o n t e x t ,
device ,
0 , NULL ) ;

26 / 47

Explanation (2)

Request a GPU device.

Request a OpenCL context (representing all of OpenCLs state).
Create a command-queue:
get OpenCL to do work by telling it to run a kernel in a queue.

27 / 47

Simple Example (3)

// 4 . P e r f o r m r u n t i m e s o u r c e c o m p i l a t i o n , and o b t a i n
//
kernel entry point .
c l _ p r o g r a m program = c l C r e a t e P r o g r a m W i t h S o u r c e ( c o n t e x t ,
1,
&s o u r c e ,
NULL ,
NULL ) ;
c l B u i l d P r o g r a m ( program , 1 , &d e v i c e , NULL , NULL , NULL ) ;
c l _ k e r n e l k e r n e l = c l C r e a t e K e r n e l ( program , " memset " ,
NULL ) ;
// 5 . C r e a t e a d a t a b u f f e r .
cl_mem b u f f e r = c l C r e a t e B u f f e r ( c o n t e x t ,
CL_MEM_WRITE_ONLY,
NWITEMS s i z e o f ( c l _ u i n t ) ,
NULL , NULL ) ;

28 / 47

Explanation (3)
We create an OpenCL program (runs on the compute unit):
kernels;
functions; and
declarations.
In this case, we create a kernel called memset from source.
OpenCL may also create programs from binaries
(may be in intermediate representation).
Next, we need a data buffer (enables inter-device communication).
This program does not have any input,
so we dont put anything into the buffer (just declare its size).

29 / 47

Simple Example (4)

// 6 . Launch t h e k e r n e l . L e t OpenCL p i c k t h e l o c a l work
//
size .
s i z e _ t g l o b a l _ w o r k _ s i z e = NWITEMS ;
c l S e t K e r n e l A r g ( k e r n e l , 0 , s i z e o f ( b u f f e r ) , ( v o i d )& b u f f e r ) ;
c l E n q u e u e N D R a n g e K e r n e l ( queue ,
kernel ,
1,
// d i m e n s i o n s
NULL , // i n i t i a l o f f s e t s
&g l o b a l _ w o r k _ s i z e , // number o f
// worki t e m s
NULL , // worki t e m s p e r workg r o u p
0 , NULL , NULL ) ;
// e v e n t s
c l F i n i s h ( q ue ue ) ;
// 7 . Look a t t h e r e s u l t s v i a s y n c h r o n o u s b u f f e r map .
cl_uint ptr ;
p t r = ( c l _ u i n t ) c l E n q u e u e M a p B u f f e r ( queue , b u f f e r ,
CL_TRUE , CL_MAP_READ,
0 , NWITEMS
sizeof ( cl_uint ) ,
0 , NULL , NULL , NULL ) ;

30 / 47

Explanation (4)
Set kernel arguments to buffer.
We launch the kernel, enqueuing the 1-dimensional
index space starting at 0.
We specify that the index space has NWITEMS elements;
and not to subdivide the program into work-groups.
There is also an event interface, which we do not use.
We copy the results back; call is blocking (CL_TRUE);
hence we dont need an explicit clFinish() call.
We specify that we want to read the results back into
buffer.

31 / 47

Simple Example (5)

int i ;
f o r ( i =0; i < NWITEMS ; i ++)
p r i n t f ("%d %d\n " , i , p t r [ i ] ) ;
return 0;

The program simply prints 0 0, 1 1, . . . , 511 511.

Note: I didnt clean up or include error handling
for any of the OpenCL functions.

32 / 47

Part IV
Another Example

33 / 47

C++ Bindings

If we use the C++ bindings, well get automatic

resource release and exceptions.
C++ likes to use the RAII style
(resource allocation is initialization).
Change the header to CL/cl.hpp and define
__CL_ENABLE_EXCEPTIONS.
Wed also like to store our kernel in a file instead of a string.
The C API is not so nice to work with.

34 / 47

Vector Addition Kernel

Lets write a kernel that adds two vectors and stores the result.
This kernel will go in the file vector_add_kernel.cl.
_ _ k e r n e l v o i d v e c t o r _ a d d ( _ _ g l o b a l c o n s t i n t A ,
_ _ g l o b a l c o n s t i n t B ,
_ _ g l o b a l i n t C) {
// Get t h e i n d e x o f t h e c u r r e n t e l e m e n t t o be p r o c e s s e d
int i = get_global_id (0);

// Do t h e o p e r a t i o n
C[ i ] = A[ i ] + B[ i ] ;

Other possible qualifiers: local, constant and private.

35 / 47

Vector Addition (1)

// V e c t o r add example , C++ b i n d i n g s ( u s e t h e s e ! )
// s o u r c e :
// h t t p : / /www . t h e b i g b l o b . com/ g e t t i n g s t a r t e d
//
w i t h o p e n c l andgpuc o m p u t i n g /
#d e f i n e __CL_ENABLE_EXCEPTIONS
#i n c l u d e <CL/ c l . hpp>
#i n c l u d e
#i n c l u d e
#i n c l u d e
#i n c l u d e
#i n c l u d e

i n t main ( ) {
// C r e a t e t h e two i n p u t v e c t o r s
c o n s t i n t LIST_SIZE = 1 0 0 0 ;
i n t A = new i n t [ LIST_SIZE ] ;
i n t B = new i n t [ LIST_SIZE ] ;
f o r ( i n t i = 0 ; i < LIST_SIZE ; i ++) {
A[ i ] = i ;
B [ i ] = LIST_SIZE i ;
}
36 / 47

Vector Addition (2)

try {
// Get a v a i l a b l e p l a t f o r m s
s t d : : v e c t o r <c l : : P l a t f o r m > p l a t f o r m s ;
c l : : P l a t f o r m : : g e t (& p l a t f o r m s ) ;
// S e l e c t t h e d e f a u l t p l a t f o r m and c r e a t e a c o n t e x t
// u s i n g t h i s p l a t f o r m and t h e GPU
c l _ c o n t e x t _ p r o p e r t i e s cps [ 3 ] = {
CL_CONTEXT_PLATFORM,
( cl_context_properties )( platforms [ 0 ] ) ( ) ,
0
};
c l : : C o n t e x t c o n t e x t (CL_DEVICE_TYPE_GPU , c p s ) ;
// Get a l i s t o f d e v i c e s on t h i s p l a t f o r m
s t d : : v e c t o r <c l : : D e v i c e > d e v i c e s =
c o n t e x t . g e t I n f o <CL_CONTEXT_DEVICES> ( ) ;
// C r e a t e a command q ue ue and u s e t h e f i r s t d e v i c e
c l : : CommandQueue qu eu e = c l : : CommandQueue ( c o n t e x t ,
devices [ 0 ] ) ;

37 / 47

Explanation (2)

You can define __NO_STD_VECTOR and use cl::vector

(same with strings).
You can enable profiling by adding
CL_QUEUE_PROFILING_ENABLE as 3rd argument to queue.

38 / 47

Vector Addition (3)

// Read s o u r c e f i l e
std : : i f s t r e a m s o u r c e F i l e (" vector_add_kernel . c l " ) ;
std : : s t r i n g sourceCode (
s t d : : i s t r e a m b u f _ i t e r a t o r <c h a r >( s o u r c e F i l e ) ,
( s t d : : i s t r e a m b u f _ i t e r a t o r <c h a r >())
);
c l : : Program : : S o u r c e s s o u r c e (
1,
s t d : : make_pair ( sourceCode . c _ s t r ( ) ,
s o u r c e C o d e . l e n g t h ()+1)
);
// Make program o f t h e s o u r c e c o d e i n t h e c o n t e x t
c l : : Program program = c l : : Program ( c o n t e x t , s o u r c e ) ;
// B u i l d program f o r t h e s e s p e c i f i c
program . b u i l d ( d e v i c e s ) ;

devices

// Make k e r n e l
c l : : K e r n e l k e r n e l ( program , " v e c t o r _ a d d " ) ;

39 / 47

Vector Addition (4)

// C r e a t e memory b u f f e r s
cl : : Buffer bufferA = cl : : Buffer (
context ,
CL_MEM_READ_ONLY,
LIST_SIZE s i z e o f ( i n t )
);
cl : : Buffer bufferB = cl : : Buffer (
context ,
CL_MEM_READ_ONLY,
LIST_SIZE s i z e o f ( i n t )
);
cl : : Buffer bufferC = cl : : Buffer (
context ,
CL_MEM_WRITE_ONLY,
LIST_SIZE s i z e o f ( i n t )
);

40 / 47

Vector Addition (5)

// Copy l i s t s A and B t o t h e memory b u f f e r s
q ue ue . e n q u e u e W r i t e B u f f e r (
bufferA ,
CL_TRUE ,
0,
LIST_SIZE s i z e o f ( i n t ) ,
A
);
q ue ue . e n q u e u e W r i t e B u f f e r (
bufferB ,
CL_TRUE ,
0,
LIST_SIZE s i z e o f ( i n t ) ,
B
);
// S e t a r g u m e n t s
kernel . setArg (0 ,
kernel . setArg (1 ,
kernel . setArg (2 ,

to k e r n e l
bufferA );
bufferB );
bufferC );

41 / 47

Explanation (5)

enqueue*Buffer arguments:
buffer
cl_ bool blocking_write
::size_t offset
::size_t size
const void * ptr

42 / 47

Vector Addition (6)

// Run t h e k e r n e l on s p e c i f i c ND r a n g e
c l : : NDRange g l o b a l ( LIST_SIZE ) ;
c l : : NDRange l o c a l ( 1 ) ;
q ue ue . enqueueNDRangeKernel (
kernel ,
c l : : NullRange ,
global ,
local
);
// Read b u f f e r C i n t o a l o c a l
i n t C = new i n t [ LIST_SIZE ] ;
q ue ue . e n q u e u e R e a d B u f f e r (
bufferC ,
CL_TRUE ,
0,
LIST_SIZE s i z e o f ( i n t ) ,
C
);

list

43 / 47

Vector Addition (7)

f o r ( i n t i = 0 ; i < LIST_SIZE ; i ++) {

s t d : : c o u t << A [ i ] << " + " << B [ i ] << " = "
<< C [ i ] << s t d : : e n d l ;
}
} catch ( c l : : Error e r r o r ) {
s t d : : c o u t << e r r o r . what ( ) << " ( " << e r r o r . e r r ( )
<< " ) " << s t d : : e n d l ;
}
}

return 0;

This program just prints all the additions (equalling 1000).

44 / 47

Other Improvements

The host memory is still unreleased.

With the same number of lines, we could use the C++11
unique_ptr, which would free the memory for us.
You can use a vector instead of an array,
and use &v[0] instead of <type>*.
Valid as long as the vector is not resized.

45 / 47

OpenCL Programming Summary

Went through real OpenCL examples.

Have the reference card for the API.
Saw a C++ template for setting up OpenCL.
Aside: if youre serious about programming in C++, check
out Effective C++ by Scott Meyers (slightly dated with
C++11, but it still has some good stuff)

46 / 47

Overall summary

First Half: Brief overview of OpenCL and its programming

model.
Many concepts are similar to plain parallel programming
(more structure).
Second Half: Looked at an OpenCL implementation and
how to organize it.
Need to write lots of boilerplate!

47 / 47

Problems CHAPTER 17
100% (2)
Problems CHAPTER 17
4 pages
Detailed Lesson Plan in MAPEH I
73% (40)
Detailed Lesson Plan in MAPEH I
5 pages
Introduction To OpenCL With Examples
No ratings yet
Introduction To OpenCL With Examples
128 pages
06-Intro To Opencl PDF
No ratings yet
06-Intro To Opencl PDF
57 pages
Opencl: These Notes Will Introduce Opencl
No ratings yet
Opencl: These Notes Will Introduce Opencl
34 pages
11 - OpenCL Fundamentals
No ratings yet
11 - OpenCL Fundamentals
253 pages
Upcrc Opencl Lec1
No ratings yet
Upcrc Opencl Lec1
38 pages
CS-3006 7 UsingOpenCL DataParallelProgramming
No ratings yet
CS-3006 7 UsingOpenCL DataParallelProgramming
80 pages
GPU Programming Using openCL
No ratings yet
GPU Programming Using openCL
13 pages
Introduction To OpenCL
No ratings yet
Introduction To OpenCL
44 pages
Opencl 1pp PDF
No ratings yet
Opencl 1pp PDF
48 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
Parallel Programming in Opencl: Advanced Graphics & Image Processing
No ratings yet
Parallel Programming in Opencl: Advanced Graphics & Image Processing
31 pages
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
No ratings yet
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
74 pages
3 Heterogeneous Computer Architectures: 3.1 Gpus
No ratings yet
3 Heterogeneous Computer Architectures: 3.1 Gpus
16 pages
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
No ratings yet
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
258 pages
Introduction To OpenCL Programming (201005)
No ratings yet
Introduction To OpenCL Programming (201005)
132 pages
Supercomputing On Graphics Cards: Marcus Bannerman
No ratings yet
Supercomputing On Graphics Cards: Marcus Bannerman
18 pages
FPGA and OpenCL
No ratings yet
FPGA and OpenCL
31 pages
OpenCL For EiT-M
No ratings yet
OpenCL For EiT-M
41 pages
Embedded Linux - Gse5 Lab5 - Introduction To Opencl: Barriga Ponce de Leon Ricardo Guo Ran
No ratings yet
Embedded Linux - Gse5 Lab5 - Introduction To Opencl: Barriga Ponce de Leon Ricardo Guo Ran
8 pages
Opencl 03 Basics
No ratings yet
Opencl 03 Basics
62 pages
NVIDIA OpenCL JumpStart Guide
No ratings yet
NVIDIA OpenCL JumpStart Guide
15 pages
AdvancedOpenCL Full
No ratings yet
AdvancedOpenCL Full
101 pages
Opencl 2pp
No ratings yet
Opencl 2pp
28 pages
PostgreSQL OpenCL Procedural Language
No ratings yet
PostgreSQL OpenCL Procedural Language
29 pages
OpenCL Jumpstart Guide
No ratings yet
OpenCL Jumpstart Guide
17 pages
OpenCL A Parallel Programming Standart For Heterogeneous
No ratings yet
OpenCL A Parallel Programming Standart For Heterogeneous
12 pages
OpenGL vs. OpenCL, Which To Choose and Why - Stack Overflow
No ratings yet
OpenGL vs. OpenCL, Which To Choose and Why - Stack Overflow
9 pages
Pete Presentation 2
No ratings yet
Pete Presentation 2
17 pages
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
No ratings yet
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
128 pages
OpenCL Image Convolution Filter - Box Filter
No ratings yet
OpenCL Image Convolution Filter - Box Filter
8 pages
DS1822-Parallel Computing - Unit5
No ratings yet
DS1822-Parallel Computing - Unit5
16 pages
OpenCL Unleashing The Power of Parallel Computing
No ratings yet
OpenCL Unleashing The Power of Parallel Computing
8 pages
Gpu Rigid Body Simulation Using Opencl: Bullet 2.X Refactoring
No ratings yet
Gpu Rigid Body Simulation Using Opencl: Bullet 2.X Refactoring
22 pages
Assignment - 9 - 2025 Final
No ratings yet
Assignment - 9 - 2025 Final
4 pages
Using OpenCL Programming Massively Parallel Computers
No ratings yet
Using OpenCL Programming Massively Parallel Computers
309 pages
Opencl 2.0 Features: Benjamin Coquelle MAY 2015
No ratings yet
Opencl 2.0 Features: Benjamin Coquelle MAY 2015
40 pages
Opencl: Graphics Interop: The Best of Both Worlds - Graphics and Compute
No ratings yet
Opencl: Graphics Interop: The Best of Both Worlds - Graphics and Compute
18 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
No ratings yet
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
17 pages
Opencl Programming For The Cuda Architecture
No ratings yet
Opencl Programming For The Cuda Architecture
23 pages
1605808992-Using oneAPI FPGA IXPUG
No ratings yet
1605808992-Using oneAPI FPGA IXPUG
105 pages
Opencl 1 1 Quick Reference Card
No ratings yet
Opencl 1 1 Quick Reference Card
8 pages
A Es Implementation On Open CL
No ratings yet
A Es Implementation On Open CL
6 pages
DNA Assembly With de Bruijn Graphs On FPGA PDF
No ratings yet
DNA Assembly With de Bruijn Graphs On FPGA PDF
4 pages
OpenCL Best Practices Guide
No ratings yet
OpenCL Best Practices Guide
54 pages
OpenCL On de Series Boards
No ratings yet
OpenCL On de Series Boards
18 pages
PgCOn 2011 Parallel Image Searching
No ratings yet
PgCOn 2011 Parallel Image Searching
20 pages
LPIC-1 Primer
From Everand
LPIC-1 Primer
John Greene
4.5/5 (3)
OpenCL Tutorial - Basics
No ratings yet
OpenCL Tutorial - Basics
24 pages
GPU Accelerated Databases, Speeding Up Database Time Series Analysis Using OpenCL
No ratings yet
GPU Accelerated Databases, Speeding Up Database Time Series Analysis Using OpenCL
29 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Lec 1
No ratings yet
Lec 1
27 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
Csit3913 PDF
No ratings yet
Csit3913 PDF
12 pages
Projects With Microcontrollers And PICC
From Everand
Projects With Microcontrollers And PICC
Guillermo Perez Guillen
5/5 (1)
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Interview Questions for IBM Mainframe Developers
From Everand
Interview Questions for IBM Mainframe Developers
Robert Wingate
1/5 (1)
Computer Science, Career and Job
From Everand
Computer Science, Career and Job
Ramkrishna Ghosh
No ratings yet
Quartus Command Line
No ratings yet
Quartus Command Line
26 pages
Singh Et Al-2000-Geophysical Research Letters
No ratings yet
Singh Et Al-2000-Geophysical Research Letters
4 pages
Random 123 SC 11
No ratings yet
Random 123 SC 11
12 pages
Altera Opencl Getting Started
No ratings yet
Altera Opencl Getting Started
28 pages
Embedded System Design: A Unified Hardware/Software Approach
100% (1)
Embedded System Design: A Unified Hardware/Software Approach
4 pages
SET 2 BROADCST TV and RADIO Signals SET 2
No ratings yet
SET 2 BROADCST TV and RADIO Signals SET 2
3 pages
2014-2015 Teacher of The Year
No ratings yet
2014-2015 Teacher of The Year
16 pages
Antitumor Potential of Cisplatin Loaded Into SBA-15 Mesoporous Silica - in Vivo
No ratings yet
Antitumor Potential of Cisplatin Loaded Into SBA-15 Mesoporous Silica - in Vivo
10 pages
Gatela, Jone Harry B. BSEd ENG 2
No ratings yet
Gatela, Jone Harry B. BSEd ENG 2
2 pages
Gift Policy: ETHICS Handbook 23
No ratings yet
Gift Policy: ETHICS Handbook 23
3 pages
Your Handsome Captain
No ratings yet
Your Handsome Captain
14 pages
Purchase Order
No ratings yet
Purchase Order
1 page
Ortega Crim Law 2 Notes
No ratings yet
Ortega Crim Law 2 Notes
4 pages
Caste Wise Schemes Abstract
No ratings yet
Caste Wise Schemes Abstract
2 pages
CYB204 Week2-Topology
No ratings yet
CYB204 Week2-Topology
26 pages
UNITAR Introduction To Sustainable Development in Practice
No ratings yet
UNITAR Introduction To Sustainable Development in Practice
30 pages
Eo Organizing The BDC
No ratings yet
Eo Organizing The BDC
3 pages
Republic of The Philippines City of Taguig Taguig City University Gen. Santos Avenue, Central Bicutan, Taguig City
No ratings yet
Republic of The Philippines City of Taguig Taguig City University Gen. Santos Avenue, Central Bicutan, Taguig City
7 pages
2016 Gulfstream G650ER Brochure
No ratings yet
2016 Gulfstream G650ER Brochure
37 pages
Trees in SQL
100% (6)
Trees in SQL
20 pages
PTON Q3 2023 Shareholder Letter - VF
No ratings yet
PTON Q3 2023 Shareholder Letter - VF
13 pages
MATRIX For Data Need and Analysis AYUPAN
No ratings yet
MATRIX For Data Need and Analysis AYUPAN
7 pages
XII - I PreBoard - PHYSICS
No ratings yet
XII - I PreBoard - PHYSICS
12 pages
CAMEL Analysis of HDFC Bank 2024 Only Bank Statement
No ratings yet
CAMEL Analysis of HDFC Bank 2024 Only Bank Statement
12 pages
Differential Equation
No ratings yet
Differential Equation
13 pages
Exam 1 MGMT 363 Review (CH 1-4)
No ratings yet
Exam 1 MGMT 363 Review (CH 1-4)
7 pages
CIS Times 2023 2024
No ratings yet
CIS Times 2023 2024
210 pages
Analysis On The Causes of Cracking and Excessive Deflection of Long Span Box
No ratings yet
Analysis On The Causes of Cracking and Excessive Deflection of Long Span Box
18 pages
CHEMISTRY Grade 9 Retake
No ratings yet
CHEMISTRY Grade 9 Retake
8 pages
Name: - Date: - : English Year 2 Unit 7: Get Dressed Match The Picture Correctly
No ratings yet
Name: - Date: - : English Year 2 Unit 7: Get Dressed Match The Picture Correctly
12 pages
Introduction To Social Representation Theory
No ratings yet
Introduction To Social Representation Theory
8 pages
Account Statuses BRD
No ratings yet
Account Statuses BRD
8 pages
Practicum Report On Manufacturing Transformer
100% (1)
Practicum Report On Manufacturing Transformer
70 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 19-Opencl: Ece 459: Programming For Performance

Uploaded by

Lecture 19-Opencl: Ece 459: Programming For Performance

Uploaded by

Lecture 19OpenCL

ECE 459: Programming for Performance

March 18, 2014

Last Time: Compiler Optimizations

Compiler reads your program

OpenCL: coding on a heterogeneous architecture.

Another term you may see vendors using:

Other Heterogeneous Programming Examples

[PS4: back to a regular CPU/GPU system,

(PS3) Cell Overview

(from the Linux Cell documentation)

Compute Unified Device Architecture:

GPU Programming Model

The abstract model is simple:

Another example of data parallelism.

Note: OpenCL also supports task parallelism (using

Work-item: the fundamental unit of work in OpenCL.

Work-Items: Three more details

There are many different types of memory available to you:

Restrictions when writing kernels in OpenCL

Its mostly C, but:

OpenCLs Additions to C in Kernels

In kernels, you can also use:

Different workgroups execute independently.

You can find the official documentation here:

Lets just dive into an example.

All data belongs to an NDRange.

All branches in a wavefront/warp should execute the same path.

Simple Example (1)

Include the OpenCL header.

Simple Example (2)

Request a GPU device.

Simple Example (3)

Simple Example (4)

Simple Example (5)

The program simply prints 0 0, 1 1, . . . , 511 511.

If we use the C++ bindings, well get automatic

Vector Addition Kernel

Other possible qualifiers: local, constant and private.

Vector Addition (1)

Vector Addition (2)

You can define __NO_STD_VECTOR and use cl::vector

Vector Addition (3)

Vector Addition (4)

Vector Addition (5)

Vector Addition (6)

Vector Addition (7)

f o r ( i n t i = 0 ; i < LIST_SIZE ; i ++) {

This program just prints all the additions (equalling 1000).

The host memory is still unreleased.

OpenCL Programming Summary

Went through real OpenCL examples.

First Half: Brief overview of OpenCL and its programming

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.