0% found this document useful (0 votes)
46 views

OPENMP Notes

1. GPUs are better suited than CPUs for tasks that can be processed in parallel and require high throughput rather than low latency. GPUs have simpler control hardware that allows for more computational units, making them more power efficient for parallel workloads. 2. Programming for GPUs requires an explicitly parallel programming model like CUDA and optimizing for throughput. Data must be copied between CPU and GPU memory, and kernels launched on the GPU to perform computation on the device. 3. The CPU acts as the host, launching kernels on the GPU device and managing data transfer between CPU and GPU memory via APIs like cudaMemcpy. Kernels define code to run identically on many parallel threads to leverage the GPU's parallel

Uploaded by

avinash kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

OPENMP Notes

1. GPUs are better suited than CPUs for tasks that can be processed in parallel and require high throughput rather than low latency. GPUs have simpler control hardware that allows for more computational units, making them more power efficient for parallel workloads. 2. Programming for GPUs requires an explicitly parallel programming model like CUDA and optimizing for throughput. Data must be copied between CPU and GPU memory, and kernels launched on the GPU to perform computation on the device. 3. The CPU acts as the host, launching kernels on the GPU device and managing data transfer between CPU and GPU memory via APIs like cudaMemcpy. Kernels define code to run identically on many parallel threads to leverage the GPU's parallel

Uploaded by

avinash kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

OPENMP

1. Turning up clock speed will result in increase in power consumption


2. Many smaller simpler processors
3. (Feature size) As the transistor size decreases, they will run faster, consume less power and
can accommodate more on chips
4. (Clock frequency) clock speed also increases over time but in last few years they are
stagnant
5. Processors are getting faster because we have more transistors available for computation
not because the clocking their transistor’s faster as clocking speed is being constant past few
years
6. Why we don’t keep increasing clock speed?
It is not the case that we can’t make processors any smaller or can increase the clock speed
any better. It is due to the fact that it produces lot of heat and its hard for us to cool down
the processors. So, the main factor today is power.
Limitation is the number of transistors…smaller transistors indeed consume less space and
consumes less power but combining millions of transistors together produces a lot of heat.
So instead of flooding single processors with many transistors we are moving towards having
more processors for running program faster
7. What kind of processors will we build?(Major design constraint: power)
CPU:
Complex control hardware
Flexibility_performance
Expensive in terms of power
GPU:
Simpler control hardware
(+) More HW for computation
(+) Potentially more power efficient(operations/watt)
(-) More restrictive programming model
8. Latency: time required to complete a task
9. Throughput: work done per unit of time
10. CPU chooses to optimise latency while GPU chooses to optimise latency
11. CORE GPU DESIGN TENETS:
Lots of simple compute units
Trade simple control for more compute
Explicitly parallel programming model
Optimize for throughput not latency
(Therefore, most important for those where throughput metric is important)
12. GPUs from the point of view of software developer
-Importance of programming in parallel
8 core Bridge (INTEL)
8-Wide AVX vector operations/core
2 threads/core (HyperThreading)
=128-way parallelism
13. The computers are heterogeneous for this task of parallelism. They have two different
processors in then (i) CPU (“HOST”) (ii) GPU (“DEVICE”)
14. If we write a plain sequential program it will only run on CPU, for utilizing GPU we use cuda
programming model written in c with extensions
15. CUDA assumes GPU as the coprocessor to CPU and also assumes they have their separate
memory (Physically allocated to both of them in form of DRAM)
16. CPU is incharge, it tells GPU what to do

Tasks involve:

Moving data from cpu to GPU (fulfils by cudamemcpy)

Calling back data from GPU to CPU (fulfils by cudamemcpy)

Allocate GPU memory (cudaMalloc)

Launch kernel on GPU (Host launches kernel on device)

17. The GPU can do the following:


Respond to CPU request to SEND data GPU -> CPU
Respond to CPU request to RECV data CPU -> GPU
Compute A Kernel Launched by CPU
((Advanced) GPU can launch their own kernals as well and copy data from cpu)
18. A typical GPU program:
CPU allocates storage on GPU (cudaMalloc)
CPU copies input data from CPU->GPU (cudaMemcpy)
CPU launches KERNEL(s) on GPU to process the data (kernel launch)
CPU copies results back to CPU from GPU (cudaMencpy)
[Must have a high ratio of computation to communication, If communication is high but
computation on that communicated data is low then parallelism fails. So we must focus on
high computation on communicated data]
19. DEFINING THE GPU COMPUTATION:
Kernels look like serial programs
Write your program as if it will run on one thread
The GPU will run that program on MANY THTREADS
20. What is the GPU good at?
Efficiently launching lots of threads
Running lots of threads in parallel
21. cudaMemcpy(d_in,h_in,Array_Bytes,CudaMemcpyHostToDevice)
22. cudaMemcpy(h_out,d_out,Array_Bytes,CudaMemcpyDeviceToHost)
23. square<<<1, ARRAY_SIZE>>>(d_out,d_in); { launch the kernel named square on 1 block of 64
elelment}
24. threadIdx.x for thread id
25. configuring the kernel launch:
kernel <<< GRID OF BLOCKS, BLOCK OF THTREADS>>> (…)
dim3(x,y,z) == dim3(w) ==w
square<<<1,64>>> == square<<<dim3(1,1,1), dim3(64,1,1)>>>
each block can have a maximum of 512 or 1024 threads
square<<< dim3(bx,by,bz), dim3(tx,ty,tz),shmem>>> (…)
dim3(bx,by,bz) = grid of blocks bx.by.bz
dim3(tx,ty,tz) = block of threads tx.ty.tz
shmem = shared memory per block in bytes
threadIdx: thread within block, threadIdx.x threadIdx.y
blockDim: size of a block
blockIdx: block within grid
gridDim: size of grid
26. adc
27.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy