1. GPUs are better suited than CPUs for tasks that can be processed in parallel and require high throughput rather than low latency. GPUs have simpler control hardware that allows for more computational units, making them more power efficient for parallel workloads.
2. Programming for GPUs requires an explicitly parallel programming model like CUDA and optimizing for throughput. Data must be copied between CPU and GPU memory, and kernels launched on the GPU to perform computation on the device.
3. The CPU acts as the host, launching kernels on the GPU device and managing data transfer between CPU and GPU memory via APIs like cudaMemcpy. Kernels define code to run identically on many parallel threads to leverage the GPU's parallel
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
46 views
OPENMP Notes
1. GPUs are better suited than CPUs for tasks that can be processed in parallel and require high throughput rather than low latency. GPUs have simpler control hardware that allows for more computational units, making them more power efficient for parallel workloads.
2. Programming for GPUs requires an explicitly parallel programming model like CUDA and optimizing for throughput. Data must be copied between CPU and GPU memory, and kernels launched on the GPU to perform computation on the device.
3. The CPU acts as the host, launching kernels on the GPU device and managing data transfer between CPU and GPU memory via APIs like cudaMemcpy. Kernels define code to run identically on many parallel threads to leverage the GPU's parallel
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4
OPENMP
1. Turning up clock speed will result in increase in power consumption
2. Many smaller simpler processors 3. (Feature size) As the transistor size decreases, they will run faster, consume less power and can accommodate more on chips 4. (Clock frequency) clock speed also increases over time but in last few years they are stagnant 5. Processors are getting faster because we have more transistors available for computation not because the clocking their transistor’s faster as clocking speed is being constant past few years 6. Why we don’t keep increasing clock speed? It is not the case that we can’t make processors any smaller or can increase the clock speed any better. It is due to the fact that it produces lot of heat and its hard for us to cool down the processors. So, the main factor today is power. Limitation is the number of transistors…smaller transistors indeed consume less space and consumes less power but combining millions of transistors together produces a lot of heat. So instead of flooding single processors with many transistors we are moving towards having more processors for running program faster 7. What kind of processors will we build?(Major design constraint: power) CPU: Complex control hardware Flexibility_performance Expensive in terms of power GPU: Simpler control hardware (+) More HW for computation (+) Potentially more power efficient(operations/watt) (-) More restrictive programming model 8. Latency: time required to complete a task 9. Throughput: work done per unit of time 10. CPU chooses to optimise latency while GPU chooses to optimise latency 11. CORE GPU DESIGN TENETS: Lots of simple compute units Trade simple control for more compute Explicitly parallel programming model Optimize for throughput not latency (Therefore, most important for those where throughput metric is important) 12. GPUs from the point of view of software developer -Importance of programming in parallel 8 core Bridge (INTEL) 8-Wide AVX vector operations/core 2 threads/core (HyperThreading) =128-way parallelism 13. The computers are heterogeneous for this task of parallelism. They have two different processors in then (i) CPU (“HOST”) (ii) GPU (“DEVICE”) 14. If we write a plain sequential program it will only run on CPU, for utilizing GPU we use cuda programming model written in c with extensions 15. CUDA assumes GPU as the coprocessor to CPU and also assumes they have their separate memory (Physically allocated to both of them in form of DRAM) 16. CPU is incharge, it tells GPU what to do
Tasks involve:
Moving data from cpu to GPU (fulfils by cudamemcpy)
Calling back data from GPU to CPU (fulfils by cudamemcpy)
Allocate GPU memory (cudaMalloc)
Launch kernel on GPU (Host launches kernel on device)
17. The GPU can do the following:
Respond to CPU request to SEND data GPU -> CPU Respond to CPU request to RECV data CPU -> GPU Compute A Kernel Launched by CPU ((Advanced) GPU can launch their own kernals as well and copy data from cpu) 18. A typical GPU program: CPU allocates storage on GPU (cudaMalloc) CPU copies input data from CPU->GPU (cudaMemcpy) CPU launches KERNEL(s) on GPU to process the data (kernel launch) CPU copies results back to CPU from GPU (cudaMencpy) [Must have a high ratio of computation to communication, If communication is high but computation on that communicated data is low then parallelism fails. So we must focus on high computation on communicated data] 19. DEFINING THE GPU COMPUTATION: Kernels look like serial programs Write your program as if it will run on one thread The GPU will run that program on MANY THTREADS 20. What is the GPU good at? Efficiently launching lots of threads Running lots of threads in parallel 21. cudaMemcpy(d_in,h_in,Array_Bytes,CudaMemcpyHostToDevice) 22. cudaMemcpy(h_out,d_out,Array_Bytes,CudaMemcpyDeviceToHost) 23. square<<<1, ARRAY_SIZE>>>(d_out,d_in); { launch the kernel named square on 1 block of 64 elelment} 24. threadIdx.x for thread id 25. configuring the kernel launch: kernel <<< GRID OF BLOCKS, BLOCK OF THTREADS>>> (…) dim3(x,y,z) == dim3(w) ==w square<<<1,64>>> == square<<<dim3(1,1,1), dim3(64,1,1)>>> each block can have a maximum of 512 or 1024 threads square<<< dim3(bx,by,bz), dim3(tx,ty,tz),shmem>>> (…) dim3(bx,by,bz) = grid of blocks bx.by.bz dim3(tx,ty,tz) = block of threads tx.ty.tz shmem = shared memory per block in bytes threadIdx: thread within block, threadIdx.x threadIdx.y blockDim: size of a block blockIdx: block within grid gridDim: size of grid 26. adc 27.