0% found this document useful (0 votes)
245 views22 pages

WRF-GPU DR Young-Tae+Kim

This document discusses accelerating the Weather Research and Forecasting (WRF) model using GPUs. It describes implementing key WRF physics routines on GPUs using CUDA Fortran. Performance tests showed a 3-4x speedup for physics routines on a Tesla GPU compared to a CPU. Overall WRF runs were accelerated by over 1.5x using the GPU implementation. While GPUs provide efficient acceleration, data transfer between the CPU and GPU remains a bottleneck, and translating code to GPUs requires non-trivial modifications.

Uploaded by

MigueVargas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
245 views22 pages

WRF-GPU DR Young-Tae+Kim

This document discusses accelerating the Weather Research and Forecasting (WRF) model using GPUs. It describes implementing key WRF physics routines on GPUs using CUDA Fortran. Performance tests showed a 3-4x speedup for physics routines on a Tesla GPU compared to a CPU. Overall WRF runs were accelerated by over 1.5x using the GPU implementation. While GPUs provide efficient acceleration, data transfer between the CPU and GPU remains a bottleneck, and translating code to GPUs requires non-trivial modifications.

Uploaded by

MigueVargas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Gangneung-Wonju National University

Youngtae Kim

Agenda
1 Background
1.1 The future of high performance computing
1.2 GP-GPU
1.3 CUDA Fortran

2 Implementation
2.1 Implementation Fortran program of WRF
2.2 Execution profile of WRF physics
2.3 Implementation of parallel programs

3 Performance
3.1 Performance comparison
3.2 Performance of WRF

4 Conclusions
1

1 Background
1.1 The future of High Performance Computing
H. Meuer, Scientific Computing World: June/July 2009

A thousand-fold performance increase over an 11-year time


period.
1986 Gigaflops
1997 Teraflops
2008 Petaflops
2019 Exaflops
For the near future,
we expect that
the hardware architecture
will be a combination of
specialized CPU and
GPU type cores.

1 Background
GP-GPU performance

FLOPS/Memory bandwidth for the CPU and GP-GPU


*FLOPS: Floating-Point Operations per Seconds
3

1 Background
GP-GPU Acceleration of WRF WSM5

1 Background
1.2 GP-GPU(General Purposed Graphic Processing Unit)
Originally graphic processing
Grid of Multi processors
Use PCI
Thread block
(Compute in parallel)

Grid (Data Domain)

1 Background
Caller: function<<<dimGrid, dimBlock>>> ()
Callee: i = blockDim%x*(blockIdx%x-1) + threadIdx%x
j = blockDim%y*(blockIdx%y-1) + threadIdx%y
(3*0+1,
3*0+1)

(3*0+1,
3*0+2)

((33**00+1,
+1,
33**00+3)
+3)

((33**00+1,
+1,
33**11+1)
+1)

((33**00+1,
+1,
33**11+2)
+2)

((33**00+1,
+1,
33**11+3)
+3)

((33**0
0+1,
+1,
33**22+1)
+1)

(3*0+1,
3*2+2)

(3*0+1,
3*2+3)

(3*0+2,
3*0+1)

(3*0+2,
3*0+2)

((33**00+2,
+2,
33**00+3)
+3)

((33**00+2,
+2,
33**11+1)
+1)

((33**00+2,
+2,
33**11+2)
+2)

((33**00+2,
+2,
33**11+3)
+3)

((33**0
0+2,
+2,
33**22+1)
+1)

(3*0+2,
3*2+2)

(3*0+2,
3*2+3)

(3*0+3,
3*0+1)

(3*0+3,
3*0+2)

((33**00+3,
+3,
33**00+3)
+3)

((33**00+3,
+3,
33**11+1)
+1)

((33**00+3,
+3,
33**11+2)
+2)

((33**00+3,
+3,
33**11+3)
+3)

((33**0
0+3,
+3,
33**22+1)
+1)

(3*0+3,
3*2+2)

(3*0+3,
3*2+3)

(3*1+1,
3*0+1)

(3*1+1,
3*0+2)

((33**11+1,
+1,
33**00+3)
+3)

((33**11+1,
+1,
33**11+1)
+1)

((33**11+1,
+1,
33**11+2)
+2)

((33**11+1,
+1,
33**11+3)
+3)

((33**11+1,
+1,
33**22+1)
+1)

(3*1+1,
3*2+2)

(3*1+1,
3*2+3)

(3*1+2,
3*0+1)

(3*1+2,
3*0+2)

((33**11+2,
+2,
33**00+3)
+3)

((33**1+2,
1+2,
33**11+1)
+1)

((33**11+2,
+2,
33**11+2)
+2)

((33**11+2,
+2,
33**11+3)
+3)

((33**11+2,
+2,
33**22+1)
+1)

(3*1+2,
3*2+2)

(3*1+2,
3*2+3)

(3*1+3,
3*0+1)

(3*1+3,
3*0+2)

((33**11+3,
+3,
33**00+3)
+3)

((33**1+3,
1+3,
33**11+1)
+1)

((33**11+3,
+3,
33**11+2)
+2)

((33**11+3,
+3,
33**11+3)
+3)

((33**11+3,
+3,
33**22+1)
+1)

(3*1+3,
3*2+2)

(3*1+3,
3*2+3)

(1,1)

(2,1)

(1,2)

(2,2)

(1,3)

(2,3)

Grid
6

1 Background
1.3 CUDA Fortran (PG Fortran version 10)
Developed by Portland Group Inc. and Nvidia(2009/12)
Support CUDA
Use Fortan90(95/03) syntax
Some limitations of CUDA Fortran

Not support automatic arrays and module variables


Not supoort common, equivalence

CUDA(Computer United Device Architecture): GP-GPU


programming interface by Nvidia

2 Implementation
2.1 Implementation Fortran program of WRF(v.3.4)
Physics run on GP-GPUs
Micro physics: WSM6 and WDM6
Boundary-layer physics: YSUPBL
Radiation physics - RRTMG_LW, RRTMG_SW
Surface-layer physics - SFCLAY

2 Implementation
2.2 Execution profile of WRF physics routines
WRF execution profile
22.5%

RRTMG_LW
RRTMG_SW
14.4%

WDM6
YSUPBL
etc.

2.1%
21.1%

2 Implementation
2.3 Implementation of parallel programs
2.3.1 Running environment
Modification of configure.wrf(environment set-up file)

Compatible to original WRF program


ARCH_LOCAL = -DRUN_ON_GPU
(GP-GPU code compile if -DRUN_ON_GPU defined)

Create a directory for CUDA codes only - cuda

GP-GPU source codes


Exclusive Makefile

10

2 Implementation
2.3.2 Structure of the GP-GPU program
(Original code)

(GPU code)
Initialize: dynamic allocation of
GP-GPU variables

Initialize

Physics routine:
do j=..

call 2d routine(..,j,..)

enddo

Physics routine:
Time steps

Copy CPU variables to GPU


call 3d routine (GPU)
Copy GPU variables to CPU

Time steps

Finalize: deallocation of GPU


variables

11

2 Implementation
2.3.3 Initialize & Finalize
phys/module_physics_init.F
#ifndef RUN_ON_GPU
CALL rrtmg_lwinit( )
#else
CALL rrtmg_lwinit_gpu()
#endif

main/module_wrf_top.F
#ifdef RUN_ON_GPU
call rrtmg_lwfinalize_gpu()
#endif

cuda/module_ra_rrtmg_lw_gpu.F
cuda/module_ra_rrtmg_lw_gpu.F

Initialization of constants
Allocation of GPU device variables
subroutine rrtmg_lwinit_gpu(...)

call rrtmg_lw_ini(cp)

allocate(p8w_d(dime,djme),stat=istate)

Deallocation of GPU device


variables
subroutine rrtmg_lwfinalize_gpu( ..)

deallocate(p8w_d)

12

2 Implementation
2.3.4 Calling GPU Functions
phys/module_radiation_driver.F
#ifndef RUN_ON_GPU
USE module_ra_rrtmg_lw, only: rrtmg_lwrad
#else
USE module_ra_rrtmg_lw_gpu, only rrtmg_lwrad_gpu
#endif
...
#ifndef RUN_ON_GPU
CALL RRTMG_LWRAD(...)
#else
CALL RRTMG_LWRAD_GPU(...)

13

2 Implementation
2.3.5 Translation into GPU code
Use 3-dimensional domain
Remove the horizontal loop(i, j-loop) of GPU function(Global)
SUBROUTINE wdm62D()

do k = kts, kte
do i = its, ite
cpm(i,k) = cpmcal(q(i,k))
xl(i,k) = xlcal(t(i,k))
enddo
enddo
do k = kts, kte
do i = its, ite
delz_tmp(i,k) = delz(i,k)
den_tmp(i,k) = den(i,k)
enddo
enddo

END SUBROUTINE wdm62D

attributes(global) subroutine wdm6_gpu_kernel(...)


i = (blockIdx%x-1)*blockDim%x + threadIdx%x
j = (blockIdx%y-1)*blockDim%y + threadIdx%y
if (((i.ge.its).and.(i.le.ite)).and. &
((j.ge.jts).and.(j.le.jte))) then

do k = kts, kte
cpm(i,k,j) = cpmcal(q(i,k,j))
xl(i,k,j) = xlcal(t(i,k,j))
! enddo
do k = kts, kte
delz_tmp(i,k,j) = delz(i,k,j)
den_tmp(i,k,j) = den(i,k,j)
enddo

endif
end subroutine wdm6_gpu_kernel
14

2 Implementation
2.3.6 Memory allocation of arrays and copy CPU data
subroutine rrtmg_lwrad(rthratenlw, emiss, )

rthratenlw(i,k,j) =
emiss(i,k,j) =

real, allocatable, device :: rthratenlw_d(:,:,:), emiss_d(:,:),


rthratenlw_d = rthratenlw
emiss_d = emiss
call rrtmg_lwrad_gpu_kernel<<<dimGrid, dimBlock>>> (rthratenlw_d, emiss_d, )
attributes(global) subroutine rrtmg_lwrad_gpu_kernel(rthratenlw, emiss, )

rthratenlw(i,k,j) =
emiss(i,k,j) =

15

3 Performance
3.1 Performance comparison
System specification used for performance checking
CPU

Intel Xeon E5405 (2.0GHz)

GPU

Tesla C1060 (1.3GHz)

Global memory

4G bytes

#multiprocessors

30

#cores

240

Registers/block
Max#thread/block

16384
512

16

3 Performance
Performance of WRF physics routines
4000
3500

CPU

3000

GPU

2500
2000
1500
1000
500
0

WSM6

WDM6

YSUPBL

RRTMG_LW RRTMG_SW

SFCLAY

17

3 Performance
Performance comparison of CUDA C and CUDA Fortran
300,000

250,000

CPU

GPU

microsec

200,000

150,000

100,000

50,000

0
WSM5

WSM6

18

3 Performance
3.2 Performance of WRF
18000
16000

CPU

14000

GPU

12000
10000
8000
6000
4000
2000
0

WRF

19

4 Conclusions
Pros
GP-GPUs can be used as efficient hardware accelerators.
GP-GPUs are cheap and energy efficient.
Cons
Communication between CPUs and GPUs is slow.

Data transfer between CPU and GP-GPU is a bottleneck.


Overlap of communication and computation is necessary.

Translation into GP-GPU code is not trivial.

Parameter passing methods, local resources are limited.


CUDA Fortran need to be improved.

20

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy