0% found this document useful (0 votes)

245 views22 pages

WRF-GPU DR Young-Tae+Kim

This document discusses accelerating the Weather Research and Forecasting (WRF) model using GPUs. It describes implementing key WRF physics routines on GPUs using CUDA Fortran. Performance tests showed a 3-4x speedup for physics routines on a Tesla GPU compared to a CPU. Overall WRF runs were accelerated by over 1.5x using the GPU implementation. While GPUs provide efficient acceleration, data transfer between the CPU and GPU remains a bottleneck, and translating code to GPUs requires non-trivial modifications.

Uploaded by

MigueVargas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

245 views22 pages

WRF-GPU DR Young-Tae+Kim

Uploaded by

MigueVargas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Gangneung-Wonju National University

Youngtae Kim

Agenda
1 Background
1.1 The future of high performance computing
1.2 GP-GPU
1.3 CUDA Fortran

2 Implementation
2.1 Implementation Fortran program of WRF
2.2 Execution profile of WRF physics
2.3 Implementation of parallel programs

3 Performance
3.1 Performance comparison
3.2 Performance of WRF

4 Conclusions
1

1 Background
1.1 The future of High Performance Computing
H. Meuer, Scientific Computing World: June/July 2009

A thousand-fold performance increase over an 11-year time

period.
1986 Gigaflops
1997 Teraflops
2008 Petaflops
2019 Exaflops
For the near future,
we expect that
the hardware architecture
will be a combination of
specialized CPU and
GPU type cores.

1 Background
GP-GPU performance

FLOPS/Memory bandwidth for the CPU and GP-GPU

*FLOPS: Floating-Point Operations per Seconds
3

1 Background
GP-GPU Acceleration of WRF WSM5

1 Background
1.2 GP-GPU(General Purposed Graphic Processing Unit)
Originally graphic processing
Grid of Multi processors
Use PCI
Thread block
(Compute in parallel)

Grid (Data Domain)

1 Background
Caller: function<<<dimGrid, dimBlock>>> ()
Callee: i = blockDim%x*(blockIdx%x-1) + threadIdx%x
j = blockDim%y*(blockIdx%y-1) + threadIdx%y
(3*0+1,
3*0+1)

(3*0+1,
3*0+2)

((33**00+1,
+1,
33**00+3)
+3)

((33**00+1,
+1,
33**11+1)
+1)

((33**00+1,
+1,
33**11+2)
+2)

((33**00+1,
+1,
33**11+3)
+3)

((33**0
0+1,
+1,
33**22+1)
+1)

(3*0+1,
3*2+2)

(3*0+1,
3*2+3)

(3*0+2,
3*0+1)

(3*0+2,
3*0+2)

((33**00+2,
+2,
33**00+3)
+3)

((33**00+2,
+2,
33**11+1)
+1)

((33**00+2,
+2,
33**11+2)
+2)

((33**00+2,
+2,
33**11+3)
+3)

((33**0
0+2,
+2,
33**22+1)
+1)

(3*0+2,
3*2+2)

(3*0+2,
3*2+3)

(3*0+3,
3*0+1)

(3*0+3,
3*0+2)

((33**00+3,
+3,
33**00+3)
+3)

((33**00+3,
+3,
33**11+1)
+1)

((33**00+3,
+3,
33**11+2)
+2)

((33**00+3,
+3,
33**11+3)
+3)

((33**0
0+3,
+3,
33**22+1)
+1)

(3*0+3,
3*2+2)

(3*0+3,
3*2+3)

(3*1+1,
3*0+1)

(3*1+1,
3*0+2)

((33**11+1,
+1,
33**00+3)
+3)

((33**11+1,
+1,
33**11+1)
+1)

((33**11+1,
+1,
33**11+2)
+2)

((33**11+1,
+1,
33**11+3)
+3)

((33**11+1,
+1,
33**22+1)
+1)

(3*1+1,
3*2+2)

(3*1+1,
3*2+3)

(3*1+2,
3*0+1)

(3*1+2,
3*0+2)

((33**11+2,
+2,
33**00+3)
+3)

((33**1+2,
1+2,
33**11+1)
+1)

((33**11+2,
+2,
33**11+2)
+2)

((33**11+2,
+2,
33**11+3)
+3)

((33**11+2,
+2,
33**22+1)
+1)

(3*1+2,
3*2+2)

(3*1+2,
3*2+3)

(3*1+3,
3*0+1)

(3*1+3,
3*0+2)

((33**11+3,
+3,
33**00+3)
+3)

((33**1+3,
1+3,
33**11+1)
+1)

((33**11+3,
+3,
33**11+2)
+2)

((33**11+3,
+3,
33**11+3)
+3)

((33**11+3,
+3,
33**22+1)
+1)

(3*1+3,
3*2+2)

(3*1+3,
3*2+3)

(1,1)

(2,1)

(1,2)

(2,2)

(1,3)

(2,3)

Grid
6

1 Background
1.3 CUDA Fortran (PG Fortran version 10)
Developed by Portland Group Inc. and Nvidia(2009/12)
Support CUDA
Use Fortan90(95/03) syntax
Some limitations of CUDA Fortran

Not support automatic arrays and module variables

Not supoort common, equivalence

CUDA(Computer United Device Architecture): GP-GPU

programming interface by Nvidia

2 Implementation
2.1 Implementation Fortran program of WRF(v.3.4)
Physics run on GP-GPUs
Micro physics: WSM6 and WDM6
Boundary-layer physics: YSUPBL
Radiation physics - RRTMG_LW, RRTMG_SW
Surface-layer physics - SFCLAY

2 Implementation
2.2 Execution profile of WRF physics routines
WRF execution profile
22.5%

RRTMG_LW
RRTMG_SW
14.4%

WDM6
YSUPBL
etc.

2.1%
21.1%

2 Implementation
2.3 Implementation of parallel programs
2.3.1 Running environment
Modification of configure.wrf(environment set-up file)

Compatible to original WRF program

ARCH_LOCAL = -DRUN_ON_GPU
(GP-GPU code compile if -DRUN_ON_GPU defined)

Create a directory for CUDA codes only - cuda

GP-GPU source codes

Exclusive Makefile

2 Implementation
2.3.2 Structure of the GP-GPU program
(Original code)

(GPU code)
Initialize: dynamic allocation of
GP-GPU variables

Initialize

Physics routine:
do j=..

call 2d routine(..,j,..)

enddo

Physics routine:
Time steps

Copy CPU variables to GPU

call 3d routine (GPU)
Copy GPU variables to CPU

Time steps

Finalize: deallocation of GPU

variables

2 Implementation
2.3.3 Initialize & Finalize
phys/module_physics_init.F
#ifndef RUN_ON_GPU
CALL rrtmg_lwinit( )
#else
CALL rrtmg_lwinit_gpu()
#endif

main/module_wrf_top.F
#ifdef RUN_ON_GPU
call rrtmg_lwfinalize_gpu()
#endif

cuda/module_ra_rrtmg_lw_gpu.F
cuda/module_ra_rrtmg_lw_gpu.F

Initialization of constants
Allocation of GPU device variables
subroutine rrtmg_lwinit_gpu(...)

call rrtmg_lw_ini(cp)

allocate(p8w_d(dime,djme),stat=istate)

Deallocation of GPU device

variables
subroutine rrtmg_lwfinalize_gpu( ..)

deallocate(p8w_d)

2 Implementation
2.3.4 Calling GPU Functions
phys/module_radiation_driver.F
#ifndef RUN_ON_GPU
USE module_ra_rrtmg_lw, only: rrtmg_lwrad
#else
USE module_ra_rrtmg_lw_gpu, only rrtmg_lwrad_gpu
#endif
...
#ifndef RUN_ON_GPU
CALL RRTMG_LWRAD(...)
#else
CALL RRTMG_LWRAD_GPU(...)

2 Implementation
2.3.5 Translation into GPU code
Use 3-dimensional domain
Remove the horizontal loop(i, j-loop) of GPU function(Global)
SUBROUTINE wdm62D()

do k = kts, kte
do i = its, ite
cpm(i,k) = cpmcal(q(i,k))
xl(i,k) = xlcal(t(i,k))
enddo
enddo
do k = kts, kte
do i = its, ite
delz_tmp(i,k) = delz(i,k)
den_tmp(i,k) = den(i,k)
enddo
enddo

END SUBROUTINE wdm62D

attributes(global) subroutine wdm6_gpu_kernel(...)

i = (blockIdx%x-1)*blockDim%x + threadIdx%x
j = (blockIdx%y-1)*blockDim%y + threadIdx%y
if (((i.ge.its).and.(i.le.ite)).and. &
((j.ge.jts).and.(j.le.jte))) then

do k = kts, kte
cpm(i,k,j) = cpmcal(q(i,k,j))
xl(i,k,j) = xlcal(t(i,k,j))
! enddo
do k = kts, kte
delz_tmp(i,k,j) = delz(i,k,j)
den_tmp(i,k,j) = den(i,k,j)
enddo

endif
end subroutine wdm6_gpu_kernel
14

2 Implementation
2.3.6 Memory allocation of arrays and copy CPU data
subroutine rrtmg_lwrad(rthratenlw, emiss, )

rthratenlw(i,k,j) =
emiss(i,k,j) =

real, allocatable, device :: rthratenlw_d(:,:,:), emiss_d(:,:),

rthratenlw_d = rthratenlw
emiss_d = emiss
call rrtmg_lwrad_gpu_kernel<<<dimGrid, dimBlock>>> (rthratenlw_d, emiss_d, )
attributes(global) subroutine rrtmg_lwrad_gpu_kernel(rthratenlw, emiss, )

rthratenlw(i,k,j) =
emiss(i,k,j) =

3 Performance
3.1 Performance comparison
System specification used for performance checking
CPU

Intel Xeon E5405 (2.0GHz)

GPU

Tesla C1060 (1.3GHz)

Global memory

4G bytes

#multiprocessors

#cores

240

Registers/block
Max#thread/block

16384
512

3 Performance
Performance of WRF physics routines
4000
3500

CPU

3000

GPU

2500
2000
1500
1000
500
0

WSM6

WDM6

YSUPBL

RRTMG_LW RRTMG_SW

SFCLAY

3 Performance
Performance comparison of CUDA C and CUDA Fortran
300,000

250,000

CPU

GPU

microsec

200,000

150,000

100,000

50,000

0
WSM5

WSM6

3 Performance
3.2 Performance of WRF
18000
16000

CPU

14000

GPU

12000
10000
8000
6000
4000
2000
0

WRF

4 Conclusions
Pros
GP-GPUs can be used as efficient hardware accelerators.
GP-GPUs are cheap and energy efficient.
Cons
Communication between CPUs and GPUs is slow.

Data transfer between CPU and GP-GPU is a bottleneck.

Overlap of communication and computation is necessary.

Translation into GP-GPU code is not trivial.

Parameter passing methods, local resources are limited.

CUDA Fortran need to be improved.

Computer Awareness - Badge
63% (8)
Computer Awareness - Badge
6 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
Cuda Examples
No ratings yet
Cuda Examples
28 pages
Lecture17 12
No ratings yet
Lecture17 12
86 pages
CUDA
No ratings yet
CUDA
33 pages
GPU Programming Slides 3
No ratings yet
GPU Programming Slides 3
73 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Gpu Programming
100% (2)
Gpu Programming
96 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
No ratings yet
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
17 pages
Cuda
No ratings yet
Cuda
7 pages
CUDA 2D Stencil Computations For The Jacobi Method: Jos e Mar Ia Cecilia, Jos e Manuel Garc Ia, and Manuel Ujald On
No ratings yet
CUDA 2D Stencil Computations For The Jacobi Method: Jos e Mar Ia Cecilia, Jos e Manuel Garc Ia, and Manuel Ujald On
4 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Multi Gpu Programming With Mpi
No ratings yet
Multi Gpu Programming With Mpi
93 pages
Class 10
No ratings yet
Class 10
13 pages
HW 2
No ratings yet
HW 2
12 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
No ratings yet
CUDA Libraries and CUDA Fortran: Massimiliano Fatica
55 pages
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
No ratings yet
BCS3413 Principle & Applications of Parallel Programming Quiz 2: Gpgpu Cuda
3 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Gpu, Cuda and Pycuda
No ratings yet
Gpu, Cuda and Pycuda
11 pages
Cuda Program + Wait For User Input
No ratings yet
Cuda Program + Wait For User Input
2 pages
Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA
No ratings yet
Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA
18 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
No ratings yet
3 Some Commonly Used CUDA API: 3.1 Function Type Qualifiers
7 pages
Accelerating VGG16 DCNN With An FPGA: Dongjoon Park, Pranoti Dhamal
No ratings yet
Accelerating VGG16 DCNN With An FPGA: Dongjoon Park, Pranoti Dhamal
7 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
sc09 Fluid Sim Cohen
No ratings yet
sc09 Fluid Sim Cohen
33 pages
Lec 6
No ratings yet
Lec 6
16 pages
CUDA Putting It All Together
No ratings yet
CUDA Putting It All Together
39 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit
No ratings yet
Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit
44 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
Cud A Reference Manual
No ratings yet
Cud A Reference Manual
299 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
07 cmsc416 Cuda
No ratings yet
07 cmsc416 Cuda
26 pages
PyCUDA AH PDF
No ratings yet
PyCUDA AH PDF
16 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Neural Network Implementation Using CUDA and OpenMP
No ratings yet
Neural Network Implementation Using CUDA and OpenMP
7 pages
HPCXX 2023 d4
No ratings yet
HPCXX 2023 d4
52 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Chapter 3 Multidimensional Grids A 2023 Programming Massively Parallel Pro
No ratings yet
Chapter 3 Multidimensional Grids A 2023 Programming Massively Parallel Pro
22 pages
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
No ratings yet
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
19 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Microprocessor
No ratings yet
Microprocessor
22 pages
CSR8635 Datasheet
0% (1)
CSR8635 Datasheet
105 pages
ELS 08 November 2021
No ratings yet
ELS 08 November 2021
11 pages
The 8051 Microcontroller & Embedded Systems: Muhammad Ali Mazidi, Janice Mazidi & Rolin Mckinlay
No ratings yet
The 8051 Microcontroller & Embedded Systems: Muhammad Ali Mazidi, Janice Mazidi & Rolin Mckinlay
57 pages
Cisco Dcuci v40 Student Guide Volume 2
No ratings yet
Cisco Dcuci v40 Student Guide Volume 2
346 pages
An5247 Overtheair Application and Wireless Firmware Update For Stm32wb Series Microcontrollers Stmicroelectronics
No ratings yet
An5247 Overtheair Application and Wireless Firmware Update For Stm32wb Series Microcontrollers Stmicroelectronics
36 pages
Ideacentre AIO 520 24IKL Datasheet EN
No ratings yet
Ideacentre AIO 520 24IKL Datasheet EN
2 pages
Lenovo - ThinkPad - T61 - Accessories
No ratings yet
Lenovo - ThinkPad - T61 - Accessories
14 pages
Essentials of Computer Architecture - Realref - Copie (2) - Copie
No ratings yet
Essentials of Computer Architecture - Realref - Copie (2) - Copie
772 pages
Assignment Questions of ES1
No ratings yet
Assignment Questions of ES1
5 pages
HTTP WWW - Thundermatch.com - My Final Images Pricelist1
No ratings yet
HTTP WWW - Thundermatch.com - My Final Images Pricelist1
1 page
GTL HQ Anniversary Patch Install
No ratings yet
GTL HQ Anniversary Patch Install
3 pages
Microprocessor Based Automatic Door Open
No ratings yet
Microprocessor Based Automatic Door Open
7 pages
SCE EN 010-050 R1209 S7-1200 Analog Values
No ratings yet
SCE EN 010-050 R1209 S7-1200 Analog Values
34 pages
Dxdiag Lumion
No ratings yet
Dxdiag Lumion
41 pages
Computer Specifications
No ratings yet
Computer Specifications
2 pages
Chapter 1: An Introduction To Information Systems Objectives
No ratings yet
Chapter 1: An Introduction To Information Systems Objectives
12 pages
LM700
No ratings yet
LM700
22 pages
KRONOS 3 1 3 Release Notes E1
No ratings yet
KRONOS 3 1 3 Release Notes E1
9 pages
Arduino Uno
No ratings yet
Arduino Uno
4 pages
ERC12864-2 Series Datasheet
No ratings yet
ERC12864-2 Series Datasheet
20 pages
Application Form
No ratings yet
Application Form
5 pages
Intel Vs Amd
No ratings yet
Intel Vs Amd
11 pages
2820 Controller Specifications
No ratings yet
2820 Controller Specifications
9 pages
PowerWalker VI 3000R
No ratings yet
PowerWalker VI 3000R
3 pages
Unit - 2 Storage Devices
No ratings yet
Unit - 2 Storage Devices
48 pages
Chapter 1 ARM Notes (ALP)
No ratings yet
Chapter 1 ARM Notes (ALP)
12 pages
8085 Microprocessor Notes
100% (1)
8085 Microprocessor Notes
25 pages
Acer One l1410 Tongfang Nsbv140x Revb1 Ddr3l
50% (2)
Acer One l1410 Tongfang Nsbv140x Revb1 Ddr3l
31 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

WRF-GPU DR Young-Tae+Kim

Uploaded by

WRF-GPU DR Young-Tae+Kim

Uploaded by

Gangneung-Wonju National University

A thousand-fold performance increase over an 11-year time

FLOPS/Memory bandwidth for the CPU and GP-GPU

Grid (Data Domain)

Not support automatic arrays and module variables

CUDA(Computer United Device Architecture): GP-GPU

Compatible to original WRF program

Create a directory for CUDA codes only - cuda

GP-GPU source codes

Copy CPU variables to GPU

Finalize: deallocation of GPU

Deallocation of GPU device

END SUBROUTINE wdm62D

attributes(global) subroutine wdm6_gpu_kernel(...)

real, allocatable, device :: rthratenlw_d(:,:,:), emiss_d(:,:),

Intel Xeon E5405 (2.0GHz)

Tesla C1060 (1.3GHz)

Data transfer between CPU and GP-GPU is a bottleneck.

Translation into GP-GPU code is not trivial.

Parameter passing methods, local resources are limited.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.