S3064 Pedraforca ARM GPU Cluster HPC
S3064 Pedraforca ARM GPU Cluster HPC
es
Pedraforca: a First
ARM + GPU Cluster for HPC
Nikola Puzovic, Alex Ramirez
We’ve hit the power wall
ALL computers are limited by power consumption
Energy-efficient approaches
Multi-core
– Fujitsu Ultra SPARC VIIIfx
– Intel SandyBridge
– AMD Bulldozer
Low-power processors
– IBM BlueGene/Q
Compute accelerators
– IBM Cell
– NVIDIA Tesla
– AMD Radeon
– Intel Xeon Phi
The next step in the commodity chain
HPC
Servers
Desktop
Tegra 3
– Quad-core ARM Cortex-A9
– 12-core Embedded GPU
Tegra 4
– Quad-core ARM Cortex-A15
– 72-core Embedded GPU
Tibidabo: The first ARM multicore cluster
Q7 Tegra 2 Q7 carrier board
2 x Cortex-A9 @ 1GHz 2 x Cortex-A9
2 GFLOPS 2 GFLOPS
5 Watts (?) 1 GbE + 100 MbE
0.4 GFLOPS / W 7 Watts
0.3 GFLOPS / W
2 Racks
1U Rackable blade 32 blade containers
8 nodes 256 nodes
16 GFLOPS 512 cores
65 Watts 9x 48-port 1GbE switch
0.25 GFLOPS / W
512 GFLOPS
3.4 Kwatt
0.15 GFLOPS / W
Proof of concept
– It is possible to deploy a cluster of
smartphone processors
Enable software stack development
HPC System software stack on ARM
Source files (C, C++, FORTRAN, …)
Open source system software stack
Compiler(s)
gcc gfortran OmpSs … – Ubuntu/Debian Linux OS
Executable(s) – GNU compilers
• gcc, g++, gfortran
Scientific libraries
ATLAS FFTW HDF5 … … – Scientific libraries
• ATLAS, FFTW, HDF5,...
Developer tools
Paraver Scalasca …
– Slurm cluster management
Runtime libraries
Cluster management (Slurm) – MPICH2, CUDA, …
OmpSs runtime library (NANOS++) – OmpSs toolchain*
GASNet CUDA OpenCL Developer tools
MPI – Paraver, Scalasca
Linux
Linux
Linux – Allinea DDT debugger
CPU GPU …
CPU GPU * S3232 - OmpSs: Leveraging CUDA and OpenCL to
CPU GPU
Exploit Heterogeneous Clusters of Hardware
Accelerators. Thursday, 10:00, Marriott Ballroom 3
Porting applications to ARM
Prog. Model
Application Domain Institution Scalability ARM port
MPI OpenMP Other
YALES2 Combustion CNRS/CORIA Y >32K
EUTERPE Fusion BSC Y Y >60K
SPECFEM3D Wave propagation CNRS Y CUDA, SMPSs >150K, >1K GPU
MP2C Multi-particle collision JSC Y >65K
BigDFT Elect. Structure CEA Y Y CUDA, OpenCL >2K, >300 GPU
Quantum Expresso Elect. Strcuture CINECA Y Y CUDA Good
PEPC
Coulomg + gravitational
forces
JSC Y Pthreads, SMPSs >300K
SMMP Protein folding JSC Y
OpenCL
16K
ProFASI Protein folding JSC Y Good
COSMO Weather forecast CINECA Y Y
BQCD Particle physics LRZ Y Y ~300K
Mellanox ConnectX-3
8x PCIe Gen3
40 Gb/s
2.5” SSD
250 GB
SATA 3 MLC
4x IB switch
Login nodes
Intel SandyBridge E5
NFS Storage
Pedraforca: Interconnect
IB
GbE
IB IB
GbE
IB
apps
– Unused CPU in heavily Interconnection network
accelerated apps
Decouple CPU from GPU
– Off-load kernels to remote GPU CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
– Direct GPU to GPU data
transfers
• Orchestrated by light-weight Interconnection network
ARM CPU
Conclusions
CARMA is not an HPC solution …
… but it enables software development already
http://www.bsc.es/about_bsc/employment/vacancies
– Senior Researchers in Energy-Efficient Supercomputers
– HPC Application Developers