0% found this document useful (0 votes)
12 views18 pages

S3064 Pedraforca ARM GPU Cluster HPC

Uploaded by

Peter Pan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views18 pages

S3064 Pedraforca ARM GPU Cluster HPC

Uploaded by

Peter Pan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

www.bsc.

es

Pedraforca: a First
ARM + GPU Cluster for HPC
Nikola Puzovic, Alex Ramirez
We’ve hit the power wall
ALL computers are limited by power consumption
Energy-efficient approaches
Multi-core
– Fujitsu Ultra SPARC VIIIfx
– Intel SandyBridge
– AMD Bulldozer

Low-power processors
– IBM BlueGene/Q

Compute accelerators
– IBM Cell
– NVIDIA Tesla
– AMD Radeon
– Intel Xeon Phi
The next step in the commodity chain

HPC

Servers

Desktop

Build the next HPC system on


commodity and super-commodity
Mobile components
– 100M tablets in 2012
– 750M smartphones in 2012
NVIDIA Tegra: Commodity CPU + GPU platform
Tegra 2
– Dual-core ARM Cortex-A9
– ULP Embedded GPU

Tegra 3
– Quad-core ARM Cortex-A9
– 12-core Embedded GPU

Tegra 4
– Quad-core ARM Cortex-A15
– 72-core Embedded GPU
Tibidabo: The first ARM multicore cluster
Q7 Tegra 2 Q7 carrier board
2 x Cortex-A9 @ 1GHz 2 x Cortex-A9
2 GFLOPS 2 GFLOPS
5 Watts (?) 1 GbE + 100 MbE
0.4 GFLOPS / W 7 Watts
0.3 GFLOPS / W

2 Racks
1U Rackable blade 32 blade containers
8 nodes 256 nodes
16 GFLOPS 512 cores
65 Watts 9x 48-port 1GbE switch
0.25 GFLOPS / W
512 GFLOPS
3.4 Kwatt
0.15 GFLOPS / W
Proof of concept
– It is possible to deploy a cluster of
smartphone processors
Enable software stack development
HPC System software stack on ARM
Source files (C, C++, FORTRAN, …)
Open source system software stack
Compiler(s)
gcc gfortran OmpSs … – Ubuntu/Debian Linux OS
Executable(s) – GNU compilers
• gcc, g++, gfortran
Scientific libraries
ATLAS FFTW HDF5 … … – Scientific libraries
• ATLAS, FFTW, HDF5,...
Developer tools
Paraver Scalasca …
– Slurm cluster management
Runtime libraries
Cluster management (Slurm) – MPICH2, CUDA, …
OmpSs runtime library (NANOS++) – OmpSs toolchain*
GASNet CUDA OpenCL Developer tools
MPI – Paraver, Scalasca
Linux
Linux
Linux – Allinea DDT debugger
CPU GPU …
CPU GPU * S3232 - OmpSs: Leveraging CUDA and OpenCL to
CPU GPU
Exploit Heterogeneous Clusters of Hardware
Accelerators. Thursday, 10:00, Marriott Ballroom 3
Porting applications to ARM
Prog. Model
Application Domain Institution Scalability ARM port
MPI OpenMP Other
YALES2 Combustion CNRS/CORIA Y >32K 
EUTERPE Fusion BSC Y Y >60K 
SPECFEM3D Wave propagation CNRS Y CUDA, SMPSs >150K, >1K GPU 
MP2C Multi-particle collision JSC Y >65K 
BigDFT Elect. Structure CEA Y Y CUDA, OpenCL >2K, >300 GPU 
Quantum Expresso Elect. Strcuture CINECA Y Y CUDA Good 
PEPC
Coulomg + gravitational
forces
JSC Y Pthreads, SMPSs >300K 
SMMP Protein folding JSC Y
OpenCL
16K 
ProFASI Protein folding JSC Y Good 
COSMO Weather forecast CINECA Y Y 
BQCD Particle physics LRZ Y Y ~300K 

Porting full-scale HPC applications to ARM cluster requires


minimal effort
CARMA: CUDA on ARM developer kit
Tegra3 SoC
– Quad-core ARM Cortex-A9
– 6 PCIe lanes (gen1)
Quadro 1000M
– CUDA supported
1 GbE
First hybrid
ARM + CUDA
platform
CARMA Kit: Energy Efficiency
CARMA platform is much more energy-efficient than Tegra3
alone
Pedraforca v1: The first ARM + GPU cluster

Development cluster of 16 CARMA kits @ BSC


Pedraforca v1: Initial application performance results

Only 3.72 GLFOPS in Linpack … but


– DGEMM: 21.3 GFLOPS (0.78 GFLOPS/W)
– SGEMM: 127.8 GFLOPS (5.04 GFLOPS/W)
– Low PCIe bandwidth (400 MB/s peak)
– No overlap of data transfers and computation
Pedraforca v2: Next generation ARM + GPU platform
Tegra3 Q7 module
4x ARM Cortex-A9 @ 1.3 GHz NVIDIA Tesla K20
2GB DDR2 16x PCIe Gen3
1170 GFLOPS (peak)
Mini-ITX carrier
4x PCIe Gen1
SATA 2.0
1 GbE

Mellanox ConnectX-3
8x PCIe Gen3
40 Gb/s
2.5” SSD
250 GB
SATA 3 MLC

Ethernet 1 Gb/s (service + storage)


InfiniBand 40 Gb/s (MPI)
Pedraforca: Rack enclosure
2x GbE switch

4x IB switch

Login nodes
Intel SandyBridge E5

64x Compute nodes


4x ARM Cortex-A9
1x NVIDIA Tesla K20

NFS Storage
Pedraforca: Interconnect

IB
GbE

IB IB
GbE
IB

GbE network for service and storage


IB network for MPI
– With extra ports to connect to other clusters …
GPU-accelerated cluster vs. GPU-accelerator cluster
Current GPU clusters
– Fixed ratio of CPU to GPU CPU
CPU
GPU
GPU
CPU
CPU
GPU
GPU
CPU
CPU
GPU
GPU
CPU GPU CPU GPU CPU GPU
– Unused GPU in not-accelerated CPU GPU CPU GPU CPU GPU

apps
– Unused CPU in heavily Interconnection network
accelerated apps
Decouple CPU from GPU
– Off-load kernels to remote GPU CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
– Direct GPU to GPU data
transfers
• Orchestrated by light-weight Interconnection network

ARM CPU
Conclusions
CARMA is not an HPC solution …
… but it enables software development already

Pedraforca is the second generation ARM + GPU prototype


– GPU-accelerator cluster, instead of GPU-accelerated cluster
• ARM CPU used to orchestrate direct GPU to GPU communication

CPU + GPU integration is happening already


– Embedded mobile platforms with OpenCL capable GPU

Get ready for your next generation CPU + GPU platforms!


We’re hiring!
Do you want to work on the next generation of energy-efficient
HPC systems?
Lead the way to the Exascale?
Change the HPC world forever?

http://www.bsc.es/about_bsc/employment/vacancies
– Senior Researchers in Energy-Efficient Supercomputers
– HPC Application Developers

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy