0% found this document useful (0 votes)

12 views18 pages

S3064 Pedraforca ARM GPU Cluster HPC

Uploaded by

Peter Pan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views18 pages

S3064 Pedraforca ARM GPU Cluster HPC

Uploaded by

Peter Pan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

www.bsc.

Pedraforca: a First
ARM + GPU Cluster for HPC
Nikola Puzovic, Alex Ramirez
We’ve hit the power wall
ALL computers are limited by power consumption
Energy-efficient approaches
Multi-core
– Fujitsu Ultra SPARC VIIIfx
– Intel SandyBridge
– AMD Bulldozer

Low-power processors
– IBM BlueGene/Q

Compute accelerators
– IBM Cell
– NVIDIA Tesla
– AMD Radeon
– Intel Xeon Phi
The next step in the commodity chain

HPC

Servers

Desktop

Build the next HPC system on

commodity and super-commodity
Mobile components
– 100M tablets in 2012
– 750M smartphones in 2012
NVIDIA Tegra: Commodity CPU + GPU platform
Tegra 2
– Dual-core ARM Cortex-A9
– ULP Embedded GPU

Tegra 3
– Quad-core ARM Cortex-A9
– 12-core Embedded GPU

Tegra 4
– Quad-core ARM Cortex-A15
– 72-core Embedded GPU
Tibidabo: The first ARM multicore cluster
Q7 Tegra 2 Q7 carrier board
2 x Cortex-A9 @ 1GHz 2 x Cortex-A9
2 GFLOPS 2 GFLOPS
5 Watts (?) 1 GbE + 100 MbE
0.4 GFLOPS / W 7 Watts
0.3 GFLOPS / W

2 Racks
1U Rackable blade 32 blade containers
8 nodes 256 nodes
16 GFLOPS 512 cores
65 Watts 9x 48-port 1GbE switch
0.25 GFLOPS / W
512 GFLOPS
3.4 Kwatt
0.15 GFLOPS / W
Proof of concept
– It is possible to deploy a cluster of
smartphone processors
Enable software stack development
HPC System software stack on ARM
Source files (C, C++, FORTRAN, …)
Open source system software stack
Compiler(s)
gcc gfortran OmpSs … – Ubuntu/Debian Linux OS
Executable(s) – GNU compilers
• gcc, g++, gfortran
Scientific libraries
ATLAS FFTW HDF5 … … – Scientific libraries
• ATLAS, FFTW, HDF5,...
Developer tools
Paraver Scalasca …
– Slurm cluster management
Runtime libraries
Cluster management (Slurm) – MPICH2, CUDA, …
OmpSs runtime library (NANOS++) – OmpSs toolchain*
GASNet CUDA OpenCL Developer tools
MPI – Paraver, Scalasca
Linux
Linux
Linux – Allinea DDT debugger
CPU GPU …
CPU GPU * S3232 - OmpSs: Leveraging CUDA and OpenCL to
CPU GPU
Exploit Heterogeneous Clusters of Hardware
Accelerators. Thursday, 10:00, Marriott Ballroom 3
Porting applications to ARM
Prog. Model
Application Domain Institution Scalability ARM port
MPI OpenMP Other
YALES2 Combustion CNRS/CORIA Y >32K 
EUTERPE Fusion BSC Y Y >60K 
SPECFEM3D Wave propagation CNRS Y CUDA, SMPSs >150K, >1K GPU 
MP2C Multi-particle collision JSC Y >65K 
BigDFT Elect. Structure CEA Y Y CUDA, OpenCL >2K, >300 GPU 
Quantum Expresso Elect. Strcuture CINECA Y Y CUDA Good 
PEPC
Coulomg + gravitational
forces
JSC Y Pthreads, SMPSs >300K 
SMMP Protein folding JSC Y
OpenCL
16K 
ProFASI Protein folding JSC Y Good 
COSMO Weather forecast CINECA Y Y 
BQCD Particle physics LRZ Y Y ~300K 

Porting full-scale HPC applications to ARM cluster requires

minimal effort
CARMA: CUDA on ARM developer kit
Tegra3 SoC
– Quad-core ARM Cortex-A9
– 6 PCIe lanes (gen1)
Quadro 1000M
– CUDA supported
1 GbE
First hybrid
ARM + CUDA
platform
CARMA Kit: Energy Efficiency
CARMA platform is much more energy-efficient than Tegra3
alone
Pedraforca v1: The first ARM + GPU cluster

Development cluster of 16 CARMA kits @ BSC

Pedraforca v1: Initial application performance results

Only 3.72 GLFOPS in Linpack … but

– DGEMM: 21.3 GFLOPS (0.78 GFLOPS/W)
– SGEMM: 127.8 GFLOPS (5.04 GFLOPS/W)
– Low PCIe bandwidth (400 MB/s peak)
– No overlap of data transfers and computation
Pedraforca v2: Next generation ARM + GPU platform
Tegra3 Q7 module
4x ARM Cortex-A9 @ 1.3 GHz NVIDIA Tesla K20
2GB DDR2 16x PCIe Gen3
1170 GFLOPS (peak)
Mini-ITX carrier
4x PCIe Gen1
SATA 2.0
1 GbE

Mellanox ConnectX-3
8x PCIe Gen3
40 Gb/s
2.5” SSD
250 GB
SATA 3 MLC

Ethernet 1 Gb/s (service + storage)

InfiniBand 40 Gb/s (MPI)
Pedraforca: Rack enclosure
2x GbE switch

4x IB switch

64x Compute nodes

4x ARM Cortex-A9
1x NVIDIA Tesla K20

NFS Storage
Pedraforca: Interconnect

IB
GbE

IB IB
GbE
IB

apps
– Unused CPU in heavily Interconnection network
accelerated apps
Decouple CPU from GPU
– Off-load kernels to remote GPU CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
– Direct GPU to GPU data
transfers
• Orchestrated by light-weight Interconnection network

ARM CPU
Conclusions
CARMA is not an HPC solution …
… but it enables software development already

Pedraforca is the second generation ARM + GPU prototype

– GPU-accelerator cluster, instead of GPU-accelerated cluster
• ARM CPU used to orchestrate direct GPU to GPU communication

CPU + GPU integration is happening already

– Embedded mobile platforms with OpenCL capable GPU

Get ready for your next generation CPU + GPU platforms!

We’re hiring!
Do you want to work on the next generation of energy-efficient
HPC systems?
Lead the way to the Exascale?
Change the HPC world forever?

http://www.bsc.es/about_bsc/employment/vacancies
– Senior Researchers in Energy-Efficient Supercomputers
– HPC Application Developers

Gemsy RXM-2 Rxm-2a
100% (1)
Gemsy RXM-2 Rxm-2a
27 pages
GPU Programming Slides 1
No ratings yet
GPU Programming Slides 1
33 pages
Clusters With GPUs Under Linux and Windows HPC
No ratings yet
Clusters With GPUs Under Linux and Windows HPC
23 pages
Cloud Computing Unit-1
100% (1)
Cloud Computing Unit-1
88 pages
TLE-TVL - ICT (CSS) 9 - Q1 - CLAS6 - Accessing Information and Producing Output-Data Using Computer System - RHEA ROMERO
No ratings yet
TLE-TVL - ICT (CSS) 9 - Q1 - CLAS6 - Accessing Information and Producing Output-Data Using Computer System - RHEA ROMERO
16 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Difference Between High-Performance Computing (HPC) High-Throughput Computing
No ratings yet
Difference Between High-Performance Computing (HPC) High-Throughput Computing
49 pages
Module - 01 CC (BCS601)
No ratings yet
Module - 01 CC (BCS601)
47 pages
Unit-1 Part-1
No ratings yet
Unit-1 Part-1
14 pages
HPC Day 12 ppt-2
No ratings yet
HPC Day 12 ppt-2
139 pages
Unit 4
No ratings yet
Unit 4
48 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
Chương 1. Cấu trúc và hoạt động của máy tính
No ratings yet
Chương 1. Cấu trúc và hoạt động của máy tính
99 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Part1 22
No ratings yet
Part1 22
77 pages
Evaluating ARM and RISC-V Architectures For High-P
No ratings yet
Evaluating ARM and RISC-V Architectures For High-P
28 pages
Engineering Drawing
100% (2)
Engineering Drawing
49 pages
10 GPU-IntroCUDA3
No ratings yet
10 GPU-IntroCUDA3
141 pages
ATF TW B1 ARM Mbed Internet of Possible Suzie Nien
No ratings yet
ATF TW B1 ARM Mbed Internet of Possible Suzie Nien
39 pages
Computer Basics Worksheet
No ratings yet
Computer Basics Worksheet
5 pages
Linux User & Developer
100% (1)
Linux User & Developer
100 pages
Computer Studies Paper 1 Solutions-Printable
100% (1)
Computer Studies Paper 1 Solutions-Printable
46 pages
777 D Test Charts
100% (1)
777 D Test Charts
41 pages
Armhpc SC
No ratings yet
Armhpc SC
37 pages
Part4 22
No ratings yet
Part4 22
65 pages
Performance Evaluation and Energy Efficiency of HPC Platforms
No ratings yet
Performance Evaluation and Energy Efficiency of HPC Platforms
34 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
PSSE30 USERSManual
100% (2)
PSSE30 USERSManual
786 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Cuda - New Features and Beyond Ampere Programming For Developers PDF
No ratings yet
Cuda - New Features and Beyond Ampere Programming For Developers PDF
78 pages
Cost Afftective Deepl Learning Using Nvidia
No ratings yet
Cost Afftective Deepl Learning Using Nvidia
10 pages
2018 - LeanStore - In-Memory Data Management Beyond Main Memory
No ratings yet
2018 - LeanStore - In-Memory Data Management Beyond Main Memory
12 pages
Lecture 2
No ratings yet
Lecture 2
15 pages
Scipy09 Pycuda Tut
No ratings yet
Scipy09 Pycuda Tut
162 pages
NTNU HetComp Topublish PDF
No ratings yet
NTNU HetComp Topublish PDF
83 pages
Intro Parallel Computing PDF
No ratings yet
Intro Parallel Computing PDF
58 pages
Q7 GPU Cluster For HPC
No ratings yet
Q7 GPU Cluster For HPC
10 pages
Operating System Abstractions To Manage Gpus As Compute Devices
No ratings yet
Operating System Abstractions To Manage Gpus As Compute Devices
16 pages
Accelerating Graphic Rendering On Programmable RISC-V GPUs
No ratings yet
Accelerating Graphic Rendering On Programmable RISC-V GPUs
15 pages
Infra 1 - Saptarshi More Data, Faster Results - The Future of Acceleration in Computing Infrastructure - Saptarshi Mondal
No ratings yet
Infra 1 - Saptarshi More Data, Faster Results - The Future of Acceleration in Computing Infrastructure - Saptarshi Mondal
17 pages
Q7 GPU Cluster For HPC c1
No ratings yet
Q7 GPU Cluster For HPC c1
8 pages
Chap6 Heter Computing
No ratings yet
Chap6 Heter Computing
22 pages
Iphone: Fuzzing and Payloads: Charlie Miller
No ratings yet
Iphone: Fuzzing and Payloads: Charlie Miller
66 pages
Yoga 520-14ikb HMM 201703 PDF
No ratings yet
Yoga 520-14ikb HMM 201703 PDF
87 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
Abaqus Support 28slides
No ratings yet
Abaqus Support 28slides
28 pages
Spare Part List - SB 202
No ratings yet
Spare Part List - SB 202
24 pages
HC28.22.110 Bifrost JemDavies ARM v04 9
No ratings yet
HC28.22.110 Bifrost JemDavies ARM v04 9
31 pages
Introduction & Update: UEFI Spring Plugfest - May 8-10, 2012 Andrew N. Sloss (ARM)
No ratings yet
Introduction & Update: UEFI Spring Plugfest - May 8-10, 2012 Andrew N. Sloss (ARM)
26 pages
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
No ratings yet
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
258 pages
Arm Developments in HPC: CIUK 2018, Manchester
No ratings yet
Arm Developments in HPC: CIUK 2018, Manchester
29 pages
Network Installations Instructions
No ratings yet
Network Installations Instructions
85 pages
2010 - New Super Duper AI LLM Paper
No ratings yet
2010 - New Super Duper AI LLM Paper
6 pages
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
Introduction To GP-GPU and CUDA: High Performance Computing Center Hanoi University of Science & Technology
43 pages
Parallel Computing: Thorsten Grahs, 13. April 2015
No ratings yet
Parallel Computing: Thorsten Grahs, 13. April 2015
41 pages
Eternal Dagger
67% (3)
Eternal Dagger
28 pages
Fpga Arm Processor Based Supercomputiing
No ratings yet
Fpga Arm Processor Based Supercomputiing
5 pages
PM01 - PM Master Data
No ratings yet
PM01 - PM Master Data
32 pages
WhitePaper GPU Computing On Mali
No ratings yet
WhitePaper GPU Computing On Mali
6 pages
No Exaflops For You
No ratings yet
No Exaflops For You
61 pages
Oracle SQL Developer - Sample Chapter
100% (1)
Oracle SQL Developer - Sample Chapter
34 pages
Cobol
No ratings yet
Cobol
39 pages
STR A6252 PDF
No ratings yet
STR A6252 PDF
9 pages
Teslapersonalsupercomputer 160201192005
No ratings yet
Teslapersonalsupercomputer 160201192005
16 pages
Open Source Cortex M Devel Anderson 0
No ratings yet
Open Source Cortex M Devel Anderson 0
28 pages
Decision Making and Creative Problem Solving
No ratings yet
Decision Making and Creative Problem Solving
52 pages
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
No ratings yet
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
29 pages
16 Channel 100 V, 2/ 4 A, 5/3 Level With RTZ, T/R Switch, High-Speed Ultrasound Pulser With Integrated Transmit Beamformer
No ratings yet
16 Channel 100 V, 2/ 4 A, 5/3 Level With RTZ, T/R Switch, High-Speed Ultrasound Pulser With Integrated Transmit Beamformer
4 pages
Pod2g Jailbreak Techniques, WWJC 2012
No ratings yet
Pod2g Jailbreak Techniques, WWJC 2012
56 pages
2014 Cohpc Cluster Extended
No ratings yet
2014 Cohpc Cluster Extended
15 pages
AMD Radeon Pro w7800 Datasheet
No ratings yet
AMD Radeon Pro w7800 Datasheet
2 pages
AMD EPYC 9004 MZ33-AR0 Datasheet v1.1
No ratings yet
AMD EPYC 9004 MZ33-AR0 Datasheet v1.1
1 page
D&I of GPU Based Image Processing On CASE Cluster
No ratings yet
D&I of GPU Based Image Processing On CASE Cluster
28 pages
NanoWattICs English Presentation v2011!11!15
No ratings yet
NanoWattICs English Presentation v2011!11!15
10 pages
CUDA 6.0: Acknowledgements
No ratings yet
CUDA 6.0: Acknowledgements
13 pages
International Journal of Distributed and Parallel Systems (IJDPS)
No ratings yet
International Journal of Distributed and Parallel Systems (IJDPS)
20 pages
ASIC-System On Chip-VLSI Design - Power Planning
No ratings yet
ASIC-System On Chip-VLSI Design - Power Planning
5 pages
Report Development Tools 1
No ratings yet
Report Development Tools 1
16 pages
Atta Kneader
No ratings yet
Atta Kneader
2 pages
CUDA
No ratings yet
CUDA
46 pages
Service Manual: Shanghai Teraoka Electronic Co.,Ltd
No ratings yet
Service Manual: Shanghai Teraoka Electronic Co.,Ltd
4 pages
Advanced Help Desk Automation: A Project Report ON
No ratings yet
Advanced Help Desk Automation: A Project Report ON
11 pages
Dynamic Load Balancing On Single-And Multi-GPU Systems
No ratings yet
Dynamic Load Balancing On Single-And Multi-GPU Systems
12 pages
GPU Versus FPGA For High Productivity Computing: Imperial College London, Electrical and Electronic Engineering, London
No ratings yet
GPU Versus FPGA For High Productivity Computing: Imperial College London, Electrical and Electronic Engineering, London
6 pages
Dell-Cisco STP Interoperability and Recommendations
No ratings yet
Dell-Cisco STP Interoperability and Recommendations
7 pages
Motor Acceleration Analysis
100% (1)
Motor Acceleration Analysis
4 pages
DS Tesla-M2090 LR
No ratings yet
DS Tesla-M2090 LR
2 pages
Bus Bar Protection
No ratings yet
Bus Bar Protection
6 pages
rCUDA Guide
No ratings yet
rCUDA Guide
13 pages
Cost-Effective HPC Clustering For Computer Vision Applications
No ratings yet
Cost-Effective HPC Clustering For Computer Vision Applications
6 pages
ARM-A Mandatory Primer
No ratings yet
ARM-A Mandatory Primer
4 pages
DS Tesla M Class Aug11
No ratings yet
DS Tesla M Class Aug11
2 pages
EKM Metering EKM-15E 120 Volt Meter Spec Sheet
No ratings yet
EKM Metering EKM-15E 120 Volt Meter Spec Sheet
2 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

S3064 Pedraforca ARM GPU Cluster HPC

Uploaded by

S3064 Pedraforca ARM GPU Cluster HPC

Uploaded by

www.bsc.

Build the next HPC system on

Porting full-scale HPC applications to ARM cluster requires

Development cluster of 16 CARMA kits @ BSC

Only 3.72 GLFOPS in Linpack … but

Ethernet 1 Gb/s (service + storage)

64x Compute nodes

GbE network for service and storage

Pedraforca is the second generation ARM + GPU prototype

CPU + GPU integration is happening already

Get ready for your next generation CPU + GPU platforms!

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.