0% found this document useful (0 votes)

19 views37 pages

Armhpc SC

Uploaded by

Peter Pan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views37 pages

Armhpc SC

Uploaded by

Peter Pan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

TibidaboI: Making the Case for an ARM-Based HPC

System
Nikola Rajovica,b,∗, Alejandro Ricoa,b , Nikola Puzovica , Chris
Adeniyi-Jonesc , Alex Ramireza,b
a
Computer Sciences Department, Barcelona Supercomputing Center, Barcelona, Spain
b
Department d’Arquitectura de Computadors, Universitat Politècnica de Catalunya -
BarcelonaTech, Barcelona, Spain
c
ARM Ltd., Cambridge, United Kingdom

Abstract
It is widely accepted that future HPC systems will be limited by their power
consumption. Current HPC systems are built from commodity server pro-
cessors, designed over years to achieve maximum performance, with energy
efficiency being an after-thought. In this paper we advocate a different ap-
proach: building HPC systems from low-power embedded and mobile tech-
nology parts, over time designed for maximum energy efficiency, which now
show promise for competitive performance.
We introduce the architecture of Tibidabo, the first large-scale HPC clus-
ter built from ARM multicore chips, and a detailed performance and energy
efficiency evaluation. We present the lessons learned for the design and im-
provement in energy efficiency of future HPC systems based on such low-
power cores. Based on our experience with the prototype, we perform simu-
lations to show that a theoretical cluster of 16-core ARM Cortex-A15 chips
would increase the energy efficiency of our cluster by 8.7x, reaching an energy
efficiency of 1046 MFLOPS/W.
Keywords: high-performance computing, embedded processors, mobile

I
Tibidabo is a mountain overlooking Barcelona
∗
Corresponding author
Email addresses: nikola.rajovic@bsc.es (Nikola Rajovic),
alejandro.rico@bsc.es (Alejandro Rico), nikola.puzovic@bsc.es (Nikola Puzovic),
chris.adeniyi-jones@arm.com (Chris Adeniyi-Jones), alex.ramirez@bsc.es (Alex
Ramirez)

Preprint of the article accepted for publication in Future Generation Computer Systems, Elsevier
processors, low power, cortex-a9, cortex-a15, energy efficiency

1. Introduction
In High Performance Computing (HPC), there is a continued need for
higher computational performance. Scientific grand challenges e.g., engineer-
ing, geophysics, bioinformatics, and other types of compute-intensive appli-
cations require increasing amounts of compute power. On the other hand,
energy is increasingly becoming one of the most expensive resources and it
substantially contributes to the total cost of running a large supercomputing
facility. In some cases, the total energy cost over a few years of operation
can exceed the cost of the hardware infrastructure acquisition [1, 2, 3].
This trend is not only limited to HPC systems, it also holds true for data
centres in general. Energy efficiency is already a primary concern for the
design of any computer system and it is unanimously recognized that reach-
ing the next milestone in supercomputers’ performance, e.g. one EFLOPS
(exaFLOPS - 1018 floating-point operations per second), will be strongly
constrained by power. The energy efficiency of a system will define the max-
imum achievable performance.
In this paper, we take a first step towards HPC systems developed from
low-power solutions used in embedded and mobile devices. However, using
CPUs from this domain is a challenge: these devices are neither crafted to
exploit high ILP nor for high memory bandwidth. Most embedded CPUs
lack a vector floating-point unit and their software ecosystem is not tuned
for HPC. What makes them particularly interesting is the size and power
characteristics which allow for higher packaging density and lower cost. In
the following three subsections we further motivate our proposal from several
important aspects.

1.1. Road to Exascale

To illustrate our point about the need for low-power processors, let us
reverse engineer a theoretical Exaflop supercomputer that has a power bud-
get of 20 MW [4]. We will build our system using cores of 16 GFLOPS
(8 ops/cycle @ 2 GHz), assuming that single-thread performance will not
improve much beyond the performance that we observe today. An Exaflop
machine will require 62.5 million of such cores, independently of how they are
packaged together (multicore density, sockets per node). We also assume that

2
only 30-40% of the total power will be actually spent on the cores, the rest
going to power supply overhead, interconnect, storage, and memory. That
leads to a power budget of 6 MW to 8 MW for 62.5 million cores, which is
0.10 W to 0.13 W per core. Current high performance processors integrating
this type of cores require tens of watts at 2 GHz. However, ARM proces-
sors, designed for the embedded mobile market, consume less than 0.9 W at
that frequency [5], and thus are worth exploring—even though they do not
yet provide a sufficient level of performance, they have a promising roadmap
ahead.

1.2. ARM Processors

There is already a significant trend towards using ARM processors in data
servers and cloud computing environments [6, 7, 8, 9, 10]. Those workloads
are constrained by I/O and memory subsystems, not by CPU performance.
Recently, ARM processors are also taking significant steps towards increased
double-precision floating-point performance, making them competitive with
state-of-the-art server performance.
Previous generations of ARM application processors did not feature a
floating-point unit capable of supporting the throughputs and latencies re-
quired for HPC1 . The ARM Cortex-A9 has an optional VFPv3 floating-point
unit [11] and/or a NEON single-instruction multiple-data (SIMD) floating-
point unit [12]. The VFPv3 unit is pipelined and is capable of executing
one double-precision ADD operation per cycle, or one MUL/FMA (Fused
Multiply Accumulate) every two cycles. The NEON unit is a SIMD unit
and supports only integers and single-precision floating-point operands thus
making itself unattractive for HPC. Then, with one double-precision floating-
point arithmetic instruction per cycle (VFPv3), a 1 GHz Cortex-A9 provides
a peak of 1 GFLOPS. The more recent ARM Cortex-A15 [13] processor has
a fully-pipelined double-precision floating-point unit, delivering 2 GFLOPS
at 1 GHz (one FMA every cycle). The new ARMv8 instruction set, which is
being implemented in next-generation ARM cores, namely the Cortex-A50
Series [14], features a 64-bit address space, and adds double-precision to the
NEON SIMD ISA, allowing for 4 operations per cycle per unit leading to
4 GFLOPS at 1 GHz.

1
Cortex-A8 is the processor generation prior to Cortex-A9, which has a non-pipelined
floating-point unit. In the best case it can deliver one floating-point ADD every ∼10
cycles; MUL and MAC have smaller throughputs.

3
1.3. Bell’s Law
Our approach for an HPC system is novel because we argue for the use of
mobile cores. We consider the improvements expected in mobile SoCs in the
near future that would make them real candidates for HPC. As Bell’s law
states [15], a new computer class is usually based on lower cost components,
which continue to evolve at a roughly constant price but with increasing per-
formance from Moore’s law. This trend holds today: the class of computing
systems on the rise today in HPC is those systems with large numbers of
closely-coupled small cores (BlueGene/Q and Xeon Phi systems). From the
architectural point of view, our proposal fits into this computing class and it
has the potential for performance growth given the size and evolution of the
mobile market.

1.4. Contributions
In this paper, we present Tibidabo, an experimental HPC cluster that we
built using NVIDIA Tegra2 chips, each featuring a performance-optimized
dual-core ARM Cortex-A9 processor. We use the PCIe support in Tegra2
to connect a 1 GbE NIC, and build a tree interconnect with 48-port 1 GbE
switches.
We do not intend our first prototype to achieve an energy efficiency com-
petitive with today’s leaders. The purpose of this prototype is to be a proof
of concept to demonstrate that building such energy-efficient clusters with
mobile processors is possible, and to learn from the experience. On the soft-
ware side, the goal is to deploy an HPC-ready software stack for ARM-based
systems, and to serve as an early application development and tuning vehicle.
Detailed analysis of performance and power distribution points to a ma-
jor problem when building HPC systems from low-power parts: the system
integration glue takes more power than the microprocessor cores themselves.
The main building block of our cluster, the Q7 board, is designed having
embedded and mobile software development in mind, and is not particularly
optimized for energy-efficient operation. Nevertheless, the energy efficiency
of our cluster is 120 MFLOPS/W, still competitive with Intel Xeon X5660
and AMD Opteron 6128 based clusters,2 but much lower than what could
be anticipated from the performance and power figures of the Cortex-A9
processor.

2
In the November 2012 edition of Green500 list these systems are ranked as 395th and
396th respectively.

4
We use our performance analysis to model and simulate a potential HPC
cluster built from ARM Cortex-A9 and Cortex-A15 chips with higher multi-
core density (number of cores per chip) and higher bandwidth interconnects,
and conclude that such a system would deliver competitive energy efficiency.
The work presented here, and the lessons that we learned are a first step
towards such a system that will be built with the next generation of ARM
cores implementing the ARMv8 architecture.
The contributions of this paper are:
• The design of the first HPC ARM-based cluster architecture, with
a complete performance evaluation, energy efficiency evaluation, and
comparison with state-of-the-art high-performance architectures.

• A power distribution estimation of our ARM cluster.

• Model-based performance and energy-efficiency projections of a theo-

retical HPC cluster with a higher multicore density and higher-performance
ARM cores.

• Technology challenges and design guidelines based on our experience

to make ARM-based clusters a competitive alternative for HPC.
The rest of this paper is organized as follows. Section 2 gives an archi-
tectural overview of our cluster prototype. We introduce our compute node
together with the description of the compute SoC. In Section 3 we benchmark
a single node in terms of computational power as well as energy efficiency
compared to a laptop Intel Core i7 platform. We also present performance,
scalability and energy efficiency results for a set of HPC applications execut-
ing on our prototype. Section 4 deals with performance and energy efficiency
projections for a theoretical ARM-based system including the desired features
for HPC identified in the process of building our prototype. In Section 5 we
provide an explanation of the technology challenges encountered while build-
ing our prototype and give a set of design guidelines for a future ARM-based
HPC system. Other systems that are built with energy-efficiency in mind
are surveyed in Section 6. We conclude our paper in Section 7.

2. ARM Cluster Architecture

The compute chip in the Tibidabo cluster is the Nvidia Tegra2 SoC,
with a dual-core ARM Cortex-A9 running at 1 GHz and implemented using

5
(a) Q7 module (b) Q7 carrier board

(c) Blade with 8 boards (d) Tibidabo rack

Figure 1: Components of the Tibidabo system.

6
TSMC’s 40nm LPG performance-optimized process. Tegra2 features a num-
ber of application-specific accelerators targeted at the mobile market, such
as video and audio encoder/decoder, and image signal processor, but none
of these can be used for general-purpose computation and only contribute
as a SoC area overhead. The GPU in Tegra2 does not support general pro-
gramming models such as CUDA or OpenCL, so it cannot be used for HPC
computation either. However, more advanced GPUs actually support these
programming models and a variety of HPC systems use them to accelerate
certain kind of workloads.
Tegra2 is the central part of the Q7 module [16] (See Figure 1(a)). The
module also integrates 1 GB of DDR2-667 memory, 16 GB of eMMC stor-
age, a 100 MbE NIC (connected to Tegra2 through USB) and exposes PCIe
connectivity to the carrier board. Using Q7 modules allows an easy up-
grade when next generation SoCs become available, and reduces the cost of
replacement in case of failure.
Each Tibidabo node is built using Q7-compliant carrier boards [17] (See
Figure 1(b)). Each board hosts one Q7 module, integrates 1 GbE NIC (con-
nected to Tegra2 through PCIe), µSD card adapter and exposes other con-
nectors and related circuitry that are not required for our HPC cluster, but
are required for embedded software/hardware development (RS232, HDMI,
USB, SATA, embedded keyboard controller, compass controller, etc.).
These boards are organized into blades (See Figure 1(c)), and each blade
hosts 8 nodes and a shared Power Supply Unit (PSU). In total, Tibidabo
has 128 nodes and it occupies 42 U standard rack space: 32 U for compute
blades, 4 U for interconnect switches and 2 U for the file server.
The interconnect has a tree topology and is built from 1 GbE 48-port
switches, with 1 to 8 Gb/s link bandwidth aggregation between switches.
Each node in the network is reachable within three hops.
The Linux Kernel version 2.6.32.2 and a single Ubuntu 10.10 filesystem
image are hosted on an NFSv4 server with 1 Gb/s of bandwidth. Each node
has its own local scratch storage on a 16 GB µSD CLASS 4 memory card.
Tibidabo relies on the MPICH2 v1.4.1 version of the MPI library. At the
time of writing of this paper this was the only MPI distribution that worked
reliably with the SLURM job-manager in our cluster.
We use ATLAS 3.9.51 [18] as our linear algebra library. This library is
chosen due to the lack of a hand-optimized algebra library for our platform
and its ability to auto-tune to the underlying architecture. Applications that
need an FFT library rely on FFTW v3.3.1 [19] for the same reasons.

7
3. Evaluation
In this section we present a performance and power evaluation of Tibidabo
in two phases, first for a single compute chip in a node, and then for the whole
cluster. We also provide a break down of a single node power consumption
to understand the potential sources of inefficiency for HPC.

3.1. Methodology
For the measurement of energy efficiency (MFLOPS/W), and energy-to-
solution (Joules) in single core benchmarks, we used a Yokogawa WT230
power meter [20] with an effective sampling rate3 of 10 Hz, a basic precision
of 0.1%, and RMS output values given as voltage/current pairs. We repeat
our runs to get at least an acquisition interval of 10 minutes. The meter is
connected to act as an AC supply bridge and to directly measure power drawn
from the AC line. We have developed a measurement daemon that integrates
with the OS and triggers the power meter to start collecting samples when the
benchmark starts, and to stop when it finishes. Collected samples are then
used to calculate the energy-to-solution and energy efficiency. To measure the
energy efficiency of the whole cluster, the measurement daemon is integrated
with the SLURM [21] job manager, and after the execution of a job, power
measurement samples are included alongside the outputs from the job. In
this case, the measurement point is the power distribution unit of the entire
rack.
For single-node energy efficiency, we have measured a single Q7 board and
compared the results against a power-optimized Intel Core i7 [22] laptop (Ta-
ble 1), whose processor chip has a thermal design power of 35 W. Due to the
different natures of the laptop and the development board, and in order to
give a fair comparison in terms of energy efficiency, we are measuring only
the power of components that are necessary for executing the benchmarks,
so all unused devices are disabled. On our Q7 board, we disable Ethernet
during the benchmarks execution. On the Intel Core i7 platform, graphic
output, sound card, touch-pad, bluetooth, WiFi, and all USB devices are
disabled, and the corresponding modules are unloaded from the kernel. The
hard disk is spun down, and the Ethernet is disabled during the execution

3
Internal sampling frequencies are not known. This is the frequency at which the meter
outputs new pairs of samples.

8
Table 1: Experimental platforms
ARM Platform Intel Platform
SoC Tegra 2 Intel Core i7-640M
Architecture ARM Cortex-A9 (ARMv7-a) Nehalem
Core Count Dual core Dual core
Operating Frequency 1 GHz 2.8 GHz
Cache size(s) L1:32 KB I, 32KB D per core L1: 32KB I, 32KB D per core
L2: 1 MB I/D shared L2: 256 KB I/D per core
L3: 4 MB I/D shared
RAM 1 GB DDR2-667 8 GB DDR3-1066
32-bit single channel 64-bit dual channel
2666.67 MB/s per channel 8533.33 MB/s per channel
Compiler GCC 4.6.2 GCC 4.6.2
OS Linux 2.6.36.2 (Ubuntu 10.10) Linux 2.6.38.11 (Ubuntu 10.10)

of the benchmarks. Multithreading could not be disabled, but all experi-

ments are single-threaded and we set their logical core affinity in all cases.
On both platforms benchmarks are compiled with -O3 level of optimization
using GCC 4.6.2 compiler.

3.2. Single node performance

We start with the evaluation of the performance and energy efficiency of
a single node in our cluster, in order to have a meaningful comparison to
other state-of-the-art compute node architectures.
In Figure 2 we evaluate the performance of Cortex-A9 floating-point
double-precision pipeline using in-house developed microbenchmarks. These
benchmarks perform dense double-precision floating-point computation with
accumulation on arrays of a given size (input parameter) stressing the FPADD
and FPMA instructions in a loop.We exploit data reuse by executing the same
instruction multiple times on the same elements within one loop iteration.
This way we reduce loop condition testing overheads and keep the floating-
point pipeline as utilized as possible. The purpose is to evaluate if the ARM
Cortex-A9 pipeline is capable of achieving the peak performance of 1 FLOP
per cycle. Our results show that the Cortex-A9 core achieves the theoretical
peak double-precision floating-point performance when the microbenchmark
working set fits in the L1 cache (32 KB).
Next, in Table 2 we show the evaluation of CPU performance using the
Dhrystone benchmark [23] and the SPEC CPU2006 benchmark suite [24].
Dhrystone is not an HPC benchmark, but it is a known reference for which
there are official ARM Cortex-A9 results [5]. The purpose of this test is to
check whether our platform achieves the reported performance. We have also

9
1.2
Microbenchmark
FPADD
1.0 FPMA

0.8
GFLOPS

0.6

0.4

L1 cache

L2 cache
0.2

0.0
4K 16K 32K 64K 100K 1M 10M 100M
Problem size

Figure 2: Performance of double-precision microbenchmarks on the Cortex-A9. Peak

performance is 1 GFLOPS @ 1 GHz

evaluated the benchmark on a laptop Intel Core i7 processor to establish a

basis for comparison in terms of performance and energy-to-solution. We use
the same working set size on both platforms.
Our results show that the Tegra2 Cortex-A9 achieves the expected peak
DMIPS. Our results also show that the Cortex-A9 is 7.8x slower than the
Core i7. If we factor these results for the the frequency difference, we get
that Cortex-A9 has 2.79x lower performance/MHz.
The SPEC CPU2006 benchmarks cover a wide range of CPU intensive
workloads. Table 2 also shows the performance and energy-to-solution nor-
malized to Cortex-A9 averaged across all the benchmarks in the CINT2006
(integer) and CFP2006 (floating-point) subsets of the SPEC CPU2006 suite.
Cortex-A9 is over 9x slower in both subsets. The per-benchmark results for
these experiments can be found in a previous paper [25].
We also evaluate the effective memory bandwidth using the STREAM
benchmark [26] (Table 3). In this case, the memory bandwidth comparison
is not just a core architecture comparison because bandwidth depends mainly
on the memory subsystem. However, bandwidth efficiency, which shows the
achieved bandwidth out of the theoretical peak, shows to what extent the
core, cache hierarchy and on-chip memory controller are able to exploit off-
chip memory bandwidth. We use the largest working set size that fits in the

10
Q7 module memory on both platforms. Our results show that the DDR2-667
memory in our Q7 modules delivers a memory bandwidth of 1348 MB/s for
copy and 720 MB/s for add —the Cortex-A9 chip achieves a 51% and 27%
bandwidth efficiency respectively. Meanwhile, the DDR3-1066 in Core i7
delivers around 7000 MB/s for both copy and add, which is 41% of bandwidth
efficiency considering the two memory channels available.

Table 2: Dhrystone and SPEC CPU2006: Intel Core i7 and ARM Cortex-A9 performance
and energy-to-solution comparison. SPEC CPU2006 results are normalized to Cortex-A9
and averaged across all benchmarks in the CINT2006 and CFP2006 subsets of the suite.
Platform Dhrystone SPEC CPU2006
perf energy CINT2006 CFP2006
(DMIPS) abs (J) norm perf energy perf energy
Intel Core i7 19246 116.8 1.056 9.071 1.185 9.4735 1.172
ARM Cortex-A9 2466 110.8 1.0 1.0 1.0 1.0 1.0

Table 3: STREAM: Intel Core i7 and ARM Cortex-A9 memory bandwidth and bandwidth
efficiency comparison.
Platform Peak STREAM
mem. BW perf (MB/S) energy (avg.) efficiency (%)
(MB/s) copy scale add triad abs (J) norm copy add
Intel Core i7 17066 6912 6898 7005 6937 481.5 1.059 40.5 41.0
ARM Cortex-A9 2666 1348 1321 720 662 454.8 1.0 50.6 27.0

Tables 2 and 3 also present the energy-to-solution required by each plat-

form for the Dhrystone, SPEC CPU2006 and STREAM benchmarks. Energy-
to-solution is shown as the absolute value (for Dhrystone and STREAM)
and normalized to the Cortex-A9 platform. While it is true that the ARM
Cortex-A9 platform takes much less power than the Core i7, it also requires a
longer runtime, which results in a similar energy consumption—the Cortex-
A9 platform is between 5% and 18% better. Given that the Core i7 platform
is faster, that makes it superior in other metrics such as Energy-Delay. We
analyze the sources of energy inefficiency for these results in Section 3.4, and
evaluate potential energy-efficiency improvements for ARM-based platforms
in Section 4.

3.3. Cluster performance

Our single-node performance evaluation shows that the Cortex-A9 is
∼9 times slower than the Core i7 at their maximum operating frequencies,
which means that we need our applications to exploit a minimum of 9 parallel

11
processors in order to achieve competitive time-to-solution. More processing
cores in the system mean more need for scalability. In this section we evalu-
ate the performance, energy efficiency and scalability of the whole Tibidabo
cluster.
Figure 3 shows the parallel speed-up achieved by the High-Performance
Linpack benchmark (HPL) [27] and several other HPC applications. Fol-
lowing common practice, we perform a weak scalability test for HPL and a
strong scalability test for the rest.4 We have considered several widely used
MPI applications: GROMACS [28], a versatile package to perform molecular
dynamics simulations; SPECFEM3D GLOBE [29] that simulates continen-
tal and regional scale seismic wave propagation; HYDRO, a 2D Eulerian
code for hydrodynamics; and PEPC [30], an application that computes long-
range Coulomb forces for a set of charged particles. All applications are
compiled and executed out-of-the-box, without any hand tuning of the re-
spective source codes.
If the application could not execute on a single node, due to large memory
requirements, we calculated the speed-up with respect to the smallest number
of nodes that can handle the problem. For example, PEPC with a reference
input set requires at least 24 nodes, so we plot the results assuming that on
24 nodes the speed-up is 24.
We have executed SPECFEM3D and HYDRO with an input set that is
able to fit into the memory of a single node, and they show good strong scaling
up to the maximum available number of nodes in the cluster. In order to
achieve good strong scaling with GROMACS, we have used two input sets,
both of which can fit into the memory of two nodes. We have observed
that scaling of GROMACS improves when the input set size is increased.
PEPC does not show optimal scalability because the input set that we can
fit on our cluster is too small to show the strong scalability properties of the
application [30].
HPL shows good weak scaling. In addition to HPL performance, we also
measure power consumption, so that we can derive the MFLOPS/W met-
ric used to rank HPC systems in the Green500 list. Our cluster achieves
120 MFLOPS/W (97 GFLOPS on 96 nodes - 51% HPL efficiency), com-

4
Weak scalability refers to the capability of solving a larger problem size in the same
amount of time using a larger number of nodes (the problem size is limited by the available
memory in the system). On the other side, strong scalability refers to the capability of
solving a fixed problem size in less time while increasing the number of nodes.

12
96
ideal
HP Linpack
PEPC
HYDRO
GROMACS - small input
GROMACS - big input
64 SPECFEM3D
Speed-up

16
8
4
4 8 16 32 64 96
Number of nodes

Figure 3: Scalability of HPC applications on Tibidabo.

petitive with AMD Opteron 6128 and Intel Xeon X5660-based clusters, but
19x lower than the most efficient GPU-accelerated systems, and 21x lower
than Intel Xeon Phi (November 2012 Green500 #1). The reasons for the low
HPL efficient performance include lack of architecture-specific tuning of the
algebra library, and lack of optimization in the MPI communication stack for
ARM cores using Ethernet.

3.4. Single node power consumption breakdown

In this section we analyze the power of the multiple components on a com-
pute node. The purpose is to identify the potential causes of inefficiency—on
the hardware side—that led to the results in the previous section.
We were unable to take direct power measurements of the individual
components in the Q7 card and carrier board, so we checked them in the
provider specifications for each one of the components. The CPU core power
consumption is taken from the ARM website [31]. For the L2 cache power
estimate, we use power models of the Cortex-A9’s L2 implemented in 40nm
and account for long inactivity periods due to the 99% L1 cache hit rate of
HPL (as observed with performance counters reads). The power consump-
tion of the NICs is taken from the respective datasheets [32, 33]. For the
DDR2 memory, we use Micron’s spreadsheet tool [34] to estimate power con-
sumption based on parameters such as bandwidth, memory interface width,

13
Core1, 0.26 Core2, 0.26 L2 cache, 0.1
Memory, 0.7

Eth1, 0.9

Other, 5.68

Eth2, 0.5

Figure 4: Power consumption breakdown of the main components on a compute node. The
compute node power consumption while executing HPL is 8.4 W. This power is computed
by measuring the total cluster power and divide it by the number of nodes.

and voltage.
Figure 4 shows the average power breakdown of the major components in
a compute node over the total compute node power during an HPL run on the
entire cluster. As can be seen, the total measured power on the compute node
is significantly higher than the sum of the major parts. Other on-chip and
on-board peripherals in the compute node are not used for computation so
they are assumed to be shut off when idle. However, the large non-accounted
power part (labeled as OTHER) accounts for more than 67% of the total
power. That part of the power includes on-board low-dropout (LDO) voltage
regulators, on-board multimedia devices with related circuitry, corresponding
share of a blade PSU losses and on-chip power sinks. Figure 5 shows the
Tegra2 chip die. The white outlined area shows the chip components that
are used by HPC applications. This area is less than 35% of the total chip
area. If the rest of the chip area is not properly power and clock gated, it
would leak power even though it is not being used, thus also contributing to
the OTHER part of the compute node power.
Although the estimations in this section are not exact, we actually overes-
timate the power consumption of some of the major components when taking
the power from the multiple data sources. Therefore, our analysis shows that
up to 16% of the power is on the computation components: cores (includ-

14
Figure 5: Tegra2 die: the area marked with white border line are the components actually
used by HPC applications. It represents less than a 35% of the total chip area. source:
www.anandtech.com

ing on-chip cache-coherent interconnect and L2 cache controller), L2 cache,

and memory. The remaining 84% or more is then the overhead (or system
glue) to interconnect those computation components with other computation
components in the system. The reasons for this significant power overhead is
because of the small size of the compute chip (two cores), and due to use of
development boards targeted to embedded and mobile software development.
The conclusions from this analysis are twofold. HPC-ready carrier boards
should be stripped-out of unnecessary peripherals to reduce area, cost and
potential power sinks/wastes. And, at the same time, the computation chips
should include a larger number of cores: less boards are necessary to integrate
the same number of cores, and the power overhead of a single compute chip
is distributed among a larger number of cores. This way, the power overhead
should not be a dominant part of the total power but just a small fraction.

4. Performance and energy efficiency projections

In this section we project what would be the performance and power
consumption of our cluster if we could have set up an HPC-targeted system
using the same low-power components. One of the limitations seen in the
previous section is that having only two cores per chip leads to significant

15
overheads in order to glue them together to create a large system with a large
number of cores. Also, although Cortex-A9 is the leader in mobile computing
for its high performance, it trades-off some performance for power savings to
improve battery life. Cortex-A15 is the highest performing processor in the
ARM family, which includes features more suitable for HPC. Therefore, in
this section we evaluate cluster configurations with higher multicore density
(more cores per chip) and we also project what would be the performance
and energy efficiency if we used Cortex-A15 cores instead. To complete the
study, we evaluate multiple frequency operating points to show how frequency
affects performance and energy efficiency.
For our projections, we use an analytical power model and the DIMEMAS
cluster simulator [35]. DIMEMAS performs high-level simulation of the exe-
cution of MPI applications on cluster systems. It uses a high-level model of
the compute nodes—modeled as symmetric multi-processing (SMP) nodes—
to predict the execution time of computation phases. At the same time, it
simulates the cluster interconnect to account for MPI communication de-
lays. The interconnect and computation node models accept configuration
parameters such as interconnect bandwidth, interconnect latency, number of
links, number of cores per computation node, core performance ratio, and
memory bandwidth. DIMEMAS has been used to model the interconnect of
the MareNostrum supercomputer with an accuracy within 5% [36], and its
MPI communication model has been validated showing an error below 10%
for the NAS benchmarks [37].
The input to our simulations is a trace obtained from an HPL execution
on Tibidabo. As an example, the PARAVER [38] visualization of the input
and output traces of a DIMEMAS simulation are shown in Figure 6. The
chart shows the activity of the application threads (vertical axis) over time
(horizontal axis). Figure 6(a) shows the visualization of the original execu-
tion on Tibidabo, and Figure 6(b) shows the visualization of the DIMEMAS
simulation using a configuration that mimics the characteristics of our ma-
chine (including the interconnect characteristics) except for the CPU speed
which is, as an example, 4 times faster. As it can be observed in the real
execution, threads do not start communication all at the same time, and
thus have computation in some threads overlapping with communication in
others. In the DIMEMAS simulation, where CPU speed is increased 4 times,
computation phases (in grey) become shorter and all communication phases
get closer in time. However, the application shows the same communication
pattern and communications take a similar time as that in the original ex-

16
(a) Part of the original HPL execution on Tibidabo

(b) DIMEMAS simulation with hypothetical 4x faster computation cores

Figure 6: An example of a DIMEMAS simulation where each row presents the activity of
a single processor: it is either in a computation phase (grey) or in MPI communication
(black).

ecution. Due to the computation-bound nature of HPL, the resulting total

execution time is largely shortened. However, the speedup is not close to 4x,
as it is limited by communications, which are properly simulated to match
the behavior of the interconnect in the real machine.

4.1. Cluster energy efficiency

Table 4 shows the parameters used to estimate the performance and en-
ergy efficiency of multiple cluster configurations. For this analysis, we use
HPL to measure the performance ratio among different core configurations.
The reasons to use HPL are twofold: it makes heavy use of floating-point
operations, as it happens in most HPC applications, and is the reference
benchmark used to rank HPC machines both on Top500 and Green500 lists.
To project the performance scaling of HPL for Cortex-A9 designs clocked
at frequencies over 1 GHz, we execute HPL in one of our Tibidabo nodes
using two cores at multiple frequencies from 456 MHz to 1 GHz. Then, we
fit a polynomial trend line on the performance points to project the perfor-
mance degradation beyond 1 GHz. Figure 7(a) shows a performance degra-
dation below 10% at 1 GHz compared to perfect scaling from 456 MHz. The
polynomial trend line projections to 1.4 and 2.0 GHz show a 14% and 25%
performance loss over perfect scaling from 456 MHz respectively. A poly-

17
Table 4: Core architecture, performance and power model parameters, and results for
performance and energy efficiency of clusters with 16 cores per node.
Configuration # 1 2 3 4 5
CPU input parameters
Core architecture Cortex-A9 Cortex-A9 Cortex-A9 Cortex-A15 Cortex-A15
Frequency (GHz) 1.0 1.4 2.0 1.0 2.0
Performance over A91GHz 1.0 1.2 1.5 2.6 4.5
Power over A91GHz 1.0 1.1 1.8 1.54 3.8
Per node power figures for 16 cores per chip configuration [W]
CPU cores 4.16 4.58 7.49 7.64 18.85
L2 cache 0.8 0.88 1.44 Integrated with cores
Memory 5.6 5.6 5.6 5.6 5.6
Ethernet NICs 1.4 1.4 1.4 1.4 1.4
Aggregate power figures [W]
Per node 17.66 18.16 21.63 20.34 31.55
Total cluster 211.92 217.87 259.54 244.06 378.58

nomial trend line seems somewhat pessimistic if there are no fundamental

architectural limitations, so we can use these projections as a lower bound
for the performance of those configurations.
For the performance of Cortex-A15 configurations we perform the same
experiment on a dual-core Cortex-A15 test chip clocked at 1 GHz [39]. HPL
performance on Cortex-A15 is 2.6 times faster compared to our Cortex-A9
Tegra2 boards. To project the performance over 1 GHz we run HPL at
frequencies ranging between 500 MHz and 1 GHz and fit a polynomial trend
line on the results. The performance degradation compared to perfect scaling
from 500MHz at 2 GHz is projected to 14% so the performance ratio over
Cortex-A9 at 1 GHz is 4.5x. We must say that these performance ratios of
Cortex-A15 over Cortex-A9 are for HPL, which makes heavy use of floating-
point code. The performance ratios of Cortex-A15 over Cortex-A9 for integer
code are typically 1.5x at 1 GHz and 2.9x at 2 GHz (both compared to Cortex-
A9 at 1 GHz). This shows how, for a single compute node, Cortex-A15 is
better suited for HPC double-precision floating-point computation.
For the power projections at different clock frequencies, we are using a
power model for Cortex-A9 based on 40nm technology as this is what many
Cortex-A9 products are using today, and for the Cortex-A15 on 28nm tech-
nology as this is the process that will be used for most products produced
in 2013. The power consumption in both cases is normalized to the power
of Cortex-A9 running at 1 GHz. Then, we introduce these power ratios in
our analytical model to project the power consumption and energy efficiency
of the different cluster configurations. In all our simulations, we assume
the same number of total cores as in Tibidabo (192) and we vary the num-

18
5
5 Perfect Perfect
Cortex-A9 Cortex-A15
Performance relative to 1 GHz

Performance relative to 1 GHz

Linear fit 4 Linear fit
4
Poly fit Poly fit

3
3

2
2

1 1

0 0
456

608

760
816
912
1000

1400

2000

500
600
700
800
900
1000

2000
Frequencies (MHz) Frequencies (MHz)

(a) Dual-core Cortex-A9 (b) Dual-core Cortex-A15

Figure 7: HPL performance at multiple operating frequencies and projection to frequencies

over 1 GHz

ber of cores in each compute node. When we increase the number of cores
per compute node, the number of nodes is reduced, thus, reducing integra-
tion overhead and pressure on the interconnect (i.e. less boards, cables and
switches). To model this effect, our analytical model is as follows:
From the power breakdown of a single node presented in Figure 4, we
subtract the power corresponding to the CPUs and the memory subsystem
(L2 + memory). The remaining power in the compute node is considered to
be board overhead, and does not change with the number of cores. The board
overhead is part of the power of a single node, to which we add the power
of the cores, L2 cache and memory. For each configuration, the CPU core
power is multiplied by the number of cores per node. Same as in Tibidabo,
our projected cluster configurations are assumed to have 0.5 MB of L2 cache
per core and 500 MB of RAM per core—this assumption allows for simple
scaling to large numbers of cores. Therefore, the L2 cache power (0.1 W/MB)
and the memory power (0.7 W/GB) are multiplied both by half the number
of cores. The L2 core power for the Cortex-A9 configurations is also factored
for frequency, for which we use the core power ratio. The L2 in Cortex-A15
is part of the core macro, so the core power already includes the L2 power.
For both Cortex-A9 and Cortex-A15, the CPU macro power includes the
L1 caches, cache coherence unit and L2 controller. Therefore, the increase in
power due to a more complex L2 controller and cache coherence unit for a

19
larger multicore are accounted when that power is factored by the number of
cores. The memory power is overestimated, so the increased power due to the
increased complexity of the memory controller to scale to a higher number
of cores is also accounted for the same reason. Furthermore, a Cortex-A9
system cannot address more than 4 GB of memory so, strictly speaking,
Cortex-A9 systems with more than 4 GB are not realistic. However, we
include configurations for higher core counts per chip to show what would
be the performance and energy efficiency if Cortex-A9 included large phys-
ical address extensions as the Cortex-A15 does to address up to 1 TB of
memory [40].
The power model is summarized in these equations:

ntc Pover Pmem P
Ppred = × + Peth + ncpc × + pr × PA91G + L2$ (1)
ncpc nnin 2 2
Pover = Ptot − nnin × (Pmem + 2 × PA91G + PL2$ + Peth ) (2)

where Ppred represents the projected power of simulated clusters. ntc =

192 and nnin = 96 are constants and represent the total number of cores
and total number of nodes in Tibidabo respectively. ncpc is the number of
cores per chip. Pover represents the total Tibidabo cluster power overhead
(evaluated in Equation 2). pr defines the power ratio derived from core power
models and normalized to Cortex-A9 at 1 GHz. PA91G , Pmem , PL2$ and Peth
are constants defining a core, per core memory, per core L2 cache and per
node Ethernet power consumptions in Tibidabo. Ptot = 808 W is the average
power consumption of Tibidabo while running HPL.
In our experiments, the total number of cores remains constant and is the
same as in the Tibidabo cluster (ntc = 192). We explore the total number
of cores per chip (ncpc ) that, having one chip per node, determines the total
number of nodes of the evaluated system. Table 4 shows the resulting total
cluster power of the multiple configurations using 16 cores per chip and a
breakdown of the power for the major components.
For the performance projections of the multiple cluster configurations, we
provide DIMEMAS with a CPU core performance ratio for each configura-
tion, and a varying number of processors per node. DIMEMAS produces a
simulation of how the same 192-core5 application will behave based on the

5
Out of 128 nodes with a total of 256 processors, 4 nodes are used as login nodes and

20
new core performance and multicore density, accounting for synchronizations
and communication delays. Figure 8 shows the results. In all simulations we
keep a network bandwidth of 1 Gb/s (1GbE) and a memory bandwidth of
1400 MB/s (from peak bandwidth results using STREAM).
The results show that, as we increase the number of cores per node (at
the same time reducing the total number of nodes), performance does not
show further degradation with 1 GbE interconnect until we reach the level of
performance of Cortex-A15. None of the Cortex-A15 configurations reach its
maximum speed-up due to interconnect limitations. The configuration with
two Cortex-A15 cores at 1 GHz scales worst because the interconnect is the
same as in Tibidabo. With a higher number of cores, we are reaching 96% of
the speed-up of a Cortex-A15 at 1 GHz. Further performance increase with
Cortex-A15 at 2 GHz shows further performance limitation due to intercon-
nect communication—reaching 82% of the ideal speed-up with two cores and
reaching 91% with sixteen.

5.0
2 cores/chip
4 cores/chip
4.5
8 cores/chip
16 cores/chip
4.0

3.5
Speedup

3.0

2.5

2.0

1.5

1.0

0.5
Cortex-A9 Cortex-A9 Cortex-A9 Cortex-A15 Cortex-A15
1 GHz 1.4 GHz 2 GHz 1 GHz 2 GHz
Platforms

Figure 8: Projected speed-up for the evaluated cluster configurations. The total number
of MPI processes is constant across all experiments.

28 are unstable There are two major identified sources for instabilities: cooling issues and
problems with the PCIe driver, which drops the network connection on the problematic
nodes.

21
Increasing computation density potentially improves MPI communication
because more processes communicate on chip rather than using the network
and memory bandwidth is larger than the interconnect bandwidth. Setting
a larger machine than Tibidabo, with faster mobile cores and a higher core
count, will require a faster interconnect. In Section 4.2 we explore the inter-
connect requirements when using faster mobile cores.
1200
Cortex-A9 @ 1 GHz
Cortex-A9 @ 1.4 GHz
Cortex-A9 @ 2 GHz
Cortex-A15 @ 1 GHz
1000 Cortex-A15 @ 2 GHz
Energy efficiency (MFLOPS/W)

800

600

400

200

0
2 4 8 16
Number of cores per node

Figure 9: Projected energy efficiency

The benefit of increased computation density (more cores per node) is ac-
tually the reduction of the integration overhead and the resulting improved
energy efficiency of the system (Figure 9). The results show that increasing
the computation density, with Cortex-A9 cores running at 2 GHz we can
achieve an energy efficiency of 563 MFLOPS/W using 16 cores per node
(∼4.7x improvement). The configuration with 16 Cortex-A15 cores per node
has an energy efficiency of 1004 MFLOPS/W at 1 GHz and 1046 MFLOPS/W
at 2 GHz(∼8.7x improvement).
Using these models, we are projecting the energy efficiency of our cluster
if it used higher performance cores and included more cores per node. How-
ever, all other features remain the same, so inefficiencies due to the use of
non-optimized development boards, lack of software optimization, and lack
of vector double-precision floating-point execution units is accounted in the
model. Still with all these inefficiencies, our projections show that such a
cluster would be competitive in terms of energy efficiency with Sandy Bridge

22
and GPU-accelerated systems in the Green500 list, which shows promise for
future ARM-based platforms actually optimized for HPC.

6
Cortex-A9 1 GHz
Cortex-A9 1.4 GHz
Cortex-A9 2 GHz
5 Cortex-A15 1 GHz
Cortex-A15 2 GHz
Relative performance

0
0.1 1 10
Network bandwidth (Gb/s)

(a) Bandwidth sensitivity

6
Cortex-A9 1 GHz/1 GbE Cortex-A9 1 GHz/10 GbE
Cortex-A9 1.4 GHz/1 GbE Cortex-A9 1.4 GHz/10 GbE
Cortex-A9 2 GHz/1 GbE Cortex-A9 2 GHz/10 GbE
Cortex-A15 1 GHz/1 GbE Cortex-A15 1 GHz/10 GbE
5 Cortex-A15 2 GHz/1 GbE Cortex-A15 2 GHz/10 GbE
Relative performance

0
0 100 200 300 400 500
Latency (µs)

(b) Latency sensitivity

Figure 10: Interconnection network impact. Cluster configuration with 16 cores per node.

23
4.2. Interconnection network requirements
Cluster configurations with higher-performance cores and more cores per
node, put a higher pressure on the interconnection network. The result of
increasing the node computation power while maintaining the same network
bandwidth is that the interconnect bandwidth-to-flops ratio decreases. This
may lead to the network being a bottleneck. To evaluate this effect, we carry
out DIMEMAS simulations of the evaluated cluster configurations using a
range of network bandwidth (Figure 10(a)) and latency values (Figure 10(b)).
The baseline for these results is the cluster configuration with Cortex-A9 at
1 GHz, 1 Gb/s of bandwidth and 50 µs of latency.
The results in Figure 10(a) show that a network bandwidth of 1 Gb/s is
sufficient for the evaluated cluster configurations with Cortex-A9 cores and
the same size as Tibidabo. The Cortex-A9 configurations show a negligible
improvement with 10 Gb/s interconnects. On the other hand, configurations
with Cortex-A15 do benefit from an increased interconnect bandwidth: the
1 GHz configuration reaches its maximum at 3 Gb/s and the 2 GHz config-
uration at 8 Gb/s.
The latency evaluation in Figure 10(b) shows the relative performance
with network bandwidths of 1 Gb/s and 10 Gb/s for a range of latencies
normalized to 50 µs. An ideal zero latency does not show a significant im-
provement over 50 µs and increasing the latency in a factor of ten, only has a
significant impact on the Cortex-A15 at 2 GHz configuration. Therefore, the
latency of Tibidabo’s Ethernet network, although being larger than that of
specialized and custom networks used in supercomputing, is small enough for
all the evaluated cluster configurations which have the same size as Tibidabo.

5. Lessons learned and next steps

We have described the architecture of our Tegra2-based cluster, the first
attempt to build an HPC system using ARM processors. Our performance
and power evaluation shows that ARM Cortex-A9 platform is competitive
with a mobile Intel Nehalem Core i7-640M platform in terms of energy effi-
ciency for a reference benchmark like SPEC CPU2006. We also demonstrated
that, even without hand tuning, HPC applications scale well on our cluster.
However, building a supercomputer out of commodity-of-the-shelf low-
power components is a challenging task because achieving a balanced design
in terms of power is difficult. As an example, the total energy dissipated in
the PCB voltage regulators is comparable or even higher than that spent on

24
the CPU cores. Although the core itself provides a theoretical peak energy
efficiency of 2-4 GFLOPS/W, this design imbalance results in the measured
HPL energy efficiency of 120 MFLOPS/W.
In order to achieve system balance, we identified two fundamental im-
provements to put in practice. The first is to make use of higher-end ARM
multicore chips like Cortex-A15, which provides an architecture more suit-
able for HPC while maintains comparable single-core energy efficiency. The
second is to increase the compute density by adding more cores to the
chip. The recently announced ARM CoreLink CCN-504 cache coherence
network [41, 42] scales up to 16 cores and is targeted to high-performance
architectures such as Cortex-A15 and next-generation 64-bit ARM proces-
sors. In a resulting system of putting together these design improvements,
the CPU cores power is better balanced with that of other components such
as the memory. Our projections based on ARM Cortex-A15 processors with
higher multicore integration density show that such systems are a promising
alternative to current designs built from high performance parts. For exam-
ple, a cluster of the same size as Tibidabo, based on 16-core ARM Cortex-A15
chips at 2 GHz would provide 1046 MFLOPS/W.
A well known technique to improve energy efficiency is the use of SIMD
units. As an example, BlueGene/Q uses 256-bit-wide vectors for quad double-
precision floating-point computations, and the Intel MIC architecture uses
512-bit-wide SIMD units. Both Cortex-A9 and Cortex-A15 processors imple-
ment the ARMv7-a architecture which only supports single-precision SIMD
computation. Most HPC applications require calculations in double-precision
so they cannot exploit the current ARMv7 SIMD units. The ARMv8 archi-
tecture specification includes double-precision floating-point SIMD, so fur-
ther energy efficiency improvements for HPC computation are expected from
next-generation ARMv8 chips featuring those SIMD units.
In all of our experiments, we run the benchmarks out of the box, and
did not hand-tune any of those codes. Libraries and compilers include
architecture-dependent optimizations that, for the case of ARM processors,
target mobile computing. This leads to two different scenarios: the optimiza-
tions of libraries used in HPC, such as ATLAS or MPI, for ARM processors
are one step behind; and optimizations in compilers, operating systems and
drivers target mobile computing, thus trading-off performance for quality of
service or battery life. We have put together an HPC-ready software stack
for Tibidabo but we have not put effort in optimizing its several components
for HPC computation yet. Further energy efficiency improvements are ex-

25
pected when critical components such as MPI communication functions are
optimized for ARM-based platforms, or the Linux kernel is stripped-out of
those components not used by HPC applications.
As shown in Figure 5, the Tegra2 chip includes a number of application-
specific accelerators that are not programmable using standard industrial
programming models such as CUDA or OpenCL. If those accelerators were
programmable and used for HPC computation, that would reduce the in-
tegration overhead of Tibidabo. The use of SIMD or SIMT programmable
accelerators is widely adopted in supercomputers, such as those including
general-purpose programmable GPUs (GPGPUs). Although the effective
performance of GPGPUs is between 40% and 60%, their efficient compute-
targeted design provides them with high energy efficiency. GPUs in mobile
SoCs are starting to support general-purpose programming. One example is
the Samsung Exynos5 [43] chip, which includes two Cortex-A15 cores and
an OpenCL-compatible ARM Mali T-604 GPU [44]. This design, apart from
providing the improved energy efficiency of GPGPUs, has the advantage of
having the compute accelerator close to the general purpose cores, thus re-
ducing data transfer latencies. Such an on-chip programmable accelerator
is an attractive feature to improve energy efficiency in an HPC system built
from low-power components.
Another important issue to keep in mind when designing such kind of
systems is that the memory bandwidth-to-flops ratio must be maintained.
Currently available ARM-based platforms make use of either memory tech-
nology that is behind compared to top-class standards (e.g., many platforms
use DDR2 memory instead of DDR3), or memory technology targeting low
power (e.g., LPDDR2). For a higher-performance node with a higher num-
ber of cores and including double-precision floating-point SIMD units, cur-
rent memory choices in ARM platforms may not provide enough bandwidth,
so higher-performance memories must be adopted. Low-power ARM-based
products including DDR3 are already announced [45] and the recently an-
nounced DMC-520 [41] memory controller enables DDR3 and DDR4 memory
for ARM processors. These upcoming technologies are indeed good news for
low-power HPC computing. Moreover, package-on-package memories which
reduce the distance between the computation cores and the memory, and
increase pin density can be used to include several memory controllers and
provide higher memory bandwidth.
Finally, Tibidabo employs 1 Gbit Ethernet for the cluster interconnect.
Our experiments show that 1 GbE is not a performance limiting factor for

26
a cluster of Tibidabo size employing Cortex-A9 processors up to 2 GHz and
for compute-bound codes such as HPL. However, when using faster mobile
cores such as Cortex-A15, a 1 GbE interconnect starts becoming a bottleneck.
Current ARM-based mobile chips include peripherals targeted to the mobile
market and thus, do not provide enough bandwidth or are not compatible
with faster network technologies used in supercomputing, such as 10 GbE or
Infiniband. However, the use of 1 GbE is extensive in supercomputing—32%
of the systems in the November 2012 Top500 list use 1 GbE interconnects—,
and potential communication bottlenecks are in many cases addressable in
software [46]. Therefore, although support for a high-performance network
technology would be desirable for ARM-based HPC systems, using 1 GbE
may not be a limitation as long as the communication libraries are optimized
appropriately for Ethernet communication and the communication patterns
in HPC applications are tuned appropriately keeping the network capabilities
in mind.

6. Related work
One of the first attempts to use low-power commodity processors in HPC
systems was GreenDestiny [47]. They relied on Transmeta TM5600 proces-
sor, and although the proposal seemed good for a top platform in energy ef-
ficiency, a large-scale HPC system was never produced. Also, its computing-
to-space ratio was leading at the time.
MegaProto systems [48] were another approach in this direction. They
were based on more advanced versions of Transmeta’s processors, namely
TM5800 and TM8820. This system was able to achieve good energy effi-
ciency for the time, reaching up to 100 MFLOPS/W using a system with
512 processors. Like its predecessor, MegaProto never made it into a com-
mercial HPC product.
Roadrunner [49] topped the Top500 list in June 2008 to be the first to
break the petaflop barrier. It uses IBM PowerXCell 8i [50] together with
dual-core AMD Opteron processors. The Cell/B.E. architecture emphasizes
performance per watt by prioritizing bandwidth over latency and favours
peak computation capabilities over simplifying programmability. In the June
2008 Green500 list, it held third place with 437.43 MFLOPS/W, behind two
smaller homogeneous Cell/B.E.-based clusters.
There has been a proposal to use the Intel Atom family of processors
in clusters [51]. The platform is built and tested with a range of different

27
types of workloads, but those target data centers rather than HPC. One of
the main contributions of this work is determining the type of workloads for
which Intel Atom can compete in terms of energy-efficiency with commodity
Intel Core i7. A follow-up of this work [52] leads to the conclusion that a
cluster made homogeneously of low-power nodes (Intel Atom) is not suited
for complex database loads. They propose future research in heterogeneous
cluster architectures using low-power nodes combined with high-performance
ones.
The use of low-power processors for scale-out systems was assessed in a
study by Stanley-Marbell and Caparros-Cabezas [53]. They did a compara-
tive study of three different low-power architecture implementations: x86-64
(Intel Atom D510MO), Power Architecture e500 (Freescale P2020RDB) and
ARM Cortex-A8 (TI DM3730, BeagleBoard xM). The authors presented a
study with performance, power and thermal analyses. One of their findings
is that a single core Cortex-A8 platform is suitable for energy-proportional
computing, meaning very low idle power. However, it lacks sufficient comput-
ing resources to exploit coarse-grained task-level parallelism and be a more
energy efficient solution than the dual-core Intel Atom platform. They also
concluded that a large fraction of the platforms’ power consumption (up to
67% for the Cortex-A8 platform) cannot be attributed to a specific compo-
nent, despite the use of sophisticated techniques such as thermal imaging.
The AppleTV cluster [54, 55] is an effort to assess the performance of
the ARM Cortex-A8 processor in a cluster environment running HPL. The
authors built a small cluster with four nodes based on AppleTV devices with
a 100 MbE network. They achieved 160.4 MFLOPS with an energy efficiency
of 16 MFLOPS/W. Also, they compared the memory bandwidth against a
BeagleBoard xM platform and explained the performance differences due
to different design decisions in the memory subsystems. In our system, we
employ more recent low-power core architectures and show how improved
floating-point units, memory subsystems, and an increased number of cores
can significantly improve the overall performance and energy efficiency, while
still maintaining a small power footprint.
The BlueGene family of supercomputers has been around since 2004 in
several generations [56, 57, 58]. BlueGene systems are composed of em-
bedded cores integrated on ASIC together with architecture-specific fabrics.
BlueGene/L, the first such system, is based on the PowerPC 440, with a
theoretical peak performance of 5.6 GFLOPS. BlueGene/P increased the
peak performance of the compute card to 13.6 GFLOPS by using 4-core

28
PowerPC 450. BlueGene/Q-based clusters are one of the most power effi-
cient HPC machines nowadays delivering around 2.1 GFLOPS/W. A Blue-
Gene/Q compute chip includes 16 4-way SMT in-order cores, each one with
a 256-bit-wide quad double-precision SIMD floating-point unit, delivering
a total of 204.8 GFLOPS per chip on a power budget of around 55 W
(3.7 GFLOPS/W).
The most energy-efficient machine in the November 2012 Green500 list
is based on the Intel Xeon Phi coprocessor. It has a design similar to Blue-
Gene/Q: 4-way SMT in-order cores with wide SIMD units, but it integrates
more cores per chip (60) and the SIMD units are 512-bits-wide. The use
of a more recent technology process (22nm instead of the 45nm of Blue-
Gene/Q) allows this larger integration and results in an energy efficiency of
2.5 GFLOPS/W for the number one machine in the Green500 list.
There is a lot of hype about the use of low-power ARM processors in
servers. Currently, the most exciting, commercially available approaches are
the ones from Boston Ltd. [10], Penguin Computing [59] and EXXACT Cor-
poration [60]. They offer solutions based on the Calxeda ECX-1000 SoC [6]
with up to 48 server nodes (192 cores) and up to 4 GB of memory per
server node (192 GB in total) in a 2U enclosure. HP went one step further
with Project Moonshot [9], where they introduce the Redstone Development
Server Platform [61]. It has a compute integration option with up to 288
Calxeda SoCs in 4U.
The Calxeda ECX-1000 SoC is built for server workloads: it is a quad-core
chip with Cortex-A9 cores running at 1.4 GHz, 4 MB of L2 cache with ECC
protection, a 72-bit memory controller with ECC support, five 10 Gb lanes for
connecting with other SoCs, support for 1 GbE and 10 GbE, and SATA 2.0
controllers with support for up to five SATA disks. Unlike ARM-based mobile
SoCs, ECX-1000 does not have a power overhead in terms of unnecessary
on-chip resources and, thus, it seems better suited for energy-efficient HPC.
However, to the best of our knowledge, there are neither reported numbers
for energy-efficiency of HPL running in a cluster environment (only single-
node executions) nor scientific applications scalability tests for any of the
aforementioned enclosures.
AppliedMicro announced an ARM server platform based on their own
ARMv8-based SoC design, the X-gene [8]. There are still no enclosures
announced, and no benchmark reports, but we expect a better performance
than ARMv7-based enclosures, due to an improved CPU core architecture
and three levels of cache hierarchy.

29
7. Conclusions
In this paper we presented Tibidabo, the world’s first ARM-based HPC
cluster, for which we set up an HPC-ready software stack to execute HPC
applications widely used in scientific research such as SPECFEM3D and
GROMACS. Tibidabo was built using commodity off-the-shelf components
that are not designed for HPC. Nevertheless, our prototype cluster achieves
120 MFLOPS/W on HPL, competitive with AMD Operton 6128 and Intel
Xeon X5660-based systems. We identified a set of inefficiencies of our de-
sign given the components target mobile computing. The main inefficiency
is that the power taken by the components required to integrate small low-
power dual-core processors offsets the high energy efficiency of the cores
themselves. We perform a set of simulations to project the energy efficiency
of our cluster if we could have used chips featuring higher-performance ARM
cores and integrating a larger number of them together.
Based on these projections, a cluster configuration with 16-core Cortex-
A15 chips would be competitive with Sandy Bridge-based homogeneous sys-
tems and GPU-accelerated heterogeneous systems in the Green500 list.
We also explained the major issues and how they should evolve or be
improved for next clusters made from low-power ARM processors. These
issues include, apart from the aforementioned integration overhead, the lack
of optimized software, the use of mobile-targeted memories, the lack of
double-precision floating-point SIMD units, and the lack of support for high-
performance interconnects. Based on our recommendations, an HPC-ready
ARM processor design should include a larger number of cores per chip (e.g.,
16) and use a core microarchitecture suited for high-performance, like the
one in Cortex-A15. It should also include double-precision floating-point
SIMD units, support for multiple memory controllers servicing DDR3 or
DDR4 memory modules, and probably support for a higher-performance
network, such as Infiniband, although Gigabit Ethernet may be sufficient for
many HPC applications. On the software side, libraries, compilers, drivers
and operating systems need tuning for high performance, and architecture-
dependent optimizations for ARM processor chips.
Recent announcements show an increasing interest in server-class low-
power systems that may benefit HPC. The new 64-bit ARMv8 ISA improves
some features that are important for HPC. First, using 64-bit addresses re-
moves the 4GB memory limitation per application. This allows more mem-
ory per node, so one process can compute more data locally, requiring less

30
network communication. Also, ARMv8 increases the size of the general-
purpose register file from 16 to 32 registers. This reduces register spilling
and provides more room for compiler optimization. It also improves floating-
point performance by extending the NEON instructions with fused multiply-
add and multiply-substract, and cross-lane vector operations. More impor-
tantly, double-precision floating-point is now part of NEON. All together,
this provides a theoretical peak double-precision floating-point performance
of 4 FLOPS/cycle for a fully-pipelined SIMD unit. As an example, ARM
Cortex-A57, the highest performance ARM implementation of the ARMv8
ISA, includes two NEON units, totalling 8 double-precision floating-point
FLOPS/cycle—this is 4 times better than ARM Cortex-A15 and equivalent
to Intel implementations with one AVX unit.
These encouraging industrial roadmaps, together with research initia-
tives such as the EU-funded Mont-Blanc project [62], may lead ARM-based
platforms to accomplish the recommendations given in this paper in a near
future.

8. Acknowledgments
Authors would like to thank to anonymous reviewers for their constructive
comments. In addition, authors would like to thank to Bernard Ortiz de
Montellano and Paul M. Carpenter for their help to improve the quality of
this paper.
This project and the research leading to these results have received fund-
ing from the European Union’s Seventh Framework Programme [FP7/2007-
2013] under grant agreement no 288777. Part of this work is supported by
the PRACE project (European Union funding under grants RI-261557 and
RI-283493).

References
[1] D. Göddeke, D. Komatitsch, M. Geveler, D. Ribbrock, N. Rajovic,
N. Puzovic, A. Ramirez, Energy efficiency vs. performance of the numer-
ical solution of PDEs: an application study on a low-power ARM-based
cluster, Journal of Computational Physics 237 (2013) 132–150.

[2] HPCwire, New Mexico to Pull Plug on Encanto, Former

Top5 Supercomputer, http://www.hpcwire.com/hpcwire/2012-

31
07-12/new_mexico_to_pull_plug_on_encanto_former_top_5_
supercomputer.html, accessed: 5-May-2013 (7 2012).
[3] HPCwire, Requiem for Roadrunner, http://www.hpcwire.com/
hpcwire/2013-04-01/requiem_for_roadrunner.html, accessed: 5-
May-2013 (4 2013).
[4] K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. D.
nneau, P. Franzon, W. Harrod, K. Hill, J. Hiller, S. Karp, S. Keckler,
D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely,
T. Sterling, R. S. Williams, K. Yelick, P. Kogge, Exascale Comput-
ing Study: Technology Challenges in Achieving Exascale Systems, in:
DARPA Technical Report, 2008.
[5] ARM Ltd., Cortex-A9 Processor, http://www.arm.com/products/
processors/cortex-a/cortex-a9.php, accessed: 5-May-2013.
[6] Calxeda, EnergyCoreTM processors, http://www.calxeda.com/
technology/products/processors/, accessed: 5-May-2013.
[7] Marvell, Marvell Quad-Core ARMADA XP Series SoC,
http://www.marvell.com/embedded-processors/armada-
xp/assets/Marvell-ArmadaXP-SoC-product%20brief.pdf, accessed
5-May-2013.
[8] AppliedMicro, AppliedMicro X-Gene, http://www.apm.com/
products/x-gene/, accessed: 5-May-2013.
[9] HP, HP Labs research powers project Moonshot, HPs new archi-
tecture for extreme low-energy computing, http://www.hpl.hp.com/
news/2011/oct-dec/moonshot.html, accessed: 15-May-2013.
[10] Boston Ltd., Boston Viridis - ARM Microservers, http://www.boston.
co.uk/solutions/viridis/default.aspx, accessed: 5-May-2013.
[11] ARM Ltd., VFPv3 Floating Point Unit, http://www.arm.com/
products/processors/technologies/vector-floating-point.php,
accessed: 5-May-2013.
[12] ARM Ltd., The ARM R NEONTM general purpose SIMD engine, http:
//www.arm.com/products/processors/technologies/neon.php, ac-
cessed: 5-May-2013.

32
[13] J. Turley, Cortex-A15 “Eagle” flies the coop, Microprocessor Report
24 (11) (2010) 1–11.

[14] ARM Ltd., Cortex-A50 Series, http://www.arm.com/products/

processors/cortex-a50/index.php, accessed: 5-May-2013.

[15] G. Bell, Bell’s law for the birth and death of computer classes, Commu-
nications of ACM 51 (1) (2008) 86–94.

[16] SECO, QuadMo747-X/T20, http://www.seco.com/en/item/

quadmo747-x_t20, accessed: 5-May-2013.

[17] SECO, SECOCQ7-MXM, http://www.seco.com/en/item/secocq7-

mxm, accessed: 5-May-2013.

[18] R. Whaley, J. Dongarra, Automatically tuned linear algebra software,

in: Proceedings of the 1998 ACM/IEEE conference on Supercomputing
(CDROM), IEEE Computer Society, 1998, pp. 1–27.

[19] M. Frigo, S. G. Johnson, The design and implementation of FFTW3,

Proceedings of the IEEE 93 (2) (2005) 216–231, special issue on “Pro-
gram Generation, Optimization, and Platform Adaptation”.

[20] Yokogawa, WT210/WT230 Digital Power Meters, http://tmi.

yokogawa.com/products/digital-power-analyzers/digital-
power-analyzers/wt210wt230-digital-power-meters/, accessed:
5-May-2013.

[21] A. Yoo, M. Jette, M. Grondona, Slurm: Simple linux utility for re-
source management, in: Job Scheduling Strategies for Parallel Process-
ing, Springer, 2003, pp. 44–60.

[22] Intel, Intel R CoreTM i7-640M Processor, http://ark.intel.com/

products/49666/Intel-Core-i7-640M-Processor-4M-Cache-2_80-
GHz, accessed: 5-May-2013.

[23] R. Weicker, Dhrystone: a synthetic systems programming benchmark,

Communications of the ACM 27 (10) (1984) 1013–1030.

[24] J. L. Henning, SPEC CPU2006 benchmark descriptions, SIGARCH

Comput. Archit. News 34 (4) (2006) 1–17.

33
[25] N. Rajovic, L. Vilanova, C. Villavieja, N. Puzovic, A. Ramirez, The
low power architecture approach towards exascale computing, Journal
of Computational Science.
[26] J. D. McCalpin, Memory bandwidth and machine balance in current
high performance computers, IEEE Computer Society Technical Com-
mittee on Computer Architecture (TCCA) Newsletter (1995) 19–25.
[27] J. Dongarra, P. Luszczek, A. Petitet, The LINPACK Benchmark: past,
present and future, Concurrency and Computation: Practice and Expe-
rience 15 (9) (2003) 803–820.
[28] H. Berendsen, D. van der Spoel, R. van Drunen, Gromacs: A message-
passing parallel molecular dynamics implementation, Computer Physics
Communications 91 (1) (1995) 43–56.
[29] D. Komatitsch, J. Tromp, Introduction to the spectral element method
for three-dimensional seismic wave propagation, Geophysical Journal
International 139 (3) (1999) 806–822.
[30] DEISA 2; Distributed european infrastructure for supercomputing ap-
plications; Maintenance of the DEISA Benchmark Suite in the Second
Year, Available online at: www.deisa.eu.
[31] ARM Ltd., ARM Announces 2GHz Capable Cortex-A9 Dual Core
Processor Implementation, http://www.arm.com/about/newsroom/
25922.php, accessed: 5-May-2013.
[32] Intel Corporation, Intel R 82574 GbE Controller Family,
http://www.intel.com/content/dam/doc/datasheet/82574l-
gbe-controller-datasheet.pdf, accessed: 5-May-2013.
[33] SMSC, LAN9514/LAN9514i: USB 2.0 Hub and 10/100 Eth-
ernet Controller, http://www.smsc.com/media/Downloads_Public/
Data_Sheets/9514.pdf, accessed: 5-May-2013.
[34] Micron, DDR2 SDRAM System-Power Calculator, http://www.
micron.com/support/dram/power_calc.html, accessed: 5-May-2013.
[35] R. Badia, J. Labarta, J. Gimenez, F. Escale, DIMEMAS: Predicting
MPI applications behavior in Grid environments, in: Workshop on Grid
Applications and Programming Tools (GGF8), Vol. 86, 2003.

34
[36] A. Ramirez, O. Prat, J. Labarta, M. Valero, Performance Impact of the
Interconnection Network on MareNostrum Applications, in: 1st Work-
shop on Interconnection Network Architectures: On-Chip, Multi-Chip,
2007.

[37] S. Girona, J. Labarta, R. M. Badia, Validation of Dimemas Commu-

nication Model for MPI Collective Operations, in: Proceedings of the
7th European PVM/MPI Users’ Group Meeting on Recent Advances
in Parallel Virtual Machine and Message Passing Interface, 2000, pp.
39–46.

[38] V. Pillet, J. Labarta, T. Cortes, S. Girona, Paraver: A tool to visualize

and analyze parallel code, WoTUG-18 (1995) 17–31.

[39] ARM Ltd., CoreTile ExpressTM A15×2 A7×3 Technical Refer-

ence Manual, http://infocenter.arm.com/help/topic/com.arm.
doc.ddi0503c/DDI0503C_v2p_ca15_a7_tc2_trm.pdf, accessed: 14-
May-2013.

[40] ARM Ltd., Virtualization Extensions and Large Physical Ad-

dress Extensions, http://www.arm.com/products/processors/
technologies/virtualization-extensions.php, accessed: 14-May-
2013.

[41] ARM Ltd., CoreLink CCN-504 Cache Coherent Network, http:

//www.arm.com/products/system-ip/interconnect/corelink-ccn-
504-cache-coherent-network.php, accessed: 14-May-2013.

[42] J. Byrne, ARM CoreLink Fabric Links 16 CPUs, Microprocessor Report

(2012) 1–3.

[43] Samsung, Samsung Exynos5 Dual White Paper, http://www.samsung.

com/global/business/semiconductor/minisite/Exynos/data/
Enjoy_the_Ultimate_WQXGA_Solution_with_Exynos_5_Dual_WP.
pdf, accessed: 14-May-2013.

[44] ARM Ltd., Mali Graphics Hardware - Mali-T604, http://www.

arm.com/products/multimedia/mali-graphics-hardware/mali-
t604.php, accessed: 14-May-2013.

35
[45] Calxeda, Calxeda Quad-Node EnergyCard, http://www.calxeda.com/
technology/products/energycards/quadnode, accessed: 14-May-
2013.
[46] V. Marjanović, J. Labarta, E. Ayguadé, M. Valero, Overlapping com-
munication and computation by using a hybrid mpi/smpss approach, in:
Proceedings of the 24th ACM International Conference on Supercom-
puting, ACM, 2010, pp. 5–16.
[47] M. Warren, E. Weigle, W. Feng, High-density computing: A 240-
processor beowulf in one cubic meter, in: Supercomputing, ACM/IEEE
2002 Conference, IEEE, 2002, pp. 61–61.
[48] H. Nakashima, H. Nakamura, M. Sato, T. Boku, S. Matsuoka, D. Taka-
hashi, Y. Hotta, Megaproto: 1 tflops/10kw rack is feasible even with
only commodity technology, in: Supercomputing, 2005. Proceedings of
the ACM/IEEE SC 2005 Conference, IEEE, 2005, pp. 28–28.
[49] K. Barker, K. Davis, A. Hoisie, D. Kerbyson, M. Lang, S. Pakin, J. San-
cho, Entering the petaflop era: the architecture and performance of
roadrunner, in: Proceedings of the 2008 ACM/IEEE conference on Su-
percomputing, IEEE Press, 2008, p. 1.
[50] T. Chen, R. Rghavan, J. Dale, E. Iwata, Cell Broadband Engine Archi-
tecture and its first implementationA performance view, IBM Journal
of Research and Development 51 (5) (2007) 559 –572.
[51] V. Vasudevan, D. Andersen, M. Kaminsky, L. Tan, J. Franklin,
I. Moraru, Energy-efficient cluster computing with fawn: Workloads
and implications, in: Proceedings of the 1st International Conference
on Energy-Efficient Computing and Networking, ACM, 2010, pp. 195–
204.
[52] W. Lang, J. Patel, S. Shankar, Wimpy node clusters: What about non-
wimpy workloads?, in: Proceedings of the Sixth International Workshop
on Data Management on New Hardware, ACM, 2010, pp. 47–55.
[53] P. Stanley-Marbell, V. C. Cabezas, Performance, power, and thermal
analysis of low-power processors for scale-out systems, in: Parallel and
Distributed Processing Workshops and Phd Forum (IPDPSW), 2011
IEEE International Symposium on, IEEE, 2011, pp. 863–870.

36
[54] K. Fürlinger, C. Klausecker, D. Kranzlmüller, Towards energy efficient
parallel computing on consumer electronic devices, in: Information and
Communication on Technology for the Fight against Global Warming,
Springer, 2011, pp. 1–9.

[55] K. Fürlinger, C. Klausecker, D. Kranzlmüller, The appletv-cluster: To-

wards energy efficient parallel computing on consumer electronic devices,
Whitepaper, Ludwig-Maximilians-Universitat.

[56] N. R. Adiga, G. Almási, G. S. Almasi, Y. Aridor, R. Barik, D. Beece,

R. Bellofatto, G. Bhanot, R. Bickford, M. Blumrich, et al., An overview
of the BlueGene/0l supercomputer, in: Supercomputing, ACM/IEEE
2002 Conference, IEEE, 2002.

[57] S. Alam, R. Barrett, M. Bast, M. R. Fahey, J. Kuehn, C. McCurdy,

J. Rogers, P. Roth, R. Sankaran, J. S. Vetter, P. Worley, W. Yu, Early
evaluation of IBM BlueGene/P, in: Proceedings of the 2008 ACM/IEEE
conference on Supercomputing, IEEE Press, 2008.

[58] IBM Systems and Technology, IBM System Blue Gene/Q Data Sheet,
http://public.dhe.ibm.com/common/ssi/ecm/en/dcd12345usen/
DCD12345USEN.PDF, accessed: 5-May-2013.

[59] Penguin Computing, ARM Servers - UDX1, http://www.

penguincomputing.com/Products/RackmountedServers/ARM, ac-
cessed: 15-May-2013.

[60] EXXACT Corporation, ARM Microservers - Energy Efficient Hy-

perscale Computing, http://exxactcorp.com/index.php/solution/
solu_list/59, accessed: 15-May-2013.

[61] HP, HP Project Moonshot and the Redstone Development Server Plat-
form, http://h10032.www1.hp.com/ctg/Manual/c03442116.pdf, ac-
cessed: 15-May-2013.

[62] The Mont-Blanc project, http://montblanc-project.eu, website was

online on May 16th 2013.

IT Recruiter Mind-maps - Booklet v2.0 - SAMPLE 2023-03
No ratings yet
IT Recruiter Mind-maps - Booklet v2.0 - SAMPLE 2023-03
97 pages
Design of FPGA-Based Computing Systems with OpenCL 1st Edition Hasitha Muthumala Waidyasooriya download pdf
100% (1)
Design of FPGA-Based Computing Systems with OpenCL 1st Edition Hasitha Muthumala Waidyasooriya download pdf
55 pages
04 PlantPAx 5.0 Demo Script v3.0
No ratings yet
04 PlantPAx 5.0 Demo Script v3.0
51 pages
2014 Europar Arm
No ratings yet
2014 Europar Arm
12 pages
2014 Cohpc Cluster Extended
No ratings yet
2014 Cohpc Cluster Extended
15 pages
GPU Versus FPGA For High Productivity Computing: Imperial College London, Electrical and Electronic Engineering, London
No ratings yet
GPU Versus FPGA For High Productivity Computing: Imperial College London, Electrical and Electronic Engineering, London
6 pages
Nikolskiy 2016 J. Phys. Conf. Ser. 681 012049
No ratings yet
Nikolskiy 2016 J. Phys. Conf. Ser. 681 012049
7 pages
A 3-D CPU-FPGA-DRAM Hybrid Architecture For Low-Power Computation
No ratings yet
A 3-D CPU-FPGA-DRAM Hybrid Architecture For Low-Power Computation
14 pages
Fpga Arm Processor Based Supercomputiing
No ratings yet
Fpga Arm Processor Based Supercomputiing
5 pages
1-s2.0-S1383762122001138-main
No ratings yet
1-s2.0-S1383762122001138-main
51 pages
High-Performance Computing in University Scientific Research
No ratings yet
High-Performance Computing in University Scientific Research
3 pages
Evaluating ARM and RISC-V Architectures for High-P
No ratings yet
Evaluating ARM and RISC-V Architectures for High-P
28 pages
Near-Threshold_RISC-V_Core_With_DSP_Extensions_for_Scalable_IoT_Endpoint_Devices
No ratings yet
Near-Threshold_RISC-V_Core_With_DSP_Extensions_for_Scalable_IoT_Endpoint_Devices
14 pages
1-s2.0-S0141933116000338-main
No ratings yet
1-s2.0-S0141933116000338-main
9 pages
Arnold An eFPGA-Augmented RISC-V SoC For Low Power Iot End Nodes
No ratings yet
Arnold An eFPGA-Augmented RISC-V SoC For Low Power Iot End Nodes
14 pages
The HPC-DAG Task Model for Heterogeneous Real-Time Systems
No ratings yet
The HPC-DAG Task Model for Heterogeneous Real-Time Systems
15 pages
New Advances in High Performance Computing and Simulation: Parallel and Distributed Systems, Algorithms, and Applications
No ratings yet
New Advances in High Performance Computing and Simulation: Parallel and Distributed Systems, Algorithms, and Applications
7 pages
A Survey of High-Performance Computing Scaling Challenge
No ratings yet
A Survey of High-Performance Computing Scaling Challenge
10 pages
No Exaflops For You
No ratings yet
No Exaflops For You
61 pages
ST7 SHP 1.1 ArchiParallelesDistribuees 1spp
No ratings yet
ST7 SHP 1.1 ArchiParallelesDistribuees 1spp
38 pages
HPC
No ratings yet
HPC
30 pages
Module 1-Topic 1
No ratings yet
Module 1-Topic 1
36 pages
Where can buy Design of FPGA-Based Computing Systems with OpenCL 1st Edition Hasitha Muthumala Waidyasooriya ebook with cheap price
100% (2)
Where can buy Design of FPGA-Based Computing Systems with OpenCL 1st Edition Hasitha Muthumala Waidyasooriya ebook with cheap price
55 pages
Arm Based Soc Physical Design
No ratings yet
Arm Based Soc Physical Design
14 pages
UNIT 1
No ratings yet
UNIT 1
31 pages
1 Introduction
No ratings yet
1 Introduction
41 pages
S3064 Pedraforca ARM GPU Cluster HPC
No ratings yet
S3064 Pedraforca ARM GPU Cluster HPC
18 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
15 pages
Design of FPGA-Based Computing Systems with OpenCL 1st Edition Hasitha Muthumala Waidyasooriya download
100% (1)
Design of FPGA-Based Computing Systems with OpenCL 1st Edition Hasitha Muthumala Waidyasooriya download
59 pages
01 - Introduction: 1 Why Parallel Programming Is Important in Research
No ratings yet
01 - Introduction: 1 Why Parallel Programming Is Important in Research
50 pages
Cost Afftective Deepl Learning Using Nvidia
No ratings yet
Cost Afftective Deepl Learning Using Nvidia
10 pages
Iridis Pi PDF
No ratings yet
Iridis Pi PDF
10 pages
Where Can Buy Design of FPGA-Based Computing Systems With OpenCL 1st Edition Hasitha Muthumala Waidyasooriya Ebook With Cheap Price
100% (3)
Where Can Buy Design of FPGA-Based Computing Systems With OpenCL 1st Edition Hasitha Muthumala Waidyasooriya Ebook With Cheap Price
52 pages
1 s2.0 S0167739X18322015 Main
No ratings yet
1 s2.0 S0167739X18322015 Main
14 pages
DML Dynamic Partial Reconfiguration With Scalable Task Scheduling for Multi-Applications on FPGAs
No ratings yet
DML Dynamic Partial Reconfiguration With Scalable Task Scheduling for Multi-Applications on FPGAs
15 pages
Author Carson You PBP 5
No ratings yet
Author Carson You PBP 5
9 pages
M-1 Introduction
No ratings yet
M-1 Introduction
43 pages
Special Topic Submission Enabling Domain-Specific Architectures With An Open-Source Soft-Core GPGPU
No ratings yet
Special Topic Submission Enabling Domain-Specific Architectures With An Open-Source Soft-Core GPGPU
8 pages
L1.2 HPC Introduction
No ratings yet
L1.2 HPC Introduction
42 pages
1 Intro to HPC Compressed 1 Part 1
No ratings yet
1 Intro to HPC Compressed 1 Part 1
22 pages
Predictive Power Consumption Model For Compute Intensive Applications in Clustered ARM A53 Embedded Systems
No ratings yet
Predictive Power Consumption Model For Compute Intensive Applications in Clustered ARM A53 Embedded Systems
4 pages
AutoMM Energy-Efficient Multi-Data-Type Matrix Multiply Design on Heterogeneous Programmable System-On-chip
No ratings yet
AutoMM Energy-Efficient Multi-Data-Type Matrix Multiply Design on Heterogeneous Programmable System-On-chip
7 pages
Improving Energy Efficiency Through Parallelization
No ratings yet
Improving Energy Efficiency Through Parallelization
10 pages
The Role of Field-Programmable Gate Arrays in The Acceleration of Modern High - Performance Computing Workloads
No ratings yet
The Role of Field-Programmable Gate Arrays in The Acceleration of Modern High - Performance Computing Workloads
11 pages
Exploring-DRAM-Organizations-for-Energy-Efficient-andResilient-Exascale-Memories
No ratings yet
Exploring-DRAM-Organizations-for-Energy-Efficient-andResilient-Exascale-Memories
12 pages
Computer Architecture: Challenges and Opportunities For The Next Decade
No ratings yet
Computer Architecture: Challenges and Opportunities For The Next Decade
13 pages
Energy Efficient Multi-Core Processing: Electronics June 2014
No ratings yet
Energy Efficient Multi-Core Processing: Electronics June 2014
9 pages
Unit-1 Part-1
No ratings yet
Unit-1 Part-1
14 pages
SMM Cap1
No ratings yet
SMM Cap1
101 pages
A_Survey_on_FPGA-Based_Heterogeneous_Clusters_Architectures
No ratings yet
A_Survey_on_FPGA-Based_Heterogeneous_Clusters_Architectures
28 pages
High Performance Embedded Computing Architectures Applications and Methodologies 1st Edition Wayne Wolf - The ebook in PDF and DOCX formats is ready for download now
No ratings yet
High Performance Embedded Computing Architectures Applications and Methodologies 1st Edition Wayne Wolf - The ebook in PDF and DOCX formats is ready for download now
52 pages
19493_FULLTEXT
No ratings yet
19493_FULLTEXT
145 pages
An Efficient SRAM-based Reconfigurable Architecture For Embedded
No ratings yet
An Efficient SRAM-based Reconfigurable Architecture For Embedded
13 pages
Embedded Microprocessor System Design using FPGAs Uwe Meyer-Baese - The newest ebook version is ready, download now to explore
No ratings yet
Embedded Microprocessor System Design using FPGAs Uwe Meyer-Baese - The newest ebook version is ready, download now to explore
72 pages
2219-Article Text-15412-2-10-20230802
No ratings yet
2219-Article Text-15412-2-10-20230802
12 pages
FPGAs For Software Programmers
No ratings yet
FPGAs For Software Programmers
331 pages
Practical High Performance Computing: Definitive Reference for Developers and Engineers
From Everand
Practical High Performance Computing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Numerical Computing with Intel MKL: Definitive Reference for Developers and Engineers
From Everand
Efficient Numerical Computing with Intel MKL: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Analog Dialogue, Volume 47, Number 1: Analog Dialogue, #9
From Everand
Analog Dialogue, Volume 47, Number 1: Analog Dialogue, #9
Analog Dialogue
No ratings yet
OpenACC Programming Essentials: Definitive Reference for Developers and Engineers
From Everand
OpenACC Programming Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
From Everand
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
Anand Vemula
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
Performance Evaluation and Energy Efficiency of HPC Platforms
No ratings yet
Performance Evaluation and Energy Efficiency of HPC Platforms
34 pages
2010 - New Super Duper AI LLM Paper
No ratings yet
2010 - New Super Duper AI LLM Paper
6 pages
AMD Radeon Pro w7800 Datasheet
No ratings yet
AMD Radeon Pro w7800 Datasheet
2 pages
AMD EPYC 9004 MZ33-AR0 Datasheet v1.1
No ratings yet
AMD EPYC 9004 MZ33-AR0 Datasheet v1.1
1 page
16 Channel 100 V, 2/ 4 A, 5/3 Level With RTZ, T/R Switch, High-Speed Ultrasound Pulser With Integrated Transmit Beamformer
No ratings yet
16 Channel 100 V, 2/ 4 A, 5/3 Level With RTZ, T/R Switch, High-Speed Ultrasound Pulser With Integrated Transmit Beamformer
4 pages
Computer Organization - MIIT, Mandalay - Mouli Sankaran
100% (1)
Computer Organization - MIIT, Mandalay - Mouli Sankaran
41 pages
Independent Software Vendor Matrix (ISV) for IBM TotalStorage 3592 and LTO Tape
No ratings yet
Independent Software Vendor Matrix (ISV) for IBM TotalStorage 3592 and LTO Tape
18 pages
Emu Log
No ratings yet
Emu Log
15 pages
SE Lab Record
No ratings yet
SE Lab Record
41 pages
Ishansharma Pracfile
No ratings yet
Ishansharma Pracfile
65 pages
Router: Presentation - ID © 2008 Cisco Systems, Inc. All Rights Reserved. Cisco Confidential
No ratings yet
Router: Presentation - ID © 2008 Cisco Systems, Inc. All Rights Reserved. Cisco Confidential
18 pages
Projects BCA MCA
No ratings yet
Projects BCA MCA
2 pages
SAP ABAP On HANA - Top Interview Questions
No ratings yet
SAP ABAP On HANA - Top Interview Questions
19 pages
Sevendatabasesinsevenweekssecondedition Preview
No ratings yet
Sevendatabasesinsevenweekssecondedition Preview
5 pages
A Little Cup of Java-Coffee: CS404: CAI Class Presentation - 01 By: Leo Sep, 2002
No ratings yet
A Little Cup of Java-Coffee: CS404: CAI Class Presentation - 01 By: Leo Sep, 2002
29 pages
Security Course Arthur Revised 1
No ratings yet
Security Course Arthur Revised 1
31 pages
WIRELESS AND MOBILE COMMUNICATION Question Paper 21 22
No ratings yet
WIRELESS AND MOBILE COMMUNICATION Question Paper 21 22
3 pages
C Programming Interview Questions
No ratings yet
C Programming Interview Questions
13 pages
AJA Mini-Config Release Notes v2.26.8
No ratings yet
AJA Mini-Config Release Notes v2.26.8
2 pages
A Comprehensive Survey of WiFi Analyzer Tools
No ratings yet
A Comprehensive Survey of WiFi Analyzer Tools
9 pages
QUIZ 3A Soft Eng 1
No ratings yet
QUIZ 3A Soft Eng 1
6 pages
What Is Flash
No ratings yet
What Is Flash
9 pages
Implementing Traffic Shaping
No ratings yet
Implementing Traffic Shaping
11 pages
Lab 5 - Sampling Theorem
No ratings yet
Lab 5 - Sampling Theorem
4 pages
5G Essentials
No ratings yet
5G Essentials
15 pages
Flare - On 4: Challenge 9 Solution: Challenge Author: Joshua Homan
No ratings yet
Flare - On 4: Challenge 9 Solution: Challenge Author: Joshua Homan
15 pages
HISTORY OF COMPUTERS Long Notes
No ratings yet
HISTORY OF COMPUTERS Long Notes
9 pages
Midiwizard Midi Controller: Owners Manual
No ratings yet
Midiwizard Midi Controller: Owners Manual
4 pages
COMPUTER SYSTEM SERVICING EXAM 1st Quarter Exam
No ratings yet
COMPUTER SYSTEM SERVICING EXAM 1st Quarter Exam
2 pages
Photoelectric Smoke Sensors MIX-4010 / MIX-4010-ISO: Features
No ratings yet
Photoelectric Smoke Sensors MIX-4010 / MIX-4010-ISO: Features
2 pages
NoSQL Database
No ratings yet
NoSQL Database
5 pages
SY SEM IV 24-25
No ratings yet
SY SEM IV 24-25
1 page
Csharp Dotnet Interview Questions and Answers List
No ratings yet
Csharp Dotnet Interview Questions and Answers List
13 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Armhpc SC

Uploaded by

Armhpc SC

Uploaded by

TibidaboI: Making the Case for an ARM-Based HPC

1.1. Road to Exascale

1.2. ARM Processors

• A power distribution estimation of our ARM cluster.

• Model-based performance and energy-efficiency projections of a theo-

• Technology challenges and design guidelines based on our experience

2. ARM Cluster Architecture

(c) Blade with 8 boards (d) Tibidabo rack

Figure 1: Components of the Tibidabo system.

of the benchmarks. Multithreading could not be disabled, but all experi-

3.2. Single node performance

Figure 2: Performance of double-precision microbenchmarks on the Cortex-A9. Peak

evaluated the benchmark on a laptop Intel Core i7 processor to establish a

Tables 2 and 3 also present the energy-to-solution required by each plat-

3.3. Cluster performance

Figure 3: Scalability of HPC applications on Tibidabo.

3.4. Single node power consumption breakdown

ing on-chip cache-coherent interconnect and L2 cache controller), L2 cache,

4. Performance and energy efficiency projections

(b) DIMEMAS simulation with hypothetical 4x faster computation cores

ecution. Due to the computation-bound nature of HPL, the resulting total

4.1. Cluster energy efficiency

nomial trend line seems somewhat pessimistic if there are no fundamental

Performance relative to 1 GHz

(a) Dual-core Cortex-A9 (b) Dual-core Cortex-A15

Figure 7: HPL performance at multiple operating frequencies and projection to frequencies

where Ppred represents the projected power of simulated clusters. ntc =

Figure 9: Projected energy efficiency

(a) Bandwidth sensitivity

(b) Latency sensitivity

5. Lessons learned and next steps

[2] HPCwire, New Mexico to Pull Plug on Encanto, Former

[14] ARM Ltd., Cortex-A50 Series, http://www.arm.com/products/

[16] SECO, QuadMo747-X/T20, http://www.seco.com/en/item/

[17] SECO, SECOCQ7-MXM, http://www.seco.com/en/item/secocq7-

[18] R. Whaley, J. Dongarra, Automatically tuned linear algebra software,

[19] M. Frigo, S. G. Johnson, The design and implementation of FFTW3,

[20] Yokogawa, WT210/WT230 Digital Power Meters, http://tmi.

[22] Intel, Intel R CoreTM i7-640M Processor, http://ark.intel.com/

[23] R. Weicker, Dhrystone: a synthetic systems programming benchmark,

[24] J. L. Henning, SPEC CPU2006 benchmark descriptions, SIGARCH

[37] S. Girona, J. Labarta, R. M. Badia, Validation of Dimemas Commu-

[38] V. Pillet, J. Labarta, T. Cortes, S. Girona, Paraver: A tool to visualize

[39] ARM Ltd., CoreTile ExpressTM A15×2 A7×3 Technical Refer-

[40] ARM Ltd., Virtualization Extensions and Large Physical Ad-

[41] ARM Ltd., CoreLink CCN-504 Cache Coherent Network, http:

[42] J. Byrne, ARM CoreLink Fabric Links 16 CPUs, Microprocessor Report

[43] Samsung, Samsung Exynos5 Dual White Paper, http://www.samsung.

[44] ARM Ltd., Mali Graphics Hardware - Mali-T604, http://www.

[55] K. Fürlinger, C. Klausecker, D. Kranzlmüller, The appletv-cluster: To-

[56] N. R. Adiga, G. Almási, G. S. Almasi, Y. Aridor, R. Barik, D. Beece,

[57] S. Alam, R. Barrett, M. Bast, M. R. Fahey, J. Kuehn, C. McCurdy,

[59] Penguin Computing, ARM Servers - UDX1, http://www.

[60] EXXACT Corporation, ARM Microservers - Energy Efficient Hy-

[62] The Mont-Blanc project, http://montblanc-project.eu, website was

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.