Armhpc SC
Armhpc SC
System
Nikola Rajovica,b,∗, Alejandro Ricoa,b , Nikola Puzovica , Chris
Adeniyi-Jonesc , Alex Ramireza,b
a
Computer Sciences Department, Barcelona Supercomputing Center, Barcelona, Spain
b
Department d’Arquitectura de Computadors, Universitat Politècnica de Catalunya -
BarcelonaTech, Barcelona, Spain
c
ARM Ltd., Cambridge, United Kingdom
Abstract
It is widely accepted that future HPC systems will be limited by their power
consumption. Current HPC systems are built from commodity server pro-
cessors, designed over years to achieve maximum performance, with energy
efficiency being an after-thought. In this paper we advocate a different ap-
proach: building HPC systems from low-power embedded and mobile tech-
nology parts, over time designed for maximum energy efficiency, which now
show promise for competitive performance.
We introduce the architecture of Tibidabo, the first large-scale HPC clus-
ter built from ARM multicore chips, and a detailed performance and energy
efficiency evaluation. We present the lessons learned for the design and im-
provement in energy efficiency of future HPC systems based on such low-
power cores. Based on our experience with the prototype, we perform simu-
lations to show that a theoretical cluster of 16-core ARM Cortex-A15 chips
would increase the energy efficiency of our cluster by 8.7x, reaching an energy
efficiency of 1046 MFLOPS/W.
Keywords: high-performance computing, embedded processors, mobile
I
Tibidabo is a mountain overlooking Barcelona
∗
Corresponding author
Email addresses: nikola.rajovic@bsc.es (Nikola Rajovic),
alejandro.rico@bsc.es (Alejandro Rico), nikola.puzovic@bsc.es (Nikola Puzovic),
chris.adeniyi-jones@arm.com (Chris Adeniyi-Jones), alex.ramirez@bsc.es (Alex
Ramirez)
Preprint of the article accepted for publication in Future Generation Computer Systems, Elsevier
processors, low power, cortex-a9, cortex-a15, energy efficiency
1. Introduction
In High Performance Computing (HPC), there is a continued need for
higher computational performance. Scientific grand challenges e.g., engineer-
ing, geophysics, bioinformatics, and other types of compute-intensive appli-
cations require increasing amounts of compute power. On the other hand,
energy is increasingly becoming one of the most expensive resources and it
substantially contributes to the total cost of running a large supercomputing
facility. In some cases, the total energy cost over a few years of operation
can exceed the cost of the hardware infrastructure acquisition [1, 2, 3].
This trend is not only limited to HPC systems, it also holds true for data
centres in general. Energy efficiency is already a primary concern for the
design of any computer system and it is unanimously recognized that reach-
ing the next milestone in supercomputers’ performance, e.g. one EFLOPS
(exaFLOPS - 1018 floating-point operations per second), will be strongly
constrained by power. The energy efficiency of a system will define the max-
imum achievable performance.
In this paper, we take a first step towards HPC systems developed from
low-power solutions used in embedded and mobile devices. However, using
CPUs from this domain is a challenge: these devices are neither crafted to
exploit high ILP nor for high memory bandwidth. Most embedded CPUs
lack a vector floating-point unit and their software ecosystem is not tuned
for HPC. What makes them particularly interesting is the size and power
characteristics which allow for higher packaging density and lower cost. In
the following three subsections we further motivate our proposal from several
important aspects.
2
only 30-40% of the total power will be actually spent on the cores, the rest
going to power supply overhead, interconnect, storage, and memory. That
leads to a power budget of 6 MW to 8 MW for 62.5 million cores, which is
0.10 W to 0.13 W per core. Current high performance processors integrating
this type of cores require tens of watts at 2 GHz. However, ARM proces-
sors, designed for the embedded mobile market, consume less than 0.9 W at
that frequency [5], and thus are worth exploring—even though they do not
yet provide a sufficient level of performance, they have a promising roadmap
ahead.
1
Cortex-A8 is the processor generation prior to Cortex-A9, which has a non-pipelined
floating-point unit. In the best case it can deliver one floating-point ADD every ∼10
cycles; MUL and MAC have smaller throughputs.
3
1.3. Bell’s Law
Our approach for an HPC system is novel because we argue for the use of
mobile cores. We consider the improvements expected in mobile SoCs in the
near future that would make them real candidates for HPC. As Bell’s law
states [15], a new computer class is usually based on lower cost components,
which continue to evolve at a roughly constant price but with increasing per-
formance from Moore’s law. This trend holds today: the class of computing
systems on the rise today in HPC is those systems with large numbers of
closely-coupled small cores (BlueGene/Q and Xeon Phi systems). From the
architectural point of view, our proposal fits into this computing class and it
has the potential for performance growth given the size and evolution of the
mobile market.
1.4. Contributions
In this paper, we present Tibidabo, an experimental HPC cluster that we
built using NVIDIA Tegra2 chips, each featuring a performance-optimized
dual-core ARM Cortex-A9 processor. We use the PCIe support in Tegra2
to connect a 1 GbE NIC, and build a tree interconnect with 48-port 1 GbE
switches.
We do not intend our first prototype to achieve an energy efficiency com-
petitive with today’s leaders. The purpose of this prototype is to be a proof
of concept to demonstrate that building such energy-efficient clusters with
mobile processors is possible, and to learn from the experience. On the soft-
ware side, the goal is to deploy an HPC-ready software stack for ARM-based
systems, and to serve as an early application development and tuning vehicle.
Detailed analysis of performance and power distribution points to a ma-
jor problem when building HPC systems from low-power parts: the system
integration glue takes more power than the microprocessor cores themselves.
The main building block of our cluster, the Q7 board, is designed having
embedded and mobile software development in mind, and is not particularly
optimized for energy-efficient operation. Nevertheless, the energy efficiency
of our cluster is 120 MFLOPS/W, still competitive with Intel Xeon X5660
and AMD Opteron 6128 based clusters,2 but much lower than what could
be anticipated from the performance and power figures of the Cortex-A9
processor.
2
In the November 2012 edition of Green500 list these systems are ranked as 395th and
396th respectively.
4
We use our performance analysis to model and simulate a potential HPC
cluster built from ARM Cortex-A9 and Cortex-A15 chips with higher multi-
core density (number of cores per chip) and higher bandwidth interconnects,
and conclude that such a system would deliver competitive energy efficiency.
The work presented here, and the lessons that we learned are a first step
towards such a system that will be built with the next generation of ARM
cores implementing the ARMv8 architecture.
The contributions of this paper are:
• The design of the first HPC ARM-based cluster architecture, with
a complete performance evaluation, energy efficiency evaluation, and
comparison with state-of-the-art high-performance architectures.
5
(a) Q7 module (b) Q7 carrier board
6
TSMC’s 40nm LPG performance-optimized process. Tegra2 features a num-
ber of application-specific accelerators targeted at the mobile market, such
as video and audio encoder/decoder, and image signal processor, but none
of these can be used for general-purpose computation and only contribute
as a SoC area overhead. The GPU in Tegra2 does not support general pro-
gramming models such as CUDA or OpenCL, so it cannot be used for HPC
computation either. However, more advanced GPUs actually support these
programming models and a variety of HPC systems use them to accelerate
certain kind of workloads.
Tegra2 is the central part of the Q7 module [16] (See Figure 1(a)). The
module also integrates 1 GB of DDR2-667 memory, 16 GB of eMMC stor-
age, a 100 MbE NIC (connected to Tegra2 through USB) and exposes PCIe
connectivity to the carrier board. Using Q7 modules allows an easy up-
grade when next generation SoCs become available, and reduces the cost of
replacement in case of failure.
Each Tibidabo node is built using Q7-compliant carrier boards [17] (See
Figure 1(b)). Each board hosts one Q7 module, integrates 1 GbE NIC (con-
nected to Tegra2 through PCIe), µSD card adapter and exposes other con-
nectors and related circuitry that are not required for our HPC cluster, but
are required for embedded software/hardware development (RS232, HDMI,
USB, SATA, embedded keyboard controller, compass controller, etc.).
These boards are organized into blades (See Figure 1(c)), and each blade
hosts 8 nodes and a shared Power Supply Unit (PSU). In total, Tibidabo
has 128 nodes and it occupies 42 U standard rack space: 32 U for compute
blades, 4 U for interconnect switches and 2 U for the file server.
The interconnect has a tree topology and is built from 1 GbE 48-port
switches, with 1 to 8 Gb/s link bandwidth aggregation between switches.
Each node in the network is reachable within three hops.
The Linux Kernel version 2.6.32.2 and a single Ubuntu 10.10 filesystem
image are hosted on an NFSv4 server with 1 Gb/s of bandwidth. Each node
has its own local scratch storage on a 16 GB µSD CLASS 4 memory card.
Tibidabo relies on the MPICH2 v1.4.1 version of the MPI library. At the
time of writing of this paper this was the only MPI distribution that worked
reliably with the SLURM job-manager in our cluster.
We use ATLAS 3.9.51 [18] as our linear algebra library. This library is
chosen due to the lack of a hand-optimized algebra library for our platform
and its ability to auto-tune to the underlying architecture. Applications that
need an FFT library rely on FFTW v3.3.1 [19] for the same reasons.
7
3. Evaluation
In this section we present a performance and power evaluation of Tibidabo
in two phases, first for a single compute chip in a node, and then for the whole
cluster. We also provide a break down of a single node power consumption
to understand the potential sources of inefficiency for HPC.
3.1. Methodology
For the measurement of energy efficiency (MFLOPS/W), and energy-to-
solution (Joules) in single core benchmarks, we used a Yokogawa WT230
power meter [20] with an effective sampling rate3 of 10 Hz, a basic precision
of 0.1%, and RMS output values given as voltage/current pairs. We repeat
our runs to get at least an acquisition interval of 10 minutes. The meter is
connected to act as an AC supply bridge and to directly measure power drawn
from the AC line. We have developed a measurement daemon that integrates
with the OS and triggers the power meter to start collecting samples when the
benchmark starts, and to stop when it finishes. Collected samples are then
used to calculate the energy-to-solution and energy efficiency. To measure the
energy efficiency of the whole cluster, the measurement daemon is integrated
with the SLURM [21] job manager, and after the execution of a job, power
measurement samples are included alongside the outputs from the job. In
this case, the measurement point is the power distribution unit of the entire
rack.
For single-node energy efficiency, we have measured a single Q7 board and
compared the results against a power-optimized Intel Core i7 [22] laptop (Ta-
ble 1), whose processor chip has a thermal design power of 35 W. Due to the
different natures of the laptop and the development board, and in order to
give a fair comparison in terms of energy efficiency, we are measuring only
the power of components that are necessary for executing the benchmarks,
so all unused devices are disabled. On our Q7 board, we disable Ethernet
during the benchmarks execution. On the Intel Core i7 platform, graphic
output, sound card, touch-pad, bluetooth, WiFi, and all USB devices are
disabled, and the corresponding modules are unloaded from the kernel. The
hard disk is spun down, and the Ethernet is disabled during the execution
3
Internal sampling frequencies are not known. This is the frequency at which the meter
outputs new pairs of samples.
8
Table 1: Experimental platforms
ARM Platform Intel Platform
SoC Tegra 2 Intel Core i7-640M
Architecture ARM Cortex-A9 (ARMv7-a) Nehalem
Core Count Dual core Dual core
Operating Frequency 1 GHz 2.8 GHz
Cache size(s) L1:32 KB I, 32KB D per core L1: 32KB I, 32KB D per core
L2: 1 MB I/D shared L2: 256 KB I/D per core
L3: 4 MB I/D shared
RAM 1 GB DDR2-667 8 GB DDR3-1066
32-bit single channel 64-bit dual channel
2666.67 MB/s per channel 8533.33 MB/s per channel
Compiler GCC 4.6.2 GCC 4.6.2
OS Linux 2.6.36.2 (Ubuntu 10.10) Linux 2.6.38.11 (Ubuntu 10.10)
9
1.2
Microbenchmark
FPADD
1.0 FPMA
0.8
GFLOPS
0.6
0.4
L1 cache
L2 cache
0.2
0.0
4K 16K 32K 64K 100K 1M 10M 100M
Problem size
10
Q7 module memory on both platforms. Our results show that the DDR2-667
memory in our Q7 modules delivers a memory bandwidth of 1348 MB/s for
copy and 720 MB/s for add —the Cortex-A9 chip achieves a 51% and 27%
bandwidth efficiency respectively. Meanwhile, the DDR3-1066 in Core i7
delivers around 7000 MB/s for both copy and add, which is 41% of bandwidth
efficiency considering the two memory channels available.
Table 2: Dhrystone and SPEC CPU2006: Intel Core i7 and ARM Cortex-A9 performance
and energy-to-solution comparison. SPEC CPU2006 results are normalized to Cortex-A9
and averaged across all benchmarks in the CINT2006 and CFP2006 subsets of the suite.
Platform Dhrystone SPEC CPU2006
perf energy CINT2006 CFP2006
(DMIPS) abs (J) norm perf energy perf energy
Intel Core i7 19246 116.8 1.056 9.071 1.185 9.4735 1.172
ARM Cortex-A9 2466 110.8 1.0 1.0 1.0 1.0 1.0
Table 3: STREAM: Intel Core i7 and ARM Cortex-A9 memory bandwidth and bandwidth
efficiency comparison.
Platform Peak STREAM
mem. BW perf (MB/S) energy (avg.) efficiency (%)
(MB/s) copy scale add triad abs (J) norm copy add
Intel Core i7 17066 6912 6898 7005 6937 481.5 1.059 40.5 41.0
ARM Cortex-A9 2666 1348 1321 720 662 454.8 1.0 50.6 27.0
11
processors in order to achieve competitive time-to-solution. More processing
cores in the system mean more need for scalability. In this section we evalu-
ate the performance, energy efficiency and scalability of the whole Tibidabo
cluster.
Figure 3 shows the parallel speed-up achieved by the High-Performance
Linpack benchmark (HPL) [27] and several other HPC applications. Fol-
lowing common practice, we perform a weak scalability test for HPL and a
strong scalability test for the rest.4 We have considered several widely used
MPI applications: GROMACS [28], a versatile package to perform molecular
dynamics simulations; SPECFEM3D GLOBE [29] that simulates continen-
tal and regional scale seismic wave propagation; HYDRO, a 2D Eulerian
code for hydrodynamics; and PEPC [30], an application that computes long-
range Coulomb forces for a set of charged particles. All applications are
compiled and executed out-of-the-box, without any hand tuning of the re-
spective source codes.
If the application could not execute on a single node, due to large memory
requirements, we calculated the speed-up with respect to the smallest number
of nodes that can handle the problem. For example, PEPC with a reference
input set requires at least 24 nodes, so we plot the results assuming that on
24 nodes the speed-up is 24.
We have executed SPECFEM3D and HYDRO with an input set that is
able to fit into the memory of a single node, and they show good strong scaling
up to the maximum available number of nodes in the cluster. In order to
achieve good strong scaling with GROMACS, we have used two input sets,
both of which can fit into the memory of two nodes. We have observed
that scaling of GROMACS improves when the input set size is increased.
PEPC does not show optimal scalability because the input set that we can
fit on our cluster is too small to show the strong scalability properties of the
application [30].
HPL shows good weak scaling. In addition to HPL performance, we also
measure power consumption, so that we can derive the MFLOPS/W met-
ric used to rank HPC systems in the Green500 list. Our cluster achieves
120 MFLOPS/W (97 GFLOPS on 96 nodes - 51% HPL efficiency), com-
4
Weak scalability refers to the capability of solving a larger problem size in the same
amount of time using a larger number of nodes (the problem size is limited by the available
memory in the system). On the other side, strong scalability refers to the capability of
solving a fixed problem size in less time while increasing the number of nodes.
12
96
ideal
HP Linpack
PEPC
HYDRO
GROMACS - small input
GROMACS - big input
64 SPECFEM3D
Speed-up
32
16
8
4
4 8 16 32 64 96
Number of nodes
petitive with AMD Opteron 6128 and Intel Xeon X5660-based clusters, but
19x lower than the most efficient GPU-accelerated systems, and 21x lower
than Intel Xeon Phi (November 2012 Green500 #1). The reasons for the low
HPL efficient performance include lack of architecture-specific tuning of the
algebra library, and lack of optimization in the MPI communication stack for
ARM cores using Ethernet.
13
Core1, 0.26 Core2, 0.26 L2 cache, 0.1
Memory, 0.7
Eth1, 0.9
Other, 5.68
Eth2, 0.5
Figure 4: Power consumption breakdown of the main components on a compute node. The
compute node power consumption while executing HPL is 8.4 W. This power is computed
by measuring the total cluster power and divide it by the number of nodes.
and voltage.
Figure 4 shows the average power breakdown of the major components in
a compute node over the total compute node power during an HPL run on the
entire cluster. As can be seen, the total measured power on the compute node
is significantly higher than the sum of the major parts. Other on-chip and
on-board peripherals in the compute node are not used for computation so
they are assumed to be shut off when idle. However, the large non-accounted
power part (labeled as OTHER) accounts for more than 67% of the total
power. That part of the power includes on-board low-dropout (LDO) voltage
regulators, on-board multimedia devices with related circuitry, corresponding
share of a blade PSU losses and on-chip power sinks. Figure 5 shows the
Tegra2 chip die. The white outlined area shows the chip components that
are used by HPC applications. This area is less than 35% of the total chip
area. If the rest of the chip area is not properly power and clock gated, it
would leak power even though it is not being used, thus also contributing to
the OTHER part of the compute node power.
Although the estimations in this section are not exact, we actually overes-
timate the power consumption of some of the major components when taking
the power from the multiple data sources. Therefore, our analysis shows that
up to 16% of the power is on the computation components: cores (includ-
14
Figure 5: Tegra2 die: the area marked with white border line are the components actually
used by HPC applications. It represents less than a 35% of the total chip area. source:
www.anandtech.com
15
overheads in order to glue them together to create a large system with a large
number of cores. Also, although Cortex-A9 is the leader in mobile computing
for its high performance, it trades-off some performance for power savings to
improve battery life. Cortex-A15 is the highest performing processor in the
ARM family, which includes features more suitable for HPC. Therefore, in
this section we evaluate cluster configurations with higher multicore density
(more cores per chip) and we also project what would be the performance
and energy efficiency if we used Cortex-A15 cores instead. To complete the
study, we evaluate multiple frequency operating points to show how frequency
affects performance and energy efficiency.
For our projections, we use an analytical power model and the DIMEMAS
cluster simulator [35]. DIMEMAS performs high-level simulation of the exe-
cution of MPI applications on cluster systems. It uses a high-level model of
the compute nodes—modeled as symmetric multi-processing (SMP) nodes—
to predict the execution time of computation phases. At the same time, it
simulates the cluster interconnect to account for MPI communication de-
lays. The interconnect and computation node models accept configuration
parameters such as interconnect bandwidth, interconnect latency, number of
links, number of cores per computation node, core performance ratio, and
memory bandwidth. DIMEMAS has been used to model the interconnect of
the MareNostrum supercomputer with an accuracy within 5% [36], and its
MPI communication model has been validated showing an error below 10%
for the NAS benchmarks [37].
The input to our simulations is a trace obtained from an HPL execution
on Tibidabo. As an example, the PARAVER [38] visualization of the input
and output traces of a DIMEMAS simulation are shown in Figure 6. The
chart shows the activity of the application threads (vertical axis) over time
(horizontal axis). Figure 6(a) shows the visualization of the original execu-
tion on Tibidabo, and Figure 6(b) shows the visualization of the DIMEMAS
simulation using a configuration that mimics the characteristics of our ma-
chine (including the interconnect characteristics) except for the CPU speed
which is, as an example, 4 times faster. As it can be observed in the real
execution, threads do not start communication all at the same time, and
thus have computation in some threads overlapping with communication in
others. In the DIMEMAS simulation, where CPU speed is increased 4 times,
computation phases (in grey) become shorter and all communication phases
get closer in time. However, the application shows the same communication
pattern and communications take a similar time as that in the original ex-
16
(a) Part of the original HPL execution on Tibidabo
Figure 6: An example of a DIMEMAS simulation where each row presents the activity of
a single processor: it is either in a computation phase (grey) or in MPI communication
(black).
17
Table 4: Core architecture, performance and power model parameters, and results for
performance and energy efficiency of clusters with 16 cores per node.
Configuration # 1 2 3 4 5
CPU input parameters
Core architecture Cortex-A9 Cortex-A9 Cortex-A9 Cortex-A15 Cortex-A15
Frequency (GHz) 1.0 1.4 2.0 1.0 2.0
Performance over A91GHz 1.0 1.2 1.5 2.6 4.5
Power over A91GHz 1.0 1.1 1.8 1.54 3.8
Per node power figures for 16 cores per chip configuration [W]
CPU cores 4.16 4.58 7.49 7.64 18.85
L2 cache 0.8 0.88 1.44 Integrated with cores
Memory 5.6 5.6 5.6 5.6 5.6
Ethernet NICs 1.4 1.4 1.4 1.4 1.4
Aggregate power figures [W]
Per node 17.66 18.16 21.63 20.34 31.55
Total cluster 211.92 217.87 259.54 244.06 378.58
18
5
5 Perfect Perfect
Cortex-A9 Cortex-A15
Performance relative to 1 GHz
3
3
2
2
1 1
0 0
456
608
760
816
912
1000
1400
2000
500
600
700
800
900
1000
2000
Frequencies (MHz) Frequencies (MHz)
ber of cores in each compute node. When we increase the number of cores
per compute node, the number of nodes is reduced, thus, reducing integra-
tion overhead and pressure on the interconnect (i.e. less boards, cables and
switches). To model this effect, our analytical model is as follows:
From the power breakdown of a single node presented in Figure 4, we
subtract the power corresponding to the CPUs and the memory subsystem
(L2 + memory). The remaining power in the compute node is considered to
be board overhead, and does not change with the number of cores. The board
overhead is part of the power of a single node, to which we add the power
of the cores, L2 cache and memory. For each configuration, the CPU core
power is multiplied by the number of cores per node. Same as in Tibidabo,
our projected cluster configurations are assumed to have 0.5 MB of L2 cache
per core and 500 MB of RAM per core—this assumption allows for simple
scaling to large numbers of cores. Therefore, the L2 cache power (0.1 W/MB)
and the memory power (0.7 W/GB) are multiplied both by half the number
of cores. The L2 core power for the Cortex-A9 configurations is also factored
for frequency, for which we use the core power ratio. The L2 in Cortex-A15
is part of the core macro, so the core power already includes the L2 power.
For both Cortex-A9 and Cortex-A15, the CPU macro power includes the
L1 caches, cache coherence unit and L2 controller. Therefore, the increase in
power due to a more complex L2 controller and cache coherence unit for a
19
larger multicore are accounted when that power is factored by the number of
cores. The memory power is overestimated, so the increased power due to the
increased complexity of the memory controller to scale to a higher number
of cores is also accounted for the same reason. Furthermore, a Cortex-A9
system cannot address more than 4 GB of memory so, strictly speaking,
Cortex-A9 systems with more than 4 GB are not realistic. However, we
include configurations for higher core counts per chip to show what would
be the performance and energy efficiency if Cortex-A9 included large phys-
ical address extensions as the Cortex-A15 does to address up to 1 TB of
memory [40].
The power model is summarized in these equations:
ntc Pover Pmem P
Ppred = × + Peth + ncpc × + pr × PA91G + L2$ (1)
ncpc nnin 2 2
Pover = Ptot − nnin × (Pmem + 2 × PA91G + PL2$ + Peth ) (2)
5
Out of 128 nodes with a total of 256 processors, 4 nodes are used as login nodes and
20
new core performance and multicore density, accounting for synchronizations
and communication delays. Figure 8 shows the results. In all simulations we
keep a network bandwidth of 1 Gb/s (1GbE) and a memory bandwidth of
1400 MB/s (from peak bandwidth results using STREAM).
The results show that, as we increase the number of cores per node (at
the same time reducing the total number of nodes), performance does not
show further degradation with 1 GbE interconnect until we reach the level of
performance of Cortex-A15. None of the Cortex-A15 configurations reach its
maximum speed-up due to interconnect limitations. The configuration with
two Cortex-A15 cores at 1 GHz scales worst because the interconnect is the
same as in Tibidabo. With a higher number of cores, we are reaching 96% of
the speed-up of a Cortex-A15 at 1 GHz. Further performance increase with
Cortex-A15 at 2 GHz shows further performance limitation due to intercon-
nect communication—reaching 82% of the ideal speed-up with two cores and
reaching 91% with sixteen.
5.0
2 cores/chip
4 cores/chip
4.5
8 cores/chip
16 cores/chip
4.0
3.5
Speedup
3.0
2.5
2.0
1.5
1.0
0.5
Cortex-A9 Cortex-A9 Cortex-A9 Cortex-A15 Cortex-A15
1 GHz 1.4 GHz 2 GHz 1 GHz 2 GHz
Platforms
Figure 8: Projected speed-up for the evaluated cluster configurations. The total number
of MPI processes is constant across all experiments.
28 are unstable There are two major identified sources for instabilities: cooling issues and
problems with the PCIe driver, which drops the network connection on the problematic
nodes.
21
Increasing computation density potentially improves MPI communication
because more processes communicate on chip rather than using the network
and memory bandwidth is larger than the interconnect bandwidth. Setting
a larger machine than Tibidabo, with faster mobile cores and a higher core
count, will require a faster interconnect. In Section 4.2 we explore the inter-
connect requirements when using faster mobile cores.
1200
Cortex-A9 @ 1 GHz
Cortex-A9 @ 1.4 GHz
Cortex-A9 @ 2 GHz
Cortex-A15 @ 1 GHz
1000 Cortex-A15 @ 2 GHz
Energy efficiency (MFLOPS/W)
800
600
400
200
0
2 4 8 16
Number of cores per node
The benefit of increased computation density (more cores per node) is ac-
tually the reduction of the integration overhead and the resulting improved
energy efficiency of the system (Figure 9). The results show that increasing
the computation density, with Cortex-A9 cores running at 2 GHz we can
achieve an energy efficiency of 563 MFLOPS/W using 16 cores per node
(∼4.7x improvement). The configuration with 16 Cortex-A15 cores per node
has an energy efficiency of 1004 MFLOPS/W at 1 GHz and 1046 MFLOPS/W
at 2 GHz(∼8.7x improvement).
Using these models, we are projecting the energy efficiency of our cluster
if it used higher performance cores and included more cores per node. How-
ever, all other features remain the same, so inefficiencies due to the use of
non-optimized development boards, lack of software optimization, and lack
of vector double-precision floating-point execution units is accounted in the
model. Still with all these inefficiencies, our projections show that such a
cluster would be competitive in terms of energy efficiency with Sandy Bridge
22
and GPU-accelerated systems in the Green500 list, which shows promise for
future ARM-based platforms actually optimized for HPC.
6
Cortex-A9 1 GHz
Cortex-A9 1.4 GHz
Cortex-A9 2 GHz
5 Cortex-A15 1 GHz
Cortex-A15 2 GHz
Relative performance
0
0.1 1 10
Network bandwidth (Gb/s)
0
0 100 200 300 400 500
Latency (µs)
Figure 10: Interconnection network impact. Cluster configuration with 16 cores per node.
23
4.2. Interconnection network requirements
Cluster configurations with higher-performance cores and more cores per
node, put a higher pressure on the interconnection network. The result of
increasing the node computation power while maintaining the same network
bandwidth is that the interconnect bandwidth-to-flops ratio decreases. This
may lead to the network being a bottleneck. To evaluate this effect, we carry
out DIMEMAS simulations of the evaluated cluster configurations using a
range of network bandwidth (Figure 10(a)) and latency values (Figure 10(b)).
The baseline for these results is the cluster configuration with Cortex-A9 at
1 GHz, 1 Gb/s of bandwidth and 50 µs of latency.
The results in Figure 10(a) show that a network bandwidth of 1 Gb/s is
sufficient for the evaluated cluster configurations with Cortex-A9 cores and
the same size as Tibidabo. The Cortex-A9 configurations show a negligible
improvement with 10 Gb/s interconnects. On the other hand, configurations
with Cortex-A15 do benefit from an increased interconnect bandwidth: the
1 GHz configuration reaches its maximum at 3 Gb/s and the 2 GHz config-
uration at 8 Gb/s.
The latency evaluation in Figure 10(b) shows the relative performance
with network bandwidths of 1 Gb/s and 10 Gb/s for a range of latencies
normalized to 50 µs. An ideal zero latency does not show a significant im-
provement over 50 µs and increasing the latency in a factor of ten, only has a
significant impact on the Cortex-A15 at 2 GHz configuration. Therefore, the
latency of Tibidabo’s Ethernet network, although being larger than that of
specialized and custom networks used in supercomputing, is small enough for
all the evaluated cluster configurations which have the same size as Tibidabo.
24
the CPU cores. Although the core itself provides a theoretical peak energy
efficiency of 2-4 GFLOPS/W, this design imbalance results in the measured
HPL energy efficiency of 120 MFLOPS/W.
In order to achieve system balance, we identified two fundamental im-
provements to put in practice. The first is to make use of higher-end ARM
multicore chips like Cortex-A15, which provides an architecture more suit-
able for HPC while maintains comparable single-core energy efficiency. The
second is to increase the compute density by adding more cores to the
chip. The recently announced ARM CoreLink CCN-504 cache coherence
network [41, 42] scales up to 16 cores and is targeted to high-performance
architectures such as Cortex-A15 and next-generation 64-bit ARM proces-
sors. In a resulting system of putting together these design improvements,
the CPU cores power is better balanced with that of other components such
as the memory. Our projections based on ARM Cortex-A15 processors with
higher multicore integration density show that such systems are a promising
alternative to current designs built from high performance parts. For exam-
ple, a cluster of the same size as Tibidabo, based on 16-core ARM Cortex-A15
chips at 2 GHz would provide 1046 MFLOPS/W.
A well known technique to improve energy efficiency is the use of SIMD
units. As an example, BlueGene/Q uses 256-bit-wide vectors for quad double-
precision floating-point computations, and the Intel MIC architecture uses
512-bit-wide SIMD units. Both Cortex-A9 and Cortex-A15 processors imple-
ment the ARMv7-a architecture which only supports single-precision SIMD
computation. Most HPC applications require calculations in double-precision
so they cannot exploit the current ARMv7 SIMD units. The ARMv8 archi-
tecture specification includes double-precision floating-point SIMD, so fur-
ther energy efficiency improvements for HPC computation are expected from
next-generation ARMv8 chips featuring those SIMD units.
In all of our experiments, we run the benchmarks out of the box, and
did not hand-tune any of those codes. Libraries and compilers include
architecture-dependent optimizations that, for the case of ARM processors,
target mobile computing. This leads to two different scenarios: the optimiza-
tions of libraries used in HPC, such as ATLAS or MPI, for ARM processors
are one step behind; and optimizations in compilers, operating systems and
drivers target mobile computing, thus trading-off performance for quality of
service or battery life. We have put together an HPC-ready software stack
for Tibidabo but we have not put effort in optimizing its several components
for HPC computation yet. Further energy efficiency improvements are ex-
25
pected when critical components such as MPI communication functions are
optimized for ARM-based platforms, or the Linux kernel is stripped-out of
those components not used by HPC applications.
As shown in Figure 5, the Tegra2 chip includes a number of application-
specific accelerators that are not programmable using standard industrial
programming models such as CUDA or OpenCL. If those accelerators were
programmable and used for HPC computation, that would reduce the in-
tegration overhead of Tibidabo. The use of SIMD or SIMT programmable
accelerators is widely adopted in supercomputers, such as those including
general-purpose programmable GPUs (GPGPUs). Although the effective
performance of GPGPUs is between 40% and 60%, their efficient compute-
targeted design provides them with high energy efficiency. GPUs in mobile
SoCs are starting to support general-purpose programming. One example is
the Samsung Exynos5 [43] chip, which includes two Cortex-A15 cores and
an OpenCL-compatible ARM Mali T-604 GPU [44]. This design, apart from
providing the improved energy efficiency of GPGPUs, has the advantage of
having the compute accelerator close to the general purpose cores, thus re-
ducing data transfer latencies. Such an on-chip programmable accelerator
is an attractive feature to improve energy efficiency in an HPC system built
from low-power components.
Another important issue to keep in mind when designing such kind of
systems is that the memory bandwidth-to-flops ratio must be maintained.
Currently available ARM-based platforms make use of either memory tech-
nology that is behind compared to top-class standards (e.g., many platforms
use DDR2 memory instead of DDR3), or memory technology targeting low
power (e.g., LPDDR2). For a higher-performance node with a higher num-
ber of cores and including double-precision floating-point SIMD units, cur-
rent memory choices in ARM platforms may not provide enough bandwidth,
so higher-performance memories must be adopted. Low-power ARM-based
products including DDR3 are already announced [45] and the recently an-
nounced DMC-520 [41] memory controller enables DDR3 and DDR4 memory
for ARM processors. These upcoming technologies are indeed good news for
low-power HPC computing. Moreover, package-on-package memories which
reduce the distance between the computation cores and the memory, and
increase pin density can be used to include several memory controllers and
provide higher memory bandwidth.
Finally, Tibidabo employs 1 Gbit Ethernet for the cluster interconnect.
Our experiments show that 1 GbE is not a performance limiting factor for
26
a cluster of Tibidabo size employing Cortex-A9 processors up to 2 GHz and
for compute-bound codes such as HPL. However, when using faster mobile
cores such as Cortex-A15, a 1 GbE interconnect starts becoming a bottleneck.
Current ARM-based mobile chips include peripherals targeted to the mobile
market and thus, do not provide enough bandwidth or are not compatible
with faster network technologies used in supercomputing, such as 10 GbE or
Infiniband. However, the use of 1 GbE is extensive in supercomputing—32%
of the systems in the November 2012 Top500 list use 1 GbE interconnects—,
and potential communication bottlenecks are in many cases addressable in
software [46]. Therefore, although support for a high-performance network
technology would be desirable for ARM-based HPC systems, using 1 GbE
may not be a limitation as long as the communication libraries are optimized
appropriately for Ethernet communication and the communication patterns
in HPC applications are tuned appropriately keeping the network capabilities
in mind.
6. Related work
One of the first attempts to use low-power commodity processors in HPC
systems was GreenDestiny [47]. They relied on Transmeta TM5600 proces-
sor, and although the proposal seemed good for a top platform in energy ef-
ficiency, a large-scale HPC system was never produced. Also, its computing-
to-space ratio was leading at the time.
MegaProto systems [48] were another approach in this direction. They
were based on more advanced versions of Transmeta’s processors, namely
TM5800 and TM8820. This system was able to achieve good energy effi-
ciency for the time, reaching up to 100 MFLOPS/W using a system with
512 processors. Like its predecessor, MegaProto never made it into a com-
mercial HPC product.
Roadrunner [49] topped the Top500 list in June 2008 to be the first to
break the petaflop barrier. It uses IBM PowerXCell 8i [50] together with
dual-core AMD Opteron processors. The Cell/B.E. architecture emphasizes
performance per watt by prioritizing bandwidth over latency and favours
peak computation capabilities over simplifying programmability. In the June
2008 Green500 list, it held third place with 437.43 MFLOPS/W, behind two
smaller homogeneous Cell/B.E.-based clusters.
There has been a proposal to use the Intel Atom family of processors
in clusters [51]. The platform is built and tested with a range of different
27
types of workloads, but those target data centers rather than HPC. One of
the main contributions of this work is determining the type of workloads for
which Intel Atom can compete in terms of energy-efficiency with commodity
Intel Core i7. A follow-up of this work [52] leads to the conclusion that a
cluster made homogeneously of low-power nodes (Intel Atom) is not suited
for complex database loads. They propose future research in heterogeneous
cluster architectures using low-power nodes combined with high-performance
ones.
The use of low-power processors for scale-out systems was assessed in a
study by Stanley-Marbell and Caparros-Cabezas [53]. They did a compara-
tive study of three different low-power architecture implementations: x86-64
(Intel Atom D510MO), Power Architecture e500 (Freescale P2020RDB) and
ARM Cortex-A8 (TI DM3730, BeagleBoard xM). The authors presented a
study with performance, power and thermal analyses. One of their findings
is that a single core Cortex-A8 platform is suitable for energy-proportional
computing, meaning very low idle power. However, it lacks sufficient comput-
ing resources to exploit coarse-grained task-level parallelism and be a more
energy efficient solution than the dual-core Intel Atom platform. They also
concluded that a large fraction of the platforms’ power consumption (up to
67% for the Cortex-A8 platform) cannot be attributed to a specific compo-
nent, despite the use of sophisticated techniques such as thermal imaging.
The AppleTV cluster [54, 55] is an effort to assess the performance of
the ARM Cortex-A8 processor in a cluster environment running HPL. The
authors built a small cluster with four nodes based on AppleTV devices with
a 100 MbE network. They achieved 160.4 MFLOPS with an energy efficiency
of 16 MFLOPS/W. Also, they compared the memory bandwidth against a
BeagleBoard xM platform and explained the performance differences due
to different design decisions in the memory subsystems. In our system, we
employ more recent low-power core architectures and show how improved
floating-point units, memory subsystems, and an increased number of cores
can significantly improve the overall performance and energy efficiency, while
still maintaining a small power footprint.
The BlueGene family of supercomputers has been around since 2004 in
several generations [56, 57, 58]. BlueGene systems are composed of em-
bedded cores integrated on ASIC together with architecture-specific fabrics.
BlueGene/L, the first such system, is based on the PowerPC 440, with a
theoretical peak performance of 5.6 GFLOPS. BlueGene/P increased the
peak performance of the compute card to 13.6 GFLOPS by using 4-core
28
PowerPC 450. BlueGene/Q-based clusters are one of the most power effi-
cient HPC machines nowadays delivering around 2.1 GFLOPS/W. A Blue-
Gene/Q compute chip includes 16 4-way SMT in-order cores, each one with
a 256-bit-wide quad double-precision SIMD floating-point unit, delivering
a total of 204.8 GFLOPS per chip on a power budget of around 55 W
(3.7 GFLOPS/W).
The most energy-efficient machine in the November 2012 Green500 list
is based on the Intel Xeon Phi coprocessor. It has a design similar to Blue-
Gene/Q: 4-way SMT in-order cores with wide SIMD units, but it integrates
more cores per chip (60) and the SIMD units are 512-bits-wide. The use
of a more recent technology process (22nm instead of the 45nm of Blue-
Gene/Q) allows this larger integration and results in an energy efficiency of
2.5 GFLOPS/W for the number one machine in the Green500 list.
There is a lot of hype about the use of low-power ARM processors in
servers. Currently, the most exciting, commercially available approaches are
the ones from Boston Ltd. [10], Penguin Computing [59] and EXXACT Cor-
poration [60]. They offer solutions based on the Calxeda ECX-1000 SoC [6]
with up to 48 server nodes (192 cores) and up to 4 GB of memory per
server node (192 GB in total) in a 2U enclosure. HP went one step further
with Project Moonshot [9], where they introduce the Redstone Development
Server Platform [61]. It has a compute integration option with up to 288
Calxeda SoCs in 4U.
The Calxeda ECX-1000 SoC is built for server workloads: it is a quad-core
chip with Cortex-A9 cores running at 1.4 GHz, 4 MB of L2 cache with ECC
protection, a 72-bit memory controller with ECC support, five 10 Gb lanes for
connecting with other SoCs, support for 1 GbE and 10 GbE, and SATA 2.0
controllers with support for up to five SATA disks. Unlike ARM-based mobile
SoCs, ECX-1000 does not have a power overhead in terms of unnecessary
on-chip resources and, thus, it seems better suited for energy-efficient HPC.
However, to the best of our knowledge, there are neither reported numbers
for energy-efficiency of HPL running in a cluster environment (only single-
node executions) nor scientific applications scalability tests for any of the
aforementioned enclosures.
AppliedMicro announced an ARM server platform based on their own
ARMv8-based SoC design, the X-gene [8]. There are still no enclosures
announced, and no benchmark reports, but we expect a better performance
than ARMv7-based enclosures, due to an improved CPU core architecture
and three levels of cache hierarchy.
29
7. Conclusions
In this paper we presented Tibidabo, the world’s first ARM-based HPC
cluster, for which we set up an HPC-ready software stack to execute HPC
applications widely used in scientific research such as SPECFEM3D and
GROMACS. Tibidabo was built using commodity off-the-shelf components
that are not designed for HPC. Nevertheless, our prototype cluster achieves
120 MFLOPS/W on HPL, competitive with AMD Operton 6128 and Intel
Xeon X5660-based systems. We identified a set of inefficiencies of our de-
sign given the components target mobile computing. The main inefficiency
is that the power taken by the components required to integrate small low-
power dual-core processors offsets the high energy efficiency of the cores
themselves. We perform a set of simulations to project the energy efficiency
of our cluster if we could have used chips featuring higher-performance ARM
cores and integrating a larger number of them together.
Based on these projections, a cluster configuration with 16-core Cortex-
A15 chips would be competitive with Sandy Bridge-based homogeneous sys-
tems and GPU-accelerated heterogeneous systems in the Green500 list.
We also explained the major issues and how they should evolve or be
improved for next clusters made from low-power ARM processors. These
issues include, apart from the aforementioned integration overhead, the lack
of optimized software, the use of mobile-targeted memories, the lack of
double-precision floating-point SIMD units, and the lack of support for high-
performance interconnects. Based on our recommendations, an HPC-ready
ARM processor design should include a larger number of cores per chip (e.g.,
16) and use a core microarchitecture suited for high-performance, like the
one in Cortex-A15. It should also include double-precision floating-point
SIMD units, support for multiple memory controllers servicing DDR3 or
DDR4 memory modules, and probably support for a higher-performance
network, such as Infiniband, although Gigabit Ethernet may be sufficient for
many HPC applications. On the software side, libraries, compilers, drivers
and operating systems need tuning for high performance, and architecture-
dependent optimizations for ARM processor chips.
Recent announcements show an increasing interest in server-class low-
power systems that may benefit HPC. The new 64-bit ARMv8 ISA improves
some features that are important for HPC. First, using 64-bit addresses re-
moves the 4GB memory limitation per application. This allows more mem-
ory per node, so one process can compute more data locally, requiring less
30
network communication. Also, ARMv8 increases the size of the general-
purpose register file from 16 to 32 registers. This reduces register spilling
and provides more room for compiler optimization. It also improves floating-
point performance by extending the NEON instructions with fused multiply-
add and multiply-substract, and cross-lane vector operations. More impor-
tantly, double-precision floating-point is now part of NEON. All together,
this provides a theoretical peak double-precision floating-point performance
of 4 FLOPS/cycle for a fully-pipelined SIMD unit. As an example, ARM
Cortex-A57, the highest performance ARM implementation of the ARMv8
ISA, includes two NEON units, totalling 8 double-precision floating-point
FLOPS/cycle—this is 4 times better than ARM Cortex-A15 and equivalent
to Intel implementations with one AVX unit.
These encouraging industrial roadmaps, together with research initia-
tives such as the EU-funded Mont-Blanc project [62], may lead ARM-based
platforms to accomplish the recommendations given in this paper in a near
future.
8. Acknowledgments
Authors would like to thank to anonymous reviewers for their constructive
comments. In addition, authors would like to thank to Bernard Ortiz de
Montellano and Paul M. Carpenter for their help to improve the quality of
this paper.
This project and the research leading to these results have received fund-
ing from the European Union’s Seventh Framework Programme [FP7/2007-
2013] under grant agreement no 288777. Part of this work is supported by
the PRACE project (European Union funding under grants RI-261557 and
RI-283493).
References
[1] D. Göddeke, D. Komatitsch, M. Geveler, D. Ribbrock, N. Rajovic,
N. Puzovic, A. Ramirez, Energy efficiency vs. performance of the numer-
ical solution of PDEs: an application study on a low-power ARM-based
cluster, Journal of Computational Physics 237 (2013) 132–150.
31
07-12/new_mexico_to_pull_plug_on_encanto_former_top_5_
supercomputer.html, accessed: 5-May-2013 (7 2012).
[3] HPCwire, Requiem for Roadrunner, http://www.hpcwire.com/
hpcwire/2013-04-01/requiem_for_roadrunner.html, accessed: 5-
May-2013 (4 2013).
[4] K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. D.
nneau, P. Franzon, W. Harrod, K. Hill, J. Hiller, S. Karp, S. Keckler,
D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely,
T. Sterling, R. S. Williams, K. Yelick, P. Kogge, Exascale Comput-
ing Study: Technology Challenges in Achieving Exascale Systems, in:
DARPA Technical Report, 2008.
[5] ARM Ltd., Cortex-A9 Processor, http://www.arm.com/products/
processors/cortex-a/cortex-a9.php, accessed: 5-May-2013.
[6] Calxeda, EnergyCoreTM processors, http://www.calxeda.com/
technology/products/processors/, accessed: 5-May-2013.
[7] Marvell, Marvell Quad-Core ARMADA XP Series SoC,
http://www.marvell.com/embedded-processors/armada-
xp/assets/Marvell-ArmadaXP-SoC-product%20brief.pdf, accessed
5-May-2013.
[8] AppliedMicro, AppliedMicro X-Gene, http://www.apm.com/
products/x-gene/, accessed: 5-May-2013.
[9] HP, HP Labs research powers project Moonshot, HPs new archi-
tecture for extreme low-energy computing, http://www.hpl.hp.com/
news/2011/oct-dec/moonshot.html, accessed: 15-May-2013.
[10] Boston Ltd., Boston Viridis - ARM Microservers, http://www.boston.
co.uk/solutions/viridis/default.aspx, accessed: 5-May-2013.
[11] ARM Ltd., VFPv3 Floating Point Unit, http://www.arm.com/
products/processors/technologies/vector-floating-point.php,
accessed: 5-May-2013.
[12] ARM Ltd., The ARM R NEONTM general purpose SIMD engine, http:
//www.arm.com/products/processors/technologies/neon.php, ac-
cessed: 5-May-2013.
32
[13] J. Turley, Cortex-A15 “Eagle” flies the coop, Microprocessor Report
24 (11) (2010) 1–11.
[15] G. Bell, Bell’s law for the birth and death of computer classes, Commu-
nications of ACM 51 (1) (2008) 86–94.
[21] A. Yoo, M. Jette, M. Grondona, Slurm: Simple linux utility for re-
source management, in: Job Scheduling Strategies for Parallel Process-
ing, Springer, 2003, pp. 44–60.
33
[25] N. Rajovic, L. Vilanova, C. Villavieja, N. Puzovic, A. Ramirez, The
low power architecture approach towards exascale computing, Journal
of Computational Science.
[26] J. D. McCalpin, Memory bandwidth and machine balance in current
high performance computers, IEEE Computer Society Technical Com-
mittee on Computer Architecture (TCCA) Newsletter (1995) 19–25.
[27] J. Dongarra, P. Luszczek, A. Petitet, The LINPACK Benchmark: past,
present and future, Concurrency and Computation: Practice and Expe-
rience 15 (9) (2003) 803–820.
[28] H. Berendsen, D. van der Spoel, R. van Drunen, Gromacs: A message-
passing parallel molecular dynamics implementation, Computer Physics
Communications 91 (1) (1995) 43–56.
[29] D. Komatitsch, J. Tromp, Introduction to the spectral element method
for three-dimensional seismic wave propagation, Geophysical Journal
International 139 (3) (1999) 806–822.
[30] DEISA 2; Distributed european infrastructure for supercomputing ap-
plications; Maintenance of the DEISA Benchmark Suite in the Second
Year, Available online at: www.deisa.eu.
[31] ARM Ltd., ARM Announces 2GHz Capable Cortex-A9 Dual Core
Processor Implementation, http://www.arm.com/about/newsroom/
25922.php, accessed: 5-May-2013.
[32] Intel Corporation, Intel R 82574 GbE Controller Family,
http://www.intel.com/content/dam/doc/datasheet/82574l-
gbe-controller-datasheet.pdf, accessed: 5-May-2013.
[33] SMSC, LAN9514/LAN9514i: USB 2.0 Hub and 10/100 Eth-
ernet Controller, http://www.smsc.com/media/Downloads_Public/
Data_Sheets/9514.pdf, accessed: 5-May-2013.
[34] Micron, DDR2 SDRAM System-Power Calculator, http://www.
micron.com/support/dram/power_calc.html, accessed: 5-May-2013.
[35] R. Badia, J. Labarta, J. Gimenez, F. Escale, DIMEMAS: Predicting
MPI applications behavior in Grid environments, in: Workshop on Grid
Applications and Programming Tools (GGF8), Vol. 86, 2003.
34
[36] A. Ramirez, O. Prat, J. Labarta, M. Valero, Performance Impact of the
Interconnection Network on MareNostrum Applications, in: 1st Work-
shop on Interconnection Network Architectures: On-Chip, Multi-Chip,
2007.
35
[45] Calxeda, Calxeda Quad-Node EnergyCard, http://www.calxeda.com/
technology/products/energycards/quadnode, accessed: 14-May-
2013.
[46] V. Marjanović, J. Labarta, E. Ayguadé, M. Valero, Overlapping com-
munication and computation by using a hybrid mpi/smpss approach, in:
Proceedings of the 24th ACM International Conference on Supercom-
puting, ACM, 2010, pp. 5–16.
[47] M. Warren, E. Weigle, W. Feng, High-density computing: A 240-
processor beowulf in one cubic meter, in: Supercomputing, ACM/IEEE
2002 Conference, IEEE, 2002, pp. 61–61.
[48] H. Nakashima, H. Nakamura, M. Sato, T. Boku, S. Matsuoka, D. Taka-
hashi, Y. Hotta, Megaproto: 1 tflops/10kw rack is feasible even with
only commodity technology, in: Supercomputing, 2005. Proceedings of
the ACM/IEEE SC 2005 Conference, IEEE, 2005, pp. 28–28.
[49] K. Barker, K. Davis, A. Hoisie, D. Kerbyson, M. Lang, S. Pakin, J. San-
cho, Entering the petaflop era: the architecture and performance of
roadrunner, in: Proceedings of the 2008 ACM/IEEE conference on Su-
percomputing, IEEE Press, 2008, p. 1.
[50] T. Chen, R. Rghavan, J. Dale, E. Iwata, Cell Broadband Engine Archi-
tecture and its first implementationA performance view, IBM Journal
of Research and Development 51 (5) (2007) 559 –572.
[51] V. Vasudevan, D. Andersen, M. Kaminsky, L. Tan, J. Franklin,
I. Moraru, Energy-efficient cluster computing with fawn: Workloads
and implications, in: Proceedings of the 1st International Conference
on Energy-Efficient Computing and Networking, ACM, 2010, pp. 195–
204.
[52] W. Lang, J. Patel, S. Shankar, Wimpy node clusters: What about non-
wimpy workloads?, in: Proceedings of the Sixth International Workshop
on Data Management on New Hardware, ACM, 2010, pp. 47–55.
[53] P. Stanley-Marbell, V. C. Cabezas, Performance, power, and thermal
analysis of low-power processors for scale-out systems, in: Parallel and
Distributed Processing Workshops and Phd Forum (IPDPSW), 2011
IEEE International Symposium on, IEEE, 2011, pp. 863–870.
36
[54] K. Fürlinger, C. Klausecker, D. Kranzlmüller, Towards energy efficient
parallel computing on consumer electronic devices, in: Information and
Communication on Technology for the Fight against Global Warming,
Springer, 2011, pp. 1–9.
[58] IBM Systems and Technology, IBM System Blue Gene/Q Data Sheet,
http://public.dhe.ibm.com/common/ssi/ecm/en/dcd12345usen/
DCD12345USEN.PDF, accessed: 5-May-2013.
[61] HP, HP Project Moonshot and the Redstone Development Server Plat-
form, http://h10032.www1.hp.com/ctg/Manual/c03442116.pdf, ac-
cessed: 15-May-2013.
37