isca2016-dall
isca2016-dall
Christoffer Dall, Shih-Wei Li, Jin Tack Lim, Jason Nieh, and Georgios Koloventzos
Department of Computer Science
Columbia University
New York, NY, United States
{cdall,shihwei,jintack,nieh,gkoloven}@cs.columbia.edu
Abstract—ARM servers are becoming increasingly common, whether these differences have a material impact, positive
making server technologies such as virtualization for ARM of or negative, on performance. The lack of clear performance
growing importance. We present the first study of ARM vir- data limits the ability of hardware and software architects
tualization performance on server hardware, including multi-
core measurements of two popular ARM and x86 hypervisors, to build efficient ARM virtualization solutions, and limits
KVM and Xen. We show how ARM hardware support for the ability of companies to evaluate how best to deploy
virtualization can enable much faster transitions between VMs ARM virtualization solutions to meet their infrastructure
and the hypervisor, a key hypervisor operation. However, needs. The increasing demand for ARM-based solutions and
current hypervisor designs, including both Type 1 hypervisors growing investments in ARM server infrastructure makes
such as Xen and Type 2 hypervisors such as KVM, are not
able to leverage this performance benefit for real application this problem one of key importance.
workloads. We discuss the reasons why and show that other We present the first in-depth study of ARM virtualization
factors related to hypervisor software design and implementa- performance on multi-core server hardware. We measure the
tion have a larger role in overall performance. Based on our performance of the two most popular ARM hypervisors,
measurements, we discuss changes to ARM’s hardware virtu- KVM and Xen, and compare them with their respective x86
alization support that can potentially bridge the gap to bring
its faster VM-to-hypervisor transition mechanism to modern counterparts. These hypervisors are important and useful to
Type 2 hypervisors running real applications. These changes compare on ARM given their popularity and their different
have been incorporated into the latest ARM architecture. design choices. Xen is a standalone bare-metal hypervisor,
Keywords-computer architecture; hypervisors; operating sys- commonly referred to as a Type 1 hypervisor. KVM is a
tems; virtualization; multi-core; performance; ARM; x86 hosted hypervisor integrated within an existing OS kernel,
commonly referred to as a Type 2 hypervisor.
I. I NTRODUCTION We have designed and run a number of microbenchmarks
ARM CPUs have become the platform of choice across to analyze the performance of frequent low-level hypervisor
mobile and embedded systems, leveraging their benefits in operations, and we use these results to highlight differences
customizability and power efficiency in these markets. The in performance between Type 1 and Type 2 hypervisors on
release of the 64-bit ARM architecture, ARMv8 [1], with its ARM. A key characteristic of hypervisor performance is
improved computing capabilities is spurring an upward push the cost of transitioning from a virtual machine (VM) to
of ARM CPUs into traditional server systems. A growing the hypervisor, for example to process interrupts, allocate
number of companies are deploying commercially available memory to the VM, or perform I/O. We show that Type 1
ARM servers to meet their computing infrastructure needs. hypervisors, such as Xen, can transition between the VM
As virtualization plays an important role for servers, ARMv8 and the hypervisor much faster than Type 2 hypervisors,
provides hardware virtualization support. Major virtualiza- such as KVM, on ARM. We show that ARM can enable
tion players, including KVM [2] and Xen [3], leverage ARM significantly faster transitions between the VM and a Type
hardware virtualization extensions to support unmodified 1 hypervisor compared to x86. On the other hand, Type 2
existing operating systems (OSes) and applications with hypervisors such as KVM, incur much higher overhead on
improved hypervisor performance. ARM for VM-to-hypervisor transitions compared to x86.
Despite these trends and the importance of ARM vir- We also show that for some more complicated hypervisor
tualization, little is known in practice regarding how well operations, such as switching between VMs, Type 1 and
virtualized systems perform using ARM. There are no de- Type 2 hypervisors perform equally fast on ARM.
tailed studies of ARM virtualization performance on server Despite the performance benefit in VM transitions that
hardware. Although KVM and Xen both have ARM and ARM can provide, we show that current hypervisor designs,
x86 virtualization solutions, there are substantial differ- including both KVM and Xen on ARM, result in real
ences between their ARM and x86 approaches because of application performance that cannot be easily correlated
key architectural differences between the underlying ARM with the low-level virtualization operation performance. In
and x86 hardware virtualization mechanisms. It is unclear fact, for many workloads, we show that KVM ARM, a
Native Type 1 Type 2
drivers for a wide range of available hardware. This is
especially true for server systems with PCI where any com-
VM VM VM VM VM
mercially available PCI adapter can be used. Traditionally,
App App App App
a Type 1 hypervisor suffers from having to re-implement
Kernel Hypervisor Host OS / Hypervisor device drivers for all supported hardware. However, Xen [6],
HW HW HW
a Type 1 hypervisor, avoids this by only implementing a min-
imal amount of hardware support directly in the hypervisor
Figure 1: Hypervisor Design and running a special privileged VM, Dom0, which runs an
existing OS such as Linux and uses all the existing device
Type 2 hypervisor, can meet or exceed the performance drivers for that OS. Xen then uses Dom0 to perform I/O
of Xen ARM, a Type 1 hypervisor, despite the faster using existing device drivers on behalf of normal VMs, also
transitions between the VM and hypervisor using Type 1 known as DomUs.
hypervisor designs on ARM. We show how other factors Transitions from a VM to the hypervisor occur whenever
related to hypervisor software design and implementation the hypervisor exercises system control, such as processing
play a larger role in overall performance. These factors interrupts or I/O. The hypervisor transitions back to the
include the hypervisor’s virtual I/O model, the ability to VM once it has completed its work managing the hardware,
perform zero copy I/O efficiently, and interrupt processing letting workloads in VMs continue executing. The cost of
overhead. Although ARM hardware virtualization support such transitions is pure overhead and can add significant
incurs higher overhead on VM-to-hypervisor transitions for latency in communication between the hypervisor and the
Type 2 hypervisors than x86, we show that both types of VM. A primary goal in designing both hypervisor software
ARM hypervisors can achieve similar, and in some cases and hardware support for virtualization is to reduce the
lower, performance overhead than their x86 counterparts on frequency and cost of transitions as much as possible.
real application workloads. VMs can run guest OSes with standard device drivers
To enable modern hypervisor designs to leverage the for I/O, but because they do not have direct access to
potentially faster VM transition costs when using ARM hardware, the hypervisor would need to emulate real I/O
hardware, we discuss changes to the ARMv8 architecture devices in software. This results in frequent transitions
that can benefit Type 2 hypervisors. These improvements between the VM and the hypervisor, making each interaction
potentially enable Type 2 hypervisor designs such as KVM with the emulated device an order of magnitude slower
to achieve faster VM-to-hypervisor transitions, including than communicating with real hardware. Alternatively, direct
for hypervisor events involving I/O, resulting in reduced passthrough of I/O from a VM to the real I/O devices can
virtualization overhead on real application workloads. ARM be done using device assignment, but this requires more
has incorporated these changes into the latest ARMv8.1 expensive hardware support and complicates VM migration.
architecture. Instead, the most common approach is paravirtual I/O in
which custom device drivers are used in VMs for virtual
II. BACKGROUND devices supported by the hypervisor. The interface between
Hypervisor Overview. Figure 1 depicts the two main the VM device driver and the virtual device is specifically
hypervisor designs, Type 1 and Type 2. Type 1 hypervisors, designed to optimize interactions between the VM and the
like Xen, comprise a separate hypervisor software compo- hypervisor and facilitate fast I/O. KVM uses an implemen-
nent, which runs directly on the hardware and provides a tation of the Virtio [7] protocol for disk and networking
virtual machine abstraction to VMs running on top of the support, and Xen uses its own implementation referred to
hypervisor. Type 2 hypervisors, like KVM, run an existing simply as Xen PV. In KVM, the virtual device backend is
OS on the hardware and run both VMs and applications implemented in the host OS, and in Xen the virtual device
on top of the OS. Type 2 hypervisors typically modify backend is implemented in the Dom0 kernel. A key potential
the existing OS to facilitate running of VMs, either by performance advantage for KVM is that the virtual device
integrating the Virtual Machine Monitor (VMM) into the implementation in the KVM host kernel has full access to all
existing OS source code base, or by installing the VMM as of the machine’s hardware resources, including VM memory.
a driver into the OS. KVM integrates directly with Linux [4] On the other hand, Xen provides stronger isolation between
where other solutions such as VMware Workstation [5] use a the virtual device implementation and the VM as the Xen
loadable driver in the existing OS kernel to monitor virtual virtual device implementation lives in a separate VM, Dom0,
machines. The OS integrated with a Type 2 hypervisor is which only has access to memory and hardware resources
commonly referred to as the host OS, as opposed to the specifically allocated to it by the Xen hypervisor.
guest OS which runs in a VM. ARM Virtualization Extensions. To enable hypervisors
One advantage of Type 2 hypervisors over Type 1 hyper- to efficiently run VMs with unmodified guest OSes, ARM
visors is the reuse of existing OS code, specifically device introduced hardware virtualization extensions [1] to over-
Dom0 VM Host OS VM
Backend Virtio
EL1 Device Kernel EL1 Kernel Device Kernel
Kernel Virtio
Frontend
Driver KVM vGIC Driver
come the limitation that the ARM architecture was not virtual interrupts to VMs, which VMs can acknowledge
classically virtualizable [8]. All server and networking class and complete without trapping to the hypervisor. However,
ARM hardware is expected to implement these extensions. enabling and disabling virtual interrupts must be done in
We provide a brief overview of the ARM hardware vir- EL2. Furthermore, all physical interrupts are taken to EL2
tualization extensions and how hypervisors leverage these when running in a VM, and therefore must be handled
extensions, focusing on ARM CPU virtualization support by the hypervisor. Finally, ARM provides a virtual timer,
and contrasting it to how x86 works. which can be configured by the VM without trapping to the
The ARM virtualization extensions are centered around hypervisor. However, when the virtual timer fires, it raises a
a new CPU privilege level (also known as exception level), physical interrupt, which must be handled by the hypervisor
EL2, added to the existing user and kernel levels, EL0 and and translated into a virtual interrupt.
EL1, respectively. Software running in EL2 can configure ARM hardware virtualization support has some similari-
the hardware to support VMs. To allow VMs to interact ties to x861 , including providing a means to trap on sensitive
with an interface identical to the physical machine while instructions and a nested set of page tables to virtualize
isolating them from the rest of the system and preventing physical memory. However, there are key differences in how
them from gaining full access to the hardware, a hypervisor they support Type 1 and Type 2 hypervisors. While ARM
enables the virtualization features in EL2 before switching to virtualization extensions are centered around a separate CPU
a VM. The VM will then execute normally in EL0 and EL1 mode, x86 support provides a mode switch, root vs. non-root
until some condition is reached that requires intervention of mode, completely orthogonal from the CPU privilege rings.
the hypervisor. At this point, the hardware traps into EL2 While ARM’s EL2 is a strictly different CPU mode with its
giving control to the hypervisor, which can then interact own set of features, x86 root mode supports the same full
directly with the hardware and eventually return to the VM range of user and kernel mode functionality as its non-root
again. When all virtualization features are disabled in EL2, mode. Both ARM and x86 trap into their respective EL2
software running in EL1 and EL0 works just like on a system and root modes, but transitions between root and non-root
without the virtualization extensions where software running mode on x86 are implemented with a VM Control Structure
in EL1 has full access to the hardware. (VMCS) residing in normal memory, to and from which
ARM hardware virtualization support enables traps to EL2 hardware state is automatically saved and restored when
on certain operations, enables virtualized physical memory switching to and from root mode, for example when the
support, and provides virtual interrupt and timer support. hardware traps from a VM to the hypervisor. ARM, being
ARM provides CPU virtualization by allowing software in a RISC-style architecture, instead has a simpler hardware
EL2 to configure the CPU to trap to EL2 on sensitive mechanism to transition between EL1 and EL2 but leaves it
instructions that cannot be safely executed by a VM. ARM up to software to decide which state needs to be saved and
provides memory virtualization by allowing software in restored. This provides more flexibility in the amount of
EL2 to point to a set of page tables, Stage-2 page tables, work that needs to be done when transitioning between EL1
used to translate the VM’s view of physical addresses to and EL2 compared to switching between root and non-root
machine addresses. When Stage-2 translation is enabled, mode on x86, but poses different requirements on hypervisor
the ARM architecture defines three address spaces: Virtual software implementation.
Addresses (VA), Intermediate Physical Addresses (IPA), and ARM Hypervisor Implementations. As shown in Fig-
Physical Addresses (PA). Stage-2 translation, configured in ures 2 and 3, Xen and KVM take different approaches to
EL2, translates from IPAs to PAs. ARM provides interrupt using ARM hardware virtualization support. Xen as a Type 1
virtualization through a set of virtualization extensions to hypervisor design maps easily to the ARM architecture, run-
the ARM Generic Interrupt Controller (GIC) architecture, 1 Since Intel’s and AMD’s hardware virtualization support are very
which allows a hypervisor to program the GIC to inject similar, we limit our comparison to ARM and Intel.
ning the entire hypervisor in EL2 and running VM userspace x86 measurements were done using Dell PowerEdge r320
and VM kernel in EL0 and EL1, respectively. However, servers, each with a 64-bit Xeon 2.1 GHz ES-2450 with
existing OSes are designed to run in EL1, so a Type 2 8 physical CPU cores. Hyperthreading was disabled on the
hypervisor that leverages an existing OS such as Linux to r320 nodes to provide a similar hardware configuration to
interface with hardware does not map as easily to the ARM the ARM servers. Each r320 node had 16 GB of RAM, a
architecture. EL2 is strictly more privileged and a separate 4x500 GB 7200 RPM SATA RAID5 HD for storage, and
CPU mode with different registers than EL1, so running a Dual-port Mellanox MX354A 10 GbE NIC. All servers
Linux in EL2 would require substantial changes to Linux are connected via 10 GbE, and the interconnecting network
that would not be acceptable in practice. KVM instead runs switch [10] easily handles multiple sets of nodes commu-
across both EL2 and EL1 using split-mode virtualization [2], nicating with full 10 Gb bandwidth such that experiments
sharing EL1 between the host OS and VMs and running involving networking between two nodes can be considered
a minimal set of hypervisor functionality in EL2 to be isolated and unaffected by other traffic in the system. Using
able to leverage the ARM virtualization extensions. KVM 10 Gb Ethernet was important, as many benchmarks were
enables virtualization features in EL2 when switching from unaffected by virtualization when run over 1 Gb Ethernet,
the host to a VM, and disables them when switching back, because the network itself became the bottleneck.
allowing the host full access to the hardware from EL1 and To provide comparable measurements, we kept the soft-
properly isolating VMs also running in EL1. As a result, ware environments across all hardware platforms and all
transitioning between the VM and the hypervisor involves hypervisors the same as much as possible. We used the most
transitioning to EL2 to run the part of KVM running in EL2, recent stable versions available at the time of our experi-
then transitioning to EL1 to run the rest of KVM and the ments of the most popular hypervisors on ARM and their
host kernel. However, because both the host and the VM run counterparts on x86: KVM in Linux 4.0-rc4 with QEMU
in EL1, the hypervisor must context switch all register state 2.2.0, and Xen 4.5.0. KVM was configured with its standard
when switching between host and VM execution context, VHOST networking feature, allowing data handling to occur
similar to a regular process context switch. in the kernel instead of userspace, and with cache=none
This difference on ARM between Xen and KVM does not for its block storage devices. Xen was configured with its
exist on x86 because the root mode used by the hypervisor in-kernel block and network backend drivers to provide
does not limit or change how CPU privilege levels are used. best performance and reflect the most commonly used I/O
Running Linux in root mode does not require any changes to configuration for Xen deployments. Xen x86 was configured
Linux, so KVM maps just as easily to the x86 architecture to use HVM domains, except for Dom0 which was only
as Xen by running the hypervisor in root mode. supported as a PV instance. All hosts and VMs used Ubuntu
KVM only runs the minimal set of hypervisor function- 14.04 with the same Linux 4.0-rc4 kernel and software
ality in EL2 to be able to switch between VMs and the host configuration for all machines. A few patches were applied
and emulates all virtual device in the host OS running in to support the various hardware configurations, such as
EL1 and EL0. When a KVM VM performs I/O it involves adding support for the APM X-Gene PCI bus for the HP
trapping to EL2, switching to host EL1, and handling the m400 servers. All VMs used paravirtualized I/O, typical
I/O request in the host. Because Xen only emulates the GIC of cloud infrastructure deployments such as Amazon EC2,
in EL2 and offloads all other I/O handling to Dom0, when a instead of device passthrough, due to the absence of an
Xen VM performs I/O, it involves trapping to the hypervisor, IOMMU in our test environment.
signaling Dom0, scheduling Dom0, and handling the I/O We ran benchmarks both natively on the hosts and in
request in Dom0. VMs. Each physical or virtual machine instance used for
running benchmarks was configured as a 4-way SMP with
III. E XPERIMENTAL D ESIGN 12 GB of RAM to provide a common basis for comparison.
To evaluate the performance of ARM virtualization, we This involved three configurations: (1) running natively on
ran both microbenchmarks and real application workloads Linux capped at 4 cores and 12 GB RAM, (2) running in a
on the most popular hypervisors on ARM server hardware. VM using KVM with 8 cores and 16 GB RAM with the VM
As a baseline for comparison, we also conducted the same capped at 4 virtual CPUs (VCPUs) and 12 GB RAM, and
experiments with corresponding x86 hypervisors and server (3) running in a VM using Xen with Dom0, the privileged
hardware. We leveraged the CloudLab [9] infrastructure for domain used by Xen with direct hardware access, capped at
both ARM and x86 hardware. 4 VCPUs and 4 GB RAM and the VM capped at 4 VCPUs
ARM measurements were done using HP Moonshot m400 and 12 GB RAM. Because KVM configures the total hard-
servers, each with a 64-bit ARMv8-A 2.4 GHz Applied ware available while Xen configures the hardware dedicated
Micro Atlas SoC with 8 physical CPU cores. Each m400 to Dom0, the configuration parameters are different but the
node had 64 GB of RAM, a 120 GB SATA3 SSD for effect is the same, which is to leave the hypervisor with
storage, and a Dual-port Mellanox ConnectX-3 10 GbE NIC. 4 cores and 4 GB RAM to use outside of what is used
Name Description
by the VM. We use and measure multi-core configurations Hypercall Transition from VM to hypervisor and return
to reflect real-world server deployments. The memory limit to VM without doing any work in the hypervi-
was used to ensure a fair comparison across all hardware sor. Measures bidirectional base transition cost
of hypervisor operations.
configurations given the RAM available on the x86 servers Interrupt Controller Trap from VM to emulated interrupt controller
and the need to also provide RAM for use by the hypervisor Trap then return to VM. Measures a frequent op-
when running VMs. For benchmarks that involve clients eration for many device drivers and baseline
for accessing I/O devices emulated in the
interfacing with the server, the clients were run natively on hypervisor.
Linux and configured to use the full hardware available. Virtual IPI Issue a virtual IPI from a VCPU to another
To improve precision of our measurements and for our VCPU running on a different PCPU, both
PCPUs executing VM code. Measures time
experimental setup to mimic recommended configuration between sending the virtual IPI until the re-
best practices [11], we pinned each VCPU to a specific ceiving VCPU handles it, a frequent operation
physical CPU (PCPU) and generally ensured that no other in multi-core OSes.
work was scheduled on that PCPU. In KVM, all of the Virtual IRQ Com- VM acknowledging and completing a virtual
pletion interrupt. Measures a frequent operation that
host’s device interrupts and processes were assigned to run happens for every injected virtual interrupt.
on a specific set of PCPUs and each VCPU was pinned to a VM Switch Switch from one VM to another on the same
dedicated PCPU from a separate set of PCPUs. In Xen, we physical core. Measures a central cost when
oversubscribing physical CPUs.
configured Dom0 to run on a set of PCPUs and DomU to I/O Latency Out Measures latency between a driver in the VM
run a separate set of PCPUs. We further pinned each VCPU signaling the virtual I/O device in the hyper-
of both Dom0 and DomU to its own PCPU. visor and the virtual I/O device receiving the
signal. For KVM, this traps to the host kernel.
For Xen, this traps to Xen then raises a virtual
IV. M ICROBENCHMARK R ESULTS interrupt to Dom0.
We designed and ran a number of microbenchmarks I/O Latency In Measures latency between the virtual I/O de-
vice in the hypervisor signaling the VM and
to quantify important low-level interactions between the the VM receiving the corresponding virtual
hypervisor and the ARM hardware support for virtualization. interrupt. For KVM, this signals the VCPU
A primary performance cost in running in a VM is how thread and injects a virtual interrupt for the
Virtio device. For Xen, this traps to Xen then
much time must be spent outside the VM, which is time raises a virtual interrupt to DomU.
not spent running the workload in the VM and therefore
is virtualization overhead compared to native execution. Table I: Microbenchmarks
Therefore, our microbenchmarks are designed to measure
time spent handling a trap from the VM to the hypervisor, x86 server hardware. Measurements are shown in cycles
including time spent on transitioning between the VM and instead of time to provide a useful comparison across server
the hypervisor, time spent processing interrupts, time spent hardware with different CPU frequencies, but we focus our
switching between VMs, and latency added to I/O. analysis on the ARM measurements.
We designed a custom Linux kernel driver, which ran The Hypercall microbenchmark shows that transitioning
in the VM under KVM and Xen, on ARM and x86, and from a VM to the hypervisor on ARM can be significantly
executed the microbenchmarks in the same way across all faster than x86, as shown by the Xen ARM measurement,
platforms. Measurements were obtained using cycle counters which takes less than a third of the cycles that Xen or
and ARM hardware timer counters to ensure consistency KVM on x86 take. As explained in Section II, the ARM
across multiple CPUs. Instruction barriers were used before architecture provides a separate CPU mode with its own
and after taking timestamps to avoid out-of-order execution register bank to run an isolated Type 1 hypervisor like Xen.
or pipelining from skewing our measurements. Transitioning from a VM to a Type 1 hypervisor requires lit-
Because these measurements were at the level of a few tle more than context switching the general purpose registers
hundred to a few thousand cycles, it was important to as running the two separate execution contexts, VM and the
minimize measurement variability, especially in the context hypervisor, is supported by the separate ARM hardware state
of measuring performance on multi-core systems. Variations for EL2. While ARM implements additional register state to
caused by interrupts and scheduling can skew measurements support the different execution context of the hypervisor, x86
by thousands of cycles. To address this, we pinned and transitions from a VM to the hypervisor by switching from
isolated VCPUs as described in Section III, and also ran non-root to root mode which requires context switching the
these measurements from within VMs pinned to specific entire CPU register state to the VMCS in memory, which is
VCPUs, assigning all virtual interrupts to other VCPUs. much more expensive even with hardware support.
Using this framework, we ran seven microbenchmarks However, the Hypercall microbenchmark also shows that
that measure various low-level aspects of hypervisor per- transitioning from a VM to the hypervisor is more than an
formance, as listed in Table I. Table II presents the results order of magnitude more expensive for Type 2 hypervisors
from running these microbenchmarks on both ARM and like KVM than for Type 1 hypervisors like Xen. This is
ARM x86 Register State Save Restore
Microbenchmark KVM Xen KVM Xen GP Regs 152 184
Hypercall 6,500 376 1,300 1,228 FP Regs 282 310
Interrupt Controller Trap 7,370 1,356 2,384 1,734 EL1 System Regs 230 511
Virtual IPI 11,557 5,978 5,230 5,562
VGIC Regs 3,250 181
Virtual IRQ Completion 71 71 1,556 1,464
VM Switch 10,387 8,799 4,812 10,534 Timer Regs 104 106
I/O Latency Out 6,024 16,491 560 11,262 EL2 Config Regs 92 107
I/O Latency In 13,872 15,650 18,923 10,050 EL2 Virtual Memory Regs 92 107
Table II: Microbenchmark Measurements (cycle counts) Table III: KVM ARM Hypercall Analysis (cycle counts)
Normalized Performance
to benchmark the performance of the Java Runtime Envi-
ronment. We used 15.02 release of the Linaro AArch64 1.20
port of OpenJDK to run the the benchmark.
Netperf netperf v2.6.0 starting netserver on the server and 1.00
running with its default parameters on the client in three
0.80
modes: TCP RR, TCP STREAM, and TCP MAERTS,
measuring latency and throughput, respectively. 0.60
Apache Apache v2.4.7 Web server running ApacheBench
v2.3 on the remote client, which measures number of 0.40
handled requests per second serving the 41 KB index file
0.20
of the GCC 4.4 manual using 100 concurrent requests.
Memcached memcached v1.4.14 using the memtier benchmark 0.00
v1.2.3 with its default parameters.
MySQL MySQL v14.14 (distrib 5.5.41) running SysBench
v.0.4.12 using the default configuration with 200 parallel
transactions. Application Benchmarks