0% found this document useful (0 votes)
17 views13 pages

isca2016-dall

This document presents a comprehensive study on the performance of ARM virtualization, particularly focusing on the comparison between ARM and x86 hypervisors, KVM and Xen. It highlights the advantages of ARM's hardware support for faster VM-to-hypervisor transitions, while also noting that current hypervisor designs do not fully leverage these benefits for real application workloads. The findings suggest that hypervisor software design and implementation factors play a more significant role in overall performance than the inherent capabilities of ARM hardware.

Uploaded by

georgeweanuel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views13 pages

isca2016-dall

This document presents a comprehensive study on the performance of ARM virtualization, particularly focusing on the comparison between ARM and x86 hypervisors, KVM and Xen. It highlights the advantages of ARM's hardware support for faster VM-to-hypervisor transitions, while also noting that current hypervisor designs do not fully leverage these benefits for real application workloads. The findings suggest that hypervisor software design and implementation factors play a more significant role in overall performance than the inherent capabilities of ARM hardware.

Uploaded by

georgeweanuel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

ARM Virtualization: Performance and Architectural Implications

Christoffer Dall, Shih-Wei Li, Jin Tack Lim, Jason Nieh, and Georgios Koloventzos
Department of Computer Science
Columbia University
New York, NY, United States
{cdall,shihwei,jintack,nieh,gkoloven}@cs.columbia.edu

Abstract—ARM servers are becoming increasingly common, whether these differences have a material impact, positive
making server technologies such as virtualization for ARM of or negative, on performance. The lack of clear performance
growing importance. We present the first study of ARM vir- data limits the ability of hardware and software architects
tualization performance on server hardware, including multi-
core measurements of two popular ARM and x86 hypervisors, to build efficient ARM virtualization solutions, and limits
KVM and Xen. We show how ARM hardware support for the ability of companies to evaluate how best to deploy
virtualization can enable much faster transitions between VMs ARM virtualization solutions to meet their infrastructure
and the hypervisor, a key hypervisor operation. However, needs. The increasing demand for ARM-based solutions and
current hypervisor designs, including both Type 1 hypervisors growing investments in ARM server infrastructure makes
such as Xen and Type 2 hypervisors such as KVM, are not
able to leverage this performance benefit for real application this problem one of key importance.
workloads. We discuss the reasons why and show that other We present the first in-depth study of ARM virtualization
factors related to hypervisor software design and implementa- performance on multi-core server hardware. We measure the
tion have a larger role in overall performance. Based on our performance of the two most popular ARM hypervisors,
measurements, we discuss changes to ARM’s hardware virtu- KVM and Xen, and compare them with their respective x86
alization support that can potentially bridge the gap to bring
its faster VM-to-hypervisor transition mechanism to modern counterparts. These hypervisors are important and useful to
Type 2 hypervisors running real applications. These changes compare on ARM given their popularity and their different
have been incorporated into the latest ARM architecture. design choices. Xen is a standalone bare-metal hypervisor,
Keywords-computer architecture; hypervisors; operating sys- commonly referred to as a Type 1 hypervisor. KVM is a
tems; virtualization; multi-core; performance; ARM; x86 hosted hypervisor integrated within an existing OS kernel,
commonly referred to as a Type 2 hypervisor.
I. I NTRODUCTION We have designed and run a number of microbenchmarks
ARM CPUs have become the platform of choice across to analyze the performance of frequent low-level hypervisor
mobile and embedded systems, leveraging their benefits in operations, and we use these results to highlight differences
customizability and power efficiency in these markets. The in performance between Type 1 and Type 2 hypervisors on
release of the 64-bit ARM architecture, ARMv8 [1], with its ARM. A key characteristic of hypervisor performance is
improved computing capabilities is spurring an upward push the cost of transitioning from a virtual machine (VM) to
of ARM CPUs into traditional server systems. A growing the hypervisor, for example to process interrupts, allocate
number of companies are deploying commercially available memory to the VM, or perform I/O. We show that Type 1
ARM servers to meet their computing infrastructure needs. hypervisors, such as Xen, can transition between the VM
As virtualization plays an important role for servers, ARMv8 and the hypervisor much faster than Type 2 hypervisors,
provides hardware virtualization support. Major virtualiza- such as KVM, on ARM. We show that ARM can enable
tion players, including KVM [2] and Xen [3], leverage ARM significantly faster transitions between the VM and a Type
hardware virtualization extensions to support unmodified 1 hypervisor compared to x86. On the other hand, Type 2
existing operating systems (OSes) and applications with hypervisors such as KVM, incur much higher overhead on
improved hypervisor performance. ARM for VM-to-hypervisor transitions compared to x86.
Despite these trends and the importance of ARM vir- We also show that for some more complicated hypervisor
tualization, little is known in practice regarding how well operations, such as switching between VMs, Type 1 and
virtualized systems perform using ARM. There are no de- Type 2 hypervisors perform equally fast on ARM.
tailed studies of ARM virtualization performance on server Despite the performance benefit in VM transitions that
hardware. Although KVM and Xen both have ARM and ARM can provide, we show that current hypervisor designs,
x86 virtualization solutions, there are substantial differ- including both KVM and Xen on ARM, result in real
ences between their ARM and x86 approaches because of application performance that cannot be easily correlated
key architectural differences between the underlying ARM with the low-level virtualization operation performance. In
and x86 hardware virtualization mechanisms. It is unclear fact, for many workloads, we show that KVM ARM, a
Native Type 1 Type 2
drivers for a wide range of available hardware. This is
especially true for server systems with PCI where any com-
VM VM VM VM VM
mercially available PCI adapter can be used. Traditionally,
App App App App
a Type 1 hypervisor suffers from having to re-implement
Kernel Hypervisor Host OS / Hypervisor device drivers for all supported hardware. However, Xen [6],
HW HW HW
a Type 1 hypervisor, avoids this by only implementing a min-
imal amount of hardware support directly in the hypervisor
Figure 1: Hypervisor Design and running a special privileged VM, Dom0, which runs an
existing OS such as Linux and uses all the existing device
Type 2 hypervisor, can meet or exceed the performance drivers for that OS. Xen then uses Dom0 to perform I/O
of Xen ARM, a Type 1 hypervisor, despite the faster using existing device drivers on behalf of normal VMs, also
transitions between the VM and hypervisor using Type 1 known as DomUs.
hypervisor designs on ARM. We show how other factors Transitions from a VM to the hypervisor occur whenever
related to hypervisor software design and implementation the hypervisor exercises system control, such as processing
play a larger role in overall performance. These factors interrupts or I/O. The hypervisor transitions back to the
include the hypervisor’s virtual I/O model, the ability to VM once it has completed its work managing the hardware,
perform zero copy I/O efficiently, and interrupt processing letting workloads in VMs continue executing. The cost of
overhead. Although ARM hardware virtualization support such transitions is pure overhead and can add significant
incurs higher overhead on VM-to-hypervisor transitions for latency in communication between the hypervisor and the
Type 2 hypervisors than x86, we show that both types of VM. A primary goal in designing both hypervisor software
ARM hypervisors can achieve similar, and in some cases and hardware support for virtualization is to reduce the
lower, performance overhead than their x86 counterparts on frequency and cost of transitions as much as possible.
real application workloads. VMs can run guest OSes with standard device drivers
To enable modern hypervisor designs to leverage the for I/O, but because they do not have direct access to
potentially faster VM transition costs when using ARM hardware, the hypervisor would need to emulate real I/O
hardware, we discuss changes to the ARMv8 architecture devices in software. This results in frequent transitions
that can benefit Type 2 hypervisors. These improvements between the VM and the hypervisor, making each interaction
potentially enable Type 2 hypervisor designs such as KVM with the emulated device an order of magnitude slower
to achieve faster VM-to-hypervisor transitions, including than communicating with real hardware. Alternatively, direct
for hypervisor events involving I/O, resulting in reduced passthrough of I/O from a VM to the real I/O devices can
virtualization overhead on real application workloads. ARM be done using device assignment, but this requires more
has incorporated these changes into the latest ARMv8.1 expensive hardware support and complicates VM migration.
architecture. Instead, the most common approach is paravirtual I/O in
which custom device drivers are used in VMs for virtual
II. BACKGROUND devices supported by the hypervisor. The interface between
Hypervisor Overview. Figure 1 depicts the two main the VM device driver and the virtual device is specifically
hypervisor designs, Type 1 and Type 2. Type 1 hypervisors, designed to optimize interactions between the VM and the
like Xen, comprise a separate hypervisor software compo- hypervisor and facilitate fast I/O. KVM uses an implemen-
nent, which runs directly on the hardware and provides a tation of the Virtio [7] protocol for disk and networking
virtual machine abstraction to VMs running on top of the support, and Xen uses its own implementation referred to
hypervisor. Type 2 hypervisors, like KVM, run an existing simply as Xen PV. In KVM, the virtual device backend is
OS on the hardware and run both VMs and applications implemented in the host OS, and in Xen the virtual device
on top of the OS. Type 2 hypervisors typically modify backend is implemented in the Dom0 kernel. A key potential
the existing OS to facilitate running of VMs, either by performance advantage for KVM is that the virtual device
integrating the Virtual Machine Monitor (VMM) into the implementation in the KVM host kernel has full access to all
existing OS source code base, or by installing the VMM as of the machine’s hardware resources, including VM memory.
a driver into the OS. KVM integrates directly with Linux [4] On the other hand, Xen provides stronger isolation between
where other solutions such as VMware Workstation [5] use a the virtual device implementation and the VM as the Xen
loadable driver in the existing OS kernel to monitor virtual virtual device implementation lives in a separate VM, Dom0,
machines. The OS integrated with a Type 2 hypervisor is which only has access to memory and hardware resources
commonly referred to as the host OS, as opposed to the specifically allocated to it by the Xen hypervisor.
guest OS which runs in a VM. ARM Virtualization Extensions. To enable hypervisors
One advantage of Type 2 hypervisors over Type 1 hyper- to efficiently run VMs with unmodified guest OSes, ARM
visors is the reuse of existing OS code, specifically device introduced hardware virtualization extensions [1] to over-
Dom0 VM Host OS VM

EL0 Userspace Userspace EL0 Userspace Userspace

Backend Virtio
EL1 Device Kernel EL1 Kernel Device Kernel
Kernel Virtio
Frontend
Driver KVM vGIC Driver

Virtio I/O KVM Virtio I/O


EL2 Xen vGIC EL2

Figure 2: Xen ARM Architecture Figure 3: KVM ARM Architecture

come the limitation that the ARM architecture was not virtual interrupts to VMs, which VMs can acknowledge
classically virtualizable [8]. All server and networking class and complete without trapping to the hypervisor. However,
ARM hardware is expected to implement these extensions. enabling and disabling virtual interrupts must be done in
We provide a brief overview of the ARM hardware vir- EL2. Furthermore, all physical interrupts are taken to EL2
tualization extensions and how hypervisors leverage these when running in a VM, and therefore must be handled
extensions, focusing on ARM CPU virtualization support by the hypervisor. Finally, ARM provides a virtual timer,
and contrasting it to how x86 works. which can be configured by the VM without trapping to the
The ARM virtualization extensions are centered around hypervisor. However, when the virtual timer fires, it raises a
a new CPU privilege level (also known as exception level), physical interrupt, which must be handled by the hypervisor
EL2, added to the existing user and kernel levels, EL0 and and translated into a virtual interrupt.
EL1, respectively. Software running in EL2 can configure ARM hardware virtualization support has some similari-
the hardware to support VMs. To allow VMs to interact ties to x861 , including providing a means to trap on sensitive
with an interface identical to the physical machine while instructions and a nested set of page tables to virtualize
isolating them from the rest of the system and preventing physical memory. However, there are key differences in how
them from gaining full access to the hardware, a hypervisor they support Type 1 and Type 2 hypervisors. While ARM
enables the virtualization features in EL2 before switching to virtualization extensions are centered around a separate CPU
a VM. The VM will then execute normally in EL0 and EL1 mode, x86 support provides a mode switch, root vs. non-root
until some condition is reached that requires intervention of mode, completely orthogonal from the CPU privilege rings.
the hypervisor. At this point, the hardware traps into EL2 While ARM’s EL2 is a strictly different CPU mode with its
giving control to the hypervisor, which can then interact own set of features, x86 root mode supports the same full
directly with the hardware and eventually return to the VM range of user and kernel mode functionality as its non-root
again. When all virtualization features are disabled in EL2, mode. Both ARM and x86 trap into their respective EL2
software running in EL1 and EL0 works just like on a system and root modes, but transitions between root and non-root
without the virtualization extensions where software running mode on x86 are implemented with a VM Control Structure
in EL1 has full access to the hardware. (VMCS) residing in normal memory, to and from which
ARM hardware virtualization support enables traps to EL2 hardware state is automatically saved and restored when
on certain operations, enables virtualized physical memory switching to and from root mode, for example when the
support, and provides virtual interrupt and timer support. hardware traps from a VM to the hypervisor. ARM, being
ARM provides CPU virtualization by allowing software in a RISC-style architecture, instead has a simpler hardware
EL2 to configure the CPU to trap to EL2 on sensitive mechanism to transition between EL1 and EL2 but leaves it
instructions that cannot be safely executed by a VM. ARM up to software to decide which state needs to be saved and
provides memory virtualization by allowing software in restored. This provides more flexibility in the amount of
EL2 to point to a set of page tables, Stage-2 page tables, work that needs to be done when transitioning between EL1
used to translate the VM’s view of physical addresses to and EL2 compared to switching between root and non-root
machine addresses. When Stage-2 translation is enabled, mode on x86, but poses different requirements on hypervisor
the ARM architecture defines three address spaces: Virtual software implementation.
Addresses (VA), Intermediate Physical Addresses (IPA), and ARM Hypervisor Implementations. As shown in Fig-
Physical Addresses (PA). Stage-2 translation, configured in ures 2 and 3, Xen and KVM take different approaches to
EL2, translates from IPAs to PAs. ARM provides interrupt using ARM hardware virtualization support. Xen as a Type 1
virtualization through a set of virtualization extensions to hypervisor design maps easily to the ARM architecture, run-
the ARM Generic Interrupt Controller (GIC) architecture, 1 Since Intel’s and AMD’s hardware virtualization support are very
which allows a hypervisor to program the GIC to inject similar, we limit our comparison to ARM and Intel.
ning the entire hypervisor in EL2 and running VM userspace x86 measurements were done using Dell PowerEdge r320
and VM kernel in EL0 and EL1, respectively. However, servers, each with a 64-bit Xeon 2.1 GHz ES-2450 with
existing OSes are designed to run in EL1, so a Type 2 8 physical CPU cores. Hyperthreading was disabled on the
hypervisor that leverages an existing OS such as Linux to r320 nodes to provide a similar hardware configuration to
interface with hardware does not map as easily to the ARM the ARM servers. Each r320 node had 16 GB of RAM, a
architecture. EL2 is strictly more privileged and a separate 4x500 GB 7200 RPM SATA RAID5 HD for storage, and
CPU mode with different registers than EL1, so running a Dual-port Mellanox MX354A 10 GbE NIC. All servers
Linux in EL2 would require substantial changes to Linux are connected via 10 GbE, and the interconnecting network
that would not be acceptable in practice. KVM instead runs switch [10] easily handles multiple sets of nodes commu-
across both EL2 and EL1 using split-mode virtualization [2], nicating with full 10 Gb bandwidth such that experiments
sharing EL1 between the host OS and VMs and running involving networking between two nodes can be considered
a minimal set of hypervisor functionality in EL2 to be isolated and unaffected by other traffic in the system. Using
able to leverage the ARM virtualization extensions. KVM 10 Gb Ethernet was important, as many benchmarks were
enables virtualization features in EL2 when switching from unaffected by virtualization when run over 1 Gb Ethernet,
the host to a VM, and disables them when switching back, because the network itself became the bottleneck.
allowing the host full access to the hardware from EL1 and To provide comparable measurements, we kept the soft-
properly isolating VMs also running in EL1. As a result, ware environments across all hardware platforms and all
transitioning between the VM and the hypervisor involves hypervisors the same as much as possible. We used the most
transitioning to EL2 to run the part of KVM running in EL2, recent stable versions available at the time of our experi-
then transitioning to EL1 to run the rest of KVM and the ments of the most popular hypervisors on ARM and their
host kernel. However, because both the host and the VM run counterparts on x86: KVM in Linux 4.0-rc4 with QEMU
in EL1, the hypervisor must context switch all register state 2.2.0, and Xen 4.5.0. KVM was configured with its standard
when switching between host and VM execution context, VHOST networking feature, allowing data handling to occur
similar to a regular process context switch. in the kernel instead of userspace, and with cache=none
This difference on ARM between Xen and KVM does not for its block storage devices. Xen was configured with its
exist on x86 because the root mode used by the hypervisor in-kernel block and network backend drivers to provide
does not limit or change how CPU privilege levels are used. best performance and reflect the most commonly used I/O
Running Linux in root mode does not require any changes to configuration for Xen deployments. Xen x86 was configured
Linux, so KVM maps just as easily to the x86 architecture to use HVM domains, except for Dom0 which was only
as Xen by running the hypervisor in root mode. supported as a PV instance. All hosts and VMs used Ubuntu
KVM only runs the minimal set of hypervisor function- 14.04 with the same Linux 4.0-rc4 kernel and software
ality in EL2 to be able to switch between VMs and the host configuration for all machines. A few patches were applied
and emulates all virtual device in the host OS running in to support the various hardware configurations, such as
EL1 and EL0. When a KVM VM performs I/O it involves adding support for the APM X-Gene PCI bus for the HP
trapping to EL2, switching to host EL1, and handling the m400 servers. All VMs used paravirtualized I/O, typical
I/O request in the host. Because Xen only emulates the GIC of cloud infrastructure deployments such as Amazon EC2,
in EL2 and offloads all other I/O handling to Dom0, when a instead of device passthrough, due to the absence of an
Xen VM performs I/O, it involves trapping to the hypervisor, IOMMU in our test environment.
signaling Dom0, scheduling Dom0, and handling the I/O We ran benchmarks both natively on the hosts and in
request in Dom0. VMs. Each physical or virtual machine instance used for
running benchmarks was configured as a 4-way SMP with
III. E XPERIMENTAL D ESIGN 12 GB of RAM to provide a common basis for comparison.
To evaluate the performance of ARM virtualization, we This involved three configurations: (1) running natively on
ran both microbenchmarks and real application workloads Linux capped at 4 cores and 12 GB RAM, (2) running in a
on the most popular hypervisors on ARM server hardware. VM using KVM with 8 cores and 16 GB RAM with the VM
As a baseline for comparison, we also conducted the same capped at 4 virtual CPUs (VCPUs) and 12 GB RAM, and
experiments with corresponding x86 hypervisors and server (3) running in a VM using Xen with Dom0, the privileged
hardware. We leveraged the CloudLab [9] infrastructure for domain used by Xen with direct hardware access, capped at
both ARM and x86 hardware. 4 VCPUs and 4 GB RAM and the VM capped at 4 VCPUs
ARM measurements were done using HP Moonshot m400 and 12 GB RAM. Because KVM configures the total hard-
servers, each with a 64-bit ARMv8-A 2.4 GHz Applied ware available while Xen configures the hardware dedicated
Micro Atlas SoC with 8 physical CPU cores. Each m400 to Dom0, the configuration parameters are different but the
node had 64 GB of RAM, a 120 GB SATA3 SSD for effect is the same, which is to leave the hypervisor with
storage, and a Dual-port Mellanox ConnectX-3 10 GbE NIC. 4 cores and 4 GB RAM to use outside of what is used
Name Description
by the VM. We use and measure multi-core configurations Hypercall Transition from VM to hypervisor and return
to reflect real-world server deployments. The memory limit to VM without doing any work in the hypervi-
was used to ensure a fair comparison across all hardware sor. Measures bidirectional base transition cost
of hypervisor operations.
configurations given the RAM available on the x86 servers Interrupt Controller Trap from VM to emulated interrupt controller
and the need to also provide RAM for use by the hypervisor Trap then return to VM. Measures a frequent op-
when running VMs. For benchmarks that involve clients eration for many device drivers and baseline
for accessing I/O devices emulated in the
interfacing with the server, the clients were run natively on hypervisor.
Linux and configured to use the full hardware available. Virtual IPI Issue a virtual IPI from a VCPU to another
To improve precision of our measurements and for our VCPU running on a different PCPU, both
PCPUs executing VM code. Measures time
experimental setup to mimic recommended configuration between sending the virtual IPI until the re-
best practices [11], we pinned each VCPU to a specific ceiving VCPU handles it, a frequent operation
physical CPU (PCPU) and generally ensured that no other in multi-core OSes.
work was scheduled on that PCPU. In KVM, all of the Virtual IRQ Com- VM acknowledging and completing a virtual
pletion interrupt. Measures a frequent operation that
host’s device interrupts and processes were assigned to run happens for every injected virtual interrupt.
on a specific set of PCPUs and each VCPU was pinned to a VM Switch Switch from one VM to another on the same
dedicated PCPU from a separate set of PCPUs. In Xen, we physical core. Measures a central cost when
oversubscribing physical CPUs.
configured Dom0 to run on a set of PCPUs and DomU to I/O Latency Out Measures latency between a driver in the VM
run a separate set of PCPUs. We further pinned each VCPU signaling the virtual I/O device in the hyper-
of both Dom0 and DomU to its own PCPU. visor and the virtual I/O device receiving the
signal. For KVM, this traps to the host kernel.
For Xen, this traps to Xen then raises a virtual
IV. M ICROBENCHMARK R ESULTS interrupt to Dom0.
We designed and ran a number of microbenchmarks I/O Latency In Measures latency between the virtual I/O de-
vice in the hypervisor signaling the VM and
to quantify important low-level interactions between the the VM receiving the corresponding virtual
hypervisor and the ARM hardware support for virtualization. interrupt. For KVM, this signals the VCPU
A primary performance cost in running in a VM is how thread and injects a virtual interrupt for the
Virtio device. For Xen, this traps to Xen then
much time must be spent outside the VM, which is time raises a virtual interrupt to DomU.
not spent running the workload in the VM and therefore
is virtualization overhead compared to native execution. Table I: Microbenchmarks
Therefore, our microbenchmarks are designed to measure
time spent handling a trap from the VM to the hypervisor, x86 server hardware. Measurements are shown in cycles
including time spent on transitioning between the VM and instead of time to provide a useful comparison across server
the hypervisor, time spent processing interrupts, time spent hardware with different CPU frequencies, but we focus our
switching between VMs, and latency added to I/O. analysis on the ARM measurements.
We designed a custom Linux kernel driver, which ran The Hypercall microbenchmark shows that transitioning
in the VM under KVM and Xen, on ARM and x86, and from a VM to the hypervisor on ARM can be significantly
executed the microbenchmarks in the same way across all faster than x86, as shown by the Xen ARM measurement,
platforms. Measurements were obtained using cycle counters which takes less than a third of the cycles that Xen or
and ARM hardware timer counters to ensure consistency KVM on x86 take. As explained in Section II, the ARM
across multiple CPUs. Instruction barriers were used before architecture provides a separate CPU mode with its own
and after taking timestamps to avoid out-of-order execution register bank to run an isolated Type 1 hypervisor like Xen.
or pipelining from skewing our measurements. Transitioning from a VM to a Type 1 hypervisor requires lit-
Because these measurements were at the level of a few tle more than context switching the general purpose registers
hundred to a few thousand cycles, it was important to as running the two separate execution contexts, VM and the
minimize measurement variability, especially in the context hypervisor, is supported by the separate ARM hardware state
of measuring performance on multi-core systems. Variations for EL2. While ARM implements additional register state to
caused by interrupts and scheduling can skew measurements support the different execution context of the hypervisor, x86
by thousands of cycles. To address this, we pinned and transitions from a VM to the hypervisor by switching from
isolated VCPUs as described in Section III, and also ran non-root to root mode which requires context switching the
these measurements from within VMs pinned to specific entire CPU register state to the VMCS in memory, which is
VCPUs, assigning all virtual interrupts to other VCPUs. much more expensive even with hardware support.
Using this framework, we ran seven microbenchmarks However, the Hypercall microbenchmark also shows that
that measure various low-level aspects of hypervisor per- transitioning from a VM to the hypervisor is more than an
formance, as listed in Table I. Table II presents the results order of magnitude more expensive for Type 2 hypervisors
from running these microbenchmarks on both ARM and like KVM than for Type 1 hypervisors like Xen. This is
ARM x86 Register State Save Restore
Microbenchmark KVM Xen KVM Xen GP Regs 152 184
Hypercall 6,500 376 1,300 1,228 FP Regs 282 310
Interrupt Controller Trap 7,370 1,356 2,384 1,734 EL1 System Regs 230 511
Virtual IPI 11,557 5,978 5,230 5,562
VGIC Regs 3,250 181
Virtual IRQ Completion 71 71 1,556 1,464
VM Switch 10,387 8,799 4,812 10,534 Timer Regs 104 106
I/O Latency Out 6,024 16,491 560 11,262 EL2 Config Regs 92 107
I/O Latency In 13,872 15,650 18,923 10,050 EL2 Virtual Memory Regs 92 107

Table II: Microbenchmark Measurements (cycle counts) Table III: KVM ARM Hypercall Analysis (cycle counts)

state accounts for almost all of the Hypercall time, indicating


because although all VM traps are handled in EL2, a Type that context switching state is the primary cost due to KVM
2 hypervisor is integrated with a host kernel and both run ARM’s design, not the cost of extra traps. Unlike Xen ARM
in EL1. This results in four additional sources of overhead. which only incurs the relatively small cost of saving and
First, transitioning from the VM to the hypervisor involves restoring the general-purpose (GP) registers, KVM ARM
not only trapping to EL2, but also returning to the host saves and restores much more register state at much higher
OS in EL1, as shown in Figure 3, incurring a double trap cost. Note that for ARM, the overall cost of saving register
cost. Second, because the host OS and the VM both run state, when transitioning from a VM to the hypervisor, is
in EL1 and ARM hardware does not provide any features much more expensive than restoring it, when returning back
to distinguish between the host OS running in EL1 and the to the VM from the hypervisor, due to the cost of reading
VM running in EL1, software running in EL2 must context the VGIC register state.
switch all the EL1 system register state between the VM Unlike on ARM, both x86 hypervisors spend a similar
guest OS and the Type 2 hypervisor host OS, incurring amount of time transitioning from the VM to the hypervisor.
added cost of saving and restoring EL1 register state. Third, Since both KVM and Xen leverage the same x86 hardware
because the host OS runs in EL1 and needs full access to mechanism for transitioning between the VM and the hyper-
the hardware, the hypervisor must disable traps to EL2 and visor, they have similar performance. Both x86 hypervisors
Stage-2 translation from EL2 while switching from the VM run in root mode and run their VMs in non-root mode,
to the hypervisor, and enable them when switching back and switching between the two modes involves switching
to the VM again. Fourth, because the Type 2 hypervisor a substantial portion of the CPU register state to the VMCS
runs in EL1 but needs to access VM control register state in memory. Switching this state to memory is fast on x86,
such as the VGIC state, which can only be accessed from because it is performed by hardware in the context of a trap
EL2, there is additional overhead to read and write the VM or as a result of executing a single instruction. In contrast,
control register state in EL2. There are two approaches. ARM provides a separate CPU mode for the hypervisor with
One, the hypervisor can jump back and forth between EL1 separate registers, and ARM only needs to switch state to
and EL2 to access the control register state when needed. memory when running a different execution context in EL1.
Two, it can copy the full register state to memory while ARM can be much faster, as in the case of Xen ARM which
it is still in EL2, return to the host OS in EL1 and read does its hypervisor work in EL2 and does not need to context
and write the memory copy of the VM control state, and switch much register state, or it can be much slower, as in
then finally copy the state from memory back to the EL2 the case of KVM ARM which context switches more register
control registers when the hypervisor is running in EL2 state without the benefit of hardware support like x86.
again. Both methods incur much overhead, but the first The large difference in the cost of transitioning between
makes the software implementation complicated and difficult the VM and hypervisor between Type 1 and Type 2 hy-
to maintain. KVM ARM currently takes the second approach pervisors results in Xen ARM being significantly faster at
of reading and writing all VM control registers in EL2 handling interrupt related traps, because Xen ARM emulates
during each transition between the VM and the hypervisor. the ARM GIC interrupt controller directly in the hypervisor
While the cost of the trap between CPU modes itself is not running in EL2 as shown in Figure 2. In contrast, KVM
very high as shown in previous work [2], our measurements ARM emulates the GIC in the part of the hypervisor running
show that there is a substantial cost associated with saving in EL1. Therefore, operations such as accessing registers
and restoring register state to switch between EL2 and in the emulated GIC, sending virtual IPIs, and receiving
the host in EL1. Table III provides a breakdown of the virtual interrupts are much faster on Xen ARM than KVM
cost of context switching the relevant register state when ARM. This is shown in Table II in the measurements for the
performing the Hypercall microbenchmark measurement on Interrupt Controller trap and Virtual IPI microbenchmarks,
KVM ARM. Context switching consists of saving register in which Xen ARM is faster than KVM ARM by roughly
state to memory and restoring the new context’s state from the same difference as for the Hypercall microbenchmark.
memory to registers. The cost of saving and restoring this However, Table II shows that for the remaining mi-
crobenchmarks, Xen ARM does not enjoy a large perfor- to the hypervisor, which then signals a different VM, Dom0,
mance advantage over KVM ARM and in fact does worse and Dom0 then sends the data on the physical network.
for some of the microbenchmarks. The reasons for this differ This signaling between VMs on Xen is slow for two main
from one microbenchmark to another: For the Virtual IRQ reasons. First, because the VM and Dom0 run on different
Completion microbenchmark, both KVM ARM and Xen physical CPUs, Xen must send a physical IPI from the CPU
ARM are very fast because the ARM hardware includes running the VM to the CPU running Dom0. Second, Xen
support for completing interrupts directly in the VM without actually switches from Dom0 to a special VM, called the
trapping to the hypervisor. The microbenchmark runs much idle domain, when Dom0 is idling and waiting for I/O.
faster on ARM than x86 because the latter has to trap to the Thus, when Xen signals Dom0 to perform I/O on behalf of
hypervisor. More recently, vAPIC support has been added a VM, it must perform a VM switch from the idle domain to
to x86 with similar functionality to avoid the need to trap Dom0. We verified that changing the configuration of Xen to
to the hypervisor so that newer x86 hardware with vAPIC pinning both the VM and Dom0 to the same physical CPU
support should perform more comparably to ARM [12]. or not specifying any pinning resulted in similar or worse
For the VM Switch microbenchmark, Xen ARM is only results than reported in Table II, so the qualitative results
slightly faster than KVM ARM because both hypervisor are not specific to our configuration.
implementations have to context switch the state between It is interesting to note that KVM x86 is much faster than
the VM being switched out and the one being switched everything else on I/O Latency Out. KVM on both ARM
in. Unlike the Hypercall microbenchmark where only KVM and x86 involve the same control path of transitioning from
ARM needed to context switch EL1 state and per VM EL2 the VM to the hypervisor. While the path is conceptually
state, in this case both KVM and Xen ARM need to do this, similar to half of the path for the Hypercall microbenchmark,
and Xen ARM therefore does not directly benefit from its the result for the I/O Latency Out microbenchmark is not
faster VM-to-hypervisor transition. Xen ARM is still slightly 50% of the Hypercall cost on neither platform. The reason
faster than KVM, however, because to switch between VMs, is that for KVM x86, transitioning from the VM to the
Xen ARM simply traps to EL2 and performs a single context hypervisor accounts for only about 40% of the Hypercall
switch of the EL1 state, while KVM ARM must switch the cost, while transitioning from the hypervisor to the VM is
EL1 state from the VM to the host OS and then again from the majority of the cost (a few cycles are spent handling
the host OS to the new VM. Finally, KVM ARM also has the noop hypercall in the hypervisor). On ARM, it is much
to disable and enable traps and Stage-2 translation on each more expensive to transition from the VM to the hypervisor
transition, which Xen ARM does not have to do. than from the hypervisor to the VM, because reading back
For the I/O Latency microbenchmarks, a surprising result the VGIC state is expensive, as shown in Table III.
is that Xen ARM is slower than KVM ARM in both direc- I/O Latency In behaves more similarly between Xen and
tions. These microbenchmarks measure the time from when KVM on ARM because they perform similar low-level
a network I/O event is initiated by a sender until the receiver operations. Xen traps from Dom0 running in EL1 to the
is notified, not including additional time spent transferring hypervisor running in EL2 and signals the receiving VM, the
data. I/O latency is an especially important metric for real- reverse of the procedure described above, thereby sending
time sensitive operations and many networking applications. a physical IPI and switching from the idle domain to the
The key insight to understanding the results is to see that receiving VM in EL1. For KVM ARM, the Linux host OS
Xen ARM does not benefit from its faster VM-to-hypervisor receives the network packet via VHOST on a separate CPU,
transition mechanism in this case because Xen ARM must wakes up the receiving VM’s VCPU thread to run on another
switch between two separate VMs, Dom0 and a DomU, to CPU, thereby sending a physical IPI. The VCPU thread traps
process network I/O. Type 1 hypervisors only implement a to EL2, switches the EL1 state from the host to the VM,
limited set of functionality in the hypervisor directly, namely then switches to the VM in EL1. The end result is that the
scheduling, memory management, the interrupt controller, cost is similar across both hypervisors, with KVM being
and timers for Xen ARM. All other functionality, for ex- slightly faster. While KVM ARM is slower on I/O Latency
ample network and storage drivers are implemented in the In than I/O Latency Out because it performs more work on
special privileged VM, Dom0. Therefore, a VM performing the incoming path, Xen has similar performance on both
I/O has to communicate with Dom0 and not just the Xen Latency I/O In and Latency I/O Out because it performs
hypervisor, which means not just trapping to EL2, but also similar low-level operations for both microbenchmarks.
going to EL1 to run Dom0.
I/O Latency Out is much worse on Xen ARM than KVM V. A PPLICATION B ENCHMARK R ESULTS
ARM. When KVM ARM sends a network packet, it traps We next ran a number of real application benchmark
to the hypervisor, context switching the EL1 state, and then workloads to quantify how well the ARM virtualization
the host OS instance directly sends the data on the physical extensions support different hypervisor software designs in
network. Xen ARM, on the other hand, traps from the VM the context of more realistic workloads. Table IV lists the
Kernbench Compilation of the Linux 3.17.0 kernel using the allno- 2.06 2.33 2.75 3.56 3.90
2.00
config for ARM using GCC 4.8.2.
KVM ARM Xen ARM
Hackbench hackbench [14] using Unix domain sockets and 100 1.80
process groups running with 500 loops. KVM x86 Xen x86
SPECjvm2008 SPECjvm2008 [15] 2008 benchmark running several 1.60
real life applications and benchmarks specifically chosen
1.40

Normalized Performance
to benchmark the performance of the Java Runtime Envi-
ronment. We used 15.02 release of the Linaro AArch64 1.20
port of OpenJDK to run the the benchmark.
Netperf netperf v2.6.0 starting netserver on the server and 1.00
running with its default parameters on the client in three
0.80
modes: TCP RR, TCP STREAM, and TCP MAERTS,
measuring latency and throughput, respectively. 0.60
Apache Apache v2.4.7 Web server running ApacheBench
v2.3 on the remote client, which measures number of 0.40
handled requests per second serving the 41 KB index file
0.20
of the GCC 4.4 manual using 100 concurrent requests.
Memcached memcached v1.4.14 using the memtier benchmark 0.00
v1.2.3 with its default parameters.
MySQL MySQL v14.14 (distrib 5.5.41) running SysBench
v.0.4.12 using the default configuration with 200 parallel
transactions. Application Benchmarks

Table IV: Application Benchmarks Figure 4: Application Benchmark Performance

application workloads we used, which include a mix of


in microbenchmark performance do not result in the same
widely-used CPU and I/O intensive benchmark workloads.
differences in real application performance.
For workloads involving a client and a server, we ran
the client on a dedicated machine and the server on the Xen ARM achieves its biggest performance gain versus
configuration being measured, ensuring that the client was KVM ARM on Hackbench. Hackbench involves running
never saturated during any of our experiments. We ran these lots of threads that are sleeping and waking up, requiring
workloads natively and on both KVM and Xen on both ARM frequent IPIs for rescheduling. Xen ARM performs virtual
and x86, the latter to provide a baseline comparison. IPIs much faster than KVM ARM, roughly a factor of
Given the differences in hardware platforms, our focus two. Despite this microbenchmark performance advantage
was not on measuring absolute performance [13], but rather on a workload that performs frequent virtual IPIs, the
the relative performance differences between virtualized and resulting difference in Hackbench performance overhead
native execution on each platform. Figure 4 shows the is small, only 5% of native performance. Overall, across
performance overhead of KVM and Xen on ARM and x86 CPU-intensive workloads such as Kernbench, Hackbench
compared to native execution on the respective platform. All and SPECjvm2008, the performance differences among the
numbers are normalized to 1 for native performance, so that different hypervisors across different architectures is small.
lower numbers represent better performance. Unfortunately, Figure 4 shows that the largest differences in performance
the Apache benchmark could not run on Xen x86 because are for the I/O-intensive workloads. We first take a closer
it caused a kernel panic in Dom0. We tried several versions look at the Netperf results. Netperf TCP RR is an I/O
of Xen and Linux, but faced the same problem. We reported latency benchmark, which sends a 1 byte packet from a
this to the Xen developer community, and learned that this client to the Netperf server running in the VM, and the
may be a Mellanox network driver bug exposed by Xen’s Netperf server sends the packet back to the client, and the
I/O model. We also reported the issue to the Mellanox driver process is repeated for 10 seconds. For the Netperf TCP RR
maintainers, but did not arrive at a solution. benchmark, both hypervisors show high overhead compared
Figure 4 shows that the application performance on KVM to native performance, but Xen is noticeably worse than
and Xen on ARM and x86 is not well correlated with their KVM. To understand why, we analyzed the behavior of
respective microbenchmark performance shown in Table II. TCP RR in further detail by using tcpdump [16] to capture
Xen ARM has by far the lowest VM-to-hypervisor transition timestamps on incoming and outgoing packets at the data
costs and the best performance for most of the microbench- link layer. We modified Linux’s timestamping function to
marks, yet its performance lags behind KVM ARM on many use the ARM architected counter, and took further steps to
of the application benchmarks. KVM ARM substantially ensure that the counter values were synchronized across all
outperforms Xen ARM on the various Netperf benchmarks, PCPUs, VMs, and the hypervisor. This allowed us to analyze
TCP STREAM, TCP MAERTS, and TCP RR, as well as the latency between operations happening in the VM and the
Apache and Memcached, and performs only slightly worse host. Table V shows the detailed measurements.
on the rest of the application benchmarks. Xen ARM also Table V shows that the time per transaction increases
does generally worse than KVM x86. Clearly, the differences significantly from 41.8 µs when running natively to 86.3 µs
Native KVM Xen
where Xen delays the packet by 47.3 µs, an extra 11.2 µs.
Trans/s 23,911 11,591 10,253
Time/trans (µs) 41.8 86.3 97.5 There are two main reasons why Xen performs worse. First,
Overhead (µs) - 44.5 55.7 Xen’s I/O latency is higher than KVM’s as measured and
send to recv (µs) 29.7 29.8 33.9 explained by the I/O Latency In and Out microbenchmarks
recv to send (µs) 14.5 53.0 64.6 in Section IV. Second, Xen does not support zero-copy I/O,
recv to VM recv (µs) - 21.1 25.9 but instead must map a shared page between Dom0 and the
VM recv to VM send (µs) - 16.9 17.4 VM using the Xen grant mechanism, and must copy data
VM send to send (µs) - 15.0 21.4
between the memory buffer used for DMA in Dom0 and
Table V: Netperf TCP RR Analysis on ARM the granted memory buffer from the VM. Each data copy
incurs more than 3 µs of additional latency because of the
and 97.5 µs for KVM and Xen, respectively. The resulting complexities of establishing and utilizing the shared page
overhead per transaction is 44.5 µs and 55.7 µs for KVM and via the grant mechanism across VMs, even though only a
Xen, respectively. To understand the source of this overhead, single byte of data needs to be copied.
we decompose the time per transaction into separate steps. Although Xen ARM can transition between the VM and
send to recv is the time between sending a packet from the hypervisor more quickly than KVM, Xen cannot utilize this
physical server machine until a new response is received by advantage for the TCP RR workload, because Xen must
the client, which is the time spent on the physical wire plus engage Dom0 to perform I/O on behalf of the VM, which
the client processing time. recv to send is the time spent at results in several VM switches between idle domains and
the physical server machine to receive a packet and send Dom0 or DomU, and because Xen must perform expensive
back a response, including potentially passing through the page mapping operations to copy data between the VM
hypervisor and the VM in the virtualized configurations. and Dom0. This is a direct consequence of Xen’s software
send to recv remains the same for KVM and native, be- architecture and I/O model based on domains and a strict I/O
cause KVM does not interfere with normal Linux operations isolation policy. Xen ends up spending so much time com-
for sending or receiving network data. However, send to recv municating between the VM and Dom0 that it completely
is slower on Xen, because the Xen hypervisor adds latency dwarfs its low Hypercall cost for the TCP RR workload
in handling incoming network packets. When a physical and ends up having more overhead than KVM ARM, due
network packet arrives, the hardware raises an IRQ, which is to Xen’s software architecture and I/O model in particular.
handled in the Xen hypervisor, which translates the incoming The hypervisor software architecture is also a dominant
physical IRQ to a virtual IRQ for Dom0, which runs the factor in other aspects of the Netperf results. For the Netperf
physical network device driver. However, since Dom0 is TCP STREAM benchmark, KVM has almost no overhead
often idling when the network packet arrives, Xen must first for x86 and ARM while Xen has more than 250% over-
switch from the idle domain to Dom0 before Dom0 can head. The reason for this large difference in performance
receive the incoming network packet, similar to the behavior is again due to Xen’s lack of zero-copy I/O support, in
of the I/O Latency benchmarks described in Section IV. this case particularly on the network receive path. The
Since almost all the overhead is on the server for both Netperf TCP STREAM benchmark sends large quantities
KVM and Xen, we further decompose the recv to send time of data from a client to the Netperf server in the VM. Xen’s
at the server into three components; the time from when the Dom0, running Linux with the physical network device
physical device driver receives the packet until it is delivered driver, cannot configure the network device to DMA the
in the VM, recv to VM recv, the time from when the VM data directly into guest buffers, because Dom0 does not have
receives the packet until it sends a response, VM recv to VM access to the VM’s memory. When Xen receives data, it
send, and the time from when the VM delivers the response must configure the network device to DMA the data into a
to the physical device driver, VM send to send. Table V Dom0 kernel memory buffer, signal the VM for incoming
shows that both KVM and Xen spend a similar amount of data, let Xen configure a shared memory buffer, and finally
time receiving the packet inside the VM until being able to copy the incoming data from the Dom0 kernel buffer into
send a reply, and that this VM recv to VM send time is only the virtual device’s shared buffer. KVM, on the other hand,
slightly more time than the recv to send time spent when has full access to the VM’s memory and maintains shared
Netperf is running natively to process a packet. This suggests memory buffers in the Virtio rings [7], such that the network
that the dominant overhead for both KVM and Xen is due device can DMA the data directly into a guest-visible buffer,
to the time required by the hypervisor to process packets, resulting in significantly less overhead.
the Linux host for KVM and Dom0 for Xen. Furthermore, previous work [17] and discussions with the
Table V also shows that Xen spends noticeably more time Xen maintainers confirm that supporting zero copy on x86
than KVM in delivering packets between the physical device is problematic for Xen given its I/O model because doing
driver and the VM. KVM only delays the packet on recv so requires signaling all physical CPUs to locally invalidate
to VM recv and VM send to send by a total of 36.1 µs, TLBs when removing grant table entries for shared pages,
which proved more expensive than simply copying the this contributes to Xen’s slightly better Hackbench perfor-
data [18]. As a result, previous efforts to support zero copy mance, the resulting application performance benefit overall
on Xen x86 were abandoned. Xen ARM lacks the same is modest.
zero copy support because the Dom0 network backend driver However, when VMs perform I/O operations such as
uses the same code as on x86. Whether zero copy support sending or receiving network data, Type 1 hypervisors like
for Xen can be implemented efficiently on ARM, which Xen typically offload such handling to separate VMs to avoid
has hardware support for broadcast TLB invalidate requests having to re-implement all device drivers for the supported
across multiple PCPUs, remains to be investigated. hardware and to avoid running a full driver and emulation
For the Netperf TCP MAERTS benchmark, Xen also has stack directly in the Type 1 hypervisor, which would signif-
substantially higher overhead than KVM. The benchmark icantly increase the Trusted Computing Base and increase
measures the network transmit path from the VM, the the attack surface of the hypervisor. Switching to a different
converse of the TCP STREAM benchmark which measured VM to perform I/O on behalf of the VM has very similar
the network receive path to the VM. It turns out that the Xen costs on ARM compared to a Type 2 hypervisor approach
performance problem is due to a regression in Linux intro- of switching to the host on KVM. Additionally, KVM on
duced in Linux v4.0-rc1 in an attempt to fight bufferbloat, ARM benefits from the hypervisor having privileged access
and has not yet been fixed beyond manually tuning the to all physical resources, including the VM’s memory, and
Linux TCP configuration in the guest OS [19]. We confirmed from being directly integrated with the host OS, allowing
that using an earlier version of Linux or tuning the TCP for optimized physical interrupt handling, scheduling, and
configuration in the guest using sysfs significantly reduced processing paths in some situations.
the overhead of Xen on the TCP MAERTS benchmark. Despite the inability of both KVM and Xen to leverage
Other than the Netperf workloads, the application work- the potential fast path of trapping from a VM running in EL1
loads with the highest overhead were Apache and Mem- to the hypervisor in EL2 without the need to run additional
cached. We found that the performance bottleneck for KVM hypervisor functionality in EL1, our measurements show that
and Xen on ARM was due to network interrupt processing both KVM and Xen on ARM can provide virtualization
and delivery of virtual interrupts. Delivery of virtual inter- overhead similar to, and in some cases better than, their
rupts is more expensive than handling physical IRQs on respective x86 counterparts.
bare-metal, because it requires switching from the VM to
the hypervisor, injecting a virtual interrupt to the VM, then VI. A RCHITECTURE I MPROVEMENTS
switching back to the VM. Additionally, Xen and KVM both To make it possible for modern hypervisors to achieve
handle all virtual interrupts using a single VCPU, which, low VM-to-hypervisor transition costs on real application
combined with the additional virtual interrupt delivery cost, workloads, some changes needed to be made to the ARM
fully utilizes the underlying PCPU. We verified this by hardware virtualization support. Building on our experiences
distributing virtual interrupts across multiple VCPUs, which with the design, implementation, and performance mea-
causes performance overhead to drop on KVM from 35% to surement of KVM ARM and working in conjunction with
14% on Apache and from 26% to 8% on Memcached, and on ARM, a set of improvements have been made to bring the
Xen from 84% to 16% on Apache and from 32% to 9% on fast VM-to-hypervisor transition costs possible in limited
Memcached. Furthermore, we ran the workload natively with circumstances with Type 1 hypervisors, to a broader range
all physical interrupts assigned to a single physical CPU, of application workloads when using Type 2 hypervisors.
and observed the same native performance, experimentally These improvements are the Virtualization Host Extensions
verifying that delivering virtual interrupts is more expensive (VHE), which are now part of a new revision of the ARM
than handling physical interrupts. 64-bit architecture, ARMv8.1 [20]. VHE allows running an
In summary, while the VM-to-hypervisor transition cost OS designed to run in EL1 to run in EL2 without substantial
for a Type 1 hypervisor like Xen is much lower on ARM modifications to the OS source code. We show how this
than for a Type 2 hypervisor like KVM, this difference allows KVM ARM and its Linux host kernel to run entirely
is not easily observed for the application workloads. The in EL2 without substantial modifications to Linux.
reason is that Type 1 hypervisors typically only support VHE is provided through the addition of a new control
CPU, memory, and interrupt virtualization directly in the bit, the E2H bit, which is set at system boot when installing
hypervisors. CPU and memory virtualization has been highly a Type 2 hypervisor that uses VHE. If the bit is not set,
optimized directly in hardware and, ignoring one-time page ARMv8.1 behaves the same as ARMv8 in terms of hardware
fault costs at start up, is performed largely without the virtualization support, preserving backwards compatibility
hypervisor’s involvement. That leaves only interrupt virtual- with existing hypervisors. When the bit is set, VHE enables
ization, which is indeed much faster for Type 1 hypervisor three main features.
on ARM, confirmed by the Interrupt Controller Trap and First, VHE expands EL2, adding additional physical regis-
Virtual IPI microbenchmarks shown in Section IV. While ter state to the CPU, such that any register and functionality
available in EL1 is also available in EL2. For example, EL1 Type 1: E2H Bit Clear Type 2: E2H Bit Set
has two registers, TTBR0 EL1 and TTBR1 EL1, the first
EL 0 (User) Apps
used to look up the page tables for virtual addresses (VAs) in VM VM
the lower VA range, and the second in the upper VA range. syscalls
EL 1 (Kernel) & traps
This provides a convenient and efficient method for splitting
the VA space between userspace and the kernel. However, EL 2 (Hypervisor) Xen Hypervisor Host Kernel and KVM
without VHE, EL2 only has one page table base register,
TTBR0 EL2, making it problematic to support the split VA
space of EL1 when running in EL2. With VHE, EL2 gets Figure 5: Virtualization Host Extensions (VHE)
a second page table base register, TTBR1 EL2, making it
possible to support split VA space in EL2 in the same way as hypervisor kernel can run unmodified in EL2, because VHE
provided in EL1. This enables a Type 2 hypervisor integrated provides an equivalent EL2 register for every EL1 register
with a host OS to support a split VA space in EL2, which is and transparently rewrites EL1 register accesses from EL2
necessary to run the host OS in EL2 so it can manage the to EL2 register accesses, and because the page table formats
VA space between userspace and the kernel. between EL1 and EL2 are now compatible. Transitions from
Second, VHE provides a mechanism to access the extra host userspace to host kernel happen directly from EL0 to
EL2 register state transparently. Simply providing extra EL2 EL2, for example to handle a system call, as indicated by
registers is not sufficient to run unmodified OSes in EL2, the arrows in Figure 5. Transitions from the VM to the
because existing OSes are written to access EL1 registers. hypervisor now happen without having to context switch
For example, Linux is written to use TTBR1 EL1, which EL1 state, because EL1 is not used by the hypervisor.
does not affect the translation system running in EL2. ARMv8.1 differs from the x86 approach in two key ways.
Providing the additional register TTBR1 EL2 would still First, ARMv8.1 introduces more additional hardware state so
require modifying Linux to use the TTBR1 EL2 instead of that a VM running in EL1 does not need to save a substantial
the TTBR1 EL1 when running in EL2 vs. EL1, respectively. amount of state before switching to running the hypervisor
To avoid forcing OS vendors to add this extra level of com- in EL2 because the EL2 state is separate and backed by ad-
plexity to the software, VHE allows unmodified software to ditional hardware registers. This minimizes the cost of VM-
execute in EL2 and transparently access EL2 registers using to-hypervisor transitions because trapping from EL1 to EL2
the EL1 system register instruction encodings. For example, does not require saving and restoring state beyond general
current OS software reads the TTBR1 EL1 register with purpose registers to and from memory. In contrast, recall
the instruction mrs x1, ttbr1_el1. With VHE, the that the x86 approach adds CPU virtualization support by
software still executes the same instruction, but the hardware adding root and non-root mode as orthogonal concepts from
actually accesses the TTBR1 EL2 register. As long as the the CPU privilege modes, but does not introduce additional
E2H bit is set, accesses to EL1 registers performed in EL2 hardware register state like ARM. As a result, switching
actually access EL2 registers, transparently rewriting register between root and non-root modes requires transferring state
accesses to EL2, as described above. A new set of special between hardware registers and memory. The cost of this is
instructions are added to access the EL1 registers in EL2, ameliorated by implementing the state transfer in hardware,
which the hypervisor can use to switch between VMs, which but while this avoids the need to do additional instruction
will run in EL1. For example, if the hypervisor wishes to fetch and decode, accessing memory is still expected to be
access the guest’s TTBR1 EL1, it will use the instruction more expensive than having extra hardware register state.
mrs x1, ttb1_el21. Second, ARMv8.1 preserves the RISC-style approach of
Third, VHE expands the memory translation capabilities allowing software more fine-grained control over which state
of EL2. In ARMv8, EL2 and EL1 use different page table needs to be switched for which purposes instead of fixing
formats so that software written to run in EL1 must be this in hardware, potentially making it possible to build
modified to run in EL2. In ARMv8.1, the EL2 page table hypervisors with lower overhead, compared to x86.
format is now compatible with the EL1 format when the A Type 2 hypervisor originally designed for ARMv8 must
E2H bit is set. As a result, an OS that was previously run in be modified to benefit from VHE. A patch set has been
EL1 can now run in EL2 without being modified because it developed to add VHE support to KVM ARM. This involves
can use the same EL1 page table format. rewriting parts of the code to allow run-time adaptations of
Figure 5 shows how Type 1 and Type 2 hypervisors map to the hypervisor, such that the same kernel binary can run
the architecture with VHE. Type 1 hypervisors do not set the on both legacy ARMv8 hardware and benefit from VHE-
E2H bit introduced with VHE, and EL2 behaves exactly as in enabled ARMv8.1 hardware. The code to support VHE has
ARMv8 and described in Section II. Type 2 hypervisors set been developed using ARM software models as ARMv8.1
the E2H bit when the system boots, and the host OS kernel hardware is not yet available. We were therefore not able to
runs exclusively in EL2, and never in EL1. The Type 2 evaluate the performance of KVM ARM using VHE, but our
findings in Sections IV and V show that this addition to the state to memory when transitioning from a VM to the
hardware design could have a noticeable positive effect on hypervisor compared to Type 1 hypervisors on ARM, and
KVM ARM performance, potentially improving Hypercall (4) identifying the root causes of overhead for KVM and
and I/O Latency Out performance by more than an order Xen on ARM for real application workloads including those
of magnitude, improving more realistic I/O workloads by involving network I/O.
10% to 20%, and yielding superior performance to a Type
1 hypervisor such as Xen which must still rely on Dom0 VIII. C ONCLUSIONS
running in EL1 for I/O operations.
We present the first study of ARM virtualization per-
VII. R ELATED W ORK formance on server hardware, including multi-core mea-
surements of the two main ARM hypervisors, KVM and
Virtualization goes back to the 1970s [8], but saw a resur-
Xen. We introduce a suite of microbenchmarks to measure
gence in the 2000s with the emergence of x86 hypervisors
common hypervisor operations on multi-core systems. Using
and later x86 hardware virtualization support [5], [6], [4].
this suite, we show that ARM enables Type 1 hypervisors
Much work has been done on analyzing and improving
such as Xen to transition between a VM and the hypervisor
the performance of x86 virtualization [21], [17], [22], [23],
much faster than on x86, but that this low transition cost does
[24], [25], [26], [27], [28]. While some techniques such as
not extend to Type 2 hypervisors such as KVM because they
nested page tables have made their way from x86 to ARM,
cannot run entirely in the EL2 CPU mode ARM designed
much of the x86 virtualization work has limited applicability
for running hypervisors. While this fast transition cost is
to ARM for two reasons. First, earlier work focused on
useful for supporting virtual interrupts, it does not help
techniques to overcome the absence of x86 hardware vir-
with I/O performance because a Type 1 hypervisor like
tualization support. For example, studies of paravirtualized
Xen has to communicate with I/O backends in a special
VM performance [17] are not directly applicable to systems
Dom0 VM, requiring more complex interactions than simply
optimized with hardware virtualization support.
transitioning to and from the EL2 CPU mode.
Second, later work based on x86 hardware virtualization
We show that current hypervisor designs cannot leverage
support leverages hardware features that are in many cases
ARM’s potentially fast VM-to-hypervisor transition cost in
substantially different from ARM. For example, ELI [24]
practice for real application workloads. KVM ARM actually
reduces the overhead of device passthrough I/O coming from
exceeds the performance of Xen ARM for most real appli-
interrupt processing by applying an x86-specific technique
cation workloads involving I/O. This is due to differences
to directly deliver physical interrupts to VMs. This technique
in hypervisor software design and implementation that play
does not work on ARM, as ARM does not use Interrupt De-
a larger role than how the hardware supports low-level
scriptor Tables (IDTs), but instead reads the interrupt number
hypervisor operations. For example, KVM ARM easily
from a single hardware register and performs lookups of
provides zero copy I/O because its host OS has full access
interrupt service routines from a strictly software-managed
to all of the VM’s memory, where Xen enforces a strict
table. In contrast, our work focuses on ARM-specific hard-
I/O isolation policy resulting in poor performance despite
ware virtualization support and its performance on modern
Xen’s much faster VM-to-hypervisor transition mechanism.
hypervisors running multiprocessor VMs.
We show that ARM hypervisors have similar overhead to
Full-system virtualization of the ARM architecture is a
their x86 counterparts on real applications. Finally, we show
relatively unexplored research area. Early approaches were
how improvements to the ARM architecture may allow
software only, could not run unmodified guest OSes, and
Type 2 hypervisors to bring ARM’s fast VM-to-hypervisor
often suffered from poor performance [29], [30], [31], [32].
transition cost to real application workloads involving I/O.
More recent approaches leverage ARM hardware virtualiza-
tion support. The earliest study of ARM hardware virtualiza-
ACKNOWLEDGMENTS
tion support was based on a software simulator and a simple
hypervisor without SMP support, but due to the lack of Marc Zyngier provided insights and implemented large
hardware or a cycle-accurate simulator, no real performance parts of KVM ARM. Eric Auger helped add VHOST support
evaluation was possible [33]. to KVM ARM. Peter Maydell helped on QEMU internals
KVM ARM was the first hypervisor to use ARM hardware and configuration. Ian Campbell and Stefano Stabellini
virtualization support to run unmodified guest OSes on helped on Xen internals and in developing our measurement
multi-core hardware [2], [34]. We expand on our previous frameworks for Xen ARM. Leigh Stoller, Mike Hibler, and
work by (1) measuring virtualization performance on ARM Robert Ricci provided timely support for CloudLab. Martha
server hardware for the first time, (2) providing the first Kim and Simha Sethumadhavan gave feedback on earlier
performance comparison between KVM and Xen on both drafts of this paper. This work was supported in part by
ARM and x86, (3) quantifying the true cost of split-mode Huawei Technologies, a Google Research Award, and NSF
virtualization due to the need to save and restore more grants CNS-1162447, CNS-1422909, and CCF-1162021.
R EFERENCES monitor,” in Proceedings of the 2001 USENIX Annual Tech-
nical Conference, Jun. 2001, pp. 1–14.
[1] ARM Ltd., “ARM architecture reference manual ARMv8-A [22] K. Adams and O. Agesen, “A comparison of software and
DDI0487A.a,” Sep. 2013. hardware techniques for x86 virtualization,” in Proceedings
[2] C. Dall and J. Nieh, “KVM/ARM: The design and implemen- of the 12th International Conference on Architectural Support
tation of the Linux ARM hypervisor,” in Proceedings of the for Programming Languages and Operating Systems, Oct.
19th International Conference on Architectural Support for 2006, pp. 2–13.
Programming Languages and Operating Systems, Mar. 2014, [23] O. Agesen, J. Mattson, R. Rugina, and J. Sheldon, “Software
pp. 333–348. techniques for avoiding hardware virtualization exits,” in Pro-
[3] “Xen ARM with virtualization extensions,” http://wiki.xen. ceedings of the 2012 USENIX Annual Technical Conference,
org/wiki/Xen ARM with Virtualization Extensions. Jun. 2012, pp. 373–385.
[4] A. Kivity, Y. Kamay, D. Laor, U. Lublin, and A. Liguori, [24] A. Gordon, N. Amit, N. Har’El, M. Ben-Yehuda, A. Landau,
“kvm: The Linux virtual machine monitor,” in Proceedings A. Schuster, and D. Tsafrir, “ELI: Bare-metal performance
of the Ottawa Linux Symposium, vol. 1, Jun. 2007, pp. 225– for I/O virtualization,” in Proceedings of the 17th Interna-
230. tional Conference on Architectural Support for Programming
Languages and Operating Systems, Mar. 2012, pp. 411–422.
[5] E. Bugnion, S. Devine, M. Rosenblum, J. Sugerman, and
E. Y. Wang, “Bringing virtualization to the x86 architecture [25] L. Cherkasova and R. Gardner, “Measuring CPU overhead for
with the original VMware workstation,” ACM Transactions I/O processing in the Xen virtual machine monitor.” in Pro-
on Computer Systems, vol. 30, no. 4, pp. 12:1–12:51, Nov. ceedings of the 2005 USENIX Annual Technical Conference,
2012. Jun. 2005, pp. 387–390.
[6] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, [26] K. Gamage and X. Kompella, “Opportunistic flooding to
R. Neugebauer, I. Pratt, and A. Warfield, “Xen and the art improve tcp transmit performance in virtualized clouds.” in
of virtualization,” in Proceedings of the 19th Symposium on Proceedings of the 2nd Symposium on Cloud Computing,
Operating Systems Principles, Oct. 2003, pp. 164–177. 2011, pp. 24:1–24:14.
[7] R. Russell, “virtio: Towards a de-facto standard for virtual I/O [27] J. Heo and R. Taheri, “Virtualizing latency-sensitive appli-
devices,” SIGOPS Operating Systems Review, vol. 42, no. 5, cations: Where does the overhead come from?” VMware
pp. 95–103, Jul. 2008. Technical Journal, vol. 2, no. 2, pp. 21–30, Dec. 2013.
[8] G. J. Popek and R. P. Goldberg, “Formal requirements for [28] J. Buell, D. Hecht, J. Heo, K. Saladi, and H. R. Taheri,
virtualizable third generation architectures,” Communications “Methodology for performance analysis of VMware vSphere
of the ACM, vol. 17, no. 7, pp. 412–421, Jul. 1974. under tier-1 applications,” VMware Technical Journal, vol. 2,
no. 1, pp. 19–28, Jun. 2013.
[9] “CloudLab,” http://www.cloudlab.us.
[29] J. Hwang, S. Suh, S. Heo, C. Park, J. Ryu, S. Park, and
[10] Hewlett-Packard, “HP Moonshot-45XGc switch C. Kim, “Xen on ARM: System virtualization using Xen
module,” http://www8.hp.com/us/en/products/moonshot- hypervisor for ARM-based secure mobile phones,” in Pro-
systems/product-detail.html?oid=7398915. ceedings of the 5th Consumer Communications and Newtork
[11] “Tuning xen for performance,” http://wiki.xen.org/wiki/ Conference, Jan. 2008, pp. 257–261.
Tuning Xen for Performance, accessed: Jul. 2015. [30] C. Dall and J. Nieh, “KVM for ARM,” in Proceedings of the
[12] Intel Corporation, “Intel 64 and IA-32 architectures software Ottawa Linux Symposium, Jul. 2010, pp. 45–56.
developer’s manual, 325384-056US,” Sep. 2015. [31] J.-H. Ding, C.-J. Lin, P.-H. Chang, C.-H. Tsang, W.-C.
[13] C. Dall, S.-W. Li, J. T. Lim, and J. Nieh, “A measurement Hsu, and Y.-C. Chung, “ARMvisor: System virtualization for
study of ARM virtualization performance,” Department of ARM,” in Proceedings of the Ottawa Linux Symposium, Jul.
Computer Science, Columbia University, Tech. Rep. CUCS- 2012, pp. 93–107.
021-15, Nov. 2015. [32] K. Barr, P. Bungale, S. Deasy, V. Gyuris, P. Hung, C. Newell,
[14] “Hackbench,” Jan. 2008, http://people.redhat.com/mingo/cfs- H. Tuch, and B. Zoppis, “The VMware mobile virtualization
scheduler/tools/hackbench.c. platform: Is that a hypervisor in your pocket?” SIGOPS
[15] “Specjvm2008,” https://www.spec.org/jvm2008. Operating Systems Review, vol. 44, no. 4, pp. 124–135, Dec.
[16] “Tcpdump,” http://www.tcpdump.org/tcpdump man.html. 2010.
[17] J. R. Santos, Y. Turner, G. J. Janakiraman, and I. Pratt, [33] P. Varanasi and G. Heiser, “Hardware-supported virtualization
“Bridging the gap between software and hardware techniques on ARM,” in Proceedings of the 2nd Asia-Pacific Workshop
for I/O virtualization,” in Proceedings of the 2008 USENIX on Systems, Jul. 2011, pp. 11:1–11:5.
Annual Technical Conference, Jun. 2008, pp. 29–42. [34] C. Dall and J. Nieh, “KVM/ARM: Experiences building the
[18] Ian Campbell, “Personal communication,” Apr. 2015. Linux ARM hypervisor,” Department of Computer Science,
[19] Linux ARM Kernel Mailing List, “”tcp: refine TSO Columbia University, Tech. Rep. CUCS-010-13, Apr. 2013.
autosizing” causes performance regression on Xen,”
Apr. 2015, http://lists.infradead.org/pipermail/linux-arm-
kernel/2015-April/336497.html.
[20] D. Brash, “The ARMv8-A architecture and its ongoing
development,” Dec. 2014, http://community.arm.com/groups/
processors/blog/2014/12/02/the-armv8-a-architecture-and-
its-ongoing-development.
[21] J. Sugerman, G. Venkitachalam, and B.-H. Lim, “Virtualizing
I/O devices on VMware workstation’s hosted virtual machine

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy