Increases Servers VM Capacity Technology Brief
Increases Servers VM Capacity Technology Brief
Technology Brief
Cloud
Virtualization
2
Technology Brief | Intel® Optane™ DC SSDs with Intel® Memory Drive Technology
Intel® Optane™ SSD DC P4800X with Intel® Memory Drive Technology Measured Performance
The following charts show the performance of a system running 108 VMs concurrently, and compares the two configurations
previously discussed:
1. Linux Swap: A system using 192GB DDR4 DRAM + SWAP on Intel® Optane™ SSDs
2. Intel Memory Drive Technology: A system with 192GB DDR4 DRAM + Intel Optane SSD DC P4800X with Intel Memory
Drive Technology to achieve a 768GB of memory in a cost-effective setup
Target average latency levels (50µs average and a 99% latency of 300µs are marked in red).
3
Technology Brief | Intel® Optane™ DC SSDs with Intel® Memory Drive Technology
Intel® Optane™ DC SSD with Intel® Memory Drive Technology are a true
memory replacement technology
Summary
Leveraging Intel Optane DC SSDs with Intel Memory Drive Technology provides a cost-effective way, as well as an in-place
transparent upgrade, to support server nodes with up to 8x4 more memory than server’s specifications’ limits (or current
memory configuration), enabling a cost-effective infrastructure for VMs, with minimal impact on performance. Unlike Linux
Swap, Intel Optane DC SSDs with Intel Memory Drive Technology enables all this while maintaining the commonly acceptable
latency requirements.
4
Technology Brief | Intel® Optane™ DC SSDs with Intel® Memory Drive Technology
5
Technology Brief | Intel® Optane™ DC SSDs with Intel® Memory Drive Technology
vms-clone.sh
#!/bin/bash
vms=$1
vm_user=”intel”
vm_master=”ubuntu1604-good-backup”
virsh destroy $vm_master
virsh undefine $vm_master
for vm_name in `virsh list --name --all`; do
echo $vm_name
virsh destroy $vm_name
virsh undefine $vm_name --remove-all-storage
done
vms=$((vms-1))
virsh create /home/swapstream/$vm_master.xml
virsh suspend $vm_master
for i in `seq 0 $vms`; do
echo $i
vm_name=”ubuntu1604-$i”
virt-clone -o $vm_master -n $vm_name --auto-clone
done
virsh destroy $vm_master
virsh undefine $vm_master
vms-spawn.sh
#!/bin/bash
vms=$1
vm_user=”intel”
vms=$((vms-1))
for i in `seq 0 $vms`; do
echo $i
vm_name=”ubuntu1604-$i”
virsh start $vm_name
sleep 1
done
echo “Sleeping”
sleep 15
../get_ips.sh $((vms+1))
echo “Sleeping”
sleep 15
get_ips.sh
#!/bin/bash
vms=$1
vm_user=”intel”
vms=$((vms-1))
while [ 1 ] ; do
echo “Getting IPs”
rm inventory
for i in `seq 0 $vms`;
do
vm_name=”ubuntu1604-$i”
ip=$(virsh domifaddr “$vm_name”|tail -2 | awk ‘{print $4}’| cut -d/ -f1)
echo “$ip ansible_user=$vm_user” >> inventory
done
if [ `grep -c 192.168 inventory` -gt $vms ] ; then
break;
fi
echo “Waiting for all VM IP addresses...”
sleep 10
done
cat inventory
6
Technology Brief | Intel® Optane™ DC SSDs with Intel® Memory Drive Technology
tune.sh
#!/bin/bash
echo “Set scaling governor”
for CPUFREQ in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do [ -f $CPUFREQ ] || continue; echo -n
performance > $CPUFREQ; done
echo “Show scaling governor”
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo
for fname in `echo -n /usr/local/bin/????version | sed ‘s/^\(.*\)\(\/usr\/local\/bin\/vsmpversion\)\(.*\)$/\1\3
\2/’`; do
strings $fname 2>&1 | grep -iq “vsmpversion”
if [ $? == 0 ]; then UTILPX=`basename $fname | awk ‘{print substr($0,0,4)}’`; break; fi
done
echo “Running environment”
NATIVE=0
${UTILPX}version --long
[ $? -ne 0 ] && NATIVE=1
echo
echo “Remove all swap devices”
swapoff -a
echo
if [ $NATIVE -eq 1 ] ; then
if [ `free -g | grep Mem: | awk ‘{print $2}’` -le 500 ]; then
########### SWAP-SPECIFIC SETTINGS #####
# Specify the NVME name-space to use. Do not use /dev and
# partitions will be created automatically per name-space.
# WARNING: SET THIS CAREFULLY AS IT CAN DESTROY THE OS
swapdevlist=”nvme0n1 nvme1n1”
# For all values below USE -1 FOR DEFAULT
partitions_per_device=-1
cluster=-1
watermark_scale_factor=400
max_sectors_kb=-1
io_queue_depth=-1
nomerges=-1
########################################
echo “Setting Optane as swap on $swapdevlist with $partitions_per_device per device”
if [ ! -x /usr/sbin/partprobe -o ! -x /usr/sbin/parted ]; then
echo “please install partprobe and parted”
exit
fi
for dev in $swapdevlist; do echo $dev; dd if=/dev/zero of=/dev/${dev} bs=1M count=10 oflag=direct; done 2>&1
/usr/sbin/partprobe; sleep 1; /usr/sbin/partprobe; sleep 1
if [ $partitions_per_device -gt 1 ]; then
pchunk=$((100/partitions_per_device))
for dev in $swapdevlist; do
/usr/sbin/parted /dev/${dev} mklabel msdos
/usr/sbin/parted -a none /dev/${dev} mkpart extended 0 100%
for p in `seq 1 $partitions_per_device`; do
echo /dev/${dev} $(((p-1)*pchunk))% $((p*pchunk))%
/usr/sbin/parted -a none /dev/${dev} mkpart logical $(((p-1)*pchunk))% $((p*pchunk))%
done
done 2>&1
/usr/sbin/partprobe; sleep 1; /usr/sbin/partprobe; sleep 1
for dev in $swapdevlist; do for part in /dev/${dev}p*; do
if [[ $part =~ p1$ ]]; then echo “==== skipping ext partition $part”; continue; fi
/usr/sbin/mkswap -f $part; /usr/sbin/swapon -p 0 $part
done; done 2>&1
else
for dev in $swapdevlist; do /usr/sbin/mkswap -f $dev; /usr/sbin/swapon -p 0 $dev; done 2>&1
fi
swapon -s
echo
echo “Swap tunning”
grep -H ^ /proc/sys/vm/page-cluster
if [ $cluster -ge 0 ]; then
echo $cluster > /proc/sys/vm/page-cluster
grep -H ^ /proc/sys/vm/page-cluster
fi
grep -H ^ /proc/sys/vm/watermark_scale_factor
if [ $watermark_scale_factor -ge 0 ]; then
echo $watermark_scale_factor > /proc/sys/vm/watermark_scale_factor
grep -H ^ /proc/sys/vm/watermark_scale_factor
fi
grep -H ^ /sys/block/nvme*/queue/max_sectors_kb
if [ $max_sectors_kb -ge 0 ]; then
for dev in $swapdevlist; do echo $max_sectors_kb > /sys/block/${dev}/queue/max_sectors_kb; done
grep -H ^ /sys/block/nvme*/queue/max_sectors_kb
fi
grep -H ^ /sys/module/nvme/parameters/io_queue_depth
if [ $io_queue_depth -ge 0 ]; then
echo $io_queue_depth > /sys/module/nvme/parameters/io_queue_depth
grep -H ^ /sys/module/nvme/parameters/io_queue_depth
fi
7
Technology Brief | Intel® Optane™ DC SSDs with Intel® Memory Drive Technology
grep -H ^ /sys/block/nvme*/queue/nomerges
if [ $nomerges -ge 0 ]; then
for dev in $swapdevlist; do echo $nomerges > /sys/block/${dev}/queue/nomerges; done
grep -H ^ /sys/block/nvme*/queue/nomerges
fi
echo
fi
fi
echo “Disable hugepages”
echo ‘never’ > /sys/kernel/mm/transparent_hugepage/enabled
echo ‘never’ > /sys/kernel/mm/transparent_hugepage/defrag
echo
echo “Show NUMA Topology”
numactl -H
echo
echo “Show Memory usage”
free -g
echo
echo “Show /proc/meminfo”
cat /proc/meminfo
echo
# Lastly, stop ksmtuned - as it is not effective with random data.
echo “Stop ksmtuned and ksm”
systemctl stop ksmtuned.service
echo 0 > /sys/kernel/mm/ksm/run
# If KSM is to be used, we have two options:
# 1) Start ksm manually. This is done to make sure KSM
# is running in both SWAP and IMDT, unrelated to memory size.
# 2) Change /etc/ksmtuned.conf from KSM_THRES_COEF=20 to
# KSM_THRES_COEF=100-(DRAM/SYSTEM_MEMORY)*(100-20)
#echo “Start ksm”
#echo 1 > /sys/kernel/mm/ksm/run
echo
echo “Show kernel mm tuning”
grep -r ^ /sys/kernel/mm/
ansible/provisioning/benchmark-random-bias-mixed-redis.yml
- hosts: all
become: true
tasks:
- name: Make sure we can connect
ping:
- name: Remove swap partition
command: swapoff -a
- name: copy efficient to VM
copy:
src: /root/newswapstream/solotest
dest: /root/intel-bench
mode: 0755
- name: Benchmark Redis
command: ./solotest $REDIS_OP 6379 $REDIS_VALSIZE $REDIS_CUSTOMERS $REDIS_RANDOM_BENCHMARK_INFLUENCE
$REDIS_RANDOM_BENCHMARK_RATE $REDIS_RANDOM_BENCHMARK_MIX
register: benchmark
args:
chdir: /root/intel-bench
environment:
REDIS_OP: “{{ lookup(‘env’,’REDIS_OP’) }}”
REDIS_VALSIZE: “{{ lookup(‘env’,’REDIS_VALSIZE’) }}”
REDIS_CUSTOMERS: “{{ lookup(‘env’,’REDIS_CUSTOMERS’) }}”
REDIS_RANDOM_BENCHMARK_INFLUENCE: “{{ lookup(‘env’,’REDIS_RANDOM_BENCHMARK_INFLUENCE’) }}”
REDIS_RANDOM_BENCHMARK_RATE: “{{ lookup(‘env’,’REDIS_RANDOM_BENCHMARK_RATE’) }}”
REDIS_RANDOM_BENCHMARK_MIX: “{{ lookup(‘env’,’REDIS_RANDOM_BENCHMARK_MIX’) }}”
- name: Output benchmark results
debug: msg=”{{benchmark.stdout}}”
8
Technology Brief | Intel® Optane™ DC SSDs with Intel® Memory Drive Technology
ansible/stats
#!/usr/bin/env node
var fs = require(‘fs’)
var file = process.argv[2];
var json = fs.readFileSync(file).toString(“utf-8”);
if(json[0] != “{“) {
var lines = json.split(“\n”);
lines.shift()
json = lines.join(“\n”);
}
var contents = JSON.parse(json)
contents.plays.forEach(function(play){
var tasks = play.tasks
var benchmarkResults = tasks.filter(function(task){
return task.task.name == “Benchmark Redis”
})
var hosts = benchmarkResults[0].hosts
var totalHosts = 0;
var totalAvg = 0, totalP95 = 0, totalP99 = 0, totalMin = 99999999, totalMax = 0;
for(var host in hosts) {
var results = hosts[host]
if(totalHosts == 0) {
console.log()
}
if(results.stdout == undefined) continue;
++totalHosts;
console.log(results.stdout)
var avg = parseFloat(results.stdout.split(“\n”)[1]) * 1000
var min = parseFloat(results.stdout.split(“\n”)[2]) * 1000
var max = parseFloat(results.stdout.split(“\n”)[3]) * 1000
var P99 = parseFloat(results.stdout.split(“\n”)[4]) * 1000
var P95 = parseFloat(results.stdout.split(“\n”)[5]) * 1000
if(isNaN(avg)) {
--totalHosts
} else {
totalAvg+=avg
totalP95+=P95
totalP99+=P99
if (min < totalMin) {
totalMin = min
}
if (max > totalMax) {
totalMax = max
}
}
}
console.log()
var avgAvg = totalAvg / totalHosts
var avgP95 = totalP95 / totalHosts
var avgP99 = totalP99 / totalHosts
console.log(“Hosts:”, totalHosts)
console.log(“Avg:”, avgAvg.toFixed(3))
console.log(“Min:”, totalMin.toFixed(3))
console.log(“P95:”, avgP95.toFixed(3))
console.log(“P99:”, avgP99.toFixed(3))
console.log(“Max:”, totalMax.toFixed(3))
})
9
Technology Brief | Intel® Optane™ DC SSDs with Intel® Memory Drive Technology
solotest.c
#define _GNU_SOURCE
#include <math.h>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include “hiredis.h”
#include <fcntl.h>
#include <sys/stat.h>
#include <unistd.h>
#include <sys/sysinfo.h>
// DEFAULTS:
#define DEF_PORT 6379
#define DEF_SIZE 1000
#define DEF_COUNT 20000000
#define DEF_BIAS (-1)
#define DEF_FREQ 1000000
#define DEF_READ 100
#define MAX_HISTOGRAM 1000000 // 1 sec
char *value;
char hostname[256] = { “127.0.0.1” };
int port = DEF_PORT;
int val_size = DEF_SIZE;
long key_count = DEF_COUNT;
long bias = DEF_BIAS;
long frequency = DEF_FREQ;
long readmix = DEF_READ;
unsigned int seed;
long *histogram;
int fill = 0;
int test = 0;
inline double gauss(double x, double D)
{
double a = 1;
double b = 50;
double c = D;
return a * exp(-(x - b) * (x - b) / (2 * c * c));
}
void select_mode(char *modestr)
{
fill = strcasestr(modestr, “fill”) == NULL ? 0 : 1;
test = strcasestr(modestr, “test”) == NULL ? 0 : 1;
}
long timestamp(void)
{
static long start_time = -1;
struct timeval tv;
gettimeofday(&tv, NULL);
if (start_time == -1)
start_time = tv.tv_sec;
return ((long)(tv.tv_sec - start_time)) * 1000000L + tv.tv_usec;
}
char *datestr(void)
{
time_t now;
time(&now);
return strtok(ctime(&now), “\n”);
}
int get_seed()
{
struct sysinfo s_info;
int error = sysinfo(&s_info);
if(error != 0)
printf(“code error = %d\n”, error);
return s_info.uptime * timestamp();
}
char *prep_value(char *value, int val_size)
{
int i;
for (i = 0; i < val_size / 4; i += 1)
*((int *)value + i) = (rand_r(&seed) | 0x01010101);
value[val_size] = ‘\0’;
}
int main(int argc, char *argv[])
{
// Arguments: HOST PORT SIZE COUNT BIAS FREQ MIX
if (argc > 1) select_mode(argv[1]);
if (argc > 2) port = atoi(argv[2]);
if (argc > 3) val_size = atoi(argv[3]);
if (argc > 4) key_count = atol(argv[4]);
if (argc > 5) bias = atol(argv[5]);
if (argc > 6) frequency = atol(argv[6]);
10
Technology Brief | Intel® Optane™ DC SSDs with Intel® Memory Drive Technology
11
Technology Brief | Intel® Optane™ DC SSDs with Intel® Memory Drive Technology
12
Technology Brief | Intel® Optane™ DC SSDs with Intel® Memory Drive Technology
Appendix B: Tuning
• VMs:
o Minimize latency by adding “nohz=off highres=off lapic=notscdeadline” to kernel (guest) command line
• Host:
o Scaling governor: performance
o Disable hugepages
o Disable KSM (random benchmark)
• Linux Swap:
o Equal priority devices
o Set watermark_scale_factor to 4%
o Increasing number of partitions does not improve performance
When extending main memory using the Intel® Optane™ SSD DC P4800X and Linux Swap, it is highly recommended to set all
CPUs in the system as No-Callbacks (No-CBs) CPUs. A No-CBs CPU is a CPU whose RCU callbacks are allowed to run on CPUs
other than the CPU making the callback, which improves the page access quality of service (QoS) of the workload CPU but may
lower average page access throughput.
Read-Copy-Update (RCU) is a Linux Kernel locking mechanism for shared memory. When a shared object in memory is
updated, a new object is created in memory and all pointers to the old object are set to point to the new object. Old objects
then need to be periodically cleaned out (garbage collected).
The RCU callback is an implementation of garbage collection. On a default Linux kernel, RCU callbacks are run in softirq
context on the same CPU that performed the callback. This can cause jitter and poor page access QoS when RCU callbacks are
run on workload CPUs, especially when the workload makes heavy use of swap space. CPUs which are set to be “No Callbacks
13
Technology Brief | Intel® Optane™ DC SSDs with Intel® Memory Drive Technology
CPUs,” or “No-CBs CPUs,” are allowed to offload their RCU callbacks to threads which can then run on other (non-workload)
CPUs. This allows for better page access QoS on the workload CPU.
1. Confirm that your existing Linux kernel configuration supports No-CBs CPUs.
No-CBs CPU support was introduced in Linux Kernel 3.10, but not all Linux distribution kernels enable No-CBs
CPU support.
CentOS 7.3 supports No-CBs CPUs – skip to Step 4.
Ubuntu 16.04.1 LTS does not support No-CBs CPUs – see Step 3.
For all other distributions, check for the CONFIG _ RCU _ NOCB _ CPU config option in your kernel’s config file.
• grep “CONFIG _ RCU _ NOCB _ CPU” /boot/config-`uname -r`
If the command returns CONFIG _ RCU _ NOCB _ CPU=y, your kernel supports No-CBs CPUs. Skip to Step 4.
• If CONFIG _ RCU _ NOCB _ CPU _ ALL=y, your kernel is already set to have all CPUs be No-CBs CPUs, and you
do not need to follow this guide.
Otherwise, if the command returns # CONFIG _ RCU _ NOCB _ CPU is not set or does not return any text, your
kernel does not support No-CBs CPUs. See Step 3.
2. If your kernel does not support No-CBs CPUs, you will need to rebuild your kernel with enabled No-CBs support.
– Obtain the kernel source of your distribution, run make oldconfig, then edit the .config file and add the following
lines:
• CONFIG _ RCU _ NOCB _ CPU=y
• CONFIG _ RCU _ NOCB _ CPU _ NONE=y
– Build and install the kernel.
– Set the rcu _ nocbs kernel boot parameter in the GRUB config.
rcu _ nocbs specifies which CPUs are to be No-CBs CPUs at boot time.
On a Red Hat/CentOS machine:
o Use grubby to add rcu _ nocbs=0-X (where X is the highest CPU core number in the system) to the kernel
boot arguments list.
o For example, on a 32-core machine:
grubby --update-kernel=ALL --args=rcu _ nocbs=0-31
On an Ubuntu machine:
o Edit /etc/default/grub and append rcu _ nocbs=0-X (where X is the highest CPU core number in the
system) to GRUB _ CMDLINE _ LINUX _ DEFAULT.
o For example, on a 32-core machine:
GRUB _ CMDLINE _ LINUX _ DEFAULT=”rcu _ nocbs=0-31”
o Run update-grub, then reboot your machine.
rcu _ nocbs is a tunable boot parameter which can also specify only specific CPUs to be No-CBs CPUs, instead of all CPUs.
Additionally, the RCU callback threads may be affinitized to specific CPUs. Such tweaks are beyond the scope of this guide.
The following configuration changes may further improve paging performance, but are not appropriate for all systems or
workloads.
• If benchmarking, follow the steps in “System Tuning for Optimum Performance.”
• Disable page clustering
By default, when reading a page from swap, multiple consecutive pages are read into DRAM at once. On Intel®
Optane™ Technology, page clustering may incur a latency penalty overhead when reading multiple random
individual pages.
echo ‘0’ > /proc/sys/vm/page-cluster
14
Technology Brief | Intel® Optane™ DC SSDs with Intel® Memory Drive Technology
This will remove the overhead of coalescing pages into hugepages and then breaking them up in swap.
echo ‘never’ > /sys/kernel/mm/transparent _ hugepage/enabled
echo ‘never’ > /sys/kernel/mm/transparent _ hugepage/defrag
• Disable NUMA balancing.
Both swap and NUMA balancing frequently access LRU lists, which maintain the hot/cold pages in the system.
Disabling NUMA balancing prevents it from contending with swap over the LRU lists, improving performance.
echo ‘0’ > /proc/sys/kernel/numa _ balancing
• (Kernel 4.6 and later) Increase the watermark_scale_factor
Watermark_scale_factor is a tunable variable which affects the memory thresholds at which the swap daemon
wakes up or sleeps. We want kswapd to wake up earlier, so we make it wake up when there is 4% of available
memory left, instead of the default 0.1% of available memory.
echo ‘400’ > /proc/sys/vm/watermark _ scale _ factor
• (Kernel 4.11 or later) Recompile the kernel with multi-queue deadline I/O scheduler support.
This scheduler prevents write I/O request merges from blocking read requests for a long time.
In the kernel .config, set CONFIG _ MQ _ IOSCHED _ DEADLINE, then recompile the kernel. Use a short queue
depth of 64 and small max_sectors_kb to 32 prevent write from blocking read in I/O for swap for better swap in
latency.
1
See appendix for solotest.c of Gaussian influence implementation.
2
The benchmark results may need to be revised as additional testing is conducted. The results depend on the specific platform configurations and workloads utlized in the testing and may not
be applicable to any particular user’s components, computer system or workloads. The results are no necessarily representative of other benchmarks and other benchmark results may show
greater or lesser impact from mitigations. Implementation details: System BIOS: 00.01.0013 (Link); Kernel 4.15.12; Mitigation was validated for variants 1 through 3 using a checker script (Link –
accessed June 21, 2018).
3
Source – Intel: MSRP for Intel® Memory Drive Technology license with Intel® Optane™ SSD (April 2018). Trendforce* - Server DRAM Price Report Sep 2017 (Market Price)
4
Source - Intel Memory Drive Technology Set Up and Configuration Guide, page 43. https://nsgresources.intel.com/asset-library/intel-memory-drive-technology-set-up-and-configuration-
guide/
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system
configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.
No computer system can be absolutely secure.
Software and workloads used in performance tests may have been optimized for performance only on intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured
using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and
performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information
visit www.intel.com/benchmarks.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on
request.
Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any
warranty arising from course of performance, course of dealing, or usage in trade.
Intel, the Intel logo, Intel Optane are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.
* Other names and brands may be claimed as property of others
15