Linux Performance Tuning Logistics: Tutorial Runs From 9 To 5:00pm
Linux Performance Tuning Logistics: Tutorial Runs From 9 To 5:00pm
What happens when a server is put under a Careful tuning of memory usage won't matter if
large amount of stress? the problem is caused by a shortage of disk
My web server just got slashdotted! bandwidth
Typically the server behaves well until the load Performance measurement tools are hugely
increases beyond a certain critical point; then it important to diagnose what is placing limits on
breaks down. the scalability or performance of your
application
Transaction latencies go through the roof
The server may cease functioning altogether
Start with large areas, then narrow down
Measure the system when it is functioning Is your application I/O bound? CPU bound?
Network bound?
normally, and then when it is under stress. 5 6
What changes?
Incremental Tuning Measurement overhead
11
Questions to ask yourself when
The top(1) command
looking at top(1) output
Good general place to start What are the top tasks running; should they
be there? Are they running? Waiting for disk?
How much memory are they taking up?
How is the CPU time (overall) being spent?
User time, System time, Niced user time, I/O Wait,
Hardware IRQ, Software IRQ, Stolen time
13 14
The iostat(1) command Advanced iostat(1)
Part of the systat package; shows I/O statistics Many more details with the -x option
Use -k to for kilobytes instead of 512 sectors rrqm/s, wrqm/s read/write requests merged per
second
r/s, w/s read/write request per second
rkB/s, wkB/s number of kilobytes of read/write
transfers per second
avgrq-sz --- average request size in 512 byte
sectors per second
avgqu-sa average request queue length
15 16
Advanced iostat(1), continued Example of iostat -xk 1
Still more details revealed with the -x option Workload fs_mark -s 10240 -n 1000 -d /mnt
await average time (in ms) between when a Creates 1000 files, each 10k, in /mnt, with an fsync
requested is issued and when it is completed (time after writing each file
in queue plus time for device to service the request)
Result: 33.7 files/second
svctm average sevice time (in ms) for I/O
requests that were issued to the device
%util Percentage of CPU time during which the
device was servicing requests. (100% means the
device is fully saturated)
17 18
Conclusions we can draw from
Speeding up fs_mark
the iostat results
Utilization: 98.48% If we mount the (ext4) file system with -o
The system is I/O bound barrier=0, file/sec becomes 358.3
Adding memory or speeding up the CPU clock But this risks fs corruption after a power fail
won't help Is the fsync() really needed? Without it, file/sec
Solution attack the I/O bottleneck goes up to 17,010.30
Add more I/O bandwidth resources (use a faster Depends on application requirements
disk or use a RAID array) Better: use -o journal_async_commit
Or, do less work! Using journal checksums, it allows ext4 to safely
use only one barrier per fsync() instead of two.
(Requires Linux 2.6.32)
19 20
Using -o journal_async_commit Comparing the two results
49.2
files/sec
1000
barrier
ops
21 22
Before we leave fs_mark... Lessons Learned So Far
How does fs_mark fare on other file systems? Measure, analyze, and then tweak
ext2 (no barriers) 574.9 Bottleneck analysis is critical
ext3 (no barriers) 348.8 (w/ barriers) 30.8 It is very useful to understand how things work
ext4 (no barriers) 358.3 (w/ barriers) 49.2 under the covers
XFS (no barriers) 337.3 (w/ barriers) 29.0 Adding more resources is one way to address a
reiserfs (no barriers) 210.0 (w/ barriers) 31.5
bottleneck
Important note: these numbers are specific to
But so is figuring ways of doing less work!
this workload (small files, fsync heavy) and not Sometimes you can achieve your goal by working
a general figure of merit for these file systems smarter, not harder.
23 24
The snap script Agenda
Choosing the right storage devices Disks are probably the biggest potential
Hard Drives bottleneck in your system
SSD Punched cards and paper tape having fallen out of
favor...
RAID
NFS appliances
Critical performance specs you should examine
Sustained Data Transfer Rate
File System Tuning
Rotational speed: 5400rpm, 7200rpm, 10,000 rpm
General Tips Areal density (max capacity in that product family)
File system specific Seek time (actually 3 numbers: average, track-to-
track, full stroke)
27 28
Transfer Rates Short stroking hard drives
The important number is the sustained data HDD performance are not uniform across the
transfer rate (aka disk to buffer) rate platter
Typically around 70-100 Mb/s; slower for laptop Up to 100% performance improvements on the
drives outer edge of the disk
Much less important: The I/O transfer rate Consider partitioning your disk to take this into
At least for hard drives, whether you are using account!
SATA I's 1.5 Gb/s or SATA II's 3.0 Gb/s won't If you don't need the full 1TB of space, partitioning
matter except for rare cases when transfering data your disk to only use the first 100GB or 300GB
out of the track buffer could speed things up!
SDD's might be a different story, of course... Also when running benchmarks, use the same
29
partitions for each file system tested. 30
What about SSD's? Getting the right SSD is important
Advantages of SSD Really good website for goes into great detail
Fast random access reads about this is Anand Tech
Fails usually when writing, not when reading http://www.anandtech.com/storage/
Less suceptible to mechanical shock/vibration Many of the OEM SSD's included laptops are
Most SSD's use less power than HDD's not the good SSD's, and you the pay the OEM
markup to add insult to injury.
Disadvantage of SSD's
Cost per Gb much more expensive
Limited number of write cycles
Writes are slower than reads; random writes can be
much slower (up to a ½ sec average, 2 sec worst 31 32
case for 4k random writes for really bad SSD's!)
Should you use SSD's? PCIe attached flash
35 36
Filesystem Tuning Managing Access-time Updates
Most general purpose file systems work quite Posix requires that a file's last access time is
well for most workloads updated each time its contents are accessed.
But in some file systems are better for certain This means a disk write for every single read
specialized workloads The mount options noatime and relatime can
Reiserfs small (< 4k) files reduce this overhead.
XFS very big RAID arrays, very large files The relatime option will only update the atime if
mtime and ctime is newer than the last atime.
Ext3 is a good general purpose filesystem that
many people use by default Only saves approximately half the writes compared
to noatime
Ext4 will be better at RAID, larger files, while still
working well on small-medium sized files
Some applications do depend on atime being
37
updated 38
Using ionice to control read/write
Tuning ext3/ext4 journals
priorities
Sometimes increasing the journal size can help; Like the nice command but affects the priority
especially if your workload is very metadata- of read/write requests issued by the process
intensive (lots of small files; lots of file Three scheduling classes
creates/deletes/renames)
Idle only if there are no other high priority
Journal data modes requests pending
data=ordered (default) data is written first before Best-effort requests served round-robin (default)
metadata is committed Real time highest priority request always gets
data=journal data is written into the journal access
data=writeback only metadata is logged; after a For best-effort and real time classes, there are
crash, uninitialized data can appear in newly 8 priorities, with 0 being the highest priority and
allocated data blocks 39
7 the lowest priority 40
Agenda Network Tuning
Introduction to Performance Tuning Before you do anything else... check the basic
Filesystem and storage tuning health of the network
Speed, duplex, errors
Network tuning
Tools: ethtool, ifconfig, ping
NFS performance tuning Check TCP throughput: ttcp or nttcp
Memory tuning Look for wierd stuff using wireshark / tcpdump
Application tuning Network is a shared resource
Who else is using it?
What are bottlenecks in the network topology?
41 42
Latency vs Throughput Interrupt Coalescing
Some device drivers don't enable these Very important when optimizing for throughput,
features by default especially for high speed, long distance links
You can check using ethtool -k eth0 Represents the amount of data that can be in
TCP segment offload flight at any particular point in time.
ethtool -K tso on BDP = 2 * bandwidth * delay
Checksum off-load
BDP = bandwidth * Round Trip Time (RTT)
example:
ethtool -K tx on rx on
(100 Mbits/sec / 8 bits/byte) * 50 ms ping time =
Large Receive offload (for throughput) 625kbytes
ethtool -K lro on
45 46
Why the BDP matters Using the BDP
TCP has to be able to retransmit any dropped The BDP in bytes plus some overhead room
packets; so the kernel has to remember what should be used as [wmax] below when setting
data has been sent in case it needs to these parameters in /etc/sysctl.conf:
retransmit it. net.core.rmem_max= [wmax]
TCP Window Maximum Socket Receive Buffer size
Limits on the size of the TCP window to control net.core.wmem_max= [wmax]
kernel memory consumed by the networking stack Maximum Socket Send Buffer size
net.core.rmem_max also known as
/proc/sys/net/core/rmem_max
e.g., set via echo 2097152 >
47 /proc/sys/net/core/rmem_max 48
Per-socket /etc/sysctl.conf For large numbers of TCP
settings connections
net.ipv4.tcp_rmem = [wmin] [wstd] [wmax] net.ipv4.tcp_mem = [pmin] [pdef] [pmax]
receive buffer sizing in bytes (per socket) pages allowed to be used by TCP (for all sockets)
net.ipv4.tcp_wmem = [wmin] [wstd] [wmax] For 32-bit x86 systems, kernel text & data
memory reserved for send buffers in bytes (per (including TCP buffers) can only be in the low
socket) 896MB.
Modern kernels do automatic tuning of the So on 32-bit x86 systems, do not adjust these
receive and send buffers; and the defaults are numbers, since they are needed to balance
memory usage with other Lowmem users.
better; still if your BDP is very high, you may
need to boost [wstd] and [wmax]. Keep [wmin] If this is a problem, best bet is to switch to a 64-bit
small for out-of-memory situations. x86 system first.
49 50
Increase transmit queue length Optimizing for Low Latency TCP
The ethernet default of 100 is good for most This can be very painful, because TCP is not
networks and where we need to balance really designed for low latency applications.
interactive responsiveness with large transfers TCP is engineered to worry about congestion
However, for high speed networks and bulk control on wide-area networks, and to optimize for
transfer, this needs to be increased to some throughput on large data streams.
value between 1000-50000 If you are writing your own application from
ifconfig eth0 txqueuelength 2000 scratch, very often basing your own protocol on
UDP is often a better bet.
Tradeoffs: more kernel memory used;
interactive response may be impacted.
Do you really need a byte-oriented service?
Do you only need automatic retransmission to deal
Experiment with ttcp to find the slowest value that
51 with lost packets? 52
works for your network/application.
Nagle Algorithm Delayed Acknowledgements
Goal: To make networking more efficient by On the receiver end, wait a small amount of
batching small writes into a bigger packet for time before sending a bare acknowledgement
efficiency to see if there's more data coming (or if the
When the OS gets a small amount of data (a single program will send a response upon which you
keystroke in an telnet connection), delay a very can piggy-back your response)
small amount of time to see if more bytes will be This can interact with TCP slow-start to cause
coming.
longer latencies when the send window is
This naturally increases latency! initially small.
Requires application-level change After congestion or after the TCP connection has
int on = 1; been idle, the send window (maxmimum bytes of
unack'ed data) must be set down the MSS value
setsockopt (sockfd, SOL_TCP, TCP_NODELAY, 53 54
Introduction to Performance Tuning Optimize both your network and your filesystem
Filesystem and storage tuning In addition, various client and server specific
Network tuning settings that we'll discuss now
NFS performance tuning
General hint: use dedicated NFS servers
NFS file serving uses all parts of your system: CPU
Memory tuning time, memory, disk bandwidth, network bandwidth,
Application tuning PCI bus bandwidth
Trying to run applications on your NFS servers will
make both NFS and the apps run slowly
57 58
Tuning a NFS Server PCI Bus tuning
If you only export file system mountpoints, use NFS serving puts heavy demands on both
the no_subtree_check option in /etc/exports networking cards and hard bus adapters
Can burn large amonuts of CPU for metadata If you have a system with multiple PCI buses,
intensive workloads put the networking and storage cards on
Bump up the number of NFS threads to a large different buses
number (it doesn't hurt that much to have too Network cards tend to use lots of small DMA
many). Say, 128... (instead of 4 or 8 which is transfers, which tends to hog the bus
way too little). How to do this is distro-specific:
/etc/sysconfig/nfs
/etc/defaults/nfs-kernel-server
59 60
Tuning your network config for
NFS client tuning
NFS
Make sure you use NFSv3 and not NFSv2 Tune the network for bulk transfers (throughput)
Make sure you use TCP and not UDP Use the largest MTU size you can
Use the largest rsize/wsize that the client/server For ethernets, consider using jumbo frames if all of
kernels support the intervening switches/routers support it
Modern client/servers can do a megabyte at a time
Use the hard mount option, and not soft
Use intr so you can recover an NFS server is down
All of these are the default except for intr
Remove outdated fstab mount options. Just use
61 62
rw,intr
Agenda Memory Tuning
Introduction to Performance Tuning Memory tuning problems can often look like
Filesystem and storage tuning other problems
Unneeded I/O caused by excessive paging/swaping
Network tuning
Extra CPU time caused by cache/TLB thrashing
NFS performance tuning Extra CPU time caused by NUMA-induced memory
Memory tuning access latencies
Application tuning These subtleties require using more
sophisticated performance measurement tools
63 64
Using sar to obtain swapping
To measure swapping activity
information
The top(1) and free(1) command will both tell Use sar -W <interval> [<num. of samples>]
you if any swap space is in use Reports number of pages written (swapped out)
To a first approximation, if there is any swap in use, and read (swapped in) from the page device
the system can be made faster by adding more out per second.
RAM.
The first output is the average since system was
To see current swap activity, use the sar(8) started.
program
First use of a very handy (and rather complicated)
system activity recorder program; reading through
the man page strongly recommended
Part of the systat package 65 66
Optimizing swapping Swapping vs. Paging
The Translation Lookaside Cache speeds up Build a kernel that avoids using modules
translation from a virtual address to a physical The core kernel text segment uses huge pages;
address modules do not
Normally requires 2-3 lookups in the page tables Modify an application to use hugepages (or
TLB cache short circuits this lookup process configure an application to use it if it already
The x86info program will show the TLB cache has provision to use hugepages).
layout mount -t hugetlbfs none /hugepages then mmap
pages in /hugepages
Hugepages are a way to avoid consuming too On new qemu/kvm, you can use the option
many TLB cache entries -mem-path /hugepages
75
Use shmget(2) with the flag SHM_HUGETLB 76
Configuring hugepages Agenda
Tools for investigating applications Useful for seeing what the application is doing
strace/ltrace Especially useful when you don't have source
valgrind System call tracing: strace
gprof Shared library tracing: ltrace
oprofile
Run a new command with tracing:
perf
strace /bin/ls /usr
Most of these tools work better if you have
source access
Attach to an already existing process
But sometimes not source is not absolutely required
ltrace -p 12345
83 84
Valgrind C/C++ profiling using gprof
Used for finding memory leaks and other To use, compile your code using the -pg option
memory access bugs This will add code to the compiled binary to
Best used with source access (compiled with -g); track each function call and its caller
but not strictly necessary
In addition the program counter is sampled by
Works by emulating x86 in x86 and adding the kernel at some regular interval (i.e., 100Hz
checks to pointer references and malloc/free or 1kHz) to find the hot spots
calls
Demo time!
Other architectures supported
Commercial alternative: purify (uses object
code insertion)
85 86
System profiling using oprofile Perf: the next generation
One other application issues which can be a Rarely a good idea... but can be used to
very big deal: userspace locking improve response time for critical tasks
Rip out fancy multi-level locking (i.e., user- Set CPU affinity for tasks using taskset(1)
space spinlocks, sched_yield() calls, etc.) Set CPU affinity for interrupt handlers using
Just use pthread mutexes, and be happy /proc/irq/<nn>/smp_affinity
Linux implements pthread mutexes using the Strategies
futex(2) system call. Avoids kernel context switch
except in the contended case
Put producer/consumer processes on the same
CPU
The fast path really is fast! (So need for
fancy/complex multi-level locking just rip it out)
Move interrupt handlers to a different CPU
89 Use mpstat(1) and /proc/interrupts to get 90
processor-related statistics
Conclusion
91