Scale2017perfanalysisbpf169 170304230834
Scale2017perfanalysisbpf169 170304230834
x Tracing:
Performance Analysis with bcc/BPF
Brendan Gregg
Senior Performance Architect
Mar 2017
Linux tracing
in the last 3 years…
How do we
use these
superpowers?
Take aways
1. Understanding the value of Linux tracing superpowers
2. Upgrade to Linux 4.4+ (4.9 is beDer)
3. Ask for eBPF support in your perf analysis/monitoring tools
Ye Olde BPF
Berkeley Packet Filter
10 x 64-bit registers
maps (hashes)
acIons
kprobes
Intrusion DetecIon
BPF uprobes
Container Security tracepoints
06:20:16
msecs : count distribution
0 -> 1 : 36 |**************************************|
2 -> 3 : 1 |* |
4 -> 7 : 3 |*** |
8 -> 15 : 17 |***************** |
16 -> 31 : 33 |********************************** |
32 -> 63 : 7 |******* |
64 -> 127 : 6 |****** |
[…]
These CLI tools may be useful even if you never use them, as examples of what to implement in GUIs
New VisualizaVons and GUIs
Eg, Neclix self-service UI:
Flame Graphs
Tracing Reports
…
BPF TRACING
A Linux Tracing Timeline
- 1990’s: StaVc tracers, prototype dynamic tracers
- 2000: LTT + DProbes (dynamic tracing; not integrated)
- 2004: kprobes (2.6.9)
- 2005: DTrace (not Linux), SystemTap (out-of-tree)
- 2008: irace (2.6.27)
- 2009: perf_events (2.6.31)
- 2009: tracepoints (2.6.32)
- 2010-2016: irace & perf_events enhancements
- 2012: uprobes (3.5)
- 2014-2017: enhanced BPF patches: supporIng tracing events
- 2016-2017: irace hist triggers
also: LTTng, ktap, sysdig, ...
Linux Events & BPF Support
BPF output
Linux 4.4 Linux 4.7 Linux 4.9
BPF stacks
Linux 4.6
Linux 4.3
Linux 4.1
(version
BPF
support
arrived)
Linux 4.9
Event Tracing Efficiency
E.g., tracing TCP retransmits
Kernel
Old way: packet capture
send
tcpdump 1. read buffer
2. dump receive
Analyzer 1. read
2. process file system disks
3. print
per-event uprobes
data async
output sampling, PMCs
copy
staVsVcs maps perf_events
Introducing bcc
samples/bpf/sock_example.c
87 lines truncated
C/BPF
samples/bpf/tracex1_kern.c
58 lines truncated
bcc/BPF (C & Python)
bcc examples/tracing/bitehist.py
enIre program
ply/BPF
hDps://github.com/iovisor/ply/blob/master/README.md
enIre program
The Tracing Landscape, Mar 2017
(my opinion)
(less brutal)
dtrace4L. ply/BPF
ktap
sysdig
(many) perf
Ease of use
stap
LTTng
(h i s t t
recent changes
rigge
rs) irace bcc/BPF
(alpha) (mature)
C/BPF
Stage of
(brutal)
More reliable and complete indicator than measuring disk I/O latency
Also: btrfsslower, xfsslower, zfslower
4. biolatency
• IdenVfy mulVmodal latency and outliers with a histogram:
# biolatency -mT 1
Tracing block device I/O... Hit Ctrl-C to end. The "count" column is
summarized in-kernel
06:20:16
msecs : count distribution
0 -> 1 : 36 |**************************************|
2 -> 3 : 1 |* |
4 -> 7 : 3 |*** |
8 -> 15 : 17 |***************** |
16 -> 31 : 33 |********************************** |
32 -> 63 : 7 |******* |
64 -> 127 : 6 |****** |
[…]
Average latency (iostat/sar) may not be represenVVve with mulVple modes or outliers
5. biosnoop
• Dump disk I/O events for detailed analysis. tcpdump for disks:
# biosnoop
TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms)
0.000004001 supervise 1950 xvda1 W 13092560 4096 0.74
0.000178002 supervise 1950 xvda1 W 13092432 4096 0.61
0.001469001 supervise 1956 xvda1 W 13092440 4096 1.24
0.001588002 supervise 1956 xvda1 W 13115128 4096 1.09
1.022346001 supervise 1950 xvda1 W 13115272 4096 0.98
1.022568002 supervise 1950 xvda1 W 13188496 4096 0.93
1.023534000 supervise 1956 xvda1 W 13188520 4096 0.79
1.023585003 supervise 1956 xvda1 W 13189512 4096 0.60
2.003920000 xfsaild/md0 456 xvdc W 62901512 8192 0.23
[…]
Can import this into a spreadsheet and do a scaDer plot of Vme vs latency, e.t.c.
6. cachestat
• Measure file system cache hit raVo staVsVcs:
# cachestat
HITS MISSES DIRTIES READ_HIT% WRITE_HIT% BUFFERS_MB CACHED_MB
170610 41607 33 80.4% 19.6% 11 288
157693 6149 33 96.2% 3.7% 11 311
174483 20166 26 89.6% 10.4% 12 389
434778 35 40 100.0% 0.0% 12 389
435723 28 36 100.0% 0.0% 12 389
846183 83800 332534 55.2% 4.5% 13 553
96387 21 24 100.0% 0.0% 13 553
120258 29 44 99.9% 0.0% 13 553
255861 24 33 100.0% 0.0% 13 553
191388 22 32 100.0% 0.0% 13 553
[…]
Efficient: dynamic tracing of TCP connect funcVons only; does not trace send/receive
8. tcpaccept
• Trace passive ("inbound") TCP connecVons :
# tcpaccept
PID COMM IP RADDR LADDR LPORT
2287 sshd 4 11.16.213.254 100.66.3.172 22
4057 redis-server 4 127.0.0.1 127.0.0.1 28527
4057 redis-server 4 127.0.0.1 127.0.0.1 28527
4057 redis-server 4 127.0.0.1 127.0.0.1 28527
4057 redis-server 4 127.0.0.1 127.0.0.1 28527
2287 sshd 6 ::1 ::1 22
4057 redis-server 4 127.0.0.1 127.0.0.1 28527
4057 redis-server 4 127.0.0.1 127.0.0.1 28527
2287 sshd 6 fe80::8a3:9dff:fed5:6b19 fe80::8a3:9dff:fed5:6b19 22
[…]
[…]
from Craig Hanson and Pat Crain, and the performance engineering community
trace
# trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
TIME PID COMM FUNC - trace custom events
05:18:23 4490 dd sys_read read 1048576 bytes
05:18:23 4490 dd sys_read read 1048576 bytes
05:18:23 4490 dd sys_read read 1048576 bytes
^C
# trace -h
[...]
trace –K blk_account_io_start trace -h
Trace this kernel function, and print info with a kernel stack trace
trace 'do_sys_open "%s", arg2'
lists example
Trace the open syscall and print the filename being opened one-liners
trace 'sys_read (arg3 > 20000) "read %d bytes", arg3'
Trace the read syscall and print a message for reads >20000 bytes
trace r::do_sys_return
Trace the return from the open syscall
trace 'c:open (arg2 == 42) "%s %d", arg1, arg2'
Trace the open() call from libc only if the flags (arg2) argument is 42
trace 't:block:block_rq_complete "sectors=%d", args->nr_sector'
Trace the block_rq_complete kernel tracepoint and print # of tx sectors
[...]
by Sasha Goldshtein
argdist
# argdist -H 'p::tcp_cleanup_rbuf(struct sock *sk, int copied):int:copied'
[15:34:45]
copied : count distribution
0 -> 1 : 15088 |********************************** |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 4786 |*********** |
128 -> 255 : 1 | |
256 -> 511 : 1 | |
512 -> 1023 : 4 | |
1024 -> 2047 : 11 | |
2048 -> 4095 : 5 | funcVon |
4096 -> 8191 : 27 | argument |
8192 -> 16383 : 105 | |
16384 -> 32767 : 0 |
distribuVons |
by Sasha Goldshtein
Coming to a GUI near you
BCC/BPF VISUALIZATIONS
Latency Heatmaps
CPU + Off-CPU Flame Graphs
• Can now be
BPF opVmized
hDp://www.brendangregg.com/flamegraphs.html
Conquer Performance
On-CPU + off-CPU
means we can
measure everything
Except someVmes
one off-CPU stack
isn't enough…
Off-Wake Flame
Graphs
• Shows blocking stack with
waker stack
– BeDer understand why blocked
– Merged in-kernel using BPF
– Include mulVple waker stacks ==
chain graphs
2. …/docs/tutorial.md
3. …/docs/tutorial_bcc_python_developer.md
4. …/docs/reference_guide.md
5. .../CONTRIBUTING-SCRIPTS.md
bitehist.py Output
# ./bitehist.py
Tracing... Hit Ctrl-C to end.
^C
kbytes : count distribution
0 -> 1 : 3 | |
2 -> 3 : 0 | |
4 -> 7 : 211 |********** |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 1 | |
128 -> 255 : 800 |**************************************|
bitehist.py Code
bcc examples/tracing/bitehist.py
bytehist.py Annotated
Event
StaVsVcs
"kprobe__" is a shortcut for BPF.aDach_kprobe()
bcc examples/tracing/bitehist.py
Current ComplicaVons
• IniValize all variables
• Extra bpf_probe_read()s
• BPF_PERF_OUTPUT()
• Verifier errors
Bonus Round
PLY
File opens
# ply -c 'kprobe:do_sys_open { printf("opened: %s\n", mem(arg(1), "128s")); }'
1 probe active
opened: /sys/kernel/debug/tracing/events/enable
opened: /etc/ld.so.cache
opened: /lib/x86_64-linux-gnu/libselinux.so.1
opened: /lib/x86_64-linux-gnu/libc.so.6
opened: /lib/x86_64-linux-gnu/libpcre.so.3
opened: /lib/x86_64-linux-gnu/libdl.so.2
opened: /lib/x86_64-linux-gnu/libpthread.so.0
opened: /proc/filesystems
opened: /usr/lib/locale/locale-archive
opened: .
[...]
Count vfs calls
# ply -c 'kprobe:vfs_* { @[func()].count(); }'
WRN kprobe_attach_pattern: 'vfs_caches_init_early' will not be probed
WRN kprobe_attach_pattern: 'vfs_caches_init' will not be probed
49 probes active
^Cde-activating probes
@:
vfs_fstat 33
vfs_getattr 34
vfs_getattr_nosec 35
vfs_open 49
vfs_read 66
vfs_write 68
[...]
Read return size
# ply -c 'kretprobe:SyS_read { @ret.quantize(retval()); }'
1 probe active
^Cde-activating probes
@ret:
0 7
1 24
[ 2, 3] 5
[ 4, 7] 0
[ 8, 15] 1
[ 16, 31] 1
[ 32, 63] 3
[ 64, 127] 3
[ 128, 255] 2
[ 256, 511] 1
[ 512, 1k) 11
Read return size (ASCII)
# ply -A -c 'kretprobe:SyS_read { @ret.quantize(retval()); }'
1 probe active
^Cde-activating probes
@ret:
0 7 |################### |
1 12 |################################|
[ 2, 3] 7 |################### |
[ 4, 7] 0 | |
[ 8, 15] 1 |### |
[ 16, 31] 2 |##### |
[ 32, 63] 7 |################### |
[ 64, 127] 3 |######## |
[ 128, 255] 2 |##### |
[ 256, 511] 1 |### |
[ 512, 1k) 11 |############################# |
Read latency
# ply -A -c 'kprobe:SyS_read { @start[tid()] = nsecs(); }
kretprobe:SyS_read /@start[tid()]/ { @ns.quantize(nsecs() - @start[tid()]);
@start[tid()] = nil; }'
2 probes active
^Cde-activating probes
[...]
@ns:
@:
schedule+0x1
sys_exit+0x17
do_syscall_64+0x5e
return_from_SYSCALL_64 1
[...]
schedule+0x1
fuse_dev_read+0x63
new_sync_read+0xd2
__vfs_read+0x26
vfs_read+0x96
sys_read+0x55
do_syscall_64+0x5e
return_from_SYSCALL_64 1707
schedule+0x1
do_syscall_64+0xa2
return_from_SYSCALL_64 4647
ply One-Liners
# Trace file opens:
ply -c 'kprobe:do_sys_open { printf("opened: %s\n", mem(arg(1), "128s")); }'
• High-level language
– Simple one-liners
– Short scripts
• In development
– kprobes and tracepoints only, uprobes/perf_events not yet
– Successful so far as a proof of concept
– Not producVon tested yet (bcc is)
Future work
CHALLENGES
Challenges
• MarkeVng
• DocumentaVon
• Training
• Community
Without these, we may have another irace: a built in "secret" of Linux. Not good for adopVon!
hDps://www.iovisor.org project helps, but tracing (observability) is only one part.
Take aways
1. Understanding the value of Linux tracing superpowers
2. Upgrade to Linux 4.4+ (4.9 is beDer)
3. Ask for eBPF support in your perf analysis/monitoring tools
– QuesVons?
– iovisor bcc: hDps://github.com/iovisor/bcc
– hDp://www.brendangregg.com
– hDp://slideshare.net/brendangregg
– bgregg@neclix.com
– @brendangregg