As a professional Linux developer and systems engineer, understanding how to deeply analyze a system‘s performance is a crucial skill. While utilities like top and ps provide surface overviews of utilization, to unlock true bottlenecks requires digging deeper into what the CPU cores, memory, caches, interrupts and other subsystem events reveal.

This is where the powerful perf profiling and tracing tool shines. Perf exposes fine-grained CPU and memory statistics, system call tracing, kernel profiling, storage latency analytics, custom events and more. All with negligible overhead suitable for production environments.

In this comprehensive 3200+ word guide, you will gain expert techniques to drill into Linux performance, identify optimizations and fix a wide variety of bottlenecks.

Introduction to Perf

Perf icon

Perf is included in the Linux kernel source and exposes a multitude of performance events via a simple command line interface. As Brendan Gregg puts it concisely:

"perf is like a Swiss army knife, with specialized tools applicable in many situations, both software and hardware."

Some background on the four main capabilities:

Counting: perf can leverage hardware performance counters present in all modern CPUs to precisely measure events like cycles, cache misses, branches, faults. This helps quantify efficiency of code.

Profiling: In addition to metrics, perf also attributes utilization to specific processes, functions, instructions, threads to uncover optimization opportunities.

Tracing: The Linux kernel exposes a rich set of software tracepoints allowing perf to trace syscalls, scheduler events, page faults, interrupts with little overhead.

Custom Events: Perf supports custom kernel probes, user-level markers to build tailored analysis suited to your workload.

These capabilities enable targeted diagnosis everything from lock contention to storage latency to hot code paths.

While utilities like sar, vmstat can indicate high CPU or IO load at a surface level, to gain meaningful insight requires deeper profiling abilities like perf. With its low overhead and rich output, perf has become an invaluable performance analysis tool for Linux.

Now let‘s jump in and go over some common perf techniques.

Listing Available Perf Events

Perf can capture hundreds of hardware, software, tracepoint and other events symbolically categorized as:

perf list output

To view the events available on your system, use:

$ perf list

The output organizes events into functional groups like branches, faults, migrations, major page faults etc. You can grep for keywords like CPU to filter:

$ perf list | grep CPU

Or show just hardware events:

$ perf list --raw-dump | grep hardware 

Additionally, high level groupings help filter events:

$ perf list sched # scheduler events 
$ perf list mem # memory controller  

Having awareness of these hardware, software, tracepoint metrics is key to leveraging perf for analysis.

Profiling Application Impact

A simple way to benchmark an application‘s system utilization is using perf stat. This will profile CPU, memory, disk I/O and more while running a specified command.

For example, to understand efficiency of the grep process searching logs:

$ perf stat grep "errors" /var/log/syslog

Outputs high level utilization:

     1.150 CPUs utilized          
            93,932 ctx-switches 
           174 memory page faults  
          4.718 M cycles                   
   65 billion instructions

This quickly shows heavy context switching, high instructions per cycle indicating poor CPU usage. Just a few runs establishes an efficiency baseline for comparison.

As another example, check number of tar system calls:

$ perf stat -e syscalls:sys_enter_tar tar cf backup.tar /work

Having quick access to hundreds of metrics lets you validate performance best practices, compare code changes and more.

Hotspot Profiling with Perf Top

While perf stat provides system level statistics, to understand consumption at a process, function or instruction level requires profiling CPU activity.

perf top offers a real-time live profile view akin to the classic top command:

$ sudo perf top 

perf top output

It will continuously show the hottest code paths occupying the CPUs using various metric groupings:

  • Overhead – Percent of samples in symbol
  • Samples – Number of samples for symbol
  • Command – Process name
  • Symbol – Code path including shared libraries

So at a glance, heavy symbols either in the kernel or user space are identified.

Further toggling the Display menu allows grouping by threads, source files, assembly code – to better attribute resource usage. The -e option selects different events like cache-misses, migrations etc. And -d captures limited snapshots without overload.

Perf top makes it very easy to quickly identify any outlier processes or code paths hogging precious CPU resources.

Recording for In-depth Analysis

While perf stat and top offer live analysis, to enable deeper post-processing requires capturing profile data to disk.

The perf record subcommand profiles specified workloads, while minimizing impact to production environments. This allows detailed analysis of the saved recordings through interactive browsers, custom processing or sharing with other engineers.

Common recipe is:

1. Capture profiler data to disk

$ perf record -a -g -- <commands>

2. Inspect using interactive TUI browser

$ perf report 

For example, to analyze network code paths in a server:

$ perf record -e net:net_dev_xmit -ag -- nc -l 8080 # profile recording
<client traffic generation>
Ctrl + C  

$ perf report # tui browser

perf report output

This highlights hotspots by network driver, kernel module, code path. Using the browser, developers can toggle views to source code, assembly, annotate lines, generate flame graphs and more. Having the ability to capture full-context system profiles with negligible overhead is extremely useful.

Perf integrates call graph support allowing tracing across user, kernel boundaries:

$ perf record -g dwarf -a -o perf.dataSleep 60

This traces the whole system with stack unwinding enabled for 60 seconds, saving output to perf.data. The integrated callgraph visualizer then helps identify hot paths:

$ perf report --call-graph -i perf.data

perf callgraph

In general, leveraging perf record for in-depth processing delivers significant insights not possible via just live analysis. The saved recordings, custom post-processing and visualizations unlock serious bottlenecks.

Microbenchmarking Code Optimizations

Beyond system level profiling, Perf also allows targeted tracing of specific application or kernel functions. This is invaluable when optimizing code efficiency.

For example, confirming that vectorization speeds up a hash routine:

$ perf stat -e instructions:u,cycles:u \
  ./hash_table_insert input.txt    

$ perf stat -e instructions:u,cycles:u \ 
  ./hash_table_insert_vect input.txt   

Comparing instruction counts, cycles between the scalar and SIMD versions quickly validates optimization efficacy.

To further microbenchmark cycles per loop iteration:

$ perf stat -e cycles:u -I 1000 \
  ./multivariate_knn_search input.csv query.csv  

$ perf stat -e cycles:u -I 1000 \
  ./multivariate_knn_search_opt input.csv query.csv

The -I option sets sampling frequency, allowing cost per operation estimation.

This methodology provides empirical proof for improvements, and allows precise comparison between code variants.

Identifying Latency Issues

In addition to throughput, inconsistent latency also affects application health. Perf provides tools to quantify if latency meets SLAs.

For example, to analyze RPC server response time distribution:

$ perf buildid-cache --add `which server`

$ perf record -e cycles -a -g \
  -o server_latency.data ./server
<client load generation>
Ctrl + C

$ perf script | stackcollapse-perf.pl > out.folded  

$ flamegraph.pl out.folded > rpc_server_latency.svg

This captures server execution profile, folds stacks, renders a flame graph visualization focusing just on server-side paths:

Flame graph

Any long tail is easily identified pointing to optimization opportunities. Repeating builds latency distribution graphs characterizing architectures.

For patch validation, tracking max latency is key:

$ perf stat -e cycles -I 1000 -a -x , \
           max,max_latency,avg ./server

Having quantitative latency metrics, heatmaps provides evidence if SLA violations are addressed in new builds.

Identifying System Bottlenecks

In complex applications, performance issues could stem from the operating system itself. Tools like vmstat indicate high IO or CPU load at surface level. But to pinpoint kernel bottlenecks requires internal profiling.

Perf allows instrumenting Linux kernel functions to uncover issues like lock contention, resource limitations, scalability bottlenecks and more. Of course kernel tracing requires elevated permissions and can capture sensitive data – so use care in leveraging such capabilities.

Common kernel profiling examples include:

Mutex Contention

$ sudo perf record -e mutex:mutex_contended -ag -- sleep 10

Traces heavy mutex contention system-wide.

Scheduler Statistics

$ sudo perf record -e sched:* -ag -- sleep 30  

Captures context switches, CPU migrations indicating scheduler issues.

Filesystem Activity

$ sudo perf record -e fs -ag -- cp file1 file2

Profiles VFS calls during workload detecting filesystem overheads.

These scenarios highlight perf‘s power in getting kernel level visibility to optimize system efficiency. Of course the Linux kernel supports 1000+ trace events making very comprehensive tracing possible.

Let‘s look at storage profiling in more depth.

Storage Profiling for Latency Analysis

In addition to computation bottlenecks, I/O subsystem issues also affect workload performance. While utilities like iotop show aggregate disk throughput, to really analyze latency requires tracing actual storage driver code paths.

Perf implements powerful block tracing capabilities via:

$ sudo perf storage trace -a

This instruments the storage stack to log all I/O events with microsecond accuracy. After stopping workload, perf script analyzes the trace:

$ sudo perf script > disk_trace

The output shows latency for every I/O:

   8,007,759,645,545 ns             time           8 KiB 
       7 +0 μs                     aio_submit       0 KiB
         +19 μs                     aio_complete    
        21 μs               callback preparation
       131 μs                       block plug 
       153 μs                  .block_bio_frontmerge

Latency spikes at each driver layer are visible – exposing issues in that component. This methodology accurately highlights areas for drive firmware patches, controller tuning without guesswork.

For large system-wide traces capturing multiple storage devices, the BCC/BPF biosnoop tool provides visually cleaner output:

biosnoop screenshot

The key strength here is Perf having built-in, low-level block tracing – while allowing scripting custom analytics on trace outputs.

Custom Events for Specialized Analysis

While Perf supports rich hardware event and static kernel tracing, the options keep expanding with kernel updates and hardware evolution.

To support fully custom analysis, Perf integrates Kernel Dynamic Tracing also popularized by DTrace.

Kprobes allows registration of probes on virtually any kernel routine, userspace function to execute handlers capturing system state. This unlocks visibility into whatever domain required.

For example, confirming efficacy of a new caching optimization by counting cache hits:

# perf probe --add ‘filemap_map_pages* %return @eip‘

$ perf record -e probe:filemap_map_pages -aR ls

$ perf script | grep filemap_map_pages

This dynamically instruments, profiles and extracts metrics for the page cache routine.

Similarly, arbitrary markers can be added to userspace applications via Uprobes. This is helpful to capture metrics across layers like database to storage.

BPF tracepoints take dynamic tracing to the next level allowing injection of custom BPF programs for analysis. Perf interoperates enabling customized observability.

The key advantage here is the ability to build targeted, low-overhead tracing on top of perf even for new domains. This keeps perf evergreen, available for the next class of bottlenecks!

Summary

Perf provides a versatile toolkit combining static performance events, kernel tracing alongside dynamic probes – enabling precise profiling of the full system stack from hardware to software.

We covered common workflows like:

  • Quantifying application efficiency via perf stat
  • Identifying hot code paths consuming CPUs using perf top
  • Recording execution profiles for in-depth diagnosis via perf record
  • Microbenchmarking code optimizations
  • Discovering latency violations across layers
  • Kernel tracing to uncover OS bottlenecks
  • Custom dynamic probing for specialized tracing

Yet perf supports even more advanced analysis like frequency profiling, watchdog traces, PEBS sampling, parsed metrics output etc. The official examples guide into additional capabilities.

With its low overhead and rich metrics capture, perf has become an invaluable performance analysis tool. I encourage you to explore further how perf can help unlock bottlenecks and optimize efficiency in your environment. Let me know which perf techniques you find most useful!

Similar Posts