As a full-stack developer, getting deep visibility into system memory usage is critical for diagnosing performance issues. The numastat tool provides invaluable NUMA memory statistics to help analyze behavior at a node granularity. This guide will cover everything from NUMA fundamentals to practical usage of numastat for memory performance analysis.

Overview of NUMA Architecture

NUMA (Non-Uniform Memory Access) is a modern system architecture where the CPU/memory complex is divided into localized nodes. Each node contains processors, memory, and I/O devices as shown:

NUMA Architecture

The key benefit of NUMA is memory access parallelism – by spreading memory across nodes, more memory bandwidth is available avoiding contention. The downside is latency from remote memory access. Accessing local node memory is faster than memory on other nodes.

In NUMA systems, allocating memory on the "wrong" node can cause major performance issues due to increased remote memory access. This is why tools like numastat are invaluable for developers to analyze the per-node memory allocation profile.

Role of Numastat in NUMA Memory Analysis

The numastat tool provides detailed visibility into the system‘s NUMA memory access behavior at a node level granularity.

Key metrics provided by numastat include:

  • Per-node cache hit/miss rates
  • Local vs remote memory access counters
  • Memory allocated per-process on each node
    *Idle memory availability across nodes

This information is vital for analyzing memory allocation efficiency. If a process has a high rate of remote memory access rather than local, its performance will suffer from increased latency.

Numastat makes this NUMA profiling easy – no code changes required. Developers can instantly see if memory is skewed, or if too many remote node accesses are happening.

Common uses cases include:

  • Detecting overall system NUMA imbalance issues
  • Profiling per-process memory allocation
  • Diagnosing performance-critical applications
  • Capacity planning – identify node bottlenecks
  • Evaluating if NUMA optimizations like numactl adjustments are effective

In summary – having clear visibility into your system‘s NUMA memory efficiency is critical. Numastat fulfills this role with detailed, yet accessible node-level memory statistics.

Comparing Numastat to Other Memory Tools

Admins have access to many memory analysis tools in Linux including free, top, vmstat and more. So where does numastat fit in?

Key Advantage of Numastat

The major differentiator is it reports statistics at a NUMA node-level granularity. Tools like free show overall system memory which obscures allocation imbalances across individual nodes.

Memory Tools Comparison

Consider if Process A has memory heavily skewed to Node0 while Node1 is underutilized. Tools like free wouldn‘t expose this skew – they display aggregate system-level information.

Numastat provides this key missing visibility at the per-node level. This allows detecting NUMA imbalances and inefficient allocations affecting application performance.

Getting Started with Numastat

Now that we‘ve covered the importance of per-node memory profiling, let‘s dive into using numastat for memory analysis.

The numastat tool is provided by the numactl package. Install it on any Linux distribution as follows:

$ sudo apt install numactl

Once installed, no special permissions are required to run numastat. However, some options like showing per-process memory statistics require root privileges.

Numastat Output Overview

Running numastat with no arguments displays system-wide statistics:

$ numastat
node0 node1
numa_hit 24021M 2303M
numa_miss 0 0 
numa_foreign 0 0
interleave_hit 0 732K
local_node 0 13G
other_node 27G 0

The output presents memory metrics for each NUMA node on separate columns:

  • numa_hit: Memory successfully allocated on preferred local node
  • numa_miss: Memory allocated on non-local node due to low local memory
  • numa_foreign: Memory allocated on another node despite local node having capacity
  • interleave_hit: Memory allocated via interleaving policy across nodes

Analyzing these metrics provides visibility into potential NUMA inefficiencies like skew across nodes, high rates of non-local access etc.

Later sections cover interpreting these metrics in depth. First, let‘s explore numastat options for customizing output.

Numastat Usage Examples

Numastat includes several options for filtering, sorting, and customizing its memory statistics. Here we will cover some common examples demonstrating effective techniques.

Condensing Output

By default, numastat presents raw counters. To show human-readable values, use the -c flag:

$ numastat -c
node0       node1
numa_hit   24.0G   2.30G
numa_miss      0       0
numa_frn      0       0
itlv_hit      0   732K   
lcl_node      0   13.0G
oth_node   27.0G       0

Now memory sizes are cleanly shown in gigabyte and megabyte units – much easier to interpret!

Sorting Output by Node

The order or nodes can be arbitrary in the default output. To sort based on highest memory usage, utilize the -s flag:

$ numastat -s
node1 node0 
numa_hit 24021M 2303M
numa_miss 0 0
numa_foreign 0 0 
interleave_hit 732K 0
local_node 13G 0
other_node 0 27G

Now the node column order is based on overall memory activity putting the busiest node first. This surfaces any imbalances instantly.

Showing Detailed Per-Node System Info

To display detailed memory stats per node similar to /proc/meminfo, leverage the -m flag:

$ numastat -m
Node 0   
  MemTotal:         65G
  MemFree:         5.3G
  MemUsed:          60G
  ...
Node 1
  MemTotal:         32G
  MemFree:           8G
  MemUsed:          24G
  ...  

This quickly shows total memory, used memory, active vs. inactive breakdowns, and more for each node. Very useful for capacity planning.

Profiling Per-Process Memory

To check memory allocation per-process, pass one or more PIDs with -p. First identify PIDs:

$ ps -ef | grep mysql
mysql 15311 1 0  Feb10 ? 00:00:09 /usr/bin/mysqld
$ numastat -p 15311

Per-node process memory usage for PID 15311 (mysqld)
   Node 0        Node 1
------------------
numa_hit:   2.4G     160M
numa_miss:      0         0 
numa_local: 2.4G     160M

This reveals mysqld memory is heavily skewed to Node 0. An admin could use this data to rebalance allocation across nodes for better performance.

Advanced Numastat Usage

Now that we‘ve covered basic output and usage, let‘s explore some advanced tricks for tapping numastat‘s full potential.

Continuously Monitor Numastat

Instead of periodic snapshots, pipe numastat into watch to monitor statistics continuously:

$ watch -n 1 numastat

Every 1.0s: numastat                                                               Sat Feb 11 02:17:34 2023

node0 node1
numa_hit 24021M 2303M
numa_miss 0 0
numa_foreign 0 0
interleave_hit 0 732K
local_node 0 13G  
other_node 27G 0

This refreshes stats every second, making it easy to spot trends. Customize the refresh rate as needed.

NUMA Memory Imbalance Detection

A key benefit of numastat is detecting imbalanced memory allocation across nodes.

One indicator is when numa_foreign increments on a node running low on free memory. This means local allocation failed so memory fell back to a remote node.

Here Node 1 has high memory use and Node 0 starts seeing numa_foreign growth:

$ numastat
node0 node1 
numa_hit 24021M 2303M
numa_miss 0 0
numa_foreign 2K 0 <---- imbalance indicator
interleave_hit 0 732K  
local_node 0 13G 
other_node 27G 0

Seeing consistent numa_foreign counters indicates an imbalance where Node 0 has capacity yet can‘t fulfill local allocation.

Identifying Remote Memory Access

Excess remote memory access hurts performance. This can be measured via:

  • numa_miss: Memory allocated on a non-local node
  • numa_foreign: Local node lacked free memory so allocation fell back to remote node
  • other_node: Memory allocated on node while process runs on different node

Here is an example numastat output with indicators of inefficient remote memory access:

$ numastat

node0 node1
numa_hit 200M 59G   
numa_miss 0 500K <---- remote allocated
numa_foreign 16M 0
interleave_hit 0 0 
local_node 300M 47G <---- other node access
other_node 35G 234M <---- other node access

Node 1 shows high rates of non-local allocation. A developer would dig deeper into optimizing this applications memory policies to prevent remote memory utilization.

Evaluating Policy Changes

Numastat allows directly measuring the impact of NUMA policy changes – like using numactl to alter allocation behavior.

For example, we can pin a process to a given node using numactl. Before and after, profile with numastat:

$ numastat -p 3245 

Per-node process memory usage for PID 3245 (app)
   Node 0        Node 1 
------------------
numa_hit:  192M   1.4G

$ numactl --cpunodebind 1 --membind 1 app

$ numastat -p 3245

Per-node process memory usage for PID 3245 (app)
   Node 0        Node 1
------------------ 
numa_hit:    0     3.2G <---- improved local access  

Numastat perfectly measures the policy change – previously memory was split across nodes, now it‘s localized to node 1 after numactl adjustments.

This example demonstrates using numastat to directly evaluate the impact of optimizations like numactl pinning.

Interpreting Numastat Statistics

Now that we‘ve covered using numastat for memory profiling, let‘s explore interpreting key memory metrics to identify issues.

Detecting Imbalance via Hit Rate Analysis

A key indicator of imbalance is skew in memory access locality hit rates between nodes.

Consider this hypothetical 2 node system over time:

Time Node0 Local Hit Node1 Local Hit Imbalanced?
T1 90% 90% No
T2 75% 90% Yes
T3 55% 90% Yes

At T1 hit rates are equivalent. By T2 and T3 Node 0 local memory access drops significantly compared to Node 1. This indicates an imbalance where Node 0 can‘t service memory locally any longer despite Node 1 having capacity.

Tools like vmstat don‘t expose hit rate imbalances since they lack per-node visibility. Numastat makes this analysis possible.

Identifying Memory Fragmentation

Another key metric is memory fragmentation levels per node. Having excessive fragmentation can prevent large allocations even if aggregate free memory exists.

The -m option displays fragmentation related metrics per node:

$ numastat -m
Node 0
  MemFree:         2.1G
  Slab:             256M     <---- kernel fragmentation
  SReclaimable:      32M
  SUnreclaim:       224M

Watch for growth trends in Slab and SUnreclaim over time. In the example above, Node 0 has 2.1GB technically free, but 224M is lost to fragmentation making it unavailable for large application allocations.

As a developer, keeping track of per-node fragmentation guides upgrading node sizing and prevents misleading "out-of-memory" situations.

Expert Best Practices for Numastat

From a full-stack developer perspective, here are my recommended best practices for leveraging numastat in performance tuning exercises:

Continuous profiling – Rather than ad-hoc spot checks, utilize numastat statistics in continuous monitoringDashboards tracking key ratios over time are highly valuable for identifying emerging issues fast.

Compare relative hit rates – Focus less on absolute counters and more on memory access locality hit ratios between nodes to catch imbalance issues.

Account for fragmentation – When evaluating free memory levels, be sure to consider slab/reclaimable metrics to measure loss from fragmentation.

Prioritize local access – Optimizing CPU binding, memory policies and eliminating remote memory access should be priority one – huge performance implications.

Set alerts for key metrics – Thresholds on critical metrics like 10%+ numa_foreign or 30%+ slab fragmentation makes tuning proactive.

Correlate metrics – Cross-reference metrics between tools – does vmstat indicate swap use while numastat shows free memory? Metrics tell a unified story.

Size nodes appropriately – Watch local_node vs other_node over time. Axing cross-node activity guides right-sizing nodes for your workload‘s memory footprint.

Conclusion

As we‘ve explored, having clear visibility into your systems NUMA memory efficiency is critical for eliminating performance bottlenecks.

The numastat tool provides simple yet powerful node-level memory statistics that unlocked many optimization opportunities:

  • Identifying hotspots and imbalance issues
  • Reducing inefficient remote memory access
  • Profiling applications to prevent node resource contention
  • Sizing nodes appropriately as workload memory demands evolve

I highly recommend NUMStat as a standard part of any developer‘s performance tuning toolkit – the insights it provides are invaluable. Core metrics like local hit rates, fragmentation and cross-node activity cut straight to the heart of NUMA efficiency.

Combine numastat profiling withrobust monitoring and alerting to catch emerging NUMA issues proactively. Optimizing memory locality should be an ongoing initiativegiven the huge performance implications.

I hope this guide has provided a comprehensive overview of tapping numastat‘s capabilities for efficient memory utilization in NUMA environments. Let me know if any questions arise applying these techniques!

Similar Posts