Tracing-Driven Performance Analysis for ROS 2 Manipulation Stacks
Real-time robotic systems like manipulation arms require predictable latency for safe operation. As robots share compute resources with perception, planning, and communication tasks, understanding how system load affects motion performance is critical for building reliable autonomous systems.
This harness measures how system load affects ROS 2 manipulation performance by running controlled experiments on a simulated Panda robot arm under different stress conditions.
- Quantified DDS vulnerability - First systematic measurement showing ROS 2 manipulation fails completely under message flood before showing any CPU degradation
- Manipulation domain coverage - Extends prior work on navigation robots to manipulation stacks (MoveIt + ros2_control)
- Reproducible methodology - Fully automated harness for CloudLab with kernel-level tracing
- Realistic CPU isolation -
taskset(sched_setaffinity) CPU pinning simulates embedded systems (Jetson) more accurately than artificial stress. Kernel-enforced, verified via/proc/PID/status
- On-Device CPU Scheduling for Robot Systems
- PiCAS: Priority-Driven Chain-Aware Scheduling for ROS2
- PAAM: Priority-Driven Accelerator Management in ROS 2
- ros2_tracing: Multipurpose Low-Overhead Framework
| Finding | Detail |
|---|---|
| CPU contention has minimal impact | 80% CPU load does not degrade MoveIt planning (~15ms baseline vs ~14ms under load) |
| DDS message flooding is catastrophic | 4000 msg/s causes 100% execution failure |
| The bottleneck is communication, not compute | Hardening efforts should focus on DDS queue management and pub-sub throttling |
# 1. Clone and bootstrap (installs EVERYTHING)
git clone https://github.com/AayushBaniya2006/ldos_manip_tracing.git
cd ldos_manip_tracing
make bootstrap
# 2. IMPORTANT: Log out and back in for tracing group
exit
# SSH back in
# 3. Verify and run
cd ~/ldos_manip_tracing
make preflight # Explicit host/tool checks
make smoke_test # Verify setup works
make run_all NUM_TRIALS=10 # Run full experiment suite
# 4. Analyze results
make analyze_all
make reportA simulated Panda 7-DOF robot arm executes a simple reaching motion:
Start: Home position (arm upright)
Goal: [x=0.4, y=0.2, z=0.5] meters relative to base
Motion: MoveIt plans collision-free path, ros2_control executes trajectory
This motion is repeated under different load conditions:
| Scenario | Stress Applied | Expected Outcome |
|---|---|---|
| Baseline | None | All trials succeed |
| CPU Load | 4 workers x 80% CPU (stress-ng) | Measure planning degradation |
| Message Load | 4000 msg/s DDS flood | System failure |
| cpuset_limited | taskset CPU pinning (1-8 CPUs) |
Realistic embedded constraint |
| Metric | Description | Source | Weight |
|---|---|---|---|
| T1 | Planning latency | MoveIt feedback | 1.0 |
| T2 | Execution latency | Action result | 1.0 |
| T3 | Control loop jitter | LTTng callbacks | 0.5 |
| T4 | Total end-to-end latency | Wall clock | 1.5 |
| T5 | Simulation step timing | Gazebo updates | 0.3 |
| Scenario | Trials | Success | Planning (ms) | Execution (ms) | Total (ms) |
|---|---|---|---|---|---|
| Baseline | 11 | 100% | 15.6 +/- 4.1 | 656.2 +/- 500.4 | 673.7 +/- 501.5 |
| CPU Load | 11 | 100% | 13.6 +/- 1.2 | 529.4 +/- 93.1 | 545.0 +/- 93.8 |
| Msg Load | 10 | 0% | 13.5 +/- 4.0 | - | FAILED |
Isolation mechanism:
taskset -c(Linuxsched_setaffinitysyscall)
tasksetsets a CPU affinity bitmask on the process. The Linux kernel scheduler will not run the process on any CPU outside this mask. The affinity is:
- Inherited across
fork(2)— all child processes get the same restriction- Preserved across
execve(2)— launching new binaries keeps the restrictionThis means
taskset -c 31 ros2 launch ...restricts the launch process AND every ROS node it spawns to CPU 31. No cgroups needed.Verification: Every trial checks
/proc/PID/status→Cpus_allowed_listfor all ROS processes. An independent background CPU monitor continuously samples actual CPU usage. Both confirm correct pinning across all configs.
| ROS CPUs | Trials | Success | Planning (ms) | Execution (ms) | Total (ms) | Isolation Verified |
|---|---|---|---|---|---|---|
| 1 | 5 | 100% | 7.75 +/- 3.65 | 761.33 +/- 349.98 | 771.12 +/- 348.58 | Yes (CPU monitor) |
| 2 | 5 | 100% | 7.40 +/- 5.10 | 490.68 +/- 109.32 | 500.02 +/- 110.68 | Yes |
| 4 | 5 | 100% | 8.90 +/- 5.35 | 490.73 +/- 144.42 | 501.66 +/- 142.86 | Yes |
| 8 | 5 | 100% | 8.86 +/- 4.33 | 510.69 +/- 157.99 | 521.60 +/- 154.49 | Yes |
CPU Monitor Confirmation (per-process CPU pinning):
| Config | robot_state_pub | parameter_bridge | move_group | Expected |
|---|---|---|---|---|
| 1-CPU | [31] | [31] | [31] | 31 |
| 2-CPU | [30,31] | [30,31] | [30,31] | 30-31 |
| 4-CPU | [28-31] | [28-31] | [28-31] | 28-31 |
| 8-CPU | [24-31] | [24-31] | [24-31] | 24-31 |
-
CPU load does NOT degrade planning - Planning time is slightly lower under CPU stress (15.6ms -> 13.6ms), possibly due to CPU governor behavior or cache effects.
-
CPU load REDUCES execution variance - Standard deviation drops from 500ms to 93ms under load, suggesting more predictable behavior under controlled contention.
-
Message flood causes complete failure - All 10 trials failed (typically
control_failed, MoveIt error-4). DDS middleware collapses under 4000 msg/s. -
1-CPU restriction causes 55% execution latency increase - Single-core time-sharing between
move_group,robot_state_publisher, andparameter_bridgedegrades execution from ~490ms to ~761ms with significantly higher variance (std=350ms vs ~110ms). -
Workload saturates at 2 CPUs - 2-CPU and 4-CPU configs produce statistically identical execution latencies (490.68 vs 490.73 ms). Additional CPUs beyond 2 provide no benefit for this manipulation workload.
-
Planning latency is CPU-independent - RRTConnect planning time (~7-9ms) is unaffected by CPU count. The single-threaded planner completes too fast for CPU restriction to matter on this hardware.
-
Isolation mechanism: taskset works, cgroups fail from SSH - cgroups v2 cpuset enhancement consistently fails due to cross-branch migration restrictions from SSH session scopes.
taskset(sched_setaffinity) provides complete isolation — affinity is inherited acrossfork(2)and preserved acrossexecve(2), covering all ROS 2 launch child processes.
stress-ng --cpu 4 --cpu-load 80 --cpu-method matrixprod --timeout 300s| Parameter | Value | Notes |
|---|---|---|
| Workers | 4 | Pinned to CPUs 1-4 (CPU 0 reserved) |
| Load | 80% | Target utilization per worker |
| Method | matrixprod | CPU-bound, no I/O |
The message flood uses msg_flood_node.py to saturate DDS middleware:
# 4 publishers, each at 1000 Hz = 4000 msg/s total
ros2 run ldos_harness msg_flood_node.py --rate 1000 --topic /flood_topic_1 &
ros2 run ldos_harness msg_flood_node.py --rate 1000 --topic /flood_topic_2 &
ros2 run ldos_harness msg_flood_node.py --rate 1000 --topic /flood_topic_3 &
ros2 run ldos_harness msg_flood_node.py --rate 1000 --topic /flood_topic_4 &| Parameter | Value | Notes |
|---|---|---|
| Publishers | 4 | Separate processes, separate topics |
| Rate | 1000 Hz each | 1 msg/ms per publisher |
| Payload | 1024 bytes | std_msgs/String filled with 'X' |
| QoS | BEST_EFFORT / VOLATILE | No backpressure, fire-and-forget |
| Total throughput | 4000 msg/s (~4 MB/s) |
Why BEST_EFFORT breaks the system:
- No acknowledgments or retries - messages are fire-and-forget
- DDS discovery and transport queues overflow
- MoveIt action server cannot receive feedback from controller
- Controller returns
UNKNOWNstatus - MoveIt commonly reports
CONTROL_FAILED(error code-4) - Harness records this explicitly as
control_failed(with raw MoveIt error code/label)
Instead of artificial CPU stress (stress-ng), this scenario uses real CPU resource limitation via Linux taskset (sched_setaffinity syscall). This simulates embedded systems (Jetson Nano, Xavier) more realistically by hard-limiting which CPUs the ROS stack can use.
┌─────────────────────────────────────────────────────────────────────┐
│ CloudLab Node (auto-detect CPUs) │
├───────────────────────────────┬─────────────────────────────────────┤
│ GAZEBO (taskset: CPUs 0-N) │ ROS STACK (taskset: CPUs N+1-31) │
│ ───────────────────────── │ ─────────────────────────── │
│ • gz-sim8 (Gazebo server) │ • move_group (MoveIt) │
│ │ • robot_state_publisher │
│ (Physics + gz_ros2_control │ • ros_gz_bridge (clock) │
│ may run in this scope, │ • benchmark_runner.py │
│ including controller mgr) │ │
│ │ Kernel enforces: these processes │
│ (Receives all CPUs except │ CANNOT run on Gazebo CPUs. │
│ the ROS-limited set) │ Verified via /proc/PID/status. │
└───────────────────────────────┴─────────────────────────────────────┘
How it works:
# This is the core of the isolation — one line:
taskset -c "30-31" ros2 launch ldos_harness ros_stack.launch.py
# The kernel's sched_setaffinity restricts this process AND all children
# (fork inherits, execve preserves) to CPUs 30-31. No cgroups needed.Quick Start:
# Check system support
make check_cpuset
# Run with 2 CPUs for ROS stack (default)
make run_cpuset NUM_TRIALS=10
# Run CPU sweep (1, 2, 4, 8 CPUs) to find saturation point
make sweep_cpuset NUM_TRIALS=5
# Custom CPU count
ROS_CPU_COUNT=1 make run_cpuset NUM_TRIALS=10Why taskset and not cgroups v2 cpuset?
- cgroups v2 cpuset consistently fails from SSH sessions due to cross-branch migration restrictions (
echo $$ > cgroup.procsdenied) systemd-run --scope -p AllowedCPUs=Xsilently ignores AllowedCPUs for child processes (systemd 255 bug)tasksetcallssched_setaffinity()directly — no cgroup infrastructure needed, works from any context, and the kernel enforces it unconditionally
Interpretation caveat:
move_group, bridges, and benchmark are constrained to the ROS-limited CPU setgz_ros2_control/controller_managermay execute inside the Gazebo process (Gazebo scope), depending on runtime embedding- Treat the results as ROS userspace CPU-budget sensitivity under a split simulator/ROS scope design
# Clean stale state (recommended between runs)
make cleanup_processes
# Verify system support (taskset, CPU count, etc.)
make check_cpuset
# CPU-count experiments (disable tracing noise)
ENABLE_TRACING=0 ROS_CPU_COUNT=4 make run_cpuset NUM_TRIALS=10
make analyze_all && make report
make archive_latest TAG=cpuset_roscpu4_trials10 SCENARIO=cpuset_limited
ENABLE_TRACING=0 ROS_CPU_COUNT=2 make run_cpuset NUM_TRIALS=10
make analyze_all && make report
make archive_latest TAG=cpuset_roscpu2_trials10 SCENARIO=cpuset_limited
ENABLE_TRACING=0 ROS_CPU_COUNT=1 make run_cpuset NUM_TRIALS=10
make analyze_all && make report
make archive_latest TAG=cpuset_roscpu1_trials10 SCENARIO=cpuset_limitedExpected Results:
| ROS CPUs | Expected Outcome |
|---|---|
| 4 | Similar to baseline |
| 2 | Slight degradation, still passing |
| 1 | Significant degradation or failures |
LTTng provides kernel-level tracing with minimal overhead (~2-5%), capturing:
- Exact callback execution times (nanosecond precision)
- Kernel scheduler events (context switches, wake latency)
- ROS 2 middleware internals (executor, pub/sub timing)
ROS 2 Userspace (ros2:*)
callback_start/callback_end- Callback execution timingrclcpp_publish/rcl_take- Message emission and receptionrclcpp_executor_*- Executor scheduling behavior
Kernel Tracepoints
sched_switch- Context switchessched_wakeup- Task wake events (scheduling latency)sched_migrate_task- CPU migration
LTTng Session -> CTF Binary -> babeltrace2 -> analyze_trace.py -> CSV/JSON
make analyze_all always aggregates benchmark JSON results. If Python bt2/babeltrace2 bindings are unavailable, per-trace CTF analysis is skipped, but summary CSV/report generation still works.
# Profile during an experiment (recommended)
make profile_baseline # Runs experiment with 60s CPU profiling
make profile_cpu_load # CPU load with profiling
# Profile running stack
make profile_cpu # System-wide
make profile_moveit # MoveIt-specificInteractive flamegraph showing CPU time distribution. Width = CPU time. Click to zoom in the SVG version.
Baseline (system mostly idle):
28.30% swapper [kernel] io_idle <- Benchmark completes fast
0.21% gz_ros2_control::PostUpdate <- Gazebo-ROS bridge
0.01% gz_hardware::read/write <- Hardware interface
CPU Load (stress-ng active):
29.62% stress-ng-cpu <- Injected artificial load
28.30% swapper io_idle <- Still idle despite stress
0.06% gz_ros2_control::PostUpdate <- ROS unchanged
Key insight: ROS 2 manipulation has minimal CPU footprint. The benchmark completes in ~2s, leaving most profiling time as idle. The bottleneck is I/O (message passing), not CPU.
-
CPU isolation analysis- DONE. Sweep found saturation at 2 CPUs (see results above). Next: combine CPU restriction with LDoS load injection -
Find exact DDS breaking point - Sweep message rates (100, 500, 1000, 2000, 3000 msg/s) to identify failure threshold
-
Test DDS tuning - Increase queue sizes, try shared memory transport, compare CycloneDDS vs FastDDS
-
Extend to real hardware - Validate findings on physical Panda arm
-
Compare profiling approaches - eBPF (lightweight) vs LTTng (detailed) tradeoffs
-
Apply adaptive throttling - Integrate runtime pub-sub throttling to prevent failures
-
Priority-driven scheduling - Implement PiCAS-style chain-aware scheduling for ROS 2 executors
make help # Show all targets
# === Setup ===
make bootstrap # Fresh CloudLab setup (installs everything)
make preflight # Explicit host/tool preflight checks
make setup # Build workspace only
make clean # Remove build artifacts
# === Validation ===
make smoke_test # Quick check (~60 sec)
make acceptance_test # Full component verification
# === Experiments ===
make run_baseline # Baseline trials (NUM_TRIALS=10)
make run_cpu_load # CPU load trials (stress-ng)
make run_msg_load # Message load trials (DDS flood)
make run_all # All scenarios
# === CPU Isolation (taskset / sched_setaffinity) ===
make check_cpuset # Verify taskset + system CPU support
make run_cpuset # Run CPU-limited experiments (uses taskset)
make sweep_cpuset # Sweep ROS stack CPUs (1, 2, 4, 8)
make cleanup_processes # Kill stale ROS/Gazebo/LTTng state
make archive_latest # Archive latest experiment artifacts (TAG=...)
# === Profiling ===
make profile_baseline # Baseline + CPU profiling
make profile_cpu_load # CPU load + profiling
make profile_cpu # Profile running stack
# === Analysis ===
make analyze_all # Process traces and aggregate
make report # Summary statistics
# Analysis/report prefer the latest experiment manifest written by
# run_experiment_suite.sh to avoid mixing stale trial files from older runs.
# === Parameter Sweeps ===
make sweep_cpu # Sweep CPU load 0-90% (stress-ng)
make sweep_msg # Sweep message rate
make find_breaking # Binary search for breaking pointldos_manip_tracing/
├── Makefile # Entry point (make help)
├── scripts/
│ ├── bootstrap_cloudlab.sh # Complete CloudLab setup (10 phases)
│ ├── run_experiment_suite.sh # Main experiment orchestrator
│ ├── run_single_trial.sh # Single trial execution
│ ├── start_trace.sh # LTTng session start
│ ├── stop_trace.sh # LTTng session stop
│ ├── cpu_load.sh # stress-ng wrapper
│ ├── msg_load.sh # DDS flood launcher
│ ├── cpuset_launch_v2.sh # taskset (sched_setaffinity) CPU isolation wrapper
│ ├── cpuset_sweep.sh # CPU count parameter sweep (1, 2, 4, 8 CPUs)
│ ├── check_cpuset_support.sh # Verify taskset + system CPU support
│ ├── cpu_monitor.sh # Background CPU usage monitor with affinity verification
│ ├── cleanup_harness_processes.sh # Standard stale-process/session cleanup
│ ├── archive_latest_experiment.sh # Archive latest run artifacts for reporting
│ ├── cpu_profile.sh # perf + FlameGraph
│ └── analyze_traces.sh # Post-processing pipeline
├── src/ldos_harness/ # ROS 2 package
│ ├── launch/
│ │ ├── full_stack.launch.py # Complete stack (Gazebo + ROS)
│ │ ├── sim_only.launch.py # Gazebo only (for cpuset isolation)
│ │ ├── ros_stack.launch.py # ROS only (for cpuset isolation)
│ │ ├── sim_bringup.launch.py
│ │ └── moveit_bringup.launch.py
│ ├── config/ # URDF, SRDF, controller configs
│ └── scripts/
│ ├── benchmark_runner.py # Trial execution node
│ └── msg_flood_node.py # DDS flood publisher
├── configs/
│ ├── experiment_config.yaml # Trial counts, timeouts, goals
│ ├── tracing_config.yaml # LTTng event selection
│ └── objectives.yaml # Metric definitions (T1-T5)
├── analysis/
│ ├── analyze_trace.py # LTTng CTF processor
│ └── output/ # Results (CSVs, flamegraphs)
├── traces/ # Raw LTTng traces
└── results/ # Benchmark JSON results
Full results in analysis/output/combined_summary.csv:
Click to expand trial data
trial_id,scenario,status,planning_latency_ms,execution_latency_ms,total_latency_ms
baseline_001,baseline,success,12.95,510.86,525.71
baseline_002,baseline,success,22.87,561.46,586.35
baseline_003,baseline,success,13.00,510.28,525.28
baseline_004,baseline,success,11.41,311.83,325.49
baseline_005,baseline,success,23.12,460.29,485.35
baseline_006,baseline,success,14.26,560.35,576.52
baseline_007,baseline,success,14.03,660.40,676.38
baseline_008,baseline,success,14.15,710.63,726.74
baseline_009,baseline,success,13.21,260.32,275.40
baseline_010,baseline,success,13.67,560.59,576.12
baseline_profiled,baseline,success,18.40,2111.32,2131.77
cpu_load_001,cpu_load,success,14.18,560.58,576.79
cpu_load_002,cpu_load,success,12.62,411.51,426.09
cpu_load_003,cpu_load,success,12.47,461.03,475.55
cpu_load_004,cpu_load,success,15.38,661.51,678.84
cpu_load_005,cpu_load,success,12.27,511.08,525.34
cpu_load_006,cpu_load,success,13.74,410.52,426.34
cpu_load_007,cpu_load,success,13.25,511.45,526.74
cpu_load_008,cpu_load,success,13.47,511.39,527.09
cpu_load_009,cpu_load,success,12.37,511.25,525.59
cpu_load_010,cpu_load,success,13.92,711.74,727.72
cpu_load_profiled,cpu_load,success,16.00,560.92,579.01
msg_load_001,msg_load,execution_failed,14.87,10.26,27.37
msg_load_002,msg_load,execution_failed,12.52,11.21,25.75
msg_load_003,msg_load,execution_failed,12.41,10.57,24.98
msg_load_004,msg_load,execution_failed,12.62,11.17,26.05
msg_load_005,msg_load,execution_failed,14.38,11.60,28.10
msg_load_006,msg_load,execution_failed,12.55,10.31,24.87
msg_load_007,msg_load,execution_failed,21.90,10.45,35.01
msg_load_008,msg_load,execution_failed,13.67,11.27,27.11
msg_load_009,msg_load,execution_failed,5.42,10.75,18.26
msg_load_010,msg_load,execution_failed,14.15,10.35,26.69Installed automatically by make bootstrap:
- Ubuntu 24.04 (Noble)
- ROS 2 Jazzy
- Gazebo Harmonic (gz-sim8)
- MoveIt 2
- ros2_control + ros2_controllers
- LTTng (lttng-tools, liblttng-ust-dev)
- ros2_tracing
- Linux perf + FlameGraph
- Python 3.12+ (pandas, scipy, matplotlib, babeltrace2)
- stress-ng
pkill -9 -f gazebo; pkill -9 -f ros2; pkill -9 -f gzserver
sleep 10
make smoke_test# Check that taskset support and CPU detection work:
make check_cpuset
# During a run, verify processes are pinned:
grep Cpus_allowed_list /proc/$(pgrep -f move_group)/status
# Should show only the assigned CPUs (e.g., "31" for 1-CPU, "30-31" for 2-CPU)This usually indicates stale state or synchronization drift, not a CPU-scaling result.
make cleanup_processes
ENABLE_TRACING=0 ROS_CPU_COUNT=4 make run_cpuset NUM_TRIALS=10If repeated invalid starts occur, the harness captures per-trial diagnostics in:
results/cpuset_limited/*_diagnostics.txt
sudo usermod -aG tracing $USER
exit
# SSH back inpkill -9 -f gazebo; pkill -9 -f ros2; pkill -9 -f gzserver; pkill -9 -f move_group
lttng list 2>/dev/null | grep ldos | awk '{print $1}' | xargs -I {} lttng destroy {} 2>/dev/null
sleep 10MIT License
