Skip to content

heap corruption / segfault during Python GC inside capture_while (found in 1.13.0.dev20260422) #1385

@adenzler-nvidia

Description

@adenzler-nvidia

Summary

A deterministic SIGSEGV (or double free or corruption on glibc) hits during Python garbage collection while a warp.capture_while region is active on the CPU device. The same workload runs cleanly on 1.13.0.dev20260421 and crashes on 1.13.0.dev20260422, so it's a one-day regression.

The crash profile points at a missed reference-retention path in the new CPU graph-capture / APIC replay work added in #1349.

Reproducer

Any workload that drives mujoco-warp on CPU inside a graph capture reproduces. Concrete example (from newton-physics/newton):

git clone https://github.com/newton-physics/newton.git
cd newton
# pin the bad nightly
uv lock --upgrade-package warp-lang
# ensure warp-lang==1.13.0.dev20260422 is resolved
CUDA_VISIBLE_DEVICES= uv run --extra dev -m newton.tests \
    -k test_selection.example_selection_cartpole_cpu

Expected: test passes.
Actual: SIGSEGV (-11) or SIGABRT (-6) partway through the simulation loop. Happens across Ubuntu x86_64, Ubuntu arm64, macOS (arm64), Windows, and on a CUDA box when CUDA is hidden.

Stack traces

Two observed crash sites, both during Garbage-collecting — which is characteristic of heap corruption rather than a single logic bug (GC just happens to be where the trap fires):

Site A — GC during array construction

Fatal Python error: Segmentation fault
Current thread 0x... (most recent call first):
  Garbage-collecting
  File ".../warp/_src/types.py", line 2304 in __init__
  File ".../warp/_src/types.py", line 3878 in __ctype__
  File ".../warp/_src/context.py", line 7610 in pack_arg
  File ".../warp/_src/context.py", line 8292 in pack_args
  File ".../warp/_src/context.py", line 8297 in launch
  File ".../mujoco_warp/_src/solver.py", line 3240 in _solver_iteration
  File ".../warp/_src/context.py", line 9743 in capture_while
  File ".../mujoco_warp/_src/solver.py", line 3335 in _solve
  ...

Site B — GC during kernel compilation

Fatal Python error: Segmentation fault
Current thread 0x... (most recent call first):
  Garbage-collecting
  File ".../ast.py", line 52 in parse
  File ".../warp/_src/codegen.py", line 1038 in __init__      # Adjoint.__init__
  File ".../warp/_src/context.py", line 781 in __init__       # Function.__init__
  File ".../warp/_src/context.py", line 1362 in wrapper
  File ".../mujoco_warp/_src/solver.py", line 1809 in update_constraint_efc
  File ".../warp/_src/context.py", line 9743 in capture_while
  ...

Both traces have warp.capture_while live on the CPU path above the crashing frame.

Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, _warp_fastcall, _cbor2

Bisect

Nightly Result
1.13.0.dev20260421 ✅ passes full Newton suite
1.13.0.dev20260422 ❌ segfault in test_selection.example_selection_cartpole_cpu

Nothing between those nightlies touches the CPU execution path except the work landed for #1349 ("Add graph capture for APIC serialization and CPU replay"). The commit message explicitly calls out reference retention for CPU capture (base array in _regions, ModuleExec on the graph) and mentions a follow-up to "Retain ModuleExec on CPU APIC graphs to prevent use-after-unload" — our symptom looks like a retention site that was missed for the per-launch array views / packed kernel args that mujoco-warp creates.

Environment

  • OS: reproduced on Ubuntu 22.04 x86_64, Ubuntu 24.04 arm64, Windows, macOS (arm64), and a CUDA runner with CUDA_VISIBLE_DEVICES=""
  • Python: 3.12
  • warp-lang: 1.13.0.dev20260422 (installed from https://pypi.nvidia.com/)
  • mujoco-warp: 3.7.0.1
  • Device: CPU (the test forces --device cpu)

Workaround

Pin to warp-lang==1.13.0.dev20260421 until a fix lands.

Suspected cause

Heap/refcount bug in the new CPU APIC capture/replay machinery — something a kernel launch needs to stay alive for the duration of capture_while (likely an array view, apic_array_t, a per-launch param record, or a function-pointer-carrying struct) is being reaped by Python GC before the recorded operation runs, and later use of the dangling pointer segfaults.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No fields configured for Bug.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions