Skip to content

Support deterministic kernels inside conditional body graphs#3

Open
johnnynunez wants to merge 2 commits into
mmacklin:warp-deterministicfrom
johnnynunez:det/warp-1355-fixes
Open

Support deterministic kernels inside conditional body graphs#3
johnnynunez wants to merge 2 commits into
mmacklin:warp-deterministicfrom
johnnynunez:det/warp-1355-fixes

Conversation

@johnnynunez

Copy link
Copy Markdown

Description

Companion fixes for NVIDIA#1355, unblocking mujoco_warp (which launches solver kernels inside wp.capture_while).

CUDA forbids memory-allocation nodes inside conditional body graphs, but launch_deterministic recorded stream-ordered allocations for scatter/counter buffers and CUB workspaces on the capturing stream. Capturing a deterministic kernel inside wp.capture_while / wp.capture_if therefore failed with "Conditional body graph contains an unsupported operation (memory allocation)". This implements the reusable-workspace direction anticipated in design/deterministic-execution.md:

  • During capture, deterministic buffer allocations are redirected to a dedicated non-capturing stream under a temporarily relaxed thread capture mode (cudaThreadExchangeStreamCaptureMode), and the allocation stream is synchronized before captured work consumes the memory. The captured graph contains zero allocation nodes; buffer lifetime stays tied to the graph via _deterministic_buffer_refs.
  • Allocation-time initialization no longer replays with the graph, so explicit fill_/zero_ resets are recorded on the capturing stream so each replay starts from clean buffer state.
  • Adds the wp_cuda_thread_exchange_capture_mode native API and two capture_while regression tests (scatter and consumed-return counter paths) modeled on the mujoco_warp solver pattern.

Also includes a one-line build fix: fastcall.cpp must include Python.h before warp.h, otherwise _XOPEN_SOURCE is redefined and the -Werror build fails on newer glibc (defaults to 800).

Testing

On RTX PRO 6000 Blackwell (sm_120, CUDA 13.3):

  • Full determinism suite: 87 tests pass (including the 2 new capture_while tests)
  • 647 tests across CodeGen/Launch/Graph/Stream/Array suites pass
  • mujoco_warp step() under wp.ScopedCapture + capture_while with DeterministicMode.RUN_TO_RUN: previously crashed, now runs with bit-identical graph replays
uv run warp/tests/deterministic/test_deterministic_graph_capture.py
uvx pre-commit run --files warp/_src/deterministic.py warp/_src/context.py warp/native/warp.cu warp/native/warp.h warp/native/warp.cpp warp/tests/deterministic/test_deterministic_graph_capture.py

Python.h defines feature-test macros such as _XOPEN_SOURCE and must be
included before standard headers. warp.h transitively includes glibc
<features.h> (crt.h -> assert.h), so including it before Python.h
triggers a macro redefinition warning that fails the -Werror build on
newer glibc where _XOPEN_SOURCE defaults to 800.

Signed-off-by: johnnynunez <johnnynuca14@gmail.com>
CUDA forbids memory-allocation nodes inside conditional body graphs
(wp.capture_while / wp.capture_if), but deterministic launches allocated
scatter/counter buffers and CUB workspaces with stream-ordered
allocations on the capturing stream. Capturing a deterministic kernel
inside a conditional body therefore failed with 'Conditional body graph
contains an unsupported operation (memory allocation)'. MuJoCo Warp hits
this on every solver step, which iterates via wp.capture_while.

Redirect deterministic buffer allocations to a dedicated non-capturing
stream while capture is active, under a temporarily relaxed thread
capture mode (cudaThreadExchangeStreamCaptureMode), and synchronize the
allocation stream before captured work consumes the memory. The captured
graph then contains no allocation nodes; buffer lifetime is still tied
to the graph via _deterministic_buffer_refs. Because allocation-time
initialization no longer replays with the graph, record explicit
fill_/zero_ resets on the capturing stream so replays start from clean
buffer state.

Adds wp_cuda_thread_exchange_capture_mode native API and capture_while
regression tests covering the scatter and consumed-return counter
paths.

Signed-off-by: johnnynunez <johnnynuca14@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant