Skip to content

Determinism support 2/N#1300

Draft
mar-yan24 wants to merge 7 commits into
google-deepmind:mainfrom
mar-yan24:mark/determinism2-draft
Draft

Determinism support 2/N#1300
mar-yan24 wants to merge 7 commits into
google-deepmind:mainfrom
mar-yan24:mark/determinism2-draft

Conversation

@mar-yan24

Copy link
Copy Markdown
Contributor

Determinism support 2/N: deterministic constraint row allocation

For now I am just putting this in as a draft, opening it up early against main for review. Checks won't pass because 1/N changes are omitted.

Summary

This PR extends the opt-in opt.deterministic flag introduced in #1281 to make constraint row allocation reproducible across repeated runs of the same input. The wp.atomic_add(nefc_out, worldid, N) allocation used inside each constraint kernel is replaced with a deterministic count -> exclusive-scan -> emit pipeline. After 2/N, every constraint row should be at the same position on every run, so d.nefc, d.efc.*, and d.efc.J are bitwise stable.

Guarantees after 2/N (on top of 1/N)

  • Stable d.contact.* ordering (from 1/N).
  • Stable d.nefc across runs.
  • Stable per-row d.efc.* values.
  • Stable dense and sparse d.efc.J (including J_rownnz, J_rowadr, J_colind).

Not yet guaranteed (deliberately out of scope)

  • Bitwise qacc, qvel, qpos. Still gated on deterministic solver reductions.
  • CUDA-graph capture with opt.deterministic=True. Blocked by host-side overflow readback

Changes

  • mujoco_warp/_src/constraint.py: replaces atomic slot allocation with the deterministic count -> scan -> emit pipeline across all constraint families (equality, friction, limit, contact pyramidal, contact elliptic).
  • mujoco_warp/_src/constraint.py: persisted deterministic scratch buffers, all per-family counts, nnz_counts, offsets, nnz_offsets, nefc_base, nnz_base, plus contact world_start / world_end, are allocated once per (m, d) and reused across steps.
  • mujoco_warp/_src/constraint.py: skip zero-size families, python-side early-skip for families whose size == 0 avoids ~10–20 no-op kernel launches per step on models like humanoid.
  • mujoco_warp/_src/types.py: opt.deterministic docstring updated to reflect that the flag now also covers constraint row allocation.
  • mujoco_warp/_src/determinism_test.py: expanded determinism regression coverage for nefc, per-row efc.*, dense and sparse efc.J, canonicalized det-on vs det-off row-multiset equivalence, and benchmark-path smoke coverage for both solver paths.
  • mujoco_warp/_src/benchmark.py, mujoco_warp/testspeed.py: expose --use_cuda_graph at the CLI and pipe it through benchmark() so opt.deterministic=True can be benchmarked on the non-captured path while the host-side overflow readback remains (can remove this if unwanted, mainly for helping me test).

Benchmarks

I used similar benchmarking, just extended, as in the last PR, thanks to claude lol.

Environment:

  • GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8 GiB, sm_89)
  • Warp: 1.13.0.dev20260227, CUDA Toolkit 12.9, Driver 12.5
  • Methodology: 3 trials × 500 measured steps, 50 warmup steps, explicit sync around the timing window, us/step = 1e6 * run_duration / (nworld * nstep).
  • use_cuda_graph=False on both off and on runs (required for deterministic mode today, host-side overflow readback is not capture-safe).
  • Capacity margin applied to both off and on: njmax = baseline + 32, njmax_nnz = baseline + 32 * nv (deterministic overflow validation would otherwise trip after warmup).
  • collision.xml nworld=512 used nccdmax=8 to fit in 8 GiB, applied to both runs.

Newton + Dense

model nworld mean ncon mean nefc off (us/step) on (us/step) overhead
humanoid/humanoid.xml 1 9.53 30.63 4318.30 6022.23 +39.5%
humanoid/humanoid.xml 64 11.16 40.05 86.05 110.70 +28.7%
humanoid/humanoid.xml 512 11.22 44.95 12.23 15.17 +24.1%
collision.xml 1 10.82 23.47 4836.62 6363.59 +31.6%
collision.xml 64 10.81 23.57 77.49 100.23 +29.3%
collision.xml 512 10.82 24.19 11.87 15.02 +26.5%

CG + Sparse

model nworld mean ncon mean nefc off (us/step) on (us/step) overhead
humanoid/humanoid.xml 1 9.16 27.82 8374.90 10129.02 +20.9%
humanoid/humanoid.xml 64 11.18 38.88 128.68 155.16 +20.6%
humanoid/humanoid.xml 512 11.18 43.58 16.21 21.31 +31.4%
collision.xml 1 10.74 23.31 5421.41 7149.21 +31.9%
collision.xml 64 10.74 23.31 93.56 111.70 +19.4%
collision.xml 512 10.74 23.31 12.15 14.05 +15.6%

Overhead range across the matrix: +15.6% .. +39.5%.

The benchmarks are decent, not amazing not horrible. There are several performance enhancements I have in mind (aside from CUDA graph support of course) that might be able to bring the overhead down quite a bit. I already implemented two which helped a decent bit.

Luckily, I was able to test this a couple days ago, but recently I installed the new Windows updates, and I am 99.9% sure these caused CUDA latency regression because I cannot get the same performance benchmarks on either this PR nor am I able to get similar numbers for the last PR. The latency is so bad that I literally cannot test on this windows slop machine so I might struggle in getting better performance benchmarks until I can get a workaround for this security update.

Performance enhancements in this PR

Two small perf enhancements are included in so that 2/N is closer to the 1/N cost. Neither changes the determinism semantics; both are measured against the same matrix.

Persisted deterministic scratch buffers. Allocates every scratch buffer once per (m, d) and reuses it across steps instead of wp.empty(...) on every constraint family. Measured on this branch (humanoid.xml and collision.xml, 3 trials × 500 steps):

model nworld overhead before overhead after pp reduction % reduction
humanoid/humanoid.xml 1 +36.6% +20.3% −16.3 −45%
humanoid/humanoid.xml 64 +40.5% +21.5% −19.0 −47%
humanoid/humanoid.xml 512 +42.5% +36.0% −6.5 −15%
collision.xml 64 +29.1% +12.4% −16.7 −57%
collision.xml 512 +23.1% +13.6% −9.5 −41%

Saves ~370–830 µs/step of device-side work.

Skip zero-size families. Python-side early-skip for constraint families whose family-size is 0 (e.g. unused equality / friction / limit). Avoids ~10–20 no-op kernel launches per step on humanoid-like models. Small but consistent saving, largest impact at small nworld where launch overhead dominates.

mar-yan24 and others added 7 commits April 17, 2026 22:27
(cherry picked from commit 689867c)
(cherry picked from commit e5deba2)
(cherry picked from commit 7664301)
(cherry picked from commit 28270ce)
(cherry picked from commit 96c6e8a)
@johnnynunez

Copy link
Copy Markdown

Status note for anyone picking this up (I'm continuing the determinism/differentiability work across the stack — see NVIDIA/warp#1355 and #1281).

I attempted a mechanical rebase of this branch onto current main: it's not feasible. This PR rewrites constraint.py (+3344/−2184) and upstream has since landed a competing rewrite of the same kernels (a23500c, "Optimization: efc_contact kernels", +574/−465) plus 4 more constraint.py changes (Jdotv correction 9e024df, flex eq_active 80bbba7, cache_kernel f0e2d81/6f235d4). The result is 22 conflict regions where both sides restructured the same functions — a textual merge would produce silently wrong physics.

The right path is a semantic re-implementation of the count→scan→emit row-allocation pipeline on top of today's constraint.py, preserving a23500c's optimizations. @mar-yan24 if you have notes on which parts of the rewrite were essential vs incidental, that would cut the effort considerably.

Meanwhile #1281 rebases cleanly (see my comment there) and delivers bitwise contact ordering + single-step determinism on its own, with ~9-12% overhead vs baseline in my benchmarks.

@johnnynunez

Copy link
Copy Markdown

Update: the semantic port is done — mar-yan24#6 (branch johnnynunez:det/mjwarp-determinism2-port, stacked on the #1281 rebase).

The count→scan→emit pipeline and persisted scratch buffers are preserved as designed; adaptations for current main: contacts use a single _efc_contact_count mirroring the unified post-a23500c _efc_contact_init, eq_active honored in flex equality (80bbba7), all factories under @cache_kernel (f0e2d81), and — new — CUDA graph capture now works with opt.deterministic=True by skipping the host-side overflow readback while a capture is active.

Measured on RTX PRO 6000 Blackwell: full suite 1101 passed / 23 skipped; nefc + per-row efc.* + sparse efc.J structure/values bitwise stable across runs with graph capture on; overhead ~9% (1 world) to ~24% (256 worlds) on a 20-body contact pile, in line with the original branch's numbers.

@johnnynunez

Copy link
Copy Markdown

Heads-up: the root cause of the residual post-2/N drift is now isolated (full falsification methodology in #1281 (comment)), which defines the 3/N scope precisely:

  1. _update_constraint_init_qfrc_constraint_sparse — racing wp.atomic_add into qfrc_constraint[world, dof] across efc rows
  2. _JTDAJ_sparse / _update_gradient_h_incremental_sparse — racing wp.atomic_add into ctx.h[world, row, col]

Fixing both (order-deterministic gathers) makes 500-step rollouts bitwise on a 20-body contact pile, and an end-to-end Newton (newton-physics) SolverMuJoCo rollout bitwise over 200 steps. Dense-jacobian mode needs nothing — it's already gather-based and fully deterministic.

I'm starting the production implementation of 3/N now (deterministic sparse qfrc_constraint + sparse H assembly, gated on opt.deterministic, same architecture as this PR), stacked on the 2/N port. Will link the branch here when it's green.

@johnnynunez

Copy link
Copy Markdown

3/N is implemented and green: mar-yan24#8 (branch johnnynunez:det/mjwarp-determinism3, stacked on the 2/N port).

With it, opt.deterministic=True now delivers bitwise-identical qacc/qpos/qvel over 1000-step contact-rich rollouts — same-process, fresh-process, and CUDA-graph-captured — with no warp-level deterministic mode required. Full suite 1108 passed; new SolverDeterminismTest regression coverage included.

Scope ended up being exactly two sparse kernels (deterministic gathers for qfrc_constraint and JTDAJ, plus disabling the incremental-H scatter in det mode); dense was already deterministic. Current version is correctness-first (naive gathers, ~2.4x at 1 world / ~10x at 1024 worlds on the solver-heavy pile benchmark — still well under warp's automatic mode at low-to-mid world counts); a segmented-reduce optimization pass using 2/N's count/scan machinery is the natural follow-up.

johnnynunez added a commit to johnnynunez/mujoco_warp that referenced this pull request Jun 11, 2026
…signment

Semantic re-implementation of mar-yan24's count->scan->emit pipeline
(google-deepmind#1300) on top of the post-a23500c constraint
kernels, since the original branch predates the efc_contact rewrite and
95 other upstream commits and cannot be rebased mechanically.

When opt.deterministic=True, the racy wp.atomic_add(nefc/efc_nnz) slot
allocation in every constraint family is replaced with: a count kernel
that writes per-thread row/nnz counts, a per-world exclusive scan that
converts counts to offsets and bumps the totals, and emit kernels that
read base+offset instead of atomic results. Constraint rows therefore
land at identical positions on every run of the same input.

Differences from the original google-deepmind#1300 branch:
- Contact families use a single _efc_contact_count kernel mirroring the
  unified _efc_contact_init (upstream a23500c replaced the per-cone
  _contact_pyramidal/_contact_elliptic kernels this no longer touches).
- _equality_flex_count honors eq_active (upstream 80bbba7).
- All new kernel factories are wrapped in @cache_kernel (upstream
  f0e2d81) so repeated step() calls do not recreate kernels.
- Host-side overflow validation is skipped while a CUDA graph capture is
  active, making opt.deterministic capture-safe (the original branch
  documented capture as unsupported).
- Ports the expanded determinism regression tests, minus the benchmark
  coverage that targeted the pre-google-deepmind#1301 benchmark API.

Verified: full suite 1101 passed / 23 skipped; constraint rows (nefc,
efc.type/id, sparse J structure and values) bitwise stable across
same-process trials with CUDA graph capture; ~9% overhead at 1 world,
~24% at 256 worlds on RTX PRO 6000 Blackwell.

Signed-off-by: johnnynunez <johnnynuca14@gmail.com>

Co-authored-by: Mark Yang <markyang2005@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants