Determinism support 2/N by mar-yan24 · Pull Request #1300 · google-deepmind/mujoco_warp

mar-yan24 · 2026-04-19T04:24:57Z

Determinism support 2/N: deterministic constraint row allocation

For now I am just putting this in as a draft, opening it up early against main for review. Checks won't pass because 1/N changes are omitted.

Summary

This PR extends the opt-in opt.deterministic flag introduced in #1281 to make constraint row allocation reproducible across repeated runs of the same input. The wp.atomic_add(nefc_out, worldid, N) allocation used inside each constraint kernel is replaced with a deterministic count -> exclusive-scan -> emit pipeline. After 2/N, every constraint row should be at the same position on every run, so d.nefc, d.efc.*, and d.efc.J are bitwise stable.

Guarantees after 2/N (on top of 1/N)

Stable d.contact.* ordering (from 1/N).
Stable d.nefc across runs.
Stable per-row d.efc.* values.
Stable dense and sparse d.efc.J (including J_rownnz, J_rowadr, J_colind).

Not yet guaranteed (deliberately out of scope)

Bitwise qacc, qvel, qpos. Still gated on deterministic solver reductions.
CUDA-graph capture with opt.deterministic=True. Blocked by host-side overflow readback

Changes

mujoco_warp/_src/constraint.py: replaces atomic slot allocation with the deterministic count -> scan -> emit pipeline across all constraint families (equality, friction, limit, contact pyramidal, contact elliptic).
mujoco_warp/_src/constraint.py: persisted deterministic scratch buffers, all per-family counts, nnz_counts, offsets, nnz_offsets, nefc_base, nnz_base, plus contact world_start / world_end, are allocated once per (m, d) and reused across steps.
mujoco_warp/_src/constraint.py: skip zero-size families, python-side early-skip for families whose size == 0 avoids ~10–20 no-op kernel launches per step on models like humanoid.
mujoco_warp/_src/types.py: opt.deterministic docstring updated to reflect that the flag now also covers constraint row allocation.
mujoco_warp/_src/determinism_test.py: expanded determinism regression coverage for nefc, per-row efc.*, dense and sparse efc.J, canonicalized det-on vs det-off row-multiset equivalence, and benchmark-path smoke coverage for both solver paths.
mujoco_warp/_src/benchmark.py, mujoco_warp/testspeed.py: expose --use_cuda_graph at the CLI and pipe it through benchmark() so opt.deterministic=True can be benchmarked on the non-captured path while the host-side overflow readback remains (can remove this if unwanted, mainly for helping me test).

Benchmarks

I used similar benchmarking, just extended, as in the last PR, thanks to claude lol.

Environment:

GPU: NVIDIA GeForce RTX 4060 Laptop GPU (8 GiB, sm_89)
Warp: 1.13.0.dev20260227, CUDA Toolkit 12.9, Driver 12.5
Methodology: 3 trials × 500 measured steps, 50 warmup steps, explicit sync around the timing window, us/step = 1e6 * run_duration / (nworld * nstep).
use_cuda_graph=False on both off and on runs (required for deterministic mode today, host-side overflow readback is not capture-safe).
Capacity margin applied to both off and on: njmax = baseline + 32, njmax_nnz = baseline + 32 * nv (deterministic overflow validation would otherwise trip after warmup).
collision.xml nworld=512 used nccdmax=8 to fit in 8 GiB, applied to both runs.

Newton + Dense

model	nworld	mean ncon	mean nefc	off (us/step)	on (us/step)	overhead
`humanoid/humanoid.xml`	1	9.53	30.63	4318.30	6022.23	+39.5%
`humanoid/humanoid.xml`	64	11.16	40.05	86.05	110.70	+28.7%
`humanoid/humanoid.xml`	512	11.22	44.95	12.23	15.17	+24.1%
`collision.xml`	1	10.82	23.47	4836.62	6363.59	+31.6%
`collision.xml`	64	10.81	23.57	77.49	100.23	+29.3%
`collision.xml`	512	10.82	24.19	11.87	15.02	+26.5%

CG + Sparse

model	nworld	mean ncon	mean nefc	off (us/step)	on (us/step)	overhead
`humanoid/humanoid.xml`	1	9.16	27.82	8374.90	10129.02	+20.9%
`humanoid/humanoid.xml`	64	11.18	38.88	128.68	155.16	+20.6%
`humanoid/humanoid.xml`	512	11.18	43.58	16.21	21.31	+31.4%
`collision.xml`	1	10.74	23.31	5421.41	7149.21	+31.9%
`collision.xml`	64	10.74	23.31	93.56	111.70	+19.4%
`collision.xml`	512	10.74	23.31	12.15	14.05	+15.6%

Overhead range across the matrix: +15.6% .. +39.5%.

The benchmarks are decent, not amazing not horrible. There are several performance enhancements I have in mind (aside from CUDA graph support of course) that might be able to bring the overhead down quite a bit. I already implemented two which helped a decent bit.

Luckily, I was able to test this a couple days ago, but recently I installed the new Windows updates, and I am 99.9% sure these caused CUDA latency regression because I cannot get the same performance benchmarks on either this PR nor am I able to get similar numbers for the last PR. The latency is so bad that I literally cannot test on this windows slop machine so I might struggle in getting better performance benchmarks until I can get a workaround for this security update.

Performance enhancements in this PR

Two small perf enhancements are included in so that 2/N is closer to the 1/N cost. Neither changes the determinism semantics; both are measured against the same matrix.

Persisted deterministic scratch buffers. Allocates every scratch buffer once per (m, d) and reuses it across steps instead of wp.empty(...) on every constraint family. Measured on this branch (humanoid.xml and collision.xml, 3 trials × 500 steps):

model	nworld	overhead before	overhead after	pp reduction	% reduction
`humanoid/humanoid.xml`	1	+36.6%	+20.3%	−16.3	−45%
`humanoid/humanoid.xml`	64	+40.5%	+21.5%	−19.0	−47%
`humanoid/humanoid.xml`	512	+42.5%	+36.0%	−6.5	−15%
`collision.xml`	64	+29.1%	+12.4%	−16.7	−57%
`collision.xml`	512	+23.1%	+13.6%	−9.5	−41%

Saves ~370–830 µs/step of device-side work.

Skip zero-size families. Python-side early-skip for constraint families whose family-size is 0 (e.g. unused equality / friction / limit). Avoids ~10–20 no-op kernel launches per step on humanoid-like models. Small but consistent saving, largest impact at small nworld where launch overhead dominates.

(cherry picked from commit 689867c)

(cherry picked from commit e5deba2)

(cherry picked from commit 7664301)

(cherry picked from commit 28270ce)

(cherry picked from commit 6fb877c)

(cherry picked from commit 96c6e8a)

(cherry picked from commit 56401ad)

johnnynunez · 2026-06-11T03:44:52Z

Status note for anyone picking this up (I'm continuing the determinism/differentiability work across the stack — see NVIDIA/warp#1355 and #1281).

I attempted a mechanical rebase of this branch onto current main: it's not feasible. This PR rewrites constraint.py (+3344/−2184) and upstream has since landed a competing rewrite of the same kernels (a23500c, "Optimization: efc_contact kernels", +574/−465) plus 4 more constraint.py changes (Jdotv correction 9e024df, flex eq_active 80bbba7, cache_kernel f0e2d81/6f235d4). The result is 22 conflict regions where both sides restructured the same functions — a textual merge would produce silently wrong physics.

The right path is a semantic re-implementation of the count→scan→emit row-allocation pipeline on top of today's constraint.py, preserving a23500c's optimizations. @mar-yan24 if you have notes on which parts of the rewrite were essential vs incidental, that would cut the effort considerably.

Meanwhile #1281 rebases cleanly (see my comment there) and delivers bitwise contact ordering + single-step determinism on its own, with ~9-12% overhead vs baseline in my benchmarks.

johnnynunez · 2026-06-11T04:32:37Z

Update: the semantic port is done — mar-yan24#6 (branch johnnynunez:det/mjwarp-determinism2-port, stacked on the #1281 rebase).

The count→scan→emit pipeline and persisted scratch buffers are preserved as designed; adaptations for current main: contacts use a single _efc_contact_count mirroring the unified post-a23500c _efc_contact_init, eq_active honored in flex equality (80bbba7), all factories under @cache_kernel (f0e2d81), and — new — CUDA graph capture now works with opt.deterministic=True by skipping the host-side overflow readback while a capture is active.

Measured on RTX PRO 6000 Blackwell: full suite 1101 passed / 23 skipped; nefc + per-row efc.* + sparse efc.J structure/values bitwise stable across runs with graph capture on; overhead ~9% (1 world) to ~24% (256 worlds) on a 20-body contact pile, in line with the original branch's numbers.

johnnynunez · 2026-06-11T09:02:14Z

Heads-up: the root cause of the residual post-2/N drift is now isolated (full falsification methodology in #1281 (comment)), which defines the 3/N scope precisely:

_update_constraint_init_qfrc_constraint_sparse — racing wp.atomic_add into qfrc_constraint[world, dof] across efc rows
_JTDAJ_sparse / _update_gradient_h_incremental_sparse — racing wp.atomic_add into ctx.h[world, row, col]

Fixing both (order-deterministic gathers) makes 500-step rollouts bitwise on a 20-body contact pile, and an end-to-end Newton (newton-physics) SolverMuJoCo rollout bitwise over 200 steps. Dense-jacobian mode needs nothing — it's already gather-based and fully deterministic.

I'm starting the production implementation of 3/N now (deterministic sparse qfrc_constraint + sparse H assembly, gated on opt.deterministic, same architecture as this PR), stacked on the 2/N port. Will link the branch here when it's green.

johnnynunez · 2026-06-11T09:17:49Z

3/N is implemented and green: mar-yan24#8 (branch johnnynunez:det/mjwarp-determinism3, stacked on the 2/N port).

With it, opt.deterministic=True now delivers bitwise-identical qacc/qpos/qvel over 1000-step contact-rich rollouts — same-process, fresh-process, and CUDA-graph-captured — with no warp-level deterministic mode required. Full suite 1108 passed; new SolverDeterminismTest regression coverage included.

Scope ended up being exactly two sparse kernels (deterministic gathers for qfrc_constraint and JTDAJ, plus disabling the incremental-H scatter in det mode); dense was already deterministic. Current version is correctness-first (naive gathers, ~2.4x at 1 world / ~10x at 1024 worlds on the solver-heavy pile benchmark — still well under warp's automatic mode at low-to-mid world counts); a segmented-reduce optimization pass using 2/N's count/scan machinery is the natural follow-up.

…signment Semantic re-implementation of mar-yan24's count->scan->emit pipeline (google-deepmind#1300) on top of the post-a23500c constraint kernels, since the original branch predates the efc_contact rewrite and 95 other upstream commits and cannot be rebased mechanically. When opt.deterministic=True, the racy wp.atomic_add(nefc/efc_nnz) slot allocation in every constraint family is replaced with: a count kernel that writes per-thread row/nnz counts, a per-world exclusive scan that converts counts to offsets and bumps the totals, and emit kernels that read base+offset instead of atomic results. Constraint rows therefore land at identical positions on every run of the same input. Differences from the original google-deepmind#1300 branch: - Contact families use a single _efc_contact_count kernel mirroring the unified _efc_contact_init (upstream a23500c replaced the per-cone _contact_pyramidal/_contact_elliptic kernels this no longer touches). - _equality_flex_count honors eq_active (upstream 80bbba7). - All new kernel factories are wrapped in @cache_kernel (upstream f0e2d81) so repeated step() calls do not recreate kernels. - Host-side overflow validation is skipped while a CUDA graph capture is active, making opt.deterministic capture-safe (the original branch documented capture as unsupported). - Ports the expanded determinism regression tests, minus the benchmark coverage that targeted the pre-google-deepmind#1301 benchmark API. Verified: full suite 1101 passed / 23 skipped; constraint rows (nefc, efc.type/id, sparse J structure and values) bitwise stable across same-process trials with CUDA graph capture; ~9% overhead at 1 world, ~24% at 256 worlds on RTX PRO 6000 Blackwell. Signed-off-by: johnnynunez <johnnynuca14@gmail.com> Co-authored-by: Mark Yang <markyang2005@gmail.com>

mar-yan24 and others added 7 commits April 17, 2026 22:27

beta determinism 2

7ce36c2

(cherry picked from commit 689867c)

testing for row allocation

06598f5

(cherry picked from commit e5deba2)

cleanup

0b2fecd

(cherry picked from commit 7664301)

collision test expansion

bd5a8cd

(cherry picked from commit 28270ce)

persisted deterministic scratch for performance

e57cd53

(cherry picked from commit 6fb877c)

get rid of colon

f64d5fb

(cherry picked from commit 96c6e8a)

performance: skip zero-size deterministic constraint families

afc9ef9

(cherry picked from commit 56401ad)

eric-heiden mentioned this pull request May 1, 2026

Determinism in Newton newton-physics/newton#2479

Open

johnnynunez mentioned this pull request Jun 11, 2026

Determinism support 1/N #1281

Open

This was referenced Jun 11, 2026

Auto Differentiation Implementation 1/3 #1226

Open

Port determinism 2/N (count->scan->emit) onto current google-deepmind/main mar-yan24/mujoco_warp#6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determinism support 2/N#1300

Determinism support 2/N#1300
mar-yan24 wants to merge 7 commits into
google-deepmind:mainfrom
mar-yan24:mark/determinism2-draft

mar-yan24 commented Apr 19, 2026

Uh oh!

johnnynunez commented Jun 11, 2026

Uh oh!

johnnynunez commented Jun 11, 2026

Uh oh!

johnnynunez commented Jun 11, 2026

Uh oh!

johnnynunez commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mar-yan24 commented Apr 19, 2026

Determinism support 2/N: deterministic constraint row allocation

Summary

Guarantees after 2/N (on top of 1/N)

Not yet guaranteed (deliberately out of scope)

Changes

Benchmarks

Newton + Dense

CG + Sparse

Performance enhancements in this PR

Uh oh!

johnnynunez commented Jun 11, 2026

Uh oh!

johnnynunez commented Jun 11, 2026

Uh oh!

johnnynunez commented Jun 11, 2026

Uh oh!

johnnynunez commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants