Determinism support 2/N#1300
Conversation
|
Status note for anyone picking this up (I'm continuing the determinism/differentiability work across the stack — see NVIDIA/warp#1355 and #1281). I attempted a mechanical rebase of this branch onto current main: it's not feasible. This PR rewrites The right path is a semantic re-implementation of the count→scan→emit row-allocation pipeline on top of today's constraint.py, preserving a23500c's optimizations. @mar-yan24 if you have notes on which parts of the rewrite were essential vs incidental, that would cut the effort considerably. Meanwhile #1281 rebases cleanly (see my comment there) and delivers bitwise contact ordering + single-step determinism on its own, with ~9-12% overhead vs baseline in my benchmarks. |
|
Update: the semantic port is done — mar-yan24#6 (branch The count→scan→emit pipeline and persisted scratch buffers are preserved as designed; adaptations for current main: contacts use a single Measured on RTX PRO 6000 Blackwell: full suite 1101 passed / 23 skipped; |
|
Heads-up: the root cause of the residual post-2/N drift is now isolated (full falsification methodology in #1281 (comment)), which defines the 3/N scope precisely:
Fixing both (order-deterministic gathers) makes 500-step rollouts bitwise on a 20-body contact pile, and an end-to-end Newton (newton-physics) I'm starting the production implementation of 3/N now (deterministic sparse |
|
3/N is implemented and green: mar-yan24#8 (branch With it, Scope ended up being exactly two sparse kernels (deterministic gathers for |
…signment Semantic re-implementation of mar-yan24's count->scan->emit pipeline (google-deepmind#1300) on top of the post-a23500c constraint kernels, since the original branch predates the efc_contact rewrite and 95 other upstream commits and cannot be rebased mechanically. When opt.deterministic=True, the racy wp.atomic_add(nefc/efc_nnz) slot allocation in every constraint family is replaced with: a count kernel that writes per-thread row/nnz counts, a per-world exclusive scan that converts counts to offsets and bumps the totals, and emit kernels that read base+offset instead of atomic results. Constraint rows therefore land at identical positions on every run of the same input. Differences from the original google-deepmind#1300 branch: - Contact families use a single _efc_contact_count kernel mirroring the unified _efc_contact_init (upstream a23500c replaced the per-cone _contact_pyramidal/_contact_elliptic kernels this no longer touches). - _equality_flex_count honors eq_active (upstream 80bbba7). - All new kernel factories are wrapped in @cache_kernel (upstream f0e2d81) so repeated step() calls do not recreate kernels. - Host-side overflow validation is skipped while a CUDA graph capture is active, making opt.deterministic capture-safe (the original branch documented capture as unsupported). - Ports the expanded determinism regression tests, minus the benchmark coverage that targeted the pre-google-deepmind#1301 benchmark API. Verified: full suite 1101 passed / 23 skipped; constraint rows (nefc, efc.type/id, sparse J structure and values) bitwise stable across same-process trials with CUDA graph capture; ~9% overhead at 1 world, ~24% at 256 worlds on RTX PRO 6000 Blackwell. Signed-off-by: johnnynunez <johnnynuca14@gmail.com> Co-authored-by: Mark Yang <markyang2005@gmail.com>
Determinism support 2/N: deterministic constraint row allocation
For now I am just putting this in as a draft, opening it up early against main for review. Checks won't pass because 1/N changes are omitted.
Summary
This PR extends the opt-in
opt.deterministicflag introduced in #1281 to make constraint row allocation reproducible across repeated runs of the same input. Thewp.atomic_add(nefc_out, worldid, N)allocation used inside each constraint kernel is replaced with a deterministic count -> exclusive-scan -> emit pipeline. After 2/N, every constraint row should be at the same position on every run, sod.nefc,d.efc.*, andd.efc.Jare bitwise stable.Guarantees after 2/N (on top of 1/N)
d.contact.*ordering (from 1/N).d.nefcacross runs.d.efc.*values.d.efc.J(includingJ_rownnz,J_rowadr,J_colind).Not yet guaranteed (deliberately out of scope)
qacc,qvel,qpos. Still gated on deterministic solver reductions.opt.deterministic=True. Blocked by host-side overflow readbackChanges
mujoco_warp/_src/constraint.py: replaces atomic slot allocation with the deterministic count -> scan -> emit pipeline across all constraint families (equality, friction, limit, contact pyramidal, contact elliptic).mujoco_warp/_src/constraint.py: persisted deterministic scratch buffers, all per-familycounts,nnz_counts,offsets,nnz_offsets,nefc_base,nnz_base, plus contactworld_start/world_end, are allocated once per(m, d)and reused across steps.mujoco_warp/_src/constraint.py: skip zero-size families, python-side early-skip for families whosesize == 0avoids ~10–20 no-op kernel launches per step on models like humanoid.mujoco_warp/_src/types.py:opt.deterministicdocstring updated to reflect that the flag now also covers constraint row allocation.mujoco_warp/_src/determinism_test.py: expanded determinism regression coverage fornefc, per-rowefc.*, dense and sparseefc.J, canonicalized det-on vs det-off row-multiset equivalence, and benchmark-path smoke coverage for both solver paths.mujoco_warp/_src/benchmark.py,mujoco_warp/testspeed.py: expose--use_cuda_graphat the CLI and pipe it throughbenchmark()soopt.deterministic=Truecan be benchmarked on the non-captured path while the host-side overflow readback remains (can remove this if unwanted, mainly for helping me test).Benchmarks
I used similar benchmarking, just extended, as in the last PR, thanks to claude lol.
Environment:
NVIDIA GeForce RTX 4060 Laptop GPU(8 GiB,sm_89)1.13.0.dev20260227, CUDA Toolkit 12.9, Driver 12.5us/step = 1e6 * run_duration / (nworld * nstep).use_cuda_graph=Falseon both off and on runs (required for deterministic mode today, host-side overflow readback is not capture-safe).njmax = baseline + 32,njmax_nnz = baseline + 32 * nv(deterministic overflow validation would otherwise trip after warmup).collision.xmlnworld=512usednccdmax=8to fit in 8 GiB, applied to both runs.Newton + Dense
humanoid/humanoid.xmlhumanoid/humanoid.xmlhumanoid/humanoid.xmlcollision.xmlcollision.xmlcollision.xmlCG + Sparse
humanoid/humanoid.xmlhumanoid/humanoid.xmlhumanoid/humanoid.xmlcollision.xmlcollision.xmlcollision.xmlOverhead range across the matrix: +15.6% .. +39.5%.
The benchmarks are decent, not amazing not horrible. There are several performance enhancements I have in mind (aside from CUDA graph support of course) that might be able to bring the overhead down quite a bit. I already implemented two which helped a decent bit.
Luckily, I was able to test this a couple days ago, but recently I installed the new Windows updates, and I am 99.9% sure these caused CUDA latency regression because I cannot get the same performance benchmarks on either this PR nor am I able to get similar numbers for the last PR. The latency is so bad that I literally cannot test on this windows slop machine so I might struggle in getting better performance benchmarks until I can get a workaround for this security update.
Performance enhancements in this PR
Two small perf enhancements are included in so that 2/N is closer to the 1/N cost. Neither changes the determinism semantics; both are measured against the same matrix.
Persisted deterministic scratch buffers. Allocates every scratch buffer once per
(m, d)and reuses it across steps instead ofwp.empty(...)on every constraint family. Measured on this branch (humanoid.xmlandcollision.xml, 3 trials × 500 steps):humanoid/humanoid.xmlhumanoid/humanoid.xmlhumanoid/humanoid.xmlcollision.xmlcollision.xmlSaves ~370–830 µs/step of device-side work.
Skip zero-size families. Python-side early-skip for constraint families whose family-size is 0 (e.g. unused equality / friction / limit). Avoids ~10–20 no-op kernel launches per step on humanoid-like models. Small but consistent saving, largest impact at small
nworldwhere launch overhead dominates.