Port determinism 2/N (count->scan->emit) onto current google-deepmind/main by johnnynunez · Pull Request #6 · mar-yan24/mujoco_warp

johnnynunez · 2026-06-11T04:32:20Z

Semantic port of this branch's deterministic constraint row allocation onto current upstream main (which is 96+ commits ahead, including the a23500c efc_contact rewrite that made a mechanical rebase impossible — see google-deepmind#1300 (comment)).

What's preserved from your design:

The count -> per-world exclusive scan -> emit pipeline, per constraint family
Persisted scratch buffers via _ensure_det_scratch (allocated once per (m, d))
wp.static(deterministic) dispatch inside kernel factories so non-det mode compiles to the original atomic code
Your expanded determinism regression tests

What's adapted to current main:

Contacts: one _efc_contact_count kernel mirroring the unified _efc_contact_init from a23500c (replaces the per-cone _contact_pyramidal/_contact_elliptic count kernels — those emit kernels no longer exist upstream)
_equality_flex_count honors eq_active (upstream 80bbba7)
All factories wrapped in @cache_kernel (upstream f0e2d81) — fixes kernel re-creation on every step
CUDA graph capture now works with opt.deterministic=True: host-side overflow validation is skipped while a capture is active (device-side overflow warning still fires on replays)
Benchmark-path tests dropped (pre-Refactor CLI tools and generate benchmark documentation google-deepmind/mujoco_warp#1301 benchmark() API no longer exists)

Verified on RTX PRO 6000 Blackwell (sm_120, CUDA 13.3):

Full suite: 1101 passed, 23 skipped
Constraint rows (nefc, efc.type/id, sparse J structure+values) bitwise stable across same-process trials with CUDA graph capture
Overhead: ~9% at 1 world, ~24% at 256 worlds (graph captured, 20-body contact pile)

Stacks on top of #4 (the 1/N rebase).

Bumps [lxml](https://github.com/lxml/lxml) from 6.0.2 to 6.1.0. - [Release notes](https://github.com/lxml/lxml/releases) - [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt) - [Commits](lxml/lxml@lxml-6.0.2...lxml-6.1.0) --- updated-dependencies: - dependency-name: lxml dependency-version: 6.1.0 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…google-deepmind#1306) 1. Use wp.tile_cholesky_inplace on the diagonal block instead of wp.tile_cholesky. The inplace variant zeros the strict upper triangle and avoids allocating a separate L_kk tile, so the shared-memory footprint of the factor step is halved. 2. In backward substitution, read previously-computed x_j blocks as a tile_view into the resident rhs_tile rather than re-loading them from global x. The per-iteration wp.tile_store into x is hoisted to a single coalesced store of rhs_tile at the end of the kernel. This eliminates O((N/bs)^2) global loads plus a store per iteration. Correctness validated by the existing test_block_cholesky and the full solver_test suite. Kernel-level A/B on three_humanoids (nworld=8192, nstep=500): - update_gradient_cholesky_blocked (factor+solve): avg 801 -> 711 us (-11.3%) - update_gradient_cholesky_blocked_skip_unchanged (solve-mostly): median 153 -> 148 us (-3.1%), avg 287 -> 281 us (-2.0%) - Combined cholesky: 1527 -> 1464 ms (-4.1%) - Tail on factor+solve: stddev 276 -> 70 us, max 2236 -> 1809 us

* Add skybox rendering * Fix compilation bug when skybox disabled * Small fixes

…change. (google-deepmind#1314)

* Add semantic segmentation render parity * Add semantic segmentation tutorial example * Format io test skip message * Fix API import ordering * Add dedicated semantic segmentation notebook * Fix notebook local import setup * Update segmentation API and tutorial * Address latest PR review comments * Revert synthetic zero-nefc test change * Restore io test to match main * Fix tutorial segmentation display dtype * Enable segmentation rendering in tutorial * address StafaH's reviews * add flex segmentation test * Use render buffer directly in segmentation test * Fix flex mesh group roots for segmentation * remove leading _

…ind#1301) * Initial stab at moving benchmark logic to python. Cleans up testspeed to be only human readable output. Moves benchmark running to benchmarks/run.py with example in humanoid. * Support for asset fetching from git repos. * Refactor CLI tools: centralize shared code in cli.py and add record binary * fix ruff * More changes and refactors. * Remove mediapy, introduces too many deps. * Update uv.lock to HEAD * Better rollout video for aloha_cloth * Address PR comments. * Ensure duration respects rollout length.

* Optimize the efc_contact kernels * Formatting * Fix kernel analyzer error * Rename dof_affects_body to body_isdofancestor * Improve body_isdofancestor construction * Address some of the review comments * Fix formatting

* add benchmark for MyoSim MyoArm * update record.py

…nd#1326) * Narrow out-of-bounds test to speed up CI. * Fix warp kernel cache for stable dependency tests Add dep-type to the CI warp cache key so that stable and locked uv dependency tests each get their own kernel cache. Previously both shared the same cache keyed on uv.lock, which is irrelevant for stable (pip install), causing cache misses when warp versions differ.

…google-deepmind#1325) * Consolidate aloha benchmarks into single directory Merge aloha_pot, aloha_sdf, aloha_cloth into benchmarks/aloha/ with scene-specific XMLs (scene_pot.xml, scene_sdf.xml, scene_cloth.xml). Add aloha_clutter benchmark with YCB/GSO object assets from aloha_sim. Other changes: - Add glob pattern support in run.py asset specs for flexible asset mapping - Migrate run.py from os.path to pathlib - Auto-set nstep from replay trajectory length when not explicitly specified - Update benchmarks/README.md documentation - Generalize load_trajectory docstring and variable naming * Address PR comments. * Address PR comments. * Address PR comments. * Update XML after mujocolab/mjlab#970

* Add Implicit Integrator and RNE Derivative Support * refactor: scope PR to RNE derivative implementation only * fixed linting issues * Move RNE-specific derivative tests into `derivative_test.py` . * feat: implement analytical RNE derivatives and consolidate tests - derivative.py: Implemented RNE Term 2 using `motion_cross_force` to correctly calculate momentum derivatives. Cleaned up comments and TODOs. - derivative_test.py: Merged RNE stress tests into `test_rne_stress`, replacing `test_rne_vel_effect`. - types.py: Marked `IMPLICIT` integrator as unsupported. * fix comments * fix linting issues * feat: resolve conflicts and fix MuJoCo 3.5.0 compatibility * Merge upstream/main into PR google-deepmind#912 * update rne derivatives implementation * updates * update derivative_test.py --------- Co-authored-by: Taylor Howell <taylorhowell@google.com>

* Import from Piper * version guard

@thowell

…okup table (google-deepmind#1334) * Fix _qderiv_actuator_passive_actuation_sparse row indexing into qM_fullm The kernel introduced in google-deepmind#1243 searches qM_fullm for the (row, col) elemid using m.M_rownnz / m.M_rowadr. Those arrays describe MuJoCo's compact mass matrix sparsity, where joints whose internal block is treated as diagonal-only by mj_factorM (e.g. free joints) contribute one entry per internal dof. qM_fullm_i / qM_fullm_j, however, are built by walking dof_parentid and include the full chained internal block, so the two layouts diverge whenever a joint with diagonal-only compact storage precedes any actuated dof in qvel order. In that case the kernel reads the wrong slice of qMj, the inner `qMj[row_startk] == col` check never fires, and the actuator's contribution to qDeriv is silently dropped. Downstream factor_solve_i then sees (M - dt*qDeriv) without the actuator damping for the affected dof, and the implicit step diverges. The bug only surfaces with this specific topology (free joint at qvel start + actuated dof after it), which is why the existing serial-chain unit test (PR google-deepmind#1243's test_smooth_vel_sparse_tendon_coupled, no free joint) does not catch it. Repro: a single free body followed by an actuated hinge; with the buggy indexing, mjwarp's deriv_smooth_vel diagonal entry for the hinge is qM only (0.001), while MuJoCo's reference is qM + dt*(kp+kv) (0.003). Fix: build chain-aware row offsets qM_fullm_rownnz / qM_fullm_rowadr alongside qM_fullm_i / qM_fullm_j in io.py and pass them to the kernel instead of the compact M_rownnz / M_rowadr. Adds test_smooth_vel_sparse_free_joint_precedes_actuator covering the minimum reproducer. * Address review: drop unused body name and is_sparse assert Per @thowell's PR feedback (google-deepmind#1334), remove the redundant body name attribute and the is_sparse assertion in the new regression test. * Switch sparse qDeriv lookup to qM_fullm_elemid (Kenny's approach) Replace the chain-aware (qM_fullm_rownnz, qM_fullm_rowadr) row offsets plus linear search through qMj with a dense (nv x nv) qM_fullm_elemid lookup table built in io.py. The kernel does an O(1) reverse lookup instead of walking the row to find the matching column. For typical robot models (humanoid, three_humanoids) kernel time is unchanged within trial noise. For deep chains with high-rownnz actuators (e.g. tendons spanning a long serial chain) the kernel asymptotically drops from O(depth^3) to O(depth^2) per actuator thread; a 100-link chain with a tendon actuator runs ~8x faster end-to-end on a 5090. Memory cost: nv^2 * 4 bytes (40 KiB at nv=100, 1 MiB at nv=500).

* Plumb lighting params to warp renderer and utilize to reach parity with opengl renderer make background color controllable in create_render_context api make light params batchable make material params batchable pr comments PR comments PR comments add render sanity tests remove lighting field prop test make max shininess a constant * linter fix * use max shininess

* Fix line search cost precision Evaluate line-search candidates relative to alpha zero so small improvements are not lost when the absolute objective is large. Store the accepted line-search improvement directly for convergence instead of recomputing it from absolute costs after constraint update. * Remove unused solver cost tracking Solver convergence now uses the line-search improvement directly, so the accumulated objective cost is no longer consumed by runtime code. Drop the internal cost workspace and the cost atomics from constraint update. * Remove unused solver Gauss tracking Drop the stored Gauss cost from solver contexts now that line search tracks shifted objective improvements directly. This removes the per-iteration Gauss update kernels and keeps the line-search Gauss polynomial delta-only. * Fix inactive shifted line-search costs Preserve the zero-cost baseline for one-sided rows that start inactive and become active during line search. Keep the shifted quadratic form fast with an explicit offset, and use cost-only helpers where alpha-zero subtraction does not need derivatives. * Adapt context types to main * Adapt island line search to cleanup * Adapt line-search test to main * Fix island line-search cost rebase * Format solver module * Collapse shifted helpers and clarify offset Drops _eval_pt_direct_shifted (no-offset) into the offset variant with offset=0.0, and renames the offset variant to drop the redundant _offset suffix. Same for the 3-alpha pair. Adds an offset variable name and a comment at the four limit/contact branches that motivates why the offset is needed when alpha=0 was inactive. * Apply ruff-format to solver * Address line-search review comments Hoist the per-contact reads in the parallel line-search kernel so the impratio index, friction, and quad values are loaded once and shared by both elliptic-cost evaluations instead of being re-read for each call. This matches the iterative kernel. Rename the test helper update_mujoco_constraints to update_constraints and rename test_linesearch_accepts_sub_ulp_improvement to test_linesearch_accepts_sub_float32_improvement to avoid the unexplained "ulp" abbreviation.

…ogle-deepmind#1402) * Disable MathDx GEMM for blocked Cholesky tile_matmul The blocked Cholesky performs small (16x16) upper-convention (U^T U) rank-k updates. Warp's register-blocked scalar GEMM is faster than cuBLASDx for that left-transpose pattern at this tile size, and skipping cuBLASDx also avoids its LTO compile cost. Set enable_mathdx_gemm=False via per-kernel module_options on the blocked-Cholesky kernels. Also switch the existing dense-JTDAJ disable to the same per-kernel mechanism and remove the now-unused scoped_mathdx_gemm_disabled helper, so the option lives on the kernel definition rather than mutating the global warp config around each launch. * Tune JTDAJ_dense block_dim to 128 for the scalar GEMM The dense JTDAJ tile_matmul runs on Warp's scalar GEMM (MathDx disabled). Its optimal block_dim for the scalar path is 128, whereas 96 was tuned for the cuBLASDx path. Only affects dense models (nv <= 32).

* Optimize h.zero_ call * Update memory allocation * Remove dangerous return --------- Co-authored-by: Taylor Howell <taylorhowell@google.com>

The island line search was the last solver path still comparing absolute constraint costs. When absolute costs are large, the accepted improvement can fall below float32 resolution, so previous_cost - current_cost rounds to zero and the line search refuses an otherwise-valid step. Evaluate line-search candidates as deltas from alpha=0, matching the monolithic iterative and parallel kernels: drop the constant gauss cost, make the alpha=0 point the zero-cost reference, evaluate each constraint contribution as cost(alpha) - cost(0) via the shifted helpers, and compare candidate costs against zero. Gradient and hessian are offset-free and are left unchanged, so the Newton step and bracketing logic are untouched. Add a direct island line-search unit test that drives the kernel on a single island whose large gauss cost hides a sub-resolution improvement.

_update_gradient_init_h_sparse wrote the mass matrix into the lower triangle of h via M_elemid[i, j], which is only populated where the column is an ancestor of the row (the lower triangle). The constraint term from _JTDAJ_sparse and the Cholesky factorization both use the upper triangle, so the upper-triangle Hessian was assembled without the mass matrix, producing an incorrect Newton Hessian. Transpose the lookup to M_elemid[j, i] so M lands in the upper triangle, and skip writing the lower triangle entirely since it is never read. The solver still converges with the wrong Hessian (Newton degrades toward gradient descent), so existing tests pass; the symptom is a large performance regression on the sparse path.

…model degrees of freedom (google-deepmind#1404)

The dense-Jacobian branch of _update_gradient_JTCJ_island skips an iteration when J[ic1, idof_i] is zero, but in the dim1 != dim2 case the swap-pair contribution hcone * J[ic2, idof_i] * J[ic1, idof_j] still needs to write — the swap-pair-only block under 'if dim1id != dim2id' only ran when J[ic1, i] AND J[ic2, j] were both nonzero. Cells where exactly one of those four J entries was zero silently dropped the corresponding cone contribution. Per cell H[a, b] for an off-diagonal cone pair (dim1, dim2) the contribution should be hcone * (J[ic1, a] * J[ic2, b] + J[ic2, a] * J[ic1, b]) which the monolithic _update_gradient_JTCJ_dense computes correctly (solver.py:2700-2704). This change keeps the original loop (which already handles the first term) and adds a second loop with the same shape but ic1/ic2 swapped, gated on dim1id != dim2id, for the second term. Both loops keep the aggressive J==0 early-skip from the original. The existing scalar _cholesky_factorize_solve_island absorbs the resulting indefinite per-island H via its mid-factorization clamp (s <= 1e-6 -> s = 1e-6), so no existing test fails. The bug becomes visible when the per-island solve uses any Cholesky variant without diagonal clamping (e.g. wp.tile_cholesky on the per-island H sub-block), producing NaN qacc on configurations such as constraints.xml keyframe 2 with elliptic cone + Newton + dense Jacobian.

…nd#1406)

* Apply changes from cl/925389681: Implement tiled kernels for CG solver helpers * Fix pre-submit failures: remove unused import, re-order solve_cg_finalize parameters and adjust launch arguments * Apply ruff formatting fixes to solver.py * Specialize CG block dimensions and resolve review comments * Fix solver context initialization for island solver * fixed formatting * add _ * Remove dead kernels, redundant zeroing; dynamic block_dim * fixed formatting * remove block_dim from else branch

…d#1413) The sparse elliptic-cone Hessian assembly (JTCJ) previously ran output-stationary: one thread per (contact, dense dof-pair), scanning the contact's column indices to locate each pair. When nefc >> nv the overwhelming majority of those dof-pairs do not appear in the contact's support, so the kernel spent almost all of its time scanning and skipping. Restructure it to be input-stationary: launch one thread per (contact, support-pair), decode the pair index directly into the two participating dofs, register-sum the cone block, and accumulate into H with a single atomic add. This touches only the pairs that actually contribute and exposes far more parallelism. The pair dimension is bounded by jtcj_max_pairs, derived once in io.py from the deepest geom-body dof chain. The math is unchanged. Co-authored-by: Alain Denzler <adenzler@nvidia.com>

Semantic re-implementation of mar-yan24's count->scan->emit pipeline (google-deepmind#1300) on top of the post-a23500c constraint kernels, since the original branch predates the efc_contact rewrite and 95 other upstream commits and cannot be rebased mechanically. When opt.deterministic=True, the racy wp.atomic_add(nefc/efc_nnz) slot allocation in every constraint family is replaced with: a count kernel that writes per-thread row/nnz counts, a per-world exclusive scan that converts counts to offsets and bumps the totals, and emit kernels that read base+offset instead of atomic results. Constraint rows therefore land at identical positions on every run of the same input. Differences from the original google-deepmind#1300 branch: - Contact families use a single _efc_contact_count kernel mirroring the unified _efc_contact_init (upstream a23500c replaced the per-cone _contact_pyramidal/_contact_elliptic kernels this no longer touches). - _equality_flex_count honors eq_active (upstream 80bbba7). - All new kernel factories are wrapped in @cache_kernel (upstream f0e2d81) so repeated step() calls do not recreate kernels. - Host-side overflow validation is skipped while a CUDA graph capture is active, making opt.deterministic capture-safe (the original branch documented capture as unsupported). - Ports the expanded determinism regression tests, minus the benchmark coverage that targeted the pre-google-deepmind#1301 benchmark API. Verified: full suite 1101 passed / 23 skipped; constraint rows (nefc, efc.type/id, sparse J structure and values) bitwise stable across same-process trials with CUDA graph capture; ~9% overhead at 1 world, ~24% at 256 worlds on RTX PRO 6000 Blackwell. Signed-off-by: johnnynunez <johnnynuca14@gmail.com>

johnnynunez · 2026-06-11T05:47:13Z

Same note as on #4 re: the CONFLICTING badge — this is a port stacked on the rebased determinism1, so a textual merge into mark/determinism2-draft (which predates the upstream constraint.py rewrite) isn't meaningful. To adopt: git reset --hard johnnynunez/det/mjwarp-determinism2-port on the branch behind google-deepmind#1300, or review here and I'll reshape it however you prefer.

One thing to highlight for review since it goes beyond the original branch: opt.deterministic is now CUDA-graph-capture safe (host overflow validation is skipped while capturing — wp.get_stream().is_capturing). That removes the biggest perf objection to the deterministic path, since the no-graph requirement was costing far more than the count→scan→emit overhead itself. With capture on, the measured overhead is ~9% @ 1 world, ~24% @ 256 worlds.

thowell and others added 30 commits April 21, 2026 17:45

fill_(wp.inf) (google-deepmind#1302)

eedcb02

update efc_J reshape (google-deepmind#1304)

cc5fbf2

polynomial stiffness and damping (google-deepmind#1298)

80b0739

dc motor (google-deepmind#1293)

70533bf

flex contact parameters (google-deepmind#1310)

09a88e2

remove todos (google-deepmind#1311)

f9260c6

update _flex_bending (google-deepmind#1309)

a05b47e

Add skybox rendering to the Batch Renderer (google-deepmind#1308)

1e05a2d

* Add skybox rendering * Fix compilation bug when skybox disabled * Small fixes

Force upgrade mujoco_warp to MuJoCo 3.8.0 to accomodate breaking API …

5b65e59

…change. (google-deepmind#1314)

Bump MuJoCo Warp to v3.8.0 (google-deepmind#1315)

cbd3ca0

compute nmaxcondim including flex (google-deepmind#1312)

ebcb493

restore cache_kernel (google-deepmind#1318)

6f235d4

v3.8.0.1 (google-deepmind#1319)

a50dbbd

Optimization: efc_contact kernels (google-deepmind#1295)

a23500c

* Optimize the efc_contact kernels * Formatting * Fix kernel analyzer error * Rename dof_affects_body to body_isdofancestor * Improve body_isdofancestor construction * Address some of the review comments * Fix formatting

flex equality constraint eq_active check (google-deepmind#1320)

80bbba7

update flex_stiffness (google-deepmind#1322)

6fa14d7

add benchmark for MyoSim MyoArm (google-deepmind#937)

0728005

* add benchmark for MyoSim MyoArm * update record.py

python 3.14 (google-deepmind#1327)

dab8152

update inside site sensor (google-deepmind#1330)

4da8881

flex_bending (google-deepmind#1328)

7ef10e7

Import from Piper (google-deepmind#1335)

e48bce0

* Import from Piper * version guard

v3.8.0.2 (google-deepmind#1336)

9a94d74

mhasek and others added 26 commits June 2, 2026 18:33

remove unused update_gradient_JTDAJ_sparse_tiled (google-deepmind#1400)

452712f

mj_fullM check version (google-deepmind#1401)

09e27cc

Optimization: refactor some memory allocation (google-deepmind#1356)

be83026

* Optimize h.zero_ call * Update memory allocation * Remove dangerous return --------- Co-authored-by: Taylor Howell <taylorhowell@google.com>

sleeping (google-deepmind#1369)

2e1a7e2

refactor solver (google-deepmind#1395)

2da62c5

Initialize linesearch_iterative block dimension dynamically based on …

53c6a09

…model degrees of freedom (google-deepmind#1404)

Fix render test failures on machines with GPU display. (google-deepmi…

fb56eb0

…nd#1406)

update sleep implementation (google-deepmind#1408)

c759bdf

remove parallel linesearch (google-deepmind#1410)

0615687

flex passive stiffness and bending adr checks (google-deepmind#1412)

083e5ef

determinism implementation 1

498e666

Update types.py

cb803ff

deprecated bracket switch

c27b697

add todo and remove opt-in comment

c579fc4

test _sort_contacts added

10be348

combine permute kernals

f52243f

ruff fix and contact param replacement

8da00d8

johnnynunez mentioned this pull request Jun 11, 2026

Determinism support 2/N google-deepmind/mujoco_warp#1300

Draft

This was referenced Jun 11, 2026

Determinism 3/N: deterministic sparse solver reductions (bitwise qacc/qpos/qvel) #8

Open

Port smooth contact autodifferentiation (3/3) onto current google-deepmind/main #11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port determinism 2/N (count->scan->emit) onto current google-deepmind/main#6

Port determinism 2/N (count->scan->emit) onto current google-deepmind/main#6
johnnynunez wants to merge 104 commits into
mar-yan24:mark/determinism2-draftfrom
johnnynunez:det/mjwarp-determinism2-port

johnnynunez commented Jun 11, 2026

Uh oh!

johnnynunez commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Conversation

johnnynunez commented Jun 11, 2026

Uh oh!

johnnynunez commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants