Skip to content

Port determinism 2/N (count->scan->emit) onto current google-deepmind/main#6

Open
johnnynunez wants to merge 104 commits into
mar-yan24:mark/determinism2-draftfrom
johnnynunez:det/mjwarp-determinism2-port
Open

Port determinism 2/N (count->scan->emit) onto current google-deepmind/main#6
johnnynunez wants to merge 104 commits into
mar-yan24:mark/determinism2-draftfrom
johnnynunez:det/mjwarp-determinism2-port

Conversation

@johnnynunez

Copy link
Copy Markdown

Semantic port of this branch's deterministic constraint row allocation onto current upstream main (which is 96+ commits ahead, including the a23500c efc_contact rewrite that made a mechanical rebase impossible — see google-deepmind#1300 (comment)).

What's preserved from your design:

  • The count -> per-world exclusive scan -> emit pipeline, per constraint family
  • Persisted scratch buffers via _ensure_det_scratch (allocated once per (m, d))
  • wp.static(deterministic) dispatch inside kernel factories so non-det mode compiles to the original atomic code
  • Your expanded determinism regression tests

What's adapted to current main:

  • Contacts: one _efc_contact_count kernel mirroring the unified _efc_contact_init from a23500c (replaces the per-cone _contact_pyramidal/_contact_elliptic count kernels — those emit kernels no longer exist upstream)
  • _equality_flex_count honors eq_active (upstream 80bbba7)
  • All factories wrapped in @cache_kernel (upstream f0e2d81) — fixes kernel re-creation on every step
  • CUDA graph capture now works with opt.deterministic=True: host-side overflow validation is skipped while a capture is active (device-side overflow warning still fires on replays)
  • Benchmark-path tests dropped (pre-Refactor CLI tools and generate benchmark documentation google-deepmind/mujoco_warp#1301 benchmark() API no longer exists)

Verified on RTX PRO 6000 Blackwell (sm_120, CUDA 13.3):

  • Full suite: 1101 passed, 23 skipped
  • Constraint rows (nefc, efc.type/id, sparse J structure+values) bitwise stable across same-process trials with CUDA graph capture
  • Overhead: ~9% at 1 world, ~24% at 256 worlds (graph captured, 20-body contact pile)

Stacks on top of #4 (the 1/N rebase).

thowell and others added 30 commits April 21, 2026 17:45
Bumps [lxml](https://github.com/lxml/lxml) from 6.0.2 to 6.1.0.
- [Release notes](https://github.com/lxml/lxml/releases)
- [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt)
- [Commits](lxml/lxml@lxml-6.0.2...lxml-6.1.0)

---
updated-dependencies:
- dependency-name: lxml
  dependency-version: 6.1.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…google-deepmind#1306)

1. Use wp.tile_cholesky_inplace on the diagonal block instead of
   wp.tile_cholesky. The inplace variant zeros the strict upper triangle
   and avoids allocating a separate L_kk tile, so the shared-memory
   footprint of the factor step is halved.

2. In backward substitution, read previously-computed x_j blocks as a
   tile_view into the resident rhs_tile rather than re-loading them from
   global x. The per-iteration wp.tile_store into x is hoisted to a
   single coalesced store of rhs_tile at the end of the kernel. This
   eliminates O((N/bs)^2) global loads plus a store per iteration.

Correctness validated by the existing test_block_cholesky and the full
solver_test suite.

Kernel-level A/B on three_humanoids (nworld=8192, nstep=500):
- update_gradient_cholesky_blocked (factor+solve): avg 801 -> 711 us (-11.3%)
- update_gradient_cholesky_blocked_skip_unchanged (solve-mostly):
  median 153 -> 148 us (-3.1%), avg 287 -> 281 us (-2.0%)
- Combined cholesky: 1527 -> 1464 ms (-4.1%)
- Tail on factor+solve: stddev 276 -> 70 us, max 2236 -> 1809 us
* Add skybox rendering

* Fix compilation bug when skybox disabled

* Small fixes
* Add semantic segmentation render parity

* Add semantic segmentation tutorial example

* Format io test skip message

* Fix API import ordering

* Add dedicated semantic segmentation notebook

* Fix notebook local import setup

* Update segmentation API and tutorial

* Address latest PR review comments

* Revert synthetic zero-nefc test change

* Restore io test to match main

* Fix tutorial segmentation display dtype

* Enable segmentation rendering in tutorial

* address StafaH's reviews

* add flex segmentation test

* Use render buffer directly in segmentation test

* Fix flex mesh group roots for segmentation

* remove leading _
…ind#1301)

* Initial stab at moving benchmark logic to python.

Cleans up testspeed to be only human readable output.  Moves
benchmark running to benchmarks/run.py with example in humanoid.

* Support for asset fetching from git repos.

* Refactor CLI tools: centralize shared code in cli.py and add record binary

* fix ruff

* More changes and refactors.

* Remove mediapy, introduces too many deps.

* Update uv.lock to HEAD

* Better rollout video for aloha_cloth

* Address PR comments.

* Ensure duration respects rollout length.
* Optimize the efc_contact kernels

* Formatting

* Fix kernel analyzer error

* Rename dof_affects_body to body_isdofancestor

* Improve body_isdofancestor construction

* Address some of the review comments

* Fix formatting
* add benchmark for MyoSim MyoArm

* update record.py
…nd#1326)

* Narrow out-of-bounds test to speed up CI.

* Fix warp kernel cache for stable dependency tests

Add dep-type to the CI warp cache key so that stable and locked uv
dependency tests each get their own kernel cache. Previously both
shared the same cache keyed on uv.lock, which is irrelevant for
stable (pip install), causing cache misses when warp versions differ.
…google-deepmind#1325)

* Consolidate aloha benchmarks into single directory

Merge aloha_pot, aloha_sdf, aloha_cloth into benchmarks/aloha/ with
scene-specific XMLs (scene_pot.xml, scene_sdf.xml, scene_cloth.xml).

Add aloha_clutter benchmark with YCB/GSO object assets from aloha_sim.

Other changes:
- Add glob pattern support in run.py asset specs for flexible asset mapping
- Migrate run.py from os.path to pathlib
- Auto-set nstep from replay trajectory length when not explicitly specified
- Update benchmarks/README.md documentation
- Generalize load_trajectory docstring and variable naming

* Address PR comments.

* Address PR comments.

* Address PR comments.

* Update XML after mujocolab/mjlab#970
* Add Implicit Integrator and RNE Derivative Support

* refactor: scope PR to RNE derivative implementation only

* fixed linting issues

* Move RNE-specific derivative tests into `derivative_test.py` .

* feat: implement analytical RNE derivatives and consolidate tests

- derivative.py: Implemented RNE Term 2 using `motion_cross_force` to correctly calculate momentum derivatives. Cleaned up comments and TODOs.
- derivative_test.py: Merged RNE stress tests into `test_rne_stress`, replacing `test_rne_vel_effect`.
- types.py: Marked `IMPLICIT` integrator as unsupported.

* fix comments

* fix linting issues

* feat: resolve conflicts and fix MuJoCo 3.5.0 compatibility

* Merge upstream/main into PR google-deepmind#912

* update rne derivatives implementation

* updates

* update derivative_test.py

---------

Co-authored-by: Taylor Howell <taylorhowell@google.com>
* Import from Piper

* version guard
…okup table (google-deepmind#1334)

* Fix _qderiv_actuator_passive_actuation_sparse row indexing into qM_fullm

The kernel introduced in google-deepmind#1243 searches qM_fullm for the (row, col) elemid
using m.M_rownnz / m.M_rowadr. Those arrays describe MuJoCo's compact mass
matrix sparsity, where joints whose internal block is treated as
diagonal-only by mj_factorM (e.g. free joints) contribute one entry per
internal dof. qM_fullm_i / qM_fullm_j, however, are built by walking
dof_parentid and include the full chained internal block, so the two
layouts diverge whenever a joint with diagonal-only compact storage
precedes any actuated dof in qvel order.

In that case the kernel reads the wrong slice of qMj, the inner
`qMj[row_startk] == col` check never fires, and the actuator's
contribution to qDeriv is silently dropped. Downstream factor_solve_i
then sees (M - dt*qDeriv) without the actuator damping for the affected
dof, and the implicit step diverges. The bug only surfaces with this
specific topology (free joint at qvel start + actuated dof after it),
which is why the existing serial-chain unit test (PR google-deepmind#1243's
test_smooth_vel_sparse_tendon_coupled, no free joint) does not catch it.

Repro: a single free body followed by an actuated hinge; with the
buggy indexing, mjwarp's deriv_smooth_vel diagonal entry for the hinge
is qM only (0.001), while MuJoCo's reference is qM + dt*(kp+kv) (0.003).

Fix: build chain-aware row offsets qM_fullm_rownnz / qM_fullm_rowadr
alongside qM_fullm_i / qM_fullm_j in io.py and pass them to the kernel
instead of the compact M_rownnz / M_rowadr.

Adds test_smooth_vel_sparse_free_joint_precedes_actuator covering the
minimum reproducer.

* Address review: drop unused body name and is_sparse assert

Per @thowell's PR feedback (google-deepmind#1334), remove the redundant body name
attribute and the is_sparse assertion in the new regression test.

* Switch sparse qDeriv lookup to qM_fullm_elemid (Kenny's approach)

Replace the chain-aware (qM_fullm_rownnz, qM_fullm_rowadr) row offsets
plus linear search through qMj with a dense (nv x nv) qM_fullm_elemid
lookup table built in io.py. The kernel does an O(1) reverse lookup
instead of walking the row to find the matching column.

For typical robot models (humanoid, three_humanoids) kernel time is
unchanged within trial noise. For deep chains with high-rownnz actuators
(e.g. tendons spanning a long serial chain) the kernel asymptotically
drops from O(depth^3) to O(depth^2) per actuator thread; a 100-link
chain with a tendon actuator runs ~8x faster end-to-end on a 5090.

Memory cost: nv^2 * 4 bytes (40 KiB at nv=100, 1 MiB at nv=500).
mhasek and others added 26 commits June 2, 2026 18:33
* Plumb lighting params to warp renderer and utilize to reach parity with opengl renderer

make background color controllable in create_render_context api

make light params batchable

make material params batchable

pr comments

PR comments

PR comments

add render sanity tests

remove lighting field prop test

make max shininess a constant

* linter fix

* use max shininess
* Fix line search cost precision

Evaluate line-search candidates relative to alpha zero so small improvements are not lost when the absolute objective is large.

Store the accepted line-search improvement directly for convergence instead of recomputing it from absolute costs after constraint update.

* Remove unused solver cost tracking

Solver convergence now uses the line-search improvement directly, so the accumulated objective cost is no longer consumed by runtime code.

Drop the internal cost workspace and the cost atomics from constraint update.

* Remove unused solver Gauss tracking

Drop the stored Gauss cost from solver contexts now that line search tracks shifted objective improvements directly. This removes the per-iteration Gauss update kernels and keeps the line-search Gauss polynomial delta-only.

* Fix inactive shifted line-search costs

Preserve the zero-cost baseline for one-sided rows that start inactive and become active during line search. Keep the shifted quadratic form fast with an explicit offset, and use cost-only helpers where alpha-zero subtraction does not need derivatives.

* Adapt context types to main

* Adapt island line search to cleanup

* Adapt line-search test to main

* Fix island line-search cost rebase

* Format solver module

* Collapse shifted helpers and clarify offset

Drops _eval_pt_direct_shifted (no-offset) into the offset variant with offset=0.0, and renames the offset variant to drop the redundant _offset suffix. Same for the 3-alpha pair. Adds an offset variable name and a comment at the four limit/contact branches that motivates why the offset is needed when alpha=0 was inactive.

* Apply ruff-format to solver

* Address line-search review comments

Hoist the per-contact reads in the parallel line-search kernel so the
impratio index, friction, and quad values are loaded once and shared by
both elliptic-cost evaluations instead of being re-read for each call.
This matches the iterative kernel.

Rename the test helper update_mujoco_constraints to update_constraints
and rename test_linesearch_accepts_sub_ulp_improvement to
test_linesearch_accepts_sub_float32_improvement to avoid the unexplained
"ulp" abbreviation.
…ogle-deepmind#1402)

* Disable MathDx GEMM for blocked Cholesky tile_matmul

The blocked Cholesky performs small (16x16) upper-convention (U^T U)
rank-k updates. Warp's register-blocked scalar GEMM is faster than
cuBLASDx for that left-transpose pattern at this tile size, and skipping
cuBLASDx also avoids its LTO compile cost.

Set enable_mathdx_gemm=False via per-kernel module_options on the
blocked-Cholesky kernels. Also switch the existing dense-JTDAJ disable to
the same per-kernel mechanism and remove the now-unused
scoped_mathdx_gemm_disabled helper, so the option lives on the kernel
definition rather than mutating the global warp config around each launch.

* Tune JTDAJ_dense block_dim to 128 for the scalar GEMM

The dense JTDAJ tile_matmul runs on Warp's scalar GEMM (MathDx disabled).
Its optimal block_dim for the scalar path is 128, whereas 96 was tuned for
the cuBLASDx path. Only affects dense models (nv <= 32).
* Optimize h.zero_ call

* Update memory allocation

* Remove dangerous return

---------

Co-authored-by: Taylor Howell <taylorhowell@google.com>
The island line search was the last solver path still comparing absolute
constraint costs. When absolute costs are large, the accepted improvement
can fall below float32 resolution, so previous_cost - current_cost rounds
to zero and the line search refuses an otherwise-valid step.

Evaluate line-search candidates as deltas from alpha=0, matching the
monolithic iterative and parallel kernels: drop the constant gauss cost,
make the alpha=0 point the zero-cost reference, evaluate each constraint
contribution as cost(alpha) - cost(0) via the shifted helpers, and compare
candidate costs against zero. Gradient and hessian are offset-free and are
left unchanged, so the Newton step and bracketing logic are untouched.

Add a direct island line-search unit test that drives the kernel on a
single island whose large gauss cost hides a sub-resolution improvement.
_update_gradient_init_h_sparse wrote the mass matrix into the lower
triangle of h via M_elemid[i, j], which is only populated where the
column is an ancestor of the row (the lower triangle). The constraint
term from _JTDAJ_sparse and the Cholesky factorization both use the
upper triangle, so the upper-triangle Hessian was assembled without the
mass matrix, producing an incorrect Newton Hessian.

Transpose the lookup to M_elemid[j, i] so M lands in the upper triangle,
and skip writing the lower triangle entirely since it is never read.

The solver still converges with the wrong Hessian (Newton degrades
toward gradient descent), so existing tests pass; the symptom is a large
performance regression on the sparse path.
The dense-Jacobian branch of _update_gradient_JTCJ_island skips an
iteration when J[ic1, idof_i] is zero, but in the dim1 != dim2 case
the swap-pair contribution hcone * J[ic2, idof_i] * J[ic1, idof_j]
still needs to write — the swap-pair-only block under 'if dim1id !=
dim2id' only ran when J[ic1, i] AND J[ic2, j] were both nonzero. Cells
where exactly one of those four J entries was zero silently dropped
the corresponding cone contribution.

Per cell H[a, b] for an off-diagonal cone pair (dim1, dim2) the
contribution should be
  hcone * (J[ic1, a] * J[ic2, b] + J[ic2, a] * J[ic1, b])
which the monolithic _update_gradient_JTCJ_dense computes correctly
(solver.py:2700-2704). This change keeps the original loop (which
already handles the first term) and adds a second loop with the same
shape but ic1/ic2 swapped, gated on dim1id != dim2id, for the second
term. Both loops keep the aggressive J==0 early-skip from the
original.

The existing scalar _cholesky_factorize_solve_island absorbs the
resulting indefinite per-island H via its mid-factorization clamp
(s <= 1e-6 -> s = 1e-6), so no existing test fails. The bug becomes
visible when the per-island solve uses any Cholesky variant without
diagonal clamping (e.g. wp.tile_cholesky on the per-island H
sub-block), producing NaN qacc on configurations such as
constraints.xml keyframe 2 with elliptic cone + Newton + dense
Jacobian.
* Apply changes from cl/925389681: Implement tiled kernels for CG solver helpers

* Fix pre-submit failures: remove unused import, re-order solve_cg_finalize parameters and adjust launch arguments

* Apply ruff formatting fixes to solver.py

* Specialize CG block dimensions and resolve review comments

* Fix solver context initialization for island solver

* fixed formatting

* add _

* Remove dead kernels, redundant zeroing; dynamic block_dim

* fixed formatting

* remove block_dim from else branch
…d#1413)

The sparse elliptic-cone Hessian assembly (JTCJ) previously ran
output-stationary: one thread per (contact, dense dof-pair), scanning
the contact's column indices to locate each pair. When nefc >> nv the
overwhelming majority of those dof-pairs do not appear in the contact's
support, so the kernel spent almost all of its time scanning and
skipping.

Restructure it to be input-stationary: launch one thread per
(contact, support-pair), decode the pair index directly into the two
participating dofs, register-sum the cone block, and accumulate into H
with a single atomic add. This touches only the pairs that actually
contribute and exposes far more parallelism. The pair dimension is
bounded by jtcj_max_pairs, derived once in io.py from the deepest
geom-body dof chain. The math is unchanged.

Co-authored-by: Alain Denzler <adenzler@nvidia.com>
Semantic re-implementation of mar-yan24's count->scan->emit pipeline
(google-deepmind#1300) on top of the post-a23500c constraint
kernels, since the original branch predates the efc_contact rewrite and
95 other upstream commits and cannot be rebased mechanically.

When opt.deterministic=True, the racy wp.atomic_add(nefc/efc_nnz) slot
allocation in every constraint family is replaced with: a count kernel
that writes per-thread row/nnz counts, a per-world exclusive scan that
converts counts to offsets and bumps the totals, and emit kernels that
read base+offset instead of atomic results. Constraint rows therefore
land at identical positions on every run of the same input.

Differences from the original google-deepmind#1300 branch:
- Contact families use a single _efc_contact_count kernel mirroring the
  unified _efc_contact_init (upstream a23500c replaced the per-cone
  _contact_pyramidal/_contact_elliptic kernels this no longer touches).
- _equality_flex_count honors eq_active (upstream 80bbba7).
- All new kernel factories are wrapped in @cache_kernel (upstream
  f0e2d81) so repeated step() calls do not recreate kernels.
- Host-side overflow validation is skipped while a CUDA graph capture is
  active, making opt.deterministic capture-safe (the original branch
  documented capture as unsupported).
- Ports the expanded determinism regression tests, minus the benchmark
  coverage that targeted the pre-google-deepmind#1301 benchmark API.

Verified: full suite 1101 passed / 23 skipped; constraint rows (nefc,
efc.type/id, sparse J structure and values) bitwise stable across
same-process trials with CUDA graph capture; ~9% overhead at 1 world,
~24% at 256 worlds on RTX PRO 6000 Blackwell.

Signed-off-by: johnnynunez <johnnynuca14@gmail.com>
@johnnynunez

Copy link
Copy Markdown
Author

Same note as on #4 re: the CONFLICTING badge — this is a port stacked on the rebased determinism1, so a textual merge into mark/determinism2-draft (which predates the upstream constraint.py rewrite) isn't meaningful. To adopt: git reset --hard johnnynunez/det/mjwarp-determinism2-port on the branch behind google-deepmind#1300, or review here and I'll reshape it however you prefer.

One thing to highlight for review since it goes beyond the original branch: opt.deterministic is now CUDA-graph-capture safe (host overflow validation is skipped while capturing — wp.get_stream().is_capturing). That removes the biggest perf objection to the deterministic path, since the no-graph requirement was costing far more than the count→scan→emit overhead itself. With capture on, the measured overhead is ~9% @ 1 world, ~24% @ 256 worlds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.