Port determinism 2/N (count->scan->emit) onto current google-deepmind/main#6
Conversation
Bumps [lxml](https://github.com/lxml/lxml) from 6.0.2 to 6.1.0. - [Release notes](https://github.com/lxml/lxml/releases) - [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt) - [Commits](lxml/lxml@lxml-6.0.2...lxml-6.1.0) --- updated-dependencies: - dependency-name: lxml dependency-version: 6.1.0 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…google-deepmind#1306) 1. Use wp.tile_cholesky_inplace on the diagonal block instead of wp.tile_cholesky. The inplace variant zeros the strict upper triangle and avoids allocating a separate L_kk tile, so the shared-memory footprint of the factor step is halved. 2. In backward substitution, read previously-computed x_j blocks as a tile_view into the resident rhs_tile rather than re-loading them from global x. The per-iteration wp.tile_store into x is hoisted to a single coalesced store of rhs_tile at the end of the kernel. This eliminates O((N/bs)^2) global loads plus a store per iteration. Correctness validated by the existing test_block_cholesky and the full solver_test suite. Kernel-level A/B on three_humanoids (nworld=8192, nstep=500): - update_gradient_cholesky_blocked (factor+solve): avg 801 -> 711 us (-11.3%) - update_gradient_cholesky_blocked_skip_unchanged (solve-mostly): median 153 -> 148 us (-3.1%), avg 287 -> 281 us (-2.0%) - Combined cholesky: 1527 -> 1464 ms (-4.1%) - Tail on factor+solve: stddev 276 -> 70 us, max 2236 -> 1809 us
* Add skybox rendering * Fix compilation bug when skybox disabled * Small fixes
* Add semantic segmentation render parity * Add semantic segmentation tutorial example * Format io test skip message * Fix API import ordering * Add dedicated semantic segmentation notebook * Fix notebook local import setup * Update segmentation API and tutorial * Address latest PR review comments * Revert synthetic zero-nefc test change * Restore io test to match main * Fix tutorial segmentation display dtype * Enable segmentation rendering in tutorial * address StafaH's reviews * add flex segmentation test * Use render buffer directly in segmentation test * Fix flex mesh group roots for segmentation * remove leading _
…ind#1301) * Initial stab at moving benchmark logic to python. Cleans up testspeed to be only human readable output. Moves benchmark running to benchmarks/run.py with example in humanoid. * Support for asset fetching from git repos. * Refactor CLI tools: centralize shared code in cli.py and add record binary * fix ruff * More changes and refactors. * Remove mediapy, introduces too many deps. * Update uv.lock to HEAD * Better rollout video for aloha_cloth * Address PR comments. * Ensure duration respects rollout length.
* Optimize the efc_contact kernels * Formatting * Fix kernel analyzer error * Rename dof_affects_body to body_isdofancestor * Improve body_isdofancestor construction * Address some of the review comments * Fix formatting
* add benchmark for MyoSim MyoArm * update record.py
…nd#1326) * Narrow out-of-bounds test to speed up CI. * Fix warp kernel cache for stable dependency tests Add dep-type to the CI warp cache key so that stable and locked uv dependency tests each get their own kernel cache. Previously both shared the same cache keyed on uv.lock, which is irrelevant for stable (pip install), causing cache misses when warp versions differ.
…google-deepmind#1325) * Consolidate aloha benchmarks into single directory Merge aloha_pot, aloha_sdf, aloha_cloth into benchmarks/aloha/ with scene-specific XMLs (scene_pot.xml, scene_sdf.xml, scene_cloth.xml). Add aloha_clutter benchmark with YCB/GSO object assets from aloha_sim. Other changes: - Add glob pattern support in run.py asset specs for flexible asset mapping - Migrate run.py from os.path to pathlib - Auto-set nstep from replay trajectory length when not explicitly specified - Update benchmarks/README.md documentation - Generalize load_trajectory docstring and variable naming * Address PR comments. * Address PR comments. * Address PR comments. * Update XML after mujocolab/mjlab#970
* Add Implicit Integrator and RNE Derivative Support * refactor: scope PR to RNE derivative implementation only * fixed linting issues * Move RNE-specific derivative tests into `derivative_test.py` . * feat: implement analytical RNE derivatives and consolidate tests - derivative.py: Implemented RNE Term 2 using `motion_cross_force` to correctly calculate momentum derivatives. Cleaned up comments and TODOs. - derivative_test.py: Merged RNE stress tests into `test_rne_stress`, replacing `test_rne_vel_effect`. - types.py: Marked `IMPLICIT` integrator as unsupported. * fix comments * fix linting issues * feat: resolve conflicts and fix MuJoCo 3.5.0 compatibility * Merge upstream/main into PR google-deepmind#912 * update rne derivatives implementation * updates * update derivative_test.py --------- Co-authored-by: Taylor Howell <taylorhowell@google.com>
* Import from Piper * version guard
…okup table (google-deepmind#1334) * Fix _qderiv_actuator_passive_actuation_sparse row indexing into qM_fullm The kernel introduced in google-deepmind#1243 searches qM_fullm for the (row, col) elemid using m.M_rownnz / m.M_rowadr. Those arrays describe MuJoCo's compact mass matrix sparsity, where joints whose internal block is treated as diagonal-only by mj_factorM (e.g. free joints) contribute one entry per internal dof. qM_fullm_i / qM_fullm_j, however, are built by walking dof_parentid and include the full chained internal block, so the two layouts diverge whenever a joint with diagonal-only compact storage precedes any actuated dof in qvel order. In that case the kernel reads the wrong slice of qMj, the inner `qMj[row_startk] == col` check never fires, and the actuator's contribution to qDeriv is silently dropped. Downstream factor_solve_i then sees (M - dt*qDeriv) without the actuator damping for the affected dof, and the implicit step diverges. The bug only surfaces with this specific topology (free joint at qvel start + actuated dof after it), which is why the existing serial-chain unit test (PR google-deepmind#1243's test_smooth_vel_sparse_tendon_coupled, no free joint) does not catch it. Repro: a single free body followed by an actuated hinge; with the buggy indexing, mjwarp's deriv_smooth_vel diagonal entry for the hinge is qM only (0.001), while MuJoCo's reference is qM + dt*(kp+kv) (0.003). Fix: build chain-aware row offsets qM_fullm_rownnz / qM_fullm_rowadr alongside qM_fullm_i / qM_fullm_j in io.py and pass them to the kernel instead of the compact M_rownnz / M_rowadr. Adds test_smooth_vel_sparse_free_joint_precedes_actuator covering the minimum reproducer. * Address review: drop unused body name and is_sparse assert Per @thowell's PR feedback (google-deepmind#1334), remove the redundant body name attribute and the is_sparse assertion in the new regression test. * Switch sparse qDeriv lookup to qM_fullm_elemid (Kenny's approach) Replace the chain-aware (qM_fullm_rownnz, qM_fullm_rowadr) row offsets plus linear search through qMj with a dense (nv x nv) qM_fullm_elemid lookup table built in io.py. The kernel does an O(1) reverse lookup instead of walking the row to find the matching column. For typical robot models (humanoid, three_humanoids) kernel time is unchanged within trial noise. For deep chains with high-rownnz actuators (e.g. tendons spanning a long serial chain) the kernel asymptotically drops from O(depth^3) to O(depth^2) per actuator thread; a 100-link chain with a tendon actuator runs ~8x faster end-to-end on a 5090. Memory cost: nv^2 * 4 bytes (40 KiB at nv=100, 1 MiB at nv=500).
* Plumb lighting params to warp renderer and utilize to reach parity with opengl renderer make background color controllable in create_render_context api make light params batchable make material params batchable pr comments PR comments PR comments add render sanity tests remove lighting field prop test make max shininess a constant * linter fix * use max shininess
* Fix line search cost precision Evaluate line-search candidates relative to alpha zero so small improvements are not lost when the absolute objective is large. Store the accepted line-search improvement directly for convergence instead of recomputing it from absolute costs after constraint update. * Remove unused solver cost tracking Solver convergence now uses the line-search improvement directly, so the accumulated objective cost is no longer consumed by runtime code. Drop the internal cost workspace and the cost atomics from constraint update. * Remove unused solver Gauss tracking Drop the stored Gauss cost from solver contexts now that line search tracks shifted objective improvements directly. This removes the per-iteration Gauss update kernels and keeps the line-search Gauss polynomial delta-only. * Fix inactive shifted line-search costs Preserve the zero-cost baseline for one-sided rows that start inactive and become active during line search. Keep the shifted quadratic form fast with an explicit offset, and use cost-only helpers where alpha-zero subtraction does not need derivatives. * Adapt context types to main * Adapt island line search to cleanup * Adapt line-search test to main * Fix island line-search cost rebase * Format solver module * Collapse shifted helpers and clarify offset Drops _eval_pt_direct_shifted (no-offset) into the offset variant with offset=0.0, and renames the offset variant to drop the redundant _offset suffix. Same for the 3-alpha pair. Adds an offset variable name and a comment at the four limit/contact branches that motivates why the offset is needed when alpha=0 was inactive. * Apply ruff-format to solver * Address line-search review comments Hoist the per-contact reads in the parallel line-search kernel so the impratio index, friction, and quad values are loaded once and shared by both elliptic-cost evaluations instead of being re-read for each call. This matches the iterative kernel. Rename the test helper update_mujoco_constraints to update_constraints and rename test_linesearch_accepts_sub_ulp_improvement to test_linesearch_accepts_sub_float32_improvement to avoid the unexplained "ulp" abbreviation.
…ogle-deepmind#1402) * Disable MathDx GEMM for blocked Cholesky tile_matmul The blocked Cholesky performs small (16x16) upper-convention (U^T U) rank-k updates. Warp's register-blocked scalar GEMM is faster than cuBLASDx for that left-transpose pattern at this tile size, and skipping cuBLASDx also avoids its LTO compile cost. Set enable_mathdx_gemm=False via per-kernel module_options on the blocked-Cholesky kernels. Also switch the existing dense-JTDAJ disable to the same per-kernel mechanism and remove the now-unused scoped_mathdx_gemm_disabled helper, so the option lives on the kernel definition rather than mutating the global warp config around each launch. * Tune JTDAJ_dense block_dim to 128 for the scalar GEMM The dense JTDAJ tile_matmul runs on Warp's scalar GEMM (MathDx disabled). Its optimal block_dim for the scalar path is 128, whereas 96 was tuned for the cuBLASDx path. Only affects dense models (nv <= 32).
* Optimize h.zero_ call * Update memory allocation * Remove dangerous return --------- Co-authored-by: Taylor Howell <taylorhowell@google.com>
The island line search was the last solver path still comparing absolute constraint costs. When absolute costs are large, the accepted improvement can fall below float32 resolution, so previous_cost - current_cost rounds to zero and the line search refuses an otherwise-valid step. Evaluate line-search candidates as deltas from alpha=0, matching the monolithic iterative and parallel kernels: drop the constant gauss cost, make the alpha=0 point the zero-cost reference, evaluate each constraint contribution as cost(alpha) - cost(0) via the shifted helpers, and compare candidate costs against zero. Gradient and hessian are offset-free and are left unchanged, so the Newton step and bracketing logic are untouched. Add a direct island line-search unit test that drives the kernel on a single island whose large gauss cost hides a sub-resolution improvement.
_update_gradient_init_h_sparse wrote the mass matrix into the lower triangle of h via M_elemid[i, j], which is only populated where the column is an ancestor of the row (the lower triangle). The constraint term from _JTDAJ_sparse and the Cholesky factorization both use the upper triangle, so the upper-triangle Hessian was assembled without the mass matrix, producing an incorrect Newton Hessian. Transpose the lookup to M_elemid[j, i] so M lands in the upper triangle, and skip writing the lower triangle entirely since it is never read. The solver still converges with the wrong Hessian (Newton degrades toward gradient descent), so existing tests pass; the symptom is a large performance regression on the sparse path.
…model degrees of freedom (google-deepmind#1404)
The dense-Jacobian branch of _update_gradient_JTCJ_island skips an iteration when J[ic1, idof_i] is zero, but in the dim1 != dim2 case the swap-pair contribution hcone * J[ic2, idof_i] * J[ic1, idof_j] still needs to write — the swap-pair-only block under 'if dim1id != dim2id' only ran when J[ic1, i] AND J[ic2, j] were both nonzero. Cells where exactly one of those four J entries was zero silently dropped the corresponding cone contribution. Per cell H[a, b] for an off-diagonal cone pair (dim1, dim2) the contribution should be hcone * (J[ic1, a] * J[ic2, b] + J[ic2, a] * J[ic1, b]) which the monolithic _update_gradient_JTCJ_dense computes correctly (solver.py:2700-2704). This change keeps the original loop (which already handles the first term) and adds a second loop with the same shape but ic1/ic2 swapped, gated on dim1id != dim2id, for the second term. Both loops keep the aggressive J==0 early-skip from the original. The existing scalar _cholesky_factorize_solve_island absorbs the resulting indefinite per-island H via its mid-factorization clamp (s <= 1e-6 -> s = 1e-6), so no existing test fails. The bug becomes visible when the per-island solve uses any Cholesky variant without diagonal clamping (e.g. wp.tile_cholesky on the per-island H sub-block), producing NaN qacc on configurations such as constraints.xml keyframe 2 with elliptic cone + Newton + dense Jacobian.
* Apply changes from cl/925389681: Implement tiled kernels for CG solver helpers * Fix pre-submit failures: remove unused import, re-order solve_cg_finalize parameters and adjust launch arguments * Apply ruff formatting fixes to solver.py * Specialize CG block dimensions and resolve review comments * Fix solver context initialization for island solver * fixed formatting * add _ * Remove dead kernels, redundant zeroing; dynamic block_dim * fixed formatting * remove block_dim from else branch
…d#1413) The sparse elliptic-cone Hessian assembly (JTCJ) previously ran output-stationary: one thread per (contact, dense dof-pair), scanning the contact's column indices to locate each pair. When nefc >> nv the overwhelming majority of those dof-pairs do not appear in the contact's support, so the kernel spent almost all of its time scanning and skipping. Restructure it to be input-stationary: launch one thread per (contact, support-pair), decode the pair index directly into the two participating dofs, register-sum the cone block, and accumulate into H with a single atomic add. This touches only the pairs that actually contribute and exposes far more parallelism. The pair dimension is bounded by jtcj_max_pairs, derived once in io.py from the deepest geom-body dof chain. The math is unchanged. Co-authored-by: Alain Denzler <adenzler@nvidia.com>
Semantic re-implementation of mar-yan24's count->scan->emit pipeline (google-deepmind#1300) on top of the post-a23500c constraint kernels, since the original branch predates the efc_contact rewrite and 95 other upstream commits and cannot be rebased mechanically. When opt.deterministic=True, the racy wp.atomic_add(nefc/efc_nnz) slot allocation in every constraint family is replaced with: a count kernel that writes per-thread row/nnz counts, a per-world exclusive scan that converts counts to offsets and bumps the totals, and emit kernels that read base+offset instead of atomic results. Constraint rows therefore land at identical positions on every run of the same input. Differences from the original google-deepmind#1300 branch: - Contact families use a single _efc_contact_count kernel mirroring the unified _efc_contact_init (upstream a23500c replaced the per-cone _contact_pyramidal/_contact_elliptic kernels this no longer touches). - _equality_flex_count honors eq_active (upstream 80bbba7). - All new kernel factories are wrapped in @cache_kernel (upstream f0e2d81) so repeated step() calls do not recreate kernels. - Host-side overflow validation is skipped while a CUDA graph capture is active, making opt.deterministic capture-safe (the original branch documented capture as unsupported). - Ports the expanded determinism regression tests, minus the benchmark coverage that targeted the pre-google-deepmind#1301 benchmark API. Verified: full suite 1101 passed / 23 skipped; constraint rows (nefc, efc.type/id, sparse J structure and values) bitwise stable across same-process trials with CUDA graph capture; ~9% overhead at 1 world, ~24% at 256 worlds on RTX PRO 6000 Blackwell. Signed-off-by: johnnynunez <johnnynuca14@gmail.com>
|
Same note as on #4 re: the CONFLICTING badge — this is a port stacked on the rebased determinism1, so a textual merge into One thing to highlight for review since it goes beyond the original branch: |
Semantic port of this branch's deterministic constraint row allocation onto current upstream main (which is 96+ commits ahead, including the a23500c efc_contact rewrite that made a mechanical rebase impossible — see google-deepmind#1300 (comment)).
What's preserved from your design:
_ensure_det_scratch(allocated once per (m, d))wp.static(deterministic)dispatch inside kernel factories so non-det mode compiles to the original atomic codeWhat's adapted to current main:
_efc_contact_countkernel mirroring the unified_efc_contact_initfrom a23500c (replaces the per-cone_contact_pyramidal/_contact_ellipticcount kernels — those emit kernels no longer exist upstream)_equality_flex_counthonorseq_active(upstream 80bbba7)@cache_kernel(upstream f0e2d81) — fixes kernel re-creation on every stepopt.deterministic=True: host-side overflow validation is skipped while a capture is active (device-side overflow warning still fires on replays)benchmark()API no longer exists)Verified on RTX PRO 6000 Blackwell (sm_120, CUDA 13.3):
Stacks on top of #4 (the 1/N rebase).