Skip to content

Optimize blocked Cholesky: inplace factor and view-based backward sub#1306

Merged
adenzler-nvidia merged 1 commit into
google-deepmind:mainfrom
adenzler-nvidia:adenzler/cholesky-view-opt
Apr 22, 2026
Merged

Optimize blocked Cholesky: inplace factor and view-based backward sub#1306
adenzler-nvidia merged 1 commit into
google-deepmind:mainfrom
adenzler-nvidia:adenzler/cholesky-view-opt

Conversation

@adenzler-nvidia

Copy link
Copy Markdown
Collaborator

Summary

Two small changes to mujoco_warp/_src/block_cholesky.py that speed up the blocked Cholesky kernels without altering their contract:

  1. tile_cholesky_inplace on the diagonal block. The inplace variant zeros the strict upper triangle per Warp's contract, so storing A_kk_tile as L[k:end, k:end] is correct. Later tile_lower_solve_inplace calls read only the lower triangle. Halves shared memory for the factor step (no separate L_kk_tile).

  2. Read x_j via tile_view(rhs_tile, …) in backward substitution. Forward substitution already did this for y_j; backward substitution was re-loading x_j from global memory even though it's resident in the rhs_tile shared-memory buffer. The per-iteration tile_store is hoisted to a single coalesced final store. Eliminates O((N/bs)²) global loads plus one store per block iteration.

Benchmarks

mjwarp-testspeed benchmarks/humanoid/three_humanoids.xml --nworld=8192 --nconmax=100 --njmax=192 --nstep=500 on RTX PRO 6000 Blackwell, kernel times from nsys profile --cuda-graph-trace=node:

Kernel Invocations Main median Branch median Δmed Main avg Branch avg Δavg
update_gradient_cholesky_blocked (factor+solve) 500 703 µs 703 µs 0% 801 µs 711 µs −11.3%
update_gradient_cholesky_blocked_skip_unchanged (solve-mostly) ~3930 153 µs 148 µs −3.1% 287 µs 281 µs −2.0%

Combined cholesky kernel time: 1527 → 1464 ms (−4.1%).

Tail-latency drop on the factor+solve path is notable: stddev 276 → 70 µs, max 2236 → 1809 µs — likely the extra L_kk_tile allocation was pushing occupancy down on some launches.

Correctness

  • uv run pytest mujoco_warp/_src/support_test.py -k test_block_cholesky — passes.
  • uv run pytest mujoco_warp/_src/solver_test.py — all 40 cases pass.

Test plan

  • CI passes.

1. Use wp.tile_cholesky_inplace on the diagonal block instead of
   wp.tile_cholesky. The inplace variant zeros the strict upper triangle
   and avoids allocating a separate L_kk tile, so the shared-memory
   footprint of the factor step is halved.

2. In backward substitution, read previously-computed x_j blocks as a
   tile_view into the resident rhs_tile rather than re-loading them from
   global x. The per-iteration wp.tile_store into x is hoisted to a
   single coalesced store of rhs_tile at the end of the kernel. This
   eliminates O((N/bs)^2) global loads plus a store per iteration.

Correctness validated by the existing test_block_cholesky and the full
solver_test suite.

Kernel-level A/B on three_humanoids (nworld=8192, nstep=500):
- update_gradient_cholesky_blocked (factor+solve): avg 801 -> 711 us (-11.3%)
- update_gradient_cholesky_blocked_skip_unchanged (solve-mostly):
  median 153 -> 148 us (-3.1%), avg 287 -> 281 us (-2.0%)
- Combined cholesky: 1527 -> 1464 ms (-4.1%)
- Tail on factor+solve: stddev 276 -> 70 us, max 2236 -> 1809 us
@adenzler-nvidia adenzler-nvidia merged commit 27f12f9 into google-deepmind:main Apr 22, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants