Optimize blocked Cholesky: inplace factor and view-based backward sub by adenzler-nvidia · Pull Request #1306 · google-deepmind/mujoco_warp

adenzler-nvidia · 2026-04-22T10:23:01Z

Summary

Two small changes to mujoco_warp/_src/block_cholesky.py that speed up the blocked Cholesky kernels without altering their contract:

tile_cholesky_inplace on the diagonal block. The inplace variant zeros the strict upper triangle per Warp's contract, so storing A_kk_tile as L[k:end, k:end] is correct. Later tile_lower_solve_inplace calls read only the lower triangle. Halves shared memory for the factor step (no separate L_kk_tile).
Read x_j via tile_view(rhs_tile, …) in backward substitution. Forward substitution already did this for y_j; backward substitution was re-loading x_j from global memory even though it's resident in the rhs_tile shared-memory buffer. The per-iteration tile_store is hoisted to a single coalesced final store. Eliminates O((N/bs)²) global loads plus one store per block iteration.

Benchmarks

mjwarp-testspeed benchmarks/humanoid/three_humanoids.xml --nworld=8192 --nconmax=100 --njmax=192 --nstep=500 on RTX PRO 6000 Blackwell, kernel times from nsys profile --cuda-graph-trace=node:

Kernel	Invocations	Main median	Branch median	Δmed	Main avg	Branch avg	Δavg
`update_gradient_cholesky_blocked` (factor+solve)	500	703 µs	703 µs	0%	801 µs	711 µs	−11.3%
`update_gradient_cholesky_blocked_skip_unchanged` (solve-mostly)	~3930	153 µs	148 µs	−3.1%	287 µs	281 µs	−2.0%

Combined cholesky kernel time: 1527 → 1464 ms (−4.1%).

Tail-latency drop on the factor+solve path is notable: stddev 276 → 70 µs, max 2236 → 1809 µs — likely the extra L_kk_tile allocation was pushing occupancy down on some launches.

Correctness

uv run pytest mujoco_warp/_src/support_test.py -k test_block_cholesky — passes.
uv run pytest mujoco_warp/_src/solver_test.py — all 40 cases pass.

Test plan

CI passes.

1. Use wp.tile_cholesky_inplace on the diagonal block instead of wp.tile_cholesky. The inplace variant zeros the strict upper triangle and avoids allocating a separate L_kk tile, so the shared-memory footprint of the factor step is halved. 2. In backward substitution, read previously-computed x_j blocks as a tile_view into the resident rhs_tile rather than re-loading them from global x. The per-iteration wp.tile_store into x is hoisted to a single coalesced store of rhs_tile at the end of the kernel. This eliminates O((N/bs)^2) global loads plus a store per iteration. Correctness validated by the existing test_block_cholesky and the full solver_test suite. Kernel-level A/B on three_humanoids (nworld=8192, nstep=500): - update_gradient_cholesky_blocked (factor+solve): avg 801 -> 711 us (-11.3%) - update_gradient_cholesky_blocked_skip_unchanged (solve-mostly): median 153 -> 148 us (-3.1%), avg 287 -> 281 us (-2.0%) - Combined cholesky: 1527 -> 1464 ms (-4.1%) - Tail on factor+solve: stddev 276 -> 70 us, max 2236 -> 1809 us

thowell approved these changes Apr 22, 2026

View reviewed changes

adenzler-nvidia merged commit 27f12f9 into google-deepmind:main Apr 22, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize blocked Cholesky: inplace factor and view-based backward sub#1306

Optimize blocked Cholesky: inplace factor and view-based backward sub#1306
adenzler-nvidia merged 1 commit into
google-deepmind:mainfrom
adenzler-nvidia:adenzler/cholesky-view-opt

adenzler-nvidia commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adenzler-nvidia commented Apr 22, 2026

Summary

Benchmarks

Correctness

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants