Optimize blocked Cholesky: inplace factor and view-based backward sub#1306
Merged
adenzler-nvidia merged 1 commit intoApr 22, 2026
Merged
Conversation
1. Use wp.tile_cholesky_inplace on the diagonal block instead of wp.tile_cholesky. The inplace variant zeros the strict upper triangle and avoids allocating a separate L_kk tile, so the shared-memory footprint of the factor step is halved. 2. In backward substitution, read previously-computed x_j blocks as a tile_view into the resident rhs_tile rather than re-loading them from global x. The per-iteration wp.tile_store into x is hoisted to a single coalesced store of rhs_tile at the end of the kernel. This eliminates O((N/bs)^2) global loads plus a store per iteration. Correctness validated by the existing test_block_cholesky and the full solver_test suite. Kernel-level A/B on three_humanoids (nworld=8192, nstep=500): - update_gradient_cholesky_blocked (factor+solve): avg 801 -> 711 us (-11.3%) - update_gradient_cholesky_blocked_skip_unchanged (solve-mostly): median 153 -> 148 us (-3.1%), avg 287 -> 281 us (-2.0%) - Combined cholesky: 1527 -> 1464 ms (-4.1%) - Tail on factor+solve: stddev 276 -> 70 us, max 2236 -> 1809 us
thowell
approved these changes
Apr 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two small changes to
mujoco_warp/_src/block_cholesky.pythat speed up the blocked Cholesky kernels without altering their contract:tile_cholesky_inplaceon the diagonal block. The inplace variant zeros the strict upper triangle per Warp's contract, so storingA_kk_tileasL[k:end, k:end]is correct. Latertile_lower_solve_inplacecalls read only the lower triangle. Halves shared memory for the factor step (no separateL_kk_tile).Read
x_jviatile_view(rhs_tile, …)in backward substitution. Forward substitution already did this fory_j; backward substitution was re-loadingx_jfrom global memory even though it's resident in therhs_tileshared-memory buffer. The per-iterationtile_storeis hoisted to a single coalesced final store. Eliminates O((N/bs)²) global loads plus one store per block iteration.Benchmarks
mjwarp-testspeed benchmarks/humanoid/three_humanoids.xml --nworld=8192 --nconmax=100 --njmax=192 --nstep=500on RTX PRO 6000 Blackwell, kernel times fromnsys profile --cuda-graph-trace=node:update_gradient_cholesky_blocked(factor+solve)update_gradient_cholesky_blocked_skip_unchanged(solve-mostly)Combined cholesky kernel time: 1527 → 1464 ms (−4.1%).
Tail-latency drop on the factor+solve path is notable: stddev 276 → 70 µs, max 2236 → 1809 µs — likely the extra
L_kk_tileallocation was pushing occupancy down on some launches.Correctness
uv run pytest mujoco_warp/_src/support_test.py -k test_block_cholesky— passes.uv run pytest mujoco_warp/_src/solver_test.py— all 40 cases pass.Test plan