Improve blocked Cholesky solver performance in Kamino#2517
Conversation
On the sizes in benchmark_llt_blocked.py (n=32..1000), the prior defaults (block_size=16, solve_block_dim=64) left significant throughput on the table. Switching the LLTBlockedSolver defaults to block_size=32 with solve_block_dim=128 and aligning the solve launcher defaults / benchmark flags delivers ~2-5x faster factor and ~1.5-3x faster solve across the tested range: n=128: factor 72us -> 45us, solve 41us -> 23us n=320: factor 822us -> 413us, solve 217us -> 108us n=1000: factor 57ms -> 19ms, solve 4.0ms -> 2.0ms No algorithmic changes; kernels and padding logic are unchanged.
📝 WalkthroughWalkthroughReplaces temporary tile allocations with in-place operations in blocked Cholesky factorize and solve kernels (negative gemm accumulation, in-place Cholesky, in-place triangular solves). Hoists and pads diagonal tiles. Also updates solver defaults: Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py`:
- Around line 389-393: Comments referencing a removed file
(benchmark_llt_blocked.py) should be updated around the block_dim parameter and
the other similar comment blocks (references at the block_dim declaration and
the nearby comment blocks ~421-425 and ~457-459); remove the broken filename
link and replace it with a short, self-contained rationale: state that
empirically 128 threads (4 warps) gives best throughput for large tiles while
small matrices prefer 64, note callers may override, and/or point to a generic
"per-size benchmarks" record (or commit/CI artifact) rather than a file path;
update the comment text adjacent to the block_dim declaration and the two other
comment locations to reflect this concise rationale.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 323ea4b4-a7b4-4f00-98da-5b66b7a6149e
📒 Files selected for processing (2)
newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.pynewton/_src/solvers/kamino/_src/linalg/linear.py
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Guirec-Maloisel
left a comment
There was a problem hiding this comment.
Looks good to me, on my machine the max throughput for dr legs is increased by 25% (using the fastest solver configuration).
Please just quickly edit the commits that refer to the benchmark (cf code rabbit comments)
chschuma-disney
left a comment
There was a problem hiding this comment.
Fantastic work, looks good to me.
Since Guirec already did the extensive benchmarking, I did a quick informal test on a single-world interactive BDX example. I saw a 15-20% increase in the frame rate.
195866e
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py (1)
389-458:⚠️ Potential issue | 🟡 MinorAdd a CHANGELOG entry for the default retunings.
The launcher
block_dimdefaults here flip from 64 → 128 forllt_blocked_solveandllt_blocked_solve_inplace, and the PR also bumpsLLTBlockedSolver(block_size=…, solve_block_dim=…)defaults (16→32, 64→128) inlinalg/linear.py. These propagate to users who instantiateLLTBlockedSolverwithout overrides and change observed performance/resource characteristics, so aChangedentry inCHANGELOG.mdunder[Unreleased]with a one-line note on the new defaults (and how to restore the old values via explicit kwargs) would satisfy the repo policy.As per coding guidelines: "Check that any user-facing change includes a corresponding entry in CHANGELOG.md under the [Unreleased] section … For Deprecated, Changed, and Removed entries, verify the entry includes migration guidance telling users what replaces the old behavior."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py` around lines 389 - 458, Add a "Changed" entry under the [Unreleased] section of CHANGELOG.md noting that the default thread block sizes were retuned (llt_blocked_solve and llt_blocked_solve_inplace default block_dim changed from 64→128 and LLTBlockedSolver defaults solve_block_dim/block_size changed 64→128 and 16→32 respectively), and include one-line migration guidance that users can restore prior behavior by passing explicit kwargs (e.g., block_dim or solve_block_dim/block_size) when constructing LLTBlockedSolver or calling llt_blocked_solve/llt_blocked_solve_inplace.
🧹 Nitpick comments (1)
newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py (1)
246-289: Solve kernel hoist + in-place refactor looks good; one thing worth confirming.Hoisting
L_diagabove thejloop and switching totile_matmul(..., alpha=-1.0)+tile_lower_solve_inplace/tile_upper_solve_inplace(tile_transpose(L_diag), …)is correct and matches the pattern inllt_blocked_solve_inplace_kernel.Please double-check one asymmetry: in the backward sub you re-pad
L_diagto identity on OOB rows/cols (lines 268–279), but in the forward sub (lines 251–257)L_diagis used directly fromtile_loadwithout any padding branch. That's only safe becausemake_llt_blocked_factorize_kernelwrites identity into the padded diagonal ofL; if a future change ever skipped that (or a non-factorize producer fedLhere), the forwardtile_lower_solve_inplaceon the trailing block would divide by garbage. Consider either (a) dropping the backward re-pad (redundant given the factorize invariant) or (b) adding the same pad in the forward path, so the two are symmetric and the invariant is locally enforced rather than implicit.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py` around lines 246 - 289, Forward path currently uses L_diag from wp.tile_load without padding while backward pads the diagonal for trailing blocks; make them symmetric by adding the same OOB-padding logic before the forward wp.tile_lower_solve_inplace call. In the forward loop (the block that loads L_diag and then calls wp.tile_lower_solve_inplace), detect the trailing block condition (i + block_size > n_i) and run the same padding routine used in the backward path (using tid_block, num_threads_per_block, num_tile_elements, computing row/col, using wp.where to inject identity on OOB entries) so L_diag is safe for the lower solve; alternatively, if you prefer the other approach, remove the backward-pad block instead—pick one and apply the change where L_diag is loaded in the forward loop to ensure symmetry with the existing backward padding.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py`:
- Around line 389-458: Add a "Changed" entry under the [Unreleased] section of
CHANGELOG.md noting that the default thread block sizes were retuned
(llt_blocked_solve and llt_blocked_solve_inplace default block_dim changed from
64→128 and LLTBlockedSolver defaults solve_block_dim/block_size changed 64→128
and 16→32 respectively), and include one-line migration guidance that users can
restore prior behavior by passing explicit kwargs (e.g., block_dim or
solve_block_dim/block_size) when constructing LLTBlockedSolver or calling
llt_blocked_solve/llt_blocked_solve_inplace.
---
Nitpick comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py`:
- Around line 246-289: Forward path currently uses L_diag from wp.tile_load
without padding while backward pads the diagonal for trailing blocks; make them
symmetric by adding the same OOB-padding logic before the forward
wp.tile_lower_solve_inplace call. In the forward loop (the block that loads
L_diag and then calls wp.tile_lower_solve_inplace), detect the trailing block
condition (i + block_size > n_i) and run the same padding routine used in the
backward path (using tid_block, num_threads_per_block, num_tile_elements,
computing row/col, using wp.where to inject identity on OOB entries) so L_diag
is safe for the lower solve; alternatively, if you prefer the other approach,
remove the backward-pad block instead—pick one and apply the change where L_diag
is loaded in the forward loop to ensure symmetry with the existing backward
padding.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: c4ecd031-6f47-4a54-9e43-cbec8e2a04df
📒 Files selected for processing (1)
newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py
The pre-release audit found user-facing commits that landed since v1.1.0 without an [Unreleased] entry. Backfill five of them here. Kamino-related commits (newton-physics#2280, newton-physics#2517, newton-physics#2554, newton-physics#2575) are intentionally omitted: Kamino is still alpha and not currently tracked in CHANGELOG. - Changed: contact-conversion overhead reduction in SolverMuJoCo ([newton-physicsGH-2393](newton-physics#2393)) - Fixed: target_voxel_size ignored on the texture-SDF path ([newton-physicsGH-2464](newton-physics#2464)) - Fixed: Newton-to-mujoco-warp material-combination mismatch ([newton-physicsGH-2439](newton-physics#2439)) - Fixed: eq_objtype mismatch on joint equality / mimic constraints in SolverMuJoCo ([newton-physicsGH-2459](newton-physics#2459)) - Fixed: _tiled_sum_kernel launch-dim handling under warp-lang 1.13 ([newton-physicsGH-2546](newton-physics#2546))
Description
Improve blocked Cholesky solver performance in Kamino. The Cholesky kernels also seem to compile significantly faster when using their inplace variants. Below is the AI report but we should verify on real scenes and different PCs.
Commit breakdown
b32ee0eftile_matmulin-place accumulation,tile_cholesky_inplace,tile_lower_solve_inplacec10ad4727518c0efblock_size16→32,solve_block_dim64→128d72a2839Apples-to-apples
Isolates
b32ee0ef's kernel change, both atbs=32 dim=128.Kernel rewrite alone buys ~4–12%.
End-to-end user-facing
Each commit with its own production defaults:
block_size=16,solve_block_dim=64,factorize_block_dim=128block_size=32,solve_block_dim=128,factorize_block_dim=128Summary
b32ee0ef): modest ~5–10% win. Switching matmul to in-place accumulation and using the_inplacevariants ofcholesky/lower_solveremoves temporary tile allocations.7518c0ef): the dominant contribution.block_size=32cuts the number of outer iterations in half vs.bs=16, andsolve_block_dim=128doubles the threads hiding latency over=64.bs=32; solve is still notably faster.chol_err/res_normcolumns match exactly between main and HEAD at every size.Measurements taken on the RTX PRO 6000 with the benchmark recovered from
c10ad472, 100 iters × best-of-3 per size, Warp kernel cache cleared between commits, GPU idle.Checklist
CHANGELOG.mdhas been updated (if user-facing change)Test plan
Summary by CodeRabbit