Improve blocked Cholesky solver performance in Kamino by nvtw · Pull Request #2517 · newton-physics/newton

nvtw · 2026-04-21T11:58:58Z

Description

Improve blocked Cholesky solver performance in Kamino. The Cholesky kernels also seem to compile significantly faster when using their inplace variants. Below is the AI report but we should verify on real scenes and different PCs.

Commit breakdown

Commit	Change	Measured impact
`b32ee0ef`	Kernel-level perf: `tile_matmul` in-place accumulation, `tile_cholesky_inplace`, `tile_lower_solve_inplace`	4–12% kernel speedup (fixed args)
`c10ad472`	Add benchmark (no runtime change)	—
`7518c0ef`	Launcher default tuning: `block_size` 16→32, `solve_block_dim` 64→128	Biggest chunk of user-facing win
`d72a2839`	Remove benchmark file	— (tooling only)

Apples-to-apples

Isolates b32ee0ef's kernel change, both at bs=32 dim=128.

n	main factor → `b32ee0e`	main solve → `b32ee0e`
128	49.3 → 45.2 µs (1.09×)	24.8 → 22.7 µs (1.09×)
320	465 → 414 µs (1.12×)	117 → 108 µs (1.08×)
1000	20,227 → 19,385 µs (1.04×)	2,054 → 1,975 µs (1.04×)

Kernel rewrite alone buys ~4–12%.

End-to-end user-facing

Each commit with its own production defaults:

main: block_size=16, solve_block_dim=64, factorize_block_dim=128
HEAD: block_size=32, solve_block_dim=128, factorize_block_dim=128

n	factor main → HEAD	×	solve main → HEAD	×
32	7.5 → 7.4 µs	1.01×	9.0 → 9.0 µs	1.00×
70	28.8 → 30.8 µs	0.94×	24.7 → 18.6 µs	1.33×
128	76.0 → 45.2 µs	1.68×	44.0 → 22.7 µs	1.94×
192	226 → 113 µs	2.00×	92.2 → 45.1 µs	2.04×
257	769 → 485 µs	1.59×	302 → 162 µs	1.86×
320	940 → 413 µs	2.27×	248 → 108 µs	2.29×
401	2,689 → 1,427 µs	1.88×	690 → 342 µs	2.02×
512	3,608 → 1,527 µs	2.36×	602 → 262 µs	2.30×
768	11,945 → 4,864 µs	2.46×	1,414 → 576 µs	2.45×
1000	36,854 → 19,434 µs	1.90×	4,209 → 1,973 µs	2.13×

Summary

Kernel-level changes (b32ee0ef): modest ~5–10% win. Switching matmul to in-place accumulation and using the _inplace variants of cholesky/lower_solve removes temporary tile allocations.
Default tuning (7518c0ef): the dominant contribution. block_size=32 cuts the number of outer iterations in half vs. bs=16, and solve_block_dim=128 doubles the threads hiding latency over =64.
Net at typical sizes (n ≥ 128): ~1.7–2.5× faster factor, ~1.9–2.5× faster solve. At the extreme small end (n ≤ 70), factor is flat or marginally worse due to padding overhead at bs=32; solve is still notably faster.
Numerical accuracy identical — chol_err / res_norm columns match exactly between main and HEAD at every size.

Measurements taken on the RTX PRO 6000 with the benchmark recovered from c10ad472, 100 iters × best-of-3 per size, Warp kernel cache cleared between commits, GPU idle.

Checklist

New or existing tests cover these changes
The documentation is up to date with these changes
CHANGELOG.md has been updated (if user-facing change)

Test plan

Summary by CodeRabbit

Performance
- Improved Cholesky factorization and linear-solve routines using in-place matrix operations to reduce memory use and speed up solves.
Chores
- Tuned default block sizes for factorization and solve routines (larger blocks) to improve efficiency on common hardware.

On the sizes in benchmark_llt_blocked.py (n=32..1000), the prior defaults (block_size=16, solve_block_dim=64) left significant throughput on the table. Switching the LLTBlockedSolver defaults to block_size=32 with solve_block_dim=128 and aligning the solve launcher defaults / benchmark flags delivers ~2-5x faster factor and ~1.5-3x faster solve across the tested range: n=128: factor 72us -> 45us, solve 41us -> 23us n=320: factor 822us -> 413us, solve 217us -> 108us n=1000: factor 57ms -> 19ms, solve 4.0ms -> 2.0ms No algorithmic changes; kernels and padding logic are unchanged.

coderabbitai · 2026-04-21T11:59:13Z

📝 Walkthrough

Walkthrough

Replaces temporary tile allocations with in-place operations in blocked Cholesky factorize and solve kernels (negative gemm accumulation, in-place Cholesky, in-place triangular solves). Hoists and pads diagonal tiles. Also updates solver defaults: block_size → 32 and solve block_dim defaults → 128.

Changes

Cohort / File(s)	Summary
Kernel-level in-place optimizations `newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py`	Refactored factorize and solve kernels to use in-place negative accumulation (`wp.tile_matmul(..., alpha=-1.0)`), in-place Cholesky (`wp.tile_cholesky_inplace(...)`), and in-place triangular solves (`wp.tile_lower_solve_inplace`, `wp.tile_upper_solve_inplace`). Hoists diagonal tile loads, pads diagonal tiles when needed, and removes intermediate solve/matmul temporaries. Updated solve launcher defaults from `block_dim=64` to `block_dim=128`.
Solver configuration defaults `newton/_src/solvers/kamino/_src/linalg/linear.py`	Adjusted `LLTBlockedSolver` constructor defaults: `block_size` increased from `16` to `32`, and `solve_block_dim` increased from `64` to `128`, aligning solver defaults with kernel launcher updates.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly summarizes the main change: kernel-level optimizations and default tuning to improve blocked Cholesky solver performance in Kamino, which is clearly reflected in the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py`:
- Around line 389-393: Comments referencing a removed file
(benchmark_llt_blocked.py) should be updated around the block_dim parameter and
the other similar comment blocks (references at the block_dim declaration and
the nearby comment blocks ~421-425 and ~457-459); remove the broken filename
link and replace it with a short, self-contained rationale: state that
empirically 128 threads (4 warps) gives best throughput for large tiles while
small matrices prefer 64, note callers may override, and/or point to a generic
"per-size benchmarks" record (or commit/CI artifact) rather than a file path;
update the comment text adjacent to the block_dim declaration and the two other
comment locations to reflect this concise rationale.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 323ea4b4-a7b4-4f00-98da-5b66b7a6149e

📥 Commits

Reviewing files that changed from the base of the PR and between f954415 and d72a283.

📒 Files selected for processing (2)

newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py
newton/_src/solvers/kamino/_src/linalg/linear.py

codecov · 2026-04-21T12:16:40Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

camevor

nice speedup! lgtm

vastsoun

LGTM, amazing work @nvtw. These speedups are very significant. I'll give it a try on my machine later today and confirm.

Guirec-Maloisel

Looks good to me, on my machine the max throughput for dr legs is increased by 25% (using the fastest solver configuration).

Please just quickly edit the commits that refer to the benchmark (cf code rabbit comments)

chschuma-disney

Fantastic work, looks good to me.

Since Guirec already did the extensive benchmarking, I did a quick informal test on a single-world interactive BDX example. I saw a 15-20% increase in the frame rate.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py (1)
389-458: ⚠️ Potential issue | 🟡 Minor

Add a CHANGELOG entry for the default retunings.

The launcher block_dim defaults here flip from 64 → 128 for llt_blocked_solve and llt_blocked_solve_inplace, and the PR also bumps LLTBlockedSolver(block_size=…, solve_block_dim=…) defaults (16→32, 64→128) in linalg/linear.py. These propagate to users who instantiate LLTBlockedSolver without overrides and change observed performance/resource characteristics, so a Changed entry in CHANGELOG.md under [Unreleased] with a one-line note on the new defaults (and how to restore the old values via explicit kwargs) would satisfy the repo policy.

As per coding guidelines: "Check that any user-facing change includes a corresponding entry in CHANGELOG.md under the [Unreleased] section … For Deprecated, Changed, and Removed entries, verify the entry includes migration guidance telling users what replaces the old behavior."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py` around lines
389 - 458, Add a "Changed" entry under the [Unreleased] section of CHANGELOG.md
noting that the default thread block sizes were retuned (llt_blocked_solve and
llt_blocked_solve_inplace default block_dim changed from 64→128 and
LLTBlockedSolver defaults solve_block_dim/block_size changed 64→128 and 16→32
respectively), and include one-line migration guidance that users can restore
prior behavior by passing explicit kwargs (e.g., block_dim or
solve_block_dim/block_size) when constructing LLTBlockedSolver or calling
llt_blocked_solve/llt_blocked_solve_inplace.

🧹 Nitpick comments (1)

newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py (1)
246-289: Solve kernel hoist + in-place refactor looks good; one thing worth confirming.

Hoisting L_diag above the j loop and switching to tile_matmul(..., alpha=-1.0) + tile_lower_solve_inplace / tile_upper_solve_inplace(tile_transpose(L_diag), …) is correct and matches the pattern in llt_blocked_solve_inplace_kernel.

Please double-check one asymmetry: in the backward sub you re-pad L_diag to identity on OOB rows/cols (lines 268–279), but in the forward sub (lines 251–257) L_diag is used directly from tile_load without any padding branch. That's only safe because make_llt_blocked_factorize_kernel writes identity into the padded diagonal of L; if a future change ever skipped that (or a non-factorize producer fed L here), the forward tile_lower_solve_inplace on the trailing block would divide by garbage. Consider either (a) dropping the backward re-pad (redundant given the factorize invariant) or (b) adding the same pad in the forward path, so the two are symmetric and the invariant is locally enforced rather than implicit.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py` around lines
246 - 289, Forward path currently uses L_diag from wp.tile_load without padding
while backward pads the diagonal for trailing blocks; make them symmetric by
adding the same OOB-padding logic before the forward wp.tile_lower_solve_inplace
call. In the forward loop (the block that loads L_diag and then calls
wp.tile_lower_solve_inplace), detect the trailing block condition (i +
block_size > n_i) and run the same padding routine used in the backward path
(using tid_block, num_threads_per_block, num_tile_elements, computing row/col,
using wp.where to inject identity on OOB entries) so L_diag is safe for the
lower solve; alternatively, if you prefer the other approach, remove the
backward-pad block instead—pick one and apply the change where L_diag is loaded
in the forward loop to ensure symmetry with the existing backward padding.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py`:
- Around line 389-458: Add a "Changed" entry under the [Unreleased] section of
CHANGELOG.md noting that the default thread block sizes were retuned
(llt_blocked_solve and llt_blocked_solve_inplace default block_dim changed from
64→128 and LLTBlockedSolver defaults solve_block_dim/block_size changed 64→128
and 16→32 respectively), and include one-line migration guidance that users can
restore prior behavior by passing explicit kwargs (e.g., block_dim or
solve_block_dim/block_size) when constructing LLTBlockedSolver or calling
llt_blocked_solve/llt_blocked_solve_inplace.

---

Nitpick comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py`:
- Around line 246-289: Forward path currently uses L_diag from wp.tile_load
without padding while backward pads the diagonal for trailing blocks; make them
symmetric by adding the same OOB-padding logic before the forward
wp.tile_lower_solve_inplace call. In the forward loop (the block that loads
L_diag and then calls wp.tile_lower_solve_inplace), detect the trailing block
condition (i + block_size > n_i) and run the same padding routine used in the
backward path (using tid_block, num_threads_per_block, num_tile_elements,
computing row/col, using wp.where to inject identity on OOB entries) so L_diag
is safe for the lower solve; alternatively, if you prefer the other approach,
remove the backward-pad block instead—pick one and apply the change where L_diag
is loaded in the forward loop to ensure symmetry with the existing backward
padding.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: c4ecd031-6f47-4a54-9e43-cbec8e2a04df

📥 Commits

Reviewing files that changed from the base of the PR and between d72a283 and 195866e.

📒 Files selected for processing (1)

newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py

Done

The pre-release audit found user-facing commits that landed since v1.1.0 without an [Unreleased] entry. Backfill five of them here. Kamino-related commits (newton-physics#2280, newton-physics#2517, newton-physics#2554, newton-physics#2575) are intentionally omitted: Kamino is still alpha and not currently tracked in CHANGELOG. - Changed: contact-conversion overhead reduction in SolverMuJoCo ([newton-physicsGH-2393](newton-physics#2393)) - Fixed: target_voxel_size ignored on the texture-SDF path ([newton-physicsGH-2464](newton-physics#2464)) - Fixed: Newton-to-mujoco-warp material-combination mismatch ([newton-physicsGH-2439](newton-physics#2439)) - Fixed: eq_objtype mismatch on joint equality / mimic constraints in SolverMuJoCo ([newton-physicsGH-2459](newton-physics#2459)) - Fixed: _tiled_sum_kernel launch-dim handling under warp-lang 1.13 ([newton-physicsGH-2546](newton-physics#2546))

nvtw added 4 commits April 21, 2026 12:06

Performance optimizations for blocked cholesky solver

b32ee0e

Add llt benchmark

c10ad47

Remove the benchmark file

d72a283

nvtw self-assigned this Apr 21, 2026

nvtw marked this pull request as draft April 21, 2026 11:59

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py Outdated

nvtw requested review from Guirec-Maloisel, camevor, chschuma-disney and vastsoun April 21, 2026 12:41

nvtw marked this pull request as ready for review April 21, 2026 12:49

camevor previously approved these changes Apr 21, 2026

View reviewed changes

vastsoun previously approved these changes Apr 22, 2026

View reviewed changes

Guirec-Maloisel previously requested changes Apr 22, 2026

View reviewed changes

chschuma-disney previously approved these changes Apr 22, 2026

View reviewed changes

Implement MR comment

195866e

nvtw dismissed stale reviews from chschuma-disney, vastsoun, and camevor via 195866e April 22, 2026 13:31

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

camevor approved these changes Apr 22, 2026

View reviewed changes

nvtw enabled auto-merge April 22, 2026 14:04

nvtw disabled auto-merge April 22, 2026 14:05

nvtw enabled auto-merge April 22, 2026 14:05

nvtw requested a review from Guirec-Maloisel April 22, 2026 14:05

nvtw added this pull request to the merge queue Apr 22, 2026

Merged via the queue into newton-physics:main with commit 5d3c869 Apr 22, 2026
25 checks passed

coderabbitai Bot mentioned this pull request Apr 23, 2026

Kamino Cholesky solver with RCM support #2554

Merged

3 tasks

preist-nvidia mentioned this pull request May 4, 2026

Clean up CHANGELOG before 1.2 release branch cut #2684

Merged

coderabbitai Bot mentioned this pull request May 8, 2026

Kamino - Propagate LLT performance improvements to semi-sparse LLT #2816

Merged

3 tasks

coderabbitai Bot mentioned this pull request May 29, 2026

Kamino solver speedups #2998

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve blocked Cholesky solver performance in Kamino#2517

Improve blocked Cholesky solver performance in Kamino#2517
nvtw merged 5 commits into
newton-physics:mainfrom
nvtw:dev/tw3/kamino_cholesky_speed

nvtw commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

codecov Bot commented Apr 21, 2026

Uh oh!

camevor left a comment

Uh oh!

vastsoun left a comment

Uh oh!

Guirec-Maloisel left a comment

Uh oh!

chschuma-disney left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

nvtw commented Apr 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Commit breakdown

Apples-to-apples

End-to-end user-facing

Summary

Checklist

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Apr 21, 2026

Codecov Report

Uh oh!

camevor left a comment

Choose a reason for hiding this comment

Uh oh!

vastsoun left a comment

Choose a reason for hiding this comment

Uh oh!

Guirec-Maloisel left a comment

Choose a reason for hiding this comment

Uh oh!

chschuma-disney left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nvtw commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading