Skip to content

Improve blocked Cholesky solver performance in Kamino#2517

Merged
nvtw merged 5 commits into
newton-physics:mainfrom
nvtw:dev/tw3/kamino_cholesky_speed
Apr 22, 2026
Merged

Improve blocked Cholesky solver performance in Kamino#2517
nvtw merged 5 commits into
newton-physics:mainfrom
nvtw:dev/tw3/kamino_cholesky_speed

Conversation

@nvtw

@nvtw nvtw commented Apr 21, 2026

Copy link
Copy Markdown
Member

Description

Improve blocked Cholesky solver performance in Kamino. The Cholesky kernels also seem to compile significantly faster when using their inplace variants. Below is the AI report but we should verify on real scenes and different PCs.

Commit breakdown

Commit Change Measured impact
b32ee0ef Kernel-level perf: tile_matmul in-place accumulation, tile_cholesky_inplace, tile_lower_solve_inplace 4–12% kernel speedup (fixed args)
c10ad472 Add benchmark (no runtime change)
7518c0ef Launcher default tuning: block_size 16→32, solve_block_dim 64→128 Biggest chunk of user-facing win
d72a2839 Remove benchmark file — (tooling only)

Apples-to-apples

Isolates b32ee0ef's kernel change, both at bs=32 dim=128.

n main factor → b32ee0e main solve → b32ee0e
128 49.3 → 45.2 µs (1.09×) 24.8 → 22.7 µs (1.09×)
320 465 → 414 µs (1.12×) 117 → 108 µs (1.08×)
1000 20,227 → 19,385 µs (1.04×) 2,054 → 1,975 µs (1.04×)

Kernel rewrite alone buys ~4–12%.

End-to-end user-facing

Each commit with its own production defaults:

  • main: block_size=16, solve_block_dim=64, factorize_block_dim=128
  • HEAD: block_size=32, solve_block_dim=128, factorize_block_dim=128
n factor main → HEAD × solve main → HEAD ×
32 7.5 → 7.4 µs 1.01× 9.0 → 9.0 µs 1.00×
70 28.8 → 30.8 µs 0.94× 24.7 → 18.6 µs 1.33×
128 76.0 → 45.2 µs 1.68× 44.0 → 22.7 µs 1.94×
192 226 → 113 µs 2.00× 92.2 → 45.1 µs 2.04×
257 769 → 485 µs 1.59× 302 → 162 µs 1.86×
320 940 → 413 µs 2.27× 248 → 108 µs 2.29×
401 2,689 → 1,427 µs 1.88× 690 → 342 µs 2.02×
512 3,608 → 1,527 µs 2.36× 602 → 262 µs 2.30×
768 11,945 → 4,864 µs 2.46× 1,414 → 576 µs 2.45×
1000 36,854 → 19,434 µs 1.90× 4,209 → 1,973 µs 2.13×

Summary

  • Kernel-level changes (b32ee0ef): modest ~5–10% win. Switching matmul to in-place accumulation and using the _inplace variants of cholesky/lower_solve removes temporary tile allocations.
  • Default tuning (7518c0ef): the dominant contribution. block_size=32 cuts the number of outer iterations in half vs. bs=16, and solve_block_dim=128 doubles the threads hiding latency over =64.
  • Net at typical sizes (n ≥ 128): ~1.7–2.5× faster factor, ~1.9–2.5× faster solve. At the extreme small end (n ≤ 70), factor is flat or marginally worse due to padding overhead at bs=32; solve is still notably faster.
  • Numerical accuracy identicalchol_err / res_norm columns match exactly between main and HEAD at every size.

Measurements taken on the RTX PRO 6000 with the benchmark recovered from c10ad472, 100 iters × best-of-3 per size, Warp kernel cache cleared between commits, GPU idle.

Checklist

  • New or existing tests cover these changes
  • The documentation is up to date with these changes
  • CHANGELOG.md has been updated (if user-facing change)

Test plan

Summary by CodeRabbit

  • Performance
    • Improved Cholesky factorization and linear-solve routines using in-place matrix operations to reduce memory use and speed up solves.
  • Chores
    • Tuned default block sizes for factorization and solve routines (larger blocks) to improve efficiency on common hardware.

nvtw added 4 commits April 21, 2026 12:06
On the sizes in benchmark_llt_blocked.py (n=32..1000), the prior defaults
(block_size=16, solve_block_dim=64) left significant throughput on the
table. Switching the LLTBlockedSolver defaults to block_size=32 with
solve_block_dim=128 and aligning the solve launcher defaults / benchmark
flags delivers ~2-5x faster factor and ~1.5-3x faster solve across the
tested range:

  n=128:  factor 72us -> 45us,   solve 41us -> 23us
  n=320:  factor 822us -> 413us, solve 217us -> 108us
  n=1000: factor 57ms -> 19ms,   solve 4.0ms -> 2.0ms

No algorithmic changes; kernels and padding logic are unchanged.
@nvtw nvtw self-assigned this Apr 21, 2026
@nvtw nvtw marked this pull request as draft April 21, 2026 11:59
@coderabbitai

coderabbitai Bot commented Apr 21, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

Replaces temporary tile allocations with in-place operations in blocked Cholesky factorize and solve kernels (negative gemm accumulation, in-place Cholesky, in-place triangular solves). Hoists and pads diagonal tiles. Also updates solver defaults: block_size → 32 and solve block_dim defaults → 128.

Changes

Cohort / File(s) Summary
Kernel-level in-place optimizations
newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py
Refactored factorize and solve kernels to use in-place negative accumulation (wp.tile_matmul(..., alpha=-1.0)), in-place Cholesky (wp.tile_cholesky_inplace(...)), and in-place triangular solves (wp.tile_lower_solve_inplace, wp.tile_upper_solve_inplace). Hoists diagonal tile loads, pads diagonal tiles when needed, and removes intermediate solve/matmul temporaries. Updated solve launcher defaults from block_dim=64 to block_dim=128.
Solver configuration defaults
newton/_src/solvers/kamino/_src/linalg/linear.py
Adjusted LLTBlockedSolver constructor defaults: block_size increased from 16 to 32, and solve_block_dim increased from 64 to 128, aligning solver defaults with kernel launcher updates.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly summarizes the main change: kernel-level optimizations and default tuning to improve blocked Cholesky solver performance in Kamino, which is clearly reflected in the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py`:
- Around line 389-393: Comments referencing a removed file
(benchmark_llt_blocked.py) should be updated around the block_dim parameter and
the other similar comment blocks (references at the block_dim declaration and
the nearby comment blocks ~421-425 and ~457-459); remove the broken filename
link and replace it with a short, self-contained rationale: state that
empirically 128 threads (4 warps) gives best throughput for large tiles while
small matrices prefer 64, note callers may override, and/or point to a generic
"per-size benchmarks" record (or commit/CI artifact) rather than a file path;
update the comment text adjacent to the block_dim declaration and the two other
comment locations to reflect this concise rationale.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 323ea4b4-a7b4-4f00-98da-5b66b7a6149e

📥 Commits

Reviewing files that changed from the base of the PR and between f954415 and d72a283.

📒 Files selected for processing (2)
  • newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py
  • newton/_src/solvers/kamino/_src/linalg/linear.py

Comment thread newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py Outdated
@codecov

codecov Bot commented Apr 21, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@nvtw nvtw marked this pull request as ready for review April 21, 2026 12:49
camevor
camevor previously approved these changes Apr 21, 2026

@camevor camevor left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice speedup! lgtm

vastsoun
vastsoun previously approved these changes Apr 22, 2026

@vastsoun vastsoun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, amazing work @nvtw. These speedups are very significant. I'll give it a try on my machine later today and confirm.

@Guirec-Maloisel Guirec-Maloisel left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, on my machine the max throughput for dr legs is increased by 25% (using the fastest solver configuration).

Please just quickly edit the commits that refer to the benchmark (cf code rabbit comments)

@chschuma-disney chschuma-disney left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work, looks good to me.

Since Guirec already did the extensive benchmarking, I did a quick informal test on a single-world interactive BDX example. I saw a 15-20% increase in the frame rate.

@nvtw nvtw dismissed stale reviews from chschuma-disney, vastsoun, and camevor via 195866e April 22, 2026 13:31

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py (1)

389-458: ⚠️ Potential issue | 🟡 Minor

Add a CHANGELOG entry for the default retunings.

The launcher block_dim defaults here flip from 64 → 128 for llt_blocked_solve and llt_blocked_solve_inplace, and the PR also bumps LLTBlockedSolver(block_size=…, solve_block_dim=…) defaults (16→32, 64→128) in linalg/linear.py. These propagate to users who instantiate LLTBlockedSolver without overrides and change observed performance/resource characteristics, so a Changed entry in CHANGELOG.md under [Unreleased] with a one-line note on the new defaults (and how to restore the old values via explicit kwargs) would satisfy the repo policy.

As per coding guidelines: "Check that any user-facing change includes a corresponding entry in CHANGELOG.md under the [Unreleased] section … For Deprecated, Changed, and Removed entries, verify the entry includes migration guidance telling users what replaces the old behavior."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py` around lines
389 - 458, Add a "Changed" entry under the [Unreleased] section of CHANGELOG.md
noting that the default thread block sizes were retuned (llt_blocked_solve and
llt_blocked_solve_inplace default block_dim changed from 64→128 and
LLTBlockedSolver defaults solve_block_dim/block_size changed 64→128 and 16→32
respectively), and include one-line migration guidance that users can restore
prior behavior by passing explicit kwargs (e.g., block_dim or
solve_block_dim/block_size) when constructing LLTBlockedSolver or calling
llt_blocked_solve/llt_blocked_solve_inplace.
🧹 Nitpick comments (1)
newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py (1)

246-289: Solve kernel hoist + in-place refactor looks good; one thing worth confirming.

Hoisting L_diag above the j loop and switching to tile_matmul(..., alpha=-1.0) + tile_lower_solve_inplace / tile_upper_solve_inplace(tile_transpose(L_diag), …) is correct and matches the pattern in llt_blocked_solve_inplace_kernel.

Please double-check one asymmetry: in the backward sub you re-pad L_diag to identity on OOB rows/cols (lines 268–279), but in the forward sub (lines 251–257) L_diag is used directly from tile_load without any padding branch. That's only safe because make_llt_blocked_factorize_kernel writes identity into the padded diagonal of L; if a future change ever skipped that (or a non-factorize producer fed L here), the forward tile_lower_solve_inplace on the trailing block would divide by garbage. Consider either (a) dropping the backward re-pad (redundant given the factorize invariant) or (b) adding the same pad in the forward path, so the two are symmetric and the invariant is locally enforced rather than implicit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py` around lines
246 - 289, Forward path currently uses L_diag from wp.tile_load without padding
while backward pads the diagonal for trailing blocks; make them symmetric by
adding the same OOB-padding logic before the forward wp.tile_lower_solve_inplace
call. In the forward loop (the block that loads L_diag and then calls
wp.tile_lower_solve_inplace), detect the trailing block condition (i +
block_size > n_i) and run the same padding routine used in the backward path
(using tid_block, num_threads_per_block, num_tile_elements, computing row/col,
using wp.where to inject identity on OOB entries) so L_diag is safe for the
lower solve; alternatively, if you prefer the other approach, remove the
backward-pad block instead—pick one and apply the change where L_diag is loaded
in the forward loop to ensure symmetry with the existing backward padding.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py`:
- Around line 389-458: Add a "Changed" entry under the [Unreleased] section of
CHANGELOG.md noting that the default thread block sizes were retuned
(llt_blocked_solve and llt_blocked_solve_inplace default block_dim changed from
64→128 and LLTBlockedSolver defaults solve_block_dim/block_size changed 64→128
and 16→32 respectively), and include one-line migration guidance that users can
restore prior behavior by passing explicit kwargs (e.g., block_dim or
solve_block_dim/block_size) when constructing LLTBlockedSolver or calling
llt_blocked_solve/llt_blocked_solve_inplace.

---

Nitpick comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py`:
- Around line 246-289: Forward path currently uses L_diag from wp.tile_load
without padding while backward pads the diagonal for trailing blocks; make them
symmetric by adding the same OOB-padding logic before the forward
wp.tile_lower_solve_inplace call. In the forward loop (the block that loads
L_diag and then calls wp.tile_lower_solve_inplace), detect the trailing block
condition (i + block_size > n_i) and run the same padding routine used in the
backward path (using tid_block, num_threads_per_block, num_tile_elements,
computing row/col, using wp.where to inject identity on OOB entries) so L_diag
is safe for the lower solve; alternatively, if you prefer the other approach,
remove the backward-pad block instead—pick one and apply the change where L_diag
is loaded in the forward loop to ensure symmetry with the existing backward
padding.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: c4ecd031-6f47-4a54-9e43-cbec8e2a04df

📥 Commits

Reviewing files that changed from the base of the PR and between d72a283 and 195866e.

📒 Files selected for processing (1)
  • newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py

@nvtw nvtw enabled auto-merge April 22, 2026 14:04
@nvtw nvtw disabled auto-merge April 22, 2026 14:05
@nvtw nvtw enabled auto-merge April 22, 2026 14:05
@nvtw nvtw requested a review from Guirec-Maloisel April 22, 2026 14:05
@nvtw nvtw added this pull request to the merge queue Apr 22, 2026
Merged via the queue into newton-physics:main with commit 5d3c869 Apr 22, 2026
25 checks passed
preist-nvidia added a commit to preist-nvidia/newton that referenced this pull request May 4, 2026
The pre-release audit found user-facing commits that landed since v1.1.0
without an [Unreleased] entry. Backfill five of them here. Kamino-related
commits (newton-physics#2280, newton-physics#2517, newton-physics#2554, newton-physics#2575) are intentionally omitted: Kamino is
still alpha and not currently tracked in CHANGELOG.

- Changed: contact-conversion overhead reduction in SolverMuJoCo
  ([newton-physicsGH-2393](newton-physics#2393))
- Fixed: target_voxel_size ignored on the texture-SDF path
  ([newton-physicsGH-2464](newton-physics#2464))
- Fixed: Newton-to-mujoco-warp material-combination mismatch
  ([newton-physicsGH-2439](newton-physics#2439))
- Fixed: eq_objtype mismatch on joint equality / mimic constraints in
  SolverMuJoCo ([newton-physicsGH-2459](newton-physics#2459))
- Fixed: _tiled_sum_kernel launch-dim handling under warp-lang 1.13
  ([newton-physicsGH-2546](newton-physics#2546))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants