Skip to content

Kamino Cholesky solver with RCM support#2554

Merged
nvtw merged 18 commits into
newton-physics:mainfrom
nvtw:dev/tw3/kamino_solver_experiments_minimal
Apr 24, 2026
Merged

Kamino Cholesky solver with RCM support#2554
nvtw merged 18 commits into
newton-physics:mainfrom
nvtw:dev/tw3/kamino_solver_experiments_minimal

Conversation

@nvtw

@nvtw nvtw commented Apr 23, 2026

Copy link
Copy Markdown
Member

Description

Experimental reverse Cuthill McKee reordering for Cholesky to reduce fill in.

Checklist

  • New or existing tests cover these changes
  • The documentation is up to date with these changes
  • CHANGELOG.md has been updated (if user-facing change)

Test plan

Summary by CodeRabbit

  • New Features

    • Added an opt-in "LLTBRCM" solver: RCM-reordered, semi-sparse blocked Cholesky with GPU-accelerated factorize/solve and batched reordering for better performance on blocked SPD systems.
    • New solver is exposed with runtime-accessible properties for use in workflows.
  • Configuration

    • New "LLTBRCM" option for linear_solver_type; default remains "LLTB".
  • Performance

    • GPU kernels and batched reordering optimized; several GPU kernels now run with backward/autograd disabled for leaner execution.

nvtw added 10 commits April 22, 2026 21:03
Previously each BFS iteration launched two kernels: bfs_step (expand frontier)
followed by post_step (tick current_level, reset discovered, check alive). On
small multi-block workloads (n=256, 8 blocks) this doubled the dominant kernel
launch cost of the reorder phase.

Merge the post-step bookkeeping into bfs_step itself:
- Replace the per-block current_level array with a single shared
  iter_counter[0] incremented by thread (0,0) at the end of each launch.
- Drop the alive flag entirely: saturated blocks naturally do no work
  (no thread sees level == cur) and kernel launch overhead is fixed.
- Drop the discovered flag: termination is governed by the fixed
  max_bfs_iters cap, same as before.

Result on n=256 / 8 blocks / banded-5% (uncaptured 1 factorize + 50 solves):
  reorder stage: 1.426 -> 1.325 ms  (~7% faster)
  total:         9.93  -> 7.78 ms   (~22% faster)
All tile_fill values are unchanged (identical ordering semantics).

14/14 RCM solver tests pass.

Made-with: Cursor
The factorize path previously launched three separate auxiliary kernels over
the same (num_blocks, max_dim, max_dim) thread grid:

  1. build_inv_P:       inv_P[P[r]] = r
  2. permute_matrix:    A_hat[r, c] = A[P[r], P[c]]
  3. build_tile_pattern: if |A_hat[r, c]| > tol: OR bit into tile_pattern

These are data-safe to fuse because every thread (b, r, c) only writes its
own A_hat[r, c] and reads only A[P[r], P[c]] (no overlap). inv_P is produced
by the c == 0 threads. Replace the three launches with a single
`fused_permute_and_tp_kernel`.

Size sweep (banded, 1 factorize + 50 solves, captured replay, best of 11):

  n=128  blocks=8 : neutral  (within +-0.1x)
  n=256  blocks=8 : neutral  (1.76-2.10x RCM speedup preserved)
  n=512  blocks=4 : neutral  (3.13x preserved)
  n=1000 blocks=2 : +0.20x at 10% density (4.19 -> 4.39x)
                    neutral at 5% (4.89x)

All 14 RCM solver tests still pass. Reduces kernel-launch count by two per
factorize and shrinks the solver surface area (the now-redundant
build_inv_P, permute_matrix and build_tile_pattern kernels are still
exported from llt_blocked_nd.py for any external callers, but the solver no
longer instantiates or calls them).

Made-with: Cursor
These two kernels both launch with dim=(num_blocks,) and do per-block serial
work back-to-back (one thread per block):
- select_root:  argmin over degree slice to pick a minimum-degree root
- init_frontier: seed BFS (level[root] = 0, push root into order_buf)

Merge them into `select_and_seed_kernel`. Saves one launch per reorder call
on top of the existing fused bfs_step (commit before previous).

Size sweep (banded, 1 factorize + 50 solves, captured replay, best of 11,
 sp = LLTB / RCM speedup, higher is better):

  n=128  blocks=8 , 5%  :  1.19x -> 1.41x  (+0.22x, launch-overhead dominated)
  n=128  blocks=8 , 10% :  1.26x -> 1.26x  (neutral)
  n=256  blocks=8 , 5%  :  2.09x -> 2.25x  (+0.16x)
  n=256  blocks=8 , 10% :  2.10x -> 2.02x  (noise)
  n=512  blocks=4 , 5%  :  3.13x -> 3.12x  (neutral)
  n=512  blocks=4 , 10% :  2.88x -> 2.93x  (noise)
  n=1000 blocks=2 , 5%  :  4.89x -> 4.79x  (noise)
  n=1000 blocks=2 , 10% :  4.39x -> 4.18x  (noise)

Net positive at small n (where reorder launches are the bottleneck),
neutral at large n (where factorize/solve dominates). 14/14 tests pass.

Made-with: Cursor
@nvtw nvtw self-assigned this Apr 23, 2026
@nvtw nvtw marked this pull request as draft April 23, 2026 07:14
@coderabbitai

coderabbitai Bot commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 828e7323-6216-45be-9399-7007107f13db

📥 Commits

Reviewing files that changed from the base of the PR and between 1a44c82 and ce038a6.

📒 Files selected for processing (1)
  • newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py

📝 Walkthrough

Walkthrough

Adds a new LLTBlockedRCMSolver plus batched RCM reordering and semi-sparse blocked LLT Warp/CUDA kernels; integrates the solver into the linalg module and adds a new "LLTBRCM" config option.

Changes

Cohort / File(s) Summary
GPU kernels: RCM batch & blocked LLT
newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py, newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm.py
New Warp/CUDA modules implementing batched Reverse Cuthill–McKee capture/replay and semi-sparse blocked LLT: permute/vector kernels, fused permute+tile-pattern build, symbolic fill-in, numeric factorize and solve, launch wrappers and __all__ exports.
Solver implementation
newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py
New LLTBlockedRCMSolver DirectSolver: manages per-block buffers (L, y, A_hat, b_hat, x_hat, P, inv_P, tile_pattern, tpo), binds/replays RCM callback, runs fused permute+tp build, symbolic inflation, numeric factorization, and tile-aware solves; reconstruction left unimplemented.
Module & config integration
newton/_src/solvers/kamino/_src/linalg/__init__.py, newton/_src/solvers/kamino/config.py
Registers and exports LLTBlockedRCMSolver under "LLTBRCM", extends LinearSolverType/__all__, and updates ConstrainedDynamicsConfig.linear_solver_type to accept "LLTBRCM" (default remains "LLTB").
Kernel decorator updates
newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py, .../llt_blocked_semi_sparse.py, .../llt_sequential.py
Several Warp kernels changed to @wp.kernel(enable_backward=False) to disable backward/autograd support; no functional or API changes beyond decorator flags.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Solver as LLTBlockedRCMSolver
    participant RCM as rcm_batch
    participant Kernels as llt_blocked_rcm_kernels

    User->>Solver: allocate(operator)
    Solver->>RCM: create_rcm_batch_launch() / bind callback
    RCM-->>Solver: rcm_callback (per-block P)

    User->>Solver: factorize()
    Solver->>RCM: rcm_callback() [compute P per block]
    Solver->>Kernels: llt_blocked_rcm_fused_permute_and_tp()\n[A -> A_hat, inv_P, initial tile_pattern]
    Solver->>Kernels: llt_blocked_rcm_symbolic_fill_in()\n[inflate tile_pattern]
    Solver->>Kernels: llt_blocked_rcm_factorize()\n[compute L from A_hat & tile_pattern]

    User->>Solver: solve(b)
    Solver->>Kernels: llt_blocked_rcm_permute_vector()\n[b -> b_hat via P]
    Solver->>Kernels: llt_blocked_rcm_solve()\n[forward/back solve using L & tile_pattern]
    Solver->>Kernels: llt_blocked_rcm_permute_vector()\n[x_hat -> x via inv_P]
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • Guirec-Maloisel
  • camevor
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Kamino Cholesky solver with RCM support' directly and clearly summarizes the main changes: adding RCM (Reverse Cuthill-McKee) support to the Kamino Cholesky solver, which is evidenced by the new LLTBlockedRCMSolver class, RCM batch reordering implementation, and configuration updates throughout the codebase.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (7)
newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py (2)

256-262: Dead attributes: _reorder_callbacks, _A_views, _A_views_attached_to.

These are initialized at the end of _allocate_impl but never read or written anywhere else in the class — the actively-used members are the singular _reorder_callback and _reorder_attached_to set in _ensure_reorder_launches_bound. Looks like leftover from an earlier per-block-views design. Safe to remove to avoid confusing future readers.

🧹 Proposed cleanup
         # Per-block views/launches are (re)built lazily in
         # ``_ensure_reorder_launches_bound`` the first time a concrete A
         # buffer arrives, and rebound only if its device pointer changes.
-        self._reorder_callbacks = []
-        self._A_views = []
-        self._A_views_attached_to = None

         self._has_factors = False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py`
around lines 256 - 262, Remove the dead per-block attributes initialized in
_allocate_impl: _reorder_callbacks, _A_views, and _A_views_attached_to; they are
not used elsewhere (the class uses the singular _reorder_callback and
_reorder_attached_to in _ensure_reorder_launches_bound), so delete their
initialization and any other references to these plural attributes to avoid
confusion and keep the class consistent with the current per-matrix design.

265-275: _reset_impl does not invalidate the cached reorder callback.

If a user calls reset() and then passes a different A buffer on the next factorize, _ensure_reorder_launches_bound compares by identity against _reorder_attached_to and correctly rebinds — so that path is fine. But zeroing _P / _inv_P here without also clearing _reorder_attached_to means that if the same A buffer object is reused after reset(), the zeroed _P would be consumed by the fused permute/fill-in step before the reorder callback re-runs. Actually on inspection the flow always runs self._reorder_callback() inside _factorize_impl before the fused kernel reads _P, so the current sequence is safe — but it's fragile. Consider either:

  • setting self._reorder_attached_to = None here to force a rebind, or
  • leaving _P/_inv_P un-zeroed (they are overwritten each factorize anyway).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py`
around lines 265 - 275, _reset_impl currently zeroes _P and _inv_P but does not
clear the cached reorder binding, which can lead to a fragile race if the same A
object is reused; modify _reset_impl to also set self._reorder_attached_to =
None so the next call into _factorize_impl will force _reorder_callback() to
rebind the reorder state (alternatively, you could skip zeroing _P/_inv_P, but
the safer change is clearing _reorder_attached_to).
newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm.py (3)

125-142: Docstring says "atomic OR" but code uses wp.atomic_max.

Minor doc inconsistency: lines 132-133 describe the operation as atomic OR, but fused_permute_and_tp_kernel at line 185 uses wp.atomic_max(tile_pattern, ..., int(1)). For 0/1 slots this is functionally the same, but callers reading this docstring might assume tile_pattern is bit-packed and try to OR non-{0,1} values in later. Consider aligning the wording (e.g. "atomic max on 0/1 values") or switching to wp.atomic_or if Warp exposes it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm.py` around
lines 125 - 142, The docstring for the fused kernel incorrectly states "atomic
OR" while the implementation in fused_permute_and_tp_kernel uses wp.atomic_max
on tile_pattern; update the code so docstring and implementation agree: either
change the docstring wording to explicitly say "atomic max on 0/1 values (uses
wp.atomic_max)" and note callers must only OR 0/1 semantics, or replace
wp.atomic_max with wp.atomic_or if Warp provides atomic_or and bit-packed OR
behavior is desired—adjust callers/zeroing semantics accordingly and ensure the
symbol fused_permute_and_tp_kernel and the tile_pattern description reflect the
chosen atomic operation.

178-185: Fused kernel writes tile-pattern bits for the full n×n grid including the upper triangle.

The symbolic fill-in and numeric factorize/solve kernels only query lower-triangle (tile_i, tile_j) with j ≤ i, so any bits written in the upper triangle here are pure waste (correctness is unaffected because they are never read for skipping). Gating the atomic on r >= c would roughly halve the atomic traffic on the tile-pattern array — worth it only if profiling shows the atomics are a hot spot.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm.py` around
lines 178 - 185, The fused kernel is writing tile-pattern bits for the full
matrix; restrict writes to the lower triangle by gating the atomic write with a
check that the tile row index is >= tile column index (i.e., only when r >= c).
Concretely, in the block where av is computed and before calling
wp.atomic_max(tile_pattern, tp_off + tr * n_tiles + tc, int(1)), compute tr = r
// block_size and tc = c // block_size as you already do and add a guard if tr
>= tc (or if r >= c) around the wp.atomic_max call so the atomic_max is only
executed for lower-triangle tiles (tile_i >= tile_j).

224-238: Minor: no early exit once fill is detected.

symbolic_fill_in_kernel keeps scanning k after filled = int(1). For large n_tiles this is O(n_tiles^3) worst case per block and is serial within a thread. A break would not change semantics but trims pointless memory traffic. Safe to defer given the small-block target workload.

♻️ Proposed early-exit
                 if tile_pattern[tp_off + i * n_tiles + j] == int(0):
                     # Scan k = 0..j-1 for a filled (i,k) and (j,k) pair.
                     filled = int(0)
                     for k in range(j):
                         if tile_pattern[tp_off + i * n_tiles + k] != int(0) and tile_pattern[
                             tp_off + j * n_tiles + k
                         ] != int(0):
                             filled = int(1)
+                            break
                     if filled == int(1):
                         tile_pattern[tp_off + i * n_tiles + j] = int(1)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm.py` around
lines 224 - 238, The loop in symbolic_fill_in_kernel that scans k (for j in
range(n_tiles): for i in range(j+1, n_tiles): ... for k in range(j): ...)
doesn't break after detecting fill, causing unnecessary work; modify the inner
k-loop so that when the condition (tile_pattern[tp_off + i * n_tiles + k] != 0
and tile_pattern[tp_off + j * n_tiles + k] != 0) is true you set filled = 1 and
immediately break out of the k-loop (or replace with a short-circuit check that
sets tile_pattern[tp_off + i * n_tiles + j] = 1 and breaks), keeping all other
logic the same and still using the same variables tile_pattern, tp_off, n_tiles,
i, j, k.
newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py (2)

393-402: Validate num_blocks against array shapes.

A quick sanity check that dims.shape[0] == mio.shape[0] == vio.shape[0] == num_blocks and that A_flat.device == perm_flat.device == dims.device would fail fast with a clear message instead of silently running with an undersized scratch (head/root sized num_blocks). Cheap, no runtime cost.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py` around lines
393 - 402, Add explicit sanity checks that dims.shape[0], mio.shape[0], and
vio.shape[0] all equal num_blocks and that A_flat.device, perm_flat.device, and
dims.device are the same; if any check fails raise a ValueError or TypeError
with a clear message. Place these validations near the top of the function after
total_vec is computed (referencing variables num_blocks, dims, mio, vio,
A_flat.device, perm_flat.device, dims.device) so the function fails fast instead
of continuing with undersized scratch arrays (e.g., head/root sized num_blocks).

65-89: Reliance on Warp private APIs (wp._src.context.runtime.core.*).

create_cuda_graph_callback reaches into wp._src.context.runtime.core.wp_cuda_graph_create_exec and wp._src.context.runtime.get_error_string. These are not part of Warp's public surface and can break on any Warp update. Warp already exposes wp.capture_launch(graph) (you use it in graph_callback) after capture; the manual graph_exec plumbing is only needed if you want to reuse a captured graph handle across devices/streams. Consider either:

  • using the standard with wp.ScopedCapture(...) as capture: ... pattern and wp.capture_launch(capture.graph) directly, or
  • gating the private-API path behind a Warp-version check and falling back to the public API.
Does Warp provide a public API equivalent to wp._src.context.runtime.core.wp_cuda_graph_create_exec for binding a captured graph to a specific stream?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py` around lines
65 - 89, The function create_cuda_graph_callback currently uses Warp private
APIs (wp._src.context.runtime.core.wp_cuda_graph_create_exec and
wp._src.context.runtime.get_error_string) to populate graph.graph_exec; change
it to avoid private plumbing by either: 1) removing the manual graph_exec
creation and simply returning a callback that calls wp.capture_launch(graph)
(use the existing graph variable and wp.capture_launch in graph_callback), or 2)
if reusing a captured graph handle across devices/streams is required, gate the
private-API path behind an explicit Warp-version check and add a safe fallback
that uses wp.capture_launch(capture.graph) when the public API is unavailable;
update places referencing graph.graph_exec accordingly so no code depends on
private runtime calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py`:
- Around line 97-143: The constructor and internal attribute have a misspelled
name: change the kwarg and attribute from factortize_block_dim to
factorize_block_dim (and _factortize_block_dim to _factorize_block_dim) and
update every downstream reference (e.g., the use at the solver initialization
and the usage referenced in the reorder/factorize logic such as the call site
around the earlier noted line 360) to the corrected symbol names so the
constructor parameter, stored attribute, and all accesses match.

In `@newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py`:
- Around line 258-289: The kernel bfs_step_kernel has a data race on
iter_counter[0] because all threads read it while thread (0,0) writes it in the
same launch; fix by removing intra-kernel writes to iter_counter: make
iter_counter read-only in bfs_step_kernel (use cur = iter_counter[0] only) and
create a tiny single-thread kernel increment_counter_kernel that atomically
increments or writes iter_counter[0] (or simply sets iter_counter[0]=cur+1) and
launch that kernel between bfs_step_kernel launches (or alternatively pass cur
as a scalar kernel argument to bfs_step_kernel and re-record/replay launches);
update the launch sequence (where bfs_step_launch.launch() is called) to run
increment_counter_kernel after each BFS step so the counter advance is visible
only between launches and there is no intra-launch race.

---

Nitpick comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py`:
- Around line 256-262: Remove the dead per-block attributes initialized in
_allocate_impl: _reorder_callbacks, _A_views, and _A_views_attached_to; they are
not used elsewhere (the class uses the singular _reorder_callback and
_reorder_attached_to in _ensure_reorder_launches_bound), so delete their
initialization and any other references to these plural attributes to avoid
confusion and keep the class consistent with the current per-matrix design.
- Around line 265-275: _reset_impl currently zeroes _P and _inv_P but does not
clear the cached reorder binding, which can lead to a fragile race if the same A
object is reused; modify _reset_impl to also set self._reorder_attached_to =
None so the next call into _factorize_impl will force _reorder_callback() to
rebind the reorder state (alternatively, you could skip zeroing _P/_inv_P, but
the safer change is clearing _reorder_attached_to).

In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm.py`:
- Around line 125-142: The docstring for the fused kernel incorrectly states
"atomic OR" while the implementation in fused_permute_and_tp_kernel uses
wp.atomic_max on tile_pattern; update the code so docstring and implementation
agree: either change the docstring wording to explicitly say "atomic max on 0/1
values (uses wp.atomic_max)" and note callers must only OR 0/1 semantics, or
replace wp.atomic_max with wp.atomic_or if Warp provides atomic_or and
bit-packed OR behavior is desired—adjust callers/zeroing semantics accordingly
and ensure the symbol fused_permute_and_tp_kernel and the tile_pattern
description reflect the chosen atomic operation.
- Around line 178-185: The fused kernel is writing tile-pattern bits for the
full matrix; restrict writes to the lower triangle by gating the atomic write
with a check that the tile row index is >= tile column index (i.e., only when r
>= c). Concretely, in the block where av is computed and before calling
wp.atomic_max(tile_pattern, tp_off + tr * n_tiles + tc, int(1)), compute tr = r
// block_size and tc = c // block_size as you already do and add a guard if tr
>= tc (or if r >= c) around the wp.atomic_max call so the atomic_max is only
executed for lower-triangle tiles (tile_i >= tile_j).
- Around line 224-238: The loop in symbolic_fill_in_kernel that scans k (for j
in range(n_tiles): for i in range(j+1, n_tiles): ... for k in range(j): ...)
doesn't break after detecting fill, causing unnecessary work; modify the inner
k-loop so that when the condition (tile_pattern[tp_off + i * n_tiles + k] != 0
and tile_pattern[tp_off + j * n_tiles + k] != 0) is true you set filled = 1 and
immediately break out of the k-loop (or replace with a short-circuit check that
sets tile_pattern[tp_off + i * n_tiles + j] = 1 and breaks), keeping all other
logic the same and still using the same variables tile_pattern, tp_off, n_tiles,
i, j, k.

In `@newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py`:
- Around line 393-402: Add explicit sanity checks that dims.shape[0],
mio.shape[0], and vio.shape[0] all equal num_blocks and that A_flat.device,
perm_flat.device, and dims.device are the same; if any check fails raise a
ValueError or TypeError with a clear message. Place these validations near the
top of the function after total_vec is computed (referencing variables
num_blocks, dims, mio, vio, A_flat.device, perm_flat.device, dims.device) so the
function fails fast instead of continuing with undersized scratch arrays (e.g.,
head/root sized num_blocks).
- Around line 65-89: The function create_cuda_graph_callback currently uses Warp
private APIs (wp._src.context.runtime.core.wp_cuda_graph_create_exec and
wp._src.context.runtime.get_error_string) to populate graph.graph_exec; change
it to avoid private plumbing by either: 1) removing the manual graph_exec
creation and simply returning a callback that calls wp.capture_launch(graph)
(use the existing graph variable and wp.capture_launch in graph_callback), or 2)
if reusing a captured graph handle across devices/streams is required, gate the
private-API path behind an explicit Warp-version check and add a safe fallback
that uses wp.capture_launch(capture.graph) when the public API is unavailable;
update places referencing graph.graph_exec accordingly so no code depends on
private runtime calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: fb8157f4-0ab7-439b-906d-b02ed65cda27

📥 Commits

Reviewing files that changed from the base of the PR and between 4b8293c and 9107a16.

📒 Files selected for processing (5)
  • newton/_src/solvers/kamino/_src/linalg/__init__.py
  • newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm.py
  • newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py
  • newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py
  • newton/_src/solvers/kamino/config.py

Comment thread newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py Outdated
Comment thread newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py
@codecov

codecov Bot commented Apr 23, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py`:
- Around line 256-261: The plural attributes _reorder_callbacks, _A_views, and
_A_views_attached_to are dead state and should be removed; update the class
initializer (where _reorder_callback and _reorder_attached_to are set) to only
declare and use the singular attributes (_reorder_callback and
_reorder_attached_to), delete the unused plural attributes and any unused
references, and if the intent was to support multiple callbacks/views instead,
complete that refactor by replacing all uses of
_reorder_callback/_reorder_attached_to with the plural versions consistently —
do not leave both sets present.

In `@newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py`:
- Around line 134-319: The kernel parameter annotations use the deprecated
parenthesized form (e.g., wp.array(dtype=...)) in init_and_degree_kernel,
select_and_seed_kernel, bfs_step_kernel, append_unreached_kernel, and
reverse_into_perm_kernel; update every wp.array(...) annotation to the bracket
form wp.array[T] for the correct element type (e.g., wp.int32 or dtype) for all
array parameters (A, dims, mio, vio, degree, level, head, root, order_buf,
perm), remove the now-unnecessary "# type: ignore[valid-type]" comments, and
ensure 1-D arrays use wp.array[T] while preserving existing kernel signatures
and semantics.
- Line 1: Update the SPDX copyright year in the file header from 2025 to 2026 by
editing the SPDX comment line (the line starting with "#
SPDX-FileCopyrightText:") so it reads 2026; ensure the rest of the
header/comment text remains unchanged.
- Around line 65-89: The create_cuda_graph_callback function currently performs
manual CUDA graph exec creation using the private symbol
wp._src.context.runtime.core.wp_cuda_graph_create_exec and assigns
graph.graph_exec (lines in the callback setup); remove that manual instantiation
block and rely on wp.capture_launch(graph) to lazily create and cache the
executable on first launch, keeping the existing stream/device validation
(stream.device vs graph.device) and leaving graph.graph_exec untouched so Warp's
public API handles exec creation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 5fa602fb-2fce-4dfe-b7ca-00c839d51389

📥 Commits

Reviewing files that changed from the base of the PR and between 9107a16 and b377f29.

📒 Files selected for processing (2)
  • newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py
  • newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py

Comment thread newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py Outdated
Comment thread newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py Outdated
Comment thread newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py
Comment thread newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py (1)

336-336: Nit: drop the redundant int(...) wrap around math.ceil.

math.ceil already returns int in Python 3, so the cast is a no-op (this is the only RUF046 hit that applies outside kernel code — inside kernels the explicit int(...) literals are required by Warp and should stay).

♻️ Proposed tweak
-    return 2 * int(math.ceil(math.sqrt(max_dim))) + 4
+    return 2 * math.ceil(math.sqrt(max_dim)) + 4
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py` at line 336,
Drop the redundant int(...) around math.ceil: locate the return statement
returning "2 * int(math.ceil(math.sqrt(max_dim))) + 4" and replace it with "2 *
math.ceil(math.sqrt(max_dim)) + 4" (math.ceil already returns int in Python 3);
ensure max_dim and math are unchanged.
newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py (2)

362-364: _reconstruct_impl stub — track or remove.

Leaving this as NotImplementedError is fine for an experimental draft, but since the base DirectSolver._reconstruct_impl is already @abstractmethod with the same semantics, any caller that invokes reconstruct() on this solver will get a runtime error rather than a type-check failure. Consider opening a tracking issue so this doesn't linger once the solver graduates out of experimental.

Want me to open a follow-up issue to implement _reconstruct_impl (RCM-permuted A = P^T L L^T P reconstruction)?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py`
around lines 362 - 364, This solver currently leaves _reconstruct_impl
unimplemented which causes reconstruct() callers to get a runtime
NotImplementedError; either implement the RCM-permuted reconstruction (A = P^T L
L^T P) inside the LLT_blocked_rcm_solver._reconstruct_impl or, if you can’t
implement now, open a tracking issue and update the stub to include a TODO and a
clear NotImplementedError message referencing that issue (and mention
_reconstruct_impl and DirectSolver._reconstruct_impl so future readers know this
is intentional and tracked).

145-147: Nit: add type annotation for consistency.

Every other __init__-set attribute in this block has an explicit annotation; _rcm_max_bfs_iters is the lone exception.

♻️ Proposed tweak
         self._reorder_tol: float = reorder_tol
-        self._rcm_max_bfs_iters = rcm_max_bfs_iters
+        self._rcm_max_bfs_iters: int | None = rcm_max_bfs_iters
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py`
around lines 145 - 147, Add a type annotation for the attribute
_rcm_max_bfs_iters in the class __init__ (same block where self._reorder_tol:
float = reorder_tol is set) — e.g., declare self._rcm_max_bfs_iters: int =
rcm_max_bfs_iters — to match the explicit annotations used for other attributes
in the constructor (look for __init__ in llt_blocked_rcm_solver and the
self._reorder_tol assignment to place the change).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py`:
- Line 1: Update the SPDX header year in the file by replacing the existing line
that reads "# SPDX-FileCopyrightText: Copyright (c) 2025 The Newton Developers"
so the year matches the file creation year (2026); ensure the header becomes "#
SPDX-FileCopyrightText: Copyright (c) 2026 The Newton Developers" to align with
the sibling rcm_batch.py and project guidelines.

---

Nitpick comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py`:
- Around line 362-364: This solver currently leaves _reconstruct_impl
unimplemented which causes reconstruct() callers to get a runtime
NotImplementedError; either implement the RCM-permuted reconstruction (A = P^T L
L^T P) inside the LLT_blocked_rcm_solver._reconstruct_impl or, if you can’t
implement now, open a tracking issue and update the stub to include a TODO and a
clear NotImplementedError message referencing that issue (and mention
_reconstruct_impl and DirectSolver._reconstruct_impl so future readers know this
is intentional and tracked).
- Around line 145-147: Add a type annotation for the attribute
_rcm_max_bfs_iters in the class __init__ (same block where self._reorder_tol:
float = reorder_tol is set) — e.g., declare self._rcm_max_bfs_iters: int =
rcm_max_bfs_iters — to match the explicit annotations used for other attributes
in the constructor (look for __init__ in llt_blocked_rcm_solver and the
self._reorder_tol assignment to place the change).

In `@newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py`:
- Line 336: Drop the redundant int(...) around math.ceil: locate the return
statement returning "2 * int(math.ceil(math.sqrt(max_dim))) + 4" and replace it
with "2 * math.ceil(math.sqrt(max_dim)) + 4" (math.ceil already returns int in
Python 3); ensure max_dim and math are unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 2970cd46-686d-4e14-9133-0fd572e619d3

📥 Commits

Reviewing files that changed from the base of the PR and between b377f29 and a65874a.

📒 Files selected for processing (2)
  • newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py
  • newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py

Comment thread newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Use the file creation year in the SPDX header.

This new file’s header still uses 2025; for this 2026-created file it should be 2026.

📝 Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2025 The Newton Developers
+# SPDX-FileCopyrightText: Copyright (c) 2026 The Newton Developers

As per coding guidelines: "In SPDX copyright lines, use the year the file was first created. Do not create date ranges or update the year when modifying a file."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm.py` at line
1, Update the SPDX copyright year in the file header from 2025 to 2026; locate
the SPDX header line beginning with "SPDX-FileCopyrightText" in
llt_blocked_rcm.py and change the year token "2025" to "2026" so the file’s
creation year is correct.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py`:
- Around line 263-273: The override of _reset_impl in llt_blocked_rcm_solver.py
must match the base signature by accepting (self, A, **kwargs); change the
method definition to def _reset_impl(self, A, **kwargs) -> None and keep the
existing attribute zeroing (_L, _y, _A_hat, _b_hat, _x_hat, _P, _inv_P,
_tile_pattern) and _has_factors reset as-is (you can ignore A if unused), so
calls that forward A/kwargs to _reset_impl won't raise TypeError; ensure the
method name and attributes (_reset_impl, _L, _y, _A_hat, _b_hat, _x_hat, _P,
_inv_P, _tile_pattern, _has_factors) are used exactly as in the diff.
- Around line 63-66: Replace the private `_src` cross-reference in the
llt_blocked_rcm_solver.py docstring with a public/short Sphinx target: find the
docstring referencing `newton._src...LLTBlockedSolver` and change it to a
public, shortest cross-reference such as
`:class:\`newton.solvers.kamino.linalg.linear.LLTBlockedSolver\`` or simply
`:class:\`LLTBlockedSolver\`` so the docstring in the LLTBlockedRCMSolver
(file/class: llt_blocked_rcm_solver.py / LLTBlockedRCMSolver) uses the public
API path and not the private `_src` module.
- Around line 92-167: The constructor allows arbitrary dtype but underlying
kernels (see make_llt_blocked_rcm_factorize_kernel,
make_llt_blocked_rcm_solve_kernel, make_llt_blocked_rcm_solve_inplace_kernel)
are hard-coded to float32 causing buffer/kernel dtype mismatch; in
LLTBlockedRCMSolver.__init__ (immediately after the docstring) add a guard that
raises NotImplementedError when dtype != float32 (message: "LLTBlockedRCMSolver
currently supports only float32.") and ensure _allocate_impl still uses
self._dtype; also add a short CHANGELOG entry noting the float32-only
limitation.

In `@newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py`:
- Around line 37-41: The doc example imports a private `_src` path; update the
example to avoid private imports by either removing the import line or replacing
it with the public-facing symbol import (e.g., import create_rcm_batch_launch
from the package's public module that re-exports it) — locate the example in
rcm_batch.py (the code block referencing create_rcm_batch_launch) and change the
import to use the public API name or drop it from the docs so there are no
references to newton._src.

---

Duplicate comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm.py`:
- Line 1: Update the SPDX copyright year in the file header from 2025 to 2026;
locate the SPDX header line beginning with "SPDX-FileCopyrightText" in
llt_blocked_rcm.py and change the year token "2025" to "2026" so the file’s
creation year is correct.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: d031a504-dfa6-445e-9470-a77aebdd0d20

📥 Commits

Reviewing files that changed from the base of the PR and between a65874a and 1a44c82.

📒 Files selected for processing (6)
  • newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.py
  • newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm.py
  • newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py
  • newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_semi_sparse.py
  • newton/_src/solvers/kamino/_src/linalg/factorize/llt_sequential.py
  • newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py

Comment thread newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py

@vastsoun vastsoun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work @nvtw. This is just great.

I think we should think about placement of this code, but besides that, and since it already works, let's proceed to undraft it and start a complete review.

@nvtw nvtw marked this pull request as ready for review April 24, 2026 05:50
@nvtw nvtw changed the title Experimental Kamino Cholesky solver with RCM support Kamino Cholesky solver with RCM support Apr 24, 2026
@vastsoun vastsoun self-requested a review April 24, 2026 09:24

@vastsoun vastsoun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's merge this.

@nvtw nvtw added this pull request to the merge queue Apr 24, 2026
Merged via the queue into newton-physics:main with commit 5cd4df9 Apr 24, 2026
25 checks passed
preist-nvidia added a commit to preist-nvidia/newton that referenced this pull request May 4, 2026
The pre-release audit found user-facing commits that landed since v1.1.0
without an [Unreleased] entry. Backfill five of them here. Kamino-related
commits (newton-physics#2280, newton-physics#2517, newton-physics#2554, newton-physics#2575) are intentionally omitted: Kamino is
still alpha and not currently tracked in CHANGELOG.

- Changed: contact-conversion overhead reduction in SolverMuJoCo
  ([newton-physicsGH-2393](newton-physics#2393))
- Fixed: target_voxel_size ignored on the texture-SDF path
  ([newton-physicsGH-2464](newton-physics#2464))
- Fixed: Newton-to-mujoco-warp material-combination mismatch
  ([newton-physicsGH-2439](newton-physics#2439))
- Fixed: eq_objtype mismatch on joint equality / mimic constraints in
  SolverMuJoCo ([newton-physicsGH-2459](newton-physics#2459))
- Fixed: _tiled_sum_kernel launch-dim handling under warp-lang 1.13
  ([newton-physicsGH-2546](newton-physics#2546))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants