Kamino Cholesky solver with RCM support#2554
Conversation
Previously each BFS iteration launched two kernels: bfs_step (expand frontier) followed by post_step (tick current_level, reset discovered, check alive). On small multi-block workloads (n=256, 8 blocks) this doubled the dominant kernel launch cost of the reorder phase. Merge the post-step bookkeeping into bfs_step itself: - Replace the per-block current_level array with a single shared iter_counter[0] incremented by thread (0,0) at the end of each launch. - Drop the alive flag entirely: saturated blocks naturally do no work (no thread sees level == cur) and kernel launch overhead is fixed. - Drop the discovered flag: termination is governed by the fixed max_bfs_iters cap, same as before. Result on n=256 / 8 blocks / banded-5% (uncaptured 1 factorize + 50 solves): reorder stage: 1.426 -> 1.325 ms (~7% faster) total: 9.93 -> 7.78 ms (~22% faster) All tile_fill values are unchanged (identical ordering semantics). 14/14 RCM solver tests pass. Made-with: Cursor
The factorize path previously launched three separate auxiliary kernels over
the same (num_blocks, max_dim, max_dim) thread grid:
1. build_inv_P: inv_P[P[r]] = r
2. permute_matrix: A_hat[r, c] = A[P[r], P[c]]
3. build_tile_pattern: if |A_hat[r, c]| > tol: OR bit into tile_pattern
These are data-safe to fuse because every thread (b, r, c) only writes its
own A_hat[r, c] and reads only A[P[r], P[c]] (no overlap). inv_P is produced
by the c == 0 threads. Replace the three launches with a single
`fused_permute_and_tp_kernel`.
Size sweep (banded, 1 factorize + 50 solves, captured replay, best of 11):
n=128 blocks=8 : neutral (within +-0.1x)
n=256 blocks=8 : neutral (1.76-2.10x RCM speedup preserved)
n=512 blocks=4 : neutral (3.13x preserved)
n=1000 blocks=2 : +0.20x at 10% density (4.19 -> 4.39x)
neutral at 5% (4.89x)
All 14 RCM solver tests still pass. Reduces kernel-launch count by two per
factorize and shrinks the solver surface area (the now-redundant
build_inv_P, permute_matrix and build_tile_pattern kernels are still
exported from llt_blocked_nd.py for any external callers, but the solver no
longer instantiates or calls them).
Made-with: Cursor
These two kernels both launch with dim=(num_blocks,) and do per-block serial work back-to-back (one thread per block): - select_root: argmin over degree slice to pick a minimum-degree root - init_frontier: seed BFS (level[root] = 0, push root into order_buf) Merge them into `select_and_seed_kernel`. Saves one launch per reorder call on top of the existing fused bfs_step (commit before previous). Size sweep (banded, 1 factorize + 50 solves, captured replay, best of 11, sp = LLTB / RCM speedup, higher is better): n=128 blocks=8 , 5% : 1.19x -> 1.41x (+0.22x, launch-overhead dominated) n=128 blocks=8 , 10% : 1.26x -> 1.26x (neutral) n=256 blocks=8 , 5% : 2.09x -> 2.25x (+0.16x) n=256 blocks=8 , 10% : 2.10x -> 2.02x (noise) n=512 blocks=4 , 5% : 3.13x -> 3.12x (neutral) n=512 blocks=4 , 10% : 2.88x -> 2.93x (noise) n=1000 blocks=2 , 5% : 4.89x -> 4.79x (noise) n=1000 blocks=2 , 10% : 4.39x -> 4.18x (noise) Net positive at small n (where reorder launches are the bottleneck), neutral at large n (where factorize/solve dominates). 14/14 tests pass. Made-with: Cursor
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a new LLTBlockedRCMSolver plus batched RCM reordering and semi-sparse blocked LLT Warp/CUDA kernels; integrates the solver into the linalg module and adds a new "LLTBRCM" config option. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Solver as LLTBlockedRCMSolver
participant RCM as rcm_batch
participant Kernels as llt_blocked_rcm_kernels
User->>Solver: allocate(operator)
Solver->>RCM: create_rcm_batch_launch() / bind callback
RCM-->>Solver: rcm_callback (per-block P)
User->>Solver: factorize()
Solver->>RCM: rcm_callback() [compute P per block]
Solver->>Kernels: llt_blocked_rcm_fused_permute_and_tp()\n[A -> A_hat, inv_P, initial tile_pattern]
Solver->>Kernels: llt_blocked_rcm_symbolic_fill_in()\n[inflate tile_pattern]
Solver->>Kernels: llt_blocked_rcm_factorize()\n[compute L from A_hat & tile_pattern]
User->>Solver: solve(b)
Solver->>Kernels: llt_blocked_rcm_permute_vector()\n[b -> b_hat via P]
Solver->>Kernels: llt_blocked_rcm_solve()\n[forward/back solve using L & tile_pattern]
Solver->>Kernels: llt_blocked_rcm_permute_vector()\n[x_hat -> x via inv_P]
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (7)
newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py (2)
256-262: Dead attributes:_reorder_callbacks,_A_views,_A_views_attached_to.These are initialized at the end of
_allocate_implbut never read or written anywhere else in the class — the actively-used members are the singular_reorder_callbackand_reorder_attached_toset in_ensure_reorder_launches_bound. Looks like leftover from an earlier per-block-views design. Safe to remove to avoid confusing future readers.🧹 Proposed cleanup
# Per-block views/launches are (re)built lazily in # ``_ensure_reorder_launches_bound`` the first time a concrete A # buffer arrives, and rebound only if its device pointer changes. - self._reorder_callbacks = [] - self._A_views = [] - self._A_views_attached_to = None self._has_factors = False🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py` around lines 256 - 262, Remove the dead per-block attributes initialized in _allocate_impl: _reorder_callbacks, _A_views, and _A_views_attached_to; they are not used elsewhere (the class uses the singular _reorder_callback and _reorder_attached_to in _ensure_reorder_launches_bound), so delete their initialization and any other references to these plural attributes to avoid confusion and keep the class consistent with the current per-matrix design.
265-275:_reset_impldoes not invalidate the cached reorder callback.If a user calls
reset()and then passes a differentAbuffer on the next factorize,_ensure_reorder_launches_boundcompares by identity against_reorder_attached_toand correctly rebinds — so that path is fine. But zeroing_P/_inv_Phere without also clearing_reorder_attached_tomeans that if the sameAbuffer object is reused afterreset(), the zeroed_Pwould be consumed by the fused permute/fill-in step before the reorder callback re-runs. Actually on inspection the flow always runsself._reorder_callback()inside_factorize_implbefore the fused kernel reads_P, so the current sequence is safe — but it's fragile. Consider either:
- setting
self._reorder_attached_to = Nonehere to force a rebind, or- leaving
_P/_inv_Pun-zeroed (they are overwritten each factorize anyway).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py` around lines 265 - 275, _reset_impl currently zeroes _P and _inv_P but does not clear the cached reorder binding, which can lead to a fragile race if the same A object is reused; modify _reset_impl to also set self._reorder_attached_to = None so the next call into _factorize_impl will force _reorder_callback() to rebind the reorder state (alternatively, you could skip zeroing _P/_inv_P, but the safer change is clearing _reorder_attached_to).newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm.py (3)
125-142: Docstring says "atomic OR" but code useswp.atomic_max.Minor doc inconsistency: lines 132-133 describe the operation as
atomic OR, butfused_permute_and_tp_kernelat line 185 useswp.atomic_max(tile_pattern, ..., int(1)). For 0/1 slots this is functionally the same, but callers reading this docstring might assumetile_patternis bit-packed and try to OR non-{0,1} values in later. Consider aligning the wording (e.g."atomic max on 0/1 values") or switching towp.atomic_orif Warp exposes it.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm.py` around lines 125 - 142, The docstring for the fused kernel incorrectly states "atomic OR" while the implementation in fused_permute_and_tp_kernel uses wp.atomic_max on tile_pattern; update the code so docstring and implementation agree: either change the docstring wording to explicitly say "atomic max on 0/1 values (uses wp.atomic_max)" and note callers must only OR 0/1 semantics, or replace wp.atomic_max with wp.atomic_or if Warp provides atomic_or and bit-packed OR behavior is desired—adjust callers/zeroing semantics accordingly and ensure the symbol fused_permute_and_tp_kernel and the tile_pattern description reflect the chosen atomic operation.
178-185: Fused kernel writes tile-pattern bits for the full n×n grid including the upper triangle.The symbolic fill-in and numeric factorize/solve kernels only query lower-triangle
(tile_i, tile_j)withj ≤ i, so any bits written in the upper triangle here are pure waste (correctness is unaffected because they are never read for skipping). Gating the atomic onr >= cwould roughly halve the atomic traffic on the tile-pattern array — worth it only if profiling shows the atomics are a hot spot.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm.py` around lines 178 - 185, The fused kernel is writing tile-pattern bits for the full matrix; restrict writes to the lower triangle by gating the atomic write with a check that the tile row index is >= tile column index (i.e., only when r >= c). Concretely, in the block where av is computed and before calling wp.atomic_max(tile_pattern, tp_off + tr * n_tiles + tc, int(1)), compute tr = r // block_size and tc = c // block_size as you already do and add a guard if tr >= tc (or if r >= c) around the wp.atomic_max call so the atomic_max is only executed for lower-triangle tiles (tile_i >= tile_j).
224-238: Minor: no early exit once fill is detected.
symbolic_fill_in_kernelkeeps scanningkafterfilled = int(1). For largen_tilesthis is O(n_tiles^3) worst case per block and is serial within a thread. Abreakwould not change semantics but trims pointless memory traffic. Safe to defer given the small-block target workload.♻️ Proposed early-exit
if tile_pattern[tp_off + i * n_tiles + j] == int(0): # Scan k = 0..j-1 for a filled (i,k) and (j,k) pair. filled = int(0) for k in range(j): if tile_pattern[tp_off + i * n_tiles + k] != int(0) and tile_pattern[ tp_off + j * n_tiles + k ] != int(0): filled = int(1) + break if filled == int(1): tile_pattern[tp_off + i * n_tiles + j] = int(1)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm.py` around lines 224 - 238, The loop in symbolic_fill_in_kernel that scans k (for j in range(n_tiles): for i in range(j+1, n_tiles): ... for k in range(j): ...) doesn't break after detecting fill, causing unnecessary work; modify the inner k-loop so that when the condition (tile_pattern[tp_off + i * n_tiles + k] != 0 and tile_pattern[tp_off + j * n_tiles + k] != 0) is true you set filled = 1 and immediately break out of the k-loop (or replace with a short-circuit check that sets tile_pattern[tp_off + i * n_tiles + j] = 1 and breaks), keeping all other logic the same and still using the same variables tile_pattern, tp_off, n_tiles, i, j, k.newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py (2)
393-402: Validatenum_blocksagainst array shapes.A quick sanity check that
dims.shape[0] == mio.shape[0] == vio.shape[0] == num_blocksand thatA_flat.device == perm_flat.device == dims.devicewould fail fast with a clear message instead of silently running with an undersized scratch (head/rootsizednum_blocks). Cheap, no runtime cost.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py` around lines 393 - 402, Add explicit sanity checks that dims.shape[0], mio.shape[0], and vio.shape[0] all equal num_blocks and that A_flat.device, perm_flat.device, and dims.device are the same; if any check fails raise a ValueError or TypeError with a clear message. Place these validations near the top of the function after total_vec is computed (referencing variables num_blocks, dims, mio, vio, A_flat.device, perm_flat.device, dims.device) so the function fails fast instead of continuing with undersized scratch arrays (e.g., head/root sized num_blocks).
65-89: Reliance on Warp private APIs (wp._src.context.runtime.core.*).
create_cuda_graph_callbackreaches intowp._src.context.runtime.core.wp_cuda_graph_create_execandwp._src.context.runtime.get_error_string. These are not part of Warp's public surface and can break on any Warp update. Warp already exposeswp.capture_launch(graph)(you use it ingraph_callback) after capture; the manualgraph_execplumbing is only needed if you want to reuse a captured graph handle across devices/streams. Consider either:
- using the standard
with wp.ScopedCapture(...) as capture: ...pattern andwp.capture_launch(capture.graph)directly, or- gating the private-API path behind a Warp-version check and falling back to the public API.
Does Warp provide a public API equivalent to wp._src.context.runtime.core.wp_cuda_graph_create_exec for binding a captured graph to a specific stream?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py` around lines 65 - 89, The function create_cuda_graph_callback currently uses Warp private APIs (wp._src.context.runtime.core.wp_cuda_graph_create_exec and wp._src.context.runtime.get_error_string) to populate graph.graph_exec; change it to avoid private plumbing by either: 1) removing the manual graph_exec creation and simply returning a callback that calls wp.capture_launch(graph) (use the existing graph variable and wp.capture_launch in graph_callback), or 2) if reusing a captured graph handle across devices/streams is required, gate the private-API path behind an explicit Warp-version check and add a safe fallback that uses wp.capture_launch(capture.graph) when the public API is unavailable; update places referencing graph.graph_exec accordingly so no code depends on private runtime calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py`:
- Around line 97-143: The constructor and internal attribute have a misspelled
name: change the kwarg and attribute from factortize_block_dim to
factorize_block_dim (and _factortize_block_dim to _factorize_block_dim) and
update every downstream reference (e.g., the use at the solver initialization
and the usage referenced in the reorder/factorize logic such as the call site
around the earlier noted line 360) to the corrected symbol names so the
constructor parameter, stored attribute, and all accesses match.
In `@newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py`:
- Around line 258-289: The kernel bfs_step_kernel has a data race on
iter_counter[0] because all threads read it while thread (0,0) writes it in the
same launch; fix by removing intra-kernel writes to iter_counter: make
iter_counter read-only in bfs_step_kernel (use cur = iter_counter[0] only) and
create a tiny single-thread kernel increment_counter_kernel that atomically
increments or writes iter_counter[0] (or simply sets iter_counter[0]=cur+1) and
launch that kernel between bfs_step_kernel launches (or alternatively pass cur
as a scalar kernel argument to bfs_step_kernel and re-record/replay launches);
update the launch sequence (where bfs_step_launch.launch() is called) to run
increment_counter_kernel after each BFS step so the counter advance is visible
only between launches and there is no intra-launch race.
---
Nitpick comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py`:
- Around line 256-262: Remove the dead per-block attributes initialized in
_allocate_impl: _reorder_callbacks, _A_views, and _A_views_attached_to; they are
not used elsewhere (the class uses the singular _reorder_callback and
_reorder_attached_to in _ensure_reorder_launches_bound), so delete their
initialization and any other references to these plural attributes to avoid
confusion and keep the class consistent with the current per-matrix design.
- Around line 265-275: _reset_impl currently zeroes _P and _inv_P but does not
clear the cached reorder binding, which can lead to a fragile race if the same A
object is reused; modify _reset_impl to also set self._reorder_attached_to =
None so the next call into _factorize_impl will force _reorder_callback() to
rebind the reorder state (alternatively, you could skip zeroing _P/_inv_P, but
the safer change is clearing _reorder_attached_to).
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm.py`:
- Around line 125-142: The docstring for the fused kernel incorrectly states
"atomic OR" while the implementation in fused_permute_and_tp_kernel uses
wp.atomic_max on tile_pattern; update the code so docstring and implementation
agree: either change the docstring wording to explicitly say "atomic max on 0/1
values (uses wp.atomic_max)" and note callers must only OR 0/1 semantics, or
replace wp.atomic_max with wp.atomic_or if Warp provides atomic_or and
bit-packed OR behavior is desired—adjust callers/zeroing semantics accordingly
and ensure the symbol fused_permute_and_tp_kernel and the tile_pattern
description reflect the chosen atomic operation.
- Around line 178-185: The fused kernel is writing tile-pattern bits for the
full matrix; restrict writes to the lower triangle by gating the atomic write
with a check that the tile row index is >= tile column index (i.e., only when r
>= c). Concretely, in the block where av is computed and before calling
wp.atomic_max(tile_pattern, tp_off + tr * n_tiles + tc, int(1)), compute tr = r
// block_size and tc = c // block_size as you already do and add a guard if tr
>= tc (or if r >= c) around the wp.atomic_max call so the atomic_max is only
executed for lower-triangle tiles (tile_i >= tile_j).
- Around line 224-238: The loop in symbolic_fill_in_kernel that scans k (for j
in range(n_tiles): for i in range(j+1, n_tiles): ... for k in range(j): ...)
doesn't break after detecting fill, causing unnecessary work; modify the inner
k-loop so that when the condition (tile_pattern[tp_off + i * n_tiles + k] != 0
and tile_pattern[tp_off + j * n_tiles + k] != 0) is true you set filled = 1 and
immediately break out of the k-loop (or replace with a short-circuit check that
sets tile_pattern[tp_off + i * n_tiles + j] = 1 and breaks), keeping all other
logic the same and still using the same variables tile_pattern, tp_off, n_tiles,
i, j, k.
In `@newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py`:
- Around line 393-402: Add explicit sanity checks that dims.shape[0],
mio.shape[0], and vio.shape[0] all equal num_blocks and that A_flat.device,
perm_flat.device, and dims.device are the same; if any check fails raise a
ValueError or TypeError with a clear message. Place these validations near the
top of the function after total_vec is computed (referencing variables
num_blocks, dims, mio, vio, A_flat.device, perm_flat.device, dims.device) so the
function fails fast instead of continuing with undersized scratch arrays (e.g.,
head/root sized num_blocks).
- Around line 65-89: The function create_cuda_graph_callback currently uses Warp
private APIs (wp._src.context.runtime.core.wp_cuda_graph_create_exec and
wp._src.context.runtime.get_error_string) to populate graph.graph_exec; change
it to avoid private plumbing by either: 1) removing the manual graph_exec
creation and simply returning a callback that calls wp.capture_launch(graph)
(use the existing graph variable and wp.capture_launch in graph_callback), or 2)
if reusing a captured graph handle across devices/streams is required, gate the
private-API path behind an explicit Warp-version check and add a safe fallback
that uses wp.capture_launch(capture.graph) when the public API is unavailable;
update places referencing graph.graph_exec accordingly so no code depends on
private runtime calls.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: fb8157f4-0ab7-439b-906d-b02ed65cda27
📒 Files selected for processing (5)
newton/_src/solvers/kamino/_src/linalg/__init__.pynewton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm.pynewton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.pynewton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.pynewton/_src/solvers/kamino/config.py
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py`:
- Around line 256-261: The plural attributes _reorder_callbacks, _A_views, and
_A_views_attached_to are dead state and should be removed; update the class
initializer (where _reorder_callback and _reorder_attached_to are set) to only
declare and use the singular attributes (_reorder_callback and
_reorder_attached_to), delete the unused plural attributes and any unused
references, and if the intent was to support multiple callbacks/views instead,
complete that refactor by replacing all uses of
_reorder_callback/_reorder_attached_to with the plural versions consistently —
do not leave both sets present.
In `@newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py`:
- Around line 134-319: The kernel parameter annotations use the deprecated
parenthesized form (e.g., wp.array(dtype=...)) in init_and_degree_kernel,
select_and_seed_kernel, bfs_step_kernel, append_unreached_kernel, and
reverse_into_perm_kernel; update every wp.array(...) annotation to the bracket
form wp.array[T] for the correct element type (e.g., wp.int32 or dtype) for all
array parameters (A, dims, mio, vio, degree, level, head, root, order_buf,
perm), remove the now-unnecessary "# type: ignore[valid-type]" comments, and
ensure 1-D arrays use wp.array[T] while preserving existing kernel signatures
and semantics.
- Line 1: Update the SPDX copyright year in the file header from 2025 to 2026 by
editing the SPDX comment line (the line starting with "#
SPDX-FileCopyrightText:") so it reads 2026; ensure the rest of the
header/comment text remains unchanged.
- Around line 65-89: The create_cuda_graph_callback function currently performs
manual CUDA graph exec creation using the private symbol
wp._src.context.runtime.core.wp_cuda_graph_create_exec and assigns
graph.graph_exec (lines in the callback setup); remove that manual instantiation
block and rely on wp.capture_launch(graph) to lazily create and cache the
executable on first launch, keeping the existing stream/device validation
(stream.device vs graph.device) and leaving graph.graph_exec untouched so Warp's
public API handles exec creation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 5fa602fb-2fce-4dfe-b7ca-00c839d51389
📒 Files selected for processing (2)
newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.pynewton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py (1)
336-336: Nit: drop the redundantint(...)wrap aroundmath.ceil.
math.ceilalready returnsintin Python 3, so the cast is a no-op (this is the only RUF046 hit that applies outside kernel code — inside kernels the explicitint(...)literals are required by Warp and should stay).♻️ Proposed tweak
- return 2 * int(math.ceil(math.sqrt(max_dim))) + 4 + return 2 * math.ceil(math.sqrt(max_dim)) + 4🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py` at line 336, Drop the redundant int(...) around math.ceil: locate the return statement returning "2 * int(math.ceil(math.sqrt(max_dim))) + 4" and replace it with "2 * math.ceil(math.sqrt(max_dim)) + 4" (math.ceil already returns int in Python 3); ensure max_dim and math are unchanged.newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py (2)
362-364:_reconstruct_implstub — track or remove.Leaving this as
NotImplementedErroris fine for an experimental draft, but since the baseDirectSolver._reconstruct_implis already@abstractmethodwith the same semantics, any caller that invokesreconstruct()on this solver will get a runtime error rather than a type-check failure. Consider opening a tracking issue so this doesn't linger once the solver graduates out of experimental.Want me to open a follow-up issue to implement
_reconstruct_impl(RCM-permutedA = P^T L L^T Preconstruction)?🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py` around lines 362 - 364, This solver currently leaves _reconstruct_impl unimplemented which causes reconstruct() callers to get a runtime NotImplementedError; either implement the RCM-permuted reconstruction (A = P^T L L^T P) inside the LLT_blocked_rcm_solver._reconstruct_impl or, if you can’t implement now, open a tracking issue and update the stub to include a TODO and a clear NotImplementedError message referencing that issue (and mention _reconstruct_impl and DirectSolver._reconstruct_impl so future readers know this is intentional and tracked).
145-147: Nit: add type annotation for consistency.Every other
__init__-set attribute in this block has an explicit annotation;_rcm_max_bfs_itersis the lone exception.♻️ Proposed tweak
self._reorder_tol: float = reorder_tol - self._rcm_max_bfs_iters = rcm_max_bfs_iters + self._rcm_max_bfs_iters: int | None = rcm_max_bfs_iters🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py` around lines 145 - 147, Add a type annotation for the attribute _rcm_max_bfs_iters in the class __init__ (same block where self._reorder_tol: float = reorder_tol is set) — e.g., declare self._rcm_max_bfs_iters: int = rcm_max_bfs_iters — to match the explicit annotations used for other attributes in the constructor (look for __init__ in llt_blocked_rcm_solver and the self._reorder_tol assignment to place the change).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py`:
- Line 1: Update the SPDX header year in the file by replacing the existing line
that reads "# SPDX-FileCopyrightText: Copyright (c) 2025 The Newton Developers"
so the year matches the file creation year (2026); ensure the header becomes "#
SPDX-FileCopyrightText: Copyright (c) 2026 The Newton Developers" to align with
the sibling rcm_batch.py and project guidelines.
---
Nitpick comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py`:
- Around line 362-364: This solver currently leaves _reconstruct_impl
unimplemented which causes reconstruct() callers to get a runtime
NotImplementedError; either implement the RCM-permuted reconstruction (A = P^T L
L^T P) inside the LLT_blocked_rcm_solver._reconstruct_impl or, if you can’t
implement now, open a tracking issue and update the stub to include a TODO and a
clear NotImplementedError message referencing that issue (and mention
_reconstruct_impl and DirectSolver._reconstruct_impl so future readers know this
is intentional and tracked).
- Around line 145-147: Add a type annotation for the attribute
_rcm_max_bfs_iters in the class __init__ (same block where self._reorder_tol:
float = reorder_tol is set) — e.g., declare self._rcm_max_bfs_iters: int =
rcm_max_bfs_iters — to match the explicit annotations used for other attributes
in the constructor (look for __init__ in llt_blocked_rcm_solver and the
self._reorder_tol assignment to place the change).
In `@newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py`:
- Line 336: Drop the redundant int(...) around math.ceil: locate the return
statement returning "2 * int(math.ceil(math.sqrt(max_dim))) + 4" and replace it
with "2 * math.ceil(math.sqrt(max_dim)) + 4" (math.ceil already returns int in
Python 3); ensure max_dim and math are unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 2970cd46-686d-4e14-9133-0fd572e619d3
📒 Files selected for processing (2)
newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.pynewton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (1)
newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm.py (1)
1-1:⚠️ Potential issue | 🟡 MinorUse the file creation year in the SPDX header.
This new file’s header still uses
2025; for this 2026-created file it should be2026.📝 Proposed fix
-# SPDX-FileCopyrightText: Copyright (c) 2025 The Newton Developers +# SPDX-FileCopyrightText: Copyright (c) 2026 The Newton DevelopersAs per coding guidelines: "In SPDX copyright lines, use the year the file was first created. Do not create date ranges or update the year when modifying a file."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm.py` at line 1, Update the SPDX copyright year in the file header from 2025 to 2026; locate the SPDX header line beginning with "SPDX-FileCopyrightText" in llt_blocked_rcm.py and change the year token "2025" to "2026" so the file’s creation year is correct.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py`:
- Around line 263-273: The override of _reset_impl in llt_blocked_rcm_solver.py
must match the base signature by accepting (self, A, **kwargs); change the
method definition to def _reset_impl(self, A, **kwargs) -> None and keep the
existing attribute zeroing (_L, _y, _A_hat, _b_hat, _x_hat, _P, _inv_P,
_tile_pattern) and _has_factors reset as-is (you can ignore A if unused), so
calls that forward A/kwargs to _reset_impl won't raise TypeError; ensure the
method name and attributes (_reset_impl, _L, _y, _A_hat, _b_hat, _x_hat, _P,
_inv_P, _tile_pattern, _has_factors) are used exactly as in the diff.
- Around line 63-66: Replace the private `_src` cross-reference in the
llt_blocked_rcm_solver.py docstring with a public/short Sphinx target: find the
docstring referencing `newton._src...LLTBlockedSolver` and change it to a
public, shortest cross-reference such as
`:class:\`newton.solvers.kamino.linalg.linear.LLTBlockedSolver\`` or simply
`:class:\`LLTBlockedSolver\`` so the docstring in the LLTBlockedRCMSolver
(file/class: llt_blocked_rcm_solver.py / LLTBlockedRCMSolver) uses the public
API path and not the private `_src` module.
- Around line 92-167: The constructor allows arbitrary dtype but underlying
kernels (see make_llt_blocked_rcm_factorize_kernel,
make_llt_blocked_rcm_solve_kernel, make_llt_blocked_rcm_solve_inplace_kernel)
are hard-coded to float32 causing buffer/kernel dtype mismatch; in
LLTBlockedRCMSolver.__init__ (immediately after the docstring) add a guard that
raises NotImplementedError when dtype != float32 (message: "LLTBlockedRCMSolver
currently supports only float32.") and ensure _allocate_impl still uses
self._dtype; also add a short CHANGELOG entry noting the float32-only
limitation.
In `@newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py`:
- Around line 37-41: The doc example imports a private `_src` path; update the
example to avoid private imports by either removing the import line or replacing
it with the public-facing symbol import (e.g., import create_rcm_batch_launch
from the package's public module that re-exports it) — locate the example in
rcm_batch.py (the code block referencing create_rcm_batch_launch) and change the
import to use the public API name or drop it from the docs so there are no
references to newton._src.
---
Duplicate comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm.py`:
- Line 1: Update the SPDX copyright year in the file header from 2025 to 2026;
locate the SPDX header line beginning with "SPDX-FileCopyrightText" in
llt_blocked_rcm.py and change the year token "2025" to "2026" so the file’s
creation year is correct.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: d031a504-dfa6-445e-9470-a77aebdd0d20
📒 Files selected for processing (6)
newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked.pynewton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm.pynewton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.pynewton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_semi_sparse.pynewton/_src/solvers/kamino/_src/linalg/factorize/llt_sequential.pynewton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py
The pre-release audit found user-facing commits that landed since v1.1.0 without an [Unreleased] entry. Backfill five of them here. Kamino-related commits (newton-physics#2280, newton-physics#2517, newton-physics#2554, newton-physics#2575) are intentionally omitted: Kamino is still alpha and not currently tracked in CHANGELOG. - Changed: contact-conversion overhead reduction in SolverMuJoCo ([newton-physicsGH-2393](newton-physics#2393)) - Fixed: target_voxel_size ignored on the texture-SDF path ([newton-physicsGH-2464](newton-physics#2464)) - Fixed: Newton-to-mujoco-warp material-combination mismatch ([newton-physicsGH-2439](newton-physics#2439)) - Fixed: eq_objtype mismatch on joint equality / mimic constraints in SolverMuJoCo ([newton-physicsGH-2459](newton-physics#2459)) - Fixed: _tiled_sum_kernel launch-dim handling under warp-lang 1.13 ([newton-physicsGH-2546](newton-physics#2546))
Description
Experimental reverse Cuthill McKee reordering for Cholesky to reduce fill in.
Checklist
CHANGELOG.mdhas been updated (if user-facing change)Test plan
Summary by CodeRabbit
New Features
Configuration
Performance