Fix garbage collector issues in Kamino RCM solver by nvtw · Pull Request #2575 · newton-physics/newton

nvtw · 2026-04-24T13:16:35Z

Description

Fix garbage collector issues in Kamino RCM solver

Checklist

New or existing tests cover these changes
The documentation is up to date with these changes
CHANGELOG.md has been updated (if user-facing change)

Test plan

Manual testing

Bug fix

Steps to reproduce:

Hard to reproduce. GC frees memory but only in CPU mode. There is some inconsistency between cuda and CPU mode about GC handles are tracked.

Minimal reproduction:

import newton

# Code that demonstrates the bug

Summary by CodeRabbit

Refactor
- Improved internal memory management for linear algebra solvers by optimizing scratch buffer allocation and lifetime handling during matrix reordering operations. Enhanced robustness of buffer handling in solver callbacks to ensure correct memory access patterns.

coderabbitai · 2026-04-24T13:20:11Z

📝 Walkthrough

Walkthrough

The changes refactor scratch buffer ownership in RCM (Reverse Cuthill-McKee) reordering: the LLTBlockedRCMSolver now allocates and owns dedicated batched-RCM scratch buffers during initialization, passing them to create_rcm_batch_launch which accepts caller-allocated scratch instead of allocating internally.

Changes

Cohort / File(s)	Summary
Solver Scratch Allocation & Threading `newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py`	Solver allocates dedicated batched-RCM scratch storage during `_allocate_impl` and threads it into the RCM launch callback creation via `_ensure_reorder_launches_bound`, ensuring callback-captured inputs include scratch buffers.
RCM Batch Launch Refactor `newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py`	`create_rcm_batch_launch` signature updated to accept caller-allocated `scratch` dict parameter (placed after `vio: wp.array`); removes internal scratch allocation and updates docstring with explicit lifetime requirements for Warp CPU backend compatibility.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Kamino Cholesky solver with RCM support #2554: Directly modifies the same RCM batching and LLTBlockedRCMSolver code paths, changing create_rcm_batch_launch to accept caller-owned scratch buffers and updating LLTBlockedRCMSolver to allocate and thread that scratch into recorded RCM launches.

Suggested reviewers

vastsoun

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Fix garbage collector issues in Kamino RCM solver' accurately and specifically describes the main change. It references the correct component (Kamino RCM solver) and the primary issue being addressed (garbage collector issues).
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py (1)
222-274: ⚠️ Potential issue | 🟡 Minor

Reset stale _reorder_callback and _reorder_attached_to after re-allocation.

_allocate_impl allocates new _P, _rcm_scratch, and other buffers but does not reset _reorder_callback / _reorder_attached_to. If _allocate_impl is called a second time and a later _factorize_impl receives an A object that is is-identical to the one seen before, _ensure_reorder_launches_bound will return early (line 302). The stale callback will replay and write into the previously-allocated buffers, now deallocated or repurposed. This is a pre-existing hazard for _P, but widens the captured surface to _rcm_scratch.
Suggested defensive reset
             self._rcm_scratch = _rcm_batch.allocate_rcm_batch_scratch(
                 total_vec=info.total_vec_size,
                 num_blocks=info.num_blocks,
                 device=self._device,
             )

+        # Any previously-recorded reorder callback captured the old _P /
+        # _rcm_scratch; force a rebind on the next factorize.
+        self._reorder_callback = None
+        self._reorder_attached_to = None
+
         # The batched-RCM launch callback (``self._reorder_callback``) is
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py`
around lines 222 - 274, Allocate_impl currently recreates buffers like _P and
_rcm_scratch but does not clear the stale reorder state; reset _reorder_callback
and _reorder_attached_to to None (or their initial empty state) at the end of
_allocate_impl so subsequent calls won't early-return in
_ensure_reorder_launches_bound and replay callbacks into newly
allocated/repurposed buffers; update _allocate_impl (reference symbols:
_allocate_impl, _reorder_callback, _reorder_attached_to, _P, _rcm_scratch,
_ensure_reorder_launches_bound, _factorize_impl) to defensively clear those
fields after allocating the new buffers.

🧹 Nitpick comments (1)

newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py (1)
339-373: Ownership change and docstring look good.

Hoisting scratch allocation to the caller is the right fix: with record_cmd=True, the Warp CPU backend does not keep strong Python references to recorded inputs, so locally-allocated scratch was being reclaimed as soon as create_rcm_batch_launch returned. Calling out that contract explicitly in the docstring is very helpful.

Optional nit: tightening the annotation to dict[str, wp.array] (or a small TypedDict/dataclass returned by allocate_rcm_batch_scratch) would make the expected keys self-documenting and catch a missing "degree"/"level"/… at type-check time rather than at launch time.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py` around lines
339 - 373, The scratch parameter for create_rcm_batch_launch is currently
untyped as dict; tighten it to a specific mapping so missing keys are caught
earlier: change the annotation to dict[str, wp.array] or better, define a small
TypedDict or dataclass (e.g., ScratchBuffers or AllocateRCMScratchTypedDict)
that reflects the keys produced by allocate_rcm_batch_scratch (include expected
keys like "degree", "level", etc.), update create_rcm_batch_launch's signature
to accept that TypedDict/dataclass, and use that type in callers and the
allocate_rcm_batch_scratch return so the contract is explicit and
type-checkable.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py`:
- Around line 222-274: Allocate_impl currently recreates buffers like _P and
_rcm_scratch but does not clear the stale reorder state; reset _reorder_callback
and _reorder_attached_to to None (or their initial empty state) at the end of
_allocate_impl so subsequent calls won't early-return in
_ensure_reorder_launches_bound and replay callbacks into newly
allocated/repurposed buffers; update _allocate_impl (reference symbols:
_allocate_impl, _reorder_callback, _reorder_attached_to, _P, _rcm_scratch,
_ensure_reorder_launches_bound, _factorize_impl) to defensively clear those
fields after allocating the new buffers.

---

Nitpick comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py`:
- Around line 339-373: The scratch parameter for create_rcm_batch_launch is
currently untyped as dict; tighten it to a specific mapping so missing keys are
caught earlier: change the annotation to dict[str, wp.array] or better, define a
small TypedDict or dataclass (e.g., ScratchBuffers or
AllocateRCMScratchTypedDict) that reflects the keys produced by
allocate_rcm_batch_scratch (include expected keys like "degree", "level", etc.),
update create_rcm_batch_launch's signature to accept that TypedDict/dataclass,
and use that type in callers and the allocate_rcm_batch_scratch return so the
contract is explicit and type-checkable.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: e402c4f7-8a7f-4367-a2fc-eb33d9e8ac9d

📥 Commits

Reviewing files that changed from the base of the PR and between f9a39d7 and 0201c67.

📒 Files selected for processing (2)

newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py
newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py

codecov · 2026-04-24T13:36:19Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

vastsoun

LGTM

The pre-release audit found user-facing commits that landed since v1.1.0 without an [Unreleased] entry. Backfill five of them here. Kamino-related commits (newton-physics#2280, newton-physics#2517, newton-physics#2554, newton-physics#2575) are intentionally omitted: Kamino is still alpha and not currently tracked in CHANGELOG. - Changed: contact-conversion overhead reduction in SolverMuJoCo ([newton-physicsGH-2393](newton-physics#2393)) - Fixed: target_voxel_size ignored on the texture-SDF path ([newton-physicsGH-2464](newton-physics#2464)) - Fixed: Newton-to-mujoco-warp material-combination mismatch ([newton-physicsGH-2439](newton-physics#2439)) - Fixed: eq_objtype mismatch on joint equality / mimic constraints in SolverMuJoCo ([newton-physicsGH-2459](newton-physics#2459)) - Fixed: _tiled_sum_kernel launch-dim handling under warp-lang 1.13 ([newton-physicsGH-2546](newton-physics#2546))

Fix garbage collector issues

0201c67

nvtw self-assigned this Apr 24, 2026

nvtw requested a review from vastsoun April 24, 2026 13:16

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

vastsoun approved these changes Apr 27, 2026

View reviewed changes

nvtw added this pull request to the merge queue Apr 27, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 27, 2026

nvtw added this pull request to the merge queue Apr 27, 2026

Merged via the queue into newton-physics:main with commit 91d21b1 Apr 27, 2026
25 checks passed

preist-nvidia mentioned this pull request May 4, 2026

Clean up CHANGELOG before 1.2 release branch cut #2684

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix garbage collector issues in Kamino RCM solver#2575

Fix garbage collector issues in Kamino RCM solver#2575
nvtw merged 1 commit into
newton-physics:mainfrom
nvtw:dev/tw3/fix_kamino_gc_issue

nvtw commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

codecov Bot commented Apr 24, 2026

Uh oh!

vastsoun left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nvtw commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Test plan

Bug fix

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 24, 2026

Codecov Report

Uh oh!

vastsoun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nvtw commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading