Fix garbage collector issues in Kamino RCM solver#2575
Conversation
📝 WalkthroughWalkthroughThe changes refactor scratch buffer ownership in RCM (Reverse Cuthill-McKee) reordering: the LLTBlockedRCMSolver now allocates and owns dedicated batched-RCM scratch buffers during initialization, passing them to Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py (1)
222-274:⚠️ Potential issue | 🟡 MinorReset stale
_reorder_callbackand_reorder_attached_toafter re-allocation.
_allocate_implallocates new_P,_rcm_scratch, and other buffers but does not reset_reorder_callback/_reorder_attached_to. If_allocate_implis called a second time and a later_factorize_implreceives anAobject that isis-identical to the one seen before,_ensure_reorder_launches_boundwill return early (line 302). The stale callback will replay and write into the previously-allocated buffers, now deallocated or repurposed. This is a pre-existing hazard for_P, but widens the captured surface to_rcm_scratch.Suggested defensive reset
self._rcm_scratch = _rcm_batch.allocate_rcm_batch_scratch( total_vec=info.total_vec_size, num_blocks=info.num_blocks, device=self._device, ) + # Any previously-recorded reorder callback captured the old _P / + # _rcm_scratch; force a rebind on the next factorize. + self._reorder_callback = None + self._reorder_attached_to = None + # The batched-RCM launch callback (``self._reorder_callback``) is🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py` around lines 222 - 274, Allocate_impl currently recreates buffers like _P and _rcm_scratch but does not clear the stale reorder state; reset _reorder_callback and _reorder_attached_to to None (or their initial empty state) at the end of _allocate_impl so subsequent calls won't early-return in _ensure_reorder_launches_bound and replay callbacks into newly allocated/repurposed buffers; update _allocate_impl (reference symbols: _allocate_impl, _reorder_callback, _reorder_attached_to, _P, _rcm_scratch, _ensure_reorder_launches_bound, _factorize_impl) to defensively clear those fields after allocating the new buffers.
🧹 Nitpick comments (1)
newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py (1)
339-373: Ownership change and docstring look good.Hoisting scratch allocation to the caller is the right fix: with
record_cmd=True, the Warp CPU backend does not keep strong Python references to recorded inputs, so locally-allocated scratch was being reclaimed as soon ascreate_rcm_batch_launchreturned. Calling out that contract explicitly in the docstring is very helpful.Optional nit: tightening the annotation to
dict[str, wp.array](or a smallTypedDict/dataclass returned byallocate_rcm_batch_scratch) would make the expected keys self-documenting and catch a missing"degree"/"level"/… at type-check time rather than at launch time.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py` around lines 339 - 373, The scratch parameter for create_rcm_batch_launch is currently untyped as dict; tighten it to a specific mapping so missing keys are caught earlier: change the annotation to dict[str, wp.array] or better, define a small TypedDict or dataclass (e.g., ScratchBuffers or AllocateRCMScratchTypedDict) that reflects the keys produced by allocate_rcm_batch_scratch (include expected keys like "degree", "level", etc.), update create_rcm_batch_launch's signature to accept that TypedDict/dataclass, and use that type in callers and the allocate_rcm_batch_scratch return so the contract is explicit and type-checkable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.py`:
- Around line 222-274: Allocate_impl currently recreates buffers like _P and
_rcm_scratch but does not clear the stale reorder state; reset _reorder_callback
and _reorder_attached_to to None (or their initial empty state) at the end of
_allocate_impl so subsequent calls won't early-return in
_ensure_reorder_launches_bound and replay callbacks into newly
allocated/repurposed buffers; update _allocate_impl (reference symbols:
_allocate_impl, _reorder_callback, _reorder_attached_to, _P, _rcm_scratch,
_ensure_reorder_launches_bound, _factorize_impl) to defensively clear those
fields after allocating the new buffers.
---
Nitpick comments:
In `@newton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py`:
- Around line 339-373: The scratch parameter for create_rcm_batch_launch is
currently untyped as dict; tighten it to a specific mapping so missing keys are
caught earlier: change the annotation to dict[str, wp.array] or better, define a
small TypedDict or dataclass (e.g., ScratchBuffers or
AllocateRCMScratchTypedDict) that reflects the keys produced by
allocate_rcm_batch_scratch (include expected keys like "degree", "level", etc.),
update create_rcm_batch_launch's signature to accept that TypedDict/dataclass,
and use that type in callers and the allocate_rcm_batch_scratch return so the
contract is explicit and type-checkable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: e402c4f7-8a7f-4367-a2fc-eb33d9e8ac9d
📒 Files selected for processing (2)
newton/_src/solvers/kamino/_src/linalg/factorize/llt_blocked_rcm_solver.pynewton/_src/solvers/kamino/_src/linalg/factorize/rcm_batch.py
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
The pre-release audit found user-facing commits that landed since v1.1.0 without an [Unreleased] entry. Backfill five of them here. Kamino-related commits (newton-physics#2280, newton-physics#2517, newton-physics#2554, newton-physics#2575) are intentionally omitted: Kamino is still alpha and not currently tracked in CHANGELOG. - Changed: contact-conversion overhead reduction in SolverMuJoCo ([newton-physicsGH-2393](newton-physics#2393)) - Fixed: target_voxel_size ignored on the texture-SDF path ([newton-physicsGH-2464](newton-physics#2464)) - Fixed: Newton-to-mujoco-warp material-combination mismatch ([newton-physicsGH-2439](newton-physics#2439)) - Fixed: eq_objtype mismatch on joint equality / mimic constraints in SolverMuJoCo ([newton-physicsGH-2459](newton-physics#2459)) - Fixed: _tiled_sum_kernel launch-dim handling under warp-lang 1.13 ([newton-physicsGH-2546](newton-physics#2546))
Description
Fix garbage collector issues in Kamino RCM solver
Checklist
CHANGELOG.mdhas been updated (if user-facing change)Test plan
Manual testing
Bug fix
Steps to reproduce:
Hard to reproduce. GC frees memory but only in CPU mode. There is some inconsistency between cuda and CPU mode about GC handles are tracked.
Minimal reproduction:
Summary by CodeRabbit