fix(memory): close local embedding providers on timeout#83858
fix(memory): close local embedding providers on timeout#83858brokemac79 wants to merge 1 commit into
Conversation
|
Codex review: passed. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
Summary Reproducibility: yes. with high confidence from source plus supplied live proof: current main lacks a provider close path, and the linked issue/PR logs show the 2026.5.18 Telegram Active Memory timeout retaining a local GGUF mapping that the PR head no longer retains. PR rating Rank-up moves:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. PR egg Rarity: 🥚 common. What is this egg doing here?
Real behavior proof Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land the scoped timeout cleanup after exact-head gates, and track broader idle-TTL, sub-agent removal, and config-change eviction policies separately if maintainers want them. Do we have a high-confidence way to reproduce the issue? Yes, with high confidence from source plus supplied live proof: current main lacks a provider close path, and the linked issue/PR logs show the 2026.5.18 Telegram Active Memory timeout retaining a local GGUF mapping that the PR head no longer retains. Is this the best way to solve the issue? Yes. The provider close lifecycle plus scoped timeout teardown is the narrow maintainable fix boundary; broader idle or config-change cleanup is useful follow-up work rather than a blocker for this timeout leak. Label justifications:
What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 1d77170a305b. |
a0e1021 to
f932d64
Compare
|
@clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
hxy91819
left a comment
There was a problem hiding this comment.
LGTM.
Here is what I have dug on this PR tech background.
In OpenClaw's design, each sub-agent gets its own cached embedding model instance (via MemoryIndexManager / QMD manager) for memory search. When a sub-agent's memory search times out, the local node-llama-cpp model's VRAM/RSS was never released — even if that sub-agent never does another memory search, the resources stay allocated for the lifetime of the process. The existing closeAll* path only runs at process shutdown, so during runtime memory only grows. This PR adds a runtime release path by wiring provider.close() through a scoped teardown chain triggered on timeout.
Trade-off acknowledged: After timeout, the cache entry is evicted alongside the resources, so the next memory search for that agent cold-starts the model. This is acceptable because a timeout already signals something is wrong (model too large, machine too slow), and retaining potentially failing resources can worsen the problem. Cold start costs seconds; OOM is fatal. If smoother reactivation is needed later, an idle-TTL strategy can layer on top of the close() infrastructure this PR adds, without further changes to the provider lifecycle.
|
Reviewed this PR in depth. The lifecycle and scoped teardown approach is sound — this is standard practice for native resource management (close/abort/dispose, ownership guards, scoped teardown vs global shutdown). One area for future follow-up: The current trigger paths for resource release are timeout and process shutdown. There are additional lifecycle events that could also warrant cleanup, but these are not blockers for this PR:
None of these block this PR. The |
Gray-box verification reportRan two manual gray-box smoke suites against both Layer 1:
|
| Metric | main | PR |
|---|---|---|
provider.close method |
undefined |
function |
First embedQuery RSS gain |
+451 MB | +454 MB |
After close() RSS release |
N/A | -317 MB |
close() idempotent |
N/A | ✓ |
embedQuery after close |
N/A | "Local embedding provider has been closed" |
Layers 2–4: Manager + scoped teardown + facade chain
scripts/smoke/memory-lifecycle-graybox.mjs (in openclaw-e2e-go)
| Layer | What's tested | PR | main |
|---|---|---|---|
| L2 | MemoryIndexManager.close() resolves + idempotent |
✓ | ✓ |
| L3 | closeMemorySearchManager({ agentId }) — one agent closed, others untouched |
✓ | ❌ API nonexistent |
| L4 | closeActiveMemorySearchManager facade chain resolves |
✓ | ❌ API nonexistent |
What this proves
- Native memory is actually released — real RSS measurement shows
disposeResources([context, model, llama])frees ~317 MB. - Layers 3–4 don't exist on
main— the scoped teardown and facade APIs are entirely new, and their absence on main correctly reproduces the original leak path. - Manager lifecycle is healthy —
close()resolves and is idempotent on both branches.
Combined with the existing 16 unit tests (which cover INDEX_CACHE guard, QMD manager path, and active-memory timeout trigger chain), the fix is well-verified.
Scripts
Located at openclaw-e2e-go/scripts/smoke/:
memory-provider-close.mjs— Layer 1 (requiresnode-llama-cppinstalled)memory-lifecycle-graybox.mjs— Layers 2–4 (no native deps needed)REPORT-PR83858.md— full report with coverage matrix
83baa18 to
f932d64
Compare
|
/clawsweeper re-review |
|
/clawsweeper automerge |
|
🦞🔧 Source: I will update this PR branch, or open a safe credited replacement, if the repair worker finds a narrow CI fix. Automerge progress:
|
|
ClawSweeper 🐠 reef update Thanks for the work here. ClawSweeper could not write to the source branch, so it opened a replacement PR rather than letting the fix drift. attribution still points back here. Why replacement: ClawSweeper could not update the source PR branch directly; GitHub did not grant sufficient push rights to the bot for that branch.
fish notes: model gpt-5.5, reasoning high; reviewed against 70e7256. |
Fixes #83792
Summary
ClawSweeper guidance followed
Testing
node scripts/run-vitest.mjs packages/memory-host-sdk/src/host/embeddings.test.tsnode scripts/run-vitest.mjs extensions/active-memory/index.test.ts -t "releases memory search managers after active-memory timeouts"node scripts/run-vitest.mjs extensions/memory-core/src/memory/index.test.ts -t "closes embedding providers|evicts scoped|retries embedding provider close"node scripts/run-vitest.mjs extensions/memory-core/src/memory/search-manager.test.ts -t "qmd open-failure cooldown|requested agent"node scripts/run-vitest.mjs src/plugin-sdk/memory-host-search.test.ts src/plugins/memory-runtime.test.ts extensions/memory-core/index.test.ts -t "scoped|runtime cleanup|lazy runtime"tsgocore, extensions, package-test, and extension-test slicesoxlintbinary on touched core/package files and touched extension filesgit diff --checktsgoextensions + extension tests, directoxlinton touched files,git diff --checkReal behavior proof
Behavior or issue addressed: Active Memory Telegram preflight on the reporter's live VPS should not retain a local node-llama-cpp/GGUF embedding model mapping after an Active Memory timeout.
Real environment tested: brokemac79's live OpenClaw VPS instance (
polymarket-mc), Ubuntu user systemd serviceopenclaw-gateway.service, OpenClaw 2026.5.18 package with this PR build deployed at headf932d646b33fcfe09be1c0dcbdfae3fb7351c6b4; Telegram group/topic IDs redacted; config hadmemorySearch.provider: "local"andplugins.slots.memory: "memory-core".Exact steps or command run after this patch: Deployed the PR build to the live VPS package directory with root-owned plugin files, restarted
openclaw-gateway.service, sent a Telegram message to@Ant_clawd_bot, then monitored the live gateway PID withps,journalctl --user -u openclaw-gateway.service, and/proc/<pid>/mapsforembeddinggemma|gguf|node-llama|Q8_0for roughly five minutes after the timeout.Evidence after fix:
Observed result after fix: The live VPS reproduced the Active Memory timeout path from Telegram and still returned a Telegram reply. The process had zero matching GGUF/node-llama mappings before the run, during the timeout monitoring window, and in the final after snapshot (
MAP_COUNT_AFTER=0), so the local embedding model mapping was not retained after the timeout.Not tested: Long soak across many repeated Telegram topics, non-Linux hosts, every alternate memory provider, and a maintainer-run production package artifact. The proof above is from the reporter's live Linux VPS instance.
Note: the repo
scripts/run-oxlint.mjswrapper failed before lint on this Windows checkout becauseC:\Program Files\nodejs\node.exewas not shell-quoted while preparing extension boundary artifacts. Directoxlinton the touched files passed.