fix(subagents): honor archiveAfterMinutes for session-mode reaping#78263
Conversation
|
Codex review: needs maintainer review before merge. Summary Reproducibility: yes. Current main's sweep path deletes completed no-archiveAtMs rows after the five-minute SESSION_RUN_TTL_MS, and the PR body supplies terminal before/after output for the subagents list symptom. Real behavior proof Next step before merge Security Review detailsBest possible solution: Land the config-driven retention fix after maintainer review, keeping archiveAfterMinutes as the single retention control for completed subagent registry rows. Do we have a high-confidence way to reproduce the issue? Yes. Current main's sweep path deletes completed no-archiveAtMs rows after the five-minute SESSION_RUN_TTL_MS, and the PR body supplies terminal before/after output for the subagents list symptom. Is this the best way to solve the issue? Yes. Reusing resolveArchiveAfterMs in the sweep loop is the narrowest maintainable fix because run-mode already uses that helper and the public docs already expose archiveAfterMinutes as the subagent retention knob. What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 69d446d1784c. |
85ce768 to
f78d7ce
Compare
52dbafc to
ccc2ce3
Compare
e57c04b to
0bcf95e
Compare
Session-mode subagent registry rows reaped on a hardcoded 5-minute TTL instead of the configured `agents.defaults.subagents.archiveAfterMinutes` window (default 60 minutes) that run-mode already honors for `archiveAtMs`. That asymmetry meant `subagents list` and other registry-backed status surfaces lost completed runs five minutes after cleanup, even when the operator's configured retention was longer, and gave operators no way to tune session-mode retention at all. On slower messaging surfaces and when agent-to-agent transcript access is disabled, completed sub-agents appeared to silently disappear. Drop `SESSION_RUN_TTL_MS` and have the sweep loop call `resolveArchiveAfterMs` so both spawn modes reap on the same configured horizon. Setting `archiveAfterMinutes: 0` now disables session-mode reaping just like it disables run-mode `sessions.delete`. Tests scope a positive `archiveAfterMinutes` for the swept-context-engine fixture so the deletion path still fires under the new config-driven sweep.
0bcf95e to
b415467
Compare
|
Merged via squash.
Thanks @arniesaha! |
…penclaw#78263) Merged via squash. Prepared head SHA: b415467 Co-authored-by: arniesaha <3646287+arniesaha@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman
…penclaw#78263) Merged via squash. Prepared head SHA: b415467 Co-authored-by: arniesaha <3646287+arniesaha@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman
…penclaw#78263) Merged via squash. Prepared head SHA: b415467 Co-authored-by: arniesaha <3646287+arniesaha@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman
Summary
Have completed session-mode subagent registry rows reaped on
agents.defaults.subagents.archiveAfterMinutes— the same retention knob run-mode already uses forarchiveAtMs— instead of a separate hardcoded 5-minute TTL.Background
Session-mode and run-mode subagent registry rows had two different retention horizons:
spawnMode: "run"): the row carriesarchiveAtMs = now + archiveAfterMs, wherearchiveAfterMsis derived fromagents.defaults.subagents.archiveAfterMinutes(default 60 minutes). At sweep time the row is removed and the child session issessions.deleted.spawnMode: "session"): the row carries noarchiveAtMs(the child session is retained independently), and was instead reaped on a hardcodedSESSION_RUN_TTL_MS = 5 minutes— completely ignoring the configuredarchiveAfterMinuteswindow.This asymmetry was a bug:
tools.agentToAgentis disabled (so the cross-agent history/send/status fallback isn't available), an operator asking "what happened to the sub-agents I just delegated to?" got an emptysubagents listwhile the child sessions were still alive on disk — completed sub-agents appeared to silently disappear.Fix
Drop
SESSION_RUN_TTL_MS. In the sweep loop, resolvesessionRetentionMs = resolveArchiveAfterMs(cfg)once per tick and use it as the absolute TTL for session-mode rows aftercleanupCompletedAt. Run-mode behavior is unchanged.Defaults stay the same as run-mode: 60 minutes.
archiveAfterMinutes: 0now disables session-mode reaping (registry row kept indefinitely) just like it already disables run-modesessions.delete.Behavior change for users
archiveAfterMinutesunset (default)archiveAfterMinutes: 30archiveAfterMinutes: 0Default-configured installs see the operator-visible retention extend from 5 → 60 minutes for session-mode runs. Users who explicitly set
archiveAfterMinutesnow get that value applied uniformly to both spawn modes.Real behavior proof
Behavior or issue addressed: Completed session-mode subagent runs disappeared from
subagents listabout five minutes after the child session finished, even though the child session itself was still alive on disk and the operator's configuredarchiveAfterMinuteswas 60. Operators on slower messaging channels saw their delegated sub-agents as if they had silently exited.Real environment tested: Local OpenClaw build from this branch on Linux (Node 22), parent agent running in a terminal session, two session-mode sub-agents spawned from the parent and allowed to complete normally. Reproduced first against
main(5-minute hardcoded TTL) and then against this branch (config-driven, default 60 minutes) using the same setup. Also verifiedarchiveAfterMinutes: 0keeps session-mode rows indefinitely after this change.Exact steps or command run after this patch: Started the parent agent, used the spawn flow to start two session-mode sub-agents, waited for them to reach completion (
endedAtset,cleanupCompletedAtset), then waited about 30 minutes wall-clock and ranopenclaw subagents listfrom the parent operator surface. Repeated at the ~70-minute mark to confirm sweep still fires under the default. Repeated witharchiveAfterMinutes: 0configured to confirm rows are kept indefinitely.Evidence after fix: Terminal output captured below.
Before fix, against
main, ~6 minutes after both sub-agents finished (default config):After fix, same setup, ~30 minutes after the sub-agents finished:
After fix, ~70 minutes after the sub-agents finished (past the default 60-minute window):
After fix with
agents.defaults.subagents.archiveAfterMinutes: 0, several hours after the sub-agents finished:Observed result after fix: Completed session-mode runs stayed visible in
subagents listfor the full configuredarchiveAfterMinuteswindow aftercleanupCompletedAt, then were swept exactly as before. Run-mode entries (still driven byarchiveAtMs) were unchanged.archiveAfterMinutes: 0disabled session-mode reaping consistent with the existing run-mode semantic. No leftover state insubagent-runs.jsonafter sweep.What was not tested: Cross-host behavior with multiple operators sharing a gateway; very large registries (hundreds of completed runs simultaneously held under the longer default — sweep cost per entry is unchanged but holding more entries in memory was not separately benchmarked).
Testing
pnpm test src/agents/subagent-registry.test.tspnpm test src/agents/subagent-registry.persistence.test.tsNotes for reviewers
passes stored agentDir through swept context-engine cleanup pathsscopes a positivearchiveAfterMinutesvalue for that case, since the suite-wide mock config setsarchiveAfterMinutes: 0(which under the new config-driven sweep is "never reap"). Run-mode part of the same test continues to use a directly-pinnedarchiveAtMsand is unaffected.