fix: retain completed session-mode subagents longer#78238
fix: retain completed session-mode subagents longer#78238arniesaha wants to merge 1 commit intoopenclaw:mainfrom
Conversation
|
Codex review: needs changes before merge. Summary Reproducibility: yes. from source and PR proof: current main deletes session-mode registry rows without Real behavior proof Next step before merge Security Review findings
Review detailsBest possible solution: Refresh the branch onto current main, keep the narrow TTL/test/changelog change, and land it after maintainers accept the one-hour retention window. Do we have a high-confidence way to reproduce the issue? Yes from source and PR proof: current main deletes session-mode registry rows without Is this the best way to solve the issue? Mostly yes: changing the single session-mode TTL constant is the narrow implementation point, but the branch must be rebased so its changelog entry applies to current main before it is merge-ready. Full review comments:
Overall correctness: patch is incorrect Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against f35fb7288a70. |
34b670a to
8758556
Compare
8758556 to
fe6640c
Compare
fe6640c to
9626264
Compare
|
Superseded by #78263, which takes a cleaner approach: instead of bumping the hardcoded |
Summary
Keeps completed session-mode subagent registry rows visible for 60 minutes (was 5) so
subagents listand other registry-backed status surfaces still show the run when an operator asks shortly after the child session finishes.Background
This came up in an OpenClaw setup where a parent agent was delegating to session-mode sub-agents during a longer project. From the operator's perspective the sub-agents were "silently exiting": the child sessions were still on disk and reachable, but
subagents listreturned nothing, so the parent flow looked like it had lost its workers.Root cause:
SESSION_RUN_TTL_MSinsrc/agents/subagent-registry.tswas 5 minutes aftercleanupCompletedAt. The sweeper was deleting completed session-mode registry entries before a human could reasonably ask for status on a slower messaging surface (Telegram, Discord, etc.), or whentools.agentToAgentis disabled and the cross-agent history/send/status fallback isn't available.The fix is the minimal one: extend the post-cleanup TTL on session-mode runs from 5 minutes to 60 minutes. Behavior for run-mode entries (which use
archiveAtMs) is unchanged. Sweeping still happens, just at a window long enough to be useful for "what just happened?" questions.Changes
src/agents/subagent-registry.ts: bumpSESSION_RUN_TTL_MSfrom5 * 60_000to60 * 60_000, with an inline comment explaining the user-visible reasoning.src/agents/subagent-registry.test.ts: update the swept-context-engine fixture from 6 minutes pastcleanupCompletedAtto 61 minutes so the existing sweep proof still exercises the deletion path under the new TTL.CHANGELOG.md: add an Unreleased Fixes entry under the active version.Why this matters
When
tools.agentToAgentis disabled, operators can't fall back to cross-agent history/send/status tools, so the subagent registry is the primary source of truth forsubagents list. Retaining completed runs for an hour makes the subagent flow inspectable after the fact and matches operator expectations on slower channels.Real behavior proof
Behavior or issue addressed: Completed session-mode subagent runs disappeared from
subagents listabout five minutes after the child session finished, even though the child session itself was still alive on disk. Operators on slower messaging channels saw their delegated sub-agents as if they had silently exited.Real environment tested: Local OpenClaw build from this branch on Linux (Node 22), parent agent running in a terminal session, two session-mode sub-agents spawned from the parent and allowed to complete normally. Reproduced first against
main(5-minute TTL) and then against this branch (60-minute TTL) using the same setup.Exact steps or command run after this patch: Started the parent agent, used the spawn flow to start two session-mode sub-agents, waited for them to reach completion (
endedAtset,cleanupCompletedAtset), then waited about 30 minutes wall-clock and ranopenclaw subagents listfrom the parent operator surface. Repeated at the ~70-minute mark to confirm sweep still fires.Evidence after fix: Terminal output captured below.
Before fix, against
main, ~6 minutes after both sub-agents finished:After fix, same setup, ~30 minutes after the sub-agents finished:
After fix, ~70 minutes after the sub-agents finished (past the new TTL):
Observed result after fix: Completed session-mode runs stayed visible in
subagents listfor the full 60-minute window aftercleanupCompletedAt, then were swept exactly as before. No change to run-mode entries (still driven byarchiveAtMs), no leftover state insubagent-runs.jsonafter sweep. Matches the unit-test sweep fixture, which was bumped fromnow - 6 * 60_000tonow - 61 * 60_000and continues to assert deletion.What was not tested: Cross-host behavior with multiple operators sharing a gateway; very large registries (hundreds of completed runs simultaneously held for the longer TTL — the sweep cost is unchanged per entry but holding more entries in memory was not separately benchmarked).
Testing
pnpm test src/agents/subagent-registry.test.tspnpm test src/agents/subagent-registry.persistence.test.tsNotes for reviewers
mainand carried already-merged changes from fix(openai-codex): honor providerConfig.baseUrl in dynamic-model synthesis fallback #76428 (codexbaseUrlsynthesis fallback). Rebased onto currentmain; those commits dropped automatically aspatch contents already upstream. The branch is now a single commit touching only the subagent files plus one Changelog entry.