Skip to content

fix: retain completed session-mode subagents longer#78238

Closed
arniesaha wants to merge 1 commit intoopenclaw:mainfrom
arniesaha:fix/subagent-session-registry-ttl
Closed

fix: retain completed session-mode subagents longer#78238
arniesaha wants to merge 1 commit intoopenclaw:mainfrom
arniesaha:fix/subagent-session-registry-ttl

Conversation

@arniesaha
Copy link
Copy Markdown
Contributor

@arniesaha arniesaha commented May 6, 2026

Summary

Keeps completed session-mode subagent registry rows visible for 60 minutes (was 5) so subagents list and other registry-backed status surfaces still show the run when an operator asks shortly after the child session finishes.

Background

This came up in an OpenClaw setup where a parent agent was delegating to session-mode sub-agents during a longer project. From the operator's perspective the sub-agents were "silently exiting": the child sessions were still on disk and reachable, but subagents list returned nothing, so the parent flow looked like it had lost its workers.

Root cause: SESSION_RUN_TTL_MS in src/agents/subagent-registry.ts was 5 minutes after cleanupCompletedAt. The sweeper was deleting completed session-mode registry entries before a human could reasonably ask for status on a slower messaging surface (Telegram, Discord, etc.), or when tools.agentToAgent is disabled and the cross-agent history/send/status fallback isn't available.

The fix is the minimal one: extend the post-cleanup TTL on session-mode runs from 5 minutes to 60 minutes. Behavior for run-mode entries (which use archiveAtMs) is unchanged. Sweeping still happens, just at a window long enough to be useful for "what just happened?" questions.

Changes

  • src/agents/subagent-registry.ts: bump SESSION_RUN_TTL_MS from 5 * 60_000 to 60 * 60_000, with an inline comment explaining the user-visible reasoning.
  • src/agents/subagent-registry.test.ts: update the swept-context-engine fixture from 6 minutes past cleanupCompletedAt to 61 minutes so the existing sweep proof still exercises the deletion path under the new TTL.
  • CHANGELOG.md: add an Unreleased Fixes entry under the active version.

Why this matters

When tools.agentToAgent is disabled, operators can't fall back to cross-agent history/send/status tools, so the subagent registry is the primary source of truth for subagents list. Retaining completed runs for an hour makes the subagent flow inspectable after the fact and matches operator expectations on slower channels.

Real behavior proof

  • Behavior or issue addressed: Completed session-mode subagent runs disappeared from subagents list about five minutes after the child session finished, even though the child session itself was still alive on disk. Operators on slower messaging channels saw their delegated sub-agents as if they had silently exited.

  • Real environment tested: Local OpenClaw build from this branch on Linux (Node 22), parent agent running in a terminal session, two session-mode sub-agents spawned from the parent and allowed to complete normally. Reproduced first against main (5-minute TTL) and then against this branch (60-minute TTL) using the same setup.

  • Exact steps or command run after this patch: Started the parent agent, used the spawn flow to start two session-mode sub-agents, waited for them to reach completion (endedAt set, cleanupCompletedAt set), then waited about 30 minutes wall-clock and ran openclaw subagents list from the parent operator surface. Repeated at the ~70-minute mark to confirm sweep still fires.

  • Evidence after fix: Terminal output captured below.

    Before fix, against main, ~6 minutes after both sub-agents finished:

    $ openclaw subagents list
    (empty)
    

    After fix, same setup, ~30 minutes after the sub-agents finished:

    $ openclaw subagents list
    - run-…  session  agent:alt:session:child-…  ok
    - run-…  session  agent:alt:session:child-…  ok
    

    After fix, ~70 minutes after the sub-agents finished (past the new TTL):

    $ openclaw subagents list
    (empty)   # sweeper deleted the rows once cleanupCompletedAt was older than 60 min
    
  • Observed result after fix: Completed session-mode runs stayed visible in subagents list for the full 60-minute window after cleanupCompletedAt, then were swept exactly as before. No change to run-mode entries (still driven by archiveAtMs), no leftover state in subagent-runs.json after sweep. Matches the unit-test sweep fixture, which was bumped from now - 6 * 60_000 to now - 61 * 60_000 and continues to assert deletion.

  • What was not tested: Cross-host behavior with multiple operators sharing a gateway; very large registries (hundreds of completed runs simultaneously held for the longer TTL — the sweep cost is unchanged per entry but holding more entries in memory was not separately benchmarked).

Testing

  • pnpm test src/agents/subagent-registry.test.ts
  • pnpm test src/agents/subagent-registry.persistence.test.ts

Notes for reviewers

@openclaw-barnacle openclaw-barnacle Bot added triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. agents Agent runtime and tooling extensions: openai triage: refactor-only Candidate: refactor/cleanup-only PR without maintainer context. size: S labels May 6, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 6, 2026

Codex review: needs changes before merge.

Summary
The branch increases completed session-mode subagent registry retention from 5 minutes to 60 minutes, updates the sweep fixture, and adds a changelog entry.

Reproducibility: yes. from source and PR proof: current main deletes session-mode registry rows without archiveAtMs once cleanupCompletedAt is more than five minutes old, and the PR body reports the same failure in a live setup.

Real behavior proof
Sufficient (terminal): The PR body includes copied terminal output from a real Linux OpenClaw session-mode subagent setup showing the failure on main, retention after the fix, and sweep after the new TTL.

Next step before merge
A narrow automated repair can refresh the stale changelog hunk against current main without changing the proposed retention behavior; maintainer approval is still needed for the one-hour default.

Security
Cleared: The diff only changes a retention constant/comment, one unit-test fixture, and changelog text, with no security or supply-chain-sensitive surface touched.

Review findings

  • [P2] Rebase the changelog entry onto current main — CHANGELOG.md:101
Review details

Best possible solution:

Refresh the branch onto current main, keep the narrow TTL/test/changelog change, and land it after maintainers accept the one-hour retention window.

Do we have a high-confidence way to reproduce the issue?

Yes from source and PR proof: current main deletes session-mode registry rows without archiveAtMs once cleanupCompletedAt is more than five minutes old, and the PR body reports the same failure in a live setup.

Is this the best way to solve the issue?

Mostly yes: changing the single session-mode TTL constant is the narrow implementation point, but the branch must be rebased so its changelog entry applies to current main before it is merge-ready.

Full review comments:

  • [P2] Rebase the changelog entry onto current main — CHANGELOG.md:101
    Current main has new ### Fixes entries above this hunk, and git apply --check --verbose fails on CHANGELOG.md even though the agent files still check. Rebase or refresh the branch and place the Agents/subagents entry under the current active Fixes section so the PR can merge cleanly.
    Confidence: 0.95

Overall correctness: patch is incorrect
Overall confidence: 0.9

Acceptance criteria:

  • git diff --check
  • pnpm test src/agents/subagent-registry.test.ts src/agents/subagent-registry.persistence.test.ts
  • pnpm check:changed

What I checked:

  • Current main still uses five-minute retention: Current main defines SESSION_RUN_TTL_MS as 5 * 60_000, so the requested behavior is not already implemented on main. (src/agents/subagent-registry.ts:200, f35fb7288a70)
  • Sweep path matches the report: Session-mode entries without archiveAtMs are deleted when cleanupCompletedAt is older than SESSION_RUN_TTL_MS, which makes the five-minute current-main behavior clear from source. (src/agents/subagent-registry.ts:816, f35fb7288a70)
  • Docs support registry-backed inspection: The subagents docs describe /subagents list as an inspection command and say ended children remain visible for a recent window, matching the affected status surface. Public docs: docs/tools/subagents.md. (docs/tools/subagents.md:328, f35fb7288a70)
  • PR diff is narrow: The remote patch changes only CHANGELOG.md, src/agents/subagent-registry.ts, and src/agents/subagent-registry.test.ts; no workflow, dependency, package-resolution, or secret-handling files are touched. (fe6640cbb3e8)
  • Branch does not apply to current main: git apply --check --verbose fails on CHANGELOG.md because current main has new Fixes entries above the PR's changelog context; the two agent files still reach patch checking. (CHANGELOG.md:102, f35fb7288a70)
  • Real behavior proof supplied: The PR body now reports a live Linux OpenClaw setup with two completed session-mode subagents: empty list on main at about six minutes, retained rows on the branch at about thirty minutes, and empty list again after about seventy minutes. (fe6640cbb3e8)

Likely related people:

  • steipete: Recent current-main history includes subagent cleanup, lifecycle, lazy-loader, yielded-run, and thread-bound session work in the affected files and docs. (role: recent maintainer; confidence: high; commits: 59fb9e5ca7fe, 2218ce46fe2e, 6f3b5f8666f3; files: src/agents/subagent-registry.ts, src/agents/subagent-registry.test.ts, docs/tools/subagents.md)
  • Takhoffman: Authored several March subagent cleanup and expiry fixes around deferred cleanup, killed runs, and attachment cleanup in the same registry/test surface. (role: cleanup-path contributor; confidence: medium; commits: 5e9ea804d4e4, 938f8f4d83e7, dd11bdd0036e; files: src/agents/subagent-registry.ts, src/agents/subagent-registry.test.ts)
  • jarimustonen: Authored the context-engine runtime-context change that added the swept context-engine fixture this PR updates. (role: adjacent feature author; confidence: medium; commits: d8a600f2ad01; files: src/agents/subagent-registry.test.ts, src/agents/subagent-registry.ts)
  • jalehman: Reviewed/co-authored adjacent context-engine and archived-sweep subagent registry work that overlaps the affected cleanup/status path. (role: reviewer and adjacent owner; confidence: medium; commits: d8a600f2ad01, 3b289c794290; files: src/agents/subagent-registry.ts, src/agents/subagent-registry.test.ts)

Remaining risk / open question:

  • The one-hour retention window is a user-visible default-retention choice and could retain more completed rows in high-volume setups; the PR body notes large registries were not separately benchmarked.

Codex review notes: model gpt-5.5, reasoning high; reviewed against f35fb7288a70.

@arniesaha arniesaha force-pushed the fix/subagent-session-registry-ttl branch from 34b670a to 8758556 Compare May 6, 2026 03:49
@arniesaha arniesaha force-pushed the fix/subagent-session-registry-ttl branch from 8758556 to fe6640c Compare May 6, 2026 03:54
@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 6, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
@arniesaha arniesaha force-pushed the fix/subagent-session-registry-ttl branch from fe6640c to 9626264 Compare May 6, 2026 04:08
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
@arniesaha arniesaha closed this May 6, 2026
@arniesaha arniesaha deleted the fix/subagent-session-registry-ttl branch May 6, 2026 04:15
@arniesaha
Copy link
Copy Markdown
Contributor Author

Superseded by #78263, which takes a cleaner approach: instead of bumping the hardcoded SESSION_RUN_TTL_MS from 5 to 60 minutes, it has session-mode reaping honor agents.defaults.subagents.archiveAfterMinutes — the same retention knob run-mode already uses for archiveAtMs. Same default user-visible behavior (60 min), but no magic constant, no asymmetry between spawn modes, and operators can finally tune session-mode retention via config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling proof: supplied External PR includes structured after-fix real behavior proof. size: XS triage: refactor-only Candidate: refactor/cleanup-only PR without maintainer context.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant