Skip to content

fix(subagents): honor archiveAfterMinutes for session-mode reaping#78263

Merged
jalehman merged 3 commits intoopenclaw:mainfrom
arniesaha:fix/subagent-session-honor-archive-after-minutes
May 7, 2026
Merged

fix(subagents): honor archiveAfterMinutes for session-mode reaping#78263
jalehman merged 3 commits intoopenclaw:mainfrom
arniesaha:fix/subagent-session-honor-archive-after-minutes

Conversation

@arniesaha
Copy link
Copy Markdown
Contributor

Summary

Have completed session-mode subagent registry rows reaped on agents.defaults.subagents.archiveAfterMinutes — the same retention knob run-mode already uses for archiveAtMs — instead of a separate hardcoded 5-minute TTL.

Background

Session-mode and run-mode subagent registry rows had two different retention horizons:

  • Run-mode (spawnMode: "run"): the row carries archiveAtMs = now + archiveAfterMs, where archiveAfterMs is derived from agents.defaults.subagents.archiveAfterMinutes (default 60 minutes). At sweep time the row is removed and the child session is sessions.deleted.
  • Session-mode (spawnMode: "session"): the row carries no archiveAtMs (the child session is retained independently), and was instead reaped on a hardcoded SESSION_RUN_TTL_MS = 5 minutes — completely ignoring the configured archiveAfterMinutes window.

This asymmetry was a bug:

  1. The 5-minute window was too short in practice. On slower messaging surfaces (Telegram, Discord, etc.) and when tools.agentToAgent is disabled (so the cross-agent history/send/status fallback isn't available), an operator asking "what happened to the sub-agents I just delegated to?" got an empty subagents list while the child sessions were still alive on disk — completed sub-agents appeared to silently disappear.
  2. There was no way for users to tune session-mode retention. The hardcoded constant ignored config entirely.
  3. The two spawn modes drifted apart for no architectural reason. Operator mental model is "completed sub-agents stick around for X minutes," not "X for one mode and 5 for the other."

Fix

Drop SESSION_RUN_TTL_MS. In the sweep loop, resolve sessionRetentionMs = resolveArchiveAfterMs(cfg) once per tick and use it as the absolute TTL for session-mode rows after cleanupCompletedAt. Run-mode behavior is unchanged.

const sessionRetentionMs = resolveArchiveAfterMs(subagentRegistryDeps.getRuntimeConfig());

if (!entry.archiveAtMs) {
  if (
    typeof sessionRetentionMs === "number" &&
    typeof entry.cleanupCompletedAt === "number" &&
    now - entry.cleanupCompletedAt > sessionRetentionMs
  ) {  sweep  }
}

Defaults stay the same as run-mode: 60 minutes. archiveAfterMinutes: 0 now disables session-mode reaping (registry row kept indefinitely) just like it already disables run-mode sessions.delete.

Behavior change for users

Setting Before this PR After this PR
archiveAfterMinutes unset (default) run-mode 60 min, session-mode 5 min run-mode 60 min, session-mode 60 min
archiveAfterMinutes: 30 run-mode 30 min, session-mode 5 min run-mode 30 min, session-mode 30 min
archiveAfterMinutes: 0 run-mode never swept, session-mode 5 min run-mode never swept, session-mode never swept

Default-configured installs see the operator-visible retention extend from 5 → 60 minutes for session-mode runs. Users who explicitly set archiveAfterMinutes now get that value applied uniformly to both spawn modes.

Real behavior proof

  • Behavior or issue addressed: Completed session-mode subagent runs disappeared from subagents list about five minutes after the child session finished, even though the child session itself was still alive on disk and the operator's configured archiveAfterMinutes was 60. Operators on slower messaging channels saw their delegated sub-agents as if they had silently exited.

  • Real environment tested: Local OpenClaw build from this branch on Linux (Node 22), parent agent running in a terminal session, two session-mode sub-agents spawned from the parent and allowed to complete normally. Reproduced first against main (5-minute hardcoded TTL) and then against this branch (config-driven, default 60 minutes) using the same setup. Also verified archiveAfterMinutes: 0 keeps session-mode rows indefinitely after this change.

  • Exact steps or command run after this patch: Started the parent agent, used the spawn flow to start two session-mode sub-agents, waited for them to reach completion (endedAt set, cleanupCompletedAt set), then waited about 30 minutes wall-clock and ran openclaw subagents list from the parent operator surface. Repeated at the ~70-minute mark to confirm sweep still fires under the default. Repeated with archiveAfterMinutes: 0 configured to confirm rows are kept indefinitely.

  • Evidence after fix: Terminal output captured below.

    Before fix, against main, ~6 minutes after both sub-agents finished (default config):

    $ openclaw subagents list
    (empty)
    

    After fix, same setup, ~30 minutes after the sub-agents finished:

    $ openclaw subagents list
    - run-…  session  agent:alt:session:child-…  ok
    - run-…  session  agent:alt:session:child-…  ok
    

    After fix, ~70 minutes after the sub-agents finished (past the default 60-minute window):

    $ openclaw subagents list
    (empty)   # sweeper deleted the rows once cleanupCompletedAt was older than archiveAfterMinutes
    

    After fix with agents.defaults.subagents.archiveAfterMinutes: 0, several hours after the sub-agents finished:

    $ openclaw subagents list
    - run-…  session  agent:alt:session:child-…  ok
    - run-…  session  agent:alt:session:child-…  ok
    
  • Observed result after fix: Completed session-mode runs stayed visible in subagents list for the full configured archiveAfterMinutes window after cleanupCompletedAt, then were swept exactly as before. Run-mode entries (still driven by archiveAtMs) were unchanged. archiveAfterMinutes: 0 disabled session-mode reaping consistent with the existing run-mode semantic. No leftover state in subagent-runs.json after sweep.

  • What was not tested: Cross-host behavior with multiple operators sharing a gateway; very large registries (hundreds of completed runs simultaneously held under the longer default — sweep cost per entry is unchanged but holding more entries in memory was not separately benchmarked).

Testing

  • pnpm test src/agents/subagent-registry.test.ts
  • pnpm test src/agents/subagent-registry.persistence.test.ts

Notes for reviewers

  • Supersedes fix: retain completed session-mode subagents longer #78238, which proposed bumping the hardcoded constant from 5 to 60 minutes. This is the smaller, more principled version: the existing config knob now actually controls retention for both spawn modes.
  • The unit test passes stored agentDir through swept context-engine cleanup paths scopes a positive archiveAfterMinutes value for that case, since the suite-wide mock config sets archiveAfterMinutes: 0 (which under the new config-driven sweep is "never reap"). Run-mode part of the same test continues to use a directly-pinned archiveAtMs and is unaffected.

@openclaw-barnacle openclaw-barnacle Bot added the agents Agent runtime and tooling label May 6, 2026
@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. size: XS labels May 6, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 6, 2026

Codex review: needs maintainer review before merge.

Summary
This PR replaces completed no-archiveAtMs subagent registry reaping's hardcoded five-minute TTL with the existing archiveAfterMinutes resolver, adjusts one sweep fixture, and adds a changelog entry.

Reproducibility: yes. Current main's sweep path deletes completed no-archiveAtMs rows after the five-minute SESSION_RUN_TTL_MS, and the PR body supplies terminal before/after output for the subagents list symptom.

Real behavior proof
Sufficient (terminal): The PR body includes terminal output from a real Linux OpenClaw run showing current-main failure, after-fix retention, eventual sweep, and disabled reaping with archiveAfterMinutes: 0.

Next step before merge
No repair lane is needed because the latest patch has no blocking findings and exact-head proof/CI are sufficient; the remaining action is maintainer merge judgment.

Security
Cleared: The diff is limited to retention logic, one test fixture, and changelog text, with no dependency, workflow, secret, permission, or package-resolution changes.

Review details

Best possible solution:

Land the config-driven retention fix after maintainer review, keeping archiveAfterMinutes as the single retention control for completed subagent registry rows.

Do we have a high-confidence way to reproduce the issue?

Yes. Current main's sweep path deletes completed no-archiveAtMs rows after the five-minute SESSION_RUN_TTL_MS, and the PR body supplies terminal before/after output for the subagents list symptom.

Is this the best way to solve the issue?

Yes. Reusing resolveArchiveAfterMs in the sweep loop is the narrowest maintainable fix because run-mode already uses that helper and the public docs already expose archiveAfterMinutes as the subagent retention knob.

What I checked:

  • Current main hardcoded TTL: Current main defines SESSION_RUN_TTL_MS as five minutes and uses it to delete completed entries without archiveAtMs after cleanupCompletedAt. (src/agents/subagent-registry.ts:199, 69d446d1784c)
  • Configured retention contract: The subagents docs describe agents.defaults.subagents.archiveAfterMinutes as the auto-archive knob with default 60 minutes. Public docs: docs/tools/subagents.md. (docs/tools/subagents.md:281, 69d446d1784c)
  • Shared helper and run-mode behavior: resolveArchiveAfterMs implements default 60-minute retention and treats 0 as disabled; run-mode registration already derives archiveAtMs from that helper while session-mode rows skip archiveAtMs. (src/agents/subagent-registry-helpers.ts:307, 69d446d1784c)
  • PR implementation: The PR imports resolveArchiveAfterMs, computes sessionRetentionMs once per sweep, requires it before deleting no-archiveAtMs rows, scopes the affected sweep fixture to archiveAfterMinutes: 1, and adds a changelog entry. (src/agents/subagent-registry.ts:750, 0bcf95e9d906)
  • Real behavior proof: The PR body supplies copied terminal output from a Linux OpenClaw run showing current-main failure at about six minutes, after-fix retention at about 30 minutes, sweep after about 70 minutes, and disabled reaping with archiveAfterMinutes: 0. (0bcf95e9d906)
  • Exact-head checks: The latest head check runs are completed with the relevant check, build, lint, docs, security, and test lanes succeeding; skipped/neutral runs are non-applicable lanes such as CodeQL neutral and platform skips. (0bcf95e9d906)

Likely related people:

  • steipete: GitHub history links this handle to the original subagent archive setting/docs and repeated recent work in the registry helper, run-manager, and subagents docs paths. (role: original feature author and recent subagent maintainer; confidence: medium; commits: 75c66acfd828, 770c462c4730, cfbef8035dd1; files: docs/tools/subagents.md, src/agents/subagent-registry-helpers.ts, src/agents/subagent-registry-run-manager.ts)
  • vincentkoc: Current checkout blame and recent GitHub history show this handle on the current subagent registry snapshot and adjacent subagent behavior/docs work. (role: recent adjacent maintainer; confidence: medium; commits: 78b252682b0b, e80de466e5e1, 1427c3a78d80; files: src/agents/subagent-registry.ts, src/agents/subagent-registry-helpers.ts, src/agents/subagent-registry-run-manager.ts)
  • jalehman: This PR is assigned to this maintainer, and history links the handle to adjacent context-engine cleanup work overlapping the test fixture touched here. (role: assigned reviewer and adjacent owner; confidence: medium; commits: fee91fefceb4, bc2373fecc49, 0bcf95e9d906; files: src/agents/subagent-registry.ts, src/agents/subagent-registry.test.ts, CHANGELOG.md)

Remaining risk / open question:

  • Longer or disabled retention can leave more completed registry rows resident for high-volume subagent use; the PR body notes that large registries were not separately benchmarked.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 69d446d1784c.

@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
@arniesaha arniesaha force-pushed the fix/subagent-session-honor-archive-after-minutes branch from 85ce768 to f78d7ce Compare May 6, 2026 04:30
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
@jalehman jalehman self-assigned this May 6, 2026
@arniesaha arniesaha force-pushed the fix/subagent-session-honor-archive-after-minutes branch from 52dbafc to ccc2ce3 Compare May 7, 2026 00:59
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 7, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 7, 2026
@openclaw-barnacle openclaw-barnacle Bot added channel: telegram Channel integration: telegram size: S and removed size: XS labels May 7, 2026
@jalehman jalehman force-pushed the fix/subagent-session-honor-archive-after-minutes branch from e57c04b to 0bcf95e Compare May 7, 2026 01:17
@openclaw-barnacle openclaw-barnacle Bot added size: XS and removed proof: sufficient ClawSweeper judged the real behavior proof convincing. channel: telegram Channel integration: telegram size: S labels May 7, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 7, 2026
arniesaha and others added 3 commits May 6, 2026 19:19
Session-mode subagent registry rows reaped on a hardcoded 5-minute TTL
instead of the configured `agents.defaults.subagents.archiveAfterMinutes`
window (default 60 minutes) that run-mode already honors for `archiveAtMs`.

That asymmetry meant `subagents list` and other registry-backed status
surfaces lost completed runs five minutes after cleanup, even when the
operator's configured retention was longer, and gave operators no way to
tune session-mode retention at all. On slower messaging surfaces and when
agent-to-agent transcript access is disabled, completed sub-agents
appeared to silently disappear.

Drop `SESSION_RUN_TTL_MS` and have the sweep loop call
`resolveArchiveAfterMs` so both spawn modes reap on the same configured
horizon. Setting `archiveAfterMinutes: 0` now disables session-mode
reaping just like it disables run-mode `sessions.delete`.

Tests scope a positive `archiveAfterMinutes` for the swept-context-engine
fixture so the deletion path still fires under the new config-driven
sweep.
@jalehman jalehman force-pushed the fix/subagent-session-honor-archive-after-minutes branch from 0bcf95e to b415467 Compare May 7, 2026 02:23
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 7, 2026
@jalehman jalehman merged commit 1c331a8 into openclaw:main May 7, 2026
89 of 90 checks passed
@jalehman
Copy link
Copy Markdown
Contributor

jalehman commented May 7, 2026

Merged via squash.

Thanks @arniesaha!

rogerdigital pushed a commit to rogerdigital/openclaw that referenced this pull request May 7, 2026
…penclaw#78263)

Merged via squash.

Prepared head SHA: b415467
Co-authored-by: arniesaha <3646287+arniesaha@users.noreply.github.com>
Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>
Reviewed-by: @jalehman
steipete pushed a commit that referenced this pull request May 7, 2026
…78263)

Merged via squash.

Prepared head SHA: b415467
Co-authored-by: arniesaha <3646287+arniesaha@users.noreply.github.com>
Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>
Reviewed-by: @jalehman

(cherry picked from commit 1c331a8)
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
…penclaw#78263)

Merged via squash.

Prepared head SHA: b415467
Co-authored-by: arniesaha <3646287+arniesaha@users.noreply.github.com>
Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>
Reviewed-by: @jalehman
rogerdigital pushed a commit to rogerdigital/openclaw that referenced this pull request May 9, 2026
…penclaw#78263)

Merged via squash.

Prepared head SHA: b415467
Co-authored-by: arniesaha <3646287+arniesaha@users.noreply.github.com>
Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>
Reviewed-by: @jalehman
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling proof: supplied External PR includes structured after-fix real behavior proof. size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants