fix(gateway): add TTL cleanup for 3 Maps that grow unbounded causing OOM#52731
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b9b0958190
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Greptile SummaryThis PR fixes a real Gateway OOM issue under batch workloads by adding TTL-based cleanup to three Maps that previously grew without bound for session-mode runs. The approach is sound and consistent with the existing sweep/maintenance patterns in the codebase. Key changes:
Minor gaps worth noting:
Confidence Score: 4/5
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2acf732f01
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 731ae843ab
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
d7cb864 to
83538bd
Compare
|
Rebased onto latest main. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 83538bd08c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| export function sweepStaleRunContexts(maxAgeMs = 30 * 60 * 1000): number { | ||
| const now = Date.now(); | ||
| let swept = 0; | ||
| for (const [runId, ctx] of state.runContextById.entries()) { |
There was a problem hiding this comment.
Read event state before sweeping stale run contexts
sweepStaleRunContexts iterates state.runContextById but never initializes state, so calling it throws ReferenceError: state is not defined. This is now invoked from the gateway maintenance timer every 60s (startGatewayMaintenanceTimers), so a live gateway hits this path repeatedly and can fail the maintenance tick or crash under uncaught exception handling.
Useful? React with 👍 / 👎.
83538bd to
2c9f538
Compare
e9e48c7 to
37731c3
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 37731c3535
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (typeof entry.cleanupCompletedAt === "number" && now - entry.cleanupCompletedAt > SESSION_RUN_TTL_MS) { | ||
| clearPendingLifecycleError(runId); | ||
| void notifyContextEngineSubagentEnded({ | ||
| childSessionKey: entry.childSessionKey, | ||
| reason: "swept", |
There was a problem hiding this comment.
Avoid emitting a second subagent-ended callback on TTL sweep
The new no-archiveAtMs TTL path sweeps entries based on cleanupCompletedAt, but those entries have already passed through completeCleanupBookkeeping for cleanup: "keep", which already calls notifyContextEngineSubagentEnded(... reason: "completed"). Emitting another callback here with reason: "swept" causes duplicate terminal notifications for the same run, so context-engine implementations that perform non-idempotent cleanup can run teardown logic twice.
Useful? React with 👍 / 👎.
Gateway crashes with OOM after batch processing ~1000+ agent sessions. Three Maps accumulate entries without cleanup: 1. subagentRuns: session-mode runs have archiveAtMs=undefined, so the 60s sweeper skips them forever. Add 5-min absolute TTL for completed session runs. Also sweep orphaned pendingLifecycleError entries with 5-min TTL. 2. runContextById: only cleaned via manual clearAgentRunContext() calls. Add registeredAt timestamp and sweepStaleRunContexts() with 30-min TTL, called from the existing 60s maintenance timer. Also sweeps the companion seqByRun Map for the same runIds. 3. pendingLifecycleErrorByRunId: 15s retry timer but no absolute TTL. Now swept in sweepSubagentRuns() after 5 min. Pre-deploy entries without registeredAt are treated as infinitely old and swept immediately. Fixes openclaw#52725 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. clearAgentRunContext now also deletes seqByRun (Greptile P2) 2. TTL constants moved to module scope (Greptile P2) 3. Session-mode TTL uses cleanupCompletedAt instead of endedAt to avoid interrupting deferred cleanup flows (Codex P1) 4. Added lastActiveAt to AgentRunContext, refreshed on every emitAgentEvent — long-running active agents are not swept (Codex P1) 5. resetAgentRunContextForTest also clears seqByRun (P2 drive-by) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The sweeper was only started when archiveAtMs was truthy, so pure session-mode workloads (archiveAtMs=undefined) never triggered the 60s sweep interval. This meant the TTL cleanup for session-mode runs, pending lifecycle errors, and orphaned contexts never executed — defeating the OOM fix for the exact batch scenario it was designed for. Remove the if(archiveAtMs) guard at both registration sites so the sweeper runs for all workloads. startSweeper() is idempotent (returns immediately if already running) so this is safe. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The new registeredAt and lastActiveAt fields on AgentRunContext cause toEqual to fail because it expects exact property counts. Switch to toMatchObject which validates expected properties without requiring exact match — the standard pattern for objects with auto-injected timestamps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…per on restore 1. Guard safeRemoveAttachmentsDir with retainAttachmentsOnKeep check in the session-mode TTL sweep, matching the existing pattern in finalizeSubagentCleanup (Codex P1) 2. Start sweeper unconditionally in restoreSubagentRunsOnce(), matching the register paths — ensures TTL cleanup runs after restart even when all restored entries are session-mode (Codex P2) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
37731c3 to
4816a29
Compare
|
Merged via squash.
Thanks @artwalker! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4816a29de5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| void notifyContextEngineSubagentEnded({ | ||
| childSessionKey: entry.childSessionKey, | ||
| reason: "swept", | ||
| workspaceDir: entry.workspaceDir, |
There was a problem hiding this comment.
Avoid re-emitting subagent end callback after keep cleanup
This sweep path runs only when cleanupCompletedAt is already set, but keep-mode runs set cleanupCompletedAt via completeCleanupBookkeeping, which already calls notifyContextEngineSubagentEnded(... reason: "completed"). Emitting reason: "swept" again here introduces a second terminal callback for the same run ~5 minutes later, which is a compatibility regression for context-engine plugins that treat onSubagentEnded as a one-time teardown signal and can run non-idempotent cleanup twice.
Useful? React with 👍 / 👎.
…OOM (openclaw#52731) Merged via squash. Prepared head SHA: 4816a29 Co-authored-by: artwalker <44759507+artwalker@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman
…OOM (openclaw#52731) Merged via squash. Prepared head SHA: 4816a29 Co-authored-by: artwalker <44759507+artwalker@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman
…OOM (openclaw#52731) Merged via squash. Prepared head SHA: 4816a29 Co-authored-by: artwalker <44759507+artwalker@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman
…OOM (openclaw#52731) Merged via squash. Prepared head SHA: 4816a29 Co-authored-by: artwalker <44759507+artwalker@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman
…OOM (openclaw#52731) Merged via squash. Prepared head SHA: 4816a29 Co-authored-by: artwalker <44759507+artwalker@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman
Summary
Change Type
Scope
Linked Issue
User-visible / Behavior Changes
Gateway no longer OOMs under batch workloads. Completed session-mode subagent runs are cleaned up after 5 minutes. Stale run contexts are cleaned up after 30 minutes.
Security Impact (required)
NoNoNoNoNoThree leak points fixed
1. Primary:
subagentRunsMap (subagent-registry.ts)Session-mode spawns set
archiveAtMs = undefined. The 60s sweeper skipped entries withoutarchiveAtMs. Added absolute 5-min TTL for completed session runs (keyed onendedAt).2. Secondary:
runContextById+seqByRunMaps (agent-events.ts)Only cleaned via manual
clearAgentRunContext()on lifecycle end/error. AddedregisteredAttimestamp toAgentRunContextandsweepStaleRunContexts()with 30-min TTL, called from the existing 60s maintenance timer. Also sweeps companionseqByRunentries. Pre-deploy entries withoutregisteredAtare treated as infinitely old.3. Secondary:
pendingLifecycleErrorByRunIdMap (subagent-registry.ts)Had 15s retry timer but no absolute TTL. Now swept in
sweepSubagentRuns()after 5 minutes.Repro + Verification
Environment
Steps
spawnMode: "session")ps -o rss= -p $(pgrep -f openclaw-gateway) | awk '{print $1/1024 "MB"}'Expected
Memory stabilizes after sessions complete (~< 2GB), cleaned up by sweep timers.
Actual (before fix)
Memory grows linearly, never reclaimed. OOM at ~8GB.
Evidence
npm run buildpassesHuman Verification (required)
sweepSubagentRuns()correctly handles 3 cases (no archiveAtMs + ended, no archiveAtMs + running, has archiveAtMs)registeredAtonly set on first registration, not updatesseqByRunswept alongsiderunContextByIdto prevent companion leakregisteredAttreated as infinitely oldReview Conversations
Compatibility / Migration
YesNoNoFailure Recovery
Risks and Mitigations
🤖 Generated with Claude Code