Skip to content

fix(gateway): add TTL cleanup for 3 Maps that grow unbounded causing OOM#52731

Merged
jalehman merged 8 commits into
openclaw:mainfrom
artwalker:fix/gateway-memory-leak-session-runs
Apr 9, 2026
Merged

fix(gateway): add TTL cleanup for 3 Maps that grow unbounded causing OOM#52731
jalehman merged 8 commits into
openclaw:mainfrom
artwalker:fix/gateway-memory-leak-session-runs

Conversation

@artwalker

Copy link
Copy Markdown
Contributor

Summary

  • Problem: Gateway OOM crashes after batch processing ~1000+ agent sessions (8GB heap overflow)
  • Why it matters: Any batch workload (e.g., RCAgent scanning 1400 tickets × 2 sessions each) kills the Gateway process
  • What changed: Added TTL-based cleanup for 3 Maps that lacked cleanup mechanisms for session-mode runs
  • What did NOT change: Existing archive-based cleanup for non-session runs, existing TTL mechanisms for chatRunState/agentRunSeq/toolEventRecipients

Change Type

  • Bug fix

Scope

  • Gateway / orchestration

Linked Issue

User-visible / Behavior Changes

Gateway no longer OOMs under batch workloads. Completed session-mode subagent runs are cleaned up after 5 minutes. Stale run contexts are cleaned up after 30 minutes.

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Three leak points fixed

1. Primary: subagentRuns Map (subagent-registry.ts)

Session-mode spawns set archiveAtMs = undefined. The 60s sweeper skipped entries without archiveAtMs. Added absolute 5-min TTL for completed session runs (keyed on endedAt).

2. Secondary: runContextById + seqByRun Maps (agent-events.ts)

Only cleaned via manual clearAgentRunContext() on lifecycle end/error. Added registeredAt timestamp to AgentRunContext and sweepStaleRunContexts() with 30-min TTL, called from the existing 60s maintenance timer. Also sweeps companion seqByRun entries. Pre-deploy entries without registeredAt are treated as infinitely old.

3. Secondary: pendingLifecycleErrorByRunId Map (subagent-registry.ts)

Had 15s retry timer but no absolute TTL. Now swept in sweepSubagentRuns() after 5 minutes.

Repro + Verification

Environment

  • OS: Linux (production server)
  • Runtime: Node.js, OpenClaw 2026.3.13+

Steps

  1. Run batch workload creating 1000+ agent sessions via Gateway RPC (spawnMode: "session")
  2. Monitor memory: ps -o rss= -p $(pgrep -f openclaw-gateway) | awk '{print $1/1024 "MB"}'

Expected

Memory stabilizes after sessions complete (~< 2GB), cleaned up by sweep timers.

Actual (before fix)

Memory grows linearly, never reclaimed. OOM at ~8GB.

Evidence

  • Code analysis identifying 3 unbounded Maps
  • Verified existing cleanup mechanisms skip session-mode entries
  • npm run build passes
  • Memory monitoring under batch load (production verification pending)

Human Verification (required)

  • Verified: sweepSubagentRuns() correctly handles 3 cases (no archiveAtMs + ended, no archiveAtMs + running, has archiveAtMs)
  • Verified: registeredAt only set on first registration, not updates
  • Verified: seqByRun swept alongside runContextById to prevent companion leak
  • Verified: pre-deploy entries without registeredAt treated as infinitely old
  • Not verified: production load test (requires batch workload environment)

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Failure Recovery

  • Revert the single commit to restore previous behavior
  • Watch for: Gateway memory growth under batch workloads returning

Risks and Mitigations

  • Risk: 30-min TTL for runContextById could sweep contexts for legitimately long-running agents (>30 min). Mitigation: only metadata (sessionKey, verboseLevel) is lost, not the run itself — agent continues, just without enriched event routing.

🤖 Generated with Claude Code

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime agents Agent runtime and tooling size: S labels Mar 23, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b9b0958190

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/infra/agent-events.ts Outdated
Comment thread src/agents/subagent-registry.ts
@greptile-apps

greptile-apps Bot commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a real Gateway OOM issue under batch workloads by adding TTL-based cleanup to three Maps that previously grew without bound for session-mode runs. The approach is sound and consistent with the existing sweep/maintenance patterns in the codebase.

Key changes:

  • subagentRuns entries with no archiveAtMs (session-mode) are now deleted 5 minutes after endedAt is set, matching the behavior of the existing archive-based sweep for non-session runs.
  • Orphaned pendingLifecycleErrorByRunId entries (whose 15-second retry timer may have unref()d before firing) are now also swept after 5 minutes in the same sweepSubagentRuns pass.
  • runContextById entries now carry a registeredAt timestamp and are swept after 30 minutes via a new sweepStaleRunContexts() call hooked into the existing 60-second gateway maintenance timer; companion seqByRun entries are cleaned up alongside them.

Minor gaps worth noting:

  • clearAgentRunContext (the direct lifecycle-end path) still only removes the entry from runContextById, leaving a seqByRun entry per run. The sweep covers orphaned contexts but normally-terminated runs accumulate seqByRun entries on the happy path. A one-line addition to clearAgentRunContext would close this.
  • SESSION_RUN_TTL_MS and PENDING_ERROR_TTL_MS are defined as local variables inside sweepSubagentRuns rather than as module-level constants, inconsistent with the existing style in the file.

Confidence Score: 4/5

  • Safe to merge; the fix is logically correct and addresses the OOM root cause, with one minor completeness gap in clearAgentRunContext and style nits that don't affect correctness.
  • All three identified leak points are correctly fixed. The control flow in sweepSubagentRuns is correct (the continue after the session-mode block prevents erroneous fall-through to the archive path). No new security surface, no behavioral change for non-session runs. The one unaddressed gap — clearAgentRunContext not cleaning seqByRun — is a pre-existing minor leak that's not blocking. Production load verification is still pending but the code analysis is thorough and the repro scenario is well-understood.
  • src/infra/agent-events.tsclearAgentRunContext should also delete from seqByRun to fully close the companion leak.

Comments Outside Diff (1)

  1. src/infra/agent-events.ts, line 67-69 (link)

    P2 clearAgentRunContext doesn't clean up seqByRun

    sweepStaleRunContexts correctly cleans both runContextById and seqByRun together. But clearAgentRunContext — the direct path called on every normal lifecycle end/error — only deletes from runContextById, leaving an orphaned seqByRun entry per run on the happy path. Since this PR explicitly identifies seqByRun as a companion leak to runContextById, it would be more complete to clean both here too:

    Without this, the sweep only covers the "missed lifecycle event" case; normally-terminated runs still accumulate seqByRun entries indefinitely.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/infra/agent-events.ts
    Line: 67-69
    
    Comment:
    **`clearAgentRunContext` doesn't clean up `seqByRun`**
    
    `sweepStaleRunContexts` correctly cleans both `runContextById` and `seqByRun` together. But `clearAgentRunContext` — the direct path called on every normal lifecycle end/error — only deletes from `runContextById`, leaving an orphaned `seqByRun` entry per run on the happy path. Since this PR explicitly identifies `seqByRun` as a companion leak to `runContextById`, it would be more complete to clean both here too:
    
    
    
    Without this, the sweep only covers the "missed lifecycle event" case; normally-terminated runs still accumulate `seqByRun` entries indefinitely.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/infra/agent-events.ts
Line: 67-69

Comment:
**`clearAgentRunContext` doesn't clean up `seqByRun`**

`sweepStaleRunContexts` correctly cleans both `runContextById` and `seqByRun` together. But `clearAgentRunContext` — the direct path called on every normal lifecycle end/error — only deletes from `runContextById`, leaving an orphaned `seqByRun` entry per run on the happy path. Since this PR explicitly identifies `seqByRun` as a companion leak to `runContextById`, it would be more complete to clean both here too:

```suggestion
export function clearAgentRunContext(runId: string) {
  state.runContextById.delete(runId);
  state.seqByRun.delete(runId);
}
```

Without this, the sweep only covers the "missed lifecycle event" case; normally-terminated runs still accumulate `seqByRun` entries indefinitely.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/agents/subagent-registry.ts
Line: 846

Comment:
**TTL constants should be module-level**

`SESSION_RUN_TTL_MS` (and `PENDING_ERROR_TTL_MS` at line 891) are defined as local variables and recreated on every sweep invocation (every 60 s). All other TTL constants in this file — `ANNOUNCE_EXPIRY_MS`, `ANNOUNCE_COMPLETION_HARD_EXPIRY_MS`, `LIFECYCLE_ERROR_RETRY_GRACE_MS` — are module-level constants. Moving these two to module scope keeps the pattern consistent and makes them easier to cross-reference:

```suggestion
const SESSION_RUN_TTL_MS = 5 * 60 * 1000; // 5 min absolute TTL for session-mode runs
```

(Same applies to `PENDING_ERROR_TTL_MS` at line 891.)

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "fix(gateway): add TTL cleanup for 3 Maps..." | Re-trigger Greptile

Comment thread src/agents/subagent-registry.ts Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2acf732f01

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/agents/subagent-registry.ts

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 731ae843ab

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/agents/subagent-registry.ts Outdated
Comment thread src/agents/subagent-registry.ts
@artwalker artwalker force-pushed the fix/gateway-memory-leak-session-runs branch from d7cb864 to 83538bd Compare March 31, 2026 07:45
@artwalker

Copy link
Copy Markdown
Contributor Author

Rebased onto latest main.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 83538bd08c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/infra/agent-events.ts
export function sweepStaleRunContexts(maxAgeMs = 30 * 60 * 1000): number {
const now = Date.now();
let swept = 0;
for (const [runId, ctx] of state.runContextById.entries()) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Read event state before sweeping stale run contexts

sweepStaleRunContexts iterates state.runContextById but never initializes state, so calling it throws ReferenceError: state is not defined. This is now invoked from the gateway maintenance timer every 60s (startGatewayMaintenanceTimers), so a live gateway hits this path repeatedly and can fail the maintenance tick or crash under uncaught exception handling.

Useful? React with 👍 / 👎.

@artwalker artwalker force-pushed the fix/gateway-memory-leak-session-runs branch from 83538bd to 2c9f538 Compare April 1, 2026 23:54
@jalehman jalehman self-assigned this Apr 9, 2026
@jalehman jalehman force-pushed the fix/gateway-memory-leak-session-runs branch from e9e48c7 to 37731c3 Compare April 9, 2026 20:49

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 37731c3535

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +488 to +492
if (typeof entry.cleanupCompletedAt === "number" && now - entry.cleanupCompletedAt > SESSION_RUN_TTL_MS) {
clearPendingLifecycleError(runId);
void notifyContextEngineSubagentEnded({
childSessionKey: entry.childSessionKey,
reason: "swept",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid emitting a second subagent-ended callback on TTL sweep

The new no-archiveAtMs TTL path sweeps entries based on cleanupCompletedAt, but those entries have already passed through completeCleanupBookkeeping for cleanup: "keep", which already calls notifyContextEngineSubagentEnded(... reason: "completed"). Emitting another callback here with reason: "swept" causes duplicate terminal notifications for the same run, so context-engine implementations that perform non-idempotent cleanup can run teardown logic twice.

Useful? React with 👍 / 👎.

artwalker and others added 8 commits April 9, 2026 13:57
Gateway crashes with OOM after batch processing ~1000+ agent sessions.
Three Maps accumulate entries without cleanup:

1. subagentRuns: session-mode runs have archiveAtMs=undefined, so the
   60s sweeper skips them forever. Add 5-min absolute TTL for completed
   session runs. Also sweep orphaned pendingLifecycleError entries with
   5-min TTL.

2. runContextById: only cleaned via manual clearAgentRunContext() calls.
   Add registeredAt timestamp and sweepStaleRunContexts() with 30-min
   TTL, called from the existing 60s maintenance timer. Also sweeps
   the companion seqByRun Map for the same runIds.

3. pendingLifecycleErrorByRunId: 15s retry timer but no absolute TTL.
   Now swept in sweepSubagentRuns() after 5 min.

Pre-deploy entries without registeredAt are treated as infinitely old
and swept immediately.

Fixes openclaw#52725

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. clearAgentRunContext now also deletes seqByRun (Greptile P2)
2. TTL constants moved to module scope (Greptile P2)
3. Session-mode TTL uses cleanupCompletedAt instead of endedAt to
   avoid interrupting deferred cleanup flows (Codex P1)
4. Added lastActiveAt to AgentRunContext, refreshed on every
   emitAgentEvent — long-running active agents are not swept (Codex P1)
5. resetAgentRunContextForTest also clears seqByRun (P2 drive-by)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The sweeper was only started when archiveAtMs was truthy, so pure
session-mode workloads (archiveAtMs=undefined) never triggered the
60s sweep interval. This meant the TTL cleanup for session-mode runs,
pending lifecycle errors, and orphaned contexts never executed —
defeating the OOM fix for the exact batch scenario it was designed for.

Remove the if(archiveAtMs) guard at both registration sites so the
sweeper runs for all workloads. startSweeper() is idempotent (returns
immediately if already running) so this is safe.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The new registeredAt and lastActiveAt fields on AgentRunContext cause
toEqual to fail because it expects exact property counts. Switch to
toMatchObject which validates expected properties without requiring
exact match — the standard pattern for objects with auto-injected
timestamps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…per on restore

1. Guard safeRemoveAttachmentsDir with retainAttachmentsOnKeep check
   in the session-mode TTL sweep, matching the existing pattern in
   finalizeSubagentCleanup (Codex P1)

2. Start sweeper unconditionally in restoreSubagentRunsOnce(), matching
   the register paths — ensures TTL cleanup runs after restart even
   when all restored entries are session-mode (Codex P2)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jalehman jalehman force-pushed the fix/gateway-memory-leak-session-runs branch from 37731c3 to 4816a29 Compare April 9, 2026 20:58
@jalehman jalehman merged commit 820dc38 into openclaw:main Apr 9, 2026
8 checks passed
@jalehman

jalehman commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

Merged via squash.

Thanks @artwalker!

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4816a29de5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +490 to +493
void notifyContextEngineSubagentEnded({
childSessionKey: entry.childSessionKey,
reason: "swept",
workspaceDir: entry.workspaceDir,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid re-emitting subagent end callback after keep cleanup

This sweep path runs only when cleanupCompletedAt is already set, but keep-mode runs set cleanupCompletedAt via completeCleanupBookkeeping, which already calls notifyContextEngineSubagentEnded(... reason: "completed"). Emitting reason: "swept" again here introduces a second terminal callback for the same run ~5 minutes later, which is a compatibility regression for context-engine plugins that treat onSubagentEnded as a one-time teardown signal and can run non-idempotent cleanup twice.

Useful? React with 👍 / 👎.

steipete pushed a commit that referenced this pull request Apr 10, 2026
…OOM (#52731)

Merged via squash.

Prepared head SHA: 4816a29
Co-authored-by: artwalker <44759507+artwalker@users.noreply.github.com>
Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>
Reviewed-by: @jalehman
lovewanwan pushed a commit to lovewanwan/openclaw that referenced this pull request Apr 28, 2026
…OOM (openclaw#52731)

Merged via squash.

Prepared head SHA: 4816a29
Co-authored-by: artwalker <44759507+artwalker@users.noreply.github.com>
Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>
Reviewed-by: @jalehman
ogt-redknie pushed a commit to ogt-redknie/OPENX that referenced this pull request May 2, 2026
…OOM (openclaw#52731)

Merged via squash.

Prepared head SHA: 4816a29
Co-authored-by: artwalker <44759507+artwalker@users.noreply.github.com>
Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>
Reviewed-by: @jalehman
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
…OOM (openclaw#52731)

Merged via squash.

Prepared head SHA: 4816a29
Co-authored-by: artwalker <44759507+artwalker@users.noreply.github.com>
Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>
Reviewed-by: @jalehman
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 24, 2026
…OOM (openclaw#52731)

Merged via squash.

Prepared head SHA: 4816a29
Co-authored-by: artwalker <44759507+artwalker@users.noreply.github.com>
Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>
Reviewed-by: @jalehman
jameslcowan pushed a commit to jameslcowan/openclaw that referenced this pull request Jun 2, 2026
…OOM (openclaw#52731)

Merged via squash.

Prepared head SHA: 4816a29
Co-authored-by: artwalker <44759507+artwalker@users.noreply.github.com>
Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com>
Reviewed-by: @jalehman
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling gateway Gateway runtime size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Gateway OOM after batch agent sessions — 3 Maps grow unbounded

2 participants