fix(gateway): add TTL cleanup for 3 Maps that grow unbounded causing OOM by artwalker · Pull Request #52731 · openclaw/openclaw

artwalker · 2026-03-23T08:21:43Z

Summary

Problem: Gateway OOM crashes after batch processing ~1000+ agent sessions (8GB heap overflow)
Why it matters: Any batch workload (e.g., RCAgent scanning 1400 tickets × 2 sessions each) kills the Gateway process
What changed: Added TTL-based cleanup for 3 Maps that lacked cleanup mechanisms for session-mode runs
What did NOT change: Existing archive-based cleanup for non-session runs, existing TTL mechanisms for chatRunState/agentRunSeq/toolEventRecipients

Change Type

Bug fix

Scope

Gateway / orchestration

Linked Issue

Fixes [Bug]: Gateway OOM after batch agent sessions — 3 Maps grow unbounded #52725

User-visible / Behavior Changes

Gateway no longer OOMs under batch workloads. Completed session-mode subagent runs are cleaned up after 5 minutes. Stale run contexts are cleaned up after 30 minutes.

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? No
Data access scope changed? No

Three leak points fixed

1. Primary: `subagentRuns` Map (subagent-registry.ts)

Session-mode spawns set archiveAtMs = undefined. The 60s sweeper skipped entries without archiveAtMs. Added absolute 5-min TTL for completed session runs (keyed on endedAt).

2. Secondary: `runContextById` + `seqByRun` Maps (agent-events.ts)

Only cleaned via manual clearAgentRunContext() on lifecycle end/error. Added registeredAt timestamp to AgentRunContext and sweepStaleRunContexts() with 30-min TTL, called from the existing 60s maintenance timer. Also sweeps companion seqByRun entries. Pre-deploy entries without registeredAt are treated as infinitely old.

3. Secondary: `pendingLifecycleErrorByRunId` Map (subagent-registry.ts)

Had 15s retry timer but no absolute TTL. Now swept in sweepSubagentRuns() after 5 minutes.

Repro + Verification

Environment

OS: Linux (production server)
Runtime: Node.js, OpenClaw 2026.3.13+

Steps

Run batch workload creating 1000+ agent sessions via Gateway RPC (spawnMode: "session")
Monitor memory: ps -o rss= -p $(pgrep -f openclaw-gateway) | awk '{print $1/1024 "MB"}'

Expected

Memory stabilizes after sessions complete (~< 2GB), cleaned up by sweep timers.

Actual (before fix)

Memory grows linearly, never reclaimed. OOM at ~8GB.

Evidence

Code analysis identifying 3 unbounded Maps
Verified existing cleanup mechanisms skip session-mode entries
npm run build passes
Memory monitoring under batch load (production verification pending)

Human Verification (required)

Verified: sweepSubagentRuns() correctly handles 3 cases (no archiveAtMs + ended, no archiveAtMs + running, has archiveAtMs)
Verified: registeredAt only set on first registration, not updates
Verified: seqByRun swept alongside runContextById to prevent companion leak
Verified: pre-deploy entries without registeredAt treated as infinitely old
Not verified: production load test (requires batch workload environment)

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No

Failure Recovery

Revert the single commit to restore previous behavior
Watch for: Gateway memory growth under batch workloads returning

Risks and Mitigations

Risk: 30-min TTL for runContextById could sweep contexts for legitimately long-running agents (>30 min). Mitigation: only metadata (sessionKey, verboseLevel) is lost, not the run itself — agent continues, just without enriched event routing.

🤖 Generated with Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b9b0958190

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

greptile-apps · 2026-03-23T08:26:34Z

Greptile Summary

This PR fixes a real Gateway OOM issue under batch workloads by adding TTL-based cleanup to three Maps that previously grew without bound for session-mode runs. The approach is sound and consistent with the existing sweep/maintenance patterns in the codebase.

Key changes:

subagentRuns entries with no archiveAtMs (session-mode) are now deleted 5 minutes after endedAt is set, matching the behavior of the existing archive-based sweep for non-session runs.
Orphaned pendingLifecycleErrorByRunId entries (whose 15-second retry timer may have unref()d before firing) are now also swept after 5 minutes in the same sweepSubagentRuns pass.
runContextById entries now carry a registeredAt timestamp and are swept after 30 minutes via a new sweepStaleRunContexts() call hooked into the existing 60-second gateway maintenance timer; companion seqByRun entries are cleaned up alongside them.

Minor gaps worth noting:

clearAgentRunContext (the direct lifecycle-end path) still only removes the entry from runContextById, leaving a seqByRun entry per run. The sweep covers orphaned contexts but normally-terminated runs accumulate seqByRun entries on the happy path. A one-line addition to clearAgentRunContext would close this.
SESSION_RUN_TTL_MS and PENDING_ERROR_TTL_MS are defined as local variables inside sweepSubagentRuns rather than as module-level constants, inconsistent with the existing style in the file.

Confidence Score: 4/5

Safe to merge; the fix is logically correct and addresses the OOM root cause, with one minor completeness gap in clearAgentRunContext and style nits that don't affect correctness.
All three identified leak points are correctly fixed. The control flow in sweepSubagentRuns is correct (the continue after the session-mode block prevents erroneous fall-through to the archive path). No new security surface, no behavioral change for non-session runs. The one unaddressed gap — clearAgentRunContext not cleaning seqByRun — is a pre-existing minor leak that's not blocking. Production load verification is still pending but the code analysis is thorough and the repro scenario is well-understood.
src/infra/agent-events.ts — clearAgentRunContext should also delete from seqByRun to fully close the companion leak.

Comments Outside Diff (1)

src/infra/agent-events.ts, line 67-69 (link)

clearAgentRunContext doesn't clean up seqByRun

sweepStaleRunContexts correctly cleans both runContextById and seqByRun together. But clearAgentRunContext — the direct path called on every normal lifecycle end/error — only deletes from runContextById, leaving an orphaned seqByRun entry per run on the happy path. Since this PR explicitly identifies seqByRun as a companion leak to runContextById, it would be more complete to clean both here too:

Without this, the sweep only covers the "missed lifecycle event" case; normally-terminated runs still accumulate seqByRun entries indefinitely.

Prompt To Fix With AI

This is a comment left during a code review.
Path: src/infra/agent-events.ts
Line: 67-69

Comment:
**`clearAgentRunContext` doesn't clean up `seqByRun`**

`sweepStaleRunContexts` correctly cleans both `runContextById` and `seqByRun` together. But `clearAgentRunContext` — the direct path called on every normal lifecycle end/error — only deletes from `runContextById`, leaving an orphaned `seqByRun` entry per run on the happy path. Since this PR explicitly identifies `seqByRun` as a companion leak to `runContextById`, it would be more complete to clean both here too:



Without this, the sweep only covers the "missed lifecycle event" case; normally-terminated runs still accumulate `seqByRun` entries indefinitely.

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/infra/agent-events.ts
Line: 67-69

Comment:
**`clearAgentRunContext` doesn't clean up `seqByRun`**

`sweepStaleRunContexts` correctly cleans both `runContextById` and `seqByRun` together. But `clearAgentRunContext` — the direct path called on every normal lifecycle end/error — only deletes from `runContextById`, leaving an orphaned `seqByRun` entry per run on the happy path. Since this PR explicitly identifies `seqByRun` as a companion leak to `runContextById`, it would be more complete to clean both here too:

```suggestion
export function clearAgentRunContext(runId: string) {
  state.runContextById.delete(runId);
  state.seqByRun.delete(runId);
}
```

Without this, the sweep only covers the "missed lifecycle event" case; normally-terminated runs still accumulate `seqByRun` entries indefinitely.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/agents/subagent-registry.ts
Line: 846

Comment:
**TTL constants should be module-level**

`SESSION_RUN_TTL_MS` (and `PENDING_ERROR_TTL_MS` at line 891) are defined as local variables and recreated on every sweep invocation (every 60 s). All other TTL constants in this file — `ANNOUNCE_EXPIRY_MS`, `ANNOUNCE_COMPLETION_HARD_EXPIRY_MS`, `LIFECYCLE_ERROR_RETRY_GRACE_MS` — are module-level constants. Moving these two to module scope keeps the pattern consistent and makes them easier to cross-reference:

```suggestion
const SESSION_RUN_TTL_MS = 5 * 60 * 1000; // 5 min absolute TTL for session-mode runs
```

(Same applies to `PENDING_ERROR_TTL_MS` at line 891.)

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "fix(gateway): add TTL cleanup for 3 Maps..." | Re-trigger Greptile}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2acf732f01

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 731ae843ab

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

artwalker · 2026-03-31T07:46:10Z

Rebased onto latest main.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 83538bd08c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-31T07:50:37Z

+export function sweepStaleRunContexts(maxAgeMs = 30 * 60 * 1000): number {
+  const now = Date.now();
+  let swept = 0;
+  for (const [runId, ctx] of state.runContextById.entries()) {


Read event state before sweeping stale run contexts

sweepStaleRunContexts iterates state.runContextById but never initializes state, so calling it throws ReferenceError: state is not defined. This is now invoked from the gateway maintenance timer every 60s (startGatewayMaintenanceTimers), so a live gateway hits this path repeatedly and can fail the maintenance tick or crash under uncaught exception handling.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 37731c3535

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-09T20:55:36Z

+      if (typeof entry.cleanupCompletedAt === "number" && now - entry.cleanupCompletedAt > SESSION_RUN_TTL_MS) {
+        clearPendingLifecycleError(runId);
+        void notifyContextEngineSubagentEnded({
+          childSessionKey: entry.childSessionKey,
+          reason: "swept",


Avoid emitting a second subagent-ended callback on TTL sweep

The new no-archiveAtMs TTL path sweeps entries based on cleanupCompletedAt, but those entries have already passed through completeCleanupBookkeeping for cleanup: "keep", which already calls notifyContextEngineSubagentEnded(... reason: "completed"). Emitting another callback here with reason: "swept" causes duplicate terminal notifications for the same run, so context-engine implementations that perform non-idempotent cleanup can run teardown logic twice.

Useful? React with 👍 / 👎.

Gateway crashes with OOM after batch processing ~1000+ agent sessions. Three Maps accumulate entries without cleanup: 1. subagentRuns: session-mode runs have archiveAtMs=undefined, so the 60s sweeper skips them forever. Add 5-min absolute TTL for completed session runs. Also sweep orphaned pendingLifecycleError entries with 5-min TTL. 2. runContextById: only cleaned via manual clearAgentRunContext() calls. Add registeredAt timestamp and sweepStaleRunContexts() with 30-min TTL, called from the existing 60s maintenance timer. Also sweeps the companion seqByRun Map for the same runIds. 3. pendingLifecycleErrorByRunId: 15s retry timer but no absolute TTL. Now swept in sweepSubagentRuns() after 5 min. Pre-deploy entries without registeredAt are treated as infinitely old and swept immediately. Fixes openclaw#52725 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

1. clearAgentRunContext now also deletes seqByRun (Greptile P2) 2. TTL constants moved to module scope (Greptile P2) 3. Session-mode TTL uses cleanupCompletedAt instead of endedAt to avoid interrupting deferred cleanup flows (Codex P1) 4. Added lastActiveAt to AgentRunContext, refreshed on every emitAgentEvent — long-running active agents are not swept (Codex P1) 5. resetAgentRunContextForTest also clears seqByRun (P2 drive-by) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The sweeper was only started when archiveAtMs was truthy, so pure session-mode workloads (archiveAtMs=undefined) never triggered the 60s sweep interval. This meant the TTL cleanup for session-mode runs, pending lifecycle errors, and orphaned contexts never executed — defeating the OOM fix for the exact batch scenario it was designed for. Remove the if(archiveAtMs) guard at both registration sites so the sweeper runs for all workloads. startSweeper() is idempotent (returns immediately if already running) so this is safe. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The new registeredAt and lastActiveAt fields on AgentRunContext cause toEqual to fail because it expects exact property counts. Switch to toMatchObject which validates expected properties without requiring exact match — the standard pattern for objects with auto-injected timestamps. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…per on restore 1. Guard safeRemoveAttachmentsDir with retainAttachmentsOnKeep check in the session-mode TTL sweep, matching the existing pattern in finalizeSubagentCleanup (Codex P1) 2. Start sweeper unconditionally in restoreSubagentRunsOnce(), matching the register paths — ensures TTL cleanup runs after restart even when all restored entries are session-mode (Codex P2) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jalehman · 2026-04-09T20:58:51Z

Merged via squash.

Prepared head SHA: 4816a29de50f91ea1ce4a98ea3a388bd537177de
Merge commit: 820dc3852530bffd2451bb1a6ddbcdd9f6324d66

Thanks @artwalker!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4816a29de5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-09T21:05:10Z

+        void notifyContextEngineSubagentEnded({
+          childSessionKey: entry.childSessionKey,
+          reason: "swept",
+          workspaceDir: entry.workspaceDir,


Avoid re-emitting subagent end callback after keep cleanup

This sweep path runs only when cleanupCompletedAt is already set, but keep-mode runs set cleanupCompletedAt via completeCleanupBookkeeping, which already calls notifyContextEngineSubagentEnded(... reason: "completed"). Emitting reason: "swept" again here introduces a second terminal callback for the same run ~5 minutes later, which is a compatibility regression for context-engine plugins that treat onSubagentEnded as a one-time teardown signal and can run non-idempotent cleanup twice.

Useful? React with 👍 / 👎.

@jalehman

…OOM (#52731) Merged via squash. Prepared head SHA: 4816a29 Co-authored-by: artwalker <44759507+artwalker@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman

@jalehman

…OOM (openclaw#52731) Merged via squash. Prepared head SHA: 4816a29 Co-authored-by: artwalker <44759507+artwalker@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman

@jalehman

…OOM (openclaw#52731) Merged via squash. Prepared head SHA: 4816a29 Co-authored-by: artwalker <44759507+artwalker@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman

@jalehman

…OOM (openclaw#52731) Merged via squash. Prepared head SHA: 4816a29 Co-authored-by: artwalker <44759507+artwalker@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman

@jalehman

…OOM (openclaw#52731) Merged via squash. Prepared head SHA: 4816a29 Co-authored-by: artwalker <44759507+artwalker@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman

@jalehman

…OOM (openclaw#52731) Merged via squash. Prepared head SHA: 4816a29 Co-authored-by: artwalker <44759507+artwalker@users.noreply.github.com> Co-authored-by: jalehman <550978+jalehman@users.noreply.github.com> Reviewed-by: @jalehman

openclaw-barnacle Bot added gateway Gateway runtime agents Agent runtime and tooling size: S labels Mar 23, 2026

chatgpt-codex-connector Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread src/infra/agent-events.ts Outdated

Comment thread src/agents/subagent-registry.ts

greptile-apps Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread src/agents/subagent-registry.ts Outdated

chatgpt-codex-connector Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread src/agents/subagent-registry.ts

chatgpt-codex-connector Bot reviewed Mar 23, 2026

View reviewed changes

Comment thread src/agents/subagent-registry.ts Outdated

Comment thread src/agents/subagent-registry.ts

artwalker force-pushed the fix/gateway-memory-leak-session-runs branch from d7cb864 to 83538bd Compare March 31, 2026 07:45

chatgpt-codex-connector Bot reviewed Mar 31, 2026

View reviewed changes

artwalker force-pushed the fix/gateway-memory-leak-session-runs branch from 83538bd to 2c9f538 Compare April 1, 2026 23:54

jalehman self-assigned this Apr 9, 2026

jalehman force-pushed the fix/gateway-memory-leak-session-runs branch from e9e48c7 to 37731c3 Compare April 9, 2026 20:49

chatgpt-codex-connector Bot reviewed Apr 9, 2026

View reviewed changes

artwalker and others added 8 commits April 9, 2026 13:57

fix: repair stale agent run context sweep

c8123c1

fix: document stale run context cleanup

606c2cc

fix: clean rebased changelog entry

4816a29

jalehman force-pushed the fix/gateway-memory-leak-session-runs branch from 37731c3 to 4816a29 Compare April 9, 2026 20:58

jalehman merged commit 820dc38 into openclaw:main Apr 9, 2026
8 checks passed

chatgpt-codex-connector Bot reviewed Apr 9, 2026

View reviewed changes

karanuppal mentioned this pull request Apr 9, 2026

fix: clean up seqByRun entry in clearAgentRunContext() #56411

Closed

github-actions Bot mentioned this pull request Apr 9, 2026

📡 Upstream Digest — 2026-04-09 22:33 UTC curtismercier/openclaw-mods#521

Open

github-actions Bot mentioned this pull request Apr 13, 2026

Update ghcr.io/openclaw/openclaw Docker tag to v2026.5.7 - autoclosed claytono/infra#1977

Closed

1 task

clawsweeper Bot mentioned this pull request Apr 26, 2026

fix(gateway): ignore stale post-lifecycle tails without hiding seq gaps #51080

Closed

20 tasks

Uh oh!

Conversation

artwalker commented Mar 23, 2026

Summary

Change Type

Scope

Linked Issue

User-visible / Behavior Changes

Security Impact (required)

Three leak points fixed

1. Primary: subagentRuns Map (subagent-registry.ts)

2. Secondary: runContextById + seqByRun Maps (agent-events.ts)

3. Secondary: pendingLifecycleErrorByRunId Map (subagent-registry.ts)

Repro + Verification

Environment

Steps

Expected

Actual (before fix)

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Failure Recovery

Risks and Mitigations

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Comments Outside Diff (1)

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

artwalker commented Mar 31, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jalehman commented Apr 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

1. Primary: `subagentRuns` Map (subagent-registry.ts)

2. Secondary: `runContextById` + `seqByRun` Maps (agent-events.ts)

3. Secondary: `pendingLifecycleErrorByRunId` Map (subagent-registry.ts)

greptile-apps Bot commented Mar 23, 2026 •

edited

Loading