Bug type
Crash (process/app exits or hangs)
Summary
Gateway crashes with OOM (8GB heap overflow) after processing ~1000+ agent sessions in batch. Root cause: 3 Maps in the Gateway process grow unbounded because their cleanup mechanisms have gaps.
Environment: Running batch RCAgent workload — each scan-closed ticket creates 2 agent sessions (knowledge-extractor + qa-reviewer). ~1400 tickets = ~2800 sessions → Gateway OOM.
Three leak points identified
1. Primary: subagentRuns Map (subagent-registry.ts:65)
- When
spawnMode === "session", archiveAtMs is set to undefined (L1118-1119)
sweepSubagentRuns() (L726-759, runs every 60s) skips entries where !entry.archiveAtMs
- Result: session-mode runs are never cleaned up, Map grows indefinitely
2. Secondary: runContextById Map (agent-events.ts:25)
- Only cleaned via manual
clearAgentRunContext(runId) calls when lifecycle reaches "end"/"error"
- If a run times out or Gateway OOM occurs before lifecycle end, context is orphaned forever
- No TTL-based cleanup exists
3. Secondary: pendingLifecycleErrorByRunId Map (subagent-registry.ts:254-261)
- Has a 15-second retry timer per entry, but no absolute TTL
- If the lifecycle error event never arrives, entries can accumulate
Already-working cleanup (for reference)
chatRunState.abortedRuns — 1 hour TTL ✅
agentRunSeq — prunes when >10,000 entries ✅
toolEventRecipients — 10 min TTL + prune ✅
Steps to reproduce
- Run a batch workload creating 1000+ agent sessions via Gateway RPC (e.g., using
spawnMode: "session")
- Monitor Gateway memory:
ps -o rss= -p $(pgrep -f openclaw-gateway) | awk '{print $1/1024 "MB"}'
- Memory grows linearly with session count, never reclaimed
- Gateway OOM crashes around 8GB
Expected behavior
Gateway memory should stabilize after sessions complete — completed session-mode runs should be cleaned up by the existing sweep timer.
Actual behavior
subagentRuns Map grows indefinitely for session-mode spawns. Gateway eventually OOM crashes.
Proposed fix
- Add absolute TTL to
sweepSubagentRuns() for session-mode runs (no archiveAtMs) — clean up 5 min after completion
- Add TTL-based sweep for
runContextById — clean up entries older than 30 min
- Add absolute TTL for
pendingLifecycleErrorByRunId — force-finalize after 5 min
OpenClaw version
2026.3.13+
Operating system
Linux (production server)
Impact and severity
High — Gateway process crashes under batch workloads, affecting all connected clients.
Bug type
Crash (process/app exits or hangs)
Summary
Gateway crashes with OOM (8GB heap overflow) after processing ~1000+ agent sessions in batch. Root cause: 3 Maps in the Gateway process grow unbounded because their cleanup mechanisms have gaps.
Environment: Running batch RCAgent workload — each scan-closed ticket creates 2 agent sessions (knowledge-extractor + qa-reviewer). ~1400 tickets = ~2800 sessions → Gateway OOM.
Three leak points identified
1. Primary:
subagentRunsMap (subagent-registry.ts:65)spawnMode === "session",archiveAtMsis set toundefined(L1118-1119)sweepSubagentRuns()(L726-759, runs every 60s) skips entries where!entry.archiveAtMs2. Secondary:
runContextByIdMap (agent-events.ts:25)clearAgentRunContext(runId)calls when lifecycle reaches "end"/"error"3. Secondary:
pendingLifecycleErrorByRunIdMap (subagent-registry.ts:254-261)Already-working cleanup (for reference)
chatRunState.abortedRuns— 1 hour TTL ✅agentRunSeq— prunes when >10,000 entries ✅toolEventRecipients— 10 min TTL + prune ✅Steps to reproduce
spawnMode: "session")ps -o rss= -p $(pgrep -f openclaw-gateway) | awk '{print $1/1024 "MB"}'Expected behavior
Gateway memory should stabilize after sessions complete — completed session-mode runs should be cleaned up by the existing sweep timer.
Actual behavior
subagentRunsMap grows indefinitely for session-mode spawns. Gateway eventually OOM crashes.Proposed fix
sweepSubagentRuns()for session-mode runs (noarchiveAtMs) — clean up 5 min after completionrunContextById— clean up entries older than 30 minpendingLifecycleErrorByRunId— force-finalize after 5 minOpenClaw version
2026.3.13+
Operating system
Linux (production server)
Impact and severity
High — Gateway process crashes under batch workloads, affecting all connected clients.