Skip to content

[Bug]: Gateway OOM after batch agent sessions — 3 Maps grow unbounded #52725

@artwalker

Description

@artwalker

Bug type

Crash (process/app exits or hangs)

Summary

Gateway crashes with OOM (8GB heap overflow) after processing ~1000+ agent sessions in batch. Root cause: 3 Maps in the Gateway process grow unbounded because their cleanup mechanisms have gaps.

Environment: Running batch RCAgent workload — each scan-closed ticket creates 2 agent sessions (knowledge-extractor + qa-reviewer). ~1400 tickets = ~2800 sessions → Gateway OOM.

Three leak points identified

1. Primary: subagentRuns Map (subagent-registry.ts:65)

  • When spawnMode === "session", archiveAtMs is set to undefined (L1118-1119)
  • sweepSubagentRuns() (L726-759, runs every 60s) skips entries where !entry.archiveAtMs
  • Result: session-mode runs are never cleaned up, Map grows indefinitely

2. Secondary: runContextById Map (agent-events.ts:25)

  • Only cleaned via manual clearAgentRunContext(runId) calls when lifecycle reaches "end"/"error"
  • If a run times out or Gateway OOM occurs before lifecycle end, context is orphaned forever
  • No TTL-based cleanup exists

3. Secondary: pendingLifecycleErrorByRunId Map (subagent-registry.ts:254-261)

  • Has a 15-second retry timer per entry, but no absolute TTL
  • If the lifecycle error event never arrives, entries can accumulate

Already-working cleanup (for reference)

  • chatRunState.abortedRuns — 1 hour TTL ✅
  • agentRunSeq — prunes when >10,000 entries ✅
  • toolEventRecipients — 10 min TTL + prune ✅

Steps to reproduce

  1. Run a batch workload creating 1000+ agent sessions via Gateway RPC (e.g., using spawnMode: "session")
  2. Monitor Gateway memory: ps -o rss= -p $(pgrep -f openclaw-gateway) | awk '{print $1/1024 "MB"}'
  3. Memory grows linearly with session count, never reclaimed
  4. Gateway OOM crashes around 8GB

Expected behavior

Gateway memory should stabilize after sessions complete — completed session-mode runs should be cleaned up by the existing sweep timer.

Actual behavior

subagentRuns Map grows indefinitely for session-mode spawns. Gateway eventually OOM crashes.

Proposed fix

  1. Add absolute TTL to sweepSubagentRuns() for session-mode runs (no archiveAtMs) — clean up 5 min after completion
  2. Add TTL-based sweep for runContextById — clean up entries older than 30 min
  3. Add absolute TTL for pendingLifecycleErrorByRunId — force-finalize after 5 min

OpenClaw version

2026.3.13+

Operating system

Linux (production server)

Impact and severity

High — Gateway process crashes under batch workloads, affecting all connected clients.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions