Skip to content

Bug: Preemptive context overflow silently kills embedded sessions without notifying user #84536

@najef1979-code

Description

@najef1979-code

Bug: Preemptive context overflow silently kills embedded sessions without notifying user

OpenClaw version: 2026.5.19-beta.1 (ba9034b)
Gateway: running (systemd)
Affected agent(s): agent:marcus:main, agent:jordan:main, agent:neon (multiple agents same day)
Channel: webchat
First seen: 2026-05-19 08:45 UTC
Last seen: 2026-05-20 12:20 UTC


Description

Embedded agent sessions enter a terminal error state (Context overflow: estimated context size exceeds safe threshold during tool loop) during tool loops. The session silently dies with no user notification. The error is logged but never delivered to the user's channel. Sessions remain stuck in a processing state for hours until manually restarted.

Expected: user sees Context overflow: prompt too large for the model. Try /reset immediately.
Actual: silence, frozen session, no message delivered, no self-healing.


Steps to Reproduce

  1. Have an embedded agent session engaged in a long tool-heavy task
  2. Session accumulates messages in a tool loop (context at ~50%, well within 200K token window)
  3. Preemptive context overflow check fires — before any model call
  4. Session enters embedded_run_agent_end with error
  5. Error is logged but never delivered to the channel
  6. Session stays frozen in processing state
  7. Stuck-session recovery fires hours later but only releases the lane, no notification
  8. User must manually restart gateway

Expected Behavior

  1. Error message delivered to the user's channel immediately
  2. Session auto-resets or clearly signals the user to reset
  3. No silently dead sessions for hours without notification

Actual Behavior

  1. Session ends with embedded_run_agent_end + error
  2. compactionAttempts=0 — preemptive check short-circuits before in-attempt compaction
  3. Error never reaches the channel
  4. Session stays frozen in processing state
  5. Stuck-session recovery fires hours later but only releases the lane
  6. No self-healing

Environment

Field Value
OS Linux 7.0.0-15-generic (x64)
Node.js v24.15.0
OpenClaw 2026.5.19-beta.1 (ba9034b)
Install location /usr/local/bin/openclaw
Gateway bind loopback (127.0.0.1:18789)
Provider minimax
Model minimax/MiniMax-M2.7
Fallbacks yes (MiniMax-M2.5)
Compaction mode safeguard

Relevant Logs

// PREEMPTIVE OVERFLOW — fires before model call, no token count measured
{"subsystem":"agent/embedded","1":"[context-overflow-diag] sessionKey=agent:marcus:main provider=minimax/MiniMax-M2.7 compactionAttempts=0 observedTokens=unknown error=Context overflow: estimated context size exceeds safe threshold during tool loop."}

// SESSION ENDS
{"subsystem":"agent/embedded","1":{"event":"embedded_run_agent_end","isError":true,"error":"Context overflow: prompt too large for the model. Try /reset (or /new) to start a fresh session, or use a larger-context model."}}

// COMPACTION RUNS BUT SESSION NEVER RESUMES
{"subsystem":"agent/embedded","1":"[compaction] rotated active transcript after compaction (sessionKey=agent:marcus:main)"}

// HOURS LATER — stuck session detected but recovery is insufficient
{"subsystem":"diagnostic","1":"stuck session: ...lastProgressAge=29351s terminalProgressStale=true recovery=checking"}
{"subsystem":"diagnostic","1":"stuck session recovery outcome: status=released action=release_lane ... released=0"}

Root Cause Analysis

The issue is in the embedded Pi runner's preemptive context overflow handling (selection-BpjGe-Y0.js):

  1. PREEMPTIVE_OVERFLOW_RATIO = 0.9 is hardcoded — no config path, not tunable
  2. For MiniMax-M2.7 (200K token context): maxContextChars = 200,000 × 4 × 0.9 = 720,000 chars
  3. The preemptive check fires during a tool loop before sending to the model, even when actual token count is well within limits
  4. compactionAttempts=0 — preemptive check short-circuits before the in-attempt compaction path is reached
  5. Error is not delivered to the channel
  6. Session never resumes

Code locations:

  • selection-BpjGe-Y0.js:9325PREEMPTIVE_OVERFLOW_RATIO = .9 hardcoded
  • selection-BpjGe-Y0.js:9495maxContextChars = Math.floor(contextWindowTokens * 4 * 0.9)
  • selection-BpjGe-Y0.js:9537-9538 — throws error without running compaction
  • pi-embedded-BpxGOwmb.js — in-attempt compaction never reached

Additional Context

Multiple agents hit this on 2026-05-19:

Time (UTC) Agent Channel
08:45 neon:telegram:direct telegram
17:44 jordan:main webchat
18:09 neon:telegram:direct telegram
22:52 marcus:main webchat

Key evidence this is NOT actual context exhaustion:

  • compactionAttempts=0 — pre-emptive check fired, not model rejection
  • observedTokens=unknown — no token count was measured
  • Session at ~50% context — well within 200K token window
  • Error fires during tool loop, before model call

Severity

High — agents silently die, no user notification, requires manual restart. No self-healing. Recurring across multiple agents.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions