Skip to content

[Bug]: Agent stall detector hard-coded 120s threshold kills legitimate long model calls on local vLLM #85826

@kiagentkronos-cell

Description

@kiagentkronos-cell

Problem

The gateway's agent stall detector fires at ~120s with no configuration knob. When a local vLLM model call takes longer (large context windows, heavy tool-use chains), the session is classified as stalled_agent_run, eventually triggering an EmbeddedAttemptSessionTakeoverError that terminates the session with status: failed.

This causes:

  1. User-facing messages never delivered (session dies mid-turn)
  2. Subagent completions lost (announce retry-limit exhausted while session is stalled)
  3. No graceful recovery — session ends in failed state requiring manual intervention

Version

OpenClaw 2026.5.20 (e510042)

Reproduction

  1. Local vLLM serving Qwen3.6-27B-FP8 on ARM64 (DGX Spark, NVIDIA GB10)
  2. Large context window session (~200+ messages, ~90K+ tokens)
  3. Agent enters extended tool-use chain or model call takes >120s
  4. Gateway logs:
[diagnostic] long-running session: age=127s activeWorkKind=model_call
[diagnostic] stalled session: age=183s reason=active_work_without_progress classification=stalled_agent_run
[diagnostic] lane task error: EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released
[warn] Subagent announce give up (retry-limit) — gateway request timeout for agent
  1. Session status becomes failed, user sees no response in their channel

Gateway Logs (anonymized)

21:52:58 [diagnostic] long-running session: sessionId=REDACTED sessionKey=agent:main:discord:channel:REDACTED state=processing age=127s queueDepth=1 reason=queued_behind_active_work classification=long_running activeWorkKind=model_call lastProgress=model_call:started lastProgressAge=11s recovery=none

21:59:58 [diagnostic] stalled session: age=183s reason=active_work_without_progress classification=stalled_agent_run activeWorkKind=model_call lastProgress=model_call:started lastProgressAge=142s recovery=none

22:00:28 [diagnostic] stalled session: age=213s recovery=none

22:00:58 [diagnostic] stalled session: age=243s recovery=none

22:01:26 [diagnostic] lane task error: lane=session:agent:main:discord:channel:REDACTED durationMs=272885 error="EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released: REDACTED.jsonl"

22:02:57 [warn] Subagent announce give up (retry-limit) run=REDACTED child=agent:main:subagent:REDACTED retries=3 endedAgo=366s deliveryError="gateway request timeout for agent; direct-primary: gateway request timeout for agent"

Root Cause

The stall detector has a hard-coded threshold (~120s for long-running, then stalled_agent_run) that cannot be tuned via openclaw.json. For local vLLM deployments with large context windows (262K max tokens, 27B parameter models), model calls exceeding 120s are completely normal — especially with complex tool-use chains and heavy reasoning.

The config.schema contains no stallDetector, modelCallTimeout, sessionTimeout, or similar configurable field. The only stall-related config fields are cosmetic emoji status reactions (messages.statusReactions.emojis.stallSoft), not the detection logic itself.

Additionally, once stalled, the EmbeddedAttemptSessionTakeoverError destroys the session instead of gracefully waiting for the model call to complete or providing a "still working" status message to the user.

Expected Behavior

  1. Configurable thresholds: Allow tuning stall detection thresholds per-provider or per-session via config (e.g., gateway.agentStallTimeoutMs or agents.defaults.modelCallTimeoutMs)
  2. Graceful degradation: Instead of killing the session, show intermediate status to user ("still processing...")
  3. Recovery after stall: If model call completes before total timeout, resume normally instead of forcing EmbeddedAttemptSessionTakeoverError

Impact

Users with local vLLM deployments (especially large models, large context windows) experience silent message loss and session failures. This affects Discord, WhatsApp, and any channel where the user expects a response.

Environment

  • Host: ARM64 (DGX Spark, NVIDIA GB10)
  • Model: vLLM serving Qwen3.6-27B-FP8 (--gpu-memory-utilization 0.6, --max-model-len 262144)
  • Context: ~90K+ tokens, 200+ messages
  • Channels: Discord + WhatsApp

Metadata

Metadata

Assignees

Labels

P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions