-
-
Notifications
You must be signed in to change notification settings - Fork 79.2k
[Bug]: Agent stall detector hard-coded 120s threshold kills legitimate long model calls on local vLLM #85826
Copy link
Copy link
Open
Labels
P1High-priority user-facing bug, regression, or broken workflow.High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.ClawSweeper found a clear likely implementation shape for this issue.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.ClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.ClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.ClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.Channel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.Good issue quality with a plausible reproduction path needing some confirmation.
Metadata
Metadata
Assignees
Labels
P1High-priority user-facing bug, regression, or broken workflow.High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.ClawSweeper found a clear likely implementation shape for this issue.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.ClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.ClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.ClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.Channel message delivery can be lost, duplicated, or misrouted.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.Good issue quality with a plausible reproduction path needing some confirmation.
Type
Fields
Give feedbackNo fields configured for issues without a type.
Problem
The gateway's agent stall detector fires at ~120s with no configuration knob. When a local vLLM model call takes longer (large context windows, heavy tool-use chains), the session is classified as
stalled_agent_run, eventually triggering anEmbeddedAttemptSessionTakeoverErrorthat terminates the session withstatus: failed.This causes:
failedstate requiring manual interventionVersion
OpenClaw 2026.5.20 (e510042)
Reproduction
failed, user sees no response in their channelGateway Logs (anonymized)
Root Cause
The stall detector has a hard-coded threshold (~120s for
long-running, thenstalled_agent_run) that cannot be tuned viaopenclaw.json. For local vLLM deployments with large context windows (262K max tokens, 27B parameter models), model calls exceeding 120s are completely normal — especially with complex tool-use chains and heavy reasoning.The
config.schemacontains nostallDetector,modelCallTimeout,sessionTimeout, or similar configurable field. The onlystall-related config fields are cosmetic emoji status reactions (messages.statusReactions.emojis.stallSoft), not the detection logic itself.Additionally, once stalled, the
EmbeddedAttemptSessionTakeoverErrordestroys the session instead of gracefully waiting for the model call to complete or providing a "still working" status message to the user.Expected Behavior
gateway.agentStallTimeoutMsoragents.defaults.modelCallTimeoutMs)EmbeddedAttemptSessionTakeoverErrorImpact
Users with local vLLM deployments (especially large models, large context windows) experience silent message loss and session failures. This affects Discord, WhatsApp, and any channel where the user expects a response.
Environment