[Bug]: Agent stall detector hard-coded 120s threshold kills legitimate long model calls on local vLLM

## Problem

The gateway's agent stall detector fires at ~120s with no configuration knob. When a local vLLM model call takes longer (large context windows, heavy tool-use chains), the session is classified as `stalled_agent_run`, eventually triggering an `EmbeddedAttemptSessionTakeoverError` that terminates the session with `status: failed`.

This causes:
1. User-facing messages never delivered (session dies mid-turn)
2. Subagent completions lost (announce retry-limit exhausted while session is stalled)
3. No graceful recovery — session ends in `failed` state requiring manual intervention

## Version

OpenClaw 2026.5.20 (e510042)

## Reproduction

1. Local vLLM serving Qwen3.6-27B-FP8 on ARM64 (DGX Spark, NVIDIA GB10)
2. Large context window session (~200+ messages, ~90K+ tokens)
3. Agent enters extended tool-use chain or model call takes >120s
4. Gateway logs:
```
[diagnostic] long-running session: age=127s activeWorkKind=model_call
[diagnostic] stalled session: age=183s reason=active_work_without_progress classification=stalled_agent_run
[diagnostic] lane task error: EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released
[warn] Subagent announce give up (retry-limit) — gateway request timeout for agent
```
5. Session status becomes `failed`, user sees no response in their channel

## Gateway Logs (anonymized)

```
21:52:58 [diagnostic] long-running session: sessionId=REDACTED sessionKey=agent:main:discord:channel:REDACTED state=processing age=127s queueDepth=1 reason=queued_behind_active_work classification=long_running activeWorkKind=model_call lastProgress=model_call:started lastProgressAge=11s recovery=none

21:59:58 [diagnostic] stalled session: age=183s reason=active_work_without_progress classification=stalled_agent_run activeWorkKind=model_call lastProgress=model_call:started lastProgressAge=142s recovery=none

22:00:28 [diagnostic] stalled session: age=213s recovery=none

22:00:58 [diagnostic] stalled session: age=243s recovery=none

22:01:26 [diagnostic] lane task error: lane=session:agent:main:discord:channel:REDACTED durationMs=272885 error="EmbeddedAttemptSessionTakeoverError: session file changed while embedded prompt lock was released: REDACTED.jsonl"

22:02:57 [warn] Subagent announce give up (retry-limit) run=REDACTED child=agent:main:subagent:REDACTED retries=3 endedAgo=366s deliveryError="gateway request timeout for agent; direct-primary: gateway request timeout for agent"
```

## Root Cause

The stall detector has a hard-coded threshold (~120s for `long-running`, then `stalled_agent_run`) that cannot be tuned via `openclaw.json`. For local vLLM deployments with large context windows (262K max tokens, 27B parameter models), model calls exceeding 120s are completely normal — especially with complex tool-use chains and heavy reasoning.

The `config.schema` contains no `stallDetector`, `modelCallTimeout`, `sessionTimeout`, or similar configurable field. The only `stall`-related config fields are cosmetic emoji status reactions (`messages.statusReactions.emojis.stallSoft`), not the detection logic itself.

Additionally, once stalled, the `EmbeddedAttemptSessionTakeoverError` destroys the session instead of gracefully waiting for the model call to complete or providing a "still working" status message to the user.

## Expected Behavior

1. **Configurable thresholds:** Allow tuning stall detection thresholds per-provider or per-session via config (e.g., `gateway.agentStallTimeoutMs` or `agents.defaults.modelCallTimeoutMs`)
2. **Graceful degradation:** Instead of killing the session, show intermediate status to user ("still processing...")
3. **Recovery after stall:** If model call completes before total timeout, resume normally instead of forcing `EmbeddedAttemptSessionTakeoverError`

## Impact

Users with local vLLM deployments (especially large models, large context windows) experience silent message loss and session failures. This affects Discord, WhatsApp, and any channel where the user expects a response.

## Environment

- Host: ARM64 (DGX Spark, NVIDIA GB10)
- Model: vLLM serving Qwen3.6-27B-FP8 (--gpu-memory-utilization 0.6, --max-model-len 262144)
- Context: ~90K+ tokens, 200+ messages
- Channels: Discord + WhatsApp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Agent stall detector hard-coded 120s threshold kills legitimate long model calls on local vLLM #85826

Problem

Version

Reproduction

Gateway Logs (anonymized)

Root Cause

Expected Behavior

Impact

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Agent stall detector hard-coded 120s threshold kills legitimate long model calls on local vLLM #85826

Description

Problem

Version

Reproduction

Gateway Logs (anonymized)

Root Cause

Expected Behavior

Impact

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions