Skip to content

Liveness-based turn timeouts and progress notifications during long agent turns #77626

@willificent

Description

@willificent

Summary

The current timeoutSeconds applies as a single wall-clock budget for the entire agent turn — from message receipt to final reply. A tool-heavy turn might chain 4–6 LLM calls plus multiple tool executions. If any step runs slow, the whole turn fails silently. The user sees tool-call summaries appearing in chat (which is great), then suddenly "assistant turn failed" with no explanation.

This issue proposes an architectural shift: liveness-based per-step timeouts instead of one total-turn clock, plus progress notifications and graceful termination reporting. All proposed changes include concrete code diffs against the current codebase.

Current Behavior

  1. A user sends a message.
  2. The agent begins a turn that may involve multiple LLM calls and tool executions (web fetches, shell commands, memory searches, etc.).
  3. Tool-call summaries appear in chat as each step completes — this is genuinely excellent UX.
  4. If the cumulative wall-clock time exceeds timeoutSeconds, the entire turn is killed.
  5. The user sees a generic "assistant turn failed" with no indication of what was happening or why.

Example scenario (Telegram): An agent turn involves a web search → page fetch → shell command → second LLM call → memory lookup → final response. Each step completes in 10–15 seconds, but the total exceeds the configured timeout. The user watched 4 tool summaries scroll by, then gets a silent failure. They don't know if the agent crashed, the network dropped, or the timeout fired.

The Timeout Chain — How It Works Today

chat.send / agent RPC
  → resolveAgentTimeoutMs(cfg, overrideMs/overrideSeconds)  [src/agents/timeout.ts]
  → registerChatAbortController({ timeoutMs, expiresAtMs })  [src/gateway/chat-abort.ts]
  → AbortController.signal passed to embeddedRun()  [src/agents/pi-embedded-runner/run.ts]
  → Maintenance timer sweep: if Date.now() > entry.expiresAtMs → abortChatRunById()  [src/gateway/server-maintenance.ts]

Key finding: The turn timeout is a single wall-clock deadline set at turn start. It's enforced by a periodic maintenance sweep that checks expiresAtMs and aborts the run's AbortController if expired. There is no per-step reset and no liveness detection — the timer counts from turn start regardless of whether the agent is actively making progress.

Files Involved

File Role
src/agents/timeout.ts Resolves timeoutMs from config (agents.defaults.timeoutSeconds, default 48h)
src/gateway/chat-abort.ts Registers AbortController + expiresAtMs, provides abortChatRunById()
src/gateway/server-maintenance.ts Periodic sweep: aborts expired runs at line ~112
src/gateway/server-methods/chat.ts chat.send handler — calls registerChatAbortController, passes signal to dispatchInboundMessage
src/gateway/server-methods/agent.ts agent RPC handler — same pattern, lines ~1268-1279
src/agents/pi-embedded-runner/run.ts The embedded run loop — receives abortSignal, relays to per-attempt controllers
src/agents/pi-embedded-runner/run/llm-idle-timeout.ts LLM streaming idle watchdog (120s default for cloud providers, disabled for local)
src/agents/pi-embedded-runner/run/idle-timeout-breaker.ts Circuit breaker for consecutive idle timeouts (cap: 5)
src/config/agent-timeout-defaults.ts DEFAULT_LLM_IDLE_TIMEOUT_SECONDS = 120
src/config/types.agent-defaults.ts Config schema: agents.defaults.timeoutSeconds (line 346), heartbeat timeoutSeconds (line 390)

What Exists vs What's Missing

Exists:

  • LLM idle timeout (per-stream, 120s) — detects when a single LLM streaming call goes silent
  • Idle-timeout circuit breaker — caps consecutive idle timeouts at 5 before failing over
  • Overall turn timeout — wall-clock deadline, default 48h, enforced by maintenance sweep

Missing:

  • Per-step liveness reset — the overall deadline never resets even though the agent is making progress (completing tool calls, receiving LLM responses)
  • Graceful timeout message — when the maintenance sweep aborts a run, it broadcasts "aborted" with stopReason: "timeout", but no human-readable message explaining what happened or what the user should do
  • Progress notifications — for long-running steps (>30s), no "still working" message is emitted to the channel

Related Issues

These issues are adjacent but propose different solutions or address different symptoms:


Proposed Changes

Change A: Per-Step Timeout Reset (Liveness-Based)

Concept: Instead of one fixed expiresAtMs set at turn start, reset the deadline each time the agent completes a "step" (LLM response received, tool execution completed). Keep a hard ceiling (maxTurnTimeSeconds) as a separate absolute deadline. Add a new stepTimeoutSeconds config key (default 0 = disabled) for backward compatibility.

A1. Add stepTimeoutSeconds and maxTurnTimeSeconds to config schema

File: src/config/types.agent-defaults.ts

After line 346 (timeoutSeconds?: number;), add:

  /**
   * Per-step liveness timeout in seconds. Each time the agent completes a step
   * (LLM response received or tool execution completed), this countdown resets.
   * If no step completes within this window, the run is aborted.
   * Default: 0 (disabled, fall back to wall-clock only). Set to 300 for 5-minute per-step window.
   */
  stepTimeoutSeconds?: number;

  /**
   * Hard wall-clock ceiling in seconds for a single turn, regardless of liveness.
   * This is the absolute maximum — even if the agent keeps making progress, the
   * turn is aborted after this duration. Default: same as timeoutSeconds (48h).
   */
  maxTurnTimeSeconds?: number;

A2. Add step-timeout resolvers

File: src/agents/timeout.ts

Add after resolveAgentTimeoutMs:

export function resolveStepTimeoutMs(opts: {
  cfg?: OpenClawConfig;
  overrideSeconds?: number | null;
}): number {
  const raw = normalizeNumber(opts.cfg?.agents?.defaults?.stepTimeoutSeconds);
  if (raw !== undefined) {
    if (raw === 0) return 0; // disabled
    if (raw < 0) return 0;
    return Math.min(Math.max(raw, 1), MAX_SAFE_TIMEOUT_MS) * 1000;
  }
  // Default: 0 (disabled, preserve current behavior)
  return 0;
}

export function resolveMaxTurnTimeMs(opts: {
  cfg?: OpenClawConfig;
  overrideSeconds?: number | null;
}): number {
  const raw = normalizeNumber(opts.cfg?.agents?.defaults?.maxTurnTimeSeconds);
  if (raw !== undefined) {
    if (raw <= 0) return MAX_SAFE_TIMEOUT_MS; // no ceiling
    return Math.min(raw, MAX_SAFE_TIMEOUT_MS / 1000 | 0) * 1000;
  }
  // Default: same as the overall timeout
  return resolveAgentTimeoutMs({ cfg: opts.cfg });
}

A3. Extend ChatAbortControllerEntry with step-based deadline

File: src/gateway/chat-abort.ts

Add to ChatAbortControllerEntry type:

  /**
   * Per-step liveness deadline. Reset on each step completion.
   * If Date.now() exceeds this and stepTimeoutMs > 0, the run is aborted.
   */
  stepExpiresAtMs?: number;

  /** Configured per-step timeout in ms (0 = disabled). */
  stepTimeoutMs?: number;

  /** Absolute hard ceiling for the turn (ms epoch). */
  hardCeilingAtMs?: number;

  /** Description of last completed step, for timeout messages. */
  lastStepDescription?: string;

Add reset function:

export function resetStepTimeout(
  chatAbortControllers: Map<string, ChatAbortControllerEntry>,
  runId: string,
  stepDescription?: string,
): void {
  const entry = chatAbortControllers.get(runId);
  if (!entry || !entry.stepTimeoutMs) return;
  entry.stepExpiresAtMs = Date.now() + entry.stepTimeoutMs;
  if (stepDescription) {
    entry.lastStepDescription = stepDescription;
  }
}

A4. Update registerChatAbortController to accept step-timeout params

File: src/gateway/chat-abort.ts

Extend registerChatAbortController params:

  stepTimeoutMs?: number;
  hardCeilingAtMs?: number;

Inside the function, after constructing entry:

  if (stepTimeoutMs && stepTimeoutMs > 0) {
    entry.stepTimeoutMs = stepTimeoutMs;
    entry.stepExpiresAtMs = now + stepTimeoutMs;
  }
  if (hardCeilingAtMs) {
    entry.hardCeilingAtMs = hardCeilingAtMs;
  }

A5. Update maintenance sweep to check step timeout

File: src/gateway/server-maintenance.ts

Replace the existing abort check (around line 112):

    for (const [runId, entry] of params.chatAbortControllers) {
      // Check per-step liveness timeout
      const stepExpired = entry.stepExpiresAtMs && now > entry.stepExpiresAtMs;
      // Check hard ceiling
      const ceilingExpired = entry.hardCeilingAtMs && now > entry.hardCeilingAtMs;
      // Check original wall-clock expiry
      const wallExpired = now > entry.expiresAtMs;

      if (!stepExpired && !ceilingExpired && !wallExpired) {
        continue;
      }

      const stopReason = ceilingExpired ? "max_turn_time" : stepExpired ? "step_timeout" : "timeout";
      abortChatRunById(
        {
          chatAbortControllers: params.chatAbortControllers,
          chatRunBuffers: params.chatRunBuffers,
          chatDeltaSentAt: params.chatDeltaSentAt,
          chatDeltaLastBroadcastLen: params.chatDeltaLastBroadcastLen,
          chatAbortedRuns: params.chatRunState.abortedRuns,
          removeChatRun: params.removeChatRun,
          agentRunSeq: params.agentRunSeq,
          broadcast: params.broadcast,
          nodeSendToSession: params.nodeSendToSession,
        },
        { runId, sessionKey: entry.sessionKey, stopReason },
      );
    }

A6. Wire step-timeout params from chat.send and agent RPC

File: src/gateway/server-methods/chat.ts

After the timeoutMs resolution (around line 1998), add:

    const stepTimeoutMs = resolveStepTimeoutMs({ cfg });
    const hardCeilingAtMs = resolveMaxTurnTimeMs({ cfg });

Update the registerChatAbortController call (around line 2178) to pass:

      const activeRunAbort = registerChatAbortController({
        chatAbortControllers: context.chatAbortControllers,
        runId: clientRunId,
        sessionId: backingSessionId ?? clientRunId,
        sessionKey: rawSessionKey,
        timeoutMs,
        now,
        stepTimeoutMs,
        hardCeilingAtMs,
        ownerConnId: normalizeOptionalText(client?.connId),
        ownerDeviceId: normalizeOptionalText(client?.connect?.device?.id),
        kind: "chat-send",
      });

File: src/gateway/server-methods/agent.ts — same pattern around line 1272.

A7. Call resetStepTimeout on step boundaries

Step boundaries in the embedded runner are:

  1. LLM response received — when the LLM stream yields content
  2. Tool execution completed — when a tool call finishes

File: src/agents/pi-embedded-runner/run.ts

Add to run params type (around the top of run.ts):

  /** Called each time a step completes (LLM response or tool execution). */
  onStepComplete?: (stepInfo: { type: "llm_response" | "tool_complete"; description?: string }) => void;

At LLM response boundaries (after the stream is consumed, around where aborted is checked ~line 1249), add:

  params.onStepComplete?.({
    type: "llm_response",
    description: attemptResult.stopReason ?? "llm_response",
  });

At tool completion boundaries (after tool results are collected), add:

  params.onStepComplete?.({
    type: "tool_complete",
    description: toolName,
  });

Wire from chat.ts: In the dispatchInboundMessage call path, the abortSignal is available. We need access to chatAbortControllers to call resetStepTimeout. The simplest approach: pass the reset function via the reply options or context.

Add to the chat.send handler, before dispatchInboundMessage:

      const stepResetFn = (stepInfo: { type: string; description?: string }) => {
        resetStepTimeout(context.chatAbortControllers, clientRunId, stepInfo.description);
      };

This requires plumbing stepResetFn through the dispatch pipeline to the embedded runner's onStepComplete. The dispatch pipeline is complex, so the implementation path is:

  1. Add onStepComplete to the reply options or MsgContext
  2. The embedded runner calls it at step boundaries
  3. The chat.send handler provides the implementation that calls resetStepTimeout

Change B: Graceful Timeout Message

Concept: When a timeout fires (step, ceiling, or wall-clock), emit a human-readable message to the channel instead of silent abort.

B1. Add timeout message to abortChatRunById broadcast

File: src/gateway/chat-abort.ts

Modify broadcastChatAborted to accept and include a user-facing message:

function broadcastChatAborted(
  ops: ChatAbortOps,
  params: {
    runId: string;
    sessionKey: string;
    stopReason?: string;
    partialText?: string;
    userMessage?: string; // NEW
  },
) {
  const { runId, sessionKey, stopReason, partialText, userMessage } = params;
  const payload = {
    runId,
    sessionKey,
    seq: (ops.agentRunSeq.get(runId) ?? 0) + 1,
    state: "aborted" as const,
    stopReason,
    userMessage, // NEW
    message: partialText
      ? {
          role: "assistant",
          content: [{ type: "text", text: partialText }],
          timestamp: Date.now(),
        }
      : undefined,
  };
  ops.broadcast("chat", payload);
  ops.nodeSendToSession(sessionKey, "chat", payload);
}

B2. Generate user-friendly timeout messages in the maintenance sweep

File: src/gateway/server-maintenance.ts

Before calling abortChatRunById, construct the message:

      let userMessage: string | undefined;
      const elapsedMs = now - entry.startedAtMs;
      const elapsedMin = Math.floor(elapsedMs / 60_000);
      const elapsedSec = Math.floor((elapsedMs % 60_000) / 1000);
      const lastStep = entry.lastStepDescription ?? "unknown";

      if (ceilingExpired) {
        userMessage = `⚠️ Turn hit the hard time limit (${elapsedMin}m ${elapsedSec}s). Last action: ${lastStep}. Try a simpler approach or increase the limit.`;
      } else if (stepExpired) {
        userMessage = `⚠️ No progress for ${Math.floor((entry.stepTimeoutMs ?? 0) / 1000)}s — last action was ${lastStep}. The task may be stuck. Retry or ask for a simpler approach.`;
      } else {
        userMessage = `⚠️ Turn timed out after ${elapsedMin}m ${elapsedSec}s — last action was ${lastStep}. Retry or ask for a simpler approach.`;
      }

B3. Deliver the message via the channel

The broadcastChatAborted event goes to connected gateway clients. For Telegram/other channels, we need to ensure the timeout message is delivered as a reply. The existing chat.aborted event is handled by the channel pipeline, but it doesn't currently send a text reply.

File: Wherever the chat.aborted event is consumed for channel delivery (likely in the reply dispatcher or channel adapter), add handling for userMessage:

  if (event.userMessage && event.state === "aborted") {
    // Deliver the user-facing timeout message through the channel
    channelSend(sessionKey, event.userMessage);
  }

The exact file depends on where channel delivery for aborted runs is handled. Search for chat.aborted or state: "aborted" in the channel dispatch code.


Change C: Progress Notifications

Concept: If a step takes >30s, emit a "still working" notification.

C1. Add a long-step detector in the embedded runner

File: src/agents/pi-embedded-runner/run.ts

Add a step timer that fires after 30s of inactivity within a step:

  /** Called when a step has been running for >30s without completing. */
  onStepLongRunning?: (stepInfo: { type: "llm_call" | "tool_execution"; description?: string; elapsedMs: number }) => void;

At the start of each LLM call or tool execution, set a 30s timer:

  let longStepTimer: NodeJS.Timeout | null = null;
  const stepStartTime = Date.now();

  const startLongStepTimer = (type: string, description?: string) => {
    clearLongStepTimer();
    longStepTimer = setTimeout(() => {
      params.onStepLongRunning?.({
        type: type as any,
        description,
        elapsedMs: Date.now() - stepStartTime,
      });
    }, 30_000);
  };

  const clearLongStepTimer = () => {
    if (longStepTimer) {
      clearTimeout(longStepTimer);
      longStepTimer = null;
    }
  };

Call startLongStepTimer at:

  • Before each LLM call: startLongStepTimer("llm_call", modelRef)
  • Before each tool execution: startLongStepTimer("tool_execution", toolName)

Call clearLongStepTimer at:

  • After LLM response received
  • After tool execution completed
  • On abort

C2. Wire the long-running callback to emit a channel message

In the chat.send handler:

      const stepLongRunningFn = (stepInfo: { type: string; description?: string; elapsedMs: number }) => {
        const desc = stepInfo.description ?? stepInfo.type;
        const elapsed = Math.floor(stepInfo.elapsedMs / 1000);
        // Emit a "still working" message through the channel reply pipeline
        // Use the dispatcher or a direct channel send
        broadcastChatDelta(context.chatRunBuffers, context.chatDeltaSentAt, context.chatDeltaLastBroadcastLen, {
          runId: clientRunId,
          sessionKey: rawSessionKey,
          type: "progress",
          text: `⏳ Still working — ${desc} has been running for ${elapsed}s…`,
        });
      };

Wire stepLongRunningFn through the dispatch pipeline alongside onStepComplete, using the same context/reply-options mechanism.

C3. Add progressNotifyAfterSeconds to config (optional)

File: src/config/types.agent-defaults.ts

After the stepTimeoutSeconds addition:

  /**
   * Seconds of inactivity within a step before emitting a "still working" progress
   * notification to the channel. Default: 30. Set to 0 to disable.
   */
  progressNotifyAfterSeconds?: number;

Impact

Acknowledgment

The existing tool-call summary feature is excellent — seeing each step appear in chat as it happens is genuinely great UX. This proposal builds on that foundation rather than replacing it. The summaries prove the gateway already has visibility into per-step progress; the ask is to extend that visibility into timeout semantics and mid-turn liveness signals.


Recommended Implementation Order

1. Change A (Per-Step Reset) — Highest Value

This is the core architectural fix. Once the step-timeout plumbing is in place, Changes B and C are straightforward additions to the same code paths.

  • Default stepTimeoutSeconds to 0 (disabled) to preserve backward compatibility
  • Add the config keys, resolvers, and ChatAbortControllerEntry extensions (A1–A4)
  • Update the maintenance sweep (A5)
  • Wire from chat.send/agent RPC (A6)
  • Add onStepComplete callback and wire to resetStepTimeout (A7)

2. Change B (Graceful Message) — Quick Win

Once the maintenance sweep has stepExpired/ceilingExpired/wallExpired distinctions, generating user-facing messages is trivial. This is the highest-impact UX improvement per line of code changed.

  • Extend broadcastChatAborted with userMessage (B1)
  • Generate messages in the maintenance sweep (B2)
  • Wire channel delivery for the userMessage field (B3)

3. Change C (Progress Notifications) — Nice-to-Have

This requires more plumbing (the long-step timer, the onStepLongRunning callback, the progressNotifyAfterSeconds config) but fills a real UX gap. Can be deferred to a follow-up PR.

  • Add the long-step timer in the embedded runner (C1)
  • Wire to channel broadcast (C2)
  • Add optional config key (C3)

Open Questions

  1. Per-step timeout semantics: Should the liveness window reset on every streaming token from the LLM, or only on "step boundaries" (tool call → tool output → next LLM call)? Token-level resets might be too granular; step-level might miss slow-streaming LLM responses.
  2. maxTurnTimeSeconds default: 30 minutes is a guess. Some users may want shorter (5 min for interactive chat) or longer (60 min for deep research tasks). Should this be per-channel?
  3. Progress notification frequency: One pulse per N seconds? One per step? Should it suppress if a tool summary was recently sent?
  4. Migration path: Should there be a config flag to opt into the new semantics, or is the backward-compatible shift acceptable?
  5. stepTimeoutSeconds default: The proposal defaults to 0 (disabled) for backward compat. Should the first release set a non-zero default (e.g., 300s) to give users the improved behavior immediately?

Environment

  • Primary channel: Telegram
  • Typical tool-heavy turn: 4–6 LLM calls + 5–10 tool executions
  • Current timeoutSeconds: varies, but commonly 120–300s

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions