Summary
The current timeoutSeconds applies as a single wall-clock budget for the entire agent turn — from message receipt to final reply. A tool-heavy turn might chain 4–6 LLM calls plus multiple tool executions. If any step runs slow, the whole turn fails silently. The user sees tool-call summaries appearing in chat (which is great), then suddenly "assistant turn failed" with no explanation.
This issue proposes an architectural shift: liveness-based per-step timeouts instead of one total-turn clock, plus progress notifications and graceful termination reporting. All proposed changes include concrete code diffs against the current codebase.
Current Behavior
- A user sends a message.
- The agent begins a turn that may involve multiple LLM calls and tool executions (web fetches, shell commands, memory searches, etc.).
- Tool-call summaries appear in chat as each step completes — this is genuinely excellent UX.
- If the cumulative wall-clock time exceeds
timeoutSeconds, the entire turn is killed.
- The user sees a generic "assistant turn failed" with no indication of what was happening or why.
Example scenario (Telegram): An agent turn involves a web search → page fetch → shell command → second LLM call → memory lookup → final response. Each step completes in 10–15 seconds, but the total exceeds the configured timeout. The user watched 4 tool summaries scroll by, then gets a silent failure. They don't know if the agent crashed, the network dropped, or the timeout fired.
The Timeout Chain — How It Works Today
chat.send / agent RPC
→ resolveAgentTimeoutMs(cfg, overrideMs/overrideSeconds) [src/agents/timeout.ts]
→ registerChatAbortController({ timeoutMs, expiresAtMs }) [src/gateway/chat-abort.ts]
→ AbortController.signal passed to embeddedRun() [src/agents/pi-embedded-runner/run.ts]
→ Maintenance timer sweep: if Date.now() > entry.expiresAtMs → abortChatRunById() [src/gateway/server-maintenance.ts]
Key finding: The turn timeout is a single wall-clock deadline set at turn start. It's enforced by a periodic maintenance sweep that checks expiresAtMs and aborts the run's AbortController if expired. There is no per-step reset and no liveness detection — the timer counts from turn start regardless of whether the agent is actively making progress.
Files Involved
| File |
Role |
src/agents/timeout.ts |
Resolves timeoutMs from config (agents.defaults.timeoutSeconds, default 48h) |
src/gateway/chat-abort.ts |
Registers AbortController + expiresAtMs, provides abortChatRunById() |
src/gateway/server-maintenance.ts |
Periodic sweep: aborts expired runs at line ~112 |
src/gateway/server-methods/chat.ts |
chat.send handler — calls registerChatAbortController, passes signal to dispatchInboundMessage |
src/gateway/server-methods/agent.ts |
agent RPC handler — same pattern, lines ~1268-1279 |
src/agents/pi-embedded-runner/run.ts |
The embedded run loop — receives abortSignal, relays to per-attempt controllers |
src/agents/pi-embedded-runner/run/llm-idle-timeout.ts |
LLM streaming idle watchdog (120s default for cloud providers, disabled for local) |
src/agents/pi-embedded-runner/run/idle-timeout-breaker.ts |
Circuit breaker for consecutive idle timeouts (cap: 5) |
src/config/agent-timeout-defaults.ts |
DEFAULT_LLM_IDLE_TIMEOUT_SECONDS = 120 |
src/config/types.agent-defaults.ts |
Config schema: agents.defaults.timeoutSeconds (line 346), heartbeat timeoutSeconds (line 390) |
What Exists vs What's Missing
Exists:
- LLM idle timeout (per-stream, 120s) — detects when a single LLM streaming call goes silent
- Idle-timeout circuit breaker — caps consecutive idle timeouts at 5 before failing over
- Overall turn timeout — wall-clock deadline, default 48h, enforced by maintenance sweep
Missing:
- Per-step liveness reset — the overall deadline never resets even though the agent is making progress (completing tool calls, receiving LLM responses)
- Graceful timeout message — when the maintenance sweep aborts a run, it broadcasts
"aborted" with stopReason: "timeout", but no human-readable message explaining what happened or what the user should do
- Progress notifications — for long-running steps (>30s), no "still working" message is emitted to the channel
Related Issues
These issues are adjacent but propose different solutions or address different symptoms:
Proposed Changes
Change A: Per-Step Timeout Reset (Liveness-Based)
Concept: Instead of one fixed expiresAtMs set at turn start, reset the deadline each time the agent completes a "step" (LLM response received, tool execution completed). Keep a hard ceiling (maxTurnTimeSeconds) as a separate absolute deadline. Add a new stepTimeoutSeconds config key (default 0 = disabled) for backward compatibility.
A1. Add stepTimeoutSeconds and maxTurnTimeSeconds to config schema
File: src/config/types.agent-defaults.ts
After line 346 (timeoutSeconds?: number;), add:
/**
* Per-step liveness timeout in seconds. Each time the agent completes a step
* (LLM response received or tool execution completed), this countdown resets.
* If no step completes within this window, the run is aborted.
* Default: 0 (disabled, fall back to wall-clock only). Set to 300 for 5-minute per-step window.
*/
stepTimeoutSeconds?: number;
/**
* Hard wall-clock ceiling in seconds for a single turn, regardless of liveness.
* This is the absolute maximum — even if the agent keeps making progress, the
* turn is aborted after this duration. Default: same as timeoutSeconds (48h).
*/
maxTurnTimeSeconds?: number;
A2. Add step-timeout resolvers
File: src/agents/timeout.ts
Add after resolveAgentTimeoutMs:
export function resolveStepTimeoutMs(opts: {
cfg?: OpenClawConfig;
overrideSeconds?: number | null;
}): number {
const raw = normalizeNumber(opts.cfg?.agents?.defaults?.stepTimeoutSeconds);
if (raw !== undefined) {
if (raw === 0) return 0; // disabled
if (raw < 0) return 0;
return Math.min(Math.max(raw, 1), MAX_SAFE_TIMEOUT_MS) * 1000;
}
// Default: 0 (disabled, preserve current behavior)
return 0;
}
export function resolveMaxTurnTimeMs(opts: {
cfg?: OpenClawConfig;
overrideSeconds?: number | null;
}): number {
const raw = normalizeNumber(opts.cfg?.agents?.defaults?.maxTurnTimeSeconds);
if (raw !== undefined) {
if (raw <= 0) return MAX_SAFE_TIMEOUT_MS; // no ceiling
return Math.min(raw, MAX_SAFE_TIMEOUT_MS / 1000 | 0) * 1000;
}
// Default: same as the overall timeout
return resolveAgentTimeoutMs({ cfg: opts.cfg });
}
A3. Extend ChatAbortControllerEntry with step-based deadline
File: src/gateway/chat-abort.ts
Add to ChatAbortControllerEntry type:
/**
* Per-step liveness deadline. Reset on each step completion.
* If Date.now() exceeds this and stepTimeoutMs > 0, the run is aborted.
*/
stepExpiresAtMs?: number;
/** Configured per-step timeout in ms (0 = disabled). */
stepTimeoutMs?: number;
/** Absolute hard ceiling for the turn (ms epoch). */
hardCeilingAtMs?: number;
/** Description of last completed step, for timeout messages. */
lastStepDescription?: string;
Add reset function:
export function resetStepTimeout(
chatAbortControllers: Map<string, ChatAbortControllerEntry>,
runId: string,
stepDescription?: string,
): void {
const entry = chatAbortControllers.get(runId);
if (!entry || !entry.stepTimeoutMs) return;
entry.stepExpiresAtMs = Date.now() + entry.stepTimeoutMs;
if (stepDescription) {
entry.lastStepDescription = stepDescription;
}
}
A4. Update registerChatAbortController to accept step-timeout params
File: src/gateway/chat-abort.ts
Extend registerChatAbortController params:
stepTimeoutMs?: number;
hardCeilingAtMs?: number;
Inside the function, after constructing entry:
if (stepTimeoutMs && stepTimeoutMs > 0) {
entry.stepTimeoutMs = stepTimeoutMs;
entry.stepExpiresAtMs = now + stepTimeoutMs;
}
if (hardCeilingAtMs) {
entry.hardCeilingAtMs = hardCeilingAtMs;
}
A5. Update maintenance sweep to check step timeout
File: src/gateway/server-maintenance.ts
Replace the existing abort check (around line 112):
for (const [runId, entry] of params.chatAbortControllers) {
// Check per-step liveness timeout
const stepExpired = entry.stepExpiresAtMs && now > entry.stepExpiresAtMs;
// Check hard ceiling
const ceilingExpired = entry.hardCeilingAtMs && now > entry.hardCeilingAtMs;
// Check original wall-clock expiry
const wallExpired = now > entry.expiresAtMs;
if (!stepExpired && !ceilingExpired && !wallExpired) {
continue;
}
const stopReason = ceilingExpired ? "max_turn_time" : stepExpired ? "step_timeout" : "timeout";
abortChatRunById(
{
chatAbortControllers: params.chatAbortControllers,
chatRunBuffers: params.chatRunBuffers,
chatDeltaSentAt: params.chatDeltaSentAt,
chatDeltaLastBroadcastLen: params.chatDeltaLastBroadcastLen,
chatAbortedRuns: params.chatRunState.abortedRuns,
removeChatRun: params.removeChatRun,
agentRunSeq: params.agentRunSeq,
broadcast: params.broadcast,
nodeSendToSession: params.nodeSendToSession,
},
{ runId, sessionKey: entry.sessionKey, stopReason },
);
}
A6. Wire step-timeout params from chat.send and agent RPC
File: src/gateway/server-methods/chat.ts
After the timeoutMs resolution (around line 1998), add:
const stepTimeoutMs = resolveStepTimeoutMs({ cfg });
const hardCeilingAtMs = resolveMaxTurnTimeMs({ cfg });
Update the registerChatAbortController call (around line 2178) to pass:
const activeRunAbort = registerChatAbortController({
chatAbortControllers: context.chatAbortControllers,
runId: clientRunId,
sessionId: backingSessionId ?? clientRunId,
sessionKey: rawSessionKey,
timeoutMs,
now,
stepTimeoutMs,
hardCeilingAtMs,
ownerConnId: normalizeOptionalText(client?.connId),
ownerDeviceId: normalizeOptionalText(client?.connect?.device?.id),
kind: "chat-send",
});
File: src/gateway/server-methods/agent.ts — same pattern around line 1272.
A7. Call resetStepTimeout on step boundaries
Step boundaries in the embedded runner are:
- LLM response received — when the LLM stream yields content
- Tool execution completed — when a tool call finishes
File: src/agents/pi-embedded-runner/run.ts
Add to run params type (around the top of run.ts):
/** Called each time a step completes (LLM response or tool execution). */
onStepComplete?: (stepInfo: { type: "llm_response" | "tool_complete"; description?: string }) => void;
At LLM response boundaries (after the stream is consumed, around where aborted is checked ~line 1249), add:
params.onStepComplete?.({
type: "llm_response",
description: attemptResult.stopReason ?? "llm_response",
});
At tool completion boundaries (after tool results are collected), add:
params.onStepComplete?.({
type: "tool_complete",
description: toolName,
});
Wire from chat.ts: In the dispatchInboundMessage call path, the abortSignal is available. We need access to chatAbortControllers to call resetStepTimeout. The simplest approach: pass the reset function via the reply options or context.
Add to the chat.send handler, before dispatchInboundMessage:
const stepResetFn = (stepInfo: { type: string; description?: string }) => {
resetStepTimeout(context.chatAbortControllers, clientRunId, stepInfo.description);
};
This requires plumbing stepResetFn through the dispatch pipeline to the embedded runner's onStepComplete. The dispatch pipeline is complex, so the implementation path is:
- Add
onStepComplete to the reply options or MsgContext
- The embedded runner calls it at step boundaries
- The chat.send handler provides the implementation that calls
resetStepTimeout
Change B: Graceful Timeout Message
Concept: When a timeout fires (step, ceiling, or wall-clock), emit a human-readable message to the channel instead of silent abort.
B1. Add timeout message to abortChatRunById broadcast
File: src/gateway/chat-abort.ts
Modify broadcastChatAborted to accept and include a user-facing message:
function broadcastChatAborted(
ops: ChatAbortOps,
params: {
runId: string;
sessionKey: string;
stopReason?: string;
partialText?: string;
userMessage?: string; // NEW
},
) {
const { runId, sessionKey, stopReason, partialText, userMessage } = params;
const payload = {
runId,
sessionKey,
seq: (ops.agentRunSeq.get(runId) ?? 0) + 1,
state: "aborted" as const,
stopReason,
userMessage, // NEW
message: partialText
? {
role: "assistant",
content: [{ type: "text", text: partialText }],
timestamp: Date.now(),
}
: undefined,
};
ops.broadcast("chat", payload);
ops.nodeSendToSession(sessionKey, "chat", payload);
}
B2. Generate user-friendly timeout messages in the maintenance sweep
File: src/gateway/server-maintenance.ts
Before calling abortChatRunById, construct the message:
let userMessage: string | undefined;
const elapsedMs = now - entry.startedAtMs;
const elapsedMin = Math.floor(elapsedMs / 60_000);
const elapsedSec = Math.floor((elapsedMs % 60_000) / 1000);
const lastStep = entry.lastStepDescription ?? "unknown";
if (ceilingExpired) {
userMessage = `⚠️ Turn hit the hard time limit (${elapsedMin}m ${elapsedSec}s). Last action: ${lastStep}. Try a simpler approach or increase the limit.`;
} else if (stepExpired) {
userMessage = `⚠️ No progress for ${Math.floor((entry.stepTimeoutMs ?? 0) / 1000)}s — last action was ${lastStep}. The task may be stuck. Retry or ask for a simpler approach.`;
} else {
userMessage = `⚠️ Turn timed out after ${elapsedMin}m ${elapsedSec}s — last action was ${lastStep}. Retry or ask for a simpler approach.`;
}
B3. Deliver the message via the channel
The broadcastChatAborted event goes to connected gateway clients. For Telegram/other channels, we need to ensure the timeout message is delivered as a reply. The existing chat.aborted event is handled by the channel pipeline, but it doesn't currently send a text reply.
File: Wherever the chat.aborted event is consumed for channel delivery (likely in the reply dispatcher or channel adapter), add handling for userMessage:
if (event.userMessage && event.state === "aborted") {
// Deliver the user-facing timeout message through the channel
channelSend(sessionKey, event.userMessage);
}
The exact file depends on where channel delivery for aborted runs is handled. Search for chat.aborted or state: "aborted" in the channel dispatch code.
Change C: Progress Notifications
Concept: If a step takes >30s, emit a "still working" notification.
C1. Add a long-step detector in the embedded runner
File: src/agents/pi-embedded-runner/run.ts
Add a step timer that fires after 30s of inactivity within a step:
/** Called when a step has been running for >30s without completing. */
onStepLongRunning?: (stepInfo: { type: "llm_call" | "tool_execution"; description?: string; elapsedMs: number }) => void;
At the start of each LLM call or tool execution, set a 30s timer:
let longStepTimer: NodeJS.Timeout | null = null;
const stepStartTime = Date.now();
const startLongStepTimer = (type: string, description?: string) => {
clearLongStepTimer();
longStepTimer = setTimeout(() => {
params.onStepLongRunning?.({
type: type as any,
description,
elapsedMs: Date.now() - stepStartTime,
});
}, 30_000);
};
const clearLongStepTimer = () => {
if (longStepTimer) {
clearTimeout(longStepTimer);
longStepTimer = null;
}
};
Call startLongStepTimer at:
- Before each LLM call:
startLongStepTimer("llm_call", modelRef)
- Before each tool execution:
startLongStepTimer("tool_execution", toolName)
Call clearLongStepTimer at:
- After LLM response received
- After tool execution completed
- On abort
C2. Wire the long-running callback to emit a channel message
In the chat.send handler:
const stepLongRunningFn = (stepInfo: { type: string; description?: string; elapsedMs: number }) => {
const desc = stepInfo.description ?? stepInfo.type;
const elapsed = Math.floor(stepInfo.elapsedMs / 1000);
// Emit a "still working" message through the channel reply pipeline
// Use the dispatcher or a direct channel send
broadcastChatDelta(context.chatRunBuffers, context.chatDeltaSentAt, context.chatDeltaLastBroadcastLen, {
runId: clientRunId,
sessionKey: rawSessionKey,
type: "progress",
text: `⏳ Still working — ${desc} has been running for ${elapsed}s…`,
});
};
Wire stepLongRunningFn through the dispatch pipeline alongside onStepComplete, using the same context/reply-options mechanism.
C3. Add progressNotifyAfterSeconds to config (optional)
File: src/config/types.agent-defaults.ts
After the stepTimeoutSeconds addition:
/**
* Seconds of inactivity within a step before emitting a "still working" progress
* notification to the channel. Default: 30. Set to 0 to disable.
*/
progressNotifyAfterSeconds?: number;
Impact
Acknowledgment
The existing tool-call summary feature is excellent — seeing each step appear in chat as it happens is genuinely great UX. This proposal builds on that foundation rather than replacing it. The summaries prove the gateway already has visibility into per-step progress; the ask is to extend that visibility into timeout semantics and mid-turn liveness signals.
Recommended Implementation Order
1. Change A (Per-Step Reset) — Highest Value
This is the core architectural fix. Once the step-timeout plumbing is in place, Changes B and C are straightforward additions to the same code paths.
- Default
stepTimeoutSeconds to 0 (disabled) to preserve backward compatibility
- Add the config keys, resolvers, and
ChatAbortControllerEntry extensions (A1–A4)
- Update the maintenance sweep (A5)
- Wire from chat.send/agent RPC (A6)
- Add
onStepComplete callback and wire to resetStepTimeout (A7)
2. Change B (Graceful Message) — Quick Win
Once the maintenance sweep has stepExpired/ceilingExpired/wallExpired distinctions, generating user-facing messages is trivial. This is the highest-impact UX improvement per line of code changed.
- Extend
broadcastChatAborted with userMessage (B1)
- Generate messages in the maintenance sweep (B2)
- Wire channel delivery for the
userMessage field (B3)
3. Change C (Progress Notifications) — Nice-to-Have
This requires more plumbing (the long-step timer, the onStepLongRunning callback, the progressNotifyAfterSeconds config) but fills a real UX gap. Can be deferred to a follow-up PR.
- Add the long-step timer in the embedded runner (C1)
- Wire to channel broadcast (C2)
- Add optional config key (C3)
Open Questions
- Per-step timeout semantics: Should the liveness window reset on every streaming token from the LLM, or only on "step boundaries" (tool call → tool output → next LLM call)? Token-level resets might be too granular; step-level might miss slow-streaming LLM responses.
maxTurnTimeSeconds default: 30 minutes is a guess. Some users may want shorter (5 min for interactive chat) or longer (60 min for deep research tasks). Should this be per-channel?
- Progress notification frequency: One pulse per N seconds? One per step? Should it suppress if a tool summary was recently sent?
- Migration path: Should there be a config flag to opt into the new semantics, or is the backward-compatible shift acceptable?
stepTimeoutSeconds default: The proposal defaults to 0 (disabled) for backward compat. Should the first release set a non-zero default (e.g., 300s) to give users the improved behavior immediately?
Environment
- Primary channel: Telegram
- Typical tool-heavy turn: 4–6 LLM calls + 5–10 tool executions
- Current
timeoutSeconds: varies, but commonly 120–300s
Summary
The current
timeoutSecondsapplies as a single wall-clock budget for the entire agent turn — from message receipt to final reply. A tool-heavy turn might chain 4–6 LLM calls plus multiple tool executions. If any step runs slow, the whole turn fails silently. The user sees tool-call summaries appearing in chat (which is great), then suddenly "assistant turn failed" with no explanation.This issue proposes an architectural shift: liveness-based per-step timeouts instead of one total-turn clock, plus progress notifications and graceful termination reporting. All proposed changes include concrete code diffs against the current codebase.
Current Behavior
timeoutSeconds, the entire turn is killed.Example scenario (Telegram): An agent turn involves a web search → page fetch → shell command → second LLM call → memory lookup → final response. Each step completes in 10–15 seconds, but the total exceeds the configured timeout. The user watched 4 tool summaries scroll by, then gets a silent failure. They don't know if the agent crashed, the network dropped, or the timeout fired.
The Timeout Chain — How It Works Today
Key finding: The turn timeout is a single wall-clock deadline set at turn start. It's enforced by a periodic maintenance sweep that checks
expiresAtMsand aborts the run'sAbortControllerif expired. There is no per-step reset and no liveness detection — the timer counts from turn start regardless of whether the agent is actively making progress.Files Involved
src/agents/timeout.tstimeoutMsfrom config (agents.defaults.timeoutSeconds, default 48h)src/gateway/chat-abort.tsAbortController+expiresAtMs, providesabortChatRunById()src/gateway/server-maintenance.tssrc/gateway/server-methods/chat.tschat.sendhandler — callsregisterChatAbortController, passes signal todispatchInboundMessagesrc/gateway/server-methods/agent.tsagentRPC handler — same pattern, lines ~1268-1279src/agents/pi-embedded-runner/run.tsabortSignal, relays to per-attempt controllerssrc/agents/pi-embedded-runner/run/llm-idle-timeout.tssrc/agents/pi-embedded-runner/run/idle-timeout-breaker.tssrc/config/agent-timeout-defaults.tsDEFAULT_LLM_IDLE_TIMEOUT_SECONDS = 120src/config/types.agent-defaults.tsagents.defaults.timeoutSeconds(line 346), heartbeattimeoutSeconds(line 390)What Exists vs What's Missing
Exists:
Missing:
"aborted"withstopReason: "timeout", but no human-readable message explaining what happened or what the user should doRelated Issues
These issues are adjacent but propose different solutions or address different symptoms:
timeoutSecondsdoesn't control LLM HTTP timeoutProposed Changes
Change A: Per-Step Timeout Reset (Liveness-Based)
Concept: Instead of one fixed
expiresAtMsset at turn start, reset the deadline each time the agent completes a "step" (LLM response received, tool execution completed). Keep a hard ceiling (maxTurnTimeSeconds) as a separate absolute deadline. Add a newstepTimeoutSecondsconfig key (default 0 = disabled) for backward compatibility.A1. Add
stepTimeoutSecondsandmaxTurnTimeSecondsto config schemaFile:
src/config/types.agent-defaults.tsAfter line 346 (
timeoutSeconds?: number;), add:A2. Add step-timeout resolvers
File:
src/agents/timeout.tsAdd after
resolveAgentTimeoutMs:A3. Extend
ChatAbortControllerEntrywith step-based deadlineFile:
src/gateway/chat-abort.tsAdd to
ChatAbortControllerEntrytype:Add reset function:
A4. Update
registerChatAbortControllerto accept step-timeout paramsFile:
src/gateway/chat-abort.tsExtend
registerChatAbortControllerparams:Inside the function, after constructing
entry:A5. Update maintenance sweep to check step timeout
File:
src/gateway/server-maintenance.tsReplace the existing abort check (around line 112):
A6. Wire step-timeout params from chat.send and agent RPC
File:
src/gateway/server-methods/chat.tsAfter the
timeoutMsresolution (around line 1998), add:Update the
registerChatAbortControllercall (around line 2178) to pass:File:
src/gateway/server-methods/agent.ts— same pattern around line 1272.A7. Call
resetStepTimeouton step boundariesStep boundaries in the embedded runner are:
File:
src/agents/pi-embedded-runner/run.tsAdd to run params type (around the top of run.ts):
At LLM response boundaries (after the stream is consumed, around where
abortedis checked ~line 1249), add:At tool completion boundaries (after tool results are collected), add:
Wire from chat.ts: In the
dispatchInboundMessagecall path, theabortSignalis available. We need access tochatAbortControllersto callresetStepTimeout. The simplest approach: pass the reset function via the reply options or context.Add to the chat.send handler, before
dispatchInboundMessage:This requires plumbing
stepResetFnthrough the dispatch pipeline to the embedded runner'sonStepComplete. The dispatch pipeline is complex, so the implementation path is:onStepCompleteto the reply options orMsgContextresetStepTimeoutChange B: Graceful Timeout Message
Concept: When a timeout fires (step, ceiling, or wall-clock), emit a human-readable message to the channel instead of silent abort.
B1. Add timeout message to
abortChatRunByIdbroadcastFile:
src/gateway/chat-abort.tsModify
broadcastChatAbortedto accept and include a user-facing message:B2. Generate user-friendly timeout messages in the maintenance sweep
File:
src/gateway/server-maintenance.tsBefore calling
abortChatRunById, construct the message:B3. Deliver the message via the channel
The
broadcastChatAbortedevent goes to connected gateway clients. For Telegram/other channels, we need to ensure the timeout message is delivered as a reply. The existingchat.abortedevent is handled by the channel pipeline, but it doesn't currently send a text reply.File: Wherever the
chat.abortedevent is consumed for channel delivery (likely in the reply dispatcher or channel adapter), add handling foruserMessage:The exact file depends on where channel delivery for aborted runs is handled. Search for
chat.abortedorstate: "aborted"in the channel dispatch code.Change C: Progress Notifications
Concept: If a step takes >30s, emit a "still working" notification.
C1. Add a long-step detector in the embedded runner
File:
src/agents/pi-embedded-runner/run.tsAdd a step timer that fires after 30s of inactivity within a step:
At the start of each LLM call or tool execution, set a 30s timer:
Call
startLongStepTimerat:startLongStepTimer("llm_call", modelRef)startLongStepTimer("tool_execution", toolName)Call
clearLongStepTimerat:C2. Wire the long-running callback to emit a channel message
In the chat.send handler:
Wire
stepLongRunningFnthrough the dispatch pipeline alongsideonStepComplete, using the same context/reply-options mechanism.C3. Add
progressNotifyAfterSecondsto config (optional)File:
src/config/types.agent-defaults.tsAfter the
stepTimeoutSecondsaddition:Impact
stepTimeoutSecondsdefaults to 0 (disabled), preserving current wall-clock-only behavior. Users opt in by setting it.maxTurnTimeSecondsdefaults to the existingtimeoutSecondsvalue.Acknowledgment
The existing tool-call summary feature is excellent — seeing each step appear in chat as it happens is genuinely great UX. This proposal builds on that foundation rather than replacing it. The summaries prove the gateway already has visibility into per-step progress; the ask is to extend that visibility into timeout semantics and mid-turn liveness signals.
Recommended Implementation Order
1. Change A (Per-Step Reset) — Highest Value
This is the core architectural fix. Once the step-timeout plumbing is in place, Changes B and C are straightforward additions to the same code paths.
stepTimeoutSecondsto 0 (disabled) to preserve backward compatibilityChatAbortControllerEntryextensions (A1–A4)onStepCompletecallback and wire toresetStepTimeout(A7)2. Change B (Graceful Message) — Quick Win
Once the maintenance sweep has
stepExpired/ceilingExpired/wallExpireddistinctions, generating user-facing messages is trivial. This is the highest-impact UX improvement per line of code changed.broadcastChatAbortedwithuserMessage(B1)userMessagefield (B3)3. Change C (Progress Notifications) — Nice-to-Have
This requires more plumbing (the long-step timer, the
onStepLongRunningcallback, theprogressNotifyAfterSecondsconfig) but fills a real UX gap. Can be deferred to a follow-up PR.Open Questions
maxTurnTimeSecondsdefault: 30 minutes is a guess. Some users may want shorter (5 min for interactive chat) or longer (60 min for deep research tasks). Should this be per-channel?stepTimeoutSecondsdefault: The proposal defaults to 0 (disabled) for backward compat. Should the first release set a non-zero default (e.g., 300s) to give users the improved behavior immediately?Environment
timeoutSeconds: varies, but commonly 120–300s