feat(agents): split LLM timeout into first-token and idle phases#65898
feat(agents): split LLM timeout into first-token and idle phases#65898kryptt wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99367e42e5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| } | ||
|
|
||
| // Effective first-token window. Positive override > fallback to idle. | ||
| const effectiveFirstTokenMs = firstTokenTimeoutMs > 0 ? firstTokenTimeoutMs : idleTimeoutMs; |
There was a problem hiding this comment.
Respect explicit first-token timeout disable
resolveLlmFirstTokenTimeoutMs intentionally returns 0 when firstTokenTimeoutSeconds: 0 is configured, but this wrapper immediately maps 0 back to idleTimeoutMs, so the first-token timer still fires. In practice, configs like idleTimeoutSeconds: 30 + firstTokenTimeoutSeconds: 0 will still abort after 30s before the first token, which contradicts the new config contract and schema text that says 0 disables first-token timeout.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 5d1e90d. firstTokenTimeoutMs === 0 now truly disables the first-token timer (waits indefinitely for the first chunk; the idle timer still kicks in once chunks start arriving). The resolver returns number | undefined: undefined means inherit-idle, 0 means disabled — preserving the contract the schema text now describes.
Greptile SummarySplits the single Confidence Score: 5/5Safe to merge; only P2 findings (doc clarity, parameter shadowing) with no logic or data-integrity issues. All remaining findings are P2: a misleading 'Set to 0 to disable' doc comment (actual behavior is fallback to idle, not true disable) and a parameter-name shadow introduced by the options-object refactor. Core logic, timer lifecycle, config schema, and test coverage are all correct. src/config/types.agent-defaults.ts (doc clarity on firstTokenTimeoutSeconds: 0 semantics) and src/agents/pi-embedded-runner/run/llm-idle-timeout.ts (inner options parameter shadowing). Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/config/types.agent-defaults.ts
Line: 333
Comment:
**Misleading "disable" wording for `firstTokenTimeoutSeconds: 0`**
Setting `firstTokenTimeoutSeconds: 0` does not disable the first-token timeout — it falls back to using `idleTimeoutSeconds` for that phase (because `streamWithIdleTimeout` treats `firstTokenTimeoutMs === 0` the same as "unset"). An operator wanting an unlimited first-token wait (e.g. for extreme cold starts) while keeping idle protection would set this to 0 expecting no cap, but still receive an `idleTimeoutSeconds`-bounded first-token window.
The description should clarify the actual semantics, e.g.:
```suggestion
* Set to 0 to remove the dedicated first-token window (the first-token phase
* will then use `idleTimeoutSeconds`, same as when this field is unset).
* To wait indefinitely for the first token, set a very large value instead.
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/agents/pi-embedded-runner/run/llm-idle-timeout.ts
Line: 124
Comment:
**`options` parameter shadows the outer `options: StreamIdleTimeoutOptions`**
The inner function returned by `streamWithIdleTimeout` reuses `options` as the name of the `StreamFn` call-site parameter, shadowing the outer `options: StreamIdleTimeoutOptions`. There is no runtime bug (the outer `options` is fully destructured at the top of the function), but the shadowing is new — introduced because the outer positional arguments were collapsed into an `options` object in this PR. If someone adds code inside the inner closure expecting `options` to refer to timeout config, they'll silently pick up the stream-call options instead.
```suggestion
return (model, context, streamCallOptions) => {
```
and update the one use inside the body:
```ts
const maybeStream = baseFn(model, context, streamCallOptions);
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "feat(agents): split LLM timeout into fir..." | Re-trigger Greptile |
| * may legitimately take much longer than mid-stream token gaps. | ||
| * When unset, the first-token wait inherits `idleTimeoutSeconds` | ||
| * (backwards-compatible behavior). | ||
| * Set to 0 to disable the first-token timeout without disabling idle. |
There was a problem hiding this comment.
Misleading "disable" wording for
firstTokenTimeoutSeconds: 0
Setting firstTokenTimeoutSeconds: 0 does not disable the first-token timeout — it falls back to using idleTimeoutSeconds for that phase (because streamWithIdleTimeout treats firstTokenTimeoutMs === 0 the same as "unset"). An operator wanting an unlimited first-token wait (e.g. for extreme cold starts) while keeping idle protection would set this to 0 expecting no cap, but still receive an idleTimeoutSeconds-bounded first-token window.
The description should clarify the actual semantics, e.g.:
| * Set to 0 to disable the first-token timeout without disabling idle. | |
| * Set to 0 to remove the dedicated first-token window (the first-token phase | |
| * will then use `idleTimeoutSeconds`, same as when this field is unset). | |
| * To wait indefinitely for the first token, set a very large value instead. |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/config/types.agent-defaults.ts
Line: 333
Comment:
**Misleading "disable" wording for `firstTokenTimeoutSeconds: 0`**
Setting `firstTokenTimeoutSeconds: 0` does not disable the first-token timeout — it falls back to using `idleTimeoutSeconds` for that phase (because `streamWithIdleTimeout` treats `firstTokenTimeoutMs === 0` the same as "unset"). An operator wanting an unlimited first-token wait (e.g. for extreme cold starts) while keeping idle protection would set this to 0 expecting no cap, but still receive an `idleTimeoutSeconds`-bounded first-token window.
The description should clarify the actual semantics, e.g.:
```suggestion
* Set to 0 to remove the dedicated first-token window (the first-token phase
* will then use `idleTimeoutSeconds`, same as when this field is unset).
* To wait indefinitely for the first token, set a very large value instead.
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fixed in 5d1e90d. firstTokenTimeoutSeconds: 0 now genuinely disables the first-token timer (waits indefinitely for the first chunk). Updated docstring to describe the new contract: leave unset to inherit idleTimeoutSeconds, set to a positive value for a dedicated window, set to 0 to wait indefinitely for the first chunk.
There was a problem hiding this comment.
Re-flagged on the rebased commit, but the underlying semantics were already corrected: firstTokenTimeoutSeconds: 0 now genuinely disables the first-token timer (waits indefinitely for the first chunk; idle timer still applies after the first chunk arrives). Tightened the docstring in 668f3dc to spell that out explicitly so it can't be re-misread.
|
|
||
| // Effective first-token window. Positive override > fallback to idle. | ||
| const effectiveFirstTokenMs = firstTokenTimeoutMs > 0 ? firstTokenTimeoutMs : idleTimeoutMs; | ||
|
|
||
| return (model, context, options) => { |
There was a problem hiding this comment.
options parameter shadows the outer options: StreamIdleTimeoutOptions
The inner function returned by streamWithIdleTimeout reuses options as the name of the StreamFn call-site parameter, shadowing the outer options: StreamIdleTimeoutOptions. There is no runtime bug (the outer options is fully destructured at the top of the function), but the shadowing is new — introduced because the outer positional arguments were collapsed into an options object in this PR. If someone adds code inside the inner closure expecting options to refer to timeout config, they'll silently pick up the stream-call options instead.
| return (model, context, options) => { | |
| return (model, context, streamCallOptions) => { |
and update the one use inside the body:
const maybeStream = baseFn(model, context, streamCallOptions);Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/pi-embedded-runner/run/llm-idle-timeout.ts
Line: 124
Comment:
**`options` parameter shadows the outer `options: StreamIdleTimeoutOptions`**
The inner function returned by `streamWithIdleTimeout` reuses `options` as the name of the `StreamFn` call-site parameter, shadowing the outer `options: StreamIdleTimeoutOptions`. There is no runtime bug (the outer `options` is fully destructured at the top of the function), but the shadowing is new — introduced because the outer positional arguments were collapsed into an `options` object in this PR. If someone adds code inside the inner closure expecting `options` to refer to timeout config, they'll silently pick up the stream-call options instead.
```suggestion
return (model, context, streamCallOptions) => {
```
and update the one use inside the body:
```ts
const maybeStream = baseFn(model, context, streamCallOptions);
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fixed in 5d1e90d. Renamed the inner closure parameter to streamCallOptions and updated the body — no more shadowing.
99367e4 to
5d1e90d
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5d1e90d6b8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| (error) => idleTimeoutTrigger?.(error), | ||
| ); | ||
| firstTokenTimeoutMs, | ||
| onIdleTimeout: (error) => idleTimeoutTrigger?.(error), |
There was a problem hiding this comment.
Preserve timeout phase when surfacing timeout errors
This callback collapses both timeout phases into a single idle signal: streamWithIdleTimeout now invokes it for first-token and mid-stream idle timeouts, but attempt.ts only records idleTimedOut, so downstream error handling cannot tell which phase actually failed. In practice, a run that times out on firstTokenTimeoutSeconds (for example idle=30s, first-token=10s) is still reported as an idle-timeout case and points users at agents.defaults.llm.idleTimeoutSeconds, which does not fix the configured first-token limit; propagate the timeout phase (or split callbacks/flags) so retries and user guidance target the correct setting.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Addressed in 668f3dc. Added llmTimeoutPhase: "first-token" | "idle" | null to EmbeddedRunAttemptResult. The idleTimeoutTrigger callback now records the phase by inspecting the error message (streamWithIdleTimeout already labels phase in the message text). The user-facing error in run.ts now points at agents.defaults.llm.firstTokenTimeoutSeconds for first-token timeouts and agents.defaults.llm.idleTimeoutSeconds for mid-stream idle timeouts. Retry/failover logic still keys off the idleTimedOut boolean so both phases stay eligible for the same-model retry — both warrant a retry. Added a parallel test in run.timeout-triggered-compaction.test.ts.
Adds a second, longer timeout window that covers the wait for the first
token separately from mid-stream idle gaps. This allows operators to
set a tight idle timeout for responsive mid-stream fault detection
without killing cold-loading models or upstream routers that only
produce their first byte after a long warm-up.
- New `agents.defaults.llm.firstTokenTimeoutSeconds` config knob.
When unset, the first-token wait inherits `idleTimeoutSeconds`
(backwards-compatible: existing configs behave identically).
- `streamWithIdleTimeout` now takes an options object
`{ idleTimeoutMs, firstTokenTimeoutMs?, onIdleTimeout? }` and tracks
a per-iterator `firstTokenSeen` flag; the appropriate window is
selected before each `next()` race.
- `idleTimeoutMs === 0` remains the master off-switch and now short-
circuits by returning `baseFn` unwrapped.
- Error messages distinguish the phase:
"LLM first-token timeout (Xs): ..." vs "LLM idle timeout (Xs): ...".
Motivation: deployments with local Ollama backends can have cold
model loads that legitimately take minutes, while mid-stream silence
almost always indicates a real fault. Previously a single idle timeout
had to be generous enough for the worst cold-load, which made
mid-stream hangs slow to detect.
Tests: covers first-token timeout firing with short idle, idle
timeout firing after first chunk with long first-token window, `0`
disable semantics, and `resolveLlmFirstTokenTimeoutMs` fallback rules.
5d1e90d to
668f3dc
Compare
When the agent streams from a model the backend hasn't loaded yet,
Ollama goes silent while it maps the weights — often for minutes.
Any downstream idle-timeout short enough to catch real mid-stream
hangs (~30s) will trip during this load. Raising the idle timeout
to cover cold-load worst-case (5m+) makes mid-stream fault detection
unbearably slow.
Fix: inject protocol-appropriate keepalive chunks during the load,
gated by positive evidence the model wasn't already hot.
Flow:
1. On streaming chat/completion requests, call /api/ps on the chosen
backend (PREFLIGHT_TIMEOUT=10s). If the model is loaded, proxy
normally — zero overhead.
2. If not loaded, commit to 200 OK and run the heartbeat path:
emit no-op keepalive chunks every LOADING_HEARTBEAT=15s while
waiting for upstream headers + first body byte, bounded by
LOADING_MAX_WAIT=300s.
3. Once real bytes flow, stop heartbeating and stream through.
4. If upstream errors *after* we committed 200, emit an in-band
error per protocol (Ollama: done:true+error; OpenAI SSE:
{error} + [DONE]; Anthropic: event: error).
Protocol coverage:
- Ollama /api/chat + /api/generate (NDJSON, empty-content chunks)
- OpenAI /v1/chat/completions + /v1/completions (SSE comment lines)
- Anthropic /v1/messages (SSE comment lines)
Heartbeat format is spec-compliant no-op in every case — SSE comments
are ignored by EventSource/httpx_sse/OpenAI+Anthropic SDKs; Ollama
NDJSON uses valid chunks with empty content and done:false so
python-ollama, ollama-js, etc. just skip them.
Knobs (all in seconds, env vars on the Deployment):
OLLAMA_ROUTER_LOADING_HEARTBEAT (default 15)
OLLAMA_ROUTER_PREFLIGHT_TIMEOUT (default 10)
OLLAMA_ROUTER_LOADING_MAX_WAIT (default 300)
Honesty property: heartbeats only ever fire when /api/ps said the
model wasn't loaded at request start. If the model then fails to load,
upstream returns an error body, the heartbeat stops naturally, and
the in-band error encoder surfaces the failure to the client. The
pathological "upstream hangs forever after preflight succeeds"
window is bounded by LOADING_MAX_WAIT.
17 new tests cover protocol classification, heartbeat payload shape,
preflight matching (name/model/prefix), and end-to-end heartbeat with
a mock backend.
Companion change on the client side (PR openclaw/openclaw#65898)
splits idleTimeoutSeconds into idleTimeoutSeconds + firstTokenTimeoutSeconds,
so deployments without this router still get a sane cold-load window.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c9b4c65821
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| "The model did not produce its first token before the LLM first-token timeout. " + | ||
| "Please try again, or increase `agents.defaults.llm.firstTokenTimeoutSeconds` in your config (set to 0 to wait indefinitely for the first token)."; |
There was a problem hiding this comment.
Keep timeout leak detection aligned with first-token errors
This new first-token timeout text does not match the timeout leak pattern used by QA character eval (extensions/qa-lab/src/character-eval.ts only matches model did not produce a response before the LLM idle timeout). In scenarios where a run times out before the first token, the failure text can now pass through detectTranscriptFailure unclassified, so eval runs may be marked clean even though a timeout error leaked into the transcript. Please update the timeout detection/matching path to treat first-token and idle timeout messages consistently.
Useful? React with 👍 / 👎.
|
Thanks for the context here. I swept through the related work, and this is now duplicate or superseded. Keep this PR open because the first-token/idle split is still a meaningful unsolved capability, but the current branch is not merge-ready: it conflicts with current timeout plumbing, reintroduces a retired config subtree, and still lacks real behavior proof. Canonical path: Close this stale PR. The latest review rated it F, the branch still lacks merge-ready proof, and there has been no human follow-up after the durable review. So I’m closing this here because the remaining work is already tracked in the canonical issue. Review detailsBest possible solution: Close this stale PR. The latest review rated it F, the branch still lacks merge-ready proof, and there has been no human follow-up after the durable review. Do we have a high-confidence way to reproduce the issue? Not applicable as a strict bug reproduction: this PR adds a new two-phase timeout capability. The claimed deployment scenario is plausible from source, but the PR does not include real slow-model/proxy proof. Is this the best way to solve the issue? No, not as currently shaped. The split is useful, but the implementation needs a current-main rebase, preserved timeout/abort contracts, and a maintainer-approved config surface before it is the best solution. Security review: Security review cleared: The diff does not touch dependency sources, workflows, secrets, package resolution, or other supply-chain surfaces; the concerns are compatibility and availability, not security. AGENTS.md: found and applied where relevant. What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against e0018382eb00. |
|
ClawSweeper PR egg 🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat. Where did the egg go?
|
|
This pull request has been automatically marked as stale due to inactivity. |
|
ClawSweeper applied the proposed close for this PR.
|
Summary
Adds a second, longer timeout window that covers the wait for the first token separately from mid-stream idle gaps. This lets operators set a tight idle timeout for responsive fault detection without killing cold-loading models or upstream routers whose first byte legitimately takes minutes.
agents.defaults.llm.firstTokenTimeoutSecondsidleTimeoutSeconds— fully backwards-compatible (existing configs behave identically).0to disable the first-token timeout without disabling idle.streamWithIdleTimeoutsignature change: now takes an options object{ idleTimeoutMs, firstTokenTimeoutMs?, onIdleTimeout? }. The per-iterator state tracks afirstTokenSeenflag and selects the appropriate window before eachnext()race.idleTimeoutMs === 0master off-switch: short-circuits to returnbaseFnunwrapped (zero overhead when timeouts are disabled)."LLM first-token timeout (Xs): no response from model""LLM idle timeout (Xs): no response from model"Motivation
Deployments with local Ollama backends can have cold model loads that legitimately take 2–4 minutes, while mid-stream silence almost always indicates a real fault (network stall, model crash). Previously a single
idleTimeoutSecondshad to be generous enough to cover the worst cold-load, which made mid-stream hangs slow to detect. Splitting the two lets both windows be correctly sized.Example config for a deployment with cold-loading backends + a heartbeat-emitting proxy:
Test plan
pnpm vitest run src/agents/pi-embedded-runner/run/llm-idle-timeout.test.ts— 23 tests pass (8 existing + 15 new/updated)firstTokenTimeoutMsidleTimeoutMsand longfirstTokenTimeoutMsidleTimeoutMs: 0with a positivefirstTokenTimeoutMsdisables both phases (master off-switch)resolveLlmFirstTokenTimeoutMsfalls back to idle when unset, respects explicit0, caps at max safetsc --noEmitcleanconfig:schema:gen) — new field appears underagents.defaults.llmattempt.ts) updated to the new options-object APIBreaking changes
streamWithIdleTimeout's signature changed from(baseFn, timeoutMs, onIdleTimeout?)to(baseFn, options). The only in-tree caller isattempt.ts(updated). External plugins that wrap the stream function would need to migrate — but this isn't a documented extension point, so in practice this is internal.The config surface is additive and backwards-compatible.