fix: defer LLM stream connect timeout to after HTTP request is sent#729
Conversation
There was a problem hiding this comment.
Suggested priority: P2 (includes user-path files (packages/app/src/components/terminal.tsx, packages/app/src/pages/layout/sidebar-workspace.tsx)).
P1/P0 are reserved for maintainer confirmation. Please relabel manually if this is a release blocker, security issue, data-loss risk, or updater/runtime failure.
📝 WalkthroughWalkthroughThis PR delivers security validations and robustness improvements across the CLI, LLM session, and data handling layers. Terminal font loading is now fault-tolerant, CLI commands validate user inputs (SQL read-only enforcement, URL protocols, config path traversal), LLM stream timeout is fixed to measure only network time (not setup), and data parsing now guards against invalid parameters. ChangesValidation and resilience improvements across CLI, session, and UI
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/opencode/src/cli/cmd/db.ts`:
- Around line 31-35: The current validation in db.ts allowing queries that start
with "with" is unsafe because SQLite permits CTEs before INSERT/UPDATE/DELETE;
update the check in the query validation (the normalized/query logic) to either
(A) remove "with" from the allowlist so only "select" and "pragma" are accepted,
or (B) implement deeper inspection: parse the normalized string after the CTE
block to locate the first real statement token and verify it is "select" or
"pragma" before allowing execution; update the UI.error message and keep the
existing process.exit(1) behavior on rejection.
In `@packages/opencode/src/cli/cmd/providers.ts`:
- Around line 300-303: The code path that validates the URL (the block checking
/^https?:\/\//i against url) logs an error and returns, which leaves the process
exit code as 0; change this to set a non-zero exit status before returning (for
example call process.exit(1) or throw an error) so failures are visible to
automation—update the branch that calls prompts.log.error("URL must use http or
https protocol") and prompts.outro("Done") to exit with a non-zero code (e.g.,
process.exit(1)) instead of just returning.
In `@packages/opencode/src/config/paths.ts`:
- Around line 122-125: The current lexical check using
path.resolve/path.relative can be bypassed via symlinks; update the validation
in the code that computes resolvedPath and relativeToConfig so both the config
root and the target file are canonicalized with fs.realpath (or
fs.promises.realpath) before comparing. Specifically, call realpath on configDir
(or configSource root) and on the resolvedPath, then use path.relative between
those real paths and throw the same JsonError (referencing JsonError and token)
if the real-relative path escapes; also ensure subsequent reading
(Filesystem.readText) uses the realpath to avoid following a symlink that points
outside. Handle realpath errors and convert them into the existing JsonError
flow.
In `@packages/ui/src/components/tool-info.ts`:
- Around line 42-58: The file declares duplicate top-level functions
enterWorktreeOwnerProject and enterWorktreeTarget (causing TS compilation
failure); remove the redundant implementations so only one definition of each
remains (keep the original/desired implementation and delete the duplicates you
found at lines 42–58), and ensure any exports or references still point to the
retained functions (enterWorktreeOwnerProject, enterWorktreeTarget) and that
helper utilities pickString/getFilename are available where used.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 91e5eab7-48e1-4889-8d0d-292139bed5cb
📒 Files selected for processing (9)
packages/app/src/components/terminal.tsxpackages/app/src/pages/layout/sidebar-workspace.tsxpackages/opencode/src/cli/cmd/db.tspackages/opencode/src/cli/cmd/providers.tspackages/opencode/src/config/paths.tspackages/opencode/src/provider/transform.tspackages/opencode/src/session/llm.tspackages/opencode/src/session/prompt.tspackages/ui/src/components/tool-info.ts
There was a problem hiding this comment.
Code Review
This pull request introduces several security and stability enhancements, including read-only query enforcement for the database CLI, URL protocol validation, and path traversal protection for configuration file references. It also refactors LLM request timeouts to measure only network wait time, improves terminal font loading resilience, and adds robust parsing for prompt range parameters. A critical issue was identified in the UI components where duplicate function definitions were introduced, which would cause redeclaration errors.
The connect timeout (30s) was started during request object creation (arm() call), but the HTTP request wasn't sent until after run() completed its async setup work (provider lookup, config, plugin hooks). This caused premature timeouts — users saw "timed out after 30000ms" before actually waiting 30 seconds for the provider. Move arm() into a new startTimeout() method on the request object, called after run() returns. The timeout now only measures actual network/provider wait time. Fixes Astro-Han#728
92a20bc to
bac1267
Compare
…758) ## Summary Add `ProviderTransform.streamTimeouts(model)` returning `{ connectTimeoutMs: 120_000 }` for reasoning-capable models (gated on `model.capabilities.reasoning`), and apply it at the two production `llm.stream()` call sites — `session/processor.ts` (main response) and `session/prompt.ts` (title generation) — with helper-first spread order so any caller-provided `StreamInput.connectTimeoutMs` still wins. The default `CONNECT_STREAM_TIMEOUT_MS` is untouched; non-reasoning models keep the 30s ceiling. ## Why The 30-second first-progress watchdog in `session/llm.ts` aborts reasoning-model streams whose first observable provider event arrives later than the ceiling. The incident recorded in #755 shows OpenAI `gpt-5.5` on a long session spending 30204ms with only `start: 1` and zero content events before the watchdog aborted. PR #729 fixed an earlier residual where the connect timer armed before the HTTP request was actually sent; the residual addressed here is the 30s ceiling itself, which #729's body explicitly deferred. The capability gate keys off `model.capabilities.reasoning` after verifying the current models.json catalog has correct `reasoning=true` labels on the gpt-5 family (22/23, only `gpt-5.3-chat-latest` excluded), all o-series, Claude haiku/sonnet/opus 4.x thinking variants, and the Gemini 2.5+/3.1 series — so no provider-id allowlist fallback is needed at this time. ## Related Issue Refs #755 (short-term path; the full deferred set — `SessionRetry.policy.retryable()` not classifying local timeouts, watchdog architectural rewrite, and the parallel mid-stream `terminated` failure mode — is documented in the issue body and left to follow-up PRs). ## Human Review Status Pending ## Review Focus - Helper-first spread order at `session/processor.ts:954` and `session/prompt.ts:465`. `processor.ts` receives `streamInput` from `process()` callers (`session/prompt.ts:2020`, `session/compaction.ts:480`); neither sets `connectTimeoutMs` today, so the spread defaults to the helper value. The order is helper-first specifically so that a future caller wishing to override does not need code changes here. - Capability-based gate vs model-id-pattern matching. Catalog cross-checked at PR-prep time and labels are reliable across the four target families; a second pair of eyes on whether `model.capabilities.reasoning === true` is the right axis is welcome. - The new contract test floor at `>= 90_000` in `transform.test.ts` codifies the policy direction (the lowest value considered for reasoning models during the issue discussion). Worth questioning whether a corresponding upper bound is also needed. - Title-generation path applies the helper to `mdl` from `provider.getSmallModel(input.providerID)` or the title agent's explicit override. The typical small model is non-reasoning so the helper returns `{}`; only when a user explicitly configures a reasoning small variant for the title agent does the 120s apply, and the failure mode is silent (caught at `prompt.ts:482-488`, logs and falls back to the default title). The deliberate choice was to avoid a per-mode branch inside the helper. ## Risk Notes - After the 120s ceiling is reached, a first-progress timeout still surfaces as a hard `UnknownError` because `SessionRetry.policy.retryable()` does not type-tag local timeouts and does not match the bare error message. The user-experience improvement here is the lower hit rate; no automatic retry is added in this PR. Retry classification is deferred until the parallel mid-stream `terminated` failure mode is analyzed, so retry can be designed once against both failure shapes rather than speculatively per-shape. - Explicit `connectTimeoutMs: undefined` from a caller would clobber the helper value during spread and fall through to the 30s default via `llm.ts:434-439`. No production path constructs `streamInput` this way today, but a future caller with conditional override should pass a positive number or omit the field rather than set it to `undefined`. - Behavior change is strictly scoped to reasoning-capable models. Non-reasoning models keep the 30s default. - No visible UI or copy changed; the visible-UI conditional checkbox is left unticked for that reason. - No platform / packaging / updater / signing / paths / shell / permissions surface was touched; the macOS/Windows conditional checkbox is left unticked for that reason. - No docs / release notes / dependencies / permissions / credentials / deletion behavior / generated content / local file changes; the related conditional checkbox is left unticked for that reason. ## How To Verify ```text typecheck (bun --cwd packages/opencode run typecheck): ok bun test packages/opencode/test/provider/transform.test.ts (streamTimeouts): 3 pass / 0 fail bun test packages/opencode/test/session/ packages/opencode/test/provider/: 940 pass / 4 skip / 1 todo / 0 fail internal cross-review (Claude Opus + Codex high, parallel): 0 P0 / 0 P1 / 1 P3 both reviewers flagged (policy floor too loose) fixed by tightening test to >= 90_000 ``` ## Screenshots or Recordings Not applicable (no UI change). ## Checklist - [ ] **Type label** — this PR carries exactly one of `bug`, `enhancement`, `task`, `documentation`. Type labels are author-added; the labeler bot does NOT assign them. Add the label in the GitHub UI, then tick this. - [ ] **Routing labels** — this PR carries at least one of `app`, `ui`, `platform`, `harness`, `ci`. The labeler bot assigns these on PR open based on changed paths. Confirm the bot's choice (or override if wrong), then tick this. - [ ] **Priority label** — this PR carries exactly one of `P0`, `P1`, `P2`, `P3`. The priority-triage bot suggests one on PR open. Confirm or override, then tick this. - [x] Human Review Status above is set to `Pending`, `Approved by @<reviewer>`, or `Not required: <reason>` (default is `Pending`; "not required" is restricted to bot-authored low-risk PRs). - [x] I linked the related issue, or stated in Summary why there is no issue. - [x] I described the review focus and any meaningful risks. - [x] I replaced the example block in How To Verify with the real verification steps and the key result for each. - [x] I did not introduce unrelated refactors, dependencies, generated files, or file changes beyond the stated scope. - [ ] **(conditional)** I manually checked visible UI or copy changes when needed, with screenshots or recordings. Leave unticked only if no visible UI or copy changed. - [ ] **(conditional)** I considered macOS and Windows impact for platform, packaging, updater, signing, paths, shell, or permissions changes. Leave unticked only if no platform/packaging surface was touched. - [ ] **(conditional)** I called out docs, release notes, dependencies, permissions, credentials, deletion behavior, generated content, or local file changes when relevant. Leave unticked only if none of those surfaces was touched. - [x] I reviewed the final diff for unrelated changes and suspicious dependency changes. - [x] I am targeting `dev`, and my PR title and commit messages use Conventional Commits in English. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added intelligent timeout management for AI model streaming with extended timeouts for reasoning-capable models, improving stability during complex inference tasks. * Enhanced title generation streams with optimized timeout configuration for better performance on reasoning models. * **Tests** * Added test coverage for new timeout management functionality, validating behavior across different model types. <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/Astro-Han/pawwork/pull/758?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Defer the LLM stream connect timeout (
arm()call) until afterrun()returns, so the 30-second window measures only network/provider response time, not internal async setup (provider lookup, config, plugin hooks).Why
The connect timer was started during request-object construction, before the HTTP request was actually sent.
run()'s async setup work (provider lookup, config load,experimental.chat.system.transform/chat.params/chat.headersplugin hooks) ate into the 30s window, so users on slower providers (Kimi, Xiaomi token plan, other domestic-China endpoints) saw "timed out after 30000ms" before they had actually waited 30 seconds. Issue #728 documents 12 occurrences in a single user's database.Fix moves
arm()into a newstartTimeout()method on the request object and calls it afterrun()returns, right before stream consumption.Related Issue
Fixes #728.
Human Review Status
Pending.
Review Focus
packages/opencode/src/session/llm.ts: thearm()call site move. Confirm nothing else in the request lifecycle depends on the timer being armed during construction.resetTimeout()andtimeoutFailuresemantics are unchanged.Risk Notes
run()itself never returns (e.g., provider lookup hangs), the connect timer never starts. This is the same failure mode as before —run()hangs were never covered by the connect timer. No new regression surface.SessionRetry.policydoes not treat connect timeouts as retryable, (2)connectTimeoutMsis not configurable end-to-end. Both intentionally left for separate PRs.How To Verify
Screenshots or Recordings
Not applicable (no UI change).
Checklist
bug,platform, priority)dev, Conventional Commits in EnglishNote to @Spongeacer: The branch was rewritten by a maintainer to narrow scope to just the
llm.tsfix (the original PR also pulled in unrelated audit/dead-code changes that broke typecheck via a duplicate function intool-info.ts). The other changes can be reproposed individually if you still want them in. To resync your local branch:```
git fetch origin
git checkout fix/llm-stream-connect-timeout
git reset --hard origin/fix/llm-stream-connect-timeout
```