fix(opencode): fail stalled LLM connections before silent timeout#558
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR implements a connect-timeout watchdog for LLM streaming to prevent indefinite stuck states when providers are misconfigured or reject model combinations. It introduces a 30-second connect timeout that fires if no provider progress events arrive, while preserving the existing stream-idle timeout for cases where progress has already begun. The change includes new timeout constants, type extensions, event classification logic, iterator wrappers, expanded test mocks, and comprehensive unit and integration tests. ChangesConnect-Timeout Watchdog for LLM Streaming
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a dedicated connection timeout for LLM streams, distinct from the existing silent stream timeout. It adds a connectTimeoutMs configuration option and updates the streaming logic to track provider progress, applying the connection timeout until the first progress event is received. The implementation includes a new failOnTimeout utility to wrap async iterables and race them against timeout signals. Testing coverage is expanded with new cases for connection timeout failures and session status transitions. One review comment identifies a potential unhandled promise rejection when racing the stream iterator against the timeout, suggesting a catch handler for the iterator's next promise.
Astro-Han
left a comment
There was a problem hiding this comment.
Opus second-pass review (product layer / three questions / five locks)
Conclusion: approve
Implementation stays inside the GPT-X-locked D shape: connect-phase timeout split from silent-stream timeout, allowlist-gated provider progress flag, force-failure path via failOnTimeout wrapper hitting processor.halt(). No drift into boot validation, no status reconcile watchdog, no churn on session.abort from PR #556.
Three questions
Could this be even less? 293 lines across llm.ts + 2 test files. Core change is ~70 lines (constant + state + failOnTimeout + isProviderProgressEvent); the rest is targeted tests. Promise.race(iterator.next(), timeoutFailure) plus double timeoutError() sync-check after the race is necessary to plug the race window between setTimeout firing and the next iterator pull resolving — cannot shrink without losing correctness. Check.
Is what remains good enough?
CONNECT_STREAM_TIMEOUT_MS = 30_000default;connectTimeoutMs?: numberis a test-only hook (not surfaced via config), matches GPT-X "no provider override in v1" red line.isProviderProgressEventallowlist excludesstartandfinishper GPT-X-tightened spec. Includestext-*,reasoning-*,tool-input-*,tool-call,tool-result,tool-error— only events that prove the provider is actually producing content/state.failConnectTimeout()writestimeoutError, rejects thetimeoutFailurepromise, then callsctrl.abort(). ThefailOnTimeoutwrapper double-checkstimeoutError()before AND after thePromise.raceto plug the window where setTimeout fires concurrent with iterator.next() resolving. Stream throws →Effect.runDrainreturns failure →processor.halt()triggers → assistantinfo.errorwritten → session status flips to idle.- After first valid progress event,
providerProgressed = trueflipscurrentTimeoutMs()tostreamTimeoutMs. Existing silent-stream timeout path (10min default) preserved. - Re-arm semantics during
hasAwaitingQuestionpreserved via the existingarm()re-entry — pending-question soft cancel from PR #556 is not broken.
Check.
Is it reassuring? Five-lock acceptance traced against tests:
- connect timeout produces stream failure not success drain — explicit test asserting
Exit.isFailure(exit)).toBe(true)plusresponseCanceled+requestAbortedboth fire. This is the lock GPT-X put hardest weight on; the test name is locked to the semantic to keep future maintainers honest. - Processor writes
info.error+ flips status idle — covered byprocessor-effect.test.tsadditions (+65 LOC). I did not deep-read those but the test names need to assert this chain explicitly; please confirm in engineering final. - First progress event disarms connect timeout — covered indirectly by
silent stream timeout cancels provider response body promptlyrewritten to usewaitProgressThenSilentStreamingRequestfixture: provider emits text-delta first, then goes silent, onlystreamTimeoutMs(20ms) fires, exit stays success. - Awaiting question pauses connect timeout, fires failure after answer — covered by the reworked
does not count awaiting question blockers as silent provider timeouttest: connect timeout suspended for 90ms while awaiting, request not aborted; after answer, connect timeout fires and exit is failure. PR #556 soft-cancel pending-question semantic preserved. - Live repro of
alibaba-coding-plan-cn + glm-5— flagged for manual verification in PR body. Not asserted in CI.
Findings
P3-1 (rename request, not blocking): The test does not count awaiting question blockers as silent provider timeout no longer matches its current behavior. The first half still asserts that awaiting-question re-arm prevents fire, but the second half now asserts that connect timeout DOES fire (as failure) after the answer is sent. Suggest renaming to connect timeout pauses during awaiting question and fires failure after answer once this PR lands, or splitting into two tests. Not a blocker — behavior is correct, name is the artifact lag.
P3-2 (style nit, not blocking): failOnTimeout wrapper's return() method calls void iterator.return?.().catch(() => {}) without awaiting. This is intentional to avoid hanging on a provider iterator that never resolves cleanup. Worth a one-line code comment noting the design intent: "abort signal is the real cancellation path; iterator.return() is best-effort protocol courtesy that we do not block on."
P3-3 (Gemini bot finding, defer to GPT-X engineering final): Gemini left an unresolved review thread suggesting void nextPromise.catch(() => {}) before the Promise.race to suppress unhandled-rejection warnings when the late AbortError lands after the timeout has already rejected. I think this is a legitimate small improvement (Node will log unhandled rejection warnings even when behavior is correct); GPT-X should weigh whether to fix in this PR or defer.
Approve
Pending GPT-X engineering final pass and Gemini thread resolution.
67f8033 to
3614914
Compare
Astro-Han
left a comment
There was a problem hiding this comment.
Engineering review: one P2 issue before I can pass this.
-
P2
packages/opencode/src/session/llm.ts:501:resetTimeout(event)currently clears and re-arms the timer for everyfullStreamevent, even whenisProviderProgressEvent(event)is false andproviderProgressedis still false. That meansstart,finish, or any future unallowlisted housekeeping event can keep extending the connect timeout without any effective provider progress, which weakens the #554 contract and contradicts the explicit allowlist decision thatstart/finishdo not count as progress.Suggested shape: before
providerProgressedis true, ignore non-progress events for timer-reset purposes; after the first allowlisted progress event, continue resetting on every event to preserve the existing silent-stream timeout semantics.const progressed = isProviderProgressEvent(event) if (!providerProgressed && !progressed) return if (progressed) providerProgressed = true if (timeout) clearTimeout(timeout) arm()
Please add a regression test showing a
start-only or otherwise non-progress-only stream still fails onconnectTimeoutMswithout extending the timer.
The info.error + idle processor test is on the right layer, Gemini P3-3 is addressed, and the non-awaited return() cleanup is now documented. I will re-check after this timer-reset edge is patched.
3614914 to
6295698
Compare
Astro-Han
left a comment
There was a problem hiding this comment.
Engineering review pass for 6295698b5.
The previous P2 is addressed: connect-phase non-progress events no longer reset the connect timer, while post-progress events still reset the existing silent-stream timeout. The new start-only regression covers the important edge case.
Verified locally:
bun --cwd packages/opencode test test/session/llm.test.ts -t "timeout" --timeout 30000: 4 passbun --cwd packages/opencode test test/session/processor-effect.test.ts -t "connect timeout writes assistant info.error and flips session_status idle" --timeout 10000: 1 passbun --cwd packages/opencode typecheck: passgit diff --check: pass
No unresolved review threads found. CI is still running on the latest SHA, so keep waiting for the required checks before merge. Non-blocking closeout note: the PR body verification counts should be refreshed for the added regression test (llm.test.ts now has 19 tests and the full opencode run reported 2708 pass).
Approve in body: the implementation now matches the #554 boundary and does not expand into boot validation, provider overrides, status watchdogs, or the #556 abort API.
6295698 to
2ad4080
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Astro-Han
left a comment
There was a problem hiding this comment.
Engineering review pass for the rewritten history at 2ad40808f.
Commit granularity now matches the PawWork rule: PR branch commits are review-sized and independently explainable, while dev can still use squash merge.
Commit review:
f89725343 fix(opencode): fail stalled llm connections before silent timeout: core implementation plus the minimal LLM failure-not-success-drain regression. Single fix intent.b29832c22 test(opencode): cover llm connect timeout edge cases: LLM phase-boundary coverage for post-progress silent-stream behavior, start-only non-progress events, and awaiting-question re-arm. Single test-expansion intent.2ad40808f test(opencode): cover processor recovery after connect timeout: processor-levelinfo.error+session_status idlerecovery coverage. Single cross-layer regression intent.
Per-commit verification I reran by checkout:
f89725343:bun --cwd packages/opencode typecheckpass;bun --cwd packages/opencode test test/session/llm.test.ts -t "connect timeout produces stream failure not success drain" --timeout 30000pass.b29832c22:bun --cwd packages/opencode typecheckpass;bun --cwd packages/opencode test test/session/llm.test.ts -t "timeout" --timeout 30000pass.2ad40808f:bun --cwd packages/opencode typecheckpass;bun --cwd packages/opencode test test/session/processor-effect.test.ts -t "connect timeout writes assistant info.error and flips session_status idle" --timeout 10000pass;bun --cwd packages/opencode test test/session/llm.test.ts -t "timeout" --timeout 30000pass.
Also checked git diff --check dev...HEAD: pass, and review threads remain resolved.
Non-blocking nit: the awaiting-question LLM test name still says "silent provider timeout" even though the behavior now covers connect timeout pausing during an awaiting question and firing after answer. The behavior is correct and this does not block merge.
CI is still running on the latest SHA, so wait for required checks before merge.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
packages/opencode/src/session/llm.ts (1)
456-458: ⚡ Quick winSuppress potential unhandled-rejection on
timeoutFailure.
timeoutFailureis constructed at acquire time, but its rejection handler is only attached later whenPromise.race([nextPromise, request.timeoutFailure])runs insidenext(). IffailConnectTimeoutresolves the rejection before the firstnext()is called (e.g., a very smallconnectTimeoutMsin tests, or a tight timing window between yielded events), Node may report an unhandled rejection warning and could terminate the process in stricter runtimes.Attach a no-op catch at construction to make the promise always observed; the
next()consumer still gets the rejection viaPromise.race.🛡️ Proposed fix
const timeoutFailure = new Promise<never>((_, reject) => { rejectTimeout = reject }) + // Ensure the rejection is always observed even if no `next()` is in flight + // when the connect timeout fires (e.g., tight test timings). + void timeoutFailure.catch(() => {})🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@packages/opencode/src/session/llm.ts` around lines 456 - 458, The timeoutFailure Promise created alongside rejectTimeout can reject before any consumer attaches handlers (in next()), causing unhandled-rejection warnings; update the construction of timeoutFailure (and the place that assigns rejectTimeout) to attach a no-op catch handler (e.g., timeoutFailure.catch(() => {})) immediately after creation so rejections are always observed while still allowing next() to race on request.timeoutFailure and receive the rejection; locate the timeoutFailure and rejectTimeout variables in the acquire/create block and add the catch there, leaving next(), failConnectTimeout, and Promise.race logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@packages/opencode/src/session/llm.ts`:
- Around line 456-458: The timeoutFailure Promise created alongside
rejectTimeout can reject before any consumer attaches handlers (in next()),
causing unhandled-rejection warnings; update the construction of timeoutFailure
(and the place that assigns rejectTimeout) to attach a no-op catch handler
(e.g., timeoutFailure.catch(() => {})) immediately after creation so rejections
are always observed while still allowing next() to race on
request.timeoutFailure and receive the rejection; locate the timeoutFailure and
rejectTimeout variables in the acquire/create block and add the catch there,
leaving next(), failConnectTimeout, and Promise.race logic unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 883c7d2b-f380-41b7-a8c9-f31eb6601bf1
📒 Files selected for processing (3)
packages/opencode/src/session/llm.tspackages/opencode/test/session/llm.test.tspackages/opencode/test/session/processor-effect.test.ts
2ad4080 to
8c8b285
Compare
8c8b285 to
37b587d
Compare
Astro-Han
left a comment
There was a problem hiding this comment.
Engineering re-check pass for 37b587daa after the CodeRabbit nitpick fix.
Confirmed the no-op catch was fixuped into commit 1 and preserves the original timeout promise for Promise.race([nextPromise, request.timeoutFailure]); it only observes early rejection warnings. The earlier nextPromise.catch(() => {}) remains in place. Commit 2 also now has the corrected awaiting-question test name.
Verified locally:
- On commit
a4cd01b50:bun --cwd packages/opencode typecheckpass;bun --cwd packages/opencode test test/session/llm.test.ts -t "connect timeout produces stream failure not success drain" --timeout 30000pass. - On PR HEAD
37b587daa:bun --cwd packages/opencode test test/session/llm.test.ts --timeout 30000: 19 pass;git diff --check dev...HEAD: pass; worktree clean.
No engineering blockers from GPT-X. Wait for CI to finish before merge.
Summary
Adds a 30s pre-progress LLM connect timeout before the existing 10m silent-stream timeout.
startandfinishdo not count as provider progress; only visible/state-advancing events do.info.error+ idle recovery.Why
#554 shows an invalid locally registered provider/model pair,
alibaba-coding-plan-cn + glm-5, entering a permanent "思考中" state. The request can stall before any effective provider output, leaving the assistant message without parts,info.error, completion time, or an LLM trace.The existing timeout guarded silent streams, but it used the same 10m budget before provider progress and only aborted the request. In the no-progress case that can success-drain instead of hitting
SessionProcessor.halt(), so no assistant error is written and the session can stay busy.This PR fixes the lowest layer that owns provider stream progress: LLM stream dispatch.
Related Issue
Closes #554.
Human Review Status
Pending. A human should make the final merge decision after reviewing the final diff and verification evidence.
Review Focus
Please focus on:
start/finishare excluded; text, reasoning, and tool progress are included.return()on stream cleanup but does not await a never-resolving provider iterator. Existing cancellation tests prove provider response body cancellation still happens promptly.Risk Notes
startand then no effective progress for 30s will now surface an error. That is intentional for v1; provider-specific overrides are deferred unless real usage proves the default too strict.return()so interrupted processor cleanup cannot hang on a never-ending provider stream. This is covered by existing interrupt cleanup tests and response-body cancellation tests.How To Verify
The concrete #554 failure mode is covered by a no-effective-provider-progress fixture:
connectTimeoutMsfires, the provider request is cancelled, the LLM stream fails rather than success-draining, the processor writes assistantinfo.error, andsession_statusreturns to idle. Production uses the 30s default; tests use a 20ms hook for deterministic runtime.Screenshots or Recordings
Not applicable. This is server/session behavior with no visible UI or copy changes.
Checklist
dev, and my PR title and commit messages use Conventional Commits in EnglishSummary by CodeRabbit
New Features
Bug Fixes