fix(cli): arm the disconnect hard-deadline at teardown, not before the op handler (#1775)#2032
Open
pabloglzg wants to merge 1 commit into
Open
fix(cli): arm the disconnect hard-deadline at teardown, not before the op handler (#1775)#2032pabloglzg wants to merge 1 commit into
pabloglzg wants to merge 1 commit into
Conversation
…e op handler (garrytan#1775) The 10s force-exit timer was installed before the op handler ran, making it a whole-run deadline: any shared op slower than 10s total (search/query against remote Postgres, a reranker pass, a query with LLM expansion) got process.exit() mid-handler — empty stdout, exit 0, and a misleading "engine.disconnect() did not return" warning for a disconnect that never ran. Arm it in the finally, immediately before the drain + disconnect it exists to bound — the same shape handleCliOnly's teardown already uses. Slow handlers stay bounded piecemeal by the layers underneath (6s query-embed deadline, 60s/300s AI gateway deadlines, Postgres statement_timeout default 5min) — there is deliberately no uniform whole-run wallclock after this change; the old one masked failures as exit-0 successes. Adds a source-shape regression test pinning the placement (same rationale as test/fix-wave-structural.test.ts) and updates the one KEY_FILES.md phrase that described the old whole-run behavior. Root-cause diagnosis by @danashburn in garrytan#1775. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1775.
The bug
The 10s "disconnect hang guard" in
src/cli.tsis armed at the top of the op-dispatch path, before the op handler runs, and only cleared in thefinallyafterengine.disconnect()returns. That makes it accidentally a whole-run deadline, not a disconnect guard: any shared op whose total wall time exceeds 10s getsprocess.exit(process.exitCode ?? 0)mid-handler. The user sees empty stdout, exit 0, and a misleadingengine.disconnect() did not return within 10000mswarning for a disconnect that never ran.This is why #1775 reproduces only on
search/query: they're the shared ops that routinely exceed 10s (remote Postgres over WAN, reranker passes, LLM query expansion), whilelist/get/statusfinish fast, and bulk commands (embed,sync) dispatch through the CLI-only path that never hits this timer. Root cause was diagnosed exactly by @danashburn in the issue thread — this PR is that one-move fix, plus a regression test.Note: #2015 (
endPoolBounded) fixes a different bug in the same blast radius — a genuinely hungpool.end()against PgBouncer. It doesn't touch this timer; with #2015 merged, an op handler slower than 10s still gets killed mid-run. The two compose.The fix
Arm the timer in the
finally, immediately beforedrainAllBackgroundWorkForCliExit()+engine.disconnect()— the teardown it exists to bound (its warning message already says so). The C13 defense-in-depth contract ("a hung disconnect cannot defeat the force-exit") is fully preserved: the timer is armed synchronously at teardown start, before the drain/disconnect awaits. This also brings the op-dispatch path to the same shapehandleCliOnly's teardown has used since v0.42.20.0 (#1762) — the pre-handler placement was the outlier, not the design.Trade-off, stated honestly: a hung op handler is no longer killed at 10s. There is deliberately no uniform whole-run wallclock after this change — handlers are bounded piecemeal by the layers underneath (6s query-embed deadline in hybrid.ts, 60s/300s AI gateway deadlines via
withDefaultTimeout, Postgresstatement_timeoutdefault 5min), so the worst-case wedge for a pathological case (e.g. client-side TCP blackhole mid-query) moves from 10s to ~minutes. The old behavior was strictly worse for that case too: it reported the failure as an exit-0 success with empty output. If a uniform backstop is wanted for unattended callers, a generous (minutes-scale) pre-handler deadline with a distinct warning and nonzero exit would be a separate, additive change — happy to file it.Tests
test/cli-disconnect-deadline-placement.test.ts— source-shape regression test pinning that the arming sits afterop.handler(...)and before the drain, with exactly one arming site inmain(). Placement is the load-bearing property, so this follows the pattern the repo endorses intest/fix-wave-structural.test.ts(a behavioral repro would need a >10s op against a live engine just to observe the timer fire).Also updates the one
KEY_FILES.mdphrase that described the old whole-run behavior ("hanging past the 10s force-exit") to the current truth.Verification
test/cli-should-force-exit.test.ts,test/cli.test.ts,test/cli-options.test.ts,test/cli-dispatch-thin-client.test.ts,test/fix-wave-structural.test.ts,test/build-llms.test.ts: all pass. The placement test fails on current master, passes with the fix.bun run typecheck: clean.gbrain querytakes 13–28s: before the fix, empty stdout + exit 0 + the misleading warning at exactly 10s; after, full ranked results print and the process exits cleanly.🤖 Generated with Claude Code