Skip to content

fix(cli): arm the disconnect hard-deadline at teardown, not before the op handler (#1775)#2032

Open
pabloglzg wants to merge 1 commit into
garrytan:masterfrom
pabloglzg:fix/cli-force-exit-timer
Open

fix(cli): arm the disconnect hard-deadline at teardown, not before the op handler (#1775)#2032
pabloglzg wants to merge 1 commit into
garrytan:masterfrom
pabloglzg:fix/cli-force-exit-timer

Conversation

@pabloglzg

Copy link
Copy Markdown
Contributor

Fixes #1775.

The bug

The 10s "disconnect hang guard" in src/cli.ts is armed at the top of the op-dispatch path, before the op handler runs, and only cleared in the finally after engine.disconnect() returns. That makes it accidentally a whole-run deadline, not a disconnect guard: any shared op whose total wall time exceeds 10s gets process.exit(process.exitCode ?? 0) mid-handler. The user sees empty stdout, exit 0, and a misleading engine.disconnect() did not return within 10000ms warning for a disconnect that never ran.

This is why #1775 reproduces only on search/query: they're the shared ops that routinely exceed 10s (remote Postgres over WAN, reranker passes, LLM query expansion), while list/get/status finish fast, and bulk commands (embed, sync) dispatch through the CLI-only path that never hits this timer. Root cause was diagnosed exactly by @danashburn in the issue thread — this PR is that one-move fix, plus a regression test.

Note: #2015 (endPoolBounded) fixes a different bug in the same blast radius — a genuinely hung pool.end() against PgBouncer. It doesn't touch this timer; with #2015 merged, an op handler slower than 10s still gets killed mid-run. The two compose.

The fix

Arm the timer in the finally, immediately before drainAllBackgroundWorkForCliExit() + engine.disconnect() — the teardown it exists to bound (its warning message already says so). The C13 defense-in-depth contract ("a hung disconnect cannot defeat the force-exit") is fully preserved: the timer is armed synchronously at teardown start, before the drain/disconnect awaits. This also brings the op-dispatch path to the same shape handleCliOnly's teardown has used since v0.42.20.0 (#1762) — the pre-handler placement was the outlier, not the design.

Trade-off, stated honestly: a hung op handler is no longer killed at 10s. There is deliberately no uniform whole-run wallclock after this change — handlers are bounded piecemeal by the layers underneath (6s query-embed deadline in hybrid.ts, 60s/300s AI gateway deadlines via withDefaultTimeout, Postgres statement_timeout default 5min), so the worst-case wedge for a pathological case (e.g. client-side TCP blackhole mid-query) moves from 10s to ~minutes. The old behavior was strictly worse for that case too: it reported the failure as an exit-0 success with empty output. If a uniform backstop is wanted for unattended callers, a generous (minutes-scale) pre-handler deadline with a distinct warning and nonzero exit would be a separate, additive change — happy to file it.

Tests

test/cli-disconnect-deadline-placement.test.ts — source-shape regression test pinning that the arming sits after op.handler(...) and before the drain, with exactly one arming site in main(). Placement is the load-bearing property, so this follows the pattern the repo endorses in test/fix-wave-structural.test.ts (a behavioral repro would need a >10s op against a live engine just to observe the timer fire).

Also updates the one KEY_FILES.md phrase that described the old whole-run behavior ("hanging past the 10s force-exit") to the current truth.

Verification

  • New test + test/cli-should-force-exit.test.ts, test/cli.test.ts, test/cli-options.test.ts, test/cli-dispatch-thin-client.test.ts, test/fix-wave-structural.test.ts, test/build-llms.test.ts: all pass. The placement test fails on current master, passes with the fix.
  • bun run typecheck: clean.
  • Manually verified on a Postgres brain with a local reranker where gbrain query takes 13–28s: before the fix, empty stdout + exit 0 + the misleading warning at exactly 10s; after, full ranked results print and the process exits cleanly.

🤖 Generated with Claude Code

…e op handler (garrytan#1775)

The 10s force-exit timer was installed before the op handler ran, making
it a whole-run deadline: any shared op slower than 10s total (search/query
against remote Postgres, a reranker pass, a query with LLM expansion) got
process.exit() mid-handler — empty stdout, exit 0, and a misleading
"engine.disconnect() did not return" warning for a disconnect that never
ran. Arm it in the finally, immediately before the drain + disconnect it
exists to bound — the same shape handleCliOnly's teardown already uses.

Slow handlers stay bounded piecemeal by the layers underneath (6s
query-embed deadline, 60s/300s AI gateway deadlines, Postgres
statement_timeout default 5min) — there is deliberately no uniform
whole-run wallclock after this change; the old one masked failures as
exit-0 successes.

Adds a source-shape regression test pinning the placement (same rationale
as test/fix-wave-structural.test.ts) and updates the one KEY_FILES.md
phrase that described the old whole-run behavior.

Root-cause diagnosis by @danashburn in garrytan#1775.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

search/query render no output + engine.disconnect() 10s force-exit on 0.42.8.0 (Postgres brain); 0.22.8 works

1 participant