fix(cli): arm the disconnect hard-deadline at teardown, not before the op handler (#1775) by pabloglzg · Pull Request #2032 · garrytan/gbrain

pabloglzg · 2026-06-10T04:21:58Z

The bug

The 10s "disconnect hang guard" in src/cli.ts is armed at the top of the op-dispatch path, before the op handler runs, and only cleared in the finally after engine.disconnect() returns. That makes it accidentally a whole-run deadline, not a disconnect guard: any shared op whose total wall time exceeds 10s gets process.exit(process.exitCode ?? 0) mid-handler. The user sees empty stdout, exit 0, and a misleading engine.disconnect() did not return within 10000ms warning for a disconnect that never ran.

This is why #1775 reproduces only on search/query: they're the shared ops that routinely exceed 10s (remote Postgres over WAN, reranker passes, LLM query expansion), while list/get/status finish fast, and bulk commands (embed, sync) dispatch through the CLI-only path that never hits this timer. Root cause was diagnosed exactly by @danashburn in the issue thread — this PR is that one-move fix, plus a regression test.

Note: #2015 (endPoolBounded) fixes a different bug in the same blast radius — a genuinely hung pool.end() against PgBouncer. It doesn't touch this timer; with #2015 merged, an op handler slower than 10s still gets killed mid-run. The two compose.

The fix

Arm the timer in the finally, immediately before drainAllBackgroundWorkForCliExit() + engine.disconnect() — the teardown it exists to bound (its warning message already says so). The C13 defense-in-depth contract ("a hung disconnect cannot defeat the force-exit") is fully preserved: the timer is armed synchronously at teardown start, before the drain/disconnect awaits. This also brings the op-dispatch path to the same shape handleCliOnly's teardown has used since v0.42.20.0 (#1762) — the pre-handler placement was the outlier, not the design.

Trade-off, stated honestly: a hung op handler is no longer killed at 10s. There is deliberately no uniform whole-run wallclock after this change — handlers are bounded piecemeal by the layers underneath (6s query-embed deadline in hybrid.ts, 60s/300s AI gateway deadlines via withDefaultTimeout, Postgres statement_timeout default 5min), so the worst-case wedge for a pathological case (e.g. client-side TCP blackhole mid-query) moves from 10s to ~minutes. The old behavior was strictly worse for that case too: it reported the failure as an exit-0 success with empty output. If a uniform backstop is wanted for unattended callers, a generous (minutes-scale) pre-handler deadline with a distinct warning and nonzero exit would be a separate, additive change — happy to file it.

Tests

test/cli-disconnect-deadline-placement.test.ts — source-shape regression test pinning that the arming sits after op.handler(...) and before the drain, with exactly one arming site in main(). Placement is the load-bearing property, so this follows the pattern the repo endorses in test/fix-wave-structural.test.ts (a behavioral repro would need a >10s op against a live engine just to observe the timer fire).

Also updates the one KEY_FILES.md phrase that described the old whole-run behavior ("hanging past the 10s force-exit") to the current truth.

Verification

New test + test/cli-should-force-exit.test.ts, test/cli.test.ts, test/cli-options.test.ts, test/cli-dispatch-thin-client.test.ts, test/fix-wave-structural.test.ts, test/build-llms.test.ts: all pass. The placement test fails on current master, passes with the fix.
bun run typecheck: clean.
Manually verified on a Postgres brain with a local reranker where gbrain query takes 13–28s: before the fix, empty stdout + exit 0 + the misleading warning at exactly 10s; after, full ranked results print and the process exits cleanly.

🤖 Generated with Claude Code

@danashburn

…e op handler (garrytan#1775) The 10s force-exit timer was installed before the op handler ran, making it a whole-run deadline: any shared op slower than 10s total (search/query against remote Postgres, a reranker pass, a query with LLM expansion) got process.exit() mid-handler — empty stdout, exit 0, and a misleading "engine.disconnect() did not return" warning for a disconnect that never ran. Arm it in the finally, immediately before the drain + disconnect it exists to bound — the same shape handleCliOnly's teardown already uses. Slow handlers stay bounded piecemeal by the layers underneath (6s query-embed deadline, 60s/300s AI gateway deadlines, Postgres statement_timeout default 5min) — there is deliberately no uniform whole-run wallclock after this change; the old one masked failures as exit-0 successes. Adds a source-shape regression test pinning the placement (same rationale as test/fix-wave-structural.test.ts) and updates the one KEY_FILES.md phrase that described the old whole-run behavior. Root-cause diagnosis by @danashburn in garrytan#1775. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cli): arm the disconnect hard-deadline at teardown, not before the op handler (#1775)#2032

fix(cli): arm the disconnect hard-deadline at teardown, not before the op handler (#1775)#2032
pabloglzg wants to merge 1 commit into
garrytan:masterfrom
pabloglzg:fix/cli-force-exit-timer

pabloglzg commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pabloglzg commented Jun 10, 2026

The bug

The fix

Tests

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant