fix(serve): clean up stdio MCP server on client disconnect#676
fix(serve): clean up stdio MCP server on client disconnect#676seungsu-kr wants to merge 1 commit intogarrytan:masterfrom
Conversation
When a parent process disconnects from `gbrain serve` (stdio mode), the
PGLite write lock is leaked because:
1. `commands/serve.ts` never calls `engine.disconnect()` after
`startMcpServer(engine)` resolves.
2. `cli.ts` short-circuits the serve case (`// serve doesn't disconnect`).
3. The MCP SDK's `StdioServerTransport` does not register `'end'` or
`'close'` listeners on stdin (only `'data'` and `'error'`), so even a
clean stdin EOF never reaches the SDK.
4. There is no signal handler covering SIGTERM / SIGINT / SIGHUP.
5. There is no watchdog for the orphan case where the parent dies
without closing stdin or sending a signal (the kernel just
re-parents to PID 1; no event fires).
Net effect: the server keeps the engine and its
`<datadir>/.gbrain-lock/lock` acquired indefinitely after the parent
disconnects. The next `gbrain serve` either times out or waits for the
in-process 5-minute stale-lock check.
Patch (stdio path only — HTTP / OAuth path unchanged):
- Stdin EOF / close watchers (`once('end')` + `once('close')`), gated
on `!isTTY` so interactive use is unaffected. 'end' covers clean
parent disconnect; 'close' covers SIGKILL'd parent leaving the
handle destroyed.
- SIGTERM / SIGINT / SIGHUP handlers, all funneled into the same
idempotent path. SIGHUP matters in the wild — Claude
Desktop on macOS and hermes-agent restarts have been observed
sending it instead of closing stdin.
- Parent-process watchdog: 5s `setInterval` polling `process.ppid`.
If the captured initial parent (non-1) becomes 1, we got
re-parented to init → parent died → graceful exit. Catches
launchd / cron / certain MCP gateways that just terminate without
notification. Skipped when initial ppid is already 1 (legitimate
init-child like a systemd unit).
- Idempotent `beginShutdown()` — sync `shuttingDown` flag prevents
double-disconnect across racing events. Watchdog is cleared first
on cleanup so a coincident tick can't queue a redundant pass.
- 5-second cleanup deadline; if `engine.disconnect()` wedges we still
`process.exit(0)` (the abandoned lock dir is reclaimed by the
existing `process.kill(pid, 0)` stale check on the next attempt).
- Optional `--stdio-idle-timeout <sec>` opt-in (default OFF) for
parents that leak the pipe but never close it. Resets on every
stdin `'data'` chunk. Strict parsing — typos / blank values throw
instead of silently disabling the safety net.
Test seams (`ServeOptions { stdin, signals, exit, log, startMcpServer,
getParentPid, setInterval, clearInterval }`) let
`test/serve-stdio-lifecycle.test.ts` cover the lifecycle end-to-end
without spawning a real Bun child or booting the real MCP SDK. Fake
timers drive the watchdog deterministically without 5s of wall clock.
Verified:
- 20 new tests pass: signals (SIGTERM / SIGINT / SIGHUP), stdin EOF
(end + close), idempotency, cleanup-failure / deadline, TTY-skip,
watchdog (fires on ppid 0->1, NOT installed when initial ppid=1, no
false fire on healthy tick), strict idle-timeout parsing
(rejects 'abc' / '30junk' / '-1' / '1.5' / '' / missing).
- Adjacent suites (cli, cli-options, mcp-tool-defs, pglite-lock):
46/46 unchanged.
- `bun run typecheck` clean.
Related to garrytan#413 (stdin=/dev/null lock contention) and garrytan#446 (idle timeout
feature request). Materially overlaps with garrytan#591; this commit started as
an independent patch and integrates SIGHUP + stdin 'close' + the PPID
watchdog after discovering garrytan#591.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hey @seungsu-kr — thank you for this. It's a real chronic bug we've been hitting. We're deferring this PR from the v0.30.3 fix wave (#776) to keep that wave's size manageable, but it'll ship as its own focused PR right after. I'll ping you here when it's ready to land. Apologies for the delay. |
Annotate the v0.31.3 changes in the canonical Key Files section: new src/core/sql-query.ts adapter (#681), src/commands/serve.ts stdio cleanup (#676), v0.31.3 amendments to auth.ts / serve-http.ts / oauth-provider.ts surfaces, and migration v46 normalizer in migrate.ts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…closes #413, #446) (#801) * fix(serve): clean up stdio MCP server on client disconnect The PGLite write lock leaked indefinitely when the parent of `gbrain serve` disconnected. Three root causes: serve.ts never called engine.disconnect() after startMcpServer() resolved; cli.ts short-circuited with a "serve doesn't disconnect" comment; and the MCP SDK's StdioServerTransport only listens for 'data'/'error' on stdin, never 'end'/'close', so even a clean stdin EOF never reached the SDK. Net effect: the next `gbrain serve` waited for the in-process 5-minute stale- lock check or hung indefinitely. stdio path now installs a unified lifecycle: - SIGTERM/SIGINT/SIGHUP all funnel into one idempotent shutdown path (SIGHUP coverage matters for Claude Desktop on macOS / MCP gateway restarts; SIGINT for Ctrl-C; SIGTERM for daemon shutdown). - stdin 'end' (clean EOF) and 'close' (parent SIGKILL with pipe still open) both trigger the same graceful path. TTY stdin skips the watchers so interactive `gbrain serve` is unaffected. - Parent-process watchdog polls the live kernel parent PID via spawnSync ('ps','-o','ppid=','-p',PID) every 5s. process.ppid is cached at process creation by Bun (and Node) and never refreshes on re-parent — empirical evidence on macOS shows ps reports the new parent within one tick while process.ppid stays at the original PID indefinitely (oven-sh/bun#30305). - Watchdog fires on `getParentPid() !== initialParentPid` (any reparent), not just `=== 1`. Catches launchd / systemd / tmux / parent-shell-with- PR_SET_CHILD_SUBREAPER cases where the kernel re-anchors us to a non-1 subreaper PID. Codex review caught the original `=== 1` was incomplete. - One-shot startup probe verifies `spawnSync('ps')` actually works on this host. If the probe fails (stripped containers / busybox without procps), we skip installing the watchdog interval entirely AND emit a loud stderr line — the operator sees "watchdog disabled" instead of an installed- but-never-fires phantom that silently falls back to cached process.ppid. - 5-second cleanup deadline: if engine.disconnect() wedges (PGLite WASM stall, etc.), the process still calls process.exit(0). The abandoned lock dir is reclaimed on the next start by the existing stale-lock check in pglite-lock.ts. - Optional `--stdio-idle-timeout <sec>`: default OFF safety net for parents that leak the pipe but never close it. Strict parsing rejects `abc` / `30junk` / `-1` / `1.5` / blank values explicitly so a typo doesn't silently disable the safety net (closes #446). Test seam: ServeOptions { stdin, signals, exit, log, startMcpServer, getParentPid, setInterval, clearInterval, probeWatchdog } lets the lifecycle be unit-tested deterministically without spawning a real Bun child or booting the MCP SDK. 22 test cases covering signals, stdin EOF, TTY skip, watchdog reparent (both PID-1 and subreaper-PID-N cases), ps-unavailable degraded mode, idle timeout, idempotent shutdown, and cleanup-deadline behavior. Closes #413, #446. Supersedes #591. Co-Authored-By: Aragorn2046 <noreply@github.com> Co-Authored-By: seungsu-kr <noreply@github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(auth): route HTTP auth/admin SQL through active engine `gbrain auth` and `gbrain serve --http` previously routed every SQL through the postgres.js singleton in src/core/db.ts, which silently fell back to a file-backed PGLite when DATABASE_URL was set but the config file disagreed. The HTTP transport's verbatim use of the singleton also made `gbrain serve --http` Postgres-only, even though the `access_tokens` and `mcp_request_log` tables exist in both engine schemas. Auth, OAuth, admin, file uploads, and HTTP-transport SQL now run through `engine.executeRaw` via a deliberately narrow tagged-template adapter (`src/core/sql-query.ts`). The contract is scalar-binds-only — adding JSONB or fragment composition would invite the adapter to drift into a partial postgres.js clone. JSONB writes use a separate `executeRawJsonb(engine, sql, scalarParams, jsonbParams)` helper that composes positional `$N::jsonb` casts and passes objects through `engine.executeRaw`. The CI guard at `scripts/check-jsonb-pattern.sh` doesn't fire because the helper is a method call, not the banned `${JSON.stringify(x)}::jsonb` template-literal interpolation, and the v0.12.0 double-encode bug class doesn't apply to positional binding via `postgres.js`'s `unsafe()` (verified by `test/e2e/auth-permissions.test.ts:67` on Postgres and the new `test/sql-query.test.ts` on PGLite). Migrated call sites: - src/commands/auth.ts: takes-holders writes (lines 52, 86) → executeRawJsonb. List, revoke, register-client, revoke-client → SqlQuery via withConfiguredSql() helper that opens an engine, runs the callback, disconnects. - src/commands/serve-http.ts: ~25 call sites including the four mcp_request_log.params INSERTs (now write real JSONB objects, not JSON-encoded strings — the read side `params->>'op'` returns the operation name, closing CLAUDE.md's outstanding "JSON-string-into- JSONB" note as a side effect). The /admin/api/requests dynamic filter pattern (postgres.js fragment composition) is rewritten as parametrized SQL string + params array. - src/mcp/http-transport.ts: legacy bearer-auth path. The Postgres-only fail-fast at startup is removed because both schemas now carry access_tokens + mcp_request_log. - src/core/oauth-provider.ts: SqlQuery / SqlValue types relocated from here to sql-query.ts as the canonical home (Codex finding #8). - src/commands/files.ts: all 5 db.getConnection() sites (lines 104, 139, 252, 326, 355). The line-256 INSERT into files.metadata uses executeRawJsonb; the other four are scalar-only SqlQuery (Codex finding #6 — scope was bigger than the plan's "lone INSERT" framing). - src/core/config.ts: env-var DATABASE_URL inference. When dbUrl is set, infer Postgres engine and clear the stale database_path. Engine-internal sql.json() sites in src/core/postgres-engine.ts (5 sites: lines 520, 1689, 1728, 1790, 2313) STAY UNCHANGED. They live inside PostgresEngine itself, where the postgres.js template-tag sql.json() pattern is correct — those methods are only loaded when Postgres is the active engine, so there's no PGLite-routing concern. Migration v45 (mcp_request_log_params_jsonb_normalize): one-shot UPDATE that lifts pre-v0.31 string-shaped JSONB rows to objects so the /admin/api/requests endpoint at serve-http.ts:605 returns one consistent shape to the admin SPA. Idempotent (subsequent runs find no rows where jsonb_typeof = 'string'). Closes the mixed-shape window that would otherwise have made post-deploy admin reads break. Tests: - test/sql-query.test.ts: 7 cases covering scalar binds, the .json() rejection (defense in depth — SqlQuery is scalar-only), JSONB round-trip with `jsonb_typeof = 'object'` and `->>` semantics, the v0.12.0 double-encode regression guard, null JSONB handling, and the scalars-then-jsonb call shape. - test/config-env.test.ts: migrated from PR's manual `restoreEnv()` in afterEach to the canonical `withEnv()` helper at test/helpers/with-env.ts (CLAUDE.md R1 / codex finding D3). Five cases covering DATABASE_URL precedence, GBRAIN_DATABASE_URL operator override, file-only config, env-only config, and the no-config null path. - test/e2e/auth-takes-holders-pglite.test.ts: 6 cases against in-memory PGLite (no DATABASE_URL gate). Covers create / update / read of access_tokens.permissions, mcp_request_log.params object + null writes, and the migration v45 normalizer (seed string-shaped row, run UPDATE, assert object shape; second-run no-op for idempotency). - test/http-transport.test.ts: mock updated to intercept engine.executeRaw (the new code path) instead of the postgres.js template tag. 24 cases pass. Plan reference: ~/.claude/plans/system-instruction-you-are-working-peppy-moore.md. Codex outside-voice review applied: D-codex-1, D-codex-2, D-codex-5, D-codex-8, D-codex-9, D-codex-10 (and D1, D5 reversed by codex). Closes the architectural intent of #681. Supersedes its branch. Co-Authored-By: codex-bot <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: update CLAUDE.md key files for v0.31.3 Annotate the v0.31.3 changes in the canonical Key Files section: new src/core/sql-query.ts adapter (#681), src/commands/serve.ts stdio cleanup (#676), v0.31.3 amendments to auth.ts / serve-http.ts / oauth-provider.ts surfaces, and migration v46 normalizer in migrate.ts. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore: regenerate llms-full.txt for v0.31.3 docs sync CI's build-llms test asserts the committed llms.txt + llms-full.txt match what scripts/build-llms.ts produces from current source state. CLAUDE.md was amended by /document-release post-merge (new entries for src/core/sql-query.ts and src/commands/serve.ts; amended notes on auth.ts / serve-http.ts / migrate.ts), so the inlined-bundle fell out of sync. Regenerated via `bun run build:llms`. llms.txt unchanged (curated index — no new web URLs added). llms-full.txt updated to inline the new CLAUDE.md content. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Aragorn2046 <noreply@github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Closing as superseded by #801 (v0.31.3, merged 2026-05-10). The stdio cleanup from this branch landed verbatim, with two additional cases on top:
Credit lines in the v0.31.3 CHANGELOG (
The Verified locally: 21 stale Thanks @garrytan for the disposition and merge, and @Aragorn2046 for the original work in #591 + Shelby's path-B endorsement that unblocked this PR. |
Summary
gbrain serve(stdio mode) leaks the PGLite write lock when the parent process disconnects. Three structural gaps:src/commands/serve.tsnever callsengine.disconnect()afterstartMcpServer(engine)resolves.src/cli.ts:case 'serve'short-circuits with// serve doesn't disconnect.StdioServerTransport(@modelcontextprotocol/sdk) only registers'data'/'error'listeners on stdin — never'end'/'close'— so even a clean stdin EOF never reaches the SDK.Net effect: the server keeps the engine and its
<datadir>/.gbrain-lock/lockacquired indefinitely after the parent disconnects. The nextgbrain serveeither times out or waits for the in-process 5-minute stale-lock check.Scope
Stdio path only. The v0.26+ HTTP / OAuth path (
runServeHttpinserve-http.ts) is preserved verbatim.Patch
src/commands/serve.ts:once('end')andonce('close'), gated on!isTTYso interactivegbrain servein a terminal is unaffected.'end'covers clean parent disconnect;'close'covers a SIGKILL'd parent leaving the handle destroyed.SIGTERM/SIGINT/SIGHUPhandlers — funneled into the same idempotent shutdown path. SIGHUP coverage matters in real deployments (Claude Desktop on macOS, MCP gateway restarts).initialParentPid !== 1(so a process legitimately spawned under PID 1 like a systemd unit doesn't immediately self-terminate). Detects launchd / cron / orphan scenarios where the parent dies without closing stdin or sending a signal.engine.disconnect()wedges (PGLite WASM stall, etc.), the process still callsprocess.exit(0). Any abandoned lock dir is reclaimed on the next attempt by the existingprocess.kill(pid, 0)stale-lock check inpglite-lock.ts.--stdio-idle-timeout <sec>— default OFF safety net for parents that leak the pipe but never close it and never send another message. Strict parsing: typos / blank values throw immediately instead of silently disabling the safety net (closes Feature request:gbrain serveidle timeout (auto-exit after inactivity) #446).Test seam:
ServeOptions { stdin, signals, exit, log, startMcpServer, getParentPid, setInterval, clearInterval }lets the lifecycle be unit-tested without spawning a real Bun child or booting the real MCP SDK. Fake timers drive the watchdog deterministically without 5s of wall clock.process.ppidis cached on Bun (and never updates on re-parent)While building real-process repro evidence, I found that
process.ppidin Bun (and Node — see oven-sh/bun#30305 filed by @Aragorn2046) is captured at process creation and never refreshed when the kernel re-parents the child to PID 1. The kernel re-parents correctly within one second (ps -o ppid= -p $$returns1), butprocess.ppidstays at the original parent's PID for the entire process lifetime.So a naive
if (process.ppid === 1)watchdog will not actually fire in a real launchd / cron / orphaned-PPID scenario. The patch routes through areadLiveParentPid()helper that callsspawnSync('ps', ['-o', 'ppid=', '-p', String(process.pid)])per tick (~10ms / 5s, falls back toprocess.ppidifpsfails). This was the difference between the watchdog firing and not firing in the harness below.Tests
test/serve-stdio-lifecycle.test.ts— 20 cases, all passing on the rebased branch:SIGTERM/SIGINT/SIGHUPeach trigger graceful exit (3).'end'and'close'each trigger graceful exit (2).engine.disconnect()once (1).engine.disconnect()rejects, exit still happens withcode=0and a logged error (1).0 → 1, NOT installed when initial ppid is already 1, no false fire on healthy ticks (3).--stdio-idle-timeout: arms / resets on data / fires on idle / 0 disarms / strict parse rejectsabc/30junk/-1/1.5/ blank / missing (9).Adjacent unit suites (
cli,cli-options,mcp-tool-defs,pglite-lock,mcp-eval-capture): 77/77 unchanged.Full
bun teston the rebased branch: 3864 pass / 341 skip / 4 fail. The 4 failures (brain-registry.serial,cli-options skillpack-check timeout, twoe2e/dream-cycle-eight-phase-pglitecases) all reproduce identically on9c2dc4cmaster without this patch — pre-existing on upstream, no regression introduced by this PR.bun run typecheck: clean.Real-process repro
Harness (
run.sh+parent.py) drives 4 disconnect cases against two builds — baseline = upstream9c2dc4c, patched = this branch — withGBRAIN_HOMEpointing at a fresh PGLite per case. Lock state is captured BEFORE runninggbrain statssostats's own stale-lock cleanup can't skew the snapshot. Primary signal = child liveness + lock dir's PID JSON +kill -0;gbrain statslatency is secondary.stats(latency)watchdog-fifois the launchd / orphan case — child's stdin is held open by an external keeper (named FIFO) so the patch cannot fall through to stdin EOF; only the watchdog can detect parent death. That row was the one that would have silently failed without theprocess.ppid→spawnSync('ps')fix.Environment: macOS arm64, Bun 1.3.12.
Cross-references
'close'/ parent-process watchdog credit stays with @Aragorn2046's three features in the commit message.gbrain serveidle timeout (auto-exit after inactivity) #446 — "Feature request:gbrain serveidle timeout".process.ppidBun-cache bug that motivates thereadLiveParentPid()fallback (filed by @Aragorn2046 with confirmation that Node.js refreshes ppid correctly in the same scenario).cc @Aragorn2046
Patch, repro harness, and PR body drafted in a Claude Code (Opus 4.7) session. Submitted by @seungsu-kr after sign-off.
Need help on this PR? Tag
@codesmithwith what you need.