Skip to content

fix(serve): clean up stdio MCP server on client disconnect#676

Closed
seungsu-kr wants to merge 1 commit intogarrytan:masterfrom
seungsu-kr:feature/stdio-graceful-cleanup-rebased
Closed

fix(serve): clean up stdio MCP server on client disconnect#676
seungsu-kr wants to merge 1 commit intogarrytan:masterfrom
seungsu-kr:feature/stdio-graceful-cleanup-rebased

Conversation

@seungsu-kr
Copy link
Copy Markdown

@seungsu-kr seungsu-kr commented May 6, 2026

Summary

gbrain serve (stdio mode) leaks the PGLite write lock when the parent process disconnects. Three structural gaps:

  1. src/commands/serve.ts never calls engine.disconnect() after startMcpServer(engine) resolves.
  2. src/cli.ts:case 'serve' short-circuits with // serve doesn't disconnect.
  3. The MCP SDK's StdioServerTransport (@modelcontextprotocol/sdk) only registers 'data' / 'error' listeners on stdin — never 'end' / 'close' — so even a clean stdin EOF never reaches the SDK.

Net effect: the server keeps the engine and its <datadir>/.gbrain-lock/lock acquired indefinitely after the parent disconnects. The next gbrain serve either times out or waits for the in-process 5-minute stale-lock check.

Scope

Stdio path only. The v0.26+ HTTP / OAuth path (runServeHttp in serve-http.ts) is preserved verbatim.

Patch

src/commands/serve.ts:

  • Stdin EOF + close watchersonce('end') and once('close'), gated on !isTTY so interactive gbrain serve in a terminal is unaffected. 'end' covers clean parent disconnect; 'close' covers a SIGKILL'd parent leaving the handle destroyed.
  • SIGTERM / SIGINT / SIGHUP handlers — funneled into the same idempotent shutdown path. SIGHUP coverage matters in real deployments (Claude Desktop on macOS, MCP gateway restarts).
  • Parent-process watchdog — 5s polling, gated on initialParentPid !== 1 (so a process legitimately spawned under PID 1 like a systemd unit doesn't immediately self-terminate). Detects launchd / cron / orphan scenarios where the parent dies without closing stdin or sending a signal.
  • 5-second cleanup deadline — if engine.disconnect() wedges (PGLite WASM stall, etc.), the process still calls process.exit(0). Any abandoned lock dir is reclaimed on the next attempt by the existing process.kill(pid, 0) stale-lock check in pglite-lock.ts.
  • Optional --stdio-idle-timeout <sec> — default OFF safety net for parents that leak the pipe but never close it and never send another message. Strict parsing: typos / blank values throw immediately instead of silently disabling the safety net (closes Feature request: gbrain serve idle timeout (auto-exit after inactivity) #446).

Test seam: ServeOptions { stdin, signals, exit, log, startMcpServer, getParentPid, setInterval, clearInterval } lets the lifecycle be unit-tested without spawning a real Bun child or booting the real MCP SDK. Fake timers drive the watchdog deterministically without 5s of wall clock.

process.ppid is cached on Bun (and never updates on re-parent)

While building real-process repro evidence, I found that process.ppid in Bun (and Node — see oven-sh/bun#30305 filed by @Aragorn2046) is captured at process creation and never refreshed when the kernel re-parents the child to PID 1. The kernel re-parents correctly within one second (ps -o ppid= -p $$ returns 1), but process.ppid stays at the original parent's PID for the entire process lifetime.

startup process.ppid=87589
tick    process.ppid=87589  live(ps)=87589
tick    process.ppid=87589  live(ps)=1     ← kernel re-parented; process.ppid did NOT update
tick    process.ppid=87589  live(ps)=1
tick    process.ppid=87589  live(ps)=1

So a naive if (process.ppid === 1) watchdog will not actually fire in a real launchd / cron / orphaned-PPID scenario. The patch routes through a readLiveParentPid() helper that calls spawnSync('ps', ['-o', 'ppid=', '-p', String(process.pid)]) per tick (~10ms / 5s, falls back to process.ppid if ps fails). This was the difference between the watchdog firing and not firing in the harness below.

Tests

test/serve-stdio-lifecycle.test.ts — 20 cases, all passing on the rebased branch:

  • Signals: SIGTERM / SIGINT / SIGHUP each trigger graceful exit (3).
  • Stdin EOF: 'end' and 'close' each trigger graceful exit (2).
  • TTY stdin does NOT install end watcher (interactive use unaffected) (1).
  • Idempotency: multiple racing signals only engine.disconnect() once (1).
  • Cleanup deadline: when engine.disconnect() rejects, exit still happens with code=0 and a logged error (1).
  • Watchdog: fires on ppid 0 → 1, NOT installed when initial ppid is already 1, no false fire on healthy ticks (3).
  • --stdio-idle-timeout: arms / resets on data / fires on idle / 0 disarms / strict parse rejects abc / 30junk / -1 / 1.5 / blank / missing (9).

Adjacent unit suites (cli, cli-options, mcp-tool-defs, pglite-lock, mcp-eval-capture): 77/77 unchanged.

Full bun test on the rebased branch: 3864 pass / 341 skip / 4 fail. The 4 failures (brain-registry.serial, cli-options skillpack-check timeout, two e2e/dream-cycle-eight-phase-pglite cases) all reproduce identically on 9c2dc4c master without this patch — pre-existing on upstream, no regression introduced by this PR.

bun run typecheck: clean.

Real-process repro

Harness (run.sh + parent.py) drives 4 disconnect cases against two builds — baseline = upstream 9c2dc4c, patched = this branch — with GBRAIN_HOME pointing at a fresh PGLite per case. Lock state is captured BEFORE running gbrain stats so stats's own stale-lock cleanup can't skew the snapshot. Primary signal = child liveness + lock dir's PID JSON + kill -0; gbrain stats latency is secondary.

case side child PID lock state (pre-stats) stats (latency) lifecycle reason
stdin-close baseline alive live-lock pid=88224 timeout-8s (8s) (none)
stdin-close patched alive no-lock-dir ok (0s) graceful exit (stdin-end)
sigterm baseline alive live-lock pid=88331 timeout-8s (8s) (none)
sigterm patched alive no-lock-dir ok (0s) graceful exit (SIGTERM)
sigkill-parent baseline dead stale-lock pid=88415 ok (0s) (none)
sigkill-parent patched dead no-lock-dir ok (1s) graceful exit (stdin-end)
watchdog-fifo baseline alive live-lock pid=88573 timeout-8s (8s) (none)
watchdog-fifo patched dead no-lock-dir ok (1s) graceful exit (parent-died)

watchdog-fifo is the launchd / orphan case — child's stdin is held open by an external keeper (named FIFO) so the patch cannot fall through to stdin EOF; only the watchdog can detect parent death. That row was the one that would have silently failed without the process.ppidspawnSync('ps') fix.

Environment: macOS arm64, Bun 1.3.12.

Cross-references

cc @Aragorn2046


Patch, repro harness, and PR body drafted in a Claude Code (Opus 4.7) session. Submitted by @seungsu-kr after sign-off.


View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

When a parent process disconnects from `gbrain serve` (stdio mode), the
PGLite write lock is leaked because:

  1. `commands/serve.ts` never calls `engine.disconnect()` after
     `startMcpServer(engine)` resolves.
  2. `cli.ts` short-circuits the serve case (`// serve doesn't disconnect`).
  3. The MCP SDK's `StdioServerTransport` does not register `'end'` or
     `'close'` listeners on stdin (only `'data'` and `'error'`), so even a
     clean stdin EOF never reaches the SDK.
  4. There is no signal handler covering SIGTERM / SIGINT / SIGHUP.
  5. There is no watchdog for the orphan case where the parent dies
     without closing stdin or sending a signal (the kernel just
     re-parents to PID 1; no event fires).

Net effect: the server keeps the engine and its
`<datadir>/.gbrain-lock/lock` acquired indefinitely after the parent
disconnects. The next `gbrain serve` either times out or waits for the
in-process 5-minute stale-lock check.

Patch (stdio path only — HTTP / OAuth path unchanged):

  - Stdin EOF / close watchers (`once('end')` + `once('close')`), gated
    on `!isTTY` so interactive use is unaffected. 'end' covers clean
    parent disconnect; 'close' covers SIGKILL'd parent leaving the
    handle destroyed.
  - SIGTERM / SIGINT / SIGHUP handlers, all funneled into the same
    idempotent path. SIGHUP matters in the wild — Claude
    Desktop on macOS and hermes-agent restarts have been observed
    sending it instead of closing stdin.
  - Parent-process watchdog: 5s `setInterval` polling `process.ppid`.
    If the captured initial parent (non-1) becomes 1, we got
    re-parented to init → parent died → graceful exit. Catches
    launchd / cron / certain MCP gateways that just terminate without
    notification. Skipped when initial ppid is already 1 (legitimate
    init-child like a systemd unit).
  - Idempotent `beginShutdown()` — sync `shuttingDown` flag prevents
    double-disconnect across racing events. Watchdog is cleared first
    on cleanup so a coincident tick can't queue a redundant pass.
  - 5-second cleanup deadline; if `engine.disconnect()` wedges we still
    `process.exit(0)` (the abandoned lock dir is reclaimed by the
    existing `process.kill(pid, 0)` stale check on the next attempt).
  - Optional `--stdio-idle-timeout <sec>` opt-in (default OFF) for
    parents that leak the pipe but never close it. Resets on every
    stdin `'data'` chunk. Strict parsing — typos / blank values throw
    instead of silently disabling the safety net.

Test seams (`ServeOptions { stdin, signals, exit, log, startMcpServer,
getParentPid, setInterval, clearInterval }`) let
`test/serve-stdio-lifecycle.test.ts` cover the lifecycle end-to-end
without spawning a real Bun child or booting the real MCP SDK. Fake
timers drive the watchdog deterministically without 5s of wall clock.

Verified:
  - 20 new tests pass: signals (SIGTERM / SIGINT / SIGHUP), stdin EOF
    (end + close), idempotency, cleanup-failure / deadline, TTY-skip,
    watchdog (fires on ppid 0->1, NOT installed when initial ppid=1, no
    false fire on healthy tick), strict idle-timeout parsing
    (rejects 'abc' / '30junk' / '-1' / '1.5' / '' / missing).
  - Adjacent suites (cli, cli-options, mcp-tool-defs, pglite-lock):
    46/46 unchanged.
  - `bun run typecheck` clean.

Related to garrytan#413 (stdin=/dev/null lock contention) and garrytan#446 (idle timeout
feature request). Materially overlaps with garrytan#591; this commit started as
an independent patch and integrates SIGHUP + stdin 'close' + the PPID
watchdog after discovering garrytan#591.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan
Copy link
Copy Markdown
Owner

garrytan commented May 9, 2026

Hey @seungsu-kr — thank you for this. It's a real chronic bug we've been hitting. We're deferring this PR from the v0.30.3 fix wave (#776) to keep that wave's size manageable, but it'll ship as its own focused PR right after. I'll ping you here when it's ready to land. Apologies for the delay.

garrytan added a commit that referenced this pull request May 10, 2026
Annotate the v0.31.3 changes in the canonical Key Files section:
new src/core/sql-query.ts adapter (#681), src/commands/serve.ts stdio
cleanup (#676), v0.31.3 amendments to auth.ts / serve-http.ts /
oauth-provider.ts surfaces, and migration v46 normalizer in migrate.ts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
garrytan added a commit that referenced this pull request May 10, 2026
…closes #413, #446) (#801)

* fix(serve): clean up stdio MCP server on client disconnect

The PGLite write lock leaked indefinitely when the parent of `gbrain serve`
disconnected. Three root causes: serve.ts never called engine.disconnect()
after startMcpServer() resolved; cli.ts short-circuited with a "serve doesn't
disconnect" comment; and the MCP SDK's StdioServerTransport only listens for
'data'/'error' on stdin, never 'end'/'close', so even a clean stdin EOF never
reached the SDK.

Net effect: the next `gbrain serve` waited for the in-process 5-minute stale-
lock check or hung indefinitely.

stdio path now installs a unified lifecycle:
- SIGTERM/SIGINT/SIGHUP all funnel into one idempotent shutdown path
  (SIGHUP coverage matters for Claude Desktop on macOS / MCP gateway
  restarts; SIGINT for Ctrl-C; SIGTERM for daemon shutdown).
- stdin 'end' (clean EOF) and 'close' (parent SIGKILL with pipe still
  open) both trigger the same graceful path. TTY stdin skips the watchers
  so interactive `gbrain serve` is unaffected.
- Parent-process watchdog polls the live kernel parent PID via spawnSync
  ('ps','-o','ppid=','-p',PID) every 5s. process.ppid is cached at process
  creation by Bun (and Node) and never refreshes on re-parent — empirical
  evidence on macOS shows ps reports the new parent within one tick while
  process.ppid stays at the original PID indefinitely (oven-sh/bun#30305).
- Watchdog fires on `getParentPid() !== initialParentPid` (any reparent),
  not just `=== 1`. Catches launchd / systemd / tmux / parent-shell-with-
  PR_SET_CHILD_SUBREAPER cases where the kernel re-anchors us to a non-1
  subreaper PID. Codex review caught the original `=== 1` was incomplete.
- One-shot startup probe verifies `spawnSync('ps')` actually works on this
  host. If the probe fails (stripped containers / busybox without procps),
  we skip installing the watchdog interval entirely AND emit a loud stderr
  line — the operator sees "watchdog disabled" instead of an installed-
  but-never-fires phantom that silently falls back to cached process.ppid.
- 5-second cleanup deadline: if engine.disconnect() wedges (PGLite WASM
  stall, etc.), the process still calls process.exit(0). The abandoned
  lock dir is reclaimed on the next start by the existing stale-lock
  check in pglite-lock.ts.
- Optional `--stdio-idle-timeout <sec>`: default OFF safety net for
  parents that leak the pipe but never close it. Strict parsing rejects
  `abc` / `30junk` / `-1` / `1.5` / blank values explicitly so a typo
  doesn't silently disable the safety net (closes #446).

Test seam: ServeOptions { stdin, signals, exit, log, startMcpServer,
getParentPid, setInterval, clearInterval, probeWatchdog } lets the
lifecycle be unit-tested deterministically without spawning a real Bun
child or booting the MCP SDK.

22 test cases covering signals, stdin EOF, TTY skip, watchdog reparent
(both PID-1 and subreaper-PID-N cases), ps-unavailable degraded mode,
idle timeout, idempotent shutdown, and cleanup-deadline behavior.

Closes #413, #446. Supersedes #591.

Co-Authored-By: Aragorn2046 <noreply@github.com>
Co-Authored-By: seungsu-kr <noreply@github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(auth): route HTTP auth/admin SQL through active engine

`gbrain auth` and `gbrain serve --http` previously routed every SQL
through the postgres.js singleton in src/core/db.ts, which silently fell
back to a file-backed PGLite when DATABASE_URL was set but the config
file disagreed. The HTTP transport's verbatim use of the singleton also
made `gbrain serve --http` Postgres-only, even though the
`access_tokens` and `mcp_request_log` tables exist in both engine
schemas.

Auth, OAuth, admin, file uploads, and HTTP-transport SQL now run through
`engine.executeRaw` via a deliberately narrow tagged-template adapter
(`src/core/sql-query.ts`). The contract is scalar-binds-only — adding
JSONB or fragment composition would invite the adapter to drift into a
partial postgres.js clone. JSONB writes use a separate
`executeRawJsonb(engine, sql, scalarParams, jsonbParams)` helper that
composes positional `$N::jsonb` casts and passes objects through
`engine.executeRaw`. The CI guard at `scripts/check-jsonb-pattern.sh`
doesn't fire because the helper is a method call, not the banned
`${JSON.stringify(x)}::jsonb` template-literal interpolation, and the
v0.12.0 double-encode bug class doesn't apply to positional binding via
`postgres.js`'s `unsafe()` (verified by
`test/e2e/auth-permissions.test.ts:67` on Postgres and the new
`test/sql-query.test.ts` on PGLite).

Migrated call sites:
  - src/commands/auth.ts: takes-holders writes (lines 52, 86) →
    executeRawJsonb. List, revoke, register-client, revoke-client →
    SqlQuery via withConfiguredSql() helper that opens an engine, runs
    the callback, disconnects.
  - src/commands/serve-http.ts: ~25 call sites including the four
    mcp_request_log.params INSERTs (now write real JSONB objects, not
    JSON-encoded strings — the read side `params->>'op'` returns the
    operation name, closing CLAUDE.md's outstanding "JSON-string-into-
    JSONB" note as a side effect). The /admin/api/requests dynamic
    filter pattern (postgres.js fragment composition) is rewritten as
    parametrized SQL string + params array.
  - src/mcp/http-transport.ts: legacy bearer-auth path. The
    Postgres-only fail-fast at startup is removed because both schemas
    now carry access_tokens + mcp_request_log.
  - src/core/oauth-provider.ts: SqlQuery / SqlValue types relocated
    from here to sql-query.ts as the canonical home (Codex finding #8).
  - src/commands/files.ts: all 5 db.getConnection() sites (lines 104,
    139, 252, 326, 355). The line-256 INSERT into files.metadata uses
    executeRawJsonb; the other four are scalar-only SqlQuery (Codex
    finding #6 — scope was bigger than the plan's "lone INSERT" framing).
  - src/core/config.ts: env-var DATABASE_URL inference. When dbUrl is
    set, infer Postgres engine and clear the stale database_path.

Engine-internal sql.json() sites in src/core/postgres-engine.ts (5
sites: lines 520, 1689, 1728, 1790, 2313) STAY UNCHANGED. They live
inside PostgresEngine itself, where the postgres.js template-tag
sql.json() pattern is correct — those methods are only loaded when
Postgres is the active engine, so there's no PGLite-routing concern.

Migration v45 (mcp_request_log_params_jsonb_normalize): one-shot UPDATE
that lifts pre-v0.31 string-shaped JSONB rows to objects so the
/admin/api/requests endpoint at serve-http.ts:605 returns one
consistent shape to the admin SPA. Idempotent (subsequent runs find no
rows where jsonb_typeof = 'string'). Closes the mixed-shape window
that would otherwise have made post-deploy admin reads break.

Tests:
  - test/sql-query.test.ts: 7 cases covering scalar binds, the
    .json() rejection (defense in depth — SqlQuery is scalar-only),
    JSONB round-trip with `jsonb_typeof = 'object'` and `->>`
    semantics, the v0.12.0 double-encode regression guard, null
    JSONB handling, and the scalars-then-jsonb call shape.
  - test/config-env.test.ts: migrated from PR's manual `restoreEnv()`
    in afterEach to the canonical `withEnv()` helper at
    test/helpers/with-env.ts (CLAUDE.md R1 / codex finding D3).
    Five cases covering DATABASE_URL precedence, GBRAIN_DATABASE_URL
    operator override, file-only config, env-only config, and the
    no-config null path.
  - test/e2e/auth-takes-holders-pglite.test.ts: 6 cases against
    in-memory PGLite (no DATABASE_URL gate). Covers create / update /
    read of access_tokens.permissions, mcp_request_log.params object
    + null writes, and the migration v45 normalizer (seed
    string-shaped row, run UPDATE, assert object shape; second-run
    no-op for idempotency).
  - test/http-transport.test.ts: mock updated to intercept
    engine.executeRaw (the new code path) instead of the postgres.js
    template tag. 24 cases pass.

Plan reference: ~/.claude/plans/system-instruction-you-are-working-peppy-moore.md.
Codex outside-voice review applied: D-codex-1, D-codex-2, D-codex-5,
D-codex-8, D-codex-9, D-codex-10 (and D1, D5 reversed by codex).

Closes the architectural intent of #681. Supersedes its branch.

Co-Authored-By: codex-bot <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: update CLAUDE.md key files for v0.31.3

Annotate the v0.31.3 changes in the canonical Key Files section:
new src/core/sql-query.ts adapter (#681), src/commands/serve.ts stdio
cleanup (#676), v0.31.3 amendments to auth.ts / serve-http.ts /
oauth-provider.ts surfaces, and migration v46 normalizer in migrate.ts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore: regenerate llms-full.txt for v0.31.3 docs sync

CI's build-llms test asserts the committed llms.txt + llms-full.txt
match what scripts/build-llms.ts produces from current source state.
CLAUDE.md was amended by /document-release post-merge (new entries for
src/core/sql-query.ts and src/commands/serve.ts; amended notes on
auth.ts / serve-http.ts / migrate.ts), so the inlined-bundle fell out
of sync. Regenerated via `bun run build:llms`.

llms.txt unchanged (curated index — no new web URLs added).
llms-full.txt updated to inline the new CLAUDE.md content.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Aragorn2046 <noreply@github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@seungsu-kr
Copy link
Copy Markdown
Author

Closing as superseded by #801 (v0.31.3, merged 2026-05-10).

The stdio cleanup from this branch landed verbatim, with two additional cases on top:

  1. Watchdog reparent-to-PID-N (subreaper) — covers PR_SET_CHILD_SUBREAPER environments (systemd, tmux, parent-shell-as-subreaper). Original watchdog only checked ppid === 1.
  2. ps-unavailable startup probe — stripped containers (busybox without procps, etc.) get a loud stderr line + watchdog skips installing entirely, instead of silently running every 5s and never firing.

Credit lines in the v0.31.3 CHANGELOG (~/git/gbrain/CHANGELOG.md lines 7-13):

Two community PRs (#676 stdio cleanup, #681 engine-aware auth SQL) landed in one focused PR with two atomic commits.
Credit @Aragorn2046 (origin features in #591) and @seungsu-kr (rebased submitter, Bun ppid workaround).

The process.ppid Bun-cache workaround (spawnSync('ps', ['-o', 'ppid=', '-p', String(process.pid)]) per tick) reaching v0.31.3 as the canonical pattern is the part I'm most happy about — even though oven-sh/bun#30305 (filed by @Aragorn2046) is the upstream fix, every Bun consumer needs this fallback in the meantime.

Verified locally: 21 stale gbrain serve processes had accumulated on my machine (1-3 days old, all from pre-v0.31.3 Claude Desktop / Cursor disconnects) — exactly the leak this fix prevents going forward.

Thanks @garrytan for the disposition and merge, and @Aragorn2046 for the original work in #591 + Shelby's path-B endorsement that unblocked this PR.

@seungsu-kr seungsu-kr closed this May 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: gbrain serve idle timeout (auto-exit after inactivity) MCP serve: stdin=/dev/null causes lock contention after gateway restart

2 participants