Skip to content

v0.42.18.0 fix: sync orphan-pileup watchdog (#1633) + links-lag µs stamp (#1768)#1807

Merged
garrytan merged 10 commits into
masterfrom
garrytan/santo-domingo-v4
Jun 3, 2026
Merged

v0.42.18.0 fix: sync orphan-pileup watchdog (#1633) + links-lag µs stamp (#1768)#1807
garrytan merged 10 commits into
masterfrom
garrytan/santo-domingo-v4

Conversation

@garrytan

@garrytan garrytan commented Jun 3, 2026

Copy link
Copy Markdown
Owner

Two unrelated bugs reported from real Postgres/Supabase brains.

#1633gbrain sync spins, ignores SIGTERM, orphans pile up under cron

A sync --source <id> could enter a busy loop that pegs a CPU core and ignores SIGTERM/kill (only kill -9 stopped it). Under cron the stuck process orphaned and the next tick spawned another — one reporter found 13, 24h+ old. Root cause: a spinning sync starves its own event loop, so the SIGTERM handler and --timeout gbrain already had can never run.

Fix — out-of-band hard-deadline watchdog. src/core/process-watchdog.ts spawns a Bun worker_threads Worker (eval:true so it survives bun --compile) on a separate OS thread; at the deadline it SIGTERMs its own process, and at deadline+grace SIGKILLs it — fires even when the main loop is starved. Signals SELF, so no PID-reuse footgun. Empirically validated on Bun 1.3.13 (worker timer + SIGKILL killed a while(true){}-starved process; Codex outside-voice + a repo spike both confirmed).

  • Armed in cli.ts before connectEngine so a connect-phase hang is bounded too.
  • Non-TTY (cron) default 3600s; tune via GBRAIN_SYNC_MAX_RUNTIME_SECONDS, --hard-deadline <s>, or opt out with --no-hard-deadline. --timeout <s> auto-arms the hard backstop. Interactive TTY runs stay unbounded.
  • Part B: SIGINT graceful-cancel on single-source + --all (Ctrl-C returns a clean partial + releases the lock via the normal finally, instead of a hard cut that leaked the lock). withRefreshingLock timer unref'd.
  • The spin itself is not root-caused (needs a live repro; leading suspect is ReDoS in a schema-pack link rule, partly mitigated by v0.41.37.0's redos-guard). The watchdog makes the orphan-pileup/unkillable symptom impossible; gbrain sync --source <id> spins indefinitely (busy loop, SIGTERM ignored) → orphaned processes pile up under scheduler #1633 stays open for the root cause. A [sync-watchdog] heartbeat + the existing [gbrain phase] lines pinpoint the next hang.

#1768links_extraction_lag stuck at 100% on Postgres

gbrain extract --stale stamped every page, yet gbrain doctor still reported every page as needing extraction — the remediation it recommends could never satisfy its own check. Root cause: the stamp went through a JS Date (millisecond-truncated) while the DB updated_at keeps microseconds, so updated_at > links_extracted_at stayed permanently true. (Not a trigger — there is no BEFORE UPDATE trigger on pages.)

Fix. Both engines' listStalePagesForExtraction SELECT now projects a deterministic full-µs UTC string (to_char(updated_at AT TIME ZONE 'UTC', '…US"Z"')); StalePageRow.updated_at_iso; extractStaleFromDB stamps that. The markPagesExtractedBatch SQL is unchanged, so backdated stamps (version-arm test) still work and the CDX-1 edited-since arm is strengthened to exact equality. Postgres-only symptom.

Tests

  • New: test/process-watchdog.test.ts (pure decision matrix + handle contract), test/process-watchdog.serial.test.ts (Bun-pinned: starved process IS killed ~deadline+grace; no-watchdog control does NOT self-exit; clean dispose never kills), test/sync-hard-deadline.test.ts (resolution precedence + composeAbortSignals).
  • New deterministic µs regression in test/extract-stale.test.ts (inject a µs updated_at, run --stale, assert lag → 0 and stays 0).
  • verify 29/29; typecheck clean; targeted post-merge suites 187 pass.

Pre-existing failures (NOT from this branch)

The full suite shows 4 failures in test/facts-classify.test.ts (×2) and test/mcp-eval-capture.test.ts (×2). Verified against a clean origin/master checkout: the same 4 fail there too — they predate this branch and live in areas this PR never touches (facts classifier, op-layer eval-capture). This PR adds 187+ passing tests and introduces zero new failures.

Reviews

/plan-eng-review CLEAR (9 decisions resolved) · /codex outside-voice CLEAR (validated both load-bearing bets: the to_char µs round-trip and the Bun worker-self-kill).

🤖 Generated with Claude Code

garrytan and others added 6 commits June 2, 2026 22:42
Stamp the full-microsecond updated_at (via to_char ... AT TIME ZONE UTC)
instead of the millisecond-truncated JS Date, so links_extracted_at equals
the DB updated_at exactly and the staleness predicate clears. Stamp SQL
unchanged: version-arm backdating still works, D4 preserved, CDX-1 strengthened.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bun eval-Worker that SIGTERM->grace->SIGKILLs its own process from a separate
OS thread, so a sync whose main event loop is starved (ReDoS spin) still dies.
Signals SELF (no PID-reuse footgun). Empirically validated on Bun 1.3.13.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
cli.ts installs the watchdog before connectEngine (bounds connect hangs);
resolveSyncHardDeadline + composeAbortSignals in sync.ts; SIGINT graceful
cancel on single-source + --all; withRefreshingLock timer unref'd. Non-TTY
default 3600s makes cron orphan-pileup structurally impossible.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sibling workspaces claimed v0.42.13-v0.42.17; advance this branch's slot.
VERSION + package.json + CHANGELOG header + CLAUDE.md annotations + llms bundles.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@garrytan garrytan changed the title v0.42.13.0 fix: sync orphan-pileup watchdog (#1633) + links-lag µs stamp (#1768) v0.42.18.0 fix: sync orphan-pileup watchdog (#1633) + links-lag µs stamp (#1768) Jun 3, 2026
garrytan and others added 3 commits June 3, 2026 07:30
…ngo-v4

# Conflicts:
#	CHANGELOG.md
#	CLAUDE.md
#	VERSION
#	llms-full.txt
#	package.json
…check:doc-history)

The doc-history guard bans the bolded **v0.X release-clause marker in reference
docs (history belongs in CHANGELOG + git). Rewrote the extract.ts/sync.ts
additions as current-state prose and de-versioned the process-watchdog entry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ngo-v4

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	docs/architecture/KEY_FILES.md
#	package.json
@garrytan garrytan merged commit bde11bb into master Jun 3, 2026
21 checks passed
mgunnin added a commit to mgunnin/gbrain that referenced this pull request Jun 3, 2026
* upstream/master:
  v0.42.23.0 feat(jobs): --nice scheduling-priority flag for jobs work/supervisor (garrytan#1815) (garrytan#1820)
  v0.42.22.0 fix(minions): supervisor progress watchdog + worker DB self-defense — alive-but-wedged worker self-heals (garrytan#1801) (garrytan#1824)
  v0.42.21.0 fix(postgres): module-singleton ownership — canonical landing for the dream-cycle "connect() has not been called" class (garrytan#1404/garrytan#1471/garrytan#1619) (garrytan#1805)
  v0.42.20.0 fix: reliability wave — PGLite capture lock-pin + Postgres reconnect race + search embed-hang (garrytan#1762 garrytan#1745 garrytan#1775) (garrytan#1810)
  v0.42.19.0 fix(skillopt): close the last gap in the AI SDK v6 tool-loop fix (write-capture mapper + regression test) (garrytan#1809)
  v0.42.18.0 fix: sync orphan-pileup watchdog (garrytan#1633) + links-lag µs stamp (garrytan#1768) (garrytan#1807)
  v0.42.17.0 fix(sync): resumable incremental sync — killed mid-import no longer loses progress (garrytan#1794) (garrytan#1808)
  v0.42.16.0 feat(doctor): brain health as a solved problem — cause-ranked doctor + OOM-loop line + auto-drain + pool-reap (garrytan#1685) (garrytan#1802)
  v0.42.15.0 fix: decouple CLI primary output from process.stdout.isTTY (garrytan#1784) (garrytan#1806)
  v0.42.14.0 fix(zero-config): code-* readiness signal + init embedding-key validation + lock self-heal (garrytan#1780) (garrytan#1804)
  v0.42.13.0 fix(search): archive/ content findable by default, demoted not hard-excluded (garrytan#1777) (garrytan#1797)
  v0.42.12.0 feat: self-upgrading gbrain — invocation-riding update check + opt-in auto-upgrade (garrytan#1798)
  v0.42.11.0 feat(skillopt): held-out eval gate, honest receipts, ENFORCE + ablation opts (garrytan#1759)
  v0.42.10.0 feat(extract): opt-in global-basename wikilink resolution (closes garrytan#972) (garrytan#1388)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant