Summary
gbrain sync --source <id> can enter an infinite CPU-spinning loop that never resolves, and the process ignores SIGTERM (requires SIGKILL to terminate). When the sync is triggered from a scheduler that has its own session timeout, the parent exits and leaves the sync process orphaned (PPID 1), still burning CPU indefinitely.
Environment
- gbrain v0.41.11.1 (bun-linked from source)
- macOS 26.3 arm64 (Mac mini M4 16 GB)
- Engine: Postgres (Supabase)
- Source:
briefings → /Users/simona/.openclaw/workspace/memory (local git repo, ~119 pages, markdown strategy)
- Triggered by: OpenClaw cron → isolated agent session →
exec gbrain sync --source briefings --no-pull --no-embed
What was observed
On 2026-05-29 ~20:00 CT:
ps -axo pid,ppid,etime,%cpu,rss,args | grep gbrain
13 concurrent processes, all identical:
bun /Users/simona/.bun/bin/gbrain sync --source briefings --no-pull --no-embed
PPID: 1 (orphaned — parent had exited)
Oldest elapsed time: 1-02:56:11 (>24 hours)
CPU per process: ~60–72%
RSS per process: ~250 MB
Combined load average: ~10
Free memory: 121 MB (of 16 GB); 8.3 GB in compressor (thrashing)
After kill -9 on all 13: load dropped from ~10 → ~5, free memory jumped 121 MB → 9 GB.
Two distinct bugs
Bug 1 — Individual run hangs in a busy loop (the underlying bug)
A single gbrain sync --source briefings --no-pull --no-embed run does not complete. It is not blocked-idle — it pegs ~65% CPU continuously, indicating a busy loop or retry storm rather than I/O wait. The oldest instance had been spinning for >24 hours.
Suspected location: The briefings source syncs a directory of markdown files (daily notes, memory files). The most likely candidates for the spin, based on the sync code path:
extractLinksForSlugs / extractTimelineForSlugs — iterating over pagesAffected with expensive per-file operations
runFactsBackstop loop — the per-slug loop that calls queue.add(...) for each affected page; if queue submission is retrying on a transient error without backoff this would spin
withRefreshingLock timer not refreshing because the event loop is saturated by synchronous/native work (git execFileSync calls with 30s timeouts)
performFullSync triggered every run due to versionNeverSet (chunker_version never written for this source), causing a full reimport of all ~119 pages each cycle even when HEAD hasn't changed
Diagnostic: Running with stderr phase breadcrumbs would show where it hangs:
gbrain sync --source briefings --no-pull --no-embed 2>&1 | grep "\[gbrain phase\]"
Per the code comments in sync.ts (v0.41.8.0 / #1342), phase lines are emitted at each major boundary. The last line printed before the spin would identify the stuck phase.
Bug 2 — SIGTERM is ignored (requires SIGKILL)
pkill -f "gbrain sync" # did nothing
kill -9 <pid> # worked
gbrain sync (the CLI entry point, not the autopilot) has no process.on('SIGTERM') handler. The autopilot daemon registers one, but the bare runSync / performSync path does not. If the process is blocked inside a synchronous native call (e.g. execFileSync for a git command), SIGTERM will queue but not be delivered until the call returns. Since the git calls have 30s timeouts, SIGTERM should eventually be handled — but in practice these processes ran for >24h without ever exiting, suggesting the SIGTERM either never delivered cleanly or the process re-entered a blocking call immediately after.
Fix: Register a process.on('SIGTERM', ...) handler in the runSync CLI entry point (same pattern as autopilot.ts's shutdown() function) that sets a global abort flag checked between import iterations, then calls process.exit(0).
Workaround applied
Added a wrapper script that:
- Overlap guard —
pgrep -fl "gbrain sync" at entry; skip the run entirely if any prior sync is still alive
- Per-run timeout — Perl
fork/alarm wrapper: 480s SIGTERM → 10s grace → SIGKILL per source
This prevents accumulation but doesn't fix the underlying spin or SIGTERM issues in gbrain itself.
Suggested fixes in gbrain
-
runSync SIGTERM handler — process.on('SIGTERM', () => { shuttingDown = true; }), check shuttingDown between file imports, exit cleanly.
-
Per-source CLI timeout flag — gbrain sync --timeout 300 that wraps performSync with an AbortController and a setTimeout(() => abort.abort(), ms). The sync handler in worker.ts already does this via job.timeout_ms; expose it on the CLI surface too.
-
chunker_version gate diagnosis — If versionNeverSet (source has no chunker_version row) causes a full reimport on every run even when HEAD hasn't changed, that's a performance regression for the first N cycles after a source is registered. Worth logging a clear warning when this gate fires: [sync] chunker_version unset for source <id> — forcing full reimport (will not repeat after first successful write).
-
--break-lock hint in stale-process scenario — When withRefreshingLock fails because a prior instance is holding the lock and that PID is dead, the error message already hints gbrain sync --break-lock. But since the PID is alive (just stuck), the break-lock safe path refuses. Documenting --force-break-lock in the runaway-process recovery guide would help.
Reproduction
Set up a cron or scheduler that fires gbrain sync --source <id> every N minutes (N < single-run time). If any run gets stuck (for any reason), the next tick spawns another, and so on. After 24h you have (24*60)/N orphaned spinning processes.
Summary
gbrain sync --source <id>can enter an infinite CPU-spinning loop that never resolves, and the process ignores SIGTERM (requires SIGKILL to terminate). When the sync is triggered from a scheduler that has its own session timeout, the parent exits and leaves the sync process orphaned (PPID 1), still burning CPU indefinitely.Environment
briefings→/Users/simona/.openclaw/workspace/memory(local git repo, ~119 pages, markdown strategy)exec gbrain sync --source briefings --no-pull --no-embedWhat was observed
On 2026-05-29 ~20:00 CT:
After
kill -9on all 13: load dropped from ~10 → ~5, free memory jumped 121 MB → 9 GB.Two distinct bugs
Bug 1 — Individual run hangs in a busy loop (the underlying bug)
A single
gbrain sync --source briefings --no-pull --no-embedrun does not complete. It is not blocked-idle — it pegs ~65% CPU continuously, indicating a busy loop or retry storm rather than I/O wait. The oldest instance had been spinning for >24 hours.Suspected location: The
briefingssource syncs a directory of markdown files (daily notes, memory files). The most likely candidates for the spin, based on the sync code path:extractLinksForSlugs/extractTimelineForSlugs— iterating overpagesAffectedwith expensive per-file operationsrunFactsBackstoploop — the per-slug loop that callsqueue.add(...)for each affected page; if queue submission is retrying on a transient error without backoff this would spinwithRefreshingLocktimer not refreshing because the event loop is saturated by synchronous/native work (gitexecFileSynccalls with 30s timeouts)performFullSynctriggered every run due toversionNeverSet(chunker_version never written for this source), causing a full reimport of all ~119 pages each cycle even when HEAD hasn't changedDiagnostic: Running with stderr phase breadcrumbs would show where it hangs:
Per the code comments in
sync.ts(v0.41.8.0 / #1342), phase lines are emitted at each major boundary. The last line printed before the spin would identify the stuck phase.Bug 2 — SIGTERM is ignored (requires SIGKILL)
gbrain sync(the CLI entry point, not the autopilot) has noprocess.on('SIGTERM')handler. The autopilot daemon registers one, but the barerunSync/performSyncpath does not. If the process is blocked inside a synchronous native call (e.g.execFileSyncfor a git command), SIGTERM will queue but not be delivered until the call returns. Since the git calls have 30s timeouts, SIGTERM should eventually be handled — but in practice these processes ran for >24h without ever exiting, suggesting the SIGTERM either never delivered cleanly or the process re-entered a blocking call immediately after.Fix: Register a
process.on('SIGTERM', ...)handler in therunSyncCLI entry point (same pattern asautopilot.ts'sshutdown()function) that sets a global abort flag checked between import iterations, then callsprocess.exit(0).Workaround applied
Added a wrapper script that:
pgrep -fl "gbrain sync"at entry; skip the run entirely if any prior sync is still alivefork/alarmwrapper: 480s SIGTERM → 10s grace → SIGKILL per sourceThis prevents accumulation but doesn't fix the underlying spin or SIGTERM issues in gbrain itself.
Suggested fixes in gbrain
runSyncSIGTERM handler —process.on('SIGTERM', () => { shuttingDown = true; }), checkshuttingDownbetween file imports, exit cleanly.Per-source CLI timeout flag —
gbrain sync --timeout 300that wrapsperformSyncwith an AbortController and asetTimeout(() => abort.abort(), ms). The sync handler inworker.tsalready does this viajob.timeout_ms; expose it on the CLI surface too.chunker_versiongate diagnosis — IfversionNeverSet(source has nochunker_versionrow) causes a full reimport on every run even when HEAD hasn't changed, that's a performance regression for the first N cycles after a source is registered. Worth logging a clear warning when this gate fires:[sync] chunker_version unset for source <id> — forcing full reimport (will not repeat after first successful write).--break-lockhint in stale-process scenario — WhenwithRefreshingLockfails because a prior instance is holding the lock and that PID is dead, the error message already hintsgbrain sync --break-lock. But since the PID is alive (just stuck), the break-lock safe path refuses. Documenting--force-break-lockin the runaway-process recovery guide would help.Reproduction
Set up a cron or scheduler that fires
gbrain sync --source <id>every N minutes (N < single-run time). If any run gets stuck (for any reason), the next tick spawns another, and so on. After 24h you have(24*60)/Norphaned spinning processes.