Skip to content

gbrain sync --source <id> spins indefinitely (busy loop, SIGTERM ignored) → orphaned processes pile up under scheduler #1633

@Simona-digital-ops

Description

@Simona-digital-ops

Summary

gbrain sync --source <id> can enter an infinite CPU-spinning loop that never resolves, and the process ignores SIGTERM (requires SIGKILL to terminate). When the sync is triggered from a scheduler that has its own session timeout, the parent exits and leaves the sync process orphaned (PPID 1), still burning CPU indefinitely.

Environment

  • gbrain v0.41.11.1 (bun-linked from source)
  • macOS 26.3 arm64 (Mac mini M4 16 GB)
  • Engine: Postgres (Supabase)
  • Source: briefings/Users/simona/.openclaw/workspace/memory (local git repo, ~119 pages, markdown strategy)
  • Triggered by: OpenClaw cron → isolated agent session → exec gbrain sync --source briefings --no-pull --no-embed

What was observed

On 2026-05-29 ~20:00 CT:

ps -axo pid,ppid,etime,%cpu,rss,args | grep gbrain
13 concurrent processes, all identical:
  bun /Users/simona/.bun/bin/gbrain sync --source briefings --no-pull --no-embed

  PPID: 1 (orphaned — parent had exited)
  Oldest elapsed time: 1-02:56:11 (>24 hours)
  CPU per process: ~60–72%
  RSS per process: ~250 MB
  Combined load average: ~10
  Free memory: 121 MB (of 16 GB); 8.3 GB in compressor (thrashing)

After kill -9 on all 13: load dropped from ~10 → ~5, free memory jumped 121 MB → 9 GB.

Two distinct bugs

Bug 1 — Individual run hangs in a busy loop (the underlying bug)

A single gbrain sync --source briefings --no-pull --no-embed run does not complete. It is not blocked-idle — it pegs ~65% CPU continuously, indicating a busy loop or retry storm rather than I/O wait. The oldest instance had been spinning for >24 hours.

Suspected location: The briefings source syncs a directory of markdown files (daily notes, memory files). The most likely candidates for the spin, based on the sync code path:

  1. extractLinksForSlugs / extractTimelineForSlugs — iterating over pagesAffected with expensive per-file operations
  2. runFactsBackstop loop — the per-slug loop that calls queue.add(...) for each affected page; if queue submission is retrying on a transient error without backoff this would spin
  3. withRefreshingLock timer not refreshing because the event loop is saturated by synchronous/native work (git execFileSync calls with 30s timeouts)
  4. performFullSync triggered every run due to versionNeverSet (chunker_version never written for this source), causing a full reimport of all ~119 pages each cycle even when HEAD hasn't changed

Diagnostic: Running with stderr phase breadcrumbs would show where it hangs:

gbrain sync --source briefings --no-pull --no-embed 2>&1 | grep "\[gbrain phase\]"

Per the code comments in sync.ts (v0.41.8.0 / #1342), phase lines are emitted at each major boundary. The last line printed before the spin would identify the stuck phase.

Bug 2 — SIGTERM is ignored (requires SIGKILL)

pkill -f "gbrain sync"   # did nothing
kill -9 <pid>            # worked

gbrain sync (the CLI entry point, not the autopilot) has no process.on('SIGTERM') handler. The autopilot daemon registers one, but the bare runSync / performSync path does not. If the process is blocked inside a synchronous native call (e.g. execFileSync for a git command), SIGTERM will queue but not be delivered until the call returns. Since the git calls have 30s timeouts, SIGTERM should eventually be handled — but in practice these processes ran for >24h without ever exiting, suggesting the SIGTERM either never delivered cleanly or the process re-entered a blocking call immediately after.

Fix: Register a process.on('SIGTERM', ...) handler in the runSync CLI entry point (same pattern as autopilot.ts's shutdown() function) that sets a global abort flag checked between import iterations, then calls process.exit(0).

Workaround applied

Added a wrapper script that:

  1. Overlap guardpgrep -fl "gbrain sync" at entry; skip the run entirely if any prior sync is still alive
  2. Per-run timeout — Perl fork/alarm wrapper: 480s SIGTERM → 10s grace → SIGKILL per source

This prevents accumulation but doesn't fix the underlying spin or SIGTERM issues in gbrain itself.

Suggested fixes in gbrain

  1. runSync SIGTERM handlerprocess.on('SIGTERM', () => { shuttingDown = true; }), check shuttingDown between file imports, exit cleanly.

  2. Per-source CLI timeout flaggbrain sync --timeout 300 that wraps performSync with an AbortController and a setTimeout(() => abort.abort(), ms). The sync handler in worker.ts already does this via job.timeout_ms; expose it on the CLI surface too.

  3. chunker_version gate diagnosis — If versionNeverSet (source has no chunker_version row) causes a full reimport on every run even when HEAD hasn't changed, that's a performance regression for the first N cycles after a source is registered. Worth logging a clear warning when this gate fires: [sync] chunker_version unset for source <id> — forcing full reimport (will not repeat after first successful write).

  4. --break-lock hint in stale-process scenario — When withRefreshingLock fails because a prior instance is holding the lock and that PID is dead, the error message already hints gbrain sync --break-lock. But since the PID is alive (just stuck), the break-lock safe path refuses. Documenting --force-break-lock in the runaway-process recovery guide would help.

Reproduction

Set up a cron or scheduler that fires gbrain sync --source <id> every N minutes (N < single-run time). If any run gets stuck (for any reason), the next tick spawns another, and so on. After 24h you have (24*60)/N orphaned spinning processes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions