gbrain sync --source <id> spins indefinitely (busy loop, SIGTERM ignored) → orphaned processes pile up under scheduler

## Summary

`gbrain sync --source <id>` can enter an infinite CPU-spinning loop that never resolves, and the process ignores SIGTERM (requires SIGKILL to terminate). When the sync is triggered from a scheduler that has its own session timeout, the parent exits and leaves the sync process orphaned (PPID 1), still burning CPU indefinitely.

## Environment

- gbrain v0.41.11.1 (bun-linked from source)
- macOS 26.3 arm64 (Mac mini M4 16 GB)
- Engine: Postgres (Supabase)
- Source: `briefings` → `/Users/simona/.openclaw/workspace/memory` (local git repo, ~119 pages, markdown strategy)
- Triggered by: OpenClaw cron → isolated agent session → `exec gbrain sync --source briefings --no-pull --no-embed`

## What was observed

On 2026-05-29 ~20:00 CT:

```
ps -axo pid,ppid,etime,%cpu,rss,args | grep gbrain
```

```
13 concurrent processes, all identical:
  bun /Users/simona/.bun/bin/gbrain sync --source briefings --no-pull --no-embed

  PPID: 1 (orphaned — parent had exited)
  Oldest elapsed time: 1-02:56:11 (>24 hours)
  CPU per process: ~60–72%
  RSS per process: ~250 MB
  Combined load average: ~10
  Free memory: 121 MB (of 16 GB); 8.3 GB in compressor (thrashing)
```

After `kill -9` on all 13: load dropped from ~10 → ~5, free memory jumped 121 MB → 9 GB.

## Two distinct bugs

### Bug 1 — Individual run hangs in a busy loop (the underlying bug)

A single `gbrain sync --source briefings --no-pull --no-embed` run does not complete. It is not blocked-idle — it pegs ~65% CPU continuously, indicating a busy loop or retry storm rather than I/O wait. The oldest instance had been spinning for >24 hours.

**Suspected location:** The `briefings` source syncs a directory of markdown files (daily notes, memory files). The most likely candidates for the spin, based on the sync code path:

1. `extractLinksForSlugs` / `extractTimelineForSlugs` — iterating over `pagesAffected` with expensive per-file operations
2. `runFactsBackstop` loop — the per-slug loop that calls `queue.add(...)` for each affected page; if queue submission is retrying on a transient error without backoff this would spin
3. `withRefreshingLock` timer not refreshing because the event loop is saturated by synchronous/native work (git `execFileSync` calls with 30s timeouts)
4. `performFullSync` triggered every run due to `versionNeverSet` (chunker_version never written for this source), causing a full reimport of all ~119 pages each cycle even when HEAD hasn't changed

**Diagnostic**: Running with stderr phase breadcrumbs would show where it hangs:
```bash
gbrain sync --source briefings --no-pull --no-embed 2>&1 | grep "\[gbrain phase\]"
```
Per the code comments in `sync.ts` (v0.41.8.0 / #1342), phase lines are emitted at each major boundary. The last line printed before the spin would identify the stuck phase.

### Bug 2 — SIGTERM is ignored (requires SIGKILL)

```bash
pkill -f "gbrain sync"   # did nothing
kill -9 <pid>            # worked
```

`gbrain sync` (the CLI entry point, not the autopilot) has no `process.on('SIGTERM')` handler. The autopilot daemon registers one, but the bare `runSync` / `performSync` path does not. If the process is blocked inside a synchronous native call (e.g. `execFileSync` for a git command), SIGTERM will queue but not be delivered until the call returns. Since the git calls have 30s timeouts, SIGTERM should eventually be handled — but in practice these processes ran for >24h without ever exiting, suggesting the SIGTERM either never delivered cleanly or the process re-entered a blocking call immediately after.

**Fix:** Register a `process.on('SIGTERM', ...)` handler in the `runSync` CLI entry point (same pattern as `autopilot.ts`'s `shutdown()` function) that sets a global abort flag checked between import iterations, then calls `process.exit(0)`.

## Workaround applied

Added a wrapper script that:
1. **Overlap guard** — `pgrep -fl "gbrain sync"` at entry; skip the run entirely if any prior sync is still alive
2. **Per-run timeout** — Perl `fork`/`alarm` wrapper: 480s SIGTERM → 10s grace → SIGKILL per source

This prevents accumulation but doesn't fix the underlying spin or SIGTERM issues in gbrain itself.

## Suggested fixes in gbrain

1. **`runSync` SIGTERM handler** — `process.on('SIGTERM', () => { shuttingDown = true; })`, check `shuttingDown` between file imports, exit cleanly.

2. **Per-source CLI timeout flag** — `gbrain sync --timeout 300` that wraps `performSync` with an AbortController and a `setTimeout(() => abort.abort(), ms)`. The sync handler in `worker.ts` already does this via `job.timeout_ms`; expose it on the CLI surface too.

3. **`chunker_version` gate diagnosis** — If `versionNeverSet` (source has no `chunker_version` row) causes a full reimport on every run even when HEAD hasn't changed, that's a performance regression for the first N cycles after a source is registered. Worth logging a clear warning when this gate fires: `[sync] chunker_version unset for source <id> — forcing full reimport (will not repeat after first successful write)`.

4. **`--break-lock` hint in stale-process scenario** — When `withRefreshingLock` fails because a prior instance is holding the lock and that PID is dead, the error message already hints `gbrain sync --break-lock`. But since the PID is alive (just stuck), the break-lock safe path refuses. Documenting `--force-break-lock` in the runaway-process recovery guide would help.

## Reproduction

Set up a cron or scheduler that fires `gbrain sync --source <id>` every N minutes (N < single-run time). If any run gets stuck (for any reason), the next tick spawns another, and so on. After 24h you have `(24*60)/N` orphaned spinning processes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gbrain sync --source <id> spins indefinitely (busy loop, SIGTERM ignored) → orphaned processes pile up under scheduler #1633

Summary

Environment

What was observed

Two distinct bugs

Bug 1 — Individual run hangs in a busy loop (the underlying bug)

Bug 2 — SIGTERM is ignored (requires SIGKILL)

Workaround applied

Suggested fixes in gbrain

Reproduction

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

gbrain sync --source <id> spins indefinitely (busy loop, SIGTERM ignored) → orphaned processes pile up under scheduler #1633

Description

Summary

Environment

What was observed

Two distinct bugs

Bug 1 — Individual run hangs in a busy loop (the underlying bug)

Bug 2 — SIGTERM is ignored (requires SIGKILL)

Workaround applied

Suggested fixes in gbrain

Reproduction

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions