Skip to content

fix(sync): --skip-failed acks pre-existing unacked failures up-front#686

Closed
brandonlipman wants to merge 6 commits intogarrytan:masterfrom
brandonlipman:fix/sync-skip-failed-acks-stale
Closed

fix(sync): --skip-failed acks pre-existing unacked failures up-front#686
brandonlipman wants to merge 6 commits intogarrytan:masterfrom
brandonlipman:fix/sync-skip-failed-acks-stale

Conversation

@brandonlipman
Copy link
Copy Markdown
Contributor

@brandonlipman brandonlipman commented May 6, 2026

Bug

The recovery flow that gbrain doctor and printSyncResult both advertise is broken end-to-end:

  1. Files with bad YAML hit the failure log; sync stays blocked at last_commit.
  2. User fixes the YAML.
  3. User re-runs gbrain sync → succeeds, advances last_commit.
  4. gbrain doctor still reports N unacked failures from step 1 because ~/.gbrain/sync-failures.jsonl is append-only history, never auto-cleared.
  5. doctor message: "use gbrain sync --skip-failed to acknowledge."
  6. User runs gbrain sync --skip-failedAlready up to date. → log unchanged.

The flag is a no-op exactly when the user needs it.

Root cause

--skip-failed only acknowledges failures from the current run. The ack path lives inside performSync and is gated on failedFiles.length > 0 (src/commands/sync.ts:695). When the diff is empty (because the user already fixed the bad files) or the sync is up to date, that branch never executes, and the flag never reaches acknowledgeSyncFailures(). The doctor and printSyncResult:1251 messages assume the flag handles stale entries — it doesn't.

Repro

$ gbrain doctor
  [WARN] sync_failures: 27 unacknowledged sync failure(s)...
         Fix the file(s) and re-run 'gbrain sync', or use
         'gbrain sync --skip-failed' to acknowledge.
$ # ... fix the YAML ...
$ gbrain sync
Already up to date.
$ gbrain sync --skip-failed
Already up to date.        # ← log untouched
$ gbrain doctor
  [WARN] sync_failures: 27 unacknowledged sync failure(s)...   # still!

Hit while debugging a real brain that had 27 unacked failures from 3 files (2 already-fixed, 1 fresh-broken).

Fix

At the top of runSync, when --skip-failed is set, eagerly ack any pre-existing unacked failures before any sync work runs. The flag now means "acknowledge whatever is currently flagged and move on" regardless of whether the current run produces new failures or finds nothing to do.

The inner per-run ack path stays — it still handles new failures from the current run. The two paths compose: gbrain sync --skip-failed clears stale + advances past anything new, all in one command, matching what the doctor message promises.

After

$ gbrain sync --skip-failed
Acknowledged 27 pre-existing failure(s).
Already up to date.
$ gbrain doctor
  [OK] sync_failures: N historical sync failure(s), all acknowledged

Tests

Two added in test/sync-failures.test.ts:

  • Source-string pin on the new gate (matches the file's existing CLI-flag test convention).
  • Behavioral test on the underlying acknowledgeSyncFailures path.
$ bun test test/sync-failures.test.ts
41 pass, 0 fail, 75 expect() calls

Behavior preserved

  • Without --skip-failed, the existing block-sync-on-failures behavior is unchanged.
  • Within a single sync run that produces new failures + is invoked with --skip-failed, the inner ack path still handles them (the new code only acks pre-existing entries; the inner path acks the run's own).
  • --retry-failed semantics are unchanged.

View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

…otstrap

The forward-reference bootstrap (PostgresEngine + PGLiteEngine
applyForwardReferenceBootstrap) covered v0.18 + v0.19 + v0.26.5 columns
but missed two later groups. Brains upgrading from v0.14-era to current
master crash before the migration ladder runs:

1. v0.20 Cathedral II — content_chunks.search_vector,
   parent_symbol_path, doc_comment, symbol_name_qualified.
   `CREATE INDEX idx_chunks_search_vector` and
   `CREATE INDEX idx_chunks_symbol_qualified` in schema.sql/PGLITE_SCHEMA_SQL
   crash with "column search_vector does not exist" / "column
   symbol_name_qualified does not exist".

2. v0.26.3 — mcp_request_log.agent_name, params, error_message.
   `CREATE INDEX idx_mcp_log_agent_time ON mcp_request_log(agent_name,...)`
   crashes with "column agent_name does not exist".

Reproduces deterministically on a v0.13/v0.14 brain upgraded straight
to current master. The user hits the wall before any of v15-v36 can run.

Both engines now probe for these columns and pre-add them via
`ALTER TABLE ADD COLUMN IF NOT EXISTS` before SCHEMA_SQL runs. Migrations
v26, v27, v33 still run later via runMigrations and remain idempotent
(they handle backfill on top of the bootstrap-added columns).

Test coverage extended in test/schema-bootstrap-coverage.test.ts:
REQUIRED_BOOTSTRAP_COVERAGE now lists 6 new forward references; the
strip-and-rebuild block drops the corresponding indexes/triggers so the
test exercises a brain that pre-dates v0.20 + v0.26.3 migrations.

Repro: brain on schema v13/v14 + run `gbrain init --migrate-only` against
current master → fails. With this patch → succeeds; ladder runs to v36.
`package.json` declares `"bin": { "gbrain": "src/cli.ts" }`, and bun's
linker creates `~/.bun/bin/gbrain` as a symlink to the file. The shebang
`#!/usr/bin/env bun` works only when the target file is executable —
otherwise bun runs it as a script (because it sees the script via the
shebang interpreter), but executing the symlinked target itself fails:

  $ ls -la ~/.bun/bin/gbrain
  lrwxrwxrwx ... -> ../install/global/node_modules/gbrain/src/cli.ts
  $ ~/.bun/bin/gbrain --version
  /opt/homebrew/bin/bash: line 1: /Users/brandon/.bun/bin/gbrain: Permission denied

This bites the postinstall hook that calls `gbrain apply-migrations`
(masked by the `||` fallback) and any subprocess that invokes the
binary by absolute path (e.g., subagent_messages migration v0.16's
`execSync('gbrain init --migrate-only', ...)`).

Setting the mode in-tree to 755 fixes both. No content change.
…able

`gbrain doctor` was the only consumer of `findRepoRoot` from
`core/repo-root.ts`. Every other consumer (check-resolvable.ts:145,
skillify.ts, etc.) uses `autoDetectSkillsDir`, which has the full
detection chain:
  1. \$OPENCLAW_WORKSPACE
  2. ~/.openclaw/workspace
  3. findRepoRoot() walk from cwd
  4. ./skills

`findRepoRoot` only does step 3. Result: when the user runs `gbrain
doctor` from any directory outside the gbrain repo or the OpenClaw
workspace tree (e.g., a project's checkout), `resolver_health` reports
"Could not find skills directory" even though the dispatcher exists at
~/.openclaw/workspace/skills/RESOLVER.md.

Reproduces in any directory other than ~/gbrain or its descendants on
a system with ~/.openclaw/workspace/skills/RESOLVER.md present:

    \$ cd ~/Documents
    \$ gbrain doctor
    [WARN] resolver_health: Could not find skills directory   # before
    [WARN] resolver_health: 5 issue(s): 0 error(s), 5 warning(s)  # after

Switching doctor to `autoDetectSkillsDir` brings it inline with the rest
of the codebase. The detected dir is also passed to
`checkSkillConformance` (step 2 of the resolver_health block), which
previously rebuilt the path from `repoRoot` — now uses the same
detected path for consistency.

All 15 existing tests in test/doctor.test.ts continue to pass.
The recovery flow that doctor + printSyncResult both advertise was broken:

1. User has files with bad YAML → they hit the failure log + sync stays
   blocked at last_commit.
2. User fixes the YAML.
3. User re-runs `gbrain sync` — sync succeeds, advances last_commit.
4. `gbrain doctor` still reports N unacked failures from step 1 because
   sync-failures.jsonl is append-only history, never auto-cleared.
5. doctor message says: "use 'gbrain sync --skip-failed' to acknowledge".
6. User runs `gbrain sync --skip-failed` → "Already up to date." → log
   unchanged.

The bug: --skip-failed only acknowledges failures from the CURRENT run.
performSync's ack path is gated on `failedFiles.length > 0` after sync —
it never fires when the diff is empty (because the user already fixed
the bad files) or when the sync is up to date. So the documented recovery
sequence is a no-op exactly when the user needs it.

The fix: at the top of runSync, when --skip-failed is set, eagerly ack
any pre-existing unacked failures before any sync work runs. Now the flag
means "acknowledge whatever is currently flagged and move on" regardless
of whether the current run produces new failures or finds nothing to do.

The inner per-run ack path stays — it still handles new failures from
the CURRENT run, which is the (a) syncing now produces failures + (b)
caller wants to ack them path. The two paths compose: `gbrain sync
--skip-failed` clears stale + advances past anything new, all in one
command, matching what the doctor message promises.

Tests: 2 added in test/sync-failures.test.ts. One source-string pin on
the new gate (the file's existing pattern for CLI-flag tests). One
behavioral test on the underlying acknowledgeSyncFailures path.

Repro:
  $ gbrain doctor
  [WARN] sync_failures: 27 unacknowledged sync failure(s)...
         Fix the file(s) and re-run 'gbrain sync', or use
         'gbrain sync --skip-failed' to acknowledge.
  $ # ... fix the YAML ...
  $ gbrain sync
  Already up to date.
  $ gbrain sync --skip-failed
  Already up to date.   # before this PR
  $ gbrain doctor
  [WARN] sync_failures: 27 unacknowledged sync failure(s)...   # still!

After:
  $ gbrain sync --skip-failed
  Acknowledged 27 pre-existing failure(s).
  Already up to date.
  $ gbrain doctor
  [OK] sync_failures: N historical sync failure(s), all acknowledged
@garrytan
Copy link
Copy Markdown
Owner

Closing — your fix landed in master via the v0.30.3 fix-wave PR #776 (merged at ff53a4c9): "--skip-failed acks pre-existing unacked failures up-front".

Thank you for the contribution — credit is preserved in the v0.30.3 CHANGELOG entry. 🙏

@garrytan garrytan closed this May 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants