Skip to content

v0.26.4 test: parallel unit-test loop (12x speedup, failure-first logging)#605

Merged
garrytan merged 6 commits intomasterfrom
garrytan/parallel-tests
May 4, 2026
Merged

v0.26.4 test: parallel unit-test loop (12x speedup, failure-first logging)#605
garrytan merged 6 commits intomasterfrom
garrytan/parallel-tests

Conversation

@garrytan
Copy link
Copy Markdown
Owner

@garrytan garrytan commented May 4, 2026

Summary

bun run test finishes in ~85 seconds. It was 18 minutes. 12x speedup via 8-shard parallel fan-out + dedicated failure-log file.

Test infra (commits 1-2):

  • New scripts/run-unit-parallel.sh (~340 LOC). Spawns N=min(8, cpu_count) shards via existing scripts/run-unit-shard.sh. Per-shard 600s gtimeout/timeout/bg-pid wallclock cap. Single-writer post-shard failure aggregation (no concurrent-write hazards). 10s heartbeat to stderr proving it isn't wedged.
  • New scripts/run-serial-tests.sh runs *.serial.test.ts files at --max-concurrency=1 after the parallel pass.
  • scripts/run-unit-shard.sh accepts --max-concurrency=N; excludes *.serial.test.ts alongside *.slow.test.ts.
  • package.json script tier split: test (fast loop) / verify (CI-narrow gates) / test:full (verify + parallel + slow + smart e2e) / test:serial / test:e2e / check:all.

Failure-first logging (commits 1-3):

  • .context/test-failures.log — extracted failure blocks per shard, prefixed --- shard N: <test name> ---. Falls back to /tmp/ if .context/ is unwritable.
  • .context/test-summary.txt — one-line-per-shard pass=X fail=Y skip=Z rc=W.
  • Stderr banner with absolute path + tail-30 of failure log on any failure. Survives | head / | tail / agent log truncation.
  • .gitignore adds .context/.

CI gate tightening (commit 3):

  • .github/workflows/test.yml now runs bun run verify (was: 4 specific scripts inlined). Privacy check now actually fires on every CI run; previously it ran only when somebody manually invoked the old bun run test chain. Caught two pre-existing Wintermute references in src/core/mounts-cache.ts and replaced with your OpenClaw per CLAUDE.md privacy rule.

Quarantine (commit 3):

  • test/brain-registry.test.tstest/brain-registry.serial.test.ts (1 case fails under cross-file contention)
  • test/reconcile-links.test.tstest/reconcile-links.serial.test.ts (beforeEach hook timeout under contention)

Both pass alone. The proper architectural fix (sweep ~58 PGLite + ~40 env-mutation + 2 mock.module sites + add --concurrent flag) is filed as a P0 TODO for v0.27+.

Regression tests (commit 4, 4 files, 13 cases):

  • test/scripts/run-unit-parallel.test.ts — exit-code propagation + failure-log contract (uses 4-fixture tempdir, ~500ms)
  • test/scripts/run-unit-shard.test.ts — exclusion symmetry (slow + serial + e2e all excluded)
  • test/scripts/serial-files.test.ts — discovery + concurrency=1 + disjoint from unit-shard set
  • test/privacy-script-wired.test.ts — updated to assert verify chains check:privacy AND test.yml calls bun run verify

Docs (commit 5):

  • CLAUDE.md Testing section rewritten with tier table, intentional CI-vs-local divergence section, failure-log contract, file taxonomy.
  • CHANGELOG.md ## [0.26.4] entry per voice rules.
  • TODOS.md adds P0: intra-file parallelism via --concurrent (~1-2 weeks; target bun run test <30s) and P1: E2E template-DB parallelism.
  • llms-full.txt regenerated.

Plus merge with master (v0.26.3 admin observability landed during this branch): Merged. verify extended to run master's check:admin-build.sh gate (vite build of admin React app). Tests still 88s green post-merge.

Test Coverage

  • 4 new test files, 13 cases for the new wrapper / shard / serial / privacy contracts.
  • All ran green inline against a 4-fixture tempdir.
  • Tests: ~3650 → ~3650 + 13 (regression suite for v0.26.4 itself).

Pre-Landing Review

Eng review and Codex outside-voice review were run interactively during planning. Codex flagged 4 critical structural issues (Bun native --shard non-functional on file lists, parity test impossible by design, freshPglite() contradicts existing resetPglite() helper, verify was redefining the ship gate). All 4 resolved by user via AskUserQuestion. Plan file: ~/.claude/plans/system-instruction-you-are-working-tranquil-ladybug.md — full GSTACK REVIEW REPORT included.

TODOS

  • NEW P0: Intra-file parallelism via --concurrent flag. Sweep ~58 PGLite + ~40 env-mutation + 2 mock.module sites using existing test/helpers/reset-pglite.ts (do NOT introduce freshPglite() — codex correctly flagged that the repo already rejected that direction). Target bun run test <30s.
  • NEW P1: E2E parallelism via Postgres template databases. ~1-2 days.

Test plan

  • bun run verify green (privacy + jsonb + progress + wasm + admin-build + typecheck)
  • bun run test 88s, 3657 pass / 0 fail / 0 skip (8 parallel shards + 34 serial)
  • Failure-log contract verified: deliberately failing test produces --- shard N: prefix in .context/test-failures.log + loud stderr banner
  • All 13 regression tests green
  • CI workflow updated to call bun run verify (single-source-of-truth for ship gate)

🤖 Generated with Claude Code


View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

garrytan and others added 6 commits May 3, 2026 17:05
Lay foundation for v0.26.4 parallel test loop:

- scripts/run-unit-parallel.sh: spawns N shards (default min(8, cpu_count))
  via run-unit-shard.sh, captures per-shard logs, post-shard single-writer
  failure-log aggregation at .context/test-failures.log, 10s heartbeat to
  stderr, per-shard 600s timeout (gtimeout/timeout/bg-pid fallback chain),
  loud final banner with absolute path + tail-30 of failures, summary file
  for at-a-glance status. Single writer eliminates concurrent-write hazards
  on the failure log.
- scripts/run-serial-tests.sh: discovers *.serial.test.ts files (concurrency-
  unsafe by design), runs them with --max-concurrency=1. Invoked after the
  parallel pass.
- scripts/run-unit-shard.sh: now accepts --max-concurrency=N (forwarded to
  bun test); --dry-run-list moved into argv parsing alongside; excludes
  *.serial.test.ts in addition to *.slow.test.ts.
- bunfig.toml: trim stale comment about typecheck-chained timeout.
- .gitignore: add .context/ (Conductor workspace artifacts directory; the
  failure log + summary + per-shard logs all live here).

No package.json changes yet (commit 2). No test reorganization yet
(commits 4-7).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…commit 2/8)

Per Codex Tension #4 (verify scope), distinguish three tiers cleanly:

- `bun run test` = fast loop, file-level parallel fan-out via the new wrapper
  (scripts/run-unit-parallel.sh). No pre-checks, no typecheck, no wasm
  compile in the hot path. ~15s of pre-test gates removed.
- `bun run verify` = CI's authoritative gate set: check:jsonb +
  check:progress + check:wasm + typecheck. Matches what
  .github/workflows/test.yml runs on shard 1, no scope drift. The 4
  checks not in CI (privacy, no-legacy-getconnection, trailing-newline,
  exports-count) move to `bun run check:all` for opt-in local use.
- `bun run test:full` = verify + parallel + slow + smart e2e (runs e2e
  only if DATABASE_URL is set; else loud skip notice to stderr per Open
  Item #7). The local equivalent of "everything CI runs."

Adds `bun run test:serial` for the *.serial.test.ts subset (concurrency-
unsafe files run with --max-concurrency=1).

Bumps VERSION + package.json to 0.26.4. Both move together per the CI
version-gate contract in CLAUDE.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave: makes the new wrapper actually green and tightens the CI gate it
exposed.

Wrapper bug fixes (scripts/run-unit-parallel.sh):
- grep_count helper: avoids the `grep -c | echo 0` double-output bug
  where 0 matches yields a 2-line "0\n0" string and breaks arithmetic.
- bun_summary_count helper: parses Bun's actual end-of-shard summary
  format (`N pass` / `N fail` / `N skip`), not the per-test markers
  (which are `✓` / `(fail)`, never `(pass)` / `(skip)`).
- Heartbeat now reads `^\s+✓` (Bun's per-test pass marker) for live
  progress mid-run; final summary still uses the summary-line counts
  for accuracy.

Privacy gate tightening:
- Move scripts/check-privacy.sh into `bun run verify` (was previously
  only in the now-removed `bun run test` chain). Without this, after
  commit 2 the privacy check ran in nothing automatic.
- .github/workflows/test.yml now calls `bun run verify` instead of
  inlining the gate list. Single source of truth for "what's the ship
  gate." This is what verify == CI was supposed to mean per Codex T#4.
- Pre-existing `Wintermute` references in src/core/mounts-cache.ts:6
  and :324 caught by the now-running gate; replaced with `your OpenClaw`
  per CLAUDE.md privacy rule (verify gate now passes on master HEAD).
- test/privacy-script-wired.test.ts updated: regression guard now
  asserts verify includes check:privacy AND that test.yml runs
  `bun run verify`, replacing the obsolete "test script includes
  check-privacy.sh" assertion.

Quarantine 2 cross-file-contention flakes:
- test/brain-registry.test.ts: 28 tests pass alone (41ms); 1 test
  ("empty/null/undefined id routes to host") fails when run alongside
  other files in the same shard. Renamed → *.serial.test.ts so it
  runs in scripts/run-serial-tests.sh's serial pass after the parallel
  pass completes.
- test/reconcile-links.test.ts: 6 tests pass alone (1s); a beforeEach
  hook times out (~896s) under cross-file contention. Same treatment.

Both flakes are bun-process-level shared-state leaks (PGLite singletons
or top-level imports). Fixing them properly is the v0.27.0+ intra-file
parallelism project (TODO P0 — see commit 5).

Measurement after this commit:
  bun run test = 94s (was 18 min sequential)
  3639 pass, 0 fail, 0 skip across 8 parallel shards + 34 serial tests
  Failure-log + heartbeat + summary all working

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…commit 4/5)

Three regression suites pin the v0.26.4 contracts. Without these,
future refactors of the wrapper or shard scripts could silently
regress the work in commits 1-3.

test/scripts/run-unit-shard.test.ts (4 cases — gap b):
- Asserts the unit-shard `--dry-run-list` output excludes every
  *.slow.test.ts and *.serial.test.ts file, plus the test/e2e/ subtree.
- Catches a future `find` expression that drops one of the `-not -name`
  clauses and silently un-quarantines slow/serial files into the
  parallel pass.

test/scripts/serial-files.test.ts (3 cases — gap e):
- Every checked-in *.serial.test.ts (via `git ls-files`) is listed by
  scripts/run-serial-tests.sh's `--dry-run-list`.
- The script's source contains `bun test --max-concurrency=1` (the
  serial-pass guarantee that quarantined files don't run intra-file
  concurrent and reintroduce the contention they were quarantined for).
- Disjoint set: a file is never in both the unit-shard list AND the
  serial list — pins the carve-out contract.

test/scripts/run-unit-parallel.test.ts (6 cases — gaps a + d):
- Exit-code propagation (a): wrapper exits non-zero when ANY shard
  has a failing test; exits zero when all pass. The hardest contract
  to silently break in a fan-out wrapper (`for ... &; wait` returns
  the LAST child's status, not any failure's).
- Failure-log contract (d): on failure, .context/test-failures.log
  exists, is non-empty, contains the `--- shard N:` prefix and the
  failing test's describe text. Stderr banner contains the absolute
  log path. On success, the log is cleared (no stale content).
- Summary file format: `shard N/M: pass=X fail=Y skip=Z rc=W` per
  shard, machine-parseable for future tooling.

The wrapper test runs against a 4-file tempdir (3 pass + 1 fail) so
it executes in ~500ms; spawning the wrapper against the real test
suite would take ~90s and isn't worth the cost in a regression suite.

All 13 cases pass on first run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mmit 5/5)

Closes the v0.26.4 ship.

CLAUDE.md Testing section rewritten:
- New tier table: test (fast loop, 85s) / verify (CI gates, 12s) /
  test:full (everything local) / test:slow / test:serial / test:e2e /
  check:all. Each row names its scope, wallclock, and when to use.
- Intentional CI vs local divergence section: CI matrix (test-shard.sh,
  hash-bucketed, includes slow) vs local fast loop (run-unit-shard.sh,
  round-robin, excludes slow + serial). Codex correctly flagged that a
  parity test would always fail by design — this is the documentation
  that explains why.
- Failure-first logging contract: .context/test-failures.log format,
  stderr banner, summary file, wedge handling.
- File taxonomy: *.test.ts / *.slow.test.ts / *.serial.test.ts /
  test/e2e/. Names the two currently-quarantined files and points at the
  intra-file P0 TODO for the proper fix.

CHANGELOG.md `## [0.26.4]` entry per voice rules:
- Two-line headline: "bun run test finishes in 85 seconds. Was 18
  minutes." + failure-log directive.
- Lead paragraph names what shipped and why.
- Numbers-that-matter table: BEFORE / AFTER / Δ for wallclock, pre-test
  gates, failure visibility, shards, pipe-survival.
- "What this means for you" closing tied to the inner-loop user.
- "To take advantage of v0.26.4" block per the v0.13+ self-repair
  template (gbrain upgrade + contributor steps).
- Itemized changes by area (new scripts, script extensions, package.json
  tier split, CI tightening, failure-first logging, quarantine, regression
  tests, bunfig).
- "What did NOT ship" section names the intra-file project + E2E
  template-DB project as P0/P1 follow-ups with concrete acceptance
  criteria.
- Process section names the codex review + scope-correction loop
  honestly: "snapped back to ship today once empirical measurement showed
  Bun's --max-concurrency does nothing on tests not marked
  test.concurrent()."
- For-contributors note on portability + single-writer + fallback paths.

TODOS.md adds two P-rated entries:
- P0: intra-file parallelism via --concurrent flag. Sweep ~58 PGLite
  sites + ~40 env mutations + 2 mock.module sites. Target: bun run test
  < 30s. ~1-2 weeks. Detailed acceptance criteria. References Codex
  findings and plan-file rationale.
- P1: E2E parallelism via Postgres template databases. CREATE DATABASE
  TEMPLATE gbrain_template per test file. ~1-2 days.

llms.txt + llms-full.txt regenerated via `bun run build:llms` to absorb
the CLAUDE.md changes (per CLAUDE.md's "After any release ship that
touches the Key Files annotations in CLAUDE.md, run bun run build:llms"
rule). The build-llms regression test was firing in shard 7 of the
parallel pass — caught the drift, regeneration cleared it. Final
measurement after fix: 94s wallclock, 3652 pass, 0 fail across 8
parallel shards + 34 serial tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ests

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
#	src/core/mounts-cache.ts
@garrytan garrytan merged commit d97f159 into master May 4, 2026
7 checks passed
garrytan added a commit that referenced this pull request May 4, 2026
Master added two more commits:
- d97f159 v0.26.4 test: parallel unit-test loop (12x speedup, #605)
- 0de9eb6 v0.26.5 feat: destructive operation guard end-to-end (#600)

Resolved three conflicts (all version bookkeeping):

- VERSION: kept 0.26.6 (this branch's version, ahead of master's 0.26.5)
- package.json: kept 0.26.6
- CHANGELOG.md: my v0.26.6 entry on top, then master's new v0.26.5 +
  v0.26.4 blocks below. Final order: 0.26.6 → 0.26.5 → 0.26.4 →
  0.26.3 → 0.26.2 → 0.26.1 → 0.26.0 (top to bottom, contiguous).

Schema-drift gate sanity check post-merge:
- Master's v0.26.5 destructive-guard work added pages.deleted_at
  (with partial index pages_deleted_at_purge_idx) and three columns
  on sources (archived, archived_at, archive_expires_at). All four
  are present in BOTH src/schema.sql AND src/core/pglite-schema.ts —
  master kept them in lockstep, so the gate is satisfied automatically.
- access_tokens.id is still UUID DEFAULT gen_random_uuid() in both
  engines (my v0.26.3 D6 fix preserved across the merge).
- Typecheck clean, schema-diff unit tests 17/17 pass, privacy script
  clean (master's v0.26.4 work fixed the Wintermute references in
  mounts-cache.ts that I had patched earlier — converged independently).

Regenerated llms-full.txt to match the merged CHANGELOG.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant