v0.26.4 test: parallel unit-test loop (12x speedup, failure-first logging)#605
Merged
v0.26.4 test: parallel unit-test loop (12x speedup, failure-first logging)#605
Conversation
Lay foundation for v0.26.4 parallel test loop: - scripts/run-unit-parallel.sh: spawns N shards (default min(8, cpu_count)) via run-unit-shard.sh, captures per-shard logs, post-shard single-writer failure-log aggregation at .context/test-failures.log, 10s heartbeat to stderr, per-shard 600s timeout (gtimeout/timeout/bg-pid fallback chain), loud final banner with absolute path + tail-30 of failures, summary file for at-a-glance status. Single writer eliminates concurrent-write hazards on the failure log. - scripts/run-serial-tests.sh: discovers *.serial.test.ts files (concurrency- unsafe by design), runs them with --max-concurrency=1. Invoked after the parallel pass. - scripts/run-unit-shard.sh: now accepts --max-concurrency=N (forwarded to bun test); --dry-run-list moved into argv parsing alongside; excludes *.serial.test.ts in addition to *.slow.test.ts. - bunfig.toml: trim stale comment about typecheck-chained timeout. - .gitignore: add .context/ (Conductor workspace artifacts directory; the failure log + summary + per-shard logs all live here). No package.json changes yet (commit 2). No test reorganization yet (commits 4-7). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…commit 2/8) Per Codex Tension #4 (verify scope), distinguish three tiers cleanly: - `bun run test` = fast loop, file-level parallel fan-out via the new wrapper (scripts/run-unit-parallel.sh). No pre-checks, no typecheck, no wasm compile in the hot path. ~15s of pre-test gates removed. - `bun run verify` = CI's authoritative gate set: check:jsonb + check:progress + check:wasm + typecheck. Matches what .github/workflows/test.yml runs on shard 1, no scope drift. The 4 checks not in CI (privacy, no-legacy-getconnection, trailing-newline, exports-count) move to `bun run check:all` for opt-in local use. - `bun run test:full` = verify + parallel + slow + smart e2e (runs e2e only if DATABASE_URL is set; else loud skip notice to stderr per Open Item #7). The local equivalent of "everything CI runs." Adds `bun run test:serial` for the *.serial.test.ts subset (concurrency- unsafe files run with --max-concurrency=1). Bumps VERSION + package.json to 0.26.4. Both move together per the CI version-gate contract in CLAUDE.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wave: makes the new wrapper actually green and tightens the CI gate it
exposed.
Wrapper bug fixes (scripts/run-unit-parallel.sh):
- grep_count helper: avoids the `grep -c | echo 0` double-output bug
where 0 matches yields a 2-line "0\n0" string and breaks arithmetic.
- bun_summary_count helper: parses Bun's actual end-of-shard summary
format (`N pass` / `N fail` / `N skip`), not the per-test markers
(which are `✓` / `(fail)`, never `(pass)` / `(skip)`).
- Heartbeat now reads `^\s+✓` (Bun's per-test pass marker) for live
progress mid-run; final summary still uses the summary-line counts
for accuracy.
Privacy gate tightening:
- Move scripts/check-privacy.sh into `bun run verify` (was previously
only in the now-removed `bun run test` chain). Without this, after
commit 2 the privacy check ran in nothing automatic.
- .github/workflows/test.yml now calls `bun run verify` instead of
inlining the gate list. Single source of truth for "what's the ship
gate." This is what verify == CI was supposed to mean per Codex T#4.
- Pre-existing `Wintermute` references in src/core/mounts-cache.ts:6
and :324 caught by the now-running gate; replaced with `your OpenClaw`
per CLAUDE.md privacy rule (verify gate now passes on master HEAD).
- test/privacy-script-wired.test.ts updated: regression guard now
asserts verify includes check:privacy AND that test.yml runs
`bun run verify`, replacing the obsolete "test script includes
check-privacy.sh" assertion.
Quarantine 2 cross-file-contention flakes:
- test/brain-registry.test.ts: 28 tests pass alone (41ms); 1 test
("empty/null/undefined id routes to host") fails when run alongside
other files in the same shard. Renamed → *.serial.test.ts so it
runs in scripts/run-serial-tests.sh's serial pass after the parallel
pass completes.
- test/reconcile-links.test.ts: 6 tests pass alone (1s); a beforeEach
hook times out (~896s) under cross-file contention. Same treatment.
Both flakes are bun-process-level shared-state leaks (PGLite singletons
or top-level imports). Fixing them properly is the v0.27.0+ intra-file
parallelism project (TODO P0 — see commit 5).
Measurement after this commit:
bun run test = 94s (was 18 min sequential)
3639 pass, 0 fail, 0 skip across 8 parallel shards + 34 serial tests
Failure-log + heartbeat + summary all working
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…commit 4/5) Three regression suites pin the v0.26.4 contracts. Without these, future refactors of the wrapper or shard scripts could silently regress the work in commits 1-3. test/scripts/run-unit-shard.test.ts (4 cases — gap b): - Asserts the unit-shard `--dry-run-list` output excludes every *.slow.test.ts and *.serial.test.ts file, plus the test/e2e/ subtree. - Catches a future `find` expression that drops one of the `-not -name` clauses and silently un-quarantines slow/serial files into the parallel pass. test/scripts/serial-files.test.ts (3 cases — gap e): - Every checked-in *.serial.test.ts (via `git ls-files`) is listed by scripts/run-serial-tests.sh's `--dry-run-list`. - The script's source contains `bun test --max-concurrency=1` (the serial-pass guarantee that quarantined files don't run intra-file concurrent and reintroduce the contention they were quarantined for). - Disjoint set: a file is never in both the unit-shard list AND the serial list — pins the carve-out contract. test/scripts/run-unit-parallel.test.ts (6 cases — gaps a + d): - Exit-code propagation (a): wrapper exits non-zero when ANY shard has a failing test; exits zero when all pass. The hardest contract to silently break in a fan-out wrapper (`for ... &; wait` returns the LAST child's status, not any failure's). - Failure-log contract (d): on failure, .context/test-failures.log exists, is non-empty, contains the `--- shard N:` prefix and the failing test's describe text. Stderr banner contains the absolute log path. On success, the log is cleared (no stale content). - Summary file format: `shard N/M: pass=X fail=Y skip=Z rc=W` per shard, machine-parseable for future tooling. The wrapper test runs against a 4-file tempdir (3 pass + 1 fail) so it executes in ~500ms; spawning the wrapper against the real test suite would take ~90s and isn't worth the cost in a regression suite. All 13 cases pass on first run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mmit 5/5) Closes the v0.26.4 ship. CLAUDE.md Testing section rewritten: - New tier table: test (fast loop, 85s) / verify (CI gates, 12s) / test:full (everything local) / test:slow / test:serial / test:e2e / check:all. Each row names its scope, wallclock, and when to use. - Intentional CI vs local divergence section: CI matrix (test-shard.sh, hash-bucketed, includes slow) vs local fast loop (run-unit-shard.sh, round-robin, excludes slow + serial). Codex correctly flagged that a parity test would always fail by design — this is the documentation that explains why. - Failure-first logging contract: .context/test-failures.log format, stderr banner, summary file, wedge handling. - File taxonomy: *.test.ts / *.slow.test.ts / *.serial.test.ts / test/e2e/. Names the two currently-quarantined files and points at the intra-file P0 TODO for the proper fix. CHANGELOG.md `## [0.26.4]` entry per voice rules: - Two-line headline: "bun run test finishes in 85 seconds. Was 18 minutes." + failure-log directive. - Lead paragraph names what shipped and why. - Numbers-that-matter table: BEFORE / AFTER / Δ for wallclock, pre-test gates, failure visibility, shards, pipe-survival. - "What this means for you" closing tied to the inner-loop user. - "To take advantage of v0.26.4" block per the v0.13+ self-repair template (gbrain upgrade + contributor steps). - Itemized changes by area (new scripts, script extensions, package.json tier split, CI tightening, failure-first logging, quarantine, regression tests, bunfig). - "What did NOT ship" section names the intra-file project + E2E template-DB project as P0/P1 follow-ups with concrete acceptance criteria. - Process section names the codex review + scope-correction loop honestly: "snapped back to ship today once empirical measurement showed Bun's --max-concurrency does nothing on tests not marked test.concurrent()." - For-contributors note on portability + single-writer + fallback paths. TODOS.md adds two P-rated entries: - P0: intra-file parallelism via --concurrent flag. Sweep ~58 PGLite sites + ~40 env mutations + 2 mock.module sites. Target: bun run test < 30s. ~1-2 weeks. Detailed acceptance criteria. References Codex findings and plan-file rationale. - P1: E2E parallelism via Postgres template databases. CREATE DATABASE TEMPLATE gbrain_template per test file. ~1-2 days. llms.txt + llms-full.txt regenerated via `bun run build:llms` to absorb the CLAUDE.md changes (per CLAUDE.md's "After any release ship that touches the Key Files annotations in CLAUDE.md, run bun run build:llms" rule). The build-llms regression test was firing in shard 7 of the parallel pass — caught the drift, regeneration cleared it. Final measurement after fix: 94s wallclock, 3652 pass, 0 fail across 8 parallel shards + 34 serial tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ests # Conflicts: # CHANGELOG.md # VERSION # package.json # src/core/mounts-cache.ts
garrytan
added a commit
that referenced
this pull request
May 4, 2026
Master added two more commits: - d97f159 v0.26.4 test: parallel unit-test loop (12x speedup, #605) - 0de9eb6 v0.26.5 feat: destructive operation guard end-to-end (#600) Resolved three conflicts (all version bookkeeping): - VERSION: kept 0.26.6 (this branch's version, ahead of master's 0.26.5) - package.json: kept 0.26.6 - CHANGELOG.md: my v0.26.6 entry on top, then master's new v0.26.5 + v0.26.4 blocks below. Final order: 0.26.6 → 0.26.5 → 0.26.4 → 0.26.3 → 0.26.2 → 0.26.1 → 0.26.0 (top to bottom, contiguous). Schema-drift gate sanity check post-merge: - Master's v0.26.5 destructive-guard work added pages.deleted_at (with partial index pages_deleted_at_purge_idx) and three columns on sources (archived, archived_at, archive_expires_at). All four are present in BOTH src/schema.sql AND src/core/pglite-schema.ts — master kept them in lockstep, so the gate is satisfied automatically. - access_tokens.id is still UUID DEFAULT gen_random_uuid() in both engines (my v0.26.3 D6 fix preserved across the merge). - Typecheck clean, schema-diff unit tests 17/17 pass, privacy script clean (master's v0.26.4 work fixed the Wintermute references in mounts-cache.ts that I had patched earlier — converged independently). Regenerated llms-full.txt to match the merged CHANGELOG.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
bun run testfinishes in ~85 seconds. It was 18 minutes. 12x speedup via 8-shard parallel fan-out + dedicated failure-log file.Test infra (commits 1-2):
scripts/run-unit-parallel.sh(~340 LOC). Spawns N=min(8, cpu_count)shards via existingscripts/run-unit-shard.sh. Per-shard 600sgtimeout/timeout/bg-pid wallclock cap. Single-writer post-shard failure aggregation (no concurrent-write hazards). 10s heartbeat to stderr proving it isn't wedged.scripts/run-serial-tests.shruns*.serial.test.tsfiles at--max-concurrency=1after the parallel pass.scripts/run-unit-shard.shaccepts--max-concurrency=N; excludes*.serial.test.tsalongside*.slow.test.ts.package.jsonscript tier split:test(fast loop) /verify(CI-narrow gates) /test:full(verify + parallel + slow + smart e2e) /test:serial/test:e2e/check:all.Failure-first logging (commits 1-3):
.context/test-failures.log— extracted failure blocks per shard, prefixed--- shard N: <test name> ---. Falls back to/tmp/if.context/is unwritable..context/test-summary.txt— one-line-per-shardpass=X fail=Y skip=Z rc=W.| head/| tail/ agent log truncation..gitignoreadds.context/.CI gate tightening (commit 3):
.github/workflows/test.ymlnow runsbun run verify(was: 4 specific scripts inlined). Privacy check now actually fires on every CI run; previously it ran only when somebody manually invoked the oldbun run testchain. Caught two pre-existingWintermutereferences insrc/core/mounts-cache.tsand replaced withyour OpenClawper CLAUDE.md privacy rule.Quarantine (commit 3):
test/brain-registry.test.ts→test/brain-registry.serial.test.ts(1 case fails under cross-file contention)test/reconcile-links.test.ts→test/reconcile-links.serial.test.ts(beforeEach hook timeout under contention)Both pass alone. The proper architectural fix (sweep ~58 PGLite + ~40 env-mutation + 2 mock.module sites + add
--concurrentflag) is filed as a P0 TODO for v0.27+.Regression tests (commit 4, 4 files, 13 cases):
test/scripts/run-unit-parallel.test.ts— exit-code propagation + failure-log contract (uses 4-fixture tempdir, ~500ms)test/scripts/run-unit-shard.test.ts— exclusion symmetry (slow + serial + e2e all excluded)test/scripts/serial-files.test.ts— discovery + concurrency=1 + disjoint from unit-shard settest/privacy-script-wired.test.ts— updated to assert verify chains check:privacy AND test.yml callsbun run verifyDocs (commit 5):
CLAUDE.mdTesting section rewritten with tier table, intentional CI-vs-local divergence section, failure-log contract, file taxonomy.CHANGELOG.md## [0.26.4]entry per voice rules.TODOS.mdadds P0: intra-file parallelism via--concurrent(~1-2 weeks; targetbun run test<30s) and P1: E2E template-DB parallelism.llms-full.txtregenerated.Plus merge with master (v0.26.3 admin observability landed during this branch): Merged.
verifyextended to run master'scheck:admin-build.shgate (vite build of admin React app). Tests still 88s green post-merge.Test Coverage
Pre-Landing Review
Eng review and Codex outside-voice review were run interactively during planning. Codex flagged 4 critical structural issues (Bun native --shard non-functional on file lists, parity test impossible by design,
freshPglite()contradicts existingresetPglite()helper,verifywas redefining the ship gate). All 4 resolved by user via AskUserQuestion. Plan file:~/.claude/plans/system-instruction-you-are-working-tranquil-ladybug.md— full GSTACK REVIEW REPORT included.TODOS
--concurrentflag. Sweep ~58 PGLite + ~40 env-mutation + 2 mock.module sites using existingtest/helpers/reset-pglite.ts(do NOT introducefreshPglite()— codex correctly flagged that the repo already rejected that direction). Targetbun run test<30s.Test plan
bun run verifygreen (privacy + jsonb + progress + wasm + admin-build + typecheck)bun run test88s, 3657 pass / 0 fail / 0 skip (8 parallel shards + 34 serial)--- shard N:prefix in.context/test-failures.log+ loud stderr bannerbun run verify(single-source-of-truth for ship gate)🤖 Generated with Claude Code
Need help on this PR? Tag
@codesmithwith what you need.