feat(build): teach synthesizer to emit cmd-dir testCmd for `package main` shims by anbangr · Pull Request #95 · anbangr/gstack

anbangr · 2026-05-25T04:10:13Z

Summary

Teaches the gstack build orchestrator's plan synthesizer that Go package main shim phases need their cmd-directory included in the per-phase  hint, not just the conventional test/ tree. Without this, verify-red exercises the wrong directory, sees existing unit tests all-pass, and halts with Gemini could not produce failing tests after N attempts (GSTACK_BUILD_RED_MAX_ITER) — the exact failure mode that stuck the polis-mesh-gap4-adapter run on 2026-05-25.

Three commits, bisectable:

117429e feat(build): teach synthesizer to emit cmd-dir testCmd for package main shims — initial bullet added to the planSynthesizer prompt section in build/SKILL.md.tmpl (+ autoregen build/SKILL.md).
fdabbc2 test(build): pin synthesizer package-main testCmd hint with static-grep regression — new test/build-synthesizer-package-main-hint.test.ts (3 tests) pinning the anchor phrase, failure-mode reference, concrete example, and section placement.
0fdebfb refactor(build): apply codex review findings + bump skill version — addresses Codex pre-landing findings: dropped misleading Python/Rust path-union over-generalization, clarified the rule applies even in monoglot Go repos, strengthened tests with two additional assertions for command shape and anti-generalization. Bumped build/SKILL.md.tmpl frontmatter version: 1.29.0 → 1.30.0 per the fork versioning rule.

Fork-local skill release: per the "Fork versioning rule" in CLAUDE.md, the top-level VERSION, package.json, and CHANGELOG.md are intentionally untouched. The skill-internal version: frontmatter is the release marker for custom-skill work in this fork.

Test Coverage

5 tests pinning the prompt change, 14 expects, 20ms:

Regression class	Pinned by
Anchor phrase removed/renamed	`ANCHOR` assert (.tmpl + generated)
Failure-mode reference dropped	`FAILURE_MODE_REFERENCE` assert (.tmpl + generated)
Concrete example dropped	`CONCRETE_EXAMPLE` assert (.tmpl + generated)
Bullet drifts out of synthesizer section	structural between-headers assert
Unioned go-test command shape weakened	`bullet pins the unioned go-test command shape` assert
Python/Rust over-generalization re-introduced	`bullet does not push Gemini toward Python/Rust path-union over-generalization` assert
tmpl edited but `gen:skill-docs` not run	generated-file mirror tests

Parser/resolver side of  is already pinned in build/orchestrator/__tests__/per-phase-testcmd.test.ts and not extended here.

Tests: 0 → 1 (+1 new file). Coverage gate: PASS (100%, 0 gaps for a prompt-only change).

Pre-Landing Review

Inline checklist: no blocking findings. SQL/data safety and LLM-output-trust-boundary are both N/A (the diff is a change to an LLM prompt, not consumption of LLM output).

Codex Adversarial Review

Codex found 3 actionable findings on commit fdabbc21, all addressed in 0fdebfb2:

Python/Rust over-generalization (Codex's recommended-changes finding) — the original bullet generalized the unioned-path syntax to "any phase whose tests can't live in test/" with Python and Rust examples. pytest does not take Go-style ./... paths and cargo test discovers tests via the crate, so that generalization could push Gemini toward invalid commands for those ecosystems. Fix: scoped the bullet to Go package main only, with an explicit "do not generalize" sentence calling out Python's and Rust's different runner mechanics.
Conflict with prior conditional bullet — the existing "Polyglot repo test-runner hint" says emit  only when the per-phase runner differs from the repo root. The new bullet's "MUST include" implied an unconditional emit, creating ambiguity. Fix: explicit phrasing "emit even in a monoglot Go repo where the previous bullet would otherwise let you skip it" so the synthesizer knows the cmd-dir override is required regardless.
Test doesn't pin command shape — original 3 assertions caught wording-level drift but not changes to the actual unioned-go-test pattern. Fix: added 2 new assertions for the literal go test ./.../cmd/.../... ./test/... shape and for the presence of the anti-generalization wording.

Two additional Codex findings (line 477 test-writer prompt coupling, line 426 phase skeleton lacks testCmd placement) were deferred as out-of-scope for this branch.

Plan Completion

Plan file: ~/.claude/plans/polis-mesh-gap4-adapter-20260525-074215-frolicking-frog.md

Step 1: Fresh branch off main (parent bb465b6f v1.45.0.0)
Step 2: build/SKILL.md.tmpl gains package main co-location bullet
Step 3: build/SKILL.md regenerated via bun run gen:skill-docs
Step 4: Static-grep regression test
Step 5: bun test green (exit 0, both pre-fix and post-fix runs)
Step 6: This handoff

TODOS

No TODOS.md items completed by this PR.

Verification Note: Eval-Store Fixture Noise

Reviewers running bun test may see this in the output:

Eval Results — ... — e2e
  test-1                                PASS    $0.10           2s
  test-2                                FAIL    $0.05           1s
  Total: 1/2 passed                     $0.15  3s
  ...
  Status: 2 unchanged

This is not a real eval failure. It is the EvalCollector printing its real-looking report while processing fixture data inside the passing test test/helpers/eval-store.test.ts:85 ('written JSON has correct schema fields'). That test deliberately constructs one passing entry (test-1) and one failing entry (test-2) to verify the failure-counting logic via expect(data.failed).toBe(1). Running the file in isolation: 31 pass / 0 fail / 67 expects / 537ms.

Filed as a separate DX cleanup candidate: the EvalCollector could suppress its real-looking report when invoked from a unit-test fixture context (e.g., NODE_ENV=test gate). Not in scope for this PR.

Documentation

No documentation files needed updating. This is an internal refinement to the planSynthesizer's prompt rules — no new public surface (no CLI flag, no env var, no command), so README/ARCHITECTURE/CONTRIBUTING/CLAUDE.md/build/README are unchanged. The fork-local rule blocks CHANGELOG/VERSION/package.json changes.

Test plan

bun test passes (exit 0) on both pre-fix and post-fix commits
bun test test/build-synthesizer-package-main-hint.test.ts passes (5/5, 20ms)
bun run gen:skill-docs succeeds; build/SKILL.md mirrors build/SKILL.md.tmpl
Manual: next time gstack-build hits a Go package main phase via planSynthesizer, the emitted living plan's testCmd hint includes the binary directory (will be observable on the next polis-mesh-style run)

🤖 Generated with Claude Code

…ain` shims Without this, a Go shim phase (e.g. experiments/e1/cmd/polis-step) gets a testCmd pointing only at the conventional test/ tree. Go requires `package main` tests to live alongside main.go, so the test-writer correctly places main_test.go in the cmd dir — but verify-red exercises the wrong tree, sees the sibling unit tests all-pass, and halts the build with "Gemini could not produce failing tests after N attempts". The synthesizer prompt now instructs that any phase producing a `package main` binary emit a testCmd that unions the cmd dir and the standard test/ dir, with the co-located dir listed first. Generalises the principle to any phase where tests cannot live in the repo's standard test/ tree (Go package main, Python tests beside source, Rust integration tests). Pure prompt change. No orchestrator code changes; parser.ts:429-446 and phase-runner.ts:712-744 already honor the per-phase testCmd override. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ep regression Three assertions: 1. SKILL.md.tmpl contains the "Co-located tests for `package main` shims" anchor + failure-mode reference + concrete example 2. Regenerated SKILL.md preserves the same content (catches gen-skill-docs drift and resolver pipeline regressions) 3. Bullet sits inside the plan-synthesizer prompt section, between the existing "Polyglot repo test-runner hint" bullet and the "Non-Coding Phase Templates" header — protects against accidental section reshuffles Tier 1 (free, <100ms). Pure file reads. No API spend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pre-landing codex adversarial flagged three issues on the package-main testCmd bullet. Addressing all three: 1. Removed the Python/Rust path-union over-generalization. `pytest` does not take Go-style `./...` paths and `cargo test` discovers tests via the crate rather than via paths, so generalizing the union syntax to those ecosystems would push Gemini toward invalid commands. The bullet now states the rule is Go-specific and tells the synthesizer to use the native runner for non-Go ecosystems with similar co-location. 2. Resolved the conflict with the prior "polyglot repo" conditional bullet. The new wording explicitly says to emit testCmd "even in a monoglot Go repo where the previous bullet would otherwise let you skip it" — the cmd-dir is not covered by a repo-root `go test ./...` for verify-red's purposes, so the per-phase override is required regardless. 3. Strengthened the regression test with two new assertions: one pins the unioned go-test command shape, the other pins the explicit anti-generalization wording for Python and Rust. Total: 5 tests, 14 expects, 20ms. Bumped build skill template version 1.29.0 → 1.30.0 per the fork versioning rule in CLAUDE.md — top-level VERSION and CHANGELOG stay untouched because this is fork-local custom-skill work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…igure.cm Roles changed: - testWriter: kimi/kimi-for-coding → gemini/gemini-3.5-flash (adds backup: codex/gpt-5.3-codex-spark, was no backup) - primaryImpl: kimi/kimi-for-coding → codex/gpt-5.5 (backup: gemini/gemini-3.1-pro-preview → gemini-3.5-flash) - testFixer: kimi/kimi-for-coding → gemini/gemini-3.5-flash (backup: gemini/gemini-3.1-pro-preview → codex/gpt-5.3-codex-spark) - ship.backupModel: gemini-3.1-pro-preview → gemini-3.5-flash - land.backupModel: gemini-3.1-pro-preview → gemini-3.5-flash In-flight WIP that was sitting in a stash since the start of PR #95's work. Applied cleanly on main after PR #95 landed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eflight (3 bugs from 7-report analysis) (#98) * fix(build/orchestrator): fault classifier rungs + subtree testCmd inference The FAIL handler's catch-all at cli.ts:7476-7484 hard-coded `capKind: 'codex'` and `phaseState.codexReview?.iterations ?? 0`, so hygiene failures, Gemini timeouts, red-spec exhaustion, and zero-stdout stall-kills all surfaced as "RETRY_CAP_HIT codex 0 iterations" — confirmed across 7 production reports in ~/.gstack/skill-faults/ (agnt2 p1/p7/p9/p10/p11/p12/p13/p17). The investigator-bot then ran against the wrong evidence and retry caps exhausted on operations that were actually correct. Three changes, one PR: 1. Halt-event taxonomy. halt-events.ts gains HYGIENE_FAIL and RED_SPEC_EXHAUSTED; the previously-declared-but-never-emitted STALL_KILLED gets wired up for zero-stdout watchdog kills. 2. FAIL classifier ladder. cli.ts rewires the cascade into five ordered rungs (failureRender provider verdict → scan ALL role logs, not just codex-review → hygiene fail → RED_SPEC_EXHAUSTED → RED_GATE_ZERO) with a gated catch-all that only emits RETRY_CAP_HIT when codexIterations actually >= cap; otherwise markPhaseFailed with the structured reason. phase-runner.ts:743 now emits the RED_SPEC_EXHAUSTED: prefix the classifier matches. 3. Subtree testCmd inference. PhaseState gains testWriterTouchedPaths, captured via git diff after each RUN_GEMINI_TEST_SPEC; resolveTestCmd ForPhase inserts a new priority rung between explicit annotation and global args that walks up from the deepest common path prefix looking for a subtree package.json with a test script (e.g. sidecar-v2/ when the test-writer's files all live there). Generalises the PR #95 Go cmd-dir pattern to TypeScript subtrees. Closes the agnt2 p7/p13 "root npm test skips sidecar-v2" trap. 29 new tests across 3 files cover the new recorders, the inference helper, the resolver priority, and lock in the pattern regexes against real production fault-report messages so future regressions surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(build/orchestrator): auth preflight closes stdin, detects interactive prompts, adds Kimi probe Three preflight gaps surfaced in the agnt2 fault reports: 1. p9: Gemini exited 0 on an auth prompt and the --version fallback in assertGeminiAuth then returned ok:true — masking the real failure downstream. The orchestrator burned a phase budget on a provider that was never going to work. 2. probeAuthSync used stdio: ["ignore", "pipe", "ignore"] so stderr was discarded. An auth-prompt CLI that wrote "please log in" to stderr left no trace in the probe result; the only signal was a generic exit code that --version trivially masked. 3. No Kimi auth preflight at all (p17: Kimi+Gemini both stall-killed with zero stdout suggests Kimi was hitting an interactive prompt path nothing upstream of the watchdog was detecting). Changes: - probeAuthSync captures stderr (stdio piped + encoding utf8) and scans combined output for AUTH_REQUIRED_RE patterns. When matched, returns authRequired:true so the caller short-circuits — does NOT fall through to the --version probe (which exits 0 on broken auth). - assertGeminiAuth and assertCodexAuth gate their --version fallback on the new authRequired signal. - New assertKimiAuth mirrors the same shape (probe `auth status`, fall back to `--version`, gate fallback on auth-required signal). - New codexBin() helper resolves CODEX_BIN at call time so tests can swap in fake binaries (parity with existing geminiBin/kimiBin). - _resetAuthPreflightForTests clears the new Kimi cache too. 11 new tests cover all three providers and the kill-switch path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

anbangr and others added 3 commits May 25, 2026 11:31

anbangr merged commit 0e82008 into main May 25, 2026

anbangr deleted the feat/build-synthesizer-package-main-testcmd branch May 25, 2026 04:18

anbangr mentioned this pull request May 25, 2026

fix(build/orchestrator): fault classifier + subtree testCmd + auth preflight (3 bugs from 7-report analysis) #98

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(build): teach synthesizer to emit cmd-dir testCmd for `package main` shims#95

feat(build): teach synthesizer to emit cmd-dir testCmd for `package main` shims#95
anbangr merged 3 commits into
mainfrom
feat/build-synthesizer-package-main-testcmd

anbangr commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anbangr commented May 25, 2026

Summary

Test Coverage

Pre-Landing Review

Codex Adversarial Review

Plan Completion

TODOS

Verification Note: Eval-Store Fixture Noise

Documentation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant