feat(build): teach synthesizer to emit cmd-dir testCmd for package main shims#95
Merged
Merged
Conversation
…ain` shims Without this, a Go shim phase (e.g. experiments/e1/cmd/polis-step) gets a testCmd pointing only at the conventional test/ tree. Go requires `package main` tests to live alongside main.go, so the test-writer correctly places main_test.go in the cmd dir — but verify-red exercises the wrong tree, sees the sibling unit tests all-pass, and halts the build with "Gemini could not produce failing tests after N attempts". The synthesizer prompt now instructs that any phase producing a `package main` binary emit a testCmd that unions the cmd dir and the standard test/ dir, with the co-located dir listed first. Generalises the principle to any phase where tests cannot live in the repo's standard test/ tree (Go package main, Python tests beside source, Rust integration tests). Pure prompt change. No orchestrator code changes; parser.ts:429-446 and phase-runner.ts:712-744 already honor the per-phase testCmd override. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ep regression Three assertions: 1. SKILL.md.tmpl contains the "Co-located tests for `package main` shims" anchor + failure-mode reference + concrete example 2. Regenerated SKILL.md preserves the same content (catches gen-skill-docs drift and resolver pipeline regressions) 3. Bullet sits inside the plan-synthesizer prompt section, between the existing "Polyglot repo test-runner hint" bullet and the "Non-Coding Phase Templates" header — protects against accidental section reshuffles Tier 1 (free, <100ms). Pure file reads. No API spend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-landing codex adversarial flagged three issues on the package-main testCmd bullet. Addressing all three: 1. Removed the Python/Rust path-union over-generalization. `pytest` does not take Go-style `./...` paths and `cargo test` discovers tests via the crate rather than via paths, so generalizing the union syntax to those ecosystems would push Gemini toward invalid commands. The bullet now states the rule is Go-specific and tells the synthesizer to use the native runner for non-Go ecosystems with similar co-location. 2. Resolved the conflict with the prior "polyglot repo" conditional bullet. The new wording explicitly says to emit testCmd "even in a monoglot Go repo where the previous bullet would otherwise let you skip it" — the cmd-dir is not covered by a repo-root `go test ./...` for verify-red's purposes, so the per-phase override is required regardless. 3. Strengthened the regression test with two new assertions: one pins the unioned go-test command shape, the other pins the explicit anti-generalization wording for Python and Rust. Total: 5 tests, 14 expects, 20ms. Bumped build skill template version 1.29.0 → 1.30.0 per the fork versioning rule in CLAUDE.md — top-level VERSION and CHANGELOG stay untouched because this is fork-local custom-skill work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
anbangr
added a commit
that referenced
this pull request
May 25, 2026
…igure.cm
Roles changed:
- testWriter: kimi/kimi-for-coding → gemini/gemini-3.5-flash
(adds backup: codex/gpt-5.3-codex-spark, was no backup)
- primaryImpl: kimi/kimi-for-coding → codex/gpt-5.5
(backup: gemini/gemini-3.1-pro-preview → gemini-3.5-flash)
- testFixer: kimi/kimi-for-coding → gemini/gemini-3.5-flash
(backup: gemini/gemini-3.1-pro-preview → codex/gpt-5.3-codex-spark)
- ship.backupModel: gemini-3.1-pro-preview → gemini-3.5-flash
- land.backupModel: gemini-3.1-pro-preview → gemini-3.5-flash
In-flight WIP that was sitting in a stash since the start of PR #95's
work. Applied cleanly on main after PR #95 landed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 tasks
anbangr
added a commit
that referenced
this pull request
May 25, 2026
…eflight (3 bugs from 7-report analysis) (#98) * fix(build/orchestrator): fault classifier rungs + subtree testCmd inference The FAIL handler's catch-all at cli.ts:7476-7484 hard-coded `capKind: 'codex'` and `phaseState.codexReview?.iterations ?? 0`, so hygiene failures, Gemini timeouts, red-spec exhaustion, and zero-stdout stall-kills all surfaced as "RETRY_CAP_HIT codex 0 iterations" — confirmed across 7 production reports in ~/.gstack/skill-faults/ (agnt2 p1/p7/p9/p10/p11/p12/p13/p17). The investigator-bot then ran against the wrong evidence and retry caps exhausted on operations that were actually correct. Three changes, one PR: 1. Halt-event taxonomy. halt-events.ts gains HYGIENE_FAIL and RED_SPEC_EXHAUSTED; the previously-declared-but-never-emitted STALL_KILLED gets wired up for zero-stdout watchdog kills. 2. FAIL classifier ladder. cli.ts rewires the cascade into five ordered rungs (failureRender provider verdict → scan ALL role logs, not just codex-review → hygiene fail → RED_SPEC_EXHAUSTED → RED_GATE_ZERO) with a gated catch-all that only emits RETRY_CAP_HIT when codexIterations actually >= cap; otherwise markPhaseFailed with the structured reason. phase-runner.ts:743 now emits the RED_SPEC_EXHAUSTED: prefix the classifier matches. 3. Subtree testCmd inference. PhaseState gains testWriterTouchedPaths, captured via git diff after each RUN_GEMINI_TEST_SPEC; resolveTestCmd ForPhase inserts a new priority rung between explicit annotation and global args that walks up from the deepest common path prefix looking for a subtree package.json with a test script (e.g. sidecar-v2/ when the test-writer's files all live there). Generalises the PR #95 Go cmd-dir pattern to TypeScript subtrees. Closes the agnt2 p7/p13 "root npm test skips sidecar-v2" trap. 29 new tests across 3 files cover the new recorders, the inference helper, the resolver priority, and lock in the pattern regexes against real production fault-report messages so future regressions surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(build/orchestrator): auth preflight closes stdin, detects interactive prompts, adds Kimi probe Three preflight gaps surfaced in the agnt2 fault reports: 1. p9: Gemini exited 0 on an auth prompt and the --version fallback in assertGeminiAuth then returned ok:true — masking the real failure downstream. The orchestrator burned a phase budget on a provider that was never going to work. 2. probeAuthSync used stdio: ["ignore", "pipe", "ignore"] so stderr was discarded. An auth-prompt CLI that wrote "please log in" to stderr left no trace in the probe result; the only signal was a generic exit code that --version trivially masked. 3. No Kimi auth preflight at all (p17: Kimi+Gemini both stall-killed with zero stdout suggests Kimi was hitting an interactive prompt path nothing upstream of the watchdog was detecting). Changes: - probeAuthSync captures stderr (stdio piped + encoding utf8) and scans combined output for AUTH_REQUIRED_RE patterns. When matched, returns authRequired:true so the caller short-circuits — does NOT fall through to the --version probe (which exits 0 on broken auth). - assertGeminiAuth and assertCodexAuth gate their --version fallback on the new authRequired signal. - New assertKimiAuth mirrors the same shape (probe `auth status`, fall back to `--version`, gate fallback on auth-required signal). - New codexBin() helper resolves CODEX_BIN at call time so tests can swap in fake binaries (parity with existing geminiBin/kimiBin). - _resetAuthPreflightForTests clears the new Kimi cache too. 11 new tests cover all three providers and the kill-switch path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Teaches the gstack build orchestrator's plan synthesizer that Go
package mainshim phases need their cmd-directory included in the per-phase<!-- testCmd: -->hint, not just the conventionaltest/tree. Without this, verify-red exercises the wrong directory, sees existing unit tests all-pass, and halts withGemini could not produce failing tests after N attempts (GSTACK_BUILD_RED_MAX_ITER)— the exact failure mode that stuck the polis-mesh-gap4-adapter run on 2026-05-25.Three commits, bisectable:
package mainshims — initial bullet added to the planSynthesizer prompt section inbuild/SKILL.md.tmpl(+ autoregenbuild/SKILL.md).test/build-synthesizer-package-main-hint.test.ts(3 tests) pinning the anchor phrase, failure-mode reference, concrete example, and section placement.build/SKILL.md.tmplfrontmatterversion:1.29.0 → 1.30.0 per the fork versioning rule.Fork-local skill release: per the "Fork versioning rule" in CLAUDE.md, the top-level VERSION, package.json, and CHANGELOG.md are intentionally untouched. The skill-internal
version:frontmatter is the release marker for custom-skill work in this fork.Test Coverage
5 tests pinning the prompt change, 14 expects, 20ms:
ANCHORassert (.tmpl + generated)FAILURE_MODE_REFERENCEassert (.tmpl + generated)CONCRETE_EXAMPLEassert (.tmpl + generated)bullet pins the unioned go-test command shapeassertbullet does not push Gemini toward Python/Rust path-union over-generalizationassertgen:skill-docsnot runParser/resolver side of
<!-- testCmd: -->is already pinned inbuild/orchestrator/__tests__/per-phase-testcmd.test.tsand not extended here.Tests: 0 → 1 (+1 new file). Coverage gate: PASS (100%, 0 gaps for a prompt-only change).
Pre-Landing Review
Inline checklist: no blocking findings. SQL/data safety and LLM-output-trust-boundary are both N/A (the diff is a change to an LLM prompt, not consumption of LLM output).
Codex Adversarial Review
Codex found 3 actionable findings on commit
fdabbc21, all addressed in0fdebfb2:Python/Rust over-generalization (Codex's recommended-changes finding) — the original bullet generalized the unioned-path syntax to "any phase whose tests can't live in test/" with Python and Rust examples.
pytestdoes not take Go-style./...paths andcargo testdiscovers tests via the crate, so that generalization could push Gemini toward invalid commands for those ecosystems. Fix: scoped the bullet to Gopackage mainonly, with an explicit "do not generalize" sentence calling out Python's and Rust's different runner mechanics.Conflict with prior conditional bullet — the existing "Polyglot repo test-runner hint" says emit
<!-- testCmd: -->only when the per-phase runner differs from the repo root. The new bullet's "MUST include" implied an unconditional emit, creating ambiguity. Fix: explicit phrasing "emit even in a monoglot Go repo where the previous bullet would otherwise let you skip it" so the synthesizer knows the cmd-dir override is required regardless.Test doesn't pin command shape — original 3 assertions caught wording-level drift but not changes to the actual unioned-go-test pattern. Fix: added 2 new assertions for the literal
go test ./.../cmd/.../... ./test/...shape and for the presence of the anti-generalization wording.Two additional Codex findings (line 477 test-writer prompt coupling, line 426 phase skeleton lacks testCmd placement) were deferred as out-of-scope for this branch.
Plan Completion
Plan file:
~/.claude/plans/polis-mesh-gap4-adapter-20260525-074215-frolicking-frog.mdbb465b6fv1.45.0.0)build/SKILL.md.tmplgainspackage mainco-location bulletbuild/SKILL.mdregenerated viabun run gen:skill-docsbun testgreen (exit 0, both pre-fix and post-fix runs)TODOS
No TODOS.md items completed by this PR.
Verification Note: Eval-Store Fixture Noise
Reviewers running
bun testmay see this in the output:This is not a real eval failure. It is the
EvalCollectorprinting its real-looking report while processing fixture data inside the passing testtest/helpers/eval-store.test.ts:85('written JSON has correct schema fields'). That test deliberately constructs one passing entry (test-1) and one failing entry (test-2) to verify the failure-counting logic viaexpect(data.failed).toBe(1). Running the file in isolation: 31 pass / 0 fail / 67 expects / 537ms.Filed as a separate DX cleanup candidate: the
EvalCollectorcould suppress its real-looking report when invoked from a unit-test fixture context (e.g.,NODE_ENV=testgate). Not in scope for this PR.Documentation
No documentation files needed updating. This is an internal refinement to the planSynthesizer's prompt rules — no new public surface (no CLI flag, no env var, no command), so README/ARCHITECTURE/CONTRIBUTING/CLAUDE.md/build/README are unchanged. The fork-local rule blocks CHANGELOG/VERSION/package.json changes.
Test plan
bun testpasses (exit 0) on both pre-fix and post-fix commitsbun test test/build-synthesizer-package-main-hint.test.tspasses (5/5, 20ms)bun run gen:skill-docssucceeds;build/SKILL.mdmirrorsbuild/SKILL.md.tmplpackage mainphase via planSynthesizer, the emitted living plan's testCmd hint includes the binary directory (will be observable on the next polis-mesh-style run)🤖 Generated with Claude Code