Skip to content

feat(build): teach synthesizer to emit cmd-dir testCmd for package main shims#95

Merged
anbangr merged 3 commits into
mainfrom
feat/build-synthesizer-package-main-testcmd
May 25, 2026
Merged

feat(build): teach synthesizer to emit cmd-dir testCmd for package main shims#95
anbangr merged 3 commits into
mainfrom
feat/build-synthesizer-package-main-testcmd

Conversation

@anbangr

@anbangr anbangr commented May 25, 2026

Copy link
Copy Markdown
Owner

Summary

Teaches the gstack build orchestrator's plan synthesizer that Go package main shim phases need their cmd-directory included in the per-phase <!-- testCmd: --> hint, not just the conventional test/ tree. Without this, verify-red exercises the wrong directory, sees existing unit tests all-pass, and halts with Gemini could not produce failing tests after N attempts (GSTACK_BUILD_RED_MAX_ITER) — the exact failure mode that stuck the polis-mesh-gap4-adapter run on 2026-05-25.

Three commits, bisectable:

  • 117429e feat(build): teach synthesizer to emit cmd-dir testCmd for package main shims — initial bullet added to the planSynthesizer prompt section in build/SKILL.md.tmpl (+ autoregen build/SKILL.md).
  • fdabbc2 test(build): pin synthesizer package-main testCmd hint with static-grep regression — new test/build-synthesizer-package-main-hint.test.ts (3 tests) pinning the anchor phrase, failure-mode reference, concrete example, and section placement.
  • 0fdebfb refactor(build): apply codex review findings + bump skill version — addresses Codex pre-landing findings: dropped misleading Python/Rust path-union over-generalization, clarified the rule applies even in monoglot Go repos, strengthened tests with two additional assertions for command shape and anti-generalization. Bumped build/SKILL.md.tmpl frontmatter version: 1.29.0 → 1.30.0 per the fork versioning rule.

Fork-local skill release: per the "Fork versioning rule" in CLAUDE.md, the top-level VERSION, package.json, and CHANGELOG.md are intentionally untouched. The skill-internal version: frontmatter is the release marker for custom-skill work in this fork.

Test Coverage

5 tests pinning the prompt change, 14 expects, 20ms:

Regression class Pinned by
Anchor phrase removed/renamed ANCHOR assert (.tmpl + generated)
Failure-mode reference dropped FAILURE_MODE_REFERENCE assert (.tmpl + generated)
Concrete example dropped CONCRETE_EXAMPLE assert (.tmpl + generated)
Bullet drifts out of synthesizer section structural between-headers assert
Unioned go-test command shape weakened bullet pins the unioned go-test command shape assert
Python/Rust over-generalization re-introduced bullet does not push Gemini toward Python/Rust path-union over-generalization assert
tmpl edited but gen:skill-docs not run generated-file mirror tests

Parser/resolver side of <!-- testCmd: --> is already pinned in build/orchestrator/__tests__/per-phase-testcmd.test.ts and not extended here.

Tests: 0 → 1 (+1 new file). Coverage gate: PASS (100%, 0 gaps for a prompt-only change).

Pre-Landing Review

Inline checklist: no blocking findings. SQL/data safety and LLM-output-trust-boundary are both N/A (the diff is a change to an LLM prompt, not consumption of LLM output).

Codex Adversarial Review

Codex found 3 actionable findings on commit fdabbc21, all addressed in 0fdebfb2:

  1. Python/Rust over-generalization (Codex's recommended-changes finding) — the original bullet generalized the unioned-path syntax to "any phase whose tests can't live in test/" with Python and Rust examples. pytest does not take Go-style ./... paths and cargo test discovers tests via the crate, so that generalization could push Gemini toward invalid commands for those ecosystems. Fix: scoped the bullet to Go package main only, with an explicit "do not generalize" sentence calling out Python's and Rust's different runner mechanics.

  2. Conflict with prior conditional bullet — the existing "Polyglot repo test-runner hint" says emit <!-- testCmd: --> only when the per-phase runner differs from the repo root. The new bullet's "MUST include" implied an unconditional emit, creating ambiguity. Fix: explicit phrasing "emit even in a monoglot Go repo where the previous bullet would otherwise let you skip it" so the synthesizer knows the cmd-dir override is required regardless.

  3. Test doesn't pin command shape — original 3 assertions caught wording-level drift but not changes to the actual unioned-go-test pattern. Fix: added 2 new assertions for the literal go test ./.../cmd/.../... ./test/... shape and for the presence of the anti-generalization wording.

Two additional Codex findings (line 477 test-writer prompt coupling, line 426 phase skeleton lacks testCmd placement) were deferred as out-of-scope for this branch.

Plan Completion

Plan file: ~/.claude/plans/polis-mesh-gap4-adapter-20260525-074215-frolicking-frog.md

  • Step 1: Fresh branch off main (parent bb465b6f v1.45.0.0)
  • Step 2: build/SKILL.md.tmpl gains package main co-location bullet
  • Step 3: build/SKILL.md regenerated via bun run gen:skill-docs
  • Step 4: Static-grep regression test
  • Step 5: bun test green (exit 0, both pre-fix and post-fix runs)
  • Step 6: This handoff

TODOS

No TODOS.md items completed by this PR.

Verification Note: Eval-Store Fixture Noise

Reviewers running bun test may see this in the output:

Eval Results — ... — e2e
  test-1                                PASS    $0.10           2s
  test-2                                FAIL    $0.05           1s
  Total: 1/2 passed                     $0.15  3s
  ...
  Status: 2 unchanged

This is not a real eval failure. It is the EvalCollector printing its real-looking report while processing fixture data inside the passing test test/helpers/eval-store.test.ts:85 ('written JSON has correct schema fields'). That test deliberately constructs one passing entry (test-1) and one failing entry (test-2) to verify the failure-counting logic via expect(data.failed).toBe(1). Running the file in isolation: 31 pass / 0 fail / 67 expects / 537ms.

Filed as a separate DX cleanup candidate: the EvalCollector could suppress its real-looking report when invoked from a unit-test fixture context (e.g., NODE_ENV=test gate). Not in scope for this PR.

Documentation

No documentation files needed updating. This is an internal refinement to the planSynthesizer's prompt rules — no new public surface (no CLI flag, no env var, no command), so README/ARCHITECTURE/CONTRIBUTING/CLAUDE.md/build/README are unchanged. The fork-local rule blocks CHANGELOG/VERSION/package.json changes.

Test plan

  • bun test passes (exit 0) on both pre-fix and post-fix commits
  • bun test test/build-synthesizer-package-main-hint.test.ts passes (5/5, 20ms)
  • bun run gen:skill-docs succeeds; build/SKILL.md mirrors build/SKILL.md.tmpl
  • Manual: next time gstack-build hits a Go package main phase via planSynthesizer, the emitted living plan's testCmd hint includes the binary directory (will be observable on the next polis-mesh-style run)

🤖 Generated with Claude Code

anbangr and others added 3 commits May 25, 2026 11:31
…ain` shims

Without this, a Go shim phase (e.g. experiments/e1/cmd/polis-step) gets a
testCmd pointing only at the conventional test/ tree. Go requires
`package main` tests to live alongside main.go, so the test-writer correctly
places main_test.go in the cmd dir — but verify-red exercises the wrong
tree, sees the sibling unit tests all-pass, and halts the build with
"Gemini could not produce failing tests after N attempts". The synthesizer
prompt now instructs that any phase producing a `package main` binary
emit a testCmd that unions the cmd dir and the standard test/ dir, with
the co-located dir listed first.

Generalises the principle to any phase where tests cannot live in the
repo's standard test/ tree (Go package main, Python tests beside source,
Rust integration tests).

Pure prompt change. No orchestrator code changes; parser.ts:429-446 and
phase-runner.ts:712-744 already honor the per-phase testCmd override.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ep regression

Three assertions:
1. SKILL.md.tmpl contains the "Co-located tests for `package main` shims"
   anchor + failure-mode reference + concrete example
2. Regenerated SKILL.md preserves the same content (catches gen-skill-docs
   drift and resolver pipeline regressions)
3. Bullet sits inside the plan-synthesizer prompt section, between the
   existing "Polyglot repo test-runner hint" bullet and the "Non-Coding
   Phase Templates" header — protects against accidental section reshuffles

Tier 1 (free, <100ms). Pure file reads. No API spend.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-landing codex adversarial flagged three issues on the package-main testCmd
bullet. Addressing all three:

1. Removed the Python/Rust path-union over-generalization. `pytest` does not
   take Go-style `./...` paths and `cargo test` discovers tests via the crate
   rather than via paths, so generalizing the union syntax to those ecosystems
   would push Gemini toward invalid commands. The bullet now states the rule
   is Go-specific and tells the synthesizer to use the native runner for
   non-Go ecosystems with similar co-location.

2. Resolved the conflict with the prior "polyglot repo" conditional bullet.
   The new wording explicitly says to emit testCmd "even in a monoglot Go
   repo where the previous bullet would otherwise let you skip it" — the
   cmd-dir is not covered by a repo-root `go test ./...` for verify-red's
   purposes, so the per-phase override is required regardless.

3. Strengthened the regression test with two new assertions: one pins the
   unioned go-test command shape, the other pins the explicit
   anti-generalization wording for Python and Rust. Total: 5 tests, 14
   expects, 20ms.

Bumped build skill template version 1.29.0 → 1.30.0 per the fork
versioning rule in CLAUDE.md — top-level VERSION and CHANGELOG stay
untouched because this is fork-local custom-skill work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@anbangr anbangr merged commit 0e82008 into main May 25, 2026
@anbangr anbangr deleted the feat/build-synthesizer-package-main-testcmd branch May 25, 2026 04:18
anbangr added a commit that referenced this pull request May 25, 2026
…igure.cm

Roles changed:
- testWriter:  kimi/kimi-for-coding  → gemini/gemini-3.5-flash
               (adds backup: codex/gpt-5.3-codex-spark, was no backup)
- primaryImpl: kimi/kimi-for-coding  → codex/gpt-5.5
               (backup: gemini/gemini-3.1-pro-preview → gemini-3.5-flash)
- testFixer:   kimi/kimi-for-coding  → gemini/gemini-3.5-flash
               (backup: gemini/gemini-3.1-pro-preview → codex/gpt-5.3-codex-spark)
- ship.backupModel:  gemini-3.1-pro-preview → gemini-3.5-flash
- land.backupModel:  gemini-3.1-pro-preview → gemini-3.5-flash

In-flight WIP that was sitting in a stash since the start of PR #95's
work. Applied cleanly on main after PR #95 landed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
anbangr added a commit that referenced this pull request May 25, 2026
…eflight (3 bugs from 7-report analysis) (#98)

* fix(build/orchestrator): fault classifier rungs + subtree testCmd inference

The FAIL handler's catch-all at cli.ts:7476-7484 hard-coded `capKind: 'codex'`
and `phaseState.codexReview?.iterations ?? 0`, so hygiene failures, Gemini
timeouts, red-spec exhaustion, and zero-stdout stall-kills all surfaced as
"RETRY_CAP_HIT codex 0 iterations" — confirmed across 7 production reports
in ~/.gstack/skill-faults/ (agnt2 p1/p7/p9/p10/p11/p12/p13/p17). The
investigator-bot then ran against the wrong evidence and retry caps
exhausted on operations that were actually correct.

Three changes, one PR:

1. Halt-event taxonomy. halt-events.ts gains HYGIENE_FAIL and
   RED_SPEC_EXHAUSTED; the previously-declared-but-never-emitted
   STALL_KILLED gets wired up for zero-stdout watchdog kills.

2. FAIL classifier ladder. cli.ts rewires the cascade into five ordered
   rungs (failureRender provider verdict → scan ALL role logs, not just
   codex-review → hygiene fail → RED_SPEC_EXHAUSTED → RED_GATE_ZERO) with
   a gated catch-all that only emits RETRY_CAP_HIT when codexIterations
   actually >= cap; otherwise markPhaseFailed with the structured reason.
   phase-runner.ts:743 now emits the RED_SPEC_EXHAUSTED: prefix the
   classifier matches.

3. Subtree testCmd inference. PhaseState gains testWriterTouchedPaths,
   captured via git diff after each RUN_GEMINI_TEST_SPEC; resolveTestCmd
   ForPhase inserts a new priority rung between explicit annotation and
   global args that walks up from the deepest common path prefix looking
   for a subtree package.json with a test script (e.g. sidecar-v2/
   when the test-writer's files all live there). Generalises the PR #95
   Go cmd-dir pattern to TypeScript subtrees. Closes the agnt2 p7/p13
   "root npm test skips sidecar-v2" trap.

29 new tests across 3 files cover the new recorders, the inference
helper, the resolver priority, and lock in the pattern regexes against
real production fault-report messages so future regressions surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(build/orchestrator): auth preflight closes stdin, detects interactive prompts, adds Kimi probe

Three preflight gaps surfaced in the agnt2 fault reports:

1. p9: Gemini exited 0 on an auth prompt and the --version fallback in
   assertGeminiAuth then returned ok:true — masking the real failure
   downstream. The orchestrator burned a phase budget on a provider
   that was never going to work.

2. probeAuthSync used stdio: ["ignore", "pipe", "ignore"] so stderr was
   discarded. An auth-prompt CLI that wrote "please log in" to stderr
   left no trace in the probe result; the only signal was a generic
   exit code that --version trivially masked.

3. No Kimi auth preflight at all (p17: Kimi+Gemini both stall-killed
   with zero stdout suggests Kimi was hitting an interactive prompt
   path nothing upstream of the watchdog was detecting).

Changes:

- probeAuthSync captures stderr (stdio piped + encoding utf8) and
  scans combined output for AUTH_REQUIRED_RE patterns. When matched,
  returns authRequired:true so the caller short-circuits — does NOT
  fall through to the --version probe (which exits 0 on broken auth).
- assertGeminiAuth and assertCodexAuth gate their --version fallback
  on the new authRequired signal.
- New assertKimiAuth mirrors the same shape (probe `auth status`,
  fall back to `--version`, gate fallback on auth-required signal).
- New codexBin() helper resolves CODEX_BIN at call time so tests can
  swap in fake binaries (parity with existing geminiBin/kimiBin).
- _resetAuthPreflightForTests clears the new Kimi cache too.

11 new tests cover all three providers and the kill-switch path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant