v1.26.4.0 fix: use Codex-native build commands#9
Merged
Conversation
Configure /build to call local Codex skill names directly and skip the optional secondary review gate when unset. Bumps version to v1.26.4.0. Co-Authored-By: OpenAI Codex <noreply@openai.com>
anbangr
pushed a commit
that referenced
this pull request
May 7, 2026
… rename (garrytan#1351) * feat: gstack-gbrain-mcp-verify helper for remote MCP probe Probes a remote gbrain MCP endpoint with bearer auth. POSTs initialize, classifies failures into NETWORK / AUTH / MALFORMED with one-line remediation hints, and runs a tools/list capability probe to detect sources_add MCP support (forward-compat for when gbrain ships URL ingest). Token consumed from GBRAIN_MCP_TOKEN env, never argv. Required to set both 'application/json' AND 'text/event-stream' in Accept; that gotcha costs 10 minutes of debugging when missed (regression-tested). Live-verified against wintermute (gbrain v0.27.1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: gstack-artifacts-init + gstack-artifacts-url helpers artifacts-init replaces brain-init with provider choice (gh / glab / manual), per-user gstack-artifacts-$USER repo, HTTPS-canonical storage in ~/.gstack-artifacts-remote.txt, and a "send this to your brain admin" hookup printout. Always prints the command, never auto-executes — gbrain v0.26.x has no admin-scope MCP probe (codex Finding #3). artifacts-url centralizes HTTPS↔SSH/host/owner-repo conversion so callers don't each string-mangle (codex Finding #10). The remote-conflict check in artifacts-init compares at the canonical level so re-running with HTTPS input doesn't trip on a stored SSH URL for the same logical repo. The "URL form not supported" branch prints a two-line clone-then-path form for gbrain v0.26.x; the supported branch is a one-liner with --url ready for when gbrain ships URL ingest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: extend gstack-gbrain-detect with mcp_mode + artifacts_remote Adds two new fields to detect's JSON output: - gbrain_mcp_mode: local-stdio | remote-http | none Resolved via 3-tier fallback (codex Finding D3): claude mcp get --json → claude mcp list text-grep → ~/.claude.json jq read. If Anthropic moves the file format, the first two tiers absorb it. - gstack_artifacts_remote: HTTPS URL from ~/.gstack-artifacts-remote.txt Falls back to ~/.gstack-brain-remote.txt during the v1.27.0.0 migration window so detect doesn't return empty between upgrade and migration. Existing detect tests still pass (15/15). New 19 tests cover every fallback tier independently, plus a schema regression for /sync-gbrain compat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: setup-gbrain Path 4 (remote MCP) + artifacts rename Path 4 lets users paste an HTTPS MCP URL + bearer token and registers it as an HTTP-transport MCP without needing a local gbrain CLI install. The flow: - Step 2 gains a fourth option (Remote gbrain MCP) - Step 4 adds Path 4 sub-flow: collect URL, secret-read bearer, verify via gstack-gbrain-mcp-verify (NETWORK / AUTH / MALFORMED classifier) - Step 5 (local doctor), Step 7.5 (transcript ingest), Step 5a's stdio branch all skip on Path 4 - Step 5a adds an HTTP+bearer registration form: claude mcp add --transport http --header "Authorization: Bearer ..." - Step 7 renamed "session memory sync" → "artifacts sync" and now calls gstack-artifacts-init (which always prints the brain-admin hookup command — no auto-execute, codex Finding #3) - Step 8 CLAUDE.md block branches: remote-http includes URL + server version (never the token); local-stdio keeps engine + config-file - Step 9 smoke test on Path 4 prints the curl-equivalent for post-restart verification (MCP tools aren't visible mid-session) - Step 10 verdict block has separate templates per mode Idempotency: re-running with gbrain_mcp_mode=remote-http already in detect output skips Step 2 entirely and goes to verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: rename gbrain_sync_mode → artifacts_sync_mode (v1.27.0.0 prep) Hard rename, no dual-read alias (codex Finding D4). The on-disk migration script (Phase C, separate commit) renames the config key in users' ~/.gstack/config.yaml and any CLAUDE.md blocks. Touched call sites: - bin/gstack-config defaults + validation + list/defaults output - bin/gstack-gbrain-detect (gstack_brain_sync_mode field still emitted with the same name for downstream-tool compat; reads new key) - bin/gstack-brain-sync, bin/gstack-brain-enqueue, bin/gstack-brain-uninstall - bin/gstack-timeline-log (comment ref) - scripts/resolvers/preamble/generate-brain-sync-block.ts: renames key, branches on gbrain_mcp_mode=remote-http to emit "ARTIFACTS_SYNC: remote-mode (managed by brain server <host>)" instead of the local mode/queue/last_push line (codex Finding #11) - bin/gstack-brain-restore + bin/gstack-gbrain-source-wireup: read ~/.gstack-artifacts-remote.txt with ~/.gstack-brain-remote.txt fallback during the migration window - bin/gstack-artifacts-init: tolerant of unrecognized URL forms (local paths, file://, self-hosted gitea) so test infrastructure and unusual remotes work without canonicalization - test/brain-sync.test.ts: gstack-brain-init → gstack-artifacts-init - test/skill-e2e-brain-privacy-gate.test.ts: artifacts_sync_mode keys - test/gen-skill-docs.test.ts: budget 35K → 36.5K for the new MCP-mode probe in the preamble resolver - health/SKILL.md.tmpl, sync-gbrain/SKILL.md.tmpl: comment + verdict line Hard delete: - bin/gstack-brain-init (replaced by bin/gstack-artifacts-init in v1.27.0.0) - test/gstack-brain-init-gh-mock.test.ts (replaced by gstack-artifacts-init.test.ts) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: regenerate SKILL.md files after artifacts-sync rename Mechanical regen via \`bun run gen:skill-docs --host all\`. All */SKILL.md files reflect the renamed config key (gbrain_sync_mode → artifacts_sync_mode), the renamed remote-helper file (~/.gstack-artifacts-remote.txt with brain fallback), the renamed init script (gstack-artifacts-init), and the new ARTIFACTS_SYNC: remote-mode status line that fires when a remote-http MCP is registered. Golden fixtures (test/fixtures/golden/*-ship-SKILL.md) refreshed to match the regenerated default-ship output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat: v1.27.0.0 migration — gstack-brain → gstack-artifacts rename Journaled, interruption-safe migration. Six steps, each writes to ~/.gstack/.migrations/v1.27.0.0.journal on success; re-entry resumes from the next un-done step. On final success, journal is replaced by ~/.gstack/.migrations/v1.27.0.0.done. Steps: 1. gh_repo_renamed gh/glab repo rename gstack-brain-$USER → gstack-artifacts-$USER (idempotent: detects already-renamed and skips) 2. remote_txt_renamed mv ~/.gstack-brain-remote.txt → artifacts file, rewriting URL path to match the new repo name 3. config_key_renamed sed -i in ~/.gstack/config.yaml flips gbrain_sync_mode → artifacts_sync_mode 4. claude_md_block sed flips "- Memory sync:" → "- Artifacts sync:" in cwd CLAUDE.md and ~/.gstack/CLAUDE.md 5. sources_swapped gbrain sources add NEW (verify) → remove OLD (codex Finding #6: add-before-remove ordering, no downtime window). On remote-MCP mode, prints commands for the brain admin instead of executing. 6. done touchfile + delete journal User opt-out: any "n" or "skip-for-now" answer at the initial prompt writes a marker file that prevents re-prompting; user can re-invoke via /setup-gbrain --rerun-migration. 11 unit tests cover: nothing-to-migrate, GitHub happy path, idempotent re-run, journal-resume mid-flight, remote-MCP print-only path, add-before-remove ordering verification, add-fail → old source stays registered, CLAUDE.md field rewrite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: regression suite + E2E for v1.27.0.0 rename Three new regression tests guard the rename's blast radius (per codex Findings #1, #8, #9, #12): - test/no-stale-gstack-brain-refs.test.ts: greps bin/, scripts/, *.tmpl, test/ for forbidden identifiers (gstack-brain-init, gbrain_sync_mode); fails CI if any non-allowlisted file references them. - test/post-rename-doc-regen.test.ts: confirms gen-skill-docs output has no stale references in any */SKILL.md (the cross-product blind spot). - test/setup-gbrain-path4-structure.test.ts: structural lint over the Path 4 prose contract — STOP gates after verify failure, never-write- token rules, mode-aware CLAUDE.md block, bearer always via env-var. Two new gate-tier E2E tests (deterministic stub HTTP server, fixed inputs): - test/skill-e2e-setup-gbrain-remote.test.ts: Path 4 happy path. Stubs an HTTP MCP server, drives the skill via Agent SDK with a stubbed bearer, asserts claude.json gets the http MCP entry, CLAUDE.md gets the remote-http block, the secret token NEVER leaks to CLAUDE.md. - test/skill-e2e-setup-gbrain-bad-token.test.ts: stub server returns 401; asserts the AUTH classifier hint surfaces, no MCP registration occurs, CLAUDE.md is unchanged. Regression guard for the "verify failed → STOP" rule. touchfiles.ts: setup-gbrain-remote and setup-gbrain-bad-token added at gate-tier so CI catches Path 4 regressions on every PR. Plus a few comment refs flipped: bin/gstack-jsonl-merge, bin/gstack-timeline-log (legacy gstack-brain-init mentions in headers). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * release: v1.27.0.0 — /setup-gbrain Path 4 + brain → artifacts rename Bumps VERSION 1.26.4.0 → 1.27.0.0 (MINOR per CLAUDE.md scale-aware bump guidance: ~1500 line net change including a new path in /setup-gbrain, two new bin helpers, a journaled migration, 59 new tests, and a config key rename across the codebase). CHANGELOG entry covers: Path 4 (Remote MCP) end-to-end, the brain → artifacts rename, the journaled migration, the verify-helper error classifier, the artifacts-init multi-host provider choice. Includes the canonical Garry-voice headline + numbers table + audience close per the release-summary format. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: demote setup-gbrain Path 4 E2E to periodic-tier The Agent SDK E2E tests for Path 4 (skill-e2e-setup-gbrain-remote and skill-e2e-setup-gbrain-bad-token) are inherently non-deterministic — the model interprets "follow Path 4 only" prompts flexibly and can skip Step 8 (CLAUDE.md write) or shortcut past the verify helper, which makes the gate-tier assertions flaky. The deterministic gate coverage for Path 4 is in test/setup-gbrain-path4-structure.test.ts: a fast structural lint that catches AUQ-pacing regressions and prose contract drift in <200ms with zero token spend. That test is the right tool for catching the failure mode the gate-tier was meant to guard against. The Agent SDK E2E tests stay available on-demand for periodic-tier runs (EVALS=1 EVALS_TIER=periodic bun test test/skill-e2e-setup-gbrain-*.test.ts). Also tightened the verify-error assertion to the literal field shape ("error_class": "AUTH") instead of a substring match that false-matches the parent claude session's "needs-auth" MCP discovery markers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: sync package.json version to 1.27.0.0 VERSION was bumped to 1.27.0.0 in f6ec11e but package.json was not updated in the same commit. The gen-skill-docs.test.ts assertion "package.json version matches VERSION file" caught the drift. This is the DRIFT_STALE_PKG case the /ship Step 12 idempotency check is designed for; the fix is the documented sync-only repair (no re-bump, package.json synced to existing VERSION). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
anbangr
added a commit
that referenced
this pull request
May 20, 2026
* docs: design for build plan-review convergence loop
Spec for /build's planSynthesizer ↔ planReviewer loop: in-process round
loop, mid-loop user triage gate, plan-file-as-ledger cross-round memory,
set-aware adaptive cap. Triggering case was bundle-1 (5→3→2→manual r4,
~$5-10) where rigor caught real bugs but the operator was locked out
until round 3. New default brings user in at round 1, makes each round
cheaper via in-process loop, and adapts the cap to actual convergence.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(spec): pin adaptive-cap exit_reason mapping in convergence design
Self-review caught a small ambiguity: the decision table listed two bail-out triggers (re-raises-only and regression) but the exit_reason enum had three adaptive-cap values without an explicit mapping. Fixed: enum now has exactly adaptive_cap_re_raises_only and adaptive_cap_regression, and the decision table rows reference which exit_reason each triggers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(plan): implementation plan for build plan-review convergence
19-task TDD-structured plan for the convergence loop spec. Each task is
self-contained: failing test → minimal impl → passing test → commit.
Covers types (T1), annotation contract (T2), history JSONL (T3),
convergence aggregate (T4), adaptive cap (T5), TTY triage (T6), non-TTY
triage (T7), prompts (T8), main loop (T9), CLI wire-in (T10), disputed
counting (T11), three integration tests (T12-T14), E2E (T15), SKILL.md
shrink + version bump (T16), README (T17), CHANGELOG (T18), final
verification (T19).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(build/types): add convergence types for plan-review loop
Extends PlanReviewVerdict with optional triage_decisions, round_history_path,
convergence, interrupted_at_objection fields. Adds TriageDecision and
ConvergenceSnapshot interfaces and ROUND_HISTORY_FORMAT_VERSION constant.
All new fields are optional so existing call sites compile unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(build/types): camelCase convergence field names; drop NEW: prefix
Code-quality review flagged inconsistency: existing PlanReviewVerdict
fields use camelCase (reviewedBy, round) but the new convergence fields
landed as snake_case (triage_decisions, round_history_path, etc).
Converts all new TypeScript interface fields to camelCase to match the
file's established convention. JSONL wire formats in later tasks can
still use snake_case via JSON.stringify of manually-shaped objects --
TypeScript types do not need to mirror JSONL key shape.
Also drops the informal "NEW:" prefix from JSDoc comments and adds a
one-line doc for TriageDecision.decision.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(build/plan-reviewer): add round-annotation read/write contract
Adds parseRoundAnnotations, writeRoundAnnotation, parseRoundHistoryHeader,
updateRoundHistoryHeader exported from plan-reviewer.ts. These implement
the cross-round memory contract: each round's triage decisions and synth
resolutions are written into the plan file as HTML comment blocks above
the matching '### Phase N' heading, plus a top-of-plan history block.
The next round's reviewer reads these to know what's already been decided.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(build/plan-reviewer): ROUND N REVIEWER attaches to round N's entry
The original Task 2 commit had the parser attach ROUND N REVIEWER lines
to round N-1's entry. That choice was forced by a self-contradicting
test fixture (a ROUND 2 REVIEWER: line inside an annotation the test
required to have rounds.length === 1).
Both the design spec and Task 9's planned writer in runPlanReviewLoop
treat ROUND N REVIEWER as a round-N observation paired with round N's
USER decision (if any). The N-1 offset would have corrupted every
annotation round-trip in Tasks 9, 11, and the integration tests.
Fixes the parser to attach directly to round N. Updates the broken
test fixture to assert the realistic multi-round shape: round 1 carries
USER/RESOLUTION, round 2 carries only REVIEWER (because no round-2 USER
decision happened on this annotation -- the reviewer did not re-raise).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(build/plan-reviewer): silent corruption + missing JSDoc on annotation writers
Three issues from code-quality review:
1. writeRoundAnnotation's two String.replace() call sites used string
replacements, which interpret \$&, \$', \$\` and \$N as substitution
patterns. A plan annotation whose field text contains those tokens
(plausible when reviewing regex / shell / template code) would have
produced silently corrupted output. Both call sites now use function
replacements which suppress the interpolation.
2. Misleading comment in parseRoundAnnotations claimed the header always
names "round 1". An annotation first written in round 2+ opens with
ROUND 2 CRITICAL [...], which the parser already handles correctly --
the comment was the only wrong thing.
3. RoundHistoryEntry, parseRoundHistoryHeader, and updateRoundHistoryHeader
gained JSDoc explaining purpose, the optional finalLine parameter
(used on loop exit for the 'final: APPROVED after N rounds, ...' line),
and the atomic-write invariant.
New regression test covers the \$& interpolation foot-gun by round-tripping
fields containing every replacement-pattern token.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(build/plan-review-loop): add round-history JSONL writer
New module plan-review-loop.ts will house the in-process round loop, triage
gate, and adaptive-cap. First commit establishes the per-build-state history
JSONL: append-only, corruption-tolerant reads, round-counter derivation.
appendHistoryEntry / readHistoryEntries / deriveRoundNumber pair with the
existing plan-reviewer.ts::readPlanReviewRound for cross-launch resume.
HistoryEntry uses camelCase field names (objectionCountRaw, noForwardProgress,
reRaises, newObjections) matching the Task 1 convention; the JSONL on disk
serializes the same camelCase, so jq queries can use the field names verbatim.
Also registers plan-review-loop.ts in coverage-matrix.test.ts so the
MODULE_TEST_OWNERS invariant stays green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(build/plan-review-loop): cross-build convergence aggregate writer
writeConvergenceAggregate appends one line per completed build to
~/.gstack/analytics/convergence.jsonl. Captures trajectory, exitReason,
total accept/reject/defer counts, wall time, and annotation parse
errors -- the tuning signal needed to validate MAX_ROUNDS=5 and the
adaptive cap rule over weeks of builds.
Best-effort write: aggregate analytics never block the build path.
ConvergenceAggregate uses camelCase fields matching the Task 1
convention; jq queries use the field names verbatim.
Adds the ExitReason union as a distinct exported type so Task 5's
adaptive-cap decision and Task 9's loop exit can return values
type-checked against it.
Also extends MODULE_TEST_OWNERS coverage entry for plan-review-loop.ts
to include the new test file.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(build/plan-review-loop): set-aware adaptive cap rule
computeConvergenceSnapshot compares round-k accepted objections against
prior-round annotations parsed from the plan file. Classifies each as
re-raise (prior accepted-and-resolved, same (location, severity)) or
new objection. Rejected-prior matches are deliberately neither -- they
are reviewer-prompt-fidelity signals, not synth-failure signals.
shouldBailAdaptive implements the decision table from the design spec:
hard cap at MAX_ROUNDS triggers stalemate_gate, accepted-count regression
triggers adaptive_cap_regression, re-raises-with-no-new triggers
adaptive_cap_re_raises_only. Precedence is explicit and tested across
six scenarios covering round 1, mid-loop continue, both bail paths,
the hard cap, and adaptive-disabled.
camelCase fields throughout (priorRoundAccepted, reRaises, newObjections,
noForwardProgress) matching Task 1 convention. Exports ConvergenceSnapshotInput,
RoundConvergenceSnapshot, AdaptiveCapInput, AdaptiveCapDecision so Task 9's
runPlanReviewLoop can compose with them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(build/plan-review-loop): TTY triage gate for per-objection decisions
runTriageGateTTY prompts user per CRITICAL objection with the 8-key menu
(a/r/d/v/A/R/s/q + Enter), captures optional rationale, surfaces re-raises
with prior rejection context, and supports fast-path A/R or early quit.
Stream-based input/output so tests can drive it without a real TTY,
following the existing build/orchestrator/__tests__/feature-review-prompt.test.ts
pattern: Readable.push(Buffer.from(...)) to avoid object-mode readline pitfalls.
Uses a line-queue/waiter pattern to decouple readline event emission from the
sequential ask() calls — avoids the ERR_USE_AFTER_CLOSE trap that occurs when
readline output ownership conflicts with the injected output stream.
8 tests cover: per-objection accept with rationale capture, mixed
reject/defer/accept, [A]ccept-ALL and [R]eject-ALL fast paths, [s]top
remainder, [q]uit early, and re-raise framing with prior-rejection context.
Returns TriageGateResult with quitEarly / fastPathed flags so Task 9's loop
can route exit codes correctly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(build/plan-review-loop): non-TTY triage modes for CI / scripts / agents
runTriageGateNonTTY: synchronous, no readline. Three modes, picked via the
--plan-review-noninteractive CLI flag added in Task 10:
- auto-accept (default, extends the existing IMPORTANT-objection non-TTY
behavior to CRITICAL): accept every objection, re-synth, continue
- fail-fast: exit code 3 on the first round with CRITICAL — strict CI gate
- auto-reject: reject every objection, annotate as rejected, proceed —
escape hatch for known-noisy reviewer runs
Returns NonTTYTriageResult with decisions[] (one per input objection)
and shouldFailFast flag. camelCase fields throughout matching Task 1
convention.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(build/plan-reviewer): annotation-aware reviewer + synth prompts
Extends PLAN_REVIEW_PROMPT with a paragraph teaching the reviewer to read
prior-round annotations (USER:accept/reject, RESOLUTION:pending/disputed,
REVIEWER:re-raised) and not re-raise settled concerns. Promotes the const
to an export so the new snapshot test can verify it without dynamic
import gymnastics.
Adds new exported SYNTH_REVISION_PROMPT (formerly inline in
build/SKILL.md.tmpl Step 5.5 -- moved to plan-reviewer.ts so Task 10's
cli.ts can use it from a typed import) instructing the synthesizer to
honor user triage decisions, write RESOLUTION lines, and mark disputes
when the user accepted something the synth thinks is wrong.
Snapshot-tested so unintended drift surfaces in CI; non-snapshot
assertions pin the core annotation contract phrases.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(build/plan-review-loop): in-process round loop runPlanReviewLoop
Composes the reviewer call, triage gate (TTY or non-TTY), annotation
writes, convergence snapshot, adaptive-cap decision, history JSONL
append, top-of-plan history header update, and synth dispatch -- all
in-process so re-launch overhead between rounds is eliminated.
runStalemateGate handles the user-facing AskUser at adaptive-cap-bail
or MAX_ROUNDS exit. Uses the same line-queue/waiter readline pattern
as runTriageGateTTY (Task 6) to work around Bun's readline
ERR_USE_AFTER_CLOSE on multi-question sequences.
Single-readline-per-loop: when isTTY, runPlanReviewLoop stands up one
shared readline + ask function and threads it into runTriageGateTTY
and runStalemateGate via an optional askFn injection point. Per-call
readlines drain the entire buffered input stream on the first close,
starving later rounds. Existing standalone-call tests for the gates
keep working because askFn defaults to undefined and each gate opens
its own readline in that case.
Tests cover: APPROVE round 1 (no synth, fastest path), bundle-1
trajectory 5->3->2->0 with three synth invocations, adaptive bail on
re-raises stall at round 2.
disputed_resolutions per-round count is a placeholder zero in this
commit; Task 11 wires it to the real RESOLUTION: disputed annotation
detection after each synth call.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(build/cli): wire runPlanReviewLoop into startup path
Replaces the single-shot runPlanReview + reconcilePlanReview call with the
new in-process loop. Adds three CLI flags:
--plan-review-max-rounds=N (default 5; range 1..20)
--plan-review-no-adaptive-cap (off; disables forward-progress bail)
--plan-review-noninteractive=<auto-accept|fail-fast|auto-reject>
(default auto-accept)
Exit codes preserved on the existing side (0 approve, 1 runtime, 2 test fail);
new contract:
3 = stalemate (user picked [m] or non-TTY fail-fast hit CRITICAL)
4 = user abort (user picked [q] at gate)
130 = SIGINT during triage (unchanged)
Legacy plan-review-report.json is still written from loopResult.finalVerdict
so SKILL.md.tmpl Step 5.5 stalemate handler keeps working without changes
until Task 16 shrinks it.
synthFn adapter dispatches a configured Claude role with the
SYNTH_REVISION_PROMPT from Task 8 against the now-annotated plan file.
Since the dedicated planSynthesizer role does not yet exist in
role-config.ts, the adapter falls back to planReviewer's role config and
the timeout falls back to BUILD_DEFAULTS.timeoutsMs.planReview. Both are
forward-compatible: once planSynthesizer lands, the `as any` accessors
pick it up automatically.
reconcilePlanReview and readPlanReviewRound are no longer imported in
cli.ts now that the loop owns reconciliation and round tracking; they
remain exported from plan-reviewer.ts for tests that exercise them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(build/plan-review-loop): count disputed resolutions per round
Replaces the Task 9 placeholder disputedResolutions.push(0) after the
synth call with a real count: re-parse plan annotations, find this
round's entries with RESOLUTION starting "disputed", tally them.
The synth writes RESOLUTION: disputed -- <reason> when it disagrees
with a user-accepted objection. Each disputed entry surfaces in the
next round's triage so the user can re-decide. The aggregate count
in convergence.jsonl is the tuning signal for "how often does synth
disagree with user accepts" -- high counts suggest reviewer prompt
fidelity issues or user triage rationale weakness.
Annotation parse errors during this re-parse increment
annotationParseErrors and log to console.warn; they do not crash the
loop. The placeholder pushes on adaptive-bail-continue and on APPROVE
paths stay 0 (no synth invocation means no resolutions to count).
Also fixes annotationParseErrors declaration from const to let, which
was previously a no-op placeholder but now receives increments inside
the error catch path.
New test covers the disputed-resolution path: round 1 has 2 accepted
CRITICAL, synth marks one disputed and one applied, aggregate's
disputedResolutions[0] is 1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(build/integration): bundle-1 trajectory converges 5→3→2→0
Integration test using stub reviewer (replaying fixture data from the
real bundle-1 trajectory: 5 CRITICAL → 3 new → 2 new → APPROVE) and
a stub synth that replaces RESOLUTION:pending with RESOLUTION:applied.
Verifies the end-to-end story:
- 4 rounds, 3 synth invocations, exit APPROVE
- history JSONL has 4 lines (one per round)
- convergence aggregate has correct trajectory and totals:
trajectoryRaw [5, 3, 2, 0], totalAccepted 10, finalVerdict APPROVE
- plan file accumulates ROUND 1/2/3 annotations + top-of-plan history block
Fixture data lives in test/fixtures/build-convergence/ for sharing
across Tasks 13, 14, 15 (the other integration + E2E tests).
Coverage-matrix updated: plan-review-loop.ts now lists the integration
test as an additional owner alongside the unit tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(build/integration): adaptive cap bails on re-raises-only
Integration test verifying the no-forward-progress bail path:
Round 1 raises 2 CRITICAL, user accepts both, synth marks RESOLUTION
non-pending (a real fix attempt). Round 2 raises the same 2 CRITICAL
with zero new objections — adaptive cap detects this as a true stall
(reRaises=2, newObjections=0). Bail-out gate fires. User picks
[m]anual mode → exit code 3, outcome user_manual, exit_reason
user_manual in convergence.jsonl.
Synth is invoked exactly once (round 1) — the bail-out fires before
the round-2 synth call, saving the wasted API spend the design was
built to prevent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(build/integration): synth disputes a user-accepted objection
Integration test for the synth-dispute escape hatch:
Round 1 reviewer raises 1 CRITICAL ("use bcrypt"). User accepts. Synth
disagrees -- instead of applying the fix, it writes
RESOLUTION: disputed -- bcrypt conflicts with FIPS in this build.
The dispute is captured in two telemetry surfaces:
- disputedResolutions[0] === 1 in convergence.jsonl
- The plan annotation preserves the synth's reasoning verbatim so the
next round's reviewer (or user, if it re-surfaces) sees WHY synth
declined to apply the fix.
Round 2 reviewer reads the disputed annotation and approves (validating
that a dispute does not force a re-raise -- the reviewer can decide).
Loop exits clean with outcome "approved".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(e2e): real Codex respects round-annotation contract
Layer 4 E2E from the convergence design spec. Drives runPlanReviewLoop
with the real Codex planReviewer against the bundle-1 fixture plan
created in Task 12. Stub synth rewrites RESOLUTION: pending so the
round-2 reviewer call sees a non-pending resolution annotation.
Asserts that once round 1's objections are accepted and annotated as
RESOLUTION: applied, round 2's Codex call does not re-raise them
(reRaises[1] === 0) -- proving the reviewer prompt addition (Task 8)
actually changes real-Codex behavior, not just our parser tests.
Classified gate tier per CLAUDE.md; ~$0.50/run. Gated by EVALS=1;
free tests run without invoking Codex.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(build/skill): shrink Step 5.5 to handle exit codes only
The in-process loop in plan-review-loop.ts now resolves most rounds
inside the CLI. Step 5.5 shrinks to a four-branch dispatch on exit
codes 0/3/4/130, plus the existing exit-1/2 paths. The cross-round
annotation history lives in the plan file, so manual edits between
launches just need to preserve those annotations so the next round's
reviewer has context.
Bumps skill version 1.24.0 -> 1.25.0 (MINOR -- new capability).
NOT bumping top-level VERSION per fork rule in CLAUDE.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(build/orchestrator): document the plan-review convergence loop
Adds a section to the orchestrator README covering the loop architecture,
exit code contract, flags, telemetry file layout, triage gate key map,
and module boundaries. Cross-references the design spec.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(changelog): build skill 1.25.0 — in-process plan-review loop
User-facing entry for the convergence loop change. NOT bumping top-level
VERSION per CLAUDE.md fork rule -- only the build skill frontmatter
bumped (Task 16). Entry covers what changed for the user, projected
metrics from the bundle-1 case study, itemized add/changed/contributor
changes, with spec cross-reference.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(build/orchestrator): clarify exit code 3 ambiguity + fix dropped indent
Pre-landing review surfaced two README findings:
1. Exit code 3 has two meanings since this PR: lock contention at startup
AND plan-review stalemate. The general exit-code table now reflects
both. Exit codes 4 (user abort) and 130 (SIGINT) added to the table
for completeness.
2. The troubleshooting bullet for --mark-phase-committed lost its
2-space leading indent during the table reformatting pass. Restored
so the bullet continuation renders correctly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(build/orchestrator): strengthen 8 test gaps from pre-landing review
Pre-landing review found 8 test-quality findings. Addressed here:
1. types-convergence.test.ts: trivial echo-assertions replaced with
behavioral tests that exercise types via real function calls.
2. plan-annotation-round-trip.test.ts: new test for the fallback-prepend
path in writeRoundAnnotation when the referenced Phase heading is
absent from the plan text.
3. loop-converge-bundle-1.test.ts: added structural placement assertion
to confirm annotations land in the expected position (the fixture
intentionally exercises the fallback-prepend path for rounds 2-3
which raise objections at non-existent phases).
4. plan-reviewer-triage-tty.test.ts: new test for the [v]iew prose key
path, verifying the assessment text is shown before the re-prompt.
5. skill-e2e-build-convergence.test.ts: tier-gate pattern aligned with
other E2E tests (EVALS=1 && EVALS_TIER==='gate', no 'all' fallback).
Core contract assertion no longer wrapped in a conditional that
would let the test pass vacuously if Codex approves round 1.
6. plan-reviewer-loop.test.ts: new test exercising the max-rounds
stalemate path (reviewer always REVISE, non-TTY auto-accept,
confirms rounds === maxRounds at exit).
7. adaptive-cap-set-aware.test.ts: new test for the
'fewer accepted with new objections' branch, confirming the loop
continues when accepted count decreases but new dimensions appeared.
8. History-entry expectations updated for tests where the production
fix (IMPORTANT/SUGGESTION counts now recorded) changes the asserted
shape. Most tests use CRITICAL-only fixtures and are unaffected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(build/plan-review-loop): restore IMPORTANT/SUGGESTION handling + 9 cleanups
Pre-landing review surfaced one CRITICAL contract violation and 9
informational items. All addressed in this commit:
CRITICAL — IMPORTANT/SUGGESTION objections silently dropped (regression).
The early-return at the APPROVE / zero-CRITICAL branch treated REVISE-with-
non-CRITICAL-only as APPROVE without annotating IMPORTANT/SUGGESTION
objections in the plan file or recording them in the history. The pre-
loop reconcilePlanReview path handled them via inline annotate-and-
proceed; that path is no longer reached because cli.ts dropped the
reconcilePlanReview call when wiring runPlanReviewLoop. Restored by
filtering important/suggestion, calling writeRoundAnnotation for each,
and recording real counts in HistoryEntry.
Maintainability cleanups:
- LoopOutcome aliased to ExitReason (byte-identical union); the as-cast
in writeAggregate removed.
- Dead sigint literal removed from ExitReason; SIGINT handling lives in
cli.ts signal handler, not the loop return type.
- ROUND_HISTORY_FORMAT_VERSION removed from types.ts; was never imported.
- Always-false interrupted field removed from ConvergenceAggregate; no
code path mutated it.
- Stale "Task 11 placeholder" comment on disputedResolutions replaced
with accurate description of what the array contains.
- makeReadlineAsk helper extracted; the ERR_USE_AFTER_CLOSE workaround
for Bun readline is now in one place instead of three duplicated
blocks (triage TTY standalone, loop shared, stalemate standalone).
Performance cleanups:
- Two redundant readFileSync calls per round eliminated: the plan-text
read for re-raise detection is reused for annotation-write, and the
in-memory updatedPlan is passed to updateRoundHistoryHeader instead of
re-reading from disk after writing.
- existsSync + statSync collapsed to statSync with try/catch.
Performance finding #4 (pre-parse annotations once before
writeRoundAnnotation loop) deferred — requires writeRoundAnnotation
signature change in plan-reviewer.ts; sub-ms cost at typical scale
(K=5-10 objections, 20-50KB plan).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(build/plan-reviewer): close 4 silent-data-loss annotation bugs
Red Team review verified 4 round-trip corruption modes in the annotation
parser/writer. Each fails silently with no warning.
1. '-->' in any user-supplied field closes the HTML comment early.
Subsequent lines bleed into plan text; parser silently drops rounds[].
2. '"' in rationale stops the [^"]* regex group at the first inner
quote. Rationale field silently dropped.
3. ']' in location stops the [^\]\n]+ regex group at the first inner
bracket. ENTIRE annotation silently lost.
4. Non-canonical whitespace in existing in-file annotations makes
writeRoundAnnotation's merge path a silent no-op. Round 2's
decision silently lost.
Fix #1-3: HTML-entity encoding on serialize, decode on parse. Encode
ampersand first (so user input containing literal escape sequences
round-trips correctly); decode ampersand last. Encoded sequences:
']' -> ']', '"' -> '"', '-->' -> '-->', '&' -> '&'.
Applied to userRationale, issue, suggestion, location, resolution.
userDecision is a fixed enum (no encoding). Operator warning logged
on first character requiring encoding per field so silent encoding
never lands without an audit trail.
Fix #4: writeRoundAnnotation merge path walks the regex over planText
directly to locate the actual byte range of each existing block
(handling non-canonical whitespace, CRLF, external-editor reformat).
A fresh local regex avoids the shared ANNOTATION_BLOCK_RE's lastIndex
being reset inside parseRoundAnnotations during the loop, which
otherwise caused infinite loops on the same span.
Twelve new tests pin the round-trip invariants:
- '-->' / '"' / ']' / '%' / '&' all round-trip correctly
- merge into non-canonical 8-space-indent block preserves round 2
- literal ']' does not decode to ']' on parse
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(build/plan-review-loop, cli): close 8 Red Team findings
CRITICAL #1 (regression from round-1 fix): TTY mode now prompts per
IMPORTANT objection (y/skip/all) instead of unconditionally
auto-accepting. Restores the pre-loop reconcilePlanReview contract
that was lost when the loop went in-process. SUGGESTION objections
remain auto-annotated (no prompt). Non-TTY mode still auto-accepts
per existing CI contract.
CRITICAL #6: runPlanReviewLoop now calls deriveRoundNumber on
startup, reads prior history, and starts the round counter at the
next available number on resume. Trajectory arrays are hydrated
from history so shouldBailAdaptive has the right context. Resume
after exit-3 or exit-130 no longer duplicates round numbers in
history.jsonl or writes a second convergence aggregate record.
INFO #7: objectionCountRaw set to critical.length only (matches JSDoc
and other paths). Verdict written as verdict.verdict, not hardcoded
"APPROVE", in the REVISE-with-zero-CRITICAL branch.
INFO #8: Early-exit paths (user quit, non-TTY fail-fast) now push
sentinel -1 to trajectoryAccepted, reRaisesArr, reRejectedArr,
disputedResolutions to maintain length parity with trajectoryRaw.
User-quit path also writes an INTERRUPTED history entry.
INFO #9: synthFn rejection wrapped in try/catch. On failure, the
loop writes an INTERRUPTED history entry, writes the aggregate with
exitReason: reviewer_unavailable, closes shared resources, and exits
with code 1.
INFO #10: Plan-file writes use atomic tmp + rename pattern.
SIGINT during a write can no longer leave a half-written plan that
the parser would silently skip on resume.
INFO #11: HOME fallback uses os.homedir() with os.tmpdir() as
secondary. Containers without HOME no longer pollute the project
worktree.
INFO #12: [s]top key in runTriageGateTTY no longer prompts for the
current-objection rationale before exiting. The current decision is
pushed with empty rationale and the loop moves on.
Tests added: resume-from-history, synth-failure-handled, TTY
IMPORTANT prompt, array-length parity on quit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(build/plan-review-loop): close adversarial review CRITICALs
Claude adversarial review found two real bugs the 3-specialist + Red
Team rounds missed. Both verified by reading the code.
C1 (CRITICAL): TTY infinite-loop on stdin close. The decision-loop
default case in runTriageGateTTY (and two siblings in the IMPORTANT
prompt + runStalemateGate) was `opts.output.write("Invalid input '${ans}'...")`
with no EOF handling. Once makeReadlineAsk's streamClosed sets the
internal flag, ask() resolves with "" forever, and the while-loop
spins at 100% CPU printing "Invalid input ''" until externally
killed. Triggered by routine Ctrl+D, piped stdin closing, or terminal
disconnect mid-triage. All three sites now treat empty input as a
quit/abort/skip-all signal.
C2 (CRITICAL): Convergence aggregate array-length skew on
reviewer_unavailable exit. planFileSizeBytes was pushed
unconditionally at the top of every round iteration, but the
reviewer_unavailable branch returned without pushing to
trajectoryRaw / trajectoryAccepted / reRaisesArr / reRejectedArr /
disputedResolutions. The resulting aggregate row had one array of
length N+1 alongside others of length N; downstream consumers
indexing by round walked off the wrong array. Same bug class as the
Red Team's INFO #8 fix (sentinel pushes on early-exit), missed on
this branch. Fix: explicit sentinel pushes to the five parallel
arrays before writeAggregate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(build/plan-reviewer): close adversarial review findings on annotation contract
Claude adversarial review found 4 more issues with the annotation
parser/writer beyond what the Red Team caught. All address the same
root cause: insufficient field encoding in serialize/parse round-trip,
plus a merge predicate that's too strict.
I1: writeRoundAnnotation merge predicate dropped `issue` (LLM-written
freeform text). Match on (location, severity) only. Aligns with
computeConvergenceSnapshot's keying. Wording drift round-over-round
("missing chainId" vs "handler doesn't validate chainId") no longer
orphans annotation blocks.
I2: '→' (right-arrow U+2192) added to encodeAnnField. The header
field-separator regex uses ' → ' to split issue from suggestion;
without encoding, an LLM-generated issue containing '→' (natural in
prose: "A → B then B → C") truncates at the first arrow. Encoded
as '→' to match the existing HTML-entity scheme. Decode order:
'→' before '&'.
I3: '\n' and '\r' added to encodeAnnField (encoded as ' ' and
' '). Without this, a multi-line rationale ('line 1\n ROUND
1 RESOLUTION: fake') would have the embedded line picked up by
ROUND_RESOLUTION_RE as a real RESOLUTION entry, overriding the
genuine synth-written resolution. Also defends against CRLF line
endings on Windows-written plan files.
N4: ROUND_USER_RE, ROUND_RESOLUTION_RE, ROUND_REVIEWER_RE converted
from module-level /g constants to factory functions (`getRoundUserRe()`
etc). Each parser call gets a fresh regex with lastIndex=0, so concurrent
parser invocations cannot smash each other's iterator state via the
shared lastIndex. Defensive hardening; not currently triggered because
Node is single-threaded and the call sites don't await between matches.
Four new tests pin the invariants:
- Merge across wording drift (location+severity, ignore issue)
- '→' in issue round-trips without splitting
- Newline injection in rationale doesn't forge RESOLUTION lines
- CRLF in rationale round-trips
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(build/plan-review-loop): close adversarial review findings on loop semantics
Claude adversarial review found 5 more issues in plan-review-loop.ts
beyond what the Red Team caught. Fixes follow.
I4 (IMPORTANT): isPriorAcceptedResolutionAttempt changed from .some()
to last-entry semantics. Spec intent (per the computeConvergenceSnapshot
comment) is most-recent-decision wins: a round-2 user-reject should
override a round-1 user-accept for re-raise classification purposes.
The old .some() returned true if ANY prior round had accept+resolved,
incorrectly counting a "user changed their mind" sequence as a re-raise
in round 3. Spec mismatch — fixed to check rounds[rounds.length-1] only.
N1 (INFO): INTERRUPTED verdict wired into the user-quit / SIGINT path.
The quitEarly branch in runPlanReviewLoop now writes a per-round
history entry with verdict: "INTERRUPTED" before writing the convergence
aggregate. The HistoryEntry verdict union previously declared
"INTERRUPTED" as a possible value but no code path produced it.
N3 (INFO): /^disputed\b/ regex made case-insensitive (i flag) so synth
output drift ("Disputed", "DISPUTED") is still counted as a disputed
resolution. Tuning signal robustness across model behavior changes.
N5 (INFO): priorRejectRationale Map now hydrates from prior plan
annotations on resume (startRound > 1). Without this, re-raise framing
on resumed runs showed empty strings for the user's earlier rejection
rationale even though the rationale was present in the plan file.
N6 (INFO): atomicWriteFile tmp filename suffix extended with
crypto.randomBytes(8) for cross-process collision safety. The
pid+ms approach was unique within a single Node process but
container-in-container scenarios could share both.
N7 (empty rationale to undefined on round-trip) is owned by
plan-reviewer.ts (parallel scope) — not addressed here.
Three new tests cover:
- last-round-wins re-raise classification (positive + negative)
- case-insensitive 'Disputed' counted as disputed resolution
- priorRejectRationale hydration on resume shows prior rationale
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(build/plan-review-loop, cli): close 3 Codex HIGH findings on loop semantics
H2: synthFn now propagates the SubAgentResult's exit-code as `ok`,
and runPlanReviewLoop checks the returned ok. A synth that hit a
timeout, model-not-found error, or non-zero exit no longer silently
masquerades as a successful round. Failure routes through the
synth_failure exit reason added in c5e7b3e.
H3: [c] Continue anyway at the adaptive bail gate now runs the synth
before looping. Previously the `continue` statement jumped straight
to the next round iteration, skipping synth dispatch entirely — the
next paid reviewer round then reviewed the same unresolved plan with
RESOLUTION: pending unchanged, virtually guaranteeing re-raises and
burning the round. Extracted the synth-call into a helper so the
normal-flow path and the continue path share one code path.
H4: Resume past maxRounds (startRound > maxRounds, e.g., user re-
launched after exiting manual mode at the cap) no longer falls
through the empty for-loop to a `lastVerdict!` non-null assertion
on a null value. Added a startRound > maxRounds guard that synthesizes
an APPROVE verdict ("Resume past maxRounds; review skipped.") and
returns with outcome: "approved", round: startRound - 1.
Three new tests pin the invariants:
- synthFn returning ok:false routes to synth_failure exit code 1
- [c] continue invokes synth before next reviewer round
- resume with startRound > maxRounds doesn't crash
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(build/cli, plan-review-loop): close 3 Codex structured review P1/P2 findings
Codex structured review (the formal --base main P0/P1/P2/P3 pass) found
three issues that bypass the plan-review gate. All closed here.
P1 #1: cli.ts:9935-9963 only handled exit codes 3 / 4 / 130 then fell
through to `state.planReview = loopResult.finalVerdict; saveState; ...`
and proceeded into implementation. exitCode 1 (synth_failure) silently
bypassed the gate. Now exitCode 1 persists state.planReview with
status: "synth_failure" and throws ExitError(1). The build stops; the
resume gate (see next fix) picks it up.
P1 #2: cli.ts:9814-9818 resume gate only re-ran the loop when
state.planReview was missing or status === "critical_exit_pending".
Status values "user_aborted" and "synth_failure" were treated as
"already reviewed" — the next resume saw a REVISE verdict and proceeded
to phases. Both pending statuses now route through the gate so resume
picks up the loop as the docs promise.
P2: plan-review-loop.ts:851 resume-past-maxRounds auto-approved with a
synthetic APPROVE verdict purely based on history length. A user who
typed `gstack-build resume` without actually editing the plan bypassed
the review gate. Now runs ONE more verification reviewer call on the
(potentially-edited) plan. APPROVE → proceed. REVISE with CRITICAL →
exit STALEMATE (3); the user must edit further or pass
--no-plan-review to explicitly override.
The existing H4 test ("returns approved with synthetic verdict") was
updated to assert the new behavior (one reviewer call, APPROVE result),
and a companion H4-P2 test covers the REVISE-from-verification path
that exits STALEMATE.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: post-ship documentation update for plan-review convergence (v1.40.5.0)
- CHANGELOG.md: remove stray merge-artifact line (> > > > > > > origin/main)
and fix stray # in Design spec bullet point in For contributors section
- build/orchestrator/README.md: add plan-reviewer.ts, plan-review-loop.ts,
drain-faults.ts, halt-event-helpers.ts, halt-events.ts to Architecture
module map (all added on this branch, were missing from the module list)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
anbangr
added a commit
that referenced
this pull request
May 26, 2026
…ndings + monitor + pytest-cov + Gemini model) + critical /ship hardening (#103) * feat(build/orchestrator): plumb pendingFeatureReviewOutputPath through FEATURE_REDO Plumbing-only commit for Bug A in ~/.claude/plans/fix-orchestrator-mitosis-oasis-may-26-faults.md. Three structural changes, no behavioral change yet: - New optional `pendingFeatureReviewOutputPath?: string` field on PhaseState (types.ts). Documents the FEATURE_REDO hygiene-deadlock bug class this field exists to close. - `resetPhaseStateForRedo(state, phaseIndex, opts?)` now accepts an optional `featureReviewOutputPath` option. When set, the path is written to phaseState.pendingFeatureReviewOutputPath. When unset (every other caller — startup --reset-phase, tests), the field is explicitly deleted to prevent stale-path bleed-through across unrelated resets. Backwards compatible: existing 2-arg callers compile and behave identically. - FEATURE_REDO dispatch (cli.ts:7499) now passes outputFilePath into the reset call for every named phase. The phase state now carries the findings path; the impl dispatch will read it in the next commit. No reader exists yet — the field is plumbed but unconsumed. Commit 2 (coming next) wires the primary-impl dispatch to read the field and thread the findings into buildGeminiPromptBody's reviewFeedback parameter, completing the fix. Closes the bug-class evidence: - ~/.gstack/skill-faults/manual-1779768463981/MANUAL_INVESTIGATION:0:b335a492.md - ~/.gstack/skill-faults/manual-1779766926082/MANUAL_INVESTIGATION:0:efe61aab.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(build/orchestrator): thread feature-review findings into primary-impl on FEATURE_REDO Closes the FEATURE_REDO hygiene-deadlock bug class. Pairs with the plumbing commit (c09c75f) which added the pendingFeatureReviewOutputPath field; this commit is the reader. Before this commit: FEATURE_REDO reset a phase to pending, the orchestrator re-entered the phase, hit the RUN_GEMINI action, and called buildGeminiPromptBody WITHOUT the 4th `reviewFeedback` parameter. The impl agent saw the standard phase description, observed the existing worktree code looked fine (the prior pass already shipped work), made no commit, and the hygiene gate rejected "primary implementor did not create a new commit". The phase paused, requiring manual operator intervention: read the feature-review output by hand, make the code change, --mark-phase-committed. After this commit: 1. cli.ts RUN_GEMINI dispatch reads phaseState.pendingFeatureReviewOutputPath when set. If the file is readable, its contents are passed as the 4th argument to buildGeminiPromptBody. The impl agent now receives a "## Previous review findings" block (with the prompt-injection defenses buildGeminiPromptBody already implements) AND has access to the existing NO_CHANGES_NEEDED sentinel for the legitimate "work already done in HEAD" case. 2. cli.ts logs `↻ FEATURE_REDO: threading feature-review findings from <path>` when wiring fires. If the file is missing or unreadable, logs a warning and falls back to the standard prompt rather than crashing — best-effort delivery. 3. phase-runner.ts applyResult(RUN_GEMINI) unconditionally clears pendingFeatureReviewOutputPath on the returned next-state. The findings have been delivered to the impl call; whether impl succeeded or failed, the directive's purpose is fulfilled. The next impl attempt comes from either a fresh FEATURE_REDO cycle (which re-sets the field via resetPhaseStateForRedo) or from --reset-phase (which clears it via the else branch added in the plumbing commit). Existing tests pass unchanged: 179/179 in drift + hygiene-sibling + phase-runner suites. Behavioral regression test for the wiring lands in commit 7 (the consolidated test suite for this batch). Closes the bug-class evidence: - ~/.gstack/skill-faults/manual-1779768463981/MANUAL_INVESTIGATION:0:b335a492.md - ~/.gstack/skill-faults/manual-1779766926082/MANUAL_INVESTIGATION:0:efe61aab.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(build/orchestrator): gate pytest --cov injection on pytest-cov availability (Bug C) Closes Bug C in ~/.claude/plans/fix-orchestrator-mitosis-oasis-may-26-faults.md. Before this commit: injectCoverageFlags unconditionally appended `--cov --cov-report term-missing` whenever the testCmd matched `\bpytest\b`. Projects that intentionally disable coverage in CI (mitosis-oasis CLAUDE.md:294 is the canonical case) hit `pytest: error: unrecognized arguments: --cov=...` and exit 4 during argparse, before any tests run. The test-fixer role then iterated 3 times unable to fix a missing system dependency, and the hygiene gate rejected the no-op fix attempts. After this commit: 1. New exported `hasPytestCov(cwd)` helper. Two cheap signals, either sufficient: pyproject.toml text grep for `pytest-cov`, or `python -c "import pytest_cov"` exit 0. Returns false on any error (no python, no pyproject.toml, timeout, etc.) — better to skip injection than crash pytest. 2. `injectCoverageFlags(testCmd: string, cwd: string)` — added required `cwd` parameter. Idempotent short-circuit (`testCmd.includes("--cov")`) fires first so the probe is only run when actually needed. When the probe returns false, the command is returned unchanged with a one-line console warning naming the missing dep and the bug ID. 3. Single existing call site at `cli.ts:8527` updated to pass `cwd` (already in scope). 4. Existing `injectCoverageFlags` tests updated for the new signature: beforeAll sets up two temp dirs (one with pytest-cov in pyproject.toml, one without). Pytest tests use the cov-available cwd to preserve the existing "appends --cov" behavior; new T-C2 asserts the missing-cov cwd skips injection. 235/235 pass in sub-agents.test.ts (was 234/234 plus the new T-C2). Closes the bug-class evidence: - ~/.gstack/skill-faults/manual-1779768446214/MANUAL_INVESTIGATION:0:a076e67f.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(build/orchestrator/monitor): suppress stale alarm when subagent children are alive (Bug B leg 1) Closes Bug B leg 1 in ~/.claude/plans/fix-orchestrator-mitosis-oasis-may-26-faults.md. Before this commit: monitor.ts:713-715 declared `stale = true` based solely on `lastUpdatedAt` ageing past staleWindowMs (default 30s). Long Codex /qa or Claude /review calls (3-10 minutes is normal) produce no state.json writes during execution, so the timestamp aged out and the monitor fired USER_ACTION_REQUIRED → supervisor escalation. The supervisor's verdict was always "re-enter the monitor", which immediately fired the same false alarm again. Loop-guard caught the pattern after 3 escalations in 7 minutes but the underlying re-entry cycle burned a Codex supervisor call per loop until then. After this commit: 1. New exported helper `hasActiveSubagentChild(orchPid)` runs `ps -e -o ppid=,comm=`, filters lines whose ppid matches the orchestrator PID, and returns true if any `comm` basename contains `codex`, `claude`, `gemini`, or `kimi`. POSIX-portable ps invocation (avoids Linux-only --ppid). Defensive — returns false on any error (no ps, timeout, parse failure), so the caller falls back to the unmodified stale check rather than silently breaking the alarm. 2. `readRunSnapshot` calls hasActiveSubagentChild(pid) once per poll and ANDs the result (negated) into the `stale` boolean. When a subagent child is alive, stale is false regardless of how old `lastUpdatedAt` is. When no subagent is alive AND lastUpdatedAt is old, the alarm still fires (preserves real hang detection). 3. Leg 2 (next commit) adds a complementary status-aware threshold for in-flight subagent statuses, defense-in-depth in case `ps` returns empty (e.g., child wrapped in a shell). 43/43 monitor tests pass. Closes the bug-class evidence: - ~/.gstack/skill-faults/manual-1779768454543/MANUAL_INVESTIGATION:0:3f09e1bc.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(build/orchestrator/monitor): status-aware stale threshold for in-flight subagent calls (Bug B leg 2) Defense-in-depth complement to leg 1 (a580461). The child-process check is the primary signal, but it returns false when: - subagent is wrapped in a shell that doesn't propagate the binary name - subagent runs inside a sandbox that hides the child from ps - ps itself is unavailable or times out In any of those cases, the unmodified stale check would fire spuriously. This leg adds a second signal: if any phase in state.phases is in a *_running status, multiply the stale window by 30x (default 30s → 15 min, matching the worst-case subagent timeout). Conservative: applies to ALL phases, not just the "current" one, since multiple phases could be in flight via parallel features. Status set: test_spec_running, gemini_running, test_fix_running, codex_running, dual_impl_running, dual_tests_running, dual_judge_running. Detection is `status.endsWith("_running")` so any future *_running state automatically participates. 43/43 monitor tests pass. Closes the bug-class evidence: - ~/.gstack/skill-faults/manual-1779768454543/MANUAL_INVESTIGATION:0:3f09e1bc.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(build): refresh Gemini model in configure.cm + add 404 preflight probe (Bug D) Closes Bug D in ~/.claude/plans/fix-orchestrator-mitosis-oasis-may-26-faults.md. Two parts: 1. configure.cm: replace all 5 occurrences of "gemini-3.5-flash" with "gemini-2.5-flash". Confirmed valid via `echo "ping" | gemini -m gemini-2.5-flash` (returned "pong"). The 3.5-flash identifier returns HTTP 404 ModelNotFoundError on every primary call, burning ~10s before falling back to Codex. Affected roles: testWriter (primary + backup), primaryImpl (backup), testFixer (primary), ship (backup), land (backup). All Gemini-touching roles realigned at once — partial fixes invite confusion about which roles are still misconfigured. 2. New `parseGeminiModelProbeStderr(stderr)` pure function — parses Gemini CLI stderr for ModelNotFoundError / code: 404 / "Requested entity was not found" signals. Returns `{ok, reason?, model?}`. Pure, no I/O — testable without network. 3. New `assertGeminiModel(model)` async function — sends a 1-token ping prompt with `-m <model>` once per (model, session), parses stderr via the pure function, caches the result, emits ONE startup warning per invalid model with remediation pointer to configure.cm. Gated by GSTACK_DISABLE_AUTH_PREFLIGHT=1 env knob (consistent with the existing auth preflights). 4. runGeminiTask call site (sub-agents.ts:1930+) calls assertGeminiModel fire-and-forget (`void`) after auth succeeds. Non-blocking — the actual call proceeds regardless, falls back via the existing backup path on real failure. The probe cache ensures we don't burn an extra Gemini API call per phase; the warning fires once per session per invalid model. 5. _resetAuthPreflightForTests() now also clears the model probe cache + warned set, so the integration tests can re-assert wiring without stale state. 235/235 pass in sub-agents.test.ts. Unit tests for the parser land in commit 7. Closes the bug-class evidence: - ~/.gstack/skill-faults/manual-1779768437194/MANUAL_INVESTIGATION:0:856e5242.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(build/orchestrator): regression suites for the 4-fault mitosis-oasis batch Three new test files, 29 tests total, all passing. Pairs with commits c09c75f..cc1fafd. feature-redo-findings-injection.test.ts (12 tests): - T-A1a/b/c: resetPhaseStateForRedo persists / clears / explicit-undefined-clears the pendingFeatureReviewOutputPath field - T-A2a/b: buildGeminiPromptBody emits the findings block + prompt- injection defenses ONLY when the 4th arg is set - T-A3a/b/c: applyResult(RUN_GEMINI) clears the pending field on success / exit-fail / timeout - T-A4a: file-read contributes contents to prompt body (graceful fallback for missing files is exercised at the cli.ts dispatch site, covered by T-A5b's static-grep) - T-A5a/b/c: static-grep wiring guards — FEATURE_REDO dispatch passes outputFilePath, RUN_GEMINI dispatch reads + threads the field, applyResult clears the field. Future refactors that break any of these silently regress to the bug class. monitor-stale-subagent-child.test.ts (6 tests): - T-B1a: hasActiveSubagentChild handles null / non-positive pid - T-B1b: returns false when no subagent-named child exists - T-B1c: returns true when a child binary literally named "codex" exists. Uses the cp /bin/sleep → /tmp/.../codex trick because kernel exec semantics mean a shebang script reports comm="sh", not the script name. - T-B5a/b/c: static-grep guards for both legs — leg 1 calls hasActiveSubagentChild(pid), leg 1 ANDs !subagentChildAlive into the stale conditional, leg 2's anyPhaseRunning detection + baseStaleWindowMs * 30 multiplier survive future refactors. gemini-model-probe.test.ts (11 tests): - T-D1a-h: pure-function tests for parseGeminiModelProbeStderr. Empty / unrelated noise → ok:true. ModelNotFoundError / code: 404 / "Requested entity was not found" / case-insensitive matching → ok:false. Model name extraction when present, absent when not. - T-D2a: configure.cm grep — no stale "gemini-3.5-flash" refs, required "gemini-2.5-flash" present. - T-D2b: runGeminiTask call site invokes assertGeminiModel fire-and-forget. - T-D2c: _resetAuthPreflightForTests clears both the probe cache and the warned-set so test isolation survives. Bug C (pytest-cov gate) regression test landed inline with commit c28ca03 as T-C-NEW in sub-agents.test.ts — keeps the per-bug isolation while consolidating the helper coverage. 443/443 pass across all impacted suites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(build/orchestrator): security + correctness follow-ups from /ship review (critical fixes) Six tightening fixes from the /ship-time pre-landing + adversarial reviews. All inline so the deliverable PR ships safe, none of them deferred to a follow-up branch. 1. **Bug A path containment + size cap (CRITICAL — credential exfil primitive).** `cli.ts:8005-8030`. state.json is operator-editable, so `pendingFeatureReviewOutputPath` is untrusted input at read time. A tampered state pointing the field at `~/.ssh/id_rsa`, `~/.gstack/security/device-salt`, `~/.aws/credentials` etc. would slurp the file via `fs.readFileSync`, run it through the existing sanitizeReviewFeedback (which only redacts GATE markers, NOT arbitrary secrets), and embed it in the impl prompt sent to the Gemini API — exfiltrating secrets. Fix: route the path through the existing `validateLogPathInScope(rawPath, state.slug)` helper (same pattern used at cli.ts:7115, 7625, 8113); reject out-of-scope paths with a warn. Add a 1 MiB size cap via fs.statSync before the read so a pathological huge findings file (or symlink to /dev/zero) can't OOM the orchestrator. T-A4c pins the wiring statically. 2. **Bug B exact-basename match instead of substring (HIGH-IMPACT FP-vector tightening).** `monitor.ts:341-348`. The original `comm.includes(name)` would match any process whose basename contained `codex` / `claude` / `gemini` / `kimi` as a substring — `mycodex-wrapper`, `kimichi`, `unkimi`, `notgemini` would all silently suppress the stale alarm forever. Tightened to exact- equality match against the basename (still handles the path-prefix case via the basename strip above). T-B1d pins the regression with a real `mycodex-wrapper` child process spawn. 3. **Bug C python3 fallback (silent feature loss on modern macOS).** `sub-agents.ts:3307-3322`. The original `spawnSync("python", ...)` ENOENT'd on macOS 12+ and many Linux distros (only `python3` ships by default), silently disabling the --cov injection for projects that DID install pytest-cov via pip/conda. Loop now tries `python3` first, falls back to `python`. 4. **Bug D async spawn (non-blocking probe).** `sub-agents.ts:443-490`. The original `spawnSync` inside `assertGeminiModel` blocked the orchestrator event loop for up to 15 seconds on the FIRST probe per unique model — visible as a stall before phase work started. The `void` modifier discards the Promise but does NOT make spawnSync non-blocking. Switched to async `spawn` + accumulated stderr with a 15s timeout. The probe is now genuinely fire-and-forget. 5. **Bug D Promise-cache (concurrent-probe race).** `sub-agents.ts: 416-422`. Caching the resolved RESULT (not the Promise) still raced: two `assertGeminiModel(model)` calls landing within the same tick would both miss the cache before the first spawn returned and both fire full probes. Caching the Promise (and inserting it BEFORE the spawn returns) means concurrent callers for the same model share one in-flight probe. 6. **Bug D .catch on call site (unhandled-rejection safety).** `sub-agents.ts:2056`. Changed `void assertGeminiModel(opts.model)` to `assertGeminiModel(opts.model).catch(() => {})`. If assertGeminiModel ever rejects (refactor, new code path), the .catch swallows it so the probe can never crash a phase. The internal try/catch already returns `{ok: true}` on probe-throw, but defense in depth is cheap. Tests: - T-A4b: cli.ts dispatch wraps fs.readFileSync in try/catch + warn message (static-grep wiring guard) - T-A4c: cli.ts dispatch calls validateLogPathInScope + applies FINDINGS_MAX_BYTES cap (static-grep wiring guard) - T-B1d: mycodex-wrapper child does NOT match (real-process regression test for the substring-FP tightening) - T-D2b updated: call site uses .catch(...) instead of `void` 446/446 pass across the 7 impacted suites. Review provenance: - Pre-landing review (Claude subagent): findings #1, #2, #3, #4 - Adversarial review (Claude subagent): findings #1 (CRITICAL), #3 (path containment), #4 (size cap), #5 (substring FP), #7 (python3), #8 (Promise cache), #9 (event-loop block) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
anbangr
pushed a commit
that referenced
this pull request
May 31, 2026
…first floor covers all 51 skills (garrytan#1712) * docs(designs): add v2_PLAN.md — gstack v2 the lightest opinionated skill pack The approved plan from /plan-ceo-review → /plan-eng-review → /codex×2 → /plan-devex-review. Captures the v1.45/v2.0 hybrid release shape, cathedral parity-eval suite, sequential v1.45 execution, sections/*.md.tmpl pipeline, EVALS_BUDGET_HARD_CAP override path, and v2 launch copy specs. This commit just lands the design doc. Implementation follows in the rest of the v1.45.0.0 branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(parity): T0a — capture v1.44.1 baseline + capture helper + diff utility Cathedral parity-eval suite primitive. captureBaseline() walks every top-level SKILL.md and records bytes, lines, estimated tokens, frontmatter description length, and eval coverage. diffBaselines() reports per-skill delta + total corpus delta + catalog tokens delta. Locks the v1.44.1 reference snapshot at test/fixtures/parity-baseline-v1.44.1.json. After Phase A+B+C land, scripts/capture-baseline.ts --tag v1.45.0.0 produces a comparable snapshot; diff supplies the real numbers the v2 CHANGELOG quotes. Never invent baseline numbers; ship them only if they came from a real run. v1.44.1 numbers captured this commit: - 51 skills - 2,847 KB total corpus - ~9,319 catalog tokens (sum of description bytes / 4) - top 3: ship 160 KB, plan-ceo-review 128 KB, office-hours 108 KB Test plan: - bun test test/helpers/capture-parity-baseline.test.ts passes 4/4 - The baseline JSON file is committed so reviewers can audit v1→v2 numbers Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(resolvers): T2 — ResolverEntry + appliesTo gate infrastructure Adds the conditional-resolver-injection plumbing from the v2_PLAN A.1 step. Resolvers can now be either a bare ResolverFn (always fires, current behavior) or a ResolverEntry { resolve, appliesTo? } (gated; appliesTo returning false skips the resolver, substitutes empty string). Why infrastructure-only: the audit during T0a confirmed most resolvers don't need gating. The {{NAME}} placeholder system is already conditional at the template level — a resolver only fires for skills that reference it. The gate is for future use when a placeholder's audience needs a structural guardrail beyond social convention, or when a sub-resolver inside a larger composed resolver (e.g. preamble) needs per-skill skip. scripts/gen-skill-docs.ts:444 now uses unwrapResolver() to handle both shapes. RESOLVERS map signature widens from Record<string, ResolverFn> to Record<string, ResolverValue>. All existing resolvers stay bare functions and work unchanged. Test plan: - bun test test/resolver-entry.test.ts: 6 pass (gate plumbing + registry) - bun test test/gen-skill-docs.test.ts: 389 pass (no regression) - bun run gen:skill-docs --dry-run: all SKILL.md files FRESH (no diff) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(preamble): T3 — jargon dedup + terse-build flag (Phase A.2 + A.3) A.2 jargon dedup: generate-writing-style.ts replaces the inlined 80-term jargon list with a one-line pointer to scripts/jargon-list.json. The list was duplicated into every tier-2+ skill (48 of 51 skills); inlining cost was ~1.5 KB × 48 = ~70 KB across the corpus. Pointer cost is ~30 bytes per skill. Agents Read the JSON once per session on first jargon term encountered; thereafter the terms array is the canonical reference. A.3 terse build flag: --explain-level=terse compresses preamble prose at gen time. When the flag is set, writing-style collapses to a one-line terse directive and completeness-section + confusion-protocol + context-health are dropped entirely. The default build keeps the runtime-conditional behavior intact (sections still render; the model skips them when EXPLAIN_LEVEL: terse appears in the preamble echo). Terse build is opt-in for users who want shipped skills to match their runtime preference and avoid the per-session terse-mode dead prose. TemplateContext gains an optional `explainLevel: 'default' | 'terse'` field. Default builds set it to 'default'; --explain-level=terse sets 'terse'. Resolvers gate their output via `ctx?.explainLevel === 'terse'`. Measured impact (default build, post-T3): - Total corpus: 2,847 KB → 2,812 KB (saved 35 KB) - ship.md: 160 → 159 KB - plan-ceo-review.md: 128 → 127 KB - Top 10 heaviest: all slightly smaller from jargon pointer Larger compression lands in T4 (catalog trim) and T7 (atomic regen across the full Phase A pipeline). The terse build path further compresses to ~711K tokens vs default ~725K (saved ~14K tokens corpus-wide). Test plan: - bun test test/gen-skill-docs.test.ts: 389 pass (no regression) - bun test test/resolver-entry.test.ts: 6 pass - bun test test/helpers/capture-parity-baseline.test.ts: 4 pass - bun run gen:skill-docs --explain-level=terse: ship.md drops completeness + confusion-protocol + context-health sections; writing-style collapses to one-line terse directive 48 SKILL.md files updated (every tier-2+ skill picks up the jargon pointer). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(catalog): T4 — catalog trim + proactive-suggestions.json (Phase A.4) Shortens frontmatter `description:` in every Claude SKILL.md to a single lead sentence + (gstack) tag. The routing prose ("Use when asked to...", "Proactively suggest...") and voice triggers move to a "## When to invoke" body section so they remain discoverable inside the skill. A per-run registry at scripts/proactive-suggestions.json aggregates the routing/ voice text for all 52 skills so agents can pull guidance on demand without paying for it in the always-loaded catalog. Build flag --catalog-mode=full restores v1.44 legacy behavior (full multi-line descriptions in frontmatter). Default is trim. splitCatalogDescription() extracts: lead sentence, routing paragraphs, voice-triggers line, (gstack) tag presence. Short descriptions (<120 chars, already trimmed) are skipped via a guard so re-runs are idempotent. Measured impact (vs v1.44.1 baseline): - Catalog tokens (sum of description bytes / 4): 9,319 → 4,045 (-56.6%) - Total SKILL.md corpus bytes: 2,915 KB → 2,880 KB (-1.2%) - Routing prose preserved as in-skill "## When to invoke" sections - 52 skill entries in scripts/proactive-suggestions.json (on-demand registry) The corpus drop is small because catalog trim MOVES text from frontmatter to body, it doesn't delete it. The headline win is the catalog: the always-loaded system prompt surface drops by more than half. Test plan: - bun test test/gen-skill-docs.test.ts: 389 pass, 0 fail - Manual: ship/SKILL.md frontmatter description is now ONE line ending with `(gstack)`; allowed-tools field on next line (YAML well-formed) - Manual: scripts/proactive-suggestions.json contains 52 entries - bun run gen:skill-docs --catalog-mode=full restores legacy behavior 53 files changed (52 SKILL.md across hosts + the new proactive-suggestions.json). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(budget): T5 — hard token budgets + override audit trail (Phase A.6) Two new gate-tier guardrails for the v1.45.0.0 compression baseline: 1. test/skill-size-budget.test.ts (NEW) — per-skill SKILL.md size budget. Compares current state to test/fixtures/parity-baseline-v1.44.1.json. Three checks: per-skill (×1.05 default ratio), total corpus, and catalog token estimate (≤7000 for v1.45). The per-skill ratio is 1.05 not 1.0 because the T4 catalog trim moves text from frontmatter to a body section; small skills see a tiny body growth that's fine when offset by the much larger catalog-token win. 2. test/skill-budget-regression.test.ts EXTENDED — hard dollar cap on per-run eval cost. Per-tier defaults: gate $25, periodic $70. Umbrella EVALS_BUDGET_HARD_CAP=$30. Catches runaway eval costs (infinite retry, model price changes) before they amortize across PRs. Both checks support an override path with audit trail: GSTACK_SIZE_BUDGET_OVERRIDE_REASON="why this is OK" — size EVALS_BUDGET_OVERRIDE_REASON="why this is OK" — cost Overrides log to ~/.gstack/analytics/spend-overrides.jsonl with timestamp + scope + reason + CI provenance (runner, branch, commit) via test/helpers/budget-override.ts. Why the override audit: a hard cap with no escape valve becomes operationally hostile (legit price changes, longer transcripts, new required evals can all blow the cap). An override with no audit becomes "everyone overrides everything and the gate is theater." This module ships the audit half so reviewers can see what was waived and why. Codex 2nd-pass critique #3 absorbed: per-suite caps + override path with auditability + budget baselines checked into repo (parity-baseline-v1.44.1.json already in test/fixtures/). Test plan: - bun test test/skill-size-budget.test.ts: 4 pass (per-skill, corpus, catalog, baseline-exists) - bun test test/skill-budget-regression.test.ts: 4 pass (2 existing ratio checks + 2 new hard-cap checks) - Existing eval runs ($14.11 e2e, $0.02 llm-judge) sit well under the new caps Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(cso): T6 — pin must-preserve security phrases (Phase A.5) cso/SKILL.md is a content-heavy security audit skill (75 KB after T3+T4). Codex 2nd-pass critique #9: "cso exemption too broad ... should still get resolver dedup, catalog trim, sectioning if safe, and targeted evals around must-not-miss checks." T3 (jargon dedup) and T4 (catalog trim) already applied to cso the same way they applied to every other skill — confirmed by inspection: - jargon list NOT inlined (0 inline term lines) - catalog description trimmed to one line (74 bytes vs 774 bytes baseline) - "## When to invoke" body section present T6 work: lock in the security-prose preservation via a gate-tier test that fails CI if future compression strips load-bearing phrases: - OWASP, STRIDE positioning - daily / comprehensive mode discipline - confidence scoring language - active verification ("verif" prefix catches verify/verified/verification) - ## Preamble heading (preamble resolver still fires) Also guards cso against accidental over-stripping: SKILL.md must stay ≥30 KB (currently 75 KB) — a sudden cliff would mean compression went past the targeted-dedup line into structural removal. No structural change to cso. Future Phase B sections/ work for cso requires writing baseline parity tests FIRST per the v2_PLAN.md sequencing. Test plan: - bun test test/cso-preserved.test.ts: 5 pass Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(parity): T0b — cathedral parity-suite harness + invariant registry Adds the harness that the v2_PLAN.md cathedral parity-eval suite is built on. Compares CURRENT SKILL.md output to v1.44.1 baseline along three axes: STRUCTURE frontmatter shape (catalog trim landed, "## When to invoke" present) CONTENT must-preserve phrases per skill family (cso: OWASP/STRIDE; plan-ceo: SCOPE EXPANSION/HOLD SCOPE/REDUCTION; ship: VERSION/CHANGELOG/PR; etc.) SIZE per-skill byte budget (maxSizeRatio + minBytes guards) PARITY_INVARIANTS registry pins 10 load-bearing skills (cso, ship, plan-*- review, review, qa, investigate, office-hours, autoplan). Each entry declares what must NOT regress; future compression that strips these phrases or shrinks a skill past its minBytes cliff fails CI. Periodic-tier LLM-judge parity (paid, ~$0.20/skill) lands in v2.0.0.0 sections/ phase. Same registry, same harness, judge added on top. Test plan: - bun test test/parity-suite.test.ts: 10/10 invariants pass vs v1.44.1 - Per-skill failures get actionable per-line breakdown so a reviewer can see which phrase / heading / size limit went sideways Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(coverage): T1 — skill coverage matrix + structural-compliance floor Phase 0 deliverable — eval-first foundation. Two new test files plus the registry: 1. test/skill-coverage-matrix.ts — single source of truth mapping each skill to its gate-tier + periodic-tier test files. SKILL_COVERAGE record with 51 entries; every gstack skill on disk has at least one gate-tier entry. 2. test/skill-coverage-matrix.test.ts — CI gate. Asserts every skill on disk has a registry entry AND that gate[] is non-empty. Catches "skill added but eval not registered" the moment a new SKILL.md lands. 3. test/skill-coverage-floor.test.ts — per-skill structural compliance (FREE, file-IO only). For each of 51 skills, verifies: - SKILL.md exists - Frontmatter well-formed (name + description fields) - Catalog-trim contract (inline description ≤ 250 chars, or block form) - Generated header present (edit .tmpl, not .md) - Body ≥ 200 bytes (non-trivial content) - No unresolved {{TEMPLATE}} placeholders leaked The "floor" is the minimum eval that every skill ships with. Skills that need deeper behavioral testing get additional entries in their coverage record (e.g., ship has skill-e2e-ship-idempotency + workflow + floor). Future skills only need to add the floor entry and the matrix gate unblocks them. Codex 2nd-pass critique #1 mitigation: eval-first floor is structural compliance (the testable part) — judgment-skill behavior gets layered periodic-tier evals on top. We don't pretend the floor proves correctness, only that the skill structurally compiles. Test plan: - bun test test/skill-coverage-matrix.test.ts: 4 pass (matrix shape + coverage) - bun test test/skill-coverage-floor.test.ts: 309 pass (6 checks × 51 skills + 3 registry-level) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * build(skills): T7 — atomic regenerate + capture v1.45.0.0 baseline Final regen pass across all hosts after T1-T6 work landed. Captures the v1.45.0.0 parity baseline at test/fixtures/parity-baseline-v1.45.0.0.json for diffing against the v1.44.1 reference. Measured deltas (real numbers from test/helpers/capture-parity-baseline.ts): Total SKILL.md corpus 2,847 KB → 2,813 KB (-1.2%) Catalog tokens (always-loaded) ~9,319 → ~4,045 tokens (-56.6%) Top 10 heaviest skills 0.5-1.0% drop each The catalog token cut is the headline. It's the always-loaded surface, i.e. tokens charged on every session start. Per-skill SKILL.md sizes barely moved because T4 catalog trim MOVES routing prose from frontmatter to a body "## When to invoke" section rather than deleting it — the catalog wins without amputating discoverability. The bigger per-skill compression lands in v2.0.0.0 (Phase B sections/ pattern on the 5 heavyweights). v1.45 is the foundation: eval-first infrastructure + cheap wins. scripts/proactive-suggestions.json regenerated with the latest 52 skills listed (one-time write per gen-skill-docs run; aggregated catalog parts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * v1.45.0.0 — gstack v2 foundation: catalog tokens drop 56%, eval-first floor Bumps VERSION + package.json to 1.45.0.0. CHANGELOG entry covers what shipped between v1.44.1 and this release: the cathedral parity-eval foundation, conditional resolver injection plumbing, jargon dedup, terse build flag, catalog trim with one-line frontmatter descriptions, hard token + dollar budget gates with override audit, cso preservation pins, and the v1.44.1 ↔ v1.45.0.0 parity baselines committed to test/fixtures/. Numbers (measured, not estimated): - Catalog tokens: ~9,319 → ~4,045 (-56.6%) - Total corpus: 2,847 KB → 2,813 KB (-1.2%) - Skills with gate-tier eval coverage: 32/51 → 51/51 (floor achieved) This is the foundation release. v2.0.0.0 will ship the architectural break (sections/*.md.tmpl pattern + mechanical Read enforcement + eval-coverage annotations) as a coordinated marketing-grade launch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(catalog): refresh proactive-suggestions.json timestamp after v1.45 bump The generated_at field updates on every gen-skill-docs run; this is the T7 atomic-regenerate output landed alongside the v1.45.0.0 bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(catalog): deterministic proactive-suggestions.json (no per-run timestamp) Original implementation wrote a generated_at timestamp on every gen-skill-docs run. That made CI dry-run freshness checks flap because the file changed on every regeneration even when the actual content (skill descriptions, routing prose, voice triggers) was unchanged. Two fixes: 1. Drop the generated_at field. The file is purely a content registry now. 2. Only write the file when serialized content actually differs from disk. Reproducible test: bun run gen:skill-docs twice in a row now leaves scripts/proactive-suggestions.json unchanged on the second run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(catalog): preserve routing prose when first sentence exceeds 200 chars splitCatalogDescription truncated the lead BEFORE computing routing extraction, which meant skills whose first sentence was over 200 chars (design-consultation: 207 chars) had their entire routing prose silently dropped — the "## When to invoke" body section came out empty. Root cause: routing was extracted via `collapsed.indexOf(lead)` after lead was suffixed with "...". The "..." never appeared in the original string, so indexOf returned -1 and routingProse fell back to empty. Fix: compute routing from sentenceLead (the untruncated first sentence) BEFORE truncating the displayed lead. The displayed lead still gets "..." when over 200 chars, but the routing extraction uses the real boundary. Also: refresh golden snapshots for claude/codex/factory ship and update two unit tests that asserted v1.44 behavior: - skill-validation.test.ts: trigger-phrase + proactive-routing tests now search whole content, not just frontmatter (T4 moved them to a body "## When to invoke" section) - writing-style-resolver.test.ts: jargon-list assertion now expects the T3 reference pointer, not the inline list Test plan: - bun test test/skill-validation.test.ts test/writing-style-resolver.test.ts test/host-config.test.ts test/skill-size-budget.test.ts test/parity-suite.test.ts test/skill-coverage-matrix.test.ts test/skill-coverage-floor.test.ts test/cso-preserved.test.ts test/resolver-entry.test.ts test/helpers/capture-parity-baseline.test.ts test/gen-skill-docs.test.ts: 1134 pass, 0 fail - Manual verify: design-consultation/SKILL.md "## When to invoke this skill" body section now contains "Use when asked to..." + "Proactively suggest..." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(catalog): deterministic proactive-suggestions.json across machines CI check-freshness failed because scripts/proactive-suggestions.json serialized differently on local vs CI: 1. Root-skill key leaked the directory name. processTemplate's outer loop computed `dir = path.basename(path.dirname(tmplPath))`. For the root SKILL.md.tmpl at ROOT/SKILL.md.tmpl, that returns the repo-checkout directory name — "seville-v3" in a Conductor worktree, "gstack" on GitHub Actions, anything-else for a fork. Fix: detect root via `path.dirname(tmplPath) === ROOT` and hardcode the key to "gstack" for that one case. 2. Aggregate key order was filesystem-iteration order. discoverTemplates doesn't guarantee stable ordering across platforms, so the JSON `skills` object came out shuffled between machines. Fix: sort Object.keys(proactiveAggregate) alphabetically before serializing. After the fix, the generated file is identical on every machine and matches what's committed. CI freshness check (bun run gen:skill-docs && git diff --exit-code) now passes. Test plan: - bun run gen:skill-docs && bun run gen:skill-docs --dry-run: all FRESH - node -e 'verify keys sorted': sorted match: true - grep -c '"seville-v3"' scripts/proactive-suggestions.json: 0 - Focused test suite: 704 pass, 0 fail Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(catalog): unit + regression coverage for catalog-trim helpers Four exported functions in scripts/gen-skill-docs.ts handle every skill's frontmatter rewrite at gen time but had zero unit tests. Both real bugs we shipped (and fixed) on this branch lived in these functions: v1.45.0.0 design-consultation: when the first sentence exceeded 200 chars, routing-prose extraction lost the entire tail (anchored on truncated lead with "..." that didn't substring-match the original). v1.45.0.0 CI freshness: root-skill key leaked the checkout directory name ("seville-v3" vs "gstack") and aggregate order was filesystem- iteration order. Both shapes are now regression-tested: - splitCatalogDescription: 7 tests covering simple multi-line, >200-char first sentence (design-consultation regression), voice-trigger extraction, no-(gstack) handling, embedded periods (documents known fallback), no-period fragments, and idempotency. - buildTrimmedDescription: 3 tests. - buildWhenToInvokeSection: 3 tests. - applyCatalogTrim: 4 tests covering the standard rewrite, no-op for already-short descriptions, the YAML-collision newline fix, and the malformed-frontmatter null return. - proactive-suggestions.json determinism: 3 tests asserting sorted keys, root keyed as "gstack" (not the worktree directory), and no timestamp/generated_at field that would flap CI freshness. Test plan: - bun test test/catalog-trim.test.ts: 20 pass, 0 fail Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(coverage): fill three remaining v1.46.0.0 test gaps Three untested surfaces from the v1.46.0.0 work. All three would have caught real bugs we shipped (and fixed) on this branch. 1. test/helpers/budget-override.test.ts — 7 tests pin the audit-trail contract for EVALS_BUDGET_OVERRIDE_REASON and GSTACK_SIZE_BUDGET_OVERRIDE_REASON. Without this, the audit logger could silently drop events and overrides become invisible. Tests cover: required fields per JSONL line, CI provenance capture (CI/GITHUB_ACTIONS/branch/commit), local-runner defaults, append-only behavior, missing-directory recovery, and unwritable- path resilience (logs warning instead of throwing). 2. test/terse-build.test.ts — 16 tests pin --explain-level=terse behavior across the 4 gated resolvers and the composed preamble. Default vs terse vs undefined-ctx all asserted. Without this, a refactor that breaks the explainLevel threading silently regresses the opt-in compression path; the runtime EXPLAIN_LEVEL: terse gate still works so users wouldn't notice. Tier-1 invariant pinned (terse-only-affects-tier-2+). 3. test/gen-skill-docs-idempotency.test.ts — 2 tests catch the class of bug behind the v1.45.0.0 timestamp flap. Two consecutive gen-skill-docs runs must produce byte-identical outputs across STABLE_OUTPUTS (proactive-suggestions.json, SKILL.md, ship/SKILL.md, plan-ceo-review/SKILL.md, office-hours/SKILL.md, gstack/llms.txt). --dry-run reports zero stale files after a fresh gen. CI freshness regressions surface as test failures BEFORE a PR is opened. Test plan: - bun test test/helpers/budget-override.test.ts: 7 pass - bun test test/terse-build.test.ts: 16 pass - bun test test/gen-skill-docs-idempotency.test.ts: 2 pass - Full focused suite (15 test files): 1179 pass, 0 fail (+45 new tests vs the pre-fill baseline of 1134) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(coverage): close 5 remaining v1.46.0.0 test gaps (A-E) Five behaviors that v1.46 ships but had no test coverage. All now pinned. A) --host all idempotency (test/gen-skill-docs-idempotency.test.ts) The default test ran Claude host only. Non-Claude hosts (Codex, Factory, Cursor, OpenClaw, GBrain, Slate, OpenCode, Hermes, Kiro) each have their own output paths and could carry their own non-deterministic fields. We hit a "--host all needed for freshness check" mid-/ship. Now: two consecutive `bun run gen:skill-docs --host all` runs must produce byte-identical outputs across a per-host sample (.agents/, .cursor/, .factory/, .gbrain/). Catches per-host adapter regressions before CI. B) --catalog-mode=full opt-out (test/catalog-mode-full.test.ts) The legacy escape hatch had zero tests. 6 new tests across two layers: static (CATALOG_MODE_ARG parsed; conditional gate present; default is "trim"; invalid value throws) + smoke (actual --catalog-mode=full run produces a multi-line `description: |` block + omits "## When to invoke" body section; mutates the working tree then restores in a finally block). C) parity-baseline-v1.44.1.json integrity (test/parity-baseline-integrity.test.ts) The baseline is the source of every v1→v2 number cited in the CHANGELOG v1.46.0.0 entry. Anyone could edit it without test failure until now. 8 new tests pin: existence, tag, capturedFromCommit allowlist, expected v1.44 numbers (51 skills, ~2,915 KB, ~9,319 catalog tokens), CHANGELOG references this file by path, per-skill shape, and a SHA256 byte-stability hash. Any edit fails with a clear "if intentional, update EXPECTED_HASH AND the CHANGELOG numbers" signal. D) Live appliesTo gate end-to-end (test/resolver-entry.test.ts extended) The unwrapResolver unit tests covered the function; the gen-skill-docs.ts substitution loop that USES the gate had no integration coverage. 6 new tests simulate the exact 4-line shape from gen-skill-docs.ts:457-467 against synthetic registries: plain-function fires unconditionally, gated fires when true / empty-string when false, mixed registries compose, parameterized resolvers respect gates, unknown resolvers throw. E) Per-skill min-size floor (test/skill-size-budget.test.ts extended) The existing 200-byte body coverage-floor is a noise floor — a skill that lost 99.75% of content still passes. 1 new test asserts every skill stays ≥80% of its v1.44.1 baseline size (the parity-suite content invariants only covered 10 of 51 skills; the remaining 41 were uncovered). SECTIONS_EXTRACTED hook in place for v2.0.0.0 when the sections/ pattern legitimately shrinks ship/plan-ceo/etc. past the floor. Test plan: - bun test focused 17-file suite: 1202 pass, 0 fail (+23 new tests vs the pre-fill 1179 baseline) - catalog-mode=full mutates working tree then restores cleanly - --host all idempotency runs two full gen passes in <1s on this machine Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
anbangr
pushed a commit
that referenced
this pull request
Jun 1, 2026
…always-loaded) (garrytan#1806) * feat(test): transcript-section-logger + ship-action fingerprint (T10) Pure-analysis module over a SkillTestResult/NDJSON transcript: - extractSectionReads(): which sections/*.md a run opened (post-carve check) - extractShipActions(): observable action fingerprint (merge/test/bump/ changelog/commit/push/pr) that works on the MONOLITH too, so a baseline captured before the carve can detect a sectioned-ship regression - baseline read/write + compareShipActions() for baseline-first dogf(T10) Baseline-first answers the Codex outside-voice critique that a logger in the same PR as the carve is post-failure telemetry without a pre-carve reference. 11 unit tests, all green. Paid monolith baseline capture runs separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(pipeline): section discovery + generation machinery (T9) - discover-skills.ts: discoverSectionTemplates() scans <skill>/sections/*.md.tmpl - gen-skill-docs.ts: extract resolvePlaceholders + applyHostRewrites + buildContext as shared helpers (processTemplate and the new processSectionTemplate both call them, so a sanitization/rewrite fix can't miss sections) [C1] - processSectionTemplate: body-fragment generation (no frontmatter/catalog/voice), parent-skill TemplateContext (skillName pinned to parent, not 'sections', so appliesTo gating + tier behave identically), per-host output routing - --host all now fails the build on ANY host failure, not just claude, so a stale external-host output can't slip the freshness gate [Codex outside-voice #9] Inert until a skill is carved (no sections/ dirs exist yet). Refactor is output-neutral: gen:skill-docs --dry-run --host all reports 0 STALE. 5 discovery unit tests + 389 gen-skill-docs tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(setup): install sections/ for cherry-pick targets (claude + kiro) (T9) Two install targets cherry-pick SKILL.md and would leave a carved skill's sections/ behind, 404ing a runtime 'Read sections/<name>.md': - link_claude_skill_dirs: link the sections/ subdir via _link_or_copy (windows gets a fresh copy on every ./setup) - kiro per-skill loop: sed-rewrite + copy each sections/* so paths resolve under ~/.kiro, not ~/.codex/~/.claude codex/factory/opencode link the whole generated dir, so sections ride free. Addresses Codex outside-voice #4/#6 (runtime pathing landmine). Inert until a skill is carved. Static-tripwire test + windows-fallback invariant green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(ship): gstack-version-bump CLI — tested idempotency classify + write (T9) Hybrid CLI extraction (CM1): the deterministic core of ship Step 12 becomes a tested CLI instead of bash prose the agent re-derives each run. - classify: FRESH/ALREADY_BUMPED/DRIFT_STALE_PKG/DRIFT_UNEXPECTED from VERSION vs origin/<base>:VERSION vs package.json.version (pure reader) - write: validated dual-write to VERSION + package.json (FRESH bump) - repair: DRIFT_STALE_PKG sync, no re-bump Bump-LEVEL choice + queue collision stay agent judgment; slot pick stays bin/gstack-next-version. This removes the re-bump-a-shipped-branch footgun from skippable prose into code that can't be skipped or misread. 15 tests (exhaustive state matrix + write/repair fs + real-git classify). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(parity): sectioned-skill parity capability — guards the carve (T9) Carved skills (skeleton + sections/*.md) need parity checks that see relocated content, or moving a phrase into a section reads as 'lost': - readSkillForParity(): union skeleton + all sections/*.md - checkSkillParity sectioned mode: content checks against the union; minBytes/ maxSizeRatio against union bytes (total behavior preserved); maxSkeletonBytes asserts the always-loaded skeleton actually shrank. Lowering minBytes to fit a small skeleton would otherwise make the size floor toothless [Codex #12]. Built + tested BEFORE the carve so ship's invariant can flip to sectioned in the same commit it lands. Monolith path byte-identical (verified: pre-existing investigate 1.053 ratio drift fails the same with this change stashed). 7 sectioned-parity tests + existing parity tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(ship): carve into skeleton + on-demand sections (Claude) (T9) ship/SKILL.md drops 167KB → 68.7KB (~59% of the always-loaded skill) by moving 8 prose-heavy steps into ship/sections/*.md, read on demand: tests, test-coverage, plan-completion, review-army, greptile, adversarial, changelog, pr-body. Step 12's version logic now calls the tested gstack-version-bump CLI instead of inline bash. Claude-first (S2): {{SECTION:id}} emits a STOP-Read pointer on Claude (skeleton + generated section files) and INLINES the content on every other host, so external hosts keep the full monolith — verified factory at 162KB with no sections dir. {{SECTION_INDEX:ship}} renders the situation→section table from the PASSIVE manifest (CM2 / v2_PLAN.md:663); required-reads live only in test fixtures. Multi-pass resolve expands inlined sections' own resolvers. Parity: ship invariant flipped to sectioned (union content checks + maxSkeletonBytes asserts the shrink). Carve-fallout fixed across gen-skill-docs/skill-validation/ golden/plan-completion/garrytan#1539/size-budget tests via skeleton+sections union reads. Free suite green except the pre-existing investigate parity drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): manifest-consistency + context-parity + requiredReads helper (T9) Free deterministic guards for the carve: - required-reads.ts + unit test: assertRequiredReads(run, requiredFiles) — the mechanical layer-5 check that the agent Read the sections its situation needs (required set comes from the fixture, not the passive manifest) - section-manifest-consistency: 3-tier orphan classification (generated orphan + hand-edited generated file → FAIL; manifest orphan → WARN per v2_PLAN.md) and pins the PASSIVE-manifest contract (no applies_when/required_for) - template-context-parity: generated sections have zero unresolved placeholders and gated resolvers (ADVERSARIAL_STEP/CONFIDENCE_CALIBRATION/CHANGELOG_WORKFLOW) rendered — proving sections resolve with the parent skillName, not 'sections' 16 tests, all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): section-loading E2E + idempotency CLI detection (T9) - skill-e2e-ship-section-loading.test.ts (new, periodic): runs real /ship in plan mode against a fresh version-changing fixture and asserts the agent Read the required sections (review-army + changelog). Runs against the INSTALLED skill (~/.claude/skills/gstack/ship), not repo paths, so install-layout 404s surface [Codex outside-voice #5]. Layer-5 mechanical guard against silent section-skip. - skill-e2e-ship-idempotency.test.ts: detection updated for the carve — Step 12 now runs gstack-version-bump classify (JSON "state":"ALREADY_BUMPED") instead of the inline bash echo (STATE: ALREADY_BUMPED). Accept both; add a gstack-version-bump-write re-bump regression signal. - touchfiles: register ship-section-loading (periodic) + extend idempotency deps with bin/gstack-version-bump + scripts/resolvers/sections.ts. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(ship): union-read redaction wiring test for the carve (T9) main's PR-body redaction-at-sink lives in sections/pr-body.md.tmpl after the carve, not the skeleton template. Read skeleton + section templates union so the redaction-wiring assertions follow the relocated content. 9/9 green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * v1.54.0.0 feat: carve /ship into skeleton + on-demand sections (-59% always-loaded) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/builddefaults to call local skill names directly:/qa,/ship, and/land-and-deploy.reviewSecondary.commandunset.review/qacommands still fail when missing, while disabled secondary review is reported as skipped.VERSION/package.jsonto1.26.4.0.Test Coverage
Changed behavior is covered by role-config defaults tests and review gate planning tests. No new uncovered runtime path identified.
Pre-Landing Review
No issues found. The change is scoped to config, orchestration gate planning, tests, generated docs, and release metadata.
Design Review
No frontend files changed — design review skipped.
Eval Results
No prompt/eval source files changed that require paid eval suites.
Scope Drift
No drift found. The PR implements the requested Codex-dominant build command plan and leaves the separate dual-impl/judge-provider incompatibility for later work.
Plan Completion
All requested plan items completed:
reviewSecondaryis optional when commandless and emits a skipped report line.Verification Results
jq . build/configure.cmbun test build/orchestrator/__tests__/role-config.test.tsbun test build/orchestrator/__tests__/sub-agents.test.ts build/orchestrator/__tests__/cli.test.tsbun test test/host-config.test.tsbun test test/gbrain-supabase-provision.test.tsbun run skill:checkbun run buildgit diff --checkbun testDocumentation
Generated skill docs and golden fixtures were refreshed. A separate
/document-releasesubagent was not run because this Codex session only permits subagents when explicitly requested.TODOs
No TODO items completed in this PR.
Test plan
See Verification Results.