Skip to content

fix(build): droppedPhasesCount + --print-only exit code for flat-task plans#31

Merged
anbangr merged 38 commits into
mainfrom
fix/step-transition-clean
May 11, 2026
Merged

fix(build): droppedPhasesCount + --print-only exit code for flat-task plans#31
anbangr merged 38 commits into
mainfrom
fix/step-transition-clean

Conversation

@anbangr

@anbangr anbangr commented May 11, 2026

Copy link
Copy Markdown
Owner

Summary

  • droppedPhasesCount in ParseResult: exposes count of phases found but dropped due to missing labeled checkboxes (**Implementation**, **Review & QA**). Previously these silently vanished into 0 executable phases with no actionable error.
  • --print-only exit code fix: gstack-build plan.md --print-only now exits 2 when 0 phases are parsed (was incorrectly exiting 0, masking malformed plans from scripts).
  • Better error message: when droppedPhasesCount > 0, CLI prints a diagnostic hint showing the expected labeled-marker format vs. the flat-task format that caused the failure.
  • Regression test in parser.test.ts: structural-mirroring flat-task format emits 0 phases, droppedPhasesCount=3, 7 warnings.
  • CLI integration tests in integration.test.ts: --print-only exits 2 on malformed plan; stderr contains droppedPhasesCount hint.

Root cause: AI synthesizers sometimes use "structural mirroring" — 3 separate ### Phase headings named after TDD steps (Test Specification, Implementation, Review & QA) with flat - [ ] Write failing E2E test... checkboxes. The parser only emits a phase when both implementationCheckboxLine and reviewCheckboxLine are non-null; flat checkboxes don't match the labeled-marker regexes, so all phases are dropped and gstack-build exits 2 with a confusing "no executable phases found" message.

Test plan

  • bun test build/orchestrator/__tests__/parser.test.ts — flat-task regression test passes (0 phases, droppedPhasesCount=3, 7 warnings)
  • bun test build/orchestrator/__tests__/integration.test.ts--print-only exits 2 on malformed plan, stderr contains hint
  • bun test build/orchestrator/__tests__/ — 825 pass, 1 pre-existing branch-sensitive failure
  • bin/gstack-build --help exits 0

🤖 Generated with Claude Code

anbangr and others added 30 commits May 11, 2026 22:27
New capability (non-code phase kinds, step-transition guardrails) and
concurrent-build fix warrant a MINOR bump per fork versioning convention.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Regenerate build/SKILL.md from template after step-transition and
  content-review gate changes
- Add --no-plan-review to integration test invocations that use
  --skip-ship and --skip-clean-check to prevent LLM calls in CI

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lidator

Captures the follow-on work from the step-transition / living plan parse
investigation: detect structural-mirroring flat-task format before agents
are spawned, rather than after parsing returns 0 executable phases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rdering

Address two findings from /review:
- Add stderr assertion to --print-only exit-2 integration test so a future
  unrelated exit-2 can't pass silently (T-1)
- Add comment on ensureFeature() call after finalize() in parser.ts explaining
  the ordering: idempotent for emitted phases, load-bearing for dropped phases (M-6)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- planSynthesizer prompt now requires a `#### Test Spec` section for
  every phase: coverage target (≥80%), scenario table (ID/Scenario/
  Given/When/Then), and explicit edge cases list. Test-writer becomes a
  pure implementor — receives a spec as quality floor, MAY add cases.

- `extractCoverageTarget(phaseBody)` parses `**Coverage target: ≥N%**`
  from phase body (defaults 80 when absent — backward compatible).

- `buildGeminiTestSpecPrompt` is now spec-aware: detects `#### Test Spec`
  in phase.body and switches from generic "write failing tests" to
  "implement ALL listed cases as minimum requirement" instructions.

- `parseCoveragePercent(stdout, testCmd)` parses coverage % from test
  runner stdout for Jest/Vitest, Bun, pytest, and Go; returns null for
  unknown frameworks (advisory-only).

- `PhaseState.coverageResult?: { actual, target }` field added to types.

- `PLAN_REVIEW_PROMPT` gains criterion 6 (TEST SPEC QUALITY): CRITICAL
  for inconsistent specs across phases, IMPORTANT for all-missing (legacy
  plans), SUGGESTION for missing coverage target line.

- Test suite: 12 new tests for extractCoverageTarget + spec-aware
  buildGeminiTestSpecPrompt in cli.test.ts; 12 new tests for
  parseCoveragePercent in sub-agents.test.ts. Version assertions and
  coverage-matrix ownership map updated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add detectSkillFaults() with coverage for:
- CODEX_CONVERGENCE (iterations >= DEFAULT_MAX_CODEX_ITERATIONS)
- TEST_FIXER_LOOP (iterations >= DEFAULT_MAX_TEST_ITERATIONS)
- PREMATURE_COMPLETION (checked tasks for non-committed phases)
- PLAN_SYNTHESIS_INVALID (missing Origin trace: or Acceptance:)
- WORKTREE_LEAK (completed=true but worktree still exists)
- RED_SPEC_TRIVIAL (trivially-passing tests)
- PLAN_MUTATOR_MISMATCH (plan mutation failures)
- PLAN_REVIEW_STALEMATE (round>=3 with CRITICAL objections)
- FEATURE_VERIFIER_SCOPE (VERIFICATION: GAPS in stdout)

All detectors are wrapped in try/catch so bad inputs never throw.
Analytics are appended to GSTACK_HOME/analytics/skill-faults.jsonl
only when faults exist, and analytics failures are swallowed.
…rEvaluation wiring

Red-phase test spec for Phase 2.1. Tests cover:
- SKILL_FAULT_DETECTED absent from MonitorEventName/MONITOR_EXIT_CODES (guard)
- MonitorEvaluation.skillFaultEvents field exists and is always an array
- evaluateMonitorOnce populates skillFaultEvents from detectSkillFaults
- each SkillFaultDetectedEvent has required shape fields + event: 'SKILL_FAULT_DETECTED'
- skillFaultEvents is [] when detector finds no faults or state is null
- monitor exit code is unaffected by skillFaultEvents presence

11 tests fail (Red), 3 guard tests pass. No implementation code.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The test file was placed in build/orchestrator/__tests__/ (correct for
test:build-skill) but VERIFY_RED runs bun test test/ which does not scan
that directory. Move to test/skill-monitor-fault.test.ts with adjusted
import paths so the 11 RED tests are discovered and confirmed failing.

Tests: 11 fail, 3 pass (before implementation)
- Add SkillFaultDetectedEvent type in types.ts (imports SkillFault)
- Add skillFaultEvents field to MonitorEvaluation in monitor.ts
- Add stateDir to MonitorRunSnapshot for detectSkillFaults input
- Call detectSkillFaults per snapshot in evaluateMonitorOnce with try/catch
- Print skillFaultEvents as JSON lines before terminal events in monitor mode
- SKILL_FAULT_DETECTED is not a MonitorEventName and has no exit code
- Update Step M3 monitor launch to use set -o pipefail and
  ${PIPESTATUS[0]} while teeing output to monitor-output.log
- Add Step M3.5 that scans monitor output for SKILL_FAULT_DETECTED,
  dedupes by resolved path (readlink), reads fault_investigator_model
  from configure.cm, and dispatches either GSTACK_FAULT_INVESTIGATOR_COMMAND
  or one background agent per non-duplicate fault
- Add validation tests for Step M3.5 content in skill-md.test.ts
- Fix pre-existing hardcoded model name in cli.ts comment
…_RUN_ID

The previous Step M3.5 implementation had a critical silent-failure bug:
1. `sed -n 's/.*file:////p'` is a malformed sed expression (4 slashes = bad
   flag in substitute command). `_FAULT_FILE` was always empty and the
   `[ -z "$_FAULT_FILE" ] && continue` guard silently skipped every fault.
2. The expression also assumed a `file://` URI format that the monitor never
   emits — actual SKILL_FAULT_DETECTED events are JSON lines with a
   `faults[].sourceFiles[]` array (see build/orchestrator/cli.ts:5739-5741
   and build/orchestrator/types.ts:15-23). No investigator would ever spawn.

Switch to jq-based JSON parsing that flattens each event into TSV rows
(runId<TAB>category<TAB>file) and pass FAULT_CATEGORY + FAULT_RUN_ID env
vars to the investigator alongside FAULT_FILE. Dedupe key now includes
(runId, category, resolved-path) so unrelated faults across runs aren't
collapsed. Log filename is suffixed with category to avoid collisions.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…atch

Adds test/skill-e2e-build-fault-investigator.test.ts (periodic tier) covering
the fault investigator E2E flow: mock gstack-build outputs SKILL_FAULT_DETECTED
JSON, Step M3.5 dispatches GSTACK_FAULT_INVESTIGATOR_COMMAND with fault env
vars, mock investigator writes report to $FAULT_PRIMARY, assertions verify
report exists with PLAN_SYNTHESIS_INVALID and no source files were edited.

Registers build-fault-investigator-e2e in touchfiles.ts — selected when
build/SKILL.md, skill-fault-detector.ts, or monitor.ts change.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Adds mock configure.cm file to prevent jq from failing in Step M3.5 mock
1. plan-selection (6 tests): `defaultActiveRunRegistryDir()` hardcoded
   `~/.gstack/build-state/active-runs` and ignored `GSTACK_BUILD_STATE_DIR`,
   causing 11 real active-run records to leak into unit tests and inflate
   candidate counts (turning expected "selected" into "ambiguous"). Fix: honour
   the env var consistently, the same way `state.ts` already does.

2. integration (3 tests): plan review subprocess called `codex` with
   `OPENAI_API_KEY` from the inherited `process.env`, triggering a real ~30s
   API call against the LLM. These tests exercise feature lifecycle, not plan
   review. Fix: add `--no-plan-review` to each CLI invocation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…estSpec detection

Four improvements identified during code review of 3e2b8b2:

- Move `extractCoverageTarget` from cli.ts to sub-agents.ts (alongside
  parseCoveragePercent); re-export via import in cli.ts. Eliminates the
  circular-import risk when phase-runner.ts calls coverage functions.

- Fix decimal truncation in extractCoverageTarget: `(\d+)` only matched
  integers, silently returning 80 for targets like ≥90.5%. Changed to
  `([\d.]+)` + parseFloat.

- Fix `hasTestSpec` detection in buildGeminiTestSpecPrompt: was
  `phase.body.includes("#### Test Spec")` (fragile string match, false
  negative when body text differs). Now `phase.testSpecCheckboxLine !== -1`
  (parser already computes this — zero extra overhead).

- Wire coverage gate in RUN_TESTS handler: after GREEN tests pass and the
  phase has a test spec (`testSpecCheckboxLine !== -1`), call
  parseCoveragePercent(result.stdout, testCmd) and compare against
  extractCoverageTarget(phase.body). Below target → set coverageResult and
  route to test_fix_running. Unknown framework → log advisory, proceed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Complete the coverage gate: `injectCoverageFlags(testCmd)` appends the
appropriate flag for the detected framework before the GREEN test run,
so `parseCoveragePercent` reliably finds coverage data in stdout even
when projects don't pre-configure coverage in their test script.

Framework → flag mapping:
  jest     → --coverage --coverageReporters text
  vitest   → --coverage
  bun test → --coverage
  pytest   → --cov --cov-report term-missing
  go test  → -cover
  unknown  → unchanged (advisory log, gate skips)

Injection is idempotent (no-op if flag already present) and only fires
when the phase has a test spec (testSpecCheckboxLine !== -1) — VERIFY_RED
and legacy phases use the bare test command unchanged.

11 unit tests added covering each framework, idempotency, and unknowns.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`phase.kind !== "code" ? "" : ""` always evaluated to "" regardless
of the condition, and was silently filtered by .filter(Boolean).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…p (Bug D1)

Two failing tests document the bug:
1. After CRITICAL verdict, state.planReview must be persisted with status
   "critical_exit_pending" — currently cli.ts does not persist anything
   before process.exit(3), so planReview stays undefined on disk.
2. On resume with the sentinel set, the plan-review gate must still fire —
   the current guard (!state.planReview) is false when planReview is truthy,
   so the gate is skipped after the sentinel is introduced.

Two GREEN tests confirm baseline behavior: APPROVE verdict suppresses the
gate; undefined planReview (first run) fires the gate.

Tests MUST fail until Feature 4 implementation lands.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Before this fix, a CRITICAL plan-review verdict caused process.exit(3)
without saving any sentinel to state. On resume, !state.planReview was
true → review ran again → CRITICAL again → infinite loop.

Fix:
1. Save state.planReview = { ...verdict, status: "critical_exit_pending" }
   before releaseLock + process.exit(3) so the sentinel survives on disk.
2. Widen the plan-review gate guard from !state.planReview to
   !state.planReview || state.planReview.status === "critical_exit_pending"
   so the gate re-fires on resume when the sentinel is present.

Tests: two new tests in phase-runner.test.ts cover both the sentinel
persistence and the widened gate; 90/90 passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…g D2)

Introduces ExitError (errors.ts) — thrown instead of process.exit(N)
inside try/finally blocks so the finally clause runs cleanup before
the process terminates.

Changes:
- errors.ts: new ExitError class (instanceof Error, numeric code field)
- cli.ts: import ExitError; replace critical_exit process.exit(3) with
  throw new ExitError(3); update main().catch to call process.exit(err.code)
  when err instanceof ExitError
- phase-runner.test.ts: 5 new tests (ExitError shape, propagation through
  finally, default and custom messages); 95/95 passing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ature 6)

applyResult() now populates phaseState.coverageResult when:
- action is RUN_TESTS
- tests are GREEN (status = "tests_green")
- extra.phaseBody is provided
- parseCoveragePercent() returns a non-null value for the stdout

Coverage below target emits an advisory warning but keeps status
"tests_green" — not blocking. The target defaults to 80 when no
"**Coverage target: ≥N%**" line appears in the phase body.

6 new tests in phase-runner.test.ts; 101/101 passing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ics + test assertions

- Add errors.ts to MODULE_TEST_OWNERS in coverage-matrix.test.ts
- Fix analytics logActivity to emit "success" for exit code 13 (FINALIZATION_REQUIRED),
  which is a success state (pending ship), not a failure
- Fix integration test assertions: --skip-ship correctly exits 13, not 0, when
  features reach origin_verified (pre-existing test/impl mismatch)
…d [Phase 1.1]

RED phase TDD: 11 tests fail because the parser does not yet stamp kind: "code"
on emitted phases, and existing Phase literal construction sites have no kind
field (undefined fails the VALID_KINDS.includes runtime assertion).

11 tests pass immediately: direct Phase construction with explicit kind values,
and PhaseKind union membership checks (both already exist in types.ts).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add required kind: PhaseKind field to the parser factory init and to
every Phase literal construction site in tests/fixtures. This ensures
backward-compatible default of kind: "code" for all existing phases
while the type system enforces correctness going forward.

- parser.ts: stamp kind: "code" on every emitted Phase
- state.test.ts, cli.test.ts, phase-runner.test.ts,
  feature-review.test.ts, cli-guardrails.test.ts,
  phase-kind.test.ts: add kind: "code" to all helpers and inline literals
…tations

- Fix PHASE_HEADING regex to allow optional [kind] bracket between number and colon
- Add BODY_KIND_PATTERN for <!-- kind: X --> HTML comment fallback
- Add IMPL_LABELS_BY_KIND and REVIEW_LABELS_BY_KIND maps for all 5 PhaseKind values
- Parser now stamps kind from heading bracket (primary), body comment (fallback), or defaults to "code"
- Inline kind-comment detection ensures kind is set before checkbox processing
- Add implCheckboxRe/reviewCheckboxRe for kind-specific checkbox matching
- Add 16 new parser tests covering all bracket annotations, HTML fallback, checkbox recognition

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
anbangr and others added 8 commits May 11, 2026 22:38
- Add IMPL_MARKER_BY_KIND and REVIEW_MARKER_BY_KIND lookup tables
- Update flipPhaseCheckboxes signature to accept optional kind?: PhaseKind
- Derives implMarker/reviewMarker from kind ?? "code" (backward compat)
- Update reconcilePhaseCheckboxes to pass phase.kind
- Update both cli.ts call sites (lines ~3870, ~4282) to pass kind: phase.kind
- Add 9 kind-aware mutator tests covering all 5 kinds + error cases + backward compat

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-mutator.ts

The merge introduced exported constants at the top of the file while the
original local const declarations were still present below, causing a
"has already been declared" TypeScript error. Remove the duplicates.

Also regenerate SKILL.md files to pick up template changes from the merge.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Post-rebase gen:skill-docs sync — templates updated in Phase 1.5 commits
produced SKILL.md drift. Regenerating now from authoritative templates.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…geTarget import after rebase

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@anbangr anbangr force-pushed the fix/step-transition-clean branch from 5fc91a3 to fbdbb93 Compare May 11, 2026 15:01
@anbangr anbangr merged commit cb26722 into main May 11, 2026
@anbangr anbangr deleted the fix/step-transition-clean branch May 11, 2026 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant