test(build): regression tests for active-run registry exit-13 / paused status by anbangr · Pull Request #33 · anbangr/gstack

anbangr · 2026-05-11T23:06:40Z

Summary

Adds regression tests (T1–T5) for the active-run registry fix from PR fix(build): droppedPhasesCount + --print-only exit code for flat-task plans #31: `exit-13 (FINALIZATION_REQUIRED)` must write `"paused"` status, not `"failed"`.
T3: non-zero non-13 exit writes `"failed"` (verifies the ternary branches correctly)
T4: paused + dead PID → record is removed by stale-paused cleanup (Feature 3)
T5: paused + live PID → record kept, candidate still returned for resume
Bumps build skill version to `1.22.2` to reflect the added test coverage.
Fixes ship SKILL.md golden baselines (Phase 1.5 added Content Review gate language, baselines were stale).

Net diff vs main: 8 files, 215 lines (orchestrator tests + skill version bump + golden baselines).

Test plan

`bun test` passes (all tests green, exit 0)
`bun test build/orchestrator/tests/` — all tests green including T3/T4/T5
`bun test test/host-config.test.ts` — all 73 pass including golden file checks
Living plan (fluffy-twilight) all checkboxes complete
Fork versioning rule respected — no top-level `VERSION` bump

🤖 Generated with Claude Code

New capability (non-code phase kinds, step-transition guardrails) and concurrent-build fix warrant a MINOR bump per fork versioning convention. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Regenerate build/SKILL.md from template after step-transition and content-review gate changes - Add --no-plan-review to integration test invocations that use --skip-ship and --skip-clean-check to prevent LLM calls in CI Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…lidator Captures the follow-on work from the step-transition / living plan parse investigation: detect structural-mirroring flat-task format before agents are spawned, rather than after parsing returns 0 executable phases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rdering Address two findings from /review: - Add stderr assertion to --print-only exit-2 integration test so a future unrelated exit-2 can't pass silently (T-1) - Add comment on ensureFeature() call after finalize() in parser.ts explaining the ordering: idempotent for emitted phases, load-bearing for dropped phases (M-6) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- planSynthesizer prompt now requires a `#### Test Spec` section for every phase: coverage target (≥80%), scenario table (ID/Scenario/ Given/When/Then), and explicit edge cases list. Test-writer becomes a pure implementor — receives a spec as quality floor, MAY add cases. - `extractCoverageTarget(phaseBody)` parses `**Coverage target: ≥N%**` from phase body (defaults 80 when absent — backward compatible). - `buildGeminiTestSpecPrompt` is now spec-aware: detects `#### Test Spec` in phase.body and switches from generic "write failing tests" to "implement ALL listed cases as minimum requirement" instructions. - `parseCoveragePercent(stdout, testCmd)` parses coverage % from test runner stdout for Jest/Vitest, Bun, pytest, and Go; returns null for unknown frameworks (advisory-only). - `PhaseState.coverageResult?: { actual, target }` field added to types. - `PLAN_REVIEW_PROMPT` gains criterion 6 (TEST SPEC QUALITY): CRITICAL for inconsistent specs across phases, IMPORTANT for all-missing (legacy plans), SUGGESTION for missing coverage target line. - Test suite: 12 new tests for extractCoverageTarget + spec-aware buildGeminiTestSpecPrompt in cli.test.ts; 12 new tests for parseCoveragePercent in sub-agents.test.ts. Version assertions and coverage-matrix ownership map updated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add detectSkillFaults() with coverage for: - CODEX_CONVERGENCE (iterations >= DEFAULT_MAX_CODEX_ITERATIONS) - TEST_FIXER_LOOP (iterations >= DEFAULT_MAX_TEST_ITERATIONS) - PREMATURE_COMPLETION (checked tasks for non-committed phases) - PLAN_SYNTHESIS_INVALID (missing Origin trace: or Acceptance:) - WORKTREE_LEAK (completed=true but worktree still exists) - RED_SPEC_TRIVIAL (trivially-passing tests) - PLAN_MUTATOR_MISMATCH (plan mutation failures) - PLAN_REVIEW_STALEMATE (round>=3 with CRITICAL objections) - FEATURE_VERIFIER_SCOPE (VERIFICATION: GAPS in stdout) All detectors are wrapped in try/catch so bad inputs never throw. Analytics are appended to GSTACK_HOME/analytics/skill-faults.jsonl only when faults exist, and analytics failures are swallowed.

…rEvaluation wiring Red-phase test spec for Phase 2.1. Tests cover: - SKILL_FAULT_DETECTED absent from MonitorEventName/MONITOR_EXIT_CODES (guard) - MonitorEvaluation.skillFaultEvents field exists and is always an array - evaluateMonitorOnce populates skillFaultEvents from detectSkillFaults - each SkillFaultDetectedEvent has required shape fields + event: 'SKILL_FAULT_DETECTED' - skillFaultEvents is [] when detector finds no faults or state is null - monitor exit code is unaffected by skillFaultEvents presence 11 tests fail (Red), 3 guard tests pass. No implementation code. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The test file was placed in build/orchestrator/__tests__/ (correct for test:build-skill) but VERIFY_RED runs bun test test/ which does not scan that directory. Move to test/skill-monitor-fault.test.ts with adjusted import paths so the 11 RED tests are discovered and confirmed failing. Tests: 11 fail, 3 pass (before implementation)

- Add SkillFaultDetectedEvent type in types.ts (imports SkillFault) - Add skillFaultEvents field to MonitorEvaluation in monitor.ts - Add stateDir to MonitorRunSnapshot for detectSkillFaults input - Call detectSkillFaults per snapshot in evaluateMonitorOnce with try/catch - Print skillFaultEvents as JSON lines before terminal events in monitor mode - SKILL_FAULT_DETECTED is not a MonitorEventName and has no exit code

- Update Step M3 monitor launch to use set -o pipefail and ${PIPESTATUS[0]} while teeing output to monitor-output.log - Add Step M3.5 that scans monitor output for SKILL_FAULT_DETECTED, dedupes by resolved path (readlink), reads fault_investigator_model from configure.cm, and dispatches either GSTACK_FAULT_INVESTIGATOR_COMMAND or one background agent per non-duplicate fault - Add validation tests for Step M3.5 content in skill-md.test.ts - Fix pre-existing hardcoded model name in cli.ts comment

…_RUN_ID The previous Step M3.5 implementation had a critical silent-failure bug: 1. `sed -n 's/.*file:////p'` is a malformed sed expression (4 slashes = bad flag in substitute command). `_FAULT_FILE` was always empty and the `[ -z "$_FAULT_FILE" ] && continue` guard silently skipped every fault. 2. The expression also assumed a `file://` URI format that the monitor never emits — actual SKILL_FAULT_DETECTED events are JSON lines with a `faults[].sourceFiles[]` array (see build/orchestrator/cli.ts:5739-5741 and build/orchestrator/types.ts:15-23). No investigator would ever spawn. Switch to jq-based JSON parsing that flattens each event into TSV rows (runId<TAB>category<TAB>file) and pass FAULT_CATEGORY + FAULT_RUN_ID env vars to the investigator alongside FAULT_FILE. Dedupe key now includes (runId, category, resolved-path) so unrelated faults across runs aren't collapsed. Log filename is suffixed with category to avoid collisions. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…nv passing

…atch Adds test/skill-e2e-build-fault-investigator.test.ts (periodic tier) covering the fault investigator E2E flow: mock gstack-build outputs SKILL_FAULT_DETECTED JSON, Step M3.5 dispatches GSTACK_FAULT_INVESTIGATOR_COMMAND with fault env vars, mock investigator writes report to $FAULT_PRIMARY, assertions verify report exists with PLAN_SYNTHESIS_INVALID and no source files were edited. Registers build-fault-investigator-e2e in touchfiles.ts — selected when build/SKILL.md, skill-fault-detector.ts, or monitor.ts change. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Adds mock configure.cm file to prevent jq from failing in Step M3.5 mock

1. plan-selection (6 tests): `defaultActiveRunRegistryDir()` hardcoded `~/.gstack/build-state/active-runs` and ignored `GSTACK_BUILD_STATE_DIR`, causing 11 real active-run records to leak into unit tests and inflate candidate counts (turning expected "selected" into "ambiguous"). Fix: honour the env var consistently, the same way `state.ts` already does. 2. integration (3 tests): plan review subprocess called `codex` with `OPENAI_API_KEY` from the inherited `process.env`, triggering a real ~30s API call against the LLM. These tests exercise feature lifecycle, not plan review. Fix: add `--no-plan-review` to each CLI invocation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…estSpec detection Four improvements identified during code review of 3e2b8b2: - Move `extractCoverageTarget` from cli.ts to sub-agents.ts (alongside parseCoveragePercent); re-export via import in cli.ts. Eliminates the circular-import risk when phase-runner.ts calls coverage functions. - Fix decimal truncation in extractCoverageTarget: `(\d+)` only matched integers, silently returning 80 for targets like ≥90.5%. Changed to `([\d.]+)` + parseFloat. - Fix `hasTestSpec` detection in buildGeminiTestSpecPrompt: was `phase.body.includes("#### Test Spec")` (fragile string match, false negative when body text differs). Now `phase.testSpecCheckboxLine !== -1` (parser already computes this — zero extra overhead). - Wire coverage gate in RUN_TESTS handler: after GREEN tests pass and the phase has a test spec (`testSpecCheckboxLine !== -1`), call parseCoveragePercent(result.stdout, testCmd) and compare against extractCoverageTarget(phase.body). Below target → set coverageResult and route to test_fix_running. Unknown framework → log advisory, proceed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Complete the coverage gate: `injectCoverageFlags(testCmd)` appends the appropriate flag for the detected framework before the GREEN test run, so `parseCoveragePercent` reliably finds coverage data in stdout even when projects don't pre-configure coverage in their test script. Framework → flag mapping: jest → --coverage --coverageReporters text vitest → --coverage bun test → --coverage pytest → --cov --cov-report term-missing go test → -cover unknown → unchanged (advisory log, gate skips) Injection is idempotent (no-op if flag already present) and only fires when the phase has a test spec (testSpecCheckboxLine !== -1) — VERIFY_RED and legacy phases use the bare test command unchanged. 11 unit tests added covering each framework, idempotency, and unknowns. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

`phase.kind !== "code" ? "" : ""` always evaluated to "" regardless of the condition, and was silently filtered by .filter(Boolean). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…p (Bug D1) Two failing tests document the bug: 1. After CRITICAL verdict, state.planReview must be persisted with status "critical_exit_pending" — currently cli.ts does not persist anything before process.exit(3), so planReview stays undefined on disk. 2. On resume with the sentinel set, the plan-review gate must still fire — the current guard (!state.planReview) is false when planReview is truthy, so the gate is skipped after the sentinel is introduced. Two GREEN tests confirm baseline behavior: APPROVE verdict suppresses the gate; undefined planReview (first run) fires the gate. Tests MUST fail until Feature 4 implementation lands. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Before this fix, a CRITICAL plan-review verdict caused process.exit(3) without saving any sentinel to state. On resume, !state.planReview was true → review ran again → CRITICAL again → infinite loop. Fix: 1. Save state.planReview = { ...verdict, status: "critical_exit_pending" } before releaseLock + process.exit(3) so the sentinel survives on disk. 2. Widen the plan-review gate guard from !state.planReview to !state.planReview || state.planReview.status === "critical_exit_pending" so the gate re-fires on resume when the sentinel is present. Tests: two new tests in phase-runner.test.ts cover both the sentinel persistence and the widened gate; 90/90 passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…g D2) Introduces ExitError (errors.ts) — thrown instead of process.exit(N) inside try/finally blocks so the finally clause runs cleanup before the process terminates. Changes: - errors.ts: new ExitError class (instanceof Error, numeric code field) - cli.ts: import ExitError; replace critical_exit process.exit(3) with throw new ExitError(3); update main().catch to call process.exit(err.code) when err instanceof ExitError - phase-runner.test.ts: 5 new tests (ExitError shape, propagation through finally, default and custom messages); 95/95 passing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ature 6) applyResult() now populates phaseState.coverageResult when: - action is RUN_TESTS - tests are GREEN (status = "tests_green") - extra.phaseBody is provided - parseCoveragePercent() returns a non-null value for the stdout Coverage below target emits an advisory warning but keeps status "tests_green" — not blocking. The target defaults to 80 when no "**Coverage target: ≥N%**" line appears in the phase body. 6 new tests in phase-runner.test.ts; 101/101 passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ics + test assertions - Add errors.ts to MODULE_TEST_OWNERS in coverage-matrix.test.ts - Fix analytics logActivity to emit "success" for exit code 13 (FINALIZATION_REQUIRED), which is a success state (pending ship), not a failure - Fix integration test assertions: --skip-ship correctly exits 13, not 0, when features reach origin_verified (pre-existing test/impl mismatch)

…d [Phase 1.1] RED phase TDD: 11 tests fail because the parser does not yet stamp kind: "code" on emitted phases, and existing Phase literal construction sites have no kind field (undefined fails the VALID_KINDS.includes runtime assertion). 11 tests pass immediately: direct Phase construction with explicit kind values, and PhaseKind union membership checks (both already exist in types.ts). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… loop

Add required kind: PhaseKind field to the parser factory init and to every Phase literal construction site in tests/fixtures. This ensures backward-compatible default of kind: "code" for all existing phases while the type system enforces correctness going forward. - parser.ts: stamp kind: "code" on every emitted Phase - state.test.ts, cli.test.ts, phase-runner.test.ts, feature-review.test.ts, cli-guardrails.test.ts, phase-kind.test.ts: add kind: "code" to all helpers and inline literals

…tations - Fix PHASE_HEADING regex to allow optional [kind] bracket between number and colon - Add BODY_KIND_PATTERN for  HTML comment fallback - Add IMPL_LABELS_BY_KIND and REVIEW_LABELS_BY_KIND maps for all 5 PhaseKind values - Parser now stamps kind from heading bracket (primary), body comment (fallback), or defaults to "code" - Inline kind-comment detection ensures kind is set before checkbox processing - Add implCheckboxRe/reviewCheckboxRe for kind-specific checkbox matching - Add 16 new parser tests covering all bracket annotations, HTML fallback, checkbox recognition Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add IMPL_MARKER_BY_KIND and REVIEW_MARKER_BY_KIND lookup tables - Update flipPhaseCheckboxes signature to accept optional kind?: PhaseKind - Derives implMarker/reviewMarker from kind ?? "code" (backward compat) - Update reconcilePhaseCheckboxes to pass phase.kind - Update both cli.ts call sites (lines ~3870, ~4282) to pass kind: phase.kind - Add 9 kind-aware mutator tests covering all 5 kinds + error cases + backward compat Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…EW gates, ship gate

…-mutator.ts The merge introduced exported constants at the top of the file while the original local const declarations were still present below, causing a "has already been declared" TypeScript error. Remove the duplicates. Also regenerate SKILL.md files to pick up template changes from the merge. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Post-rebase gen:skill-docs sync — templates updated in Phase 1.5 commits produced SKILL.md drift. Regenerating now from authoritative templates. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…geTarget import after rebase Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… "paused" T1 (integration): exit-13 --skip-ship writes "paused" record to active-run registry. Existing test at line 685 already covers this scenario. T3 (integration): non-zero non-13 exit writes "failed" record. Forces validateResumeLaunch to throw via pre-written state with mismatched projectRoot. T4 (plan-selection): paused + dead-pid (999999) record must be cleaned up by activeRunOnlyCandidates() — verifies Feature 3 auto-clean logic. T5 (plan-selection): paused + live-pid (process.pid) record must stay and return a candidate — verifies existing behavior preserved. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…addition Phase 1.5 added the Content Review gate to ship/SKILL.md.tmpl but the golden baselines weren't updated. Regenerated from current ship/SKILL.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

anbangr and others added 30 commits May 11, 2026 22:27

chore(build): bump skill version to 1.22.0

32ae454

New capability (non-code phase kinds, step-transition guardrails) and concurrent-build fix warrant a MINOR bump per fork versioning convention. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

qa(build): improve M3.5 path resolution, exit-code persistence, and e…

8e38f10

…nv passing

fix(build): complete M3.5 fault investigator report contract

45d9a64

test(e2e): complete build fault investigator test structure

dd7f428

- Adds mock configure.cm file to prevent jq from failing in Step M3.5 mock

qa(e2e): fix HOME isolation and report path in fault investigator test

e9b380d

chore: bump test phase timeout to 900000ms (suite grew past 5min budget)

499018f

fix(review): remove dead-code noop in buildCodexReviewBody

20f29e4

`phase.kind !== "code" ? "" : ""` always evaluated to "" regardless of the condition, and was silently filtered by .filter(Boolean). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(test): add build/orchestrator/__tests__/ to bun test path for TDD…

c9b7deb

… loop

anbangr and others added 13 commits May 11, 2026 22:38

feat(cli): Phase 1.4 — buildKindInstructions for kind-specific prompts

0af2fcf

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: regenerate SKILL.md files after Phase 1.2-1.5 template updates

77cc427

feat(templates): Phase 1.5 — non-coding phase templates, CONTENT_REVI…

d07f55f

…EW gates, ship gate

chore: regenerate SKILL.md files after rebase onto fork/main

8f4b44a

Post-rebase gen:skill-docs sync — templates updated in Phase 1.5 commits produced SKILL.md drift. Regenerating now from authoritative templates. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(build): restore IMPL_MARKER_BY_KIND exports and fix extractCovera…

d4a8336

…geTarget import after rebase Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(build): update skill-md test to expect version 1.22.1 after rebase

fbdbb93

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge remote-tracking branch 'fork/main' into fix/step-transition-clean

956e6e9

chore(build): bump skill version to 1.22.2 after regression test addi…

53e3328

…tions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(tests): update skill-md version assertion to 1.22.2

9ba4963

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(tests): update ship SKILL.md golden baselines for Content Review …

d80b363

…addition Phase 1.5 added the Content Review gate to ship/SKILL.md.tmpl but the golden baselines weren't updated. Regenerated from current ship/SKILL.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

anbangr merged commit 49ab35c into main May 12, 2026

anbangr deleted the fix/step-transition-clean branch May 12, 2026 00:04

anbangr mentioned this pull request May 12, 2026

fix(build): exit-13 active-run registry fix — bisect-clean 5-commit sequence #34

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(build): regression tests for active-run registry exit-13 / paused status#33

test(build): regression tests for active-run registry exit-13 / paused status#33
anbangr merged 43 commits into
mainfrom
fix/step-transition-clean

anbangr commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anbangr commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

anbangr commented May 11, 2026 •

edited

Loading