v1.40.5.0 fix(build/orchestrator): drain-faults --queue self-enqueue + codex-hardened audit short-circuit#62
Merged
Conversation
…events Closes the drain-faults --queue self-enqueue loop. The queue consumer now moves any event with investigate:false directly to processed/ and records outcome:audit-skipped in analytics, without dispatching the codex investigator. Schema: HaltEvent gains optional investigate?: boolean. Absent OR true means dispatch (back-compat for rows filed before this field existed). The gstack-upgrade migration in a later commit retroactively flags legacy manual-recovery rows. Tests: 4 new cases in drain-halt-events-audit-skip.test.ts covering audit skip, normal dispatch, mixed queue, and legacy-row dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ked helper All three manual-recovery cli entry points (drain-faults, mark-shipped, --mark-phase-committed) now go through a single helper that pins investigate:false on every emit. Combined with the Phase 1 consumer short-circuit, this stops drain-faults --queue from paying codex to investigate its own invocation, and stops the analogous wasted spend on mark-shipped + --mark-phase-committed audit events that previously sat in pending-investigations/ awaiting drain. The cli sites still own the message and pointer composition; only the HaltEvent shape (kind, severity, snapshot defaults) is centralized. --mark-phase-committed keeps its richer state-aware snapshot via the new optional snapshot override. Static test (T7) asserts no raw `kind: "MANUAL_RECOVERY_INVOKED"` emit blocks remain in cli.ts, guarding against future contributors forgetting the flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ERY_INVOKED rows
Retroactively adds investigate:false to pre-existing legacy MANUAL_RECOVERY_INVOKED
rows under ~/.gstack/skill-faults/{pending-investigations,processed}/ so the
next drain-faults --queue short-circuits them instead of paying codex (~$0.30 each).
Migration body-only: kind unchanged, so computeFaultId() stays stable and
filenames don't need to rename. jq-based; atomic tmp+rename per file;
malformed rows logged and skipped without aborting the run; marker file at
~/.gstack/skill-faults/.migrations/v1.40.5.0.done gates re-runs.
Test matrix T9-T16 covers: pending+processed migration, already-flagged rows
untouched, idempotency on re-run, malformed-row skip, fresh-install no-op,
mid-flight crash resume, and non-MANUAL_RECOVERY rows untouched.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-fix-drain-faults-self-enqueue
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… short-circuit Codex adversarial review surfaced 7 findings on the initial implementation. This commit fixes all 7 with regression tests for each: drain-faults.ts (consumer short-circuit): - M1: gate the short-circuit on `kind === "MANUAL_RECOVERY_INVOKED"`. A corrupted PHASE_FAILED row with investigate:false must NOT bypass investigation. - L1: honor `--dry-run` BEFORE moving the file or writing analytics. The short-circuit was running before the dry-run check, so dry-run leaked audit rows to processed/. - H3: only count shortCircuited and append the audit-skipped analytics row when markInvestigated actually succeeded (or returned ENOENT, which means a concurrent drain won the race and the skip is still effectively true). EACCES / EROFS / bad-queue-path failures now log the error and bump result.failed instead of silently lying about success. migrations/v1.40.5.0.sh: - H1: jq-missing exits 0 WITHOUT writing the marker so the next gstack-upgrade retries once jq lands. Previously the migration would exit successfully without doing the work but the wrapper would record the version as applied. - H2: track write_failures per file rewrite. If any failed, refuse to write the marker and exit non-zero. Stops a single EACCES from permanently locking out the migration. - M2: re-scan both dirs after rewriting. If a concurrent emitter wrote a new recovery-sink audit row during the run, refuse to write the marker so the next run picks them up. - M3: restrict scope to known recovery-sink message prefixes (drain-faults, mark-shipped, --mark-phase-committed). Custom MANUAL_RECOVERY_INVOKED rows from external emitters that legitimately need investigation are no longer silently reclassified as audit-only. - Malformed-row presence now also blocks the marker (don't lock out retries while the source of corruption is unfixed). Tests: - drain-halt-events-audit-skip.test.ts: 3 new cases (M1, L1, H3) — 7 total in the file, all green. - gstack-upgrade-migration-v1_40_5_0.test.ts: 5 new cases (H2, M3, M3 mark-shipped, M3 mark-phase-committed, malformed-no-marker) plus T13 updated to reflect the new "malformed blocks marker" contract. 13 total in the file, all green. Real-disk verified: hardened migration on ~/.gstack/skill-faults/ still flags legacy rows correctly + writes marker on clean run; dry-run drain-faults --queue reports shortCircuited:1 without moving the audit file from pending/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the
gstack-build drain-faults --queueself-enqueue loop introduced by the collision of PR #57 (queue-mode consumer) and PR 2 / #54 (MANUAL_RECOVERY_INVOKEDobservability emits). Every priordrain-faults --queueinvocation paid codex (~$0.30) to investigate the recovery sink invoking itself.Performance & Cost
gstack-build drain-faults --queuenow short-circuits its own audit emit instead of dispatching codex on it. Cost per redundant audit drain: $0.30 → $0.00.drain-faultsself-invocation + anagnt2-prototype--mark-phase-committedfrom another build) verified end-to-end: short-circuited cleanly, codex untouched, analytics recordsoutcome: "audit-skipped".Architecture
HaltEvent.investigate?: booleanproperty inbuild/orchestrator/halt-events.ts. Absent ortrue= dispatch (back-compat);false= audit-only, short-circuit at consumer.emitManualRecoveryInvoked()helper inhalt-event-helpers.ts— single source of truth for the three cli manual-recovery emit sites (drain-faults, mark-shipped, --mark-phase-committed). Pinsinvestigate: falseby construction.drainFaultsFromHaltEventsQueuerecordsoutcome: "audit-skipped"analytics row and moves the event toprocessed/without dispatching. Gated onkind === "MANUAL_RECOVERY_INVOKED"so a corrupted PHASE_FAILED withinvestigate:falsewon't bypass investigation.Migration
gstack-upgrade/migrations/v1.40.5.0.sh. Flags legacyMANUAL_RECOVERY_INVOKEDrows in~/.gstack/skill-faults/{pending-investigations,processed}/withinvestigate: false. Scope restricted to known recovery-sink message prefixes (drain-faults / mark-shipped / --mark-phase-committed); custom MANUAL_RECOVERY rows from external emitters are left alone.jqtmp+rename per file; marker at~/.gstack/skill-faults/.migrations/v1.40.5.0.doneis written ONLY when the run completes without any rewrite failures, no concurrent emitter races detected, and no malformed rows skipped. Any of those conditions exits non-zero so the nextgstack-upgraderetries instead of permanently locking the migration "done."Codex adversarial review
Codex flagged 7 real issues on the initial implementation. All fixed in commit
dce471fawith regression tests:mvfailure still wrote markerwrite_failurescount; any failure refuses marker + exit non-zeromarkInvestigatedfailure swallowed AND still reported successshortCircuitedand analytics on actual move success (ENOENT counts as success for concurrent-drain races; EACCES/EROFS/etc bumpfailed)investigate:falseworked on any kind, not just MANUAL_RECOVERYkind === "MANUAL_RECOVERY_INVOKED"--dry-runstill moved audit rows + wrote analyticsReal-disk verified after the fixes: hardened migration ran clean, flagged the two legacy rows, wrote the marker;
drain-faults --queue --dry-runreportsshortCircuited: 1with the audit file unchanged inpending-investigations/.Test Coverage
20 new tests across 3 files. 7 in drain-halt-events-audit-skip.test.ts (T1-T4 + M1 + L1 + H3), 13 in gstack-upgrade-migration-v1_40_5_0.test.ts (T9-T16 + H2 + M3 happy-path + M3 mark-shipped + M3 mark-phase-committed + malformed-row-no-marker).
Tests: 126 pass across the 7 touched suites (drain-halt-events-audit-skip, manual-recovery-emit-site, drain-halt-events, drain-faults, auto-drain, release-daemon, migration) including the merged PR #61 release-daemon tests.
Pre-Landing Review
Pass 1 (CRITICAL) categories cleared on Claude's structured review: no SQL, no LLM trust-boundary changes, no shell injection in the migration (jq + pure path strings + arg arrays), the new
"audit-skipped"outcome union member has no consumer dependency. The Codex adversarial review surfaced the 7 issues listed above; all are fixed.Plan Completion
All 16 acceptance criteria and phase tasks from
inbox/2026-05-20-fix-drain-faults-self-enqueue-plan.mdshipped and verified. Two minor deviations: (a) plan said "4 emit sites in cli.ts", actual count is 3 (the investigator report counted recovery entry points across the system, but auto-drain doesn't have its own emit — it inherits via the shared consumer); (b) migration marker filename usesv1.40.5.0.doneto match existinggstack-upgrade/migrations/convention (v1.40.0.0.done, v1.38.1.0.done, etc.) rather than the plan's literalmanual-recovery-audit-done. Behaviorally identical.The plan's original scope expanded to "investigate:false flag + kind split + auto-startup migration" during eng-review (~80 lines source / ~120 tests / ~45 min CC), then codex outside-voice on the plan pushed back ("VERDICT: scope-down" — kind split is schema churn without justification, startup-hook migration is a bad side effect). Final synthesis: flag yes, kind split no, migration only at
/gstack-upgrade. This PR ships that synthesis.Test plan
bun test build/orchestrator/__tests__/drain-halt-events-audit-skip.test.ts— 7 pass (T1-T4 + M1 + L1 + H3)bun test build/orchestrator/__tests__/manual-recovery-emit-site.test.ts— 3 pass (T5-T7)bun test test/gstack-upgrade-migration-v1_40_5_0.test.ts— 13 pass (T9-T16 + H2 + M3 ×3 + malformed-no-marker)bun teston touched suites — 126 pass after merging origin/main (PR fix(release-daemon): verify PR is actually merged on GitHub before writing "landed" #61)~/.gstack/skill-faults/— 2 legacyMANUAL_RECOVERY_INVOKEDrows gainedinvestigate: false, marker written cleanlybun build/orchestrator/cli.ts drain-faults --queue --dry-run:shortCircuited: 1, processed: 4, failed: 0. Audit file stays in pending/ (L1 invariant); zero codex callsKnown pre-existing failure (not from this branch)
integration.test.ts > release_queued without shippedAt/prNumber is detected as manual patch and resettimes out at ~27s on both the pre-change baseline and the post-change worktree. Test file not in this diff; not related to halt-events. Flagged for separate investigation.🤖 Generated with Claude Code