Skip to content

v1.40.5.0 fix(build/orchestrator): drain-faults --queue self-enqueue + codex-hardened audit short-circuit#62

Merged
anbangr merged 6 commits into
mainfrom
worktree-fix-drain-faults-self-enqueue
May 20, 2026
Merged

v1.40.5.0 fix(build/orchestrator): drain-faults --queue self-enqueue + codex-hardened audit short-circuit#62
anbangr merged 6 commits into
mainfrom
worktree-fix-drain-faults-self-enqueue

Conversation

@anbangr

@anbangr anbangr commented May 20, 2026

Copy link
Copy Markdown
Owner

Summary

Closes the gstack-build drain-faults --queue self-enqueue loop introduced by the collision of PR #57 (queue-mode consumer) and PR 2 / #54 (MANUAL_RECOVERY_INVOKED observability emits). Every prior drain-faults --queue invocation paid codex (~$0.30) to investigate the recovery sink invoking itself.

Performance & Cost

  • gstack-build drain-faults --queue now short-circuits its own audit emit instead of dispatching codex on it. Cost per redundant audit drain: $0.30 → $0.00.
  • Two stale legacy audit events on the dev machine (yesterday's drain-faults self-invocation + an agnt2-prototype --mark-phase-committed from another build) verified end-to-end: short-circuited cleanly, codex untouched, analytics records outcome: "audit-skipped".

Architecture

  • New HaltEvent.investigate?: boolean property in build/orchestrator/halt-events.ts. Absent or true = dispatch (back-compat); false = audit-only, short-circuit at consumer.
  • New emitManualRecoveryInvoked() helper in halt-event-helpers.ts — single source of truth for the three cli manual-recovery emit sites (drain-faults, mark-shipped, --mark-phase-committed). Pins investigate: false by construction.
  • New consumer short-circuit in drainFaultsFromHaltEventsQueue records outcome: "audit-skipped" analytics row and moves the event to processed/ without dispatching. Gated on kind === "MANUAL_RECOVERY_INVOKED" so a corrupted PHASE_FAILED with investigate:false won't bypass investigation.
  • Auto-drain hook inherits the short-circuit transitively (same shared consumer code path) — no separate change.

Migration

  • New gstack-upgrade/migrations/v1.40.5.0.sh. Flags legacy MANUAL_RECOVERY_INVOKED rows in ~/.gstack/skill-faults/{pending-investigations,processed}/ with investigate: false. Scope restricted to known recovery-sink message prefixes (drain-faults / mark-shipped / --mark-phase-committed); custom MANUAL_RECOVERY rows from external emitters are left alone.
  • Atomic jq tmp+rename per file; marker at ~/.gstack/skill-faults/.migrations/v1.40.5.0.done is written ONLY when the run completes without any rewrite failures, no concurrent emitter races detected, and no malformed rows skipped. Any of those conditions exits non-zero so the next gstack-upgrade retries instead of permanently locking the migration "done."

Codex adversarial review

Codex flagged 7 real issues on the initial implementation. All fixed in commit dce471fa with regression tests:

ID Severity Issue Fix
H1 High jq-missing path wrote marker silently jq-missing now exits 0 without marker; next gstack-upgrade retries once jq lands
H2 High Per-file mv failure still wrote marker Track write_failures count; any failure refuses marker + exit non-zero
H3 High markInvestigated failure swallowed AND still reported success Gate shortCircuited and analytics on actual move success (ENOENT counts as success for concurrent-drain races; EACCES/EROFS/etc bump failed)
M1 Medium investigate:false worked on any kind, not just MANUAL_RECOVERY Short-circuit gated on kind === "MANUAL_RECOVERY_INVOKED"
M2 Medium Migration didn't catch concurrent-emitter rows Re-scan both dirs after rewrite; new unflagged rows refuse marker
M3 Medium Migration matched every MANUAL_RECOVERY with no investigate Restricted to known recovery-sink message prefixes
L1 Low --dry-run still moved audit rows + wrote analytics dry-run check moved BEFORE the file move; counter still increments so dry-run reports intent correctly

Real-disk verified after the fixes: hardened migration ran clean, flagged the two legacy rows, wrote the marker; drain-faults --queue --dry-run reports shortCircuited: 1 with the audit file unchanged in pending-investigations/.

Test Coverage

20 new tests across 3 files. 7 in drain-halt-events-audit-skip.test.ts (T1-T4 + M1 + L1 + H3), 13 in gstack-upgrade-migration-v1_40_5_0.test.ts (T9-T16 + H2 + M3 happy-path + M3 mark-shipped + M3 mark-phase-committed + malformed-row-no-marker).

COVERAGE: 22/24 paths (92%) | QUALITY: ★★★ × 22, ★ × 2 | GAPS: 2 (defensive jq-missing + mv-fail fallbacks — tested via H1+H2)

Tests: 126 pass across the 7 touched suites (drain-halt-events-audit-skip, manual-recovery-emit-site, drain-halt-events, drain-faults, auto-drain, release-daemon, migration) including the merged PR #61 release-daemon tests.

Pre-Landing Review

Pass 1 (CRITICAL) categories cleared on Claude's structured review: no SQL, no LLM trust-boundary changes, no shell injection in the migration (jq + pure path strings + arg arrays), the new "audit-skipped" outcome union member has no consumer dependency. The Codex adversarial review surfaced the 7 issues listed above; all are fixed.

Plan Completion

All 16 acceptance criteria and phase tasks from inbox/2026-05-20-fix-drain-faults-self-enqueue-plan.md shipped and verified. Two minor deviations: (a) plan said "4 emit sites in cli.ts", actual count is 3 (the investigator report counted recovery entry points across the system, but auto-drain doesn't have its own emit — it inherits via the shared consumer); (b) migration marker filename uses v1.40.5.0.done to match existing gstack-upgrade/migrations/ convention (v1.40.0.0.done, v1.38.1.0.done, etc.) rather than the plan's literal manual-recovery-audit-done. Behaviorally identical.

The plan's original scope expanded to "investigate:false flag + kind split + auto-startup migration" during eng-review (~80 lines source / ~120 tests / ~45 min CC), then codex outside-voice on the plan pushed back ("VERDICT: scope-down" — kind split is schema churn without justification, startup-hook migration is a bad side effect). Final synthesis: flag yes, kind split no, migration only at /gstack-upgrade. This PR ships that synthesis.

Test plan

  • bun test build/orchestrator/__tests__/drain-halt-events-audit-skip.test.ts — 7 pass (T1-T4 + M1 + L1 + H3)
  • bun test build/orchestrator/__tests__/manual-recovery-emit-site.test.ts — 3 pass (T5-T7)
  • bun test test/gstack-upgrade-migration-v1_40_5_0.test.ts — 13 pass (T9-T16 + H2 + M3 ×3 + malformed-no-marker)
  • bun test on touched suites — 126 pass after merging origin/main (PR fix(release-daemon): verify PR is actually merged on GitHub before writing "landed" #61)
  • Real-disk migration on ~/.gstack/skill-faults/ — 2 legacy MANUAL_RECOVERY_INVOKED rows gained investigate: false, marker written cleanly
  • bun build/orchestrator/cli.ts drain-faults --queue --dry-run: shortCircuited: 1, processed: 4, failed: 0. Audit file stays in pending/ (L1 invariant); zero codex calls

Known pre-existing failure (not from this branch)

integration.test.ts > release_queued without shippedAt/prNumber is detected as manual patch and reset times out at ~27s on both the pre-change baseline and the post-change worktree. Test file not in this diff; not related to halt-events. Flagged for separate investigation.

🤖 Generated with Claude Code

anbangr and others added 6 commits May 20, 2026 08:14
…events

Closes the drain-faults --queue self-enqueue loop. The queue consumer now
moves any event with investigate:false directly to processed/ and records
outcome:audit-skipped in analytics, without dispatching the codex investigator.

Schema: HaltEvent gains optional investigate?: boolean. Absent OR true means
dispatch (back-compat for rows filed before this field existed). The
gstack-upgrade migration in a later commit retroactively flags legacy
manual-recovery rows.

Tests: 4 new cases in drain-halt-events-audit-skip.test.ts covering audit
skip, normal dispatch, mixed queue, and legacy-row dispatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ked helper

All three manual-recovery cli entry points (drain-faults, mark-shipped,
--mark-phase-committed) now go through a single helper that pins
investigate:false on every emit. Combined with the Phase 1 consumer
short-circuit, this stops drain-faults --queue from paying codex to
investigate its own invocation, and stops the analogous wasted spend on
mark-shipped + --mark-phase-committed audit events that previously sat
in pending-investigations/ awaiting drain.

The cli sites still own the message and pointer composition; only the
HaltEvent shape (kind, severity, snapshot defaults) is centralized.
--mark-phase-committed keeps its richer state-aware snapshot via the new
optional snapshot override.

Static test (T7) asserts no raw `kind: "MANUAL_RECOVERY_INVOKED"` emit
blocks remain in cli.ts, guarding against future contributors forgetting
the flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ERY_INVOKED rows

Retroactively adds investigate:false to pre-existing legacy MANUAL_RECOVERY_INVOKED
rows under ~/.gstack/skill-faults/{pending-investigations,processed}/ so the
next drain-faults --queue short-circuits them instead of paying codex (~$0.30 each).

Migration body-only: kind unchanged, so computeFaultId() stays stable and
filenames don't need to rename. jq-based; atomic tmp+rename per file;
malformed rows logged and skipped without aborting the run; marker file at
~/.gstack/skill-faults/.migrations/v1.40.5.0.done gates re-runs.

Test matrix T9-T16 covers: pending+processed migration, already-flagged rows
untouched, idempotency on re-run, malformed-row skip, fresh-install no-op,
mid-flight crash resume, and non-MANUAL_RECOVERY rows untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… short-circuit

Codex adversarial review surfaced 7 findings on the initial implementation.
This commit fixes all 7 with regression tests for each:

drain-faults.ts (consumer short-circuit):
- M1: gate the short-circuit on `kind === "MANUAL_RECOVERY_INVOKED"`. A
  corrupted PHASE_FAILED row with investigate:false must NOT bypass
  investigation.
- L1: honor `--dry-run` BEFORE moving the file or writing analytics. The
  short-circuit was running before the dry-run check, so dry-run leaked
  audit rows to processed/.
- H3: only count shortCircuited and append the audit-skipped analytics
  row when markInvestigated actually succeeded (or returned ENOENT,
  which means a concurrent drain won the race and the skip is still
  effectively true). EACCES / EROFS / bad-queue-path failures now log
  the error and bump result.failed instead of silently lying about
  success.

migrations/v1.40.5.0.sh:
- H1: jq-missing exits 0 WITHOUT writing the marker so the next
  gstack-upgrade retries once jq lands. Previously the migration would
  exit successfully without doing the work but the wrapper would record
  the version as applied.
- H2: track write_failures per file rewrite. If any failed, refuse to
  write the marker and exit non-zero. Stops a single EACCES from
  permanently locking out the migration.
- M2: re-scan both dirs after rewriting. If a concurrent emitter wrote a
  new recovery-sink audit row during the run, refuse to write the marker
  so the next run picks them up.
- M3: restrict scope to known recovery-sink message prefixes (drain-faults,
  mark-shipped, --mark-phase-committed). Custom MANUAL_RECOVERY_INVOKED
  rows from external emitters that legitimately need investigation are
  no longer silently reclassified as audit-only.
- Malformed-row presence now also blocks the marker (don't lock out
  retries while the source of corruption is unfixed).

Tests:
- drain-halt-events-audit-skip.test.ts: 3 new cases (M1, L1, H3) — 7
  total in the file, all green.
- gstack-upgrade-migration-v1_40_5_0.test.ts: 5 new cases (H2, M3, M3
  mark-shipped, M3 mark-phase-committed, malformed-no-marker) plus T13
  updated to reflect the new "malformed blocks marker" contract. 13
  total in the file, all green.

Real-disk verified: hardened migration on ~/.gstack/skill-faults/ still
flags legacy rows correctly + writes marker on clean run; dry-run
drain-faults --queue reports shortCircuited:1 without moving the audit
file from pending/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant