Skip to content

feat(continuation): #354 codex P1 fixes β€” retry-drain-widen + chain-state-persistence-fail#359

Closed
ronan-dandelion-cult wants to merge 27 commits intocael/325-canonical2from
frond-scribe/20260424/candidate-gpt2-354-codex-fixes
Closed

feat(continuation): #354 codex P1 fixes β€” retry-drain-widen + chain-state-persistence-fail#359
ronan-dandelion-cult wants to merge 27 commits intocael/325-canonical2from
frond-scribe/20260424/candidate-gpt2-354-codex-fixes

Conversation

@ronan-dandelion-cult
Copy link
Copy Markdown

Summary

  • Problem: PR [PARALLEL] feat(continuation): substrate-native chain-budget extraction (path-b)Β #354 had two confirmed-real P1 post-compaction delegate queue regressions: filtered drains stranded failed retries, and chain-state persistence failures still allowed queue ack.
  • Why it matters: transient spawn or session-store write failures could leave continuation delegates stuck until restart, or ack a spawned shard while durable chain state stayed stale.
  • What changed: fresh post-compaction drains now run a targeted pass followed by an unfiltered retry pass; chain-state persistence failures rethrow so queue recovery fails/retries; lifecycle released-delegate telemetry now counts accepted fresh drain deliveries.
  • What did NOT change (scope boundary): no queue storage format, config, retry cap, or gateway startup recovery contract changes.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Root Cause (if applicable)

  • Root cause: extraction moved post-compaction delegate delivery through the durable session-delivery queue, but dispatch only drained newly queued entry IDs and chain-state persistence errors were logged instead of propagated.
  • Missing detection / guardrail: no regression covered failed queued delegate retry across later drain cycles or session-store write failure after an accepted post-compaction spawn.
  • Contributing context (if known): pre-extraction inline dispatch counted accepted spawns directly and did not split enqueue/drain/ack across queue recovery.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/auto-reply/reply/post-compaction-delegate-dispatch.test.ts
  • Scenario the test should lock in: failed post-compaction queue entries remain pending through backoff and are picked up by a later unfiltered drain; session-store write failure increments retry metadata instead of acking; lifecycle release count comes from accepted fresh drain deliveries.
  • Why this is the smallest reliable guardrail: the dispatch test exercises the actual queue drain and delivery seams without requiring gateway startup or external provider processes.
  • Existing test that already covers this (if any): existing queue storage/recovery tests cover retry metadata, backoff, and retry-budget exhaustion generically.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

Operators should see post-compaction delegate retries recover during later drain cycles instead of waiting for restart, and lifecycle release counts no longer include queued-but-not-accepted delegate deliveries.

Diagram (if applicable)

Before:
enqueue fresh IDs -> targeted drain only -> failed old entries remain filtered out

After:
enqueue fresh IDs -> targeted drain -> lifecycle accepted count -> unfiltered retry drain -> eligible old failures retried

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Linux
  • Runtime/container: Node 22+ workspace via pnpm
  • Model/provider: N/A (queue/delivery seam tests use mocks)
  • Integration/channel (if any): N/A
  • Relevant config (redacted): default test continuation runtime config

Steps

  1. Enqueue a post-compaction delegate and make the first spawn/persistence attempt fail.
  2. Run a later drain cycle without entryIds after retry backoff elapses.
  3. Observe the entry retry and ack only after successful delivery/persistence.

Expected

  • Failed entries remain queued with retry metadata and later eligible unfiltered drains retry them.
  • updateSessionStore failure causes queue failure/retry, not ack.
  • Lifecycle releasedDelegates reflects accepted fresh deliveries.

Actual

  • Fixed by this PR.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios: pnpm tsgo; pnpm check; pnpm test src/auto-reply/reply/post-compaction-delegate-dispatch.test.ts; pnpm test src/infra/session-delivery-queue-storage.ts src/infra/session-delivery-queue-recovery.ts; pnpm check:additional; pnpm build; pnpm check:changed --base 3eb87a7 --head HEAD.
  • Edge cases checked: targeted drains still bypass fresh-ID backoff; unfiltered retry drains respect backoff; persistence failure records retry metadata; lifecycle counts accepted fresh deliveries only.
  • What you did not verify: live gateway/provider spawning outside the mocked seam.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: A retry after persistence failure can be at-least-once because the first spawn may already have been accepted.
    • Mitigation: durable queue now avoids acking stale chain state; retry cap and backoff still bound repeated attempts.

Test User and others added 27 commits April 26, 2026 15:10
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The mock at server-restart-sentinel.test.ts:108 used a zero-arg
`vi.fn(async () => ({...}))` lambda which locked TS to a
`() => Promise<...>` signature. The .mockImplementationOnce at line 294
then could not pass an opts param.

Type the vi.fn declaration to the full
RecoverPendingSessionDeliveriesParams shape via Parameters<typeof ...>[0]
so .mockImplementationOnce arg-shape stops getting rejected.
When the post-compaction delegate dispatch logic was extracted from
agent-runner.ts (cdc9b6e), the slice that enforced
maxDelegatesPerTurn (and accounted for any bracket-style delegate
already spawned this turn via bracketDelegateOffset) was dropped. The
new path mapped over the full persisted+staged list, so a turn could
exceed the configured per-turn delegate budget.

Caught by codex review on PR #354 (r3144284490) and byte-confirmed by
🌻 / 🌊. Restoration mirrors the pre-extraction behavior at
src/auto-reply/reply/agent-runner.ts pre-cdc9b6ecd54:

  bracketDelegateOffset = continuationSignalKind === "delegate" ? 1 : 0
  compactionBudget = max(0, maxDelegatesPerTurn - bracketDelegateOffset)
  released = allCompactionDelegates.slice(0, compactionBudget)

resolveContinuationRuntimeConfig is now a dispatch dep so tests can
inject runtime configs without process-wide loadConfig calls.

Adds three regression tests: cap drops overflow, bracket offset reduces
budget by one, zero-budget edge case enqueues nothing.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ’‘ Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1b5aef7b17

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with πŸ‘.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +487 to +489
if (result === "spawned") {
deliveredDelegates += 1;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include rejected delegates in dropped lifecycle accounting

This drain path only increments deliveredDelegates when a queued delegate is spawned, but it never tracks terminal "rejected" outcomes (for example, chain-length or cost-cap rejections). Because the lifecycle event later uses the precomputed droppedCompactionDelegates, those rejected delegates are acknowledged and removed from the queue without being counted as dropped, so operators can see Released 0 ... with no corresponding dropped count even though delegates were definitively not released.

Useful? React with πŸ‘Β / πŸ‘Ž.

@ronan-dandelion-cult
Copy link
Copy Markdown
Author

Author note: posted via gh CLI auth'd as ronan-dandelion-cult; content-author is frond-scribe (Claude Opus 4.7 / 1M context). Closing the frond-scribe-dispatched gpt-5.5-codex-fixes lane as superseded; dispatch-ownership is mine. Per operator's "stun by channel noise" diagnostic 2026-04-27 ~00:53Z + the parallel-track-as-evidence canon banked at #360 close (issuecomment-4323425139).

Closing as superseded by #354 (1870a84f32 on cael/325-canonical2)

PR #354 (🌊 amend lane on frond-scribe/20260424/candidate-gpt2-343-path-b) merged at 00:23Z with all three codex P1/P2 fixes folded in:

  • r3144331033 β€” drain entryIds filter strands retries (P1, retry-drain-widen)
  • r3144344309 β€” swallowed chain-state persistence failure (P1, persist-rethrow with caller propagation)
  • r3144344310 β€” premature lifecycle releasedDelegates count (P2, count-from-accepted-deliveries)

#354's quorum-2 was 🌫 + 🌻 byte-walking the repaired head. canonical2 advance via #354 β†’ 1870a84f32.

This PR's parallel impl (1b5aef7b171a)

The frond-scribe-dispatched gpt-5.5 worker on frond-scribe/20260424/candidate-gpt2-354-codex-fixes (sibling lane) produced +242/-39 focused fix-only patches for the same three findings. The two lanes converged on the same direction (drain widening, persist-rethrow, accepted-count for lifecycle) β€” independent-implementation cross-validation evidence that the fix shape is right.

The merged path (#354) is the canonical-track outcome; this PR's branch stays available as parallel-track-evidence per #326 savegame discipline (not force-pushed, not deleted), but no merge action on it. Closing.

🌿 frond-scribe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant