Skip to content

fix(workflows): interactive loop resume uses fresh session on first iteration#1923

Merged
Wirasm merged 3 commits into
devfrom
archon/task-archon-fix-github-issue-1780923940901
Jun 8, 2026
Merged

fix(workflows): interactive loop resume uses fresh session on first iteration#1923
Wirasm merged 3 commits into
devfrom
archon/task-archon-fix-github-issue-1780923940901

Conversation

@Wirasm

@Wirasm Wirasm commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Problem: Interactive loop nodes (interactive: true) crash with error_during_execution on every iteration after the first approval gate. Iteration 1 runs fine; resuming at iteration 2+ always fails.
  • Why it matters: Blocks all interactive loop workflows (e.g. archon-piv-loop) from completing after the first human gate — the only workaround was fresh_context: true on every node.
  • What changed: One-line fix in dag-executor.ts — added (isLoopResume && i === startIteration) to the needsFreshSession condition so the first iteration of every interactive resume starts with a fresh SDK session instead of reusing a potentially stale one. Updated one test expectation and added a regression test.
  • What did NOT change: fresh_context: true loops, non-interactive loops, persist_session loops, and PR fix(workflows): fail loudly on SDK isError results in DAG and loop nodes #1291's fail-loud throw are all untouched.

UX Journey

Before

User                  Archon (DAG executor)           Claude SDK
────                  ─────────────────────           ──────────
approves gate ──────▶ resume loop, i=2
                      needsFreshSession = false
                      resumeSessionId = 'session-from-hours-ago'
                      sendQuery(..., sessionId) ─────▶ session expired → error_during_execution
                      PR #1291 throw ◀──────────────── isError: true
workflow FAILS ◀────  node_failed event

After

User                  Archon (DAG executor)           Claude SDK
────                  ─────────────────────           ──────────
approves gate ──────▶ resume loop, i=2
                      [needsFreshSession = TRUE]      (isLoopResume && i === startIteration)
                      [resumeSessionId = undefined]
                      sendQuery(..., undefined) ──────▶ fresh session started
                      receives result ◀─────────────── sessionId: 'new-session'
                      [currentSessionId = 'new-session']
iteration 3+          i=3, needsFreshSession = false
                      resumeSessionId = 'new-session' (valid, from i=2)
                      ...continues normally

Architecture Diagram

Before

dag-executor.ts (executeLoop)
  └─ isLoopResume=true, startIteration=2
     needsFreshSession = fresh_context || i===1   ← i=2, always false on resume
     resumeSessionId   = currentSessionId          ← stale loopGateMeta.sessionId
     sendQuery(prompt, opts, staleSessionId)        ← crashes

After

dag-executor.ts (executeLoop)
  └─ isLoopResume=true, startIteration=2
     [~] needsFreshSession = fresh_context || i===1 || (isLoopResume && i===startIteration)
     resumeSessionId   = needsFreshSession ? undefined : currentSessionId
     sendQuery(prompt, opts, undefined)             ← fresh session, no crash
     currentSessionId ← msg.sessionId              ← updated after i=2 for i=3+

Connection inventory:

From To Status Notes
dag-executor.ts:executeLoop aiClient.sendQuery modified resumeSessionId is now undefined on first resumed iteration
dag-executor.ts:executeLoop loopGateMeta.sessionId unchanged Still read, but now discarded (not passed as resumeSessionId) on first resume

Label Snapshot

  • Risk: risk: low
  • Size: size: XS
  • Scope: workflows
  • Module: workflows:executor

Change Metadata

  • Change type: bug
  • Primary scope: workflows

Linked Issue

Validation Evidence (required)

bun run validate
Check Result
Bundled defaults ✅ 36 commands, 20 workflows up to date
Bundled skill ✅ 23 files across 2 skills up to date
Bundled schema ✅ Up to date
Type check ✅ No errors across all 10 packages
Lint ✅ 0 errors, 0 warnings
Format ✅ All files formatted
Tests ✅ 4721 passed, 0 failed

Directly relevant tests:

  • interactive loop resumes from stored iteration with user input — updated expectation (undefined instead of 'loop-session-1') ✅
  • interactive loop resume does not crash with error_during_execution on iteration 2 — new regression test ✅
  • loop iteration fails loudly when SDK returns error_during_execution — unchanged fail-loud path ✅

Security Impact (required)

  • New permissions/capabilities? No
  • New external network calls? No
  • Secrets/tokens handling changed? No
  • File system access scope changed? No

Compatibility / Migration

  • Backward compatible? Yes — the only observable behavior change is that the first iteration of an interactive loop resume no longer passes a potentially stale session ID to the SDK.
  • Config/env changes? No
  • Database migration needed? No

Human Verification (required)

  • Verified via test suite that iteration 2 of a resumed interactive loop receives resumeSessionId = undefined (fresh session).
  • Verified that subsequent iterations (i > startIteration) continue to use currentSessionId from the previous iteration, preserving multi-turn continuity within a resume.
  • Verified the fail-loud isError path (PR fix(workflows): fail loudly on SDK isError results in DAG and loop nodes #1291) still triggers correctly for non-stale-session failures.
  • Not verified: live end-to-end with real Claude SDK against an actual expired session (would require waiting hours for real session expiry).

Side Effects / Blast Radius (required)

  • Affected: Interactive loop resume path only (isLoopResume && i === startIteration).
  • Not affected: fresh_context: true loops, non-interactive loops, persist_session loops, or any non-loop DAG nodes.
  • Potential unintended effect: If a session is still valid when the user approves (e.g. immediate approval), we now discard a valid session and start fresh. Trade-off is intentional: the cost is a slightly slower start; the benefit is eliminating the crash. User input is carried via $LOOP_USER_INPUT so session continuity is not required for context.
  • Guardrails: PR fix(workflows): fail loudly on SDK isError results in DAG and loop nodes #1291's fail-loud throw remains the backstop for any other failure mode.

Rollback Plan (required)

  • Fast rollback: git revert c06173bd
  • No feature flags needed (one-line condition change).
  • Observable failure symptoms: error_during_execution in workflow run events for interactive loop nodes on iteration 2+.

Risks and Mitigations

  • Risk: Fresh session loses conversational context from iteration 1.
    • Mitigation: $LOOP_USER_INPUT carries the user's approval comment into the prompt; the previous iteration's output is available via $LOOP_PREV_OUTPUT if configured. Session continuity for multi-turn refinement is restored from iteration 2 onward (currentSessionId is updated after the fresh first iteration).

Summary by CodeRabbit

  • Bug Fixes

    • Interactive loops now use fresh AI sessions when resuming after user input, preventing errors from stale stored sessions.
  • Documentation

    • Clarified when loop sessions are always treated as fresh, including after interactive approval resumption.
  • Tests

    • Added regression test for stale session handling during resumed iterations.

When resuming a paused interactive loop gate (e.g. archon-piv-loop),
iteration 2+ always failed with error_during_execution because
needsFreshSession was (loop.fresh_context || i === 1) — but on resume
the loop begins at startIteration >= 2, so the condition was never
true and the executor blindly passed the stored gate sessionId to
the SDK. After hours of human review wait that session is typically
expired, which is the failure the issue reporter observed.

Changes:
- packages/workflows/src/dag-executor.ts: force a fresh session on
  the first iteration of every interactive loop resume by extending
  needsFreshSession with (isLoopResume && i === startIteration).
  Subsequent iterations in the same resume continue to thread
  currentSessionId from the fresh first iteration, preserving
  multi-turn continuity. User feedback is already carried via
  \$LOOP_USER_INPUT so session continuity is not required.
- packages/workflows/src/dag-executor.test.ts: update the existing
  "interactive loop resumes from stored iteration" test to expect
  an undefined sessionArg (the test was enforcing the buggy
  behavior).
- packages/workflows/src/dag-executor.test.ts: add a regression
  test that simulates a resumed interactive loop with a stale
  stored sessionId and asserts the SDK is invoked with undefined
  and no failure events are emitted.

Coordinates with #1291 (fail loudly on SDK isError) and #1294
(orchestrator-side stale-session recovery) without reintroducing
any retry loop — we simply avoid passing the stale id.

Fixes #1208
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c87a05c9-6f96-4ece-b789-cda58b21066d

📥 Commits

Reviewing files that changed from the base of the PR and between 282e9ff and 634af47.

📒 Files selected for processing (3)
  • packages/docs-web/src/content/docs/guides/loop-nodes.md
  • packages/workflows/src/dag-executor.test.ts
  • packages/workflows/src/dag-executor.ts

📝 Walkthrough

Walkthrough

Fix for interactive loop resume crashes by forcing fresh AI session on resumed iteration instead of reusing stale session stored at gate. Executor logic updated, existing test assertion corrected to expect fresh session, new regression test added, and documentation clarified.

Changes

Interactive Loop Resume Session Handling

Layer / File(s) Summary
Session threading fix and test validation
packages/workflows/src/dag-executor.ts, packages/workflows/src/dag-executor.test.ts
executeLoopNode now forces fresh session when resuming interactively (i === startIteration), preventing reuse of stale sessions. Test assertion updated to expect undefined (fresh session) on first resumed iteration. New regression test verifies no error_during_execution crash and no loop_iteration_failed or node_failed events when resuming with expired gate session id.
Documentation clarification
packages/docs-web/src/content/docs/guides/loop-nodes.md
fresh_context description expanded to clarify that first iteration after interactive gate resume also runs fresh, noting session expiry risk during user wait time and that resumed user feedback is provided via $LOOP_USER_INPUT.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • coleam00/Archon#1367: Both PRs modify executeLoopNode's interactive-resume behavior for the first resumed iteration (main PR forces a fresh session when iteration === startIteration, while the other PR ensures $LOOP_PREV_OUTPUT is empty on that same first resumed iteration), so they are code-path related.

Poem

🐰 A loop once stumbled, session stale,
Waiting for humans, it would fail,
Fresh context blooms when gates resume,
No more crashes in the loop's room! 🎉

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch archon/task-archon-fix-github-issue-1780923940901

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Wirasm

Wirasm commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

🔍 Comprehensive PR Review

PR: #1923 — fix(workflows): interactive loop resume uses fresh session on first iteration
Reviewed by: 5 specialized agents (code-review, error-handling, test-coverage, comment-quality, docs-impact)
Date: 2026-06-08


Summary

This is a surgical, correct fix for issue #1208. The one-line addition to needsFreshSession in dag-executor.ts precisely prevents a stale/expired session ID from being passed to the SDK on the first resumed iteration of an interactive loop — eliminating the error_during_execution crash. All five agents returned APPROVE. No critical or high issues found. The fix introduces no new error paths, no silent fallbacks, no type violations, and no scope creep.

Verdict: ✅ APPROVE

Severity Count
🔴 CRITICAL 0
🟠 HIGH 0
🟡 MEDIUM 0
🟢 LOW 5 (after deduplication)

🟢 Low Issues (All Optional)

LOW-1: currentSessionId initialized to stale value, then immediately discarded

📍 packages/workflows/src/dag-executor.ts:1994

currentSessionId is initialized to loopGateMeta.sessionId (the stale ID) on line 1994, but the fix correctly forces resumeSessionId = undefined for the first resumed iteration, so the stale value never reaches the SDK. After the first iteration, currentSessionId is overwritten at line 2142. Harmless; the 6-line comment explains the invariant. Recommendation: leave as-is.

Reported by: code-review + error-handling


LOW-2: Test event-filter cast doesn't guard against undefined argument

📍 packages/workflows/src/dag-executor.test.ts (new regression test)

The filter uses (call[0] as Record<string, unknown>).event_type — if call[0] were ever undefined, this would throw inside filter() obscuring the test failure. Not possible today with this mock. Recommendation: leave as-is.

Reported by: error-handling


LOW-3: No test for session threading on iteration startIteration+1 of a resume

📍 packages/workflows/src/dag-executor.test.ts:4958

Both new tests complete in one resumed iteration. The threading mechanism at line 2142 is unchanged and covered by the existing "fresh_context: false threads session" test (line 4411). Recommendation: leave as-is; optional follow-up.

Optional future test sketch
it('interactive loop resume threads fresh session into iterations after startIteration', async () => {
  let callCount = 0;
  mockSendQueryDag.mockImplementation(function* () {
    callCount++;
    if (callCount >= 2) {
      yield { type: 'assistant', content: 'Done. <promise>APPROVED</promise>' };
    } else {
      yield { type: 'assistant', content: 'Still working.' };
    }
    yield { type: 'result', sessionId: `fresh-session-${String(callCount)}` };
  });
  // ... assert calls[0][2] === undefined, calls[1][2] === 'fresh-session-1'
});

Reported by: code-review + test-coverage


LOW-4: Production comment names "Claude SDK" specifically, but fix is provider-agnostic

📍 packages/workflows/src/dag-executor.ts:2053-2057

The comment says "may reference a Claude SDK session that expired." The needsFreshSession guard applies to all providers that return a sessionId. A future maintainer adding a new provider with session resume could read "Claude SDK" and conclude the guard doesn't apply to them.

Suggested one-line reword (optional, trivial):

// Session threading. Force a fresh session on the first iteration of an
// interactive loop resume: the stored session id from the gate metadata
// may be stale after a human review wait — passing it to the SDK can
// trigger errors (e.g. error_during_execution on Claude SDK).
// User feedback is carried via $LOOP_USER_INPUT, so session continuity is
// not required for the first resumed iteration.

Reported by: comment-quality


LOW-5: loop-nodes.md understates the freshness guarantee for interactive loop resumes

📍 packages/docs-web/src/content/docs/guides/loop-nodes.md:135

Line 135 says "The first iteration is always fresh regardless of this setting." After this PR the guarantee also covers the first iteration after an interactive loop resume — even if its numeric index is 2+. A user designing a fresh_context: false interactive loop expecting full session continuity across an approval gate will be surprised.

Suggested one-sentence addition (optional):

The first iteration is always fresh regardless of this setting. The first
iteration executed after resuming from an interactive loop approval gate is
also always fresh — the stored session id may have expired during the human
review wait, and user feedback is carried via `$LOOP_USER_INPUT` instead.

Reported by: docs-impact


✅ What's Good


📋 Suggested Follow-up Issues (if not fixing in this PR)

Title Priority
Document interactive loop resume session freshness in loop-nodes.md P2
Add multi-iteration resume session-threading test P3

Next Steps

  1. Optional: Apply LOW-4 comment broadening (trivial one-liner — no block on merge)
  2. Optional: Apply LOW-5 docs update (one sentence in loop-nodes.md)
  3. Nothing blocks merge — mark ready for review when comfortable

Reviewed by Archon prp-core:prp-review-agents workflow

- Broaden comment in dag-executor.ts: remove false implication that the
  stale-session guard is Claude-only; fix applies to any provider with
  sessionResume support
- Update loop-nodes.md: document that the first iteration after resuming
  from an interactive loop approval gate is also always fresh, so users
  relying on fresh_context:false across a gate boundary are not surprised
@Wirasm

Wirasm commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

⚡ Self-Fix Report (Aggressive)

Status: COMPLETE
Pushed: ✅ Changes pushed to archon/task-archon-fix-github-issue-1780923940901
Commit: aea5f8f
Philosophy: Fix everything unless clearly a new concern


Fixes Applied (2 total)

Severity Count
🔴 CRITICAL 0
🟠 HIGH 0
🟡 MEDIUM 0
🟢 LOW 2
View all fixes
  • Comment broadened: "Claude SDK" → provider-agnostic (packages/workflows/src/dag-executor.ts:2053-2057) — The comment said "may reference a Claude SDK session that expired." Changed to "may be stale after a human review wait — passing it to the SDK can trigger errors (e.g. error_during_execution on Claude SDK)." Removes the false implication that the guard only applies to Claude; it covers any provider with sessionResume support.

  • loop-nodes.md: document interactive resume freshness (packages/docs-web/src/content/docs/guides/loop-nodes.md:135) — Added sentence after "The first iteration is always fresh regardless of this setting." to explicitly cover the resume case: users relying on fresh_context: false for session continuity across an interactive approval gate would otherwise be surprised.


Tests Added

(none)


Skipped (3)

All three were explicitly recommended as "leave as-is" or "skip" by the review agents:

Finding Reason
currentSessionId stale init YAGNI — existing comment explains invariant; both agents recommended Option A (leave as-is)
Test event-filter cast fragility Test-only risk; agents recommended leave as-is
Multi-iter resume session threading test Orthogonal to this fix; existing general threading test covers the mechanism

Suggested Follow-up Issues

(none)


Validation

✅ Type check | ✅ Lint | ✅ Tests (283 passed)


Self-fix by Archon · aggressive mode · fixes pushed to archon/task-archon-fix-github-issue-1780923940901

@Wirasm Wirasm marked this pull request as ready for review June 8, 2026 18:59
@Wirasm Wirasm merged commit ab4cde6 into dev Jun 8, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: interactive loop resume crashes with error_during_execution (stale session)

1 participant