fix(workflows): interactive loop resume uses fresh session on first iteration by Wirasm · Pull Request #1923 · coleam00/Archon

Wirasm · 2026-06-08T13:23:14Z

Summary

Problem: Interactive loop nodes (interactive: true) crash with error_during_execution on every iteration after the first approval gate. Iteration 1 runs fine; resuming at iteration 2+ always fails.
Why it matters: Blocks all interactive loop workflows (e.g. archon-piv-loop) from completing after the first human gate — the only workaround was fresh_context: true on every node.
What changed: One-line fix in dag-executor.ts — added (isLoopResume && i === startIteration) to the needsFreshSession condition so the first iteration of every interactive resume starts with a fresh SDK session instead of reusing a potentially stale one. Updated one test expectation and added a regression test.
What did NOT change: fresh_context: true loops, non-interactive loops, persist_session loops, and PR fix(workflows): fail loudly on SDK isError results in DAG and loop nodes #1291's fail-loud throw are all untouched.

UX Journey

Before

User                  Archon (DAG executor)           Claude SDK
────                  ─────────────────────           ──────────
approves gate ──────▶ resume loop, i=2
                      needsFreshSession = false
                      resumeSessionId = 'session-from-hours-ago'
                      sendQuery(..., sessionId) ─────▶ session expired → error_during_execution
                      PR #1291 throw ◀──────────────── isError: true
workflow FAILS ◀────  node_failed event

After

User                  Archon (DAG executor)           Claude SDK
────                  ─────────────────────           ──────────
approves gate ──────▶ resume loop, i=2
                      [needsFreshSession = TRUE]      (isLoopResume && i === startIteration)
                      [resumeSessionId = undefined]
                      sendQuery(..., undefined) ──────▶ fresh session started
                      receives result ◀─────────────── sessionId: 'new-session'
                      [currentSessionId = 'new-session']
iteration 3+          i=3, needsFreshSession = false
                      resumeSessionId = 'new-session' (valid, from i=2)
                      ...continues normally

Architecture Diagram

Before

dag-executor.ts (executeLoop)
  └─ isLoopResume=true, startIteration=2
     needsFreshSession = fresh_context || i===1   ← i=2, always false on resume
     resumeSessionId   = currentSessionId          ← stale loopGateMeta.sessionId
     sendQuery(prompt, opts, staleSessionId)        ← crashes

After

dag-executor.ts (executeLoop)
  └─ isLoopResume=true, startIteration=2
     [~] needsFreshSession = fresh_context || i===1 || (isLoopResume && i===startIteration)
     resumeSessionId   = needsFreshSession ? undefined : currentSessionId
     sendQuery(prompt, opts, undefined)             ← fresh session, no crash
     currentSessionId ← msg.sessionId              ← updated after i=2 for i=3+

Connection inventory:

From	To	Status	Notes
`dag-executor.ts:executeLoop`	`aiClient.sendQuery`	modified	`resumeSessionId` is now `undefined` on first resumed iteration
`dag-executor.ts:executeLoop`	`loopGateMeta.sessionId`	unchanged	Still read, but now discarded (not passed as `resumeSessionId`) on first resume

Label Snapshot

Risk: risk: low
Size: size: XS
Scope: workflows
Module: workflows:executor

Change Metadata

Change type: bug
Primary scope: workflows

Linked Issue

Fixes fix: interactive loop resume crashes with error_during_execution (stale session) #1208
Related fix(workflows): fail loudly on SDK isError results in DAG and loop nodes #1291 (fail-loud on isError — what made this crash visible)
Related fix(orchestrator): clear stale session ID on error_during_execution to prevent infinite failure loop #1294 (orchestrator-side equivalent fix)

Validation Evidence (required)

bun run validate

Check	Result
Bundled defaults	✅ 36 commands, 20 workflows up to date
Bundled skill	✅ 23 files across 2 skills up to date
Bundled schema	✅ Up to date
Type check	✅ No errors across all 10 packages
Lint	✅ 0 errors, 0 warnings
Format	✅ All files formatted
Tests	✅ 4721 passed, 0 failed

Directly relevant tests:

interactive loop resumes from stored iteration with user input — updated expectation (undefined instead of 'loop-session-1') ✅
interactive loop resume does not crash with error_during_execution on iteration 2 — new regression test ✅
loop iteration fails loudly when SDK returns error_during_execution — unchanged fail-loud path ✅

Security Impact (required)

New permissions/capabilities? No
New external network calls? No
Secrets/tokens handling changed? No
File system access scope changed? No

Compatibility / Migration

Backward compatible? Yes — the only observable behavior change is that the first iteration of an interactive loop resume no longer passes a potentially stale session ID to the SDK.
Config/env changes? No
Database migration needed? No

Human Verification (required)

Verified via test suite that iteration 2 of a resumed interactive loop receives resumeSessionId = undefined (fresh session).
Verified that subsequent iterations (i > startIteration) continue to use currentSessionId from the previous iteration, preserving multi-turn continuity within a resume.
Verified the fail-loud isError path (PR fix(workflows): fail loudly on SDK isError results in DAG and loop nodes #1291) still triggers correctly for non-stale-session failures.
Not verified: live end-to-end with real Claude SDK against an actual expired session (would require waiting hours for real session expiry).

Side Effects / Blast Radius (required)

Affected: Interactive loop resume path only (isLoopResume && i === startIteration).
Not affected: fresh_context: true loops, non-interactive loops, persist_session loops, or any non-loop DAG nodes.
Potential unintended effect: If a session is still valid when the user approves (e.g. immediate approval), we now discard a valid session and start fresh. Trade-off is intentional: the cost is a slightly slower start; the benefit is eliminating the crash. User input is carried via $LOOP_USER_INPUT so session continuity is not required for context.
Guardrails: PR fix(workflows): fail loudly on SDK isError results in DAG and loop nodes #1291's fail-loud throw remains the backstop for any other failure mode.

Rollback Plan (required)

Fast rollback: git revert c06173bd
No feature flags needed (one-line condition change).
Observable failure symptoms: error_during_execution in workflow run events for interactive loop nodes on iteration 2+.

Risks and Mitigations

Risk: Fresh session loses conversational context from iteration 1.
- Mitigation: $LOOP_USER_INPUT carries the user's approval comment into the prompt; the previous iteration's output is available via $LOOP_PREV_OUTPUT if configured. Session continuity for multi-turn refinement is restored from iteration 2 onward (currentSessionId is updated after the fresh first iteration).

Summary by CodeRabbit

Bug Fixes
- Interactive loops now use fresh AI sessions when resuming after user input, preventing errors from stale stored sessions.
Documentation
- Clarified when loop sessions are always treated as fresh, including after interactive approval resumption.
Tests
- Added regression test for stale session handling during resumed iterations.

When resuming a paused interactive loop gate (e.g. archon-piv-loop), iteration 2+ always failed with error_during_execution because needsFreshSession was (loop.fresh_context || i === 1) — but on resume the loop begins at startIteration >= 2, so the condition was never true and the executor blindly passed the stored gate sessionId to the SDK. After hours of human review wait that session is typically expired, which is the failure the issue reporter observed. Changes: - packages/workflows/src/dag-executor.ts: force a fresh session on the first iteration of every interactive loop resume by extending needsFreshSession with (isLoopResume && i === startIteration). Subsequent iterations in the same resume continue to thread currentSessionId from the fresh first iteration, preserving multi-turn continuity. User feedback is already carried via \$LOOP_USER_INPUT so session continuity is not required. - packages/workflows/src/dag-executor.test.ts: update the existing "interactive loop resumes from stored iteration" test to expect an undefined sessionArg (the test was enforcing the buggy behavior). - packages/workflows/src/dag-executor.test.ts: add a regression test that simulates a resumed interactive loop with a stale stored sessionId and asserts the SDK is invoked with undefined and no failure events are emitted. Coordinates with #1291 (fail loudly on SDK isError) and #1294 (orchestrator-side stale-session recovery) without reintroducing any retry loop — we simply avoid passing the stale id. Fixes #1208

coderabbitai · 2026-06-08T13:23:23Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c87a05c9-6f96-4ece-b789-cda58b21066d

📥 Commits

Reviewing files that changed from the base of the PR and between 282e9ff and 634af47.

📒 Files selected for processing (3)

packages/docs-web/src/content/docs/guides/loop-nodes.md
packages/workflows/src/dag-executor.test.ts
packages/workflows/src/dag-executor.ts

📝 Walkthrough

Walkthrough

Fix for interactive loop resume crashes by forcing fresh AI session on resumed iteration instead of reusing stale session stored at gate. Executor logic updated, existing test assertion corrected to expect fresh session, new regression test added, and documentation clarified.

Changes

Interactive Loop Resume Session Handling

Layer / File(s)	Summary
Session threading fix and test validation `packages/workflows/src/dag-executor.ts`, `packages/workflows/src/dag-executor.test.ts`	`executeLoopNode` now forces fresh session when resuming interactively (`i === startIteration`), preventing reuse of stale sessions. Test assertion updated to expect `undefined` (fresh session) on first resumed iteration. New regression test verifies no `error_during_execution` crash and no `loop_iteration_failed` or `node_failed` events when resuming with expired gate session id.
Documentation clarification `packages/docs-web/src/content/docs/guides/loop-nodes.md`	`fresh_context` description expanded to clarify that first iteration after interactive gate resume also runs fresh, noting session expiry risk during user wait time and that resumed user feedback is provided via `$LOOP_USER_INPUT`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

coleam00/Archon#1367: Both PRs modify executeLoopNode's interactive-resume behavior for the first resumed iteration (main PR forces a fresh session when iteration === startIteration, while the other PR ensures $LOOP_PREV_OUTPUT is empty on that same first resumed iteration), so they are code-path related.

Poem

🐰 A loop once stumbled, session stale,
Waiting for humans, it would fail,
Fresh context blooms when gates resume,
No more crashes in the loop's room! 🎉

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch archon/task-archon-fix-github-issue-1780923940901

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Wirasm · 2026-06-08T13:33:45Z

🔍 Comprehensive PR Review

PR: #1923 — fix(workflows): interactive loop resume uses fresh session on first iteration
Reviewed by: 5 specialized agents (code-review, error-handling, test-coverage, comment-quality, docs-impact)
Date: 2026-06-08

Summary

This is a surgical, correct fix for issue #1208. The one-line addition to needsFreshSession in dag-executor.ts precisely prevents a stale/expired session ID from being passed to the SDK on the first resumed iteration of an interactive loop — eliminating the error_during_execution crash. All five agents returned APPROVE. No critical or high issues found. The fix introduces no new error paths, no silent fallbacks, no type violations, and no scope creep.

Verdict: ✅ APPROVE

Severity	Count
🔴 CRITICAL	0
🟠 HIGH	0
🟡 MEDIUM	0
🟢 LOW	5 (after deduplication)

🟢 Low Issues (All Optional)

LOW-1: `currentSessionId` initialized to stale value, then immediately discarded

📍 packages/workflows/src/dag-executor.ts:1994

currentSessionId is initialized to loopGateMeta.sessionId (the stale ID) on line 1994, but the fix correctly forces resumeSessionId = undefined for the first resumed iteration, so the stale value never reaches the SDK. After the first iteration, currentSessionId is overwritten at line 2142. Harmless; the 6-line comment explains the invariant. Recommendation: leave as-is.

Reported by: code-review + error-handling

LOW-2: Test event-filter cast doesn't guard against `undefined` argument

📍 packages/workflows/src/dag-executor.test.ts (new regression test)

The filter uses (call[0] as Record<string, unknown>).event_type — if call[0] were ever undefined, this would throw inside filter() obscuring the test failure. Not possible today with this mock. Recommendation: leave as-is.

Reported by: error-handling

LOW-3: No test for session threading on iteration `startIteration+1` of a resume

📍 packages/workflows/src/dag-executor.test.ts:4958

Both new tests complete in one resumed iteration. The threading mechanism at line 2142 is unchanged and covered by the existing "fresh_context: false threads session" test (line 4411). Recommendation: leave as-is; optional follow-up.

Optional future test sketch

it('interactive loop resume threads fresh session into iterations after startIteration', async () => {
  let callCount = 0;
  mockSendQueryDag.mockImplementation(function* () {
    callCount++;
    if (callCount >= 2) {
      yield { type: 'assistant', content: 'Done. <promise>APPROVED</promise>' };
    } else {
      yield { type: 'assistant', content: 'Still working.' };
    }
    yield { type: 'result', sessionId: `fresh-session-${String(callCount)}` };
  });
  // ... assert calls[0][2] === undefined, calls[1][2] === 'fresh-session-1'
});

Reported by: code-review + test-coverage

LOW-4: Production comment names "Claude SDK" specifically, but fix is provider-agnostic

📍 packages/workflows/src/dag-executor.ts:2053-2057

The comment says "may reference a Claude SDK session that expired." The needsFreshSession guard applies to all providers that return a sessionId. A future maintainer adding a new provider with session resume could read "Claude SDK" and conclude the guard doesn't apply to them.

Suggested one-line reword (optional, trivial):

// Session threading. Force a fresh session on the first iteration of an
// interactive loop resume: the stored session id from the gate metadata
// may be stale after a human review wait — passing it to the SDK can
// trigger errors (e.g. error_during_execution on Claude SDK).
// User feedback is carried via $LOOP_USER_INPUT, so session continuity is
// not required for the first resumed iteration.

Reported by: comment-quality

LOW-5: `loop-nodes.md` understates the freshness guarantee for interactive loop resumes

📍 packages/docs-web/src/content/docs/guides/loop-nodes.md:135

Line 135 says "The first iteration is always fresh regardless of this setting." After this PR the guarantee also covers the first iteration after an interactive loop resume — even if its numeric index is 2+. A user designing a fresh_context: false interactive loop expecting full session continuity across an approval gate will be surprised.

Suggested one-sentence addition (optional):

The first iteration is always fresh regardless of this setting. The first
iteration executed after resuming from an interactive loop approval gate is
also always fresh — the stored session id may have expired during the human
review wait, and user feedback is carried via `$LOOP_USER_INPUT` instead.

Reported by: docs-impact

✅ What's Good

Surgical fix: Exactly one boolean subexpression added to needsFreshSession. Nothing else touched.
Comment quality: The 6-line block explains the expiry failure mode, why session continuity is safe to skip, and links to fix: interactive loop resume crashes with error_during_execution (stale session) #1208 — exactly the kind of non-obvious invariant that warrants a comment.
Fail-loud path preserved: The if (msg.isError) throw block from PR fix(workflows): fail loudly on SDK isError results in DAG and loop nodes #1291 is untouched — the guard that made fix: interactive loop resume crashes with error_during_execution (stale session) #1208 visible instead of silent remains intact.
Test regression anchor: New test uses a clearly named stale ID ('stale-session-from-hours-ago') and checks for zero failure events — tests observable behavior, not internal state.
Issue anchoring: #1208 appears in three comment sites across the test file.
CLAUDE.md compliance: No any types, no new mock.module() calls, no scope creep, no infinite-retry risk, no silent swallowing.

📋 Suggested Follow-up Issues (if not fixing in this PR)

Title	Priority
Document interactive loop resume session freshness in `loop-nodes.md`	P2
Add multi-iteration resume session-threading test	P3

Next Steps

Optional: Apply LOW-4 comment broadening (trivial one-liner — no block on merge)
Optional: Apply LOW-5 docs update (one sentence in loop-nodes.md)
Nothing blocks merge — mark ready for review when comfortable

Reviewed by Archon prp-core:prp-review-agents workflow

- Broaden comment in dag-executor.ts: remove false implication that the stale-session guard is Claude-only; fix applies to any provider with sessionResume support - Update loop-nodes.md: document that the first iteration after resuming from an interactive loop approval gate is also always fresh, so users relying on fresh_context:false across a gate boundary are not surprised

Wirasm · 2026-06-08T13:36:21Z

⚡ Self-Fix Report (Aggressive)

Status: COMPLETE
Pushed: ✅ Changes pushed to archon/task-archon-fix-github-issue-1780923940901
Commit: aea5f8f
Philosophy: Fix everything unless clearly a new concern

Fixes Applied (2 total)

Severity	Count
🔴 CRITICAL	0
🟠 HIGH	0
🟡 MEDIUM	0
🟢 LOW	2

View all fixes

✅ Comment broadened: "Claude SDK" → provider-agnostic (packages/workflows/src/dag-executor.ts:2053-2057) — The comment said "may reference a Claude SDK session that expired." Changed to "may be stale after a human review wait — passing it to the SDK can trigger errors (e.g. error_during_execution on Claude SDK)." Removes the false implication that the guard only applies to Claude; it covers any provider with sessionResume support.
✅ loop-nodes.md: document interactive resume freshness (packages/docs-web/src/content/docs/guides/loop-nodes.md:135) — Added sentence after "The first iteration is always fresh regardless of this setting." to explicitly cover the resume case: users relying on fresh_context: false for session continuity across an interactive approval gate would otherwise be surprised.

Tests Added

(none)

Skipped (3)

All three were explicitly recommended as "leave as-is" or "skip" by the review agents:

Finding	Reason
`currentSessionId` stale init	YAGNI — existing comment explains invariant; both agents recommended Option A (leave as-is)
Test event-filter cast fragility	Test-only risk; agents recommended leave as-is
Multi-iter resume session threading test	Orthogonal to this fix; existing general threading test covers the mechanism

Suggested Follow-up Issues

(none)

Validation

✅ Type check | ✅ Lint | ✅ Tests (283 passed)

Self-fix by Archon · aggressive mode · fixes pushed to archon/task-archon-fix-github-issue-1780923940901

simplify: remove unused callCount tracking in regression test

634af47

Wirasm mentioned this pull request Jun 8, 2026

fix: interactive loop resume crashes with error_during_execution (stale session) #1208

Closed

Wirasm marked this pull request as ready for review June 8, 2026 18:59

Wirasm merged commit ab4cde6 into dev Jun 8, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(workflows): interactive loop resume uses fresh session on first iteration#1923

fix(workflows): interactive loop resume uses fresh session on first iteration#1923
Wirasm merged 3 commits into
devfrom
archon/task-archon-fix-github-issue-1780923940901

Wirasm commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Wirasm commented Jun 8, 2026

Uh oh!

Wirasm commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Wirasm commented Jun 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

UX Journey

Before

After

Architecture Diagram

Before

After

Label Snapshot

Change Metadata

Linked Issue

Validation Evidence (required)

Security Impact (required)

Compatibility / Migration

Human Verification (required)

Side Effects / Blast Radius (required)

Rollback Plan (required)

Risks and Mitigations

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Wirasm commented Jun 8, 2026

🔍 Comprehensive PR Review

Summary

🟢 Low Issues (All Optional)

LOW-1: currentSessionId initialized to stale value, then immediately discarded

LOW-2: Test event-filter cast doesn't guard against undefined argument

LOW-3: No test for session threading on iteration startIteration+1 of a resume

LOW-4: Production comment names "Claude SDK" specifically, but fix is provider-agnostic

LOW-5: loop-nodes.md understates the freshness guarantee for interactive loop resumes

✅ What's Good

📋 Suggested Follow-up Issues (if not fixing in this PR)

Next Steps

Uh oh!

Wirasm commented Jun 8, 2026

⚡ Self-Fix Report (Aggressive)

Fixes Applied (2 total)

Tests Added

Skipped (3)

Suggested Follow-up Issues

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Wirasm commented Jun 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

LOW-1: `currentSessionId` initialized to stale value, then immediately discarded

LOW-2: Test event-filter cast doesn't guard against `undefined` argument

LOW-3: No test for session threading on iteration `startIteration+1` of a resume

LOW-5: `loop-nodes.md` understates the freshness guarantee for interactive loop resumes