fix(orchestrator): clear stale session ID on error_during_execution to prevent infinite failure loop by kagura-agent · Pull Request #1294 · coleam00/Archon

kagura-agent · 2026-04-19T08:23:59Z

Summary

Problem: After container restart, sending a message in an existing conversation silently fails and enters an infinite failure loop — the expired Claude API session ID is persisted even on error
Why it matters: Users are stuck with broken conversations after any restart, with no recovery path except manual DB intervention
What changed: updateSession() and tryPersistSessionId() now accept string | null; both handleStreamMode and handleBatchMode clear session ID on error_during_execution
What did not change: Conversation history (remote_agent_messages), session creation flow, normal (non-error) session persistence

UX Journey

Before

User                   Archon                   Claude API
────                   ──────                   ─────────
sends message ──────▶  loads stale session ID
                       tries Claude API ──────▶  rejects (expired)
                       persists SAME stale ID
  sees error ◀──────── returns error
sends again ─────────▶ loads SAME stale ID
                       tries Claude API ──────▶  rejects again
  (infinite loop)      never recovers

After

User                   Archon                   Claude API
────                   ──────                   ─────────
sends message ──────▶  loads stale session ID
                       tries Claude API ──────▶  rejects (expired)
                       [clears session ID → NULL]
  sees error ◀──────── returns error
sends again ─────────▶ [no stored ID → creates new session]
                       tries Claude API ──────▶  accepts (fresh)
  sees reply ◀──────── streams response

Architecture Diagram

Before

handleStreamMode / handleBatchMode
    │
    ▼
error_during_execution ──▶ updateSession(sessionId)  ──▶ DB (stale ID persisted)

After

handleStreamMode / handleBatchMode
    │
    ▼
error_during_execution ──▶ [updateSession(null)]  ──▶ DB (ID cleared)

Connection inventory:

From	To	Status	Notes
handleStreamMode	updateSession	modified	Now passes `null` on error
handleBatchMode	tryPersistSessionId	modified	Now passes `null` on error
updateSession	DB	modified	Accepts `string \| null`
tryPersistSessionId	DB	modified	Accepts `string \| null`

Label Snapshot

Risk: risk: low
Size: size: S
Scope: core
Module: core:orchestrator

Security Impact

No security impact. This change only affects internal session state management — no auth, no user data exposure, no new inputs.

Human Verification

Verified by restarting container and sending message to existing conversation — session recovers automatically
All 89 orchestrator-agent tests pass
All 28 sessions DB tests pass
TypeScript type-check clean (tsc --noEmit)

Side Effects / Blast Radius

Only affects error recovery path (error_during_execution)
Normal session flow is unchanged
Worst case: session is cleared unnecessarily, resulting in a fresh session (no data loss, just context reload)

Rollback Plan

Revert the single commit. The only behavioral change is in the error handler — reverting restores original (broken) behavior of persisting stale IDs.

Closes #1280

Summary by CodeRabbit

Release Notes

Bug Fixes
- Enhanced session management to automatically clear invalid sessions when execution errors occur, improving reliability.
- Expanded error logging to include additional details for better troubleshooting of result processing failures.
Tests
- Added comprehensive test coverage for session state clearing during error recovery scenarios.
- Improved test structure with named mock utilities for better maintainability.

coderabbitai · 2026-04-19T08:24:15Z

📝 Walkthrough

Walkthrough

Allow clearing persisted assistant session IDs by accepting null when updating sessions and change orchestrator result handling to clear (set NULL) the stored assistant_session_id when an error_during_execution result is received, preventing stale-session retry loops.

Changes

Cohort / File(s)	Summary
Session persistence `packages/core/src/db/sessions.ts`, `packages/core/src/db/sessions.test.ts`	`updateSession` signature changed to accept `sessionId: string \| null`; tests added/updated to verify NULL is bound and not-found behavior remains.
Orchestrator logic `packages/core/src/orchestrator/orchestrator-agent.ts`, `packages/core/src/orchestrator/orchestrator-agent.test.ts`	`tryPersistSessionId` accepts `string \| null`. `handleStreamMode`/`handleBatchMode` detect `msg.isError && msg.errorSubtype === 'error_during_execution'`, log clearing, persist `null` for assistant session, prevent using the failed sessionId for that turn; tests added to assert transition + `updateSession(..., null)` in both stream and batch.

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client
  participant Orch as Orchestrator
  participant AI as AIClient
  participant DB as Database

  Client->>Orch: send user message
  Orch->>DB: read assistant_session_id (may exist)
  Orch->>AI: forward message (include assistant_session_id if present)
  AI-->>Orch: result (isError=true, subtype=error_during_execution)
  Orch->>DB: tryPersistSessionId(session.id, NULL)
  DB-->>Orch: ack
  Orch->>AI: subsequent message uses no assistant_session_id (fresh session)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

fix(core): surface auth errors instead of silently dropping them #1089: Modifies orchestrator result handling and session persistence codepaths; overlaps on error/result handling and session-id updates.
fix(workflows): fail loudly on SDK isError results in DAG and loop nodes #1291: Alters handling of SDK result errors and execution failure flows; related to error_during_execution behaviour and persistence.

Poem

🐰 I nudge the stale ID into the night,
Clearing cobwebs so the convo's bright.
No looping echoes, just fresh, hopeful talk—
I hop, I clear, then skip away to walk. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly and specifically describes the main change: clearing stale session IDs on error_during_execution to prevent infinite failure loops.
Description check	✅ Passed	The PR description comprehensively covers all required template sections with clear explanations, UX/architecture diagrams, validation evidence, security impact, and rollback plan.
Linked Issues check	✅ Passed	The code changes fully address issue `#1280`: updateSession and tryPersistSessionId now accept null; both handleStreamMode and handleBatchMode clear session ID on error_during_execution as required.
Out of Scope Changes check	✅ Passed	All changes are tightly scoped to the error recovery path for stale session IDs; conversation history, session creation flow, and normal session persistence remain unchanged as specified.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Wirasm · 2026-04-20T06:36:48Z

@kagura-agent related to #1208 — overlapping area or partial fix.

Wirasm · 2026-04-20T06:38:02Z

Hi @kagura-agent — thanks for opening this PR.

This repository uses a PR template at .github/pull_request_template.md with several required sections. A few of them appear to be empty or placeholder here:

Security Impact (required)
Human Verification (required)
Side Effects / Blast Radius (required)
Rollback Plan (required)

Could you fill those out (even briefly)? The template helps reviewers understand scope, risk, and rollback — it speeds up review significantly.

If a section genuinely doesn't apply, just write "N/A" in it rather than leaving it blank.

kagura-agent · 2026-04-20T07:49:52Z

Thanks @Wirasm! Filled out all the template sections.

Re #1208 — I took a look and this PR addresses a different failure mode: #1208 is about initial session creation, while this one handles the case where a previously valid session expires after container restart. The fix is also in a different code path (error handler clearing stale IDs vs. session creation logic). Happy to add a note cross-referencing #1208 if that helps!

Wirasm · 2026-04-29T09:13:55Z

Review Summary

Verdict: minor-fixes-needed

This is a tightly-scoped bug fix that clears stale assistant_session_id values when error_during_execution errors occur, breaking infinite failure loops. The code is clean, type-safe, and CLAUDE.md-compliant — good work. The gap is test coverage: the new session-clearing branch has no tests, and updateSession's null path also needs one. Both are small, targeted additions.

Blocking issues

orchestrator-agent.ts:958–964 and orchestrator-agent.ts:1088–1094 — The new branch that calls await tryPersistSessionId(session.id, null) on error_during_execution is untested. Add a test for both handleStreamMode and handleBatchMode that yields a synthetic result event { type: 'result', isError: true, errorSubtype: 'error_during_execution', sessionId: 'stale-session-id' } and asserts updateSession is called with (sessionId, null).

Suggested fixes

sessions.ts:60–62 / sessions.test.ts — updateSession is now called with null, but the null path has no test. Add a test that verifies await updateSession('session-123', null) generates UPDATE remote_agent_sessions SET assistant_session_id = NULL WHERE id = $1 and does not throw.

Minor / nice-to-have

orchestrator-agent.ts:958–963 / orchestrator-agent.ts:1088–1093 — The 'clearing_stale_session_id' warning logs only conversationId and errorSubtype. Add other available fields from msg (e.g., toolName) to help diagnose the root cause of the execution error.
orchestrator-agent.ts:958–964 and orchestrator-agent.ts:1088–1094 — The interaction between the new clearing branch and the existing error-path persist call is not explicitly tested when a result carries both an error and a sessionId. Add a test asserting updateSession is called with null (clearing takes precedence).
orchestrator-agent.ts:333 — When tryPersistSessionId logs a failure, it uses newSessionId: null in the payload, which may mislead log readers into thinking the intent was to set null. Rename the field to persistedValue: assistantSessionId.

Compliments

Well-scoped PR with a clear before/after UX diagram, architecture diagram, connection inventory, and rollback plan — exactly the kind of documentation that makes future maintainers' lives easier.
The tryPersistSessionId error-swallowing is intentional and appropriate (non-critical side effect), and the failure is logged with context — good error handling design.
Both handleStreamMode and handleBatchMode are updated consistently — no subtle asymmetry.

Reviewed via maintainer-review-pr workflow (Pi/Minimax). Aspects run: code-review, error-handling, test-coverage, comment-quality.

…o prevent infinite failure loop When a Claude API session expires (e.g. after container restart), the orchestrator persists the new (failed) session ID from the error result, causing every subsequent message in that conversation to hit the same error — an infinite failure loop. Fix: on error_during_execution result, set assistant_session_id to NULL instead of persisting the failed session ID. The next message starts a fresh session with full context rebuilt from the DB. Conversation history is unaffected since it lives in remote_agent_messages, independent of the Claude session. Changes: - updateSession() and tryPersistSessionId() now accept string | null - Both handleStreamMode and handleBatchMode clear session ID on error_during_execution Fixes coleam00#1280

… feedback Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> Signed-off-by: kagura-agent <kagura.agent.ai@gmail.com>

kagura-agent · 2026-04-29T09:32:22Z

Thanks for the thorough review @Wirasm! All items addressed:

Blocking — stale session clearing tests:

Added tests for both handleStreamMode and handleBatchMode that yield a synthetic error_during_execution result and assert updateSession is called with (sessionId, null)

Suggested — updateSession(id, null) SQL test:

Added test in sessions.test.ts verifying the NULL path generates correct SQL

Minor fixes:

Enriched warning logs with errors and stopReason fields (note: toolName doesn't exist on result type chunks per the MessageChunk type definition)
persistedValue rename was already in the original commit

All 100 orchestrator tests + 29 sessions tests pass. Rebased on latest dev.

coderabbitai

🧹 Nitpick comments (1)

packages/core/src/orchestrator/orchestrator-agent.ts (1)
983-992: Normalize result-path log event names to the repo convention.

The new/updated event names (clearing_stale_session_id, ai_result_error) don’t follow the required {domain}.{action}_{state} format.
♻️ Suggested naming adjustment
- 'clearing_stale_session_id'
+ 'orchestrator.session_clear_failed'

- 'ai_result_error'
+ 'orchestrator.ai_result_failed'
As per coding guidelines, "Structured logging with Pino: use event naming format {domain}.{action}_{state} (standard states: _started, _completed, _failed, _validated, _rejected)."

Also applies to: 999-1007, 1127-1136, 1143-1151
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/orchestrator/orchestrator-agent.ts` around lines 983 - 992,
The log event names like 'clearing_stale_session_id' and 'ai_result_error' do
not follow the repo convention `{domain}.{action}_{state}`; update the strings
passed to getLog().warn/getLog().error (and any other getLog() calls around the
same areas) to use that format (choose appropriate domain names such as
"session" or "ai.result", an action verb like "clear_stale_session" or "result",
and a standard state suffix like
`_started`/`_completed`/`_failed`/`_validated`/`_rejected`) so each call (e.g.,
the getLog().warn inside the clearing stale session flow and the getLog().error
for ai result errors) emits events matching `{domain}.{action}_{state}`; apply
the same renaming to the other occurrences referenced (lines ~999-1007,
~1127-1136, ~1143-1151).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/core/src/orchestrator/orchestrator-agent.ts`:
- Around line 983-992: The log event names like 'clearing_stale_session_id' and
'ai_result_error' do not follow the repo convention `{domain}.{action}_{state}`;
update the strings passed to getLog().warn/getLog().error (and any other
getLog() calls around the same areas) to use that format (choose appropriate
domain names such as "session" or "ai.result", an action verb like
"clear_stale_session" or "result", and a standard state suffix like
`_started`/`_completed`/`_failed`/`_validated`/`_rejected`) so each call (e.g.,
the getLog().warn inside the clearing stale session flow and the getLog().error
for ai result errors) emits events matching `{domain}.{action}_{state}`; apply
the same renaming to the other occurrences referenced (lines ~999-1007,
~1127-1136, ~1143-1151).

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8d3dd9ef-3822-4b18-b28f-0d28548c5bb6

📥 Commits

Reviewing files that changed from the base of the PR and between d928859 and ec93193.

📒 Files selected for processing (4)

packages/core/src/db/sessions.test.ts
packages/core/src/db/sessions.ts
packages/core/src/orchestrator/orchestrator-agent.test.ts
packages/core/src/orchestrator/orchestrator-agent.ts

* fix(core/test): split connection.test.ts from DB-test batch to avoid mock pollution (coleam00#1269) messages.test.ts uses mock.module('./connection', ...) at module-load time. Per CLAUDE.md:131 (Bun issue oven-sh/bun#7823), mock.module() is process- global and irreversible. When Bun pre-loads all test files in a batch, the mock shadows the real connection module before connection.test.ts runs, causing getDatabaseType() to always return the mocked value regardless of DATABASE_URL. Move connection.test.ts into its own `bun test` invocation immediately after postgres.test.ts (which runs alone) and before the big DB/utils/ config/state batch that contains messages.test.ts. This follows the same isolation pattern already used for command-handler, clone, postgres, and path-validation tests. * fix(setup): align PORT default on 3090 across .env.example, wizard, and JSDoc (coleam00#1152) (coleam00#1271) The server's getPort() fallback changed from 3000 to 3090 in the Hono migration (coleam00#318), but .env.example, the setup wizard's generated .env, and the JSDoc describing the fallback were not updated — leaving three different sources of truth for "the default PORT." When the wizard writes PORT=3000 to ~/.archon/.env (which the Hono server loads with override: true, while Vite only reads repo-local .env), the two processes can land on different ports silently. That mismatch is the real mechanism behind the failure described in coleam00#1152. - .env.example: comment out PORT, document 3090 as the default - packages/cli/src/commands/setup.ts: wizard no longer writes PORT=3000 into the generated .env; fix the "Additional Options" note - packages/cli/src/commands/setup.test.ts: assert no bare PORT= line and the commented default is present - packages/core/src/utils/port-allocation.ts: fix stale JSDoc "default 3000" -> "default 3090" - deploy/.env.example: keep Docker default at 3000 (compose/Caddy target that) but annotate it so users don't copy it for local dev Single source of truth for the local-dev default is now basePort in port-allocation.ts. * fix(providers/claude): use || instead of ?? in hasExplicitTokens to handle empty-string env vars (coleam00#1028) Closes coleam00#1027 * chore(deps): bump claude-agent-sdk to 0.2.121, codex-sdk to 0.125.0 (coleam00#1460) Both SDKs were ~30 patch releases behind. Validation suite passes (type-check, lint, format, tests across all 10 packages) without code changes. The only sustained Claude SDK behavior change in the range — v0.2.111's options.env overlay/replace flap, since reverted to overlay — is a no-op for Archon, which already passes { ...process.env } as the SDK env. * fix(cli): lazy-import bundled skill files so non-setup commands don't crash on missing source (coleam00#1394) The 18 top-level `import … with { type: 'text' }` statements in `bundled-skill.ts` resolve at module load. For `bun build --compile` that's build time, so the binary embeds the strings and works regardless of any on-disk skill files. For `bun link` (linked-source) installs that's every `archon` invocation — including `archon --help`, which doesn't even use the skill content. If any of the 18 source files are missing or moved, the import fails and the CLI cannot start at all. The skill content is data the binary deploys via `archon setup`, not data the CLI needs at runtime. There's only one consumer in production code: `copyArchonSkill()` in `setup.ts`. Moving the import into that function as a dynamic import preserves the compiled-binary behavior (Bun's bundler statically analyses literal-string `import()` and embeds the chunk — verified by grepping the SKILL.md frontmatter out of a freshly compiled binary) while making the linked-source install resilient: only `archon setup` triggers the bundled-skill module load now. Verified: a known skill string appears in the compiled binary 1×, and `archon --help` no longer needs the source files to start. `copyArchonSkill()` becomes async because the dynamic import is a Promise. The single production call site is already in an async function and gets an `await`. The four `setup.test.ts` cases become async too. * fix(claude): stop passing --no-env-file to native binary in dev mode (coleam00#1461) * fix(claude): stop passing --no-env-file to native binary in dev mode The Claude Agent SDK switched from shipping `cli.js` inside the package to per-platform native binaries via optional deps somewhere in the 0.2.x series. As of 0.2.121 there is no `cli.js` in the SDK package; dev mode resolves to `@anthropic-ai/claude-agent-sdk-darwin-arm64/claude` (Mach-O). That native binary rejects `--no-env-file` with `error: unknown option '--no-env-file'` and the subprocess exits 1. `shouldPassNoEnvFile` was returning true on `cliPath === undefined` on the assumption that "dev mode = JS executable run via Bun". That assumption is dead. Tighten the predicate to only return true on an explicit `.js` suffix, so we only emit the flag when the SDK is going to spawn a Bun-runnable script. CWD `.env` leak protection is unaffected. `stripCwdEnv()` in `@archon/paths` (coleam00#1067) deletes Bun-auto-loaded `.env`/`.env.local`/ `.env.development`/`.env.production` keys from `process.env` at every Archon entry point before any subprocess is spawned. The native Claude binary does not auto-load `.env` from its cwd either. `--no-env-file` was belt-and-suspenders for the JS-via-Bun case only. Verified end-to-end with a sentinel: added a unique `ARCHON_LEAK_SENTINEL_$$` to Archon's `.env`, ran e2e-claude-smoke with a bash probe checking the subprocess env. stderr shows `[archon] stripped 23 keys from /Users/rasmus/Projects/cole/Archon (.env, .env.local)` — sentinel was deleted. Bash node prints `PASS: simple='4', no sentinel leak`. Workflow completes cleanly, no `--no-env-file` rejection from the SDK binary. bun run validate: green across all 10 packages. * fix(claude): address review on coleam00#1461 (stale docs + test gaps) Critical: file-level JSDoc at provider.ts:18 still claimed dev mode resolves cli.js. Updated to reflect SDK 0.2.x's switch to per-platform native binaries. Important: security.md still listed --no-env-file as item 2 of target-repo .env isolation. Scoped that bullet to legacy Bun-runnable JS entry points and called out that native binaries don't auto-load .env from cwd. Added an Unreleased Fixed entry to CHANGELOG.md. Updated binary-resolver.ts JSDoc title that referenced cli.js. Polish: widened the predicate to accept .mjs and .cjs (also Bun-runnable JS — matches the SDK's own internal extension list). Dropped the redundant `passesNoEnvFile` log field that mirrored `isJsExecutable`. Added unit cases for .mjs/.cjs (now true) and .ts/.tsx/.jsx (deliberately false — never SDK entry points). Added an integration test that mocks resolveClaudeBinaryPath to return a .js path and asserts executableArgs: ['--no-env-file'] flows through buildBaseClaudeOptions all the way to the SDK call — catches future regressions in the conditional spread. bun run validate: green across all 10 packages. * fix(orchestrator): clear stale session ID on error_during_execution to prevent infinite failure loop (coleam00#1294) * fix(orchestrator): clear stale session ID on error_during_execution to prevent infinite failure loop When a Claude API session expires (e.g. after container restart), the orchestrator persists the new (failed) session ID from the error result, causing every subsequent message in that conversation to hit the same error — an infinite failure loop. Fix: on error_during_execution result, set assistant_session_id to NULL instead of persisting the failed session ID. The next message starts a fresh session with full context rebuilt from the DB. Conversation history is unaffected since it lives in remote_agent_messages, independent of the Claude session. Changes: - updateSession() and tryPersistSessionId() now accept string | null - Both handleStreamMode and handleBatchMode clear session ID on error_during_execution Fixes coleam00#1280 * test(orchestrator): add stale session clearing tests + address review feedback Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> Signed-off-by: kagura-agent <kagura.agent.ai@gmail.com> --------- Signed-off-by: kagura-agent <kagura.agent.ai@gmail.com> Co-authored-by: Claude Opus 4 (1M context) <noreply@anthropic.com> * fix(claude): honor CLAUDE_BIN_PATH in dev mode for libc-mismatch hosts (coleam00#1481) * fix(claude): honor CLAUDE_BIN_PATH in dev mode for libc-mismatch hosts The Claude Agent SDK auto-resolves its bundled native binary in [linux-x64-musl, linux-x64] order. On glibc Linux hosts (Ubuntu/Debian/ Fedora), Bun installs both via optionalDependencies and the musl variant is picked first; its ELF interpreter (/lib/ld-musl-x86_64.so.1) does not exist on glibc, so spawn fails and the SDK reports a misleading "binary not found" — the file is on disk, the loader is not. The documented escape hatch CLAUDE_BIN_PATH was dead code in dev mode: the resolver early-returned undefined when BUNDLED_IS_BINARY=false before ever reading the env var. The only workaround was patching node_modules. Move the env-var block above the BUNDLED_IS_BINARY return. Config-file path stays binary-mode-only — it's per-repo, not per-machine; env is the right knob for libc mismatches. Behavior preserved: - env unset → unchanged (undefined in dev, autodetect/throw in binary) - env set + file exists → resolved (was binary-only; now also dev) - env set + file missing → clear error (was binary-only; now also dev) Closes coleam00#1474 * chore(claude): address CodeRabbit review on coleam00#1481 - CHANGELOG entry under [Unreleased] / Fixed describing the dev-mode CLAUDE_BIN_PATH escape hatch (previously ignored). Notes that config-file path remains binary-mode-only and that env-loading + target-repo .env isolation are unchanged downstream. - Empty-string test pinning that CLAUDE_BIN_PATH='' falls through to undefined rather than throwing — protects against a future predicate typo that would treat empty as "set". - One-line note in ai-assistants.md "Binary path configuration" section pointing dev-mode users at the env-var override for the glibc/musl mismatch case. Skipped from the review: - The other two docs-page rewrites (configuration.md / troubleshooting.md): the error message itself names CLAUDE_BIN_PATH, and coleam00#1474 documents the use case publicly. One mention in ai-assistants.md is enough for discovery. - Type-style consistency tweaks in the test file: pure bikeshed. * fix(deps): bump hono to ^4.12.16 and @hono/node-server to ^1.19.13 (closes coleam00#1484) (coleam00#1499) * fix(orchestrator): create ~/.archon/workspaces before AI provider spawn (coleam00#1529) * fix(orchestrator): create ~/.archon/workspaces before AI provider spawn On a fresh install, ~/.archon/workspaces doesn't exist yet. The orchestrator passes that path as cwd to the AI provider, which calls spawn() — which raises ENOENT. The error is then misclassified as "binary not found" in the friendly-error path, surfacing as an incorrect "Claude binary not found" message. Add ensureArchonWorkspacesPath() in @archon/paths that mkdir -p's the directory and returns the path. Use it at the orchestrator's spawn-cwd site so the directory is guaranteed to exist before spawn(). Other call sites of getArchonWorkspacesPath() (workflow discovery, path-prefix comparisons) only consume the path string and don't need the directory to exist; they keep using the pure getter. Closes coleam00#1528 * test(orchestrator): assert ensureArchonWorkspacesPath is called Capture the @archon/paths mock as a named variable and assert it was called in the syncWorkspace handleMessage path. Without this, the test suite passes even if orchestrator-agent.ts:824 reverts to the non-ensuring getArchonWorkspacesPath() variant — exactly the regression that surfaced as 'Claude Code native binary not found' in coleam00#1528. * docs(changelog): add Tier 1 batch 2 cherry-pick entry Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Signed-off-by: kagura-agent <kagura.agent.ai@gmail.com> Co-authored-by: Rasmus Widing <152263317+Wirasm@users.noreply.github.com> Co-authored-by: DIY Smart Code <thomas@thirty3.de> Co-authored-by: Cocoon-Break <54054995+kuishou68@users.noreply.github.com> Co-authored-by: Kagura <kagura.agent.ai@gmail.com> Co-authored-by: Claude Opus 4 (1M context) <noreply@anthropic.com> Co-authored-by: Yasser <116118149+YrFnS@users.noreply.github.com> Co-authored-by: Truffle <truffleagent@gmail.com> Co-authored-by: cjnprospa <sirhcle.j23@gmail.com>

…teration (#1923) * Fix: interactive loop resume crashes with error_during_execution (#1208) When resuming a paused interactive loop gate (e.g. archon-piv-loop), iteration 2+ always failed with error_during_execution because needsFreshSession was (loop.fresh_context || i === 1) — but on resume the loop begins at startIteration >= 2, so the condition was never true and the executor blindly passed the stored gate sessionId to the SDK. After hours of human review wait that session is typically expired, which is the failure the issue reporter observed. Changes: - packages/workflows/src/dag-executor.ts: force a fresh session on the first iteration of every interactive loop resume by extending needsFreshSession with (isLoopResume && i === startIteration). Subsequent iterations in the same resume continue to thread currentSessionId from the fresh first iteration, preserving multi-turn continuity. User feedback is already carried via \$LOOP_USER_INPUT so session continuity is not required. - packages/workflows/src/dag-executor.test.ts: update the existing "interactive loop resumes from stored iteration" test to expect an undefined sessionArg (the test was enforcing the buggy behavior). - packages/workflows/src/dag-executor.test.ts: add a regression test that simulates a resumed interactive loop with a stale stored sessionId and asserts the SDK is invoked with undefined and no failure events are emitted. Coordinates with #1291 (fail loudly on SDK isError) and #1294 (orchestrator-side stale-session recovery) without reintroducing any retry loop — we simply avoid passing the stale id. Fixes #1208 * fix: address review findings for #1923 - Broaden comment in dag-executor.ts: remove false implication that the stale-session guard is Claude-only; fix applies to any provider with sessionResume support - Update loop-nodes.md: document that the first iteration after resuming from an interactive loop approval gate is also always fresh, so users relying on fresh_context:false across a gate boundary are not surprised * simplify: remove unused callCount tracking in regression test

kagura-agent force-pushed the fix/1280-stale-session-id-loop branch from 553c6a2 to d928859 Compare April 19, 2026 09:09

Wirasm mentioned this pull request Apr 20, 2026

fix: interactive loop resume crashes with error_during_execution (stale session) #1208

Closed

kagura-agent and others added 2 commits April 29, 2026 17:24

test(orchestrator): add stale session clearing tests + address review…

ec93193

… feedback Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com> Signed-off-by: kagura-agent <kagura.agent.ai@gmail.com>

kagura-agent force-pushed the fix/1280-stale-session-id-loop branch from d928859 to ec93193 Compare April 29, 2026 09:32

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Wirasm merged commit cbcca8c into coleam00:dev Apr 29, 2026
4 checks passed

Wirasm mentioned this pull request Apr 29, 2026

Release 0.3.10 #1488

Merged

prospapledge88 mentioned this pull request May 5, 2026

chore: cherry-pick Tier 1 upstream fixes (10 commits) — batch 2 prospapledge88/Archon#7

Merged

5 tasks

Wirasm mentioned this pull request Jun 8, 2026

fix(workflows): interactive loop resume uses fresh session on first iteration #1923

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(orchestrator): clear stale session ID on error_during_execution to prevent infinite failure loop#1294

fix(orchestrator): clear stale session ID on error_during_execution to prevent infinite failure loop#1294
Wirasm merged 2 commits into
coleam00:devfrom
kagura-agent:fix/1280-stale-session-id-loop

kagura-agent commented Apr 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 19, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

Wirasm commented Apr 20, 2026

Uh oh!

Wirasm commented Apr 20, 2026

Uh oh!

kagura-agent commented Apr 20, 2026

Uh oh!

Wirasm commented Apr 29, 2026

Uh oh!

kagura-agent commented Apr 29, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kagura-agent commented Apr 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

UX Journey

Before

After

Architecture Diagram

Before

After

Label Snapshot

Security Impact

Human Verification

Side Effects / Blast Radius

Rollback Plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

Wirasm commented Apr 20, 2026

Uh oh!

Wirasm commented Apr 20, 2026

Uh oh!

kagura-agent commented Apr 20, 2026

Uh oh!

Wirasm commented Apr 29, 2026

Review Summary

Blocking issues

Suggested fixes

Minor / nice-to-have

Compliments

Uh oh!

kagura-agent commented Apr 29, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kagura-agent commented Apr 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 19, 2026 •

edited

Loading