Summary
A DAG-workflow command-node that invokes claude.exe as a subagent on a freshly-forked git worktree can complete in <15 seconds reporting success while never actually performing its task. The subagent appears to be invoked without task context, prints a "what would you like to do?" idle prompt, and exits. The downstream node (merge-to-main in our case) then runs against an unchanged branch tip and reports success because git push is a no-op when the branch is already at origin's tip.
Net effect: workflow reports green, no work actually lands on main, generator's diff sits uncommitted in the (now-orphaned) worktree.
Environment
Symptom
Workflow logs from today's repro (Phase 4b Sprint 3):
[lms-plan-feature] Completed (32.3s)
[exec-eval] Completed (25m12s)
[cancel-check] Completed (81ms)
[lms-commit] Started
[lms-commit] Completed (7s) ← bug fires here
[merge-to-main] Started
[merge-to-main] Completed (3.4s) ← reports success but origin/main unchanged
Healthy commit-node duration is 1m30s – 3m (the subagent reads the diff, drafts a commit message, runs git add + commit + push). 7 seconds means the subagent did none of that.
Inspecting the subagent's output via task-output log shows the bug signature:
Context loaded — `/prime` not yet run this session, and no task given.
What would you like to do?
The subagent was spawned but received no task — it idled out into an interactive prompt and the parent workflow's executor treated the clean exit as task-completion success.
Reproductions
Three documented variants of the same surface symptom — possibly the same upstream root cause, possibly related:
Variant 1: worktree_reused (well-documented in our local rules)
Reported in Discovery Phase 0 era (early 2026). Trigger: a prior workflow run was Ctrl+C'd mid-flight; re-kicking the workflow with the same --branch reuses the existing worktree. The commit-node subagent fires with bad context, silent-no-op. Workaround: force-clean the worktree (git worktree remove --force ...) before re-kicking.
Variant 2: fresh-worktree silent-no-op (this issue's primary surface)
Trigger condition NOT isolated. Two confirmed repros:
- Phase 3a Sprint 1 (2026-05-17): 23m46s exec-eval, evaluator PASS all 7 criteria scores 8-10; commit-node fired in 8.3 seconds;
merge-to-main reported success; no commit on main.
- Phase 4b Sprint 3 (2026-05-18, today): 25m12s exec-eval, evaluator PASS all 9 criteria scores 9-10; commit-node fired in 7 seconds;
merge-to-main reported success; no commit on main.
Both runs:
- Worktree was freshly forked from
origin/main at workflow start (verified by the workflow's own verify-worktree-fresh node passing).
- Generator's work was complete and in the worktree (recovered manually post-bug — verified with
git status + git diff).
- No Ctrl+C, no prior interrupted run on the same branch.
- The
verify-worktree-fresh node and allocate-ports node both completed normally; only the late commit-node fired the silent-no-op.
Variant 3: DAG cancellation after eval PASS but before commit (possibly related)
Phase 4a Sprint 3 (2026-05-17): exec-eval loop completed cleanly (evaluator PASS all 9 criteria scores 8-10 after 32m15s); DAG cancelled mid-flight before lms-commit could fire. Workflow finished without an anyFailed:true signal but no sprint commit landed. Distinct from variants 1+2 in that the commit-node never even attempts to run — the workflow cancels at the DAG layer between nodes. May be a different bug (DAG-executor cancellation race) but lumped here because the operator-facing recovery is identical: generator's work is in the worktree, manual stage + commit + push to recover.
Recovery (operator)
For variants 1 + 2 the recovery is mechanical and reliable:
cd <archon-worktree-path>
git status # confirm uncommitted contract-scoped diff
git add <contract.files_expected paths> # NOT `git add -A` — .archon/ drift contaminates
git commit -m "feat(<slug>): sprint N description"
git push origin archon/task-feat-<slug>
cd <parent-checkout>
git fetch && git merge --ff-only origin/archon/task-feat-<slug>
git push origin main
~2 minutes per recovery.
Why I think it's an Archon issue, not a prompt issue
The lms-commit.md prompt at .archon/commands/lms-commit.md is unchanged across all three reproductions. It works correctly the majority of the time (~70-80% of sprints land cleanly). The same prompt fires successfully on the same Sprint 1 and Sprint 2 of the same Phase 4b pipeline (commit-node durations 2m50s and 1m47s) and silent-no-ops on Sprint 3 (7s). The intermittent nature + the "what would you like to do?" output strongly suggest the subagent is being invoked without the task context being correctly passed in — that's an Archon-side subagent-invocation issue, not anything the command-prompt could fix.
Workaround we've shipped
- The recovery recipe above, documented in our repo at
.claude/rules/archon.md.
- A phase-orchestrator workflow (
lms-phase-pipeline.yaml) that snapshots origin/main SHA before each child sprint and re-checks after; if the SHA doesn't move, the phase halts with a recovery diagnostic instead of forking subsequent sprints from stale main. This caught today's repro cleanly.
Neither workaround prevents the bug — they just make it survivable.
Precedent for upstream fix
A similar nested-spawn deadlock (CLAUDECODE=1 env inheriting into spawned claude.exe subprocesses) was filed as #1067 and fixed upstream; we recently retired our archon-nested.sh workaround in our repo's commit 2248a03 ("retire resume-key-guard workaround; upstream fix shipped"). Filing this in the hope of the same outcome.
What would help us reproduce
I don't have an isolated trigger. The bug fires on long-running exec-eval phases (25-32 min), on fresh worktrees, after the prior nodes (plan/exec/eval) complete normally. Repro rate is roughly 1 sprint in 4-5 across our Phase 3a / 4a / 4b runs. Happy to share full task-output logs from any of the three documented repros if useful — let me know and I'll attach.
Related local docs
For reference, our local documentation of the three variants + the recovery recipe lives at .claude/rules/archon.md in the LMS-Project repo (private). I can paste relevant sections here if helpful.
Summary
A DAG-workflow command-node that invokes
claude.exeas a subagent on a freshly-forked git worktree can complete in <15 seconds reporting success while never actually performing its task. The subagent appears to be invoked without task context, prints a "what would you like to do?" idle prompt, and exits. The downstream node (merge-to-mainin our case) then runs against an unchanged branch tip and reports success becausegit pushis a no-op when the branch is already at origin's tip.Net effect: workflow reports green, no work actually lands on
main, generator's diff sits uncommitted in the (now-orphaned) worktree.Environment
~/.bun/bin/archonresolution; installed via Bun on Windows)provider.claudein workflow runs)C:\Users\Dale\AppData\Roaming\npm\node_modules\@anthropic-ai\claude-code\bin\claude.exearchoninvocations through a small wrapper that stripsCLAUDECODE+CLAUDE_CODE_ENTRYPOINT(closes 2 issues: v0.3.5: CLI workflow run silently hangs — dotenv loads .env from CWD instead of ~/.archon/.env,, + rchon serve hardcodes skipPlatformAdapters:true — Telegram/Discord/Slack adapters are unreachable #1067).archon/workflows/; one of the late nodes invokes aclaude.exesubagent viacommand: lms-commit(a markdown prompt at.archon/commands/lms-commit.md) whose job is to stage + commit the worktree's diffSymptom
Workflow logs from today's repro (Phase 4b Sprint 3):
Healthy commit-node duration is 1m30s – 3m (the subagent reads the diff, drafts a commit message, runs git add + commit + push). 7 seconds means the subagent did none of that.
Inspecting the subagent's output via task-output log shows the bug signature:
The subagent was spawned but received no task — it idled out into an interactive prompt and the parent workflow's executor treated the clean exit as task-completion success.
Reproductions
Three documented variants of the same surface symptom — possibly the same upstream root cause, possibly related:
Variant 1:
worktree_reused(well-documented in our local rules)Reported in Discovery Phase 0 era (early 2026). Trigger: a prior workflow run was Ctrl+C'd mid-flight; re-kicking the workflow with the same
--branchreuses the existing worktree. The commit-node subagent fires with bad context, silent-no-op. Workaround: force-clean the worktree (git worktree remove --force ...) before re-kicking.Variant 2: fresh-worktree silent-no-op (this issue's primary surface)
Trigger condition NOT isolated. Two confirmed repros:
merge-to-mainreported success; no commit onmain.merge-to-mainreported success; no commit onmain.Both runs:
origin/mainat workflow start (verified by the workflow's ownverify-worktree-freshnode passing).git status+git diff).verify-worktree-freshnode andallocate-portsnode both completed normally; only the late commit-node fired the silent-no-op.Variant 3: DAG cancellation after eval PASS but before commit (possibly related)
Phase 4a Sprint 3 (2026-05-17): exec-eval loop completed cleanly (evaluator PASS all 9 criteria scores 8-10 after 32m15s); DAG cancelled mid-flight before
lms-commitcould fire. Workflow finished without ananyFailed:truesignal but no sprint commit landed. Distinct from variants 1+2 in that the commit-node never even attempts to run — the workflow cancels at the DAG layer between nodes. May be a different bug (DAG-executor cancellation race) but lumped here because the operator-facing recovery is identical: generator's work is in the worktree, manual stage + commit + push to recover.Recovery (operator)
For variants 1 + 2 the recovery is mechanical and reliable:
~2 minutes per recovery.
Why I think it's an Archon issue, not a prompt issue
The
lms-commit.mdprompt at.archon/commands/lms-commit.mdis unchanged across all three reproductions. It works correctly the majority of the time (~70-80% of sprints land cleanly). The same prompt fires successfully on the same Sprint 1 and Sprint 2 of the same Phase 4b pipeline (commit-node durations 2m50s and 1m47s) and silent-no-ops on Sprint 3 (7s). The intermittent nature + the "what would you like to do?" output strongly suggest the subagent is being invoked without the task context being correctly passed in — that's an Archon-side subagent-invocation issue, not anything the command-prompt could fix.Workaround we've shipped
.claude/rules/archon.md.lms-phase-pipeline.yaml) that snapshotsorigin/mainSHA before each child sprint and re-checks after; if the SHA doesn't move, the phase halts with a recovery diagnostic instead of forking subsequent sprints from stale main. This caught today's repro cleanly.Neither workaround prevents the bug — they just make it survivable.
Precedent for upstream fix
A similar nested-spawn deadlock (
CLAUDECODE=1env inheriting into spawned claude.exe subprocesses) was filed as #1067 and fixed upstream; we recently retired ourarchon-nested.shworkaround in our repo's commit2248a03("retire resume-key-guard workaround; upstream fix shipped"). Filing this in the hope of the same outcome.What would help us reproduce
I don't have an isolated trigger. The bug fires on long-running exec-eval phases (25-32 min), on fresh worktrees, after the prior nodes (plan/exec/eval) complete normally. Repro rate is roughly 1 sprint in 4-5 across our Phase 3a / 4a / 4b runs. Happy to share full task-output logs from any of the three documented repros if useful — let me know and I'll attach.
Related local docs
For reference, our local documentation of the three variants + the recovery recipe lives at
.claude/rules/archon.mdin the LMS-Project repo (private). I can paste relevant sections here if helpful.