Skip to content

lms-commit-style claude.exe subagent silent no-op on fresh worktree — commit-node reports success in <15s, no commit lands, generator's work uncommitted in worktree #1720

@DHenion-cyber

Description

@DHenion-cyber

Summary

A DAG-workflow command-node that invokes claude.exe as a subagent on a freshly-forked git worktree can complete in <15 seconds reporting success while never actually performing its task. The subagent appears to be invoked without task context, prints a "what would you like to do?" idle prompt, and exits. The downstream node (merge-to-main in our case) then runs against an unchanged branch tip and reports success because git push is a no-op when the branch is already at origin's tip.

Net effect: workflow reports green, no work actually lands on main, generator's diff sits uncommitted in the (now-orphaned) worktree.

Environment

Symptom

Workflow logs from today's repro (Phase 4b Sprint 3):

[lms-plan-feature] Completed (32.3s)
[exec-eval] Completed (25m12s)
[cancel-check] Completed (81ms)
[lms-commit] Started
[lms-commit] Completed (7s)            ← bug fires here
[merge-to-main] Started
[merge-to-main] Completed (3.4s)        ← reports success but origin/main unchanged

Healthy commit-node duration is 1m30s – 3m (the subagent reads the diff, drafts a commit message, runs git add + commit + push). 7 seconds means the subagent did none of that.

Inspecting the subagent's output via task-output log shows the bug signature:

Context loaded — `/prime` not yet run this session, and no task given.
What would you like to do?

The subagent was spawned but received no task — it idled out into an interactive prompt and the parent workflow's executor treated the clean exit as task-completion success.

Reproductions

Three documented variants of the same surface symptom — possibly the same upstream root cause, possibly related:

Variant 1: worktree_reused (well-documented in our local rules)

Reported in Discovery Phase 0 era (early 2026). Trigger: a prior workflow run was Ctrl+C'd mid-flight; re-kicking the workflow with the same --branch reuses the existing worktree. The commit-node subagent fires with bad context, silent-no-op. Workaround: force-clean the worktree (git worktree remove --force ...) before re-kicking.

Variant 2: fresh-worktree silent-no-op (this issue's primary surface)

Trigger condition NOT isolated. Two confirmed repros:

  • Phase 3a Sprint 1 (2026-05-17): 23m46s exec-eval, evaluator PASS all 7 criteria scores 8-10; commit-node fired in 8.3 seconds; merge-to-main reported success; no commit on main.
  • Phase 4b Sprint 3 (2026-05-18, today): 25m12s exec-eval, evaluator PASS all 9 criteria scores 9-10; commit-node fired in 7 seconds; merge-to-main reported success; no commit on main.

Both runs:

  • Worktree was freshly forked from origin/main at workflow start (verified by the workflow's own verify-worktree-fresh node passing).
  • Generator's work was complete and in the worktree (recovered manually post-bug — verified with git status + git diff).
  • No Ctrl+C, no prior interrupted run on the same branch.
  • The verify-worktree-fresh node and allocate-ports node both completed normally; only the late commit-node fired the silent-no-op.

Variant 3: DAG cancellation after eval PASS but before commit (possibly related)

Phase 4a Sprint 3 (2026-05-17): exec-eval loop completed cleanly (evaluator PASS all 9 criteria scores 8-10 after 32m15s); DAG cancelled mid-flight before lms-commit could fire. Workflow finished without an anyFailed:true signal but no sprint commit landed. Distinct from variants 1+2 in that the commit-node never even attempts to run — the workflow cancels at the DAG layer between nodes. May be a different bug (DAG-executor cancellation race) but lumped here because the operator-facing recovery is identical: generator's work is in the worktree, manual stage + commit + push to recover.

Recovery (operator)

For variants 1 + 2 the recovery is mechanical and reliable:

cd <archon-worktree-path>
git status                               # confirm uncommitted contract-scoped diff
git add <contract.files_expected paths>  # NOT `git add -A` — .archon/ drift contaminates
git commit -m "feat(<slug>): sprint N description"
git push origin archon/task-feat-<slug>
cd <parent-checkout>
git fetch && git merge --ff-only origin/archon/task-feat-<slug>
git push origin main

~2 minutes per recovery.

Why I think it's an Archon issue, not a prompt issue

The lms-commit.md prompt at .archon/commands/lms-commit.md is unchanged across all three reproductions. It works correctly the majority of the time (~70-80% of sprints land cleanly). The same prompt fires successfully on the same Sprint 1 and Sprint 2 of the same Phase 4b pipeline (commit-node durations 2m50s and 1m47s) and silent-no-ops on Sprint 3 (7s). The intermittent nature + the "what would you like to do?" output strongly suggest the subagent is being invoked without the task context being correctly passed in — that's an Archon-side subagent-invocation issue, not anything the command-prompt could fix.

Workaround we've shipped

  • The recovery recipe above, documented in our repo at .claude/rules/archon.md.
  • A phase-orchestrator workflow (lms-phase-pipeline.yaml) that snapshots origin/main SHA before each child sprint and re-checks after; if the SHA doesn't move, the phase halts with a recovery diagnostic instead of forking subsequent sprints from stale main. This caught today's repro cleanly.

Neither workaround prevents the bug — they just make it survivable.

Precedent for upstream fix

A similar nested-spawn deadlock (CLAUDECODE=1 env inheriting into spawned claude.exe subprocesses) was filed as #1067 and fixed upstream; we recently retired our archon-nested.sh workaround in our repo's commit 2248a03 ("retire resume-key-guard workaround; upstream fix shipped"). Filing this in the hope of the same outcome.

What would help us reproduce

I don't have an isolated trigger. The bug fires on long-running exec-eval phases (25-32 min), on fresh worktrees, after the prior nodes (plan/exec/eval) complete normally. Repro rate is roughly 1 sprint in 4-5 across our Phase 3a / 4a / 4b runs. Happy to share full task-output logs from any of the three documented repros if useful — let me know and I'll attach.

Related local docs

For reference, our local documentation of the three variants + the recovery recipe lives at .claude/rules/archon.md in the LMS-Project repo (private). I can paste relevant sections here if helpful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions