lms-commit-style claude.exe subagent silent no-op on fresh worktree — commit-node reports success in <15s, no commit lands, generator's work uncommitted in worktree

## Summary

A DAG-workflow command-node that invokes `claude.exe` as a subagent on a freshly-forked git worktree can complete in <15 seconds reporting success while never actually performing its task. The subagent appears to be invoked without task context, prints a "what would you like to do?" idle prompt, and exits. The downstream node (`merge-to-main` in our case) then runs against an unchanged branch tip and reports success because `git push` is a no-op when the branch is already at origin's tip.

Net effect: workflow reports green, no work actually lands on `main`, generator's diff sits uncommitted in the (now-orphaned) worktree.

## Environment

- Archon CLI (per `~/.bun/bin/archon` resolution; installed via Bun on Windows)
- Claude Code provider (`provider.claude` in workflow runs)
- Windows 10, gitbash shell, Node-side claude.exe at `C:\Users\Dale\AppData\Roaming\npm\node_modules\@anthropic-ai\claude-code\bin\claude.exe`
- Wrapping `archon` invocations through a small wrapper that strips `CLAUDECODE` + `CLAUDE_CODE_ENTRYPOINT` (closes #1067)
- Workflows are repo-scoped Archon DAG yaml under `.archon/workflows/`; one of the late nodes invokes a `claude.exe` subagent via `command: lms-commit` (a markdown prompt at `.archon/commands/lms-commit.md`) whose job is to stage + commit the worktree's diff

## Symptom

Workflow logs from today's repro (Phase 4b Sprint 3):

```
[lms-plan-feature] Completed (32.3s)
[exec-eval] Completed (25m12s)
[cancel-check] Completed (81ms)
[lms-commit] Started
[lms-commit] Completed (7s)            ← bug fires here
[merge-to-main] Started
[merge-to-main] Completed (3.4s)        ← reports success but origin/main unchanged
```

Healthy commit-node duration is 1m30s – 3m (the subagent reads the diff, drafts a commit message, runs git add + commit + push). 7 seconds means the subagent did **none** of that.

Inspecting the subagent's output via task-output log shows the bug signature:

```
Context loaded — `/prime` not yet run this session, and no task given.
What would you like to do?
```

The subagent was spawned but received no task — it idled out into an interactive prompt and the parent workflow's executor treated the clean exit as task-completion success.

## Reproductions

Three documented variants of the same surface symptom — possibly the same upstream root cause, possibly related:

### Variant 1: `worktree_reused` (well-documented in our local rules)

Reported in Discovery Phase 0 era (early 2026). Trigger: a prior workflow run was Ctrl+C'd mid-flight; re-kicking the workflow with the same `--branch` reuses the existing worktree. The commit-node subagent fires with bad context, silent-no-op. Workaround: force-clean the worktree (`git worktree remove --force ...`) before re-kicking.

### Variant 2: fresh-worktree silent-no-op (this issue's primary surface)

Trigger condition NOT isolated. Two confirmed repros:

- **Phase 3a Sprint 1 (2026-05-17):** 23m46s exec-eval, evaluator PASS all 7 criteria scores 8-10; commit-node fired in 8.3 seconds; `merge-to-main` reported success; no commit on `main`.
- **Phase 4b Sprint 3 (2026-05-18, today):** 25m12s exec-eval, evaluator PASS all 9 criteria scores 9-10; commit-node fired in 7 seconds; `merge-to-main` reported success; no commit on `main`.

Both runs:
- Worktree was freshly forked from `origin/main` at workflow start (verified by the workflow's own `verify-worktree-fresh` node passing).
- Generator's work was complete and in the worktree (recovered manually post-bug — verified with `git status` + `git diff`).
- No Ctrl+C, no prior interrupted run on the same branch.
- The `verify-worktree-fresh` node and `allocate-ports` node both completed normally; only the late commit-node fired the silent-no-op.

### Variant 3: DAG cancellation after eval PASS but before commit (possibly related)

Phase 4a Sprint 3 (2026-05-17): exec-eval loop completed cleanly (evaluator PASS all 9 criteria scores 8-10 after 32m15s); DAG cancelled mid-flight before `lms-commit` could fire. Workflow finished without an `anyFailed:true` signal but no sprint commit landed. Distinct from variants 1+2 in that the commit-node never even attempts to run — the workflow cancels at the DAG layer between nodes. May be a different bug (DAG-executor cancellation race) but lumped here because the operator-facing recovery is identical: generator's work is in the worktree, manual stage + commit + push to recover.

## Recovery (operator)

For variants 1 + 2 the recovery is mechanical and reliable:

```bash
cd <archon-worktree-path>
git status                               # confirm uncommitted contract-scoped diff
git add <contract.files_expected paths>  # NOT `git add -A` — .archon/ drift contaminates
git commit -m "feat(<slug>): sprint N description"
git push origin archon/task-feat-<slug>
cd <parent-checkout>
git fetch && git merge --ff-only origin/archon/task-feat-<slug>
git push origin main
```

~2 minutes per recovery.

## Why I think it's an Archon issue, not a prompt issue

The `lms-commit.md` prompt at `.archon/commands/lms-commit.md` is unchanged across all three reproductions. It works correctly the majority of the time (~70-80% of sprints land cleanly). The same prompt fires successfully on the same Sprint 1 and Sprint 2 of the same Phase 4b pipeline (commit-node durations 2m50s and 1m47s) and silent-no-ops on Sprint 3 (7s). The intermittent nature + the "what would you like to do?" output strongly suggest the subagent is being invoked without the task context being correctly passed in — that's an Archon-side subagent-invocation issue, not anything the command-prompt could fix.

## Workaround we've shipped

- The recovery recipe above, documented in our repo at `.claude/rules/archon.md`.
- A phase-orchestrator workflow (`lms-phase-pipeline.yaml`) that snapshots `origin/main` SHA before each child sprint and re-checks after; if the SHA doesn't move, the phase halts with a recovery diagnostic instead of forking subsequent sprints from stale main. This caught today's repro cleanly.

Neither workaround prevents the bug — they just make it survivable.

## Precedent for upstream fix

A similar nested-spawn deadlock (`CLAUDECODE=1` env inheriting into spawned claude.exe subprocesses) was filed as #1067 and fixed upstream; we recently retired our `archon-nested.sh` workaround in our repo's commit `2248a03` ("retire resume-key-guard workaround; upstream fix shipped"). Filing this in the hope of the same outcome.

## What would help us reproduce

I don't have an isolated trigger. The bug fires on long-running exec-eval phases (25-32 min), on fresh worktrees, after the prior nodes (plan/exec/eval) complete normally. Repro rate is roughly 1 sprint in 4-5 across our Phase 3a / 4a / 4b runs. Happy to share full task-output logs from any of the three documented repros if useful — let me know and I'll attach.

## Related local docs

For reference, our local documentation of the three variants + the recovery recipe lives at `.claude/rules/archon.md` in the LMS-Project repo (private). I can paste relevant sections here if helpful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lms-commit-style claude.exe subagent silent no-op on fresh worktree — commit-node reports success in <15s, no commit lands, generator's work uncommitted in worktree #1720

Summary

Environment

Symptom

Reproductions

Variant 1: `worktree_reused` (well-documented in our local rules)

Variant 2: fresh-worktree silent-no-op (this issue's primary surface)

Variant 3: DAG cancellation after eval PASS but before commit (possibly related)

Recovery (operator)

Why I think it's an Archon issue, not a prompt issue

Workaround we've shipped

Precedent for upstream fix

What would help us reproduce

Related local docs

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

lms-commit-style claude.exe subagent silent no-op on fresh worktree — commit-node reports success in <15s, no commit lands, generator's work uncommitted in worktree #1720

Description

Summary

Environment

Symptom

Reproductions

Variant 1: worktree_reused (well-documented in our local rules)

Variant 2: fresh-worktree silent-no-op (this issue's primary surface)

Variant 3: DAG cancellation after eval PASS but before commit (possibly related)

Recovery (operator)

Why I think it's an Archon issue, not a prompt issue

Workaround we've shipped

Precedent for upstream fix

What would help us reproduce

Related local docs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Variant 1: `worktree_reused` (well-documented in our local rules)