Skip to content

fix(claude): stop passing --no-env-file to native binary in dev mode#1461

Merged
Wirasm merged 2 commits into
devfrom
fix/sdk-no-env-file-native-binary
Apr 28, 2026
Merged

fix(claude): stop passing --no-env-file to native binary in dev mode#1461
Wirasm merged 2 commits into
devfrom
fix/sdk-no-env-file-native-binary

Conversation

@Wirasm

@Wirasm Wirasm commented Apr 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Problem: Every Claude SDK call from a dev-mode Archon (e.g. `bun run cli`, `bun run dev`) crashes the SDK subprocess with `error: unknown option '--no-env-file'`. Surfaced while smoke-testing chore(deps): bump claude-agent-sdk to 0.2.121, codex-sdk to 0.125.0 #1460 — `e2e-claude-smoke` failed before the redesign branch could be validated against Claude.
  • Why it matters: Dev mode is unusable for any Claude workflow today. CI smoke tests, the title-generator background service, and direct `/workflow run` invocations all hit it.
  • What changed: Tighten `shouldPassNoEnvFile` so it only returns true when the resolved executable path explicitly ends in `.js`. The historical `cliPath === undefined → true` heuristic was based on the SDK shipping `cli.js` inside the package; SDK 0.2.x switched to per-platform native binaries (e.g. `@anthropic-ai/claude-agent-sdk-darwin-arm64/claude`) and dev mode now resolves to one of those. Native binaries reject `--no-env-file`.
  • What did NOT change: CWD `.env` leak protection is unaffected. `stripCwdEnv()` in `@archon/paths` is the actual guard — it deletes Bun-auto-loaded `.env`/`.env.local`/`.env.development`/`.env.production` keys from `process.env` at every Archon entry point before any subprocess is spawned. The native Claude binary doesn't auto-load `.env` from its cwd either (verified end-to-end with sentinel keys). `--no-env-file` was belt-and-suspenders for the JS-via-Bun case only.

UX Journey

Before

```
$ bun run cli workflow run e2e-claude-smoke --no-worktree "smoke"
[simple] Started
{...,"stderr":"error: unknown option '--no-env-file'"...,"msg":"subprocess_error"}
{...,"err":...,"errorClass":"crash","attempt":3,"maxRetries":3,"msg":"query_error"}
[simple] Failed: Claude Code crash: ... (stderr: error: unknown option '--no-env-file')
❌ DAG workflow 'e2e-claude-smoke' completed with no successful nodes.
```

After

```
$ bun run cli workflow run e2e-claude-smoke --no-worktree "smoke"
[archon] stripped 23 keys from /Users/rasmus/Projects/cole/Archon (.env, .env.local)
[simple] Started
4
[simple] Completed (2.4s)
[assert] Started
[assert] Completed (6ms)
PASS: simple='4', no sentinel leak

Workflow completed successfully.
```

Architecture Diagram

No architectural change — single-function predicate fix in the Claude provider. The two-layer leak-defense model is unchanged:

```
┌────────────────────────────────────┐
│ Layer 1 — Archon process boot │
bun run cli ────────────────▶ │ stripCwdEnv() deletes CWD .env │
│ keys from process.env │
└─────────────────┬──────────────────┘


┌────────────────────────────────────┐
│ Layer 2 — subprocess spawn │
│ (was: --no-env-file for Bun-run │
│ cli.js; SDK no longer ships JS, │
│ so layer 2 is a no-op for dev) │
└─────────────────┬──────────────────┘


┌────────────────────────────────────┐
│ Claude Code subprocess │
│ inherits already-cleaned env │
└────────────────────────────────────┘
```

Connection inventory:

From To Status Notes
`shouldPassNoEnvFile` SDK `executableArgs` unchanged only the predicate body changed
`stripCwdEnv()` Archon `process.env` unchanged still the actual leak guard

Label Snapshot

  • Risk: `risk: low`
  • Size: `size: XS`
  • Scope: `providers` (claude)
  • Module: `providers:claude`

Change Metadata

  • Change type: `bug`
  • Primary scope: `providers` (claude)

Linked Issue

Validation Evidence (required)

```bash
bun run validate

EXIT=0

```

All five gates pass — `check:bundled`, `type-check` (10 packages), `lint --max-warnings 0`, `format:check`, `test` (every package, every file `0 fail`).

End-to-end probe:

  1. Added a unique sentinel `ARCHON_LEAK_SENTINEL_$$=...` to Archon's `.env`.
  2. Extended `e2e-claude-smoke`'s bash assert node to grep `env` for any `ARCHON_LEAK_SENTINEL_` key in the spawned subprocess.
  3. Ran `bun packages/cli/src/cli.ts workflow run e2e-claude-smoke --no-worktree`.
  4. stderr: `[archon] stripped 23 keys from /Users/rasmus/Projects/cole/Archon (.env, .env.local) to prevent target repo env from leaking into Archon processes`.
  5. Bash node: `PASS: simple='4', no sentinel leak`.
  6. Workflow completes cleanly with no `unknown option` rejection.

Security Impact (required)

  • New permissions/capabilities? No
  • New external network calls? No
  • Secrets/tokens handling changed? No (the leak guard is unchanged; this commit just stops emitting an unsupported flag).
  • File system access scope changed? No

Compatibility / Migration

  • Backward compatible? Yes — existing configs that point at native installer binaries already had `shouldPassNoEnvFile === false` and continue to behave identically. Configs that explicitly point at a `cli.js` (legacy npm-installed SDK) still get `--no-env-file` (the predicate accepts `.js` paths). The only behavioral change is for dev mode where `cliPath` is `undefined`, which now correctly omits the flag.
  • Config/env changes? No
  • Database migration needed? No

Human Verification (required)

  • Verified scenarios: Reproduced the SDK rejection by invoking the bundled native binary directly with `--no-env-file --print "hi"` — confirmed `error: unknown option '--no-env-file'`. Ran the e2e-claude-smoke after the fix; `[simple] Completed (2.4s)` and the workflow exits successfully. Title-generator background service also went from `title.generate_failed` → `title.generate_completed` in the same run.
  • Edge cases checked: Provider tests for the predicate cover undefined, explicit cli.js, native binary paths (Linux/macOS/Windows/Homebrew symlink), and suffix edge cases (`cli.json`, `cli.js.bak`).
  • What was not verified: Behavior on a host that explicitly configures `claudeBinaryPath: /path/to/cli.js` (legacy npm-installed SDK with the JS entry point) — the predicate still returns true for those, but I don't have a JS-cli installation to manually exercise.

Side Effects / Blast Radius (required)

  • Affected subsystems/workflows: Any code path that calls `ClaudeProvider.sendQuery` in dev mode (`bun run cli`, `bun run dev`, every workflow with a Claude node, the title-generator background service).
  • Potential unintended effects: Configurations where users had wrapped Bun/Node around a JS Claude entry point may have been relying on the prior `undefined → true` heuristic. With this change, those configs need an explicit `claudeBinaryPath: /path/to/cli.js` so the predicate matches the `.js` suffix. Stop-gap is one config line; the SDK no longer ships such an entry point in its own package, so this combination is rare.
  • Guardrails/monitoring for early detection: Existing `claude.subprocess_env_file_flag` debug log records the decision per request; `subprocess_error` log fires on rejection.

Rollback Plan (required)

  • Fast rollback: `git revert `. No data, config, or schema changes.
  • Feature flags or config toggles: None — pure predicate fix.
  • Observable failure symptoms: Any reintroduction of the prior bug surfaces as `error: unknown option '--no-env-file'` in `subprocess_error` logs and `Claude Code process exited with code 1` from the SDK retry loop.

Risks and Mitigations

  • Risk: A user explicitly configured `claudeBinaryPath` to a wrapper that needs `--no-env-file` and relied on the dev-mode default to provide it.
    • Mitigation: Such users only need to ensure the configured path ends in `.js` (which is the standard for the legacy npm cli.js entry point anyway). Documented in the updated comment on `shouldPassNoEnvFile`.

Summary by CodeRabbit

  • Bug Fixes

    • Fixed subprocess env-file flag behavior so the flag is only applied for legacy Bun-runnable JS CLI entrypoints; native dev-mode binaries no longer receive the unsupported flag.
  • Tests

    • Expanded tests to cover JS/TS entrypoint permutations and added an integration-style test verifying when the flag is passed.
  • Documentation / Changelog

    • Clarified subprocess .env isolation and updated docs/changelog to reflect native-binary handling in SDK 0.2.x.

The Claude Agent SDK switched from shipping `cli.js` inside the package
to per-platform native binaries via optional deps somewhere in the
0.2.x series. As of 0.2.121 there is no `cli.js` in the SDK package;
dev mode resolves to `@anthropic-ai/claude-agent-sdk-darwin-arm64/claude`
(Mach-O). That native binary rejects `--no-env-file` with
`error: unknown option '--no-env-file'` and the subprocess exits 1.

`shouldPassNoEnvFile` was returning true on `cliPath === undefined` on
the assumption that "dev mode = JS executable run via Bun". That
assumption is dead. Tighten the predicate to only return true on an
explicit `.js` suffix, so we only emit the flag when the SDK is going
to spawn a Bun-runnable script.

CWD `.env` leak protection is unaffected. `stripCwdEnv()` in
`@archon/paths` (#1067) deletes Bun-auto-loaded `.env`/`.env.local`/
`.env.development`/`.env.production` keys from `process.env` at every
Archon entry point before any subprocess is spawned. The native Claude
binary does not auto-load `.env` from its cwd either. `--no-env-file`
was belt-and-suspenders for the JS-via-Bun case only.

Verified end-to-end with a sentinel: added a unique
`ARCHON_LEAK_SENTINEL_$$` to Archon's `.env`, ran e2e-claude-smoke
with a bash probe checking the subprocess env. stderr shows
`[archon] stripped 23 keys from /Users/rasmus/Projects/cole/Archon
(.env, .env.local)` — sentinel was deleted. Bash node prints
`PASS: simple='4', no sentinel leak`. Workflow completes cleanly,
no `--no-env-file` rejection from the SDK binary.

bun run validate: green across all 10 packages.
@coderabbitai

coderabbitai Bot commented Apr 28, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0bbcc400-9e6e-4546-a641-207bdf6d3049

📥 Commits

Reviewing files that changed from the base of the PR and between bfcc107 and 75e5a3e.

📒 Files selected for processing (5)
  • CHANGELOG.md
  • packages/docs-web/src/content/docs/reference/security.md
  • packages/providers/src/claude/binary-resolver.ts
  • packages/providers/src/claude/provider.test.ts
  • packages/providers/src/claude/provider.ts
✅ Files skipped from review due to trivial changes (2)
  • packages/providers/src/claude/binary-resolver.ts
  • CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/providers/src/claude/provider.test.ts

📝 Walkthrough

Walkthrough

The Claude provider now only applies --no-env-file when the configured CLI path is an explicit Bun-runnable JS entrypoint (.js, .mjs, .cjs). cliPath === undefined no longer implies passing that flag. Tests, docs, and the changelog were updated to reflect this behavioral change.

Changes

Cohort / File(s) Summary
Provider implementation
packages/providers/src/claude/provider.ts
Changed shouldPassNoEnvFile(cliPath) to return false when cliPath is undefined and to return true only for explicit Bun-runnable JS extensions (.js, .mjs, .cjs). Removed redundant debug field in buildBaseClaudeOptions.
Provider tests
packages/providers/src/claude/provider.test.ts
Updated expectations: shouldPassNoEnvFile(undefined) now false; expanded cases for .mjs/.cjs true and .ts/.tsx/.jsx false; changed subprocess invocation assertions to omit executableArgs when cliPath is undefined; added integration-style test mocking resolveClaudeBinaryPath.
Binary resolver docs
packages/providers/src/claude/binary-resolver.ts
JSDoc updated to clarify resolution target is the SDK’s native per-platform executable (SDK ≥0.2.x) while still noting legacy cli.js support. No runtime logic changes.
Security docs
packages/docs-web/src/content/docs/reference/security.md
Clarified subprocess .env isolation flow and narrowed description of when executableArgs: ['--no-env-file'] is applied (only for legacy Bun-runnable JS entrypoints).
CHANGELOG
CHANGELOG.md
Added Unreleased fix describing the change: restrict --no-env-file to explicit Bun-runnable JS entrypoints to avoid crashing dev-mode native binaries.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 I sniffed the CLI path on the run,
Native hops now skip the flag,
Only .js gets the little bun,
Tests aligned — no more snag.
A tidy patch, a carrot drag! 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main fix: stopping the --no-env-file flag from being passed to the native Claude binary in dev mode, which is the core issue causing crashes.
Description check ✅ Passed The description comprehensively covers the required template sections: problem statement, impact, scope boundaries, UX journey with before/after flows, architecture diagram with connection inventory, change metadata, validation evidence with end-to-end testing, security impact assessment, backward compatibility analysis, human verification details, side effects analysis, and rollback plan.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/sdk-no-env-file-native-binary

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Wirasm

Wirasm commented Apr 28, 2026

Copy link
Copy Markdown
Collaborator Author

PR Review Summary (multi-agent)

Ran code-reviewer, docs-impact, pr-test-analyzer, comment-analyzer against the diff. The fix itself is correct and minimal; the issues below are around the edges of the change — stale comments elsewhere in the same file, a missing changelog entry, stale security docs, and a redundant log field.

Critical Issues (1 found)

Agent Issue Location
comment-analyzer File-level JSDoc directly contradicts the new shouldPassNoEnvFile JSDoc — still says "In dev mode the SDK resolves cli.js itself from node_modules", which is exactly the assumption this PR documents as broken on SDK 0.2.x. packages/providers/src/claude/provider.ts:18

Important Issues (3 found)

Agent Issue Location
docs-impact security.md still claims item 2 of target-repo .env isolation is "executableArgs: ['--no-env-file'] prevents Bun from auto-loading .env in the Claude Code subprocess CWD." Post-#1461 the flag is only passed for explicit .js paths; the native binary (SDK 0.2.x default, including dev mode) never gets it. packages/docs-web/src/content/docs/reference/security.md:132
docs-impact No CHANGELOG.md entry for this fix in [Unreleased] — only $LOOP_PREV_OUTPUT is listed. Suggested entry: a ### Fixed line covering the unknown option '--no-env-file' crash and noting the .env isolation guarantee is unchanged. CHANGELOG.md:8
comment-analyzer binary-resolver.ts JSDoc title still says "Resolve the path to the Claude Code SDK's cli.js." The function was generalised to native binaries long ago; reinforcing the stale framing alongside the new corrected JSDoc on shouldPassNoEnvFile is confusing. packages/providers/src/claude/binary-resolver.ts:56

Suggestions (4 found)

Agent Suggestion Location
code-reviewer The predicate matches .js only — .mjs and .cjs (also Bun-runnable JS) silently fall through to false. Practical impact is low (no operator points claudeBinaryPath at sdk.mjs, and stripCwdEnv() is the real guard) but the comment implies broader coverage than the code provides. Either widen to ['.js','.mjs','.cjs'].some(ext => cliPath?.endsWith(ext)) or tighten the comment to say .js only is intentional. packages/providers/src/claude/provider.ts:567
pr-test-analyzer No symmetric integration assertion that executableArgs: ['--no-env-file'] is actually present when resolveClaudeBinaryPath returns an explicit .js path. The unit predicate test covers it, but a regression in buildBaseClaudeOptions's conditional spread wouldn't be caught. Rating 6/10. packages/providers/src/claude/provider.test.ts:498-528
pr-test-analyzer Add a shouldPassNoEnvFile('/path/to/cli.mjs') → false test to document the decision about .mjs/.cjs, regardless of whether the implementation widens. Rating 5/10. packages/providers/src/claude/provider.test.ts:22-55
comment-analyzer Redundant log field: { cliPath, isJsExecutable, passesNoEnvFile: isJsExecutable } logs the same boolean twice under two names. Drop passesNoEnvFile. packages/providers/src/claude/provider.ts:586

Strengths

  • Predicate change is genuinely one-line and KISS-aligned. No new abstractions, no config keys, no speculative branches.
  • stripCwdEnv() chain verified end-to-end: called synchronously via import '@archon/paths/strip-cwd-env-boot' at every Archon entry point (packages/cli/src/cli.ts:12, packages/server/src/index.ts:10) before any module reads process.env. The PR's safety claim ("CWD .env leak protection comes from stripCwdEnv(), not from --no-env-file") is accurate.
  • All five shouldPassNoEnvFile unit cases are self-consistent with the new behaviour (undefined → false; explicit .js → true; native paths on Linux/macOS/Windows/Homebrew → false; .json and .js.bak false-positive guards → false).
  • Integration assertion at provider.test.ts:516 (executableArgs is undefined in dev mode) correctly validates the fix through the real buildBaseClaudeOptions codepath via sendQuery.
  • Updated JSDoc on shouldPassNoEnvFile itself is a model of "WHY > WHAT" — explains the Bun-flag mechanism, the historical assumption, why it broke on 0.2.x, and the safety fallback.

Verdict

NEEDS FIXES — none of the issues block correctness of the predicate fix, but the stale file-level JSDoc at provider.ts:18 directly contradicts the new function JSDoc on the very same fact (whether dev mode = cli.js or native binary), which is the kind of inconsistency that wastes the next reader's time. The security.md and CHANGELOG.md updates are user-facing.

Recommended Actions

  1. Update provider.ts:18 to reflect that dev mode resolves a native binary on SDK 0.2.x (or remove the cli.js claim from that file-level comment).
  2. Update security.md:132 to scope item 2 to "JS cli.js only" and note that native binaries don't auto-load CWD .env.
  3. Add a ### Fixed entry under [Unreleased] in CHANGELOG.md.
  4. (Optional, polish) Update binary-resolver.ts:56 JSDoc title; drop the redundant passesNoEnvFile log field; decide on .mjs/.cjs and either widen the predicate or document the scoping explicitly.

Critical: file-level JSDoc at provider.ts:18 still claimed dev mode
resolves cli.js. Updated to reflect SDK 0.2.x's switch to per-platform
native binaries.

Important: security.md still listed --no-env-file as item 2 of
target-repo .env isolation. Scoped that bullet to legacy
Bun-runnable JS entry points and called out that native binaries
don't auto-load .env from cwd. Added an Unreleased Fixed entry to
CHANGELOG.md. Updated binary-resolver.ts JSDoc title that referenced
cli.js.

Polish: widened the predicate to accept .mjs and .cjs (also
Bun-runnable JS — matches the SDK's own internal extension list).
Dropped the redundant `passesNoEnvFile` log field that mirrored
`isJsExecutable`. Added unit cases for .mjs/.cjs (now true) and
.ts/.tsx/.jsx (deliberately false — never SDK entry points).

Added an integration test that mocks resolveClaudeBinaryPath to
return a .js path and asserts executableArgs: ['--no-env-file']
flows through buildBaseClaudeOptions all the way to the SDK call —
catches future regressions in the conditional spread.

bun run validate: green across all 10 packages.
@Wirasm

Wirasm commented Apr 28, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks for the thorough review. Pushed 75e5a3e6 addressing every item.

Critical

  • provider.ts:18 file-level JSDoc rewritten — now states dev mode resolves the SDK's bundled per-platform native binary (Mach-O/ELF/PE from @anthropic-ai/claude-agent-sdk-<platform>) and explicitly notes the pre-0.2.x cli.js shape as historical context. Cross-references shouldPassNoEnvFile for the flag implications.

Important

  • security.md:132 rewritten. Item 2 now reads: "when the SDK is configured to spawn a Bun-runnable JS entry point (legacy npm-installed cli.js/cli.mjs/cli.cjs), Archon also passes executableArgs: ['--no-env-file'] so Bun skips its env autoload inside the spawned process. SDK 0.2.x ships per-platform native binaries instead — those don't auto-load .env from cwd, so the flag is unnecessary and is omitted." Item 1 also got a "This is the primary guard" qualifier so the stripCwdEnv()-as-real-defense story isn't buried.
  • CHANGELOG.md got an Unreleased ### Fixed entry covering the unknown option '--no-env-file' crash and explicitly noting the .env isolation guarantee is unchanged.
  • binary-resolver.ts:56 JSDoc title rewritten — now says "Resolve the path to the Claude Code executable (native binary in SDK 0.2.x; legacy cli.js is still accepted for operators pinned to npm-installed SDKs that ship a JS entry point)."

Polish

  • Widened the predicate to .js/.mjs/.cjs via a BUN_JS_EXTENSIONS const. .ts/.tsx/.jsx are deliberately excluded — the SDK has never shipped those as entry points, so accepting them would only widen misconfiguration. Added unit tests for both directions: .mjs/.cjs → true; .ts/.tsx/.jsx → false. (Matches the spirit of the SDK's own internal MR() predicate, scoped to legitimate runtime entry points.)
  • Dropped the redundant passesNoEnvFile: isJsExecutable log field. Now logs { cliPath, isJsExecutable }.
  • Added the symmetric integration test the test-analyzer flagged: spyOn(binaryResolver, 'resolveClaudeBinaryPath').mockResolvedValue('/usr/local/.../cli.js'), run sendQuery, assert callArgs.options.executableArgs is ['--no-env-file'] AND pathToClaudeCodeExecutable matches. This catches regressions in buildBaseClaudeOptions's conditional spread that the predicate-only unit tests would miss.

bun run validate green across all 10 packages (71 pass / 0 fail in provider.test.ts).

@Wirasm Wirasm merged commit ff90111 into dev Apr 28, 2026
4 checks passed
@Wirasm Wirasm deleted the fix/sdk-no-env-file-native-binary branch April 28, 2026 09:50
@Wirasm Wirasm mentioned this pull request Apr 29, 2026
prospapledge88 added a commit to prospapledge88/Archon that referenced this pull request May 5, 2026
* fix(core/test): split connection.test.ts from DB-test batch to avoid mock pollution (coleam00#1269)

messages.test.ts uses mock.module('./connection', ...) at module-load time.
Per CLAUDE.md:131 (Bun issue oven-sh/bun#7823), mock.module() is process-
global and irreversible. When Bun pre-loads all test files in a batch, the
mock shadows the real connection module before connection.test.ts runs,
causing getDatabaseType() to always return the mocked value regardless of
DATABASE_URL.

Move connection.test.ts into its own `bun test` invocation immediately
after postgres.test.ts (which runs alone) and before the big DB/utils/
config/state batch that contains messages.test.ts. This follows the same
isolation pattern already used for command-handler, clone, postgres, and
path-validation tests.

* fix(setup): align PORT default on 3090 across .env.example, wizard, and JSDoc (coleam00#1152) (coleam00#1271)

The server's getPort() fallback changed from 3000 to 3090 in the Hono
migration (coleam00#318), but .env.example, the setup wizard's generated .env,
and the JSDoc describing the fallback were not updated — leaving three
different sources of truth for "the default PORT."

When the wizard writes PORT=3000 to ~/.archon/.env (which the Hono
server loads with override: true, while Vite only reads repo-local
.env), the two processes can land on different ports silently. That
mismatch is the real mechanism behind the failure described in coleam00#1152.

- .env.example: comment out PORT, document 3090 as the default
- packages/cli/src/commands/setup.ts: wizard no longer writes PORT=3000
  into the generated .env; fix the "Additional Options" note
- packages/cli/src/commands/setup.test.ts: assert no bare PORT= line and
  the commented default is present
- packages/core/src/utils/port-allocation.ts: fix stale JSDoc "default
  3000" -> "default 3090"
- deploy/.env.example: keep Docker default at 3000 (compose/Caddy target
  that) but annotate it so users don't copy it for local dev

Single source of truth for the local-dev default is now basePort in
port-allocation.ts.

* fix(providers/claude): use || instead of ?? in hasExplicitTokens to handle empty-string env vars (coleam00#1028)

Closes coleam00#1027

* chore(deps): bump claude-agent-sdk to 0.2.121, codex-sdk to 0.125.0 (coleam00#1460)

Both SDKs were ~30 patch releases behind. Validation suite passes
(type-check, lint, format, tests across all 10 packages) without code
changes. The only sustained Claude SDK behavior change in the range —
v0.2.111's options.env overlay/replace flap, since reverted to overlay —
is a no-op for Archon, which already passes { ...process.env } as the
SDK env.

* fix(cli): lazy-import bundled skill files so non-setup commands don't crash on missing source (coleam00#1394)

The 18 top-level `import … with { type: 'text' }` statements in
`bundled-skill.ts` resolve at module load. For `bun build --compile` that's
build time, so the binary embeds the strings and works regardless of any
on-disk skill files. For `bun link` (linked-source) installs that's every
`archon` invocation — including `archon --help`, which doesn't even use the
skill content. If any of the 18 source files are missing or moved, the
import fails and the CLI cannot start at all.

The skill content is data the binary deploys via `archon setup`, not data
the CLI needs at runtime. There's only one consumer in production code:
`copyArchonSkill()` in `setup.ts`. Moving the import into that function as
a dynamic import preserves the compiled-binary behavior (Bun's bundler
statically analyses literal-string `import()` and embeds the chunk —
verified by grepping the SKILL.md frontmatter out of a freshly compiled
binary) while making the linked-source install resilient: only `archon
setup` triggers the bundled-skill module load now. Verified: a known skill
string appears in the compiled binary 1×, and `archon --help` no longer
needs the source files to start.

`copyArchonSkill()` becomes async because the dynamic import is a Promise.
The single production call site is already in an async function and gets
an `await`. The four `setup.test.ts` cases become async too.

* fix(claude): stop passing --no-env-file to native binary in dev mode (coleam00#1461)

* fix(claude): stop passing --no-env-file to native binary in dev mode

The Claude Agent SDK switched from shipping `cli.js` inside the package
to per-platform native binaries via optional deps somewhere in the
0.2.x series. As of 0.2.121 there is no `cli.js` in the SDK package;
dev mode resolves to `@anthropic-ai/claude-agent-sdk-darwin-arm64/claude`
(Mach-O). That native binary rejects `--no-env-file` with
`error: unknown option '--no-env-file'` and the subprocess exits 1.

`shouldPassNoEnvFile` was returning true on `cliPath === undefined` on
the assumption that "dev mode = JS executable run via Bun". That
assumption is dead. Tighten the predicate to only return true on an
explicit `.js` suffix, so we only emit the flag when the SDK is going
to spawn a Bun-runnable script.

CWD `.env` leak protection is unaffected. `stripCwdEnv()` in
`@archon/paths` (coleam00#1067) deletes Bun-auto-loaded `.env`/`.env.local`/
`.env.development`/`.env.production` keys from `process.env` at every
Archon entry point before any subprocess is spawned. The native Claude
binary does not auto-load `.env` from its cwd either. `--no-env-file`
was belt-and-suspenders for the JS-via-Bun case only.

Verified end-to-end with a sentinel: added a unique
`ARCHON_LEAK_SENTINEL_$$` to Archon's `.env`, ran e2e-claude-smoke
with a bash probe checking the subprocess env. stderr shows
`[archon] stripped 23 keys from /Users/rasmus/Projects/cole/Archon
(.env, .env.local)` — sentinel was deleted. Bash node prints
`PASS: simple='4', no sentinel leak`. Workflow completes cleanly,
no `--no-env-file` rejection from the SDK binary.

bun run validate: green across all 10 packages.

* fix(claude): address review on coleam00#1461 (stale docs + test gaps)

Critical: file-level JSDoc at provider.ts:18 still claimed dev mode
resolves cli.js. Updated to reflect SDK 0.2.x's switch to per-platform
native binaries.

Important: security.md still listed --no-env-file as item 2 of
target-repo .env isolation. Scoped that bullet to legacy
Bun-runnable JS entry points and called out that native binaries
don't auto-load .env from cwd. Added an Unreleased Fixed entry to
CHANGELOG.md. Updated binary-resolver.ts JSDoc title that referenced
cli.js.

Polish: widened the predicate to accept .mjs and .cjs (also
Bun-runnable JS — matches the SDK's own internal extension list).
Dropped the redundant `passesNoEnvFile` log field that mirrored
`isJsExecutable`. Added unit cases for .mjs/.cjs (now true) and
.ts/.tsx/.jsx (deliberately false — never SDK entry points).

Added an integration test that mocks resolveClaudeBinaryPath to
return a .js path and asserts executableArgs: ['--no-env-file']
flows through buildBaseClaudeOptions all the way to the SDK call —
catches future regressions in the conditional spread.

bun run validate: green across all 10 packages.

* fix(orchestrator): clear stale session ID on error_during_execution to prevent infinite failure loop (coleam00#1294)

* fix(orchestrator): clear stale session ID on error_during_execution to prevent infinite failure loop

When a Claude API session expires (e.g. after container restart), the orchestrator
persists the new (failed) session ID from the error result, causing every subsequent
message in that conversation to hit the same error — an infinite failure loop.

Fix: on error_during_execution result, set assistant_session_id to NULL instead of
persisting the failed session ID. The next message starts a fresh session with full
context rebuilt from the DB. Conversation history is unaffected since it lives in
remote_agent_messages, independent of the Claude session.

Changes:
- updateSession() and tryPersistSessionId() now accept string | null
- Both handleStreamMode and handleBatchMode clear session ID on error_during_execution

Fixes coleam00#1280

* test(orchestrator): add stale session clearing tests + address review feedback

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Signed-off-by: kagura-agent <kagura.agent.ai@gmail.com>

---------

Signed-off-by: kagura-agent <kagura.agent.ai@gmail.com>
Co-authored-by: Claude Opus 4 (1M context) <noreply@anthropic.com>

* fix(claude): honor CLAUDE_BIN_PATH in dev mode for libc-mismatch hosts (coleam00#1481)

* fix(claude): honor CLAUDE_BIN_PATH in dev mode for libc-mismatch hosts

The Claude Agent SDK auto-resolves its bundled native binary in
[linux-x64-musl, linux-x64] order. On glibc Linux hosts (Ubuntu/Debian/
Fedora), Bun installs both via optionalDependencies and the musl variant
is picked first; its ELF interpreter (/lib/ld-musl-x86_64.so.1) does not
exist on glibc, so spawn fails and the SDK reports a misleading "binary
not found" — the file is on disk, the loader is not.

The documented escape hatch CLAUDE_BIN_PATH was dead code in dev mode:
the resolver early-returned undefined when BUNDLED_IS_BINARY=false before
ever reading the env var. The only workaround was patching node_modules.

Move the env-var block above the BUNDLED_IS_BINARY return. Config-file
path stays binary-mode-only — it's per-repo, not per-machine; env is the
right knob for libc mismatches.

Behavior preserved:
- env unset                  → unchanged (undefined in dev, autodetect/throw in binary)
- env set + file exists      → resolved (was binary-only; now also dev)
- env set + file missing     → clear error (was binary-only; now also dev)

Closes coleam00#1474

* chore(claude): address CodeRabbit review on coleam00#1481

- CHANGELOG entry under [Unreleased] / Fixed describing the dev-mode
  CLAUDE_BIN_PATH escape hatch (previously ignored). Notes that
  config-file path remains binary-mode-only and that env-loading +
  target-repo .env isolation are unchanged downstream.
- Empty-string test pinning that CLAUDE_BIN_PATH='' falls through
  to undefined rather than throwing — protects against a future
  predicate typo that would treat empty as "set".
- One-line note in ai-assistants.md "Binary path configuration"
  section pointing dev-mode users at the env-var override for the
  glibc/musl mismatch case.

Skipped from the review:
- The other two docs-page rewrites (configuration.md /
  troubleshooting.md): the error message itself names CLAUDE_BIN_PATH,
  and coleam00#1474 documents the use case publicly. One mention in
  ai-assistants.md is enough for discovery.
- Type-style consistency tweaks in the test file: pure bikeshed.

* fix(deps): bump hono to ^4.12.16 and @hono/node-server to ^1.19.13 (closes coleam00#1484) (coleam00#1499)

* fix(orchestrator): create ~/.archon/workspaces before AI provider spawn (coleam00#1529)

* fix(orchestrator): create ~/.archon/workspaces before AI provider spawn

On a fresh install, ~/.archon/workspaces doesn't exist yet. The
orchestrator passes that path as cwd to the AI provider, which calls
spawn() — which raises ENOENT. The error is then misclassified as
"binary not found" in the friendly-error path, surfacing as an
incorrect "Claude binary not found" message.

Add ensureArchonWorkspacesPath() in @archon/paths that mkdir -p's the
directory and returns the path. Use it at the orchestrator's spawn-cwd
site so the directory is guaranteed to exist before spawn().

Other call sites of getArchonWorkspacesPath() (workflow discovery,
path-prefix comparisons) only consume the path string and don't need
the directory to exist; they keep using the pure getter.

Closes coleam00#1528

* test(orchestrator): assert ensureArchonWorkspacesPath is called

Capture the @archon/paths mock as a named variable and assert it was
called in the syncWorkspace handleMessage path. Without this, the test
suite passes even if orchestrator-agent.ts:824 reverts to the
non-ensuring getArchonWorkspacesPath() variant — exactly the regression
that surfaced as 'Claude Code native binary not found' in coleam00#1528.

* docs(changelog): add Tier 1 batch 2 cherry-pick entry

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Signed-off-by: kagura-agent <kagura.agent.ai@gmail.com>
Co-authored-by: Rasmus Widing <152263317+Wirasm@users.noreply.github.com>
Co-authored-by: DIY Smart Code <thomas@thirty3.de>
Co-authored-by: Cocoon-Break <54054995+kuishou68@users.noreply.github.com>
Co-authored-by: Kagura <kagura.agent.ai@gmail.com>
Co-authored-by: Claude Opus 4 (1M context) <noreply@anthropic.com>
Co-authored-by: Yasser <116118149+YrFnS@users.noreply.github.com>
Co-authored-by: Truffle <truffleagent@gmail.com>
Co-authored-by: cjnprospa <sirhcle.j23@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant