You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
IAssistantClient.sendQuery(prompt, cwd, resumeSessionId, options) has an implicit contract that "cwd corresponds to a registered codebase" — the pre-spawn env-leak check in both ClaudeClient and CodexClient does a findCodebaseByDefaultCwd(cwd) ?? findCodebaseByPathPrefix(cwd) lookup and assumes the result is non-null.
At least four upstream call sites violate this contract and pass unregistered cwd paths (including the literal string /workspace as a fallback). This is the upstream cause of #991 (the shallow fix). Until the contract is repaired, the pre-spawn gate has to fall back to "unregistered → skip scan" (#991), which is correct for the symptoms but hides the architectural drift.
This issue tracks the architectural cleanup: audit every call site, decide per-path whether it should always resolve to a registered codebase or is legitimately codebase-less, and change the sendQuery contract to make the distinction explicit instead of relying on lookup-by-cwd heuristics.
The problem: the codebase is looked up by cwd, but the cwd is not a stable key for "which codebase is this query for?". Workers, worktrees, codebase-less workflows, and fallback paths all pass cwd values that don't map cleanly to a single codebase row.
The four upstream paths that violate the contract
All four are in packages/core/src/ and packages/workflows/src/.
Path 1 — Orchestrator stale-cleaned fallback to /workspace
When it fires: the isolation resolver returns stale_cleaned (the stored worktree reference points to a path that no longer exists on disk) AND the conversation has no associated codebase row.
What gets passed: conversation.cwd if set, otherwise the literal string /workspace.
Why this is wrong: /workspace is not a registered codebase path on most deployments. The pre-spawn check calls findCodebaseByDefaultCwd('/workspace') which returns null, and the old logic runs the scanner on /workspace. On Cole's deployed server, this is exactly what fires because the docker container's working directory has a .env.
// No codebase — run in parent's cwd (no isolation needed for non-repo workflows)workerCwd=ctx.cwd;
When it fires: a conversation is created against a workflow that doesn't require a codebase (e.g., archon-assist "what time is it" without any project selected).
What gets passed: ctx.cwd — wherever the orchestrator was invoked from. On the CLI that's the user's shell cwd. On the server that's the docker container's working directory. Neither is a registered codebase by design.
Why this is "legitimately codebase-less": the workflow explicitly does not require a codebase. The user's intent is "run this assistant query with no project context". The pre-spawn check should NOT fire for this case — there's no codebase to consent against and no project-specific .env to scan. But the current code path is indistinguishable from a "tried to use a codebase but lookup failed" scenario.
Path 3 — DAG executor for workflow runs
File: packages/workflows/src/dag-executor.ts:840 and :1600
When it fires: every DAG node that spawns a Claude or Codex subprocess.
What gets passed: cwd from the workflow executor context, which is derived from workflowRun.working_path or the resolved isolation environment path.
Why this is a problem: the executor has access to the codebase object (it's passed in via WorkflowDeps), but it discards that context and passes only the cwd string to sendQuery. The pre-spawn check then has to re-derive "which codebase is this?" from the cwd, which may not match if:
The worktree path doesn't match findCodebaseByPathPrefix logic (e.g., atypical isolation layouts, symlinks, bind mounts)
The codebase was renamed or its default_cwd was updated since the worktree was created
The workflow is codebase-less and cwd is a shared working directory
exportasyncfunctiongenerateAndSetTitle(conversationDbId: string,userMessage: string,assistantType: string,cwd: string,// ← passed by caller without codebase contextworkflowName?: string): Promise<void>{// ...forawait(constchunkofclient.sendQuery(titlePrompt,cwd,undefined,{model: titleModel,tools: [],})){ ... }}
When it fires: every conversation — title generation runs after the first message in every conversation. See orchestrator-agent.ts:524.
What gets passed: whatever cwd the orchestrator resolved for that conversation, which includes the /workspace fallback from Path 1 and the codebase-less path from Path 2.
Why this path matters most: title generation is the hottestsendQuery call site by volume. Every new conversation hits it. Cole's deployed server is blocking every conversation creation because title generation fires the pre-spawn scan against an unregistered cwd.
Proposed architectural fix
Change the IAssistantClient.sendQuery contract to accept an explicit codebaseId parameter. The pre-spawn check looks up the codebase by ID instead of by cwd. Callers must pass null if the invocation is legitimately codebase-less.
New contract
sendQuery(prompt: string,cwd: string,codebaseId: string |null,// NEW — explicit, not inferredresumeSessionId?: string,options?: AssistantRequestOptions): AsyncGenerator<MessageChunk>;
New pre-spawn check
// No lookup-by-cwd heuristic. The caller tells us which codebase this is for.letcodebase: Codebase|null=null;if(codebaseId!==null){codebase=awaitcodebaseDb.getCodebase(codebaseId);if(!codebase){// Codebase ID was given but lookup failed → hard error.// Either the caller has stale state or the row was deleted mid-query.thrownewError(`Codebase ${codebaseId} not found — aborted sendQuery`);}}// Only scan for registered codebases without consent.if(codebase&&!codebase.allow_env_keys){constmerged=awaitloadConfig(cwd).catch(()=>({allowTargetRepoKeys: false}));if(!merged.allowTargetRepoKeys){constreport=scanPathForSensitiveKeys(cwd);if(report.findings.length>0){thrownewEnvLeakError(report,'spawn-existing');}}}// codebaseId === null → explicit codebase-less invocation, skip gate entirely
This makes the three cases explicit:
Caller's codebase state
Caller passes
Pre-spawn behavior
Has a registered codebase for this run
codebaseId: 'uuid-here'
Look up by ID, enforce gate based on allow_env_keys
Codebase-less by design (title gen, codebase-less assist)
codebaseId: null
Skip gate entirely
Codebase was deleted or lookup fails
N/A
getCodebase() returns null → hard error, no silent bypass
No lookup-by-cwd heuristic. No guessing. No silent fallthrough.
Per-path audit (what each caller should pass)
Path 1 — Orchestrator stale-cleaned fallback
Current: falls back to '/workspace' when codebase is null.
Proposed: the fallback is only hit when !codebase. That means codebaseId is also null at this point. Pass codebaseId: null. Pre-spawn skips.
Also: the /workspace fallback itself is suspicious. Is /workspace ever a valid cwd for a running subprocess? Probably not outside of the Docker container where archon happens to have /workspace set up. Worth a separate question: should this fallback throw instead of returning a bogus cwd?
Path 2 — Codebase-less orchestrator path
Current: workerCwd = ctx.cwd with no codebase association.
Proposed: pass codebaseId: null explicitly. The "codebase-less" nature becomes a first-class concept instead of inferred from "no codebase row matched". Downstream code (including sendQuery) can branch on it cleanly.
Path 3 — DAG executor
Current: aiClient.sendQuery(finalPrompt, cwd, resumeSessionId, nodeOptionsWithAbort) — no codebase passed.
Proposed: the executor has access to the codebase object via WorkflowDeps.codebase (or similar — needs audit). Thread codebase?.id ?? null through to sendQuery. When the executor was invoked for a codebase-less workflow run, codebase is null and codebaseId: null gets passed correctly.
Path 4 — Title generator
Current: generateAndSetTitle(conversationDbId, userMessage, assistantType, cwd, workflowName) with no codebase context.
Proposed: add codebaseId: string | null as a parameter:
Title generation passes skipEnvLeakGate: true. Codebase-less workflows pass it too. The pre-spawn check honors the flag and bypasses the scan.
Pros: smaller diff, no contract change, backwards compatible.
Cons: still relies on the lookup-by-cwd heuristic for the "normal" path, so the architectural drift persists. Callers can forget to set the flag. The flag is a workaround, not a fix — the contract is still broken.
My take: do the contract change. It's ~20-30 files touched (each caller) but eliminates the class of bug. The flag approach is a patch on a patch.
Attack vector analysis
After this fix, the "unregistered cwd" path becomes explicit (codebaseId: null). Is there an attack where a malicious caller passes null to bypass the gate intentionally?
Threat model:
Direct SDK caller outside archon (e.g., someone writes their own script importing ClaudeClient): they control codebaseId and can pass null. But they also control everything else — they can just not use archon at all and call the Claude SDK directly. Not a real attack vector.
Workflow that happens to be codebase-less: legitimate use. codebaseId: null is correct for these. No bypass.
Web UI user: can't reach sendQuery directly. Has to go through a conversation which has a codebase_id stored. The orchestrator threads it through. No bypass.
CLI user: archon workflow run --cwd /path/to/foo auto-registers first (creates the codebase row), THEN calls sendQuery with the new codebaseId. The registration scan has already fired at that point, so consent is already established. No bypass.
Conclusion: no attack vector for the explicit-null case. The registration gate remains authoritative.
Cleanup tasks within this issue
Once the contract change lands:
Delete the cwd-based lookup in pre-spawn — findCodebaseByDefaultCwd and findCodebaseByPathPrefix were added specifically for this buggy fallback (in bug: concurrent-run guard has TOCTOU race with pre-created pending rows #1036). After the fix they're only used for codebase resolution in other flows; audit whether they're still needed elsewhere. If not, delete them.
Delete the /workspace string fallback in orchestrator.ts:194 — should throw or return a typed "no working directory" error instead of a bogus path.
Tests — each migrated call site needs a test that exercises the codebaseId: null path and asserts the pre-spawn scan is not called.
Contract documentation — update the IAssistantClient interface doc to explicitly describe what codebaseId: null means.
Files to change
File
Change
packages/core/src/types/index.ts
Add codebaseId: string | null parameter to IAssistantClient.sendQuery
packages/core/src/clients/claude.ts
Implement new lookup-by-ID logic in sendQuery; delete cwd-based fallback
packages/core/src/clients/codex.ts
Same
packages/core/src/services/title-generator.ts
Add codebaseId parameter, pass through to sendQuery
Thread conversation.codebase_id to title generator and to sendQuery calls at lines 840, 959
packages/core/src/orchestrator/orchestrator.ts
Delete /workspace fallback; change to explicit "no codebase" return
packages/workflows/src/dag-executor.ts
Thread deps.codebase?.id ?? null to sendQuery at lines 840 and 1600
packages/core/src/db/codebases.ts
Audit findCodebaseByDefaultCwd and findCodebaseByPathPrefix — if no other callers, delete
Tests
Update every test that calls sendQuery directly to include the new parameter
Estimate: ~20-30 files touched, most of them test file updates. Core logic change is ~100 lines.
Validation
All tests pass after the parameter addition
Manual: create a conversation on a deployed instance, confirm title generation no longer hits the pre-spawn gate
Manual: register a codebase with allow_env_keys: false, run a workflow node, confirm pre-spawn scan fires (regression for registered path)
Manual: run archon-assist "hello" without any project context, confirm it works (codebase-less path)
Manual: delete a codebase mid-conversation, attempt to continue the conversation, confirm the hard error fires instead of silent fallback
Priority
P1 — not a direct demo blocker if #991 (the shallow fix) lands first. But the current design is architecturally broken and will keep producing bugs until it's fixed. Schedule for post-v0.3.2.
The pre-spawn env-leak check in ClaudeClient.sendQuery() and CodexClient.sendQuery() used if (!codebase?.allow_env_keys) which evaluates to true when codebase is null (unregistered). This made the scanner run on EVERY query for unregistered codebases, creating a circular dependency where registration fails (because of the .env) and the pre-spawn check also fails (because there's no registered codebase to check allow_env_keys against).
The shallow fix in #991 addresses the symptom. This issue tracks the architectural cleanup so the drift doesn't recur.
Summary
IAssistantClient.sendQuery(prompt, cwd, resumeSessionId, options)has an implicit contract that "cwdcorresponds to a registered codebase" — the pre-spawn env-leak check in bothClaudeClientandCodexClientdoes afindCodebaseByDefaultCwd(cwd) ?? findCodebaseByPathPrefix(cwd)lookup and assumes the result is non-null.At least four upstream call sites violate this contract and pass unregistered cwd paths (including the literal string
/workspaceas a fallback). This is the upstream cause of #991 (the shallow fix). Until the contract is repaired, the pre-spawn gate has to fall back to "unregistered → skip scan" (#991), which is correct for the symptoms but hides the architectural drift.This issue tracks the architectural cleanup: audit every call site, decide per-path whether it should always resolve to a registered codebase or is legitimately codebase-less, and change the
sendQuerycontract to make the distinction explicit instead of relying on lookup-by-cwd heuristics.Current contract (implicit, broken)
Inside
sendQuery, both clients do:The problem: the codebase is looked up by cwd, but the cwd is not a stable key for "which codebase is this query for?". Workers, worktrees, codebase-less workflows, and fallback paths all pass cwd values that don't map cleanly to a single codebase row.
The four upstream paths that violate the contract
All four are in
packages/core/src/andpackages/workflows/src/.Path 1 — Orchestrator stale-cleaned fallback to
/workspaceFile:
packages/core/src/orchestrator/orchestrator.ts:194When it fires: the isolation resolver returns
stale_cleaned(the stored worktree reference points to a path that no longer exists on disk) AND the conversation has no associated codebase row.What gets passed:
conversation.cwdif set, otherwise the literal string/workspace.Why this is wrong:
/workspaceis not a registered codebase path on most deployments. The pre-spawn check callsfindCodebaseByDefaultCwd('/workspace')which returns null, and the old logic runs the scanner on/workspace. On Cole's deployed server, this is exactly what fires because the docker container's working directory has a.env.Path 2 — Codebase-less orchestrator path
File:
packages/core/src/orchestrator/orchestrator.ts:303When it fires: a conversation is created against a workflow that doesn't require a codebase (e.g.,
archon-assist "what time is it"without any project selected).What gets passed:
ctx.cwd— wherever the orchestrator was invoked from. On the CLI that's the user's shell cwd. On the server that's the docker container's working directory. Neither is a registered codebase by design.Why this is "legitimately codebase-less": the workflow explicitly does not require a codebase. The user's intent is "run this assistant query with no project context". The pre-spawn check should NOT fire for this case — there's no codebase to consent against and no project-specific
.envto scan. But the current code path is indistinguishable from a "tried to use a codebase but lookup failed" scenario.Path 3 — DAG executor for workflow runs
File:
packages/workflows/src/dag-executor.ts:840and:1600When it fires: every DAG node that spawns a Claude or Codex subprocess.
What gets passed:
cwdfrom the workflow executor context, which is derived fromworkflowRun.working_pathor the resolved isolation environment path.Why this is a problem: the executor has access to the
codebaseobject (it's passed in viaWorkflowDeps), but it discards that context and passes only thecwdstring tosendQuery. The pre-spawn check then has to re-derive "which codebase is this?" from the cwd, which may not match if:findCodebaseByPathPrefixlogic (e.g., atypical isolation layouts, symlinks, bind mounts)default_cwdwas updated since the worktree was createdcwdis a shared working directoryPath 4 — Title generator
File:
packages/core/src/services/title-generator.ts:53When it fires: every conversation — title generation runs after the first message in every conversation. See
orchestrator-agent.ts:524.What gets passed: whatever
cwdthe orchestrator resolved for that conversation, which includes the/workspacefallback from Path 1 and the codebase-less path from Path 2.Why this path matters most: title generation is the hottest
sendQuerycall site by volume. Every new conversation hits it. Cole's deployed server is blocking every conversation creation because title generation fires the pre-spawn scan against an unregistered cwd.Proposed architectural fix
Change the
IAssistantClient.sendQuerycontract to accept an explicitcodebaseIdparameter. The pre-spawn check looks up the codebase by ID instead of by cwd. Callers must passnullif the invocation is legitimately codebase-less.New contract
New pre-spawn check
This makes the three cases explicit:
codebaseId: 'uuid-here'allow_env_keyscodebaseId: nullgetCodebase()returns null → hard error, no silent bypassNo lookup-by-cwd heuristic. No guessing. No silent fallthrough.
Per-path audit (what each caller should pass)
Path 1 — Orchestrator stale-cleaned fallback
Current: falls back to
'/workspace'whencodebaseis null.Proposed: the fallback is only hit when
!codebase. That meanscodebaseIdis also null at this point. PasscodebaseId: null. Pre-spawn skips.Also: the
/workspacefallback itself is suspicious. Is/workspaceever a valid cwd for a running subprocess? Probably not outside of the Docker container where archon happens to have/workspaceset up. Worth a separate question: should this fallback throw instead of returning a bogus cwd?Path 2 — Codebase-less orchestrator path
Current:
workerCwd = ctx.cwdwith no codebase association.Proposed: pass
codebaseId: nullexplicitly. The "codebase-less" nature becomes a first-class concept instead of inferred from "no codebase row matched". Downstream code (includingsendQuery) can branch on it cleanly.Path 3 — DAG executor
Current:
aiClient.sendQuery(finalPrompt, cwd, resumeSessionId, nodeOptionsWithAbort)— no codebase passed.Proposed: the executor has access to the
codebaseobject viaWorkflowDeps.codebase(or similar — needs audit). Threadcodebase?.id ?? nullthrough tosendQuery. When the executor was invoked for a codebase-less workflow run,codebaseis null andcodebaseId: nullgets passed correctly.Path 4 — Title generator
Current:
generateAndSetTitle(conversationDbId, userMessage, assistantType, cwd, workflowName)with no codebase context.Proposed: add
codebaseId: string | nullas a parameter:The caller in
orchestrator-agent.ts:524already has access to the conversation'scodebase_id. Thread it through.For conversations without a codebase (the codebase-less path), pass
null. Title generation scans nothing, runs nothing, completes.Alternative fix (lighter-weight)
If changing the
sendQuerycontract is too invasive for a near-term fix, an alternative is to add an explicitskipEnvLeakGate: booleanoption:Title generation passes
skipEnvLeakGate: true. Codebase-less workflows pass it too. The pre-spawn check honors the flag and bypasses the scan.Pros: smaller diff, no contract change, backwards compatible.
Cons: still relies on the lookup-by-cwd heuristic for the "normal" path, so the architectural drift persists. Callers can forget to set the flag. The flag is a workaround, not a fix — the contract is still broken.
My take: do the contract change. It's ~20-30 files touched (each caller) but eliminates the class of bug. The flag approach is a patch on a patch.
Attack vector analysis
After this fix, the "unregistered cwd" path becomes explicit (
codebaseId: null). Is there an attack where a malicious caller passesnullto bypass the gate intentionally?Threat model:
Direct SDK caller outside archon (e.g., someone writes their own script importing
ClaudeClient): they controlcodebaseIdand can pass null. But they also control everything else — they can just not use archon at all and call the Claude SDK directly. Not a real attack vector.Workflow that happens to be codebase-less: legitimate use.
codebaseId: nullis correct for these. No bypass.Web UI user: can't reach
sendQuerydirectly. Has to go through a conversation which has a codebase_id stored. The orchestrator threads it through. No bypass.CLI user:
archon workflow run --cwd /path/to/fooauto-registers first (creates the codebase row), THEN calls sendQuery with the newcodebaseId. The registration scan has already fired at that point, so consent is already established. No bypass.Conclusion: no attack vector for the explicit-null case. The registration gate remains authoritative.
Cleanup tasks within this issue
Once the contract change lands:
findCodebaseByDefaultCwdandfindCodebaseByPathPrefixwere added specifically for this buggy fallback (in bug: concurrent-run guard has TOCTOU race with pre-created pending rows #1036). After the fix they're only used for codebase resolution in other flows; audit whether they're still needed elsewhere. If not, delete them./workspacestring fallback inorchestrator.ts:194— should throw or return a typed "no working directory" error instead of a bogus path.codebaseId: nullpath and asserts the pre-spawn scan is not called.IAssistantClientinterface doc to explicitly describe whatcodebaseId: nullmeans.Files to change
packages/core/src/types/index.tscodebaseId: string | nullparameter toIAssistantClient.sendQuerypackages/core/src/clients/claude.tssendQuery; delete cwd-based fallbackpackages/core/src/clients/codex.tspackages/core/src/services/title-generator.tscodebaseIdparameter, pass through tosendQuerypackages/core/src/orchestrator/orchestrator-agent.tsconversation.codebase_idto title generator and tosendQuerycalls at lines 840, 959packages/core/src/orchestrator/orchestrator.ts/workspacefallback; change to explicit "no codebase" returnpackages/workflows/src/dag-executor.tsdeps.codebase?.id ?? nulltosendQueryat lines 840 and 1600packages/core/src/db/codebases.tsfindCodebaseByDefaultCwdandfindCodebaseByPathPrefix— if no other callers, deletesendQuerydirectly to include the new parameterEstimate: ~20-30 files touched, most of them test file updates. Core logic change is ~100 lines.
Validation
allow_env_keys: false, run a workflow node, confirm pre-spawn scan fires (regression for registered path)archon-assist "hello"without any project context, confirm it works (codebase-less path)Priority
P1 — not a direct demo blocker if #991 (the shallow fix) lands first. But the current design is architecturally broken and will keep producing bugs until it's fixed. Schedule for post-v0.3.2.
Related
loadConfigfail-closed, retroactive consent UI, context-aware errors — did not touch the contractCredit
Root cause identified by @coleam00 in Slack:
The shallow fix in #991 addresses the symptom. This issue tracks the architectural cleanup so the drift doesn't recur.