Skip to content

bug(orchestrator): multiple code paths call sendQuery with unregistered cwd, violating pre-spawn check assumptions #993

@Wirasm

Description

@Wirasm

Summary

IAssistantClient.sendQuery(prompt, cwd, resumeSessionId, options) has an implicit contract that "cwd corresponds to a registered codebase" — the pre-spawn env-leak check in both ClaudeClient and CodexClient does a findCodebaseByDefaultCwd(cwd) ?? findCodebaseByPathPrefix(cwd) lookup and assumes the result is non-null.

At least four upstream call sites violate this contract and pass unregistered cwd paths (including the literal string /workspace as a fallback). This is the upstream cause of #991 (the shallow fix). Until the contract is repaired, the pre-spawn gate has to fall back to "unregistered → skip scan" (#991), which is correct for the symptoms but hides the architectural drift.

This issue tracks the architectural cleanup: audit every call site, decide per-path whether it should always resolve to a registered codebase or is legitimately codebase-less, and change the sendQuery contract to make the distinction explicit instead of relying on lookup-by-cwd heuristics.

Current contract (implicit, broken)

sendQuery(
  prompt: string,
  cwd: string,
  resumeSessionId?: string,
  options?: AssistantRequestOptions
): AsyncGenerator<MessageChunk>;

Inside sendQuery, both clients do:

const codebase =
  (await codebaseDb.findCodebaseByDefaultCwd(cwd)) ??
  (await codebaseDb.findCodebaseByPathPrefix(cwd));
if (!codebase?.allow_env_keys) {
  // ... env-leak scan ...
}

The problem: the codebase is looked up by cwd, but the cwd is not a stable key for "which codebase is this query for?". Workers, worktrees, codebase-less workflows, and fallback paths all pass cwd values that don't map cleanly to a single codebase row.

The four upstream paths that violate the contract

All four are in packages/core/src/ and packages/workflows/src/.

Path 1 — Orchestrator stale-cleaned fallback to /workspace

File: packages/core/src/orchestrator/orchestrator.ts:194

if (!codebase) return { status: 'none', cwd: conversation.cwd ?? '/workspace', env: null };

When it fires: the isolation resolver returns stale_cleaned (the stored worktree reference points to a path that no longer exists on disk) AND the conversation has no associated codebase row.

What gets passed: conversation.cwd if set, otherwise the literal string /workspace.

Why this is wrong: /workspace is not a registered codebase path on most deployments. The pre-spawn check calls findCodebaseByDefaultCwd('/workspace') which returns null, and the old logic runs the scanner on /workspace. On Cole's deployed server, this is exactly what fires because the docker container's working directory has a .env.

Path 2 — Codebase-less orchestrator path

File: packages/core/src/orchestrator/orchestrator.ts:303

// No codebase — run in parent's cwd (no isolation needed for non-repo workflows)
workerCwd = ctx.cwd;

When it fires: a conversation is created against a workflow that doesn't require a codebase (e.g., archon-assist "what time is it" without any project selected).

What gets passed: ctx.cwd — wherever the orchestrator was invoked from. On the CLI that's the user's shell cwd. On the server that's the docker container's working directory. Neither is a registered codebase by design.

Why this is "legitimately codebase-less": the workflow explicitly does not require a codebase. The user's intent is "run this assistant query with no project context". The pre-spawn check should NOT fire for this case — there's no codebase to consent against and no project-specific .env to scan. But the current code path is indistinguishable from a "tried to use a codebase but lookup failed" scenario.

Path 3 — DAG executor for workflow runs

File: packages/workflows/src/dag-executor.ts:840 and :1600

aiClient.sendQuery(finalPrompt, cwd, resumeSessionId, nodeOptionsWithAbort)

When it fires: every DAG node that spawns a Claude or Codex subprocess.

What gets passed: cwd from the workflow executor context, which is derived from workflowRun.working_path or the resolved isolation environment path.

Why this is a problem: the executor has access to the codebase object (it's passed in via WorkflowDeps), but it discards that context and passes only the cwd string to sendQuery. The pre-spawn check then has to re-derive "which codebase is this?" from the cwd, which may not match if:

  • The worktree path doesn't match findCodebaseByPathPrefix logic (e.g., atypical isolation layouts, symlinks, bind mounts)
  • The codebase was renamed or its default_cwd was updated since the worktree was created
  • The workflow is codebase-less and cwd is a shared working directory

Path 4 — Title generator

File: packages/core/src/services/title-generator.ts:53

export async function generateAndSetTitle(
  conversationDbId: string,
  userMessage: string,
  assistantType: string,
  cwd: string,          // ← passed by caller without codebase context
  workflowName?: string
): Promise<void> {
  // ...
  for await (const chunk of client.sendQuery(titlePrompt, cwd, undefined, {
    model: titleModel,
    tools: [],
  })) { ... }
}

When it fires: every conversation — title generation runs after the first message in every conversation. See orchestrator-agent.ts:524.

What gets passed: whatever cwd the orchestrator resolved for that conversation, which includes the /workspace fallback from Path 1 and the codebase-less path from Path 2.

Why this path matters most: title generation is the hottest sendQuery call site by volume. Every new conversation hits it. Cole's deployed server is blocking every conversation creation because title generation fires the pre-spawn scan against an unregistered cwd.

Proposed architectural fix

Change the IAssistantClient.sendQuery contract to accept an explicit codebaseId parameter. The pre-spawn check looks up the codebase by ID instead of by cwd. Callers must pass null if the invocation is legitimately codebase-less.

New contract

sendQuery(
  prompt: string,
  cwd: string,
  codebaseId: string | null,    // NEW — explicit, not inferred
  resumeSessionId?: string,
  options?: AssistantRequestOptions
): AsyncGenerator<MessageChunk>;

New pre-spawn check

// No lookup-by-cwd heuristic. The caller tells us which codebase this is for.
let codebase: Codebase | null = null;
if (codebaseId !== null) {
  codebase = await codebaseDb.getCodebase(codebaseId);
  if (!codebase) {
    // Codebase ID was given but lookup failed → hard error.
    // Either the caller has stale state or the row was deleted mid-query.
    throw new Error(`Codebase ${codebaseId} not found — aborted sendQuery`);
  }
}

// Only scan for registered codebases without consent.
if (codebase && !codebase.allow_env_keys) {
  const merged = await loadConfig(cwd).catch(() => ({ allowTargetRepoKeys: false }));
  if (!merged.allowTargetRepoKeys) {
    const report = scanPathForSensitiveKeys(cwd);
    if (report.findings.length > 0) {
      throw new EnvLeakError(report, 'spawn-existing');
    }
  }
}
// codebaseId === null → explicit codebase-less invocation, skip gate entirely

This makes the three cases explicit:

Caller's codebase state Caller passes Pre-spawn behavior
Has a registered codebase for this run codebaseId: 'uuid-here' Look up by ID, enforce gate based on allow_env_keys
Codebase-less by design (title gen, codebase-less assist) codebaseId: null Skip gate entirely
Codebase was deleted or lookup fails N/A getCodebase() returns null → hard error, no silent bypass

No lookup-by-cwd heuristic. No guessing. No silent fallthrough.

Per-path audit (what each caller should pass)

Path 1 — Orchestrator stale-cleaned fallback

Current: falls back to '/workspace' when codebase is null.

Proposed: the fallback is only hit when !codebase. That means codebaseId is also null at this point. Pass codebaseId: null. Pre-spawn skips.

Also: the /workspace fallback itself is suspicious. Is /workspace ever a valid cwd for a running subprocess? Probably not outside of the Docker container where archon happens to have /workspace set up. Worth a separate question: should this fallback throw instead of returning a bogus cwd?

Path 2 — Codebase-less orchestrator path

Current: workerCwd = ctx.cwd with no codebase association.

Proposed: pass codebaseId: null explicitly. The "codebase-less" nature becomes a first-class concept instead of inferred from "no codebase row matched". Downstream code (including sendQuery) can branch on it cleanly.

Path 3 — DAG executor

Current: aiClient.sendQuery(finalPrompt, cwd, resumeSessionId, nodeOptionsWithAbort) — no codebase passed.

Proposed: the executor has access to the codebase object via WorkflowDeps.codebase (or similar — needs audit). Thread codebase?.id ?? null through to sendQuery. When the executor was invoked for a codebase-less workflow run, codebase is null and codebaseId: null gets passed correctly.

Path 4 — Title generator

Current: generateAndSetTitle(conversationDbId, userMessage, assistantType, cwd, workflowName) with no codebase context.

Proposed: add codebaseId: string | null as a parameter:

export async function generateAndSetTitle(
  conversationDbId: string,
  userMessage: string,
  assistantType: string,
  cwd: string,
  codebaseId: string | null,    // NEW
  workflowName?: string
): Promise<void>

The caller in orchestrator-agent.ts:524 already has access to the conversation's codebase_id. Thread it through.

For conversations without a codebase (the codebase-less path), pass null. Title generation scans nothing, runs nothing, completes.

Alternative fix (lighter-weight)

If changing the sendQuery contract is too invasive for a near-term fix, an alternative is to add an explicit skipEnvLeakGate: boolean option:

sendQuery(
  prompt: string,
  cwd: string,
  resumeSessionId?: string,
  options?: AssistantRequestOptions & { skipEnvLeakGate?: boolean }
): AsyncGenerator<MessageChunk>;

Title generation passes skipEnvLeakGate: true. Codebase-less workflows pass it too. The pre-spawn check honors the flag and bypasses the scan.

Pros: smaller diff, no contract change, backwards compatible.

Cons: still relies on the lookup-by-cwd heuristic for the "normal" path, so the architectural drift persists. Callers can forget to set the flag. The flag is a workaround, not a fix — the contract is still broken.

My take: do the contract change. It's ~20-30 files touched (each caller) but eliminates the class of bug. The flag approach is a patch on a patch.

Attack vector analysis

After this fix, the "unregistered cwd" path becomes explicit (codebaseId: null). Is there an attack where a malicious caller passes null to bypass the gate intentionally?

Threat model:

  1. Direct SDK caller outside archon (e.g., someone writes their own script importing ClaudeClient): they control codebaseId and can pass null. But they also control everything else — they can just not use archon at all and call the Claude SDK directly. Not a real attack vector.

  2. Workflow that happens to be codebase-less: legitimate use. codebaseId: null is correct for these. No bypass.

  3. Web UI user: can't reach sendQuery directly. Has to go through a conversation which has a codebase_id stored. The orchestrator threads it through. No bypass.

  4. CLI user: archon workflow run --cwd /path/to/foo auto-registers first (creates the codebase row), THEN calls sendQuery with the new codebaseId. The registration scan has already fired at that point, so consent is already established. No bypass.

Conclusion: no attack vector for the explicit-null case. The registration gate remains authoritative.

Cleanup tasks within this issue

Once the contract change lands:

  • Delete the cwd-based lookup in pre-spawnfindCodebaseByDefaultCwd and findCodebaseByPathPrefix were added specifically for this buggy fallback (in bug: concurrent-run guard has TOCTOU race with pre-created pending rows #1036). After the fix they're only used for codebase resolution in other flows; audit whether they're still needed elsewhere. If not, delete them.
  • Delete the /workspace string fallback in orchestrator.ts:194 — should throw or return a typed "no working directory" error instead of a bogus path.
  • Tests — each migrated call site needs a test that exercises the codebaseId: null path and asserts the pre-spawn scan is not called.
  • Contract documentation — update the IAssistantClient interface doc to explicitly describe what codebaseId: null means.

Files to change

File Change
packages/core/src/types/index.ts Add codebaseId: string | null parameter to IAssistantClient.sendQuery
packages/core/src/clients/claude.ts Implement new lookup-by-ID logic in sendQuery; delete cwd-based fallback
packages/core/src/clients/codex.ts Same
packages/core/src/services/title-generator.ts Add codebaseId parameter, pass through to sendQuery
packages/core/src/orchestrator/orchestrator-agent.ts Thread conversation.codebase_id to title generator and to sendQuery calls at lines 840, 959
packages/core/src/orchestrator/orchestrator.ts Delete /workspace fallback; change to explicit "no codebase" return
packages/workflows/src/dag-executor.ts Thread deps.codebase?.id ?? null to sendQuery at lines 840 and 1600
packages/core/src/db/codebases.ts Audit findCodebaseByDefaultCwd and findCodebaseByPathPrefix — if no other callers, delete
Tests Update every test that calls sendQuery directly to include the new parameter

Estimate: ~20-30 files touched, most of them test file updates. Core logic change is ~100 lines.

Validation

  • All tests pass after the parameter addition
  • Manual: create a conversation on a deployed instance, confirm title generation no longer hits the pre-spawn gate
  • Manual: register a codebase with allow_env_keys: false, run a workflow node, confirm pre-spawn scan fires (regression for registered path)
  • Manual: run archon-assist "hello" without any project context, confirm it works (codebase-less path)
  • Manual: delete a codebase mid-conversation, attempt to continue the conversation, confirm the hard error fires instead of silent fallback

Priority

P1 — not a direct demo blocker if #991 (the shallow fix) lands first. But the current design is architecturally broken and will keep producing bugs until it's fixed. Schedule for post-v0.3.2.

Related

Credit

Root cause identified by @coleam00 in Slack:

The pre-spawn env-leak check in ClaudeClient.sendQuery() and CodexClient.sendQuery() used if (!codebase?.allow_env_keys) which evaluates to true when codebase is null (unregistered). This made the scanner run on EVERY query for unregistered codebases, creating a circular dependency where registration fails (because of the .env) and the pre-spawn check also fails (because there's no registered codebase to check allow_env_keys against).

The shallow fix in #991 addresses the symptom. This issue tracks the architectural cleanup so the drift doesn't recur.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High priority - Address soon, next in queuearchitectureArchitectural changes and designarea: clientsAI assistant clientsbugSomething is broken

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions