bug(orchestrator): multiple code paths call sendQuery with unregistered cwd, violating pre-spawn check assumptions

## Summary

`IAssistantClient.sendQuery(prompt, cwd, resumeSessionId, options)` has an implicit contract that "`cwd` corresponds to a registered codebase" — the pre-spawn env-leak check in both `ClaudeClient` and `CodexClient` does a `findCodebaseByDefaultCwd(cwd) ?? findCodebaseByPathPrefix(cwd)` lookup and assumes the result is non-null.

**At least four upstream call sites violate this contract** and pass unregistered cwd paths (including the literal string `/workspace` as a fallback). This is the upstream cause of #991 (the shallow fix). Until the contract is repaired, the pre-spawn gate has to fall back to "unregistered → skip scan" (#991), which is correct for the symptoms but hides the architectural drift.

This issue tracks the architectural cleanup: audit every call site, decide per-path whether it should always resolve to a registered codebase or is legitimately codebase-less, and change the `sendQuery` contract to make the distinction explicit instead of relying on lookup-by-cwd heuristics.

## Current contract (implicit, broken)

```typescript
sendQuery(
  prompt: string,
  cwd: string,
  resumeSessionId?: string,
  options?: AssistantRequestOptions
): AsyncGenerator<MessageChunk>;
```

Inside `sendQuery`, both clients do:

```typescript
const codebase =
  (await codebaseDb.findCodebaseByDefaultCwd(cwd)) ??
  (await codebaseDb.findCodebaseByPathPrefix(cwd));
if (!codebase?.allow_env_keys) {
  // ... env-leak scan ...
}
```

**The problem**: the codebase is looked up by cwd, but the cwd is not a stable key for "which codebase is this query for?". Workers, worktrees, codebase-less workflows, and fallback paths all pass cwd values that don't map cleanly to a single codebase row.

## The four upstream paths that violate the contract

All four are in `packages/core/src/` and `packages/workflows/src/`.

### Path 1 — Orchestrator stale-cleaned fallback to `/workspace`

**File**: `packages/core/src/orchestrator/orchestrator.ts:194`

```typescript
if (!codebase) return { status: 'none', cwd: conversation.cwd ?? '/workspace', env: null };
```

**When it fires**: the isolation resolver returns `stale_cleaned` (the stored worktree reference points to a path that no longer exists on disk) AND the conversation has no associated codebase row.

**What gets passed**: `conversation.cwd` if set, otherwise the literal string `/workspace`.

**Why this is wrong**: `/workspace` is not a registered codebase path on most deployments. The pre-spawn check calls `findCodebaseByDefaultCwd('/workspace')` which returns null, and the old logic runs the scanner on `/workspace`. On Cole's deployed server, this is exactly what fires because the docker container's working directory has a `.env`.

### Path 2 — Codebase-less orchestrator path

**File**: `packages/core/src/orchestrator/orchestrator.ts:303`

```typescript
// No codebase — run in parent's cwd (no isolation needed for non-repo workflows)
workerCwd = ctx.cwd;
```

**When it fires**: a conversation is created against a workflow that doesn't require a codebase (e.g., `archon-assist "what time is it"` without any project selected).

**What gets passed**: `ctx.cwd` — wherever the orchestrator was invoked from. On the CLI that's the user's shell cwd. On the server that's the docker container's working directory. Neither is a registered codebase by design.

**Why this is "legitimately codebase-less"**: the workflow explicitly does not require a codebase. The user's intent is "run this assistant query with no project context". The pre-spawn check should NOT fire for this case — there's no codebase to consent against and no project-specific `.env` to scan. But the current code path is indistinguishable from a "tried to use a codebase but lookup failed" scenario.

### Path 3 — DAG executor for workflow runs

**File**: `packages/workflows/src/dag-executor.ts:840` and `:1600`

```typescript
aiClient.sendQuery(finalPrompt, cwd, resumeSessionId, nodeOptionsWithAbort)
```

**When it fires**: every DAG node that spawns a Claude or Codex subprocess.

**What gets passed**: `cwd` from the workflow executor context, which is derived from `workflowRun.working_path` or the resolved isolation environment path.

**Why this is a problem**: the executor has access to the `codebase` object (it's passed in via `WorkflowDeps`), but it discards that context and passes only the `cwd` string to `sendQuery`. The pre-spawn check then has to re-derive "which codebase is this?" from the cwd, which may not match if:

- The worktree path doesn't match `findCodebaseByPathPrefix` logic (e.g., atypical isolation layouts, symlinks, bind mounts)
- The codebase was renamed or its `default_cwd` was updated since the worktree was created
- The workflow is codebase-less and `cwd` is a shared working directory

### Path 4 — Title generator

**File**: `packages/core/src/services/title-generator.ts:53`

```typescript
export async function generateAndSetTitle(
  conversationDbId: string,
  userMessage: string,
  assistantType: string,
  cwd: string,          // ← passed by caller without codebase context
  workflowName?: string
): Promise<void> {
  // ...
  for await (const chunk of client.sendQuery(titlePrompt, cwd, undefined, {
    model: titleModel,
    tools: [],
  })) { ... }
}
```

**When it fires**: **every conversation** — title generation runs after the first message in every conversation. See `orchestrator-agent.ts:524`.

**What gets passed**: whatever `cwd` the orchestrator resolved for that conversation, which includes the `/workspace` fallback from Path 1 and the codebase-less path from Path 2.

**Why this path matters most**: title generation is the **hottest** `sendQuery` call site by volume. Every new conversation hits it. Cole's deployed server is blocking every conversation creation because title generation fires the pre-spawn scan against an unregistered cwd.

## Proposed architectural fix

Change the `IAssistantClient.sendQuery` contract to accept an explicit `codebaseId` parameter. The pre-spawn check looks up the codebase by ID instead of by cwd. Callers must pass `null` if the invocation is legitimately codebase-less.

### New contract

```typescript
sendQuery(
  prompt: string,
  cwd: string,
  codebaseId: string | null,    // NEW — explicit, not inferred
  resumeSessionId?: string,
  options?: AssistantRequestOptions
): AsyncGenerator<MessageChunk>;
```

### New pre-spawn check

```typescript
// No lookup-by-cwd heuristic. The caller tells us which codebase this is for.
let codebase: Codebase | null = null;
if (codebaseId !== null) {
  codebase = await codebaseDb.getCodebase(codebaseId);
  if (!codebase) {
    // Codebase ID was given but lookup failed → hard error.
    // Either the caller has stale state or the row was deleted mid-query.
    throw new Error(`Codebase ${codebaseId} not found — aborted sendQuery`);
  }
}

// Only scan for registered codebases without consent.
if (codebase && !codebase.allow_env_keys) {
  const merged = await loadConfig(cwd).catch(() => ({ allowTargetRepoKeys: false }));
  if (!merged.allowTargetRepoKeys) {
    const report = scanPathForSensitiveKeys(cwd);
    if (report.findings.length > 0) {
      throw new EnvLeakError(report, 'spawn-existing');
    }
  }
}
// codebaseId === null → explicit codebase-less invocation, skip gate entirely
```

This makes the three cases explicit:

| Caller's codebase state | Caller passes | Pre-spawn behavior |
|---|---|---|
| Has a registered codebase for this run | `codebaseId: 'uuid-here'` | Look up by ID, enforce gate based on `allow_env_keys` |
| Codebase-less by design (title gen, codebase-less assist) | `codebaseId: null` | Skip gate entirely |
| Codebase was deleted or lookup fails | N/A | `getCodebase()` returns null → hard error, no silent bypass |

No lookup-by-cwd heuristic. No guessing. No silent fallthrough.

## Per-path audit (what each caller should pass)

### Path 1 — Orchestrator stale-cleaned fallback

**Current**: falls back to `'/workspace'` when `codebase` is null.

**Proposed**: the fallback is only hit when `!codebase`. That means `codebaseId` is also null at this point. Pass `codebaseId: null`. Pre-spawn skips.

**Also**: the `/workspace` fallback itself is suspicious. Is `/workspace` ever a valid cwd for a running subprocess? Probably not outside of the Docker container where archon happens to have `/workspace` set up. Worth a separate question: should this fallback throw instead of returning a bogus cwd?

### Path 2 — Codebase-less orchestrator path

**Current**: `workerCwd = ctx.cwd` with no codebase association.

**Proposed**: pass `codebaseId: null` explicitly. The "codebase-less" nature becomes a first-class concept instead of inferred from "no codebase row matched". Downstream code (including `sendQuery`) can branch on it cleanly.

### Path 3 — DAG executor

**Current**: `aiClient.sendQuery(finalPrompt, cwd, resumeSessionId, nodeOptionsWithAbort)` — no codebase passed.

**Proposed**: the executor has access to the `codebase` object via `WorkflowDeps.codebase` (or similar — needs audit). Thread `codebase?.id ?? null` through to `sendQuery`. When the executor was invoked for a codebase-less workflow run, `codebase` is null and `codebaseId: null` gets passed correctly.

### Path 4 — Title generator

**Current**: `generateAndSetTitle(conversationDbId, userMessage, assistantType, cwd, workflowName)` with no codebase context.

**Proposed**: add `codebaseId: string | null` as a parameter:

```typescript
export async function generateAndSetTitle(
  conversationDbId: string,
  userMessage: string,
  assistantType: string,
  cwd: string,
  codebaseId: string | null,    // NEW
  workflowName?: string
): Promise<void>
```

The caller in `orchestrator-agent.ts:524` already has access to the conversation's `codebase_id`. Thread it through.

For conversations without a codebase (the codebase-less path), pass `null`. Title generation scans nothing, runs nothing, completes.

## Alternative fix (lighter-weight)

If changing the `sendQuery` contract is too invasive for a near-term fix, an alternative is to add an explicit `skipEnvLeakGate: boolean` option:

```typescript
sendQuery(
  prompt: string,
  cwd: string,
  resumeSessionId?: string,
  options?: AssistantRequestOptions & { skipEnvLeakGate?: boolean }
): AsyncGenerator<MessageChunk>;
```

Title generation passes `skipEnvLeakGate: true`. Codebase-less workflows pass it too. The pre-spawn check honors the flag and bypasses the scan.

**Pros**: smaller diff, no contract change, backwards compatible.

**Cons**: still relies on the lookup-by-cwd heuristic for the "normal" path, so the architectural drift persists. Callers can forget to set the flag. The flag is a workaround, not a fix — the contract is still broken.

**My take**: do the contract change. It's ~20-30 files touched (each caller) but eliminates the class of bug. The flag approach is a patch on a patch.

## Attack vector analysis

After this fix, the "unregistered cwd" path becomes explicit (`codebaseId: null`). Is there an attack where a malicious caller passes `null` to bypass the gate intentionally?

**Threat model**:

1. **Direct SDK caller outside archon** (e.g., someone writes their own script importing `ClaudeClient`): they control `codebaseId` and can pass null. But they also control everything else — they can just not use archon at all and call the Claude SDK directly. Not a real attack vector.

2. **Workflow that happens to be codebase-less**: legitimate use. `codebaseId: null` is correct for these. No bypass.

3. **Web UI user**: can't reach `sendQuery` directly. Has to go through a conversation which has a codebase_id stored. The orchestrator threads it through. No bypass.

4. **CLI user**: `archon workflow run --cwd /path/to/foo` auto-registers first (creates the codebase row), THEN calls sendQuery with the new `codebaseId`. The registration scan has already fired at that point, so consent is already established. No bypass.

Conclusion: **no attack vector** for the explicit-null case. The registration gate remains authoritative.

## Cleanup tasks within this issue

Once the contract change lands:

- **Delete the cwd-based lookup in pre-spawn** — `findCodebaseByDefaultCwd` and `findCodebaseByPathPrefix` were added specifically for this buggy fallback (in #1036). After the fix they're only used for codebase resolution in other flows; audit whether they're still needed elsewhere. If not, delete them.
- **Delete the `/workspace` string fallback in `orchestrator.ts:194`** — should throw or return a typed "no working directory" error instead of a bogus path.
- **Tests** — each migrated call site needs a test that exercises the `codebaseId: null` path and asserts the pre-spawn scan is not called.
- **Contract documentation** — update the `IAssistantClient` interface doc to explicitly describe what `codebaseId: null` means.

## Files to change

| File | Change |
|---|---|
| `packages/core/src/types/index.ts` | Add `codebaseId: string \| null` parameter to `IAssistantClient.sendQuery` |
| `packages/core/src/clients/claude.ts` | Implement new lookup-by-ID logic in `sendQuery`; delete cwd-based fallback |
| `packages/core/src/clients/codex.ts` | Same |
| `packages/core/src/services/title-generator.ts` | Add `codebaseId` parameter, pass through to `sendQuery` |
| `packages/core/src/orchestrator/orchestrator-agent.ts` | Thread `conversation.codebase_id` to title generator and to `sendQuery` calls at lines 840, 959 |
| `packages/core/src/orchestrator/orchestrator.ts` | Delete `/workspace` fallback; change to explicit "no codebase" return |
| `packages/workflows/src/dag-executor.ts` | Thread `deps.codebase?.id ?? null` to `sendQuery` at lines 840 and 1600 |
| `packages/core/src/db/codebases.ts` | Audit `findCodebaseByDefaultCwd` and `findCodebaseByPathPrefix` — if no other callers, delete |
| Tests | Update every test that calls `sendQuery` directly to include the new parameter |

Estimate: **~20-30 files touched**, most of them test file updates. Core logic change is ~100 lines.

## Validation

- [ ] All tests pass after the parameter addition
- [ ] Manual: create a conversation on a deployed instance, confirm title generation no longer hits the pre-spawn gate
- [ ] Manual: register a codebase with `allow_env_keys: false`, run a workflow node, confirm pre-spawn scan fires (regression for registered path)
- [ ] Manual: run `archon-assist "hello"` without any project context, confirm it works (codebase-less path)
- [ ] Manual: delete a codebase mid-conversation, attempt to continue the conversation, confirm the hard error fires instead of silent fallback

## Priority

**P1** — not a direct demo blocker if #991 (the shallow fix) lands first. But the current design is architecturally broken and will keep producing bugs until it's fixed. Schedule for post-v0.3.2.

## Related

- **#991** — shallow fix for the same symptom (null check in pre-spawn). Land first. This issue is the follow-up that repairs the contract properly.
- **#1036** — original env-leak gate that introduced the pre-spawn check with the buggy assumption
- **#983** — env-leak gate polish sweep that added `loadConfig` fail-closed, retroactive consent UI, context-aware errors — did not touch the contract

## Credit

Root cause identified by @coleam00 in Slack:

> The pre-spawn env-leak check in ClaudeClient.sendQuery() and CodexClient.sendQuery() used if (!codebase?.allow_env_keys) which evaluates to true when codebase is null (unregistered). This made the scanner run on EVERY query for unregistered codebases, creating a circular dependency where registration fails (because of the .env) and the pre-spawn check also fails (because there's no registered codebase to check allow_env_keys against).

The shallow fix in #991 addresses the symptom. This issue tracks the architectural cleanup so the drift doesn't recur.


File	Change
`packages/core/src/types/index.ts`	Add `codebaseId: string \| null` parameter to `IAssistantClient.sendQuery`
`packages/core/src/clients/claude.ts`	Implement new lookup-by-ID logic in `sendQuery`; delete cwd-based fallback
`packages/core/src/clients/codex.ts`	Same
`packages/core/src/services/title-generator.ts`	Add `codebaseId` parameter, pass through to `sendQuery`
`packages/core/src/orchestrator/orchestrator-agent.ts`	Thread `conversation.codebase_id` to title generator and to `sendQuery` calls at lines 840, 959
`packages/core/src/orchestrator/orchestrator.ts`	Delete `/workspace` fallback; change to explicit "no codebase" return
`packages/workflows/src/dag-executor.ts`	Thread `deps.codebase?.id ?? null` to `sendQuery` at lines 840 and 1600
`packages/core/src/db/codebases.ts`	Audit `findCodebaseByDefaultCwd` and `findCodebaseByPathPrefix` — if no other callers, delete
Tests	Update every test that calls `sendQuery` directly to include the new parameter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(orchestrator): multiple code paths call sendQuery with unregistered cwd, violating pre-spawn check assumptions #993

Summary

Current contract (implicit, broken)

The four upstream paths that violate the contract

Path 1 — Orchestrator stale-cleaned fallback to `/workspace`

Path 2 — Codebase-less orchestrator path

Path 3 — DAG executor for workflow runs

Path 4 — Title generator

Proposed architectural fix

New contract

New pre-spawn check

Per-path audit (what each caller should pass)

Path 1 — Orchestrator stale-cleaned fallback

Path 2 — Codebase-less orchestrator path

Path 3 — DAG executor

Path 4 — Title generator

Alternative fix (lighter-weight)

Attack vector analysis

Cleanup tasks within this issue

Files to change

Validation

Priority

Related

Credit

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Caller's codebase state	Caller passes	Pre-spawn behavior
Has a registered codebase for this run	`codebaseId: 'uuid-here'`	Look up by ID, enforce gate based on `allow_env_keys`
Codebase-less by design (title gen, codebase-less assist)	`codebaseId: null`	Skip gate entirely
Codebase was deleted or lookup fails	N/A	`getCodebase()` returns null → hard error, no silent bypass

bug(orchestrator): multiple code paths call sendQuery with unregistered cwd, violating pre-spawn check assumptions #993

Description

Summary

Current contract (implicit, broken)

The four upstream paths that violate the contract

Path 1 — Orchestrator stale-cleaned fallback to /workspace

Path 2 — Codebase-less orchestrator path

Path 3 — DAG executor for workflow runs

Path 4 — Title generator

Proposed architectural fix

New contract

New pre-spawn check

Per-path audit (what each caller should pass)

Path 1 — Orchestrator stale-cleaned fallback

Path 2 — Codebase-less orchestrator path

Path 3 — DAG executor

Path 4 — Title generator

Alternative fix (lighter-weight)

Attack vector analysis

Cleanup tasks within this issue

Files to change

Validation

Priority

Related

Credit

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Path 1 — Orchestrator stale-cleaned fallback to `/workspace`