Skip to content

perf(models-config): targetProvider short-circuit with disk-vs-config validation#73261

Open
zeroaltitude wants to merge 26 commits intoopenclaw:mainfrom
zeroaltitude:perf/models-config-target-provider-short-circuit
Open

perf(models-config): targetProvider short-circuit with disk-vs-config validation#73261
zeroaltitude wants to merge 26 commits intoopenclaw:mainfrom
zeroaltitude:perf/models-config-target-provider-short-circuit

Conversation

@zeroaltitude
Copy link
Copy Markdown
Contributor

@zeroaltitude zeroaltitude commented Apr 28, 2026

Summary

Splits the targetProvider short-circuit half of #72869 into a standalone PR.

Adds a fast path to ensureOpenClawModelsJson for callers that know which provider/model they are about to use (e.g. the pi-embedded runner). When set, the implicit-provider-discovery pipeline can be skipped IF the on-disk models.json provider entry structurally matches what the current configuration would produce.

Why structural comparison

A naive "is configured" check (does any non-empty credential exist on disk for this provider) silently bypasses several real failure modes:

  • Rotated apiKey: cold start with new key in config, old key still on disk, presence-only check fires, all calls fail with 401 until something else invalidates.
  • Attacker-tampered baseUrl: redirect to exfil endpoint kept across restarts.
  • Attacker-injected headers: arbitrary auth material kept across restarts.

The new readExistingProviderMatchesConfig compares all four security-relevant fields:

Field Comparison
apiKey resolved through resolveSecretInputRef (env-ref expansion via createConfigRuntimeEnv) before string equality vs. disk
baseUrl exact string equality
headers stable structural equality (key-order independent)
auth stable structural equality

Any mismatch falls through to full planning + write. This is safe to use on cold start and after gateway restart.

API

Adds EnsureOpenClawModelsJsonOptions exported type with targetProvider and targetModel, layered on top of the existing options shape (kept all of pluginMetadataSnapshot, workspaceDir, providerDiscoveryProviderIds, providerDiscoveryTimeoutMs).

Tests

6 cases in models-config.target-provider-short-circuit.test.ts:

  • hit-on-match — full structural match short-circuits planning
  • miss-on-rotated-key — config apiKey change forces a full plan
  • miss-on-baseUrl-change — tampered disk baseUrl rejects short-circuit
  • miss-on-tampered-headers — any disk header drift rejects short-circuit
  • miss-on-cold-cache — empty in-memory cache + missing disk file forces a plan
  • hit-after-warm-fingerprint — warm in-memory cache reuse

Dependencies

This is one of two PRs split out of the original combined #72869. The cache-fingerprint work (auth-profiles content hash, models.json drift detection, fingerprint hashing, security hardening) is in #73260.

This branch can land independently of the cache PR; the two changes touch different code paths.

Related

Real behavior proof

  • Behavior or issue addressed: Target-provider models-config short-circuit safety from PR perf(models-config): targetProvider short-circuit with disk-vs-config validation #73261, including provider-level api drift and uncacheable models hash fail-closed behavior.

  • Real environment tested: Local OpenClaw topic branch perf/models-config-target-provider-short-circuit at 2f89c7b99a, including final perf(models-config): content-hash auth-profiles + models.json drift detection #73260 ContentHashOutcome contract, isolated temporary agent directory, production ensureOpenClawModelsJson invoked through a tsx runtime driver with no test harness or mocks.

  • Exact steps or command run after this patch: Ran a real proof driver after the patch covering steady-state targetProvider hit, provider api drift, and oversize models.json.

  • Evidence after fix: Full copied runtime output is in the PR comment: perf(models-config): targetProvider short-circuit with disk-vs-config validation #73261 (comment)

    Excerpt of copied live output:

    ensureOpenClawModelsJson targetProvider=... steady-state: wrote=false
    provider api changed on disk/config comparison: short-circuit refused; full plan wrote=true
    planner restored config-side provider api
    models.json > 1 MiB: scoped cache refused; full plan wrote=true
    planner shrank models.json below 1 MiB
    proof: PASS
    
  • Observed result after fix: The production path keeps the short-circuit only for structurally matching provider config, refuses provider-level api drift, and refuses uncacheable models hashes instead of treating them as stable cache hits.

  • What was not tested: Nothing else for this PR's changed behavior beyond the isolated local runtime proof and supplemental automated validation.

…etection

Replaces mtime-based fingerprinting with content-based hashes for the
implicit-provider-discovery cache key, plus a second-factor models.json
content hash that catches external edits / partial corruption / sibling
process writes between cache hits. Includes the security hardening
findings raised by Aisle on the original combined PR (openclaw#72869).

# Why content hashes

The previous mtime-based key invalidated on every OAuth token refresh
because auth-profiles.json gets rewritten with new access/refresh
timestamps even when the set of available providers doesn't change.
Same for models.json mtime: the file is the OUTPUT of this function,
so each call observed its own write and invalidated the next call.

Now:

- auth-profiles.json: SHA-256 over a stable serialization that strips
  volatile OAuth session fields (access, refresh, expires*, issuedAt,
  refreshed/lastChecked/lastRefresh/lastValidatedAt). Token rotation
  no longer invalidates; structural changes (added/removed profiles,
  rotated static type:token credentials) still do.

- models.json: NOT included in the input fingerprint (would cause
  self-invalidation). Instead its content hash is captured at write
  time and stored alongside the readyCache entry. Every cache check
  recomputes the file hash and compares; any external edit /
  corruption / tamper invalidates the cache and forces re-plan.

# Security hardening (Aisle review on PR openclaw#72869)

- 🟠 High #1 (CWE-59 symlink-following chmod): ensureModelsFileMode
  now lstats first and refuses to chmod symlinks or non-regular files.
  Closes the attack where a symlink at ${agentDir}/models.json points
  at a sensitive owned file and our best-effort chmod follows.

- 🟡 Med #3 (CWE-1321 prototype pollution): stripAuthProfilesVolatile
  now uses Object.create(null) for the result and explicitly filters
  __proto__ / prototype / constructor keys (belt-and-suspenders).

- 🟡 Med #4 (DoS via unbounded fingerprinting): MAX_AUTH_PROFILES_BYTES
  = 8 MiB cap (above which we hash raw bytes), MAX_AUTH_PROFILES_DEPTH
  = 64 with a depth-cap marker so deeply-nested JSON cannot stack-
  overflow the gateway.

- 🟡 Med #5 (CWE-312 secrets in cache): buildModelsJsonFingerprint now
  returns SHA-256 hex of the canonical payload instead of the raw
  stable-stringified payload. The readyCache holds no verbatim secret
  material; heap snapshots / debug telemetry / core dumps cannot leak
  apiKey strings via this path.

# Token-rotation correctness

"token" is intentionally NOT in AUTH_PROFILE_VOLATILE_FIELDS even
though OAuth-style 'token' fields rotate. Profiles with type: 'token'
use the literal token key as a long-lived static credential, and
stripping it would mask real auth-state changes when a user rotates
that credential. Documented inline.

# Tests

Two tests in models-config.fingerprint-cache.test.ts:
- 'does not invalidate the cache when OAuth session fields rotate'
  (rotates access/refresh/expires; cache stays valid)
- 'DOES invalidate the cache when a static type:token credential rotates'
  (rotates the literal token field; cache invalidates)

# Compatibility

MODELS_JSON_STATE.readyCache value shape extended with modelsJsonHash:
{ fingerprint, modelsJsonHash, result }. All three plan return paths
(skip/noop/write) capture the post-write hash. The refreshedFingerprint
re-key path forwards modelsJsonHash through unchanged.

This is one of two PRs split out of openclaw#72869. The targetProvider
short-circuit work landed separately.
… validation

Add a fast path to ensureOpenClawModelsJson for callers that know which
provider/model they're about to use (e.g. the pi-embedded runner).
When set, the implicit-provider-discovery pipeline can be skipped IF
the on-disk models.json provider entry STRUCTURALLY matches what the
current configuration would produce.

# Why structural comparison

A naive 'is configured' check (does any non-empty credential exist on
disk for this provider) silently bypasses several real failure modes:

- Rotated apiKey: cold start with new key in config, old key still on
  disk, presence-only check fires, all calls fail with 401 until
  something else invalidates.
- Attacker-tampered baseUrl: redirect to exfil endpoint kept across
  restarts.
- Attacker-injected headers: arbitrary auth material kept across
  restarts.

The new readExistingProviderMatchesConfig compares all four
security-relevant fields:

  apiKey  - resolved through resolveSecretInputRef (env-ref expansion
            via createConfigRuntimeEnv) before string equality vs. disk
  baseUrl - exact string equality
  headers - stable structural equality (key-order independent)
  auth    - stable structural equality

Any mismatch falls through to full planning + writes.  This is safe to
use on cold start and after gateway restart.

# API

Adds EnsureOpenClawModelsJsonOptions exported type with targetProvider
and targetModel, layered on top of the existing options shape (kept
all of pluginMetadataSnapshot, workspaceDir, providerDiscoveryProviderIds,
providerDiscoveryTimeoutMs).

# Tests

5 cases in models-config.target-provider-short-circuit.test.ts:
- hit-on-match: full structural match short-circuits planning
- miss-on-rotated-key: config apiKey change forces a full plan
- miss-on-baseUrl-change: tampered disk baseUrl rejects short-circuit
- miss-on-tampered-headers: any disk header drift rejects short-circuit
- miss-on-cold-cache: empty in-memory cache + missing disk file forces a plan
- hit-after-warm-fingerprint: warm in-memory cache reuse

# Dependencies

This is one of two PRs split out of the original combined openclaw#72869.
The cache-fingerprint work (auth-profiles content hash, models.json
drift detection, fingerprint hashing, security hardening) landed
separately as perf/models-config-cache-fingerprint.

This branch can land independently.  When both branches land, the
original PR openclaw#72869 can be closed as superseded.
@aisle-research-bot
Copy link
Copy Markdown

aisle-research-bot Bot commented Apr 28, 2026

🔒 Aisle Security Analysis

We found 4 potential security issue(s) in this PR:

# Severity Title
1 🟠 High TOCTOU symlink swap enables arbitrary file read and chmod via models.json short-circuit path
2 🟠 High SSRF / credential exfiltration via models.json per-model baseUrl/api not validated in targetProvider short-circuit
3 🟡 Medium Potential stack/CPU DoS via recursive stableStringify on attacker-controlled models.json fields
4 🔵 Low TOCTOU race in models.json short-circuit integrity check leads to stale readyCache
1. 🟠 TOCTOU symlink swap enables arbitrary file read and chmod via models.json short-circuit path
Property Value
Severity High
CWE CWE-367
Location src/agents/models-config.ts:98-316

Description

In readExistingProviderMatchesConfig(), the code attempts to be symlink-safe by calling fs.lstat(targetPath) and rejecting symlinks and large files, but then performs a separate fs.readFile(targetPath, "utf8").

This creates a classic time-of-check/time-of-use window:

  • Check: lstat() verifies models.json is a regular file and <= 1 MiB
  • Use: a later readFile() re-opens the path and will follow symlinks
  • If an attacker can write to agentDir (or otherwise replace models.json), they can swap the file between these calls (or immediately after) with a symlink or different file, bypassing the earlier protections.

Additionally, ensureModelsFileModeForModelsJson() uses fs.chmod(pathname, 0o600) on the path with no symlink protection. If models.json is swapped to a symlink (e.g., to some other file writable by the process), the chmod can affect an unintended target.

Vulnerable code:

const lst = await fs.lstat(targetPath).catch(() => null);
if (!lst || lst.isSymbolicLink() || !lst.isFile()) {
  return false;
}
if (lst.size > MAX_MODELS_JSON_SHORT_CIRCUIT_BYTES) {
  return false;
}
...
raw = await fs.readFile(targetPath, "utf8");

and:

await fs.chmod(pathname, 0o600).catch(() => {// best-effort
});

Recommendation

Make the read and permission-setting operations symlink-safe and race-resistant by operating on an already-opened file descriptor and validating it with fstat, ideally with O_NOFOLLOW.

One approach (POSIX):

import { open } from "node:fs/promises";
import { constants as FS } from "node:fs";

const fh = await open(targetPath, FS.O_RDONLY | FS.O_NOFOLLOW);
try {
  const st = await fh.stat();
  if (!st.isFile()) return false;
  if (st.size > MAX_MODELS_JSON_SHORT_CIRCUIT_BYTES) return false;
  const raw = await fh.readFile({ encoding: "utf8" });
  ...
} finally {
  await fh.close();
}

For mode enforcement, avoid chmod(path) on a path that can be swapped; instead:

  • re-open with O_NOFOLLOW and use fh.chmod(0o600) (or fs.fchmod), or
  • at minimum, lstat() immediately before chmod and abort if symlink, acknowledging remaining race windows.

This removes the TOCTOU window and prevents symlink-following reads/chmods.

2. 🟠 SSRF / credential exfiltration via models.json per-model baseUrl/api not validated in targetProvider short-circuit
Property Value
Severity High
CWE CWE-918
Location src/agents/models-config.ts:260-279

Description

The ensureOpenClawModelsJson() targetProvider short-circuit uses readExistingProviderMatchesConfig() to decide whether an existing on-disk models.json can be trusted without re-planning/re-writing.

However, readExistingProviderMatchesConfig() only compares a subset of provider fields (apiKey, baseUrl, headers, auth) and explicitly does not compare other security-relevant fields such as the provider's models[] entries.

This is security-relevant because the runtime later uses per-model fields originating from models.json (not just provider-level fields). In particular, the embedded runner falls back to discoveredModel.baseUrl (from models.json) when no provider-level baseUrl override is configured:

  • If an attacker can tamper with models.json on disk (or if it is supplied from an untrusted location), they can inject a malicious per-model baseUrl / api (or other request-shaping fields) while keeping provider-level baseUrl/headers/auth unchanged.
  • The short-circuit will accept the disk state and skip rewriting, leaving the malicious per-model destination in place.
  • Requests can then be sent to attacker-controlled endpoints, potentially including provider credentials/headers, enabling SSRF and/or credential exfiltration.

Vulnerable logic (short-circuit trusts disk even if models[] differs):

// Other provider fields (models[], ... compat, etc.) are NOT compared.// ...
if (!stableEqual(configuredProvider.baseUrl, diskProvider.baseUrl)) return false;// apiKey, headers, auth comparisons...
return true;

Runtime sink consuming per-model baseUrl from models.json:

baseUrl: providerConfig.baseUrl ?? discoveredModel.baseUrl,
...
baseUrl: requestConfig.baseUrl ?? discoveredModel.baseUrl,

Recommendation

Do not short-circuit unless all request-destination and request-shaping fields that can affect runtime network calls are proven consistent between config and disk.

At minimum, include structural comparison of models[] entries for fields that can influence transport, such as per-model baseUrl, api, headers, and provider-level request options (if persisted), or simply reject short-circuit when diskProvider.models is present and configuredProvider does not explicitly define an identical set.

Example defensive approach (prefer correctness over speed):

// Pseudocode: extend comparison to cover model-level transport fields
if (!stableEqual(configuredProvider.models, diskProvider.models)) return false;
if (!stableEqual(configuredProvider.request, diskProvider.request)) return false;
if (!stableEqual(configuredProvider.api, diskProvider.api)) return false;

Alternatively, make the short-circuit conditional on cfg.models.mode and only allow it when the system is expected to fully manage models.json (no merge semantics), and/or always re-plan when models.json contains any per-model transport overrides.

Also consider validating that any resolved baseUrl targets are within an allowlist (or at least are HTTPS and not link-local/private IP ranges) before use to mitigate SSRF across the stack.

3. 🟡 Potential stack/CPU DoS via recursive stableStringify on attacker-controlled models.json fields
Property Value
Severity Medium
CWE CWE-674
Location src/agents/models-config.ts:31-44

Description

ensureOpenClawModelsJson() added a short-circuit path that reads and parses models.json from disk and then compares diskProvider.baseUrl/headers/auth with configured values using stableEqual(), which is implemented as stableStringify(a) === stableStringify(b).

In src/agents/models-config.ts, the local stableStringify() implementation is fully recursive and has no depth limit. While the file read is capped to 1 MiB, an attacker who can modify models.json can still craft extremely deep (but small) nested arrays/objects (e.g., [[[[...]]]]) that:

  • cause maximum call stack size exceeded crashes (uncontrolled recursion)
  • or cause high CPU time during stringification, stalling the process

Vulnerable code:

function stableEqual(a: unknown, b: unknown): boolean {
  return stableStringify(a) === stableStringify(b);
}

and the recursive serializer:

function stableStringify(value: unknown): string {
  if (value === null || typeof value !== "object") return JSON.stringify(value);
  if (Array.isArray(value)) return `[${value.map((entry) => stableStringify(entry)).join(",")}]`;
  const entries = Object.entries(value as Record<string, unknown>).toSorted(([a], [b]) =>
    a.localeCompare(b),
  );
  return `{${entries.map(([key, entry]) => `${JSON.stringify(key)}:${stableStringify(entry)}`).join(",")}}`;
}

Even though diskProvider originates from JSON (so it can’t contain cycles), it can contain adversarial depth.

Recommendation

Avoid unbounded recursive stringification on untrusted data.

Options:

  1. Validate shapes and compare directly for the specific fields being checked:
  • baseUrl: must be string | undefined
  • headers: must be Record<string,string> | undefined (no nested objects)
  • auth: constrain to a known shallow schema
  1. If stable structural comparison is still required, use a non-recursive implementation or enforce a maximum depth and fail closed (return false to force full planning) when exceeded.

Example depth-limited stringify:

function stableStringifySafe(value: unknown, depth = 0, maxDepth = 200): string {
  if (depth > maxDepth) throw new Error("max depth exceeded");
  if (value === null || typeof value !== "object") return JSON.stringify(value) ?? "null";
  if (Array.isArray(value)) {
    return `[${value.map(v => stableStringifySafe(v, depth + 1, maxDepth)).join(",")}]`;
  }
  const rec = value as Record<string, unknown>;
  const keys = Object.keys(rec).sort();
  return `{${keys.map(k => `${JSON.stringify(k)}:${stableStringifySafe(rec[k], depth + 1, maxDepth)}`).join(",")}}`;
}

function stableEqual(a: unknown, b: unknown): boolean {
  try {
    return stableStringifySafe(a) === stableStringifySafe(b);
  } catch {
    return false; // fall through to full plan
  }
}

This keeps the short-circuit check robust while preventing stack/CPU DoS.

4. 🔵 TOCTOU race in models.json short-circuit integrity check leads to stale readyCache
Property Value
Severity Low
CWE CWE-367
Location src/agents/models-config.ts:422-456

Description

ensureOpenClawModelsJson() has a new "targetProvider short-circuit" path that:

  • Reads and parses models.json to check if the on-disk provider entry matches the current config
  • If it matches, it returns {wrote:false} and populates MODELS_JSON_STATE.readyCache
  • This happens outside withModelsJsonWriteLock() (which is only an in-process lock)

Because there is no locking or post-check revalidation of the file’s contents/identity, another process (or thread) with concurrent write access to models.json can modify the file immediately after the match check and before/after the cache is populated.

Impact:

  • The function can return success even though models.json has changed since it was verified
  • The result is cached in-memory (readyCache), delaying corrective rewrites until the cache key changes or the process restarts
  • If downstream code trusts models.json after calling ensureOpenClawModelsJson(), it may consume attacker-modified provider settings (e.g., baseUrl/headers/auth), depending on the broader threat model

Vulnerable code (short-circuit path):

const matches = await readExistingProviderMatchesConfig(...);
if (matches) {
  await ensureModelsFileModeForModelsJson(targetPath);
  const result = { agentDir, wrote: false };
  MODELS_JSON_STATE.readyCache.set(cacheKey, Promise.resolve({ fingerprint, result }));
  return result;
}

Recommendation

Make the short-circuit check robust against concurrent modification before caching/returning:

Option A (simplest): run the short-circuit path under the same withModelsJsonWriteLock(targetPath, ...) block so only one in-process caller can check+cache at a time.

Option B (stronger, cross-process TOCTOU mitigation): use an atomic check strategy and only cache if the file is unchanged:

  • fd = await fs.open(targetPath, 'r')
  • stat1 = await fd.stat() (capture ino, mtimeMs, size)
  • raw = await fd.readFile('utf8') and parse/compare
  • stat2 = await fd.stat() and ensure ino/mtimeMs/size unchanged
  • only then populate readyCache and return

Example sketch:

const fd = await fs.open(targetPath, 'r');
try {
  const s1 = await fd.stat();
  const raw = await fd.readFile('utf8');
  const parsed = JSON.parse(raw);
  const ok = /* compare parsed vs config */;
  const s2 = await fd.stat();
  if (ok && s1.ino === s2.ino && s1.mtimeMs === s2.mtimeMs && s1.size === s2.size) {
    MODELS_JSON_STATE.readyCache.set(cacheKey, Promise.resolve({ fingerprint, result }));
    return result;
  }
} finally {
  await fd.close();
}

Also consider avoiding caching the short-circuit result unless the check is performed under a lock or validated with a file-identity/mtime recheck.


Analyzed PR: #73261 at commit 10361a0

Last updated on: 2026-04-28T05:45:21Z

Re-review progress:

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 682c12624c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/models-config.ts Outdated
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 28, 2026

Greptile Summary

Adds a targetProvider fast path to ensureOpenClawModelsJson that short-circuits the implicit-provider-discovery pipeline when the on-disk models.json entry structurally matches current config. The implementation correctly addresses all previously-flagged security concerns (symmetric baseUrl comparison, env-ref apiKey resolution, stable headers/auth equality, prototype-chain injection guard, symlink protection, size cap) and populates readyCache after a successful short-circuit so subsequent calls take the warm in-memory path.

Confidence Score: 5/5

PR is safe to merge — no P0 or P1 findings; all previously flagged security issues confirmed resolved.

All four security-critical field comparisons (apiKey, baseUrl, headers, auth) are symmetric and handle the undefined-vs-string drift case. The readyCache ordering fix is verified by the new fs.readFile spy test. Prototype injection, symlink, and oversized-file vectors are guarded. No new attack surfaces introduced.

No files require special attention.

Reviews (2): Last reviewed commit: "fix(models-config): symmetric baseUrl, e..." | Re-trigger Greptile

Comment thread src/agents/models-config.ts
Comment thread src/agents/models-config.ts Outdated
Comment thread src/agents/models-config.target-provider-short-circuit.test.ts
Addresses Codex P1, Greptile P2, and Aisle Med #1/#2/#3 on openclaw#73260:

# Streaming bounded hash (Codex P1 / Aisle Med #2)

The previous oversize-branch in readAuthProfilesStableHash still did
fs.readFile(path) which loaded the entire file into memory \u2014 the
MAX_AUTH_PROFILES_BYTES cap was effectively unenforced.  Same problem
for readModelsJsonContentHash (no cap at all).

New helper safeHashRegularFile():
  - lstat first; reject symlinks and non-regular files
  - bail at lstat-time if size exceeds maxBytes (returns sentinel hash
    so size-change still invalidates the cache)
  - open with O_NOFOLLOW where supported (closes lstat\u2192open TOCTOU)
  - fstat after open to verify it's still a regular file
  - createReadStream(fd) with bounded highWaterMark; destroy if accumulated
    bytes exceed maxBytes (handles the case where attacker grows the
    file between lstat and stream completion)

Both readers now route through safeHashRegularFile.  Models.json gets
a 1 MiB cap (MAX_MODELS_JSON_BYTES); auth-profiles keeps its 8 MiB cap.

# O_NOFOLLOW chmod (Aisle Med #1, CWE-367)

The previous lstat-then-chmod sequence was racy: an attacker who
could rename/replace ${agentDir}/models.json between the two syscalls
could have chmod() follow a swapped-in symlink to an arbitrary file.

Replaced with fs.open(path, O_RDONLY | O_NOFOLLOW) \u2192 fchmod via the
file handle.  The open itself refuses symlinks atomically; the
fchmod operates on the validated fd, eliminating the race.  When
O_NOFOLLOW is unavailable (rare but possible), falls back to plain
O_RDONLY \u2014 the prior lstat protection was already best-effort.

# Symlink-safe reads (Aisle Med #3, CWE-59)

Previous fs.readFile in the hash readers followed symlinks.  With
the safeHashRegularFile refactor above, both auth-profiles.json and
models.json now reject symlinks before reading.

# JSDoc fix (Greptile P2)

readModelsJsonContentHash's prior comment said "forces a re-plan"
when the file is absent, but the call site does null === null
comparison which is a cache hit, not a re-plan (the steady-state
skip case for plan.action === 'skip').  Updated the comment to
explain the actual semantics: stable absence is a stable result;
drift is a re-plan.

# Tests

14/14 existing tests pass.  No new test cases needed \u2014 the existing
file-mode and fingerprint-cache tests cover both the symlink-rejection
paths (via lstat shortcut) and the modelsJsonHash second-factor.
Lint: 0 errors.
…tegration

Addresses Codex P2, Greptile P1+P2x2, and Aisle High #1 + Med #2/#3/#4
on PR openclaw#73261:

# Greptile P1 / Aisle High #1: asymmetric baseUrl (CWE-918, SSRF)

The previous guard:
  if (typeof configuredProvider.baseUrl === 'string' &&
      configuredProvider.baseUrl !== diskProvider.baseUrl)
short-circuited the check entirely when config omitted baseUrl
(common for providers with compiled-in defaults).  An attacker who
could tamper with on-disk models.json could set baseUrl to an exfil
endpoint and the short-circuit would accept it silently \u2014 exactly
the SSRF/credential-exfil vector this PR was meant to close.

Replaced with symmetric stableEqual() so config-undefined vs
disk-string is a mismatch and falls through to full planning, which
re-applies provider/plugin defaults and rewrites the file.

# Codex P2: env-ref API key comparison

resolveConfiguredApiKeyForCompare resolved env refs to the env-var
VALUE (env[ref.id]).  But planOpenClawModelsJson persists env-source
api keys to models.json as the env-var NAME (e.g. 'OPENAI_API_KEY'),
not the value \u2014 that's what resolveApiKeyFromCredential returns for
env source.  Comparing value-vs-name always mismatched, so the
short-circuit never fired for the most common config form
('apiKey: "${env.OPENAI_API_KEY}"').

Now the helper returns the env-var NAME for env-source refs, while
still verifying the env is currently populated (so we don't
short-circuit on a misconfigured environment).  Plaintext values
still compare directly.

# Greptile P2 (perf): readyCache integration

The short-circuit returned before the readyCache check, so warm
callers paid the disk read + JSON.parse cost on every call.
Reordered:
  1. compute fingerprint (cheap)
  2. check readyCache \u2014 warm hit returns immediately, no disk I/O
  3. if cold, attempt short-circuit on disk
  4. if short-circuit succeeds, populate readyCache so subsequent
     calls take the warm path

Net effect: warm callers now skip disk entirely; cold callers with
intact disk state still get the short-circuit benefit; full plan
fires only on real drift.

# Aisle Med #2 (CWE-59): symlink-following short-circuit reads

readExistingProviderMatchesConfig used fs.readFile on a possibly-
symlinked target.  Now lstat-checks first and refuses symlinks /
non-regular files.

# Aisle Med #3 (CWE-1321): prototype-chain key access

explicitProviders[targetProvider] could fall through to a prototype
key like '__proto__' / 'constructor' / 'prototype'.  Now uses
Object.hasOwn at both lookup sites (caller AND
readExistingProviderMatchesConfig) to refuse inherited keys.  Also
explicit string check rejects the three known dangerous keys.

# Aisle Med #4 (CWE-400): unbounded models.json read

Added MAX_MODELS_JSON_SHORT_CIRCUIT_BYTES = 1 MiB cap on the disk
read in the short-circuit path.  Files above the cap fall through
to full planning rather than being parsed.

# Tests

Updated hit-after-warm-fingerprint test (Greptile P2) to use
fs.readFile spy and assert no disk read occurs on the warm path.
Added short-circuit-populates-cache test that drops the in-memory
cache between runs, fires the short-circuit, then verifies the
third call takes the warm path with no disk read.

14/14 tests pass.  Lint: 0 errors.
@zeroaltitude
Copy link
Copy Markdown
Contributor Author

All five findings landed in 10361a0569:

  • 🟠 fix: add @lid format support and allowFrom wildcard handling #1 High (CWE-918 SSRF via tampered baseUrl): symmetric stableEqual comparison; config-undefined vs disk-string now mismatches and falls through to full plan. See Greptile P1 thread for details.
  • 🟡 Login fails with 'WebSocket Error (socket hang up)' ECONNRESET #2 Med (CWE-59 symlink-following short-circuit reads): readExistingProviderMatchesConfig now lstats first and refuses symlinks / non-regular files before reading.
  • 🟡 WA business, groups & office hours  #3 Med (CWE-1321 prototype-chain key access): Object.hasOwn checks at both lookup sites (caller-side explicitProviders[targetProvider] and disk-side parsed.providers[targetProvider]); explicit string-equality rejection of __proto__ / constructor / prototype.
  • 🟡 Images not passed to Claude CLI - only path reference in text #4 Med (CWE-400 unbounded read): MAX_MODELS_JSON_SHORT_CIRCUIT_BYTES = 1 MiB cap enforced at lstat time before reading.
  • 🟡 CLI: add Opencode integration #5 Med (partial provider comparison): documented inline that the comparison covers apiKey/baseUrl/headers/auth (the security-critical subset). Tampering with non-security fields like models[] / maxTokens / contextWindow would change inference behavior but isn't an exfil vector \u2014 accepting that trade-off keeps the short-circuit reachable. Comment notes that adding fields here is the path forward if any of those becomes security-critical.

@zeroaltitude
Copy link
Copy Markdown
Contributor Author

@greptile-apps: please do re-review/re-score

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 30, 2026

Codex review: needs maintainer review before merge.

Summary
This PR adds content-hash models-config cache validation plus a provider/model-scoped ensureOpenClawModelsJson fast path wired into pi-embedded run and compaction callers, with changelog, regression tests, and a runtime proof script.

Reproducibility: yes. Source inspection shows current main lacks target-provider/model hints and the embedded callers pass only workspaceDir, while PR head adds the fast path and caller wiring.

Real behavior proof
Sufficient (terminal): Latest-head terminal proof for d8dfa42082 shows a production ensureOpenClawModelsJson runtime driver covering validated-outcome caching, drift rejection, and oversize fail-closed behavior.

Next step before merge
The remaining action is maintainer/contributor branch reconciliation because GitHub reports the PR dirty and it overlaps with an open companion PR, not a narrow ClawSweeper repair defect.

Security
Cleared: No remaining concrete security or supply-chain regression was found in the inspected latest diff; the prior TOCTOU cache concern is addressed by caching the validated content outcome.

Review details

Best possible solution:

Rebase and reconcile the branch with current main and the companion cache-fingerprint work, then land after normal maintainer review and exact-head gates.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection shows current main lacks target-provider/model hints and the embedded callers pass only workspaceDir, while PR head adds the fast path and caller wiring.

Is this the best way to solve the issue?

Yes for the inspected implementation direction. The latest head binds scoped cache entries to the validated models.json outcome and fails closed on the previously flagged drift paths; the remaining blocker is branch reconciliation, not a discrete code defect from this pass.

Acceptance criteria:

  • pnpm test src/agents/models-config.target-provider-short-circuit.test.ts
  • pnpm test src/agents/models-config.fingerprint-cache.test.ts src/agents/models-config.write-serialization.test.ts
  • pnpm tsgo:core
  • pnpm tsgo:core:test
  • pnpm config:schema:check

What I checked:

Likely related people:

  • steipete: Recent GitHub path history shows repeated models-config and embedded-runner maintenance, including filesystem safety, model cache, provider discovery, and planning refactors. (role: recent maintainer; confidence: high; commits: 538605ff44d2, 35da7d2c992c, 947aae5a9920; files: src/agents/models-config.ts, src/agents/pi-embedded-runner/run.ts, src/agents/pi-embedded-runner/compact.ts)
  • shakkernerd: Recent history touches current-main models-config and embedded runner metadata/workspace plumbing adjacent to this fast path. (role: recent adjacent owner; confidence: high; commits: 1235f7f98168, 5655c2b0666d, 498af508d045; files: src/agents/models-config.ts, src/agents/pi-embedded-runner/run.ts, src/agents/pi-embedded-runner/compact.ts)
  • vincentkoc: Path history attributes the existing models.json readiness cache for embedded runs to prior merged work, which is the behavior this PR extends. (role: introduced related behavior; confidence: medium; commits: 2b210703a3b9, 1d935cce5155, 877eb1cbed04; files: src/agents/models-config.ts, src/agents/pi-embedded-runner/run.ts)
  • joshavant: Recent merged SecretRef/source-managed provider persistence hardening is adjacent to the config-vs-disk credential comparison surface touched here. (role: adjacent security owner; confidence: medium; commits: 0bcb95e8fa5e, 36d2ae2a2235; files: src/agents/models-config.ts, src/agents/models-config.providers.normalize.ts)

Remaining risk / open question:

  • GitHub reports mergeable_state: dirty, so the branch needs rebase or conflict reconciliation before merge.
  • The PR currently overlaps with open companion PR perf(models-config): content-hash auth-profiles + models.json drift detection #73260; landing order or diff ownership should be reconciled by a maintainer.
  • I did not run tests locally because this review is read-only; validation is based on source inspection, supplied terminal proof, and GitHub check metadata.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 4721ca8e459f.

@openclaw-barnacle openclaw-barnacle Bot added channel: line Channel integration: line commands Command implementations labels May 1, 2026
@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. and removed commands Command implementations labels May 5, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9e4c01b6b8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/models-config.ts Outdated
Comment on lines +459 to +462
// Populate readyCache so the next call with identical inputs
// takes the warm-cache path above without re-reading disk.
MODELS_JSON_STATE.readyCache.set(cacheKey, Promise.resolve({ fingerprint, result }));
return result;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Scope ready-cache entries to short-circuit mode

When a targetProvider short-circuit succeeds, this stores a global readyCache entry keyed only by fingerprint, even though only one provider was validated. A later ensureOpenClawModelsJson(...) call without targetProvider can hit that cache and skip full planning, leaving other providers stale on disk. This is reproducible whenever one caller does targeted prewarm and another later expects full reconciliation in the same process.

Useful? React with 👍 / 👎.

Comment thread src/agents/models-config.ts Outdated
Comment on lines +367 to +368
} else if (typeof diskProvider.apiKey === "string" && diskProvider.apiKey.length > 0) {
return false;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat non-string disk apiKey as mismatch

In the resolvedConfiguredApiKey === undefined branch, only non-empty string diskProvider.apiKey values are rejected. If models.json has a non-string apiKey (for example tampered object/number) while config omits apiKey, this function still returns a structural match and short-circuits planning. That preserves malformed or attacker-modified provider state instead of forcing a rewrite.

Useful? React with 👍 / 👎.

@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
…rget-provider-short-circuit

# Conflicts:
#	CHANGELOG.md
#	docs/.generated/plugin-sdk-api-baseline.sha256
#	src/agents/models-config.ts
@openclaw-barnacle openclaw-barnacle Bot removed docs Improvements or additions to documentation channel: line Channel integration: line labels May 6, 2026
@chatgpt-codex-connector
Copy link
Copy Markdown

💡 Codex Review

} else if (
diskProvider.apiKey !== undefined &&
!(typeof diskProvider.apiKey === "string" && diskProvider.apiKey.length === 0)
) {

P2 Badge Accept implicit env apiKey markers in short-circuit compare

When configuredProvider.apiKey is unset, this branch rejects any non-empty diskProvider.apiKey, but planOpenClawModelsJson can legitimately persist env-derived markers even with no configured key (via resolveMissingProviderApiKey in src/agents/models-config.providers.normalize.ts). In that common setup (provider has models, auth comes from OPENAI_API_KEY/profile), targetProvider short-circuit will always miss after restart and re-run full implicit discovery every time, negating the new perf path despite disk/config being semantically aligned.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
…g short-circuit (review feedback)

Codex P2 round-6 on PR openclaw#73261: 'Accept implicit env apiKey markers in
short-circuit compare'

Paraphrased problem
-------------------
When configuredProvider.apiKey is unset, the round-5 unset-config
branch rejected ANY non-empty diskProvider.apiKey, but
planOpenClawModelsJson legitimately persists env-derived markers
even with no configured key (via resolveMissingProviderApiKey in
src/agents/models-config.providers.normalize.ts -> secret-helpers.ts).
In the dominant implicit-discovery setup (provider has models, auth
comes from OPENAI_API_KEY env or AWS SDK env chain), the planner
writes apiKey: "OPENAI_API_KEY" (the env-var name marker) to disk
on the first pass. With round-5's strict rejection, every subsequent
restart would miss the targetProvider short-circuit and re-run full
implicit discovery, negating the new perf path despite disk and
config being semantically aligned.

Solution
--------
Extend the unset-config branch to also accept disk apiKey values
that match what resolveMissingProviderApiKey would have persisted.
The decision tree mirrors the planner:

- providerAuth === "aws-sdk": call resolveAwsSdkApiKeyVarName(env);
  accept iff disk equals that env-var name AND env[awsEnvVar] is
  populated (liveness check matches the existing
  resolveConfiguredApiKeyForCompare contract for env-source refs).

- otherwise: call resolveEnvApiKeyVarName(providerKey, env); accept
  iff disk equals that env-var name AND env[envVarName] is
  populated.

- profile-plaintext / non-default providerApiKeyResolver paths are
  NOT covered by this round (we don't have profile or resolver in
  scope at the comparator). Bounded perf cost: one full plan per
  restart that produces an unchanged disk shape; can be extended
  symmetrically in a later round if profile context plumbing makes
  it cheap.

Anything else (number, null, object, array, string not derivable
from the planner's env/aws-sdk paths, name pointing at an
unrelated env var, env var no longer set) still falls through to
full planning.

Result
------
- src/agents/models-config.ts:
  + new helper diskApiKeyMatchesUnsetConfigPlannerOutput documents
    the planner-output decision tree and pins the env / aws-sdk
    branches with explicit liveness checks.
  + unset-config branch in readExistingProviderMatchesConfig now
    consults the helper before rejecting.
  + new imports for resolveAwsSdkApiKeyVarName /
    resolveEnvApiKeyVarName from
    models-config.providers.secret-helpers.

- src/agents/models-config.target-provider-short-circuit.test.ts:
  three new regression tests pin the contract:
  + hit-on-env-marker-with-unset-config: short-circuit accepts disk
    apiKey: 'OPENAI_API_KEY' when config has no apiKey and env
    var is set.
  + miss-on-env-marker-with-unset-env: same setup but with the env
    var deleted between passes; short-circuit rejects (liveness).
  + miss-on-env-marker-name-mismatch: disk apiKey points at an
    unrelated env var that doesn't map to the openai provider;
    short-circuit rejects.
  Pre-existing miss-on-malformed-disk-apiKey kept and still passing.

Validation gate
---------------
- pnpm tsgo:core: clean
- pnpm vitest run src/agents/models-config.target-provider-short-circuit.test.ts:
  40/40 passed (37 prior + 3 new)
- pnpm oxlint <touched files>: 0 warnings, 0 errors
- pnpm plugin-sdk:api:check: OK
- git diff --check: clean
@zeroaltitude
Copy link
Copy Markdown
Contributor Author

Round 6 — addressing Codex P2 review feedback

Codex finding on 8f1c6dc4ab: [P2] Accept implicit env apiKey markers in short-circuit compare.

Paraphrased problem: When configuredProvider.apiKey is unset, the round-5 unset-config branch rejected ANY non-empty diskProvider.apiKey, but planOpenClawModelsJson legitimately persists env-derived markers even with no configured key via resolveMissingProviderApiKey (src/agents/models-config.providers.secret-helpers.ts). In the dominant implicit-discovery setup (provider has models, auth comes from OPENAI_API_KEY env or AWS SDK env chain), the planner writes apiKey: "OPENAI_API_KEY" (the env-var name marker) to disk on the first pass. With round-5's strict rejection, every restart missed the targetProvider short-circuit and re-ran full implicit discovery — negating the perf path even though disk and config were semantically aligned.

Solution approach: Extend the unset-config branch to also accept disk apiKey values that match what resolveMissingProviderApiKey would have persisted. The decision tree mirrors the planner:

  • providerAuth === "aws-sdk" → call resolveAwsSdkApiKeyVarName(env); accept iff disk equals that env-var name AND env[awsEnvVar] is populated (liveness check matches the existing resolveConfiguredApiKeyForCompare contract for env-source refs).
  • otherwise → call resolveEnvApiKeyVarName(providerKey, env); accept iff disk equals that env-var name AND env[envVarName] is populated.
  • profile-plaintext / non-default providerApiKeyResolver paths are explicitly not covered by this round (we don't have profile or resolver in scope at the comparator). Bounded perf cost: one full plan per restart that produces an unchanged disk shape; can be extended symmetrically in a later round if profile context plumbing makes it cheap.

Anything else (number, null, object, array, string not derivable from the planner's env/aws-sdk paths, name pointing at an unrelated env var, env var no longer set) still falls through to full planning.

Result: commit 32200a3170 on perf/models-config-target-provider-short-circuit.

  • src/agents/models-config.ts:
    • new helper diskApiKeyMatchesUnsetConfigPlannerOutput documents the planner-output decision tree and pins the env / aws-sdk branches with explicit liveness checks.
    • unset-config branch in readExistingProviderMatchesConfig now consults the helper before rejecting.
    • new imports for resolveAwsSdkApiKeyVarName / resolveEnvApiKeyVarName from models-config.providers.secret-helpers.
  • src/agents/models-config.target-provider-short-circuit.test.ts: three new regression tests pin the contract:
    • hit-on-env-marker-with-unset-config: short-circuit accepts disk apiKey: "OPENAI_API_KEY" when config has no apiKey and env var is set.
    • miss-on-env-marker-with-unset-env: same setup but with the env var deleted between passes; short-circuit rejects (liveness).
    • miss-on-env-marker-name-mismatch: disk apiKey points at an unrelated env var that doesn't map to the openai provider; short-circuit rejects.
    • Pre-existing miss-on-malformed-disk-apiKey retained and still passing.

Validation gate (last lines):

pnpm tsgo:core                                  -> clean
pnpm vitest run src/agents/models-config.target-provider-short-circuit.test.ts
                                                -> Test Files: 2 passed (2) | Tests: 40 passed (40)
pnpm oxlint <touched files>                     -> Found 0 warnings and 0 errors.
pnpm plugin-sdk:api:check                       -> OK docs/.generated/plugin-sdk-api-baseline.sha256
git diff --check                                -> clean

@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 32200a3170

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +897 to +903
if (
!stableEqualBounded(
configuredProvider.headers,
diskProvider.headers,
SHORT_CIRCUIT_COMPARE_MAX_DEPTH,
)
) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize secret-ref headers before short-circuit compare

The target-provider fast path compares configuredProvider.headers directly against disk, but planner output normalizes secret-ref headers (e.g. ${ENV} / SecretRef objects) into marker strings via normalizeHeaderValues and source-managed enforcement. As a result, any provider configured with header SecretRefs will always fail this comparison and re-run full planning on every call, which defeats the intended targetProvider performance win for header-auth providers even when disk and config are semantically equivalent.

Useful? React with 👍 / 👎.

…ls (Clawsweeper P2 round-6 on openclaw#73261)

Both pi-embedded production callers (compact.ts and run.ts) were passing
only { workspaceDir } to ensureOpenClawModelsJson, so the targetProvider
short-circuit (options.targetProvider being undefined) never fired in
production embedded runs, negating the perf benefit of the entire PR.

Fix: forward the resolved `provider` as `targetProvider` in both call
sites so the scoped-cache short-circuit can skip a full plan when the
provider's disk config already matches.

Regression coverage: export ensureOpenClawModelsJsonMock from
compact.hooks.harness.ts; add two focused tests that pin the
targetProvider wiring for the compact direct path (default openai
provider and caller-supplied provider override).

tsgo:core, tsgo:core:test → clean
git diff --check → clean
@zeroaltitude
Copy link
Copy Markdown
Contributor Author

Round 7 — addressing Clawsweeper P2 (thread targetProvider into embedded callers)

Commit: 5fde6ac536

[P2] Pass target hints from embedded callers

Problem (paraphrased): Both compactEmbeddedPiSessionDirect (compact.ts) and the fallback model-resolution path in runEmbeddedPiSession (run.ts) called ensureOpenClawModelsJson with only { workspaceDir }. Because options.targetProvider was undefined, the if (targetProvider) guard at the start of the short-circuit path was never entered for these production callers, meaning the entire targetProvider short-circuit was dead code in practice — every embedded run still executed a full plan.

Fix: Forward the resolved provider string (already computed before the call in both files) as targetProvider in the options object. No new imports or data fetching needed — provider was already in scope.

compact.ts (compactEmbeddedPiSessionDirect):

await ensureOpenClawModelsJson(params.config, agentDir, {
  workspaceDir: resolvedWorkspace,
  // Thread the resolved provider so the targetProvider short-circuit can skip
  // a full plan when this provider's disk config already matches.
  targetProvider: provider,
});

run.ts (fallback path in runEmbeddedPiSession):

await ensureOpenClawModelsJson(params.config, agentDir, {
  workspaceDir: resolvedWorkspace,
  // Thread the resolved provider so the targetProvider short-circuit can skip
  // a full plan when this provider's disk config already matches.
  targetProvider: provider,
});

Regression coverage: Exported ensureOpenClawModelsJsonMock from compact.hooks.harness.ts (previously created inline in vi.doMock and not accessible for assertion). Added two focused tests in compact.hooks.test.ts:

  • threads targetProvider: provider into ensureOpenClawModelsJson so the short-circuit can fire — asserts ensureOpenClawModelsJsonMock is called with expect.objectContaining({ targetProvider: "openai" }) when the default provider resolves
  • threads the caller-supplied provider, not only the default, when a provider param is passed — passes provider: "anthropic" in args; asserts targetProvider: "anthropic" is forwarded

Validation gate:

pnpm tsgo:core      → clean (0 errors)
pnpm tsgo:core:test → clean (0 errors)
git diff --check    → clean

Note: the 2 no-underscore-dangle oxlint findings on __testing in compact.ts and compact.hooks.test.ts are pre-existing (present before this PR's first commit) and are not introduced by this change.

@zeroaltitude
Copy link
Copy Markdown
Contributor Author

@clawsweeper please re-review. Round 7 commit 5fde6ac536 addresses the [P2] Pass target hints from embedded callers finding: both pi-embedded production callers in compact.ts and run.ts now forward targetProvider: provider to ensureOpenClawModelsJson, so the scoped-cache short-circuit can actually fire in production runs. Regression coverage added via exported ensureOpenClawModelsJsonMock in compact.hooks.harness.ts and two focused caller-wiring tests in compact.hooks.test.ts.

@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 6, 2026
…t compare (review feedback)

Codex P2 round-7 on PR openclaw#73261: 'Normalize secret-ref headers before
short-circuit compare'

Paraphrased problem
-------------------
The targetProvider fast path compared configuredProvider.headers
directly against disk, but planner output normalizes secret-ref
headers (env refs like \${ENV_VAR} and SecretRef objects) into
marker strings via normalizeHeaderValues:
  - env refs   -> 'secretref-env:<ENV_VAR_NAME>'
  - non-env    -> 'secretref-managed'

As a result, any provider configured with header SecretRefs (the
common shape for header-auth providers) would always fail the
deep compare and re-run full implicit discovery on every call,
defeating the targetProvider perf win for that entire class of
providers.

Solution
--------
Pre-normalize configuredProvider.headers using the same
normalizeHeaderValues call the planner uses, threading
cfg.secrets?.defaults through readExistingProviderMatchesConfig so
the normalization produces identical markers to what the planner
persisted.

normalizeHeaderValues is idempotent for plain literal values
(returns headers unchanged when no entry resolves as a secret ref),
so this is a safe no-op for the common no-header / literal-header
case.  Anything else (header drift, attacker-injected headers,
markers pointing at unrelated env vars) still falls through to
full planning via the existing stableEqualBounded compare.

Out of scope (deliberately): source-managed enforcement
(enforceSourceManagedProviderSecrets) layers additional marker
rewrites when sourceProviders is set.  The short-circuit doesn't
have sourceProviders in scope today, and the perf cost of falling
through for source-managed providers is bounded.  Can be extended
in a later round if profile/source plumbing makes it cheap.

Result
------
- src/agents/models-config.ts:
  + readExistingProviderMatchesConfig now takes
    secretDefaults: SecretDefaults | undefined.
  + Before the headers compare, configuredProvider.headers (when
    record-shaped or undefined) is run through normalizeHeaderValues
    to produce the same marker strings the planner persists.
  + Call site in ensureOpenClawModelsJson passes
    cfg.secrets?.defaults.
  + New imports for normalizeHeaderValues and the SecretDefaults
    type from models-config.providers.secret-helpers.

- src/agents/models-config.target-provider-short-circuit.test.ts:
  two new regression tests:
  + hit-on-env-ref-header-normalization: short-circuit accepts disk
    headers shaped { 'X-Custom-Auth': 'secretref-env:OPENAI_API_KEY',
    'X-Static': 'literal-value' } when config holds
    { 'X-Custom-Auth': '\${OPENAI_API_KEY}', 'X-Static': 'literal-value' }.
  + miss-on-env-ref-header-mismatch: same env-ref shape in config,
    but disk marker points at UNRELATED_TOKEN \u2014 short-circuit
    rejects (deep compare catches the env-name mismatch).

Validation gate
---------------
- pnpm tsgo:core: clean
- pnpm tsgo:core:test: clean
- pnpm vitest run src/agents/models-config.target-provider-short-circuit.test.ts:
  44/44 passed (40 prior + 2 new round-6 + 2 new round-7)
- pnpm oxlint <touched files>: 0 warnings, 0 errors
- pnpm plugin-sdk:api:check: OK
- git diff --check: clean
@zeroaltitude
Copy link
Copy Markdown
Contributor Author

Round 7 — addressing Codex P2 review feedback

Codex finding on 32200a3170 (round-6 commit), inline at src/agents/models-config.ts:903: [P2] Normalize secret-ref headers before short-circuit compare.

Paraphrased problem: The targetProvider fast path compared configuredProvider.headers directly against disk, but planner output normalizes secret-ref headers (env refs like ${ENV_VAR} and SecretRef objects) into marker strings via normalizeHeaderValues:

  • env refs → secretref-env:<ENV_VAR_NAME>
  • non-env → secretref-managed

As a result, any provider configured with header SecretRefs (the common shape for header-auth providers) would always fail the deep compare and re-run full implicit discovery on every call, defeating the targetProvider perf win for that entire class of providers.

Solution approach: Pre-normalize configuredProvider.headers using the same normalizeHeaderValues call the planner uses, threading cfg.secrets?.defaults through readExistingProviderMatchesConfig so the normalization produces identical markers to what the planner persisted. normalizeHeaderValues is idempotent for plain literal values (returns headers unchanged when no entry resolves as a secret ref), so this is a safe no-op for the common no-header / literal-header case. Anything else (header drift, attacker-injected headers, markers pointing at unrelated env vars) still falls through to full planning via the existing stableEqualBounded compare.

Out of scope (deliberately): source-managed enforcement (enforceSourceManagedProviderSecrets) layers additional marker rewrites when sourceProviders is set. The short-circuit doesn't have sourceProviders in scope today, and the perf cost of falling through for source-managed providers is bounded — can be extended in a later round if profile/source plumbing makes it cheap.

Result: commit 83b72886f9 on perf/models-config-target-provider-short-circuit.

  • src/agents/models-config.ts:

    • readExistingProviderMatchesConfig now takes secretDefaults: SecretDefaults | undefined.
    • Before the headers compare, configuredProvider.headers (when record-shaped or undefined) is run through normalizeHeaderValues to produce the same marker strings the planner persists.
    • Call site in ensureOpenClawModelsJson passes cfg.secrets?.defaults.
    • New imports for normalizeHeaderValues and the SecretDefaults type from models-config.providers.secret-helpers.
  • src/agents/models-config.target-provider-short-circuit.test.ts — two new regression tests:

    • hit-on-env-ref-header-normalization: short-circuit accepts disk headers shaped { "X-Custom-Auth": "secretref-env:OPENAI_API_KEY", "X-Static": "literal-value" } when config holds { "X-Custom-Auth": "${OPENAI_API_KEY}", "X-Static": "literal-value" }.
    • miss-on-env-ref-header-mismatch: same env-ref shape in config, but disk marker points at UNRELATED_TOKEN — short-circuit rejects (deep compare catches the env-name mismatch).

Validation gate (last lines):

pnpm tsgo:core                                  -> clean
pnpm tsgo:core:test                             -> clean
pnpm vitest run src/agents/models-config.target-provider-short-circuit.test.ts
                                                -> Test Files: 2 passed (2) | Tests: 44 passed (44)
pnpm oxlint <touched files>                     -> Found 0 warnings and 0 errors.
pnpm plugin-sdk:api:check                       -> OK docs/.generated/plugin-sdk-api-baseline.sha256
git diff --check                                -> clean

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 83b72886f9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/models-config.ts Outdated
Comment on lines +951 to +958
if (Array.isArray(diskProvider.models)) {
for (const m of diskProvider.models) {
if (!isRecord(m)) {
return false;
}
for (const f of PER_MODEL_TRANSPORT_FIELDS) {
if (Object.hasOwn(m, f)) {
return false;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Compare configured models before short-circuiting provider hit

This branch only checks whether disk models contain transport override fields and never validates configuredProvider.models against diskProvider.models, so a config edit that changes model IDs/entries (but not apiKey/baseUrl/api/headers/auth) still returns a short-circuit match. With this commit wiring targetProvider into the embedded runner paths, those calls can now skip planOpenClawModelsJson after model-list changes, leaving stale models.json and causing resolveModelAsync to miss newly configured models until some later full reconcile happens.

Useful? React with 👍 / 👎.

…re short-circuit (review feedback)

Codex P1 round-8 on PR openclaw#73261: 'Compare configured models before
short-circuiting provider hit'

Paraphrased problem
-------------------
The prior per-model loop in readExistingProviderMatchesConfig only
validated that disk-side models carried no transport-override
fields; it never compared configuredProvider.models against
diskProvider.models.  Result: a config edit that adds a new model
id (without touching apiKey / baseUrl / api / headers / auth)
hits the short-circuit, leaves models.json stale, and
resolveModelAsync misses the newly configured model until some
later full reconcile happens.  With the targetProvider path now
wired into the embedded runner, that staleness window is reachable
from gateway hot paths and is a real correctness bug, not just a
perf miss.

Solution
--------
Add a model-id subset check after the existing transport-only loop:

- collectShortCircuitModelIds normalizes a provider's models field
  into a Set<string> of ids.  Accepts the two runtime shapes:
  bare string entries ('gpt-5') and record entries with a string
  id field ({ id: 'gpt-5', ... }).  Returns null on adversarial /
  malformed shapes (non-array with non-undefined value, non-record
  non-string entries, missing id, prototype-key collisions) so
  the comparator can fail closed.

- When configuredProvider.models has any entries, every configured
  id MUST appear in diskProvider.models.  Disk may legitimately
  contain MORE models than config (implicit discovery /
  plugin-contributed entries), so subset check, not strict
  equality.

- Implicit-only mode (configured models is empty array OR omitted)
  legitimately defers to the planner's discovery output.  We skip
  the subset check in that case and rely on the transport-only
  comparisons \u2014 preserving the perf path for the dominant
  implicit-mode setup.

- Adversarial shapes (configuredIds === null OR diskIds === null)
  fail closed: refuse short-circuit, fall through to full planning.

Result
------
- src/agents/models-config.ts:
  + new helper collectShortCircuitModelIds documents the parser
    contract: returns null for fail-closed shapes, empty Set for
    legitimate implicit-only mode, populated Set otherwise.
  + readExistingProviderMatchesConfig adds the subset check after
    the existing per-model transport loop.

- src/agents/models-config.target-provider-short-circuit.test.ts:
  four new round-8 regression tests:
  + miss-on-newly-configured-model: config adds { id: 'gpt-6-newly-configured' };
    short-circuit rejects, full plan re-runs.
  + miss-on-config-only-bare-string-model-add: same shape but with
    bare-string entries; subset check still fires.
  + hit-on-implicit-only-models-list: empty configured.models with
    matching transport still short-circuits (perf path preserved
    for implicit-discovery mode).
  + miss-on-malformed-configured-model-entry: configured.models has
    an entry { id: 1234 }; collector returns null; short-circuit
    refuses (fail-closed for adversarial shape).

Validation gate
---------------
- pnpm tsgo:core: clean
- pnpm tsgo:core:test: clean
- pnpm vitest run src/agents/models-config.target-provider-short-circuit.test.ts:
  52/52 passed (44 prior + 4 new)
- pnpm oxlint <touched files>: 0 warnings, 0 errors
- pnpm plugin-sdk:api:check: OK
- git diff --check: clean
@zeroaltitude
Copy link
Copy Markdown
Contributor Author

Round 8 — addressing Codex P1 review feedback

Codex finding on 83b72886f9 (round-7 commit), inline at src/agents/models-config.ts:958: [P1] Compare configured models before short-circuiting provider hit.

Paraphrased problem: The prior per-model loop in readExistingProviderMatchesConfig only validated that disk-side models carried no transport-override fields; it never compared configuredProvider.models against diskProvider.models. Result: a config edit that adds a new model id (without touching apiKey / baseUrl / api / headers / auth) hits the short-circuit, leaves models.json stale, and resolveModelAsync misses the newly configured model until some later full reconcile happens. With the targetProvider path now wired into the embedded runner, that staleness window is reachable from gateway hot paths and is a real correctness bug, not just a perf miss.

Solution approach: Add a model-id subset check after the existing transport-only loop:

  • New helper collectShortCircuitModelIds normalizes a provider's models field into a Set<string> of ids. Accepts the two runtime shapes: bare string entries ("gpt-5") and record entries with a string id field ({ id: "gpt-5", ... }). Returns null on adversarial / malformed shapes (non-array with non-undefined value, non-record non-string entries, missing id, prototype-key collisions) so the comparator can fail closed.
  • When configuredProvider.models has any entries, every configured id MUST appear in diskProvider.models. Disk may legitimately contain MORE models than config (implicit discovery / plugin-contributed entries), so subset check, not strict equality.
  • Implicit-only mode (configured models is empty array OR omitted) legitimately defers to the planner's discovery output. We skip the subset check in that case and rely on the transport-only comparisons — preserving the perf path for the dominant implicit-mode setup.
  • Adversarial shapes (configuredIds === null OR diskIds === null) fail closed: refuse short-circuit, fall through to full planning.

Result: commit fed465f466 on perf/models-config-target-provider-short-circuit.

  • src/agents/models-config.ts:

    • new helper collectShortCircuitModelIds documents the parser contract: returns null for fail-closed shapes, empty Set for legitimate implicit-only mode, populated Set otherwise.
    • readExistingProviderMatchesConfig adds the subset check after the existing per-model transport loop.
  • src/agents/models-config.target-provider-short-circuit.test.ts — four new round-8 regression tests:

    • miss-on-newly-configured-model: config adds { id: "gpt-6-newly-configured" }; short-circuit rejects, full plan re-runs.
    • miss-on-config-only-bare-string-model-add: same shape but with bare-string entries; subset check still fires.
    • hit-on-implicit-only-models-list: empty configured.models with matching transport still short-circuits (perf path preserved for implicit-discovery mode).
    • miss-on-malformed-configured-model-entry: configured.models has an entry { id: 1234 }; collector returns null; short-circuit refuses (fail-closed for adversarial shape).

Validation gate (last lines):

pnpm tsgo:core                                  -> clean
pnpm tsgo:core:test                             -> clean
pnpm vitest run src/agents/models-config.target-provider-short-circuit.test.ts
                                                -> Test Files: 2 passed (2) | Tests: 52 passed (52)
pnpm oxlint <touched files>                     -> Found 0 warnings and 0 errors.
pnpm plugin-sdk:api:check                       -> OK docs/.generated/plugin-sdk-api-baseline.sha256
git diff --check                                -> clean

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fed465f466

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1280 to +1282
const configuredProvider = Object.hasOwn(explicitProviders, targetProvider)
? explicitProviders[targetProvider]
: undefined;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize provider defaults before short-circuit comparison

Use the normalized provider shape (the same shape planOpenClawModelsJson writes) before calling readExistingProviderMatchesConfig; this currently compares disk against raw cfg.models.providers[targetProvider], so providers that inject defaults during normalization (for example extensions/ollama/provider-policy-api.ts sets a default baseUrl) will always miss the fast path even when disk is already correct. In those configs, every targeted call falls back to full planning and the new performance path is effectively disabled.

Useful? React with 👍 / 👎.

Comment thread src/agents/models-config.ts Outdated
Comment on lines +996 to +997
if (configuredIds.size > 0) {
const diskIds = collectShortCircuitModelIds(diskProvider.models);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check requested model on implicit-only short-circuit hits

When configuredProvider.models is empty/omitted, this branch skips all model-id validation and can still return a short-circuit match, even if models.json does not contain the model the caller is about to resolve. After this commit, embedded runner callers pass targetProvider on cold start, so stale disk catalogs in implicit-discovery setups can bypass resolveImplicitProviders, then resolveModelAsync fails with Unknown model until another path forces a full reconcile.

Useful? React with 👍 / 👎.

@zeroaltitude
Copy link
Copy Markdown
Contributor Author

@clawsweeper please re-review. Round 8 follow-up commit fed465f466 addresses the [P1] Compare configured models before short-circuiting provider hit finding: readExistingProviderMatchesConfig now performs a model-id subset check after the existing transport-only loop (configured ids must be a subset of disk ids), with a new collectShortCircuitModelIds helper that fails closed on adversarial shapes and skips the check for legitimate implicit-only mode (empty/omitted configured.models) to preserve the discovery-path perf. Four new round-8 regression tests pin: newly-configured model add (record + bare-string shapes), implicit-only hit preservation, and adversarial-shape fail-closed. Full validation gate clean.

…ly short-circuit on targetModel (review feedback)

What
----
Round-9 review feedback on PR openclaw#73261 addressing two Codex P2 findings on
the round-8 commit `fed465f466`:

1. [P2] Normalize provider defaults before short-circuit comparison
   (`src/agents/models-config.ts:1282`). The structural compare ran
   `cfg.models.providers[targetProvider]` AS AUTHORED against
   `models.json` AS NORMALIZED — but `planOpenClawModelsJson` writes
   the normalized shape, so any provider whose policy hook injects
   defaults (e.g. `extensions/ollama/provider-policy-api.ts` defaults
   `baseUrl` to `http://localhost:11434`) would always miss the
   short-circuit even when disk was already correct. Every targeted
   call would fall through to a full plan, disabling the perf path
   for that provider.

2. [P2] Check requested model on implicit-only short-circuit hits
   (`src/agents/models-config.ts:997`). When
   `configuredProvider.models` is `[]` / omitted (implicit-discovery
   mode), the prior contract skipped ALL model-id validation and
   could bless a stale disk catalog. With this round wiring
   `targetProvider` into the embedded runner cold paths, a stale
   `models.json` would short-circuit and then `resolveModelAsync`
   would fail with `Unknown model: <provider>/<model>` until another
   path forced a full reconcile.

Why
---
1. Apply the same `normalizeProviderSpecificConfig(targetProvider, cfg)`
   pre-pass that `normalizeProviders` runs before plan writes. The hook
   is a no-op for providers without an opinion, so existing tests stay
   green; ollama and any future bundled provider with a normalize-config
   policy now hit the perf path on the second pass.

2. Thread the resolved model id through the embedded runner via the
   already-defined `EnsureOpenClawModelsJsonOptions.targetModel` (was
   reserved). In implicit-only mode the structural compare now also
   checks `targetModelId` is present on disk; if it isn't, refuse the
   short-circuit and let the planner rediscover. Explicit-mode
   contract (subset check) is unchanged. Folded `targetModelId` into
   the scoped readyCache key so a hit cached for model X cannot be
   reused as a "match" for model Y under the same
   (provider, fingerprint, models.json content) tuple when Y isn't
   on disk.

How
---
- `readExistingProviderMatchesConfig`:
  - Apply `normalizeProviderSpecificConfig` to the configured provider
    after the `isRecord` guard, then compare normalized vs disk.
  - Accept `targetModelId?: string` and, in the implicit-only branch
    (configured `models` empty), require `targetModelId` to be on
    disk; lazily compute `diskIds` so adversarial disk model rows
    still fail closed in both branches without parsing twice.
- `modelsJsonScopedShortCircuitCacheKey`: append `\\0m:<id>` when a
  `targetModelId` is provided so per-model cache scope follows the
  per-model contract above. Keys without `targetModel` keep the prior
  shape.
- `pi-embedded-runner/run.ts` and `pi-embedded-runner/compact.ts`:
  forward the resolved `modelId` as `targetModel` alongside
  `targetProvider`.
- `EnsureOpenClawModelsJsonOptions.targetModel` JSDoc updated from
  "reserved" to the implicit-only contract.

Tests
-----
4 new cases in
`src/agents/models-config.target-provider-short-circuit.test.ts`,
pinning the regressions:

- `hit-on-policy-injected-default-baseUrl` — ollama config without
  `baseUrl` short-circuits on second pass because policy normalizes
  both sides to the default. Without the round-9 fix this falls
  through.
- `miss-on-implicit-only-target-model-not-on-disk` — implicit mode
  with `targetModel` missing from disk now misses the short-circuit
  and triggers a re-plan.
- `hit-on-implicit-only-target-model-on-disk` — symmetric companion;
  short-circuit still hits when the model is on disk.
- `scoped-cache-per-target-model` — model X cached hit does not
  satisfy a follow-up call for model Y under the same fingerprint
  and models.json content when Y is not on disk.

Validation gate
---------------
- `pnpm tsgo:core` clean
- `pnpm tsgo:core:test` clean
- `pnpm vitest run src/agents/models-config.target-provider-short-circuit.test.ts`:
  60 passed (30 × 2 contexts)
- `pnpm vitest run src/agents/models-config.fingerprint-cache.test.ts src/agents/models-config.providers.policy.test.ts`:
  24 passed
- `pnpm vitest run src/infra/changelog-unreleased.test.ts`: 3 passed
- `pnpm config:schema:check`: ok
- `pnpm plugin-sdk:api:check`: ok
- `pnpm oxlint <touched files>`: 0 new errors (the pre-existing
  `__testing` underscore-dangle finding in
  `pi-embedded-runner/compact.ts:1425` is upstream and unrelated)
- `git diff --check`: clean

Beads: openclaw-8zp
@zeroaltitude
Copy link
Copy Markdown
Contributor Author

Round 9 — addressing Codex P2 review feedback

Codex review on round-8 commit fed465f466 flagged two [P2] findings, both inline at production paths newly reachable from the embedded runner cold start now that round-7 wires targetProvider into compact.ts and run.ts. Both addressed in follow-up commit 721a5245a5 (not amended, not force-pushed).

[P2] Normalize provider defaults before short-circuit comparison — src/agents/models-config.ts:1282

Problem (paraphrased): The structural compare in ensureOpenClawModelsJson calls readExistingProviderMatchesConfig with cfg.models.providers[targetProvider] AS AUTHORED, but planOpenClawModelsJson writes the NORMALIZED shape to disk. Providers whose policy hook injects defaults during normalizeProviders (e.g. extensions/ollama/provider-policy-api.ts defaults baseUrl to http://localhost:11434 when omitted) would always miss the short-circuit even when disk was already correct, falling through to a full plan on every targeted call. Net effect: the perf path is silently disabled for that provider.

Solution: Apply the same normalizeProviderSpecificConfig(targetProvider, configuredProvider) pre-pass that normalizeProviders runs before plan writes, then compare normalized config against disk. The hook is a no-op for providers without an opinion, so the existing test suite stays green; ollama and any future bundled provider with a policy normalizeConfig now hit the perf path on the second pass.

Result: Round-9 commit 721a5245a5. Pinned by new test hit-on-policy-injected-default-baseUrl: ollama config without baseUrl short-circuits on the second ensureOpenClawModelsJson call (first call: resolveImplicitProvidersCallCount === 1; second call: === 0). Reverting the round-9 normalize pre-pass locally and re-running the new test reproduces the regression: it stays at 1 instead of 0.

[P2] Check requested model on implicit-only short-circuit hits — src/agents/models-config.ts:997

Problem (paraphrased): When configuredProvider.models is [] / omitted (implicit-discovery mode), the round-8 subset check skipped ALL model-id validation and could bless a stale disk catalog. With this PR wiring targetProvider through the embedded runner cold paths, that staleness window is now reachable: implicit-discovery setups with stale models.json short-circuit, then resolveModelAsync(provider, modelId, ...) fails with Unknown model: <provider>/<modelId> until another path forces a full reconcile.

Solution: Three coordinated changes:

  1. Thread the resolved model id from the embedded runner via the already-defined-but-reserved EnsureOpenClawModelsJsonOptions.targetModel. pi-embedded-runner/run.ts and pi-embedded-runner/compact.ts now forward targetModel: modelId alongside targetProvider: provider.
  2. In readExistingProviderMatchesConfig, when configuredProvider.models is empty AND targetModelId is provided, require targetModelId to be present in the disk model-id set. Anything else (malformed disk models, missing id) falls through to full planning. Implicit callers that don't yet know which model they will resolve (targetModelId omitted) keep the prior provider-shape-only contract — they never had per-model assertions to begin with. The disk-id parse runs lazily so adversarial disk model rows still fail closed in both explicit and implicit branches without duplicating the parse.
  3. Fold targetModelId into the scoped readyCache key (modelsJsonScopedShortCircuitCacheKey) so a hit cached for model X cannot be reused as a "match" for model Y under the same (provider, fingerprint, models.json content) tuple when Y isn't on disk. Keys without targetModel keep the prior shape (no scope drift for callers that haven't adopted the hint yet).

Updated EnsureOpenClawModelsJsonOptions.targetModel JSDoc from "reserved for future refinements" to the implicit-only contract above so future maintainers don't reinvent the contract.

Result: Round-9 commit 721a5245a5. Pinned by three new tests:

  • miss-on-implicit-only-target-model-not-on-disk — implicit mode with targetModel missing from disk now misses the short-circuit and triggers a re-plan (resolveImplicitProvidersCallCount === 1 on the second pass).
  • hit-on-implicit-only-target-model-on-disk — symmetric companion; short-circuit still hits when the model is on disk (=== 0).
  • scoped-cache-per-target-model — pin for the cache-key isolation: with model X cached as a hit, a follow-up call for model Y (not on disk) does NOT satisfy via the cached entry; it forces a fresh structural check that refuses, then re-plans (=== 1). Without the cache-key fold this would silently hit at === 0.

Validation gate (round-9, run from perf/models-config-target-provider-short-circuit @ 721a5245a5)

  • pnpm tsgo:core — clean
  • pnpm tsgo:core:test — clean
  • pnpm vitest run src/agents/models-config.target-provider-short-circuit.test.ts60 passed (30 × 2 contexts), including the four new cases above
  • pnpm vitest run src/agents/models-config.fingerprint-cache.test.ts src/agents/models-config.providers.policy.test.ts — 24 passed (no incidental regressions in the adjacent surfaces)
  • pnpm vitest run src/infra/changelog-unreleased.test.ts — 3 passed (CHANGELOG entry kept under ## Unreleased, extended in place to describe the round-9 normalize pre-pass and implicit-mode targetModel gating)
  • pnpm config:schema:check — ok
  • pnpm plugin-sdk:api:check — ok
  • pnpm oxlint on touched files — 0 new errors (the pre-existing __testing underscore-dangle finding in pi-embedded-runner/compact.ts:1425 is upstream and unrelated to this round)
  • git diff --check — clean

@clawsweeper please re-review.

…rt-circuit cache (review feedback)

Round-10 follow-up on Clawsweeper P2 (outstanding since 2026-04-30T22:05,
missed in rounds 6-9): "Cache only the validated models.json snapshot."

Pre-fix, `readExistingProviderMatchesConfig` would safe-read +
structurally validate the on-disk `models.json` once via
`safeReadFileOutcome`, then return a bare `boolean`.  The caller at
models-config.ts:1397-1401 would then issue a SECOND read via
`readModelsJsonContentOutcome(targetPath)` and store THAT outcome's
hash in the provider-scoped readyCache.  Between the two reads, an
attacker (or any external editor) replacing `models.json` could land
hash(swappedBytes) in the scoped cache as "the validated snapshot."  A
later targeted call hitting the cache would compare current disk
against hash(swappedBytes) and accept it as a match, blessing
attacker-controlled provider transport (api / baseUrl / headers
consumed per-call by `pi-embedded-runner/model.ts`).  CWE-345.

This commit collapses the two reads into one by threading the
validated `ContentHashOutcome` straight back from the structural
check.  Concretely:

- `readExistingProviderMatchesConfig` now returns a discriminated
  union: `{ matches: true; validatedModelsJsonOutcome }` on success,
  `{ matches: false }` on every failure path.  The validated outcome
  is derived directly from the `safeReadFileOutcome` the function
  already runs (line 838) — same bytes, same hash, no second disk
  trip.
- The caller uses `matchOutcome.validatedModelsJsonOutcome` for the
  scoped-cache `set` and drops the redundant
  `readModelsJsonContentOutcome(targetPath)` call.
- The `kind !== "uncacheable"` guard is preserved as
  belt-and-suspenders even though, on the success path, the validated
  outcome is provably `hashed` (uncacheable cases early-return
  `{ matches: false }` before we get here).

Three new regression tests pin the contract from independent angles:

- `cache-stores-validated-outcome` — happy path: cold + stable disk +
  reset-cache + targeted call must take the disk-based short-circuit
  exactly once and a follow-up targeted call must hit the warm scoped
  cache without re-running discovery.
- `toctou-swap-cannot-bless-unvalidated-content` — security: after a
  cold short-circuit populates the scoped cache, swap on-disk
  `models.json` to a config-disagreeing baseUrl; the next targeted
  call MUST detect drift, refuse the cache, fail the structural
  check, and fall through to a full plan (resolveImplicit fires).
- `no-cache-on-uncacheable-validated-outcome` — defense-in-depth:
  oversize `models.json` forces a full plan and the post-plan disk
  size returns under cap, confirming the uncacheable branch is
  reachable end-to-end without the redundant second read.

Adds a real-runtime proof script
`scripts/proof-73261-validated-outcome-cache.ts` that drives the
production `ensureOpenClawModelsJson` against a real on-disk
`models.json` in a temp dir, instruments `fs.promises.lstat` to count
reads per scenario, and self-checks that the round-10 contract holds:

  [1/3] cache-stores-validated-outcome
      lstat(models.json) during short-circuit: 1
      lstat(models.json) during scoped-cache hit + drift-check: 1
      ok
  [2/3] toctou-swap-cannot-bless-unvalidated-content
      disk baseUrl after cold plan: https://api.openai.com/v1
      swap was not silently blessed by stale scoped cache
      ok
  [3/3] no-cache-on-uncacheable-validated-outcome
      bloated models.json size: 2102064 bytes (> 1 MiB cap)
      post-plan models.json size: 9105 bytes
      ok

  All runtime assertions passed.

Validation gate (full output captured for the PR comment):
  pnpm tsgo:core                                                   ok
  pnpm tsgo:core:test                                              ok
  pnpm vitest run src/agents/models-config.target-provider-...     66/66
  pnpm vitest run src/agents/models-config.fingerprint-cache,
                  src/agents/models-config.write-serialization     32/32
  pnpm config:schema:check                                         ok
  pnpm plugin-sdk:api:check                                        ok
  pnpm oxlint <touched files>                                      ok
  git diff --check                                                 clean
  pnpm tsx scripts/proof-73261-validated-outcome-cache.ts          ok

Beads: openclaw-8zp
@openclaw-barnacle openclaw-barnacle Bot added the scripts Repository scripts label May 7, 2026
@zeroaltitude
Copy link
Copy Markdown
Contributor Author

zeroaltitude commented May 7, 2026

Round 10 — Clawsweeper P2 (long-outstanding) addressed

Acknowledging up front: this finding has been outstanding since the 2026-04-30T22:05 review pass and was missed across rounds 6, 7, 8, and 9. Apologies for the delay — round-10 (d8dfa42082) is targeted directly at it.

[P2] Cache only the validated models.json snapshot — src/agents/models-config.ts:1397-1401

Paraphrased finding. readExistingProviderMatchesConfig validates bytes from one safe read of models.json, then the calling block does a SECOND readModelsJsonContentOutcome(targetPath) and stores THAT outcome as the scoped cache entry. If models.json is replaced between the two reads (TOCTOU), the scoped cache stores a hash of UNVALIDATED bytes — and later targeted calls hitting the scoped cache skip drift checks for the swapped-in content, blessing it as "the validated snapshot." Security-relevant because models.json drives provider routing (api / baseUrl / headers) per pi-embedded-runner/model.ts. Reviewer's prescribed solution: "The fast path should return the validated content outcome from readExistingProviderMatchesConfig, or compare any second read against the validated hash before caching or returning."

Solution narrative. Took the first option — threading the validated outcome back is the cleaner contract; the fallback option (compare second read vs validated hash) reintroduces the same race surface in a different shape.

  • Changed readExistingProviderMatchesConfig's return type from Promise<boolean> to a discriminated union: Promise<{ matches: true; validatedModelsJsonOutcome: ContentHashOutcome } | { matches: false }>.
  • The function already runs safeReadFileOutcome(targetPath, MAX_MODELS_JSON_BYTES) at the top of its body to validate disk content; the success branch now constructs the ContentHashOutcome straight from those bytes ({ kind: "hashed", hash: safe.hash }) and returns it alongside matches: true.
  • Caller at the (formerly) line 1397-1401 block now uses matchOutcome.validatedModelsJsonOutcome directly and drops the redundant second readModelsJsonContentOutcome(targetPath) call entirely.
  • Preserved the kind !== "uncacheable" guard around the cache set even though, on the success path, the validated outcome is provably hashed (every uncacheable case early-returns { matches: false } before we reach the cache populate). Belt-and-suspenders against future shape drift in ContentHashOutcome.
  • Only one caller of readExistingProviderMatchesConfig; no other call sites needed updating.

Result.

Commit: d8dfa42082.

Three new regression tests in src/agents/models-config.target-provider-short-circuit.test.ts pin the contract from independent angles:

  • cache-stores-validated-outcome — cold + stable disk + reset-cache + targeted call takes the disk-based short-circuit; a follow-up targeted call hits the warm scoped cache without re-running discovery.
  • toctou-swap-cannot-bless-unvalidated-content — the security regression: after a cold short-circuit populates the scoped cache, swap on-disk models.json to a config-disagreeing baseUrl; the next targeted call MUST detect drift, refuse the cache, fail the structural check against the swapped baseUrl, and fall through to a full plan (resolveImplicit fires).
  • no-cache-on-uncacheable-validated-outcome — oversize models.json forces a full plan and the post-plan disk returns under the cap, confirming the uncacheable branch is reachable end-to-end without the redundant second read.

All 33 short-circuit tests × 2 vitest projects (66 total) pass.

Validation gate

$ pnpm tsgo:core
> openclaw@2026.5.6 tsgo:core ...
> node scripts/run-tsgo.mjs -p tsconfig.core.json --incremental --tsBuildInfoFile .artifacts/tsgo-cache/core.tsbuildinfo
ok

$ pnpm tsgo:core:test
> openclaw@2026.5.6 tsgo:core:test ...
> node scripts/run-tsgo.mjs -p test/tsconfig/tsconfig.core.test.json --incremental --tsBuildInfoFile .artifacts/tsgo-cache/core-test.tsbuildinfo
ok

$ pnpm vitest run src/agents/models-config.target-provider-short-circuit.test.ts
 ✓  agents-core      ../../src/agents/models-config.target-provider-short-circuit.test.ts (33 tests) 2004ms
 ✓  agents-support   ../../src/agents/models-config.target-provider-short-circuit.test.ts (33 tests) 1674ms
 Test Files  2 passed (2)
      Tests  66 passed (66)

$ pnpm vitest run src/agents/models-config.fingerprint-cache.test.ts src/agents/models-config.write-serialization.test.ts
 Test Files  4 passed (4)
      Tests  32 passed (32)

$ pnpm config:schema:check
[base-config-schema] ok

$ pnpm plugin-sdk:api:check
OK docs/.generated/plugin-sdk-api-baseline.sha256

$ pnpm oxlint src/agents/models-config.ts src/agents/models-config.target-provider-short-circuit.test.ts scripts/proof-73261-validated-outcome-cache.ts
Found 0 warnings and 0 errors.
Finished in 35ms on 3 files with 185 rules using 16 threads.

$ git diff --check
(clean)

After-fix real behavior proof

Refreshed because the prior real-behavior proof (commit 2f89c7b99a) is now stale across rounds 6/7/8/9. The new proof script scripts/proof-73261-validated-outcome-cache.ts drives the production ensureOpenClawModelsJson against a real on-disk models.json in a temp dir — no vitest, no vi.mock of the seam under test, real safeReadFileOutcome, real readExistingProviderMatchesConfig, real scoped readyCache populate / drift-detect cycle. It instruments fs.promises.lstat to count reads per scenario and self-checks the round-10 contract:

$ pnpm tsx scripts/proof-73261-validated-outcome-cache.ts
[1/3] cache-stores-validated-outcome
Config warnings:
- plugins.entries.active-memory: plugin disabled (disabled in config) but config is present
    lstat(models.json) during short-circuit: 1
    lstat(models.json) during scoped-cache hit + drift-check: 1
    ok
[2/3] toctou-swap-cannot-bless-unvalidated-content
    disk baseUrl after cold plan: https://api.openai.com/v1
    swap was not silently blessed by stale scoped cache
    ok
[3/3] no-cache-on-uncacheable-validated-outcome
    bloated models.json size: 2102064 bytes (> 1 MiB cap)
    post-plan models.json size: 9105 bytes
    ok

All runtime assertions passed.

The lstat(models.json) during short-circuit: 1 line is the direct empirical proof that the redundant second read is gone — pre-round-10, that count was 2 (one in safeReadFileOutcome inside readExistingProviderMatchesConfig, one in the now-removed post-validation readModelsJsonContentOutcome).

@clawsweeper please re-review.

Re-review progress:

@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. scripts Repository scripts size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant