Skip to content

perf(agents): stabilize models.json cache fingerprint and add targetProvider short-circuit [claude, human developer oversight]#72869

Closed
zeroaltitude wants to merge 9 commits intoopenclaw:mainfrom
zeroaltitude:perf/models-config-startup-cache-upstream
Closed

perf(agents): stabilize models.json cache fingerprint and add targetProvider short-circuit [claude, human developer oversight]#72869
zeroaltitude wants to merge 9 commits intoopenclaw:mainfrom
zeroaltitude:perf/models-config-startup-cache-upstream

Conversation

@zeroaltitude
Copy link
Copy Markdown
Contributor

@zeroaltitude zeroaltitude commented Apr 27, 2026

Problem

The in-memory cache in ensureOpenClawModelsJson was effectively disabled in practice. Every call recomputed a fingerprint that differed from the previous call's fingerprint, so the implicit-provider-discovery pipeline (which takes several seconds to scan installed plugins for provider contributions) ran on every invocation instead of once per real config change.

Two root causes:

  1. The fingerprint included auth-profiles.json mtime. That file is rewritten whenever any OAuth token in it is refreshed (access token, refresh token, expiry), even though the set of providers the user can actually use is unchanged. Every refresh invalidated the cache on the next call.

  2. The fingerprint included models.json mtime. That file is the output of this function. When a call wrote models.json, the next call observed a newer mtime and invalidated its own cached result — a self-defeating fingerprint.

Fix 1 — Stabilize the cache fingerprint (commit 918f3f9173)

Replace the mtime-based inputs with:

  • A content-based hash of auth-profiles.json that strips volatile OAuth fields (access, refresh, token, expires, expiresAt, expiresIn, issuedAt, refreshedAt, lastCheckedAt, lastRefreshAt, lastValidatedAt). Structural changes (adding/removing a profile, changing provider or accountId) still invalidate the cache.
  • Drop modelsFileMtimeMs entirely. External edits to models.json are already detected downstream in the plan layer by comparing existingRaw against the computed plan.

Measured impact: on a workspace with ~37 discovery plugins, each call to runEmbeddedPiAgent drops from ~5s to sub-millisecond in the steady-state path.

Fix 2 — targetProvider short-circuit (commit 74e4d04771)

When a caller knows the exact provider it intends to use (e.g. the pi-embedded runner, which always has a resolved provider/modelId before calling ensureOpenClawModelsJson), there is no reason to walk the full implicit-provider-discovery pipeline — models.json just needs to contain that one provider with usable credentials.

Adds an optional EnsureOpenClawModelsJsonOptions.targetProvider hint. When set, and the provider is present in both the in-memory config and on-disk models.json with some form of usable credential material (string apiKey, unresolved-secret apiKey object, populated headers, or explicit auth config), skip the fingerprint + plan + discovery path entirely and return immediately.

Threaded through from compactEmbeddedPiSessionDirect and runEmbeddedPiAgent so every embedded-pi dispatch benefits.

This is on top of the cache-fingerprint stabilization above and covers the cold-cache case (first call per gateway start, per-subagent spawn with a fresh cache, etc.) where even the fingerprint computation can be avoided.

Tests

src/agents/models-config.fingerprint-cache.test.ts adds 226 lines covering:

  • Volatile OAuth field stripping (auth-profiles.json content hash stability across token refresh).
  • Structural changes still invalidate (adding/removing a profile).
  • models.json writes do not invalidate the cache on next read.
  • targetProvider short-circuit hits when provider is configured with usable credentials.
  • targetProvider short-circuit fall-through when credentials are missing.
  • targetProvider ignored when not set (existing path preserved).

AI/Vibe-Coded disclosure

This PR was prepared with the assistance of Claude under human developer oversight. The diagnosis came from observing 5-second startup pauses on every runEmbeddedPiAgent call in production despite no actual config changes; instrumenting ensureOpenClawModelsJson revealed both fingerprint instabilities. Tested against my live workspace (37 plugins, several OAuth-using providers) over multiple gateway restarts.

Appreciate a sanity check on whether the cache layer is still the right place for this optimisation, or if there's a different layer you'd prefer.

…freshes

The in-memory cache in ensureOpenClawModelsJson was effectively disabled in
practice: every call recomputed a fingerprint that differed from the previous
call's fingerprint, so the implicit-provider-discovery pipeline (which takes
several seconds to scan installed plugins for provider contributions) ran on
every invocation instead of once per real config change.

Two root causes:

1. The fingerprint included mtime of auth-profiles.json. That file is
   rewritten whenever any OAuth token in it is refreshed (access token,
   refresh token, expiry), even though the set of providers the user can
   actually use is unchanged. Every refresh invalidated the cache on the next
   call.

2. The fingerprint included mtime of models.json. That file is the OUTPUT of
   this function. When a call wrote models.json, the NEXT call observed a
   newer mtime and invalidated its own cached result.

Replace the mtime-based inputs with a content-based hash of auth-profiles.json
that strips volatile OAuth fields (access, refresh, token, expires,
expiresAt, expiresIn, issuedAt, refreshedAt, lastCheckedAt, lastRefreshAt,
lastValidatedAt). Structural changes (adding/removing a profile, changing
provider or accountId) still invalidate the cache. Drop the models.json input
entirely: external edits to that file are already detected downstream in the
plan layer by comparing existingRaw against the computed plan.

Measured impact: on a workspace with ~37 discovery plugins, each call to
runEmbeddedPiAgent dropped from ~5s to sub-millisecond in the models.json
ensure step after the first call per gateway start. This affects every
inbound message, every subagent spawn, every cron-dispatched agent turn,
and every Active Memory lookup.

Add src/agents/models-config.fingerprint-cache.test.ts covering:
 - cache hit on identical inputs
 - cache hit across simulated OAuth token rotation
 - cache invalidation when a profile is added/removed
 - cache invalidation when config changes
…s already configured

When a caller knows the exact provider it intends to use (e.g. the pi-embedded
runner, which always has a resolved provider/modelId before it calls
ensureOpenClawModelsJson), there is no reason to walk the full implicit-
provider-discovery pipeline — models.json just needs to contain that one
provider with usable credentials.

Add an optional EnsureOpenClawModelsJsonOptions.targetProvider hint. When set,
and the provider is present in both the in-memory config and on-disk
models.json with some form of usable credential material (string apiKey,
unresolved-secret apiKey object, populated headers, or explicit auth config),
skip the fingerprint + plan + discovery path entirely and return immediately.

Thread the hint through from compactEmbeddedPiSessionDirect and
runEmbeddedPiAgent so every embedded-pi dispatch benefits.

This is on top of the cache-fingerprint stabilization in 49f7450 and covers
the cold-cache case (first call per gateway start, per-subagent spawn with a
fresh cache, etc.) where even the fingerprint path can be avoided.
# Conflicts:
#	src/agents/pi-embedded-runner/run.ts
@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: M labels Apr 27, 2026
@zeroaltitude zeroaltitude marked this pull request as ready for review April 27, 2026 16:53
@aisle-research-bot
Copy link
Copy Markdown

aisle-research-bot Bot commented Apr 27, 2026

🔒 Aisle Security Analysis

We found 4 potential security issue(s) in this PR:

# Severity Title
1 🟠 High Symlink/hardlink attack via predictable temp file in atomic models.json write
2 🟡 Medium Potential DoS via unbounded fs.readFile() of oversized auth-profiles.json during fingerprinting
3 🟡 Medium Unbounded read + JSON.parse of models.json allows memory/CPU exhaustion (DoS)
4 🟡 Medium Incomplete disk-vs-config validation in models.json short-circuit allows persistent tampering of provider behavior
1. 🟠 Symlink/hardlink attack via predictable temp file in atomic models.json write
Property Value
Severity High
CWE CWE-377
Location src/agents/models-config.ts:284-287

Description

writeModelsFileAtomicForModelsJson writes to a predictable temporary path in the agent directory using fs.writeFile(). If an attacker can create files inside ${agentDir} (e.g., via an environment override of OPENCLAW_AGENT_DIR pointing to a shared/writable directory, or other local write access), they can pre-create tempPath as a symlink (or hardlink) to another file. fs.writeFile() will follow symlinks and overwrite the link target, enabling arbitrary file overwrite as the gateway user (CWE-59/CWE-377).

Vulnerable flow:

  • Input/control: attacker can influence filesystem state in agentDir (place a symlink at the predicted *.tmp path)
  • Sink: fs.writeFile(tempPath, ...) follows symlinks and writes attacker-controlled JSON content into the symlink target
  • Result: arbitrary file clobbering (and subsequent fs.rename behavior may also replace models.json)

Vulnerable code:

const tempPath = `${targetPath}.${process.pid}.${Date.now()}.tmp`;
await fs.writeFile(tempPath, contents, { mode: 0o600 });
await fs.rename(tempPath, targetPath);

Recommendation

Avoid writing via a predictable path with writeFile() that can follow symlinks.

Mitigations (combine):

  1. Create the temp file securely with O_EXCL (fails if it already exists) and never follow existing symlinks:
import { open } from "node:fs/promises";

export async function writeModelsFileAtomicForModelsJson(targetPath: string, contents: string) {
  const dir = path.dirname(targetPath);
  const name = `.${path.basename(targetPath)}.${crypto.randomUUID()}.tmp`;
  const tempPath = path.join(dir, name);

  const fh = await open(tempPath, "wx", 0o600); // O_EXCL
  try {
    await fh.writeFile(contents, { encoding: "utf8" });
    await fh.sync();
  } finally {
    await fh.close();
  }
  await fs.rename(tempPath, targetPath);
}
  1. Optionally lstat the directory and/or targetPath before and after to ensure it is not a symlink and is within the expected directory.

This prevents a local attacker from using symlinks/hardlinks to redirect writes to unintended files.

2. 🟡 Potential DoS via unbounded fs.readFile() of oversized auth-profiles.json during fingerprinting
Property Value
Severity Medium
CWE CWE-400
Location src/agents/models-config.ts:97-105

Description

readAuthProfilesStableHash() attempts to cap expensive parsing when auth-profiles.json exceeds MAX_AUTH_PROFILES_BYTES, but in the oversized-file branch it still reads the entire file into memory and hashes it.

  • Input: auth-profiles.json located under agentDir (default under the user-writable state dir, and can also be overridden via env), so a local/untrusted party who can write to that path can supply an arbitrarily large file.
  • Sink: await fs.readFile(pathname) reads the whole file into memory, and createHash(...).update(raw) performs CPU work proportional to file size.
  • Impact: memory exhaustion and CPU exhaustion (DoS) when this fingerprint function is invoked.

Vulnerable code:

if (stat.size > MAX_AUTH_PROFILES_BYTES) {
  let raw: Buffer;
  try {
    raw = await fs.readFile(pathname);
  } catch {
    return null;
  }
  return createHash("sha256").update(raw).digest("hex");
}

Recommendation

Enforce a hard maximum read size for auth-profiles.json even in the oversized-file path.

Options:

  1. Reject oversized files outright (simplest):
if (stat.size > MAX_AUTH_PROFILES_BYTES) {
  return null; // or return a fixed marker hash like sha256("[too-large]")
}
  1. Stream the hash with an upper bound and abort once the limit is exceeded:
import { createReadStream } from "node:fs";
import { createHash } from "node:crypto";

const stream = createReadStream(pathname, { highWaterMark: 64 * 1024 });
const hash = createHash("sha256");
let total = 0;
for await (const chunk of stream) {
  total += chunk.length;
  if (total > MAX_AUTH_PROFILES_BYTES) {
    stream.destroy(new Error("auth-profiles.json too large"));
    return null; // or marker hash
  }
  hash.update(chunk);
}
return hash.digest("hex");

This prevents allocating/hashing attacker-controlled multi-GB files and keeps the fingerprinting work bounded.

3. 🟡 Unbounded read + JSON.parse of models.json allows memory/CPU exhaustion (DoS)
Property Value
Severity Medium
CWE CWE-400
Location src/agents/models-config.ts:225-463

Description

ensureOpenClawModelsJson() introduces new helpers that read models.json fully into memory (sometimes multiple times) and may parse it as JSON without any file-size cap.

If an attacker (or untrusted caller) can influence agentDir (e.g. via RunEmbeddedPiAgentParams.agentDir) or can tamper with ${agentDir}/models.json, they can place an extremely large file causing:

  • Large Buffer/string allocations via fs.readFile()
  • CPU spike and event-loop blocking via JSON.parse()
  • Repeated reads (hashing + short-circuit comparison + plan layer) amplifying the impact

Vulnerable code paths:

  • readModelsJsonContentHash() reads the entire file as a Buffer.
  • readExistingProviderMatchesConfig() reads the entire file as UTF-8 and performs JSON.parse().

Vulnerable code:

// reads whole file into memory
const raw = await fs.readFile(pathname);// reads whole file into a string and parses it
raw = await fs.readFile(targetPath, "utf8");
parsed = JSON.parse(raw);

Recommendation

Add a strict size cap (and ideally streaming) for models.json reads.

Suggested approach:

  1. stat() the file first and refuse / re-plan / truncate if above a reasonable bound (e.g. 1–8 MiB).
  2. Avoid JSON.parse() on oversized content; treat as drift/corruption and rewrite from a computed plan.
  3. For hashing, prefer streaming (createReadStream) to avoid allocating a full buffer.

Example (size cap + streaming hash):

const MAX_MODELS_JSON_BYTES = 8 * 1024 * 1024;

async function readModelsJsonContentHash(pathname: string): Promise<string | null> {
  const stat = await fs.stat(pathname).catch(() => null);
  if (!stat) return null;
  if (stat.size > MAX_MODELS_JSON_BYTES) return null; // force re-plan / treat as corrupted

  return await new Promise((resolve, reject) => {
    const hash = createHash("sha256");
    const stream = createReadStream(pathname);
    stream.on("data", (chunk) => hash.update(chunk));
    stream.on("error", () => resolve(null));
    stream.on("end", () => resolve(hash.digest("hex")));
  });
}

async function safeReadModelsJsonUtf8(pathname: string): Promise<string | null> {
  const stat = await fs.stat(pathname).catch(() => null);
  if (!stat || stat.size > MAX_MODELS_JSON_BYTES) return null;
  return fs.readFile(pathname, "utf8");
}

Then in readExistingProviderMatchesConfig(), use the capped reader and return false (no short-circuit) when the file is too large/unreadable so the normal plan/write path can restore a valid small file.

4. 🟡 Incomplete disk-vs-config validation in models.json short-circuit allows persistent tampering of provider behavior
Property Value
Severity Medium
CWE CWE-345
Location src/agents/models-config.ts:444-515

Description

ensureOpenClawModelsJson() introduces a short-circuit path that skips full planning/writing of models.json when a target provider “matches” config. However, the match function only compares a subset of provider fields (baseUrl, apiKey, headers, auth).

As a result, an attacker (or accidental corruption) that can modify ${agentDir}/models.json can change other provider fields that affect runtime behavior (e.g. api transport selection, models list/model metadata, or other provider-specific knobs) while keeping those four checked fields unchanged. The gateway will then:

  • treat the disk state as valid and return early
  • skip the plan step that would otherwise normalize and overwrite models.json

This permits persistence of tampered provider settings across runs even when the caller provides a correct config, because the early return happens before the later “modelsJsonHash” cache second-factor logic.

Vulnerable code (short-circuit match omits other provider fields):

// compares only baseUrl, apiKey, headers, auth
if (!stableEqual(configuredProvider.headers, diskProvider.headers)) return false;
if (!stableEqual(configuredProvider.auth, diskProvider.auth)) return false;
return true;

Recommendation

Tighten the short-circuit criteria to ensure the on-disk provider entry is fully consistent with what planOpenClawModelsJson() would produce.

Options:

  1. Compare the entire provider object (minus known volatile/secret-marker fields)
// simplest: require exact structural equality of the normalized provider config// (ideally after running the same normalization used by the planner)
if (!stableEqual(configuredProvider, diskProvider)) return false;
  1. At minimum include additional behavior-affecting fields such as api and models (and any provider-specific routing/config keys).
if (!stableEqual(configuredProvider.api, diskProvider.api)) return false;
if (!stableEqual(configuredProvider.models, diskProvider.models)) return false;
  1. Additionally (defense-in-depth), include a models.json content hash check even in the short-circuit path (or call into the normal cache+plan path) so any tampering anywhere in the file prevents early return.

Analyzed PR: #72869 at commit 1fadcde

Last updated on: 2026-04-28T04:15:41Z

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 27, 2026

Greptile Summary

This PR fixes two real performance bugs in ensureOpenClawModelsJson: replacing mtime-based fingerprint inputs with a content hash (stripping volatile OAuth fields) and adding a targetProvider short-circuit that skips the discovery pipeline when the provider is already on disk. The fingerprint stabilization is well-reasoned and the implementation is clean.

  • P1 — short-circuit returns stale on-disk credentials: The fast path at lines 283–293 only checks that the provider key exists in both the in-memory config and models.json; it does not check whether the on-disk credentials match the current config. After a credential rotation and gateway restart (empty in-memory cache, stale models.json), the short-circuit fires and resolveModelAsync reads the old key — exactly the scenario the fingerprint cache would have caught.
  • P2 — \"token\" in volatile fields: type: \"token\" auth profiles use the token field as a long-lived API credential; stripping it means rotating that credential won't invalidate the fingerprint cache.
  • P2 — missing tests: Despite the description listing short-circuit tests, only fingerprint-cache tests appear in the added test file.

Confidence Score: 3/5

Not safe to merge as-is — the targetProvider short-circuit can silently serve stale credentials after a config rotation.

A single P1 is present on a hot path (every runEmbeddedPiAgent call). The fingerprint-cache stabilization is sound, but the short-circuit bypasses the same config-change detection the cache relies on, creating a regression for the credential-rotation scenario.

src/agents/models-config.ts lines 283–293 (short-circuit logic); src/agents/models-config.fingerprint-cache.test.ts (missing short-circuit tests).

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/agents/models-config.ts
Line: 283-293

Comment:
**Short-circuit fires on stale on-disk credentials**

`explicitHasTarget` only checks that the provider key exists in the current config (`Boolean(explicitProviders[targetProvider])`); it says nothing about whether the *content* of that config (e.g. `apiKey`, `baseUrl`) matches what was written to `models.json`. `readExistingProviderIsConfigured` only checks that *some* non-empty credential exists on disk — not that it matches the current config.

Concrete failure path: the user rotates their API key in the config, restarts the gateway (clearing the in-memory fingerprint cache), and the first call after restart has `provider = "openai"`. `explicitHasTarget` is `true` (openai still exists), `onDiskHasTarget` is `true` (old key is still a non-empty string), so the short-circuit returns `{ wrote: false }` — but `models.json` still contains the old, revoked key. `resolveModelAsync` then reads the stale file and all requests fail silently until something else forces a cache miss.

The fingerprint-based cache (lines 295–307) would have caught this because the fingerprint includes the full `config` object. The short-circuit bypasses that check entirely.

A minimal fix would be to also verify that the provider's entry in models.json structurally matches what the current config would produce — or at minimum, not short-circuit when the fingerprint-cache is empty (cold start), because that is precisely the scenario where stale disk state needs to be reconciled.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/agents/models-config.fingerprint-cache.test.ts
Line: 101-226

Comment:
**`targetProvider` short-circuit is untested**

The PR description lists six test cases, three of which cover the `targetProvider` short-circuit ("hits when provider is configured with usable credentials", "fall-through when credentials are missing", "ignored when not set"). None of those tests appear in this file — only the four fingerprint-cache tests are present.

The missing case that matters most is *short-circuit fires with stale on-disk credentials*: after a config credential rotation the short-circuit should fall through to the fingerprint path, not return early with the old key still on disk.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/agents/models-config.ts
Line: 24-36

Comment:
**`"token"` in volatile fields strips long-lived credentials**

`"token"` is included as a volatile field, which is appropriate for OAuth session tokens. However, `type: "token"` profiles in `auth-profiles.json` (e.g., the `"anthropic:default"` profile in the test) use `token` as the long-lived API credential — not as a refresh-cycle access token. When that credential is revoked and replaced in `auth-profiles.json`, the content hash won't change (the field is stripped), so the fingerprint-based cache won't invalidate, and `models.json` won't be updated with the new credential until a full cache reset.

If `"token"` here is indeed a short-lived session token that gets auto-refreshed by Anthropic's CLI auth flow (similar to `access`/`refresh`), this is intentional and should be documented on the field. If it can also be a manually-managed long-lived key, it should be excluded from the volatile set (or handled conditionally based on `type`).

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "Merge main into perf/models-config-start..." | Re-trigger Greptile

Comment thread src/agents/models-config.ts
Comment thread src/agents/models-config.fingerprint-cache.test.ts
Comment on lines +24 to +36
const AUTH_PROFILE_VOLATILE_FIELDS: ReadonlySet<string> = new Set([
"access",
"refresh",
"token",
"expires",
"expiresAt",
"expiresIn",
"issuedAt",
"refreshedAt",
"lastCheckedAt",
"lastRefreshAt",
"lastValidatedAt",
]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 "token" in volatile fields strips long-lived credentials

"token" is included as a volatile field, which is appropriate for OAuth session tokens. However, type: "token" profiles in auth-profiles.json (e.g., the "anthropic:default" profile in the test) use token as the long-lived API credential — not as a refresh-cycle access token. When that credential is revoked and replaced in auth-profiles.json, the content hash won't change (the field is stripped), so the fingerprint-based cache won't invalidate, and models.json won't be updated with the new credential until a full cache reset.

If "token" here is indeed a short-lived session token that gets auto-refreshed by Anthropic's CLI auth flow (similar to access/refresh), this is intentional and should be documented on the field. If it can also be a manually-managed long-lived key, it should be excluded from the volatile set (or handled conditionally based on type).

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/models-config.ts
Line: 24-36

Comment:
**`"token"` in volatile fields strips long-lived credentials**

`"token"` is included as a volatile field, which is appropriate for OAuth session tokens. However, `type: "token"` profiles in `auth-profiles.json` (e.g., the `"anthropic:default"` profile in the test) use `token` as the long-lived API credential — not as a refresh-cycle access token. When that credential is revoked and replaced in `auth-profiles.json`, the content hash won't change (the field is stripped), so the fingerprint-based cache won't invalidate, and `models.json` won't be updated with the new credential until a full cache reset.

If `"token"` here is indeed a short-lived session token that gets auto-refreshed by Anthropic's CLI auth flow (similar to `access`/`refresh`), this is intentional and should be documented on the field. If it can also be a manually-managed long-lived key, it should be excluded from the volatile set (or handled conditionally based on `type`).

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d505fa064c — same fix as the Codex thread above. token removed from the volatile set.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6b2a26d37d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 118 to 122
config: params.config,
sourceConfigForSecrets: params.sourceConfigForSecrets,
envShape,
authProfilesMtimeMs,
modelsFileMtimeMs,
authProfilesHash,
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include models.json drift in cache key

This fingerprint no longer includes any models.json state, so once MODELS_JSON_STATE.readyCache is populated, unchanged config/auth inputs return early and never re-run planOpenClawModelsJson. If models.json is edited externally (manual edit, other process, partial corruption) after the first run, ensureOpenClawModelsJson will keep returning cached success and leave the file stale for the rest of the process. The previous mtime-based key at least forced re-planning after file edits.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. The models.json drift / external-edit issue overlaps with the targetProvider short-circuit semantics and the cache-key shape, so I'm holding it for a follow-up where we can settle the approach (always re-plan vs. validate-disk-matches-config vs. content-hash the file into the fingerprint). The security fixes that just landed (d505fa0) are independent and should be safe to land first.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 235e3c2dcf. The fingerprint cache now has a two-factor key: fingerprint (config + auth-profiles + plugin metadata, hashed) AND modelsJsonHash (SHA-256 of the file as captured immediately after the plan-and-write). On every cache check, we recompute the file hash and compare; any external edit / corruption / tamper invalidates. Captured at all three plan return paths (skip, noop, write) so the second factor is always recorded.

The earlier comment about 'never include models.json state' was about avoiding self-invalidation \u2014 the new design captures the hash AT WRITE TIME, not as part of the input fingerprint, so the loop that previously caused every run to invalidate its own cache is avoided.

Comment on lines +25 to +29
"access",
"refresh",
"token",
"expires",
"expiresAt",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep token credentials in auth-profile fingerprint

Treating token as a volatile field masks real auth-state changes for type: "token" profiles. Provider discovery derives usable credentials from cred.token (via resolveApiKeyFromCredential), so token updates that do not add/remove profile IDs will no longer invalidate this cache. That can leave implicit providers stuck in an outdated configured/unconfigured state until the cache is reset or the process restarts.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d505fa064c. Removed 'token' from AUTH_PROFILE_VOLATILE_FIELDS. The volatile set now keeps only OAuth session fields (access, refresh) and timing fields. Static type: 'token' profiles' literal token key is now part of the fingerprint, so rotating a static API token invalidates the cache as expected. Documented the boundary inline.

…artup-cache-upstream

# Conflicts:
#	src/agents/models-config.ts
@openclaw-barnacle openclaw-barnacle Bot added app: web-ui App: web-ui gateway Gateway runtime channel: feishu Channel integration: feishu triage: dirty-candidate Candidate: broad unrelated surfaces; may need splitting or cleanup. labels Apr 28, 2026
Five issues raised by Aisle / Codex / Greptile review on PR openclaw#72869,
addressed inline rather than deferred:

1. CWE-59 symlink-following chmod (Aisle high #1)
   ensureModelsFileModeForModelsJson called fs.chmod on a path that
   may be replaced by a symlink. If an attacker can write to the agent
   dir (or OPENCLAW_AGENT_DIR points there), the chmod followed the
   link and changed perms on an arbitrary owned file. Now lstat first
   and refuse to chmod symlinks or non-regular files.

2. Prototype pollution via JSON keys (Aisle medium #3 / CWE-1321)
   stripAuthProfilesVolatileFields() copied untrusted keys into a
   plain {} object. Special keys '__proto__', 'constructor',
   'prototype' could mutate the result's prototype chain. Now uses
   Object.create(null) for the result and explicitly filters those
   three keys (belt-and-suspenders).

3. DoS via unbounded auth-profiles fingerprinting (Aisle medium #4)
   readAuthProfilesStableHash had no size or depth limits.
   - Added MAX_AUTH_PROFILES_BYTES = 8 MiB. Above the cap we hash raw
     bytes instead of running JSON.parse + recursive transform +
     stable-stringify.
   - Added MAX_AUTH_PROFILES_DEPTH = 64 with a depth-cap marker so
     the recursive walk can't stack-overflow on pathologically nested
     input.

4. 'token' incorrectly stripped from fingerprint (Codex P2 / Greptile P2)
   AUTH_PROFILE_VOLATILE_FIELDS included 'token' to keep OAuth session
   token rotation from invalidating the cache. But profiles with
   type: 'token' use the literal 'token' key as a long-lived static
   credential — stripping it would mask real auth-state changes when
   a user rotates a static API token. Removed 'token' from the
   volatile set and documented the boundary inline. OAuth session
   fields ('access', 'refresh') and timing fields stay stripped.

Skipped from this commit (will reply on threads):
- Cache short-circuit on stale on-disk credentials (Codex/Greptile P1):
  separate concern from the security fixes; needs design discussion
  on whether to validate disk-vs-config or remove the short-circuit.
- models.json drift in cache key (Codex P1): same — touches the
  fingerprint shape and overlaps with the targetProvider short-circuit.
- targetProvider short-circuit untested (Greptile P2): test follow-up
  once the short-circuit semantics are settled.
- Aisle medium #5 (raw secrets in fingerprint cache): structurally
  larger refactor; needs to land separately to keep this commit\'s
  blast radius clear.

Lint: 0 errors. TS: clean.
@zeroaltitude
Copy link
Copy Markdown
Contributor Author

Three of the five Aisle findings landed in d505fa064c:

  • 🟠 fix: add @lid format support and allowFrom wildcard handling #1 High — Symlink-following chmod (CWE-59): Now lstats the path first; refuses to chmod symlinks or non-regular files. The O_NOFOLLOW open+fchmod variant from your recommendation is even stronger but adds an open-then-close on every cache hit; the lstat check closes the same vulnerability with no extra fd traffic.
  • 🟡 WA business, groups & office hours  #3 Medium — Prototype pollution: stripAuthProfilesVolatileFields() now uses Object.create(null) for the result and explicitly filters __proto__, prototype, constructor (belt-and-suspenders).
  • 🟡 Images not passed to Claude CLI - only path reference in text #4 Medium — Unbounded hashing DoS: Added MAX_AUTH_PROFILES_BYTES = 8 MiB (above the cap we hash raw bytes instead of parsing) and MAX_AUTH_PROFILES_DEPTH = 64 (recursive walk caps with a stable depth-marker).

Holding two for follow-ups:

…ests

Three review-driven fixes per @zeroaltitude direction (b)+(c) + secret
hygiene + tests:

(b) Validate disk-vs-config before short-circuiting
    [Aisle High #2 / Codex P1 / Greptile P1 on PR openclaw#72869]

    The previous targetProvider short-circuit fired whenever the
    on-disk provider entry contained ANY non-empty credential. That
    silently bypassed:
      - rotated apiKey: cold start with new key, old key on disk,
        short-circuit fires, all calls fail until something else
        invalidates
      - attacker-tampered baseUrl: redirect to exfil endpoint kept
      - attacker-injected headers: arbitrary auth material kept

    New readExistingProviderMatchesConfig() does a strict structural
    comparison:
      apiKey  - resolved through resolveSecretInputRef (env-ref
                expansion via createConfigRuntimeEnv) before string
                equality vs. disk
      baseUrl - exact string equality
      headers - stable structural equality (key-order independent)
      auth    - stable structural equality

    Any mismatch (or any state we cannot conclusively verify, like
    a non-env secret ref) returns false and falls through to full
    planning. The short-circuit is now safe to use on cold start and
    after gateway restart.

(c) Hash models.json content into the cache key
    [Codex P1 on PR openclaw#72869]

    Previous fingerprint had no models.json input \u2014 once the cache
    was populated, unchanged config/auth returned cached success even
    after the file was edited externally / partially corrupted /
    manually tampered. Now readyCache stores both the input
    fingerprint AND the post-write models.json SHA-256. Cache hit
    requires both to match; any external edit invalidates.

    Captures the hash at three points (skip path, noop path, write
    path) so the second factor is always recorded.

Aisle medium #5: hash fingerprint before storage
    Raw stable-stringified config (including apiKey strings) used to
    sit verbatim in MODELS_JSON_STATE.readyCache. SHA-256 over the
    canonical payload is now the cache key \u2014 deterministic but not
    reversible, so heap snapshots / debug telemetry / core dumps
    can't leak secrets via the readyCache state.

Greptile P2: targetProvider short-circuit tests
    New file models-config.target-provider-short-circuit.test.ts
    with 6 cases:
      - hit-on-match (full structural match short-circuits)
      - miss-on-rotated-key (config apiKey change forces plan)
      - miss-on-baseUrl-change (tampered disk baseUrl rejects)
      - miss-on-tampered-headers (any header drift rejects)
      - miss-on-cold-cache (no disk file forces plan)
      - hit-after-warm-fingerprint + invalidation on external
        models.json edit (modelsJsonHash second-factor verified)

Existing fingerprint-cache test updated:
    The 'volatile fields rotate' test mixed type:oauth (correctly
    volatile) with type:token (now correctly NOT volatile after
    d505fa0). Split into two tests:
      - OAuth session-field rotation does NOT invalidate (existing
        intent, narrowed to oauth-only profiles)
      - Static type:token credential rotation DOES invalidate
        (Codex/Greptile P2 - new correct behavior)

State shape change:
    MODELS_JSON_STATE.readyCache value extended with modelsJsonHash:
      { fingerprint, modelsJsonHash, result }
    All three return paths in the plan closure capture this.

Tests: 13/13 (6 new + 7 existing fingerprint-cache + file-mode).
Lint: 0 errors. TS: clean.
@zeroaltitude
Copy link
Copy Markdown
Contributor Author

🟡 #5 fixed in 235e3c2dcf. buildModelsJsonFingerprint now returns createHash('sha256').update(canonical).digest('hex') instead of the raw stable-stringified payload. The cache key only needs to be deterministic, not reversible \u2014 so the readyCache no longer holds verbatim config / sourceConfigForSecrets (including apiKey strings) in process memory. Heap snapshots / debug telemetry / core dumps can't leak secrets through this path anymore.

🟠 #2 (config drift bypass) also fixed in the same commit \u2014 see the Codex/Greptile P1 thread above for details (disk-vs-config structural comparison + models.json content hash as a second cache factor).

…artup-cache-upstream

# Conflicts:
#	src/agents/models-config.ts
@zeroaltitude
Copy link
Copy Markdown
Contributor Author

This PR has been split into two focused PRs for cleaner review:

Each new PR can land independently. The cache PR includes the security findings from the Aisle review on this thread; the short-circuit PR addresses Codex P1 / Greptile P1 / Aisle High #2 (the disk-trust-without-validation finding) with field-level comparison.

The branches are clean off origin/main (no merge cruft from this thread). Tests and lint are clean on both.

Closing this PR in favor of the split. Please move review attention to #73260 and #73261.

@zeroaltitude zeroaltitude deleted the perf/models-config-startup-cache-upstream branch April 28, 2026 05:19
@zeroaltitude
Copy link
Copy Markdown
Contributor Author

Closing this PR. Branch archived as archive/perf/models-config-startup-cache-upstream (still recoverable from origin via that ref). Forward all review attention to #73260 (cache) and #73261 (short-circuit).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling app: web-ui App: web-ui channel: feishu Channel integration: feishu channel: slack Channel integration: slack commands Command implementations gateway Gateway runtime size: XL triage: dirty-candidate Candidate: broad unrelated surfaces; may need splitting or cleanup.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant