fix(auth-profiles): make post-success bookkeeping saves non-fatal#67077
fix(auth-profiles): make post-success bookkeeping saves non-fatal#67077ademczuk wants to merge 3 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR wraps the bodies of Confidence Score: 5/5Safe to merge — narrow, well-tested fix with no regressions in the changed surface. All three changed functions have clear try/catch coverage with explicit log.warn output. The intentional exclusion of clearAuthProfileCooldown and OAuth save paths is documented in the PR. Tests pass on 3/3 new cases. No P0 or P1 findings. No files require special attention. Reviews (1): Last reviewed commit: "fix(auth-profiles): make post-success bo..." | Re-trigger Greptile |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c1805d3666
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
c1805d3 to
73decfb
Compare
d846663 to
69d44e7
Compare
|
Thanks for the review. All three |
69d44e7 to
1da884b
Compare
|
CI status note: The two red checks are pre-existing upstream failures that reproduce on clean
The PR already includes two upstream-CI-unblock commits:
34 other CI checks pass. Happy to rebase once the two unrelated red checks are resolved upstream. |
a53db25 to
d0f84c1
Compare
|
Thanks for the CWE-532 finding. Applied the preferred mitigation (option 1 from your suggestion) in commit ef36b77: the three |
d0f84c1 to
939794f
Compare
|
Update: rebased onto current main. Upstream shipped equivalents for both of the CI unblock commits I had carried, so they've been dropped:
PR is now a clean single-concern fix for #62099. All three |
4a98411 to
110d430
Compare
|
Codex review: needs maintainer review before merge. Summary Reproducibility: yes. Source reproduction is high confidence: #62099 supplies Windows steps and a matching stack trace, and current main awaits the affected mark paths while their direct save fallback can still throw. Real behavior proof Next step before merge Security Review detailsBest possible solution: Land this PR or an equivalent targeted patch after maintainer validation, keeping OAuth/token-refresh persistence failures fatal and preserving focused regression coverage for the lock-null plus direct-save EPERM path. Do we have a high-confidence way to reproduce the issue? Yes. Source reproduction is high confidence: #62099 supplies Windows steps and a matching stack trace, and current main awaits the affected mark paths while their direct save fallback can still throw. Is this the best way to solve the issue? Yes. The PR is the narrowest maintainable fix shape because it makes only post-completion auth-profile bookkeeping best-effort instead of weakening Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 7c13004883f6. |
|
Codex review: needs maintainer review before merge. What this changes: The PR wraps Maintainer follow-up before merge: This is an open contributor implementation PR for a concrete gateway failure, so the right next action is maintainer review/rebase/merge rather than spawning an automated replacement branch. Review detailsBest possible solution: Land this PR or an equivalent targeted fix that makes only post-completion auth-profile bookkeeping saves best-effort, keeps OAuth/token-refresh persistence failures fatal, and preserves focused regression coverage for the lock-null plus direct-save EPERM path. Broader ReadOnly attribute repair or context-error-loop bounds can remain separate follow-ups if maintainers want them. Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against acae48b790fa. |
110d430 to
67e96e7
Compare
67e96e7 to
63ab503
Compare
|
Acknowledged. Confirming the PR matches the bot's recommended approach: only the three post-completion mark functions (markAuthProfileGood, markAuthProfileUsed, markAuthProfileFailure) are wrapped in try/catch with structured logging, OAuth/token-refresh persistence remains fatal, and regression coverage exercises both the lock-null and direct-save EPERM paths. Just rebased onto current main (63ab503) and lint, format, and the auth-profiles vitest suite all pass locally (126/126). |
|
@vincentkoc when you've got a moment, I have put work into resolving this issue that appears to have persisted for the past 3 weeks of updates #67077? Auth-profiles EPERM fix from #62099. Rebased, CI's green. |
Summary
Fixes #62099. On Windows, concurrent config hot-reload can leave
auth-profiles.jsonwith a ReadOnly attribute. The atomic write insaveAuthProfileStorethen throwsEPERM, and becausemarkAuthProfileGood/markAuthProfileUsed/markAuthProfileFailurerun as post-completion bookkeeping, that throw used to cascade into the LLM request that had already succeeded. Fallback triggers, hits the same read-only file, fails the same way. The gateway becomes unresponsive; restarts don't help because the file attribute persists.The fix wraps the body of each
mark*function in try/catch, logging the persistence failure and continuing. Caller-visible behavior is unchanged on the happy path.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
User-visible / Behavior Changes
A gateway that previously cascaded into "all models failed" unresponsiveness when
auth-profiles.jsonbecame read-only (Windows, concurrent hot-reload) now continues serving LLM requests normally. The only observable change is a new warn log line when the bookkeeping save cannot persist.Security Impact (required)
NoNoNoNoNosaveAuthProfileStoreitself still throws on failure. OAuth token refresh (inoauth.ts) depends on that behavior, since a silent token-save failure would be a security concern. Only the threemark*functions that run after a successful provider call now tolerate save errors.markAuthProfileCooldowndelegates tomarkAuthProfileFailureso it's covered transitively.Repro + Verification
Environment
auth-profiles.jsonwith Windows ReadOnly attribute setSteps
openclaw.jsonwhile the gateway is hot-reloadingauth-profiles.jsonduring the concurrent renameEPERM: operation not permitted, copyfileExpected
LLM requests complete. Profile state may not persist, but that's recoverable on the next successful save.
Actual
The entire gateway cascades into "all models failed" until the user runs
attrib -Rand restarts.Evidence
usage.persist-nonfatal.test.tsHuman Verification (required)
Verified scenarios:
usage.persist-nonfatal.test.ts, 3/3 pass) in a fresh Docker container (Node 22, pnpm install from scratch)src/agents/auth-profiles/suite: 110/112 pass. The 2 failures (session-override.test.tsandoauth.openai-codex-refresh-fallback.test.tsvariously) also fail on cleanmainwith identical counts, so they're pre-existing flakiness unrelated to this changepnpm tsgo --noEmitclean (exit 0)oxlint --type-awareon modified files: 0 warnings, 0 errorsoxfmt --checkon modified files: cleanEdge cases checked:
updateAuthProfileStoreWithLock(lock-guarded path) andsaveAuthProfileStore(direct path). Eachmark*function is asserted to resolve without throwing in both casesmarkAuthProfileCooldowncovered transitively viamarkAuthProfileFailuredelegationWhat I did not verify:
Compatibility / Migration
YesNoNoFailure Recovery (if this breaks)
If the new warn logs become noisy in production, the fix can be reverted cleanly since it's additive (new try/catch around existing logic). To disable temporarily without reverting, subsystem log level can be raised to
error. Files to restore:src/agents/auth-profiles/profiles.ts,src/agents/auth-profiles/usage.ts. Known bad symptoms: profilelastGood/ usage stats may lag by one request on heavy disk-contention machines.Risks and Mitigations
Risk: Silencing all throws in
mark*could mask a genuine bug in the usage-stats computationlog.warnwith the full error message so ops can still see themRisk:
saveAuthProfileStorebeing called from other paths (OAuth token refresh, store inheritance) might also hit EPERM and still throwReal behavior proof
Behavior or issue addressed:
markAuthProfileGood/markAuthProfileUsed/markAuthProfileFailurepropagatedEPERMthrown bysaveAuthProfileStore(issue #62099). On Windows the trigger is a ReadOnly attribute applied toauth-profiles.jsonduring concurrent hot-reload. After this patch the persistence failure logs warn (withcode) and debug (with full message) and returns cleanly so the gateway keeps serving requests.Real environment tested: Linux x64, Node 22.22.2, fresh
pnpm install(pnpm 10.33.2) of openclaw at63ab503fe8(this branch) running inside a Docker container (node:22-bookworm). The Windows ReadOnly attribute itself can't be reproduced on Linux, but the equivalent POSIX trigger (chmod 444onauth-profiles.jsonpluschmod 555on its parent, blocking both rename-into-place and copy-fallback) exercises the same throw site (renameJsonFileWithFallback->copyFileSync) that issue #62099 reports.Exact steps or command run after this patch:
Evidence after fix: Live terminal output captured from the
node tsxinvocation against the rebased branch (commit63ab503fe8). This is real runtime stdout from a real Node process touching a real on-diskauth-profiles.json, not unit tests:Observed result after fix:
markAuthProfileGoodreturnsundefinedcleanly in 98ms with the file and directory both blocked from writes. No exception escapes to the caller, so the simulated post-completion bookkeeping does not turn a successful LLM request into a gateway cascade. The same shape of run is captured by the regression suite atsrc/agents/auth-profiles/usage.persist-nonfatal.test.tsfor the threemark*functions.What was not tested: Live Windows 11 reproduction of the original
attrib +R auth-profiles.jsonrace plus a real Anthropic LLM round-trip. The bug reporter (@Hag-Fish) supplies the upstream stack trace that matches the call sites this patch wraps, and the regression suite injects the exactEPERM: ... copyfileError variant from that trace. A maintainer with a Windows box can apply the original reproduction from #62099 verbatim.