Skip to content

fix(doctor): clear stale runtime override pins#83149

Closed
leno23 wants to merge 1 commit into
openclaw:mainfrom
leno23:fix/83098-doctor-runtime-pins
Closed

fix(doctor): clear stale runtime override pins#83149
leno23 wants to merge 1 commit into
openclaw:mainfrom
leno23:fix/83098-doctor-runtime-pins

Conversation

@leno23

@leno23 leno23 commented May 17, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Teach doctor session route-state scanning to notice agentRuntimeOverride-only entries.
  • Treat plugin-owned agentRuntimeOverride values like stale pinned runtime state and clear them with agentHarnessId during doctor --fix repair.
  • Register Anthropic and Google doctor session route-state ownership metadata so stale claude-cli / google-gemini-cli pins can be detected beyond the legacy Codex path.

Change Type

  • Bug fix

Scope

  • CLI / doctor
  • Plugin contracts
  • Tests

Linked Issue/PR

Real behavior proof

Behavior or issue addressed: doctor --fix previously only cleaned legacy Codex stale route pins. A session containing only agentRuntimeOverride: "claude-cli" was not even scanned, so stale Claude/other CLI runtime pins could persist after the configured route moved back to PI/OpenAI.

Real environment tested: macOS source checkout of openclaw/openclaw, branch fix/83098-doctor-runtime-pins, Node/pnpm workspace with existing dependencies, rebased onto upstream/main at aaf85166, commit 94789d296944c442faf64e7ccb3b16c6971cbf34.

Exact steps or command run after this patch:

pnpm --config.verify-deps-before-run=false test src/commands/doctor-session-state-providers.test.ts -- --reporter=dot
pnpm --config.verify-deps-before-run=false test src/commands/doctor-session-state-providers.test.ts src/plugins/doctor-contract-registry.test.ts -- --reporter=dot
pnpm --config.verify-deps-before-run=false exec oxfmt --check --threads=1 src/commands/doctor-session-state-providers.ts src/commands/doctor-session-state-providers.test.ts extensions/anthropic/doctor-contract-api.ts extensions/google/doctor-contract-api.ts
git diff --cached --check
pnpm --config.verify-deps-before-run=false check:changed --staged

Evidence after fix:

$ pnpm --config.verify-deps-before-run=false test src/commands/doctor-session-state-providers.test.ts -- --reporter=dot
[test] passed 1 Vitest shard in 46.62s
Test Files  1 passed (1)
Tests  9 passed (9)

$ pnpm --config.verify-deps-before-run=false test src/commands/doctor-session-state-providers.test.ts src/plugins/doctor-contract-registry.test.ts -- --reporter=dot
[test] passed 2 Vitest shards in 79.70s
Test Files  1 passed (1) / 1 passed (1)
Tests  9 passed (9) / 9 passed (9)

$ pnpm --config.verify-deps-before-run=false exec oxfmt --check --threads=1 src/commands/doctor-session-state-providers.ts src/commands/doctor-session-state-providers.test.ts extensions/anthropic/doctor-contract-api.ts extensions/google/doctor-contract-api.ts
All matched files use the correct format.
Finished in 198ms on 4 files using 1 threads.

$ pnpm --config.verify-deps-before-run=false check:changed --staged
[check:changed] lanes=core, coreTests, extensions, extensionTests
[check:changed] typecheck core
[check:changed] typecheck core tests
[check:changed] typecheck extensions
[check:changed] typecheck extension tests
[check:changed] lint core
Found 0 warnings and 0 errors.
[check:changed] lint extensions
Found 0 warnings and 0 errors.
[check:changed] runtime import cycles
Import cycle check: 0 runtime value cycle(s).
(exit code 0)

Observed result after fix: the new regression proves an entry with only agentRuntimeOverride: "claude-cli" makes storeMayContainPluginSessionRouteState(...) return true, scans as a stale Anthropic pinned runtime when the configured route is PI/OpenAI, and applySessionRouteStateRepair(...) removes agentRuntimeOverride while updating updatedAt. A negative control keeps the runtime override when the route is still configured for the Anthropic/Claude runtime.

What was not tested: a live openclaw doctor --fix run against a user's real session-store file; the behavior is covered at the doctor session route-state repair seam with in-memory session entries and plugin owner metadata, without touching local user state.

Before evidence: in a temporary upstream-main red worktree, the new regression failed because runtime override-only entries were skipped and stale agentRuntimeOverride was not removed by pinned-runtime repair. After this patch the same regression passes in the focused Vitest command above.

Root Cause

  • storeMayContainPluginSessionRouteState(...) did not include agentRuntimeOverride, so runtime override-only session entries skipped plugin route-state scanning.
  • The stale pinned runtime detector looked only at agentHarnessId, not agentRuntimeOverride.
  • The repair path for pinned runtime cleared agentHarnessId but left agentRuntimeOverride behind.
  • Anthropic/Google plugins did not expose doctor sessionRouteStateOwners, leaving their CLI runtime pins outside the generic cleanup path.

Regression Test Plan

  • Coverage level that should have caught this:
    • Seam / integration test
  • Target test file: src/commands/doctor-session-state-providers.test.ts
  • Scenario locked in: runtime override-only stale pins are detected and repaired when their owner is no longer configured, while current configured runtime overrides are preserved.
  • Existing adjacent coverage: src/plugins/doctor-contract-registry.test.ts confirms plugin doctor contracts still load through the registry.

User-visible / Behavior Changes

openclaw doctor --fix can now clean stale plugin-owned CLI runtime overrides such as claude-cli / google-gemini-cli, not just legacy Codex agentHarnessId state, when the session route no longer allows that owner.

Security Impact

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Risks and Mitigations

  • Risk: doctor could over-clear a runtime override that is still valid for the current route.
    • Mitigation: the scan still checks configured provider/model/runtime ownership first; the negative regression keeps claude-cli when the route is configured for Anthropic/Claude.
  • Risk: new plugin owner metadata could classify aliases too broadly.
    • Mitigation: ownership is limited to the plugin's provider/runtime/session-key prefixes and uses the existing doctor route-state ownership API.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

@openclaw-barnacle openclaw-barnacle Bot added commands Command implementations extensions: anthropic extensions: google proof: supplied External PR includes structured after-fix real behavior proof. size: S labels May 17, 2026
@clawsweeper

clawsweeper Bot commented May 17, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by maintainer comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors can comment @clawsweeper re-review or @clawsweeper re-run on their own open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR extends doctor session route-state cleanup to scan agentRuntimeOverride, clear it with pinned runtime repairs, add Anthropic/Google session-route owner metadata, and add focused regression tests.

Reproducibility: yes. Source inspection and the PR's proposed regression show that a session entry containing only agentRuntimeOverride: "claude-cli" is skipped on current main and should be handled by the doctor route-state scan.

Real behavior proof
Needs real behavior proof before merge: The PR body includes focused Vitest and check output, but explicitly says a live openclaw doctor --fix run against a real session-store file was not tested. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, ask a maintainer to comment @clawsweeper re-review.

Next step before merge
The contributor needs to revise the owner-boundary clearing logic and add real behavior proof before merge; automation should not take over while contributor proof is missing.

Security
Cleared: The diff does not add dependencies, network calls, workflow changes, secret handling, or new code-execution paths; the concern is correctness of persisted session repair.

Review findings

  • [P1] Clear only the owner-matched runtime field — src/commands/doctor-session-state-providers.ts:404
Review details

Best possible solution:

Carry the matched pinned-runtime field through the scan result, or otherwise clear agentRuntimeOverride only when that field itself matches the repairing owner, then prove the fix with a mixed-owner regression.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection and the PR's proposed regression show that a session entry containing only agentRuntimeOverride: "claude-cli" is skipped on current main and should be handled by the doctor route-state scan.

Is this the best way to solve the issue?

No. Extending the existing doctor cleanup is the right direction, but this implementation should preserve owner boundaries when deciding which persisted runtime field to clear.

Full review comments:

  • [P1] Clear only the owner-matched runtime field — src/commands/doctor-session-state-providers.ts:404
    This applies every pinned runtime repair by deleting both runtime fields. A legacy entry can contain a stale agentHarnessId for one owner and a valid agentRuntimeOverride for the currently configured owner; the Codex repair would then erase the valid Claude/Gemini override because the repair only records the broad reason, not which field matched. Please carry field-level match data from the scan or re-check the field value before clearing it.
    Confidence: 0.87

Overall correctness: patch is incorrect
Overall confidence: 0.88

What I checked:

  • PR diff over-clears runtime override: The PR adds clear("agentRuntimeOverride") under the generic pinned runtime repair reason, so any owner repair with that reason clears the runtime override regardless of which field matched that owner. (src/commands/doctor-session-state-providers.ts:404, 94789d296944)
  • Current repair object lacks field-level match data: DoctorSessionRouteStateRepair carries owner id, reasons, and CLI session keys, but not whether agentHarnessId or agentRuntimeOverride matched the owner, so the apply step cannot safely clear both fields from the pinned runtime reason alone. (src/commands/doctor-session-state-providers.ts:136, 019dbcc7490e)
  • Session entries can contain both runtime fields: SessionEntry defines agentRuntimeOverride and agentHarnessId as independent persisted fields, which makes mixed legacy entries possible and makes broad clearing of both fields unsafe. (src/config/sessions/types.ts:268, 019dbcc7490e)
  • Doctor applies repairs per owner to the same current entry: runPluginSessionStateDoctorRepairs groups repairs by owner and applies each owner repair to the current session entry, so one stale owner repair can remove a runtime override that belongs to another configured owner. (src/commands/doctor-session-state-providers.ts:470, 019dbcc7490e)
  • Related issue source context: The linked issue and prior ClawSweeper comment identify the remaining bug as generic non-Codex stale agentRuntimeOverride cleanup, after Keep OpenAI Codex migrations on automatic runtime routing #79238 fixed live routing and Codex-specific cleanup.
  • Feature history: GitHub commit history shows the generic doctor session-state repair path was introduced by b17bb63b9eaa and later touched by 02fe0d8978db for Codex runtime routing cleanup. (src/commands/doctor-session-state-providers.ts:107, b17bb63b9eaa)

Likely related people:

  • steipete: GitHub file history and local blame tie the generic doctor session route-state repair path and current source lines to recent work by steipete. (role: recent area contributor; confidence: high; commits: b17bb63b9eaa, 16922649d215, 694ca50e9775; files: src/commands/doctor-session-state-providers.ts, src/plugins/doctor-contract-registry.ts)
  • pashpashpash: Authored the merged runtime-routing cleanup in Keep OpenAI Codex migrations on automatic runtime routing #79238, which the linked issue identifies as the adjacent fix that left this non-Codex cleanup gap. (role: prior related PR author; confidence: medium; commits: 02fe0d8978db; files: src/commands/doctor-session-state-providers.ts, src/commands/doctor/shared/codex-route-warnings.ts)
  • jalehman: Recent history on src/plugins/doctor-contract-registry.ts includes the load-path plugin doctor contract discovery change that owns the registry surface used by this PR's plugin metadata. (role: doctor contract registry contributor; confidence: medium; commits: b22c8998ca87; files: src/plugins/doctor-contract-registry.ts)

Remaining risk / open question:

  • The current patch can silently remove a valid session-level runtime override from legacy entries that also contain a stale agentHarnessId for a different owner.
  • The PR body supplies focused test output, but not a real openclaw doctor --fix run against a real session store after the patch.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 019dbcc7490e.

@clawsweeper clawsweeper Bot added P2 Normal backlog priority with limited blast radius. impact:session-state Session, memory, transcript, context, or agent state can drift or corrupt. impact:auth-provider Auth, provider routing, model choice, or SecretRef resolution may break. labels May 17, 2026
@leno23 leno23 marked this pull request as draft May 17, 2026 17:13
@leno23

leno23 commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

Closing stale contributor PR that has been open for over a week without merge progress. I am keeping the branch intact so it can be reopened or reused if maintainers want to continue it.

@leno23 leno23 closed this May 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

commands Command implementations extensions: anthropic extensions: google impact:auth-provider Auth, provider routing, model choice, or SecretRef resolution may break. impact:session-state Session, memory, transcript, context, or agent state can drift or corrupt. P2 Normal backlog priority with limited blast radius. proof: supplied External PR includes structured after-fix real behavior proof. size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

doctor --fix: cleans only codex-route legacy agentRuntimeOverride pins; stale claude-cli/other harness pins persist forever

2 participants