fix: self-heal lane wedges + restore openai-codex OAuth on embedded path#84752
fix: self-heal lane wedges + restore openai-codex OAuth on embedded path#84752Totalsolutionsync wants to merge 5 commits into
Conversation
…talls Two related fixes for a recurring failure mode where the gateway's per-lane queue gets stuck with items waiting but no agent picks them up — observed in production today as Ghost going silent on Telegram for ~10 min at a time. Layer 1 (lane pump on idle, src/logging/diagnostic.ts): `logSessionStateChange()` decrements `queueDepth` when the lane returns to idle but does not re-trigger the lane's dequeue. In normal operation `drainLane()` re-fires recursively after each task completes, so a fresh pump is not needed. In production we have seen lanes go `idle` with `queueDepth > 0` (typically after an embedded_run ends with terminal progress) and never dequeue, leaving queued user messages stranded. Fix: on idle transition with `queueDepth > 0` and a known sessionKey, call `resetCommandLane(resolveEmbeddedSessionLane(sessionKey))`. This bumps the lane generation, clears any stale `activeTaskIds`, and re-invokes `drainLane`. It is a no-op when the lane queue is already empty, so it is safe as a belt-and-suspenders pump. Layer 2 (stalled-session recovery for terminal active work, src/logging/diagnostic-session-attention.ts): `classifySessionAttention()` flags the `queued_behind_terminal_active_work` case (active embedded_run that emitted a terminal progress signal such as `rawResponseItem/completed` while `queueDepth > 0`) as `recoveryEligible: false`, so the existing recovery coordinator (`requestStuckSessionRecovery` at diagnostic.ts:1137) never fires — the detector logged `recovery=none` and the lane wedged forever. Fix: mark this case `recoveryEligible: true`. The terminal progress signal indicates the active turn is effectively done, so the recovery coordinator's existing `release_lane` path is the right action — it releases the lane without aborting any healthy in-flight work. Widened the `session.stalled` discriminant's `recoveryEligible` type from `false` to `boolean` to allow future per-case overrides. Test update: `diagnostic-session-attention.test.ts` case "queued behind terminal embedded progress" updated to expect `recoveryEligible: true` — pinning the new (correct) classification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The embedded agent runner — Telegram replies, cron invocations, any sub-agent dispatch — calls into the secrets runtime to resolve provider auth. That path goes through `loadAuthProfileStoreForSecretsRuntime`, which hardcoded `resolveLegacyOAuthSidecars: false`. As a result, OAuth profiles whose credential material lives in the legacy sidecar layout (`oauthRef.source: "openclaw-credentials"`, hash-named files under `<state>/credentials/auth-profiles/<id>.json`) were loaded without their access/refresh tokens, and `resolveApiKeyForProfile()` fell through to the "No API key found for provider" error. The OAuth-manager-internal helper added in #83312 already sets this to `true`, but the secrets-runtime path is a parallel entry point: when the embedded agent resolves provider auth for a model turn, it loads the store through this helper, *before* the OAuth manager's own reload would have a chance to compensate. Direct CLI inference is unaffected because it routes through a different store-load path that still sees the material. Repro (against v2026.5.19 stock): 1. Have an `openai-codex:default` profile with type=oauth and `oauthRef.source = "openclaw-credentials"` (typical for users who onboarded before the sidecar runtime was removed in #82777). 2. Send a Telegram message to the bot, or wait for any cron with an embedded payload to fire. 3. Gateway logs: [diagnostic] lane task error: ... error="Error: No API key found for provider \"openai-codex\". Auth store: .../auth-profiles.json ... Configure auth for this agent (openclaw agents add <id>) or copy only portable static auth profiles from the main agentDir." 4. Meanwhile, `openclaw infer model run --model openai/gpt-5.5 --prompt "say OK"` returns a normal completion using the same OAuth profile. Fix: flip the hardcoded default in `loadAuthProfileStoreForSecretsRuntime` from `false` to `true`, matching the OAuth-manager helper's choice. Sidecar resolution is read-only and already gated by per-process feature gates downstream, so this is safe to enable unconditionally for the secrets-runtime load. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…zed state
After a network drop mid-cycle, the grammy bot can end up in a
"not initialized" state without the polling session noticing — every
subsequent spooled-update handler fails with `Bot not initialized!` in
a tight retry loop (observed at ~500ms cadence), and the only escape is
an external gateway restart.
Observed today 2026-05-20 on `The-Ghosts-Shell`: bot init succeeded at
17:14:14, WiFi dropped at 17:16:34 (`ath10k_pci DEAUTH_LEAVING by local
choice`), WiFi reconnected at 17:16:49, and by 17:24:14 the bot was
firing `Bot not initialized!` on every retry. The OUTER `runUntilAbort`
loop never got a chance to recreate the bot + re-run `bot.init()`
because nothing inside the cycle signaled "exit/continue" — the spool
worker just kept retrying the same dead update forever.
Fix:
1. Add a one-shot `#requestCycleRestartOnBotReinitNeeded` callback on
`TelegramPollingSession`. The active `#runIsolatedIngressCycle`
populates it on entry with a closure that sets the local
`restartRequested = true` and calls `worker.stop()`. Cleared in the
existing `finally` cleanup so a future cycle doesn't see a stale
handle.
2. In `#releaseFailedSpooledUpdate`, after logging the
"keeping for retry" line, detect the substring "Bot not initialized"
in the formatted error message. If present, invoke the registered
restart callback to ask the cycle to tear itself down cleanly.
The cycle's existing try/finally cleanup (worker.stop, drainOnce,
stopBot, unsubscribe, abort-listener removal) already does the right
teardown when `restartRequested` is true — this commit only adds the
detection + signaling. The outer `runUntilAbort` loop then creates a
fresh `TelegramBot` instance via `#createPollingBot()` and re-runs
`bot.init()` against the (now-stable) network on the next iteration.
The substring check is intentionally conservative — grammy throws this
specific message string from its `BotInfoCacheBase` when `botInfo` is
undefined, and we don't want to false-positive on unrelated errors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y points
L4 patched only `loadAuthProfileStoreForSecretsRuntime`, but
`ensureAuthProfileStoreWithoutExternalProfiles` (used by the embedded
runner via `pi-embedded-runner/run.ts:59` and by `model-provider-auth.ts`)
and `loadAuthProfileStoreWithoutExternalProfiles` (used by
`model-auth-label`, `pi-auth-discovery`, the models list command, the
OAuth manager) are parallel entry points that ALSO needed the same flag
flip. Without this follow-up, cron-isolated lanes (`lane=cron-nested`,
`lane=session:agent:main:cron:...:run:...`) keep hitting the legacy
"No API key found for provider \"openai-codex\"" error path even though
direct user-Telegram lanes resolve fine.
Observed live 2026-05-20 17:45 PDT on the L4-patched v2026.5.19 build:
direct Telegram replies worked, but the 15-minute AgentOS task-board
sweep cron (`9584014c`) fired at 17:45:24 and surfaced the same
`FailoverError: No API key found for provider "openai-codex"` to the
delivery channel.
Fix:
1. `loadAuthProfileStoreWithoutExternalProfiles`: default
`resolveLegacyOAuthSidecars` from `false` to `true` — matches L4's
reasoning for the secrets-runtime helper.
2. `ensureAuthProfileStoreWithoutExternalProfiles`: accept the
`resolveLegacyOAuthSidecars` option (was unsupported, hardcoded
`false` downstream), default to `true`, and forward it through
`resolveRuntimeAuthProfileStore` and both `loadAuthProfileStoreForAgent`
call sites (requested agentDir + main fallback merge).
These functions are read-only and do not mutate persisted state, so
flipping the default is safe — they just include the credential material
that's been on disk all along.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Codex review: needs maintainer review before merge. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
Summary Reproducibility: yes. Source inspection shows current main still disables legacy sidecar resolution on the embedded auth-store paths, and the grammY dependency source confirms the exact not-initialized error the Telegram recovery handles; the PR body also supplies redacted live gateway proof. PR rating Rank-up moves:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. Real behavior proof Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Keep this focused PR open for maintainer review and land it after the draft and required-check gates are satisfied. Do we have a high-confidence way to reproduce the issue? Yes. Source inspection shows current main still disables legacy sidecar resolution on the embedded auth-store paths, and the grammY dependency source confirms the exact not-initialized error the Telegram recovery handles; the PR body also supplies redacted live gateway proof. Is this the best way to solve the issue? Yes. The patch stays within existing recovery and auth-store seams, restores read-only compatibility for an already-supported legacy credential path, and avoids adding new config or product policy. Label changes:
Label justifications:
What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against e964987cd20e. |
|
ClawSweeper PR egg ✨ Hatched: 🥚 common Sunspot Shellbean Hatch commandComment Hatchability rules:
Rarity: 🥚 common. What is this egg doing here?
|
|
@Totalsolutionsync yes if you could split 4624e34 and 85f36e8 into a separate PR that would be great! |
|
@Totalsolutionsync I've flagged those 2 commits as super high priority. If you're able to make a new PR in next few minutes that would be great. Otherwise I will cherry pick them into a PR and get this merged for release and give you credit. |
…me loaders The auto-migration introduced in #83312 only fires when a credential is loaded via a path that reads its sidecar tokens. The OAuth refresh manager's internal loader does (so direct CLI inference works and self-heals on first refresh). The embedded runner's secrets-runtime loaders did not: - loadAuthProfileStoreForSecretsRuntime - loadAuthProfileStoreWithoutExternalProfiles - ensureAuthProfileStoreWithoutExternalProfiles All three opted out of sidecar resolution. So for an upgraded user with a legacy oauthRef-backed openai-codex profile, the credential loaded with no access/refresh material, evaluateStoredCredentialEligibility marked it ineligible, resolveAuthProfileOrder filtered it out, and resolveApiKeyForProvider threw "No API key found for provider 'openai-codex'" before the OAuth manager (and its migration path) was ever consulted. CLI worked, Telegram/cron/embedded turns broke — only doctor-or-bust would fix it. Flip the three embedded loaders to default resolveLegacyOAuthSidecars to true (matching loadStoredOAuthRefreshStore). The existing #83312 refresh-and-rewrite then fires on the first embedded turn for these users and persists tokens inline, removing the legacy sidecar from disk on the next doctor pass. Cherry-picked and squashed from PR #84752 (commits 85f36e8 and 4624e34). Comments noting local-fork bookkeeping stripped per repo policy. Co-authored-by: Will <totalsolutionspm@gmail.com>
…me loaders The auto-migration introduced in #83312 only fires when a credential is loaded via a path that reads its sidecar tokens. The OAuth refresh manager's internal loader does (so direct CLI inference works and self-heals on first refresh). The embedded runner's secrets-runtime loaders did not: - loadAuthProfileStoreForSecretsRuntime - loadAuthProfileStoreWithoutExternalProfiles - ensureAuthProfileStoreWithoutExternalProfiles All three opted out of sidecar resolution. So for an upgraded user with a legacy oauthRef-backed openai-codex profile, the credential loaded with no access/refresh material, evaluateStoredCredentialEligibility marked it ineligible, resolveAuthProfileOrder filtered it out, and resolveApiKeyForProvider threw "No API key found for provider 'openai-codex'" before the OAuth manager (and its migration path) was ever consulted. CLI worked, Telegram/cron/embedded turns broke — only doctor-or-bust would fix it. Flip the three embedded loaders to default resolveLegacyOAuthSidecars to true (matching loadStoredOAuthRefreshStore). The existing #83312 refresh-and-rewrite then fires on the first embedded turn for these users and persists tokens inline, removing the legacy sidecar from disk on the next doctor pass. Cherry-picked and squashed from PR #84752 (commits 85f36e8 and 4624e34). Comments noting local-fork bookkeeping stripped per repo policy. Co-authored-by: Will <totalsolutionspm@gmail.com>
…me loaders The auto-migration introduced in #83312 only fires when a credential is loaded via a path that reads its sidecar tokens. The OAuth refresh manager's internal loader does (so direct CLI inference works and self-heals on first refresh). The embedded runner's secrets-runtime loaders did not: - loadAuthProfileStoreForSecretsRuntime - loadAuthProfileStoreWithoutExternalProfiles - ensureAuthProfileStoreWithoutExternalProfiles All three opted out of sidecar resolution. So for an upgraded user with a legacy oauthRef-backed openai-codex profile, the credential loaded with no access/refresh material, evaluateStoredCredentialEligibility marked it ineligible, resolveAuthProfileOrder filtered it out, and resolveApiKeyForProvider threw "No API key found for provider 'openai-codex'" before the OAuth manager (and its migration path) was ever consulted. CLI worked, Telegram/cron/embedded turns broke — only doctor-or-bust would fix it. Flip the three embedded loaders to default resolveLegacyOAuthSidecars to true (matching loadStoredOAuthRefreshStore). The existing #83312 refresh-and-rewrite then fires on the first embedded turn for these users and persists tokens inline, removing the legacy sidecar from disk on the next doctor pass. Cherry-picked and squashed from PR #84752 (commits 85f36e8 and 4624e34). Comments noting local-fork bookkeeping stripped per repo policy. Co-authored-by: Will <totalsolutionspm@gmail.com>
…me loaders (#85074) The auto-migration introduced in #83312 only fires when a credential is loaded via a path that reads its sidecar tokens. The OAuth refresh manager's internal loader does (so direct CLI inference works and self-heals on first refresh). The embedded runner's secrets-runtime loaders did not: - loadAuthProfileStoreForSecretsRuntime - loadAuthProfileStoreWithoutExternalProfiles - ensureAuthProfileStoreWithoutExternalProfiles All three opted out of sidecar resolution. So for an upgraded user with a legacy oauthRef-backed openai-codex profile, the credential loaded with no access/refresh material, evaluateStoredCredentialEligibility marked it ineligible, resolveAuthProfileOrder filtered it out, and resolveApiKeyForProvider threw "No API key found for provider 'openai-codex'" before the OAuth manager (and its migration path) was ever consulted. CLI worked, Telegram/cron/embedded turns broke — only doctor-or-bust would fix it. Flip the three embedded loaders to default resolveLegacyOAuthSidecars to true (matching loadStoredOAuthRefreshStore). The existing #83312 refresh-and-rewrite then fires on the first embedded turn for these users and persists tokens inline, removing the legacy sidecar from disk on the next doctor pass. Cherry-picked and squashed from PR #84752 (commits 85f36e8 and 4624e34). Comments noting local-fork bookkeeping stripped per repo policy. Co-authored-by: Will <totalsolutionspm@gmail.com>
…me loaders (openclaw#85074) The auto-migration introduced in openclaw#83312 only fires when a credential is loaded via a path that reads its sidecar tokens. The OAuth refresh manager's internal loader does (so direct CLI inference works and self-heals on first refresh). The embedded runner's secrets-runtime loaders did not: - loadAuthProfileStoreForSecretsRuntime - loadAuthProfileStoreWithoutExternalProfiles - ensureAuthProfileStoreWithoutExternalProfiles All three opted out of sidecar resolution. So for an upgraded user with a legacy oauthRef-backed openai-codex profile, the credential loaded with no access/refresh material, evaluateStoredCredentialEligibility marked it ineligible, resolveAuthProfileOrder filtered it out, and resolveApiKeyForProvider threw "No API key found for provider 'openai-codex'" before the OAuth manager (and its migration path) was ever consulted. CLI worked, Telegram/cron/embedded turns broke — only doctor-or-bust would fix it. Flip the three embedded loaders to default resolveLegacyOAuthSidecars to true (matching loadStoredOAuthRefreshStore). The existing openclaw#83312 refresh-and-rewrite then fires on the first embedded turn for these users and persists tokens inline, removing the legacy sidecar from disk on the next doctor pass. Cherry-picked and squashed from PR openclaw#84752 (commits 85f36e8 and 4624e34). Comments noting local-fork bookkeeping stripped per repo policy. Co-authored-by: Will <totalsolutionspm@gmail.com>
…me loaders (openclaw#85074) The auto-migration introduced in openclaw#83312 only fires when a credential is loaded via a path that reads its sidecar tokens. The OAuth refresh manager's internal loader does (so direct CLI inference works and self-heals on first refresh). The embedded runner's secrets-runtime loaders did not: - loadAuthProfileStoreForSecretsRuntime - loadAuthProfileStoreWithoutExternalProfiles - ensureAuthProfileStoreWithoutExternalProfiles All three opted out of sidecar resolution. So for an upgraded user with a legacy oauthRef-backed openai-codex profile, the credential loaded with no access/refresh material, evaluateStoredCredentialEligibility marked it ineligible, resolveAuthProfileOrder filtered it out, and resolveApiKeyForProvider threw "No API key found for provider 'openai-codex'" before the OAuth manager (and its migration path) was ever consulted. CLI worked, Telegram/cron/embedded turns broke — only doctor-or-bust would fix it. Flip the three embedded loaders to default resolveLegacyOAuthSidecars to true (matching loadStoredOAuthRefreshStore). The existing openclaw#83312 refresh-and-rewrite then fires on the first embedded turn for these users and persists tokens inline, removing the legacy sidecar from disk on the next doctor pass. Cherry-picked and squashed from PR openclaw#84752 (commits 85f36e8 and 4624e34). Comments noting local-fork bookkeeping stripped per repo policy. Co-authored-by: Will <totalsolutionspm@gmail.com>
…me loaders (openclaw#85074) The auto-migration introduced in openclaw#83312 only fires when a credential is loaded via a path that reads its sidecar tokens. The OAuth refresh manager's internal loader does (so direct CLI inference works and self-heals on first refresh). The embedded runner's secrets-runtime loaders did not: - loadAuthProfileStoreForSecretsRuntime - loadAuthProfileStoreWithoutExternalProfiles - ensureAuthProfileStoreWithoutExternalProfiles All three opted out of sidecar resolution. So for an upgraded user with a legacy oauthRef-backed openai-codex profile, the credential loaded with no access/refresh material, evaluateStoredCredentialEligibility marked it ineligible, resolveAuthProfileOrder filtered it out, and resolveApiKeyForProvider threw "No API key found for provider 'openai-codex'" before the OAuth manager (and its migration path) was ever consulted. CLI worked, Telegram/cron/embedded turns broke — only doctor-or-bust would fix it. Flip the three embedded loaders to default resolveLegacyOAuthSidecars to true (matching loadStoredOAuthRefreshStore). The existing openclaw#83312 refresh-and-rewrite then fires on the first embedded turn for these users and persists tokens inline, removing the legacy sidecar from disk on the next doctor pass. Cherry-picked and squashed from PR openclaw#84752 (commits 85f36e8 and 4624e34). Comments noting local-fork bookkeeping stripped per repo policy. Co-authored-by: Will <totalsolutionspm@gmail.com>
…me loaders (openclaw#85074) The auto-migration introduced in openclaw#83312 only fires when a credential is loaded via a path that reads its sidecar tokens. The OAuth refresh manager's internal loader does (so direct CLI inference works and self-heals on first refresh). The embedded runner's secrets-runtime loaders did not: - loadAuthProfileStoreForSecretsRuntime - loadAuthProfileStoreWithoutExternalProfiles - ensureAuthProfileStoreWithoutExternalProfiles All three opted out of sidecar resolution. So for an upgraded user with a legacy oauthRef-backed openai-codex profile, the credential loaded with no access/refresh material, evaluateStoredCredentialEligibility marked it ineligible, resolveAuthProfileOrder filtered it out, and resolveApiKeyForProvider threw "No API key found for provider 'openai-codex'" before the OAuth manager (and its migration path) was ever consulted. CLI worked, Telegram/cron/embedded turns broke — only doctor-or-bust would fix it. Flip the three embedded loaders to default resolveLegacyOAuthSidecars to true (matching loadStoredOAuthRefreshStore). The existing openclaw#83312 refresh-and-rewrite then fires on the first embedded turn for these users and persists tokens inline, removing the legacy sidecar from disk on the next doctor pass. Cherry-picked and squashed from PR openclaw#84752 (commits 85f36e8 and 4624e34). Comments noting local-fork bookkeeping stripped per repo policy. Co-authored-by: Will <totalsolutionspm@gmail.com>
…me loaders (openclaw#85074) The auto-migration introduced in openclaw#83312 only fires when a credential is loaded via a path that reads its sidecar tokens. The OAuth refresh manager's internal loader does (so direct CLI inference works and self-heals on first refresh). The embedded runner's secrets-runtime loaders did not: - loadAuthProfileStoreForSecretsRuntime - loadAuthProfileStoreWithoutExternalProfiles - ensureAuthProfileStoreWithoutExternalProfiles All three opted out of sidecar resolution. So for an upgraded user with a legacy oauthRef-backed openai-codex profile, the credential loaded with no access/refresh material, evaluateStoredCredentialEligibility marked it ineligible, resolveAuthProfileOrder filtered it out, and resolveApiKeyForProvider threw "No API key found for provider 'openai-codex'" before the OAuth manager (and its migration path) was ever consulted. CLI worked, Telegram/cron/embedded turns broke — only doctor-or-bust would fix it. Flip the three embedded loaders to default resolveLegacyOAuthSidecars to true (matching loadStoredOAuthRefreshStore). The existing openclaw#83312 refresh-and-rewrite then fires on the first embedded turn for these users and persists tokens inline, removing the legacy sidecar from disk on the next doctor pass. Cherry-picked and squashed from PR openclaw#84752 (commits 85f36e8 and 4624e34). Comments noting local-fork bookkeeping stripped per repo policy. Co-authored-by: Will <totalsolutionspm@gmail.com>
…me loaders (openclaw#85074) The auto-migration introduced in openclaw#83312 only fires when a credential is loaded via a path that reads its sidecar tokens. The OAuth refresh manager's internal loader does (so direct CLI inference works and self-heals on first refresh). The embedded runner's secrets-runtime loaders did not: - loadAuthProfileStoreForSecretsRuntime - loadAuthProfileStoreWithoutExternalProfiles - ensureAuthProfileStoreWithoutExternalProfiles All three opted out of sidecar resolution. So for an upgraded user with a legacy oauthRef-backed openai-codex profile, the credential loaded with no access/refresh material, evaluateStoredCredentialEligibility marked it ineligible, resolveAuthProfileOrder filtered it out, and resolveApiKeyForProvider threw "No API key found for provider 'openai-codex'" before the OAuth manager (and its migration path) was ever consulted. CLI worked, Telegram/cron/embedded turns broke — only doctor-or-bust would fix it. Flip the three embedded loaders to default resolveLegacyOAuthSidecars to true (matching loadStoredOAuthRefreshStore). The existing openclaw#83312 refresh-and-rewrite then fires on the first embedded turn for these users and persists tokens inline, removing the legacy sidecar from disk on the next doctor pass. Cherry-picked and squashed from PR openclaw#84752 (commits 85f36e8 and 4624e34). Comments noting local-fork bookkeeping stripped per repo policy. Co-authored-by: Will <totalsolutionspm@gmail.com>
…me loaders (openclaw#85074) The auto-migration introduced in openclaw#83312 only fires when a credential is loaded via a path that reads its sidecar tokens. The OAuth refresh manager's internal loader does (so direct CLI inference works and self-heals on first refresh). The embedded runner's secrets-runtime loaders did not: - loadAuthProfileStoreForSecretsRuntime - loadAuthProfileStoreWithoutExternalProfiles - ensureAuthProfileStoreWithoutExternalProfiles All three opted out of sidecar resolution. So for an upgraded user with a legacy oauthRef-backed openai-codex profile, the credential loaded with no access/refresh material, evaluateStoredCredentialEligibility marked it ineligible, resolveAuthProfileOrder filtered it out, and resolveApiKeyForProvider threw "No API key found for provider 'openai-codex'" before the OAuth manager (and its migration path) was ever consulted. CLI worked, Telegram/cron/embedded turns broke — only doctor-or-bust would fix it. Flip the three embedded loaders to default resolveLegacyOAuthSidecars to true (matching loadStoredOAuthRefreshStore). The existing openclaw#83312 refresh-and-rewrite then fires on the first embedded turn for these users and persists tokens inline, removing the legacy sidecar from disk on the next doctor pass. Cherry-picked and squashed from PR openclaw#84752 (commits 85f36e8 and 4624e34). Comments noting local-fork bookkeeping stripped per repo policy. Co-authored-by: Will <totalsolutionspm@gmail.com>
…me loaders (openclaw#85074) The auto-migration introduced in openclaw#83312 only fires when a credential is loaded via a path that reads its sidecar tokens. The OAuth refresh manager's internal loader does (so direct CLI inference works and self-heals on first refresh). The embedded runner's secrets-runtime loaders did not: - loadAuthProfileStoreForSecretsRuntime - loadAuthProfileStoreWithoutExternalProfiles - ensureAuthProfileStoreWithoutExternalProfiles All three opted out of sidecar resolution. So for an upgraded user with a legacy oauthRef-backed openai-codex profile, the credential loaded with no access/refresh material, evaluateStoredCredentialEligibility marked it ineligible, resolveAuthProfileOrder filtered it out, and resolveApiKeyForProvider threw "No API key found for provider 'openai-codex'" before the OAuth manager (and its migration path) was ever consulted. CLI worked, Telegram/cron/embedded turns broke — only doctor-or-bust would fix it. Flip the three embedded loaders to default resolveLegacyOAuthSidecars to true (matching loadStoredOAuthRefreshStore). The existing openclaw#83312 refresh-and-rewrite then fires on the first embedded turn for these users and persists tokens inline, removing the legacy sidecar from disk on the next doctor pass. Cherry-picked and squashed from PR openclaw#84752 (commits 85f36e8 and 4624e34). Comments noting local-fork bookkeeping stripped per repo policy. Co-authored-by: Will <totalsolutionspm@gmail.com>
…me loaders (openclaw#85074) The auto-migration introduced in openclaw#83312 only fires when a credential is loaded via a path that reads its sidecar tokens. The OAuth refresh manager's internal loader does (so direct CLI inference works and self-heals on first refresh). The embedded runner's secrets-runtime loaders did not: - loadAuthProfileStoreForSecretsRuntime - loadAuthProfileStoreWithoutExternalProfiles - ensureAuthProfileStoreWithoutExternalProfiles All three opted out of sidecar resolution. So for an upgraded user with a legacy oauthRef-backed openai-codex profile, the credential loaded with no access/refresh material, evaluateStoredCredentialEligibility marked it ineligible, resolveAuthProfileOrder filtered it out, and resolveApiKeyForProvider threw "No API key found for provider 'openai-codex'" before the OAuth manager (and its migration path) was ever consulted. CLI worked, Telegram/cron/embedded turns broke — only doctor-or-bust would fix it. Flip the three embedded loaders to default resolveLegacyOAuthSidecars to true (matching loadStoredOAuthRefreshStore). The existing openclaw#83312 refresh-and-rewrite then fires on the first embedded turn for these users and persists tokens inline, removing the legacy sidecar from disk on the next doctor pass. Cherry-picked and squashed from PR openclaw#84752 (commits 85f36e8 and 4624e34). Comments noting local-fork bookkeeping stripped per repo policy. Co-authored-by: Will <totalsolutionspm@gmail.com>
…me loaders (openclaw#85074) The auto-migration introduced in openclaw#83312 only fires when a credential is loaded via a path that reads its sidecar tokens. The OAuth refresh manager's internal loader does (so direct CLI inference works and self-heals on first refresh). The embedded runner's secrets-runtime loaders did not: - loadAuthProfileStoreForSecretsRuntime - loadAuthProfileStoreWithoutExternalProfiles - ensureAuthProfileStoreWithoutExternalProfiles All three opted out of sidecar resolution. So for an upgraded user with a legacy oauthRef-backed openai-codex profile, the credential loaded with no access/refresh material, evaluateStoredCredentialEligibility marked it ineligible, resolveAuthProfileOrder filtered it out, and resolveApiKeyForProvider threw "No API key found for provider 'openai-codex'" before the OAuth manager (and its migration path) was ever consulted. CLI worked, Telegram/cron/embedded turns broke — only doctor-or-bust would fix it. Flip the three embedded loaders to default resolveLegacyOAuthSidecars to true (matching loadStoredOAuthRefreshStore). The existing openclaw#83312 refresh-and-rewrite then fires on the first embedded turn for these users and persists tokens inline, removing the legacy sidecar from disk on the next doctor pass. Cherry-picked and squashed from PR openclaw#84752 (commits 85f36e8 and 4624e34). Comments noting local-fork bookkeeping stripped per repo policy. Co-authored-by: Will <totalsolutionspm@gmail.com>
…me loaders (openclaw#85074) The auto-migration introduced in openclaw#83312 only fires when a credential is loaded via a path that reads its sidecar tokens. The OAuth refresh manager's internal loader does (so direct CLI inference works and self-heals on first refresh). The embedded runner's secrets-runtime loaders did not: - loadAuthProfileStoreForSecretsRuntime - loadAuthProfileStoreWithoutExternalProfiles - ensureAuthProfileStoreWithoutExternalProfiles All three opted out of sidecar resolution. So for an upgraded user with a legacy oauthRef-backed openai-codex profile, the credential loaded with no access/refresh material, evaluateStoredCredentialEligibility marked it ineligible, resolveAuthProfileOrder filtered it out, and resolveApiKeyForProvider threw "No API key found for provider 'openai-codex'" before the OAuth manager (and its migration path) was ever consulted. CLI worked, Telegram/cron/embedded turns broke — only doctor-or-bust would fix it. Flip the three embedded loaders to default resolveLegacyOAuthSidecars to true (matching loadStoredOAuthRefreshStore). The existing openclaw#83312 refresh-and-rewrite then fires on the first embedded turn for these users and persists tokens inline, removing the legacy sidecar from disk on the next doctor pass. Cherry-picked and squashed from PR openclaw#84752 (commits 85f36e8 and 4624e34). Comments noting local-fork bookkeeping stripped per repo policy. Co-authored-by: Will <totalsolutionspm@gmail.com>
Summary
Four related fixes for failure modes that can take a self-hosted gateway's Telegram channel offline and require a manual restart to recover. All four were diagnosed and validated in production on a
v2026.5.19build. Two are diagnostics/lane-queue self-healing; two restore OAuth resolution for theopenai-codexprovider on embedded-agent paths after a regression between2026.5.12and2026.5.19.The four functional commits are independent and can be split if preferred, but they share one theme: make the gateway self-heal from transient infrastructure blips instead of wedging until a human restarts it. The final commit is review polish: it preserves the active-abort recovery flag for the newly recovery-eligible terminal embedded-run case and removes local emergency-patch wording from source comments.
1.
fix(diagnostic): pump lane on idle + recover from terminal-progress stallsTwo issues in the per-lane command queue:
Lane pump on idle (
src/logging/diagnostic.ts):logSessionStateChange()decrementsqueueDepthwhen a lane returns to idle but never re-triggers the dequeue. NormallydrainLane()re-fires recursively, but in production we observed lanes that goidlewithqueueDepth > 0and never dequeue, stranding queued messages. Fix: on idle transition withqueueDepth > 0and a knownsessionKey, callresetCommandLane(resolveEmbeddedSessionLane(sessionKey)). It is a no-op when the lane queue is already empty.Stall recovery for terminal active work (
src/logging/diagnostic-session-attention.ts):classifySessionAttention()flaggedqueued_behind_terminal_active_workasrecoveryEligible: false, so the existing recovery coordinator never fired and the lane wedged (recovery=none). Fix: mark itrecoveryEligible: trueso the existing recovery path runs.The follow-up commit also keeps
allowActiveAbort: truewhen the terminal embedded-run case has crossed the abort threshold; otherwise the recovery runtime can still skip because it sees an active embedded run.2 & 3.
fix(auth): ... legacy OAuth sidecars in secrets-runtime store load(+ follow-up for all entry points)Regression: after upgrading
2026.5.12 -> 2026.5.19, embedded agent turns (channel replies, cron-isolated runs) fail withNo API key found for provider "openai-codex"even though the OAuth profile is valid and direct CLI inference works with the same profile.Root cause: PR #82777 removed sidecar runtime support; follow-up #83312 reintroduced it via a helper used by the OAuth manager's refresh path. The parallel secrets-runtime store-load helpers were still defaulting
resolveLegacyOAuthSidecars: false, so OAuth profiles whose credential material lives in the legacy sidecar layout (oauthRef.source: "openclaw-credentials", hash-named files under<state-dir>/credentials/auth-profiles/<id>.json) were loaded without access/refresh tokens.Fix (
src/agents/auth-profiles/store.ts): defaultresolveLegacyOAuthSidecarstotruein:loadAuthProfileStoreForSecretsRuntimeloadAuthProfileStoreWithoutExternalProfilesensureAuthProfileStoreWithoutExternalProfilesThese helpers are read-only and do not mutate persisted state; they only include credential material that is already on disk.
4.
fix(telegram): restart isolated polling cycle when bot loses initialized stateAfter a network drop mid-cycle, the grammy bot can end up
Bot not initialized!without the polling session noticing. Every subsequent spooled-update handler can then fail with the same error in a tight retry loop, recoverable only by an external gateway restart.Fix (
extensions/telegram/src/polling-session.ts): the spool-failure path detects theBot not initializederror and asks the active isolated ingress cycle to abort via a one-shot callback. The existingtry/finallytears the cycle down, and the outerrunUntilAbortloop creates a fresh bot and re-runsbot.init().Real behavior proof
Behavior or issue addressed: A self-hosted OpenClaw gateway could wedge after transient infrastructure failures: embedded OpenAI Codex OAuth profiles failed to resolve legacy sidecar tokens in cron/Telegram embedded paths, Telegram polling could loop on
Bot not initialized!after a network drop, and queued lane work could remain stuck behind terminal embedded progress instead of recovering.Real environment tested: Real self-hosted OpenClaw gateway running OpenClaw 2026.5.19 from the patched fork build
9d27317on a user systemd gateway service, using Telegram direct messages plus isolated cron agent turns with provideropenai-codex/ modelopenai/gpt-5.5. Local paths, PIDs, account IDs, and user identifiers are redacted.Exact steps or command run after this patch: Activated the patched fork build on the real gateway, confirmed the gateway process was active, sent Telegram messages through the live bot, forced/observed an isolated AgentOS task-board sweep cron using
openai/gpt-5.5, then checked the current gateway PID logs for OAuth and bot-init recurrence.Evidence after fix: Redacted terminal output from the live gateway after activation:
Redacted live cron result from the same gateway:
Observed result after fix: The gateway kept responding to direct Telegram messages, the isolated cron embedded-agent run completed and delivered with
openai-codex, current-PID logs showed zeroNo API key found for provideroccurrences after activation, and noBot not initialized!retry storm recurred on the patched runtime.What was not tested: No browser UI flow was exercised. The evidence is from the real gateway runtime, Telegram path, isolated cron path, service state, and current-PID runtime logs; targeted automated tests are listed separately below.
Testing
Local targeted checks run on this branch:
Notes
mainwith no conflicts.