Summary
After upgrading a live macOS OpenClaw install to 2026.5.5 (Discord plugin 2026.5.6), Discord appeared healthy and inbound events had previously been observed, but Jarvis/main stopped producing normal Discord replies. Manual Discord send through the gateway still worked, so the Discord token/send permissions were not the root cause.
The failure appears to be a model/runtime migration + stale session/task recovery issue:
- upgrade/doctor rewrote working
openai-codex/* model refs toward openai/* and stamped some agents/sessions with Codex runtime fields
- this install did not have a usable
openai provider route for that rewritten config
- gateway logs showed model-provider resolution errors and Codex app-server fallback errors
- existing Discord channel sessions could remain stalled/queued, and old background tasks remained
stale_running
- externally, the user-visible symptom was simply: Discord bot is connected, but no agent reply appears
Environment
- OpenClaw:
2026.5.5
- Discord plugin:
2026.5.6
- OS: macOS Darwin arm64
- Gateway: LaunchAgent, local loopback
127.0.0.1:18789
- Channel: Discord guild channels
- Main bot: Jarvis/main
- Install/config paths sanitized, but this was an npm/Homebrew global install with OpenClaw config under the user OpenClaw home
Observed symptoms
Before cleanup:
openclaw gateway status reported gateway running and connectivity probe OK
- Discord accounts reported running/connected
lastInboundAt had advanced during the failure window
lastOutboundAt stayed null
- user saw no replies in Discord
- direct Discord app/channel send had not yet been isolated
Task audit showed three stale running tasks from earlier repair/subagent/heartbeat work:
openclaw tasks list --status running --json
count: 3
openclaw tasks audit --json classified them as:
stale_running: 3
errors: 3
Gateway logs around the failure window included errors of this shape:
CodexAppServerRpcError: failed to load configuration: Model provider `anthropic` not found
CodexAppServerRpcError: failed to load configuration: Model provider `crofai` not found
CodexAppServerRpcError: failed to load configuration: Model provider `minimax-direct` not found
Error: Codex app-server auth profile "zai:default" must belong to provider "openai-codex" or a supported alias.
FailoverError: LLM request timed out.
stalled session: sessionKey=agent:main:discord:channel:<channel-id> state=processing queueDepth=1
The important user-visible gap: none of these were surfaced as a clear Discord error/reply. The bot simply appeared silent.
What fixed this install locally
- Restored the config to a working
openai-codex/gpt-* model route for the affected agents instead of the upgrade/doctor-rewritten openai/gpt-* + Codex runtime path.
- Removed stale per-session Codex runtime/harness overrides from session stores.
- Cancelled the three stale running tasks with
openclaw tasks cancel <taskId>.
- Restarted the gateway.
- Verified gateway and Discord accounts came back healthy.
- Verified direct Discord outbound send path:
openclaw message send --channel discord --account main --target channel:<channel-id> --message "OpenClaw upgrade diagnostic: Jarvis outbound Discord send path is live. Testing only." --json
Result:
{
"ok": true,
"result": {
"messageId": "<discord-message-id>",
"channelId": "<channel-id>"
}
}
After cleanup:
openclaw tasks list --status running --json
count: 0
openclaw health --json showed:
ok: true
Discord main connected: true
Discord specialists connected: true
lastError: null
restartPending: false
Expected behavior
Upgrade/doctor migration should not leave a Discord install in a state where:
- Discord is connected
- inbound messages are accepted or were recently accepted
- manual Discord send path works
- but normal agent replies silently fail or stall due to model/runtime/session state
If model/runtime migration cannot be made safely, OpenClaw should either:
- preserve the known-good model/runtime route
- emit a hard validation warning before restart
- surface a clear channel-visible or status-visible error
- provide a targeted repair that does not also rewrite working model refs
Actual behavior
The install became Discord-silent. Gateway/channel status looked broadly healthy, but the agent reply path failed/stalled. The actionable errors were only discoverable by combining gateway logs, model status, task audit, and session-state inspection.
Why this is hard to debug
Several signals point in different directions:
openclaw gateway status: OK
- Discord accounts: connected
- Discord token/send path: OK after manual send test
- channel security: no warnings
doctor --fix: recommended, but in this install it was exactly the risky path because it would rewrite openai-codex/* back toward the route that broke the install
- stale task/session state: only obvious through
tasks audit and session inspection
Suggested fixes
- Make doctor/upgrade model-runtime migration transactional and validate the post-migration provider route before writing config/session runtime overrides.
- If
openai/* provider is not usable, do not rewrite a working openai-codex/* route into it.
- Add a targeted command to clear stale session runtime overrides without running all of
doctor --fix.
- Include
tasks audit stale-running findings in doctor/status when channel replies are stalled.
- When Discord inbound-to-agent fails before producing a reply, send a visible fallback error or at least update channel health with a clear
lastReplyError.
- Consider a diagnostic that differentiates:
- Discord gateway connected
- Discord outbound API send works
- inbound message routed to agent session
- model turn completed
- Discord reply emitted
Impact
High for live Discord installs: the bot appears present and healthy, but users get no replies. The repair requires knowing to inspect model-runtime migration state, stale tasks, session runtime overrides, and Discord send path separately.
Summary
After upgrading a live macOS OpenClaw install to
2026.5.5(Discord plugin2026.5.6), Discord appeared healthy and inbound events had previously been observed, but Jarvis/main stopped producing normal Discord replies. Manual Discord send through the gateway still worked, so the Discord token/send permissions were not the root cause.The failure appears to be a model/runtime migration + stale session/task recovery issue:
openai-codex/*model refs towardopenai/*and stamped some agents/sessions with Codex runtime fieldsopenaiprovider route for that rewritten configstale_runningEnvironment
2026.5.52026.5.6127.0.0.1:18789Observed symptoms
Before cleanup:
openclaw gateway statusreported gateway running and connectivity probe OKlastInboundAthad advanced during the failure windowlastOutboundAtstayednullTask audit showed three stale running tasks from earlier repair/subagent/heartbeat work:
openclaw tasks audit --jsonclassified them as:Gateway logs around the failure window included errors of this shape:
The important user-visible gap: none of these were surfaced as a clear Discord error/reply. The bot simply appeared silent.
What fixed this install locally
openai-codex/gpt-*model route for the affected agents instead of the upgrade/doctor-rewrittenopenai/gpt-*+ Codex runtime path.openclaw tasks cancel <taskId>.Result:
{ "ok": true, "result": { "messageId": "<discord-message-id>", "channelId": "<channel-id>" } }After cleanup:
openclaw health --jsonshowed:Expected behavior
Upgrade/doctor migration should not leave a Discord install in a state where:
If model/runtime migration cannot be made safely, OpenClaw should either:
Actual behavior
The install became Discord-silent. Gateway/channel status looked broadly healthy, but the agent reply path failed/stalled. The actionable errors were only discoverable by combining gateway logs, model status, task audit, and session-state inspection.
Why this is hard to debug
Several signals point in different directions:
openclaw gateway status: OKdoctor --fix: recommended, but in this install it was exactly the risky path because it would rewriteopenai-codex/*back toward the route that broke the installtasks auditand session inspectionSuggested fixes
openai/*provider is not usable, do not rewrite a workingopenai-codex/*route into it.doctor --fix.tasks auditstale-running findings in doctor/status when channel replies are stalled.lastReplyError.Impact
High for live Discord installs: the bot appears present and healthy, but users get no replies. The repair requires knowing to inspect model-runtime migration state, stale tasks, session runtime overrides, and Discord send path separately.