Skip to content

[Bug]: No notification when all models fail auth — gateway silently dead for hours #20036

@nikolasdehor

Description

@nikolasdehor

Description

When all configured LLM providers fail authentication simultaneously, the gateway continues running (process stays alive, LaunchAgent healthy) but is completely unable to process any messages. No notification is sent to the user via any channel.

Steps to Reproduce

  1. Configure primary model (e.g., openai-codex/gpt-5.3-codex) with OAuth
  2. Configure fallback model (e.g., anthropic/claude-opus-4-6) with token auth
  3. Wait for the OAuth access token to expire (~60-90 min for Codex OAuth)
  4. If the auto-refresh fails silently, both primary and fallback will return 401
  5. Observe: gateway process stays alive, LaunchAgent shows healthy, but no messages are processed and no alert is sent

What I Observed

From gateway.err.log (2026-02-18, UTC):

03:59:27 - Embedded agent failed: openai-codex: LLM request timed out | anthropic: HTTP 401 Invalid bearer token
05:16:12 - Embedded agent failed: openai-codex: HTTP 401 Invalid bearer token | anthropic: HTTP 401 Invalid bearer token
06:16:48 - Embedded agent failed: openai-codex: HTTP 401 | anthropic: HTTP 401
07:10:02 - Embedded agent failed: openai-codex: HTTP 401 | anthropic: HTTP 401
08:03:09 - Embedded agent failed: Codex cooldown | Anthropic cooldown

The gateway was effectively dead for ~4 hours with no user-facing notification. The only way to detect the issue was to manually inspect gateway.err.log.

Expected Behavior

When all configured models fail auth (and the failure persists for, say, 2+ consecutive turns), the gateway should:

  1. Send a notification to the configured admin contact (e.g., via the first available channel, or log to a webhook)
  2. Increase the heartbeat frequency or emit a health event
  3. Surface the failure in openclaw status as an error state (not just "running")

Environment

  • OpenClaw v2026.2.17
  • macOS, LaunchAgent
  • Primary: openai-codex/gpt-5.3-codex (OAuth)
  • Fallback: anthropic/claude-opus-4-6 (OAT token)
  • Root cause: Codex OAuth access token expired, auto-refresh failed silently

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions