Skip to content

2026.4.8: Large-session overflow/compaction timeout can cascade into GatewayDrainingError + subagent announce loss; fallback chain stale until restart #63279

@EthanSK

Description

@EthanSK

Summary

After upgrading to OpenClaw 2026.4.8, very large Telegram sessions repeatedly hit context overflow and compaction timeout. In this incident, that appeared to cascade into prolonged gateway draining/task rejection, repeated subagent announce failures, and surprising fallback behavior.

This does not look like just “Anthropic fallback is broken.”
The stronger bug shape is a failure-chain:

  1. huge-session overflow + compaction timeout
  2. gateway enters draining/restart state and rejects new tasks
  3. subagent completion announce retries fail/give up
  4. fallback decisions during drain can still route into stale runtime fallback candidates

The user reports this behavior did not happen before this version.

Environment

  • OpenClaw: 2026.4.8
  • OS: macOS arm64
  • Channel: Telegram (multiple accounts)
  • Large long-lived sessions (hundreds to >1300 messages)

Evidence (local)

1) Massive overflow + compaction timeout

From artifacts/openclaw-2026-04-08-incident-extract.txt:

  • estimatedPromptTokens=1014988 / overflowTokens=759372
  • messages=1331 / messages=1351+ on affected Telegram sessions
  • multiple compaction failures at ~900s:
    • outcome=failed reason=timeout durationMs=900342
    • outcome=failed reason=timeout durationMs=900533

2) During failure chain, gateway rejects tasks as draining

From incident extract and ~/.openclaw/logs/gateway.err.log:

  • GatewayDrainingError: Gateway is draining for restart; new tasks are not accepted
  • repeated drain timeout reached; proceeding with restart

3) Subagent completion announce failures/retries during this state

  • Subagent completion direct announce failed ... GatewayDrainingError
  • Subagent announce completion ... transient failure, retrying
  • Subagent announce give up (retry-limit)

4) Anthropic fallback still appeared in live fallback decisions during drain

Even after removing Anthropic fallback from config on disk, log lines during draining still showed:

  • next=anthropic/claude-haiku-4-5 detail=Gateway is draining for restart; new tasks are not accepted
  • and auth failure attempts:
    • candidate=anthropic/claude-haiku-4-5 reason=auth ... HTTP 401 authentication_error: invalid x-api-key

5) On-disk config had Anthropic removed, but runtime lagged until restart

  • Current ~/.openclaw/openclaw.json fallback list is only:
    • openai-codex/gpt-5.4 -> openai-codex/gpt-5.3-codex
  • Local commit removing Anthropic fallback:
    • 19172db Remove Anthropic model fallback config
  • openclaw.json metadata shows it was touched at 2026-04-08T17:10:23.793Z
  • But runtime logs still had next=anthropic/claude-haiku-4-5 at 2026-04-08T17:10:27.547+01:00

This suggests live runtime config/fallback chain can remain stale until gateway restart/reload.

Doctor note

openclaw doctor --fix was run locally, but this alone did not reload the running gateway process.

Expected behavior

  1. Very large-session overflow/compaction failure should degrade gracefully without cascading into prolonged drain/task rejection loops.
  2. Subagent completion announce should not be lost/give-up during gateway drain windows.
  3. Runtime fallback chain should not continue using removed fallback providers after config changes are applied to disk.
  4. If restart/reload is required for fallback-chain changes, surface this clearly in CLI/doctor output.

Actual behavior

  • Overflow + compaction timeout chain coincided with gateway draining errors and task rejection.
  • Subagent announce retries frequently failed/gave up.
  • Fallback routing still referenced Anthropic during draining, causing 401 auth errors, despite Anthropic fallback being removed on disk.

Potentially related (but not exact duplicate)

Request

Please investigate this as a possible 2026.4.8 regression/failure-chain interaction:

  • large-session overflow/compaction timeout
  • gateway drain/restart task rejection behavior
  • subagent announce resilience during drain
  • runtime config/fallback-chain reload semantics (especially after removing providers)

If useful, I can provide the extracted incident artifacts/log snippets listed above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions