2026.4.8: Large-session overflow/compaction timeout can cascade into GatewayDrainingError + subagent announce loss; fallback chain stale until restart

## Summary

After upgrading to **OpenClaw 2026.4.8**, very large Telegram sessions repeatedly hit context overflow and compaction timeout. In this incident, that appeared to cascade into prolonged gateway draining/task rejection, repeated subagent announce failures, and surprising fallback behavior.

This does **not** look like just “Anthropic fallback is broken.”
The stronger bug shape is a failure-chain:

1. huge-session overflow + compaction timeout
2. gateway enters draining/restart state and rejects new tasks
3. subagent completion announce retries fail/give up
4. fallback decisions during drain can still route into stale runtime fallback candidates

The user reports this behavior did not happen before this version.

## Environment

- OpenClaw: `2026.4.8`
- OS: macOS arm64
- Channel: Telegram (multiple accounts)
- Large long-lived sessions (hundreds to >1300 messages)

## Evidence (local)

### 1) Massive overflow + compaction timeout

From `artifacts/openclaw-2026-04-08-incident-extract.txt`:

- `estimatedPromptTokens=1014988` / `overflowTokens=759372`
- `messages=1331` / `messages=1351+` on affected Telegram sessions
- multiple compaction failures at ~900s:
  - `outcome=failed reason=timeout durationMs=900342`
  - `outcome=failed reason=timeout durationMs=900533`

### 2) During failure chain, gateway rejects tasks as draining

From incident extract and `~/.openclaw/logs/gateway.err.log`:

- `GatewayDrainingError: Gateway is draining for restart; new tasks are not accepted`
- repeated `drain timeout reached; proceeding with restart`

### 3) Subagent completion announce failures/retries during this state

- `Subagent completion direct announce failed ... GatewayDrainingError`
- `Subagent announce completion ... transient failure, retrying`
- `Subagent announce give up (retry-limit)`

### 4) Anthropic fallback still appeared in live fallback decisions during drain

Even after removing Anthropic fallback from config on disk, log lines during draining still showed:

- `next=anthropic/claude-haiku-4-5 detail=Gateway is draining for restart; new tasks are not accepted`
- and auth failure attempts:
  - `candidate=anthropic/claude-haiku-4-5 reason=auth ... HTTP 401 authentication_error: invalid x-api-key`

### 5) On-disk config had Anthropic removed, but runtime lagged until restart

- Current `~/.openclaw/openclaw.json` fallback list is only:
  - `openai-codex/gpt-5.4 -> openai-codex/gpt-5.3-codex`
- Local commit removing Anthropic fallback:
  - `19172db Remove Anthropic model fallback config`
- `openclaw.json` metadata shows it was touched at `2026-04-08T17:10:23.793Z`
- But runtime logs still had `next=anthropic/claude-haiku-4-5` at `2026-04-08T17:10:27.547+01:00`

This suggests live runtime config/fallback chain can remain stale until gateway restart/reload.

## Doctor note

`openclaw doctor --fix` was run locally, but this alone did not reload the running gateway process.

## Expected behavior

1. Very large-session overflow/compaction failure should degrade gracefully without cascading into prolonged drain/task rejection loops.
2. Subagent completion announce should not be lost/give-up during gateway drain windows.
3. Runtime fallback chain should not continue using removed fallback providers after config changes are applied to disk.
4. If restart/reload is required for fallback-chain changes, surface this clearly in CLI/doctor output.

## Actual behavior

- Overflow + compaction timeout chain coincided with gateway draining errors and task rejection.
- Subagent announce retries frequently failed/gave up.
- Fallback routing still referenced Anthropic during draining, causing 401 auth errors, despite Anthropic fallback being removed on disk.

## Potentially related (but not exact duplicate)

- #44031 (compaction timeout hangs)
- #40295 (compaction deadlock/session recovery)
- #55412 (GatewayDrainingError retry behavior)
- #54276 (subagent announce give-up)
- #62095 (`doctor --fix` expectations)

## Request

Please investigate this as a possible 2026.4.8 regression/failure-chain interaction:

- large-session overflow/compaction timeout
- gateway drain/restart task rejection behavior
- subagent announce resilience during drain
- runtime config/fallback-chain reload semantics (especially after removing providers)

If useful, I can provide the extracted incident artifacts/log snippets listed above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

2026.4.8: Large-session overflow/compaction timeout can cascade into GatewayDrainingError + subagent announce loss; fallback chain stale until restart #63279

Summary

Environment

Evidence (local)

1) Massive overflow + compaction timeout

2) During failure chain, gateway rejects tasks as draining

3) Subagent completion announce failures/retries during this state

4) Anthropic fallback still appeared in live fallback decisions during drain

5) On-disk config had Anthropic removed, but runtime lagged until restart

Doctor note

Expected behavior

Actual behavior

Potentially related (but not exact duplicate)

Request

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

2026.4.8: Large-session overflow/compaction timeout can cascade into GatewayDrainingError + subagent announce loss; fallback chain stale until restart #63279

Description

Summary

Environment

Evidence (local)

1) Massive overflow + compaction timeout

2) During failure chain, gateway rejects tasks as draining

3) Subagent completion announce failures/retries during this state

4) Anthropic fallback still appeared in live fallback decisions during drain

5) On-disk config had Anthropic removed, but runtime lagged until restart

Doctor note

Expected behavior

Actual behavior

Potentially related (but not exact duplicate)

Request

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions