[Bug]: HTTP 429 from GitHub Copilot causes 10-minute silent hang instead of immediate failure

### Bug type

Regression (worked before, now fails)

### Beta release blocker

No

### Summary

When a primary LLM call to GitHub Copilot's `/responses` endpoint returns **HTTP 429** (e.g. weekly rate limit exhausted), OpenClaw does **not** detect the rate-limit response. The embedded run silently waits the full configured run timeout (default `agents.defaults.timeoutSeconds = 600`, i.e. **10 minutes**) before producing a `surface_error` and replying to the user. During this window:

- No log line is emitted for the HTTP 429.
- No `model.failed` / `model.completed` trajectory event is recorded.
- No fallback profile is attempted promptly.
- The user-facing channel (Feishu / TuiTui / cron) appears completely unresponsive.

The expected behaviour is to surface the 429 within a few seconds (Copilot returns the 429 in **~1.3 s**) and either trip the configured fallback or fail the run with a useful error message.

### Steps to reproduce

### 1. Confirm Copilot returns 429 immediately for an architect-shaped payload

```bash
TOKEN=$(jq -r .token ~/.openclaw/credentials/github-copilot.token.json)

# Build payload mimicking architect-agent shape
python3 << 'PY'
import json
payload = {
    "model": "gpt-5.4",
    "stream": True,
    "instructions": "x" * 38000,                       # ~38 KB system prompt
    "input": [{"role": "user", "content": "ping"}],
    "tools": [
        {"type": "function", "name": f"tool_{i}",
         "description": "noop", "parameters": {"type":"object","properties":{}}}
        for i in range(27)
    ],
    "max_output_tokens": 32,
    "tool_choice": "auto",
}
json.dump(payload, open("/tmp/big.json","w"))
PY

curl -sS -X POST 'https://api.individual.githubcopilot.com/responses' \
  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  -H "Copilot-Integration-Id: vscode-chat" -H "Editor-Version: vscode/1.95.0" \
  --data-binary @/tmp/big.json -w "\nHTTP:%{http_code}  TIME:%{time_total}\n"
```

Observed:
```
Sorry, you've exceeded your weekly rate limit. Please review our [Terms of Service](...).
HTTP:429  TIME:1.305
```

`Content-Type: text/plain; charset=utf-8`. The 429 body is **not** JSON and (importantly) does not include a `Retry-After` header.

### 2. Send the same prompt through OpenClaw

`openclaw.json` excerpt (relevant defaults):
```json
{
  "agents": {
    "defaults": {
      "model":   { "primary": "github-copilot/gpt-5.4", "fallbacks": [] },
      "timeoutSeconds": 600,
      "compaction": { "mode": "safeguard" }
    }
  }
}
```

Send any user message to an agent whose context puts it past the rate-limit threshold (in our case the `architect` agent triggered Copilot's weekly premium quota). Watch the gateway log:

```
17:50:23 [feishu] dispatching to agent (session=...)
17:50:27 [plugins] tuitui: Registered tools
17:52:47 [diagnostic] stuck session: state=processing age=129s queueDepth=1
…  (stuck warnings every 30 s for 10 minutes)  …
18:00:38 [agent/embedded] embedded run timeout: runId=… timeoutMs=600000
18:00:38 [agent/embedded] Profile github-copilot:github timed out. Trying next account...
18:00:38 [agent/embedded] embedded run failover decision: stage=assistant decision=surface_error reason=timeout from=github-copilot/gpt-5.4
18:00:39 [feishu] dispatch complete (queuedFinal=true, replies=1)
```

Trajectory file (`/home/zzl/.openclaw/agents/architect/sessions/<sid>.trajectory.jsonl`) for the run contains exactly four events: `session.started`, `trace.metadata`, `context.compiled`, `prompt.submitted`. **There is no `model.failed`, no `model.completed`, no error event.**

End-to-end the agent appears non-responsive for 10 minutes.

### Expected behavior

- The 429 should be detected within a few seconds.
- A trajectory `model.failed` (or equivalent) event should be emitted with the error message body.
- A log line at WARN/ERROR level should record the HTTP status and the rate-limit reason.
- If `agents.defaults.model.fallbacks` is configured, the run should immediately fail over.
- Even with no fallback, `decision=surface_error` should fire within seconds, not minutes.

### Actual behavior

It last for 10 minitues to responsed

### OpenClaw version

2026.4.22

### Operating system

Ubuntu 24.04

### Install method

npm global

### Model

github copilot/GPT-5.4

### Provider / routing chain

github copilot -> GPT-5.4

### Additional provider/model setup details

_No response_

### Logs, screenshots, and evidence

```shell

```

### Impact and severity

_No response_

### Additional information

## Root cause

I traced the code path in the installed bundle (`dist/`) and the vendored OpenAI SDK. Three independent defects compound:

### Defect 1 — Provider-stream catch swallows the error

File: `dist/provider-stream-COLujAAo.js`

The `/responses` request site is `await client.responses.create(params, options?.signal ? { signal: options.signal } : void 0)` at **line 1214**. On throw, the catch block at **lines 1230–1240**:

```js
} catch (error) {
    output.stopReason = options?.signal?.aborted ? "aborted" : "error";
    output.errorMessage = error instanceof Error ? error.message : JSON.stringify(error);
    stream.push({
        type: "error",
        reason: output.stopReason,
        error: output
    });
    stream.end();
}
```

The catch:

- does **not** call `log$3.warn` / `log$3.error`,
- does **not** record a trajectory event,
- does **not** classify the error against `RateLimitError` / status 429,
- only pushes a `{type:"error"}` frame onto the assistant message event stream.

The same shape repeats at lines **1393–1403** and **1525–1535** for sibling provider stream factories.

For comparison, the OpenAI SDK at `node_modules/openai/client.js:354–397` correctly checks `response.ok`, reads the (text/plain) body, and throws `RateLimitError` (`core/error.js:62, 112`). So the error **is** raised; OpenClaw's catch is what loses it.

### Defect 2 — `runAbortController.signal` is not propagated to the OpenAI SDK call

File: `dist/compaction-runtime-context-c6E9Op5Z.js`, function `resolveEmbeddedAgentStreamFn` at **line 5684**.

The factory wraps the inner stream function as:

```js
return async (m, context, options) => {
    const apiKey = await resolveEmbeddedAgentApiKey({ ... });
    return inner(m, normalizeContext(context), {
        ...options,
        apiKey: apiKey ?? options?.apiKey
    });
};
```

`params.signal` (which is `runAbortController.signal` per `dist/selection-DGLE6AvW.js:6440`) is **discarded**. The wrapper only forwards an `apiKey`. As a result, when `provider-stream-COLujAAo.js:1214` reads `options?.signal`, it is `undefined` for embedded-agent calls, so the SDK fetch is invoked **without an AbortSignal**. Even when the run-timeout timer at `selection-DGLE6AvW.js:6735` eventually fires `runAbortController.abort()`, there is no plumbing from that abort into the in-flight fetch. The HTTP request can only be torn down by node closing the underlying socket on process exit or by the SDK's own retry/timeout (see Defect 3).

### Defect 3 — Idle-timeout collapses to run-timeout, providing no actual idle protection

File: `dist/selection-DGLE6AvW.js`

`resolveLlmIdleTimeoutMs` at **lines 5491–5505** falls through to `agents.defaults.timeoutSeconds` when no explicit `agents.defaults.llm.idleTimeoutSeconds` is set:

```js
const agentTimeoutSeconds = params?.cfg?.agents?.defaults?.timeoutSeconds;
if (typeof agentTimeoutSeconds === "number" && ... > 0)
    return clampTimeoutMs(agentTimeoutSeconds * 1e3);
```

And the call site at **line 6549–6554** computes `runTimeoutMs` as `params.timeoutMs !== configuredRunTimeoutMs ? params.timeoutMs : void 0`, which is `void 0` for normal runs because the run timeout equals the configured run timeout. So `streamWithIdleTimeout` is wired at `idleTimeoutMs == agentTimeoutMs == 600_000`, providing zero practical idle protection on top of the run timeout.

Combined effect: the only escape hatch is the run-timeout `scheduleAbortTimer` (`selection-DGLE6AvW.js:6735`) — and that fires the abort against `runAbortController`, which (per Defect 2) is not connected to the SDK fetch.

## Why the OpenAI SDK doesn't fail faster on its own

`node_modules/openai/client.js:354` does throw `RateLimitError` on the 429, but only after retrying. `dist/transport-stream-shared-B2Os3U8j.js:29–36` (`shouldBypassLongSdkRetry`) only stamps `x-should-retry:false` when status ∈ {408, 409, 429, ≥500} **and** a `Retry-After` header is present **and** the retry-after value exceeds 60 seconds. Copilot's weekly-quota 429 is `text/plain` with no `Retry-After`, so OpenClaw never marks it non-retryable. The SDK then performs its default retry budget (a few seconds total) before re-raising — at which point Defect 1 silently absorbs it.

This explains why the Copilot 429 takes 1.3 s direct, but OpenClaw shows no error trace: the SDK does eventually raise within ~5–10 s, but Defect 1 ensures the error never surfaces beyond the in-memory event-stream frame.

## Suggested fixes

1. **`provider-stream-COLujAAo.js:1230–1240` (and the two sibling catches)**: log the raised error at WARN; record a trajectory `model.failed` event with the HTTP status and body; classify `RateLimitError` (or `error.status === 429`) and flag for failover.
2. **`compaction-runtime-context-c6E9Op5Z.js:5684 resolveEmbeddedAgentStreamFn`**: forward `params.signal` into the inner `streamFn` options so that `client.responses.create` receives the abort signal. Same fix likely needed in the non-`authStorage` branch (line 5705).
3. **`selection-DGLE6AvW.js:5491 resolveLlmIdleTimeoutMs`**: when no explicit idle timeout is configured, default to a fraction of the run timeout (e.g. min(60_000, runTimeout/3)) instead of falling through to the run timeout. The current behaviour means the documented idle-timeout safety net does nothing for default configurations.
4. **`transport-stream-shared-B2Os3U8j.js:29–36 shouldBypassLongSdkRetry`**: treat 429 as non-retryable when the body matches Copilot's `Sorry, you've exceeded your weekly rate limit` pattern, OR more generally treat any 429 with a `text/plain` body as terminal so the SDK doesn't burn additional retries.

The minimum viable fix is **#1** alone — it would surface the error promptly even if abort plumbing remains broken — but the combination is what makes the user-facing symptom 10 minutes of silence.

## Workarounds for users hitting this in the meantime

- Set `agents.defaults.timeoutSeconds` to a low value (e.g. 60) so the silent hang is at most 1 minute instead of 10.
- Switch primary model to a non-premium model that still has weekly quota, e.g. `github-copilot/gpt-5.4-mini`.
- Avoid `openai-responses`-based Copilot models; `openai-completions` against `grok-code-fast-1` is unaffected during the same outage.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: HTTP 429 from GitHub Copilot causes 10-minute silent hang instead of immediate failure #71120

Bug type

Beta release blocker

Summary

Steps to reproduce

1. Confirm Copilot returns 429 immediately for an architect-shaped payload

2. Send the same prompt through OpenClaw

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Root cause

Defect 1 — Provider-stream catch swallows the error

Defect 2 — `runAbortController.signal` is not propagated to the OpenAI SDK call

Defect 3 — Idle-timeout collapses to run-timeout, providing no actual idle protection

Why the OpenAI SDK doesn't fail faster on its own

Suggested fixes

Workarounds for users hitting this in the meantime

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: HTTP 429 from GitHub Copilot causes 10-minute silent hang instead of immediate failure #71120

Description

Bug type

Beta release blocker

Summary

Steps to reproduce

1. Confirm Copilot returns 429 immediately for an architect-shaped payload

2. Send the same prompt through OpenClaw

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

Root cause

Defect 1 — Provider-stream catch swallows the error

Defect 2 — runAbortController.signal is not propagated to the OpenAI SDK call

Defect 3 — Idle-timeout collapses to run-timeout, providing no actual idle protection

Why the OpenAI SDK doesn't fail faster on its own

Suggested fixes

Workarounds for users hitting this in the meantime

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Defect 2 — `runAbortController.signal` is not propagated to the OpenAI SDK call