Skip to content

[Bug]: HTTP 429 from GitHub Copilot causes 10-minute silent hang instead of immediate failure #71120

@zzl360

Description

@zzl360

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

When a primary LLM call to GitHub Copilot's /responses endpoint returns HTTP 429 (e.g. weekly rate limit exhausted), OpenClaw does not detect the rate-limit response. The embedded run silently waits the full configured run timeout (default agents.defaults.timeoutSeconds = 600, i.e. 10 minutes) before producing a surface_error and replying to the user. During this window:

  • No log line is emitted for the HTTP 429.
  • No model.failed / model.completed trajectory event is recorded.
  • No fallback profile is attempted promptly.
  • The user-facing channel (Feishu / TuiTui / cron) appears completely unresponsive.

The expected behaviour is to surface the 429 within a few seconds (Copilot returns the 429 in ~1.3 s) and either trip the configured fallback or fail the run with a useful error message.

Steps to reproduce

1. Confirm Copilot returns 429 immediately for an architect-shaped payload

TOKEN=$(jq -r .token ~/.openclaw/credentials/github-copilot.token.json)

# Build payload mimicking architect-agent shape
python3 << 'PY'
import json
payload = {
    "model": "gpt-5.4",
    "stream": True,
    "instructions": "x" * 38000,                       # ~38 KB system prompt
    "input": [{"role": "user", "content": "ping"}],
    "tools": [
        {"type": "function", "name": f"tool_{i}",
         "description": "noop", "parameters": {"type":"object","properties":{}}}
        for i in range(27)
    ],
    "max_output_tokens": 32,
    "tool_choice": "auto",
}
json.dump(payload, open("/tmp/big.json","w"))
PY

curl -sS -X POST 'https://api.individual.githubcopilot.com/responses' \
  -H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
  -H "Copilot-Integration-Id: vscode-chat" -H "Editor-Version: vscode/1.95.0" \
  --data-binary @/tmp/big.json -w "\nHTTP:%{http_code}  TIME:%{time_total}\n"

Observed:

Sorry, you've exceeded your weekly rate limit. Please review our [Terms of Service](...).
HTTP:429  TIME:1.305

Content-Type: text/plain; charset=utf-8. The 429 body is not JSON and (importantly) does not include a Retry-After header.

2. Send the same prompt through OpenClaw

openclaw.json excerpt (relevant defaults):

{
  "agents": {
    "defaults": {
      "model":   { "primary": "github-copilot/gpt-5.4", "fallbacks": [] },
      "timeoutSeconds": 600,
      "compaction": { "mode": "safeguard" }
    }
  }
}

Send any user message to an agent whose context puts it past the rate-limit threshold (in our case the architect agent triggered Copilot's weekly premium quota). Watch the gateway log:

17:50:23 [feishu] dispatching to agent (session=...)
17:50:27 [plugins] tuitui: Registered tools
17:52:47 [diagnostic] stuck session: state=processing age=129s queueDepth=1
…  (stuck warnings every 30 s for 10 minutes)  …
18:00:38 [agent/embedded] embedded run timeout: runId=… timeoutMs=600000
18:00:38 [agent/embedded] Profile github-copilot:github timed out. Trying next account...
18:00:38 [agent/embedded] embedded run failover decision: stage=assistant decision=surface_error reason=timeout from=github-copilot/gpt-5.4
18:00:39 [feishu] dispatch complete (queuedFinal=true, replies=1)

Trajectory file (/home/zzl/.openclaw/agents/architect/sessions/<sid>.trajectory.jsonl) for the run contains exactly four events: session.started, trace.metadata, context.compiled, prompt.submitted. There is no model.failed, no model.completed, no error event.

End-to-end the agent appears non-responsive for 10 minutes.

Expected behavior

  • The 429 should be detected within a few seconds.
  • A trajectory model.failed (or equivalent) event should be emitted with the error message body.
  • A log line at WARN/ERROR level should record the HTTP status and the rate-limit reason.
  • If agents.defaults.model.fallbacks is configured, the run should immediately fail over.
  • Even with no fallback, decision=surface_error should fire within seconds, not minutes.

Actual behavior

It last for 10 minitues to responsed

OpenClaw version

2026.4.22

Operating system

Ubuntu 24.04

Install method

npm global

Model

github copilot/GPT-5.4

Provider / routing chain

github copilot -> GPT-5.4

Additional provider/model setup details

No response

Logs, screenshots, and evidence

Impact and severity

No response

Additional information

Root cause

I traced the code path in the installed bundle (dist/) and the vendored OpenAI SDK. Three independent defects compound:

Defect 1 — Provider-stream catch swallows the error

File: dist/provider-stream-COLujAAo.js

The /responses request site is await client.responses.create(params, options?.signal ? { signal: options.signal } : void 0) at line 1214. On throw, the catch block at lines 1230–1240:

} catch (error) {
    output.stopReason = options?.signal?.aborted ? "aborted" : "error";
    output.errorMessage = error instanceof Error ? error.message : JSON.stringify(error);
    stream.push({
        type: "error",
        reason: output.stopReason,
        error: output
    });
    stream.end();
}

The catch:

  • does not call log$3.warn / log$3.error,
  • does not record a trajectory event,
  • does not classify the error against RateLimitError / status 429,
  • only pushes a {type:"error"} frame onto the assistant message event stream.

The same shape repeats at lines 1393–1403 and 1525–1535 for sibling provider stream factories.

For comparison, the OpenAI SDK at node_modules/openai/client.js:354–397 correctly checks response.ok, reads the (text/plain) body, and throws RateLimitError (core/error.js:62, 112). So the error is raised; OpenClaw's catch is what loses it.

Defect 2 — runAbortController.signal is not propagated to the OpenAI SDK call

File: dist/compaction-runtime-context-c6E9Op5Z.js, function resolveEmbeddedAgentStreamFn at line 5684.

The factory wraps the inner stream function as:

return async (m, context, options) => {
    const apiKey = await resolveEmbeddedAgentApiKey({ ... });
    return inner(m, normalizeContext(context), {
        ...options,
        apiKey: apiKey ?? options?.apiKey
    });
};

params.signal (which is runAbortController.signal per dist/selection-DGLE6AvW.js:6440) is discarded. The wrapper only forwards an apiKey. As a result, when provider-stream-COLujAAo.js:1214 reads options?.signal, it is undefined for embedded-agent calls, so the SDK fetch is invoked without an AbortSignal. Even when the run-timeout timer at selection-DGLE6AvW.js:6735 eventually fires runAbortController.abort(), there is no plumbing from that abort into the in-flight fetch. The HTTP request can only be torn down by node closing the underlying socket on process exit or by the SDK's own retry/timeout (see Defect 3).

Defect 3 — Idle-timeout collapses to run-timeout, providing no actual idle protection

File: dist/selection-DGLE6AvW.js

resolveLlmIdleTimeoutMs at lines 5491–5505 falls through to agents.defaults.timeoutSeconds when no explicit agents.defaults.llm.idleTimeoutSeconds is set:

const agentTimeoutSeconds = params?.cfg?.agents?.defaults?.timeoutSeconds;
if (typeof agentTimeoutSeconds === "number" && ... > 0)
    return clampTimeoutMs(agentTimeoutSeconds * 1e3);

And the call site at line 6549–6554 computes runTimeoutMs as params.timeoutMs !== configuredRunTimeoutMs ? params.timeoutMs : void 0, which is void 0 for normal runs because the run timeout equals the configured run timeout. So streamWithIdleTimeout is wired at idleTimeoutMs == agentTimeoutMs == 600_000, providing zero practical idle protection on top of the run timeout.

Combined effect: the only escape hatch is the run-timeout scheduleAbortTimer (selection-DGLE6AvW.js:6735) — and that fires the abort against runAbortController, which (per Defect 2) is not connected to the SDK fetch.

Why the OpenAI SDK doesn't fail faster on its own

node_modules/openai/client.js:354 does throw RateLimitError on the 429, but only after retrying. dist/transport-stream-shared-B2Os3U8j.js:29–36 (shouldBypassLongSdkRetry) only stamps x-should-retry:false when status ∈ {408, 409, 429, ≥500} and a Retry-After header is present and the retry-after value exceeds 60 seconds. Copilot's weekly-quota 429 is text/plain with no Retry-After, so OpenClaw never marks it non-retryable. The SDK then performs its default retry budget (a few seconds total) before re-raising — at which point Defect 1 silently absorbs it.

This explains why the Copilot 429 takes 1.3 s direct, but OpenClaw shows no error trace: the SDK does eventually raise within ~5–10 s, but Defect 1 ensures the error never surfaces beyond the in-memory event-stream frame.

Suggested fixes

  1. provider-stream-COLujAAo.js:1230–1240 (and the two sibling catches): log the raised error at WARN; record a trajectory model.failed event with the HTTP status and body; classify RateLimitError (or error.status === 429) and flag for failover.
  2. compaction-runtime-context-c6E9Op5Z.js:5684 resolveEmbeddedAgentStreamFn: forward params.signal into the inner streamFn options so that client.responses.create receives the abort signal. Same fix likely needed in the non-authStorage branch (line 5705).
  3. selection-DGLE6AvW.js:5491 resolveLlmIdleTimeoutMs: when no explicit idle timeout is configured, default to a fraction of the run timeout (e.g. min(60_000, runTimeout/3)) instead of falling through to the run timeout. The current behaviour means the documented idle-timeout safety net does nothing for default configurations.
  4. transport-stream-shared-B2Os3U8j.js:29–36 shouldBypassLongSdkRetry: treat 429 as non-retryable when the body matches Copilot's Sorry, you've exceeded your weekly rate limit pattern, OR more generally treat any 429 with a text/plain body as terminal so the SDK doesn't burn additional retries.

The minimum viable fix is #1 alone — it would surface the error promptly even if abort plumbing remains broken — but the combination is what makes the user-facing symptom 10 minutes of silence.

Workarounds for users hitting this in the meantime

  • Set agents.defaults.timeoutSeconds to a low value (e.g. 60) so the silent hang is at most 1 minute instead of 10.
  • Switch primary model to a non-premium model that still has weekly quota, e.g. github-copilot/gpt-5.4-mini.
  • Avoid openai-responses-based Copilot models; openai-completions against grok-code-fast-1 is unaffected during the same outage.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingregressionBehavior that previously worked and now fails

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions