Bug type
Regression (worked before, now fails)
Beta release blocker
No
Summary
When a primary LLM call to GitHub Copilot's /responses endpoint returns HTTP 429 (e.g. weekly rate limit exhausted), OpenClaw does not detect the rate-limit response. The embedded run silently waits the full configured run timeout (default agents.defaults.timeoutSeconds = 600, i.e. 10 minutes) before producing a surface_error and replying to the user. During this window:
- No log line is emitted for the HTTP 429.
- No
model.failed / model.completed trajectory event is recorded.
- No fallback profile is attempted promptly.
- The user-facing channel (Feishu / TuiTui / cron) appears completely unresponsive.
The expected behaviour is to surface the 429 within a few seconds (Copilot returns the 429 in ~1.3 s) and either trip the configured fallback or fail the run with a useful error message.
Steps to reproduce
1. Confirm Copilot returns 429 immediately for an architect-shaped payload
TOKEN=$(jq -r .token ~/.openclaw/credentials/github-copilot.token.json)
# Build payload mimicking architect-agent shape
python3 << 'PY'
import json
payload = {
"model": "gpt-5.4",
"stream": True,
"instructions": "x" * 38000, # ~38 KB system prompt
"input": [{"role": "user", "content": "ping"}],
"tools": [
{"type": "function", "name": f"tool_{i}",
"description": "noop", "parameters": {"type":"object","properties":{}}}
for i in range(27)
],
"max_output_tokens": 32,
"tool_choice": "auto",
}
json.dump(payload, open("/tmp/big.json","w"))
PY
curl -sS -X POST 'https://api.individual.githubcopilot.com/responses' \
-H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json" \
-H "Copilot-Integration-Id: vscode-chat" -H "Editor-Version: vscode/1.95.0" \
--data-binary @/tmp/big.json -w "\nHTTP:%{http_code} TIME:%{time_total}\n"
Observed:
Sorry, you've exceeded your weekly rate limit. Please review our [Terms of Service](...).
HTTP:429 TIME:1.305
Content-Type: text/plain; charset=utf-8. The 429 body is not JSON and (importantly) does not include a Retry-After header.
2. Send the same prompt through OpenClaw
openclaw.json excerpt (relevant defaults):
{
"agents": {
"defaults": {
"model": { "primary": "github-copilot/gpt-5.4", "fallbacks": [] },
"timeoutSeconds": 600,
"compaction": { "mode": "safeguard" }
}
}
}
Send any user message to an agent whose context puts it past the rate-limit threshold (in our case the architect agent triggered Copilot's weekly premium quota). Watch the gateway log:
17:50:23 [feishu] dispatching to agent (session=...)
17:50:27 [plugins] tuitui: Registered tools
17:52:47 [diagnostic] stuck session: state=processing age=129s queueDepth=1
… (stuck warnings every 30 s for 10 minutes) …
18:00:38 [agent/embedded] embedded run timeout: runId=… timeoutMs=600000
18:00:38 [agent/embedded] Profile github-copilot:github timed out. Trying next account...
18:00:38 [agent/embedded] embedded run failover decision: stage=assistant decision=surface_error reason=timeout from=github-copilot/gpt-5.4
18:00:39 [feishu] dispatch complete (queuedFinal=true, replies=1)
Trajectory file (/home/zzl/.openclaw/agents/architect/sessions/<sid>.trajectory.jsonl) for the run contains exactly four events: session.started, trace.metadata, context.compiled, prompt.submitted. There is no model.failed, no model.completed, no error event.
End-to-end the agent appears non-responsive for 10 minutes.
Expected behavior
- The 429 should be detected within a few seconds.
- A trajectory
model.failed (or equivalent) event should be emitted with the error message body.
- A log line at WARN/ERROR level should record the HTTP status and the rate-limit reason.
- If
agents.defaults.model.fallbacks is configured, the run should immediately fail over.
- Even with no fallback,
decision=surface_error should fire within seconds, not minutes.
Actual behavior
It last for 10 minitues to responsed
OpenClaw version
2026.4.22
Operating system
Ubuntu 24.04
Install method
npm global
Model
github copilot/GPT-5.4
Provider / routing chain
github copilot -> GPT-5.4
Additional provider/model setup details
No response
Logs, screenshots, and evidence
Impact and severity
No response
Additional information
Root cause
I traced the code path in the installed bundle (dist/) and the vendored OpenAI SDK. Three independent defects compound:
Defect 1 — Provider-stream catch swallows the error
File: dist/provider-stream-COLujAAo.js
The /responses request site is await client.responses.create(params, options?.signal ? { signal: options.signal } : void 0) at line 1214. On throw, the catch block at lines 1230–1240:
} catch (error) {
output.stopReason = options?.signal?.aborted ? "aborted" : "error";
output.errorMessage = error instanceof Error ? error.message : JSON.stringify(error);
stream.push({
type: "error",
reason: output.stopReason,
error: output
});
stream.end();
}
The catch:
- does not call
log$3.warn / log$3.error,
- does not record a trajectory event,
- does not classify the error against
RateLimitError / status 429,
- only pushes a
{type:"error"} frame onto the assistant message event stream.
The same shape repeats at lines 1393–1403 and 1525–1535 for sibling provider stream factories.
For comparison, the OpenAI SDK at node_modules/openai/client.js:354–397 correctly checks response.ok, reads the (text/plain) body, and throws RateLimitError (core/error.js:62, 112). So the error is raised; OpenClaw's catch is what loses it.
Defect 2 — runAbortController.signal is not propagated to the OpenAI SDK call
File: dist/compaction-runtime-context-c6E9Op5Z.js, function resolveEmbeddedAgentStreamFn at line 5684.
The factory wraps the inner stream function as:
return async (m, context, options) => {
const apiKey = await resolveEmbeddedAgentApiKey({ ... });
return inner(m, normalizeContext(context), {
...options,
apiKey: apiKey ?? options?.apiKey
});
};
params.signal (which is runAbortController.signal per dist/selection-DGLE6AvW.js:6440) is discarded. The wrapper only forwards an apiKey. As a result, when provider-stream-COLujAAo.js:1214 reads options?.signal, it is undefined for embedded-agent calls, so the SDK fetch is invoked without an AbortSignal. Even when the run-timeout timer at selection-DGLE6AvW.js:6735 eventually fires runAbortController.abort(), there is no plumbing from that abort into the in-flight fetch. The HTTP request can only be torn down by node closing the underlying socket on process exit or by the SDK's own retry/timeout (see Defect 3).
Defect 3 — Idle-timeout collapses to run-timeout, providing no actual idle protection
File: dist/selection-DGLE6AvW.js
resolveLlmIdleTimeoutMs at lines 5491–5505 falls through to agents.defaults.timeoutSeconds when no explicit agents.defaults.llm.idleTimeoutSeconds is set:
const agentTimeoutSeconds = params?.cfg?.agents?.defaults?.timeoutSeconds;
if (typeof agentTimeoutSeconds === "number" && ... > 0)
return clampTimeoutMs(agentTimeoutSeconds * 1e3);
And the call site at line 6549–6554 computes runTimeoutMs as params.timeoutMs !== configuredRunTimeoutMs ? params.timeoutMs : void 0, which is void 0 for normal runs because the run timeout equals the configured run timeout. So streamWithIdleTimeout is wired at idleTimeoutMs == agentTimeoutMs == 600_000, providing zero practical idle protection on top of the run timeout.
Combined effect: the only escape hatch is the run-timeout scheduleAbortTimer (selection-DGLE6AvW.js:6735) — and that fires the abort against runAbortController, which (per Defect 2) is not connected to the SDK fetch.
Why the OpenAI SDK doesn't fail faster on its own
node_modules/openai/client.js:354 does throw RateLimitError on the 429, but only after retrying. dist/transport-stream-shared-B2Os3U8j.js:29–36 (shouldBypassLongSdkRetry) only stamps x-should-retry:false when status ∈ {408, 409, 429, ≥500} and a Retry-After header is present and the retry-after value exceeds 60 seconds. Copilot's weekly-quota 429 is text/plain with no Retry-After, so OpenClaw never marks it non-retryable. The SDK then performs its default retry budget (a few seconds total) before re-raising — at which point Defect 1 silently absorbs it.
This explains why the Copilot 429 takes 1.3 s direct, but OpenClaw shows no error trace: the SDK does eventually raise within ~5–10 s, but Defect 1 ensures the error never surfaces beyond the in-memory event-stream frame.
Suggested fixes
provider-stream-COLujAAo.js:1230–1240 (and the two sibling catches): log the raised error at WARN; record a trajectory model.failed event with the HTTP status and body; classify RateLimitError (or error.status === 429) and flag for failover.
compaction-runtime-context-c6E9Op5Z.js:5684 resolveEmbeddedAgentStreamFn: forward params.signal into the inner streamFn options so that client.responses.create receives the abort signal. Same fix likely needed in the non-authStorage branch (line 5705).
selection-DGLE6AvW.js:5491 resolveLlmIdleTimeoutMs: when no explicit idle timeout is configured, default to a fraction of the run timeout (e.g. min(60_000, runTimeout/3)) instead of falling through to the run timeout. The current behaviour means the documented idle-timeout safety net does nothing for default configurations.
transport-stream-shared-B2Os3U8j.js:29–36 shouldBypassLongSdkRetry: treat 429 as non-retryable when the body matches Copilot's Sorry, you've exceeded your weekly rate limit pattern, OR more generally treat any 429 with a text/plain body as terminal so the SDK doesn't burn additional retries.
The minimum viable fix is #1 alone — it would surface the error promptly even if abort plumbing remains broken — but the combination is what makes the user-facing symptom 10 minutes of silence.
Workarounds for users hitting this in the meantime
- Set
agents.defaults.timeoutSeconds to a low value (e.g. 60) so the silent hang is at most 1 minute instead of 10.
- Switch primary model to a non-premium model that still has weekly quota, e.g.
github-copilot/gpt-5.4-mini.
- Avoid
openai-responses-based Copilot models; openai-completions against grok-code-fast-1 is unaffected during the same outage.
Bug type
Regression (worked before, now fails)
Beta release blocker
No
Summary
When a primary LLM call to GitHub Copilot's
/responsesendpoint returns HTTP 429 (e.g. weekly rate limit exhausted), OpenClaw does not detect the rate-limit response. The embedded run silently waits the full configured run timeout (defaultagents.defaults.timeoutSeconds = 600, i.e. 10 minutes) before producing asurface_errorand replying to the user. During this window:model.failed/model.completedtrajectory event is recorded.The expected behaviour is to surface the 429 within a few seconds (Copilot returns the 429 in ~1.3 s) and either trip the configured fallback or fail the run with a useful error message.
Steps to reproduce
1. Confirm Copilot returns 429 immediately for an architect-shaped payload
Observed:
Content-Type: text/plain; charset=utf-8. The 429 body is not JSON and (importantly) does not include aRetry-Afterheader.2. Send the same prompt through OpenClaw
openclaw.jsonexcerpt (relevant defaults):{ "agents": { "defaults": { "model": { "primary": "github-copilot/gpt-5.4", "fallbacks": [] }, "timeoutSeconds": 600, "compaction": { "mode": "safeguard" } } } }Send any user message to an agent whose context puts it past the rate-limit threshold (in our case the
architectagent triggered Copilot's weekly premium quota). Watch the gateway log:Trajectory file (
/home/zzl/.openclaw/agents/architect/sessions/<sid>.trajectory.jsonl) for the run contains exactly four events:session.started,trace.metadata,context.compiled,prompt.submitted. There is nomodel.failed, nomodel.completed, no error event.End-to-end the agent appears non-responsive for 10 minutes.
Expected behavior
model.failed(or equivalent) event should be emitted with the error message body.agents.defaults.model.fallbacksis configured, the run should immediately fail over.decision=surface_errorshould fire within seconds, not minutes.Actual behavior
It last for 10 minitues to responsed
OpenClaw version
2026.4.22
Operating system
Ubuntu 24.04
Install method
npm global
Model
github copilot/GPT-5.4
Provider / routing chain
github copilot -> GPT-5.4
Additional provider/model setup details
No response
Logs, screenshots, and evidence
Impact and severity
No response
Additional information
Root cause
I traced the code path in the installed bundle (
dist/) and the vendored OpenAI SDK. Three independent defects compound:Defect 1 — Provider-stream catch swallows the error
File:
dist/provider-stream-COLujAAo.jsThe
/responsesrequest site isawait client.responses.create(params, options?.signal ? { signal: options.signal } : void 0)at line 1214. On throw, the catch block at lines 1230–1240:The catch:
log$3.warn/log$3.error,RateLimitError/ status 429,{type:"error"}frame onto the assistant message event stream.The same shape repeats at lines 1393–1403 and 1525–1535 for sibling provider stream factories.
For comparison, the OpenAI SDK at
node_modules/openai/client.js:354–397correctly checksresponse.ok, reads the (text/plain) body, and throwsRateLimitError(core/error.js:62, 112). So the error is raised; OpenClaw's catch is what loses it.Defect 2 —
runAbortController.signalis not propagated to the OpenAI SDK callFile:
dist/compaction-runtime-context-c6E9Op5Z.js, functionresolveEmbeddedAgentStreamFnat line 5684.The factory wraps the inner stream function as:
params.signal(which isrunAbortController.signalperdist/selection-DGLE6AvW.js:6440) is discarded. The wrapper only forwards anapiKey. As a result, whenprovider-stream-COLujAAo.js:1214readsoptions?.signal, it isundefinedfor embedded-agent calls, so the SDK fetch is invoked without an AbortSignal. Even when the run-timeout timer atselection-DGLE6AvW.js:6735eventually firesrunAbortController.abort(), there is no plumbing from that abort into the in-flight fetch. The HTTP request can only be torn down by node closing the underlying socket on process exit or by the SDK's own retry/timeout (see Defect 3).Defect 3 — Idle-timeout collapses to run-timeout, providing no actual idle protection
File:
dist/selection-DGLE6AvW.jsresolveLlmIdleTimeoutMsat lines 5491–5505 falls through toagents.defaults.timeoutSecondswhen no explicitagents.defaults.llm.idleTimeoutSecondsis set:And the call site at line 6549–6554 computes
runTimeoutMsasparams.timeoutMs !== configuredRunTimeoutMs ? params.timeoutMs : void 0, which isvoid 0for normal runs because the run timeout equals the configured run timeout. SostreamWithIdleTimeoutis wired atidleTimeoutMs == agentTimeoutMs == 600_000, providing zero practical idle protection on top of the run timeout.Combined effect: the only escape hatch is the run-timeout
scheduleAbortTimer(selection-DGLE6AvW.js:6735) — and that fires the abort againstrunAbortController, which (per Defect 2) is not connected to the SDK fetch.Why the OpenAI SDK doesn't fail faster on its own
node_modules/openai/client.js:354does throwRateLimitErroron the 429, but only after retrying.dist/transport-stream-shared-B2Os3U8j.js:29–36(shouldBypassLongSdkRetry) only stampsx-should-retry:falsewhen status ∈ {408, 409, 429, ≥500} and aRetry-Afterheader is present and the retry-after value exceeds 60 seconds. Copilot's weekly-quota 429 istext/plainwith noRetry-After, so OpenClaw never marks it non-retryable. The SDK then performs its default retry budget (a few seconds total) before re-raising — at which point Defect 1 silently absorbs it.This explains why the Copilot 429 takes 1.3 s direct, but OpenClaw shows no error trace: the SDK does eventually raise within ~5–10 s, but Defect 1 ensures the error never surfaces beyond the in-memory event-stream frame.
Suggested fixes
provider-stream-COLujAAo.js:1230–1240(and the two sibling catches): log the raised error at WARN; record a trajectorymodel.failedevent with the HTTP status and body; classifyRateLimitError(orerror.status === 429) and flag for failover.compaction-runtime-context-c6E9Op5Z.js:5684 resolveEmbeddedAgentStreamFn: forwardparams.signalinto the innerstreamFnoptions so thatclient.responses.createreceives the abort signal. Same fix likely needed in the non-authStoragebranch (line 5705).selection-DGLE6AvW.js:5491 resolveLlmIdleTimeoutMs: when no explicit idle timeout is configured, default to a fraction of the run timeout (e.g. min(60_000, runTimeout/3)) instead of falling through to the run timeout. The current behaviour means the documented idle-timeout safety net does nothing for default configurations.transport-stream-shared-B2Os3U8j.js:29–36 shouldBypassLongSdkRetry: treat 429 as non-retryable when the body matches Copilot'sSorry, you've exceeded your weekly rate limitpattern, OR more generally treat any 429 with atext/plainbody as terminal so the SDK doesn't burn additional retries.The minimum viable fix is #1 alone — it would surface the error promptly even if abort plumbing remains broken — but the combination is what makes the user-facing symptom 10 minutes of silence.
Workarounds for users hitting this in the meantime
agents.defaults.timeoutSecondsto a low value (e.g. 60) so the silent hang is at most 1 minute instead of 10.github-copilot/gpt-5.4-mini.openai-responses-based Copilot models;openai-completionsagainstgrok-code-fast-1is unaffected during the same outage.