Skip to content

[Bug]: Regression in 2026.4.29: WebUI chat/embedded agent becomes extremely slow with long model-resolution, auth, core-plugin-tools and session operations #76236

@andike73

Description

@andike73

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

After upgrading from 2026.4.24 to 2026.4.29, WebUI chat responses became extremely slow or timed out, while the gateway stayed healthy and direct Codex CLI inside the same Docker container replied quickly. Rolling back to 2026.4.24 made WebUI chat usable again.

Steps to reproduce

  1. Run OpenClaw 2026.4.29 in Docker on Ubuntu 24.04 with OpenAI Codex OAuth and model openai-codex/gpt-5.4.
  2. Open the Control WebUI.
  3. Start a new chat session or use /new.
  4. Send a trivial prompt: "Antworte exakt mit OK. Keine Tools nutzen. Keine Erklärung."
  5. Observe that chat.send is fast, but the embedded agent startup/prep takes around 90+ seconds or times out.
  6. Compare with direct Codex CLI inside the same container, which replies in about 7–11 seconds.
  7. Roll back to 2026.4.24 with the same setup; WebUI chat becomes usable again, around 10–20 seconds.

Expected behavior

A trivial WebUI chat prompt should not spend around 90 seconds in embedded-agent startup/prep stages when direct Codex CLI inside the same container replies in about 7–11 seconds. The WebUI should respond in a reasonable time and should not trigger repeated long sessions.list/node.list/device.pair.list delays or stuck-session diagnostics.

Actual behavior

On 2026.4.29, the WebUI remains reachable and the gateway reports healthy. WebUI chat submission itself is fast, for example chat.send around 100–200ms, but the embedded agent takes around 90+ seconds before producing a reply or fails with timeout.

Observed slow stages for a trivial "reply OK" prompt:

startup stages totalMs=39182:

  • model-resolution: 22485ms
  • auth: 9317ms
  • attempt-dispatch: 7373ms

prep stages totalMs=56699:

  • core-plugin-tools: 30886ms
  • system-prompt: 9299ms
  • stream-setup: 10735ms

Other observed slow operations included:

  • sessions.list: 55985ms / 61307ms
  • node.list: 63236ms
  • device.pair.list: 63278ms
  • models.authStatus: 27357ms
  • sessions.usage: 27867ms

Also observed:

  • [fetch-timeout] fetch timeout reached; aborting operation
  • stuck session diagnostics
  • session-write-lock held for more than 15s
  • agent cleanup timed out at pi-trajectory-flush

After rollback to 2026.4.24, cleaning stale UI/session state, and letting the gateway settle, chat response time returned to about 10–20 seconds.

OpenClaw version

Bad: 2026.4.29 Good/workaround: 2026.4.24

Operating system

Ubuntu 24.04 VPS

Install method

Docker / Docker Compose OpenClaw runs as openclaw-gateway in a custom Docker image based on ghcr.io/openclaw/openclaw:. The image additionally includes Codex CLI, jq, ripgrep, ffmpeg, GitHub CLI, and Python tooling.

Model

openai-codex/gpt-5.4

Provider / routing chain

OpenClaw -> OpenAI Codex OAuth -> gpt-5.4 No direct OpenAI API key is used.

Additional provider/model setup details

The default/main model is configured as openai-codex/gpt-5.4.

Direct Codex CLI inside the same openclaw-gateway container works quickly:

codex exec --cd /home/node/.openclaw/workspace --skip-git-repo-check --sandbox read-only -m gpt-5.4 "Antworte exakt mit OK. Keine Erklärung."

Observed result:

  • Codex CLI 0.128.0 replied "OK"
  • real time: 0m7.684s

Earlier with Codex CLI 0.125.0:

  • replied "OK"
  • real time: 0m11.443s

This suggests Docker networking, Codex OAuth, and the model provider are fundamentally working.

Logs, screenshots, and evidence

Example from 2026.4.29 for a trivial WebUI "reply OK" prompt:

[ws] ⇄ res ✓ chat.send 102ms runId=fe1e8869-5713-4ec5-b935-a8930f7b0259

[agent/embedded] [trace:embedded-run] startup stages:
runId=fe1e8869-5713-4ec5-b935-a8930f7b0259
sessionId=c7938d77-8971-4d9b-92ba-4db3125c123a
phase=attempt-dispatch
totalMs=39182
stages=workspace:2ms@2ms,
runtime-plugins:3ms@5ms,
hooks:0ms@5ms,
model-resolution:22485ms@22490ms,
auth:9317ms@31807ms,
context-engine:2ms@31809ms,
attempt-dispatch:7373ms@39182ms

[agent/embedded] [trace:embedded-run] prep stages:
runId=fe1e8869-5713-4ec5-b935-a8930f7b0259
sessionId=c7938d77-8971-4d9b-92ba-4db3125c123a
phase=stream-ready
totalMs=56699
stages=workspace-sandbox:141ms@141ms,
skills:2ms@143ms,
core-plugin-tools:30886ms@31029ms,
bootstrap-context:563ms@31592ms,
bundle-tools:2656ms@34248ms,
system-prompt:9299ms@43547ms,
session-resource-loader:2409ms@45956ms,
agent-session:8ms@45964ms,
stream-setup:10735ms@56699ms

Other observed slow calls:
[ws] ⇄ res ✓ sessions.list 55985ms
[ws] ⇄ res ✓ sessions.list 61307ms
[ws] ⇄ res ✓ node.list 63236ms
[ws] ⇄ res ✓ device.pair.list 63278ms
[ws] ⇄ res ✓ models.authStatus 27357ms
[ws] ⇄ res ✓ sessions.usage 27867ms

Observed diagnostics:
[fetch-timeout] fetch timeout reached; aborting operation

[diagnostic] stuck session:
sessionId=unknown
sessionKey=agent:main:main
state=processing
age=123s
queueDepth=1
reason=processing_with_queued_work

[session-write-lock] releasing lock held for 62016ms (max=15000ms):
/home/node/.openclaw/agents/main/sessions/sessions.json.lock

[agent/embedded] agent cleanup timed out:
step=pi-trajectory-flush
timeoutMs=10000

Session state observations:
- /home/node/.openclaw/agents/main/sessions/sessions.json was about 8.9M
- A temporary stale-looking .jsonl.lock file was observed and later disappeared
- After rollback, WebUI cache hard reload, and session state settling, 2026.4.24 became usable again

A rate limit was observed during debugging but was later ruled out as the main cause because direct Codex CLI calls inside the same container worked quickly after quota recovered.

Impact and severity

Affected: WebUI chat on my Docker-based OpenClaw setup using OpenAI Codex OAuth.
Severity: High for this setup because WebUI chat becomes effectively unusable on 2026.4.29.
Frequency: Reproduced repeatedly after upgrading to 2026.4.29.
Consequence: Simple prompts can take 90+ seconds or time out, while the same model via direct Codex CLI responds in about 7–11 seconds.
Workaround: Roll back to 2026.4.24.

Additional information

I also tested whether too many main-agent skills caused the issue. Reducing the main-agent skill list did not materially improve the core-plugin-tools latency:

Before reducing skills:
core-plugin-tools: 30886ms

After reducing main-agent skills:
core-plugin-tools: 29555ms

memory-core dreaming was disabled during debugging.

After rolling back from 2026.4.29 to 2026.4.24, I initially saw a WebUI/client cache mismatch:

models.list error:
invalid models.list params: at root: unexpected property 'view'

After hard-reloading the WebUI / using a fresh session, models.list became fast again.

I can provide additional logs or run specific debug commands if helpful.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingregressionBehavior that previously worked and now fails

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions