You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A production OpenClaw Docker gateway on 2026.5.18 shows noticeable latency in short Codex-backed Dashboard/Webchat turns, but the existing trace output makes the slow section hard to localize.
With OPENCLAW_LOG_LEVEL=trace, [trace:embedded-run] startup stages ends at attempt-dispatch and reports only ~1.36s of startup/prep. However, session.started is recorded several seconds later. That gap appears to be inside the Codex app-server thread lifecycle (startOrResumeThread / thread.resume or thread.start), but today it has no comparable stage breakdown.
This makes operators think prep is cheap while the user-visible turn is still slow, and it makes it hard to tell whether the time is spent in binding reads, plugin app config recovery/build, thread resume/start RPC, or binding writes.
Environment
OpenClaw: 2026.5.18
Runtime: Docker, derived runtime image
Channel tested: Dashboard/Webchat direct conversation
Model/provider: openai-codex/gpt-5.5
Gateway health: healthy during the clean sample
OPENCLAW_LOG_LEVEL=trace was enabled temporarily for the measurement, then restored
diagnostics.cacheTrace was not enabled
Observed clean sample
Run: 7141c10d-811a-49b8-a5e2-c3ca23f08b84
Timeline from gateway logs / trajectory:
08:37:16.717 message queued
08:37:20.273 lane enqueue
08:37:20.278 lane dequeue waitMs=5
08:37:20.282 main lane enqueue
08:37:20.289 main lane dequeue waitMs=6
08:37:21.642 strict-agentic execution contract active
08:37:21.658 [trace:embedded-run] startup stages ... totalMs=1365
08:37:27.981 session.started
08:37:27.992 prompt.submitted
08:37:32.186 tool.call message
08:37:32.260 tool.result message
08:37:34.222 model.completed
08:37:35.022 message processed duration=18334ms
In the same trace window, this line appears repeatedly a few seconds after harness selection:
native hook relay bridge server address unavailable
The source appears to call writeNativeHookRelayBridgeRecordForRegistration(...) immediately after server.listen(...), before the listen callback is guaranteed to have a bound address:
This may be a harmless debug race because the callback writes the record later, so I am not claiming it is the main latency cause. But it can confuse latency investigations because it appears in the hidden gap. Consider either removing the eager write, making it explicitly best-effort/noisy only at trace, or exposing a bridge-ready timing if it matters for hook startup.
Why this matters
For short turns like “reply only OK”, a user-visible 18s turn currently looks like:
queue: essentially free
embedded startup trace: ~1.3s
prompt submit after session start: immediate
model/tool work: ~6s
But the missing ~6s before session.started is not attributable from current logs. Adding thread lifecycle timing would make future reports actionable and would clarify whether the latency is caused by plugin app config recovery/build, app-server thread resume/start RPC, binding I/O, or something else.
What was not tested
I did not test with diagnostics.cacheTrace because of the privacy footprint.
I did not patch runtime code locally for this report.
I did not isolate whether native Codex plugin app config (feat(codex): enable native plugin app support #78733) specifically contributes to the thread lifecycle cost; the requested change is primarily observability so that can be measured cleanly.
Problem
A production OpenClaw Docker gateway on
2026.5.18shows noticeable latency in short Codex-backed Dashboard/Webchat turns, but the existing trace output makes the slow section hard to localize.With
OPENCLAW_LOG_LEVEL=trace,[trace:embedded-run] startup stagesends atattempt-dispatchand reports only ~1.36s of startup/prep. However,session.startedis recorded several seconds later. That gap appears to be inside the Codex app-server thread lifecycle (startOrResumeThread/thread.resumeorthread.start), but today it has no comparable stage breakdown.This makes operators think prep is cheap while the user-visible turn is still slow, and it makes it hard to tell whether the time is spent in binding reads, plugin app config recovery/build, thread resume/start RPC, or binding writes.
Environment
2026.5.18openai-codex/gpt-5.5OPENCLAW_LOG_LEVEL=tracewas enabled temporarily for the measurement, then restoreddiagnostics.cacheTracewas not enabledObserved clean sample
Run:
7141c10d-811a-49b8-a5e2-c3ca23f08b84Timeline from gateway logs / trajectory:
Existing embedded startup trace:
Trajectory for the same run:
Key observation:
5-6ms.1365ms.session.starteddoes not occur until ~6.3safterstrict-agentic active/attempt-dispatch.prompt.submittedis immediate aftersession.started(~11ms).session.started, likely in Codex app-server thread lifecycle.Relevant code paths
The current startup trace is emitted in:
src/agents/pi-embedded-runner/run.tsIt marks and emits at:
The next expensive path appears to be in:
extensions/codex/src/app-server/run-attempt.tsextensions/codex/src/app-server/thread-lifecycle.tsSpecifically around:
and inside
startOrResumeThread, including:readCodexAppServerBinding(...)pluginThreadConfig.build()/ recoverable plugin binding recheckclient.request("thread/resume", ...)orclient.request("thread/start", ...)writeCodexAppServerBinding(...)Suggested fix
Add a Codex app-server thread lifecycle stage summary, analogous to the embedded startup summary, around
startOrResumeThread().Suggested stages:
dynamic-tools-fingerprintcontext-engine-bindinguser-mcp-configread-bindingbinding-compat-checksplugin-binding-stale-checkplugin-thread-config-build/plugin-thread-config-recheckthread-resume-requestthread-start-requestwrite-bindingthread-readySuggested behavior:
OPENCLAW_LOG_LEVEL=trace.2000ms, or any stage exceeds e.g.1000ms.startedvsresumed) and whether plugin app config was built/reused.Example target log shape:
Secondary observation: native hook relay address debug race
In the same trace window, this line appears repeatedly a few seconds after harness selection:
The source appears to call
writeNativeHookRelayBridgeRecordForRegistration(...)immediately afterserver.listen(...), before thelistencallback is guaranteed to have a bound address:This may be a harmless debug race because the callback writes the record later, so I am not claiming it is the main latency cause. But it can confuse latency investigations because it appears in the hidden gap. Consider either removing the eager write, making it explicitly best-effort/noisy only at trace, or exposing a bridge-ready timing if it matters for hook startup.
Why this matters
For short turns like “reply only OK”, a user-visible
18sturn currently looks like:~1.3s~6sBut the missing
~6sbeforesession.startedis not attributable from current logs. Adding thread lifecycle timing would make future reports actionable and would clarify whether the latency is caused by plugin app config recovery/build, app-server thread resume/start RPC, binding I/O, or something else.What was not tested
diagnostics.cacheTracebecause of the privacy footprint.