Skip to content

Codex app-server thread lifecycle latency is hidden between attempt-dispatch and session.started #84640

@crash2kx

Description

@crash2kx

Problem

A production OpenClaw Docker gateway on 2026.5.18 shows noticeable latency in short Codex-backed Dashboard/Webchat turns, but the existing trace output makes the slow section hard to localize.

With OPENCLAW_LOG_LEVEL=trace, [trace:embedded-run] startup stages ends at attempt-dispatch and reports only ~1.36s of startup/prep. However, session.started is recorded several seconds later. That gap appears to be inside the Codex app-server thread lifecycle (startOrResumeThread / thread.resume or thread.start), but today it has no comparable stage breakdown.

This makes operators think prep is cheap while the user-visible turn is still slow, and it makes it hard to tell whether the time is spent in binding reads, plugin app config recovery/build, thread resume/start RPC, or binding writes.

Environment

  • OpenClaw: 2026.5.18
  • Runtime: Docker, derived runtime image
  • Channel tested: Dashboard/Webchat direct conversation
  • Model/provider: openai-codex/gpt-5.5
  • Gateway health: healthy during the clean sample
  • OPENCLAW_LOG_LEVEL=trace was enabled temporarily for the measurement, then restored
  • diagnostics.cacheTrace was not enabled

Observed clean sample

Run: 7141c10d-811a-49b8-a5e2-c3ca23f08b84

Timeline from gateway logs / trajectory:

08:37:16.717 message queued
08:37:20.273 lane enqueue
08:37:20.278 lane dequeue waitMs=5
08:37:20.282 main lane enqueue
08:37:20.289 main lane dequeue waitMs=6
08:37:21.642 strict-agentic execution contract active
08:37:21.658 [trace:embedded-run] startup stages ... totalMs=1365
08:37:27.981 session.started
08:37:27.992 prompt.submitted
08:37:32.186 tool.call message
08:37:32.260 tool.result message
08:37:34.222 model.completed
08:37:35.022 message processed duration=18334ms

Existing embedded startup trace:

[trace:embedded-run] startup stages: runId=7141c10d-811a-49b8-a5e2-c3ca23f08b84 sessionId=18b56039-14d7-448f-95bd-edef2c59acd5 phase=attempt-dispatch totalMs=1365 stages=workspace:1ms@1ms,runtime-plugins:24ms@25ms,hooks:1ms@26ms,model-resolution:841ms@867ms,auth:482ms@1349ms,context-engine:7ms@1356ms,attempt-workspace:8ms@1364ms,attempt-prompt:0ms@1364ms,attempt-runtime-plan:1ms@1365ms,attempt-dispatch:0ms@1365ms

Trajectory for the same run:

session.started   08:37:27.981
context.compiled  08:37:27.984  (+3ms)
prompt.submitted  08:37:27.992  (+11ms)
tool.call         08:37:32.186  (+4205ms)
tool.result       08:37:32.260  (+4279ms)
model.completed   08:37:34.222  (+6241ms)
session.ended     08:37:34.222  (+6241ms)

Key observation:

  • OpenClaw lane wait is tiny: 5-6ms.
  • Existing embedded startup trace reports only 1365ms.
  • session.started does not occur until ~6.3s after strict-agentic active / attempt-dispatch.
  • prompt.submitted is immediate after session.started (~11ms).
  • Therefore the largest unaccounted startup block is before session.started, likely in Codex app-server thread lifecycle.

Relevant code paths

The current startup trace is emitted in:

  • src/agents/pi-embedded-runner/run.ts

It marks and emits at:

startupStages.mark(EMBEDDED_RUN_ATTEMPT_DISPATCH_STAGE.dispatch);
notifyExecutionPhase("attempt_dispatch", { provider, model: modelId });
emitStartupStageSummary(EMBEDDED_RUN_ATTEMPT_DISPATCH_STAGE.dispatch);
startupStagesEmitted = true;

The next expensive path appears to be in:

  • extensions/codex/src/app-server/run-attempt.ts
  • extensions/codex/src/app-server/thread-lifecycle.ts

Specifically around:

const startupThread = await startOrResumeThread(buildThreadLifecycleParams());

and inside startOrResumeThread, including:

  • readCodexAppServerBinding(...)
  • context-engine/user-MCP/MCP/plugin binding compatibility checks
  • optional pluginThreadConfig.build() / recoverable plugin binding recheck
  • client.request("thread/resume", ...) or client.request("thread/start", ...)
  • writeCodexAppServerBinding(...)

Suggested fix

Add a Codex app-server thread lifecycle stage summary, analogous to the embedded startup summary, around startOrResumeThread().

Suggested stages:

  • dynamic-tools-fingerprint
  • context-engine-binding
  • user-mcp-config
  • read-binding
  • binding-compat-checks
  • plugin-binding-stale-check
  • plugin-thread-config-build / plugin-thread-config-recheck
  • thread-resume-request
  • thread-start-request
  • write-binding
  • thread-ready

Suggested behavior:

  • Emit at trace level when OPENCLAW_LOG_LEVEL=trace.
  • Warn automatically if total thread lifecycle exceeds e.g. 2000ms, or any stage exceeds e.g. 1000ms.
  • Include lifecycle action (started vs resumed) and whether plugin app config was built/reused.

Example target log shape:

[trace:codex-app-server] thread lifecycle: runId=... sessionId=... action=resumed totalMs=6339 stages=read-binding:12ms@12ms,binding-compat-checks:18ms@30ms,plugin-thread-config-recheck:2100ms@2130ms,thread-resume-request:3900ms@6030ms,write-binding:220ms@6250ms,thread-ready:89ms@6339ms

Secondary observation: native hook relay address debug race

In the same trace window, this line appears repeatedly a few seconds after harness selection:

native hook relay bridge server address unavailable

The source appears to call writeNativeHookRelayBridgeRecordForRegistration(...) immediately after server.listen(...), before the listen callback is guaranteed to have a bound address:

server.listen(0, "127.0.0.1", () => {
  writeNativeHookRelayBridgeRecordForRegistration(registration, bridge);
});
if (relayBridges.get(registration.relayId) === bridge) {
  writeNativeHookRelayBridgeRecordForRegistration(registration, bridge);
}

This may be a harmless debug race because the callback writes the record later, so I am not claiming it is the main latency cause. But it can confuse latency investigations because it appears in the hidden gap. Consider either removing the eager write, making it explicitly best-effort/noisy only at trace, or exposing a bridge-ready timing if it matters for hook startup.

Why this matters

For short turns like “reply only OK”, a user-visible 18s turn currently looks like:

  • queue: essentially free
  • embedded startup trace: ~1.3s
  • prompt submit after session start: immediate
  • model/tool work: ~6s

But the missing ~6s before session.started is not attributable from current logs. Adding thread lifecycle timing would make future reports actionable and would clarify whether the latency is caused by plugin app config recovery/build, app-server thread resume/start RPC, binding I/O, or something else.

What was not tested

  • I did not test with diagnostics.cacheTrace because of the privacy footprint.
  • I did not patch runtime code locally for this report.
  • I did not isolate whether native Codex plugin app config (feat(codex): enable native plugin app support #78733) specifically contributes to the thread lifecycle cost; the requested change is primarily observability so that can be measured cleanly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Normal backlog priority with limited blast radius.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions