Codex app-server thread lifecycle latency is hidden between attempt-dispatch and session.started

## Problem

A production OpenClaw Docker gateway on `2026.5.18` shows noticeable latency in short Codex-backed Dashboard/Webchat turns, but the existing trace output makes the slow section hard to localize.

With `OPENCLAW_LOG_LEVEL=trace`, `[trace:embedded-run] startup stages` ends at `attempt-dispatch` and reports only ~1.36s of startup/prep. However, `session.started` is recorded several seconds later. That gap appears to be inside the Codex app-server thread lifecycle (`startOrResumeThread` / `thread.resume` or `thread.start`), but today it has no comparable stage breakdown.

This makes operators think prep is cheap while the user-visible turn is still slow, and it makes it hard to tell whether the time is spent in binding reads, plugin app config recovery/build, thread resume/start RPC, or binding writes.

## Environment

- OpenClaw: `2026.5.18`
- Runtime: Docker, derived runtime image
- Channel tested: Dashboard/Webchat direct conversation
- Model/provider: `openai-codex/gpt-5.5`
- Gateway health: healthy during the clean sample
- `OPENCLAW_LOG_LEVEL=trace` was enabled temporarily for the measurement, then restored
- `diagnostics.cacheTrace` was not enabled

## Observed clean sample

Run: `7141c10d-811a-49b8-a5e2-c3ca23f08b84`

Timeline from gateway logs / trajectory:

```text
08:37:16.717 message queued
08:37:20.273 lane enqueue
08:37:20.278 lane dequeue waitMs=5
08:37:20.282 main lane enqueue
08:37:20.289 main lane dequeue waitMs=6
08:37:21.642 strict-agentic execution contract active
08:37:21.658 [trace:embedded-run] startup stages ... totalMs=1365
08:37:27.981 session.started
08:37:27.992 prompt.submitted
08:37:32.186 tool.call message
08:37:32.260 tool.result message
08:37:34.222 model.completed
08:37:35.022 message processed duration=18334ms
```

Existing embedded startup trace:

```text
[trace:embedded-run] startup stages: runId=7141c10d-811a-49b8-a5e2-c3ca23f08b84 sessionId=18b56039-14d7-448f-95bd-edef2c59acd5 phase=attempt-dispatch totalMs=1365 stages=workspace:1ms@1ms,runtime-plugins:24ms@25ms,hooks:1ms@26ms,model-resolution:841ms@867ms,auth:482ms@1349ms,context-engine:7ms@1356ms,attempt-workspace:8ms@1364ms,attempt-prompt:0ms@1364ms,attempt-runtime-plan:1ms@1365ms,attempt-dispatch:0ms@1365ms
```

Trajectory for the same run:

```text
session.started   08:37:27.981
context.compiled  08:37:27.984  (+3ms)
prompt.submitted  08:37:27.992  (+11ms)
tool.call         08:37:32.186  (+4205ms)
tool.result       08:37:32.260  (+4279ms)
model.completed   08:37:34.222  (+6241ms)
session.ended     08:37:34.222  (+6241ms)
```

Key observation:

- OpenClaw lane wait is tiny: `5-6ms`.
- Existing embedded startup trace reports only `1365ms`.
- `session.started` does not occur until ~`6.3s` after `strict-agentic active` / `attempt-dispatch`.
- `prompt.submitted` is immediate after `session.started` (`~11ms`).
- Therefore the largest unaccounted startup block is before `session.started`, likely in Codex app-server thread lifecycle.

## Relevant code paths

The current startup trace is emitted in:

- `src/agents/pi-embedded-runner/run.ts`

It marks and emits at:

```ts
startupStages.mark(EMBEDDED_RUN_ATTEMPT_DISPATCH_STAGE.dispatch);
notifyExecutionPhase("attempt_dispatch", { provider, model: modelId });
emitStartupStageSummary(EMBEDDED_RUN_ATTEMPT_DISPATCH_STAGE.dispatch);
startupStagesEmitted = true;
```

The next expensive path appears to be in:

- `extensions/codex/src/app-server/run-attempt.ts`
- `extensions/codex/src/app-server/thread-lifecycle.ts`

Specifically around:

```ts
const startupThread = await startOrResumeThread(buildThreadLifecycleParams());
```

and inside `startOrResumeThread`, including:

- `readCodexAppServerBinding(...)`
- context-engine/user-MCP/MCP/plugin binding compatibility checks
- optional `pluginThreadConfig.build()` / recoverable plugin binding recheck
- `client.request("thread/resume", ...)` or `client.request("thread/start", ...)`
- `writeCodexAppServerBinding(...)`

## Suggested fix

Add a Codex app-server thread lifecycle stage summary, analogous to the embedded startup summary, around `startOrResumeThread()`.

Suggested stages:

- `dynamic-tools-fingerprint`
- `context-engine-binding`
- `user-mcp-config`
- `read-binding`
- `binding-compat-checks`
- `plugin-binding-stale-check`
- `plugin-thread-config-build` / `plugin-thread-config-recheck`
- `thread-resume-request`
- `thread-start-request`
- `write-binding`
- `thread-ready`

Suggested behavior:

- Emit at trace level when `OPENCLAW_LOG_LEVEL=trace`.
- Warn automatically if total thread lifecycle exceeds e.g. `2000ms`, or any stage exceeds e.g. `1000ms`.
- Include lifecycle action (`started` vs `resumed`) and whether plugin app config was built/reused.

Example target log shape:

```text
[trace:codex-app-server] thread lifecycle: runId=... sessionId=... action=resumed totalMs=6339 stages=read-binding:12ms@12ms,binding-compat-checks:18ms@30ms,plugin-thread-config-recheck:2100ms@2130ms,thread-resume-request:3900ms@6030ms,write-binding:220ms@6250ms,thread-ready:89ms@6339ms
```

## Secondary observation: native hook relay address debug race

In the same trace window, this line appears repeatedly a few seconds after harness selection:

```text
native hook relay bridge server address unavailable
```

The source appears to call `writeNativeHookRelayBridgeRecordForRegistration(...)` immediately after `server.listen(...)`, before the `listen` callback is guaranteed to have a bound address:

```ts
server.listen(0, "127.0.0.1", () => {
  writeNativeHookRelayBridgeRecordForRegistration(registration, bridge);
});
if (relayBridges.get(registration.relayId) === bridge) {
  writeNativeHookRelayBridgeRecordForRegistration(registration, bridge);
}
```

This may be a harmless debug race because the callback writes the record later, so I am not claiming it is the main latency cause. But it can confuse latency investigations because it appears in the hidden gap. Consider either removing the eager write, making it explicitly best-effort/noisy only at trace, or exposing a bridge-ready timing if it matters for hook startup.

## Why this matters

For short turns like “reply only OK”, a user-visible `18s` turn currently looks like:

- queue: essentially free
- embedded startup trace: `~1.3s`
- prompt submit after session start: immediate
- model/tool work: `~6s`

But the missing `~6s` before `session.started` is not attributable from current logs. Adding thread lifecycle timing would make future reports actionable and would clarify whether the latency is caused by plugin app config recovery/build, app-server thread resume/start RPC, binding I/O, or something else.

## What was not tested

- I did not test with `diagnostics.cacheTrace` because of the privacy footprint.
- I did not patch runtime code locally for this report.
- I did not isolate whether native Codex plugin app config (#78733) specifically contributes to the thread lifecycle cost; the requested change is primarily observability so that can be measured cleanly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Codex app-server thread lifecycle latency is hidden between attempt-dispatch and session.started #84640

Problem

Environment

Observed clean sample

Relevant code paths

Suggested fix

Secondary observation: native hook relay address debug race

Why this matters

What was not tested

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Codex app-server thread lifecycle latency is hidden between attempt-dispatch and session.started #84640

Description

Problem

Environment

Observed clean sample

Relevant code paths

Suggested fix

Secondary observation: native hook relay address debug race

Why this matters

What was not tested

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions