Skip to content

Add production startup and TTFT telemetry#22198

Merged
mzeng-openai merged 4 commits into
mainfrom
dev/mzeng/startup_latency_telemetry
May 11, 2026
Merged

Add production startup and TTFT telemetry#22198
mzeng-openai merged 4 commits into
mainfrom
dev/mzeng/startup_latency_telemetry

Conversation

@mzeng-openai

@mzeng-openai mzeng-openai commented May 11, 2026

Copy link
Copy Markdown
Collaborator

Why

While investigating codex exec hi startup latency, the useful questions were not "is startup slow?" but "which durable bucket is slow in production?"

The path we observed has a few distinct stages:

  1. thread/start creates the session
  2. startup prewarm builds the turn context, tools, and prompt
  3. startup prewarm warms the websocket
  4. the first real turn resolves the prewarm
  5. the model produces the first token

Before this PR, production telemetry had some of the raw measurements already:

  • aggregate startup-prewarm duration / age-at-first-turn metrics
  • TTFT as a metric
  • websocket request telemetry

But there was no coherent production event stream for the startup breakdown itself, and TTFT was metric-only. That made it hard to answer the same latency questions from OpenTelemetry-backed logs without adding one-off local instrumentation.

What changed

Add durable production telemetry on the existing SessionTelemetry path:

  • new codex.startup_phase OTel log/trace events plus codex.startup.phase.duration_ms
  • new codex.turn_ttft OTel log/trace events while preserving the existing TTFT metric

The startup phase event is emitted for the coarse buckets we actually observed while running exec hi:

  • thread_start_create_thread
  • startup_prewarm_total
  • startup_prewarm_create_turn_context
  • startup_prewarm_build_tools
  • startup_prewarm_build_prompt
  • startup_prewarm_websocket_warmup
  • startup_prewarm_resolve

These phases are intentionally low-cardinality so they remain safe as production telemetry tags.

Why this shape

This keeps the instrumentation on the same production path as the rest of the session telemetry instead of adding a local debug-only trace mode. It also avoids changing startup behavior:

  • prewarm still runs
  • no control flow changes
  • no extra remote calls
  • no user-visible behavior changes

One boundary is intentional: very early process bootstrap that happens before a session exists is not included here, because this PR uses session-scoped production telemetry. The expensive buckets we were trying to understand after thread/start are now covered durably.

Verification

  • cargo test -p codex-otel
  • cargo test -p codex-core turn_timing
  • cargo test -p codex-core regular_turn_emits_turn_started_without_waiting_for_startup_prewarm
  • cargo test -p codex-core interrupting_regular_turn_waiting_on_startup_prewarm_emits_turn_aborted
  • cargo test -p codex-app-server thread_start
  • just fix -p codex-otel -p codex-core -p codex-app-server

I also ran cargo test -p codex-core; it built successfully and then hit an existing unrelated stack overflow in tools::handlers::multi_agents::tests::tool_handlers_cascade_close_and_resume_and_keep_explicitly_closed_subtrees_closed.

@mzeng-openai mzeng-openai changed the title Add opt-in startup latency tracing Add production startup and TTFT telemetry May 11, 2026
@mzeng-openai mzeng-openai marked this pull request as ready for review May 11, 2026 20:41
@mzeng-openai mzeng-openai requested a review from a team as a code owner May 11, 2026 20:41
@mzeng-openai mzeng-openai requested review from alexsong-oai, owenlin0 and pakrym-oai and removed request for pakrym-oai May 11, 2026 20:41
"thread_start_create_thread",
create_thread_started_at.elapsed(),
Some("ready"),
);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why call it done here instead of at the bottom of the method?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. I kept this one scoped to the inner create-thread call, but added a separate thread_start_total phase at the end so we get both numbers.

}

#[test]
fn otel_export_routing_policy_routes_startup_and_ttft_observability() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this test useful?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fair. This was mostly duplicating the existing routing-policy coverage for the same macro path, so I dropped it.

@owenlin0 owenlin0 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments but preapproving

@mzeng-openai mzeng-openai enabled auto-merge (squash) May 11, 2026 23:35
@mzeng-openai mzeng-openai merged commit e15ecc9 into main May 11, 2026
27 checks passed
@mzeng-openai mzeng-openai deleted the dev/mzeng/startup_latency_telemetry branch May 11, 2026 23:58
@github-actions github-actions Bot locked and limited conversation to collaborators May 11, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants