Skip to content

feat(telemetry): Phase 4 — LLM request timing decomposition (TTFT, request setup, retry visibility, breakdown metric) #4413

@doudouOUC

Description

@doudouOUC

What would you like to be added?

Add LLM request timing decomposition to the qwen-code.llm_request span so operators can answer "why was this LLM call slow?" from a single trace, without guessing. Specifically:

  • TTFT (time-to-first-token) — wall-clock from successful attempt start to first stream chunk containing user-visible content.
  • Request setup duration — time spent in retries + backoff before the successful attempt started.
  • Sampling duration + output tokens/s — derived from TTFT + total duration.
  • Per-attempt retry telemetry — coverage of the retryWithBackoff retry path that is currently completely invisible (only debugLogger.warn, no OTel event/metric/span).
  • api.request.breakdown metric activation — populate 3 of the 4 ApiRequestPhase values from the LLM call site (currently the metric is defined and tested but has zero production callers).
  • GenAI semconv dual-emit — alongside qwen-code-native attribute names, write the corresponding OTel GenAI Semantic Convention attributes for backend interop. Same pattern as feat(telemetry): Phase 3 — qwen-code.subagent span with concurrent isolation (#3731) #4410.

This is the P3 line "LLM request decomposition / TTFT" in #3731. Phase 4 builds on Phase 1 (#4126), Phase 1.5 (#4302), Phase 2 (#4321); independent of Phase 3 (#4410, in review).

Design doc:
docs/design/telemetry-llm-request-timing-design.md
— full design: 7 decisions with "Why not X" justifications, lifecycle wiring (streaming / non-streaming / retry sites), files-to-change table with LOC estimates, test plan, edge cases, rollback plan, and a comparison with claude-code (which has the full decomposition but only for Perfetto) and opencode (which has no TTFT measurement at all).

Why is this needed?

qwen-code.llm_request spans today carry only model, prompt_id, input_tokens, output_tokens, success, error, duration_ms. Operators reading a single trace cannot tell:

  1. How much of duration_ms was the model thinking vs the network setup. A 12-second duration_ms could be 11s of retries followed by 1s of fast generation, or 100ms of setup followed by 12s of slow streaming — the trace doesn't say.
  2. When the user saw the first token. TTFT is the standard latency SLO for chat UIs. We can't compute it; we don't capture it.
  3. What happened during retries. retryWithBackoff (utils/retry.ts:285) only calls debugLogger.warn — no OTel event, no span attribute. The 4 LLM call sites that go through it (client.ts:1540, baseLlmClient.ts:193,282, geminiChat.ts:1039) have zero retry visibility in traces or metrics. ContentRetryEvent exists for content-recovery retries inside geminiChat.ts:806,830 but not for the more common rate-limit / 5xx retries from retryWithBackoff.
  4. That api.request.breakdown is dead code. The metric is defined at metrics.ts:242-251 with 4 ApiRequestPhase values, exported from index.ts:117, tested in metrics.test.ts:646-675 — but recordApiRequestBreakdown() has zero callers in production code.

These gaps make qwen-code.llm_request the least informative span in the trace tree. Tool spans (#4126/#4321) and subagent spans (#4410) both surface lifecycle phases; LLM spans collapse the entire request into one opaque duration.

Suggested implementation

The design doc has 7 locked decisions; summary:

  • D1 (TTFT semantic): "first chunk containing user-visible content" — text / functionCall / inlineData / executableCode / thought. Diverges from claude-code's message_start (metadata-only) semantic, matches the literal "first token" intent. Cross-provider uniform via the normalized GenerateContentResponse.candidates[0].content.parts inspection (hasUserVisibleContent helper).
  • D2 (measurement site): method-local closure in LoggingContentGenerator.generateContentStream. NEVER instance fields — LoggingContentGenerator is one-per-ContentGenerator, shared across concurrent calls (subagent fan-out, warmup, side-queries).
  • D3 (request_setup_ms): entry-time vs successful-attempt-start. Goes on the OTel span (claude-code only puts it on Perfetto — qwen-code has no Perfetto, OTel attrs are the only channel).
  • D4 (retry telemetry): onRetry callback option on retryWithBackoff + new ApiRetryEvent. Opt-in per caller so non-LLM callers (channels/weixin/src/api.ts) stay silent.
  • D5 (parent span aggregation): scalar attrs only (no map-typed) — attempt, retry_total_delay_ms. Per-attempt status-code distribution lives in log-bridge spans from ApiRetryEvent.
  • D6 (breakdown metric): activate recordApiRequestBreakdown() for 3 of 4 phases (REQUEST_PREPARATION, NETWORK_LATENCY, RESPONSE_PROCESSING). Skip TOKEN_PROCESSING — qwen-code has no architecturally distinct post-stream phase.
  • D7 (idempotency): metric recording sits inside the existing endLLMRequestSpan double-end guard from Phase 1.5 (fix(telemetry): Phase 1.5 polish — fallback order, abort-as-result, log/span consistency #4302).
  • D8 (GenAI semconv dual-emit): write ttft_ms + gen_ai.server.time_to_first_token (Experimental), input_tokens + gen_ai.usage.input_tokens (Stable), etc. Private name is authoritative; semconv is a compat layer.

PR split (mirrors the existing pattern of #4126#4302#4321):

  • Phase 4a (~200 LOC) — TTFT capture + extended LLMRequestMetadata + dual-emit. Self-contained value (TTFT visibility from day one).
  • Phase 4b (~250 LOC) — onRetry callback + ApiRetryEvent + 4 caller wiring. Independently a bug fix for the retryWithBackoff telemetry gap.
  • Phase 4c (~160 LOC) — recordApiRequestBreakdown activation + parent span retry-aggregation attrs (attempt, retry_total_delay_ms). Depends on 4a + 4b.

Acceptance criteria

  • qwen-code.llm_request span carries ttft_ms for streaming requests with user-visible content, undefined otherwise (non-streaming, aborted before first chunk, content-empty streams).
  • Span carries request_setup_ms (== retries + backoff before successful attempt), attempt, retry_total_delay_ms populated by retry layer; defaults to undefined / 1 / 0 when no retries.
  • Span carries derived sampling_ms (clamped to >= 0) and output_tokens_per_second (omitted when sampling_ms == 0 to avoid divide-by-zero).
  • Span dual-emits gen_ai.server.time_to_first_token (seconds, double), gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.cached_tokens, gen_ai.request.model alongside qwen-code-native names.
  • retryWithBackoff onRetry callback fires once per failed attempt at all 4 LLM call sites; emits ApiRetryEvent LogRecord; bridges to a child span via LogToSpanProcessor nested under the active LLM span.
  • Non-LLM callers (channels/weixin/src/api.ts) remain silent — no ApiRetryEvent for non-LLM retries.
  • recordApiRequestBreakdown() is called exactly once per LLM request (gated by endLLMRequestSpan double-end guard), with 3 phase values for streaming requests and 2 for non-streaming.
  • Concurrent generateContentStream calls on one shared LoggingContentGenerator produce per-call independent TTFT values (no cross-contamination from instance state).
  • Unit tests cover: TTFT semantics across part types (text / functionCall / inlineData / executableCode / thought, role-only, usage-only), TTFT capture in streaming, TTFT undefined in non-streaming / aborted-before-first-chunk, dual-emit for all gen_ai.* attrs, sampling_ms clamping, otps rounding + divide-by-zero, retry callback contract (called per attempt + silent without callback), metric idempotency.
  • Docs (design doc + relevant docs/developers/development/telemetry.md updates).

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions