feat(telemetry): Phase 4 — LLM request timing decomposition (TTFT, request setup, retry visibility, breakdown metric)

## What would you like to be added?

Add LLM request timing decomposition to the `qwen-code.llm_request` span so operators can answer "why was this LLM call slow?" from a single trace, without guessing. Specifically:

- **TTFT (time-to-first-token)** — wall-clock from successful attempt start to first stream chunk containing user-visible content.
- **Request setup duration** — time spent in retries + backoff before the successful attempt started.
- **Sampling duration + output tokens/s** — derived from TTFT + total duration.
- **Per-attempt retry telemetry** — coverage of the `retryWithBackoff` retry path that is currently completely invisible (only `debugLogger.warn`, no OTel event/metric/span).
- **`api.request.breakdown` metric activation** — populate 3 of the 4 `ApiRequestPhase` values from the LLM call site (currently the metric is defined and tested but has zero production callers).
- **GenAI semconv dual-emit** — alongside qwen-code-native attribute names, write the corresponding OTel GenAI Semantic Convention attributes for backend interop. Same pattern as #4410.

This is the P3 line *"LLM request decomposition / TTFT"* in #3731. Phase 4 builds on Phase 1 (#4126), Phase 1.5 (#4302), Phase 2 (#4321); independent of Phase 3 (#4410, in review).

**Design doc**:
[`docs/design/telemetry-llm-request-timing-design.md`](../blob/main/docs/design/telemetry-llm-request-timing-design.md)
— full design: 7 decisions with "Why not X" justifications, lifecycle wiring (streaming / non-streaming / retry sites), files-to-change table with LOC estimates, test plan, edge cases, rollback plan, and a comparison with claude-code (which has the full decomposition but only for Perfetto) and opencode (which has no TTFT measurement at all).

## Why is this needed?

`qwen-code.llm_request` spans today carry only `model`, `prompt_id`, `input_tokens`, `output_tokens`, `success`, `error`, `duration_ms`. Operators reading a single trace cannot tell:

1. **How much of `duration_ms` was the model thinking vs the network setup**. A 12-second `duration_ms` could be 11s of retries followed by 1s of fast generation, or 100ms of setup followed by 12s of slow streaming — the trace doesn't say.
2. **When the user saw the first token**. TTFT is the standard latency SLO for chat UIs. We can't compute it; we don't capture it.
3. **What happened during retries**. `retryWithBackoff` (`utils/retry.ts:285`) only calls `debugLogger.warn` — no OTel event, no span attribute. The 4 LLM call sites that go through it (`client.ts:1540`, `baseLlmClient.ts:193,282`, `geminiChat.ts:1039`) have zero retry visibility in traces or metrics. `ContentRetryEvent` exists for content-recovery retries inside `geminiChat.ts:806,830` but not for the more common rate-limit / 5xx retries from `retryWithBackoff`.
4. **That `api.request.breakdown` is dead code**. The metric is defined at `metrics.ts:242-251` with 4 `ApiRequestPhase` values, exported from `index.ts:117`, tested in `metrics.test.ts:646-675` — but `recordApiRequestBreakdown()` has zero callers in production code.

These gaps make `qwen-code.llm_request` the least informative span in the trace tree. Tool spans (#4126/#4321) and subagent spans (#4410) both surface lifecycle phases; LLM spans collapse the entire request into one opaque duration.

## Suggested implementation

The design doc has 7 locked decisions; summary:

- **D1 (TTFT semantic)**: "first chunk containing user-visible content" — text / functionCall / inlineData / executableCode / thought. Diverges from claude-code's `message_start` (metadata-only) semantic, matches the literal "first token" intent. Cross-provider uniform via the normalized `GenerateContentResponse.candidates[0].content.parts` inspection (`hasUserVisibleContent` helper).
- **D2 (measurement site)**: method-local closure in `LoggingContentGenerator.generateContentStream`. NEVER instance fields — `LoggingContentGenerator` is one-per-`ContentGenerator`, shared across concurrent calls (subagent fan-out, warmup, side-queries).
- **D3 (`request_setup_ms`)**: entry-time vs successful-attempt-start. Goes on the OTel span (claude-code only puts it on Perfetto — qwen-code has no Perfetto, OTel attrs are the only channel).
- **D4 (retry telemetry)**: `onRetry` callback option on `retryWithBackoff` + new `ApiRetryEvent`. Opt-in per caller so non-LLM callers (`channels/weixin/src/api.ts`) stay silent.
- **D5 (parent span aggregation)**: scalar attrs only (no map-typed) — `attempt`, `retry_total_delay_ms`. Per-attempt status-code distribution lives in log-bridge spans from `ApiRetryEvent`.
- **D6 (breakdown metric)**: activate `recordApiRequestBreakdown()` for 3 of 4 phases (`REQUEST_PREPARATION`, `NETWORK_LATENCY`, `RESPONSE_PROCESSING`). Skip `TOKEN_PROCESSING` — qwen-code has no architecturally distinct post-stream phase.
- **D7 (idempotency)**: metric recording sits inside the existing `endLLMRequestSpan` double-end guard from Phase 1.5 (#4302).
- **D8 (GenAI semconv dual-emit)**: write `ttft_ms` + `gen_ai.server.time_to_first_token` (Experimental), `input_tokens` + `gen_ai.usage.input_tokens` (Stable), etc. Private name is authoritative; semconv is a compat layer.

PR split (mirrors the existing pattern of #4126 → #4302 → #4321):

- **Phase 4a** (~200 LOC) — TTFT capture + extended `LLMRequestMetadata` + dual-emit. Self-contained value (TTFT visibility from day one).
- **Phase 4b** (~250 LOC) — `onRetry` callback + `ApiRetryEvent` + 4 caller wiring. Independently a bug fix for the `retryWithBackoff` telemetry gap.
- **Phase 4c** (~160 LOC) — `recordApiRequestBreakdown` activation + parent span retry-aggregation attrs (`attempt`, `retry_total_delay_ms`). Depends on 4a + 4b.

## Acceptance criteria

- `qwen-code.llm_request` span carries `ttft_ms` for streaming requests with user-visible content, undefined otherwise (non-streaming, aborted before first chunk, content-empty streams).
- Span carries `request_setup_ms` (== retries + backoff before successful attempt), `attempt`, `retry_total_delay_ms` populated by retry layer; defaults to undefined / 1 / 0 when no retries.
- Span carries derived `sampling_ms` (clamped to >= 0) and `output_tokens_per_second` (omitted when `sampling_ms == 0` to avoid divide-by-zero).
- Span dual-emits `gen_ai.server.time_to_first_token` (seconds, double), `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `gen_ai.usage.cached_tokens`, `gen_ai.request.model` alongside qwen-code-native names.
- `retryWithBackoff` `onRetry` callback fires once per failed attempt at all 4 LLM call sites; emits `ApiRetryEvent` LogRecord; bridges to a child span via `LogToSpanProcessor` nested under the active LLM span.
- Non-LLM callers (`channels/weixin/src/api.ts`) remain silent — no `ApiRetryEvent` for non-LLM retries.
- `recordApiRequestBreakdown()` is called exactly once per LLM request (gated by `endLLMRequestSpan` double-end guard), with 3 phase values for streaming requests and 2 for non-streaming.
- Concurrent `generateContentStream` calls on one shared `LoggingContentGenerator` produce per-call independent TTFT values (no cross-contamination from instance state).
- Unit tests cover: TTFT semantics across part types (text / functionCall / inlineData / executableCode / thought, role-only, usage-only), TTFT capture in streaming, TTFT undefined in non-streaming / aborted-before-first-chunk, dual-emit for all gen_ai.* attrs, sampling_ms clamping, otps rounding + divide-by-zero, retry callback contract (called per attempt + silent without callback), metric idempotency.
- Docs (design doc + relevant `docs/developers/development/telemetry.md` updates).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(telemetry): Phase 4 — LLM request timing decomposition (TTFT, request setup, retry visibility, breakdown metric) #4413

What would you like to be added?

Why is this needed?

Suggested implementation

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(telemetry): Phase 4 — LLM request timing decomposition (TTFT, request setup, retry visibility, breakdown metric) #4413

Description

What would you like to be added?

Why is this needed?

Suggested implementation

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions