You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add LLM request timing decomposition to the qwen-code.llm_request span so operators can answer "why was this LLM call slow?" from a single trace, without guessing. Specifically:
TTFT (time-to-first-token) — wall-clock from successful attempt start to first stream chunk containing user-visible content.
Request setup duration — time spent in retries + backoff before the successful attempt started.
Sampling duration + output tokens/s — derived from TTFT + total duration.
Per-attempt retry telemetry — coverage of the retryWithBackoff retry path that is currently completely invisible (only debugLogger.warn, no OTel event/metric/span).
api.request.breakdown metric activation — populate 3 of the 4 ApiRequestPhase values from the LLM call site (currently the metric is defined and tested but has zero production callers).
This is the P3 line "LLM request decomposition / TTFT" in #3731. Phase 4 builds on Phase 1 (#4126), Phase 1.5 (#4302), Phase 2 (#4321); independent of Phase 3 (#4410, in review).
Design doc: docs/design/telemetry-llm-request-timing-design.md
— full design: 7 decisions with "Why not X" justifications, lifecycle wiring (streaming / non-streaming / retry sites), files-to-change table with LOC estimates, test plan, edge cases, rollback plan, and a comparison with claude-code (which has the full decomposition but only for Perfetto) and opencode (which has no TTFT measurement at all).
Why is this needed?
qwen-code.llm_request spans today carry only model, prompt_id, input_tokens, output_tokens, success, error, duration_ms. Operators reading a single trace cannot tell:
How much of duration_ms was the model thinking vs the network setup. A 12-second duration_ms could be 11s of retries followed by 1s of fast generation, or 100ms of setup followed by 12s of slow streaming — the trace doesn't say.
When the user saw the first token. TTFT is the standard latency SLO for chat UIs. We can't compute it; we don't capture it.
What happened during retries. retryWithBackoff (utils/retry.ts:285) only calls debugLogger.warn — no OTel event, no span attribute. The 4 LLM call sites that go through it (client.ts:1540, baseLlmClient.ts:193,282, geminiChat.ts:1039) have zero retry visibility in traces or metrics. ContentRetryEvent exists for content-recovery retries inside geminiChat.ts:806,830 but not for the more common rate-limit / 5xx retries from retryWithBackoff.
That api.request.breakdown is dead code. The metric is defined at metrics.ts:242-251 with 4 ApiRequestPhase values, exported from index.ts:117, tested in metrics.test.ts:646-675 — but recordApiRequestBreakdown() has zero callers in production code.
These gaps make qwen-code.llm_request the least informative span in the trace tree. Tool spans (#4126/#4321) and subagent spans (#4410) both surface lifecycle phases; LLM spans collapse the entire request into one opaque duration.
Suggested implementation
The design doc has 7 locked decisions; summary:
D1 (TTFT semantic): "first chunk containing user-visible content" — text / functionCall / inlineData / executableCode / thought. Diverges from claude-code's message_start (metadata-only) semantic, matches the literal "first token" intent. Cross-provider uniform via the normalized GenerateContentResponse.candidates[0].content.parts inspection (hasUserVisibleContent helper).
D2 (measurement site): method-local closure in LoggingContentGenerator.generateContentStream. NEVER instance fields — LoggingContentGenerator is one-per-ContentGenerator, shared across concurrent calls (subagent fan-out, warmup, side-queries).
D3 (request_setup_ms): entry-time vs successful-attempt-start. Goes on the OTel span (claude-code only puts it on Perfetto — qwen-code has no Perfetto, OTel attrs are the only channel).
D4 (retry telemetry): onRetry callback option on retryWithBackoff + new ApiRetryEvent. Opt-in per caller so non-LLM callers (channels/weixin/src/api.ts) stay silent.
D5 (parent span aggregation): scalar attrs only (no map-typed) — attempt, retry_total_delay_ms. Per-attempt status-code distribution lives in log-bridge spans from ApiRetryEvent.
D6 (breakdown metric): activate recordApiRequestBreakdown() for 3 of 4 phases (REQUEST_PREPARATION, NETWORK_LATENCY, RESPONSE_PROCESSING). Skip TOKEN_PROCESSING — qwen-code has no architecturally distinct post-stream phase.
D8 (GenAI semconv dual-emit): write ttft_ms + gen_ai.server.time_to_first_token (Experimental), input_tokens + gen_ai.usage.input_tokens (Stable), etc. Private name is authoritative; semconv is a compat layer.
PR split (mirrors the existing pattern of #4126 → #4302 → #4321):
Phase 4a (~200 LOC) — TTFT capture + extended LLMRequestMetadata + dual-emit. Self-contained value (TTFT visibility from day one).
Phase 4b (~250 LOC) — onRetry callback + ApiRetryEvent + 4 caller wiring. Independently a bug fix for the retryWithBackoff telemetry gap.
qwen-code.llm_request span carries ttft_ms for streaming requests with user-visible content, undefined otherwise (non-streaming, aborted before first chunk, content-empty streams).
Span carries request_setup_ms (== retries + backoff before successful attempt), attempt, retry_total_delay_ms populated by retry layer; defaults to undefined / 1 / 0 when no retries.
Span carries derived sampling_ms (clamped to >= 0) and output_tokens_per_second (omitted when sampling_ms == 0 to avoid divide-by-zero).
retryWithBackoffonRetry callback fires once per failed attempt at all 4 LLM call sites; emits ApiRetryEvent LogRecord; bridges to a child span via LogToSpanProcessor nested under the active LLM span.
Non-LLM callers (channels/weixin/src/api.ts) remain silent — no ApiRetryEvent for non-LLM retries.
recordApiRequestBreakdown() is called exactly once per LLM request (gated by endLLMRequestSpan double-end guard), with 3 phase values for streaming requests and 2 for non-streaming.
Concurrent generateContentStream calls on one shared LoggingContentGenerator produce per-call independent TTFT values (no cross-contamination from instance state).
Unit tests cover: TTFT semantics across part types (text / functionCall / inlineData / executableCode / thought, role-only, usage-only), TTFT capture in streaming, TTFT undefined in non-streaming / aborted-before-first-chunk, dual-emit for all gen_ai.* attrs, sampling_ms clamping, otps rounding + divide-by-zero, retry callback contract (called per attempt + silent without callback), metric idempotency.
What would you like to be added?
Add LLM request timing decomposition to the
qwen-code.llm_requestspan so operators can answer "why was this LLM call slow?" from a single trace, without guessing. Specifically:retryWithBackoffretry path that is currently completely invisible (onlydebugLogger.warn, no OTel event/metric/span).api.request.breakdownmetric activation — populate 3 of the 4ApiRequestPhasevalues from the LLM call site (currently the metric is defined and tested but has zero production callers).This is the P3 line "LLM request decomposition / TTFT" in #3731. Phase 4 builds on Phase 1 (#4126), Phase 1.5 (#4302), Phase 2 (#4321); independent of Phase 3 (#4410, in review).
Design doc:
docs/design/telemetry-llm-request-timing-design.md— full design: 7 decisions with "Why not X" justifications, lifecycle wiring (streaming / non-streaming / retry sites), files-to-change table with LOC estimates, test plan, edge cases, rollback plan, and a comparison with claude-code (which has the full decomposition but only for Perfetto) and opencode (which has no TTFT measurement at all).
Why is this needed?
qwen-code.llm_requestspans today carry onlymodel,prompt_id,input_tokens,output_tokens,success,error,duration_ms. Operators reading a single trace cannot tell:duration_mswas the model thinking vs the network setup. A 12-secondduration_mscould be 11s of retries followed by 1s of fast generation, or 100ms of setup followed by 12s of slow streaming — the trace doesn't say.retryWithBackoff(utils/retry.ts:285) only callsdebugLogger.warn— no OTel event, no span attribute. The 4 LLM call sites that go through it (client.ts:1540,baseLlmClient.ts:193,282,geminiChat.ts:1039) have zero retry visibility in traces or metrics.ContentRetryEventexists for content-recovery retries insidegeminiChat.ts:806,830but not for the more common rate-limit / 5xx retries fromretryWithBackoff.api.request.breakdownis dead code. The metric is defined atmetrics.ts:242-251with 4ApiRequestPhasevalues, exported fromindex.ts:117, tested inmetrics.test.ts:646-675— butrecordApiRequestBreakdown()has zero callers in production code.These gaps make
qwen-code.llm_requestthe least informative span in the trace tree. Tool spans (#4126/#4321) and subagent spans (#4410) both surface lifecycle phases; LLM spans collapse the entire request into one opaque duration.Suggested implementation
The design doc has 7 locked decisions; summary:
message_start(metadata-only) semantic, matches the literal "first token" intent. Cross-provider uniform via the normalizedGenerateContentResponse.candidates[0].content.partsinspection (hasUserVisibleContenthelper).LoggingContentGenerator.generateContentStream. NEVER instance fields —LoggingContentGeneratoris one-per-ContentGenerator, shared across concurrent calls (subagent fan-out, warmup, side-queries).request_setup_ms): entry-time vs successful-attempt-start. Goes on the OTel span (claude-code only puts it on Perfetto — qwen-code has no Perfetto, OTel attrs are the only channel).onRetrycallback option onretryWithBackoff+ newApiRetryEvent. Opt-in per caller so non-LLM callers (channels/weixin/src/api.ts) stay silent.attempt,retry_total_delay_ms. Per-attempt status-code distribution lives in log-bridge spans fromApiRetryEvent.recordApiRequestBreakdown()for 3 of 4 phases (REQUEST_PREPARATION,NETWORK_LATENCY,RESPONSE_PROCESSING). SkipTOKEN_PROCESSING— qwen-code has no architecturally distinct post-stream phase.endLLMRequestSpandouble-end guard from Phase 1.5 (fix(telemetry): Phase 1.5 polish — fallback order, abort-as-result, log/span consistency #4302).ttft_ms+gen_ai.server.time_to_first_token(Experimental),input_tokens+gen_ai.usage.input_tokens(Stable), etc. Private name is authoritative; semconv is a compat layer.PR split (mirrors the existing pattern of #4126 → #4302 → #4321):
LLMRequestMetadata+ dual-emit. Self-contained value (TTFT visibility from day one).onRetrycallback +ApiRetryEvent+ 4 caller wiring. Independently a bug fix for theretryWithBackofftelemetry gap.recordApiRequestBreakdownactivation + parent span retry-aggregation attrs (attempt,retry_total_delay_ms). Depends on 4a + 4b.Acceptance criteria
qwen-code.llm_requestspan carriesttft_msfor streaming requests with user-visible content, undefined otherwise (non-streaming, aborted before first chunk, content-empty streams).request_setup_ms(== retries + backoff before successful attempt),attempt,retry_total_delay_mspopulated by retry layer; defaults to undefined / 1 / 0 when no retries.sampling_ms(clamped to >= 0) andoutput_tokens_per_second(omitted whensampling_ms == 0to avoid divide-by-zero).gen_ai.server.time_to_first_token(seconds, double),gen_ai.usage.input_tokens,gen_ai.usage.output_tokens,gen_ai.usage.cached_tokens,gen_ai.request.modelalongside qwen-code-native names.retryWithBackoffonRetrycallback fires once per failed attempt at all 4 LLM call sites; emitsApiRetryEventLogRecord; bridges to a child span viaLogToSpanProcessornested under the active LLM span.channels/weixin/src/api.ts) remain silent — noApiRetryEventfor non-LLM retries.recordApiRequestBreakdown()is called exactly once per LLM request (gated byendLLMRequestSpandouble-end guard), with 3 phase values for streaming requests and 2 for non-streaming.generateContentStreamcalls on one sharedLoggingContentGeneratorproduce per-call independent TTFT values (no cross-contamination from instance state).docs/developers/development/telemetry.mdupdates).