feat(telemetry): Phase 4a — TTFT capture + GenAI semconv dual-emit (#3731)#4417
Conversation
…wenLM#3731) Adds time-to-first-token measurement and OTel GenAI semantic convention dual-emit to the qwen-code.llm_request span. First slice of Phase 4 LLM request timing decomposition (sub-issue QwenLM#4413). Builds on Phase 1 (QwenLM#4126), Phase 1.5 (QwenLM#4302), Phase 2 (QwenLM#4321). Self-contained — ships TTFT visibility without depending on the Phase 4b retry-callback work. Changes ------- - New hasUserVisibleContent(chunk) helper centralizes "is this chunk first user-visible output" detection across all 4 providers. Operates on the normalized GenerateContentResponse shape each provider emits before LoggingContentGenerator sees the stream. Recognizes text / functionCall / inlineData / executableCode / thought parts; skips role-only and usageMetadata-only chunks. - TTFT capture in LoggingContentGenerator.generateContentStream's stream wrapper. Method-local closure variable, NEVER instance field — the generator is one-per-ContentGenerator and shared across concurrent calls (subagent fan-out, warmup, side-queries). - Extends LLMRequestMetadata with: ttftMs, requestSetupMs, attempt, retryTotalDelayMs, cachedInputTokens. The first is populated in 4a; the middle three are declared as optional so Phase 4b's retry callback can populate them without touching the schema again. - endLLMRequestSpan writes new span attrs: ttft_ms, sampling_ms (derived: duration - ttft - requestSetup, clamped to >= 0), output_tokens_per_second (derived: outputTokens / (sampling_ms / 1000), omitted when sampling_ms == 0 to avoid divide-by-zero, rounded to 2 decimals), cached_input_tokens, plus Phase 4b placeholders (attempt, request_setup_ms, retry_total_delay_ms) when caller provides them. - GenAI semconv dual-emit. Private qwen-code names stay authoritative (dashboards, SLOs, ARMS queries reference these); semconv is a compat layer for spec-aware backends. Same pattern as Phase 3 (QwenLM#4410): qwen-code.model + gen_ai.request.model (Stable) input_tokens + gen_ai.usage.input_tokens (Stable) output_tokens + gen_ai.usage.output_tokens (Stable) cached_input_tokens + gen_ai.usage.cached_tokens (Experimental) ttft_ms (ms, int) + gen_ai.server.time_to_first_token (s, double, Experimental) TTFT semantic intentionally diverges from claude-code's ttftMs (which fires on Anthropic's message_start metadata event). qwen-code measures "first user-visible content" — uniform across Anthropic / OpenAI / Gemini / Qwen, matches the literal "first token" intent. See design doc D1 for rationale. Design doc ---------- docs/design/telemetry-llm-request-timing-design.md (~530 lines) covers all 7 locked decisions with "Why not X" justifications, lifecycle wiring for streaming + non-streaming + 4 retry sites, files-to-change table with LOC estimates, test strategy, edge cases, rollback plan, and a comparison with claude-code (full decomposition but only for Perfetto) and opencode (no TTFT measurement at all). Tests ----- - streamContentDetection.test.ts (13 tests) covers part types matrix. - session-tracing.test.ts +16 tests (new describe block) cover dual-emit for every gen_ai.* field, sampling_ms clamping at 0 for clock skew, output_tokens_per_second rounding + divide-by-zero, Phase 4b placeholder fields written when present, absent when not. - loggingContentGenerator.test.ts +4 tests cover TTFT capture in streaming, undefined when stream yields only metadata chunks, and cachedInputTokens forwarding from usageMetadata on both stream and non-stream paths. All 448 telemetry tests pass, tsc --noEmit clean, eslint clean. Phase 4b (retry callback + ApiRetryEvent) and Phase 4c (recordApiRequestBreakdown activation) follow in separate PRs per the design doc split. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
…ndary tests (QwenLM#3731) Address review findings on commit 5db5f9b (Phase 4a) before opening the PR. 1. hasUserVisibleContent: tighten the `thought` Part check from `p.thought !== undefined` to `p.thought === true`. The codebase treats `thought` as a boolean flag (see loggingContentGenerator.ts:773 / :865 where `part.thought ? {...} : {}` is the canonical pattern). The original predicate would falsely fire TTFT on a degenerate `{ thought: false }` part (which means "explicitly not a thought") and on bare `{ thought: undefined }` shapes that some normalizers may emit. Affects Phase 4a only — TTFT semantics, not retry telemetry. Existing test for `thought: true` is kept; added 3 new cases: - `{ thought: false }` returns false - `{}` (no thought key) returns false - `{ text: '', thought: true }` (typical Anthropic <thinking> first chunk) returns true — confirms the design-doc D1 intent that "thinking-stream started" counts as user-visible TTFT. 2. session-tracing.test.ts: add boundary case for `cachedInputTokens === 0`. Providers reporting an explicit cache miss (0) is meaningful info distinct from undefined ("we don't know"). The test asserts both `cached_input_tokens` and `gen_ai.usage.cached_tokens` propagate the literal 0, not get filtered. 3. JSDoc fix: remove the misleading "guard with `in`" comment from `streamContentDetection.ts` that referred to a guard pattern not actually used in the implementation. All 129 tests pass (was 125; +4 new), tsc --noEmit clean, eslint clean. No production-code behavior change beyond the `thought` predicate fix — which by construction can only ever shrink the set of chunks treated as TTFT-triggering, so cannot regress any successful TTFT capture. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
There was a problem hiding this comment.
Pull request overview
Adds Phase 4a of LLM request timing decomposition by capturing TTFT (time-to-first-token) and dual-emitting OpenTelemetry GenAI semantic convention attributes on qwen-code.llm_request spans, with supporting helpers and unit tests.
Changes:
- Extend
LLMRequestMetadataandendLLMRequestSpanto emit TTFT/cached tokens + derived sampling metrics, and to dual-emitgen_ai.*attributes. - Add a centralized
hasUserVisibleContent()predicate to detect the first “user-visible” streaming chunk across providers. - Wire TTFT + cached token propagation through
LoggingContentGeneratorand add focused unit tests + design documentation.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/core/src/telemetry/session-tracing.ts | Adds new LLM span metadata fields and emits new timing/token + GenAI dual-emit attributes (including derived sampling metrics). |
| packages/core/src/telemetry/session-tracing.test.ts | Adds unit coverage for Phase 4a attributes, derived metrics, and dual-emit behavior. |
| packages/core/src/core/loggingContentGenerator/streamContentDetection.ts | Introduces shared helper for “first user-visible chunk” detection used for TTFT. |
| packages/core/src/core/loggingContentGenerator/streamContentDetection.test.ts | Adds a part-type matrix test suite for user-visible chunk detection. |
| packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts | Captures TTFT during streaming and forwards cached token counts into LLM span metadata. |
| packages/core/src/core/loggingContentGenerator/loggingContentGenerator.test.ts | Adds tests for TTFT capture + cached token forwarding in stream and non-stream paths. |
| docs/design/telemetry-llm-request-timing-design.md | Adds the Phase 4 design doc describing decisions, semantics, and planned follow-ups. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // chunks, so TTFT reflects "model produced something the operator can | ||
| // attribute to user-perceived latency." | ||
| if (ttftMs === undefined && hasUserVisibleContent(response)) { | ||
| ttftMs = Date.now() - startTime; | ||
| } |
| p.inlineData !== undefined || | ||
| p.executableCode !== undefined || | ||
| // @ts-expect-error — `thought` is not on all SDK versions but providers emit it | ||
| p.thought !== undefined, |
| | [OTel Trace Spec — Span Events](https://opentelemetry.io/docs/specs/otel/trace/api/#add-events) | "Events SHOULD NOT be used to record information that's better captured as Span Attributes." Confirms per-attempt info belongs on the LLM span attributes + log-bridge spans, not as Span Events on the parent | | ||
| | Phase 3 design doc (`telemetry-subagent-spans-design.md`) | Established the dual-emit pattern (`qwen-code.subagent.id` + `gen_ai.agent.id`) and the "private name is authoritative" rule. Phase 4 follows the same convention for TTFT and token fields | | ||
|
|
||
| ## Design — seven decisions, each justified |
📋 Review SummaryThis PR (Phase 4a) implements TTFT (time-to-first-token) measurement and OTel GenAI semantic convention dual-emit for LLM request spans. The implementation is well-structured, thoroughly tested, and follows established patterns from the codebase. The design decisions are extensively documented with clear justifications. Overall assessment: strong implementation with minor suggestions for improvement. 🔍 General Feedback
🎯 Specific Feedback🟡 High
🟢 Medium
🔵 Low
✅ Highlights
|
wenshao
left a comment
There was a problem hiding this comment.
No review findings. Downgraded from Approve to Comment: CI still running. Code is clean — tsc/eslint/build all pass, 129/129 tests green. Implementation correctly handles TTFT capture, GenAI semconv dual-emit, derived metrics, and edge cases (clock skew, divide-by-zero, cachedInputTokens 0 vs undefined). — qwen-latest-series-invite-beta-v36 via Qwen Code /review
pomelo-nwu
left a comment
There was a problem hiding this comment.
文件命名惯例:是否考虑统一为 kebab-case?
注意到本 PR 新增的 streamContentDetection.ts 位于 core/loggingContentGenerator/ 目录下,与该目录现有文件(loggingContentGenerator.ts)保持一致的 camelCase 命名,这没有问题。
不过借这个机会提一个更大范围的观察 —— packages/core/src 目前的文件命名存在两种风格混用:
- camelCase(185 个文件):集中在
core/、hooks/、memory/、followup/等较早的模块 - kebab-case(98 个文件):集中在
telemetry/、tools/、mcp/、providers/、skills/等较新的模块
从目录层面看分化更明显:
telemetry/几乎全部 kebab-case(session-tracing.ts、detailed-span-attributes.ts、log-to-span-processor.ts、resource-attributes.ts...)tools/全部 kebab-case(read-file.ts、write-file.ts、web-fetch.ts...)core/几乎全部 camelCase(baseLlmClient.ts、contentGenerator.ts、coreToolScheduler.ts...)- 唯一的例外如
core/permission-helpers.ts(kebab)和telemetry/uiTelemetry.ts(camel)
这不是本 PR 引入的问题,只是趁这个 PR 提一下:后续新增文件是否应该考虑逐步统一到 kebab-case? 一些考虑:
- kebab-case 在 Node.js 生态中更常见(npm 包名、URL path 友好)
- 较新模块已经自然收敛到 kebab-case
- 可以与 issue #4063 中的命名统一项(#10、#13)一起推进
当然这是一个全局性的风格决策,不应该卡在本 PR 上。只是想听一下作者和各位的看法 🤔
|
补充几点上面 review 没覆盖到的观察: cli 包同样存在混用
tools 目录的命名-语义脱钩问题
kebab-case 文件名与 toolName 有清晰的 1:1 机械映射(
但 camelCase 文件名与 toolName 的对应关系就模糊了:
如果统一到 kebab-case,文件名与 toolName 就能形成一致的转换规则( 这不是 blocking 问题,只是趁这个 PR 把观察到的全貌摆出来,方便后续做决策 🤔 |
完全同意「机械规则 > 人工审查」的原则。命名规范如果只靠 reviewer 人肉盯,迟早会漏。以下是我目前觉得最可行的两条路径: 1. ESLint
// eslint.config.js
import checkFile from 'eslint-plugin-check-file';
export default [
{
plugins: { 'check-file': checkFile },
rules: {
'check-file/filename': ['error', 'KEBAB_CASE', {
// 逐步迁移:先 allowlist 存量 camelCase 文件
// 每次重命名一批就从列表中移除
}],
},
},
];好处是零额外基础设施、CI 已有 ESLint pipeline,加一条规则就能跑。 对于存量文件,可以用一个 2. AGENTS.md 补充(AI agent 软约束) 在 - **File naming**: All new `.ts` files must use `kebab-case`
(e.g., `stream-content-detection.ts`, not `streamContentDetection.ts`).
Existing `camelCase` files are legacy and should be renamed to
`kebab-case` opportunistically when touched.这条主要约束 AI coding agent(Qwen Code 自身、Copilot 等)在生成新文件时自动遵守 kebab-case。配合 ESLint 硬卡,形成"agent 自觉遵守 + CI 兜底拦截"的双层防线。 建议拆成独立 issue 推进,不卡本 PR。可以归到 #4063 的 #10(命名统一)下作为一个子任务,包含:
@doudouOUC 要不要开一个 issue 来跟踪?我可以帮忙把 ESLint 规则的 PoC 先跑出来 🤔 |
关于「kebab-case 在 Node.js 生态中更常见」这个说法,确实说得过于笼统了 —— 不同子领域差异很大,web 框架圈和 infra/AI 圈的习惯完全不一样。这个论据站不住,收回。 但你后面提的可观测性管线兼容性这个角度是真正的硬论据,比我的「生态常见」说法有说服力得多:
这个思路其实和你在 PR 里做 GenAI semconv dual-emit 的逻辑是一脉相承的 —— 命名规范不是美学问题,是跨系统互操作的工程约束。 关于 AGENTS.md 的写法,你说得对,应该写成变换规则而不只是静态命名规范。建议这样措辞: - **File naming**: Use `kebab-case` for all new `.ts` files.
Conversion to downstream identifiers is mechanical:
- File → tool name: replace `-` with `_` (e.g., `task-stop.ts` → `task_stop`)
- File → OTel span attr: replace `-` with `.` prefix as needed
- File → Prometheus metric: replace `-` with `_`
Existing `camelCase` files are legacy; rename opportunistically when touched.这样写的好处是贡献者(和 AI agent)一看就知道为什么选 kebab-case、怎么变换到下游系统,而不是单纯记一条风格规则。 ESLint |
刚扫了一遍代码库,结论比较直接: 目前没有共享的 case 转换工具函数
也就是说,你提到的 ~1200 LOC 双写路径里,每一跳的命名都是手工写死的字面量,没有经过任何程序化的 case 变换。文件名 → tool name → OTel attr → Prometheus metric 之间不存在自动转换链路。 这对 kebab-case 迁移 issue 的 scope 意味着什么 文件名 kebab-case 化能解决的只是第一跳的一致性: 所以建议 kebab-case 迁移 issue 的 scope 收窄为:
关于你提的「变换规则文档」,与其做代码级的 要开 issue 的话,scope 就是 ESLint 规则 + AGENTS.md 更新 + 存量文件分批迁移,不涉及运行时代码改动。我可以先准备 PoC。 |
pomelo-nwu
left a comment
There was a problem hiding this comment.
✅ Approve
忽略文件命名风格问题(已在 separate comment thread 中跟踪,归入 #4063),本 PR 代码质量很高,可以合并。
核心生产代码 ~107 行(loggingContentGenerator.ts +19 行,session-tracing.ts +88/-2 行,新建 streamContentDetection.ts 57 行),改动集中在遥测层,不碰业务逻辑。
爆炸半径极小:
endLLMRequestSpan只有一个调用方(loggingContentGenerator.ts,5 个 call site)LLMRequestMetadata扩展全是 optional 字段 —— 现有调用者不传这些字段时行为完全不变,零 breaking changehasUserVisibleContent纯函数、无副作用,只在流式包装器的一处被调用- 不碰
ContentGenerator接口、不碰 provider 适配器、不碰retryWithBackoff(Phase 4b 的事) - TTFT 用 method-local 闭包变量而非 instance field,并发
generateContentStream调用间不会互相污染
边界条件覆盖充分:
thought: false误触发 →p.thought === true而非!== undefinedcachedInputTokens === 0被吞 →!== undefined判断sampling_ms负数 →Math.max(0, ...)output_tokens_per_second除零 →samplingMs > 0守卫- 双写 span 碰撞 → metric 写入在已有的 double-end guard 内
37 个新测试 + 538 行设计文档(每个决策点都有 "why not X")。教科书级的增量遥测 PR。
— qwen3.7-max via Qwen Code
| thought?: unknown; | ||
| }; | ||
| if (typeof p.text === 'string' && p.text.length > 0) return true; | ||
| if (p.functionCall !== undefined) return true; |
There was a problem hiding this comment.
[Suggestion] The three presence checks (p.functionCall !== undefined, p.inlineData !== undefined, p.executableCode !== undefined) treat null as present (null !== undefined is true). A part shaped { functionCall: null } would be classified as user-visible, potentially firing TTFT on an empty metadata chunk.
The text field is already guarded with a type check (typeof p.text === 'string'), but the other three fields are not. Using loose null equality (!= null) would cover both null and undefined in one check, matching the defensive posture of the text guard.
| if (p.functionCall !== undefined) return true; | |
| if (p.functionCall != null) return true; | |
| if (p.inlineData != null) return true; | |
| if (p.executableCode != null) return true; |
— qwen-latest-series-invite-beta-v36 via Qwen Code /review
| // content. hasUserVisibleContent skips role-only / usageMetadata-only | ||
| // chunks, so TTFT reflects "model produced something the operator can | ||
| // attribute to user-perceived latency." | ||
| if (ttftMs === undefined && hasUserVisibleContent(response)) { |
There was a problem hiding this comment.
[Suggestion] This call is inside the for await loop that streams chunks to the user. If hasUserVisibleContent ever throws (e.g., an SDK update changes the Part type shape), the exception propagates into the catch (error) block below, aborting the user's response stream. The error trace would point at streamContentDetection internals — not obviously a telemetry bug.
Since TTFT detection is best-effort telemetry, it should never affect the user-visible response. A try/catch here isolates detection failures:
| if (ttftMs === undefined && hasUserVisibleContent(response)) { | |
| try { | |
| if (ttftMs === undefined && hasUserVisibleContent(response)) { | |
| ttftMs = Date.now() - startTime; | |
| } | |
| } catch { | |
| // Detection failure must not abort the user stream. | |
| } |
— qwen-latest-series-invite-beta-v36 via Qwen Code /review
| if (metadata.ttftMs !== undefined) { | ||
| const samplingMs = Math.max( | ||
| 0, | ||
| duration - metadata.ttftMs - (metadata.requestSetupMs ?? 0), |
There was a problem hiding this comment.
[Suggestion] The sampling_ms formula duration - ttftMs - (requestSetupMs ?? 0) assumes ttftMs and requestSetupMs share the same time origin as duration (all measured from startTime). When Phase 4b populates requestSetupMs, if ttftMs is still measured from startTime (not from the successful attempt's dispatch), the formula will double-subtract requestSetupMs and produce incorrect sampling_ms / output_tokens_per_second values.
This invariant exists only in the reader's head — not in the type system or an assertion. Consider documenting it explicitly, e.g.:
| duration - metadata.ttftMs - (metadata.requestSetupMs ?? 0), | |
| // INVARIANT: ttftMs and requestSetupMs share the same time origin | |
| // as `duration` (generateContent[Stream] entry). If either measurement | |
| // base changes in a future phase, this formula must be updated. | |
| const samplingMs = Math.max( |
— qwen-latest-series-invite-beta-v36 via Qwen Code /review
Co-authored-by: Tommaso Sciortino <sciortino@gmail.com>
Summary
Phase 4a of P3 LLM request timing decomposition (#3731 / #4413). Adds TTFT (time-to-first-token) measurement and OTel GenAI semantic convention dual-emit attributes to the
qwen-code.llm_requestspan.hasUserVisibleContent(chunk)helper centralises first-token detection on the normalisedGenerateContentResponseshape — works uniformly across Anthropic / OpenAI / Gemini / Qwen providers.LoggingContentGenerator.generateContentStream's stream wrapper using method-local closure (NEVER instance fields —LoggingContentGeneratoris one-per-ContentGeneratorand shared across concurrent calls; subagent fan-out, warmup, side-queries).LLMRequestMetadataextended withttftMs,requestSetupMs,attempt,retryTotalDelayMs,cachedInputTokens. The 4b/4c placeholders are declared optional now so subsequent PRs don't have to revisit the schema.endLLMRequestSpanwrites new attrs (ttft_ms,cached_input_tokens, derivedsampling_msclamped at 0, derivedoutput_tokens_per_secondrounded to 2 decimals + omitted whensampling_ms == 0) plus Phase 4b placeholders when the caller provides them.gen_ai.request.model(Stable),gen_ai.usage.input_tokens(Stable),gen_ai.usage.output_tokens(Stable),gen_ai.usage.cached_tokens(Experimental),gen_ai.server.time_to_first_token(Experimental, seconds-as-double).TTFT semantic intentionally diverges from claude-code's
ttftMs(which fires on Anthropic'smessage_startmetadata event). qwen-code measures first user-visible content — uniform across providers, matches the literal "first token" intent. See design doc D1 for rationale.Closes
Design doc
docs/design/telemetry-llm-request-timing-design.md— 8 locked decisions (D1, D2, D3, D4, D5, D6, D7, D8) with "Why not X" justifications, lifecycle wiring (streaming + non-streaming + 4 retry sites for 4b), files-to-change table with LOC estimates, test strategy, edge cases, rollback plan, and a comparison with claude-code (which has the full decomposition but only for Perfetto) and opencode (which has no TTFT at all).PR split
This is Phase 4a (~200 LOC of production code, self-contained).
retryWithBackoffonRetrycallback +ApiRetryEvent+ 4 LLM caller wiring (client.ts:1540,baseLlmClient.ts:193,282,geminiChat.ts:1039). Independently a bug fix for theretryWithBackofftelemetry blackout (currently onlydebugLogger.warn).recordApiRequestBreakdown()activation + parent-span retry-aggregation attrs. Depends on 4a + 4b.Test plan
hasUserVisibleContentpart-type matrix (16 cases): non-empty text, empty text, functionCall, inlineData, executableCode,thought: true,thought: false,thoughtmissing,text: '' + thought: true(Anthropic<thinking>first chunk), mixed parts, empty parts array, missing candidates, missing content, missing parts, usage-only / role-only chunks, defensive non-object parts.endLLMRequestSpanPhase 4a (timing decomposition + dual-emit) (17 cases): all gen_ai.* dual-emits present + value-equal + correct units (seconds for ttft);cached_input_tokensboundary cases (undefined, 40, 0 explicit cache miss);ttft_mswrites seconds-as-double for gen_ai counterpart; ttft undefined ⇒ no derived attrs;sampling_msderived correctly (no setup, with setup, clamped to 0 on clock skew);output_tokens_per_secondrounding (88.89), divide-by-zero guard (sampling_ms == 0), missingoutputTokensguard; Phase 4b placeholders written when caller provides them, omitted otherwise.LoggingContentGeneratorstream + non-stream paths (4 cases): TTFT captured on first user-visible chunk, undefined when stream yields only metadata chunks;cachedInputTokensforwarded fromusageMetadata.cachedContentTokenCounton both stream and non-stream success.tsc --noEmit) cleaneslint) cleanloggingContentGenerator.test.tsexisting 39 casesReview summary
This PR went through 4 specialised review agents (code-reviewer, silent-failure-hunter, type-design-analyzer, comment-analyzer) before opening:
aa66dc7aa:hasUserVisibleContentoriginally usedp.thought !== undefined, which falsely fires TTFT on{ thought: false }parts. The codebase treatsthoughtas a boolean flag (loggingContentGenerator.ts:773); fixed top.thought === trueand added 3 boundary test cases.attempt,requestSetupMs,retryTotalDelayMs) into a sub-object. Flat schema matches existinginputTokens/outputTokensprecedent and 4b will populate all three at the same callsite — sub-object adds ceremony without invariant gain.unknown-narrowing inisUserVisiblePartwith aPart-typed parameter. Defensive against schema drift is the explicit design intent in D1; switching loses that defensiveness without a clear benefit at runtime.不追求 0 unresolved threads.