Skip to content

feat(telemetry): Phase 4a — TTFT capture + GenAI semconv dual-emit (#3731)#4417

Merged
doudouOUC merged 2 commits into
QwenLM:mainfrom
doudouOUC:feat/telemetry-phase-4a-ttft
May 22, 2026
Merged

feat(telemetry): Phase 4a — TTFT capture + GenAI semconv dual-emit (#3731)#4417
doudouOUC merged 2 commits into
QwenLM:mainfrom
doudouOUC:feat/telemetry-phase-4a-ttft

Conversation

@doudouOUC

Copy link
Copy Markdown
Collaborator

Summary

Phase 4a of P3 LLM request timing decomposition (#3731 / #4413). Adds TTFT (time-to-first-token) measurement and OTel GenAI semantic convention dual-emit attributes to the qwen-code.llm_request span.

  • New hasUserVisibleContent(chunk) helper centralises first-token detection on the normalised GenerateContentResponse shape — works uniformly across Anthropic / OpenAI / Gemini / Qwen providers.
  • TTFT captured in LoggingContentGenerator.generateContentStream's stream wrapper using method-local closure (NEVER instance fields — LoggingContentGenerator is one-per-ContentGenerator and shared across concurrent calls; subagent fan-out, warmup, side-queries).
  • LLMRequestMetadata extended with ttftMs, requestSetupMs, attempt, retryTotalDelayMs, cachedInputTokens. The 4b/4c placeholders are declared optional now so subsequent PRs don't have to revisit the schema.
  • endLLMRequestSpan writes new attrs (ttft_ms, cached_input_tokens, derived sampling_ms clamped at 0, derived output_tokens_per_second rounded to 2 decimals + omitted when sampling_ms == 0) plus Phase 4b placeholders when the caller provides them.
  • GenAI semconv dual-emit (private name authoritative, semconv = compat layer per feat(telemetry): Phase 3 — qwen-code.subagent span with concurrent isolation (#3731) #4410 precedent): gen_ai.request.model (Stable), gen_ai.usage.input_tokens (Stable), gen_ai.usage.output_tokens (Stable), gen_ai.usage.cached_tokens (Experimental), gen_ai.server.time_to_first_token (Experimental, seconds-as-double).

TTFT semantic intentionally diverges from claude-code's ttftMs (which fires on Anthropic's message_start metadata event). qwen-code measures first user-visible content — uniform across providers, matches the literal "first token" intent. See design doc D1 for rationale.

Closes

Design doc

docs/design/telemetry-llm-request-timing-design.md — 8 locked decisions (D1, D2, D3, D4, D5, D6, D7, D8) with "Why not X" justifications, lifecycle wiring (streaming + non-streaming + 4 retry sites for 4b), files-to-change table with LOC estimates, test strategy, edge cases, rollback plan, and a comparison with claude-code (which has the full decomposition but only for Perfetto) and opencode (which has no TTFT at all).

PR split

This is Phase 4a (~200 LOC of production code, self-contained).

  • Phase 4b (next PR) — retryWithBackoff onRetry callback + ApiRetryEvent + 4 LLM caller wiring (client.ts:1540, baseLlmClient.ts:193,282, geminiChat.ts:1039). Independently a bug fix for the retryWithBackoff telemetry blackout (currently only debugLogger.warn).
  • Phase 4crecordApiRequestBreakdown() activation + parent-span retry-aggregation attrs. Depends on 4a + 4b.

Test plan

  • Unit — hasUserVisibleContent part-type matrix (16 cases): non-empty text, empty text, functionCall, inlineData, executableCode, thought: true, thought: false, thought missing, text: '' + thought: true (Anthropic <thinking> first chunk), mixed parts, empty parts array, missing candidates, missing content, missing parts, usage-only / role-only chunks, defensive non-object parts.
  • Unit — endLLMRequestSpan Phase 4a (timing decomposition + dual-emit) (17 cases): all gen_ai.* dual-emits present + value-equal + correct units (seconds for ttft); cached_input_tokens boundary cases (undefined, 40, 0 explicit cache miss); ttft_ms writes seconds-as-double for gen_ai counterpart; ttft undefined ⇒ no derived attrs; sampling_ms derived correctly (no setup, with setup, clamped to 0 on clock skew); output_tokens_per_second rounding (88.89), divide-by-zero guard (sampling_ms == 0), missing outputTokens guard; Phase 4b placeholders written when caller provides them, omitted otherwise.
  • Unit — LoggingContentGenerator stream + non-stream paths (4 cases): TTFT captured on first user-visible chunk, undefined when stream yields only metadata chunks; cachedInputTokens forwarded from usageMetadata.cachedContentTokenCount on both stream and non-stream success.
  • Type-check (tsc --noEmit) clean
  • Lint (eslint) clean
  • All 448 telemetry tests pass; no regressions in loggingContentGenerator.test.ts existing 39 cases

Review summary

This PR went through 4 specialised review agents (code-reviewer, silent-failure-hunter, type-design-analyzer, comment-analyzer) before opening:

  • Critical fix applied in commit aa66dc7aa: hasUserVisibleContent originally used p.thought !== undefined, which falsely fires TTFT on { thought: false } parts. The codebase treats thought as a boolean flag (loggingContentGenerator.ts:773); fixed to p.thought === true and added 3 boundary test cases.
  • Type-design suggestion declined (push back): grouping the retry trio (attempt, requestSetupMs, retryTotalDelayMs) into a sub-object. Flat schema matches existing inputTokens/outputTokens precedent and 4b will populate all three at the same callsite — sub-object adds ceremony without invariant gain.
  • Type tightening declined (push back): replacing unknown-narrowing in isUserVisiblePart with a Part-typed parameter. Defensive against schema drift is the explicit design intent in D1; switching loses that defensiveness without a clear benefit at runtime.
  • Other suggestions declined as stylistic / non-actionable per 不追求 0 unresolved threads.

doudouOUC added 2 commits May 22, 2026 02:03
…wenLM#3731)

Adds time-to-first-token measurement and OTel GenAI semantic convention
dual-emit to the qwen-code.llm_request span. First slice of Phase 4 LLM
request timing decomposition (sub-issue QwenLM#4413).

Builds on Phase 1 (QwenLM#4126), Phase 1.5 (QwenLM#4302), Phase 2 (QwenLM#4321). Self-contained
— ships TTFT visibility without depending on the Phase 4b retry-callback work.

Changes
-------

- New hasUserVisibleContent(chunk) helper centralizes "is this chunk first
  user-visible output" detection across all 4 providers. Operates on the
  normalized GenerateContentResponse shape each provider emits before
  LoggingContentGenerator sees the stream. Recognizes text / functionCall /
  inlineData / executableCode / thought parts; skips role-only and
  usageMetadata-only chunks.

- TTFT capture in LoggingContentGenerator.generateContentStream's stream
  wrapper. Method-local closure variable, NEVER instance field — the
  generator is one-per-ContentGenerator and shared across concurrent calls
  (subagent fan-out, warmup, side-queries).

- Extends LLMRequestMetadata with: ttftMs, requestSetupMs, attempt,
  retryTotalDelayMs, cachedInputTokens. The first is populated in 4a; the
  middle three are declared as optional so Phase 4b's retry callback can
  populate them without touching the schema again.

- endLLMRequestSpan writes new span attrs: ttft_ms, sampling_ms (derived:
  duration - ttft - requestSetup, clamped to >= 0), output_tokens_per_second
  (derived: outputTokens / (sampling_ms / 1000), omitted when sampling_ms == 0
  to avoid divide-by-zero, rounded to 2 decimals), cached_input_tokens,
  plus Phase 4b placeholders (attempt, request_setup_ms, retry_total_delay_ms)
  when caller provides them.

- GenAI semconv dual-emit. Private qwen-code names stay authoritative
  (dashboards, SLOs, ARMS queries reference these); semconv is a compat
  layer for spec-aware backends. Same pattern as Phase 3 (QwenLM#4410):
    qwen-code.model            + gen_ai.request.model            (Stable)
    input_tokens               + gen_ai.usage.input_tokens       (Stable)
    output_tokens              + gen_ai.usage.output_tokens      (Stable)
    cached_input_tokens        + gen_ai.usage.cached_tokens      (Experimental)
    ttft_ms (ms, int)          + gen_ai.server.time_to_first_token
                                 (s, double, Experimental)

TTFT semantic intentionally diverges from claude-code's ttftMs (which fires
on Anthropic's message_start metadata event). qwen-code measures "first
user-visible content" — uniform across Anthropic / OpenAI / Gemini / Qwen,
matches the literal "first token" intent. See design doc D1 for rationale.

Design doc
----------

docs/design/telemetry-llm-request-timing-design.md (~530 lines) covers all
7 locked decisions with "Why not X" justifications, lifecycle wiring for
streaming + non-streaming + 4 retry sites, files-to-change table with LOC
estimates, test strategy, edge cases, rollback plan, and a comparison with
claude-code (full decomposition but only for Perfetto) and opencode
(no TTFT measurement at all).

Tests
-----

- streamContentDetection.test.ts (13 tests) covers part types matrix.
- session-tracing.test.ts +16 tests (new describe block) cover dual-emit
  for every gen_ai.* field, sampling_ms clamping at 0 for clock skew,
  output_tokens_per_second rounding + divide-by-zero, Phase 4b
  placeholder fields written when present, absent when not.
- loggingContentGenerator.test.ts +4 tests cover TTFT capture in
  streaming, undefined when stream yields only metadata chunks, and
  cachedInputTokens forwarding from usageMetadata on both stream and
  non-stream paths.

All 448 telemetry tests pass, tsc --noEmit clean, eslint clean.

Phase 4b (retry callback + ApiRetryEvent) and Phase 4c
(recordApiRequestBreakdown activation) follow in separate PRs per the
design doc split.

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
…ndary tests (QwenLM#3731)

Address review findings on commit 5db5f9b (Phase 4a) before opening the PR.

1. hasUserVisibleContent: tighten the `thought` Part check from
   `p.thought !== undefined` to `p.thought === true`. The codebase treats
   `thought` as a boolean flag (see loggingContentGenerator.ts:773 / :865
   where `part.thought ? {...} : {}` is the canonical pattern). The original
   predicate would falsely fire TTFT on a degenerate `{ thought: false }`
   part (which means "explicitly not a thought") and on bare `{ thought: undefined }`
   shapes that some normalizers may emit. Affects Phase 4a only — TTFT
   semantics, not retry telemetry.

   Existing test for `thought: true` is kept; added 3 new cases:
   - `{ thought: false }` returns false
   - `{}` (no thought key) returns false
   - `{ text: '', thought: true }` (typical Anthropic <thinking> first chunk)
     returns true — confirms the design-doc D1 intent that "thinking-stream
     started" counts as user-visible TTFT.

2. session-tracing.test.ts: add boundary case for `cachedInputTokens === 0`.
   Providers reporting an explicit cache miss (0) is meaningful info distinct
   from undefined ("we don't know"). The test asserts both `cached_input_tokens`
   and `gen_ai.usage.cached_tokens` propagate the literal 0, not get filtered.

3. JSDoc fix: remove the misleading "guard with `in`" comment from
   `streamContentDetection.ts` that referred to a guard pattern not actually
   used in the implementation.

All 129 tests pass (was 125; +4 new), tsc --noEmit clean, eslint clean.
No production-code behavior change beyond the `thought` predicate fix —
which by construction can only ever shrink the set of chunks treated as
TTFT-triggering, so cannot regress any successful TTFT capture.

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
Copilot AI review requested due to automatic review settings May 22, 2026 00:31

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Phase 4a of LLM request timing decomposition by capturing TTFT (time-to-first-token) and dual-emitting OpenTelemetry GenAI semantic convention attributes on qwen-code.llm_request spans, with supporting helpers and unit tests.

Changes:

  • Extend LLMRequestMetadata and endLLMRequestSpan to emit TTFT/cached tokens + derived sampling metrics, and to dual-emit gen_ai.* attributes.
  • Add a centralized hasUserVisibleContent() predicate to detect the first “user-visible” streaming chunk across providers.
  • Wire TTFT + cached token propagation through LoggingContentGenerator and add focused unit tests + design documentation.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/core/src/telemetry/session-tracing.ts Adds new LLM span metadata fields and emits new timing/token + GenAI dual-emit attributes (including derived sampling metrics).
packages/core/src/telemetry/session-tracing.test.ts Adds unit coverage for Phase 4a attributes, derived metrics, and dual-emit behavior.
packages/core/src/core/loggingContentGenerator/streamContentDetection.ts Introduces shared helper for “first user-visible chunk” detection used for TTFT.
packages/core/src/core/loggingContentGenerator/streamContentDetection.test.ts Adds a part-type matrix test suite for user-visible chunk detection.
packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts Captures TTFT during streaming and forwards cached token counts into LLM span metadata.
packages/core/src/core/loggingContentGenerator/loggingContentGenerator.test.ts Adds tests for TTFT capture + cached token forwarding in stream and non-stream paths.
docs/design/telemetry-llm-request-timing-design.md Adds the Phase 4 design doc describing decisions, semantics, and planned follow-ups.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +531 to +535
// chunks, so TTFT reflects "model produced something the operator can
// attribute to user-perceived latency."
if (ttftMs === undefined && hasUserVisibleContent(response)) {
ttftMs = Date.now() - startTime;
}
p.inlineData !== undefined ||
p.executableCode !== undefined ||
// @ts-expect-error — `thought` is not on all SDK versions but providers emit it
p.thought !== undefined,
| [OTel Trace Spec — Span Events](https://opentelemetry.io/docs/specs/otel/trace/api/#add-events) | "Events SHOULD NOT be used to record information that's better captured as Span Attributes." Confirms per-attempt info belongs on the LLM span attributes + log-bridge spans, not as Span Events on the parent |
| Phase 3 design doc (`telemetry-subagent-spans-design.md`) | Established the dual-emit pattern (`qwen-code.subagent.id` + `gen_ai.agent.id`) and the "private name is authoritative" rule. Phase 4 follows the same convention for TTFT and token fields |

## Design — seven decisions, each justified
@github-actions

Copy link
Copy Markdown
Contributor

📋 Review Summary

This PR (Phase 4a) implements TTFT (time-to-first-token) measurement and OTel GenAI semantic convention dual-emit for LLM request spans. The implementation is well-structured, thoroughly tested, and follows established patterns from the codebase. The design decisions are extensively documented with clear justifications. Overall assessment: strong implementation with minor suggestions for improvement.

🔍 General Feedback

  • Excellent design documentation: The design doc (telemetry-llm-request-timing-design.md) is comprehensive with 8 locked decisions (D1-D8), each with "Why not X" justifications and cross-product comparisons (claude-code, opencode).
  • Clean separation of concerns: The hasUserVisibleContent helper centralizes first-token detection logic, avoiding per-provider duplication.
  • Correct concurrency handling: TTFT capture uses method-local closure variables (never instance fields) in LoggingContentGenerator.generateContentStream, which is critical since the class is one-per-ContentGenerator and shared across concurrent calls.
  • Comprehensive test coverage: 16 test cases for hasUserVisibleContent part-type matrix and 17 test cases for endLLMRequestSpan Phase 4a timing decomposition + dual-emit.
  • GenAI semantic convention dual-emit follows precedent: Private names (qwen-code.*) remain authoritative; gen_ai.* attributes serve as a compatibility layer for spec-aware backends (consistent with Phase 3 feat(telemetry): Phase 3 — qwen-code.subagent span with concurrent isolation (#3731) #4410).
  • Phase 4b/4c placeholders properly declared: New fields (requestSetupMs, attempt, retryTotalDelayMs) are optional now, preventing schema churn in follow-up PRs.

🎯 Specific Feedback

🟡 High

  • File: streamContentDetection.ts:53 - The thought field check uses === true which is correct per D1, but the comment says "Match strict === true rather than checking presence". Consider adding a defensive check for non-boolean thought values (e.g., if a provider emits "true" as string). Current code handles this via type coercion, but explicit validation would be safer:

    if (typeof p.thought === 'boolean' && p.thought === true) return true;
  • File: session-tracing.ts:450 - The derived sampling_ms calculation uses Math.max(0, ...) to clamp for clock skew, but this masks potential underlying issues. Consider adding a debug log when clamping occurs (similar to existing debugLogger.warn patterns in the file) to help diagnose clock skew in production:

    if (samplingMs === 0 && duration > metadata.ttftMs + (metadata.requestSetupMs ?? 0)) {
      debugLogger.warn(`sampling_ms clamped to 0 due to clock skew: duration=${duration}, ttftMs=${metadata.ttftMs}, requestSetupMs=${metadata.requestSetupMs}`);
    }

🟢 Medium

  • File: loggingContentGenerator.ts:533 - The TTFT capture condition if (ttftMs === undefined && hasUserVisibleContent(response)) is correct but could benefit from an inline comment explaining why we check ttftMs === undefined (to ensure we capture the first token only, not overwrite on subsequent chunks). The design doc D2 explains this, but a brief inline comment would help future maintainers:

    // Capture TTFT on first user-visible chunk only; subsequent chunks must not overwrite
    if (ttftMs === undefined && hasUserVisibleContent(response)) {
  • File: session-tracing.ts:420-423 - The dual-emit for gen_ai.usage.cached_tokens notes "Experimental — may rename before Stable". Consider adding a TODO comment with a reference to the OTel spec version or a tracking issue number so future maintainers know when to update:

    // TODO(#XXXX): Update when OTel GenAI semconv stabilizes (currently Experimental)
  • File: streamContentDetection.test.ts:116-121 - The test for non-object parts (null, undefined, 'string', 42) is excellent defensive testing. Consider adding one more edge case: an object with text as a non-string type (e.g., { text: 123 }) to verify the typeof p.text === 'string' guard works correctly.

🔵 Low

  • File: session-tracing.ts:66-76 - The JSDoc for ttftMs mentions "diverges from claude-code's ttftMs" which is important context. Consider linking directly to the design doc section (D1) for easier navigation:

     * See docs/design/telemetry-llm-request-timing-design.md (D1) for rationale.
  • File: streamContentDetection.ts:17 - The comment lists thought / reasoning content (provider-dependent; o1, qwen thinking, Anthropic <thinking>). Consider adding Gemini to this list since the design doc D1 explicitly mentions "Gemini's thought" as well.

  • File: loggingContentGenerator.ts:468-474 - The TTFT comment block is comprehensive but could reference the design doc D2 directly for the "Why never instance fields" explanation, making it easier for future maintainers to find the full rationale.

✅ Highlights

  • hasUserVisibleContent implementation (streamContentDetection.ts:34-57) - Clean, well-documented function with excellent type safety and defensive handling of edge cases. The explicit handling of thought: false vs thought: true vs missing thought shows deep understanding of the codebase conventions.

  • Test coverage for edge cases (streamContentDetection.test.ts) - The 16-case matrix thoroughly covers all part types including Anthropic's <thinking> first chunk pattern ({ text: '', thought: true }), usage-only chunks, and defensive non-object handling.

  • Phase 4a/4b/4c split discipline - The PR strictly implements Phase 4a (TTFT + dual-emit) while declaring Phase 4b/4c placeholders as optional fields. This prevents scope creep and keeps the PR reviewable at ~200 LOC of production code.

  • GenAI dual-emit consistency (session-tracing.ts:403-426) - All dual-emits follow the same pattern with clear comments about stability levels (Stable vs Experimental) and the "private name authoritative" rule from Phase 3.

  • Idle timeout handling (loggingContentGenerator.ts:493-517) - The span timeout mechanism correctly handles the "consumer abandoned the generator" case and sets spanEndedByTimeout to prevent redundant endLLMRequestSpan calls.

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No review findings. Downgraded from Approve to Comment: CI still running. Code is clean — tsc/eslint/build all pass, 129/129 tests green. Implementation correctly handles TTFT capture, GenAI semconv dual-emit, derived metrics, and edge cases (clock skew, divide-by-zero, cachedInputTokens 0 vs undefined). — qwen-latest-series-invite-beta-v36 via Qwen Code /review

@pomelo-nwu pomelo-nwu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文件命名惯例:是否考虑统一为 kebab-case?

注意到本 PR 新增的 streamContentDetection.ts 位于 core/loggingContentGenerator/ 目录下,与该目录现有文件(loggingContentGenerator.ts)保持一致的 camelCase 命名,这没有问题。

不过借这个机会提一个更大范围的观察 —— packages/core/src 目前的文件命名存在两种风格混用

  • camelCase(185 个文件):集中在 core/hooks/memory/followup/ 等较早的模块
  • kebab-case(98 个文件):集中在 telemetry/tools/mcp/providers/skills/ 等较新的模块

从目录层面看分化更明显:

  • telemetry/ 几乎全部 kebab-case(session-tracing.tsdetailed-span-attributes.tslog-to-span-processor.tsresource-attributes.ts...)
  • tools/ 全部 kebab-case(read-file.tswrite-file.tsweb-fetch.ts...)
  • core/ 几乎全部 camelCase(baseLlmClient.tscontentGenerator.tscoreToolScheduler.ts...)
  • 唯一的例外如 core/permission-helpers.ts(kebab)和 telemetry/uiTelemetry.ts(camel)

这不是本 PR 引入的问题,只是趁这个 PR 提一下:后续新增文件是否应该考虑逐步统一到 kebab-case? 一些考虑:

  • kebab-case 在 Node.js 生态中更常见(npm 包名、URL path 友好)
  • 较新模块已经自然收敛到 kebab-case
  • 可以与 issue #4063 中的命名统一项(#10#13)一起推进

当然这是一个全局性的风格决策,不应该卡在本 PR 上。只是想听一下作者和各位的看法 🤔

@pomelo-nwu

Copy link
Copy Markdown
Collaborator

补充几点上面 review 没覆盖到的观察:

cli 包同样存在混用

packages/cli/src 的情况类似:282 个 camelCase 文件 vs 31 个 kebab-case 文件。目录层面 dualOutput/nonInteractive/remoteInput/ 用 camelCase,acp-integration/prompt-processors/test-commands/ 用 kebab-case。只是 cli 包 kebab-case 占比更低(约 7%),不一致性比 core 包更明显。

tools 目录的命名-语义脱钩问题

packages/core/src/tools/ 里还有一个更具体的问题:文件名命名风格与 toolName 的对应关系不一致

kebab-case 文件名与 toolName 有清晰的 1:1 机械映射(-_):

  • task-stop.tstask_stop
  • web-fetch.tsweb_fetch
  • read-file.tsread_file
  • cron-create.tscron_create

但 camelCase 文件名与 toolName 的对应关系就模糊了:

  • askUserQuestion.tsask_user_question(语义对应但风格不同)
  • todoWrite.tstodo_write(同上)
  • ripGrep.tsgrep_search完全脱钩,文件名和工具名无直接关系)
  • syntheticOutput.tsstructured_output完全脱钩

如果统一到 kebab-case,文件名与 toolName 就能形成一致的转换规则(kebab-file.tskebab_file),降低新贡献者的认知负担。

这不是 blocking 问题,只是趁这个 PR 把观察到的全貌摆出来,方便后续做决策 🤔

@pomelo-nwu

Copy link
Copy Markdown
Collaborator

回复 @doudouOUC 关于 kebab-case 统一的想法:

完全同意「机械规则 > 人工审查」的原则。命名规范如果只靠 reviewer 人肉盯,迟早会漏。以下是我目前觉得最可行的两条路径:

1. ESLint check-file 规则(CI 硬卡)

eslint-plugin-check-file 可以直接在 lint 阶段拦截不符合 kebab-case 的新文件名:

// eslint.config.js
import checkFile from 'eslint-plugin-check-file';

export default [
  {
    plugins: { 'check-file': checkFile },
    rules: {
      'check-file/filename': ['error', 'KEBAB_CASE', {
        // 逐步迁移:先 allowlist 存量 camelCase 文件
        // 每次重命名一批就从列表中移除
      }],
    },
  },
];

好处是零额外基础设施、CI 已有 ESLint pipeline,加一条规则就能跑。KEBAB_CASE 是插件内置 pattern,不需要手写正则。

对于存量文件,可以用一个 allowedFilenamePatterns 列表做渐进式迁移 —— 不要求一次性改完 185 个文件,但新增文件必须是 kebab-case,每次有人重命名旧文件就从 allowlist 中删一条。

2. AGENTS.md 补充(AI agent 软约束)

AGENTS.mdCode Conventions 小节加一段:

- **File naming**: All new `.ts` files must use `kebab-case`
  (e.g., `stream-content-detection.ts`, not `streamContentDetection.ts`).
  Existing `camelCase` files are legacy and should be renamed to
  `kebab-case` opportunistically when touched.

这条主要约束 AI coding agent(Qwen Code 自身、Copilot 等)在生成新文件时自动遵守 kebab-case。配合 ESLint 硬卡,形成"agent 自觉遵守 + CI 兜底拦截"的双层防线。

建议拆成独立 issue 推进,不卡本 PR。可以归到 #4063#10(命名统一)下作为一个子任务,包含:

  • ESLint 规则 + allowlist 初始数据
  • AGENTS.md 更新
  • 存量文件分批迁移计划(按目录粒度,每个目录一个 PR)

@doudouOUC 要不要开一个 issue 来跟踪?我可以帮忙把 ESLint 规则的 PoC 先跑出来 🤔

@pomelo-nwu

Copy link
Copy Markdown
Collaborator

回复 @doudouOUC

关于「kebab-case 在 Node.js 生态中更常见」这个说法,确实说得过于笼统了 —— 不同子领域差异很大,web 框架圈和 infra/AI 圈的习惯完全不一样。这个论据站不住,收回。

但你后面提的可观测性管线兼容性这个角度是真正的硬论据,比我的「生态常见」说法有说服力得多:

  • 文件名 kebab-case → span/metric name snake_case,变换路径最短(-_,一一对应)
  • Prometheus 强制 snake_case,OTel semconv 用 . 分隔(gen_ai.usage.input_tokens
  • Arize Phoenix / Langfuse 的 span 树和 trace 视图确实对 kebab/snake 的可读性更好
  • 如果上游文件名就是 camelCase,到下游每个集成点都要做一次不规则的 case 转换,容易出错

这个思路其实和你在 PR 里做 GenAI semconv dual-emit 的逻辑是一脉相承的 —— 命名规范不是美学问题,是跨系统互操作的工程约束

关于 AGENTS.md 的写法,你说得对,应该写成变换规则而不只是静态命名规范。建议这样措辞:

- **File naming**: Use `kebab-case` for all new `.ts` files.
  Conversion to downstream identifiers is mechanical:
  - File → tool name: replace `-` with `_` (e.g., `task-stop.ts``task_stop`)
  - File → OTel span attr: replace `-` with `.` prefix as needed
  - File → Prometheus metric: replace `-` with `_`
  Existing `camelCase` files are legacy; rename opportunistically when touched.

这样写的好处是贡献者(和 AI agent)一看就知道为什么选 kebab-case、怎么变换到下游系统,而不是单纯记一条风格规则。

ESLint check-file + AGENTS.md 双层防线 + 独立 issue 归到 #4063 #10 下,这个路径不变。要开 issue 的话我可以先把 ESLint PoC 和 AGENTS.md diff 准备好。

@pomelo-nwu

Copy link
Copy Markdown
Collaborator

回复 @doudouOUC 关于共享工具函数和变换链路的问题

刚扫了一遍代码库,结论比较直接:

目前没有共享的 case 转换工具函数

  • toSnakeCase()toKebabCase()camelToSnake() 之类的 util
  • 无第三方 case 库引用(lodash、change-case 等都没有)
  • tool-names.ts 里的 tool name 全部是硬编码字面量('task_stop''web_fetch'
  • telemetry/constants.tssession-tracing.ts 里的 OTel attribute 也全部是硬编码字符串('qwen-code.tool_call''gen_ai.usage.input_tokens'
  • Prometheus export 层的 metric name 同样是手写的

也就是说,你提到的 ~1200 LOC 双写路径里,每一跳的命名都是手工写死的字面量,没有经过任何程序化的 case 变换。文件名 → tool name → OTel attr → Prometheus metric 之间不存在自动转换链路。

这对 kebab-case 迁移 issue 的 scope 意味着什么

文件名 kebab-case 化能解决的只是第一跳的一致性:task-stop.tstask_stop(tool name)。后面几跳(OTel attr 用 .、Prometheus 用 _)是各系统自己的规范,和文件名无关 —— 它们本来就该用各系统的 native convention。

所以建议 kebab-case 迁移 issue 的 scope 收窄为:

  • 文件名统一 kebab-case(ESLint 规则 + AGENTS.md 变换规则文档)
  • tool name 保持 snake_case,与文件名形成 -_ 的机械映射
  • OTel attr / Prometheus metric 命名不属于这个 issue 的范围(它们有自己的 semconv 规范)

关于你提的「变换规则文档」,与其做代码级的 snake_case() util(当前没有实际需求),不如在 AGENTS.md 里把映射关系写清楚就够了 —— 因为现阶段这些变换都是开发者/AI agent 在写代码时心算的,不是运行时计算的。AGENTS.md 的措辞上一轮已经给了草稿,可以直接用。

要开 issue 的话,scope 就是 ESLint 规则 + AGENTS.md 更新 + 存量文件分批迁移,不涉及运行时代码改动。我可以先准备 PoC。

@pomelo-nwu pomelo-nwu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approve

忽略文件命名风格问题(已在 separate comment thread 中跟踪,归入 #4063),本 PR 代码质量很高,可以合并。

核心生产代码 ~107 行loggingContentGenerator.ts +19 行,session-tracing.ts +88/-2 行,新建 streamContentDetection.ts 57 行),改动集中在遥测层,不碰业务逻辑。

爆炸半径极小:

  • endLLMRequestSpan 只有一个调用方(loggingContentGenerator.ts,5 个 call site)
  • LLMRequestMetadata 扩展全是 optional 字段 —— 现有调用者不传这些字段时行为完全不变,零 breaking change
  • hasUserVisibleContent 纯函数、无副作用,只在流式包装器的一处被调用
  • 不碰 ContentGenerator 接口、不碰 provider 适配器、不碰 retryWithBackoff(Phase 4b 的事)
  • TTFT 用 method-local 闭包变量而非 instance field,并发 generateContentStream 调用间不会互相污染

边界条件覆盖充分:

  • thought: false 误触发 → p.thought === true 而非 !== undefined
  • cachedInputTokens === 0 被吞 → !== undefined 判断
  • sampling_ms 负数 → Math.max(0, ...)
  • output_tokens_per_second 除零 → samplingMs > 0 守卫
  • 双写 span 碰撞 → metric 写入在已有的 double-end guard 内

37 个新测试 + 538 行设计文档(每个决策点都有 "why not X")。教科书级的增量遥测 PR。

— qwen3.7-max via Qwen Code

thought?: unknown;
};
if (typeof p.text === 'string' && p.text.length > 0) return true;
if (p.functionCall !== undefined) return true;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] The three presence checks (p.functionCall !== undefined, p.inlineData !== undefined, p.executableCode !== undefined) treat null as present (null !== undefined is true). A part shaped { functionCall: null } would be classified as user-visible, potentially firing TTFT on an empty metadata chunk.

The text field is already guarded with a type check (typeof p.text === 'string'), but the other three fields are not. Using loose null equality (!= null) would cover both null and undefined in one check, matching the defensive posture of the text guard.

Suggested change
if (p.functionCall !== undefined) return true;
if (p.functionCall != null) return true;
if (p.inlineData != null) return true;
if (p.executableCode != null) return true;

— qwen-latest-series-invite-beta-v36 via Qwen Code /review

// content. hasUserVisibleContent skips role-only / usageMetadata-only
// chunks, so TTFT reflects "model produced something the operator can
// attribute to user-perceived latency."
if (ttftMs === undefined && hasUserVisibleContent(response)) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] This call is inside the for await loop that streams chunks to the user. If hasUserVisibleContent ever throws (e.g., an SDK update changes the Part type shape), the exception propagates into the catch (error) block below, aborting the user's response stream. The error trace would point at streamContentDetection internals — not obviously a telemetry bug.

Since TTFT detection is best-effort telemetry, it should never affect the user-visible response. A try/catch here isolates detection failures:

Suggested change
if (ttftMs === undefined && hasUserVisibleContent(response)) {
try {
if (ttftMs === undefined && hasUserVisibleContent(response)) {
ttftMs = Date.now() - startTime;
}
} catch {
// Detection failure must not abort the user stream.
}

— qwen-latest-series-invite-beta-v36 via Qwen Code /review

if (metadata.ttftMs !== undefined) {
const samplingMs = Math.max(
0,
duration - metadata.ttftMs - (metadata.requestSetupMs ?? 0),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] The sampling_ms formula duration - ttftMs - (requestSetupMs ?? 0) assumes ttftMs and requestSetupMs share the same time origin as duration (all measured from startTime). When Phase 4b populates requestSetupMs, if ttftMs is still measured from startTime (not from the successful attempt's dispatch), the formula will double-subtract requestSetupMs and produce incorrect sampling_ms / output_tokens_per_second values.

This invariant exists only in the reader's head — not in the type system or an assertion. Consider documenting it explicitly, e.g.:

Suggested change
duration - metadata.ttftMs - (metadata.requestSetupMs ?? 0),
// INVARIANT: ttftMs and requestSetupMs share the same time origin
// as `duration` (generateContent[Stream] entry). If either measurement
// base changes in a future phase, this formula must be updated.
const samplingMs = Math.max(

— qwen-latest-series-invite-beta-v36 via Qwen Code /review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(telemetry): Phase 4 — LLM request timing decomposition (TTFT, request setup, retry visibility, breakdown metric)

5 participants