feat(telemetry): Phase 4a — TTFT capture + GenAI semconv dual-emit (#3731) by doudouOUC · Pull Request #4417 · QwenLM/qwen-code

doudouOUC · 2026-05-22T00:31:05Z

Summary

Phase 4a of P3 LLM request timing decomposition (#3731 / #4413). Adds TTFT (time-to-first-token) measurement and OTel GenAI semantic convention dual-emit attributes to the qwen-code.llm_request span.

New hasUserVisibleContent(chunk) helper centralises first-token detection on the normalised GenerateContentResponse shape — works uniformly across Anthropic / OpenAI / Gemini / Qwen providers.
TTFT captured in LoggingContentGenerator.generateContentStream's stream wrapper using method-local closure (NEVER instance fields — LoggingContentGenerator is one-per-ContentGenerator and shared across concurrent calls; subagent fan-out, warmup, side-queries).
LLMRequestMetadata extended with ttftMs, requestSetupMs, attempt, retryTotalDelayMs, cachedInputTokens. The 4b/4c placeholders are declared optional now so subsequent PRs don't have to revisit the schema.
endLLMRequestSpan writes new attrs (ttft_ms, cached_input_tokens, derived sampling_ms clamped at 0, derived output_tokens_per_second rounded to 2 decimals + omitted when sampling_ms == 0) plus Phase 4b placeholders when the caller provides them.
GenAI semconv dual-emit (private name authoritative, semconv = compat layer per feat(telemetry): Phase 3 — qwen-code.subagent span with concurrent isolation (#3731) #4410 precedent): gen_ai.request.model (Stable), gen_ai.usage.input_tokens (Stable), gen_ai.usage.output_tokens (Stable), gen_ai.usage.cached_tokens (Experimental), gen_ai.server.time_to_first_token (Experimental, seconds-as-double).

TTFT semantic intentionally diverges from claude-code's ttftMs (which fires on Anthropic's message_start metadata event). qwen-code measures first user-visible content — uniform across providers, matches the literal "first token" intent. See design doc D1 for rationale.

Closes

Fixes part of feat(telemetry): harden OpenTelemetry configuration, HTTP OTLP behavior, and runtime safety #3731 (P3 Phase 4a)
Closes feat(telemetry): Phase 4 — LLM request timing decomposition (TTFT, request setup, retry visibility, breakdown metric) #4413 partially (Phase 4a only; 4b + 4c follow in separate PRs)

Design doc

docs/design/telemetry-llm-request-timing-design.md — 8 locked decisions (D1, D2, D3, D4, D5, D6, D7, D8) with "Why not X" justifications, lifecycle wiring (streaming + non-streaming + 4 retry sites for 4b), files-to-change table with LOC estimates, test strategy, edge cases, rollback plan, and a comparison with claude-code (which has the full decomposition but only for Perfetto) and opencode (which has no TTFT at all).

PR split

This is Phase 4a (~200 LOC of production code, self-contained).

Phase 4b (next PR) — retryWithBackoff onRetry callback + ApiRetryEvent + 4 LLM caller wiring (client.ts:1540, baseLlmClient.ts:193,282, geminiChat.ts:1039). Independently a bug fix for the retryWithBackoff telemetry blackout (currently only debugLogger.warn).
Phase 4c — recordApiRequestBreakdown() activation + parent-span retry-aggregation attrs. Depends on 4a + 4b.

Test plan

Unit — hasUserVisibleContent part-type matrix (16 cases): non-empty text, empty text, functionCall, inlineData, executableCode, thought: true, thought: false, thought missing, text: '' + thought: true (Anthropic <thinking> first chunk), mixed parts, empty parts array, missing candidates, missing content, missing parts, usage-only / role-only chunks, defensive non-object parts.
Unit — endLLMRequestSpan Phase 4a (timing decomposition + dual-emit) (17 cases): all gen_ai.* dual-emits present + value-equal + correct units (seconds for ttft); cached_input_tokens boundary cases (undefined, 40, 0 explicit cache miss); ttft_ms writes seconds-as-double for gen_ai counterpart; ttft undefined ⇒ no derived attrs; sampling_ms derived correctly (no setup, with setup, clamped to 0 on clock skew); output_tokens_per_second rounding (88.89), divide-by-zero guard (sampling_ms == 0), missing outputTokens guard; Phase 4b placeholders written when caller provides them, omitted otherwise.
Unit — LoggingContentGenerator stream + non-stream paths (4 cases): TTFT captured on first user-visible chunk, undefined when stream yields only metadata chunks; cachedInputTokens forwarded from usageMetadata.cachedContentTokenCount on both stream and non-stream success.
Type-check (tsc --noEmit) clean
Lint (eslint) clean
All 448 telemetry tests pass; no regressions in loggingContentGenerator.test.ts existing 39 cases

Review summary

This PR went through 4 specialised review agents (code-reviewer, silent-failure-hunter, type-design-analyzer, comment-analyzer) before opening:

Critical fix applied in commit aa66dc7aa: hasUserVisibleContent originally used p.thought !== undefined, which falsely fires TTFT on { thought: false } parts. The codebase treats thought as a boolean flag (loggingContentGenerator.ts:773); fixed to p.thought === true and added 3 boundary test cases.
Type-design suggestion declined (push back): grouping the retry trio (attempt, requestSetupMs, retryTotalDelayMs) into a sub-object. Flat schema matches existing inputTokens/outputTokens precedent and 4b will populate all three at the same callsite — sub-object adds ceremony without invariant gain.
Type tightening declined (push back): replacing unknown-narrowing in isUserVisiblePart with a Part-typed parameter. Defensive against schema drift is the explicit design intent in D1; switching loses that defensiveness without a clear benefit at runtime.
Other suggestions declined as stylistic / non-actionable per 不追求 0 unresolved threads.

…wenLM#3731) Adds time-to-first-token measurement and OTel GenAI semantic convention dual-emit to the qwen-code.llm_request span. First slice of Phase 4 LLM request timing decomposition (sub-issue QwenLM#4413). Builds on Phase 1 (QwenLM#4126), Phase 1.5 (QwenLM#4302), Phase 2 (QwenLM#4321). Self-contained — ships TTFT visibility without depending on the Phase 4b retry-callback work. Changes ------- - New hasUserVisibleContent(chunk) helper centralizes "is this chunk first user-visible output" detection across all 4 providers. Operates on the normalized GenerateContentResponse shape each provider emits before LoggingContentGenerator sees the stream. Recognizes text / functionCall / inlineData / executableCode / thought parts; skips role-only and usageMetadata-only chunks. - TTFT capture in LoggingContentGenerator.generateContentStream's stream wrapper. Method-local closure variable, NEVER instance field — the generator is one-per-ContentGenerator and shared across concurrent calls (subagent fan-out, warmup, side-queries). - Extends LLMRequestMetadata with: ttftMs, requestSetupMs, attempt, retryTotalDelayMs, cachedInputTokens. The first is populated in 4a; the middle three are declared as optional so Phase 4b's retry callback can populate them without touching the schema again. - endLLMRequestSpan writes new span attrs: ttft_ms, sampling_ms (derived: duration - ttft - requestSetup, clamped to >= 0), output_tokens_per_second (derived: outputTokens / (sampling_ms / 1000), omitted when sampling_ms == 0 to avoid divide-by-zero, rounded to 2 decimals), cached_input_tokens, plus Phase 4b placeholders (attempt, request_setup_ms, retry_total_delay_ms) when caller provides them. - GenAI semconv dual-emit. Private qwen-code names stay authoritative (dashboards, SLOs, ARMS queries reference these); semconv is a compat layer for spec-aware backends. Same pattern as Phase 3 (QwenLM#4410): qwen-code.model + gen_ai.request.model (Stable) input_tokens + gen_ai.usage.input_tokens (Stable) output_tokens + gen_ai.usage.output_tokens (Stable) cached_input_tokens + gen_ai.usage.cached_tokens (Experimental) ttft_ms (ms, int) + gen_ai.server.time_to_first_token (s, double, Experimental) TTFT semantic intentionally diverges from claude-code's ttftMs (which fires on Anthropic's message_start metadata event). qwen-code measures "first user-visible content" — uniform across Anthropic / OpenAI / Gemini / Qwen, matches the literal "first token" intent. See design doc D1 for rationale. Design doc ---------- docs/design/telemetry-llm-request-timing-design.md (~530 lines) covers all 7 locked decisions with "Why not X" justifications, lifecycle wiring for streaming + non-streaming + 4 retry sites, files-to-change table with LOC estimates, test strategy, edge cases, rollback plan, and a comparison with claude-code (full decomposition but only for Perfetto) and opencode (no TTFT measurement at all). Tests ----- - streamContentDetection.test.ts (13 tests) covers part types matrix. - session-tracing.test.ts +16 tests (new describe block) cover dual-emit for every gen_ai.* field, sampling_ms clamping at 0 for clock skew, output_tokens_per_second rounding + divide-by-zero, Phase 4b placeholder fields written when present, absent when not. - loggingContentGenerator.test.ts +4 tests cover TTFT capture in streaming, undefined when stream yields only metadata chunks, and cachedInputTokens forwarding from usageMetadata on both stream and non-stream paths. All 448 telemetry tests pass, tsc --noEmit clean, eslint clean. Phase 4b (retry callback + ApiRetryEvent) and Phase 4c (recordApiRequestBreakdown activation) follow in separate PRs per the design doc split. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

…ndary tests (QwenLM#3731) Address review findings on commit 5db5f9b (Phase 4a) before opening the PR. 1. hasUserVisibleContent: tighten the `thought` Part check from `p.thought !== undefined` to `p.thought === true`. The codebase treats `thought` as a boolean flag (see loggingContentGenerator.ts:773 / :865 where `part.thought ? {...} : {}` is the canonical pattern). The original predicate would falsely fire TTFT on a degenerate `{ thought: false }` part (which means "explicitly not a thought") and on bare `{ thought: undefined }` shapes that some normalizers may emit. Affects Phase 4a only — TTFT semantics, not retry telemetry. Existing test for `thought: true` is kept; added 3 new cases: - `{ thought: false }` returns false - `{}` (no thought key) returns false - `{ text: '', thought: true }` (typical Anthropic <thinking> first chunk) returns true — confirms the design-doc D1 intent that "thinking-stream started" counts as user-visible TTFT. 2. session-tracing.test.ts: add boundary case for `cachedInputTokens === 0`. Providers reporting an explicit cache miss (0) is meaningful info distinct from undefined ("we don't know"). The test asserts both `cached_input_tokens` and `gen_ai.usage.cached_tokens` propagate the literal 0, not get filtered. 3. JSDoc fix: remove the misleading "guard with `in`" comment from `streamContentDetection.ts` that referred to a guard pattern not actually used in the implementation. All 129 tests pass (was 125; +4 new), tsc --noEmit clean, eslint clean. No production-code behavior change beyond the `thought` predicate fix — which by construction can only ever shrink the set of chunks treated as TTFT-triggering, so cannot regress any successful TTFT capture. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

Copilot

Pull request overview

Adds Phase 4a of LLM request timing decomposition by capturing TTFT (time-to-first-token) and dual-emitting OpenTelemetry GenAI semantic convention attributes on qwen-code.llm_request spans, with supporting helpers and unit tests.

Changes:

Extend LLMRequestMetadata and endLLMRequestSpan to emit TTFT/cached tokens + derived sampling metrics, and to dual-emit gen_ai.* attributes.
Add a centralized hasUserVisibleContent() predicate to detect the first “user-visible” streaming chunk across providers.
Wire TTFT + cached token propagation through LoggingContentGenerator and add focused unit tests + design documentation.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
packages/core/src/telemetry/session-tracing.ts	Adds new LLM span metadata fields and emits new timing/token + GenAI dual-emit attributes (including derived sampling metrics).
packages/core/src/telemetry/session-tracing.test.ts	Adds unit coverage for Phase 4a attributes, derived metrics, and dual-emit behavior.
packages/core/src/core/loggingContentGenerator/streamContentDetection.ts	Introduces shared helper for “first user-visible chunk” detection used for TTFT.
packages/core/src/core/loggingContentGenerator/streamContentDetection.test.ts	Adds a part-type matrix test suite for user-visible chunk detection.
packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts	Captures TTFT during streaming and forwards cached token counts into LLM span metadata.
packages/core/src/core/loggingContentGenerator/loggingContentGenerator.test.ts	Adds tests for TTFT capture + cached token forwarding in stream and non-stream paths.
docs/design/telemetry-llm-request-timing-design.md	Adds the Phase 4 design doc describing decisions, semantics, and planned follow-ups.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        // chunks, so TTFT reflects "model produced something the operator can
+        // attribute to user-perceived latency."
+        if (ttftMs === undefined && hasUserVisibleContent(response)) {
+          ttftMs = Date.now() - startTime;
+        }


+      p.inlineData !== undefined ||
+      p.executableCode !== undefined ||
+      // @ts-expect-error — `thought` is not on all SDK versions but providers emit it
+      p.thought !== undefined,


+| [OTel Trace Spec — Span Events](https://opentelemetry.io/docs/specs/otel/trace/api/#add-events)                             | "Events SHOULD NOT be used to record information that's better captured as Span Attributes." Confirms per-attempt info belongs on the LLM span attributes + log-bridge spans, not as Span Events on the parent                                                                                                                     |
+| Phase 3 design doc (`telemetry-subagent-spans-design.md`)                                                                   | Established the dual-emit pattern (`qwen-code.subagent.id` + `gen_ai.agent.id`) and the "private name is authoritative" rule. Phase 4 follows the same convention for TTFT and token fields                                                                                                                                        |
+
+## Design — seven decisions, each justified


github-actions · 2026-05-22T00:36:22Z

📋 Review Summary

This PR (Phase 4a) implements TTFT (time-to-first-token) measurement and OTel GenAI semantic convention dual-emit for LLM request spans. The implementation is well-structured, thoroughly tested, and follows established patterns from the codebase. The design decisions are extensively documented with clear justifications. Overall assessment: strong implementation with minor suggestions for improvement.

🔍 General Feedback

Excellent design documentation: The design doc (telemetry-llm-request-timing-design.md) is comprehensive with 8 locked decisions (D1-D8), each with "Why not X" justifications and cross-product comparisons (claude-code, opencode).
Clean separation of concerns: The hasUserVisibleContent helper centralizes first-token detection logic, avoiding per-provider duplication.
Correct concurrency handling: TTFT capture uses method-local closure variables (never instance fields) in LoggingContentGenerator.generateContentStream, which is critical since the class is one-per-ContentGenerator and shared across concurrent calls.
Comprehensive test coverage: 16 test cases for hasUserVisibleContent part-type matrix and 17 test cases for endLLMRequestSpan Phase 4a timing decomposition + dual-emit.
GenAI semantic convention dual-emit follows precedent: Private names (qwen-code.*) remain authoritative; gen_ai.* attributes serve as a compatibility layer for spec-aware backends (consistent with Phase 3 feat(telemetry): Phase 3 — qwen-code.subagent span with concurrent isolation (#3731) #4410).
Phase 4b/4c placeholders properly declared: New fields (requestSetupMs, attempt, retryTotalDelayMs) are optional now, preventing schema churn in follow-up PRs.

🎯 Specific Feedback

🟡 High

File: streamContentDetection.ts:53 - The thought field check uses === true which is correct per D1, but the comment says "Match strict === true rather than checking presence". Consider adding a defensive check for non-boolean thought values (e.g., if a provider emits "true" as string). Current code handles this via type coercion, but explicit validation would be safer:
```
if (typeof p.thought === 'boolean' && p.thought === true) return true;
```
File: session-tracing.ts:450 - The derived sampling_ms calculation uses Math.max(0, ...) to clamp for clock skew, but this masks potential underlying issues. Consider adding a debug log when clamping occurs (similar to existing debugLogger.warn patterns in the file) to help diagnose clock skew in production:
```
if (samplingMs === 0 && duration > metadata.ttftMs + (metadata.requestSetupMs ?? 0)) {
  debugLogger.warn(`sampling_ms clamped to 0 due to clock skew: duration=${duration}, ttftMs=${metadata.ttftMs}, requestSetupMs=${metadata.requestSetupMs}`);
}
```

🟢 Medium

File: loggingContentGenerator.ts:533 - The TTFT capture condition if (ttftMs === undefined && hasUserVisibleContent(response)) is correct but could benefit from an inline comment explaining why we check ttftMs === undefined (to ensure we capture the first token only, not overwrite on subsequent chunks). The design doc D2 explains this, but a brief inline comment would help future maintainers:
```
// Capture TTFT on first user-visible chunk only; subsequent chunks must not overwrite
if (ttftMs === undefined && hasUserVisibleContent(response)) {
```
File: session-tracing.ts:420-423 - The dual-emit for gen_ai.usage.cached_tokens notes "Experimental — may rename before Stable". Consider adding a TODO comment with a reference to the OTel spec version or a tracking issue number so future maintainers know when to update:
```
// TODO(#XXXX): Update when OTel GenAI semconv stabilizes (currently Experimental)
```
File: streamContentDetection.test.ts:116-121 - The test for non-object parts (null, undefined, 'string', 42) is excellent defensive testing. Consider adding one more edge case: an object with text as a non-string type (e.g., { text: 123 }) to verify the typeof p.text === 'string' guard works correctly.

🔵 Low

File: session-tracing.ts:66-76 - The JSDoc for ttftMs mentions "diverges from claude-code's ttftMs" which is important context. Consider linking directly to the design doc section (D1) for easier navigation:
```
 * See docs/design/telemetry-llm-request-timing-design.md (D1) for rationale.
```
File: streamContentDetection.ts:17 - The comment lists thought / reasoning content (provider-dependent; o1, qwen thinking, Anthropic <thinking>). Consider adding Gemini to this list since the design doc D1 explicitly mentions "Gemini's thought" as well.
File: loggingContentGenerator.ts:468-474 - The TTFT comment block is comprehensive but could reference the design doc D2 directly for the "Why never instance fields" explanation, making it easier for future maintainers to find the full rationale.

✅ Highlights

hasUserVisibleContent implementation (streamContentDetection.ts:34-57) - Clean, well-documented function with excellent type safety and defensive handling of edge cases. The explicit handling of thought: false vs thought: true vs missing thought shows deep understanding of the codebase conventions.
Test coverage for edge cases (streamContentDetection.test.ts) - The 16-case matrix thoroughly covers all part types including Anthropic's <thinking> first chunk pattern ({ text: '', thought: true }), usage-only chunks, and defensive non-object handling.
Phase 4a/4b/4c split discipline - The PR strictly implements Phase 4a (TTFT + dual-emit) while declaring Phase 4b/4c placeholders as optional fields. This prevents scope creep and keeps the PR reviewable at ~200 LOC of production code.
GenAI dual-emit consistency (session-tracing.ts:403-426) - All dual-emits follow the same pattern with clear comments about stability levels (Stable vs Experimental) and the "private name authoritative" rule from Phase 3.
Idle timeout handling (loggingContentGenerator.ts:493-517) - The span timeout mechanism correctly handles the "consumer abandoned the generator" case and sets spanEndedByTimeout to prevent redundant endLLMRequestSpan calls.

wenshao

No review findings. Downgraded from Approve to Comment: CI still running. Code is clean — tsc/eslint/build all pass, 129/129 tests green. Implementation correctly handles TTFT capture, GenAI semconv dual-emit, derived metrics, and edge cases (clock skew, divide-by-zero, cachedInputTokens 0 vs undefined). — qwen-latest-series-invite-beta-v36 via Qwen Code /review

pomelo-nwu

文件命名惯例：是否考虑统一为 kebab-case？

注意到本 PR 新增的 streamContentDetection.ts 位于 core/loggingContentGenerator/ 目录下，与该目录现有文件（loggingContentGenerator.ts）保持一致的 camelCase 命名，这没有问题。

不过借这个机会提一个更大范围的观察 —— packages/core/src 目前的文件命名存在两种风格混用：

camelCase（185 个文件）：集中在 core/、hooks/、memory/、followup/ 等较早的模块
kebab-case（98 个文件）：集中在 telemetry/、tools/、mcp/、providers/、skills/ 等较新的模块

从目录层面看分化更明显：

telemetry/ 几乎全部 kebab-case（session-tracing.ts、detailed-span-attributes.ts、log-to-span-processor.ts、resource-attributes.ts...）
tools/ 全部 kebab-case（read-file.ts、write-file.ts、web-fetch.ts...）
core/ 几乎全部 camelCase（baseLlmClient.ts、contentGenerator.ts、coreToolScheduler.ts...）
唯一的例外如 core/permission-helpers.ts（kebab）和 telemetry/uiTelemetry.ts（camel）

这不是本 PR 引入的问题，只是趁这个 PR 提一下：后续新增文件是否应该考虑逐步统一到 kebab-case？ 一些考虑：

kebab-case 在 Node.js 生态中更常见（npm 包名、URL path 友好）
较新模块已经自然收敛到 kebab-case
可以与 issue #4063 中的命名统一项（#10、#13）一起推进

当然这是一个全局性的风格决策，不应该卡在本 PR 上。只是想听一下作者和各位的看法 🤔

pomelo-nwu · 2026-05-22T00:54:47Z

补充几点上面 review 没覆盖到的观察：

cli 包同样存在混用

packages/cli/src 的情况类似：282 个 camelCase 文件 vs 31 个 kebab-case 文件。目录层面 dualOutput/、nonInteractive/、remoteInput/ 用 camelCase，acp-integration/、prompt-processors/、test-commands/ 用 kebab-case。只是 cli 包 kebab-case 占比更低（约 7%），不一致性比 core 包更明显。

tools 目录的命名-语义脱钩问题

packages/core/src/tools/ 里还有一个更具体的问题：文件名命名风格与 toolName 的对应关系不一致。

kebab-case 文件名与 toolName 有清晰的 1:1 机械映射（- → _）：

task-stop.ts → task_stop
web-fetch.ts → web_fetch
read-file.ts → read_file
cron-create.ts → cron_create

但 camelCase 文件名与 toolName 的对应关系就模糊了：

askUserQuestion.ts → ask_user_question（语义对应但风格不同）
todoWrite.ts → todo_write（同上）
ripGrep.ts → grep_search（完全脱钩，文件名和工具名无直接关系）
syntheticOutput.ts → structured_output（完全脱钩）

如果统一到 kebab-case，文件名与 toolName 就能形成一致的转换规则（kebab-file.ts → kebab_file），降低新贡献者的认知负担。

这不是 blocking 问题，只是趁这个 PR 把观察到的全貌摆出来，方便后续做决策 🤔

pomelo-nwu · 2026-05-22T00:58:56Z

回复 @doudouOUC 关于 kebab-case 统一的想法：

完全同意「机械规则 > 人工审查」的原则。命名规范如果只靠 reviewer 人肉盯，迟早会漏。以下是我目前觉得最可行的两条路径：

1. ESLint check-file 规则（CI 硬卡）

eslint-plugin-check-file 可以直接在 lint 阶段拦截不符合 kebab-case 的新文件名：

// eslint.config.js
import checkFile from 'eslint-plugin-check-file';

export default [
  {
    plugins: { 'check-file': checkFile },
    rules: {
      'check-file/filename': ['error', 'KEBAB_CASE', {
        // 逐步迁移：先 allowlist 存量 camelCase 文件
        // 每次重命名一批就从列表中移除
      }],
    },
  },
];

好处是零额外基础设施、CI 已有 ESLint pipeline，加一条规则就能跑。KEBAB_CASE 是插件内置 pattern，不需要手写正则。

对于存量文件，可以用一个 allowedFilenamePatterns 列表做渐进式迁移 —— 不要求一次性改完 185 个文件，但新增文件必须是 kebab-case，每次有人重命名旧文件就从 allowlist 中删一条。

2. AGENTS.md 补充（AI agent 软约束）

在 AGENTS.md 的 Code Conventions 小节加一段：

- **File naming**: All new `.ts` files must use `kebab-case`
  (e.g., `stream-content-detection.ts`, not `streamContentDetection.ts`).
  Existing `camelCase` files are legacy and should be renamed to
  `kebab-case` opportunistically when touched.

这条主要约束 AI coding agent（Qwen Code 自身、Copilot 等）在生成新文件时自动遵守 kebab-case。配合 ESLint 硬卡，形成"agent 自觉遵守 + CI 兜底拦截"的双层防线。

建议拆成独立 issue 推进，不卡本 PR。可以归到 #4063 的 #10（命名统一）下作为一个子任务，包含：

ESLint 规则 + allowlist 初始数据
AGENTS.md 更新
存量文件分批迁移计划（按目录粒度，每个目录一个 PR）

@doudouOUC 要不要开一个 issue 来跟踪？我可以帮忙把 ESLint 规则的 PoC 先跑出来 🤔

pomelo-nwu · 2026-05-22T00:59:40Z

回复 @doudouOUC

关于「kebab-case 在 Node.js 生态中更常见」这个说法，确实说得过于笼统了 —— 不同子领域差异很大，web 框架圈和 infra/AI 圈的习惯完全不一样。这个论据站不住，收回。

但你后面提的可观测性管线兼容性这个角度是真正的硬论据，比我的「生态常见」说法有说服力得多：

文件名 kebab-case → span/metric name snake_case，变换路径最短（- → _，一一对应）
Prometheus 强制 snake_case，OTel semconv 用 . 分隔（gen_ai.usage.input_tokens）
Arize Phoenix / Langfuse 的 span 树和 trace 视图确实对 kebab/snake 的可读性更好
如果上游文件名就是 camelCase，到下游每个集成点都要做一次不规则的 case 转换，容易出错

这个思路其实和你在 PR 里做 GenAI semconv dual-emit 的逻辑是一脉相承的 —— 命名规范不是美学问题，是跨系统互操作的工程约束。

关于 AGENTS.md 的写法，你说得对，应该写成变换规则而不只是静态命名规范。建议这样措辞：

- **File naming**: Use `kebab-case` for all new `.ts` files.
  Conversion to downstream identifiers is mechanical:
  - File → tool name: replace `-` with `_` (e.g., `task-stop.ts` → `task_stop`)
  - File → OTel span attr: replace `-` with `.` prefix as needed
  - File → Prometheus metric: replace `-` with `_`
  Existing `camelCase` files are legacy; rename opportunistically when touched.

这样写的好处是贡献者（和 AI agent）一看就知道为什么选 kebab-case、怎么变换到下游系统，而不是单纯记一条风格规则。

ESLint check-file + AGENTS.md 双层防线 + 独立 issue 归到 #4063 #10 下，这个路径不变。要开 issue 的话我可以先把 ESLint PoC 和 AGENTS.md diff 准备好。

pomelo-nwu · 2026-05-22T01:00:47Z

回复 @doudouOUC 关于共享工具函数和变换链路的问题

刚扫了一遍代码库，结论比较直接：

目前没有共享的 case 转换工具函数

无 toSnakeCase()、toKebabCase()、camelToSnake() 之类的 util
无第三方 case 库引用（lodash、change-case 等都没有）
tool-names.ts 里的 tool name 全部是硬编码字面量（'task_stop'、'web_fetch'）
telemetry/constants.ts 和 session-tracing.ts 里的 OTel attribute 也全部是硬编码字符串（'qwen-code.tool_call'、'gen_ai.usage.input_tokens'）
Prometheus export 层的 metric name 同样是手写的

也就是说，你提到的 ~1200 LOC 双写路径里，每一跳的命名都是手工写死的字面量，没有经过任何程序化的 case 变换。文件名 → tool name → OTel attr → Prometheus metric 之间不存在自动转换链路。

这对 kebab-case 迁移 issue 的 scope 意味着什么

文件名 kebab-case 化能解决的只是第一跳的一致性：task-stop.ts → task_stop（tool name）。后面几跳（OTel attr 用 .、Prometheus 用 _）是各系统自己的规范，和文件名无关 —— 它们本来就该用各系统的 native convention。

所以建议 kebab-case 迁移 issue 的 scope 收窄为：

文件名统一 kebab-case（ESLint 规则 + AGENTS.md 变换规则文档）
tool name 保持 snake_case，与文件名形成 - → _ 的机械映射
OTel attr / Prometheus metric 命名不属于这个 issue 的范围（它们有自己的 semconv 规范）

关于你提的「变换规则文档」，与其做代码级的 snake_case() util（当前没有实际需求），不如在 AGENTS.md 里把映射关系写清楚就够了 —— 因为现阶段这些变换都是开发者/AI agent 在写代码时心算的，不是运行时计算的。AGENTS.md 的措辞上一轮已经给了草稿，可以直接用。

要开 issue 的话，scope 就是 ESLint 规则 + AGENTS.md 更新 + 存量文件分批迁移，不涉及运行时代码改动。我可以先准备 PoC。

pomelo-nwu

✅ Approve

忽略文件命名风格问题（已在 separate comment thread 中跟踪，归入 #4063），本 PR 代码质量很高，可以合并。

核心生产代码 ~107 行（loggingContentGenerator.ts +19 行，session-tracing.ts +88/-2 行，新建 streamContentDetection.ts 57 行），改动集中在遥测层，不碰业务逻辑。

爆炸半径极小：

endLLMRequestSpan 只有一个调用方（loggingContentGenerator.ts，5 个 call site）
LLMRequestMetadata 扩展全是 optional 字段 —— 现有调用者不传这些字段时行为完全不变，零 breaking change
hasUserVisibleContent 纯函数、无副作用，只在流式包装器的一处被调用
不碰 ContentGenerator 接口、不碰 provider 适配器、不碰 retryWithBackoff（Phase 4b 的事）
TTFT 用 method-local 闭包变量而非 instance field，并发 generateContentStream 调用间不会互相污染

边界条件覆盖充分：

thought: false 误触发 → p.thought === true 而非 !== undefined
cachedInputTokens === 0 被吞 → !== undefined 判断
sampling_ms 负数 → Math.max(0, ...)
output_tokens_per_second 除零 → samplingMs > 0 守卫
双写 span 碰撞 → metric 写入在已有的 double-end guard 内

37 个新测试 + 538 行设计文档（每个决策点都有 "why not X"）。教科书级的增量遥测 PR。

— qwen3.7-max via Qwen Code

wenshao · 2026-05-22T01:27:10Z

+    thought?: unknown;
+  };
+  if (typeof p.text === 'string' && p.text.length > 0) return true;
+  if (p.functionCall !== undefined) return true;


[Suggestion] The three presence checks (p.functionCall !== undefined, p.inlineData !== undefined, p.executableCode !== undefined) treat null as present (null !== undefined is true). A part shaped { functionCall: null } would be classified as user-visible, potentially firing TTFT on an empty metadata chunk.

The text field is already guarded with a type check (typeof p.text === 'string'), but the other three fields are not. Using loose null equality (!= null) would cover both null and undefined in one check, matching the defensive posture of the text guard.

Suggested change

if (p.functionCall !== undefined) return true;

if (p.functionCall != null) return true;

if (p.inlineData != null) return true;

if (p.executableCode != null) return true;

— qwen-latest-series-invite-beta-v36 via Qwen Code /review

wenshao · 2026-05-22T01:27:11Z

+        // content. hasUserVisibleContent skips role-only / usageMetadata-only
+        // chunks, so TTFT reflects "model produced something the operator can
+        // attribute to user-perceived latency."
+        if (ttftMs === undefined && hasUserVisibleContent(response)) {


[Suggestion] This call is inside the for await loop that streams chunks to the user. If hasUserVisibleContent ever throws (e.g., an SDK update changes the Part type shape), the exception propagates into the catch (error) block below, aborting the user's response stream. The error trace would point at streamContentDetection internals — not obviously a telemetry bug.

Since TTFT detection is best-effort telemetry, it should never affect the user-visible response. A try/catch here isolates detection failures:

Suggested change

if (ttftMs === undefined && hasUserVisibleContent(response)) {

try {

if (ttftMs === undefined && hasUserVisibleContent(response)) {

ttftMs = Date.now() - startTime;

}

} catch {

// Detection failure must not abort the user stream.

}

— qwen-latest-series-invite-beta-v36 via Qwen Code /review

wenshao · 2026-05-22T01:27:11Z

+      if (metadata.ttftMs !== undefined) {
+        const samplingMs = Math.max(
+          0,
+          duration - metadata.ttftMs - (metadata.requestSetupMs ?? 0),


[Suggestion] The sampling_ms formula duration - ttftMs - (requestSetupMs ?? 0) assumes ttftMs and requestSetupMs share the same time origin as duration (all measured from startTime). When Phase 4b populates requestSetupMs, if ttftMs is still measured from startTime (not from the successful attempt's dispatch), the formula will double-subtract requestSetupMs and produce incorrect sampling_ms / output_tokens_per_second values.

This invariant exists only in the reader's head — not in the type system or an assertion. Consider documenting it explicitly, e.g.:

Suggested change

duration - metadata.ttftMs - (metadata.requestSetupMs ?? 0),

// INVARIANT: ttftMs and requestSetupMs share the same time origin

// as `duration` (generateContent[Stream] entry). If either measurement

// base changes in a future phase, this formula must be updated.

const samplingMs = Math.max(

— qwen-latest-series-invite-beta-v36 via Qwen Code /review

Co-authored-by: Tommaso Sciortino <sciortino@gmail.com>

doudouOUC added 2 commits May 22, 2026 02:03

Copilot AI review requested due to automatic review settings May 22, 2026 00:31

Copilot started reviewing on behalf of doudouOUC May 22, 2026 00:31 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

wenshao reviewed May 22, 2026

View reviewed changes

pomelo-nwu reviewed May 22, 2026

View reviewed changes

pomelo-nwu approved these changes May 22, 2026

View reviewed changes

wenshao reviewed May 22, 2026

View reviewed changes

pomelo-nwu mentioned this pull request May 22, 2026

Standardize file naming to kebab-case with ESLint enforcement and AGENTS.md documentation #4419

Open

github-actions Bot mentioned this pull request May 22, 2026

📊 AI CLI 工具社区动态日报 2026-05-22 ivanweng2077/big_model_radar#68

Open

doudouOUC requested a review from yiliang114 May 22, 2026 02:23

yiliang114 approved these changes May 22, 2026

View reviewed changes

github-actions Bot mentioned this pull request May 22, 2026

📊 AI CLI 工具社区动态日报 2026-05-22 ivanweng2077/big_model_radar#73

Open

doudouOUC merged commit fd75f77 into QwenLM:main May 22, 2026
22 checks passed

doudouOUC deleted the feat/telemetry-phase-4a-ttft branch May 22, 2026 02:54

doudouOUC mentioned this pull request May 22, 2026

feat(telemetry): Phase 4b — retry visibility for qwen-code.llm_request (#3731) #4432

Merged

7 tasks

wenshao mentioned this pull request May 24, 2026

chore(integration): sync main into daemon_mode_b_main (2026-05-24) #4469

Merged

5 tasks

github-actions Bot mentioned this pull request May 25, 2026

Upstream changes detected: ACP spec, Gemini CLI, Codex CLI (App Server Protocol), GitHub Copilot CLI, Qwen Code greenpolo/cc-multi-cli-plugin#5

Open

doudouOUC mentioned this pull request May 31, 2026

feat(telemetry): harden OpenTelemetry configuration, HTTP OTLP behavior, and runtime safety #3731

Open

53 tasks

xaelistic pushed a commit to xaelistic/qwen-code that referenced this pull request Jun 7, 2026

Handle unhandled rejections more gracefully. (QwenLM#4417)

c337a2b

Co-authored-by: Tommaso Sciortino <sciortino@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(telemetry): Phase 4a — TTFT capture + GenAI semconv dual-emit (#3731)#4417

feat(telemetry): Phase 4a — TTFT capture + GenAI semconv dual-emit (#3731)#4417
doudouOUC merged 2 commits into
QwenLM:mainfrom
doudouOUC:feat/telemetry-phase-4a-ttft

doudouOUC commented May 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

wenshao left a comment

Uh oh!

pomelo-nwu left a comment

Uh oh!

pomelo-nwu commented May 22, 2026

Uh oh!

pomelo-nwu commented May 22, 2026

Uh oh!

pomelo-nwu commented May 22, 2026

Uh oh!

pomelo-nwu commented May 22, 2026

Uh oh!

pomelo-nwu left a comment

Uh oh!

wenshao May 22, 2026

Uh oh!

wenshao May 22, 2026

Uh oh!

wenshao May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

-  if (p.functionCall !== undefined) return true;
+  if (p.functionCall != null) return true;
+  if (p.inlineData != null) return true;
+  if (p.executableCode != null) return true;

-          duration - metadata.ttftMs - (metadata.requestSetupMs ?? 0),
+        // INVARIANT: ttftMs and requestSetupMs share the same time origin
+        // as `duration` (generateContent[Stream] entry). If either measurement
+        // base changes in a future phase, this formula must be updated.
+        const samplingMs = Math.max(

Conversation

doudouOUC commented May 22, 2026

Summary

Closes

Design doc

PR split

Test plan

Review summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

github-actions Bot commented May 22, 2026

📋 Review Summary

🔍 General Feedback

🎯 Specific Feedback

🟡 High

🟢 Medium

🔵 Low

✅ Highlights

Uh oh!

wenshao left a comment

Choose a reason for hiding this comment

Uh oh!

pomelo-nwu left a comment

Choose a reason for hiding this comment

文件命名惯例：是否考虑统一为 kebab-case？

Uh oh!

pomelo-nwu commented May 22, 2026

Uh oh!

pomelo-nwu commented May 22, 2026

Uh oh!

pomelo-nwu commented May 22, 2026

Uh oh!

pomelo-nwu commented May 22, 2026

Uh oh!

pomelo-nwu left a comment

Choose a reason for hiding this comment

✅ Approve

Uh oh!

wenshao May 22, 2026

Choose a reason for hiding this comment

Uh oh!

wenshao May 22, 2026

Choose a reason for hiding this comment

Uh oh!

wenshao May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants