feat(telemetry): Phase 4b — retry visibility for qwen-code.llm_request (#3731) by doudouOUC · Pull Request #4432 · QwenLM/qwen-code

doudouOUC · 2026-05-22T07:28:16Z

Summary

Phase 4b of P3 LLM request timing decomposition (#3731 / #4413). Adds per-attempt HTTP-status retry telemetry so operators can see retry behavior in traces, logs, and metrics — previously the retryWithBackoff path at the 4 LLM call sites was completely invisible to telemetry (only debugLogger.warn).

After Phase 4b, a request that goes 500 → 429 → success produces:

3 separate qwen-code.llm_request spans (each with its own duration_ms / ttft_ms / error / status code).
2 qwen-code.api_retry LogToSpanProcessor bridge spans sibling-parented under the caller's active span (interaction or tool).
2 qwen-code.api_retry LogRecords (OTel log signal + QwenLogger RUM).
2 increments on the new qwen-code.api.retry.count counter, tagged {model}.
The successful attempt's LLM span carries attempt: 3, request_setup_ms: ~1200 (cumulative retry overhead), retry_total_delay_ms: ~1500.

Architectural discovery (mid-planning)

The Phase 4 design doc assumed claude-code's "one LLM span owns the retry loop" pattern. Reading the 4 retry call sites carefully revealed qwen-code inverts that: retryWithBackoff sits above LoggingContentGenerator. Each retry attempt creates a fresh LLM span. The original "in-LCG accumulator" plan wouldn't work.

Resolution: propagate retry state via AsyncLocalStorage (retryContext), matching the existing promptIdContext / subagentNameContext / agent-context patterns. retryWithBackoff wraps each await fn() in retryContext.run(...); LoggingContentGenerator reads the ALS in its synchronous prelude (before the first await) and threads the snapshot to all endLLMRequestSpan callsites (success / error / idle-timeout / abort).

The plan went through 3 review rounds by Plan-agents finding 22 issues total — all addressed before implementation. Design doc updated to reflect the ALS-based approach (D4 has a "Phase 4b update" subsection at the top).

Bundled Phase 4a bug fix (sampling_ms formula)

Phase 4a's sampling_ms = duration_ms - ttft_ms - (requestSetupMs ?? 0) was silently wrong: duration_ms only covers ttft + sampling for the span (startTime is captured when startLLMRequestSpan runs, AFTER any setup phase), so subtracting setup again is double-counting. Phase 4a masked the bug because requestSetupMs was always undefined → 0. Phase 4b populates requestSetupMs with cumulative retry overhead — without this fix, sampling_ms would clamp to 0 for every retried request, wiping output-throughput data exactly when operators need it most.

Fix: sampling_ms = duration_ms - ttft_ms (drop the setup subtraction). Phase 4a tests updated.

Closes

Fixes part of feat(telemetry): harden OpenTelemetry configuration, HTTP OTLP behavior, and runtime safety #3731 (P3 Phase 4b)
Closes part of feat(telemetry): Phase 4 — LLM request timing decomposition (TTFT, request setup, retry visibility, breakdown metric) #4413 (Phase 4b only; 4c follows in a separate PR)

Design doc

docs/design/telemetry-llm-request-timing-design.md — D4 section updated with the Phase 4b discovery + ALS resolution.

Implementation overview

New (packages/core/src/utils/retryContext.ts): AsyncLocalStorage<RetryAttemptContext> with attempt / requestSetupMs / retryTotalDelayMs fields. Mirrors promptIdContext.ts style.

retry.ts:

onRetry?: (info: RetryAttemptInfo) => void option on RetryOptions (opt-in per caller).
Monotonic iterationCount decoupled from attempt (which gets clamped at maxAttempts - 1 in persistent mode). No flip-flopping for mixed-error sequences.
retryContext.run wrap around fn().
onRetry invocations wrapped in try/catch — telemetry exceptions never break the retry loop.
Keeps the existing logRetryAttempt debug log line (useful when OTel SDK isn't wired up).

types.ts: new ApiRetryEvent class with subagent_name field. JSDoc cross-references ContentRetryEvent (they cover different retry budgets and can both fire for one prompt).

loggers.ts: new logApiRetry function — three-sink fan-out matching logContentRetry: QwenLogger RUM, OTel logger.emit (bridged via LogToSpanProcessor), recordApiRetry Counter.

metrics.ts: new qwen-code.api.retry.count Counter tagged with {model}. Full COUNTER_DEFINITIONS entry + initialization + recording function + index.ts export.

qwen-logger.ts: logApiRetryEvent for RUM consistency.

4 LLM caller wiring sites: client.ts, baseLlmClient.ts (×2), geminiChat.ts — each opts in with onRetry callback that emits ApiRetryEvent with subagentName from subagentNameContext.getStore().

LoggingContentGenerator: snapshotRetryMetadata() helper called in the synchronous prelude of generateContent / generateContentStream — only point where retryContext is guaranteed active for the streaming path (returned AsyncGenerator is iterated AFTER retryWithBackoff resolves). Snapshot threaded as a parameter to loggingStreamWrapper. attempt defaults to 1 when no retry context (warmup / side-queries / direct calls).

session-tracing.ts: Phase 4a sampling_ms formula fix + LLMRequestMetadata.requestSetupMs / attempt JSDoc updates.

PR split

This is Phase 4b (~500 LOC production + ~520 LOC tests). Phase 4a (#4417, merged) shipped TTFT + GenAI semconv dual-emit. Phase 4c (next PR) — recordApiRequestBreakdown() activation for the 3 of 4 ApiRequestPhase values that qwen-code has real data for.

Test plan

retry.test.ts +9 cases: monotonic counter, requestSetupMs growth across attempts, first-try success (no telemetry), onRetry callback contract, absent-callback silence, callback-throws resilience, shouldRetryOnError mid-loop giveup, parallel-call ALS isolation, nested-retry inner-frame read.
loggers.test.ts +3 cases: 3-sink fan-out, subagent_name propagation, SDK-not-initialized path.
loggingContentGenerator.test.ts +4 cases: non-stream ALS propagation, non-stream default attempt=1, stream ALS propagation through wrapper closure, stream default attempt=1.
session-tracing.test.ts updates: 1 test rewritten + 1 renamed for the sampling_ms formula fix.
All 580 telemetry + retry + LCG tests pass
tsc --noEmit clean
eslint clean

End-to-end verification (real OTel SDK + outfile)

Beyond unit tests with mocked SDK, drove the full retry → telemetry → exporter chain with the real OTel SDK initialised in outfile mode to confirm the wiring works against actual @opentelemetry/sdk-* packages, not just mocks.

Method

One-shot vitest file under packages/core/src/utils/:

Construct a real Config with telemetry.enabled=true, target='local', outfile=<tmp>.
Call initializeTelemetry(config) — boots the real OTel SDK with FileSpanExporter + FileLogExporter.
Open an LLM span via startLLMRequestSpan.
Drive a contrived retry: retryWithBackoff(fn, opts) where fn throws status=500 for the first 2 calls then returns 'ok' on the 3rd.
onRetry callback emits ApiRetryEvent via logApiRetry(config, ...) for each failure.
Inside fn, snapshot retryContext.getStore() to verify ALS values flow correctly.
Close the LLM span via endLLMRequestSpan with the final attempt's retry context fields.
shutdownTelemetry() to force-flush exporters.
Read the outfile and assert the expected events + span attributes.

Inputs

Field	Value
Stream model	`phase4b-test-model`
`prompt_id`	`verify-prompt-1`
Failures before success	2 (`status=500` each)
`initialDelayMs` / `maxDelayMs`	30ms / 100ms
`maxAttempts`	5
LLM span fields	`inputTokens=100`, `outputTokens=20`, `ttftMs=80`, `durationMs=800`

Outputs (verbatim from outfile, formatting collapsed)

qwen-code.api_retry LogRecord 1:

{
  "session.id": "phase4b-verify-session",
  "event.name": "qwen-code.api_retry",
  "model": "phase4b-test-model",
  "prompt_id": "verify-prompt-1",
  "attempt_number": 1,
  "error_type": "Error",
  "error_message": "Simulated 500 #1",
  "status_code": 500,
  "retry_delay_ms": 28.80,
  "duration_ms": 28.80
}
// _body: "API retry attempt 1 for phase4b-test-model (status 500)."

qwen-code.api_retry LogRecord 2: same shape with attempt_number=2, retry_delay_ms=50.17.

qwen-code.llm_request span attributes:

{
  "qwen-code.model": "phase4b-test-model",
  "gen_ai.request.model": "phase4b-test-model",
  "duration_ms": 800,
  "input_tokens": 100,
  "gen_ai.usage.input_tokens": 100,
  "output_tokens": 20,
  "gen_ai.usage.output_tokens": 20,
  "ttft_ms": 80,
  "gen_ai.server.time_to_first_token": 0.08,
  "request_setup_ms": 81,
  "attempt": 3,
  "retry_total_delay_ms": 78.97,
  "sampling_ms": 720,
  "output_tokens_per_second": 27.78,
  "success": true
}

Expectations matrix

Check	Expected	Actual	✓
ApiRetryEvent record count	2	2	✓
`attempt_number` monotonic	1 → 2	1, 2	✓
`status_code` propagated	500	500	✓
`retry_delay_ms` ≈ jittered backoff	21–39 / 38–62	28.80 / 50.17	✓
LLM span `attempt`	3 (1 success + 2 prior failures)	3	✓
LLM span `retry_total_delay_ms` ≈ sum of backoffs	~78.97	78.97	✓
LLM span `request_setup_ms` ≥ `retry_total_delay_ms` (includes attempt durations)	> 78.97	81	✓
LLM span `sampling_ms` (Phase 4a fix: `duration - ttft`, NOT `... - setup`)	800 - 80 = 720	720	✓
`output_tokens_per_second`	20 / (720/1000) = 27.78	27.78	✓
GenAI semconv dual-emit `gen_ai.server.time_to_first_token` (seconds)	0.08	0.08	✓
3-sink fan-out via `logApiRetry`	OTel log + counter + RUM	log+counter visible in outfile, RUM via QwenLogger	✓
ALS isolation (each attempt sees its own ctx)	attempt 1: setup=0, total_delay=0; attempt 2: total_delay > 0; attempt 3: total_delay > attempt-2	as expected	✓

The Phase 4a sampling_ms formula fix is particularly important to call out: under the old (buggy) formula duration - ttft - setup, Phase 4b's non-zero request_setup_ms=81 would have produced sampling_ms = 800 - 80 - 81 = 639 (wrong) instead of the correct 720. Verified the fix produces the right value live.

Reproduction

The verification file was throwaway and removed after running. To re-run:

cd packages/core
# put the verification file back at src/utils/phase4b-e2e-verify.test.ts
npx vitest run src/utils/phase4b-e2e-verify.test.ts

Outfile written to /var/folders/.../telemetry.jsonl (test mirrors a copy to /tmp/phase4b-e2e-output.jsonl for inspection).

Out of scope (deferred / pre-existing)

Persistent retry mode emission cap (50+ ApiRetryEvent records under QWEN_CODE_UNATTENDED_RETRY). Aggregated attempt / retry_total_delay_ms on the LLM span remain accurate regardless.
SDK-internal retries (openai/google-genai maxRetries=3) remain invisible — operator awareness.
Stream-iteration errors (mid-stream network drop during for await) bypass retryWithBackoff entirely. Pre-existing behavior, not a Phase 4b regression.
shouldRetryOnContent content-retry path (retry.ts:184-193) skips onRetry. No caller in the codebase uses this path today — code path is dead. JSDoc documents the contract.

`ApiRetryEvent` vs `ContentRetryEvent` coexistence

Cleanly partitioned by failure mode — operators summing "total retry count for a prompt" must sum both event streams:

ContentRetryEvent (existing) — emitted by geminiChat's for-loop for InvalidStreamError retries via INVALID_CONTENT_RETRY_OPTIONS (NOT through retryWithBackoff).
ApiRetryEvent (new in 4b) — emitted by retryWithBackoff's onRetry callback for HTTP 429 / 5xx retries.

ApiRetryEvent JSDoc explicitly references ContentRetryEvent to document this.

Bridge span parentage

onRetry fires from inside retry.ts's catch block — at that moment, the failed attempt's LLM span has already ended, and the next attempt's span hasn't started. LogToSpanProcessor-bridged qwen-code.api_retry spans therefore parent to the caller's active span (typically interaction or tool), appearing as siblings of the failed LLM span, not children. This is by design and produces a coherent timeline: failed-LLM-span → api_retry-bridge-sibling → next-LLM-span reads correctly in trace viewers.

github-actions · 2026-05-22T07:31:28Z

📋 Review Summary

This PR (Phase 4b) adds per-attempt HTTP-status retry telemetry to make retry behavior visible in traces, logs, and metrics. The implementation uses AsyncLocalStorage to propagate retry state from retryWithBackoff down to LoggingContentGenerator, enabling each per-attempt LLM span to carry meaningful attempt, request_setup_ms, and retry_total_delay_ms attributes. Additionally, it fixes a Phase 4a bug where sampling_ms calculation would incorrectly subtract requestSetupMs, which would clamp sampling_ms to 0 for retried requests.

🔍 General Feedback

Positive aspects:

Excellent architectural discovery process - the PR description documents the mid-planning discovery that qwen-code's architecture differs from claude-code's, and the ALS-based solution is well-reasoned
Strong test coverage with comprehensive edge cases (parallel calls, nested retries, callback failures)
Follows existing patterns in the codebase (promptIdContext, subagentNameContext, agent-context)
The onRetry callback is opt-in, keeping retry.ts decoupled from telemetry concerns
Defensive programming: callback errors are caught and logged, never breaking the retry loop

Architectural decisions:

Using ALS for retry context propagation is the right choice given the architecture (retry layer sits above LoggingContentGenerator)
Snapshotting retry context in synchronous prelude before first await is critical for streaming path - well understood and documented
Decoupling iterationCount from attempt variable handles persistent retry mode correctly

🎯 Specific Feedback

🟡 High

File: packages/core/src/telemetry/session-tracing.ts:448 - The sampling_ms formula fix is correct, but there's a subtle issue: the comment says "requestSetupMs defaults to 0 when undefined" but the code uses (metadata.requestSetupMs ?? 0). This is correct for Phase 4b, but consider whether this subtraction should happen at all. The PR description states the fix is sampling_ms = duration_ms - ttft_ms (drop setup subtraction entirely), but the code still subtracts requestSetupMs. Recommendation: Clarify whether requestSetupMs should be subtracted or not. If the intent is that duration_ms only covers ttft + sampling, then subtracting requestSetupMs is correct for Phase 4b. But verify this matches the design doc.
File: packages/core/src/utils/retry.ts:224-227 - The retryContext.run() wraps fn(), but there's a subtle timing issue: requestSetupMs is captured immediately before retryContext.run(), but iterationCount is incremented at loop entry. If fn() is synchronous and throws immediately, the onRetry callback fires with the correct context. However, if there's any async setup before the first await in fn(), that time gets attributed to requestSetupMs for the next attempt. Recommendation: Add a comment clarifying that requestSetupMs includes any async setup time in fn() before the first await, or consider whether this is the intended behavior.

🟢 Medium

File: packages/core/src/utils/retryContext.ts:17-38 - The interface documentation is excellent, but consider adding a usage example in the JSDoc showing the typical pattern:
```
// In retryWithBackoff:
await retryContext.run({ attempt, requestSetupMs, retryTotalDelayMs }, fn);

// In LoggingContentGenerator:
const ctx = retryContext.getStore(); // May be undefined outside retry loop
```
This would help future maintainers understand the intended usage pattern.
File: packages/core/src/telemetry/types.ts - The ApiRetryEvent class is well-documented with the JSDoc explaining the difference from ContentRetryEvent. However, consider adding a readonly modifier to the 'event.name' and 'event.timestamp' fields for consistency with other event classes in the file (e.g., ToolOutputTruncatedEvent at line 695 uses readonly).
File: packages/core/src/utils/retry.ts:322-330 - The onRetry callback invocation in the persistent retry path wraps the callback in try/catch, which is correct. However, the debug log uses debugLogger.warn() - consider whether this should be debugLogger.log() instead, since a telemetry callback failure is not necessarily a warning condition (it's expected that telemetry might fail silently). The current approach is defensive, but the log level might generate false positives in monitoring.

🔵 Low

File: packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts:89-92 - The snapshotRetryMetadata() function comment mentions "Phase 4b" but doesn't explain why the synchronous prelude timing is critical. Consider adding a brief note: "Must be called synchronously before first await - ALS context is only active during retryWithBackoff's execution."
File: packages/core/src/utils/retry.ts:215-218 - The comment about iterationCount being decoupled from attempt is helpful, but consider adding a reference to the specific line where attempt gets clamped (attempt = maxAttempts - 1 at line ~344) so readers can quickly verify the claim.
File: packages/core/src/telemetry/types.ts:985-987 - The subagent_name field comment says "Read from subagentNameContext.getStore() at the caller site". Consider making this more explicit: "Populated by caller from subagentNameContext.getStore() - this event class does not read the context directly to maintain separation of concerns."
File: packages/core/src/core/baseLlmClient.ts:248-260 - The onRetry callback captures requestModel and promptId from outer scope. Consider adding a comment explaining that these values remain stable across retry attempts (they don't change mid-retry), so capturing them in the callback closure is correct.
File: packages/core/src/telemetry/loggers.ts - The PR diff shows logApiRetry function should be added, but I don't see it in the current codebase. Ensure this function is implemented following the pattern of logContentRetry (lines 715-735) with:
- QwenLogger event emission
- OTel log record with appropriate attributes
- Metric counter increment (qwen-code.api.retry.count)

✅ Highlights

Excellent test coverage: The retry context tests cover monotonic counters, parallel calls, nested retries, callback failures, and edge cases like shouldRetryOnError returning false mid-loop
Strong documentation: The PR description, design doc updates, and inline comments all clearly explain the architectural discovery and resolution
Defensive error handling: Callback errors are caught and logged without breaking the retry loop
Consistent with existing patterns: The ALS approach matches promptIdContext, subagentNameContext, and agent-context patterns already in the codebase
Phase 4a bug fix: The sampling_ms formula fix is critical and well-motivated - without it, retried requests would have sampling_ms clamped to 0, losing output-throughput data exactly when operators need it most
Opt-in telemetry: The onRetry callback is optional, keeping retry.ts decoupled from telemetry concerns and allowing non-LLM callers to remain silent

⚠️ Pre-Merge Checklist

Before merging, verify:

logApiRetry function exists in packages/core/src/telemetry/loggers.ts and follows the pattern of logContentRetry
ApiRetryEvent is exported from packages/core/src/telemetry/types.ts and included in the TelemetryEvent union type
EVENT_API_RETRY constant is defined in packages/core/src/telemetry/constants.ts
API_RETRY_COUNT metric is defined in packages/core/src/telemetry/metrics.ts with the counter registered
All four LLM call sites have the onRetry callback wired up:
- client.ts:2118-2135 (GeminiClient)
- baseLlmClient.ts:248-260 (two call sites)
- geminiChat.ts:2057-2070
TypeScript compilation passes with no new errors
Tests pass including the new Phase 4b tests in retry.test.ts and loggingContentGenerator.test.ts

Copilot

Pull request overview

Adds Phase 4b retry observability for LLM requests by propagating per-attempt retry state via AsyncLocalStorage and emitting a new api_retry telemetry event + metric, while also fixing a Phase 4a sampling_ms derivation bug.

Changes:

Introduces retryContext (ALS) and wires retryWithBackoff to populate per-attempt retry metadata and optionally invoke an onRetry callback.
Adds ApiRetryEvent + logApiRetry (RUM + OTel log + counter) and wires the 4 LLM retry call sites to emit per-attempt retry telemetry with subagent_name.
Fixes sampling_ms calculation in endLLMRequestSpan to avoid double-subtracting setup time; updates unit tests accordingly.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
packages/core/src/utils/retryContext.ts	Adds ALS container/type for per-attempt retry metadata propagation.
packages/core/src/utils/retry.ts	Adds `onRetry` callback, ALS propagation, and accumulators for retry delay/setup timing.
packages/core/src/utils/retry.test.ts	Adds unit coverage for ALS context propagation + `onRetry` behavior and monotonic attempt counting.
packages/core/src/telemetry/types.ts	Adds `ApiRetryEvent` and includes it in the union type.
packages/core/src/telemetry/session-tracing.ts	Updates LLM metadata docs and fixes `sampling_ms` derivation.
packages/core/src/telemetry/session-tracing.test.ts	Updates tests to match corrected `sampling_ms` formula.
packages/core/src/telemetry/qwen-logger/qwen-logger.ts	Adds RUM emission for API retry events.
packages/core/src/telemetry/metrics.ts	Defines and records new `qwen-code.api.retry.count` counter (tagged by model).
packages/core/src/telemetry/loggers.ts	Adds `logApiRetry` three-sink fan-out (RUM + OTel log + metric).
packages/core/src/telemetry/loggers.test.ts	Adds coverage for `logApiRetry` fan-out + `subagent_name` propagation + SDK-disabled behavior.
packages/core/src/telemetry/index.ts	Re-exports `recordApiRetry`.
packages/core/src/telemetry/constants.ts	Adds `EVENT_API_RETRY` constant.
packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts	Snapshots ALS retry metadata in sync prelude and threads it to all LLM span end sites (stream + non-stream).
packages/core/src/core/loggingContentGenerator/loggingContentGenerator.test.ts	Adds tests ensuring retry metadata propagates (and defaults) for stream/non-stream paths.
packages/core/src/core/geminiChat.ts	Wires `retryWithBackoff.onRetry` to emit `ApiRetryEvent` for Gemini streaming retries.
packages/core/src/core/client.ts	Wires `retryWithBackoff.onRetry` to emit `ApiRetryEvent` for a Gemini client LLM call site.
packages/core/src/core/baseLlmClient.ts	Wires `retryWithBackoff.onRetry` to emit `ApiRetryEvent` for two BaseLlmClient call sites.
docs/design/telemetry-llm-request-timing-design.md	Updates design doc D4 to reflect ALS-based propagation after implementation discovery.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

wenshao

Second opinion review (qwen3.7-max): no new issues found beyond the 9 inline comments from the previous review round. The AsyncLocalStorage propagation, streaming snapshot closure pattern, sampling_ms formula fix, and ALS isolation tests are all sound. CI green (24/24). — qwen3.7-max via Qwen Code /review

wenshao · 2026-05-25T16:58:15Z

维护者本地真实测试验证报告 (PR #4432)

在本地 worktree 中将 PR 合并到 latest main(e8b79d77) 后,使用 tmux session pr4432 完整跑了构建、类型检查、lint、单元测试,以及按 PR 描述自行复现的端到端 OTel 验证。结论是 可以合并 — 测试全绿,端到端实测的具体数值与 PR 描述完全吻合,Phase 4a sampling_ms 公式修复在真实 SDK 上得到证实。

合并/编译

步骤	结果
合并 base	`main @ e8b79d772`
合并提交	`397a139e` (merge clean, 无冲突)
`npm ci`	EXIT 0
`npm run build`	EXIT 0 (0 errors,15 与本 PR 无关的 lint warning)
`npm run typecheck` (全部 5 个 workspace)	EXIT 0
聚焦 lint(本 PR 修改的 11 个文件)	EXIT 0,无 warning/error

单元测试 (telemetry + retry + LCG,合计 20 个文件)

 Test Files  20 passed (20)
      Tests  587 passed (587)
   Duration  16.75s

逐文件覆盖(只列与本 PR 直接相关或被改动的):

文件	测试数
`src/utils/retry.test.ts`	65 ✓
`src/utils/retryContext.ts` (新增)	覆盖在 retry.test.ts 中
`src/core/loggingContentGenerator/loggingContentGenerator.test.ts`	47 ✓
`src/telemetry/loggers.test.ts`	48 ✓
`src/telemetry/session-tracing.test.ts`	70 ✓
`src/telemetry/metrics.test.ts`	40 ✓
`src/telemetry/sdk.test.ts`	55 ✓
`src/telemetry/log-to-span-processor.test.ts`	32 ✓
其余 telemetry 套件合计	226 ✓

PR 描述写的是 ~580,实测 587,且全绿。

端到端验证(真实 OTel SDK + outfile,无 mock)

按 PR 描述照搬方法,自己写了一个一次性 vitest 文件 src/utils/phase4b-e2e-verify.test.ts,真实初始化 NodeSDK + FileSpanExporter + FileLogExporter + PeriodicExportingMetricReader + FileMetricExporter,驱动 retryWithBackoff(fn, ...) 走 500 → 500 → ok 路径,onRetry 通过 logApiRetry() 发 ApiRetryEvent,最后 endLLMRequestSpan 注入 ALS 中的 attempt/requestSetupMs/retryTotalDelayMs,然后 shutdownTelemetry() 强制 flush 并解析 outfile JSON。

 ✓ src/utils/phase4b-e2e-verify.test.ts (4 tests) 147ms

四个用例分别校验:(1) outfile 里 2 条 qwen-code.api_retry LogRecord、(2) qwen-code.llm_request span 携带全部 Phase 4b 属性、(3) ALS 在三次 attempt 间正确传播、(4) span 上的 attempt/request_setup_ms/retry_total_delay_ms 等于最后一次 retryContext.getStore() 的快照。

实测从 outfile 解析出来的关键数值

qwen-code.api_retry LogRecord ×2

字段	第 1 条	第 2 条
`event.name`	`qwen-code.api_retry`	`qwen-code.api_retry`
`attempt_number`	1	2
`model`	`phase4b-test-model`	`phase4b-test-model`
`prompt_id`	`verify-prompt-1`	`verify-prompt-1`
`status_code`	500	500
`retry_delay_ms`	21.39	69.24
`_body`	`API retry attempt 1 for phase4b-test-model (status 500).`	`API retry attempt 2 for phase4b-test-model (status 500).`

qwen-code.llm_request span attributes

{
  "qwen-code.model": "phase4b-test-model",
  "gen_ai.request.model": "phase4b-test-model",
  "duration_ms": 800,
  "input_tokens": 100, "gen_ai.usage.input_tokens": 100,
  "output_tokens": 20, "gen_ai.usage.output_tokens": 20,
  "ttft_ms": 80, "gen_ai.server.time_to_first_token": 0.08,
  "request_setup_ms": 129,
  "attempt": 3,
  "retry_total_delay_ms": 90.63,
  "sampling_ms": 720,
  "output_tokens_per_second": 27.78,
  "success": true
}

qwen-code.api.retry.count Counter(经 PeriodicExportingMetricReader + FileMetricExporter 落盘)

metric: qwen-code.api.retry.count
  point: {"model": "phase4b-test-model"}  value: 2

Phase 4a `sampling_ms` 公式修复 — 实测验证

request_setup_ms = 129 是非零的(累计重试开销)。用旧公式 duration - ttft - setup 会得到 800 - 80 - 129 = 591(错);新公式 duration - ttft 得到 720(对)。实测 outfile 中确实写入了 720,且 output_tokens_per_second = 27.78 = 20 / (720 / 1000),前后自洽。如果不在本 PR 同时修这个公式,Phase 4b 的重试场景会让所有重试请求的吞吐量数据归零,所以这个 bundled fix 是必要的、不可拆的。

期望矩阵 vs 实测

检查项	期望	实测	✓
LogRecord 条数	2	2	✓
`attempt_number` 单调	1 → 2	1, 2	✓
`status_code` 透传	500	500	✓
LLM span `attempt`	3 (1 success + 2 prior)	3	✓
`request_setup_ms` ≥ `retry_total_delay_ms` (setup 包含 attempt 自身耗时)	true	129 ≥ 90.63	✓
`sampling_ms = duration - ttft`(Phase 4a 修复)	720	720	✓
旧公式会算成 591 (bug)	≠720	591 ≠ 720(已修复)	✓
`output_tokens_per_second`	27.78	27.78	✓
GenAI semconv `time_to_first_token` (秒)	0.08	0.08	✓
Counter `qwen-code.api.retry.count`	model 维度 = 2	`{model}: 2`	✓
ALS 隔离:attempt 1 setup=0, attempt 2 > 0, attempt 3 单调增	as expected	0 → 25ms → 80ms	✓

一处与 PR 描述需对照的实际行为(并非 bug)

PR 提到 "api_retry LogToSpanProcessor 桥接 span 以 sibling 形式挂在 caller active span 下"。这是 OTLP 路径 的行为(useOtlp 分支才 new LogToSpanProcessor(...));本次 outfile 模式的验证里没有桥接 span 产出,只有 LogRecord 本身写入了 outfile,这与 sdk.ts:259-342 的代码完全一致 — outfile 模式下 LogRecord 直接走 FileLogExporter,不经过 LogToSpanProcessor。这是预期行为,PR 单测 log-to-span-processor.test.ts (32 tests) 已单独覆盖该桥接逻辑。如果未来希望 outfile 模式也产出桥接 span 以方便本地观察,可在 4c 或后续 PR 里单独考虑(超出本 PR 范围)。

复现命令

git fetch origin pull/4432/head:pr-4432-test && git fetch origin main
git worktree add -b pr-4432-merge-test /tmp/pr4432-merged origin/main
cd /tmp/pr4432-merged && git merge pr-4432-test --no-edit
npm ci && npm run build && npm run typecheck

# 单元测试(20 个文件,587 tests)
cd packages/core && npx vitest run --coverage=false \
  src/telemetry/ src/utils/retry.test.ts \
  src/core/loggingContentGenerator/loggingContentGenerator.test.ts

# E2E:把验证文件 phase4b-e2e-verify.test.ts 放到 packages/core/src/utils/ 后
npx vitest run --coverage=false src/utils/phase4b-e2e-verify.test.ts

(验证用的 e2e 测试文件用完即弃,与 PR 描述一致;如有需要我可以贴完整源码到下方 thread。)

结论

✅ 与 latest main 合并干净,无冲突
✅ build / typecheck / lint 全绿
✅ 587 个单元测试全部通过
✅ 真实 OTel SDK 端到端验证:3 个 sink(LogRecord / Span / Counter)全部按预期写入 outfile,数值与 PR 描述一致
✅ Phase 4a sampling_ms 公式修复在真实场景下证实是必要的且生效
同意合并。

Fixes 6 of 9 inline review comments from wenshao + Copilot. The remaining 3 are pushback (duration_ms semantic = design intent per D5; persistent retry cap = explicitly deferred in PR description). 1. Fix JSDoc inaccuracy on `onRetry` contract (QwenLM#1+QwenLM#2): the comment incorrectly said "synchronous throws inside fn execute OUTSIDE the ALS frame." In fact fn() runs inside retryContext.run() so throws ARE inside the frame. What's outside the frame is the onRetry callback itself (it fires from the catch block). Rewritten per wenshao's suggestion: tells callers not to read retryContext.getStore() inside onRetry — all data comes via the RetryAttemptInfo parameter. 2. Add doc comment on content-retry delay inflation (QwenLM#3): retryTotalDelayMs accumulator includes content-retry delays (shouldRetryOnContent path) which don't fire onRetry. This is intentional — the LLM span attribute reports total user-perceived backoff time — but was undocumented. 3. Add signal?.aborted guard before onRetry invocations (QwenLM#6): if the abort signal fires between the catch and onRetry execution point, we now skip the callback to avoid phantom retry events that inflate the counter for retries that never actually proceeded. Applied to both persistent and normal retry paths. 4. Add persistent retry path test (status=429 + persistentMode) (QwenLM#4): the highest-volume production retry path had zero Phase 4b test coverage. Now verifies onRetry fires with monotonic attempt counter and that persistent-mode exponential backoff produces increasing delayMs. 5. Add Retry-After header path test (status=429 + retry-after: 2) (QwenLM#7): verifies that when the error carries a Retry-After header, onRetry.delayMs reflects the parsed header value (2000ms) instead of the exponential backoff calculation. 6. Add stream idle-timeout retry-attr propagation test (QwenLM#8): verifies that the closure-captured retrySnapshot reaches the setTimeout-fired endLLMRequestSpan call with correct retry context values (attempt=4, requestSetupMs=3000, retryTotalDelayMs=2500). All 186 affected tests pass (retry 68 + LCG 48 + session-tracing 70). tsc --noEmit clean. eslint clean. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

… in RUM (QwenLM#4432) Addresses 2 of 5 R3 review comments from wenshao (2026-05-26): 1. loggingContentGenerator.test.ts:2290 — replace `if (timeoutRecord)` guard with `expect(timeoutRecord).toBeDefined()` so the idle-timeout retry-attr test fails loudly instead of passing with 0 assertions when setTimeout doesn't fire. Also rewrote the test to use fake timers from the START (so the 5-min idle timeout is created under fake clock and can be advanced via vi.advanceTimersByTimeAsync), fixing the underlying reason it wasn't firing. 2. qwen-logger.ts:963 — add `prompt_id: event.prompt_id` to logApiRetryEvent RUM properties. Without this, RUM dashboards cannot correlate api_retry events with specific prompts, unlike the analogous logApiErrorEvent which already includes prompt_id. 165 affected tests pass. Remaining 3 R3 items (QwenLM#9 onRetry helper, QwenLM#10 error-path test coverage, QwenLM#11 caller integration assertions) deferred to follow-up PR — non-blocking refactor/test-hardening. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

wenshao

No review findings. Downgraded from Approve to Comment: CI still running. — qwen3.7-max via Qwen Code /review

wenshao · 2026-05-27T13:23:24Z

Maintainer Verification Report

Build

Result: PASS
Clean build after rm -rf packages/core/dist (stale .d.ts artifact conflict, not PR-related). Zero errors, 15 warnings (all pre-existing lint/browserslist warnings).

Tests — PR-Touched Files (Targeted)

Test File	Tests	Result
`retry.test.ts`	68	PASS
`loggingContentGenerator.test.ts`	48	PASS
`session-tracing.test.ts`	70	PASS
`loggers.test.ts`	48	PASS
Total	234	ALL PASS

Tests — Full Core Suite (Regression)

Metric	Value
Test Files	339 passed, 1 failed (340 total)
Tests	9291 passed, 1 failed, 3 skipped (9295 total)
Duration	141.69s

The single failure (anthropicContentGenerator.test.ts — User-Agent string mismatch) is pre-existing on main and unrelated to this PR.

Code Review Findings

Architecture: Clean AsyncLocalStorage-based retry context propagation, consistent with existing promptIdContext/subagentNameContext patterns.

Key observations:

retryContext.ts — Well-defined RetryAttemptContext interface with clear 1-based attempt, cumulative retryTotalDelayMs, and requestSetupMs fields. Correct use of AsyncLocalStorage.
retry.ts — iterationCount correctly decoupled from attempt variable (which gets clamped in persistent mode). onRetry callback fires in catch block OUTSIDE the ALS frame — data passed via RetryAttemptInfo parameter, not ALS read. Callback errors properly swallowed with debug logging. signal?.aborted guard prevents phantom retry events.
loggingContentGenerator.ts — snapshotRetryMetadata() called in synchronous prelude (before first await), correctly capturing ALS state before the streaming AsyncGenerator exits the retry frame. Default attempt=1 ensures warmup/side-query spans are dashboard-friendly.
session-tracing.ts — Phase 4a sampling_ms bug fix is correct: changed from duration - ttft - setup to duration - ttft. The duration_ms is computed from startLLMRequestSpan time (AFTER setup), so subtracting requestSetupMs was double-counting. Phase 4a masked this because requestSetupMs was always undefined.
4 LLM call sites (client.ts, geminiChat.ts, baseLlmClient.ts ×2) — All correctly wired with identical onRetry callback pattern emitting ApiRetryEvent with subagentNameContext.getStore().
Three-sink fan-out — QwenLogger RUM (logApiRetryEvent), OTel log signal (logger.emit → LogToSpanProcessor bridge), Counter metric (recordApiRetry tagged by model) — all present and correct.
New ApiRetryEvent type — duration_ms set equal to retry_delay_ms for LogToSpanProcessor bridge span visualization. Clear distinction from ContentRetryEvent documented in JSDoc.

No issues found. The implementation is thorough, well-documented, and matches the design doc.

Verified by: wenshao

QwenLM#3731) Adds per-attempt retry telemetry for HTTP-status retries (429/5xx) emitted by retryWithBackoff at the 4 LLM call sites. Second slice of Phase 4 (sub-issue Architectural discovery (mid-planning) -------------------------------------- The Phase 4 design doc assumed claude-code's "one LLM span owns the retry loop" pattern. Reading the 4 retryWithBackoff call sites revealed qwen-code inverts that: retryWithBackoff sits ABOVE LoggingContentGenerator. Each attempt creates a fresh LLM span. The original "in-LCG accumulator" plan wouldn't work. Resolution: propagate retry state via AsyncLocalStorage (`retryContext`). retryWithBackoff wraps each `await fn()` in `retryContext.run(...)`, and LoggingContentGenerator reads the ALS in its synchronous prelude (before the first await) and threads the snapshot into all endLLMRequestSpan callsites — success / error / idle-timeout / abort. Matches existing patterns (promptIdContext, subagentNameContext, agent-context). Plan went through 3 review rounds (Plan-agent reviews) finding 22 issues total — all addressed before implementation. Changes ------- - New retryContext.ts (AsyncLocalStorage<RetryAttemptContext>) with attempt + requestSetupMs + retryTotalDelayMs fields. Computed in retry.ts immediately before `await fn()` so values are anchored to the attempt's actual start, not derived downstream. - retry.ts: - New `onRetry?: (info: RetryAttemptInfo) => void` option on RetryOptions. Opt-in per caller: non-LLM callers stay silent. - Monotonic `iterationCount` decoupled from `attempt` (which is clamped at `maxAttempts - 1` in persistent mode). Always reflects "this is the Nth fn() call" — no flip-flopping for mixed-error sequences. - retryContext.run wrap around fn() so LCG can read the ALS. - onRetry invocations wrapped in try/catch: telemetry exceptions never break the retry loop (logged via debugLogger). - logRetryAttempt debug log line KEPT — useful when OTel SDK isn't wired up (local CLI debugging, integration tests, early-startup errors). - ApiRetryEvent telemetry event class (types.ts) with model + promptId + attempt_number + error fields + subagent_name. JSDoc cross-references ContentRetryEvent (they cover different retry budgets — HTTP-status vs invalid-stream — and can both fire for one prompt). - logApiRetry function in loggers.ts — three-sink fan-out matching logContentRetry: QwenLogger RUM, OTel log signal (bridged via LogToSpanProcessor), recordApiRetry metric counter. - recordApiRetry metric (metrics.ts) — `qwen-code.api.retry.count` Counter tagged with {model}. Full COUNTER_DEFINITIONS entry + initialization + recording function + index.ts export. - qwen-logger.ts adds logApiRetryEvent for RUM consistency. - 4 LLM caller wiring sites (client.ts, baseLlmClient.ts x2, geminiChat.ts) opt in with onRetry callback that emits ApiRetryEvent with subagentName from subagentNameContext.getStore(). - LoggingContentGenerator: snapshotRetryMetadata() helper called in the SYNCHRONOUS prelude of generateContent / generateContentStream — only point where retryContext is guaranteed active for the streaming path (the returned AsyncGenerator is iterated AFTER retryWithBackoff resolves). Snapshot threaded as parameter to loggingStreamWrapper so every endLLMRequestSpan callsite (success / error / idle-timeout / abort) sees the same values. `attempt` defaults to 1 when no retry context is present (warmup, side-queries, direct calls) so dashboards filtering WHERE attempt=1 include those. Bundled Phase 4a bug fix (sampling_ms formula) ----------------------------------------------- Phase 4a's `sampling_ms = duration_ms - ttft_ms - (requestSetupMs ?? 0)` was silently wrong. `duration_ms` only covers `ttft + sampling` for the span (startTime is captured when startLLMRequestSpan runs, AFTER any setup phase). Subtracting setup again is double-counting. Phase 4a masked the bug because requestSetupMs was always undefined → 0. Phase 4b populates requestSetupMs with cumulative retry overhead — without this fix, sampling_ms would clamp to 0 for every retried request, wiping output-throughput data exactly when operators need it most. Fix: `sampling_ms = duration_ms - ttft_ms` (drop the setup subtraction). Phase 4a tests updated accordingly: 1 test rewritten to use inputs that actually exercise the clamp under the new formula (ttft > duration = clock skew); 1 test renamed to assert the FIX (setup is NOT subtracted). Out of scope (deferred, noted in PR description) ------------------------------------------------ - Persistent retry mode emission cap (50+ events under QWEN_CODE_UNATTENDED_RETRY). Aggregated attempt/retry_total_delay_ms remain accurate regardless. - SDK-internal retries (openai/google-genai maxRetries=3) remain invisible — operator awareness only. - Stream-iteration errors (mid-stream network drop during for-await) bypass retryWithBackoff entirely. Pre-existing behavior, not a Phase 4b regression. - shouldRetryOnContent content-retry path (retry.ts:184-193) skips onRetry. No caller uses this path today — code path is dead. Tests ----- - retry.test.ts: 9 new cases (monotonic counter, requestSetupMs growth, first-try success, onRetry callback contract, absent-callback silence, callback-throws resilience, shouldRetryOnError mid-loop giveup, parallel-call ALS isolation, nested-retry inner-frame read). - loggers.test.ts: 3 new cases (3-sink fan-out, subagent_name propagation, SDK-not-initialized path). - loggingContentGenerator.test.ts: 4 new cases (non-stream ALS propagation, non-stream default attempt=1, stream ALS propagation through wrapper closure, stream default attempt=1). - session-tracing.test.ts: 1 test rewritten + 1 renamed for the sampling_ms fix. All 580 telemetry + retry + LCG tests pass. tsc --noEmit clean. eslint clean. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

Fixes 6 of 9 inline review comments from wenshao + Copilot. The remaining 3 are pushback (duration_ms semantic = design intent per D5; persistent retry cap = explicitly deferred in PR description). 1. Fix JSDoc inaccuracy on `onRetry` contract (QwenLM#1+QwenLM#2): the comment incorrectly said "synchronous throws inside fn execute OUTSIDE the ALS frame." In fact fn() runs inside retryContext.run() so throws ARE inside the frame. What's outside the frame is the onRetry callback itself (it fires from the catch block). Rewritten per wenshao's suggestion: tells callers not to read retryContext.getStore() inside onRetry — all data comes via the RetryAttemptInfo parameter. 2. Add doc comment on content-retry delay inflation (QwenLM#3): retryTotalDelayMs accumulator includes content-retry delays (shouldRetryOnContent path) which don't fire onRetry. This is intentional — the LLM span attribute reports total user-perceived backoff time — but was undocumented. 3. Add signal?.aborted guard before onRetry invocations (QwenLM#6): if the abort signal fires between the catch and onRetry execution point, we now skip the callback to avoid phantom retry events that inflate the counter for retries that never actually proceeded. Applied to both persistent and normal retry paths. 4. Add persistent retry path test (status=429 + persistentMode) (QwenLM#4): the highest-volume production retry path had zero Phase 4b test coverage. Now verifies onRetry fires with monotonic attempt counter and that persistent-mode exponential backoff produces increasing delayMs. 5. Add Retry-After header path test (status=429 + retry-after: 2) (QwenLM#7): verifies that when the error carries a Retry-After header, onRetry.delayMs reflects the parsed header value (2000ms) instead of the exponential backoff calculation. 6. Add stream idle-timeout retry-attr propagation test (QwenLM#8): verifies that the closure-captured retrySnapshot reaches the setTimeout-fired endLLMRequestSpan call with correct retry context values (attempt=4, requestSetupMs=3000, retryTotalDelayMs=2500). All 186 affected tests pass (retry 68 + LCG 48 + session-tracing 70). tsc --noEmit clean. eslint clean. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

… in RUM (QwenLM#4432) Addresses 2 of 5 R3 review comments from wenshao (2026-05-26): 1. loggingContentGenerator.test.ts:2290 — replace `if (timeoutRecord)` guard with `expect(timeoutRecord).toBeDefined()` so the idle-timeout retry-attr test fails loudly instead of passing with 0 assertions when setTimeout doesn't fire. Also rewrote the test to use fake timers from the START (so the 5-min idle timeout is created under fake clock and can be advanced via vi.advanceTimersByTimeAsync), fixing the underlying reason it wasn't firing. 2. qwen-logger.ts:963 — add `prompt_id: event.prompt_id` to logApiRetryEvent RUM properties. Without this, RUM dashboards cannot correlate api_retry events with specific prompts, unlike the analogous logApiErrorEvent which already includes prompt_id. 165 affected tests pass. Remaining 3 R3 items (QwenLM#9 onRetry helper, QwenLM#10 error-path test coverage, QwenLM#11 caller integration assertions) deferred to follow-up PR — non-blocking refactor/test-hardening. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

yiliang114

LGTM. The ALS-based retry context propagation is clean and matches existing patterns (promptIdContext, subagentNameContext). The sampling_ms formula fix is correct — duration_ms already excludes setup time since startTime is captured after startLLMRequestSpan. All prior review comments are addressed or explicitly deferred as non-blocking.

The 3 follow-up rounds of review (22 findings total) and the author's responses look thorough. Test coverage is solid — including the idle-timeout path and parallel ALS isolation.

Copilot AI review requested due to automatic review settings May 22, 2026 07:28

Copilot started reviewing on behalf of doudouOUC May 22, 2026 07:28 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread packages/core/src/utils/retry.ts Outdated

wenshao reviewed May 22, 2026

View reviewed changes

Comment thread packages/core/src/utils/retry.ts Outdated

Comment thread packages/core/src/telemetry/types.ts

Comment thread packages/core/src/utils/retry.ts

Comment thread packages/core/src/utils/retry.test.ts

wenshao reviewed May 22, 2026

View reviewed changes

Comment thread packages/core/src/utils/retry.ts

Comment thread packages/core/src/utils/retry.ts

Comment thread packages/core/src/utils/retry.ts

Comment thread packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts

github-actions Bot mentioned this pull request May 23, 2026

📊 AI CLI 工具社区动态日报 2026-05-23 gsscsd/big_model_radar#379

Open

wenshao previously approved these changes May 23, 2026

View reviewed changes

doudouOUC self-assigned this May 25, 2026

doudouOUC requested review from LaZzyMan and pomelo-nwu May 25, 2026 07:08

doudouOUC dismissed wenshao’s stale review via 7374976 May 26, 2026 17:43

wenshao reviewed May 26, 2026

View reviewed changes

wenshao previously approved these changes May 27, 2026

View reviewed changes

wenshao reviewed May 27, 2026

View reviewed changes

doudouOUC mentioned this pull request May 31, 2026

feat(telemetry): harden OpenTelemetry configuration, HTTP OTLP behavior, and runtime safety #3731

Open

53 tasks

doudouOUC requested a review from tanzhenxin June 2, 2026 08:50

doudouOUC added 3 commits June 3, 2026 18:36

doudouOUC dismissed wenshao’s stale review via 5458a9e June 3, 2026 10:36

doudouOUC force-pushed the feat/telemetry-phase-4b-retry-visibility branch from 479bbf3 to 5458a9e Compare June 3, 2026 10:36

doudouOUC requested a review from wenshao June 3, 2026 11:10

wenshao approved these changes Jun 3, 2026

View reviewed changes

yiliang114 approved these changes Jun 5, 2026

View reviewed changes

doudouOUC merged commit b3fa135 into QwenLM:main Jun 5, 2026
10 checks passed

doudouOUC deleted the feat/telemetry-phase-4b-retry-visibility branch June 5, 2026 05:45

Conversation

doudouOUC commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architectural discovery (mid-planning)

Bundled Phase 4a bug fix (sampling_ms formula)

Closes

Design doc

Implementation overview

PR split

Test plan

End-to-end verification (real OTel SDK + outfile)

Method

Inputs

Outputs (verbatim from outfile, formatting collapsed)

Expectations matrix

Reproduction

Out of scope (deferred / pre-existing)

ApiRetryEvent vs ContentRetryEvent coexistence

Bridge span parentage

Uh oh!

github-actions Bot commented May 22, 2026

📋 Review Summary

🔍 General Feedback

🎯 Specific Feedback

🟡 High

🟢 Medium

🔵 Low

✅ Highlights

⚠️ Pre-Merge Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wenshao left a comment

Choose a reason for hiding this comment

Uh oh!

wenshao commented May 25, 2026

维护者本地真实测试验证报告 (PR #4432)

合并/编译

单元测试 (telemetry + retry + LCG,合计 20 个文件)

端到端验证(真实 OTel SDK + outfile,无 mock)

实测从 outfile 解析出来的关键数值

Phase 4a sampling_ms 公式修复 — 实测验证

期望矩阵 vs 实测

一处与 PR 描述需对照的实际行为(并非 bug)

复现命令

结论

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wenshao left a comment

Choose a reason for hiding this comment

Uh oh!

wenshao commented May 27, 2026

Maintainer Verification Report

Build

Tests — PR-Touched Files (Targeted)

Tests — Full Core Suite (Regression)

Code Review Findings

Uh oh!

yiliang114 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

doudouOUC commented May 22, 2026 •

edited

Loading

`ApiRetryEvent` vs `ContentRetryEvent` coexistence

Phase 4a `sampling_ms` 公式修复 — 实测验证