Skip to content

feat(telemetry): Phase 4b — retry visibility for qwen-code.llm_request (#3731)#4432

Merged
doudouOUC merged 3 commits into
QwenLM:mainfrom
doudouOUC:feat/telemetry-phase-4b-retry-visibility
Jun 5, 2026
Merged

feat(telemetry): Phase 4b — retry visibility for qwen-code.llm_request (#3731)#4432
doudouOUC merged 3 commits into
QwenLM:mainfrom
doudouOUC:feat/telemetry-phase-4b-retry-visibility

Conversation

@doudouOUC

@doudouOUC doudouOUC commented May 22, 2026

Copy link
Copy Markdown
Collaborator

Summary

Phase 4b of P3 LLM request timing decomposition (#3731 / #4413). Adds per-attempt HTTP-status retry telemetry so operators can see retry behavior in traces, logs, and metrics — previously the retryWithBackoff path at the 4 LLM call sites was completely invisible to telemetry (only debugLogger.warn).

After Phase 4b, a request that goes 500 → 429 → success produces:

  • 3 separate qwen-code.llm_request spans (each with its own duration_ms / ttft_ms / error / status code).
  • 2 qwen-code.api_retry LogToSpanProcessor bridge spans sibling-parented under the caller's active span (interaction or tool).
  • 2 qwen-code.api_retry LogRecords (OTel log signal + QwenLogger RUM).
  • 2 increments on the new qwen-code.api.retry.count counter, tagged {model}.
  • The successful attempt's LLM span carries attempt: 3, request_setup_ms: ~1200 (cumulative retry overhead), retry_total_delay_ms: ~1500.

Architectural discovery (mid-planning)

The Phase 4 design doc assumed claude-code's "one LLM span owns the retry loop" pattern. Reading the 4 retry call sites carefully revealed qwen-code inverts that: retryWithBackoff sits above LoggingContentGenerator. Each retry attempt creates a fresh LLM span. The original "in-LCG accumulator" plan wouldn't work.

Resolution: propagate retry state via AsyncLocalStorage (retryContext), matching the existing promptIdContext / subagentNameContext / agent-context patterns. retryWithBackoff wraps each await fn() in retryContext.run(...); LoggingContentGenerator reads the ALS in its synchronous prelude (before the first await) and threads the snapshot to all endLLMRequestSpan callsites (success / error / idle-timeout / abort).

The plan went through 3 review rounds by Plan-agents finding 22 issues total — all addressed before implementation. Design doc updated to reflect the ALS-based approach (D4 has a "Phase 4b update" subsection at the top).

Bundled Phase 4a bug fix (sampling_ms formula)

Phase 4a's sampling_ms = duration_ms - ttft_ms - (requestSetupMs ?? 0) was silently wrong: duration_ms only covers ttft + sampling for the span (startTime is captured when startLLMRequestSpan runs, AFTER any setup phase), so subtracting setup again is double-counting. Phase 4a masked the bug because requestSetupMs was always undefined → 0. Phase 4b populates requestSetupMs with cumulative retry overhead — without this fix, sampling_ms would clamp to 0 for every retried request, wiping output-throughput data exactly when operators need it most.

Fix: sampling_ms = duration_ms - ttft_ms (drop the setup subtraction). Phase 4a tests updated.

Closes

Design doc

docs/design/telemetry-llm-request-timing-design.md — D4 section updated with the Phase 4b discovery + ALS resolution.

Implementation overview

New (packages/core/src/utils/retryContext.ts): AsyncLocalStorage<RetryAttemptContext> with attempt / requestSetupMs / retryTotalDelayMs fields. Mirrors promptIdContext.ts style.

retry.ts:

  • onRetry?: (info: RetryAttemptInfo) => void option on RetryOptions (opt-in per caller).
  • Monotonic iterationCount decoupled from attempt (which gets clamped at maxAttempts - 1 in persistent mode). No flip-flopping for mixed-error sequences.
  • retryContext.run wrap around fn().
  • onRetry invocations wrapped in try/catch — telemetry exceptions never break the retry loop.
  • Keeps the existing logRetryAttempt debug log line (useful when OTel SDK isn't wired up).

types.ts: new ApiRetryEvent class with subagent_name field. JSDoc cross-references ContentRetryEvent (they cover different retry budgets and can both fire for one prompt).

loggers.ts: new logApiRetry function — three-sink fan-out matching logContentRetry: QwenLogger RUM, OTel logger.emit (bridged via LogToSpanProcessor), recordApiRetry Counter.

metrics.ts: new qwen-code.api.retry.count Counter tagged with {model}. Full COUNTER_DEFINITIONS entry + initialization + recording function + index.ts export.

qwen-logger.ts: logApiRetryEvent for RUM consistency.

4 LLM caller wiring sites: client.ts, baseLlmClient.ts (×2), geminiChat.ts — each opts in with onRetry callback that emits ApiRetryEvent with subagentName from subagentNameContext.getStore().

LoggingContentGenerator: snapshotRetryMetadata() helper called in the synchronous prelude of generateContent / generateContentStream — only point where retryContext is guaranteed active for the streaming path (returned AsyncGenerator is iterated AFTER retryWithBackoff resolves). Snapshot threaded as a parameter to loggingStreamWrapper. attempt defaults to 1 when no retry context (warmup / side-queries / direct calls).

session-tracing.ts: Phase 4a sampling_ms formula fix + LLMRequestMetadata.requestSetupMs / attempt JSDoc updates.

PR split

This is Phase 4b (~500 LOC production + ~520 LOC tests). Phase 4a (#4417, merged) shipped TTFT + GenAI semconv dual-emit. Phase 4c (next PR) — recordApiRequestBreakdown() activation for the 3 of 4 ApiRequestPhase values that qwen-code has real data for.

Test plan

  • retry.test.ts +9 cases: monotonic counter, requestSetupMs growth across attempts, first-try success (no telemetry), onRetry callback contract, absent-callback silence, callback-throws resilience, shouldRetryOnError mid-loop giveup, parallel-call ALS isolation, nested-retry inner-frame read.
  • loggers.test.ts +3 cases: 3-sink fan-out, subagent_name propagation, SDK-not-initialized path.
  • loggingContentGenerator.test.ts +4 cases: non-stream ALS propagation, non-stream default attempt=1, stream ALS propagation through wrapper closure, stream default attempt=1.
  • session-tracing.test.ts updates: 1 test rewritten + 1 renamed for the sampling_ms formula fix.
  • All 580 telemetry + retry + LCG tests pass
  • tsc --noEmit clean
  • eslint clean

End-to-end verification (real OTel SDK + outfile)

Beyond unit tests with mocked SDK, drove the full retry → telemetry → exporter chain with the real OTel SDK initialised in outfile mode to confirm the wiring works against actual @opentelemetry/sdk-* packages, not just mocks.

Method

One-shot vitest file under packages/core/src/utils/:

  1. Construct a real Config with telemetry.enabled=true, target='local', outfile=<tmp>.
  2. Call initializeTelemetry(config) — boots the real OTel SDK with FileSpanExporter + FileLogExporter.
  3. Open an LLM span via startLLMRequestSpan.
  4. Drive a contrived retry: retryWithBackoff(fn, opts) where fn throws status=500 for the first 2 calls then returns 'ok' on the 3rd.
  5. onRetry callback emits ApiRetryEvent via logApiRetry(config, ...) for each failure.
  6. Inside fn, snapshot retryContext.getStore() to verify ALS values flow correctly.
  7. Close the LLM span via endLLMRequestSpan with the final attempt's retry context fields.
  8. shutdownTelemetry() to force-flush exporters.
  9. Read the outfile and assert the expected events + span attributes.

Inputs

Field Value
Stream model phase4b-test-model
prompt_id verify-prompt-1
Failures before success 2 (status=500 each)
initialDelayMs / maxDelayMs 30ms / 100ms
maxAttempts 5
LLM span fields inputTokens=100, outputTokens=20, ttftMs=80, durationMs=800

Outputs (verbatim from outfile, formatting collapsed)

qwen-code.api_retry LogRecord 1:

{
  "session.id": "phase4b-verify-session",
  "event.name": "qwen-code.api_retry",
  "model": "phase4b-test-model",
  "prompt_id": "verify-prompt-1",
  "attempt_number": 1,
  "error_type": "Error",
  "error_message": "Simulated 500 #1",
  "status_code": 500,
  "retry_delay_ms": 28.80,
  "duration_ms": 28.80
}
// _body: "API retry attempt 1 for phase4b-test-model (status 500)."

qwen-code.api_retry LogRecord 2: same shape with attempt_number=2, retry_delay_ms=50.17.

qwen-code.llm_request span attributes:

{
  "qwen-code.model": "phase4b-test-model",
  "gen_ai.request.model": "phase4b-test-model",
  "duration_ms": 800,
  "input_tokens": 100,
  "gen_ai.usage.input_tokens": 100,
  "output_tokens": 20,
  "gen_ai.usage.output_tokens": 20,
  "ttft_ms": 80,
  "gen_ai.server.time_to_first_token": 0.08,
  "request_setup_ms": 81,
  "attempt": 3,
  "retry_total_delay_ms": 78.97,
  "sampling_ms": 720,
  "output_tokens_per_second": 27.78,
  "success": true
}

Expectations matrix

Check Expected Actual
ApiRetryEvent record count 2 2
attempt_number monotonic 1 → 2 1, 2
status_code propagated 500 500
retry_delay_ms ≈ jittered backoff 21–39 / 38–62 28.80 / 50.17
LLM span attempt 3 (1 success + 2 prior failures) 3
LLM span retry_total_delay_ms ≈ sum of backoffs ~78.97 78.97
LLM span request_setup_msretry_total_delay_ms (includes attempt durations) > 78.97 81
LLM span sampling_ms (Phase 4a fix: duration - ttft, NOT ... - setup) 800 - 80 = 720 720
output_tokens_per_second 20 / (720/1000) = 27.78 27.78
GenAI semconv dual-emit gen_ai.server.time_to_first_token (seconds) 0.08 0.08
3-sink fan-out via logApiRetry OTel log + counter + RUM log+counter visible in outfile, RUM via QwenLogger
ALS isolation (each attempt sees its own ctx) attempt 1: setup=0, total_delay=0; attempt 2: total_delay > 0; attempt 3: total_delay > attempt-2 as expected

The Phase 4a sampling_ms formula fix is particularly important to call out: under the old (buggy) formula duration - ttft - setup, Phase 4b's non-zero request_setup_ms=81 would have produced sampling_ms = 800 - 80 - 81 = 639 (wrong) instead of the correct 720. Verified the fix produces the right value live.

Reproduction

The verification file was throwaway and removed after running. To re-run:

cd packages/core
# put the verification file back at src/utils/phase4b-e2e-verify.test.ts
npx vitest run src/utils/phase4b-e2e-verify.test.ts

Outfile written to /var/folders/.../telemetry.jsonl (test mirrors a copy to /tmp/phase4b-e2e-output.jsonl for inspection).

Out of scope (deferred / pre-existing)

  • Persistent retry mode emission cap (50+ ApiRetryEvent records under QWEN_CODE_UNATTENDED_RETRY). Aggregated attempt / retry_total_delay_ms on the LLM span remain accurate regardless.
  • SDK-internal retries (openai/google-genai maxRetries=3) remain invisible — operator awareness.
  • Stream-iteration errors (mid-stream network drop during for await) bypass retryWithBackoff entirely. Pre-existing behavior, not a Phase 4b regression.
  • shouldRetryOnContent content-retry path (retry.ts:184-193) skips onRetry. No caller in the codebase uses this path today — code path is dead. JSDoc documents the contract.

ApiRetryEvent vs ContentRetryEvent coexistence

Cleanly partitioned by failure mode — operators summing "total retry count for a prompt" must sum both event streams:

  • ContentRetryEvent (existing) — emitted by geminiChat's for-loop for InvalidStreamError retries via INVALID_CONTENT_RETRY_OPTIONS (NOT through retryWithBackoff).
  • ApiRetryEvent (new in 4b) — emitted by retryWithBackoff's onRetry callback for HTTP 429 / 5xx retries.

ApiRetryEvent JSDoc explicitly references ContentRetryEvent to document this.

Bridge span parentage

onRetry fires from inside retry.ts's catch block — at that moment, the failed attempt's LLM span has already ended, and the next attempt's span hasn't started. LogToSpanProcessor-bridged qwen-code.api_retry spans therefore parent to the caller's active span (typically interaction or tool), appearing as siblings of the failed LLM span, not children. This is by design and produces a coherent timeline: failed-LLM-span → api_retry-bridge-sibling → next-LLM-span reads correctly in trace viewers.

Copilot AI review requested due to automatic review settings May 22, 2026 07:28
@github-actions

Copy link
Copy Markdown
Contributor

📋 Review Summary

This PR (Phase 4b) adds per-attempt HTTP-status retry telemetry to make retry behavior visible in traces, logs, and metrics. The implementation uses AsyncLocalStorage to propagate retry state from retryWithBackoff down to LoggingContentGenerator, enabling each per-attempt LLM span to carry meaningful attempt, request_setup_ms, and retry_total_delay_ms attributes. Additionally, it fixes a Phase 4a bug where sampling_ms calculation would incorrectly subtract requestSetupMs, which would clamp sampling_ms to 0 for retried requests.

🔍 General Feedback

Positive aspects:

  • Excellent architectural discovery process - the PR description documents the mid-planning discovery that qwen-code's architecture differs from claude-code's, and the ALS-based solution is well-reasoned
  • Strong test coverage with comprehensive edge cases (parallel calls, nested retries, callback failures)
  • Follows existing patterns in the codebase (promptIdContext, subagentNameContext, agent-context)
  • The onRetry callback is opt-in, keeping retry.ts decoupled from telemetry concerns
  • Defensive programming: callback errors are caught and logged, never breaking the retry loop

Architectural decisions:

  • Using ALS for retry context propagation is the right choice given the architecture (retry layer sits above LoggingContentGenerator)
  • Snapshotting retry context in synchronous prelude before first await is critical for streaming path - well understood and documented
  • Decoupling iterationCount from attempt variable handles persistent retry mode correctly

🎯 Specific Feedback

🟡 High

  1. File: packages/core/src/telemetry/session-tracing.ts:448 - The sampling_ms formula fix is correct, but there's a subtle issue: the comment says "requestSetupMs defaults to 0 when undefined" but the code uses (metadata.requestSetupMs ?? 0). This is correct for Phase 4b, but consider whether this subtraction should happen at all. The PR description states the fix is sampling_ms = duration_ms - ttft_ms (drop setup subtraction entirely), but the code still subtracts requestSetupMs. Recommendation: Clarify whether requestSetupMs should be subtracted or not. If the intent is that duration_ms only covers ttft + sampling, then subtracting requestSetupMs is correct for Phase 4b. But verify this matches the design doc.

  2. File: packages/core/src/utils/retry.ts:224-227 - The retryContext.run() wraps fn(), but there's a subtle timing issue: requestSetupMs is captured immediately before retryContext.run(), but iterationCount is incremented at loop entry. If fn() is synchronous and throws immediately, the onRetry callback fires with the correct context. However, if there's any async setup before the first await in fn(), that time gets attributed to requestSetupMs for the next attempt. Recommendation: Add a comment clarifying that requestSetupMs includes any async setup time in fn() before the first await, or consider whether this is the intended behavior.

🟢 Medium

  1. File: packages/core/src/utils/retryContext.ts:17-38 - The interface documentation is excellent, but consider adding a usage example in the JSDoc showing the typical pattern:

    // In retryWithBackoff:
    await retryContext.run({ attempt, requestSetupMs, retryTotalDelayMs }, fn);
    
    // In LoggingContentGenerator:
    const ctx = retryContext.getStore(); // May be undefined outside retry loop

    This would help future maintainers understand the intended usage pattern.

  2. File: packages/core/src/telemetry/types.ts - The ApiRetryEvent class is well-documented with the JSDoc explaining the difference from ContentRetryEvent. However, consider adding a readonly modifier to the 'event.name' and 'event.timestamp' fields for consistency with other event classes in the file (e.g., ToolOutputTruncatedEvent at line 695 uses readonly).

  3. File: packages/core/src/utils/retry.ts:322-330 - The onRetry callback invocation in the persistent retry path wraps the callback in try/catch, which is correct. However, the debug log uses debugLogger.warn() - consider whether this should be debugLogger.log() instead, since a telemetry callback failure is not necessarily a warning condition (it's expected that telemetry might fail silently). The current approach is defensive, but the log level might generate false positives in monitoring.

🔵 Low

  1. File: packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts:89-92 - The snapshotRetryMetadata() function comment mentions "Phase 4b" but doesn't explain why the synchronous prelude timing is critical. Consider adding a brief note: "Must be called synchronously before first await - ALS context is only active during retryWithBackoff's execution."

  2. File: packages/core/src/utils/retry.ts:215-218 - The comment about iterationCount being decoupled from attempt is helpful, but consider adding a reference to the specific line where attempt gets clamped (attempt = maxAttempts - 1 at line ~344) so readers can quickly verify the claim.

  3. File: packages/core/src/telemetry/types.ts:985-987 - The subagent_name field comment says "Read from subagentNameContext.getStore() at the caller site". Consider making this more explicit: "Populated by caller from subagentNameContext.getStore() - this event class does not read the context directly to maintain separation of concerns."

  4. File: packages/core/src/core/baseLlmClient.ts:248-260 - The onRetry callback captures requestModel and promptId from outer scope. Consider adding a comment explaining that these values remain stable across retry attempts (they don't change mid-retry), so capturing them in the callback closure is correct.

  5. File: packages/core/src/telemetry/loggers.ts - The PR diff shows logApiRetry function should be added, but I don't see it in the current codebase. Ensure this function is implemented following the pattern of logContentRetry (lines 715-735) with:

    • QwenLogger event emission
    • OTel log record with appropriate attributes
    • Metric counter increment (qwen-code.api.retry.count)

✅ Highlights

  • Excellent test coverage: The retry context tests cover monotonic counters, parallel calls, nested retries, callback failures, and edge cases like shouldRetryOnError returning false mid-loop
  • Strong documentation: The PR description, design doc updates, and inline comments all clearly explain the architectural discovery and resolution
  • Defensive error handling: Callback errors are caught and logged without breaking the retry loop
  • Consistent with existing patterns: The ALS approach matches promptIdContext, subagentNameContext, and agent-context patterns already in the codebase
  • Phase 4a bug fix: The sampling_ms formula fix is critical and well-motivated - without it, retried requests would have sampling_ms clamped to 0, losing output-throughput data exactly when operators need it most
  • Opt-in telemetry: The onRetry callback is optional, keeping retry.ts decoupled from telemetry concerns and allowing non-LLM callers to remain silent

⚠️ Pre-Merge Checklist

Before merging, verify:

  1. logApiRetry function exists in packages/core/src/telemetry/loggers.ts and follows the pattern of logContentRetry
  2. ApiRetryEvent is exported from packages/core/src/telemetry/types.ts and included in the TelemetryEvent union type
  3. EVENT_API_RETRY constant is defined in packages/core/src/telemetry/constants.ts
  4. API_RETRY_COUNT metric is defined in packages/core/src/telemetry/metrics.ts with the counter registered
  5. All four LLM call sites have the onRetry callback wired up:
    • client.ts:2118-2135 (GeminiClient)
    • baseLlmClient.ts:248-260 (two call sites)
    • geminiChat.ts:2057-2070
  6. TypeScript compilation passes with no new errors
  7. Tests pass including the new Phase 4b tests in retry.test.ts and loggingContentGenerator.test.ts

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Phase 4b retry observability for LLM requests by propagating per-attempt retry state via AsyncLocalStorage and emitting a new api_retry telemetry event + metric, while also fixing a Phase 4a sampling_ms derivation bug.

Changes:

  • Introduces retryContext (ALS) and wires retryWithBackoff to populate per-attempt retry metadata and optionally invoke an onRetry callback.
  • Adds ApiRetryEvent + logApiRetry (RUM + OTel log + counter) and wires the 4 LLM retry call sites to emit per-attempt retry telemetry with subagent_name.
  • Fixes sampling_ms calculation in endLLMRequestSpan to avoid double-subtracting setup time; updates unit tests accordingly.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
packages/core/src/utils/retryContext.ts Adds ALS container/type for per-attempt retry metadata propagation.
packages/core/src/utils/retry.ts Adds onRetry callback, ALS propagation, and accumulators for retry delay/setup timing.
packages/core/src/utils/retry.test.ts Adds unit coverage for ALS context propagation + onRetry behavior and monotonic attempt counting.
packages/core/src/telemetry/types.ts Adds ApiRetryEvent and includes it in the union type.
packages/core/src/telemetry/session-tracing.ts Updates LLM metadata docs and fixes sampling_ms derivation.
packages/core/src/telemetry/session-tracing.test.ts Updates tests to match corrected sampling_ms formula.
packages/core/src/telemetry/qwen-logger/qwen-logger.ts Adds RUM emission for API retry events.
packages/core/src/telemetry/metrics.ts Defines and records new qwen-code.api.retry.count counter (tagged by model).
packages/core/src/telemetry/loggers.ts Adds logApiRetry three-sink fan-out (RUM + OTel log + metric).
packages/core/src/telemetry/loggers.test.ts Adds coverage for logApiRetry fan-out + subagent_name propagation + SDK-disabled behavior.
packages/core/src/telemetry/index.ts Re-exports recordApiRetry.
packages/core/src/telemetry/constants.ts Adds EVENT_API_RETRY constant.
packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts Snapshots ALS retry metadata in sync prelude and threads it to all LLM span end sites (stream + non-stream).
packages/core/src/core/loggingContentGenerator/loggingContentGenerator.test.ts Adds tests ensuring retry metadata propagates (and defaults) for stream/non-stream paths.
packages/core/src/core/geminiChat.ts Wires retryWithBackoff.onRetry to emit ApiRetryEvent for Gemini streaming retries.
packages/core/src/core/client.ts Wires retryWithBackoff.onRetry to emit ApiRetryEvent for a Gemini client LLM call site.
packages/core/src/core/baseLlmClient.ts Wires retryWithBackoff.onRetry to emit ApiRetryEvent for two BaseLlmClient call sites.
docs/design/telemetry-llm-request-timing-design.md Updates design doc D4 to reflect ALS-based propagation after implementation discovery.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/core/src/utils/retry.ts Outdated
Comment thread packages/core/src/utils/retry.ts Outdated
Comment thread packages/core/src/telemetry/types.ts
Comment thread packages/core/src/utils/retry.ts
Comment thread packages/core/src/utils/retry.test.ts
Comment thread packages/core/src/utils/retry.ts
Comment thread packages/core/src/utils/retry.ts
Comment thread packages/core/src/utils/retry.ts
wenshao
wenshao previously approved these changes May 23, 2026

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second opinion review (qwen3.7-max): no new issues found beyond the 9 inline comments from the previous review round. The AsyncLocalStorage propagation, streaming snapshot closure pattern, sampling_ms formula fix, and ALS isolation tests are all sound. CI green (24/24). — qwen3.7-max via Qwen Code /review

@doudouOUC doudouOUC self-assigned this May 25, 2026
@doudouOUC doudouOUC requested review from LaZzyMan and pomelo-nwu May 25, 2026 07:08
@wenshao

wenshao commented May 25, 2026

Copy link
Copy Markdown
Collaborator

维护者本地真实测试验证报告 (PR #4432)

在本地 worktree 中将 PR 合并到 latest main(e8b79d77) 后,使用 tmux session pr4432 完整跑了构建、类型检查、lint、单元测试,以及按 PR 描述自行复现的端到端 OTel 验证。结论是 可以合并 — 测试全绿,端到端实测的具体数值与 PR 描述完全吻合,Phase 4a sampling_ms 公式修复在真实 SDK 上得到证实。

合并/编译

步骤 结果
合并 base main @ e8b79d772
合并提交 397a139e (merge clean, 无冲突)
npm ci EXIT 0
npm run build EXIT 0 (0 errors,15 与本 PR 无关的 lint warning)
npm run typecheck (全部 5 个 workspace) EXIT 0
聚焦 lint(本 PR 修改的 11 个文件) EXIT 0,无 warning/error

单元测试 (telemetry + retry + LCG,合计 20 个文件)

 Test Files  20 passed (20)
      Tests  587 passed (587)
   Duration  16.75s

逐文件覆盖(只列与本 PR 直接相关或被改动的):

文件 测试数
src/utils/retry.test.ts 65 ✓
src/utils/retryContext.ts (新增) 覆盖在 retry.test.ts 中
src/core/loggingContentGenerator/loggingContentGenerator.test.ts 47 ✓
src/telemetry/loggers.test.ts 48 ✓
src/telemetry/session-tracing.test.ts 70 ✓
src/telemetry/metrics.test.ts 40 ✓
src/telemetry/sdk.test.ts 55 ✓
src/telemetry/log-to-span-processor.test.ts 32 ✓
其余 telemetry 套件合计 226 ✓

PR 描述写的是 ~580,实测 587,且全绿。

端到端验证(真实 OTel SDK + outfile,无 mock)

按 PR 描述照搬方法,自己写了一个一次性 vitest 文件 src/utils/phase4b-e2e-verify.test.ts,真实初始化 NodeSDK + FileSpanExporter + FileLogExporter + PeriodicExportingMetricReader + FileMetricExporter,驱动 retryWithBackoff(fn, ...)500 → 500 → ok 路径,onRetry 通过 logApiRetry()ApiRetryEvent,最后 endLLMRequestSpan 注入 ALS 中的 attempt/requestSetupMs/retryTotalDelayMs,然后 shutdownTelemetry() 强制 flush 并解析 outfile JSON。

 ✓ src/utils/phase4b-e2e-verify.test.ts (4 tests) 147ms

四个用例分别校验:(1) outfile 里 2 条 qwen-code.api_retry LogRecord、(2) qwen-code.llm_request span 携带全部 Phase 4b 属性、(3) ALS 在三次 attempt 间正确传播、(4) span 上的 attempt/request_setup_ms/retry_total_delay_ms 等于最后一次 retryContext.getStore() 的快照。

实测从 outfile 解析出来的关键数值

qwen-code.api_retry LogRecord ×2

字段 第 1 条 第 2 条
event.name qwen-code.api_retry qwen-code.api_retry
attempt_number 1 2
model phase4b-test-model phase4b-test-model
prompt_id verify-prompt-1 verify-prompt-1
status_code 500 500
retry_delay_ms 21.39 69.24
_body API retry attempt 1 for phase4b-test-model (status 500). API retry attempt 2 for phase4b-test-model (status 500).

qwen-code.llm_request span attributes

{
  "qwen-code.model": "phase4b-test-model",
  "gen_ai.request.model": "phase4b-test-model",
  "duration_ms": 800,
  "input_tokens": 100, "gen_ai.usage.input_tokens": 100,
  "output_tokens": 20, "gen_ai.usage.output_tokens": 20,
  "ttft_ms": 80, "gen_ai.server.time_to_first_token": 0.08,
  "request_setup_ms": 129,
  "attempt": 3,
  "retry_total_delay_ms": 90.63,
  "sampling_ms": 720,
  "output_tokens_per_second": 27.78,
  "success": true
}

qwen-code.api.retry.count Counter(经 PeriodicExportingMetricReader + FileMetricExporter 落盘)

metric: qwen-code.api.retry.count
  point: {"model": "phase4b-test-model"}  value: 2

Phase 4a sampling_ms 公式修复 — 实测验证

request_setup_ms = 129 是非零的(累计重试开销)。用旧公式 duration - ttft - setup 会得到 800 - 80 - 129 = 591(错);新公式 duration - ttft 得到 720(对)。实测 outfile 中确实写入了 720,且 output_tokens_per_second = 27.78 = 20 / (720 / 1000),前后自洽。如果不在本 PR 同时修这个公式,Phase 4b 的重试场景会让所有重试请求的吞吐量数据归零,所以这个 bundled fix 是必要的、不可拆的。

期望矩阵 vs 实测

检查项 期望 实测
LogRecord 条数 2 2
attempt_number 单调 1 → 2 1, 2
status_code 透传 500 500
LLM span attempt 3 (1 success + 2 prior) 3
request_setup_msretry_total_delay_ms (setup 包含 attempt 自身耗时) true 129 ≥ 90.63
sampling_ms = duration - ttft(Phase 4a 修复) 720 720
旧公式会算成 591 (bug) ≠720 591 ≠ 720(已修复)
output_tokens_per_second 27.78 27.78
GenAI semconv time_to_first_token (秒) 0.08 0.08
Counter qwen-code.api.retry.count model 维度 = 2 {model}: 2
ALS 隔离:attempt 1 setup=0, attempt 2 > 0, attempt 3 单调增 as expected 0 → 25ms → 80ms

一处与 PR 描述需对照的实际行为(并非 bug)

PR 提到 "api_retry LogToSpanProcessor 桥接 span 以 sibling 形式挂在 caller active span 下"。这是 OTLP 路径 的行为(useOtlp 分支才 new LogToSpanProcessor(...));本次 outfile 模式的验证里没有桥接 span 产出,只有 LogRecord 本身写入了 outfile,这与 sdk.ts:259-342 的代码完全一致 — outfile 模式下 LogRecord 直接走 FileLogExporter,不经过 LogToSpanProcessor。这是预期行为,PR 单测 log-to-span-processor.test.ts (32 tests) 已单独覆盖该桥接逻辑。如果未来希望 outfile 模式也产出桥接 span 以方便本地观察,可在 4c 或后续 PR 里单独考虑(超出本 PR 范围)。

复现命令

git fetch origin pull/4432/head:pr-4432-test && git fetch origin main
git worktree add -b pr-4432-merge-test /tmp/pr4432-merged origin/main
cd /tmp/pr4432-merged && git merge pr-4432-test --no-edit
npm ci && npm run build && npm run typecheck

# 单元测试(20 个文件,587 tests)
cd packages/core && npx vitest run --coverage=false \
  src/telemetry/ src/utils/retry.test.ts \
  src/core/loggingContentGenerator/loggingContentGenerator.test.ts

# E2E:把验证文件 phase4b-e2e-verify.test.ts 放到 packages/core/src/utils/ 后
npx vitest run --coverage=false src/utils/phase4b-e2e-verify.test.ts

(验证用的 e2e 测试文件用完即弃,与 PR 描述一致;如有需要我可以贴完整源码到下方 thread。)

结论

  • ✅ 与 latest main 合并干净,无冲突
  • ✅ build / typecheck / lint 全绿
  • ✅ 587 个单元测试全部通过
  • ✅ 真实 OTel SDK 端到端验证:3 个 sink(LogRecord / Span / Counter)全部按预期写入 outfile,数值与 PR 描述一致
  • ✅ Phase 4a sampling_ms 公式修复在真实场景下证实是必要的且生效
  • 同意合并。

doudouOUC added a commit to doudouOUC/qwen-code that referenced this pull request May 26, 2026
Fixes 6 of 9 inline review comments from wenshao + Copilot. The remaining
3 are pushback (duration_ms semantic = design intent per D5; persistent
retry cap = explicitly deferred in PR description).

1. Fix JSDoc inaccuracy on `onRetry` contract (QwenLM#1+QwenLM#2): the comment
   incorrectly said "synchronous throws inside fn execute OUTSIDE the ALS
   frame." In fact fn() runs inside retryContext.run() so throws ARE inside
   the frame. What's outside the frame is the onRetry callback itself (it
   fires from the catch block). Rewritten per wenshao's suggestion: tells
   callers not to read retryContext.getStore() inside onRetry — all data
   comes via the RetryAttemptInfo parameter.

2. Add doc comment on content-retry delay inflation (QwenLM#3): retryTotalDelayMs
   accumulator includes content-retry delays (shouldRetryOnContent path)
   which don't fire onRetry. This is intentional — the LLM span attribute
   reports total user-perceived backoff time — but was undocumented.

3. Add signal?.aborted guard before onRetry invocations (QwenLM#6): if the abort
   signal fires between the catch and onRetry execution point, we now skip
   the callback to avoid phantom retry events that inflate the counter for
   retries that never actually proceeded. Applied to both persistent and
   normal retry paths.

4. Add persistent retry path test (status=429 + persistentMode) (QwenLM#4): the
   highest-volume production retry path had zero Phase 4b test coverage.
   Now verifies onRetry fires with monotonic attempt counter and that
   persistent-mode exponential backoff produces increasing delayMs.

5. Add Retry-After header path test (status=429 + retry-after: 2) (QwenLM#7):
   verifies that when the error carries a Retry-After header,
   onRetry.delayMs reflects the parsed header value (2000ms) instead of
   the exponential backoff calculation.

6. Add stream idle-timeout retry-attr propagation test (QwenLM#8): verifies that
   the closure-captured retrySnapshot reaches the setTimeout-fired
   endLLMRequestSpan call with correct retry context values (attempt=4,
   requestSetupMs=3000, retryTotalDelayMs=2500).

All 186 affected tests pass (retry 68 + LCG 48 + session-tracing 70).
tsc --noEmit clean. eslint clean.

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
Comment thread packages/core/src/core/baseLlmClient.ts
Comment thread packages/core/src/core/client.ts
Comment thread packages/core/src/core/loggingContentGenerator/loggingContentGenerator.test.ts Outdated
Comment thread packages/core/src/telemetry/qwen-logger/qwen-logger.ts
doudouOUC added a commit to doudouOUC/qwen-code that referenced this pull request May 27, 2026
… in RUM (QwenLM#4432)

Addresses 2 of 5 R3 review comments from wenshao (2026-05-26):

1. loggingContentGenerator.test.ts:2290 — replace `if (timeoutRecord)` guard
   with `expect(timeoutRecord).toBeDefined()` so the idle-timeout retry-attr
   test fails loudly instead of passing with 0 assertions when setTimeout
   doesn't fire. Also rewrote the test to use fake timers from the START
   (so the 5-min idle timeout is created under fake clock and can be advanced
   via vi.advanceTimersByTimeAsync), fixing the underlying reason it wasn't
   firing.

2. qwen-logger.ts:963 — add `prompt_id: event.prompt_id` to
   logApiRetryEvent RUM properties. Without this, RUM dashboards cannot
   correlate api_retry events with specific prompts, unlike the analogous
   logApiErrorEvent which already includes prompt_id.

165 affected tests pass. Remaining 3 R3 items (QwenLM#9 onRetry helper, QwenLM#10
error-path test coverage, QwenLM#11 caller integration assertions) deferred to
follow-up PR — non-blocking refactor/test-hardening.

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
wenshao
wenshao previously approved these changes May 27, 2026

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No review findings. Downgraded from Approve to Comment: CI still running. — qwen3.7-max via Qwen Code /review

@wenshao

wenshao commented May 27, 2026

Copy link
Copy Markdown
Collaborator

Maintainer Verification Report

Build

  • Result: PASS
  • Clean build after rm -rf packages/core/dist (stale .d.ts artifact conflict, not PR-related). Zero errors, 15 warnings (all pre-existing lint/browserslist warnings).

Tests — PR-Touched Files (Targeted)

Test File Tests Result
retry.test.ts 68 PASS
loggingContentGenerator.test.ts 48 PASS
session-tracing.test.ts 70 PASS
loggers.test.ts 48 PASS
Total 234 ALL PASS

Tests — Full Core Suite (Regression)

Metric Value
Test Files 339 passed, 1 failed (340 total)
Tests 9291 passed, 1 failed, 3 skipped (9295 total)
Duration 141.69s

The single failure (anthropicContentGenerator.test.ts — User-Agent string mismatch) is pre-existing on main and unrelated to this PR.

Code Review Findings

Architecture: Clean AsyncLocalStorage-based retry context propagation, consistent with existing promptIdContext/subagentNameContext patterns.

Key observations:

  1. retryContext.ts — Well-defined RetryAttemptContext interface with clear 1-based attempt, cumulative retryTotalDelayMs, and requestSetupMs fields. Correct use of AsyncLocalStorage.

  2. retry.tsiterationCount correctly decoupled from attempt variable (which gets clamped in persistent mode). onRetry callback fires in catch block OUTSIDE the ALS frame — data passed via RetryAttemptInfo parameter, not ALS read. Callback errors properly swallowed with debug logging. signal?.aborted guard prevents phantom retry events.

  3. loggingContentGenerator.tssnapshotRetryMetadata() called in synchronous prelude (before first await), correctly capturing ALS state before the streaming AsyncGenerator exits the retry frame. Default attempt=1 ensures warmup/side-query spans are dashboard-friendly.

  4. session-tracing.tsPhase 4a sampling_ms bug fix is correct: changed from duration - ttft - setup to duration - ttft. The duration_ms is computed from startLLMRequestSpan time (AFTER setup), so subtracting requestSetupMs was double-counting. Phase 4a masked this because requestSetupMs was always undefined.

  5. 4 LLM call sites (client.ts, geminiChat.ts, baseLlmClient.ts ×2) — All correctly wired with identical onRetry callback pattern emitting ApiRetryEvent with subagentNameContext.getStore().

  6. Three-sink fan-out — QwenLogger RUM (logApiRetryEvent), OTel log signal (logger.emit → LogToSpanProcessor bridge), Counter metric (recordApiRetry tagged by model) — all present and correct.

  7. New ApiRetryEvent typeduration_ms set equal to retry_delay_ms for LogToSpanProcessor bridge span visualization. Clear distinction from ContentRetryEvent documented in JSDoc.

No issues found. The implementation is thorough, well-documented, and matches the design doc.


Verified by: wenshao

doudouOUC added 3 commits June 3, 2026 18:36
QwenLM#3731)

Adds per-attempt retry telemetry for HTTP-status retries (429/5xx) emitted by
retryWithBackoff at the 4 LLM call sites. Second slice of Phase 4 (sub-issue

Architectural discovery (mid-planning)
--------------------------------------

The Phase 4 design doc assumed claude-code's "one LLM span owns the retry
loop" pattern. Reading the 4 retryWithBackoff call sites revealed qwen-code
inverts that: retryWithBackoff sits ABOVE LoggingContentGenerator. Each
attempt creates a fresh LLM span. The original "in-LCG accumulator" plan
wouldn't work.

Resolution: propagate retry state via AsyncLocalStorage (`retryContext`).
retryWithBackoff wraps each `await fn()` in `retryContext.run(...)`, and
LoggingContentGenerator reads the ALS in its synchronous prelude (before
the first await) and threads the snapshot into all endLLMRequestSpan
callsites — success / error / idle-timeout / abort. Matches existing
patterns (promptIdContext, subagentNameContext, agent-context).

Plan went through 3 review rounds (Plan-agent reviews) finding 22 issues
total — all addressed before implementation.

Changes
-------

- New retryContext.ts (AsyncLocalStorage<RetryAttemptContext>) with
  attempt + requestSetupMs + retryTotalDelayMs fields. Computed in
  retry.ts immediately before `await fn()` so values are anchored to the
  attempt's actual start, not derived downstream.

- retry.ts:
  - New `onRetry?: (info: RetryAttemptInfo) => void` option on RetryOptions.
    Opt-in per caller: non-LLM callers stay silent.
  - Monotonic `iterationCount` decoupled from `attempt` (which is clamped at
    `maxAttempts - 1` in persistent mode). Always reflects "this is the Nth
    fn() call" — no flip-flopping for mixed-error sequences.
  - retryContext.run wrap around fn() so LCG can read the ALS.
  - onRetry invocations wrapped in try/catch: telemetry exceptions never
    break the retry loop (logged via debugLogger).
  - logRetryAttempt debug log line KEPT — useful when OTel SDK isn't wired
    up (local CLI debugging, integration tests, early-startup errors).

- ApiRetryEvent telemetry event class (types.ts) with model + promptId +
  attempt_number + error fields + subagent_name. JSDoc cross-references
  ContentRetryEvent (they cover different retry budgets — HTTP-status vs
  invalid-stream — and can both fire for one prompt).

- logApiRetry function in loggers.ts — three-sink fan-out matching
  logContentRetry: QwenLogger RUM, OTel log signal (bridged via
  LogToSpanProcessor), recordApiRetry metric counter.

- recordApiRetry metric (metrics.ts) — `qwen-code.api.retry.count` Counter
  tagged with {model}. Full COUNTER_DEFINITIONS entry + initialization +
  recording function + index.ts export.

- qwen-logger.ts adds logApiRetryEvent for RUM consistency.

- 4 LLM caller wiring sites (client.ts, baseLlmClient.ts x2,
  geminiChat.ts) opt in with onRetry callback that emits ApiRetryEvent
  with subagentName from subagentNameContext.getStore().

- LoggingContentGenerator: snapshotRetryMetadata() helper called in the
  SYNCHRONOUS prelude of generateContent / generateContentStream — only
  point where retryContext is guaranteed active for the streaming path
  (the returned AsyncGenerator is iterated AFTER retryWithBackoff
  resolves). Snapshot threaded as parameter to loggingStreamWrapper so
  every endLLMRequestSpan callsite (success / error / idle-timeout /
  abort) sees the same values. `attempt` defaults to 1 when no retry
  context is present (warmup, side-queries, direct calls) so dashboards
  filtering WHERE attempt=1 include those.

Bundled Phase 4a bug fix (sampling_ms formula)
-----------------------------------------------

Phase 4a's `sampling_ms = duration_ms - ttft_ms - (requestSetupMs ?? 0)`
was silently wrong. `duration_ms` only covers `ttft + sampling` for the
span (startTime is captured when startLLMRequestSpan runs, AFTER any
setup phase). Subtracting setup again is double-counting. Phase 4a
masked the bug because requestSetupMs was always undefined → 0. Phase
4b populates requestSetupMs with cumulative retry overhead — without
this fix, sampling_ms would clamp to 0 for every retried request,
wiping output-throughput data exactly when operators need it most.

Fix: `sampling_ms = duration_ms - ttft_ms` (drop the setup subtraction).
Phase 4a tests updated accordingly: 1 test rewritten to use inputs that
actually exercise the clamp under the new formula (ttft > duration =
clock skew); 1 test renamed to assert the FIX (setup is NOT subtracted).

Out of scope (deferred, noted in PR description)
------------------------------------------------

- Persistent retry mode emission cap (50+ events under
  QWEN_CODE_UNATTENDED_RETRY). Aggregated attempt/retry_total_delay_ms
  remain accurate regardless.
- SDK-internal retries (openai/google-genai maxRetries=3) remain
  invisible — operator awareness only.
- Stream-iteration errors (mid-stream network drop during for-await)
  bypass retryWithBackoff entirely. Pre-existing behavior, not a Phase 4b
  regression.
- shouldRetryOnContent content-retry path (retry.ts:184-193) skips
  onRetry. No caller uses this path today — code path is dead.

Tests
-----

- retry.test.ts: 9 new cases (monotonic counter, requestSetupMs growth,
  first-try success, onRetry callback contract, absent-callback silence,
  callback-throws resilience, shouldRetryOnError mid-loop giveup,
  parallel-call ALS isolation, nested-retry inner-frame read).
- loggers.test.ts: 3 new cases (3-sink fan-out, subagent_name
  propagation, SDK-not-initialized path).
- loggingContentGenerator.test.ts: 4 new cases (non-stream ALS
  propagation, non-stream default attempt=1, stream ALS propagation
  through wrapper closure, stream default attempt=1).
- session-tracing.test.ts: 1 test rewritten + 1 renamed for the
  sampling_ms fix.

All 580 telemetry + retry + LCG tests pass. tsc --noEmit clean.
eslint clean.

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
Fixes 6 of 9 inline review comments from wenshao + Copilot. The remaining
3 are pushback (duration_ms semantic = design intent per D5; persistent
retry cap = explicitly deferred in PR description).

1. Fix JSDoc inaccuracy on `onRetry` contract (QwenLM#1+QwenLM#2): the comment
   incorrectly said "synchronous throws inside fn execute OUTSIDE the ALS
   frame." In fact fn() runs inside retryContext.run() so throws ARE inside
   the frame. What's outside the frame is the onRetry callback itself (it
   fires from the catch block). Rewritten per wenshao's suggestion: tells
   callers not to read retryContext.getStore() inside onRetry — all data
   comes via the RetryAttemptInfo parameter.

2. Add doc comment on content-retry delay inflation (QwenLM#3): retryTotalDelayMs
   accumulator includes content-retry delays (shouldRetryOnContent path)
   which don't fire onRetry. This is intentional — the LLM span attribute
   reports total user-perceived backoff time — but was undocumented.

3. Add signal?.aborted guard before onRetry invocations (QwenLM#6): if the abort
   signal fires between the catch and onRetry execution point, we now skip
   the callback to avoid phantom retry events that inflate the counter for
   retries that never actually proceeded. Applied to both persistent and
   normal retry paths.

4. Add persistent retry path test (status=429 + persistentMode) (QwenLM#4): the
   highest-volume production retry path had zero Phase 4b test coverage.
   Now verifies onRetry fires with monotonic attempt counter and that
   persistent-mode exponential backoff produces increasing delayMs.

5. Add Retry-After header path test (status=429 + retry-after: 2) (QwenLM#7):
   verifies that when the error carries a Retry-After header,
   onRetry.delayMs reflects the parsed header value (2000ms) instead of
   the exponential backoff calculation.

6. Add stream idle-timeout retry-attr propagation test (QwenLM#8): verifies that
   the closure-captured retrySnapshot reaches the setTimeout-fired
   endLLMRequestSpan call with correct retry context values (attempt=4,
   requestSetupMs=3000, retryTotalDelayMs=2500).

All 186 affected tests pass (retry 68 + LCG 48 + session-tracing 70).
tsc --noEmit clean. eslint clean.

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
… in RUM (QwenLM#4432)

Addresses 2 of 5 R3 review comments from wenshao (2026-05-26):

1. loggingContentGenerator.test.ts:2290 — replace `if (timeoutRecord)` guard
   with `expect(timeoutRecord).toBeDefined()` so the idle-timeout retry-attr
   test fails loudly instead of passing with 0 assertions when setTimeout
   doesn't fire. Also rewrote the test to use fake timers from the START
   (so the 5-min idle timeout is created under fake clock and can be advanced
   via vi.advanceTimersByTimeAsync), fixing the underlying reason it wasn't
   firing.

2. qwen-logger.ts:963 — add `prompt_id: event.prompt_id` to
   logApiRetryEvent RUM properties. Without this, RUM dashboards cannot
   correlate api_retry events with specific prompts, unlike the analogous
   logApiErrorEvent which already includes prompt_id.

165 affected tests pass. Remaining 3 R3 items (QwenLM#9 onRetry helper, QwenLM#10
error-path test coverage, QwenLM#11 caller integration assertions) deferred to
follow-up PR — non-blocking refactor/test-hardening.

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
@doudouOUC doudouOUC force-pushed the feat/telemetry-phase-4b-retry-visibility branch from 479bbf3 to 5458a9e Compare June 3, 2026 10:36
@doudouOUC doudouOUC requested a review from wenshao June 3, 2026 11:10

@yiliang114 yiliang114 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The ALS-based retry context propagation is clean and matches existing patterns (promptIdContext, subagentNameContext). The sampling_ms formula fix is correct — duration_ms already excludes setup time since startTime is captured after startLLMRequestSpan. All prior review comments are addressed or explicitly deferred as non-blocking.

The 3 follow-up rounds of review (22 findings total) and the author's responses look thorough. Test coverage is solid — including the idle-timeout path and parallel ALS isolation.

@doudouOUC doudouOUC merged commit b3fa135 into QwenLM:main Jun 5, 2026
10 checks passed
@doudouOUC doudouOUC deleted the feat/telemetry-phase-4b-retry-visibility branch June 5, 2026 05:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants