What would you like to be added?
Harden Qwen Code's OpenTelemetry implementation so it is production-ready, starting with configuration semantics, HTTP OTLP correctness, exporter safety, and shutdown reliability.
Why is this needed?
Qwen Code already has OTLP support, but the current implementation is still closer to a minimal SDK hookup than a production-ready telemetry subsystem.
A few gaps make rollout and troubleshooting harder than necessary:
This makes telemetry appear enabled without yet being predictable, enterprise-friendly, or operationally safe enough for broader production use.
Additional context
Completed sub-issues:
feat(telemetry): define HTTP OTLP endpoint behavior and signal routing #3734 Define HTTP OTLP endpoint behavior and signal routing — closed by feat(telemetry): define HTTP OTLP endpoint behavior and signal routing #3779
fix(telemetry): add bounded shutdown timeout and fix service.version resource attribute #3811 Add bounded shutdown timeout and fix service.version resource attribute — closed by fix(telemetry): add bounded shutdown timeout and fix service.version resource attribute #3813
Remove dead useCollector setting and unreachable TelemetryTarget.QWEN enum value — closed by refactor(telemetry): remove dead useCollector setting and unreachable TelemetryTarget.QWEN #4061
follow-up(telemetry): Phase 1.5 polish — fallback order, exec span on abort-as-result, mock + log/span consistency #4212 Phase 1.5 polish from feat(telemetry): unify span creation paths for hierarchical trace tree #4126 review (fallback order, exec-span on abort-as-result, mock + log/span consistency) — closed by fix(telemetry): Phase 1.5 polish — fallback order, abort-as-result, log/span consistency #4302
Phase 2 hierarchical session tracing (tool.blocked_on_user + hook spans, tool span lifecycle move) — closed by feat(telemetry): Phase 2 — tool.blocked_on_user + hook spans (#3731) #4321
feat(telemetry): support custom resource attributes and add metric cardinality controls #4365 Custom resource attributes (OTEL_RESOURCE_ATTRIBUTES, OTEL_SERVICE_NAME, settings.json telemetry.resourceAttributes) + metric cardinality controls (session.id off Resource, opt-in toggle) — closed by feat(telemetry): support custom resource attributes and add metric cardinality controls #4367
Phase 4a TTFT capture + GenAI semconv dual-emit — closed by feat(telemetry): Phase 4a — TTFT capture + GenAI semconv dual-emit (#3731) #4417
feat(telemetry): propagate W3C traceparent + X-Qwen-Code-Session-Id to LLM service calls #4384 (partial) Client-side HTTP span (@opentelemetry/instrumentation-undici) + opt-in W3C traceparent propagation — closed partially by feat(telemetry): client-side HTTP span + opt-in W3C traceparent propagation (#4384) #4390 (traceparent done; X-Qwen-Code-Session-Id header deferred to follow-up under outboundCorrelation.* namespace)
Wire clearDetailedSpanState() into chat compression cleanup (feat(telemetry): add interaction span and detailed sensitive attributes #4097 follow-up) — closed by fix(telemetry): clear span dedup state after chat compression (#3731) #4660
Phase 4b retry visibility for qwen-code.llm_request — closed by feat(telemetry): Phase 4b — retry visibility for qwen-code.llm_request (#3731) #4432
Phase 3 subagent trace tree (qwen-code.subagent span, hybrid traceId, type-aware TTL) — closed by feat(telemetry): Phase 3 — qwen-code.subagent span with concurrent isolation (#3731) #4410
Merged PRs (chronological):
feat(telemetry): define HTTP OTLP endpoint behavior and signal routing #3779 (2026-05-01) — HTTP OTLP signal routing + per-signal endpoints + LogToSpanProcessor + console-exporter removal + Config.cleanup wiring
fix(telemetry): add bounded shutdown timeout and fix service.version resource attribute #3813 (2026-05-05) — bounded shutdown timeout + service.version fix
feat(telemetry): add sensitive span attribute opt-in #3893 (2026-05-07) — sensitive span attribute opt-in (includeSensitiveSpanAttributes) for log-to-span bridge spans
feat(telemetry): inject traceId/spanId into debug log files for OTel correlation #3847 (2026-05-10) — inject traceId/spanId into debug log files for OTel correlation
feat(telemetry) suppress OpenTelemetry diagnostics from UI #3986 (2026-05-09) — suppress OpenTelemetry diagnostics from UI / route to debug log
refactor(telemetry): remove dead useCollector setting and unreachable TelemetryTarget.QWEN #4061 (2026-05-11) — remove dead useCollector setting and TelemetryTarget.QWEN enum value
feat(telemetry): add hierarchical session tracing spans #4071 (2026-05-12) — interaction span scaffolding (lifecycle in client.ts, span type constants, WeakRef + 30-min TTL cleanup)
fix(telemetry): address PR #3847 review follow-ups for trace correlation #4058 (2026-05-13) — feat(telemetry): inject traceId/spanId into debug log files for OTel correlation #3847 review follow-ups (abandoned generator idle timeout, autoOkOnSuccess option, sampler-aware trace flags, session fallback in LogToSpanProcessor)
docs(telemetry): align config and docs semantics for target, outfile, and CLI flags #4066 (2026-05-13) — align telemetry config and docs semantics for target / outfile / CLI flags
feat(telemetry): unify span creation paths for hierarchical trace tree #4126 (2026-05-16) — unify span creation paths for hierarchical trace tree (P3 Phase 1)
feat(telemetry): add interaction span and detailed sensitive attributes #4097 (2026-05-16) — interaction span + detailed sensitive attributes (verbatim user prompt / system prompt / tool I/O / model output, gated by includeSensitiveSpanAttributes)
fix(telemetry): Phase 1.5 polish — fallback order, abort-as-result, log/span consistency #4302 (2026-05-19) — Phase 1.5 polish: fallback order, exec-span on abort-as-result, idle-timeout vs api log consistency, exec-span cancelled status
feat(telemetry): Phase 2 — tool.blocked_on_user + hook spans (#3731) #4321 (2026-05-21) — Phase 2: tool.blocked_on_user + hook spans. Moves tool span lifecycle from executeSingleToolCall to _schedule so validating → awaiting_approval → executing is one span; adds 5 blocked-span end sites + 6 wrapped hook fire sites; class-level batch-listener cleanup for shared AbortSignals; TTL safety-net stamps + log context + try/catch separation; centralised truncateSpanError (1KB cap, surrogate-pair safe); hookError plumbing for runner-contract violations; signal.aborted re-check after for-loop awaits; handleConfirmationResponse outer-catch moved out of attemptExecutionOfScheduledCalls to prevent sister-tool failures from corrupting the confirmed tool's span.
feat(telemetry): support custom resource attributes and add metric cardinality controls #4367 (2026-05-21, merged commit 64401e1) — closes feat(telemetry): support custom resource attributes and add metric cardinality controls #4365 . Custom resource attributes (OTEL_RESOURCE_ATTRIBUTES / OTEL_SERVICE_NAME env vars now respected per OTel spec, plus settings.json telemetry.resourceAttributes). Metric cardinality controls: session.id moved off the OTel Resource (it auto-attached to every metric data point and was fanning out Prometheus / ARMS Metric time-series), gated behind a new opt-in telemetry.metrics.includeSessionId toggle. Reserved keys (service.version, session.id) stripped from env + settings with diag.warn; SDK emits a one-time console summary at init when input is dropped (per W3C Baggage spec keys are also percent-decoded). Spans and logs continue to carry session.id unconditionally for trace/log correlation. Design doc: docs/design/telemetry-resource-attributes-design.md.
feat(telemetry): Phase 4a — TTFT capture + GenAI semconv dual-emit (#3731) #4417 (2026-05-22) — Phase 4a: TTFT capture (hasUserVisibleContent cross-provider first-token detection, method-local closure in LoggingContentGenerator) + extended LLMRequestMetadata (ttftMs / requestSetupMs / attempt / retryTotalDelayMs / cachedInputTokens) + endLLMRequestSpan derived attrs (sampling_ms, output_tokens_per_second) + GenAI semconv dual-emit (gen_ai.request.model / gen_ai.usage.* / gen_ai.server.time_to_first_token). Design doc: docs/design/telemetry-llm-request-timing-design.md.
feat(telemetry): client-side HTTP span + opt-in W3C traceparent propagation (#4384) #4390 (2026-05-25) — partially closes feat(telemetry): propagate W3C traceparent + X-Qwen-Code-Session-Id to LLM service calls #4384 . Client-side HTTP span via @opentelemetry/instrumentation-undici (separates network latency from model processing time) + OTLP feedback-loop guard (ignoreOutgoingRequestHook skips configured OTLP endpoints) + opt-in W3C traceparent propagation gated by outboundCorrelation.propagateTraceContext (default false, NoopTextMapPropagator). X-Qwen-Code-Session-Id header removed from scope per reviewer request — deferred to follow-up under outboundCorrelation.* namespace.
feat(telemetry): trace daemon prompt lifecycle #4556 (2026-05-29, daemon_mode_b_main) — daemon OTel context propagation: route-level qwen-code.daemon.request spans + qwen-code.daemon.bridge span on ACP prompt dispatch + cross-process trace context via reserved qwen.telemetry.* prompt metadata + ACP child context restoration for interaction span parenting. Closes feat(telemetry): cover qwen serve daemon end-to-end with OpenTelemetry #4554 .
feat(telemetry): add client_id attribute and permission route spans #4628 (2026-05-30, daemon_mode_b_main) — daemon qwen-code.client_id span attribute from X-Qwen-Client-Id header + permission vote route spans (POST /session/:id/permission/:requestId, POST /permission/:requestId) + addDaemonRequestAttribute() helper for post-rebase enrichment.
feat(telemetry): add tool spans and session.id to daemon/ACP path #4630 (2026-05-30, daemon_mode_b_main) — daemon/ACP tool span hierarchy: startToolSpan/endToolSpan + startToolExecutionSpan/endToolExecutionSpan wired into Session.ts runTool() + session.id attribute on all session-tracing spans + logConversationFinishedEvent at turn end + #executeCronPrompt wrapped in withInteractionSpan. Closes feat(telemetry): align daemon/ACP session tracing with CLI path — interaction, tool, and session.id spans missing #4602 .
fix(telemetry): clear span dedup state after chat compression (#3731) #4660 (2026-06-03) — Wire clearDetailedSpanState() into GeminiChat.tryCompress() COMPRESSED branch. Clears seenHashes after chat compression so post-compaction spans re-emit full system prompt / tool schema content. Closes feat(telemetry): add interaction span and detailed sensitive attributes #4097 follow-up checklist item.
feat(telemetry): Phase 4b — retry visibility for qwen-code.llm_request (#3731) #4432 (2026-06-05) — Phase 4b: retry visibility for qwen-code.llm_request. Adds onRetry callback to retryWithBackoff (opt-in per caller), ApiRetryEvent LogRecord + logApiRetry 3-sink fan-out, qwen-code.api.retry.count Counter, wiring at 4 LLM call sites (client.ts, baseLlmClient.ts ×2, geminiChat.ts). Retry state propagated via AsyncLocalStorage<RetryAttemptContext> so LoggingContentGenerator can snapshot attempt/requestSetupMs/retryTotalDelayMs into LLMRequestMetadata. Also fixes Phase 4a sampling_ms formula bug (was double-subtracting requestSetupMs).
feat(telemetry): Phase 3 — qwen-code.subagent span with concurrent isolation (#3731) #4410 (2026-06-05) — Phase 3: qwen-code.subagent span. Wraps every subagent invocation so the LLM/tool/hook spans the subagent emits become a proper subtree instead of interleaving under the parent interaction. Hybrid traceId — foreground = child span, fork/background = linked root span (new traceId + OTel Link back to invoker, per spec recommendation for long-running async ops). Type-aware TTL (subagent fork/background = 4h, others stay 30 min) with qwen-code.subagent.terminate_reason='ttl_swept' sentinel. LogToSpanProcessor skip-list bypasses the existing qwen-code.subagent_execution bridge to avoid duplicate spans (LogRecord itself stays for RUM + metrics). OTel GenAI spec attrs dual-emitted (gen_ai.agent.id/gen_ai.agent.name alongside qwen-code.subagent.id/name). AgentContext.depth auto-incremented inside runWithAgentContext for recursion-bug detection. Design doc: docs/design/telemetry-subagent-spans-design.md.
Open sub-issues:
The remaining work stays in this parent issue as a checklist until the scope is clearer.
Tracking checklist
Foundation
Runtime safety (P0)
Configuration semantics (P1)
Enterprise deployment (P2)
Deeper observability (P3)
Add resource attribute policy and cardinality controls — feat(telemetry): support custom resource attributes and add metric cardinality controls #4365 (closed by feat(telemetry): support custom resource attributes and add metric cardinality controls #4367 )
Propagate traceparent to outgoing LLM service calls via @opentelemetry/instrumentation-undici + opt-in outboundCorrelation.propagateTraceContext — feat(telemetry): client-side HTTP span + opt-in W3C traceparent propagation (#4384) #4390 (partial close of feat(telemetry): propagate W3C traceparent + X-Qwen-Code-Session-Id to LLM service calls #4384 )
Propagate X-Qwen-Code-Session-Id to outgoing LLM service calls (deferred from feat(telemetry): client-side HTTP span + opt-in W3C traceparent propagation (#4384) #4390 per reviewer request — needs own design under outboundCorrelation.* namespace with threat model + default-off)
Wire detailed sensitive span attributes onto hierarchical spans (feat(telemetry): add interaction span and detailed sensitive attributes #4097 ) — gated by includeSensitiveSpanAttributes; complements feat(telemetry): add sensitive span attribute opt-in #3893 's bridge-span coverage with native-span coverage. Adds verbatim user prompt (new_context), system prompt + hash + preview + length (full text deduped per session via SHA-256), per-tool tool_schema events (also hash-deduped), response.model_output, and tool_input / tool_result on every tool exit path (success + pre-hook block + post-hook stop + tool error + try-block cancel + catch-block cancel + execution exception). All large content truncated at 60KB with *_truncated and *_original_length metadata. Heavy serialization (safeJsonStringify on tool I/O, partToString on user prompt) guarded at the call site so it doesn't run when telemetry is off.
Wire clearDetailedSpanState() into chat compression cleanup (feat(telemetry): add interaction span and detailed sensitive attributes #4097 follow-up) — closed by fix(telemetry): clear span dedup state after chat compression (#3731) #4660 . Called in GeminiChat.tryCompress() COMPRESSED branch (the single convergence point for all compression paths). Clears seenHashes so post-compaction spans re-emit full system prompt / tool schema content.
Wire hierarchical session tracing spans into runtime — see design doc
Interaction span lifecycle in client.ts sendMessageStream (feat(telemetry): add hierarchical session tracing spans #4071 )
Span type constants + typed helper functions defined and exported in session-tracing.ts (feat(telemetry): add hierarchical session tracing spans #4071 )
WeakRef + strongSpans + 30-min TTL cleanup + double-end protection (feat(telemetry): add hierarchical session tracing spans #4071 )
Phase 1 — 统一 span 创建路径,修复 trace 树结构(merged: feat(telemetry): unify span creation paths for hierarchical trace tree #4126 )
Fix parent-child wiring — replace withSpan('api.*') / withSpan('tool.*') in runtime with session-tracing typed helpers (feat(telemetry): unify span creation paths for hierarchical trace tree #4126 )
Add toolContext ALS for tool sub-span parenting (feat(telemetry): unify span creation paths for hierarchical trace tree #4126 ; uses AsyncLocalStorage.run() not enterWith() for concurrent-safe scoping)
Wire startLLMRequestSpan / endLLMRequestSpan in loggingContentGenerator.ts (feat(telemetry): unify span creation paths for hierarchical trace tree #4126 )
Wire startToolSpan / endToolSpan in coreToolScheduler.ts (feat(telemetry): unify span creation paths for hierarchical trace tree #4126 )
Wire startToolExecutionSpan / endToolExecutionSpan in coreToolScheduler.executeSingleToolCall (feat(telemetry): unify span creation paths for hierarchical trace tree #4126 )
Phase 1.5 — polish from late review rounds (follow-up(telemetry): Phase 1.5 polish — fallback order, exec span on abort-as-result, mock + log/span consistency #4212 , PR fix(telemetry): Phase 1.5 polish — fallback order, abort-as-result, log/span consistency #4302 )
startLLMRequestSpan/startToolSpan fallback order — prefer active OTel span before session-root fallback (fix(telemetry): Phase 1.5 polish — fallback order, abort-as-result, log/span consistency #4302 )
tool.execution span misreports cancellation as success when tool returns ToolResult on abort (fix(telemetry): Phase 1.5 polish — fallback order, abort-as-result, log/span consistency #4302 ; review follow-up added cancelled discriminator so exec sub-span ends UNSET like the parent instead of ERROR)
Test mock for loggingContentGenerator.test.ts missing API_CALL_ABORTED_SPAN_STATUS_MESSAGE constant (fix(telemetry): Phase 1.5 polish — fallback order, abort-as-result, log/span consistency #4302 )
Stream idle timeout → safelyLogApiResponse still logs success after span timed out (fix(telemetry): Phase 1.5 polish — fallback order, abort-as-result, log/span consistency #4302 ; review follow-up extended the gate to the catch path so safelyLogApiError is also skipped after timeout)
Phase 2 — 补齐 workflow 阶段 span(merged: feat(telemetry): Phase 2 — tool.blocked_on_user + hook spans (#3731) #4321 )
Add tool.blocked_on_user span type + helper; wire into approval state machine in coreToolScheduler._schedule (moved tool span start before approval to cover validating → awaiting_approval → executing in one span) (feat(telemetry): Phase 2 — tool.blocked_on_user + hook spans (#3731) #4321 )
Add hook span type + helper; wire into pre/post hook execution in coreToolScheduler.executeSingleToolCall (6 fire sites wrapped via withHookSpan) (feat(telemetry): Phase 2 — tool.blocked_on_user + hook spans (#3731) #4321 )
Phase 3 — Subagent trace tree(1 PR,依赖 Phase 1; merged: feat(telemetry): Phase 3 — qwen-code.subagent span with concurrent isolation (#3731) #4410 )
Add subagent root span with context.with() isolation for concurrent subagent trace trees — feat(telemetry): Phase 3 — qwen-code.subagent span with concurrent isolation (#3731) #4410 . Design landed independently (claude-code OTel surface is flat; opencode validates the context.with(trace.setSpan(active, span), fn) pattern). Hybrid traceId: foreground = child, fork/background = linked root + Link. Type-aware TTL exempts fork/background up to 4h.
Phase 4 — LLM 请求时序分解(split into 4a/4b/4c — feat(telemetry): Phase 4 — LLM request timing decomposition (TTFT, request setup, retry visibility, breakdown metric) #4413 , design doc: docs/design/telemetry-llm-request-timing-design.md )
Phase 4a — TTFT capture + extended LLMRequestMetadata + GenAI semconv dual-emit(~200 LOC, self-contained)
Add hasUserVisibleContent helper for cross-provider first-token detection on normalized GenerateContentResponse (text / functionCall / inlineData / executableCode / thought) — Phase 4a
Capture ttftMs in LoggingContentGenerator.generateContentStream stream wrapper using method-local closure (NEVER instance fields — shared singleton concern) — Phase 4a
Extend LLMRequestMetadata with ttftMs / requestSetupMs / attempt / retryTotalDelayMs / cachedInputTokens (all optional; latter three populated by Phase 4b retry layer) — Phase 4a
endLLMRequestSpan writes ttft_ms, derived sampling_ms (clamped >= 0), output_tokens_per_second (rounded 2 decimals, omitted when sampling_ms == 0), cached_input_tokens, plus Phase 4b placeholders when provided — Phase 4a
GenAI semconv dual-emit (private name authoritative, semconv = compat layer per feat(telemetry): Phase 3 — qwen-code.subagent span with concurrent isolation (#3731) #4410 precedent): gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.usage.cached_tokens (Experimental), gen_ai.server.time_to_first_token (Experimental, seconds-as-double) — Phase 4a
Phase 4b — retryWithBackoff onRetry callback + ApiRetryEvent + 4 LLM caller wiring(merged: feat(telemetry): Phase 4b — retry visibility for qwen-code.llm_request (#3731) #4432 )
Add onRetry?: (info: RetryAttemptInfo) => void to retryWithBackoff options; opt-in per caller so non-LLM callers (channels/weixin/src/api.ts) stay silent — Phase 4b (feat(telemetry): Phase 4b — retry visibility for qwen-code.llm_request (#3731) #4432 )
Add ApiRetryEvent LogRecord class + logApiRetry emitter; bridges via existing LogToSpanProcessor to a child span under the active LLM span — Phase 4b (feat(telemetry): Phase 4b — retry visibility for qwen-code.llm_request (#3731) #4432 )
Wire onRetry callback at 4 LLM call sites: client.ts:1540, baseLlmClient.ts:193,282, geminiChat.ts:1039 — Phase 4b (feat(telemetry): Phase 4b — retry visibility for qwen-code.llm_request (#3731) #4432 )
Populate attempt + retryTotalDelayMs + requestSetupMs on LLMRequestMetadata from retry-layer accumulator — Phase 4b (feat(telemetry): Phase 4b — retry visibility for qwen-code.llm_request (#3731) #4432 )
Phase 4c — Activate recordApiRequestBreakdown() for 3 of 4 ApiRequestPhase values(~160 LOC, depends on 4a + 4b)
Call recordApiRequestBreakdown(model, [REQUEST_PREPARATION, NETWORK_LATENCY, RESPONSE_PROCESSING]) inside endLLMRequestSpan (gated by existing double-end guard so metric records exactly once per request) — Phase 4c
Skip TOKEN_PROCESSING phase (qwen-code has no architecturally distinct post-stream phase; enum value retained for future use) — Phase 4c
Governance and policy (P4)
What would you like to be added?
Harden Qwen Code's OpenTelemetry implementation so it is production-ready, starting with configuration semantics, HTTP OTLP correctness, exporter safety, and shutdown reliability.
Why is this needed?
Qwen Code already has OTLP support, but the current implementation is still closer to a minimal SDK hookup than a production-ready telemetry subsystem.
A few gaps make rollout and troubleshooting harder than necessary:
packages/core/src/telemetry/sdk.tscurrently passes the same HTTP endpoint directly to all three HTTP exporters, without Qwen Code itself making signal-specific/v1/{signal}behavior explicit. (Resolved by feat(telemetry): define HTTP OTLP endpoint behavior and signal routing #3779)targetanduseCollectorexist in telemetry config resolution, but they do not yet clearly drive the exporter path in the SDK initialization flow. (Resolved by refactor(telemetry): remove dead useCollector setting and unreachable TelemetryTarget.QWEN #4061 — both dead settings removed entirely)docs/developers/development/telemetry.mdstill says the feature requires corresponding code changes, which means docs and implementation are not fully aligned. (Resolved — feat(telemetry): define HTTP OTLP endpoint behavior and signal routing #3779 added signal routing/per-signal endpoint docs; docs(telemetry): align config and docs semantics for target, outfile, and CLI flags #4066 aligned target/outfile/CLI flag semantics)sdk.shutdown()without explicit bounded flush/shutdown behavior. (Resolved — feat(telemetry): define HTTP OTLP endpoint behavior and signal routing #3779 routed shutdown through Config.cleanup with idempotent shutdown promise; fix(telemetry): add bounded shutdown timeout and fix service.version resource attribute #3813 added bounded flush/shutdown timeout at the SDK layer)service.versionresource attribute is set toprocess.version(Node.js version) instead of the application version. (Resolved by fix(telemetry): add bounded shutdown timeout and fix service.version resource attribute #3813)This makes telemetry appear enabled without yet being predictable, enterprise-friendly, or operationally safe enough for broader production use.
Additional context
Completed sub-issues:
useCollectorsetting and unreachableTelemetryTarget.QWENenum value — closed by refactor(telemetry): remove dead useCollector setting and unreachable TelemetryTarget.QWEN #4061tool.blocked_on_user+hookspans, tool span lifecycle move) — closed by feat(telemetry): Phase 2 — tool.blocked_on_user + hook spans (#3731) #4321OTEL_RESOURCE_ATTRIBUTES,OTEL_SERVICE_NAME, settings.jsontelemetry.resourceAttributes) + metric cardinality controls (session.idoff Resource, opt-in toggle) — closed by feat(telemetry): support custom resource attributes and add metric cardinality controls #4367@opentelemetry/instrumentation-undici) + opt-in W3Ctraceparentpropagation — closed partially by feat(telemetry): client-side HTTP span + opt-in W3C traceparent propagation (#4384) #4390 (traceparent done;X-Qwen-Code-Session-Idheader deferred to follow-up underoutboundCorrelation.*namespace)clearDetailedSpanState()into chat compression cleanup (feat(telemetry): add interaction span and detailed sensitive attributes #4097 follow-up) — closed by fix(telemetry): clear span dedup state after chat compression (#3731) #4660qwen-code.llm_request— closed by feat(telemetry): Phase 4b — retry visibility for qwen-code.llm_request (#3731) #4432qwen-code.subagentspan, hybrid traceId, type-aware TTL) — closed by feat(telemetry): Phase 3 — qwen-code.subagent span with concurrent isolation (#3731) #4410Merged PRs (chronological):
service.versionfixincludeSensitiveSpanAttributes) for log-to-span bridge spansuseCollectorsetting andTelemetryTarget.QWENenum valueclient.ts, span type constants, WeakRef + 30-min TTL cleanup)autoOkOnSuccessoption, sampler-aware trace flags, session fallback in LogToSpanProcessor)includeSensitiveSpanAttributes)tool.blocked_on_user+hookspans. Moves tool span lifecycle fromexecuteSingleToolCallto_schedulesovalidating → awaiting_approval → executingis one span; adds 5 blocked-span end sites + 6 wrapped hook fire sites; class-level batch-listener cleanup for shared AbortSignals; TTL safety-net stamps + log context + try/catch separation; centralisedtruncateSpanError(1KB cap, surrogate-pair safe);hookErrorplumbing for runner-contract violations; signal.aborted re-check after for-loop awaits;handleConfirmationResponseouter-catch moved out ofattemptExecutionOfScheduledCallsto prevent sister-tool failures from corrupting the confirmed tool's span.64401e1) — closes feat(telemetry): support custom resource attributes and add metric cardinality controls #4365. Custom resource attributes (OTEL_RESOURCE_ATTRIBUTES/OTEL_SERVICE_NAMEenv vars now respected per OTel spec, plus settings.jsontelemetry.resourceAttributes). Metric cardinality controls:session.idmoved off the OTel Resource (it auto-attached to every metric data point and was fanning out Prometheus / ARMS Metric time-series), gated behind a new opt-intelemetry.metrics.includeSessionIdtoggle. Reserved keys (service.version,session.id) stripped from env + settings withdiag.warn; SDK emits a one-time console summary at init when input is dropped (per W3C Baggage spec keys are also percent-decoded). Spans and logs continue to carrysession.idunconditionally for trace/log correlation. Design doc:docs/design/telemetry-resource-attributes-design.md.hasUserVisibleContentcross-provider first-token detection, method-local closure inLoggingContentGenerator) + extendedLLMRequestMetadata(ttftMs / requestSetupMs / attempt / retryTotalDelayMs / cachedInputTokens) +endLLMRequestSpanderived attrs (sampling_ms, output_tokens_per_second) + GenAI semconv dual-emit (gen_ai.request.model / gen_ai.usage.* / gen_ai.server.time_to_first_token). Design doc:docs/design/telemetry-llm-request-timing-design.md.@opentelemetry/instrumentation-undici(separates network latency from model processing time) + OTLP feedback-loop guard (ignoreOutgoingRequestHookskips configured OTLP endpoints) + opt-in W3Ctraceparentpropagation gated byoutboundCorrelation.propagateTraceContext(defaultfalse,NoopTextMapPropagator).X-Qwen-Code-Session-Idheader removed from scope per reviewer request — deferred to follow-up underoutboundCorrelation.*namespace.daemon_mode_b_main) — daemon OTel context propagation: route-levelqwen-code.daemon.requestspans +qwen-code.daemon.bridgespan on ACP prompt dispatch + cross-process trace context via reservedqwen.telemetry.*prompt metadata + ACP child context restoration for interaction span parenting. Closes feat(telemetry): cover qwen serve daemon end-to-end with OpenTelemetry #4554.daemon_mode_b_main) — daemonqwen-code.client_idspan attribute fromX-Qwen-Client-Idheader + permission vote route spans (POST /session/:id/permission/:requestId,POST /permission/:requestId) +addDaemonRequestAttribute()helper for post-rebase enrichment.daemon_mode_b_main) — daemon/ACP tool span hierarchy:startToolSpan/endToolSpan+startToolExecutionSpan/endToolExecutionSpanwired intoSession.ts runTool()+session.idattribute on all session-tracing spans +logConversationFinishedEventat turn end +#executeCronPromptwrapped inwithInteractionSpan. Closes feat(telemetry): align daemon/ACP session tracing with CLI path — interaction, tool, and session.id spans missing #4602.clearDetailedSpanState()intoGeminiChat.tryCompress()COMPRESSED branch. ClearsseenHashesafter chat compression so post-compaction spans re-emit full system prompt / tool schema content. Closes feat(telemetry): add interaction span and detailed sensitive attributes #4097 follow-up checklist item.qwen-code.llm_request. AddsonRetrycallback toretryWithBackoff(opt-in per caller),ApiRetryEventLogRecord +logApiRetry3-sink fan-out,qwen-code.api.retry.countCounter, wiring at 4 LLM call sites (client.ts,baseLlmClient.ts×2,geminiChat.ts). Retry state propagated viaAsyncLocalStorage<RetryAttemptContext>soLoggingContentGeneratorcan snapshotattempt/requestSetupMs/retryTotalDelayMsintoLLMRequestMetadata. Also fixes Phase 4asampling_msformula bug (was double-subtractingrequestSetupMs).qwen-code.subagentspan. Wraps every subagent invocation so the LLM/tool/hook spans the subagent emits become a proper subtree instead of interleaving under the parent interaction. Hybrid traceId — foreground = child span, fork/background = linked root span (new traceId + OTelLinkback to invoker, per spec recommendation for long-running async ops). Type-aware TTL (subagent fork/background = 4h, others stay 30 min) withqwen-code.subagent.terminate_reason='ttl_swept'sentinel.LogToSpanProcessorskip-list bypasses the existingqwen-code.subagent_executionbridge to avoid duplicate spans (LogRecord itself stays for RUM + metrics). OTel GenAI spec attrs dual-emitted (gen_ai.agent.id/gen_ai.agent.namealongsideqwen-code.subagent.id/name).AgentContext.depthauto-incremented insiderunWithAgentContextfor recursion-bug detection. Design doc:docs/design/telemetry-subagent-spans-design.md.Open sub-issues:
Propagate✅ (closed by feat(telemetry): client-side HTTP span + opt-in W3C traceparent propagation (#4384) #4390 — undici instrumentation + opt-intraceparentoutboundCorrelation.propagateTraceContext) + PropagateX-Qwen-Code-Session-Id❌ (deferred from feat(telemetry): client-side HTTP span + opt-in W3C traceparent propagation (#4384) #4390 — needs follow-up PR underoutboundCorrelation.*namespace with threat model, host allowlist, and default-off semantics).The remaining work stays in this parent issue as a checklist until the scope is clearer.
Tracking checklist
Foundation
otlpTracesEndpoint,otlpLogsEndpoint,otlpMetricsEndpoint) and env var resolution (feat(telemetry): define HTTP OTLP endpoint behavior and signal routing #3779)resolveHttpOtlpUrl()for automatic signal path appending with query string preservation (feat(telemetry): define HTTP OTLP endpoint behavior and signal routing #3779)LogToSpanProcessorto bridge log records to spans for traces-only backends (feat(telemetry): define HTTP OTLP endpoint behavior and signal routing #3779)target,useCollector,otlpEndpoint,otlpProtocol, andoutfile(docs(telemetry): align config and docs semantics for target, outfile, and CLI flags #4066)Runtime safety (P0)
service.versionresource attribute (currentlyprocess.versioninstead of app version) (fix(telemetry): add bounded shutdown timeout and fix service.version resource attribute #3813)process.on('SIGTERM'/'SIGINT'/'exit')) from telemetry init (feat(telemetry): define HTTP OTLP endpoint behavior and signal routing #3779)turn-spans.tsnever existed as source code, only as stale dist artifacts;session-tracing.tssuperseded it entirely with WeakRef + strongSpans + 30-min TTL cleanup + double-end protection + ALS clearing + try/finally safety net)Configuration semantics (P1)
useCollectordead setting — either wire it into SDK init or remove from config/docs (refactor(telemetry): remove dead useCollector setting and unreachable TelemetryTarget.QWEN #4061)targetenum —QWENvalue exists but is unreachable through config resolution (refactor(telemetry): remove dead useCollector setting and unreachable TelemetryTarget.QWEN #4061)Add OTLP static headers support in— Won’t fix. The.qwen/settings.jsonenvsection insettings.jsonalready supports settingOTEL_EXPORTER_OTLP_HEADERS(and per-signal variants likeOTEL_EXPORTER_OTLP_TRACES_HEADERS), which the OTel JS SDK reads natively. Adding a dedicatedtelemetry.otlpHeadersfield would create a redundant config path with merge-priority ambiguity. Verified working with SLS direct-ingest (gRPC +x-sls-otel-*auth headers) and ARMS HTTP endpoints.Enterprise deployment (P2)
qwen-loggerhas proxy today)Deeper observability (P3)
Propagateto outgoing LLM service calls viatraceparent@opentelemetry/instrumentation-undici+ opt-inoutboundCorrelation.propagateTraceContext— feat(telemetry): client-side HTTP span + opt-in W3C traceparent propagation (#4384) #4390 (partial close of feat(telemetry): propagate W3C traceparent + X-Qwen-Code-Session-Id to LLM service calls #4384)X-Qwen-Code-Session-Idto outgoing LLM service calls (deferred from feat(telemetry): client-side HTTP span + opt-in W3C traceparent propagation (#4384) #4390 per reviewer request — needs own design underoutboundCorrelation.*namespace with threat model + default-off)includeSensitiveSpanAttributes; complements feat(telemetry): add sensitive span attribute opt-in #3893's bridge-span coverage with native-span coverage. Adds verbatim user prompt (new_context), system prompt + hash + preview + length (full text deduped per session via SHA-256), per-tooltool_schemaevents (also hash-deduped),response.model_output, andtool_input/tool_resulton every tool exit path (success + pre-hook block + post-hook stop + tool error + try-block cancel + catch-block cancel + execution exception). All large content truncated at 60KB with*_truncatedand*_original_lengthmetadata. Heavy serialization (safeJsonStringifyon tool I/O,partToStringon user prompt) guarded at the call site so it doesn't run when telemetry is off.clearDetailedSpanState()into chat compression cleanup (feat(telemetry): add interaction span and detailed sensitive attributes #4097 follow-up) — closed by fix(telemetry): clear span dedup state after chat compression (#3731) #4660. Called inGeminiChat.tryCompress()COMPRESSED branch (the single convergence point for all compression paths). ClearsseenHashesso post-compaction spans re-emit full system prompt / tool schema content.client.tssendMessageStream(feat(telemetry): add hierarchical session tracing spans #4071)session-tracing.ts(feat(telemetry): add hierarchical session tracing spans #4071)withSpan('api.*')/withSpan('tool.*')in runtime with session-tracing typed helpers (feat(telemetry): unify span creation paths for hierarchical trace tree #4126)toolContextALS for tool sub-span parenting (feat(telemetry): unify span creation paths for hierarchical trace tree #4126; usesAsyncLocalStorage.run()notenterWith()for concurrent-safe scoping)startLLMRequestSpan/endLLMRequestSpaninloggingContentGenerator.ts(feat(telemetry): unify span creation paths for hierarchical trace tree #4126)startToolSpan/endToolSpanincoreToolScheduler.ts(feat(telemetry): unify span creation paths for hierarchical trace tree #4126)startToolExecutionSpan/endToolExecutionSpanincoreToolScheduler.executeSingleToolCall(feat(telemetry): unify span creation paths for hierarchical trace tree #4126)startLLMRequestSpan/startToolSpanfallback order — prefer active OTel span before session-root fallback (fix(telemetry): Phase 1.5 polish — fallback order, abort-as-result, log/span consistency #4302)tool.executionspan misreports cancellation as success when tool returns ToolResult on abort (fix(telemetry): Phase 1.5 polish — fallback order, abort-as-result, log/span consistency #4302; review follow-up addedcancelleddiscriminator so exec sub-span ends UNSET like the parent instead of ERROR)loggingContentGenerator.test.tsmissingAPI_CALL_ABORTED_SPAN_STATUS_MESSAGEconstant (fix(telemetry): Phase 1.5 polish — fallback order, abort-as-result, log/span consistency #4302)safelyLogApiResponsestill logs success after span timed out (fix(telemetry): Phase 1.5 polish — fallback order, abort-as-result, log/span consistency #4302; review follow-up extended the gate to the catch path sosafelyLogApiErroris also skipped after timeout)tool.blocked_on_userspan type + helper; wire into approval state machine incoreToolScheduler._schedule(moved tool span start before approval to cover validating → awaiting_approval → executing in one span) (feat(telemetry): Phase 2 — tool.blocked_on_user + hook spans (#3731) #4321)hookspan type + helper; wire into pre/post hook execution incoreToolScheduler.executeSingleToolCall(6 fire sites wrapped viawithHookSpan) (feat(telemetry): Phase 2 — tool.blocked_on_user + hook spans (#3731) #4321)subagentroot span withcontext.with()isolation for concurrent subagent trace trees — feat(telemetry): Phase 3 — qwen-code.subagent span with concurrent isolation (#3731) #4410. Design landed independently (claude-code OTel surface is flat; opencode validates thecontext.with(trace.setSpan(active, span), fn)pattern). Hybrid traceId: foreground = child, fork/background = linked root +Link. Type-aware TTL exempts fork/background up to 4h.docs/design/telemetry-llm-request-timing-design.md)LLMRequestMetadata+ GenAI semconv dual-emit(~200 LOC, self-contained)hasUserVisibleContenthelper for cross-provider first-token detection on normalizedGenerateContentResponse(text / functionCall / inlineData / executableCode / thought) — Phase 4attftMsinLoggingContentGenerator.generateContentStreamstream wrapper using method-local closure (NEVER instance fields — shared singleton concern) — Phase 4aLLMRequestMetadatawithttftMs/requestSetupMs/attempt/retryTotalDelayMs/cachedInputTokens(all optional; latter three populated by Phase 4b retry layer) — Phase 4aendLLMRequestSpanwritesttft_ms, derivedsampling_ms(clamped >= 0),output_tokens_per_second(rounded 2 decimals, omitted when sampling_ms == 0),cached_input_tokens, plus Phase 4b placeholders when provided — Phase 4agen_ai.request.model,gen_ai.usage.input_tokens,gen_ai.usage.output_tokens,gen_ai.usage.cached_tokens(Experimental),gen_ai.server.time_to_first_token(Experimental, seconds-as-double) — Phase 4aretryWithBackoffonRetrycallback +ApiRetryEvent+ 4 LLM caller wiring(merged: feat(telemetry): Phase 4b — retry visibility for qwen-code.llm_request (#3731) #4432)onRetry?: (info: RetryAttemptInfo) => voidtoretryWithBackoffoptions; opt-in per caller so non-LLM callers (channels/weixin/src/api.ts) stay silent — Phase 4b (feat(telemetry): Phase 4b — retry visibility for qwen-code.llm_request (#3731) #4432)ApiRetryEventLogRecord class +logApiRetryemitter; bridges via existingLogToSpanProcessorto a child span under the active LLM span — Phase 4b (feat(telemetry): Phase 4b — retry visibility for qwen-code.llm_request (#3731) #4432)onRetrycallback at 4 LLM call sites:client.ts:1540,baseLlmClient.ts:193,282,geminiChat.ts:1039— Phase 4b (feat(telemetry): Phase 4b — retry visibility for qwen-code.llm_request (#3731) #4432)attempt+retryTotalDelayMs+requestSetupMsonLLMRequestMetadatafrom retry-layer accumulator — Phase 4b (feat(telemetry): Phase 4b — retry visibility for qwen-code.llm_request (#3731) #4432)recordApiRequestBreakdown()for 3 of 4ApiRequestPhasevalues(~160 LOC, depends on 4a + 4b)recordApiRequestBreakdown(model, [REQUEST_PREPARATION, NETWORK_LATENCY, RESPONSE_PROCESSING])insideendLLMRequestSpan(gated by existing double-end guard so metric records exactly once per request) — Phase 4cTOKEN_PROCESSINGphase (qwen-code has no architecturally distinct post-stream phase; enum value retained for future use) — Phase 4cGovernance and policy (P4)