feat(telemetry): harden OpenTelemetry configuration, HTTP OTLP behavior, and runtime safety

## What would you like to be added?

Harden Qwen Code's OpenTelemetry implementation so it is production-ready, starting with configuration semantics, HTTP OTLP correctness, exporter safety, and shutdown reliability.

## Why is this needed?

Qwen Code already has OTLP support, but the current implementation is still closer to a minimal SDK hookup than a production-ready telemetry subsystem.

A few gaps make rollout and troubleshooting harder than necessary:

- `packages/core/src/telemetry/sdk.ts` currently passes the same HTTP endpoint directly to all three HTTP exporters, without Qwen Code itself making signal-specific `/v1/{signal}` behavior explicit. *(Resolved by #3779)*
- `target` and `useCollector` exist in telemetry config resolution, but they do not yet clearly drive the exporter path in the SDK initialization flow. *(Resolved by #4061 — both dead settings removed entirely)*
- `docs/developers/development/telemetry.md` still says the feature requires corresponding code changes, which means docs and implementation are not fully aligned. *(Resolved — #3779 added signal routing/per-signal endpoint docs; #4066 aligned target/outfile/CLI flag semantics)*
- Console exporters are still the last fallback when no OTLP endpoint or outfile is configured, which is risky in structured-output modes. *(Resolved by #3779 — console exporter fallback removed entirely)*
- Telemetry shutdown currently relies on plain `sdk.shutdown()` without explicit bounded flush/shutdown behavior. *(Resolved — #3779 routed shutdown through Config.cleanup with idempotent shutdown promise; #3813 added bounded flush/shutdown timeout at the SDK layer)*
- `service.version` resource attribute is set to `process.version` (Node.js version) instead of the application version. *(Resolved by #3813)*

This makes telemetry appear enabled without yet being predictable, enterprise-friendly, or operationally safe enough for broader production use.

## Additional context

Completed sub-issues:

- #3734 Define HTTP OTLP endpoint behavior and signal routing — closed by #3779
- #3811 Add bounded shutdown timeout and fix service.version resource attribute — closed by #3813
- Remove dead `useCollector` setting and unreachable `TelemetryTarget.QWEN` enum value — closed by #4061
- #4212 Phase 1.5 polish from #4126 review (fallback order, exec-span on abort-as-result, mock + log/span consistency) — closed by #4302
- Phase 2 hierarchical session tracing (`tool.blocked_on_user` + `hook` spans, tool span lifecycle move) — closed by #4321
- #4365 Custom resource attributes (`OTEL_RESOURCE_ATTRIBUTES`, `OTEL_SERVICE_NAME`, settings.json `telemetry.resourceAttributes`) + metric cardinality controls (`session.id` off Resource, opt-in toggle) — closed by #4367
- Phase 4a TTFT capture + GenAI semconv dual-emit — closed by #4417
- #4384 (partial) Client-side HTTP span (`@opentelemetry/instrumentation-undici`) + opt-in W3C `traceparent` propagation — closed partially by #4390 (traceparent done; `X-Qwen-Code-Session-Id` header deferred to follow-up under `outboundCorrelation.*` namespace)
- Wire `clearDetailedSpanState()` into chat compression cleanup (#4097 follow-up) — closed by #4660
- Phase 4b retry visibility for `qwen-code.llm_request` — closed by #4432
- Phase 3 subagent trace tree (`qwen-code.subagent` span, hybrid traceId, type-aware TTL) — closed by #4410

Merged PRs (chronological):

- #3779 (2026-05-01) — HTTP OTLP signal routing + per-signal endpoints + LogToSpanProcessor + console-exporter removal + Config.cleanup wiring
- #3813 (2026-05-05) — bounded shutdown timeout + `service.version` fix
- #3893 (2026-05-07) — sensitive span attribute opt-in (`includeSensitiveSpanAttributes`) for log-to-span bridge spans
- #3847 (2026-05-10) — inject traceId/spanId into debug log files for OTel correlation
- #3986 (2026-05-09) — suppress OpenTelemetry diagnostics from UI / route to debug log
- #4061 (2026-05-11) — remove dead `useCollector` setting and `TelemetryTarget.QWEN` enum value
- #4071 (2026-05-12) — interaction span scaffolding (lifecycle in `client.ts`, span type constants, WeakRef + 30-min TTL cleanup)
- #4058 (2026-05-13) — #3847 review follow-ups (abandoned generator idle timeout, `autoOkOnSuccess` option, sampler-aware trace flags, session fallback in LogToSpanProcessor)
- #4066 (2026-05-13) — align telemetry config and docs semantics for target / outfile / CLI flags
- #4126 (2026-05-16) — unify span creation paths for hierarchical trace tree (P3 Phase 1)
- #4097 (2026-05-16) — interaction span + detailed sensitive attributes (verbatim user prompt / system prompt / tool I/O / model output, gated by `includeSensitiveSpanAttributes`)
- #4302 (2026-05-19) — Phase 1.5 polish: fallback order, exec-span on abort-as-result, idle-timeout vs api log consistency, exec-span cancelled status
- #4321 (2026-05-21) — Phase 2: `tool.blocked_on_user` + `hook` spans. Moves tool span lifecycle from `executeSingleToolCall` to `_schedule` so `validating → awaiting_approval → executing` is one span; adds 5 blocked-span end sites + 6 wrapped hook fire sites; class-level batch-listener cleanup for shared AbortSignals; TTL safety-net stamps + log context + try/catch separation; centralised `truncateSpanError` (1KB cap, surrogate-pair safe); `hookError` plumbing for runner-contract violations; signal.aborted re-check after for-loop awaits; `handleConfirmationResponse` outer-catch moved out of `attemptExecutionOfScheduledCalls` to prevent sister-tool failures from corrupting the confirmed tool's span.
- #4367 (2026-05-21, merged commit `64401e1`) — closes #4365. Custom resource attributes (`OTEL_RESOURCE_ATTRIBUTES` / `OTEL_SERVICE_NAME` env vars now respected per OTel spec, plus settings.json `telemetry.resourceAttributes`). Metric cardinality controls: `session.id` moved off the OTel Resource (it auto-attached to every metric data point and was fanning out Prometheus / ARMS Metric time-series), gated behind a new opt-in `telemetry.metrics.includeSessionId` toggle. Reserved keys (`service.version`, `session.id`) stripped from env + settings with `diag.warn`; SDK emits a one-time console summary at init when input is dropped (per W3C Baggage spec keys are also percent-decoded). Spans and logs continue to carry `session.id` unconditionally for trace/log correlation. Design doc: `docs/design/telemetry-resource-attributes-design.md`.
- #4417 (2026-05-22) — Phase 4a: TTFT capture (`hasUserVisibleContent` cross-provider first-token detection, method-local closure in `LoggingContentGenerator`) + extended `LLMRequestMetadata` (ttftMs / requestSetupMs / attempt / retryTotalDelayMs / cachedInputTokens) + `endLLMRequestSpan` derived attrs (sampling_ms, output_tokens_per_second) + GenAI semconv dual-emit (gen_ai.request.model / gen_ai.usage.* / gen_ai.server.time_to_first_token). Design doc: `docs/design/telemetry-llm-request-timing-design.md`.
- #4390 (2026-05-25) — partially closes #4384. Client-side HTTP span via `@opentelemetry/instrumentation-undici` (separates network latency from model processing time) + OTLP feedback-loop guard (`ignoreOutgoingRequestHook` skips configured OTLP endpoints) + opt-in W3C `traceparent` propagation gated by `outboundCorrelation.propagateTraceContext` (default `false`, `NoopTextMapPropagator`). `X-Qwen-Code-Session-Id` header removed from scope per reviewer request — deferred to follow-up under `outboundCorrelation.*` namespace.
- #4556 (2026-05-29, `daemon_mode_b_main`) — daemon OTel context propagation: route-level `qwen-code.daemon.request` spans + `qwen-code.daemon.bridge` span on ACP prompt dispatch + cross-process trace context via reserved `qwen.telemetry.*` prompt metadata + ACP child context restoration for interaction span parenting. Closes #4554.
- #4628 (2026-05-30, `daemon_mode_b_main`) — daemon `qwen-code.client_id` span attribute from `X-Qwen-Client-Id` header + permission vote route spans (`POST /session/:id/permission/:requestId`, `POST /permission/:requestId`) + `addDaemonRequestAttribute()` helper for post-rebase enrichment.
- #4630 (2026-05-30, `daemon_mode_b_main`) — daemon/ACP tool span hierarchy: `startToolSpan`/`endToolSpan` + `startToolExecutionSpan`/`endToolExecutionSpan` wired into `Session.ts runTool()` + `session.id` attribute on all session-tracing spans + `logConversationFinishedEvent` at turn end + `#executeCronPrompt` wrapped in `withInteractionSpan`. Closes #4602.
- #4660 (2026-06-03) — Wire `clearDetailedSpanState()` into `GeminiChat.tryCompress()` COMPRESSED branch. Clears `seenHashes` after chat compression so post-compaction spans re-emit full system prompt / tool schema content. Closes #4097 follow-up checklist item.
- #4432 (2026-06-05) — Phase 4b: retry visibility for `qwen-code.llm_request`. Adds `onRetry` callback to `retryWithBackoff` (opt-in per caller), `ApiRetryEvent` LogRecord + `logApiRetry` 3-sink fan-out, `qwen-code.api.retry.count` Counter, wiring at 4 LLM call sites (`client.ts`, `baseLlmClient.ts` ×2, `geminiChat.ts`). Retry state propagated via `AsyncLocalStorage<RetryAttemptContext>` so `LoggingContentGenerator` can snapshot `attempt`/`requestSetupMs`/`retryTotalDelayMs` into `LLMRequestMetadata`. Also fixes Phase 4a `sampling_ms` formula bug (was double-subtracting `requestSetupMs`).
- #4410 (2026-06-05) — Phase 3: `qwen-code.subagent` span. Wraps every subagent invocation so the LLM/tool/hook spans the subagent emits become a proper subtree instead of interleaving under the parent interaction. Hybrid traceId — foreground = child span, fork/background = linked root span (new traceId + OTel `Link` back to invoker, per spec recommendation for long-running async ops). Type-aware TTL (subagent fork/background = 4h, others stay 30 min) with `qwen-code.subagent.terminate_reason='ttl_swept'` sentinel. `LogToSpanProcessor` skip-list bypasses the existing `qwen-code.subagent_execution` bridge to avoid duplicate spans (LogRecord itself stays for RUM + metrics). OTel GenAI spec attrs dual-emitted (`gen_ai.agent.id`/`gen_ai.agent.name` alongside `qwen-code.subagent.id`/`name`). `AgentContext.depth` auto-incremented inside `runWithAgentContext` for recursion-bug detection. Design doc: `docs/design/telemetry-subagent-spans-design.md`.

Open sub-issues:

- #4384 ~~Propagate `traceparent`~~ ✅ (closed by #4390 — undici instrumentation + opt-in `outboundCorrelation.propagateTraceContext`) + Propagate `X-Qwen-Code-Session-Id` ❌ (deferred from #4390 — needs follow-up PR under `outboundCorrelation.*` namespace with threat model, host allowlist, and default-off semantics).

The remaining work stays in this parent issue as a checklist until the scope is clearer.

## Tracking checklist

### Foundation

- [x] #3734 Define HTTP OTLP endpoint behavior and signal routing (#3779)
- [x] Add per-signal endpoint overrides (`otlpTracesEndpoint`, `otlpLogsEndpoint`, `otlpMetricsEndpoint`) and env var resolution (#3779)
- [x] Add `resolveHttpOtlpUrl()` for automatic signal path appending with query string preservation (#3779)
- [x] Add `LogToSpanProcessor` to bridge log records to spans for traces-only backends (#3779)
- [x] Align telemetry config and docs semantics for `target`, `useCollector`, `otlpEndpoint`, `otlpProtocol`, and `outfile` (#4066)
### Runtime safety (P0)

- [x] #3811 Add bounded flush/shutdown timeout at SDK layer with fail-open behavior (#3813)
- [x] #3811 Fix `service.version` resource attribute (currently `process.version` instead of app version) (#3813)
- [x] Remove console exporter fallback when no OTLP endpoint or outfile is configured (#3779)
- [x] Restrict console/debug exporters in structured-output or non-interactive modes (#3986 — OTel SDK diagnostics routed to debug log instead of console; console exporter fallback already removed by #3779)
- [x] Route SDK shutdown through Config.cleanup with idempotent shutdown promise (#3779)
- [x] Remove async process signal handlers (`process.on('SIGTERM'/'SIGINT'/'exit')`) from telemetry init (#3779)
- [x] Add TTL cleanup and WeakRef for span lifecycle safety (#4071 — `turn-spans.ts` never existed as source code, only as stale dist artifacts; `session-tracing.ts` superseded it entirely with WeakRef + strongSpans + 30-min TTL cleanup + double-end protection + ALS clearing + try/finally safety net)
- [x] Inject traceId/spanId into debug log files for OTel correlation (#3847; review follow-ups in #4058)

### Configuration semantics (P1)

- [x] Clean up `useCollector` dead setting — either wire it into SDK init or remove from config/docs (#4061)
- [x] Clean up `target` enum — `QWEN` value exists but is unreachable through config resolution (#4061)
- [x] ~~Add OTLP static headers support in `.qwen/settings.json`~~ — **Won’t fix.** The `env` section in `settings.json` already supports setting `OTEL_EXPORTER_OTLP_HEADERS` (and per-signal variants like `OTEL_EXPORTER_OTLP_TRACES_HEADERS`), which the OTel JS SDK reads natively. Adding a dedicated `telemetry.otlpHeaders` field would create a redundant config path with merge-priority ambiguity. Verified working with SLS direct-ingest (gRPC + `x-sls-otel-*` auth headers) and ARMS HTTP endpoints.

### Enterprise deployment (P2)

- [ ] Add proxy support for OTLP HTTP/gRPC exporters (only `qwen-logger` has proxy today)
- [ ] Add mTLS / custom CA certificate support

### Deeper observability (P3)

- [x] Add resource attribute policy and cardinality controls — #4365 (closed by #4367)
- [x] ~~Propagate `traceparent`~~ to outgoing LLM service calls via `@opentelemetry/instrumentation-undici` + opt-in `outboundCorrelation.propagateTraceContext` — #4390 (partial close of #4384)
- [ ] Propagate `X-Qwen-Code-Session-Id` to outgoing LLM service calls (deferred from #4390 per reviewer request — needs own design under `outboundCorrelation.*` namespace with threat model + default-off)
- [x] Wire detailed sensitive span attributes onto hierarchical spans (#4097) — gated by `includeSensitiveSpanAttributes`; complements #3893's bridge-span coverage with native-span coverage. Adds verbatim user prompt (`new_context`), system prompt + hash + preview + length (full text deduped per session via SHA-256), per-tool `tool_schema` events (also hash-deduped), `response.model_output`, and `tool_input` / `tool_result` on every tool exit path (success + pre-hook block + post-hook stop + tool error + try-block cancel + catch-block cancel + execution exception). All large content truncated at 60KB with `*_truncated` and `*_original_length` metadata. Heavy serialization (`safeJsonStringify` on tool I/O, `partToString` on user prompt) guarded at the call site so it doesn't run when telemetry is off.
- [x] Wire `clearDetailedSpanState()` into chat compression cleanup (#4097 follow-up) — closed by #4660. Called in `GeminiChat.tryCompress()` COMPRESSED branch (the single convergence point for all compression paths). Clears `seenHashes` so post-compaction spans re-emit full system prompt / tool schema content.
- [ ] Wire hierarchical session tracing spans into runtime — see [design doc](docs/design/workflow-tracing-gaps.md)
  - [x] Interaction span lifecycle in `client.ts` `sendMessageStream` (#4071)
  - [x] Span type constants + typed helper functions defined and exported in `session-tracing.ts` (#4071)
  - [x] WeakRef + strongSpans + 30-min TTL cleanup + double-end protection (#4071)
  - **Phase 1 — 统一 span 创建路径，修复 trace 树结构（merged: #4126）**
  - [x] Fix parent-child wiring — replace `withSpan('api.*')` / `withSpan('tool.*')` in runtime with session-tracing typed helpers (#4126)
  - [x] Add `toolContext` ALS for tool sub-span parenting (#4126; uses `AsyncLocalStorage.run()` not `enterWith()` for concurrent-safe scoping)
  - [x] Wire `startLLMRequestSpan` / `endLLMRequestSpan` in `loggingContentGenerator.ts` (#4126)
  - [x] Wire `startToolSpan` / `endToolSpan` in `coreToolScheduler.ts` (#4126)
  - [x] Wire `startToolExecutionSpan` / `endToolExecutionSpan` in `coreToolScheduler.executeSingleToolCall` (#4126)
  - **Phase 1.5 — polish from late review rounds (#4212, PR #4302)**
  - [x] `startLLMRequestSpan`/`startToolSpan` fallback order — prefer active OTel span before session-root fallback (#4302)
  - [x] `tool.execution` span misreports cancellation as success when tool returns ToolResult on abort (#4302; review follow-up added `cancelled` discriminator so exec sub-span ends UNSET like the parent instead of ERROR)
  - [x] Test mock for `loggingContentGenerator.test.ts` missing `API_CALL_ABORTED_SPAN_STATUS_MESSAGE` constant (#4302)
  - [x] Stream idle timeout → `safelyLogApiResponse` still logs success after span timed out (#4302; review follow-up extended the gate to the catch path so `safelyLogApiError` is also skipped after timeout)
  - **Phase 2 — 补齐 workflow 阶段 span（merged: #4321）**
  - [x] Add `tool.blocked_on_user` span type + helper; wire into approval state machine in `coreToolScheduler._schedule` (moved tool span start before approval to cover validating → awaiting_approval → executing in one span) (#4321)
  - [x] Add `hook` span type + helper; wire into pre/post hook execution in `coreToolScheduler.executeSingleToolCall` (6 fire sites wrapped via `withHookSpan`) (#4321)
  - **Phase 3 — Subagent trace tree（1 PR，依赖 Phase 1; merged: #4410）**
  - [x] Add `subagent` root span with `context.with()` isolation for concurrent subagent trace trees — #4410. Design landed independently (claude-code OTel surface is flat; opencode validates the `context.with(trace.setSpan(active, span), fn)` pattern). Hybrid traceId: foreground = child, fork/background = linked root + `Link`. Type-aware TTL exempts fork/background up to 4h.
  - **Phase 4 — LLM 请求时序分解（split into 4a/4b/4c — #4413, design doc: [`docs/design/telemetry-llm-request-timing-design.md`](../blob/main/docs/design/telemetry-llm-request-timing-design.md)）**
  - **Phase 4a — TTFT capture + extended `LLMRequestMetadata` + GenAI semconv dual-emit（~200 LOC, self-contained）**
  - [x] Add `hasUserVisibleContent` helper for cross-provider first-token detection on normalized `GenerateContentResponse` (text / functionCall / inlineData / executableCode / thought) — Phase 4a
  - [x] Capture `ttftMs` in `LoggingContentGenerator.generateContentStream` stream wrapper using method-local closure (NEVER instance fields — shared singleton concern) — Phase 4a
  - [x] Extend `LLMRequestMetadata` with `ttftMs` / `requestSetupMs` / `attempt` / `retryTotalDelayMs` / `cachedInputTokens` (all optional; latter three populated by Phase 4b retry layer) — Phase 4a
  - [x] `endLLMRequestSpan` writes `ttft_ms`, derived `sampling_ms` (clamped >= 0), `output_tokens_per_second` (rounded 2 decimals, omitted when sampling_ms == 0), `cached_input_tokens`, plus Phase 4b placeholders when provided — Phase 4a
  - [x] GenAI semconv dual-emit (private name authoritative, semconv = compat layer per #4410 precedent): `gen_ai.request.model`, `gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`, `gen_ai.usage.cached_tokens` (Experimental), `gen_ai.server.time_to_first_token` (Experimental, seconds-as-double) — Phase 4a
  - **Phase 4b — `retryWithBackoff` `onRetry` callback + `ApiRetryEvent` + 4 LLM caller wiring（merged: #4432）**
  - [x] Add `onRetry?: (info: RetryAttemptInfo) => void` to `retryWithBackoff` options; opt-in per caller so non-LLM callers (`channels/weixin/src/api.ts`) stay silent — Phase 4b (#4432)
  - [x] Add `ApiRetryEvent` LogRecord class + `logApiRetry` emitter; bridges via existing `LogToSpanProcessor` to a child span under the active LLM span — Phase 4b (#4432)
  - [x] Wire `onRetry` callback at 4 LLM call sites: `client.ts:1540`, `baseLlmClient.ts:193,282`, `geminiChat.ts:1039` — Phase 4b (#4432)
  - [x] Populate `attempt` + `retryTotalDelayMs` + `requestSetupMs` on `LLMRequestMetadata` from retry-layer accumulator — Phase 4b (#4432)
  - **Phase 4c — Activate `recordApiRequestBreakdown()` for 3 of 4 `ApiRequestPhase` values（~160 LOC, depends on 4a + 4b）**
  - [ ] Call `recordApiRequestBreakdown(model, [REQUEST_PREPARATION, NETWORK_LATENCY, RESPONSE_PROCESSING])` inside `endLLMRequestSpan` (gated by existing double-end guard so metric records exactly once per request) — Phase 4c
  - [ ] Skip `TOKEN_PROCESSING` phase (qwen-code has no architecturally distinct post-stream phase; enum value retained for future use) — Phase 4c

### Governance and policy (P4)

- [ ] Add privacy / non-essential traffic policy
- [ ] Clarify customer OTLP telemetry vs first-party usage reporting
- [ ] Add dynamic OTLP auth helper support if needed



















Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(telemetry): harden OpenTelemetry configuration, HTTP OTLP behavior, and runtime safety #3731

What would you like to be added?

Why is this needed?

Additional context

Tracking checklist

Foundation

Runtime safety (P0)

Configuration semantics (P1)

Enterprise deployment (P2)

Deeper observability (P3)

Governance and policy (P4)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(telemetry): harden OpenTelemetry configuration, HTTP OTLP behavior, and runtime safety #3731

Description

What would you like to be added?

Why is this needed?

Additional context

Tracking checklist

Foundation

Runtime safety (P0)

Configuration semantics (P1)

Enterprise deployment (P2)

Deeper observability (P3)

Governance and policy (P4)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions