fix(telemetry): address PR #3847 review follow-ups for trace correlation#4058
Conversation
Addresses unresolved review feedback from PR #3847: - Respect OTEL_TRACES_SAMPLER env var when setting TraceFlags on the synthetic session root, so custom samplers (e.g. traceidratio) are not bypassed by forced SAMPLED flag - Store current session ID in session-context.ts and use it as a fallback in LogToSpanProcessor when the OTel Resource session.id attribute is stale after /clear or /resume - Wrap for-await loop body in runInSpan() so debug logs emitted during stream iteration see the stream span as active - Add autoOkOnSuccess option to withSpan, eliminating the need for the load-bearing setStatus(UNSET) hack in cancellation paths - Add defensive 5-minute timeout for stream spans to prevent leaks from abandoned generators 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
There was a problem hiding this comment.
Pull request overview
This PR follows up on PR #3847 to improve OpenTelemetry trace/log correlation and span lifecycle correctness across session changes, tool execution, and streaming API calls.
Changes:
- Add
withSpan(..., { autoOkOnSuccess })to avoid implicit OK statuses for multi-outcome spans (notably tool execution/cancellation). - Track current session ID in
session-contextand use it as a fallback for log→span trace correlation whensession.idis absent. - Improve streaming correlation by running chunk-processing inside the stream span context and adding a defensive timeout to prevent abandoned stream spans from leaking.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| packages/core/src/telemetry/tracer.ts | Adds WithSpanOptions and sampler-aware session root traceFlags logic. |
| packages/core/src/telemetry/tracer.test.ts | Adds tests for autoOkOnSuccess and sampler-driven traceFlags. |
| packages/core/src/telemetry/session-context.ts | Stores and exposes currentSessionId alongside the session root context. |
| packages/core/src/telemetry/session-context.test.ts | Adds coverage for getCurrentSessionId(). |
| packages/core/src/telemetry/sdk.ts | Passes session ID into setSessionContext on init/refresh. |
| packages/core/src/telemetry/log-to-span-processor.ts | Falls back to getCurrentSessionId() when session.id is missing on a log record. |
| packages/core/src/telemetry/log-to-span-processor.test.ts | Adds tests validating the new session-id fallback behavior. |
| packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts | Wraps stream loop body in span context and adds a max-duration timeout for stream spans. |
| packages/core/src/core/coreToolScheduler.ts | Removes the prior “setStatus(UNSET)” workaround and explicitly sets OK on success while using autoOkOnSuccess: false. |
| packages/core/src/core/coreToolScheduler.test.ts | Updates span-status expectations for cancellation and the new autoOkOnSuccess behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
- Fix sdk.test.ts assertion to match new two-arg setSessionContext call - Add spanEnded guard to prevent double span.end() when timeout fires - Add .trim() to OTEL_TRACES_SAMPLER env var for robustness - Pass sessionId in debugLogger.test.ts for signature completeness - Clarify log-to-span-processor comment: fallback covers "missing" not "stale" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix sdk.test.ts assertion to match new two-arg setSessionContext call - Add spanEnded guard to prevent double span.end() when timeout fires - Add .trim() to OTEL_TRACES_SAMPLER env var for robustness - Pass sessionId in debugLogger.test.ts for signature completeness - Clarify log-to-span-processor comment: fallback covers "missing" not "stale" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
689fe8d to
7889584
Compare
- Fix shouldForceSampled to force SAMPLED for all parentbased_* samplers, not just parentbased_always_on — parentbased samplers delegate to localParentNotSampled (default AlwaysOff) when parent has NONE, which silently drops all traces - Convert stream span timeout from fixed wall-clock to idle timeout: reset timer on each chunk so legitimately long streams are never affected; timeout only fires when no chunks arrive for 5 minutes - Add test for parentbased_traceidratio → TraceFlags.SAMPLED
Prevent zombie timer accumulation when chunks arrive after idle timeout has already ended the span.
- shouldForceSampled() now returns false for parentbased_always_off, preventing silent over-sampling that contradicts user intent. - Updated JSDoc to document the always_on exception for non-parentbased samplers. - Added test for parentbased_always_off → TraceFlags.NONE. - Added test for success path resilience when safeSetStatus throws in coreToolScheduler.
- Swap spanEnded/span.end() order so finally block can retry if end() throws. - Clear idle timeout immediately after for-await loop exits, before post-loop processing. - Use || instead of ?? for session.id fallback to handle empty strings.
wenshao
left a comment
There was a problem hiding this comment.
Nice cleanup of the load-bearing setStatus(UNSET) hack — the autoOkOnSuccess option plus explicit success-path safeSetStatus is the right shape, and test coverage is thorough across all five concerns. A few inline notes on edge cases and minor nits; overall LGTM.
- Expand shouldForceSampled JSDoc: document env-var assumption, parentbased_traceidratio 100% sampling semantics. - Remove unnecessary runInSpan wrapper around chunk field assignments. - Add stream.timed_out span attribute when idle timeout fires. - Deduplicate config.getSessionId() call in initializeTelemetry. - Add tests for always_on and always_off sampler values.
- Move safeSetStatus before setAttribute in timeout callback so ERROR status is set even if setAttribute throws. - Set spanEnded=true before span.end() so finally block never overwrites ERROR with OK if end() throws. - Add comment explaining || vs ?? choice for session.id fallback.
| code: SpanStatusCode.ERROR, | ||
| message: 'Stream span timed out (idle)', | ||
| }); | ||
| spanEnded = true; |
There was a problem hiding this comment.
[Suggestion] spanEnded = true is set before span.setAttribute() and span.end() in the same try block. If span.setAttribute() throws, span.end() is unreachable, yet spanEnded is already true — so the finally block's if (!spanEnded) guard skips its own span?.end(), leaking the span.
| spanEnded = true; | |
| spanEndTimeout = setTimeout(() => { | |
| spanEnded = true; | |
| try { | |
| safeSetStatus(span, { | |
| code: SpanStatusCode.ERROR, | |
| message: 'Stream span timed out (idle)', | |
| }); | |
| } catch { | |
| // OTel errors must not interrupt the consumer. | |
| } | |
| try { | |
| span.setAttribute('stream.timed_out', true); | |
| } catch { | |
| // OTel errors must not interrupt the consumer. | |
| } | |
| try { | |
| span.end(); | |
| } catch { | |
| // OTel errors must not interrupt the consumer. | |
| } | |
| }, STREAM_IDLE_TIMEOUT_MS); |
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| export function setSessionContext(ctx: Context | undefined): void { | ||
| export function setSessionContext( | ||
| ctx: Context | undefined, | ||
| sessionId?: string, |
There was a problem hiding this comment.
[Suggestion] The sessionId parameter is optional. If any future caller invokes setSessionContext(ctx) without the second argument, currentSessionId is silently set to undefined, breaking the LogToSpanProcessor fallback and scattering log events across random trace IDs. All current callers pass both arguments, but the type signature invites the mistake.
| sessionId?: string, | |
| export function setSessionContext( | |
| ctx: Context | undefined, | |
| sessionId?: string, | |
| ): void { | |
| sessionRootContext = ctx; | |
| if (arguments.length >= 2) { | |
| currentSessionId = sessionId; | |
| } | |
| } |
Alternatively, make sessionId required or add a separate setCurrentSessionId() function to decouple the two pieces of state.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
…rrelation (QwenLM#4058) - Respect OTEL_TRACES_SAMPLER env var when setting TraceFlags - Store current session ID in session-context.ts as fallback in LogToSpanProcessor - Wrap for-await loop body in runInSpan() for debug logs during stream iteration - Add autoOkOnSuccess option to withSpan - Add defensive 5-minute idle timeout for stream spans to prevent leaks Cherry-picked from upstream QwenLM/qwen-code commit 4bb8dc8. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- IDE context wrapped in <system-reminder> block and prepended to user request instead of inserting a separate user history entry - prependToFirstTextPart() and escapeClosingSystemReminderTags() added - escapeSystemReminderTags() neutralizes obfuscated tag variants - forceFullIdeContext reset on stream errors for correct resend - BaseLlmClient.clearPerModelGeneratorCache() restored on resetChat - New utilities: partUtils.ts (prependToFirstTextPart), xml.ts (escapeSystemReminderTags) Cherry-picked from upstream QwenLM/qwen-code commit 796de4d. Resolved conflict in coreToolScheduler.test.ts (merged with QwenLM#4058 changes). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Addresses unresolved review feedback from PR #3847 (feat(telemetry): inject traceId/spanId into debug log files for OTel correlation):
createSessionRootContextnow readsOTEL_TRACES_SAMPLERenv var — mostparentbased_*samplers getSAMPLED(parent flag gates children vialocalParentNotSampled),parentbased_always_offgetsNONEto respect user intent, non-parentbased samplers (e.g.traceidratio) getNONEso the sampler evaluates each span independentlygetCurrentSessionId()tosession-context.ts;LogToSpanProcessor.onEmitfalls back to it when the log record has nosession.idattribute (e.g. after a session change via/clearor/resume). Uses||(not??) so empty-stringsession.idalso triggers fallbackrunInSpan()so debug logs emitted during stream processing see the stream span as active. Chunk field-assignment block left unwrapped (no log emission)setStatus(UNSET)hack: AddedautoOkOnSuccessoption towithSpan; tool execution span usesautoOkOnSuccess: falseso callers explicitly set status in each path — eliminates the load-bearingsetStatus(UNSET)patternstream.timed_out=trueattribute + ERROR status. Includes multi-layer guards:spanEndedflag (set beforespan.end()), zombie timer prevention (if (spanEnded) return), immediateclearTimeoutafter for-await loop exitVerification
Test plan
tracer.test.ts: 32 tests (autoOkOnSuccess, sampler-aware trace flags forparentbased_always_on/off,parentbased_traceidratio,always_on,always_off,traceidratio)session-context.test.ts: 7 tests (getCurrentSessionId)log-to-span-processor.test.ts: 32 tests (session fallback withgetCurrentSessionId)coreToolScheduler.test.ts: 112 tests (cancellation status, success path resilience whensafeSetStatusthrows)loggingContentGenerator.test.ts: 30 testssdk.test.ts: 27 tests (setSessionContext two-arg signature)debugLogger.test.ts: 22 teststsc --noEmit— no new type errorsManual smoke test (requires OTel collector)
If a local OTel collector (e.g. Jaeger) is available:
TraceFlags.SAMPLEDparentbased_always_onOTEL_TRACES_SAMPLER=parentbased_always_onparentbased_always_offOTEL_TRACES_SAMPLER=parentbased_always_offparentbased_traceidratioOTEL_TRACES_SAMPLER=parentbased_traceidratioalways_offOTEL_TRACES_SAMPLER=always_off/clearthen new conversationSTREAM_IDLE_TIMEOUT_MSto 5s)stream.timed_out=true+ ERROR🤖 Generated with Qwen Code