fix(telemetry): attach interaction span to session root context#4499
Conversation
startInteractionSpan was the only startSpan call site missing the ctx argument. OTel responded by minting a fresh random trace id, splitting the interaction span away from its semantic children (llm_request / tool / tool.execution), which DO inherit the sessionId-derived trace id. The result: any session producing an interaction span ends up with two unrelated trace ids — one one-span trace for the interaction, one multi-span trace for everything else — and the hierarchy renders as disconnected items in trace viewers. Pin the interaction span to the session root context. Not via resolveParentContext() — that prefers any active OTel span over the session root, and interaction spans are turn boundaries that must always anchor at the session root, regardless of whatever wrapping span happens to be active in OTel context. Also reset session-context module state inside clearSessionTracingForTesting() so a test that sets a fake session root cannot leak into the next test. Adds regression coverage on interaction span parentContext, including the case where an unrelated OTel span is active to confirm we bypass resolveParentContext(). Fixes #4486
📋 Review SummaryThis PR fixes a critical telemetry bug where the 🔍 General Feedback
🎯 Specific Feedback🔵 Low
✅ Highlights
|
There was a problem hiding this comment.
Pull request overview
This PR fixes telemetry trace fragmentation by ensuring the qwen-code.interaction span is created with the session root OpenTelemetry context as its parent, so it shares the deterministic sessionId-derived trace id used by its child spans. It also hardens test isolation by resetting the session-context module state in the existing test cleanup helper.
Changes:
- Pass the session root context (
getSessionContext()) as the parent when starting theqwen-code.interactionspan so it stays in the same trace asllm_request/tool/tool.executionspans. - Reset
session-contextmodule state inclearSessionTracingForTesting()to prevent cross-test leakage. - Add regression tests asserting the interaction span parents to the session root even when an unrelated active OTel span exists.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| packages/core/src/telemetry/session-tracing.ts | Anchors startInteractionSpan to the session root context and resets session-context state in test cleanup. |
| packages/core/src/telemetry/session-tracing.test.ts | Adds regression coverage ensuring interaction spans use the session root context as parent. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
|
Let me check a couple of things in the diff context before writing the review. </tool_calls> Qwen Code |
|
Thanks for the review — going through both Low items: Low #1 ( Low #2 (unused |
Qwen Code Review (LIGHT)What this PR does: Fixes a trace-context bug where No issues found in the visible diff at this tier. Qwen Code |
Per CLAUDE.md guidance on comment density: keep only the WHY a future maintainer cannot derive from the code or the issue link. - session-tracing.ts: condense the 8-line interaction-span comment to the single load-bearing point — resolveParentContext() is deliberately bypassed and should not be "simplified" back into it. - clearSessionTracingForTesting hygiene comment: one line stating the cross-module reach, no restatement of the test-leak rationale. - session-tracing.test.ts: drop the multi-line describe header and the inline rationale in the second test; the describe / it names + the production comment already convey intent. Move #4486 into the describe name to match the existing `(#NNNN review)` convention used elsewhere in this file. No behavior change. All 72 tests still pass.
|
Let me verify a few claims in the diff before producing the review. Qwen Code Review (LIGHT)What this PR does (1–2 sentences): Fixes No issues found in the visible diff at this tier. Notes:
Qwen Code |
wenshao
left a comment
There was a problem hiding this comment.
No issues found. LGTM! ✅
Surgical fix — the only startSpan call site missing the parent ctx argument is now pinned to the session root context via getSessionContext() ?? otelContext.active(). The inline comment correctly explains why resolveParentContext() is intentionally bypassed (interaction spans are turn boundaries). Two new tests verify both the happy path and the active-span-bypass case. clearSessionTracingForTesting cleanup widened to include setSessionContext(undefined) — defensive and correct.
All 72 tests pass, tsc and eslint clean. CI green 8/8.
— qwen3.7-max via Qwen Code /review
Qwen Code Review (LIGHT)What this PR does: Fixes a telemetry bug where No issues found in the visible diff at this tier. The fix is correct and well-scoped:
Qwen Code |
| }); | ||
| }); | ||
|
|
||
| describe('interaction span — trace context (#4486)', () => { |
There was a problem hiding this comment.
[Suggestion] No test covers the ?? otelContext.active() fallback branch in startInteractionSpan. Both new tests call setSessionContext(fakeRoot, 'test-session') before invoking startInteractionSpan, so getSessionContext() always returns a defined value.
This fallback fires during early startup (before session context is established) or after /clear / /resume. If otelContext.active() returns a context carrying a stale span from a previous session, the interaction span would silently attach to the wrong parent — the same class of bug this PR fixes, but in a different scenario.
Consider adding a third test case:
it('falls back to otelContext.active() when no session context is set', () => {
// clearSessionTracingForTesting() in beforeEach already left
// getSessionContext() === undefined — do NOT call setSessionContext.
mockState.activeOtelSpan = { name: 'active-session-span' };
startInteractionSpan(createMockConfig({ sessionId: 'test-session' }), {
promptId: 'p',
model: 'm',
messageType: 'userQuery',
});
const span = mockSpans.find((s) => s.name === 'qwen-code.interaction');
expect(span?.parentContext).toEqual(
expect.objectContaining({ __activeSpan: expect.anything() }),
);
});— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Good catch — added in e0b2ac6. The third test starts with getSessionContext() undefined, sets mockState.activeOtelSpan to a stale span, and asserts interaction.parentContext carries __activeSpan. So the ?? otelContext.active() branch is now pinned and a future refactor that drops it will fail the test instead of silently re-introducing #4486 in a /clear / /resume scenario.
…ionSpan The two existing #4486 regression tests both call setSessionContext() before invoking startInteractionSpan, so they never exercise the nullish-coalescing fallback `getSessionContext() ?? otelContext.active()`. This third case starts with no session context set (the beforeEach hook already leaves getSessionContext() === undefined), an unrelated active OTel span in the mock context, and asserts the interaction span attaches to that active span rather than silently being created with no parent. The fallback is essentially unreachable in production (sdk.ts seeds session context during SDK init before any interaction can run), but it exists as defense against startup edge cases / `/clear` / `/resume` paths where session context could be transiently undefined while a stale span lingers in the OTel async context. Without this test, a future refactor could delete the `?? otelContext.active()` clause and silently re-introduce the same class of bug #4486 fixed, in a different scenario. Suggestion from wenshao on PR #4499.
Three style nits caught by /simplify against the freshly-added test:
- Use `toMatchObject({ __activeSpan: fakeActive })` with a captured
reference instead of `toEqual(expect.objectContaining({ __activeSpan:
expect.anything() }))`. The latter would pass even if production code
returned `__activeSpan: undefined`; the former gives referential proof
that the actual active span propagated through. Matches the existing
three re-parent tests in this file (lines 395, 669, 741).
- Use `{ kind: 'fake-active-span' }` for the active span placeholder to
match the same convention.
- Collapse the 2-line setup comment to 1 line — the "no
setSessionContext call" directive is the load-bearing bit; the
"beforeEach left undefined" half was restating beforeEach.
No behavior change. 73 tests still pass.
wenshao
left a comment
There was a problem hiding this comment.
No review findings. Downgraded from Approve to Comment: CI still running. — qwen3.7-max via Qwen Code /review
wenshao
left a comment
There was a problem hiding this comment.
No high-confidence findings. 3 low-confidence suggestions left in terminal for human review. LGTM! ✅ — qwen3.7-max via Qwen Code /review
Local Verification Report — PR #4499Verdict: ✅ Verified — safe to merge. Tested against PR head 1. Bug reproduction with real production trace IDUsed the same The derived trace id 2. Unit tests — bug visibility testReverted only Restored the PR version → all 73 pass. So the new tests genuinely lock in this fix and would catch a regression. 3. Full PR test suite (merged tree)No regression in the rest of the telemetry surface. 4. Build / typecheck / lint
5. CI statusGitHub CI: Notes for reviewers
Recommend merge. |
LaZzyMan
left a comment
There was a problem hiding this comment.
Review
One-call-site semantic fix: the interaction-span path was the only startSpan in session-tracing.ts that didn't pass a parent context, so OTel was minting a fresh random trace id for the interaction and severing it from the rest of the sessionId-derived trace. Pinning it to the session root context (with a fallback to otelContext.active() for paths that haven't established one) puts the whole turn back under a single trace. The deliberate bypass of resolveParentContext is the right call given an interaction span is a turn boundary — chaining it under wrapping spans (subagents, user-code wrappers) would defeat the purpose. The new tests cover the three behaviors the fix introduces (session-root happy path, session-root wins over an unrelated active span, fallback when no session context is set), and the clearSessionTracingForTesting hygiene change closes a real cross-test leak. PR description scopes out the orthogonal ALS-propagation issue and the multi-session race tracked under #4410 cleanly.
Verdict
APPROVE — narrow, well-tested, with a real production-traced repro and explicit non-scope.
| attributes, | ||
| }); | ||
| // Pin to session root directly — resolveParentContext() would prefer | ||
| // any active OTel span, but interaction is a turn boundary (#4486). |
There was a problem hiding this comment.
[Suggestion] The comment explains what the bypass does but not why it matters. A future maintainer scanning this file sees resolveParentContext() used at every other span-creation site and may "simplify" this call to use it too — silently re-introducing the exact trace-tree detachment bug this fix addresses.
Consider expanding the comment to mention the consequence:
// IMPORTANT: Do NOT use resolveParentContext() here. Interaction spans
// are turn boundaries and must always be direct children of the session
// root. resolveParentContext() would nest the interaction under any
// active OTel span, detaching it from the session trace ID (#4486).— qwen3.7-max via Qwen Code /review
Summary
Fixes #4486.
startInteractionSpaninpackages/core/src/telemetry/session-tracing.tswas the onlystartSpancall site missing the parentctxargument. OTel responded by minting a fresh random trace id, splitting the interaction span off from its sessionId-derived children. Pin it to the session root context so the whole turn lands in one trace.Trace tree — before / after
Before (interaction span isolated in its own one-span trace):
After (interaction anchors at session root, single trace):
(
tool.blocked_on_userchains under its tool span;hookchains under whichever of tool / interaction is active. Their relative parentage is unchanged by this PR — they appear in the diagram as "...other session spans" because the only thing this fix changes is whether the whole tree lives in the sessionId-derived trace or splinters off into a random one.)Repro (real production data, from issue #4486)
Session
33878ff9-1467-43ef-a692-634a21bbb1bf(cn-beijing, 2026-05-25 09:53–09:56) produced two distinct trace ids for the same session:4acca441dba8071e51177f7c82fe0d07llm_request+toolspans54dbac5a0e1437165971820716388282qwen-code.interactionspan onlyBoth carry the same
session.idtag but share no OTel trace context. With this fix, the interaction span derives its trace id from the samesha256(sessionId)[:32]derivation asLogToSpanProcessor, so all spans in a session land under one trace.Why this is a 1-call-site fix
Six
getTracer().startSpancalls insession-tracing.ts. Five (llm_request,tool,tool.execution,tool.blocked_on_user,hook) already passctx. Onlyinteractionwas missing it — likely an oversight from PR #4126 which fixed the children's context propagation but didn't touch the parent.Why not
resolveParentContext()?resolveParentContextprefers any active OTel span over the session root (priority 2 vs 3). For sub-spans (LLM, tool) that's correct — a side-query LLM call nested inside a tool span should chain to the tool. But interaction spans are turn boundaries: they must always anchor at the session root, never under a wrapping span (e.g. user code wrapping the run in tracing, future subagent paths, etc.). So we deliberately bypassresolveParentContextand readgetSessionContext()directly.Hygiene change
clearSessionTracingForTesting()now also resetssession-context.tsmodule state. Without this, a test that sets a fake session root could leak into the next test if it forgot atry/finally. Defensive, no behavior change for production code.Test plan
npx vitest run packages/core/src/telemetry/session-tracing.test.ts— 72 tests pass (70 prior + 2 new)npx tsc --noEmit -p packages/core— zero new tsc errors on touched filesnpx prettier --check— cleannpx eslint— cleanWhat's not in this PR
interactionContext.getStore()returnsundefinedfor some LLM/tool calls in production (causing them to fall back to session root) — that's a separate ALS-propagation issue, not in scope here.sessionRootContextis module-global and races on concurrent sessions. Pre-existing, tracked under feat(telemetry): Phase 3 — qwen-code.subagent span with concurrent isolation (#3731) #4410.interaction.sequenceglobal counter semantics — pre-existing issue mentioned in bug(telemetry): qwen-code.interaction span has wrong trace id (escapes session root context) #4486 but orthogonal to this fix.🤖 Generated with Qwen Code