fix(telemetry): Phase 1.5 polish — fallback order, abort-as-result, log/span consistency#4302
Conversation
…og/span consistency Follow-up to #4126 (Phase 1: unify span creation paths). Four small fixes surfaced in the late review rounds; tracked in #4212. 1. session-tracing fallback order — add resolveParentContext() helper that mirrors tracer.ts:getParentContext(): when interactionContext / toolContext ALS is empty, prefer the active OTel span over the synthetic session-root fallback. Without this, an LLM or tool call nested inside another OTel span (subagent inside a tool, instrumented HTTP path, etc.) re-parented to session root and flattened the trace tree. Three sites unified. 2. tool.execution span misreporting cancellation as success — when a tool observes signal.aborted and resolves with a normal ToolResult (no .error), the execution sub-span used to close as success while the parent tool span ended cancelled. Snapshot signal.aborted once after await, mirror it into endToolExecutionSpan, and stamp a sanitized error reason (Tool execution cancelled by user / Tool execution failed) so trace backends can distinguish the two without cross-referencing the parent. 3. loggingContentGenerator test mock missing API_CALL_ABORTED_SPAN_STATUS_MESSAGE — production code imports it; Vitest returned undefined and quietly masked the abort branch in tests. 4. Stream idle timeout vs api_response success log — when the 5-min idle timeout closes the LLM span as failed, skip the post-loop safelyLogApiResponse / safelyLogOpenAIInteraction / addModelOutputAttributes so telemetry doesn't carry a "success" api_response log alongside a timed-out span. Closes #4212. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
📋 Review SummaryThis PR (#4302) implements four targeted telemetry fixes addressing trace consistency, span parentage, and logging accuracy. The changes are well-scoped, properly tested, and align with the Phase 1 tracing unification work (#4126). Overall assessment: strong implementation with clear rationale, comprehensive test coverage, and no breaking changes. 🔍 General Feedback
🎯 Specific Feedback🔵 Low
✅ Highlights
|
There was a problem hiding this comment.
Pull request overview
This PR polishes telemetry span behavior and tests after the span hierarchy unification work, focusing on parent fallback order, cancellation consistency, and avoiding contradictory timeout logs.
Changes:
- Adds shared parent-context resolution that prefers explicit ALS parent, then active OTel span, then session root.
- Updates tool execution telemetry to include abort/result metadata.
- Prevents successful response/OpenAI logs after stream idle timeout and expands tests/mocks.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
packages/core/src/telemetry/session-tracing.ts |
Adds unified parent context fallback logic for LLM/tool/tool-execution spans. |
packages/core/src/telemetry/session-tracing.test.ts |
Adds tests for active OTel span fallback behavior. |
packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts |
Skips success-style logs/attributes after idle timeout ends the span. |
packages/core/src/core/loggingContentGenerator/loggingContentGenerator.test.ts |
Adds missing abort mock constant and idle-timeout logging regression test. |
packages/core/src/core/coreToolScheduler.ts |
Records abort-as-result execution span as unsuccessful with sanitized reason. |
packages/core/src/core/coreToolScheduler.test.ts |
Updates cancellation/failure telemetry assertions for execution sub-spans. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
wenshao
left a comment
There was a problem hiding this comment.
No new issues found. LGTM! ✅ — gpt-5.5 via Qwen Code /review
…ancelled status, sync tags Four follow-ups from the #4302 review on top of the Phase 1.5 polish: 1. loggingContentGenerator.ts catch block — gate safelyLogApiError + safelyLogOpenAIInteraction on !spanEndedByTimeout. The success path was gated in the original PR but the error path wasn't, so a downstream throw after idle timeout would still emit api_error alongside a timed-out span — the same contradictory-telemetry pair the timeout fix targets. 2. endToolExecutionSpan now accepts a `cancelled` discriminator. Without it, success: false unconditionally sets SpanStatusCode.ERROR while the parent tool span uses setToolSpanCancelled (UNSET) — child shows ERROR, parent shows cancelled, trace backends filtering for errors false-positive on user cancels. Cancelled exec spans now keep status UNSET like the parent while still recording success: false / error reason attributes. 3. coreToolScheduler.ts catch block now snapshots signal.aborted into a local `aborted` constant and passes it through endToolExecutionSpan + the if-branch — same idiom as the success path (#4212), eliminating the divergent style. 4. SYNC tags on tracer.ts:getParentContext and session-tracing.ts: resolveParentContext so future drift between the two parent-resolution paths surfaces in grep before it flattens the trace tree. Tests: - session-tracing.test.ts — endToolExecutionSpan cancelled: true keeps status UNSET; sanity check that the non-cancelled path still maps success: false to ERROR. - coreToolScheduler.test.ts — extends the live-output cancellation test and the catch-after-abort test to assert cancelled: true; updates the success/failed-result tests to expect cancelled: false; adds a cancelled-stays-falsy-on-real-exception test. - loggingContentGenerator.test.ts — new test that fires the idle timeout then throws downstream and verifies neither logApiError nor logInteraction fires. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
|
Reply to the 4 🔵 Low items in the bot review summary: 1. Won't fix in this PR — that line is not in the diff. The TTL was introduced in #4071 and isn't related to any of the 4 fixes here. Worth a separate cleanup PR if it surfaces during tuning, but folding it into this branch would muddy the scope and re-trigger reviewer attention. 2. Won't fix. The "why" (incident-response contradictions, not just gating logic) is non-obvious and worth the extra two lines — the suggested condensed form drops the 3. Won't fix. 4. Won't fix. The existing comment already explains the cause ("fake timers interact poorly with async-generator iteration") and the workaround. A bare TODO without a concrete plan or owner is the kind of comment that rots — better to leave the explanation as-is and let someone with a real reason to revisit (e.g. vitest fixes the interaction) open a follow-up. All 4 inline review threads (Copilot + wenshao) are now resolved as of 65b79dc. |
wenshao
left a comment
There was a problem hiding this comment.
No issues found. LGTM! ✅ — qwen-latest-series-invite-beta-v28 via Qwen Code /review
|
本地真实测试验证(worktree
4 个 review thread 全部已落地:
LGTM, merged. |
…4321) * feat(telemetry): Phase 2 — tool.blocked_on_user + hook spans Adds two OTel span types under the existing hierarchical session-tracing infrastructure (#3731 Phase 2; depends on Phase 1 #4126 and Phase 1.5 #4302): 1. `qwen-code.tool.blocked_on_user` — brackets the time a tool spends in awaiting_approval waiting for the user. Child of the tool span. Records decision (proceed_once / proceed_always / cancel / aborted / auto_approved) and source (cli / ide / hook / auto / system). Status stays UNSET — waiting is neither OK nor ERROR. 2. `qwen-code.hook` — wraps each pre/post-hook fire site so a slow hook can be told from a slow tool. Records hook_event (PreToolUse / PostToolUse / PostToolUseFailure), tool_name, shouldProceed, shouldStop, blockType, hasAdditionalContext. Status stays UNSET on intentional blocking decisions; ERROR only when the hook itself throws. To make blocked_on_user a child of the tool span, the tool span lifecycle moved from `executeSingleToolCall` to `_schedule`'s validating-loop — covering validating → awaiting_approval → executing in one span. Two new private Maps on CoreToolScheduler hold span refs across method boundaries (callId-keyed). Centralized cleanup via `finalizeToolSpan` / `finalizeBlockedSpan` private helpers ensures every terminal status path also ends the corresponding span. Eight terminal sites now finalize the tool span: signal.aborted at loop entry, hard deny, plan-mode block, non-interactive deny, permission-hook deny, background-agent deny, _schedule catch, executeSingleToolCall finally. Five blocked_on_user end sites: handleConfirmationResponse cancel and proceed branches, autoApproveCompatiblePendingTools, _schedule catch under signal.aborted, and the global-error catch. ModifyWithEditor stays inside one blocked_on_user span until the final proceed/cancel — the duration_ms reflects total user think-time including editor side trips. Six hook fire sites are wrapped: firePreToolUseHook, firePostToolUseHook, and four safelyFirePostToolUseFailureHook variants (success-path interrupt, toolResult.error path, catch-path interrupt, catch-path real exception). fireNotificationHook is intentionally NOT wrapped — it's fire-and-forget and the duration is meaningless. Mirrors claude-code's session-tracing pattern but deliberately diverges on one point: every end-helper takes the span object explicitly via `getSpanId(span)` lookup instead of `findLast`-by-type. Under concurrent tool calls, claude-code's findLast can end the wrong blocked span; passing the ref directly is concurrency-safe. Tests: - session-tracing.test.ts: 11 new tests covering parent resolution (explicit parent for blocked_on_user, ALS-based for hook), idempotent end, NOOP behavior, error-status mapping, and a concurrency regression test (two parallel blocked spans ended in reverse order). - coreToolScheduler.test.ts: mock extended with the four new helpers and two new metadata fields. New tests cover the tool span outliving a pre-hook deny path, blocked_on_user ending with cancel via the awaiting_approval flow, hook span recording shouldProceed=false / blockType='denied' on pre-hook block and shouldStop=true / blockType='stop' on post-hook stop, and a leak guard that asserts every recorded lifecycle span is ended after a successful tool call. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): address #4321 review — Copilot inline + code-reviewer + silent-failure-hunter Eight discrete fixes plus two new tests, all surfaced in the Phase 2 review rounds. Grouped here because they touch the same handful of code paths. Copilot inline (#4321 PR): 1. startToolSpan attrs naming: drop redundant `tool_name` (helper already sets `'tool.name'` from the first arg) and rename `call_id` to the namespaced `'tool.call_id'`. Two sites: `_schedule` validating-loop start, and the defensive fallback in executeSingleToolCall. Without this, traces emit non-namespaced `tool_name` / `call_id` attributes that consumers grepping for `tool.call_id` miss. 2. PreToolUse hook span: propagate the actual `preHookResult.blockType` ('denied' / 'ask' / 'stop') instead of collapsing every block to 'denied'. Also record `hasAdditionalContext` for parity with the PostToolUse / failure-hook spans. 3. blocked_on_user `source` detection: use `config.getIdeMode()` (best- effort) so IDE-driven decisions don't all show up as `'cli'`. Centralized in a new `getBlockedSource()` helper. silent-failure-hunter / code-reviewer: 4. Hook span error-tracking is dead code. firePreToolUseHook / firePostToolUseHook / safelyFirePostToolUseFailureHook all swallow throws internally — every `catch (e) { endMeta = { error, ... }; throw e }` block in the scheduler was unreachable. Simplify all 6 sites to `try { ... } finally { endHookSpan(...) }`. The default `endMeta = { success: false }` keeps the span sensible if a future hook impl decides to throw. 5. handleConfirmationResponse had no error handling. modifyWithEditor / _applyInlineModify / attemptExecutionOfScheduledCalls can throw and would otherwise leak both the tool span and the blocked_on_user span until the 30-min TTL fires. Wrap the body in a try/catch that finalizes both spans on rethrow. Extracted the body to `_handleConfirmationResponseInner` for clarity. 6. Add `'error'` to the `ToolBlockedDecision` union for system-error closes, so dashboards counting `decision: 'cancel'` don't get polluted by thrown exceptions. 7. _schedule's outer catch was labelling its non-aborted close as `'cancel'`. Switch to `'error'` (uses #6). 8. signal.aborted vs explicit user Cancel: when both are true, the old code reported `'aborted'/'system'` even though the user actually clicked Cancel. Reverse the precedence so `outcome === Cancel` wins, with `getBlockedSource()` for the source. Tests: - T1: extend the existing ProceedAlways auto-approve test to assert the two siblings' blocked spans end with `decision: 'auto_approved'`, `source: 'auto'`, while the first tool ends as `'proceed_always'`/cli. - T2: existing cancel-during-confirmation test now also asserts exactly one blocked span is recorded for the lifecycle — the same invariant ModifyWithEditor's intentional preservation across editor side trips must not break. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): close autoApprove blocked-span leak + cover three new behaviors Two follow-ups from the post-#6767469b2 review pass on PR #4321: 1. autoApproveCompatiblePendingTools error path was logging-only and leaving the sibling tool's blocked_on_user span open until the 30-min TTL fires. Symmetric with the success branch's finalizeBlockedSpan('auto_approved', 'auto'), the catch now finalizes with ('error', 'system') so the trace deterministically explains why the sibling didn't auto-approve. 2. Three behaviors introduced by 6767469 had no test coverage: - decision='error' from _schedule's outer catch when getConfirmationDetails throws (asserts tool span ends, no blocked span ever opens since the throw happens pre-awaiting_approval). - source='ide' when getBlockedSource() honors getIdeMode (Cancel path with getIdeMode: () => true). - Explicit Cancel takes precedence over a concurrent signal.aborted in the decision label — the bug the precedence flip was meant to fix is now regression-tested. Extracted a small `buildApprovalScheduler` helper for the two awaiting_approval-flow tests; the throw-on-confirmation test reuses StructuredErrorOnConfirmationTool. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): revert autoApprove catch finalizeBlockedSpan (#4321 codex P3) The previous commit 32f94d3 added a `finalizeBlockedSpan(callId, 'error', 'system')` to the autoApproveCompatiblePendingTools catch in the name of "symmetry with the success branch". Codex review pointed out the bug: that catch fires when evaluatePermissionFlow throws for a SIBLING tool, but the sibling itself is still in `awaiting_approval` — the user can still respond. By closing the blocked span at the catch, the eventual handleConfirmationResponse → finalizeBlockedSpan call becomes a no-op (Map.delete already cleared it), and the user's actual decision / source attributes are lost from the trace. Revert that line. The previous behavior was correct: log the error, leave the span open, let the user's eventual decision close it correctly. If the user never responds, the 30-min TTL in session-tracing.ts cleans up the orphan span — same fallback that already covered every other "user walks away" scenario. The "leak" the original change was trying to fix was a phantom: the span IS finalized once the user (or the abort signal) drives the tool to a terminal state. The TTL is just the safety net. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): split tool.failure_kind labels + cover proceed_once decision Two #4321 review comments from wenshao, both Critical: 1. `TOOL_FAILURE_KIND_PRE_HOOK_BLOCKED` was being emitted for FIVE distinct non-PreToolUse-hook deny paths in `_schedule`: - finalPermission === 'deny' (hard deny) - plan-mode block - non-interactive deny - permission_request hook deny - background-agent deny Dashboards filtering by `failure_kind = 'pre_hook_blocked'` were silently picking up all of these, undermining the attribute. Add distinct constants + status messages for each path. The original PRE_HOOK_BLOCKED label is now used at exactly one site — the actual PreToolUse hook deny in `_executeToolCallBody`. 2. `decision: 'proceed_once'` was untested. Existing tests covered 'cancel' and 'proceed_always' (auto-approve) but not the most common user interaction. Add a test that schedules an approval-required tool, confirms with ProceedOnce, and asserts the blocked span ends with `decision: 'proceed_once'`, `source: 'cli'`. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): address #4321 wenshao Critical + bot summary nits Three review items folded into one follow-up: 1. wenshao Critical (`coreToolScheduler.ts:1851`) — `ModifyWithEditor` path silently returned when `getPreferredEditor()` was undefined, leaking blocked + tool spans on user-walks-away. Add a `debugLogger.warn` so the silent failure is at least visible in debug telemetry. Deliberately do NOT finalize spans here, matching the Codex P3 / autoApprove decision: ModifyWithEditor stays inside one awaiting period, the user can still recover via Cancel/Proceed which closes the spans correctly, and the 30-min TTL is the safety net for give-up scenarios. Finalizing prematurely would make the user's eventual decision a no-op (Map already cleared) and lose the actual decision/source attributes. 2. Bot summary Medium (`session-tracing.ts:557-562`) — add a `debugLogger.debug` when `startToolBlockedOnUserSpan` falls back to `resolveParentContext` because the tool span isn't in `activeSpans` anymore. Helps diagnose unexpected ordering during development. 3. Bot summary Low (`constants.ts`) — JSDoc the two new span name constants. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * refactor(telemetry): extract withHookSpan helper + drop dead finalizeToolSpan param Two #4321 review Suggestions from wenshao: 1. The 6 hook fire sites (PreToolUse, PostToolUse, 4× PostToolUseFailure) each repeated the same try/finally + endMeta init + endHookSpan pattern. Future hook span protocol changes had to be made in lockstep. Extract a private generic helper: withHookSpan<T>(opts, fn, toEndMeta): Promise<T> Each fire site collapses from ~12 lines of try/finally scaffolding to ~3 lines passing in the fire callback + endMeta builder. The `let postHookResult!:` definite-assignment hack at the PostToolUse site is gone because the helper returns the awaited result directly. 2. `finalizeToolSpan(callId, metadata?)` had a dead `metadata` parameter — every caller pre-sets the span status via `setToolSpan{Failure,Cancelled}` and called `finalizeToolSpan` with no argument. Removed the parameter. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): hook span error tracking + TTL cleanup safety + call_id back-compat Three #4321 review threads from wenshao (#4321 codex P3-equivalent + two structural concerns): 1. **[Critical] Hook spans reported success on swallowed hook failures.** firePreToolUseHook / firePostToolUseHook / firePostToolUseFailureHook (and the safelyFire wrapper in coreToolScheduler) all catch transport / dispatch errors internally and return safe defaults. Before this fix, withHookSpan's `toEndMeta` ran on the safe default and recorded `success: true` — a crashing hook was indistinguishable from one that allowed execution. Add a `hookError?: string` field to the three result types, populate it in each catch, and have all 6 toEndMeta callbacks return `{ success: false, error: hookError }` when present. Existing "graceful error" tests updated to expect the new field. 2. **[Suggestion] ensureCleanupInterval not kicked from new helpers.** The 30-min TTL cleanup safety net for leaked spans only starts when `startInteractionSpan` is first called. Sub-agent or side-query code paths that call `startToolBlockedOnUserSpan` / `startHookSpan` without an interaction span first never trigger cleanup. Both helpers now call the (idempotent) `ensureCleanupInterval()` early. 3. **[Suggestion] `call_id` → `'tool.call_id'` rename is breaking for downstream consumers.** Phase 1's `startToolSpan(name, { tool_name, call_id })` shipped non-namespaced attribute keys. My Phase 2 #4321 review-fix dropped both. Dual-emit `call_id` (legacy alias) + `'tool.call_id'` for one release cycle so existing dashboards / alerts don't silently return zero. Comment notes the legacy key is removed in the next release. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): close hookError plumbing gaps from final pre-merge audit Final-pass review surfaced two gaps in the hookError contract added in eafe688: 1. **Real bug (silent-failure-hunter HIGH)**: The three fire helpers (firePreToolUseHook / firePostToolUseHook / firePostToolUseFailureHook) populate `hookError` only in their catch blocks. But the `if (!response.success || !response.output)` short-circuit at lines 121 / 220 / 299 silently dropped `response.error` from the runner layer (URL validation failures, fn exceptions, prompt-runner crashes). Hooks that never even threw — just had a failing runner — surfaced as "successful allow" in telemetry. Forward `response.error?.message` into hookError on the short-circuit path so the operator sees the actual cause. 2. **Defensive default in withHookSpan**: the initial `endMeta = { success: false }` produced UNSET status (no `error` field, so endHookSpan skips the setStatus(ERROR) branch). Today the only path that hits this default is "fn() throws before toEndMeta", which is unreachable because all hook helpers catch internally — but the contract should still map to ERROR if the invariant ever changes. Default now carries an explanatory error string. Test: new `coreToolScheduler.test.ts` case where messageBus.request resolves with success:false + a real Error; asserts the PreToolUse hook span's `hookMetadata.error` is the runner's message (instead of being silently absent). 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * test(telemetry): cover #4321 rethrow path + 2 of the new failure_kind labels Two test gaps surfaced by wenshao [Suggestion] threads: 1. **handleConfirmationResponse outer catch was untested.** The defensive recovery path that finalizes both spans on originalOnConfirm / modifyWithEditor / attemptExecution throws had no coverage. New test calls handleConfirmationResponse directly with a throwing onConfirm, asserts: - blocked span ends with `decision: 'error'`, `source: 'system'` - tool span carries `tool.failure_kind: 'tool_exception'` - the original error is rethrown to the caller 2. **5 new permission-flow failure_kind labels had zero coverage.** Add representative tests for the two highest-volume paths: - `permission_denied` — PM hard-deny via a tool whose getDefaultPermission returns 'deny' - `non_interactive_denied` — `isInteractive: () => false` scheduling an edit-tool that needs confirmation The other three (plan_mode_blocked / permission_hook_denied / background_agent_denied) are covered transitively via the existing pre_hook_blocked + plan-mode tests; if they regress, the same code path's existing assertions would notice. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): adopt 4 wenshao Critical/Suggestion findings on PR #4321 Inline review findings: - coreToolScheduler.ts: signal.abort drains scheduler-local toolSpans/blockedSpans Maps via deferred setTimeout(0) — bridges the gap between session-tracing's 30-min TTL (which ends underlying spans but cannot reach the Maps) and walk-away-during-awaiting_approval. The drain is deferred so explicit Cancel via handleConfirmationResponse and mid-execution setToolSpanCancelled paths still win the race and set canonical labels. - coreToolScheduler.test.ts: regression test for permission_hook_denied (firePermissionRequestHook deny branch at _schedule:1683) and background_agent_denied (getShouldAvoidPermissionPrompts auto-deny at _schedule:1697). Both branches were untested — silently dropping setToolSpanFailure on either would lose attribution. - coreToolScheduler.ts: defensive-fallback span in executeSingleToolCall uses canonicalToolName(toolName) so dashboards grouping by span name don't see two entries for migrated/MCP tools whose canonical and raw names differ. Review-body finding: - session-tracing.ts: TTL safety net stamps qwen-code.span.ttl_expired + qwen-code.span.duration_ms attributes and emits a debug log before ending stale spans. Operators can now distinguish "abandoned and garbage-collected by the safety net" from "deliberately ended without status/attrs". Refactored cleanup loop into sweepStaleSpans(now) and exposed runTTLSweepForTesting for unit coverage. Tests: +3 scheduler tests (~220 LOC), +2 session-tracing tests (~36 LOC). 247/247 in affected files. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): adopt 7 DeepSeek /review findings on PR #4321 Adopted ([Critical]): - coreToolScheduler.ts: ModifyWithEditor `!editorType` path now sets `qwen-code.tool.modify_with_editor_unavailable: true` on the live tool span so operators can detect the silent-bail-out state in production traces without enabling debug logging. - coreToolScheduler.test.ts: regression test for plan_mode_blocked failure_kind path (ApprovalMode.PLAN + non-read-only confirmation tool). - coreToolScheduler.test.ts: regression test for the pre-aborted signal early-exit in `_schedule` — asserts setToolSpanCancelled (UNSET status) without entering execution. Adopted ([Suggestion]): - coreToolScheduler.ts: `withHookSpan` now `catch`-es and surfaces the actual thrown message instead of the hardcoded `'hook fn threw before toEndMeta'` sentinel. Currently unreachable (hook helpers swallow internally) but defensive against contract drift. - coreToolScheduler.ts: re-add `tool_name` (non-namespaced) as a legacy alias on both startToolSpan call sites, mirroring the `call_id` / `tool.call_id` dual-emit window so pre-Phase-2 dashboards filtering on `tool_name` don't silently stop matching during the rollout. - coreToolScheduler.test.ts: regression test for the `_schedule`-driven aborted decision label on the blocked_on_user span (companion to the existing tool-span drain test). - coreToolScheduler.ts: PreToolUse / PostToolUse `toEndMeta` now include `shouldProceed: true` / `shouldStop: false` when `hookError` is set, mirroring the runtime's allow-on-hook-failure semantics. Pushed back (separate PR-level reply): - "sibling failure prematurely closes confirmed tool span" — not reachable: `_executeToolCallBody` swallows execution errors so the only paths into `handleConfirmationResponse`'s catch are `originalOnConfirm` / `modifyWithEditor` / `_applyInlineModify`, none of which run after `attemptExecutionOfScheduledCalls` started any sibling. - "PostToolUseFailure hook spans not asserted" — broader scope, defer. - "finalizeToolSpan accept required metadata" — invariant-redesign, out of scope for this PR. Tests: +3 scheduler tests; 250/250 green in affected files (coreToolScheduler 154 + session-tracing 49 + toolHookTriggers 47). 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): adopt 3 wenshao /review findings on PR #4321 - coreToolScheduler.ts: handleConfirmationResponse outer catch now branches on signal.aborted — a throw caused by the abort signal (e.g. ModifyWithEditor child interrupted by Ctrl+C) lands as decision:'aborted'/UNSET status instead of 'error'/tool_exception, matching the sister catch in `_schedule` and keeping dashboard abort-vs-error counts honest (Critical-shaped Suggestion). - coreToolScheduler.ts: drop the per-batch abort listener at the end of `_schedule` when no batch entries remain in toolSpans / blockedSpans. Prevents Node's MaxListenersExceededWarning in long-lived sessions where the same AbortSignal sees many _schedule batches without a real abort. Listeners that still cover awaiting_approval entries stay attached — the user's eventual decision closes the spans, and the listener becomes a no-op when it later fires (or auto-removes via `{ once: true }` on real abort). - coreToolScheduler.test.ts: 2 regression tests for PostToolUseFailure hook span variants — `is_interrupt:true` on user-abort vs `is_interrupt:false` on real-exception. Operators rely on this flag to separate user-initiated cancellations from system errors in dashboards; a copy-paste regression flipping the value across the 4 PostToolUseFailure call sites was previously invisible. Tests: 252/252 across affected files (coreToolScheduler 156 + session-tracing 49 + toolHookTriggers 47). 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): adopt 7 wenshao /review round-3 findings on PR #4321 Adopted ([Critical]): - coreToolScheduler.ts: full per-batch abort listener cleanup. Replaced the closure-local Set + end-of-_schedule cleanup with a class-level callIdToBatch Map keyed off a shared BatchAbortState. The listener is now released by `finalizeToolSpan` → `releaseBatchListenerIfDrained` whenever the last live batch entry drains, regardless of whether finalize happens synchronously inside _schedule, later via handleConfirmationResponse, or via executeSingleToolCall. Closes the awaiting_approval-batches-leak-listeners gap from the previous partial fix. - coreToolScheduler.ts: re-check signal.aborted in the _schedule for-loop after `evaluatePermissionFlow`/`getConfirmationDetails`/ `firePermissionRequestHook` and BEFORE setting awaiting_approval + starting the blocked span. Without this, a signal that aborts during one of those awaits opens a blocked span on an already-aborted signal whose drainSpansForBatch may have already fired, leaving the new entry permanently orphaned. - session-tracing.ts: introduce truncateSpanError(s) (1KB cap) and apply it to every endXSpan site that writes metadata.error to span attributes / status messages (LLM, tool, tool execution, hook). Hook server responses, raw exception stacks, or hostile inputs can be unbounded; some OTel backends drop the entire span when any field exceeds their limit. Adopted ([Suggestion]): - coreToolScheduler.ts: per-callId try/catch inside drainSpansForBatch. One bad finalize no longer skips the rest of the batch; failures are logged via debugLogger.warn instead of bubbling up as an unhandled timer-callback exception. - session-tracing.ts: TTL sweep robustness — wraps setAttributes and span.end() in separate try/catch blocks so a setAttributes throw can't leak the OTel span; stamps `decision: 'aborted'`/ `source: 'system'` on TTL-expired blocked_on_user spans so dashboards filtering by decision count walk-aways consistently with explicit user aborts; includes tool.name + tool.call_id in the warn log so it's actionable in production without a trace-backend lookup. - coreToolScheduler.ts: extract the 4 byte-identical PostToolUseFailure toEndMeta lambdas into a single `postToolUseFailureEndMeta` member. Future protocol changes only need to touch one place. - coreToolScheduler.test.ts: 3 new tests * outer-catch aborted branch — pre-aborted signal + throwing onConfirm asserts decision='aborted'/source='system' and failure_kind='cancelled'. * ModifyWithEditor !editorType — uses a getModifyContext-shimmed MockEditTool to enter the modifiable branch and asserts qwen-code.tool.modify_with_editor_unavailable=true. * per-batch listener removed when batch drains synchronously — asserts AbortSignal listenerCount and `callIdToBatch` size. Pushed back (deferred): - "firePermissionRequestHook in withHookSpan + hookError field" — same as previous deferral. Touches the public PermissionRequestHookResult type re-exported from packages/core/src/index.ts; declined per the guardrail on public-API changes. Tests: 255/255 across affected files (coreToolScheduler 159 + session-tracing 49 + toolHookTriggers 47). 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): polish 2 wenshao /review round-4 nits on PR #4321 - session-tracing.ts: rename `SPAN_ERROR_MAX_BYTES` → `SPAN_ERROR_MAX_CHARS` and update the JSDoc to be honest that `truncateSpanError` truncates by UTF-16 code units rather than bytes. CJK/emoji-heavy errors land in the ~2-3KB UTF-8 range under the same code-unit cap, but that's still well under all major OTel backends' per-attribute limits (Jaeger/Honeycomb ~64KB, OTLP default ~32KB), so we keep the simpler char-count bound rather than paying the encoder cost on every endXSpan. - coreToolScheduler.ts: move the `withHookSpan` JSDoc block to sit directly above the method. The previous order had two consecutive JSDoc blocks separated by `postToolUseFailureEndMeta`, which orphaned the `withHookSpan` doc — IDE hover tooltips would surface the wrong documentation. Tests: 208/208 in affected files; tsc --noEmit clean. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): adopt 4 wenshao /review round-5 findings on PR #4321 Adopted ([Suggestion]): - coreToolScheduler.ts: `setToolSpanFailure` now applies `truncateSpanError` to the status message at this single ingress point. Many of its 10+ call sites pass raw `error.message` which can be unbounded — the same backend-drop risk that drove `truncateSpanError` for the endXSpan attribute writes. Static- constant callers see no change since their messages are well under the 1024-char cap. Required exporting `truncateSpanError` from `session-tracing.ts` and re-exporting from `telemetry/index.ts`. - coreToolScheduler.ts: in `_schedule`, after the for-loop runs to completion, drop the abort listener if `batchState.callIds.size === 0`. Closes the all-error-batch leak path: if every newToolCall had `status !== 'validating'` (e.g., invalid params, tool not registered, queue full), no `finalizeToolSpan` ever fires for the batch and `releaseBatchListenerIfDrained` is never invoked. Without this drop, one dead listener accumulates per all-error batch. - coreToolScheduler.ts: `handleConfirmationResponse` outer catch now emits a `debugLogger.warn` before rethrowing. Without it, if the caller (CLI confirmation UI layer) doesn't log the rejection, the error disappears from application logs entirely — operators grepping by callId would see nothing despite the trace backend showing `failure_kind: tool_exception`. - session-tracing.test.ts: 4 new tests * `truncateSpanError` returns short strings unchanged * `truncateSpanError` truncates over 1024 chars + appends sentinel * `truncateSpanError` boundary at exactly 1024 chars * TTL sweep stamps `decision: 'aborted'` + `source: 'system'` on blocked_on_user spans (covers the branch added in review-3 round) Pushed back ([Suggestion]): - "TTL sweep can't reach scheduler-local Maps" — accurate but the fix is non-trivial: a parallel scheduler-side TTL sweep duplicates the session-tracing sweep's bookkeeping, and the practical impact is bounded (Maps die with the scheduler instance, which is per-session in CLI mode). The bigger leak (listener accumulation on shared signals) is already covered by `releaseBatchListenerIfDrained`. Marking as out-of-scope architectural follow-up. Tests: 259/259 across affected files (coreToolScheduler 159 + session-tracing 53 + toolHookTriggers 47). `tsc --noEmit` clean. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): adopt 1 wenshao /review round-6 finding on PR #4321 - coreToolScheduler.test.ts: convert the `truncateSpanError` mock from an inline identity function to `vi.fn(identity)` so individual tests can substitute a sentinel return. Added regression test `setToolSpanFailure forwards the truncateSpanError result to the span status (#4321)` that overrides the spy with `<<TRUNCATED-SENTINEL>>`, drives the scheduler through the pre-hook deny path, and asserts the span's ERROR status message equals the sentinel — locks the integration so a regression dropping the `truncateSpanError(message)` call inside `setToolSpanFailure` is caught at the scheduler boundary rather than only at the utility's unit test. Tests: 213/213 across `coreToolScheduler.test.ts` (160) + `session-tracing.test.ts` (53). `tsc --noEmit` clean. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): close 4 silent-failure + test-gap findings from final review on PR #4321 Comprehensive self-review (code-reviewer + silent-failure-hunter + type-design-analyzer + pr-test-analyzer agents) after 6 rounds of bot feedback turned up 4 remaining actionable items. Addressed: [silent-failure-hunter HIGH-1] toolHookTriggers.ts: when the hook runner returns `{ success: false }` (or missing output) with no `error.message`, the 3 fire helpers used to silently return the safe default — `{ shouldProceed: true }` / `{ shouldStop: false }` / `{}` — producing a hook span that reads `success: true` and looked like a clean allow in dashboards. Now synthesizes a sentinel hookError describing the contract violation so the span records the failure. Three existing test cases updated to assert the new sentinel-bearing shape. [silent-failure-hunter HIGH-2] coreToolScheduler.ts: synchronous throws in `_executeToolCallBody`'s prelude (addToolInputAttributes, getMessageBus, startToolExecutionSpan, etc.) propagated up to `executeSingleToolCall`'s `finally` without ever hitting setToolSpan*, so the tool span ended UNSET with no failure_kind AND the tool call stayed in 'executing' forever (checkAndNotifyCompletion never sees terminal state, scheduler hangs). Added a catch in executeSingleToolCall that pre-sets failure status + an error response before the finally finalizes — guards every prelude path the body's own try/catch doesn't cover. [silent-failure-hunter MEDIUM-3] session-tracing.ts: the empty catch on `sweepStaleSpans` `setAttributes` lost the `ttl_expired` + `decision: 'aborted'` sentinel attrs silently if setAttributes ever threw. Now matches the sibling `span.end()` catch and surfaces via `debugLogger.warn` — TTL-leaked blocked spans stay distinguishable from deliberately-UNSET ones in dashboards. [pr-test-analyzer Gap1, severity 7] coreToolScheduler.test.ts: the `signal.aborted` re-check at `_schedule:1834` (round-3 fix that prevents opening a blocked span on an already-aborted signal between the for-loop's await points and the awaiting_approval transition) had no regression test. Added one that uses a tool whose `getConfirmationDetails` aborts the signal before returning — top of loop check passes, getConfirmationDetails resolves and aborts, re-check fires the cancel path. Asserts `tool.failure_kind === 'cancelled'` AND that NO blocked_on_user span was ever started. Tests: 261/261 across affected files (coreToolScheduler 161 + session-tracing 53 + toolHookTriggers 47). `tsc --noEmit` clean. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): adopt 3 wenshao /review round-8 findings on PR #4321 All three from the same /review run; all valid (the Critical is a real bug in the SF-H2 fix from review-7 that this commit fixes). [Critical] coreToolScheduler.ts:2407 — the `c.status === 'executing'` guard on the prelude-throw catch was wrong. Prelude throws happen BEFORE the `scheduled → executing` transition in `_executeToolCallBody` (getMessageBus is called at line 2460, scheduled→executing flips at line 2522). The `find(... 'executing')` skipped the setStatusInternal, so the toolCall stayed in `scheduled` forever and checkAndNotifyCompletion never fired — exactly the stall the SF-H2 fix was supposed to prevent. Drop the guard; setStatusInternal already no-ops on terminal states (success/error/cancelled) so the unconditional call covers both scheduled-prelude and executing-body paths. Added regression test that makes getMessageBus throw and asserts onAllToolCallsComplete fires with status='error'. [Suggestion] session-tracing.ts:222 — truncateSpanError used `slice(0, 1024)` on UTF-16 code units, which splits surrogate pairs when an emoji (e.g. 🚀) or rare CJK character sits at the boundary. The result was a lone high surrogate followed by `'…[truncated]'` — strict OTLP/gRPC collectors reject batches with invalid UTF-8 (a lone high surrogate encodes to an invalid byte sequence). Back up one code unit when the cut lands on a high surrogate. Added regression test that constructs the boundary case (1023 'a' + 🚀 + padding) and asserts the truncated string is valid UTF-16. [Suggestion] toolHookTriggers.ts:133/240/319 — switched `||` to `??` in the 3 hookError sentinel sites. `||` treats empty string as falsy so a runner returning `{ error: { message: "" } }` triggered the sentinel instead of preserving the (unhelpful but distinct) empty message — a runner contract violation that's worth distinguishing from a missing-message case. `??` synthesizes only when the message is truly absent (undefined / null). Tests: 263/263 across affected files (coreToolScheduler 162 + session-tracing 54 + toolHookTriggers 47). `tsc --noEmit` clean. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): adopt 3 wenshao /review round-9 findings on PR #4321 [Critical] coreToolScheduler.ts — `handleConfirmationResponse`'s catch was misattributing sister-tool prelude throws to the confirmed tool's span. The catch wrapped `_handleConfirmationResponseInner`, which called `attemptExecutionOfScheduledCalls` at its tail. If the user proceeds tool A with ProceedAlways, `autoApproveCompatiblePendingTools` transitions sister tools B/C to `scheduled`, and B has a prelude throw, the SF-H2 catch in `executeSingleToolCall` re-throws → the throw propagates up through `attemptExecutionOfScheduledCalls` → into the outer catch keyed on A.callId, where `setToolSpanFailure(A.span, TOOL_EXCEPTION, B.error.message)` corrupts A's span and `finalizeToolSpan(A.callId)` ends A's span prematurely. A's actual result later disappears from telemetry. Fix: move `attemptExecutionOfScheduledCalls` out of `_handleConfirmationResponseInner` and into `handleConfirmationResponse` after the try/catch. The catch now covers only confirmation logic; each tool's `executeSingleToolCall` already handles its own span lifecycle via its own catch. [Suggestion] toolHookTriggers.ts — reverted the round-8 `??` change back to `||`. Downstream consumers in coreToolScheduler.ts gate on `r.hookError ? ...`, so an empty-string `hookError` preserved by `??` was silently dropped — the change defeated its own stated intent. Empty-string runner error messages carry no operator value; the sentinel ("hook runner returned ... without error detail") is more actionable, and `||` matches existing downstream truthiness semantics. [Suggestion] session-tracing.test.ts — replaced the vacuous `Buffer.from(truncated, 'utf16le')` assertion (which never throws because Node's Buffer copies raw 16-bit code units without validating surrogate pairs) with the suggested regex `/[\uD800-\uDBFF](?![\uDC00-\uDFFF])/` that actually checks for orphan high surrogates anywhere in the string. Tests: 263/263 across affected files (coreToolScheduler 162 + session-tracing 54 + toolHookTriggers 47). `tsc --noEmit` clean. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * test(telemetry): pin empty-string runner error sentinel behavior on PR #4321 [Suggestion] gpt-5.5 review-10: the round-9 `??` → `||` revert was correct, but the existing tests only covered the missing-error case (`success: false` with no `error` field). A future regression back to `??` would still pass those tests while reintroducing the silent-drop behavior the revert was guarding against. Add 3 explicit tests — one per fire helper (PreToolUse, PostToolUse, PostToolUseFailure) — that pass `{ error: { message: '' } }` and assert the sentinel hookError is synthesized (not the empty string). Pins the `||` semantics so any future `??` change fails the suite. Tests: 50/50 in toolHookTriggers.test.ts (47 → 50). `tsc --noEmit` clean. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
Summary
Follow-up to #4126 (Phase 1: unify span creation paths). Four small fixes
surfaced in the late review rounds, all tracked in #4212.
session-tracing.tsfallback order — newresolveParentContext()mirrorstracer.ts:getParentContext(), so when the ALS parent is missing we re-parent to the active OTel span before falling back to the synthetic session root. Three call sites unified (startLLMRequestSpan/startToolSpan/startToolExecutionSpan).coreToolScheduler.tsexec span — when a tool resolves with a normalToolResultafter observingsignal.aborted, snapshot the abort flag and pass it intoendToolExecutionSpanso the sub-span agrees with the cancelled parent tool span. Stamp a sanitizederrorreason (Tool execution cancelled by user/Tool execution failed) on the success path so trace backends can distinguish the two without cross-referencing the parent.loggingContentGenerator.test.ts— add the missingAPI_CALL_ABORTED_SPAN_STATUS_MESSAGEconstant to thetracer.jsmock; production code already imports it but Vitest returnedundefinedand quietly masked the abort branch in tests.loggingContentGenerator.tsidle-timeout — when the 5-min idle timeout has already closed the LLM span as failed, skip the post-loopsafelyLogApiResponse/addModelOutputAttributes/safelyLogOpenAIInteractionso we don't emit a "success" api_response log alongside a timed-out span.Closes #4212.
Test plan
pnpm test—coreToolScheduler.test.ts(134 tests),loggingContentGenerator.test.ts(38 tests),session-tracing.test.ts(34 tests)pnpm typechecksession-tracing.test.ts— three tests, one per span helper, asserting re-parent to active OTel span when the corresponding ALS parent is absentcoreToolScheduler.test.ts— extends the existing live-output cancellation test to assert the exec sub-span ends assuccess: falsewith the cancelled error reason; updates the existing failed-result test to expect the new sanitized error reasonloggingContentGenerator.test.ts— drives the idle-timer callback through asetTimeoutspy and verifies neitherlogApiResponsenorlogInteractionfires when the span has already timed out🤖 Generated with Qwen Code