feat(telemetry): add daemon OTel metrics and structured log records#4749
Conversation
📋 Review SummaryThis PR adds comprehensive OpenTelemetry metric instrumentation to the daemon serve path, implementing 11 metric instruments covering HTTP requests, session/channel lifecycle, prompt queue wait/duration, bridge errors, cancellations, and active SSE connections. The implementation is well-structured with proper separation of concerns between core metrics definitions and bridge integration points. Overall, this is a solid observability enhancement that follows OTel best practices. 🔍 General Feedback
🎯 Specific Feedback🟡 High
🟢 Medium
🔵 Low
✅ Highlights
|
There was a problem hiding this comment.
Pull request overview
Adds OpenTelemetry metrics and structured daemon lifecycle log records for qwen serve’s daemon path, improving operational visibility into HTTP traffic, session/channel lifecycle, prompt timing, cancellations, bridge errors, active SSE connections, and daemon heap usage.
Changes:
- Adds daemon-specific metric instruments (counters, histograms, observable gauges) plus unit tests for recording/gauge behavior.
- Extends daemon bridge telemetry to optionally emit metric events and richer structured log records (custom event names + severity).
- Adds a metric pre-shutdown flush helper and emits
service.instance.idas a resource attribute for daemon process incarnation.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/core/src/telemetry/sdk.ts | Tracks the active metric reader and adds forceFlushMetrics() to flush metric exports on demand. |
| packages/core/src/telemetry/index.ts | Re-exports daemon metrics APIs and forceFlushMetrics from core telemetry. |
| packages/core/src/telemetry/daemon-tracing.ts | Extends daemon log emission options and allows daemon bridge telemetry to carry optional metrics hooks. |
| packages/core/src/telemetry/daemon-metrics.ts | Introduces daemon metrics instruments and recording helpers (HTTP, lifecycle, prompt timing, errors, cancel, gauges). |
| packages/core/src/telemetry/daemon-metrics.test.ts | Adds unit tests validating daemon metrics initialization, recording, normalization, and gauge callbacks. |
| packages/cli/src/serve/server.ts | Records daemon HTTP metrics and tracks active SSE connection count; records bridge error metrics on sendBridgeError. |
| packages/cli/src/serve/runQwenServe.ts | Initializes daemon telemetry+metrics, registers gauge callbacks, emits structured lifecycle logs, and flushes metrics pre-shutdown. |
| packages/acp-bridge/src/bridgeOptions.ts | Extends the bridge telemetry contract with an optional metrics interface for decoupled metric emission. |
| packages/acp-bridge/src/bridge.ts | Calls the optional bridge telemetry metrics hooks for channel/session lifecycle, queue wait, duration, and cancellation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
wenshao
left a comment
There was a problem hiding this comment.
[Suggestion] packages/acp-bridge/src/bridge.ts:4557 — sessionLifecycle('die') not recorded during daemon shutdown
The shutdown loop at line 4557 publishes session_died events for each entry but does not call telemetry.metrics?.sessionLifecycle('die'). The channel-exit handler at line 1163 normally records this metric, but it's bypassed during shutdown because byId.clear() runs first (line 4549), making byId.get(...) return undefined.
Sessions alive at daemon shutdown time are never counted as "died", causing the counter-derived net (spawns − closes − dies) to permanently drift per restart cycle.
Fix: add telemetry.metrics?.sessionLifecycle('die') in the for (const e of entries) loop:
for (const e of entries) {
telemetry.metrics?.sessionLifecycle('die');
try {
e.events.publish({
type: 'session_died',
data: { sessionId: e.sessionId, reason: 'daemon_shutdown' },
});
} catch {
/* bus already closed */
}
e.events.close();
}(Posted as a body-level comment because this line is not in the PR diff.)
— qwen3.7-max via Qwen Code /review
Review fix summary (e8b3949)
🤖 Generated with Qwen Code |
Review fix summary (49c64f3)
5 fixed, 4 deferred to follow-up. 🤖 Generated with Qwen Code |
Review fix summary R2 (77d3500)
3 fixed, 1 deferred. 🤖 Generated with Qwen Code |
Remaining threads resolved (3eae310)
2 fixed, 1 pushed back. All threads resolved (0 remaining). 🤖 Generated with Qwen Code |
3eae310 to
0062df9
Compare
Review Round 4 — wenshao suggestions (4 threads)
🤖 Generated with Qwen Code |
PR #4749 Local Verification ReportBranch: Changes Summary
Test Results
TypeCheck AnalysisThe new typecheck errors fall into two categories:
Verdict✅ Ready to merge. All 18 unit tests pass with thorough coverage (init, recording, gauge callbacks, error normalization, idempotency, no-op before init). Core typecheck clean. Lint clean. The cross-package typecheck errors are expected build-order artifacts that resolve after rebuilding core — no actual type mismatches. Verified by wenshao |
Review Round 5 — wenshao post-approval suggestion (1 thread)
wenshao APPROVED the PR. All threads resolved (0 unresolved). 🤖 Generated with Qwen Code |
| export async function forceFlushMetrics(): Promise<void> { | ||
| if (!telemetryInitialized || !activeMetricReader) return; | ||
| const flush = activeMetricReader.forceFlush(); | ||
| flush.catch(() => {}); |
There was a problem hiding this comment.
[Suggestion] flush.catch(() => {}) creates a detached promise chain that silently discards the actual error from activeMetricReader.forceFlush(). The original flush promise is still raced in Promise.race, but the no-op catch means when the flush rejects (collector down, TLS error, network partition), the real error is lost. The caller in runQwenServe.ts:1209 only sees either the generic timeout message "forceFlushMetrics timed out after 2000ms" or silent resolution — no diagnostic about why the flush failed.
At 3 AM when the OTLP collector is unreachable, the operator has no metric flush diagnostics to distinguish collector outage from DNS failure from exporter misconfiguration.
| flush.catch(() => {}); | |
| flush.catch((err) => { | |
| debugLogger.debug?.('[metrics] flush error:', err instanceof Error ? err.message : String(err)); | |
| }); |
— qwen3.7-max via Qwen Code /review
Adds 11 OTel metric instruments to the daemon serve path, covering: - HTTP request count/latency by route and status class - Session lifecycle (spawn/close/die) counter - Channel lifecycle (spawn/exit) counter - Prompt queue wait and end-to-end duration histograms - Bridge error counter with normalized error type allowlist - Cancel request counter - ObservableGauges for active sessions, SSE connections, heap usage Key design decisions: - ObservableGauge (not UpDownCounter) for gauge-like values — immune to +1/-1 drift across complex lifecycle paths - Error type normalization via allowlist (19 known types + 'unknown') prevents unbounded cardinality - Explicit histogram bucket boundaries tuned for daemon latency profiles - Bridge decoupled via BridgeTelemetry.metrics optional sub-object - emitDaemonLog generalized with optional eventName/severityNumber - service.instance.id added to Resource for process incarnation detection - Pre-shutdown forceFlushMetrics for best-effort final metric export Closes #4554 §6. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
- service.instance.id now serves as a fallback default rather than
unconditional override, so operators can set a stable instance id
via telemetry.resourceAttributes
- channelLifecycle('spawn') log no longer carries the misleading
'expected' attribute (only meaningful for 'exit' events)
🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
…down die metric
- BridgeTelemetryMetrics now re-exports DaemonBridgeTelemetryMetrics
from core instead of duplicating the interface definition
- Add sessionLifecycle('die') in bridge shutdown loop so sessions
alive at daemon shutdown are counted in the lifecycle counter
🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
- channelExitExpected now checks info.isDying in addition to shuttingDown, so deliberate channel kills from closeSession/ killSession are correctly recorded as expected=true - Add diag stub to the @opentelemetry/api mock in daemon-metrics tests for robustness 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
- [Critical] forceFlushMetrics: void → await to prevent race with shutdownTelemetry tearing down the metric reader mid-flush - registerDaemonGaugeCallbacks: add idempotency guard (gaugesRegistered) to prevent duplicate ObservableGauge callbacks on re-entry - activeSseCount: add double-fire guard to prevent negative counter from abnormal close events - Non-null assertions (!) → optional chaining (?.) on all recording functions for resilience against SDK misconfiguration - expected ?? true vs !expected severity logic: use explicit expected === false to avoid contradictory signals when undefined 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
- [Critical] Fix TS1308: await inside non-async Promise executor. Restructured to .then() chain so forceFlushMetrics completes before bridge.shutdown() starts, without requiring async executor. - forceFlushMetrics: add 5s timeout via Promise.race to prevent indefinite blocking on unreachable OTLP collector. - Add idempotency test for registerDaemonGaugeCallbacks. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
- sessionLifecycle('die') no longer emits ERROR severity — unexpected
exits are already covered by channelLifecycle('exit', false) WARN
- gaugesRegistered = true moved to end of registerDaemonGaugeCallbacks
for consistency with initializeDaemonMetrics and retry resilience
🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
- Add flush.catch() in forceFlushMetrics to prevent unhandled rejection
when timeout wins the Promise.race (sdk.ts)
- Fix log body inconsistency: use expected ?? true to match attribute
(runQwenServe.ts)
- Guard channelLifecycle('exit') with handshakeComplete flag to prevent
exit count exceeding spawn count on handshake failures (bridge.ts)
Keeps worst-case shutdown budget under Kubernetes default 30s grace period. Healthy daemon flushes in <100ms; 2s is sufficient headroom.
d28a619 to
03a7310
Compare
wenshao
left a comment
There was a problem hiding this comment.
No issues found. LGTM! ✅ — qwen3.7-max via Qwen Code /review
BZ-D
left a comment
There was a problem hiding this comment.
Found one blocking metrics correctness issue.
Summary
BridgeTelemetryinterface with optionalmetricssub-object for bridge decouplingemitDaemonLogto support custom event names and severity levels for structured lifecycle log recordsservice.instance.idto OTel Resource for process incarnation detectionforceFlushMetrics()for pre-shutdown metric snapshotCloses #4554 §6.
Design Highlights
daemon.http.request.countdaemon.http.request.durationdaemon.session.activedaemon.session.lifecycledaemon.channel.lifecycledaemon.prompt.queue_waitdaemon.prompt.durationdaemon.bridge.error.countdaemon.cancel.countdaemon.sse.activedaemon.process.heap_usedMax time-series cardinality: ~200 (all attributes bounded by design).
Test plan
daemon-metrics.test.tscovering initialization, recording, gauge callbacks, error normalization, and no-op before init🤖 Generated with Qwen Code