fix(telemetry): add bounded shutdown timeout and fix service.version resource attribute#3813
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes two telemetry issues in packages/core to improve correctness and CLI shutdown reliability when OpenTelemetry exporters are slow or unreachable.
Changes:
- Set OpenTelemetry
service.versionresource attribute to the CLI/application version (config.getCliVersion() || 'unknown') instead of Node.jsprocess.version. - Add a bounded (10s) shutdown timeout in
shutdownTelemetry()by racingsdk.shutdown()against a timer so CLI exit can fail-open rather than hang. - Add tests covering the
service.versionresource attribute and the shutdown timeout behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| packages/core/src/telemetry/sdk.ts | Updates resource attributes to use CLI version and adds a timeout-bounded SDK shutdown flow. |
| packages/core/src/telemetry/sdk.test.ts | Extends the telemetry SDK tests to assert correct service.version and non-hanging shutdown behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
b1eab9f to
d8dbdbd
Compare
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
wenshao
left a comment
There was a problem hiding this comment.
Additional test gaps (no specific diff line):
- Missing test for
sdk.shutdown()rejecting before timeout — thecatchblock error handling path is uncovered. - Missing test for
config.getCliVersion()returningundefined— the|| 'unknown'fallback is not exercised.
— deepseek-v4-pro via Qwen Code /review
d8dbdbd to
179f3ee
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
packages/core/src/telemetry/sdk.ts:320
- When shutdown times out,
currentSdkmay still be running (batch processors/metric readers may still hold resources), but the code clearstelemetryInitialized/sdk. This can allowinitializeTelemetry()to create a second SDK instance in the same process while the first one is still active, leading to duplicated instrumentation/export or lingering handles. Consider not clearingsdk/ not allowing re-init after a timeout (e.g., track a “shutdown timed out” state and keep telemetry disabled for the remainder of the process), or only clearing state once the underlying shutdown actually completes.
} finally {
telemetryInitialized = false;
sdk = undefined;
telemetryShutdownPromise = undefined;
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
179f3ee to
8f3f760
Compare
…resource attribute `shutdownTelemetry()` awaited `sdk.shutdown()` with no time bound, causing the CLI to hang indefinitely when the OTLP endpoint was unreachable. Wrap it in a `Promise.race` with a 10-second timeout that fails open (logs a warning and proceeds with cleanup). The `service.version` resource attribute was set to `process.version` (Node.js runtime version) instead of the actual application version. Replace it with `config.getCliVersion() || 'unknown'` to match the convention used throughout the codebase. Closes #3811 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
8f3f760 to
146b91e
Compare
…ng comment - Remove duplicate `diag.error()` in shutdown catch block — a rejected shutdown already surfaces as a thrown exception; only `debugLogger.error` is needed (consistent with pre-existing behavior). - Add comment to `.catch()` handler explaining why rejections are silently dropped when timeout hasn't fired (they're caught by the try/catch via Promise.race). - Update rejection test to remove stale `diag.error` assertion. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
….resolve wrap - Revert `??` back to `||` for `getCliVersion()` fallback to stay consistent with the 6 other call sites across the codebase and guard against empty-string misconfiguration. - Add comment explaining why `Promise.resolve()` wraps `shutdown()` (defensive measure for auto-mocked environments). 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
- Re-add `diag.error()` to the shutdown rejection catch block so
the error is visible via OTel diagnostic channel, matching the
timeout path which uses `diag.warn`.
- Add `diagErrorSpy` assertion to the rejection test so the test
name ("should log error") is backed by actual log verification.
🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
wenshao
left a comment
There was a problem hiding this comment.
No issues found. LGTM! ✅ — deepseek-v4-pro via Qwen Code /review
…resource attribute (#3813) * fix(telemetry): add bounded shutdown timeout and fix service.version resource attribute `shutdownTelemetry()` awaited `sdk.shutdown()` with no time bound, causing the CLI to hang indefinitely when the OTLP endpoint was unreachable. Wrap it in a `Promise.race` with a 10-second timeout that fails open (logs a warning and proceeds with cleanup). The `service.version` resource attribute was set to `process.version` (Node.js runtime version) instead of the actual application version. Replace it with `config.getCliVersion() || 'unknown'` to match the convention used throughout the codebase. Closes #3811 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): remove redundant diag.error and add .catch() clarifying comment - Remove duplicate `diag.error()` in shutdown catch block — a rejected shutdown already surfaces as a thrown exception; only `debugLogger.error` is needed (consistent with pre-existing behavior). - Add comment to `.catch()` handler explaining why rejections are silently dropped when timeout hasn't fired (they're caught by the try/catch via Promise.race). - Update rejection test to remove stale `diag.error` assertion. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): use || for getCliVersion fallback and clarify Promise.resolve wrap - Revert `??` back to `||` for `getCliVersion()` fallback to stay consistent with the 6 other call sites across the codebase and guard against empty-string misconfiguration. - Add comment explaining why `Promise.resolve()` wraps `shutdown()` (defensive measure for auto-mocked environments). 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): restore diag.error in catch block and add test assertion - Re-add `diag.error()` to the shutdown rejection catch block so the error is visible via OTel diagnostic channel, matching the timeout path which uses `diag.warn`. - Add `diagErrorSpy` assertion to the rejection test so the test name ("should log error") is backed by actual log verification. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) --------- Co-authored-by: jinye.djy <jinye.djy@alibaba-inc.com>
…resource attribute (QwenLM#3813) * fix(telemetry): add bounded shutdown timeout and fix service.version resource attribute `shutdownTelemetry()` awaited `sdk.shutdown()` with no time bound, causing the CLI to hang indefinitely when the OTLP endpoint was unreachable. Wrap it in a `Promise.race` with a 10-second timeout that fails open (logs a warning and proceeds with cleanup). The `service.version` resource attribute was set to `process.version` (Node.js runtime version) instead of the actual application version. Replace it with `config.getCliVersion() || 'unknown'` to match the convention used throughout the codebase. Closes QwenLM#3811 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): remove redundant diag.error and add .catch() clarifying comment - Remove duplicate `diag.error()` in shutdown catch block — a rejected shutdown already surfaces as a thrown exception; only `debugLogger.error` is needed (consistent with pre-existing behavior). - Add comment to `.catch()` handler explaining why rejections are silently dropped when timeout hasn't fired (they're caught by the try/catch via Promise.race). - Update rejection test to remove stale `diag.error` assertion. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): use || for getCliVersion fallback and clarify Promise.resolve wrap - Revert `??` back to `||` for `getCliVersion()` fallback to stay consistent with the 6 other call sites across the codebase and guard against empty-string misconfiguration. - Add comment explaining why `Promise.resolve()` wraps `shutdown()` (defensive measure for auto-mocked environments). 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) * fix(telemetry): restore diag.error in catch block and add test assertion - Re-add `diag.error()` to the shutdown rejection catch block so the error is visible via OTel diagnostic channel, matching the timeout path which uses `diag.warn`. - Add `diagErrorSpy` assertion to the rejection test so the test name ("should log error") is backed by actual log verification. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code) --------- Co-authored-by: jinye.djy <jinye.djy@alibaba-inc.com>
Summary
shutdownTelemetry()now racessdk.shutdown()against a 10-second timeout. When the OTLP endpoint is unreachable, shutdown fails open (logs a warning and cleans up state) instead of hanging the CLI exit indefinitely.service.version: The resource attribute was incorrectly set toprocess.version(Node.js runtime version, e.g.v20.19.0). Replaced withconfig.getCliVersion() || 'unknown'to report the actual application version, matching the convention used throughout the codebase.Closes #3811
Test plan
1. All telemetry SDK tests pass (21/21)
2.
service.versionuses application version, notprocess.versionshould set service.version to the application version, not Node.js version— asserts the resource attribute equals'1.0.0-test'(from mockgetCliVersion()) and is NOTprocess.version3. Shutdown timeout completes even when
sdk.shutdown()never resolvesshould complete shutdown within timeout when SDK shutdown hangs— mockssdk.shutdown()to never resolve, advances fake timers past 10s, verifies shutdown completes andisTelemetrySdkInitialized()returnsfalse4. Shutdown rejection path logs error via
diag.errorshould log error when sdk.shutdown() rejects— mockssdk.shutdown()to reject, verifiesdiag.erroris called and state is cleaned up5.
getCliVersion()undefined falls back to'unknown'should fall back to "unknown" when getCliVersion returns undefined— mocksgetCliVersion()to returnundefined, verifiesservice.versionresource attribute is'unknown'6. Build and typecheck pass
🤖 Generated with Qwen Code