feat(diagnostics): trace context propagation + log transport isolation fix#27675
feat(diagnostics): trace context propagation + log transport isolation fix#27675sreis wants to merge 2 commits intoopenclaw:mainfrom
Conversation
The externalTransports Set in logger.ts is module-scoped, so plugin bundles that receive a separate copy of the module have their own isolated registry. registerLogTransport() from the OTel plugin may register on a different Set than the one the gateway logger reads. Fix: use globalThis.__openclawLogTransportState singleton, matching the pattern PR openclaw#12897 established for the diagnostic event bus. Related openclaw#12475
Add DiagnosticTraceContext type (traceId, spanId, parentSpanId) to the diagnostic event system and an AsyncLocalStorage-based store for propagating trace context through the async call chain without modifying function signatures. Wrap the two hot path entry points with thin wrapper functions: - dispatchReplyFromConfig: creates root trace context - runEmbeddedAttempt: creates child context (inherits parent traceId) The original function bodies are renamed to *Inner and left untouched, avoiding indentation churn in the diff. This is pure infrastructure -- no diagnostic events are emitted with trace context yet, and the OTel plugin does not consume it yet. Both are addressed in follow-up PRs. Related openclaw#8249
Greptile SummaryThis PR establishes trace context infrastructure for hierarchical OTel spans without changing existing behavior. The implementation adds Key changes:
Strengths:
No issues found - the implementation is correct and ready for the follow-up PR to consume the trace context. Confidence Score: 5/5
Last reviewed commit: 7bfbcd0 |
|
Thanks for the solid foundation here. I’m closing this as a duplicate of #28166, which is the consolidated and merge-ready path for trace-context propagation plus the log transport isolation fix. If I missed an edge case or you want a different split, tell me and I’ll reopen review right away. |
Summary
diagnostics-otelextension produces only disconnected pointspans with no parent-child hierarchy. Operators cannot see the end-to-end flow
of a message through routing, agent execution, and LLM calls. Additionally,
the log transport registry in
logger.tshas the same module-isolation bugthat PR fix(otel): complete diagnostics-otel OpenTelemetry v2 API migration #12897 fixed for the diagnostic event bus --
externalTransportsismodule-scoped and may not survive jiti/bundler splitting.
are a flat list of unrelated spans. This makes latency debugging and cost
attribution impossible. The log transport bug means OTel log export may
silently fail depending on how the bundler resolves modules.
globalThissingletonmatching the pattern PR fix(otel): complete diagnostics-otel OpenTelemetry v2 API migration #12897 established for the event bus. (2) New
DiagnosticTraceContexttype on all diagnostic events. (3) NewAsyncLocalStorage-based trace context store with factory functions. (4) Hotpath entry points (
dispatchReplyFromConfig,runEmbeddedAttempt) wrappedwith thin wrapper functions that establish trace context -- original function
bodies renamed to
*Innerand left untouched to avoid indentation churn.diagnostic event types emitted yet. The OTel plugin is not modified in this PR
-- it will consume the trace context in a follow-up PR. This is a pure
foundation PR.
Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
User-visible / Behavior Changes
No user-visible changes in this PR. This is infrastructure that enables
hierarchical OTel traces in a follow-up PR. The log transport fix may resolve
silent OTel log export failures for users who had the module isolation issue.
Security Impact (required)
NoNoNoNoNoRepro + Verification
Environment
Steps
pnpm install --frozen-lockfilepnpm build(verify compilation)pnpm test -- --run src/infra/diagnostic-trace-context.test.ts src/logging/(verify 54/54 tests pass)
pnpm lint(verify 0 warnings, 0 errors)Expected
All tests pass. Build succeeds. No lint errors.
Actual
All tests pass. Build succeeds. No lint errors.
Evidence
Test output (9 new tests, all passing):
Full targeted test run (54/54 passing, including all logging tests to verify
the globalThis change is safe):
Diff stats: 183 insertions, 1 deletion across 6 files. No indentation churn.
Human Verification (required)
lengths (32-char traceId, 16-char spanId). Child contexts correctly inherit
parent traceId and link via parentSpanId. AsyncLocalStorage propagates context
across async boundaries including
setTimeout,Promise.resolve().then().Nested
diagnosticTraceStore.run()calls correctly scope and restore context.createChildContext()without any parent (outside ALS)creates a new root.
currentTraceContext()returns undefined outside any run.All existing logging tests continue to pass after the globalThis migration.
emitted by this PR -- that is the follow-up). Multi-process globalThis
isolation (Vitest workers are isolated by default).
Compatibility / Migration
YesNoNoThe
traceCtxfield onDiagnosticBaseEventis optional. Existing diagnosticevent listeners (including the OTel plugin) will receive events with
traceCtx: undefineduntil Phase 2 adds emission. No existing behavior changes.Failure Recovery (if this breaks)
globalThissingleton degrades gracefully -- if removed, log transportreturns to module-scoped behavior. The AsyncLocalStorage wrapping is inert
(the trace context is created but not yet consumed by any event emission).
src/logging/logger.ts,src/infra/diagnostic-events.ts,src/auto-reply/reply/dispatch-from-config.ts,src/agents/pi-embedded-runner/run/attempt.tsAsyncLocalStoragecontextis lost across a specific async boundary (rare in Node 22+), child contexts
would fall back to creating new root traces. This would manifest as
disconnected spans once the OTel plugin consumes the context -- not a crash.
Risks and Mitigations
Risk:
globalThislog transport state could cause test interference if testsshare the same Vitest worker process.
resetLogger()test helper clears cached state.The globalThis Set is shared but
registerLogTransport()returns anunsubscribe function, and
attachExternalTransportchecks membership beforedispatching. Existing logging tests (11 files, 45 tests) all pass.
Risk:
AsyncLocalStorageoverhead on the hot path.getStore()is ~50ns per call (Node.js core, zero-allocationread). Negligible vs. seconds of LLM latency. No measurable impact.
AI disclosure: This PR was developed with AI assistance (Claude). All
changes were manually reviewed and verified with the test suite.