Skip to content

feat(diagnostics): trace context propagation + log transport isolation fix#27675

Closed
sreis wants to merge 2 commits intoopenclaw:mainfrom
triclambert:feat/diagnostics-trace-context
Closed

feat(diagnostics): trace context propagation + log transport isolation fix#27675
sreis wants to merge 2 commits intoopenclaw:mainfrom
triclambert:feat/diagnostics-trace-context

Conversation

@sreis
Copy link
Copy Markdown

@sreis sreis commented Feb 26, 2026

Summary

  • Problem: The diagnostics-otel extension produces only disconnected point
    spans with no parent-child hierarchy. Operators cannot see the end-to-end flow
    of a message through routing, agent execution, and LLM calls. Additionally,
    the log transport registry in logger.ts has the same module-isolation bug
    that PR fix(otel): complete diagnostics-otel OpenTelemetry v2 API migration #12897 fixed for the diagnostic event bus -- externalTransports is
    module-scoped and may not survive jiti/bundler splitting.
  • Why it matters: Without trace hierarchy, OTel traces in Jaeger/Fiddler/etc.
    are a flat list of unrelated spans. This makes latency debugging and cost
    attribution impossible. The log transport bug means OTel log export may
    silently fail depending on how the bundler resolves modules.
  • What changed: (1) Log transport registry now uses a globalThis singleton
    matching the pattern PR fix(otel): complete diagnostics-otel OpenTelemetry v2 API migration #12897 established for the event bus. (2) New
    DiagnosticTraceContext type on all diagnostic events. (3) New
    AsyncLocalStorage-based trace context store with factory functions. (4) Hot
    path entry points (dispatchReplyFromConfig, runEmbeddedAttempt) wrapped
    with thin wrapper functions that establish trace context -- original function
    bodies renamed to *Inner and left untouched to avoid indentation churn.
  • What did NOT change: No new dependencies, no config schema changes, no new
    diagnostic event types emitted yet. The OTel plugin is not modified in this PR
    -- it will consume the trace context in a follow-up PR. This is a pure
    foundation PR.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

No user-visible changes in this PR. This is infrastructure that enables
hierarchical OTel traces in a follow-up PR. The log transport fix may resolve
silent OTel log export failures for users who had the module isolation issue.

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: macOS (Darwin arm64)
  • Runtime: Node v22+
  • Test framework: Vitest 4.0.18

Steps

  1. pnpm install --frozen-lockfile
  2. pnpm build (verify compilation)
  3. pnpm test -- --run src/infra/diagnostic-trace-context.test.ts src/logging/
    (verify 54/54 tests pass)
  4. pnpm lint (verify 0 warnings, 0 errors)

Expected

All tests pass. Build succeeds. No lint errors.

Actual

All tests pass. Build succeeds. No lint errors.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets

Test output (9 new tests, all passing):

 ✓ src/infra/diagnostic-trace-context.test.ts (9 tests) 4ms
   ✓ createTraceContext > produces a valid root context with 32-char traceId and 16-char spanId
   ✓ createTraceContext > generates unique IDs across calls
   ✓ createChildContext > preserves parent traceId and sets parentSpanId
   ✓ createChildContext > reads parent from AsyncLocalStorage when no explicit parent given
   ✓ createChildContext > creates a new root context when no parent exists anywhere
   ✓ currentTraceContext > returns undefined outside of a store run
   ✓ currentTraceContext > returns the active context inside a store run
   ✓ currentTraceContext > returns nested child context after re-wrapping
   ✓ async propagation > propagates trace context across async boundaries

Full targeted test run (54/54 passing, including all logging tests to verify
the globalThis change is safe):

 Test Files  11 passed (11)
       Tests  54 passed (54)

Diff stats: 183 insertions, 1 deletion across 6 files. No indentation churn.

Human Verification (required)

  • Verified scenarios: Trace context factory produces valid hex IDs with correct
    lengths (32-char traceId, 16-char spanId). Child contexts correctly inherit
    parent traceId and link via parentSpanId. AsyncLocalStorage propagates context
    across async boundaries including setTimeout, Promise.resolve().then().
    Nested diagnosticTraceStore.run() calls correctly scope and restore context.
  • Edge cases checked: createChildContext() without any parent (outside ALS)
    creates a new root. currentTraceContext() returns undefined outside any run.
    All existing logging tests continue to pass after the globalThis migration.
  • What I did not verify: Live OTel collector integration (no new spans are
    emitted by this PR -- that is the follow-up). Multi-process globalThis
    isolation (Vitest workers are isolated by default).

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

The traceCtx field on DiagnosticBaseEvent is optional. Existing diagnostic
event listeners (including the OTel plugin) will receive events with
traceCtx: undefined until Phase 2 adds emission. No existing behavior changes.

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: Revert the two commits. The
    globalThis singleton degrades gracefully -- if removed, log transport
    returns to module-scoped behavior. The AsyncLocalStorage wrapping is inert
    (the trace context is created but not yet consumed by any event emission).
  • Files/config to restore: src/logging/logger.ts,
    src/infra/diagnostic-events.ts,
    src/auto-reply/reply/dispatch-from-config.ts,
    src/agents/pi-embedded-runner/run/attempt.ts
  • Known bad symptoms reviewers should watch for: If AsyncLocalStorage context
    is lost across a specific async boundary (rare in Node 22+), child contexts
    would fall back to creating new root traces. This would manifest as
    disconnected spans once the OTel plugin consumes the context -- not a crash.

Risks and Mitigations

  • Risk: globalThis log transport state could cause test interference if tests
    share the same Vitest worker process.

    • Mitigation: The existing resetLogger() test helper clears cached state.
      The globalThis Set is shared but registerLogTransport() returns an
      unsubscribe function, and attachExternalTransport checks membership before
      dispatching. Existing logging tests (11 files, 45 tests) all pass.
  • Risk: AsyncLocalStorage overhead on the hot path.

    • Mitigation: getStore() is ~50ns per call (Node.js core, zero-allocation
      read). Negligible vs. seconds of LLM latency. No measurable impact.

AI disclosure: This PR was developed with AI assistance (Claude). All
changes were manually reviewed and verified with the test suite.

The externalTransports Set in logger.ts is module-scoped, so plugin
bundles that receive a separate copy of the module have their own
isolated registry. registerLogTransport() from the OTel plugin may
register on a different Set than the one the gateway logger reads.

Fix: use globalThis.__openclawLogTransportState singleton, matching the
pattern PR openclaw#12897 established for the diagnostic event bus.

Related openclaw#12475
Add DiagnosticTraceContext type (traceId, spanId, parentSpanId) to the
diagnostic event system and an AsyncLocalStorage-based store for
propagating trace context through the async call chain without modifying
function signatures.

Wrap the two hot path entry points with thin wrapper functions:
- dispatchReplyFromConfig: creates root trace context
- runEmbeddedAttempt: creates child context (inherits parent traceId)

The original function bodies are renamed to *Inner and left untouched,
avoiding indentation churn in the diff.

This is pure infrastructure -- no diagnostic events are emitted with
trace context yet, and the OTel plugin does not consume it yet. Both
are addressed in follow-up PRs.

Related openclaw#8249
@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: S labels Feb 26, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 26, 2026

Greptile Summary

This PR establishes trace context infrastructure for hierarchical OTel spans without changing existing behavior. The implementation adds AsyncLocalStorage-based context propagation at two key entry points (dispatchReplyFromConfig creates root contexts, runEmbeddedAttempt creates child contexts) and fixes a log transport module isolation bug using the globalThis singleton pattern from #12897. The trace context (32-char traceId, 16-char spanId, optional parentSpanId) is properly typed and tested but not yet consumed by any event emission - this is intentional foundation work for a follow-up PR that will integrate with the OTel plugin.

Key changes:

  • New diagnostic-trace-context.ts module with factory functions and AsyncLocalStorage store
  • Optional traceCtx field added to DiagnosticBaseEvent (backward compatible)
  • Log transport registry migrated from module-scoped to globalThis.__openclawLogTransportState
  • Thin wrapper functions at hot-path entry points preserve original function bodies as *Inner to avoid indentation churn
  • 9 new tests covering root/child context creation, AsyncLocalStorage propagation, and async boundary crossing

Strengths:

No issues found - the implementation is correct and ready for the follow-up PR to consume the trace context.

Confidence Score: 5/5

  • This PR is safe to merge with no risk
  • Score reflects zero behavior changes (trace context infrastructure is created but not yet consumed), comprehensive test coverage (54/54 tests passing including 9 new trace context tests), backward compatibility (optional traceCtx field), and adherence to established patterns (globalThis singleton matching fix(otel): complete diagnostics-otel OpenTelemetry v2 API migration #12897). The wrapper pattern avoids indentation churn while preserving type safety. All edge cases are tested including async propagation, nested contexts, and fallback to root creation.
  • No files require special attention

Last reviewed commit: 7bfbcd0

@vincentkoc
Copy link
Copy Markdown
Member

Thanks for the solid foundation here.

I’m closing this as a duplicate of #28166, which is the consolidated and merge-ready path for trace-context propagation plus the log transport isolation fix.
Your work in #27675 is preserved in the attribution chain.

If I missed an edge case or you want a different split, tell me and I’ll reopen review right away.

@vincentkoc vincentkoc closed this Feb 27, 2026
@vincentkoc vincentkoc added dedupe:child Duplicate issue/PR child in dedupe cluster close:duplicate Closed as duplicate labels Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling close:duplicate Closed as duplicate dedupe:child Duplicate issue/PR child in dedupe cluster size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants