Skip to content

Diagnostics: dedupe OTEL trace context + log transport fixes#28166

Closed
vincentkoc wants to merge 8 commits intoopenclaw:mainfrom
vincentkoc:vincentkoc-code/dedupe-otel-trace-context-cluster
Closed

Diagnostics: dedupe OTEL trace context + log transport fixes#28166
vincentkoc wants to merge 8 commits intoopenclaw:mainfrom
vincentkoc:vincentkoc-code/dedupe-otel-trace-context-cluster

Conversation

@vincentkoc
Copy link
Copy Markdown
Member

Summary

  • Problem: Three open PRs overlap on OTEL diagnostics/log transport and trace context wiring.
  • Why it matters: Duplicate PRs are blocking review/merge and make attribution unclear.
  • What changed: Consolidated the stable pieces into a single, clean branch with preserved authorship, plus a small test mock fix for OTEL context APIs.
  • What did NOT change (scope boundary): No config schema changes; no new dependencies; no behavior changes outside diagnostics/otel paths.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • Adds trace context propagation for diagnostics-otel spans and keeps log transport/global state stable across module bundles.

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS 15.4
  • Runtime/container: Node 22+
  • Model/provider:
  • Integration/channel (if any): diagnostics-otel extension
  • Relevant config (redacted):

Steps

  1. pnpm install
  2. pnpm vitest run src/logger.test.ts src/infra/diagnostic-trace-context.test.ts src/infra/diagnostic-events.test.ts extensions/diagnostics-otel/src/service.test.ts

Expected

  • Tests pass.

Actual

  • Tests pass.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: trace context + logger/diagnostics unit tests.
  • Edge cases checked: trace context parent span handling in OTEL spans.
  • What you did not verify: end-to-end collector export.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps:

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: revert this PR.
  • Files/config to restore: diagnostics/otel + logging files only.
  • Known bad symptoms reviewers should watch for: missing parent span relationships in OTEL spans.

Risks and Mitigations

None.


Thanks for the groundwork and reviews:

AI-assisted: Yes
Testing level: targeted tests (see Repro + Verification)

leonnardo and others added 6 commits February 26, 2026 20:39
Add DiagnosticTraceContext type (traceId, spanId, parentSpanId) to the
diagnostic event system and an AsyncLocalStorage-based store for
propagating trace context through the async call chain without modifying
function signatures.

Wrap the two hot path entry points with thin wrapper functions:
- dispatchReplyFromConfig: creates root trace context
- runEmbeddedAttempt: creates child context (inherits parent traceId)

The original function bodies are renamed to *Inner and left untouched,
avoiding indentation churn in the diff.

This is pure infrastructure -- no diagnostic events are emitted with
trace context yet, and the OTel plugin does not consume it yet. Both
are addressed in follow-up PRs.

Related openclaw#8249
Without this, message.processed events (which have traceId but no
parentSpanId) get a fake parent with spanId '0000000000000000'.
Now message.processed is a true root span — parent context is only
created when both traceId AND parentSpanId exist.
- Fix TypeScript errors in service.test.ts (proper tuple type assertions)
- Remove parentSpanId from message.processed events — message spans are
  root spans in their trace, only model.usage spans are children
- Update tests to verify correct hierarchy: message.processed has no
  parentSpanId, model.usage has parentSpanId pointing to message span
@vincentkoc vincentkoc self-assigned this Feb 27, 2026
@leonnardo
Copy link
Copy Markdown

Makes sense to consolidate. The trace context work here goes past what #16865 covered.

The log transport sharing (logger.ts) was verified e2e in #16865 with a live OTLP collector + Loki, not just unit tests. The unit tests alone didn't catch the original module-isolation bug, so worth a runtime check before merging.

I can run the e2e validation on this branch if you want.

@vincentkoc
Copy link
Copy Markdown
Member Author

Makes sense to consolidate. The trace context work here goes past what #16865 covered.

The log transport sharing (logger.ts) was verified e2e in #16865 with a live OTLP collector + Loki, not just unit tests. The unit tests alone didn't catch the original module-isolation bug, so worth a runtime check before merging.

I can run the e2e validation on this branch if you want.

@leonnardo thanks, given how core this is i need to triple check these changes but your welcomed to test ans report back (ideally with screenshots)

@leonnardo
Copy link
Copy Markdown

leonnardo commented Feb 27, 2026

Spun up an isolated dev gateway with OTEL fully enabled.

Setup

OPENCLAW_CONFIG_PATH (isolated config)
OTLP HTTP collector → Victoria Metrics + Loki
service.name: openclaw-gateway-dev-otel-test
Node v24.13.0 / Linux 6.8.0-100-generic (x64)

Reproduce

git checkout vincentkoc-code/dedupe-otel-trace-context-cluster
pnpm install && pnpm build

OPENCLAW_CONFIG_PATH=/tmp/otel-test/openclaw.json \
OPENCLAW_SKIP_CHANNELS=1 \
node scripts/run-node.mjs --dev gateway

Config:

{
  "diagnostics": {
    "enabled": true,
    "otel": {
      "enabled": true,
      "endpoint": "http://<collector>:4318",
      "serviceName": "openclaw-gateway-dev-otel-test",
      "traces": true, "metrics": true, "logs": true
    }
  }
}

LGTM on logs + metrics:
Grafana Explore — Loki logs

Grafana Explore — 478 metric series

@vincentkoc vincentkoc marked this pull request as ready for review February 27, 2026 15:53
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 27, 2026

Greptile Summary

This PR consolidates overlapping OTEL diagnostics work from three related PRs, adding trace context infrastructure and fixing log transport global state persistence.

Key Changes

Trace Context Infrastructure

  • Added AsyncLocalStorage-based trace context management (src/infra/diagnostic-trace-context.ts) for propagating trace IDs and span IDs across async boundaries
  • Entry points (dispatchReplyFromConfig, runEmbeddedAttempt) now establish trace contexts using wrapper functions
  • Added optional traceId and parentSpanId fields to diagnostic events for span correlation

OTEL Service Enhancements

  • OTEL spans now accept parent trace context to establish hierarchical relationships (extensions/diagnostics-otel/src/service.ts:369-394)
  • Added GenAI semantic convention attributes (gen_ai.operation.name, gen_ai.system, etc.) for model usage spans
  • Span names now include model information (e.g., "chat claude-opus-4-6")

Log Transport Fix

  • Fixed log transport registry to persist across module bundles using globalThis (src/logging/logger.ts:42-61)
  • Registry now tracks all active logger instances and ensures transports attach to all of them
  • Uses WeakMap to prevent duplicate transport attachments

Session State Tracking

  • Added traceId and currentSpanId fields to session state for tracking trace context across message lifecycle
  • logWebhookReceived now generates a trace ID at the start of each message flow
  • logMessageQueued stores the trace ID in session state
  • logMessageProcessed generates a span ID and stores it in session state

Test Coverage

Comprehensive tests added for:

  • AsyncLocalStorage trace context creation and propagation
  • Trace context fields passing through diagnostic events
  • Log transport registration with multiple logger instances
  • OTEL API mocks updated to support context.active(), trace.setSpanContext(), and TraceFlags.SAMPLED

Notes

The trace context infrastructure is foundational work that establishes the plumbing for hierarchical span relationships. The AsyncLocalStorage context is set up at entry points but appears to be intended for integration with code paths not modified in this PR (likely model usage emission). Session state provides a separate mechanism for tracking trace IDs across the webhook→message lifecycle.

The implementation is backward compatible and all tests pass.

Confidence Score: 4/5

  • Safe to merge - well-tested foundational work with backward compatible changes
  • Score reflects solid implementation of trace context infrastructure and log transport fixes, with comprehensive test coverage. The code is backward compatible and introduces no breaking changes. Deducting one point because the trace context integration appears partial - AsyncLocalStorage contexts are established but the integration with all diagnostic logging paths isn't fully visible in this PR, suggesting this may be foundational work with full integration coming from other consolidated PRs or follow-up work.
  • No files require special attention - changes are well-structured and tested

Last reviewed commit: 0d53d29

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0d53d29376

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/logging/diagnostic.ts
Comment on lines +122 to +124
if (params.traceId) {
state.traceId = params.traceId;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Read active trace context when queueing diagnostic messages

dispatchReplyFromConfig now wraps message handling in diagnosticTraceStore.run(...), but logMessageQueued only records state.traceId when callers explicitly pass params.traceId. Current call sites still invoke logMessageQueued without that field, so message.queued and subsequent message.processed events keep emitting with traceId unset, which means the new trace-context wrapper does not actually correlate OTEL spans in production flows.

Useful? React with 👍 / 👎.

Comment thread src/logging/diagnostic.ts
Comment on lines +174 to +175
const spanId = generateSpanId();
state.currentSpanId = spanId;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Emit parent span id for processed-message diagnostic events

logMessageProcessed creates a fresh span id and stores it in session state, but the emitted message.processed diagnostic event still carries only traceId and never publishes parentSpanId (or any span id field). Because extensions/diagnostics-otel/src/service.ts only links parent context when evt.parentSpanId is present, these spans remain root spans and cannot form the intended hierarchy.

Useful? React with 👍 / 👎.

@c3013
Copy link
Copy Markdown

c3013 commented Mar 13, 2026

This modification is very helpful. By the way, will the calls from the main agent to the sub-agents be correctly chained here?

@sreis
Copy link
Copy Markdown

sreis commented Apr 8, 2026

Hello! Any plans to move this work forward?

@rohitcloudsine
Copy link
Copy Markdown

Any updates on the otel fix?

@vincentkoc
Copy link
Copy Markdown
Member Author

Thanks for the work here. We split the OTEL/diagnostics foundation into smaller mainline commits instead of landing this branch as-is, so this is now resolved on main.

The relevant landed pieces are:

Closing this as superseded/resolved by the mainline commits. Appreciate the push here — it helped shape the final, safer split.

@vincentkoc vincentkoc closed this Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling dedupe:parent Primary canonical item in dedupe cluster extensions: diagnostics-otel Extension: diagnostics-otel maintainer Maintainer-authored PR size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants