Diagnostics: dedupe OTEL trace context + log transport fixes#28166
Diagnostics: dedupe OTEL trace context + log transport fixes#28166vincentkoc wants to merge 8 commits intoopenclaw:mainfrom
Conversation
Add DiagnosticTraceContext type (traceId, spanId, parentSpanId) to the diagnostic event system and an AsyncLocalStorage-based store for propagating trace context through the async call chain without modifying function signatures. Wrap the two hot path entry points with thin wrapper functions: - dispatchReplyFromConfig: creates root trace context - runEmbeddedAttempt: creates child context (inherits parent traceId) The original function bodies are renamed to *Inner and left untouched, avoiding indentation churn in the diff. This is pure infrastructure -- no diagnostic events are emitted with trace context yet, and the OTel plugin does not consume it yet. Both are addressed in follow-up PRs. Related openclaw#8249
Without this, message.processed events (which have traceId but no parentSpanId) get a fake parent with spanId '0000000000000000'. Now message.processed is a true root span — parent context is only created when both traceId AND parentSpanId exist.
- Fix TypeScript errors in service.test.ts (proper tuple type assertions) - Remove parentSpanId from message.processed events — message spans are root spans in their trace, only model.usage spans are children - Update tests to verify correct hierarchy: message.processed has no parentSpanId, model.usage has parentSpanId pointing to message span
|
Makes sense to consolidate. The trace context work here goes past what #16865 covered. The log transport sharing (logger.ts) was verified e2e in #16865 with a live OTLP collector + Loki, not just unit tests. The unit tests alone didn't catch the original module-isolation bug, so worth a runtime check before merging. I can run the e2e validation on this branch if you want. |
@leonnardo thanks, given how core this is i need to triple check these changes but your welcomed to test ans report back (ideally with screenshots) |
|
Spun up an isolated dev gateway with OTEL fully enabled. SetupReproducegit checkout vincentkoc-code/dedupe-otel-trace-context-cluster
pnpm install && pnpm build
OPENCLAW_CONFIG_PATH=/tmp/otel-test/openclaw.json \
OPENCLAW_SKIP_CHANNELS=1 \
node scripts/run-node.mjs --dev gatewayConfig: {
"diagnostics": {
"enabled": true,
"otel": {
"enabled": true,
"endpoint": "http://<collector>:4318",
"serviceName": "openclaw-gateway-dev-otel-test",
"traces": true, "metrics": true, "logs": true
}
}
} |
Greptile SummaryThis PR consolidates overlapping OTEL diagnostics work from three related PRs, adding trace context infrastructure and fixing log transport global state persistence. Key ChangesTrace Context Infrastructure
OTEL Service Enhancements
Log Transport Fix
Session State Tracking
Test CoverageComprehensive tests added for:
NotesThe trace context infrastructure is foundational work that establishes the plumbing for hierarchical span relationships. The AsyncLocalStorage context is set up at entry points but appears to be intended for integration with code paths not modified in this PR (likely model usage emission). Session state provides a separate mechanism for tracking trace IDs across the webhook→message lifecycle. The implementation is backward compatible and all tests pass. Confidence Score: 4/5
Last reviewed commit: 0d53d29 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0d53d29376
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (params.traceId) { | ||
| state.traceId = params.traceId; | ||
| } |
There was a problem hiding this comment.
Read active trace context when queueing diagnostic messages
dispatchReplyFromConfig now wraps message handling in diagnosticTraceStore.run(...), but logMessageQueued only records state.traceId when callers explicitly pass params.traceId. Current call sites still invoke logMessageQueued without that field, so message.queued and subsequent message.processed events keep emitting with traceId unset, which means the new trace-context wrapper does not actually correlate OTEL spans in production flows.
Useful? React with 👍 / 👎.
| const spanId = generateSpanId(); | ||
| state.currentSpanId = spanId; |
There was a problem hiding this comment.
Emit parent span id for processed-message diagnostic events
logMessageProcessed creates a fresh span id and stores it in session state, but the emitted message.processed diagnostic event still carries only traceId and never publishes parentSpanId (or any span id field). Because extensions/diagnostics-otel/src/service.ts only links parent context when evt.parentSpanId is present, these spans remain root spans and cannot form the intended hierarchy.
Useful? React with 👍 / 👎.
|
This modification is very helpful. By the way, will the calls from the main agent to the sub-agents be correctly chained here? |
|
Hello! Any plans to move this work forward? |
|
Any updates on the otel fix? |
|
Thanks for the work here. We split the OTEL/diagnostics foundation into smaller mainline commits instead of landing this branch as-is, so this is now resolved on The relevant landed pieces are:
Closing this as superseded/resolved by the mainline commits. Appreciate the push here — it helped shape the final, safer split. |


Summary
Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
User-visible / Behavior Changes
Security Impact (required)
No)No)No)No)No)Yes, explain risk + mitigation:Repro + Verification
Environment
Steps
pnpm installpnpm vitest run src/logger.test.ts src/infra/diagnostic-trace-context.test.ts src/infra/diagnostic-events.test.ts extensions/diagnostics-otel/src/service.test.tsExpected
Actual
Evidence
Human Verification (required)
What you personally verified (not just CI), and how:
Compatibility / Migration
Yes)No)No)Failure Recovery (if this breaks)
Risks and Mitigations
None.
Thanks for the groundwork and reviews:
AI-assisted: Yes
Testing level: targeted tests (see Repro + Verification)