feat(diagnostics-otel): unify end-to-end tracing#70424
feat(diagnostics-otel): unify end-to-end tracing#70424jlapenna wants to merge 4 commits intoopenclaw:mainfrom
Conversation
…ation and context propagation - Move OTel initialization to an early-phase preload script (otel-preload.mjs) - Explicitly propagate active context in agent runner, sandbox backend, and tool execution - Add 'openclaw.exec' span for granular shell command tracing - Configure Dockerfile to include preload script
Greptile SummaryThis PR unifies end-to-end OTel tracing across OpenClaw, LiteLLM, and vLLM by introducing an early-phase
Confidence Score: 2/5Not safe to merge — the missing OTel import in pi-tools.abort.ts will crash every tool invocation at runtime. There is a clear P0 defect: wrapToolWithTracing uses trace, context, and SpanStatusCode that are not imported, guaranteeing a ReferenceError the first time any agent tool is executed. A P1 risk (double-init when OPENCLAW_OTEL_PRELOADED is not self-set) is also present. Both must be resolved before merge. src/agents/pi-tools.abort.ts (missing import — P0) and otel-preload.mjs (missing self-registration of env var — P1). Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/agents/pi-tools.abort.ts
Line: 74-105
Comment:
**Missing OTel import — `ReferenceError` at runtime**
`wrapToolWithTracing` references `trace`, `context`, and `SpanStatusCode` (lines 82–94), but none of these are imported in this file. Every tool execution that goes through `wrapToolWithTracing` will throw `ReferenceError: trace is not defined`, breaking all wrapped tool calls.
```suggestion
import { trace, SpanStatusCode, context } from "@opentelemetry/api";
import { copyPluginToolMeta } from "../plugins/tools.js";
import { bindAbortRelay } from "../utils/fetch-timeout.js";
import { copyChannelAgentToolMeta } from "./channel-tools.js";
import type { AnyAgentTool } from "./pi-tools.types.js";
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: otel-preload.mjs
Line: 29
Comment:
**Preload script does not self-register `OPENCLAW_OTEL_PRELOADED`**
The `diagnostics-otel` plugin skips SDK initialization only when `process.env.OPENCLAW_OTEL_PRELOADED === "1"`, but the preload script never sets that variable. If a user sets `NODE_OPTIONS=--import /app/otel-preload.mjs` but omits `OPENCLAW_OTEL_PRELOADED=1` from their environment (an easy mistake), both the preload and the plugin will call `sdk.start()`, causing double-initialization of the OTel SDK with duplicate span processors and duplicate exports. Adding a single line here makes the preload self-declaring and eliminates this failure mode.
```suggestion
sdk.start();
process.env.OPENCLAW_OTEL_PRELOADED = "1";
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: otel-preload.mjs
Line: 31-36
Comment:
**`SIGINT` not handled — SDK may not flush on Ctrl-C**
Only `SIGTERM` is hooked for graceful shutdown. In interactive or development environments the process is typically stopped with `SIGINT` (Ctrl-C), which will bypass `sdk.shutdown()` and may drop buffered spans. Consider adding a matching `SIGINT` handler or sharing the shutdown logic.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "feat(diagnostics-otel): unify end-to-end..." | Re-trigger Greptile |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1d58befd2b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Add missing @opentelemetry/api imports to fix ReferenceErrors - Set OPENCLAW_OTEL_PRELOADED environment variable in preload script - Add SIGINT handler to preload script for graceful shutdown - Refactor diagnostics-otel plugin to preserve diagnostic hooks in preloaded mode - Add OTel dependencies to root package.json for core access
Feedback AddressedI have addressed the critical issues identified in the automated reviews:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a2829fd31e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
… accuracy - Update pnpm-lock.yaml to include OTel dependencies - Exclude tokenjuice from pnpm minimumReleaseAge to unblock CI install - Ensure 'openclaw.exec' span covers the full duration of process execution by awaiting handle.promise
Update: Final Feedback AddressedI have addressed the second round of feedback to ensure CI stability and tracing accuracy:
These fixes complement the previous round of ReferenceError and initialization fixes. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7ef936b1fc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| : undefined; | ||
|
|
||
| if (tracesEnabled || metricsEnabled) { | ||
| if (!preloaded && (tracesEnabled || metricsEnabled)) { |
There was a problem hiding this comment.
Honor traces-disabled config in preloaded OTel mode
When OPENCLAW_OTEL_PRELOADED=1, this guard skips SDK setup entirely, so diagnostics.otel.traces (and related trace settings like sampling) are no longer enforced by the plugin. In the same commit, otel-preload.mjs always creates an OTLPTraceExporter, so deployments that set diagnostics.otel.traces=false will still emit spans in preloaded mode, which is an unexpected config bypass and can export data users explicitly tried to suppress.
Useful? React with 👍 / 👎.
| sdk.shutdown() | ||
| .then(() => console.log("OTel SDK shut down")) | ||
| .catch((err) => console.log("Error shutting down OTel SDK", err)) | ||
| .finally(() => process.exit(0)); |
There was a problem hiding this comment.
Let app signal handlers control process exit
Calling process.exit(0) from the preload signal hook can cut off the host app’s own async SIGTERM/SIGINT shutdown work (e.g., graceful server close/flush), because the preload module is installed before app code and races those handlers. In preloaded deployments this can cause early termination and skipped cleanup even though the app already has dedicated signal handling.
Useful? React with 👍 / 👎.
| const tracer = trace.getTracer("openclaw"); | ||
| const span = tracer.startSpan("openclaw.exec", { | ||
| attributes: { | ||
| "openclaw.exec.command": opts.command, |
There was a problem hiding this comment.
Redact command text before adding span attributes
This exports full user command text to telemetry (openclaw.exec.command) with no redaction. Commands frequently contain secrets (tokens, passwords, auth headers), so enabling OTLP tracing can leak credentials to the collector. The diagnostics pipeline already redacts other outbound fields, so command attributes should be sanitized or replaced with non-sensitive metadata.
Useful? React with 👍 / 👎.
|
@codex address that feedback |
|
To use Codex here, create an environment for this repo. |
- Two Diagnostics/OTEL Changes entries credited issue #70424 (jlapenna's open meta-tracing proposal) as the PR ref. The actual implementing PRs landed as #71451 (exec-process telemetry) and #71450 (preloaded SDK mode), both authored by @vincentkoc — corrected. - Telegram/webhook fix had no Thanks credit. Issue #71392 reporter @joelforsberg46-source identified the delivery-retry behaviour, so credit them on the entry.
… tool-result pruning entries Three entries were missing co-credits I should have preserved: - Diagnostics/OTEL exec-process spans (#71451): @vincentkoc implemented, but @jlapenna's #70424 proposed the broader tracing work this entry builds on. Now credits both. - Diagnostics/OTEL preloaded SDK (#71450): same pattern — credits @vincentkoc and @jlapenna. - Agents/tool-result pruning (#51267): @cgdusek's PR explicitly built on prior work in #39331 by @alvinttang and #34980 by @coffeexcoin. Now credits all three.
- Two Diagnostics/OTEL Changes entries credited issue openclaw#70424 (jlapenna's open meta-tracing proposal) as the PR ref. The actual implementing PRs landed as openclaw#71451 (exec-process telemetry) and openclaw#71450 (preloaded SDK mode), both authored by @vincentkoc — corrected. - Telegram/webhook fix had no Thanks credit. Issue openclaw#71392 reporter @joelforsberg46-source identified the delivery-retry behaviour, so credit them on the entry.
… tool-result pruning entries Three entries were missing co-credits I should have preserved: - Diagnostics/OTEL exec-process spans (openclaw#71451): @vincentkoc implemented, but @jlapenna's openclaw#70424 proposed the broader tracing work this entry builds on. Now credits both. - Diagnostics/OTEL preloaded SDK (openclaw#71450): same pattern — credits @vincentkoc and @jlapenna. - Agents/tool-result pruning (openclaw#51267): @cgdusek's PR explicitly built on prior work in openclaw#39331 by @alvinttang and openclaw#34980 by @coffeexcoin. Now credits all three.
… Unreleased Three of my (vincentkoc) entries were missing closing PR refs, and several maintainer-fix entries were missing credit for the user who reported the underlying issue: - Diagnostics/OTEL outbound delivery: add (#71471) and credit @jlapenna whose #70424 framed the broader tracing work. - Cron malformed legacy jobs: add (#71509). - OpenAI/Codex OAuth region failures: add (#71501) and credit reporter @wulala-xjj (#51175). - Telegram duplicate pollers: credit reporter @Co-Messi (#56230). - MCP/CLI one-shot retire: credit reporter @spartoviMD (#71457). - OpenAI/Codex image baseUrl canonicalize: credit reporter @GodsBoy (#71460). - Feishu TTS Ogg/Opus: credit reporters @sg1416-zg (#61249) and @ycjlb2023-peteryi (#37868). - MiniMax TTS portal OAuth: credit reporter @zx15210404690-hash (#55017). - MCP config reload disposal: credit reporter @xieyuanqing (#60656).
|
Closing as this work is being handled centrally and rolling out in parts today and yesterdays releases. |
|
Follow-up: the OTEL/diagnostics foundation from this thread has now been split and landed on Key landed pieces include GenAI semantic metrics/spans, content-capture controls, trusted trace parenting guardrails, provider request-id hashes, memory/context/tool-loop diagnostics, and docs: Marking this thread as resolved on main. Thanks again for the original direction. |
|
happy to help!
…On Sat, Apr 25, 2026 at 1:00 PM Vincent Koc ***@***.***> wrote:
*vincentkoc* left a comment (openclaw/openclaw#70424)
<#70424 (comment)>
Follow-up: the OTEL/diagnostics foundation from this thread has now been
split and landed on main through narrower commits.
Key landed pieces include GenAI semantic metrics/spans, content-capture
controls, trusted trace parenting guardrails, provider request-id hashes,
memory/context/tool-loop diagnostics, and docs:
- d4d4a8c
<d4d4a8c>
- 7bbd473
<7bbd473>
- dc19069
<dc19069>
- cd7a8f8
<cd7a8f8>
- 5815ca9
<5815ca9>
- 4411432
<4411432>
- 89c5298
<89c5298>
- 64582bb
<64582bb>
Marking this thread as resolved on main. Thanks again for the original
direction.
—
Reply to this email directly, view it on GitHub
<#70424 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAALGVXESOUQ4TKJUI2HR2D4XUKONAVCNFSM6AAAAACYDBUUKWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DGMRQGQ2DEMRSGM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
- Two Diagnostics/OTEL Changes entries credited issue openclaw#70424 (jlapenna's open meta-tracing proposal) as the PR ref. The actual implementing PRs landed as openclaw#71451 (exec-process telemetry) and openclaw#71450 (preloaded SDK mode), both authored by @vincentkoc — corrected. - Telegram/webhook fix had no Thanks credit. Issue openclaw#71392 reporter @joelforsberg46-source identified the delivery-retry behaviour, so credit them on the entry.
… tool-result pruning entries Three entries were missing co-credits I should have preserved: - Diagnostics/OTEL exec-process spans (openclaw#71451): @vincentkoc implemented, but @jlapenna's openclaw#70424 proposed the broader tracing work this entry builds on. Now credits both. - Diagnostics/OTEL preloaded SDK (openclaw#71450): same pattern — credits @vincentkoc and @jlapenna. - Agents/tool-result pruning (openclaw#51267): @cgdusek's PR explicitly built on prior work in openclaw#39331 by @alvinttang and openclaw#34980 by @coffeexcoin. Now credits all three.
… Unreleased Three of my (vincentkoc) entries were missing closing PR refs, and several maintainer-fix entries were missing credit for the user who reported the underlying issue: - Diagnostics/OTEL outbound delivery: add (openclaw#71471) and credit @jlapenna whose openclaw#70424 framed the broader tracing work. - Cron malformed legacy jobs: add (openclaw#71509). - OpenAI/Codex OAuth region failures: add (openclaw#71501) and credit reporter @wulala-xjj (openclaw#51175). - Telegram duplicate pollers: credit reporter @Co-Messi (openclaw#56230). - MCP/CLI one-shot retire: credit reporter @spartoviMD (openclaw#71457). - OpenAI/Codex image baseUrl canonicalize: credit reporter @GodsBoy (openclaw#71460). - Feishu TTS Ogg/Opus: credit reporters @sg1416-zg (openclaw#61249) and @ycjlb2023-peteryi (openclaw#37868). - MiniMax TTS portal OAuth: credit reporter @zx15210404690-hash (openclaw#55017). - MCP config reload disposal: credit reporter @xieyuanqing (openclaw#60656).
- Two Diagnostics/OTEL Changes entries credited issue openclaw#70424 (jlapenna's open meta-tracing proposal) as the PR ref. The actual implementing PRs landed as openclaw#71451 (exec-process telemetry) and openclaw#71450 (preloaded SDK mode), both authored by @vincentkoc — corrected. - Telegram/webhook fix had no Thanks credit. Issue openclaw#71392 reporter @joelforsberg46-source identified the delivery-retry behaviour, so credit them on the entry.
… tool-result pruning entries Three entries were missing co-credits I should have preserved: - Diagnostics/OTEL exec-process spans (openclaw#71451): @vincentkoc implemented, but @jlapenna's openclaw#70424 proposed the broader tracing work this entry builds on. Now credits both. - Diagnostics/OTEL preloaded SDK (openclaw#71450): same pattern — credits @vincentkoc and @jlapenna. - Agents/tool-result pruning (openclaw#51267): @cgdusek's PR explicitly built on prior work in openclaw#39331 by @alvinttang and openclaw#34980 by @coffeexcoin. Now credits all three.
… Unreleased Three of my (vincentkoc) entries were missing closing PR refs, and several maintainer-fix entries were missing credit for the user who reported the underlying issue: - Diagnostics/OTEL outbound delivery: add (openclaw#71471) and credit @jlapenna whose openclaw#70424 framed the broader tracing work. - Cron malformed legacy jobs: add (openclaw#71509). - OpenAI/Codex OAuth region failures: add (openclaw#71501) and credit reporter @wulala-xjj (openclaw#51175). - Telegram duplicate pollers: credit reporter @Co-Messi (openclaw#56230). - MCP/CLI one-shot retire: credit reporter @spartoviMD (openclaw#71457). - OpenAI/Codex image baseUrl canonicalize: credit reporter @GodsBoy (openclaw#71460). - Feishu TTS Ogg/Opus: credit reporters @sg1416-zg (openclaw#61249) and @ycjlb2023-peteryi (openclaw#37868). - MiniMax TTS portal OAuth: credit reporter @zx15210404690-hash (openclaw#55017). - MCP config reload disposal: credit reporter @xieyuanqing (openclaw#60656).
- Two Diagnostics/OTEL Changes entries credited issue openclaw#70424 (jlapenna's open meta-tracing proposal) as the PR ref. The actual implementing PRs landed as openclaw#71451 (exec-process telemetry) and openclaw#71450 (preloaded SDK mode), both authored by @vincentkoc — corrected. - Telegram/webhook fix had no Thanks credit. Issue openclaw#71392 reporter @joelforsberg46-source identified the delivery-retry behaviour, so credit them on the entry.
- Two Diagnostics/OTEL Changes entries credited issue openclaw#70424 (jlapenna's open meta-tracing proposal) as the PR ref. The actual implementing PRs landed as openclaw#71451 (exec-process telemetry) and openclaw#71450 (preloaded SDK mode), both authored by @vincentkoc — corrected. - Telegram/webhook fix had no Thanks credit. Issue openclaw#71392 reporter @joelforsberg46-source identified the delivery-retry behaviour, so credit them on the entry.
… tool-result pruning entries Three entries were missing co-credits I should have preserved: - Diagnostics/OTEL exec-process spans (openclaw#71451): @vincentkoc implemented, but @jlapenna's openclaw#70424 proposed the broader tracing work this entry builds on. Now credits both. - Diagnostics/OTEL preloaded SDK (openclaw#71450): same pattern — credits @vincentkoc and @jlapenna. - Agents/tool-result pruning (openclaw#51267): @cgdusek's PR explicitly built on prior work in openclaw#39331 by @alvinttang and openclaw#34980 by @coffeexcoin. Now credits all three.
… Unreleased Three of my (vincentkoc) entries were missing closing PR refs, and several maintainer-fix entries were missing credit for the user who reported the underlying issue: - Diagnostics/OTEL outbound delivery: add (openclaw#71471) and credit @jlapenna whose openclaw#70424 framed the broader tracing work. - Cron malformed legacy jobs: add (openclaw#71509). - OpenAI/Codex OAuth region failures: add (openclaw#71501) and credit reporter @wulala-xjj (openclaw#51175). - Telegram duplicate pollers: credit reporter @Co-Messi (openclaw#56230). - MCP/CLI one-shot retire: credit reporter @spartoviMD (openclaw#71457). - OpenAI/Codex image baseUrl canonicalize: credit reporter @GodsBoy (openclaw#71460). - Feishu TTS Ogg/Opus: credit reporters @sg1416-zg (openclaw#61249) and @ycjlb2023-peteryi (openclaw#37868). - MiniMax TTS portal OAuth: credit reporter @zx15210404690-hash (openclaw#55017). - MCP config reload disposal: credit reporter @xieyuanqing (openclaw#60656).
- Two Diagnostics/OTEL Changes entries credited issue openclaw#70424 (jlapenna's open meta-tracing proposal) as the PR ref. The actual implementing PRs landed as openclaw#71451 (exec-process telemetry) and openclaw#71450 (preloaded SDK mode), both authored by @vincentkoc — corrected. - Telegram/webhook fix had no Thanks credit. Issue openclaw#71392 reporter @joelforsberg46-source identified the delivery-retry behaviour, so credit them on the entry.
… tool-result pruning entries Three entries were missing co-credits I should have preserved: - Diagnostics/OTEL exec-process spans (openclaw#71451): @vincentkoc implemented, but @jlapenna's openclaw#70424 proposed the broader tracing work this entry builds on. Now credits both. - Diagnostics/OTEL preloaded SDK (openclaw#71450): same pattern — credits @vincentkoc and @jlapenna. - Agents/tool-result pruning (openclaw#51267): @cgdusek's PR explicitly built on prior work in openclaw#39331 by @alvinttang and openclaw#34980 by @coffeexcoin. Now credits all three.
… Unreleased Three of my (vincentkoc) entries were missing closing PR refs, and several maintainer-fix entries were missing credit for the user who reported the underlying issue: - Diagnostics/OTEL outbound delivery: add (openclaw#71471) and credit @jlapenna whose openclaw#70424 framed the broader tracing work. - Cron malformed legacy jobs: add (openclaw#71509). - OpenAI/Codex OAuth region failures: add (openclaw#71501) and credit reporter @wulala-xjj (openclaw#51175). - Telegram duplicate pollers: credit reporter @Co-Messi (openclaw#56230). - MCP/CLI one-shot retire: credit reporter @spartoviMD (openclaw#71457). - OpenAI/Codex image baseUrl canonicalize: credit reporter @GodsBoy (openclaw#71460). - Feishu TTS Ogg/Opus: credit reporters @sg1416-zg (openclaw#61249) and @ycjlb2023-peteryi (openclaw#37868). - MiniMax TTS portal OAuth: credit reporter @zx15210404690-hash (openclaw#55017). - MCP config reload disposal: credit reporter @xieyuanqing (openclaw#60656).
Update: Feedback Addressed
@opentelemetry/apiimports inrun.tsandpi-tools.abort.ts.otel-preload.mjsnow automatically setsOPENCLAW_OTEL_PRELOADED=1upon successful startup.diagnostics-otelto skip only the SDK initialization while preserving log transports and diagnostic event hooks.SIGINThandler to the preload script for better development environment support.package.jsonto support core instrumentation.