Skip to content

feat(diagnostics): add trace context carrier#70924

Merged
vincentkoc merged 1 commit intomainfrom
fix/diagnostic-trace-context
Apr 24, 2026
Merged

feat(diagnostics): add trace context carrier#70924
vincentkoc merged 1 commit intomainfrom
fix/diagnostic-trace-context

Conversation

@vincentkoc
Copy link
Copy Markdown
Member

Summary

  • Problem: OTEL-related work needs a shared diagnostic trace carrier before exporter, hook, or log wiring can safely build nested spans.
  • Why it matters: Without a lightweight core trace context, later diagnostics-otel changes either flatten spans or pull OpenTelemetry SDK state into core paths too early.
  • What changed: Added a pure diagnostic trace context helper, optional trace on diagnostic events, plugin-sdk diagnostics exports, focused tests, and the plugin SDK baseline update.
  • What did NOT change (scope boundary): No OTEL SDK dependency in core, no global trace registry, no retained per-run maps, no exporter behavior changes, no log transport changes.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Root Cause (if applicable)

N/A

  • Root cause:
  • Missing detection / guardrail:
  • Contributing context (if known):

Regression Test Plan (if applicable)

N/A

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/infra/diagnostic-trace-context.test.ts, src/infra/diagnostic-events.test.ts
  • Scenario the test should lock in: W3C traceparent parse/format, invalid id rejection, child context creation without self-parenting, and diagnostic event trace pass-through without retained trace state.
  • Why this is the smallest reliable guardrail: This PR only adds a core carrier and event field; exporter and runtime hook behavior will land in follow-up slices.
  • Existing test that already covers this (if any):
  • If no new test is added, why not:

User-visible / Behavior Changes

Adds plugin-sdk diagnostics helpers that bundled diagnostics-otel and future observability plugins can use for trace correlation. No runtime telemetry behavior changes yet.

Diagram (if applicable)

Before:
diagnostic event -> seq/ts only -> exporter has no shared parent context

After:
diagnostic event -> seq/ts + optional trace context -> later exporter can parent spans without core OTEL SDK state

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation:

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node via repo pnpm scripts
  • Model/provider: N/A
  • Integration/channel (if any): diagnostics-otel plugin SDK surface
  • Relevant config (redacted): N/A

Steps

  1. pnpm test src/infra/diagnostic-events.test.ts
  2. pnpm test src/infra/diagnostic-trace-context.test.ts
  3. pnpm format:check -- CHANGELOG.md docs/.generated/plugin-sdk-api-baseline.sha256 src/infra/diagnostic-trace-context.ts src/infra/diagnostic-trace-context.test.ts src/infra/diagnostic-events.ts src/infra/diagnostic-events.test.ts src/plugin-sdk/diagnostics-otel.ts
  4. pnpm plugin-sdk:api:check
  5. pnpm check:changed
  6. On latest origin/main: pnpm tsgo:core

Expected

  • Focused tests, formatting, and plugin SDK API check pass.
  • check:changed should get past local change checks; current unrelated main typecheck issue may still fail.

Actual

  • Focused tests passed.
  • Formatting passed.
  • pnpm plugin-sdk:api:check passed.
  • pnpm check:changed failed at tsgo:core with unrelated existing errors in src/agents/openai-transport-stream.ts and src/plugin-sdk/provider-tools.ts.
  • The same pnpm tsgo:core failure reproduces on latest origin/main.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios: W3C traceparent parse/format, malformed/zero trace id rejection, child trace context creation, diagnostic event trace pass-through, plugin SDK API baseline.
  • Edge cases checked: self-parenting is dropped; malformed traceparent values are rejected; no new global registry or retained trace map is introduced.
  • What you did not verify: End-to-end OTEL collector export. That is intentionally out of scope for this foundation slice.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps:

Risks and Mitigations

  • Risk: Trace ids are accepted from external carriers in later PRs.
    • Mitigation: This slice validates W3C id shape and rejects all-zero ids before context creation.
  • Risk: Foundation code grows hidden long-lived trace state.
    • Mitigation: This slice is pure helper/event payload plumbing only; no global trace registry, maps, timers, or OTEL SDK instances are added.

@vincentkoc vincentkoc self-assigned this Apr 24, 2026
@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation size: M maintainer Maintainer-authored PR labels Apr 24, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 24, 2026

Greptile Summary

This PR adds a lightweight W3C-compatible diagnostic trace context carrier (DiagnosticTraceContext) to core, an optional trace field on DiagnosticBaseEvent, and re-exports all trace helpers through the plugin-SDK surface for future diagnostics-otel span correlation. No OTEL SDK dependency is introduced into core. The changes are well-scoped, additive, and backed by focused unit tests covering the main validation and lifecycle invariants.

Confidence Score: 5/5

Safe to merge; all remaining concerns are P2 style/edge-case notes, and prior P1 threads (random-ID zero-check, future-version traceparent) have already been surfaced to the author.

The core implementation is correct: byte counts in randomHex are right (16 bytes → 32 hex for traceId, 8 bytes → 16 hex for spanId), normalization and validation guards are consistent, the self-parenting check works as tested, and no global state is added. No new P0 or P1 issues were found beyond the two already commented in prior threads.

No files require special attention.

Reviews (2): Last reviewed commit: "test(agents): follow split MiniMax FAQ d..." | Re-trigger Greptile

Comment on lines +111 to +112
const traceId = normalizeTraceId(input.traceId) ?? parsed?.traceId ?? randomHex(16);
const spanId = normalizeSpanId(input.spanId) ?? parsed?.spanId ?? randomHex(8);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Random ID fallback bypasses non-zero validation

randomHex(16) and randomHex(8) produce raw hex strings that skip the isNonZeroHex guard applied by normalizeTraceId/normalizeSpanId. While the probability of crypto.randomBytes generating all-zero bytes is negligible (~2⁻¹²⁸ / 2⁻⁶⁴), if it did happen the context object would contain IDs that fail isValidDiagnosticTraceId/isValidDiagnosticSpanId and formatDiagnosticTraceparent would silently return undefined. A defensive wrapper makes the invariant explicit:

function randomTraceId(): string {
  let id: string;
  do { id = randomHex(16); } while (!isNonZeroHex(id));
  return id;
}
function randomSpanId(): string {
  let id: string;
  do { id = randomHex(8); } while (!isNonZeroHex(id));
  return id;
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/diagnostic-trace-context.ts
Line: 111-112

Comment:
**Random ID fallback bypasses non-zero validation**

`randomHex(16)` and `randomHex(8)` produce raw hex strings that skip the `isNonZeroHex` guard applied by `normalizeTraceId`/`normalizeSpanId`. While the probability of `crypto.randomBytes` generating all-zero bytes is negligible (~2⁻¹²⁸ / 2⁻⁶⁴), if it did happen the context object would contain IDs that fail `isValidDiagnosticTraceId`/`isValidDiagnosticSpanId` and `formatDiagnosticTraceparent` would silently return `undefined`. A defensive wrapper makes the invariant explicit:

```ts
function randomTraceId(): string {
  let id: string;
  do { id = randomHex(16); } while (!isNonZeroHex(id));
  return id;
}
function randomSpanId(): string {
  let id: string;
  do { id = randomHex(8); } while (!isNonZeroHex(id));
  return id;
}
```

How can I resolve this? If you propose a fix, please make it concise.

traceparent: string | undefined,
): DiagnosticTraceContext | undefined {
const parts = traceparent?.trim().toLowerCase().split("-");
if (!parts || parts.length !== 4) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 W3C spec: future-version traceparents silently dropped

The W3C Trace Context spec (§2.2) requires that implementations parsing a traceparent with an unknown version still attempt to extract traceId and parentId from the first four dash-delimited fields and continue the trace. The current strict parts.length !== 4 guard and the version !== "00" rejection together discard any header with a future version number without attempting extraction. This is acceptable for an internal carrier now, but is worth documenting if parseDiagnosticTraceparent will later be used to accept inbound headers from external callers.

Suggested change
if (!parts || parts.length !== 4) {
if (!parts || parts.length < 4) {
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/diagnostic-trace-context.ts
Line: 72

Comment:
**W3C spec: future-version traceparents silently dropped**

The W3C Trace Context spec (§2.2) requires that implementations parsing a traceparent with an unknown version still attempt to extract `traceId` and `parentId` from the first four dash-delimited fields and continue the trace. The current strict `parts.length !== 4` guard and the `version !== "00"` rejection together discard any header with a future version number without attempting extraction. This is acceptable for an internal carrier now, but is worth documenting if `parseDiagnosticTraceparent` will later be used to accept inbound headers from external callers.

```suggestion
  if (!parts || parts.length < 4) {
```

How can I resolve this? If you propose a fix, please make it concise.

@vincentkoc vincentkoc requested a review from a team as a code owner April 24, 2026 04:01
@openclaw-barnacle openclaw-barnacle Bot added the agents Agent runtime and tooling label Apr 24, 2026
@vincentkoc vincentkoc force-pushed the fix/diagnostic-trace-context branch from c532f74 to efe515d Compare April 24, 2026 04:46
@vincentkoc vincentkoc marked this pull request as draft April 24, 2026 04:49
@vincentkoc vincentkoc marked this pull request as ready for review April 24, 2026 04:49
@vincentkoc vincentkoc force-pushed the fix/diagnostic-trace-context branch from efe515d to d3db4e4 Compare April 24, 2026 04:52
@openclaw-barnacle openclaw-barnacle Bot removed the agents Agent runtime and tooling label Apr 24, 2026
@vincentkoc vincentkoc force-pushed the fix/diagnostic-trace-context branch from d3db4e4 to b5e1671 Compare April 24, 2026 05:05
@aisle-research-bot
Copy link
Copy Markdown

aisle-research-bot Bot commented Apr 24, 2026

🔒 Aisle Security Analysis

We found 1 potential security issue(s) in this PR:

# Severity Title
1 🟡 Medium Potential DoS via unbounded traceparent header parsing
1. 🟡 Potential DoS via unbounded `traceparent` header parsing
Property Value
Severity Medium
CWE CWE-400
Location src/infra/diagnostic-trace-context.ts:68-73

Description

The parseDiagnosticTraceparent() helper performs multiple full-string operations on traceparent without any length cap:

  • trim() + toLowerCase() allocates a new string proportional to input size
  • split("-") can allocate a large array and substrings proportional to the number of dashes
  • This function is exported via src/plugin-sdk/diagnostics-otel.ts, so plugin or host code may call it with attacker-controlled values (e.g., inbound HTTP traceparent header)

An attacker can send an extremely large traceparent value to trigger excessive CPU and memory usage, potentially leading to resource exhaustion.

Vulnerable code:

const parts = traceparent?.trim().toLowerCase().split("-");

Recommendation

Add a strict maximum length check before any string transformations (the W3C traceparent header is fixed-format and typically ~55 chars). Reject/ignore values exceeding a small bound.

Example:

const MAX_TRACEPARENT_LEN = 128; // generous upper bound

export function parseDiagnosticTraceparent(traceparent?: string) {
  if (typeof traceparent !== "string") return undefined;
  if (traceparent.length > MAX_TRACEPARENT_LEN) return undefined;

  const parts = traceparent.trim().toLowerCase().split("-");// ... existing validation ...
}

Optionally, avoid split() by validating expected positions/lengths (constant-time parsing) to further reduce allocations.


Analyzed PR: #70924 at commit 1bd7a06

Last updated on: 2026-04-24T05:15:04Z

@vincentkoc vincentkoc force-pushed the fix/diagnostic-trace-context branch from b5e1671 to 1bd7a06 Compare April 24, 2026 05:09
@vincentkoc vincentkoc merged commit 0ad058a into main Apr 24, 2026
65 checks passed
@vincentkoc vincentkoc deleted the fix/diagnostic-trace-context branch April 24, 2026 05:18
Angfr95 pushed a commit to Angfr95/openclaw that referenced this pull request Apr 25, 2026
ogt-redknie pushed a commit to ogt-redknie/OPENX that referenced this pull request May 2, 2026
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Improvements or additions to documentation maintainer Maintainer-authored PR size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant