Skip to content

fix(diagnostics): share event bus and log transport across module bundles#1

Open
triclambert wants to merge 1 commit intomainfrom
fix/diagnostics-otel-event-bus-isolation
Open

fix(diagnostics): share event bus and log transport across module bundles#1
triclambert wants to merge 1 commit intomainfrom
fix/diagnostics-otel-event-bus-isolation

Conversation

@triclambert
Copy link
Copy Markdown
Owner

Summary

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

Users who configure diagnostics.otel.enabled: true with an OTLP endpoint will now actually receive traces, metrics, and logs in their collector. Previously this configuration was silently non-functional.

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No (the OTel plugin already makes OTLP calls when enabled -- they just never fired due to the event bus bug)
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • OS: macOS (Darwin arm64)
  • Runtime: Node v25.6.1
  • Integration: diagnostics-otel extension + Jaeger OTLP collector

Steps

  1. Start Jaeger: docker run -d -p 127.0.0.1:4318:4318 -p 127.0.0.1:16686:16686 jaegertracing/all-in-one:latest
  2. Configure OpenClaw with diagnostics.enabled: true, diagnostics.otel.enabled: true, diagnostics.otel.endpoint: http://127.0.0.1:4318
  3. Emit diagnostic events (via gateway activity or test script)

Expected

Traces appear in Jaeger at http://localhost:16686

Actual

Before fix: zero traces (confirmed via ss -tn, matching openclaw#18794 repro).
After fix: 3 traces with 3 spans appear in Jaeger.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets

Unit tests: 18/18 pass (5 new tests added covering the exact bug scenarios).

Integration test output:

OTel service started, emitting test events...
Events emitted. Waiting for OTel batch flush...
OTel service stopped.
SUCCESS: Found 3 trace(s) with 3 span(s) in Jaeger.

URL resolution test covers the openclaw#13783 case directly:

"appends signal path when endpoint contains /v1/ in a non-terminal position"
  https://api.example.com/api/v1/private/otel -> .../api/v1/private/otel/v1/traces

Human Verification (required)

  • Verified scenarios: Event bus isolation fix confirmed with live Jaeger collector. URL resolution fix confirmed with Opik-style endpoint containing /v1/ in base path.
  • Edge cases checked: Endpoint already ending in /v1/traces (no double-append), resetDiagnosticEventsForTest() correctly clears shared state, unsubscribe removes only the target listener.
  • What I did not verify: gRPC protocol (intentionally unsupported), concurrent multi-process scenarios (Vitest workers are isolated).

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: Revert this single commit. The globalThis singletons degrade gracefully -- if removed, behavior returns to the pre-fix state (events don't cross module boundaries, but nothing crashes).
  • Files/config to restore: src/infra/diagnostic-events.ts, src/logging/logger.ts, extensions/diagnostics-otel/src/service.ts
  • Known bad symptoms reviewers should watch for: If globalThis symbols collide with another library (extremely unlikely with openclaw. prefix), diagnostic events could leak to unintended listeners.

Risks and Mitigations

  • Risk: globalThis shared state could cause test interference if tests in the same Vitest worker share the bus.
    • Mitigation: resetDiagnosticEventsForTest() already exists for this purpose and now clears the shared bus. Existing tests already call it in beforeEach. New test added to verify this works.

AI disclosure: This PR was developed with AI assistance (Claude). All changes were manually reviewed, understood, and verified with live integration testing against Jaeger.

Comment thread extensions/diagnostics-otel/src/service.ts Outdated
@triclambert triclambert force-pushed the fix/diagnostics-otel-event-bus-isolation branch 2 times, most recently from 1fda875 to db63c03 Compare February 18, 2026 14:07
…dles

The bundler gives the gateway and plugin-sdk their own copies of
diagnostic-events.ts, so emitDiagnosticEvent() and onDiagnosticEvent()
use separate listener Sets. Events emitted by the gateway never reach
the OTel plugin. Same issue affects the log transport registry in
logger.ts.

Fix: use globalThis[Symbol.for()] singletons for both registries,
matching the pattern already established in src/plugins/runtime.ts.

Also fix resolveOtelUrl() which used endpoint.includes('/v1/') to
decide whether to append the signal path. This breaks backends whose
base URL contains '/v1/' in a non-terminal position (e.g. Opik).
Replaced with endsWith checks against the three OTLP signal suffixes.

Closes openclaw#5190
Closes openclaw#18794
Closes openclaw#13783
@sreis sreis force-pushed the fix/diagnostics-otel-event-bus-isolation branch from db63c03 to 6f19cc0 Compare February 19, 2026 16:17
@vincentkoc
Copy link
Copy Markdown

see openclaw#5190 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment