Skip to content

fix(diagnostics-otel): share listeners/transports across module bundles#16865

Closed
leonnardo wants to merge 2 commits intoopenclaw:mainfrom
leonnardo:fix/otel-module-isolation-clean
Closed

fix(diagnostics-otel): share listeners/transports across module bundles#16865
leonnardo wants to merge 2 commits intoopenclaw:mainfrom
leonnardo:fix/otel-module-isolation-clean

Conversation

@leonnardo
Copy link
Copy Markdown

@leonnardo leonnardo commented Feb 15, 2026

Summary

  • Problem: diagnostics-otel could miss diagnostic events and/or logs when OpenClaw had multiple module instances loaded (bundle/module isolation), because state was module-local.
  • Why it matters: OTEL export appeared enabled but logs/metrics could silently stop flowing in real gateway runtime.
  • What changed:
    • Shared diagnostic event state via globalThis in src/infra/diagnostic-events.ts
    • Shared external log transport registry via globalThis in src/logging/logger.ts
    • registerLogTransport(...) now attaches to all active logger instances with idempotent attach tracking
    • Added regression coverage in src/logger.test.ts for multi-logger transport attach + unsubscribe
  • Scope boundary: no config schema changes, no API/protocol changes.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • OTEL diagnostics integration is reliable across module/bundle boundaries.
  • Logs and diagnostics events now consistently reach OTEL exporters when plugin is enabled.

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)

Repro + Verification

Environment

  • OS: Linux
  • Runtime/container: Node 24 (local dev gateway)
  • Model/provider: OpenAI Codex (gpt-5.3-codex) for traffic generation
  • Integration/channel: diagnostics-otel extension + OTLP/HTTP collector
  • Relevant config (redacted): diagnostics-otel with logs/metrics enabled, OTLP endpoint configured

Steps

  1. Enable diagnostics-otel with OTLP logs + metrics.
  2. Start gateway and generate normal message traffic.
  3. Verify collector counters and backend queries.

Expected

  • Diagnostic events and logs are exported consistently even with module/bundle split.

Actual

  • Collector accepted/sent counters for logs and metrics increase.
  • Loki query for service_name="openclaw-gateway-dev" returns gateway logs.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What I personally verified:

  • Build/check pipeline passes locally:
    • pnpm build
    • pnpm check
  • Targeted test suite for touched areas passes:
    • pnpm vitest run src/logger.test.ts src/infra/infra-store.test.ts src/logging/diagnostic.test.ts extensions/diagnostics-otel/src/service.test.ts
  • Runtime OTEL validation:
    • collector log/metric counters increased
    • Loki query returned openclaw-gateway-dev logs

Edge cases checked:

  • multi-logger transport attach
  • transport unsubscribe behavior
  • defensive transport error handling (no throw propagation)

What I did not verify:

  • full channel/platform matrix end-to-end

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)

Failure Recovery (if this breaks)

  • Revert this PR.
  • Symptoms to watch for: OTEL diagnostics exporter enabled but no log/diagnostic traffic reaching collector.

Risks and Mitigations

  • Risk: duplicate transport attachment or cross-instance confusion.
    • Mitigation: global registry + per-logger idempotent attach tracking (WeakMap).

AI-assisted: Yes
Testing level: fully tested for touched scope (targeted tests + runtime validation)

Greptile Summary

Moves diagnostic events and log transport state to globalThis to fix module/bundle isolation issues preventing OTEL export.

Key changes:

  • Diagnostic event listeners and sequence counter now shared via globalThis.__openclaw_diagnostic_events_state__
  • Log transport registry moved to globalThis.__openclaw_external_log_transports__ with per-logger idempotent attachment tracking via WeakMap
  • registerLogTransport now attaches to all active logger instances, not just the cached one
  • Added timestamp validation and defensive error handling in OTEL log export
  • New test coverage for multi-logger transport attachment and unsubscribe behavior

The implementation correctly addresses the stated problem (diagnostics-otel missing events across module boundaries) by sharing state globally while maintaining idempotency through WeakMap tracking.

Confidence Score: 4/5

  • Safe to merge with minor consideration for edge cases
  • The implementation correctly solves the module isolation problem using established patterns (globalThis with typed keys). The WeakMap approach for idempotent attachment is sound. Test coverage includes the critical multi-logger scenario. One minor consideration: timestamp validation silently drops invalid timestamps rather than logging warnings, but the try-catch ensures no exceptions propagate.
  • No files require special attention

Last reviewed commit: 160e710

@vincentkoc
Copy link
Copy Markdown
Member

Appreciate the effort here.

I’m closing this as a duplicate of #28166, which consolidates the shared log transport fix with the related trace-context work into a single, clean PR.
Your contribution in #16865 is kept in the attribution trail.

If that’s not the right call, point me to the missing piece and I’ll reopen review quickly.

@vincentkoc vincentkoc closed this Feb 27, 2026
@vincentkoc vincentkoc added dedupe:child Duplicate issue/PR child in dedupe cluster close:duplicate Closed as duplicate labels Feb 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

close:duplicate Closed as duplicate dedupe:child Duplicate issue/PR child in dedupe cluster size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants