Skip to content

fix(diagnostics): expose missing telemetry signals#86682

Merged
vincentkoc merged 1 commit into
mainfrom
diagnostics-otel-prom-gap-audit
May 26, 2026
Merged

fix(diagnostics): expose missing telemetry signals#86682
vincentkoc merged 1 commit into
mainfrom
diagnostics-otel-prom-gap-audit

Conversation

@vincentkoc

Copy link
Copy Markdown
Member

Summary

  • Fix shared OTLP endpoint resolution so /v1/traces, /v1/metrics, and /v1/logs are inserted before query strings or fragments.
  • Export alertable OTel and Prometheus signals for model failover, blocked tool executions, oversized payloads, webhook ingress/errors, stale sessions, and liveness warnings.
  • Carry core diagnostic provenance on dispatcher metadata so Prometheus records core gateway stability events while dropping plugin-emitted spoof events, then document the updated OTel/Prometheus behavior and changelog entry.

Verification

  • Autoreview: AUTOREVIEW_AUTO_TESTS=0 AUTOREVIEW_OPENCLAW_MAINTAINER_VALIDATION=1 .agents/skills/autoreview/scripts/autoreview --mode branch --base origin/main -> clean, no accepted/actionable findings.
  • AWS Crabbox run_cf9cee226993, lease cbx_e9567e71c6b2, provider AWS c7a.8xlarge: pnpm test:serial src/infra/diagnostic-events.test.ts extensions/diagnostics-otel/src/service.test.ts extensions/diagnostics-prometheus/src/service.test.ts && pnpm check:changed && pnpm qa:observability:smoke && pnpm qa:observability:collector-smoke.
  • Focused tests passed: infra diagnostic events 26 tests, diagnostics-otel 65 tests, diagnostics-prometheus 17 tests.
  • pnpm check:changed passed core, coreTests, extensions, extensionTests, and docs lanes.
  • In-process observability smoke passed: OTel spans=18 metrics=37 logs=14 traces=2 metricRequests=7 logRequests=6; Prometheus smoke passed.
  • Collector-backed observability smoke passed against otel/opentelemetry-collector:0.104.0: OTel spans=18 metrics=28 logs=14 traces=2 metricRequests=6 logRequests=5; Prometheus smoke passed.

Real behavior proof

Behavior addressed: Operators now get correct OTLP signal URLs when a shared endpoint has query strings/fragments, explicit OTel metrics for blocked tools/model failover/large payloads, and Prometheus metrics for gateway stability signals without accepting plugin-spoofed untrusted diagnostics.

Real environment tested: AWS Crabbox Linux runner, provider aws, lease cbx_e9567e71c6b2, run run_cf9cee226993, including in-process OTLP/Prometheus QA and a Docker OpenTelemetry Collector path.

Exact steps or command run after this patch: pnpm test:serial src/infra/diagnostic-events.test.ts extensions/diagnostics-otel/src/service.test.ts extensions/diagnostics-prometheus/src/service.test.ts && pnpm check:changed && pnpm qa:observability:smoke && pnpm qa:observability:collector-smoke.

Evidence after fix: Focused regression tests covered endpoint query/fragment handling, new OTel instruments, Prometheus stability metrics, and plugin-spoofed untrusted diagnostic drops. QA smoke exported OTLP/Prometheus data successfully both directly and through otel/opentelemetry-collector:0.104.0.

Observed result after fix: The remote command exited 0. In-process OTel smoke observed spans=18, metrics=37, logs=14, traces=2, metricRequests=7, logRequests=6. Collector smoke observed spans=18, metrics=28, logs=14, traces=2, metricRequests=6, logRequests=5. Prometheus smoke passed in both lanes.

What was not tested: Live third-party Prometheus/Grafana scraping and live external provider traffic were not exercised; the proof used repository QA scenarios and a real OpenTelemetry Collector container.

@vincentkoc vincentkoc self-assigned this May 26, 2026
@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation gateway Gateway runtime extensions: diagnostics-otel Extension: diagnostics-otel extensions: diagnostics-prometheus size: L maintainer Maintainer-authored PR labels May 26, 2026
@clawsweeper

clawsweeper Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper status: review started.

I am starting a fresh review of this pull request: fix(diagnostics): expose missing telemetry signals This is item 1/1 in the current shard. Shard 0/1.

This placeholder means the worker is alive and reading the current context. I will edit this same comment with the actual review when the claws are done clicking.

Crustacean status: shell secured, claws on keyboard, evidence pebbles being sorted.

@vincentkoc vincentkoc force-pushed the diagnostics-otel-prom-gap-audit branch from bab3ec4 to 607b9b8 Compare May 26, 2026 00:04
@vincentkoc vincentkoc merged commit ef8619d into main May 26, 2026
99 checks passed
@vincentkoc vincentkoc deleted the diagnostics-otel-prom-gap-audit branch May 26, 2026 00:11
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 26, 2026
jameslcowan pushed a commit to jameslcowan/openclaw that referenced this pull request Jun 2, 2026
SYU8384 pushed a commit to SYU8384/openclaw that referenced this pull request Jun 3, 2026
sablehead pushed a commit to sablehead/openclaw that referenced this pull request Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Improvements or additions to documentation extensions: diagnostics-otel Extension: diagnostics-otel extensions: diagnostics-prometheus gateway Gateway runtime maintainer Maintainer-authored PR size: L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant