Skip to content

Commit ef8619d

Browse files
authored
fix(diagnostics): expose missing telemetry signals (#86682)
1 parent 71e9eaa commit ef8619d

17 files changed

Lines changed: 666 additions & 64 deletions

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,7 @@ Docs: https://docs.openclaw.ai
117117
- Checks: keep intentional Knip unused-file findings optional so full CI and sparse proof workspaces stay aligned.
118118
- Docker: restore writable `~/.config` in runtime images. Fixes #85968. Thanks @hkoessler and @Bartok9.
119119
- Plugin SDK: keep legacy root diagnostic subscriptions connected when built plugin SDK aliases resolve diagnostic helpers through a separate module graph.
120+
- Diagnostics: export alertable OTel and Prometheus signals for blocked tools, model failover, stale sessions, liveness warnings, oversized payloads, and webhook ingress while fixing shared OTLP endpoints with query strings.
120121
- Tests: normalize macOS canonical temp paths in exec allowlists, fs-safe trash assertions, installed plugin matching, Telegram topic-name stores, and built ACPX MCP server expectations so native macOS proof runners cover the intended behavior.
121122
- Codex/app-server: preserve message-tool-only source reply delivery mode on active runs so sub-agent completion wakeups can steer the active Codex turn instead of being rejected. (#86287) Thanks @ferminquant.
122123
- Tests: sample the Windows kitchen-sink RPC gateway directly and serialize RSS probes so native runs keep the memory guard active.

docs/gateway/opentelemetry.md

Lines changed: 24 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -70,14 +70,15 @@ openclaw plugins enable diagnostics-otel
7070

7171
## Signals exported
7272

73-
| Signal | What goes in it |
74-
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
75-
| **Metrics** | Counters and histograms for token usage, cost, run duration, skill usage, message flow, Talk events, queue lanes, session state/recovery, tool execution, exec, and memory pressure. |
76-
| **Traces** | Spans for model usage, model calls, harness lifecycle, skill usage, tool execution, exec, webhook/message processing, context assembly, and tool loops. |
77-
| **Logs** | Structured `logging.file` records exported over OTLP when `diagnostics.otel.logs` is enabled; log bodies are withheld unless content capture is explicitly enabled. |
73+
| Signal | What goes in it |
74+
| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
75+
| **Metrics** | Counters and histograms for token usage, cost, run duration, failover, skill usage, message flow, Talk events, queue lanes, session state/recovery, tool execution, oversized payloads, exec, and memory pressure. |
76+
| **Traces** | Spans for model usage, model calls, harness lifecycle, skill usage, tool execution, exec, webhook/message processing, context assembly, and tool loops. |
77+
| **Logs** | Structured `logging.file` records exported over OTLP when `diagnostics.otel.logs` is enabled; log bodies are withheld unless content capture is explicitly enabled. |
7878

79-
Toggle `traces`, `metrics`, and `logs` independently. All three default to on
80-
when `diagnostics.otel.enabled` is true.
79+
Toggle `traces`, `metrics`, and `logs` independently. Traces and metrics
80+
default to on when `diagnostics.otel.enabled` is true. Logs default to off and
81+
are exported only when `diagnostics.otel.logs` is explicitly `true`.
8182

8283
## Configuration reference
8384

@@ -189,6 +190,7 @@ message bodies are also approved for export.
189190
- `openclaw.model_call.request_bytes` (histogram, UTF-8 byte size of the final model request payload; no raw payload content)
190191
- `openclaw.model_call.response_bytes` (histogram, UTF-8 byte size of streamed model response events; no raw response content)
191192
- `openclaw.model_call.time_to_first_byte_ms` (histogram, elapsed time before the first streamed response event)
193+
- `openclaw.model.failover` (counter, attrs: `openclaw.provider`, `openclaw.model`, `openclaw.failover.to_provider`, `openclaw.failover.to_model`, `openclaw.failover.reason`, `openclaw.failover.suspended`, `openclaw.lane`)
192194
- `openclaw.skill.used` (counter, attrs: `openclaw.skill.name`, `openclaw.skill.source`, `openclaw.skill.activation`, optional `openclaw.agent`, optional `openclaw.toolName`)
193195

194196
### Message flow
@@ -260,16 +262,31 @@ unchanged, so dashboards should alert on sustained increases rather than every
260262
heartbeat tick. For the config knob and defaults, see
261263
[Configuration reference](/gateway/configuration-reference#diagnostics).
262264

265+
Liveness warnings also emit:
266+
267+
- `openclaw.liveness.warning` (counter, attrs: `openclaw.liveness.reason`)
268+
- `openclaw.liveness.event_loop_delay_p99_ms` (histogram, attrs: `openclaw.liveness.reason`)
269+
- `openclaw.liveness.event_loop_delay_max_ms` (histogram, attrs: `openclaw.liveness.reason`)
270+
- `openclaw.liveness.event_loop_utilization` (histogram, attrs: `openclaw.liveness.reason`)
271+
- `openclaw.liveness.cpu_core_ratio` (histogram, attrs: `openclaw.liveness.reason`)
272+
263273
### Harness lifecycle
264274

265275
- `openclaw.harness.duration_ms` (histogram, attrs: `openclaw.harness.id`, `openclaw.harness.plugin`, `openclaw.outcome`, `openclaw.harness.phase` on errors)
266276

277+
### Tool execution
278+
279+
- `openclaw.tool.execution.duration_ms` (histogram, attrs: `gen_ai.tool.name`, `openclaw.toolName`, `openclaw.tool.source`, `openclaw.tool.owner`, `openclaw.tool.params.kind`, plus `openclaw.errorCategory` on errors)
280+
- `openclaw.tool.execution.blocked` (counter, attrs: `gen_ai.tool.name`, `openclaw.toolName`, `openclaw.tool.source`, `openclaw.tool.owner`, `openclaw.tool.params.kind`, `openclaw.deniedReason`)
281+
267282
### Exec
268283

269284
- `openclaw.exec.duration_ms` (histogram, attrs: `openclaw.exec.target`, `openclaw.exec.mode`, `openclaw.outcome`, `openclaw.failureKind`)
270285

271286
### Diagnostics internals (memory and tool loop)
272287

288+
- `openclaw.payload.large` (counter, attrs: `openclaw.payload.surface`, `openclaw.payload.action`, `openclaw.channel`, `openclaw.plugin`, `openclaw.reason`)
289+
- `openclaw.payload.large_bytes` (histogram, attrs: same as `openclaw.payload.large`)
273290
- `openclaw.memory.heap_used_bytes` (histogram, attrs: `openclaw.memory.kind`)
274291
- `openclaw.memory.rss_bytes` (histogram)
275292
- `openclaw.memory.pressure` (counter, attrs: `openclaw.memory.level`)

0 commit comments

Comments
 (0)