feat(diagnostics): classify skill and tool usage#80370
Conversation
|
Codex review: needs real behavior proof before merge. Latest ClawSweeper review: 2026-05-23 06:22 UTC / May 23, 2026, 2:22 AM ET. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
Summary Reproducibility: not applicable. This is a feature PR, and current-main source search confirms the requested PR rating Rank-up moves:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. Real behavior proof Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land the centralized diagnostics contract after redacted live exporter proof and a maintainer decision to accept or adjust the expanded Prometheus metric-label contract. Do we have a high-confidence way to reproduce the issue? Not applicable. This is a feature PR, and current-main source search confirms the requested Is this the best way to solve the issue? Yes, with merge gates. The centralized diagnostics contract is the maintainable direction, but it needs live exporter proof and maintainer acceptance or redesign of the metric-label compatibility change. Label justifications:
What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 743fd4c9dbaa. |
73f40a9 to
bc8c2a7
Compare
Ruthwik-Data
left a comment
There was a problem hiding this comment.
Good addition — surfacing skill usage as a first-class diagnostic signal fills a real gap. Previously you could observe tool execution and model calls but had no way to track which skills were actually activated, which makes it hard to debug agents that silently fall back to generic behavior.
A few observations from reviewing the diff:
On the activation label in openclaw_skill_used_total: The counter tracks activation, agent, skill, source — worth documenting what valid values for activation are (e.g., command-dispatched vs read-from-SKILL.md) so operators building dashboards know what to filter on. Without that, the label is present but not actionable.
On the Prometheus query in the dashboard: sum by (skill, source) (increase(openclaw_skill_used_total[24h])) is useful for daily trends, but for debugging agent behavior in real time a shorter window (e.g., [1h] or [5m]) is more practical. Consider adding both in the example queries.
On cardinality: The skill label could become high-cardinality if users define many custom skills in SKILL.md. Worth noting in the docs whether there's any capping or truncation applied to the skill label value, similar to how other high-cardinality labels are handled elsewhere in the codebase.
Overall the scope is well-bounded and the test coverage in service.test.ts looks appropriate. Happy to see this move forward.
4b87bb9 to
e3d5a09
Compare
f65be5a to
b3bdfe2
Compare
|
ClawSweeper PR egg 🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat. Where did the egg go?
|
|
Verification after maintainer rewrite/rebase:
Known proof gap: no live third-party MCP/plugin server was exercised; focused tests cover OpenClaw metadata classification and OTel/Prometheus exporter behavior. No local pnpm/Vitest/check gate was run. |
|
Added real OpenClaw observability proof after the proof gate rejected unit-only evidence:
I also updated the PR body’s |
Summary
skill.useddiagnostic events for successful skill reads and command-dispatched skill tools without exporting raw paths, params, run ids, or session keystool_source/tool_ownerlabels for core, plugin, MCP, and channel tool executionVerification
git diff --check origin/main...HEADAUTOREVIEW_AUTO_TESTS=0 AUTOREVIEW_OPENCLAW_MAINTAINER_VALIDATION=1 .agents/skills/autoreview/scripts/autoreview --mode branch --base origin/main-> clean, no accepted/actionable findingsrun_a025b130b08b/cbx_b6a6fe523e2bon c7a.8xlarge:pnpm test:serial src/agents/pi-tools.before-tool-call.e2e.test.ts src/auto-reply/reply/get-reply-inline-actions.skip-when-config-empty.test.ts src/agents/skills.test.ts extensions/diagnostics-otel/src/service.test.ts extensions/diagnostics-prometheus/src/service.test.ts-> 5 shards passed, 161 tests passedrun_607180ab0238/cbx_12408152a349on c7a.8xlarge:pnpm check:changed-> passedrun_81b7bfc3aece/cbx_858db1d32be3on c7a.8xlarge:pnpm qa:observability:smoke-> passedReal behavior proof
Behavior addressed: operators now get bounded skill usage and tool source/owner telemetry from the diagnostics exporters, so skill, MCP, plugin, channel, and core tool activity can be distinguished without leaking raw skill paths, tool params, run ids, or session keys.
Real environment tested: AWS Crabbox Linux c7a.8xlarge fresh checkout of
openclaw/openclaw#80370, running the OpenClaw QA-lab Gateway against a local OTLP receiver and the Prometheus diagnostics scrape path.Exact steps or command run after this patch:
pnpm qa:observability:smokefrom the fresh PR checkout.Evidence after fix: terminal console output from AWS Crabbox run
run_81b7bfc3aece/ leasecbx_858db1d32be3:Observed result after fix: the OpenClaw Gateway emitted OTLP traces, metrics, and logs to the receiver, then completed the Prometheus diagnostics smoke successfully from the same PR branch. The OTLP smoke reported 18 spans, 67 metrics, 12 logs, 2 trace requests, 15 metric requests, and 7 log requests; the Prometheus scenario reported
passed=1 failed=0 total=1.What was not tested: no live third-party MCP server or external plugin service was used in this smoke; those classifications are covered by OpenClaw tool metadata paths and exporter regression coverage.