Skip to content

feat(diagnostics): classify skill and tool usage#80370

Merged
vincentkoc merged 1 commit into
openclaw:mainfrom
gauravprasadgp:feature/skill-usage-telemetry
May 23, 2026
Merged

feat(diagnostics): classify skill and tool usage#80370
vincentkoc merged 1 commit into
openclaw:mainfrom
gauravprasadgp:feature/skill-usage-telemetry

Conversation

@gauravprasadgp

@gauravprasadgp gauravprasadgp commented May 10, 2026

Copy link
Copy Markdown
Contributor

Summary

  • rewrites the original broad skill telemetry PR into one canonical diagnostics contract for skill and tool usage
  • adds trusted skill.used diagnostic events for successful skill reads and command-dispatched skill tools without exporting raw paths, params, run ids, or session keys
  • adds bounded tool_source / tool_owner labels for core, plugin, MCP, and channel tool execution
  • updates OpenTelemetry, Prometheus, docs, changelog, and focused regression coverage while keeping labels low-cardinality

Verification

  • git diff --check origin/main...HEAD
  • AUTOREVIEW_AUTO_TESTS=0 AUTOREVIEW_OPENCLAW_MAINTAINER_VALIDATION=1 .agents/skills/autoreview/scripts/autoreview --mode branch --base origin/main -> clean, no accepted/actionable findings
  • AWS Crabbox focused regression run run_a025b130b08b / cbx_b6a6fe523e2b on c7a.8xlarge: pnpm test:serial src/agents/pi-tools.before-tool-call.e2e.test.ts src/auto-reply/reply/get-reply-inline-actions.skip-when-config-empty.test.ts src/agents/skills.test.ts extensions/diagnostics-otel/src/service.test.ts extensions/diagnostics-prometheus/src/service.test.ts -> 5 shards passed, 161 tests passed
  • AWS Crabbox fresh-PR gate run_607180ab0238 / cbx_12408152a349 on c7a.8xlarge: pnpm check:changed -> passed
  • AWS Crabbox observability smoke run_81b7bfc3aece / cbx_858db1d32be3 on c7a.8xlarge: pnpm qa:observability:smoke -> passed

Real behavior proof

Behavior addressed: operators now get bounded skill usage and tool source/owner telemetry from the diagnostics exporters, so skill, MCP, plugin, channel, and core tool activity can be distinguished without leaking raw skill paths, tool params, run ids, or session keys.

Real environment tested: AWS Crabbox Linux c7a.8xlarge fresh checkout of openclaw/openclaw#80370, running the OpenClaw QA-lab Gateway against a local OTLP receiver and the Prometheus diagnostics scrape path.

Exact steps or command run after this patch: pnpm qa:observability:smoke from the fresh PR checkout.

Evidence after fix: terminal console output from AWS Crabbox run run_81b7bfc3aece / lease cbx_858db1d32be3:

[qa-suite] gateway ready: http://127.0.0.1:33625
[qa-suite] scenario pass (1/1): otel-trace-smoke
[qa-suite] run complete: passed=1 failed=0 total=1
qa-otel-smoke: passed spans=18 metrics=67 logs=12 traces=2 metricRequests=15 logRequests=7
[qa-suite] gateway ready: http://127.0.0.1:43289
[qa-suite] scenario pass (1/1): docker-prometheus-smoke
[qa-suite] run complete: passed=1 failed=0 total=1
run details provider=aws lease=cbx_858db1d32be3 slug=crimson-shrimp run=run_81b7bfc3aece type=c7a.8xlarge
{"provider":"aws","leaseId":"cbx_858db1d32be3","runId":"run_81b7bfc3aece","machineType":"c7a.8xlarge","exitCode":0}

Observed result after fix: the OpenClaw Gateway emitted OTLP traces, metrics, and logs to the receiver, then completed the Prometheus diagnostics smoke successfully from the same PR branch. The OTLP smoke reported 18 spans, 67 metrics, 12 logs, 2 trace requests, 15 metric requests, and 7 log requests; the Prometheus scenario reported passed=1 failed=0 total=1.

What was not tested: no live third-party MCP server or external plugin service was used in this smoke; those classifications are covered by OpenClaw tool metadata paths and exporter regression coverage.

@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation gateway Gateway runtime extensions: diagnostics-otel Extension: diagnostics-otel agents Agent runtime and tooling extensions: diagnostics-prometheus size: L proof: supplied External PR includes structured after-fix real behavior proof. labels May 10, 2026
@clawsweeper

clawsweeper Bot commented May 10, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Latest ClawSweeper review: 2026-05-23 06:22 UTC / May 23, 2026, 2:22 AM ET.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR adds skill.used diagnostic events, bounded tool_source/tool_owner tool classification, Prometheus/OpenTelemetry export changes, docs, changelog, and focused regression coverage.

Reproducibility: not applicable. This is a feature PR, and current-main source search confirms the requested skill.used, tool_source, and openclaw_skill_used surfaces are not existing bug behavior to reproduce.

PR rating
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🐚 platinum hermit
Summary: The patch reads as a reasonable diagnostics feature, but missing real behavior proof keeps it below merge-ready quality.

Rank-up moves:

  • Add redacted live Gateway plus Prometheus scrape or OTEL export proof showing skill.used and tool_source/tool_owner.
  • Resolve or dismiss the stale CodeQL overwritten-property thread if it no longer applies to the latest diff.
  • Get maintainer confirmation on whether changing the existing Prometheus tool metric labels is acceptable.
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Needs real behavior proof before merge: The PR body explicitly says real environment testing was not completed; before merge it needs redacted live Gateway/exporter output such as terminal output, logs, a Prometheus scrape, or OTEL export proof with private details removed. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

  • Real Gateway/exporter proof is missing; the PR body says live environment testing did not complete and focused tests were not run in the permitted environment.
  • Adding tool_owner and tool_source labels to existing Prometheus tool-execution metrics can change existing dashboard and alert query behavior for operators.
  • The stale CodeQL overwritten-property review thread should be resolved or explicitly dismissed before merge even though it appears to target an older diff.

Maintainer options:

  1. Gate on live exporter proof and accept labels (recommended)
    Require redacted Gateway plus Prometheus scrape or OTEL export proof, then intentionally accept the expanded tool metric label contract.
  2. Preserve old metric compatibility
    Keep the existing tool-execution metric labels and expose source/owner classification through a new metric or documented opt-in path.
  3. Pause until diagnostics scope is settled
    If first-class skill/tool-source telemetry is not ready for core diagnostics, pause or close the PR instead of landing a partial operator contract.

Next step before merge
Human review is needed for live proof, the stale CodeQL thread, and maintainer acceptance of the telemetry compatibility change rather than a narrow automation repair.

Security
Cleared: The patch changes diagnostics code, docs, and tests without adding dependencies, CI permissions, package resolution changes, secret handling changes, or executable supply-chain surfaces.

Review details

Best possible solution:

Land the centralized diagnostics contract after redacted live exporter proof and a maintainer decision to accept or adjust the expanded Prometheus metric-label contract.

Do we have a high-confidence way to reproduce the issue?

Not applicable. This is a feature PR, and current-main source search confirms the requested skill.used, tool_source, and openclaw_skill_used surfaces are not existing bug behavior to reproduce.

Is this the best way to solve the issue?

Yes, with merge gates. The centralized diagnostics contract is the maintainable direction, but it needs live exporter proof and maintainer acceptance or redesign of the metric-label compatibility change.

Label justifications:

  • P2: This is a normal-priority diagnostics feature with limited runtime blast radius but meaningful operator contract impact.
  • merge-risk: 🚨 compatibility: The PR changes the label set for existing Prometheus tool-execution metrics, which can affect saved dashboards and alert rules.
  • rating: 🧂 unranked krab: Current PR rating is 🧂 unranked krab because proof is 🧂 unranked krab, patch quality is 🐚 platinum hermit, and The patch reads as a reasonable diagnostics feature, but missing real behavior proof keeps it below merge-ready quality.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body explicitly says real environment testing was not completed; before merge it needs redacted live Gateway/exporter output such as terminal output, logs, a Prometheus scrape, or OTEL export proof with private details removed. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

What I checked:

  • Current main lacks the requested diagnostics contract: Current main has tool.execution.* diagnostic event types without toolSource, toolOwner, or skill.used; source search also found no existing openclaw_skill_used or tool_source implementation. (src/infra/diagnostic-events.ts:376, 743fd4c9dbaa)
  • Current tool-hook emissions are source-unclassified: Current main emits tool.execution.started/completed/error/blocked from the before-tool-call wrapper using toolName, trace, tool call id, and parameter summary, but no source/owner or skill-usage event. (src/agents/pi-tools.before-tool-call.ts:806, 743fd4c9dbaa)
  • PR patch adds the central classification implementation: The public patch for head 8ce92cc adds DiagnosticToolSource, DiagnosticSkillUsedEvent, ToolDiagnosticIdentity, skill instruction-path matching, command-dispatch matching, and exporter handling for skill.used. (src/agents/pi-tools.before-tool-call.ts:194, 8ce92cc95f43)
  • Prometheus metric-label compatibility risk is real: Current docs list openclaw_tool_execution_total and duration metrics with four labels, while the PR patch adds tool_owner and tool_source labels plus a new openclaw_skill_used_total metric. Public docs: docs/gateway/prometheus.md. (docs/gateway/prometheus.md:99, 743fd4c9dbaa)
  • Real behavior proof is still absent: The PR body says real environment testing was not completed, remote Testbox/Crabbox attempts failed before test execution, and the observed proof is static diff check plus autoreview rather than live Gateway/exporter output. (8ce92cc95f43)
  • Existing review context still points to proof and compatibility gates: The existing ClawSweeper review comment asks for redacted live Gateway plus Prometheus scrape or OTEL export proof, and flags the metric-label contract as a maintainer acceptance point; the CodeQL thread appears stale against an older diff but should be resolved or dismissed before merge.

Likely related people:

  • vincentkoc: Recent current-main diagnostics commits touch the OTEL/Prometheus exporter area, and the latest PR head commit is also authored by this person after maintainer-side force-pushes. (role: recent diagnostics contributor; confidence: high; commits: 513195b462b7, e2501b2d6db2, 0f2e7510cbda; files: extensions/diagnostics-otel/src/service.ts, extensions/diagnostics-prometheus/src/service.ts)
  • steipete: Recent history for the before-tool-call and skills helper paths includes plugin SDK hook/path work and broad runtime refactors that overlap the affected hook context boundary. (role: recent tool-hook and runtime contributor; confidence: medium; commits: e4bae42d631e, d40dc8f025a3; files: src/agents/pi-tools.before-tool-call.ts, src/agents/skills/command-specs.ts)
  • gumadeiras: Recent skill command history includes inherited agent skill allowlist work on the same skill command surface that now carries skillSource. (role: skill command area contributor; confidence: medium; commits: ddd250d13075; files: src/agents/skills/command-specs.ts, src/agents/skills/types.ts)
  • shakkernerd: Recent history shows the split of skill command specs from workspace snapshot, which is adjacent to this PR's command-spec telemetry source addition. (role: skill command area contributor; confidence: medium; commits: 4499d572fa76; files: src/agents/skills/command-specs.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 743fd4c9dbaa.

@gauravprasadgp gauravprasadgp force-pushed the feature/skill-usage-telemetry branch 2 times, most recently from 73f40a9 to bc8c2a7 Compare May 11, 2026 17:45

@Ruthwik-Data Ruthwik-Data left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good addition — surfacing skill usage as a first-class diagnostic signal fills a real gap. Previously you could observe tool execution and model calls but had no way to track which skills were actually activated, which makes it hard to debug agents that silently fall back to generic behavior.

A few observations from reviewing the diff:

On the activation label in openclaw_skill_used_total: The counter tracks activation, agent, skill, source — worth documenting what valid values for activation are (e.g., command-dispatched vs read-from-SKILL.md) so operators building dashboards know what to filter on. Without that, the label is present but not actionable.

On the Prometheus query in the dashboard: sum by (skill, source) (increase(openclaw_skill_used_total[24h])) is useful for daily trends, but for debugging agent behavior in real time a shorter window (e.g., [1h] or [5m]) is more practical. Consider adding both in the example queries.

On cardinality: The skill label could become high-cardinality if users define many custom skills in SKILL.md. Worth noting in the docs whether there's any capping or truncation applied to the skill label value, similar to how other high-cardinality labels are handled elsewhere in the codebase.

Overall the scope is well-bounded and the test coverage in service.test.ts looks appropriate. Happy to see this move forward.

@gauravprasadgp gauravprasadgp requested a review from a team as a code owner May 11, 2026 19:10
@openclaw-barnacle openclaw-barnacle Bot added channel: discord Channel integration: discord channel: matrix Channel integration: matrix channel: msteams Channel integration: msteams channel: signal Channel integration: signal channel: slack Channel integration: slack channel: telegram Channel integration: telegram channel: voice-call Channel integration: voice-call channel: whatsapp-web Channel integration: whatsapp-web channel: zalouser Channel integration: zalouser app: web-ui App: web-ui extensions: memory-core Extension: memory-core cli CLI command changes scripts Repository scripts commands Command implementations docker Docker and sandbox tooling channel: feishu Channel integration: feishu extensions: anthropic extensions: openai labels May 11, 2026
Comment thread src/agents/pi-embedded-runner/run/attempt.ts Fixed
@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. P2 Normal backlog priority with limited blast radius. rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. rating: 🌊 off-meta tidepool PR readiness rating does not apply to this item. labels May 18, 2026
@vincentkoc vincentkoc self-assigned this May 22, 2026
@vincentkoc vincentkoc force-pushed the feature/skill-usage-telemetry branch from f65be5a to b3bdfe2 Compare May 22, 2026 14:12
@clawsweeper

clawsweeper Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

@vincentkoc

Copy link
Copy Markdown
Member

Verification after maintainer rewrite/rebase:

  • git diff --check origin/main...HEAD clean on 077a75685b
  • AUTOREVIEW_AUTO_TESTS=0 AUTOREVIEW_OPENCLAW_MAINTAINER_VALIDATION=1 .agents/skills/autoreview/scripts/autoreview --mode branch --base origin/main clean; no accepted/actionable findings
  • AWS Crabbox focused regression proof: run_a025b130b08b, lease cbx_b6a6fe523e2b, provider AWS c7a.8xlarge; pnpm test:serial src/agents/pi-tools.before-tool-call.e2e.test.ts src/auto-reply/reply/get-reply-inline-actions.skip-when-config-empty.test.ts src/agents/skills.test.ts extensions/diagnostics-otel/src/service.test.ts extensions/diagnostics-prometheus/src/service.test.ts passed 5 shards / 161 tests
  • AWS Crabbox fresh-PR proof: run_607180ab0238, lease cbx_12408152a349, provider AWS c7a.8xlarge; pnpm check:changed passed from a fresh checkout of openclaw/openclaw#80370

Known proof gap: no live third-party MCP/plugin server was exercised; focused tests cover OpenClaw metadata classification and OTel/Prometheus exporter behavior. No local pnpm/Vitest/check gate was run.

@vincentkoc

Copy link
Copy Markdown
Member

Added real OpenClaw observability proof after the proof gate rejected unit-only evidence:

  • AWS Crabbox observability smoke: run_81b7bfc3aece, lease cbx_858db1d32be3, provider AWS c7a.8xlarge
  • Fresh PR checkout of openclaw/openclaw#80370
  • Command: pnpm qa:observability:smoke
  • Result: qa-otel-smoke passed against a QA-lab Gateway and local OTLP receiver with spans=18 metrics=67 logs=12 traces=2 metricRequests=15 logRequests=7; docker-prometheus-smoke passed with passed=1 failed=0 total=1
  • Final run summary: exitCode=0

I also updated the PR body’s ## Real behavior proof section with the copied console output so the gate has after-fix runtime evidence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling docs Improvements or additions to documentation extensions: diagnostics-otel Extension: diagnostics-otel extensions: diagnostics-prometheus gateway Gateway runtime merge-risk: 🚨 compatibility 🚨 May break existing users, config, migrations, defaults, or upgrade paths. P2 Normal backlog priority with limited blast radius. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. size: L status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants