feat(diagnostics): classify skill and tool usage by gauravprasadgp · Pull Request #80370 · openclaw/openclaw

gauravprasadgp · 2026-05-10T17:54:04Z

Summary

rewrites the original broad skill telemetry PR into one canonical diagnostics contract for skill and tool usage
adds trusted skill.used diagnostic events for successful skill reads and command-dispatched skill tools without exporting raw paths, params, run ids, or session keys
adds bounded tool_source / tool_owner labels for core, plugin, MCP, and channel tool execution
updates OpenTelemetry, Prometheus, docs, changelog, and focused regression coverage while keeping labels low-cardinality

Verification

git diff --check origin/main...HEAD
AUTOREVIEW_AUTO_TESTS=0 AUTOREVIEW_OPENCLAW_MAINTAINER_VALIDATION=1 .agents/skills/autoreview/scripts/autoreview --mode branch --base origin/main -> clean, no accepted/actionable findings
AWS Crabbox focused regression run run_a025b130b08b / cbx_b6a6fe523e2b on c7a.8xlarge: pnpm test:serial src/agents/pi-tools.before-tool-call.e2e.test.ts src/auto-reply/reply/get-reply-inline-actions.skip-when-config-empty.test.ts src/agents/skills.test.ts extensions/diagnostics-otel/src/service.test.ts extensions/diagnostics-prometheus/src/service.test.ts -> 5 shards passed, 161 tests passed
AWS Crabbox fresh-PR gate run_607180ab0238 / cbx_12408152a349 on c7a.8xlarge: pnpm check:changed -> passed
AWS Crabbox observability smoke run_81b7bfc3aece / cbx_858db1d32be3 on c7a.8xlarge: pnpm qa:observability:smoke -> passed

Real behavior proof

Behavior addressed: operators now get bounded skill usage and tool source/owner telemetry from the diagnostics exporters, so skill, MCP, plugin, channel, and core tool activity can be distinguished without leaking raw skill paths, tool params, run ids, or session keys.

Real environment tested: AWS Crabbox Linux c7a.8xlarge fresh checkout of openclaw/openclaw#80370, running the OpenClaw QA-lab Gateway against a local OTLP receiver and the Prometheus diagnostics scrape path.

Exact steps or command run after this patch: pnpm qa:observability:smoke from the fresh PR checkout.

Evidence after fix: terminal console output from AWS Crabbox run run_81b7bfc3aece / lease cbx_858db1d32be3:

[qa-suite] gateway ready: http://127.0.0.1:33625
[qa-suite] scenario pass (1/1): otel-trace-smoke
[qa-suite] run complete: passed=1 failed=0 total=1
qa-otel-smoke: passed spans=18 metrics=67 logs=12 traces=2 metricRequests=15 logRequests=7
[qa-suite] gateway ready: http://127.0.0.1:43289
[qa-suite] scenario pass (1/1): docker-prometheus-smoke
[qa-suite] run complete: passed=1 failed=0 total=1
run details provider=aws lease=cbx_858db1d32be3 slug=crimson-shrimp run=run_81b7bfc3aece type=c7a.8xlarge
{"provider":"aws","leaseId":"cbx_858db1d32be3","runId":"run_81b7bfc3aece","machineType":"c7a.8xlarge","exitCode":0}

Observed result after fix: the OpenClaw Gateway emitted OTLP traces, metrics, and logs to the receiver, then completed the Prometheus diagnostics smoke successfully from the same PR branch. The OTLP smoke reported 18 spans, 67 metrics, 12 logs, 2 trace requests, 15 metric requests, and 7 log requests; the Prometheus scenario reported passed=1 failed=0 total=1.

What was not tested: no live third-party MCP server or external plugin service was used in this smoke; those classifications are covered by OpenClaw tool metadata paths and exporter regression coverage.

clawsweeper · 2026-05-10T17:57:03Z

Codex review: needs real behavior proof before merge.

Latest ClawSweeper review: 2026-05-23 06:22 UTC / May 23, 2026, 2:22 AM ET.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
The PR adds skill.used diagnostic events, bounded tool_source/tool_owner tool classification, Prometheus/OpenTelemetry export changes, docs, changelog, and focused regression coverage.

Reproducibility: not applicable. This is a feature PR, and current-main source search confirms the requested skill.used, tool_source, and openclaw_skill_used surfaces are not existing bug behavior to reproduce.

PR rating
Overall: 🧂 unranked krab
Proof: 🧂 unranked krab
Patch quality: 🐚 platinum hermit
Summary: The patch reads as a reasonable diagnostics feature, but missing real behavior proof keeps it below merge-ready quality.

Rank-up moves:

Add redacted live Gateway plus Prometheus scrape or OTEL export proof showing skill.used and tool_source/tool_owner.
Resolve or dismiss the stale CodeQL overwritten-property thread if it no longer applies to the latest diff.
Get maintainer confirmation on whether changing the existing Prometheus tool metric labels is acceptable.

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Needs real behavior proof before merge: The PR body explicitly says real environment testing was not completed; before merge it needs redacted live Gateway/exporter output such as terminal output, logs, a Prometheus scrape, or OTEL export proof with private details removed. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

Real Gateway/exporter proof is missing; the PR body says live environment testing did not complete and focused tests were not run in the permitted environment.
Adding tool_owner and tool_source labels to existing Prometheus tool-execution metrics can change existing dashboard and alert query behavior for operators.
The stale CodeQL overwritten-property review thread should be resolved or explicitly dismissed before merge even though it appears to target an older diff.

Maintainer options:

Gate on live exporter proof and accept labels (recommended)
Require redacted Gateway plus Prometheus scrape or OTEL export proof, then intentionally accept the expanded tool metric label contract.
Preserve old metric compatibility
Keep the existing tool-execution metric labels and expose source/owner classification through a new metric or documented opt-in path.
Pause until diagnostics scope is settled
If first-class skill/tool-source telemetry is not ready for core diagnostics, pause or close the PR instead of landing a partial operator contract.

Next step before merge
Human review is needed for live proof, the stale CodeQL thread, and maintainer acceptance of the telemetry compatibility change rather than a narrow automation repair.

Security
Cleared: The patch changes diagnostics code, docs, and tests without adding dependencies, CI permissions, package resolution changes, secret handling changes, or executable supply-chain surfaces.

Review details

Best possible solution:

Land the centralized diagnostics contract after redacted live exporter proof and a maintainer decision to accept or adjust the expanded Prometheus metric-label contract.

Do we have a high-confidence way to reproduce the issue?

Not applicable. This is a feature PR, and current-main source search confirms the requested skill.used, tool_source, and openclaw_skill_used surfaces are not existing bug behavior to reproduce.

Is this the best way to solve the issue?

Yes, with merge gates. The centralized diagnostics contract is the maintainable direction, but it needs live exporter proof and maintainer acceptance or redesign of the metric-label compatibility change.

Label justifications:

P2: This is a normal-priority diagnostics feature with limited runtime blast radius but meaningful operator contract impact.
merge-risk: 🚨 compatibility: The PR changes the label set for existing Prometheus tool-execution metrics, which can affect saved dashboards and alert rules.
rating: 🧂 unranked krab: Current PR rating is 🧂 unranked krab because proof is 🧂 unranked krab, patch quality is 🐚 platinum hermit, and The patch reads as a reasonable diagnostics feature, but missing real behavior proof keeps it below merge-ready quality.
status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs real behavior proof before merge: The PR body explicitly says real environment testing was not completed; before merge it needs redacted live Gateway/exporter output such as terminal output, logs, a Prometheus scrape, or OTEL export proof with private details removed. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

What I checked:

Current main lacks the requested diagnostics contract: Current main has tool.execution.* diagnostic event types without toolSource, toolOwner, or skill.used; source search also found no existing openclaw_skill_used or tool_source implementation. (src/infra/diagnostic-events.ts:376, 743fd4c9dbaa)
Current tool-hook emissions are source-unclassified: Current main emits tool.execution.started/completed/error/blocked from the before-tool-call wrapper using toolName, trace, tool call id, and parameter summary, but no source/owner or skill-usage event. (src/agents/pi-tools.before-tool-call.ts:806, 743fd4c9dbaa)
PR patch adds the central classification implementation: The public patch for head 8ce92cc adds DiagnosticToolSource, DiagnosticSkillUsedEvent, ToolDiagnosticIdentity, skill instruction-path matching, command-dispatch matching, and exporter handling for skill.used. (src/agents/pi-tools.before-tool-call.ts:194, 8ce92cc95f43)
Prometheus metric-label compatibility risk is real: Current docs list openclaw_tool_execution_total and duration metrics with four labels, while the PR patch adds tool_owner and tool_source labels plus a new openclaw_skill_used_total metric. Public docs: docs/gateway/prometheus.md. (docs/gateway/prometheus.md:99, 743fd4c9dbaa)
Real behavior proof is still absent: The PR body says real environment testing was not completed, remote Testbox/Crabbox attempts failed before test execution, and the observed proof is static diff check plus autoreview rather than live Gateway/exporter output. (8ce92cc95f43)
Existing review context still points to proof and compatibility gates: The existing ClawSweeper review comment asks for redacted live Gateway plus Prometheus scrape or OTEL export proof, and flags the metric-label contract as a maintainer acceptance point; the CodeQL thread appears stale against an older diff but should be resolved or dismissed before merge.

Likely related people:

vincentkoc: Recent current-main diagnostics commits touch the OTEL/Prometheus exporter area, and the latest PR head commit is also authored by this person after maintainer-side force-pushes. (role: recent diagnostics contributor; confidence: high; commits: 513195b462b7, e2501b2d6db2, 0f2e7510cbda; files: extensions/diagnostics-otel/src/service.ts, extensions/diagnostics-prometheus/src/service.ts)
steipete: Recent history for the before-tool-call and skills helper paths includes plugin SDK hook/path work and broad runtime refactors that overlap the affected hook context boundary. (role: recent tool-hook and runtime contributor; confidence: medium; commits: e4bae42d631e, d40dc8f025a3; files: src/agents/pi-tools.before-tool-call.ts, src/agents/skills/command-specs.ts)
gumadeiras: Recent skill command history includes inherited agent skill allowlist work on the same skill command surface that now carries skillSource. (role: skill command area contributor; confidence: medium; commits: ddd250d13075; files: src/agents/skills/command-specs.ts, src/agents/skills/types.ts)
shakkernerd: Recent history shows the split of skill command specs from workspace snapshot, which is adjacent to this PR's command-spec telemetry source addition. (role: skill command area contributor; confidence: medium; commits: 4499d572fa76; files: src/agents/skills/command-specs.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 743fd4c9dbaa.

Ruthwik-Data

Good addition — surfacing skill usage as a first-class diagnostic signal fills a real gap. Previously you could observe tool execution and model calls but had no way to track which skills were actually activated, which makes it hard to debug agents that silently fall back to generic behavior.

A few observations from reviewing the diff:

On the activation label in openclaw_skill_used_total: The counter tracks activation, agent, skill, source — worth documenting what valid values for activation are (e.g., command-dispatched vs read-from-SKILL.md) so operators building dashboards know what to filter on. Without that, the label is present but not actionable.

On the Prometheus query in the dashboard: sum by (skill, source) (increase(openclaw_skill_used_total[24h])) is useful for daily trends, but for debugging agent behavior in real time a shorter window (e.g., [1h] or [5m]) is more practical. Consider adding both in the example queries.

On cardinality: The skill label could become high-cardinality if users define many custom skills in SKILL.md. Worth noting in the docs whether there's any capping or truncation applied to the skill label value, similar to how other high-cardinality labels are handled elsewhere in the codebase.

Overall the scope is well-bounded and the test coverage in service.test.ts looks appropriate. Happy to see this move forward.

clawsweeper · 2026-05-22T14:20:57Z

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?

The egg game starts only after the PR passes the real-behavior proof check.
Before that, no creature or rarity is rolled. The treat waits for real proof.
This is still just collectible flavor: proof affects review readiness, not creature quality.

vincentkoc · 2026-05-23T08:04:01Z

Verification after maintainer rewrite/rebase:

git diff --check origin/main...HEAD clean on 077a75685b
AUTOREVIEW_AUTO_TESTS=0 AUTOREVIEW_OPENCLAW_MAINTAINER_VALIDATION=1 .agents/skills/autoreview/scripts/autoreview --mode branch --base origin/main clean; no accepted/actionable findings
AWS Crabbox focused regression proof: run_a025b130b08b, lease cbx_b6a6fe523e2b, provider AWS c7a.8xlarge; pnpm test:serial src/agents/pi-tools.before-tool-call.e2e.test.ts src/auto-reply/reply/get-reply-inline-actions.skip-when-config-empty.test.ts src/agents/skills.test.ts extensions/diagnostics-otel/src/service.test.ts extensions/diagnostics-prometheus/src/service.test.ts passed 5 shards / 161 tests
AWS Crabbox fresh-PR proof: run_607180ab0238, lease cbx_12408152a349, provider AWS c7a.8xlarge; pnpm check:changed passed from a fresh checkout of openclaw/openclaw#80370

Known proof gap: no live third-party MCP/plugin server was exercised; focused tests cover OpenClaw metadata classification and OTel/Prometheus exporter behavior. No local pnpm/Vitest/check gate was run.

vincentkoc · 2026-05-23T08:08:06Z

Added real OpenClaw observability proof after the proof gate rejected unit-only evidence:

AWS Crabbox observability smoke: run_81b7bfc3aece, lease cbx_858db1d32be3, provider AWS c7a.8xlarge
Fresh PR checkout of openclaw/openclaw#80370
Command: pnpm qa:observability:smoke
Result: qa-otel-smoke passed against a QA-lab Gateway and local OTLP receiver with spans=18 metrics=67 logs=12 traces=2 metricRequests=15 logRequests=7; docker-prometheus-smoke passed with passed=1 failed=0 total=1
Final run summary: exitCode=0

I also updated the PR body’s ## Real behavior proof section with the copied console output so the gate has after-fix runtime evidence.

gauravprasadgp force-pushed the feature/skill-usage-telemetry branch 2 times, most recently from 73f40a9 to bc8c2a7 Compare May 11, 2026 17:45

Ruthwik-Data reviewed May 11, 2026

View reviewed changes

gauravprasadgp requested a review from a team as a code owner May 11, 2026 19:10

openclaw-barnacle Bot added size: L and removed extensions: anthropic extensions: openai extensions: qa-lab extensions: memory-wiki extensions: codex extensions: lmstudio plugin: google-meet size: XL labels May 11, 2026

gauravprasadgp force-pushed the feature/skill-usage-telemetry branch 2 times, most recently from 4b87bb9 to e3d5a09 Compare May 12, 2026 15:04

github-advanced-security AI found potential problems May 18, 2026

View reviewed changes

Comment thread src/agents/pi-embedded-runner/run/attempt.ts Fixed

This was referenced May 22, 2026

[Feature]: Skill usage telemetry — track which skills are actually being used #85263

Closed

feat: add skill usage telemetry — real-time exec-to-skill tracking #85307

Closed

friendfish mentioned this pull request May 22, 2026

feat: add skill usage telemetry — real-time exec-to-skill tracking #85354

Closed

vincentkoc self-assigned this May 22, 2026

vincentkoc force-pushed the feature/skill-usage-telemetry branch from f65be5a to b3bdfe2 Compare May 22, 2026 14:12

feat(diagnostics): classify skill and tool usage

077a756

github-actions Bot mentioned this pull request May 23, 2026

📡 Upstream Digest — 2026-05-23 09:44 UTC curtismercier/openclaw-mods#923

Open

This was referenced May 23, 2026

Add session-scoped continuation lease workflow API #85722

Closed

Clean up browser MCP subprocess tree #85832

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(diagnostics): classify skill and tool usage#80370

feat(diagnostics): classify skill and tool usage#80370
vincentkoc merged 1 commit into
openclaw:mainfrom
gauravprasadgp:feature/skill-usage-telemetry

gauravprasadgp commented May 10, 2026 •

edited by vincentkoc

Loading

Uh oh!

clawsweeper Bot commented May 10, 2026 •

edited

Loading

Uh oh!

Ruthwik-Data left a comment

Uh oh!

Uh oh!

clawsweeper Bot commented May 22, 2026

Uh oh!

vincentkoc commented May 23, 2026

Uh oh!

vincentkoc commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

gauravprasadgp commented May 10, 2026 • edited by vincentkoc Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Verification

Real behavior proof

Uh oh!

clawsweeper Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ruthwik-Data left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

clawsweeper Bot commented May 22, 2026

Uh oh!

vincentkoc commented May 23, 2026

Uh oh!

vincentkoc commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gauravprasadgp commented May 10, 2026 •

edited by vincentkoc

Loading

clawsweeper Bot commented May 10, 2026 •

edited

Loading