Skip to content

feat(skills): add nemoclaw-maintainer-verify-stale#3327

Merged
cv merged 52 commits into
mainfrom
feat/verify-stale-skill-v2
May 27, 2026
Merged

feat(skills): add nemoclaw-maintainer-verify-stale#3327
cv merged 52 commits into
mainfrom
feat/verify-stale-skill-v2

Conversation

@prekshivyas

@prekshivyas prekshivyas commented May 11, 2026

Copy link
Copy Markdown
Contributor

Note: This PR supersedes #3063, which had unsigned commits in its history (a stale local git config in one of my clones — user.email=test@example.com, commit.gpgsign=false — produced 8 unverified commits I couldn't retroactively sign because the branch ruleset blocks force-push). Net diff vs main is unchanged. All commits on this branch are signed and authored by prekshiv@nvidia.com. The 8 post-review commits from #3063 are collapsed into a single signed commit; per-commit history for those is preserved in #3063 for reference.


Summary

Adds maintainer skill nemoclaw-maintainer-verify-stale — automates the manual loop of: spin up a Brev box → install latest NemoClaw → try to reproduce an old bug → comment with findings. Tag-only (never auto-closes); a maintainer pulls the trigger after reviewing.

The skill has been driven end-to-end against 9 real candidates during development, surfacing the gaps and skill changes summarised below.

Issues tackled

Issue Reported Verified Verdict Score What this run surfaced (skill change)
#2791 v0.0.21 v0.0.35 wontfix-by-design By-design prototype. Step 8.5 detection framework (3 signals: maintainer attribution / removal commit / symbol absent). Maintainer reviewed the analysis and closed the issue manually.
#2168 v0.0.21 v0.0.35 wontfix-by-design (tagged) Repo-label vocabulary mismatch (status: wont-fix, not wontfix); markdown-linked citations in comments.
#2007 v0.0.18 v0.0.35 fixed-on-latest (tagged) 84/100 Docker-group sg, brev exec PATH gotcha, sandbox-build-rot reframe, multi-axis architectural-drift check.
#2592 v0.0.28 v0.0.35 fixed-on-latest (tagged) 84/100 Sentinel-file SSH-drop guard, openshell sandbox exec -n NAME -- syntax footgun, reproducer-extraction regex extension (openclaw|openshell).
#2519 v0.0.26 v0.0.35 fixed-on-latest (closed by maintainer mid-run) 84/100 Step 5 provider classification + API-key prompt, install.sh env-var passthrough, sandbox-name validator.
#2604 v0.0.28 v0.0.36 fixed-on-latest (tagged) 84/100 File-based API key propagation (kill argv exposure local→Brev). 300-word hard ceiling for comments + comment-authoring principle in Step 10.
#2611 v0.0.28 v0.0.36 fixed-on-latest (tagged) 70/100 On-box subshell argv leak (extend file-based pattern to Brev→inner shell). First non-cap-at-84 score (the −30 LLM-synth penalty pushed raw below the cap).
#2513 v0.0.26 maintainer self-verified mid-batch Gap-queue: skill should re-check state == OPEN before posting. Second mid-batch close-by-maintainer (alongside #2519) → promote from queue to a real Step 1/3 check.
#2166 v0.0.21 still-reproduces (source-only) Gap-queue: when pre-flight is conclusive, allow source-only verdicts to skip Brev entirely (Verification mode: static analysis at <tag>).

Effective touched count: 7 issues with skill-rendered comments + 2 side-discoveries. Two of the seven landed wontfix-by-design, five landed fixed-on-latest. Sandbox-build rot capped four of the five at 84 (the modal verdict shape with reporter @-mention). #2611 was the first run to land below the cap, validating the −30 LLM-synth penalty path.

v1 scope

In Linux only (Brev), bug issues against CLI / Sandbox / OpenShell / Docker / Getting Started / Ubuntu / DGX Spark / GB10
Out macOS, Windows/WSL, integration bugs needing third-party tokens, service-account bots, auto-close

Verification flow (one-line per step)

Step Purpose
6.5 — Preconditions Brev auth + install-URL reachability fail-fast (no Brev cost on dead deps)
6.7 — Local-first Pure-CLI bugs (no sandbox/Docker/GPU) run on the maintainer's local install — same evidence, zero cost
7 — Reuse-or-provision brev ls → reuse a verify-stale-* box if running, else brev create with a runtime-picked CPU SKU
8 — Validate-on-baseline, verify-on-latest Install reported version first, confirm the reproducer actually exposes the bug, reset, then install latest
8.5 — By-design detection Three signals (maintainer attribution / removal commit / symbol absent) short-circuit Brev cost on intentional removals
9 — Confidence scoring +50 / +25 / +25 / −30 / −50, clamped to [0,100]. ≥85 silent · 60–84 + reporter @-mention · <60 inconclusive
10 — Comment + label Comprehensive redaction (JWT / NVAPI / PAT / base64 / internal hostnames / emails) with HTML→text pre-pass for QA bodies. 300-word hard ceiling for fixed/wontfix; 30–80 for still-reproduces.
11 — Infra failure Sandbox-build rot is the dominant failure mode for any version >5–7 patches behind; cap-at-84 with reporter @-mention is by design
12 — Activity log Append per-issue + per-session entries to ~/development/daily-rhythm/activity/nemoclaw-verify-stale-log.md

Idempotency

Every posted comment carries <!-- nemoclaw-verify-stale v1 YYYY-MM-DD -->. Candidate filter excludes any issue whose comments match <!-- nemoclaw-verify-stale v\d+ YYYY-MM-DD --> within the last 7 days OR carries fixed-on-latest / verify-inconclusive / status: wont-fix. Release sweep clears the first two on each tag cut.

Active-discussion handling (two-tier)

Maintainer's unanswered comment age Skill behaviour
≤ 7 days Skip — running on top of fresh discussion would conflict with the maintainer's framing
> 7 days Proceed with the unanswered-question variant: lead with a markdown link to the unanswered comment, dual @-mention of both maintainer and reporter

Companion changes

  • nemoclaw-skills-guide/SKILL.md — adds the new skill to the maintainer skills table.
  • nemoclaw-maintainer-cut-release-tag/SKILL.md — release-time sweep of fixed-on-latest and verify-inconclusive from every open issue. Without it, "latest" drifts and verifications go silently stale. Verification records stay in comments; only labels reset.

CI status

Check Status
commit-lint, dco-check, installer-hash-check, legacy-path-guard, pr/changes
markdown-links ✅ after 3abffd72 (placeholder [label](url) examples in SKILL.md were being parsed as real broken links — restructured as fenced-block templates) and dc5d7143 (merge of main brought in the deletion of docs/get-started/brev-web-ui-quickstart.md per #3050, resolving the merge-side missing-file failure)

Test plan

Signed-off-by: Prekshi Vyas prekshiv@nvidia.com

Summary by CodeRabbit

  • New Features

    • New maintainer skill to verify whether stale bug reports reproduce on the latest release, with interactive reproducer validation and evidence-backed comments/labels.
  • Enhancements

    • Release-tag workflow now conditionally clears verification labels from open issues when verification is stale or regression risk is detected (preserves labels in specified safe cases).
  • Documentation

    • Added comprehensive reference guides covering candidate selection, environment/reproducer setup, provisioning, reproduction rubrics, scoring, comments, and logging.

Review Change Stack

prekshivyas and others added 30 commits May 5, 2026 14:29
Adds a maintainer skill that automates verifying whether old bug reports
still reproduce against the latest NemoClaw release. The skill picks
candidate issues opened against older versions, reuses or provisions a
Brev Linux box (CPU or GPU based on the bug profile), runs the extracted
reproducer, scores confidence, and posts an evidence-backed comment with
either a `fixed-on-latest` or `verify-inconclusive` label. Tag-only —
never auto-closes.

v1 scope is intentionally narrow: Linux only, no third-party integration
credentials, no service-account bot. Each maintainer runs it under their
own gh credentials.

Companion changes:

- Adds the new skill to the maintainer table in `nemoclaw-skills-guide`
  (count 7 -> 8 maintainer skills, 18 -> 19 total).
- Adds Step 7 to `nemoclaw-maintainer-cut-release-tag` that sweeps
  `fixed-on-latest` and `verify-inconclusive` labels off all open issues
  at release time, so the next verify-stale run re-evaluates against the
  new latest. Verification records remain in comment history; only the
  labels are reset.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Three flaws surfaced when dry-running the Step 3 candidate filter
against the live open-issue list (141 open bugs).

1. Step 2 used `gh release view` to detect "latest". NemoClaw tags but
   does not publish GitHub releases, so this returned empty. Added a
   fallback that picks the highest semver from `git ls-remote --tags`,
   which is the load-bearing path today.

2. Step 3 said "2 or more minor versions behind". Wrong vocabulary for
   `0.0.x` repos — patch is what's iterating. Reworded to "at least 2
   versions behind in the rightmost-incrementing component", with a
   concrete `0.0.x` example and a forward-compat note for when NemoClaw
   moves to `0.1.x`.

3. Step 4 specified a naive `v?0\.0\.\d+` regex that matched IP-like
   strings in bug bodies (Ollama bind `0.0.0.0:11434`, loopback
   `127.0.0.1`, etc.) and produced phantom v0.0.0 / v0.0.1 candidates.
   Tightened to require `v` prefix + word boundaries first, with a
   context-anchored fallback that only matches `0.0.X` on lines
   containing `nemoclaw` or `version`. Added a clamp that rejects any
   parsed version greater than `$LATEST` (catches roadmap labels like
   `v0.0.35` from being treated as "reported on").

After these fixes the dry-run produces 33 plausible candidates out of
141 open bugs, with credible reported-version assignments verified
against issue bodies.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Two more issues surfaced from a deeper second-pass dry-run.

1. The Step 4 regex was hardcoded to v0.0.x. NemoClaw is on that line
   today, but the skill should keep working when it moves to v0.1.x or
   higher. Generalized to `\bv\d+\.\d+\.\d+\b` with a context-anchored
   `\b\d+\.\d+\.\d+\b` fallback on lines containing nemoclaw or version
   (so the loose form doesn't match `0.0.0.0:11434` and `127.0.0.1`).

2. Some bug reports cite NemoClaw versions that were never tagged
   (e.g. `v0.1.0` × 3 issues, calver `2026.3.11` × 1). Without a tag
   existence check the skill would point Brev verification at a version
   that does not exist. Added a `git ls-remote --tags` validation pass
   that drops the version if no matching tag is found.

Also added an implementer note documenting a real failure mode caught
during the dry-run: a naive `[scan(primary)] | first | .[0] | tonumber
// [scan(fallback)] | ...` pipeline silently dropped 9 valid candidates
because `null | first` errors when scan returns empty, and the `//`
chain did not propagate cleanly to the fallback. The SKILL.md regex
itself was correct — the failure was implementation-level fragility in
how the two passes were composed. The note tells implementers to bind
each pass to a named variable, coalesce at the end, and explicitly test
the empty-match path.

After these changes the live dry-run produces 47 plausible candidates
(out of 141 open bugs) and the 4 residual drops are all reporter typos
that cite a version that was never released — correctly unverifiable.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Third dry-run round surfaced a precision issue. The previous regex
matched any `\bv\d+\.\d+\.\d+\b` anywhere in the body, then validated
against the tag list. That picked up versions belonging to *other*
products that share the v0.0.x line — most often OpenShell, but also
Node.js (v22.x.x), and other dependencies. Tag validation then
correctly rejected the non-NemoClaw ones, but on bodies where multiple
products were listed (e.g. `Environment: openshell 0.0.4, nemoclaw
0.1.0, Node.js v22.16.0`) the parser would land on OpenShell's
tag-valid v0.0.4 and report it as the NemoClaw version, producing a
false-positive candidate. 12 such collisions exist in the current
backlog.

Fix: require the version to follow `nemoclaw` (case-insensitive) within
80 non-letter, non-newline characters. The anchored regex
`(?i)nemoclaw[^a-z\n]{0,80}v?(\d+\.\d+\.\d+)` matches `NemoClaw
v0.0.32`, `nemoclaw 0.0.28`, `- NemoClaw: v0.0.16`, but not `openshell
0.0.4` followed later by `nemoclaw 0.1.0`. Combined with the existing
tag-list validation, this collapses the four error classes — typos,
calver dates, roadmap labels, and product collisions — into one
"version must be a real NemoClaw release tag, mentioned next to
NemoClaw" check.

Also expanded the implementer note to cover three concrete failure
modes hit during the dry-run:

- Empty-match handling (`[]` flowing into `first | .[0]` errors).
- Capture-group inconsistency between branches in a parser pipeline.
- Variable-scoping bug in `select($tags | index(.))` where `.` rebinds
  to `$tags` and silently passes invalid labels.

After this change the live dry-run produces 33 candidates out of 141
open bugs — same count as the very first round, but every candidate's
parsed version is now anchored to NemoClaw and validated against the
real tag list. The earlier wider counts (47, 48) included roughly 12-15
issues that would have been wasted Brev runs.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…test

A clean run on latest is ambiguous on its own:
- Scenario A: bug really got fixed (script worked, latest passes)
- Scenario B: script was junk all along (script never triggered the
  bug; latest passes for the same reason baseline would)

These look identical from the latest pass alone, which means the
previous flow could land `fixed-on-latest` on a script that never had
a chance — a false positive that the +25 commit and +25 PR signals
only partially mitigate.

This change adds a baseline pass before the latest pass:

- Step 8a: install reported version on the box.
- Step 8b: run reproducer, compare output to issue's actual-result
  description. Match = script validated.
- Step 8c: if no match (script silent OR errored on the wrong thing),
  synth-repro using the issue body PLUS the baseline transcript so
  the LLM can react to the actual failure mode. Apply -30. Retry.
  Still no match -> verify-inconclusive, skip latest entirely.
- Step 8d: install latest, run validated reproducer. Score from this.

A single trigger drives synthesis: "the script didn't expose the bug
on baseline." Whether it was silent or noisy doesn't matter; both
mean the script needs work. This collapses the previous separate
"narrative-only -> synth" and "verbatim-failed -> ???" paths into
one rule.

Step 6 simplified accordingly: just extract verbatim if available,
otherwise carry the issue body forward to Step 8b for on-demand
synthesis. No more "give up before provisioning" branch — that
decision moves into Step 8c where it has more context to work with.

Step 9 adds a baseline-validation gating rule. If the reported-
version install fails (old releases rot — installer URLs, deps, OS
images drift), the score is capped at 84 unless commit-area or PR-
mention evidence also fires. That forces the @-mention-reporter band
when we lack independent corroboration, which is the honest position
when we couldn't validate the script ourselves.

Step 11 split into two failure types:
- Latest-install fail or harness error -> hard infra failure (no
  label, optional comment, move on).
- Baseline-install fail -> degraded mode (skip baseline gate, run
  latest anyway, apply Step 9 cap).

Wallclock budget bumped from 15 to 25 minutes per verification to
accommodate two installs.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Tier-1 (blocking for E2E):

- Add the still-reproduces-on-latest path. If the latest run output
  matches the issue symptom, the bug is still live — the skill posts
  a "still reproducible" comment with both transcripts, applies no
  label, and includes a dated marker `<!-- ... v1 YYYY-MM-DD -->`.
  No new label per user direction; a 7-day TTL on the marker handles
  re-verification on the next weekly cron.
- Update the comment template for the two-pass flow. The previous
  template described a single transcript; the new one has Baseline
  and Latest sections plus a dedicated still-reproduces template,
  and the idempotency marker now carries today's date.
- Update the Step 12 log template to record baseline-install,
  baseline-match, latest-install, and latest-result independently.
- Clarify in Step 4 that REPORTED_VERSION must be the full tag string
  ("v0.0.32"), not just the patch number (32). Step 8a's installer
  expects the full tag.

Tier-2:

- Spell out an explicit match rubric in Step 8b: exit code agreement,
  symptom-phrase match (LLM-judged semantic equivalence counts), and
  distinguishing the bug from infra noise (DNS / auth / rate limits).
- Replace the minimal `rm -rf ~/.nemoclaw` reset with a comprehensive
  one that stops nemoclaw/openshell processes, removes spawned Docker
  containers, frees common ports (8080, 18789, 9119), and removes the
  installed binaries and lib paths. Run before each install in 8a and
  8d; idempotent so it's safe on a fresh box.
- Add interactive-subcommand handling in 8b: auto-detect `nemoclaw
  onboard` / `configure` and try `--non-interactive`, then
  `--dangerously-skip-prompts`, then stdin pre-feed. Fall through to
  Step 8c (synth) if none work.

Tier-3 / opportunistic:

- Fix the actual installer command (gap 9). The installer accepts
  the target ref via `NEMOCLAW_INSTALL_TAG` env var, NOT a `--version`
  flag (verified against install.sh source). Updated 8a accordingly.
- Update Step 3 idempotency: drop on either label OR a marker
  comment within the last 7 days. Previously a marker excluded the
  issue forever; the release sweep only clears labels, so the marker
  alone made re-verification impossible. The 7-day TTL also handles
  the still-reproduces case naturally.
- Extend wallclock budget to 60 min when issue body contains
  time-sensitive keywords (`after N minutes`, `eventually`, `memory
  leak`, etc.). Hard ceiling at 60 — bugs that require hours fall
  out of v1 scope.
- Keep provisioned boxes for 30 minutes on verify-inconclusive
  outcomes so a maintainer can `brev shell` in and triage. Reused
  boxes always stay.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Five concrete fixes from the pre-E2E gap analysis, plus a verified
finding on gap 2 that lowered its risk profile.

Gap 1 — sudo behavior. The reset script's `sudo rm -f` and `sudo rm
-rf` could hang on a password prompt if the Brev image's default user
lacks passwordless sudo. Switched all sudo calls to `sudo -n` (non-
interactive) so they fail fast instead. Added a precondition note
explaining the assumption: Brev stock images allow passwordless sudo,
custom images may not. The user-local install path (~/.nemoclaw) is
fully reset regardless of sudo state, so the worst case is a stale
/usr/local/bin/nemoclaw symlink that the next install overwrites.

Gap 2 — verified the install architecture supports old-version pinning
natively. The bootstrap clones the requested ref and runs that tag's
own install.sh (with a payload-marker fallback for legacy ones), so
NEMOCLAW_INSTALL_TAG=v0.0.32 works architecturally. Risk reduces to
"does v0.0.32's own installer still work on a 2026 OS image" — handled
by Step 11's degraded-mode fall-through. No SKILL.md change needed.

Gap 4 — path extraction was hand-waved. The +25 commits-touched-area
weight referenced `git log v<reported>..$LATEST -- <path>` without
specifying how to determine `<path>`. Added a three-tier extraction
procedure: stack-trace path mentions parsed from the body and mapped
to repo paths, then a component-label-to-directory map (NemoClaw CLI →
bin/, Sandbox → src/lib/sandbox/, etc.), then title-keyword fallbacks.
If none yields a path, skip the +25 signal entirely rather than
guessing — floating the weight would inflate scores meaninglessly.

Gap 5 — PR-search query was hand-waved. The +25 PR-mention weight had
no concrete query. Added two-stage search: first `gh pr list --search
"$ISSUE_NUMBER"` filtered for actual `#NNNN` references in body or
title, then a symptom-phrase fallback if direct reference returns
nothing. PRs that merged before the issue was filed are excluded via
`mergedAt > tag-date(REPORTED_VERSION)`.

Gap 6 — issues without an "Actual result" section had no match path.
Behavioral / configuration bugs (e.g., #1242 "should default to a
stable released version") describe a wrong default rather than a
runtime error. Added a fallback to Step 8b's match rubric: use the
title + full body as the symptom, match if reproducer's outcome
contradicts the issue's expected behavior. If neither error string
nor expected-behavior contradiction can be identified, route to Step
8c (synth-repro) so the LLM produces a more diagnostic script.

Gap 7 — transcript redaction was thin. Step 10 only covered tokens,
paths, and basic-auth URLs. Real transcripts leak more: Authorization
headers, JWTs, GitHub PATs, AWS keys, NVIDIA API keys, internal
hostnames, emails (PII), long base64 blobs. Expanded the redaction
table with concrete patterns for each. Added an order note:
longest/most-specific patterns first so the generic base64 catchall
doesn't mask what was actually redacted. Made it explicit that
redaction runs on every quoted chunk in the comment, not just issue-
body excerpts.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…l SKU pick

Preflight Brev auth and install-URL reachability before paying any cost
(Step 6.5). For pure-CLI bugs with no sandbox, Docker, or GPU dependency,
try the reproducer locally on the maintainer's `nemoclaw` install before
provisioning a Brev box (Step 6.7) — same evidence, zero cost.

Replace the `<your-team's-CPU-SKU>` placeholder in Step 7 with a runtime
pick via `brev search cpu --sort price --json | jq` so the skill keeps
working as SKUs change, with a `VERIFY_STALE_CPU_TYPE` env override for
teams that need to pin.

Drop the spurious `--yes` flag from `brev delete` — it does not exist on
that subcommand and would error. `brev delete` is non-interactive by
default. Print the instance name before the trap registers so manual
cleanup is possible if the trap doesn't fire.

Thread `INSTALL_URL` through Step 8a/8d so the URL override from Step 6.5
applies to both the baseline and latest installs.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Some bugs are filed against behavior that was deliberately removed or
changed in a merged PR. Running the standard rubric on these produces
a misleading verdict — the symptom "still reproduces" but the right
answer is "won't fix, see PR #X." Issue #2791 is the prototype:
`config set` was removed in #2227, the reporter tested a version that
already had it gone, and the standard rubric would have buried that
context under a low-confidence `verify-inconclusive` label.

Add Step 8.5 with three independent detection signals:

- Maintainer-attribution comment phrasing ("removed in #N", "by design",
  "wontfix", "intentional") from MEMBER/OWNER/COLLABORATOR authors.
- A removal commit between reported version and `$LATEST` whose subject
  matches `\b(remove|delete|drop|deprecate)\b` and whose diff deletes the
  symbol implicated by the reproducer.
- The implicated symbol absent from both reported version and `$LATEST`,
  meaning it was already gone when the issue was filed.

When any signal fires, skip Step 9 scoring and Brev provisioning, apply
a new `wontfix-by-design` label, and post a focused comment that links
to the responsible PR. Detection on signals 2 and 3 can run as soon as
the reported version is parsed, saving Brev cost on the entire class.

Generalize the Step 3 idempotency marker regex from `v1` to `v\d+` so
future skill versions can re-verify older-marked issues by tightening
the regex to require a specific marker version.

Add `wontfix-by-design` to the idempotency label list, the activity log
entry options, the session summary, and the release sweep in
`nemoclaw-maintainer-cut-release-tag`. Update the skill's frontmatter
description to mention the new label and the local-first short-circuit.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…var log path

The issue body that comes back from `gh issue view` for NV QA bugs is
HTML, not markdown — `<pre>...</pre>` blocks and tables, with HTML-encoded
entities. The previous Step 6 only mentioned `<pre>` in prose without an
extractor that actually parsed it. Add a Python extractor that handles
markdown fences (triple-backtick and tilde), HTML `<pre>`, and entity
unescape in one pass, so verbatim reproducers from QA-filed issues land
cleanly instead of falling through to LLM synthesis with a -30 penalty.

Step 10 already had a comprehensive redaction regex table, but the
patterns assumed plain text — tokens nested in HTML tags or attributes
slipped through unredacted. Add an HTML to text pre-pass for issue-body
excerpts before the regex table runs, so quoted excerpts from QA bodies
get the same redaction coverage transcripts already do.

Step 5's GPU keyword list flagged `inference` and `model serving`, both
of which false-positive on CPU bugs (`models.providers.inference.baseUrl`
is a config path, not a GPU need). Tighten to whole-word matches and
swap in `vllm`, `tensorrt`, `L40S`, `L4`, `T4`.

Step 12's activity log was hard-coded to one maintainer's personal
organizer path. Read from `VERIFY_STALE_LOG_DIR` with the existing path
as default, and require the skill to create the directory rather than
assuming it exists, so the skill works in CI / shared-volume / other
environments.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Path-extraction map in Step 9 was written from assumption rather than
verification — `src/lib/sandbox/`, `src/lib/openshell/`, and
`src/lib/policy/` don't exist in the current repo, and OpenShell lives
in a separate repo entirely. The +25 commits-touched-component signal
would have silently misfired on every Sandbox / OpenShell / policy
issue, underscoring real fixes. Replace with paths verified against
the current tree (`nemoclaw/src/blueprint/`, `nemoclaw-blueprint/`,
`nemoclaw/src/commands/`, etc.) and explicitly note the cross-repo
OpenShell case is out of scope for v1.

Step 8 reset wiped `~/.nemoclaw` but not `~/.openclaw`. Sandbox state
has been writable-by-default under `.openclaw` since #2227, so it
persists across the baseline → latest reinstall and contaminates the
latest run. Add it to the reset.

Anchor the `pkill -9 -f nemoclaw|openshell` patterns to a leading
slash (`/nemoclaw`, `/openshell`) so the kill matches actual installed
paths but not unrelated processes that mention these strings — including
the agent harness running this skill if its working directory contains
the word.

Reorder the Step 10 redaction table to match the stated "longest, most
specific patterns first" rule. The previous order had the generic
`(token|secret|password|...)` pattern executing before JWT/AWS/NVIDIA
patterns, which would have masked specific-token redactions with
generic ones — losing the signal of *what kind* of credential leaked.

Replace `git ls-remote git@github.com:NVIDIA/...` (Steps 2 and 4) with
`gh api repos/NVIDIA/NemoClaw/tags --paginate --jq '.[].name'`. The
SSH path required keys to be configured; `gh api` reuses whatever auth
the user already has.

Add a CLI-deps precondition check (`command -v gh brev jq python3 curl`)
to Step 6.5 so the skill fails fast if any required dependency is
missing, instead of failing late and confusingly mid-run.

Step 11's keep-box-on-inconclusive said "delay cleanup by 30 minutes"
but had no implementation. A backgrounded `sleep && brev delete`
doesn't survive session end. Replace with skip-the-trap-entirely plus
an explicit manual-cleanup reminder printed to the run output.

Out-of-Scope was contradicting itself on macOS — Step 6.7 explicitly
allows macOS for local-first runs. Clarify: Brev path is Linux-only,
the local-first short-circuit on a maintainer's laptop works on macOS
for manual single-issue runs.

Add a `local (no Brev — Step 6.7 short-circuit)` option to the Step 12
log entry's Box field so local-first runs round-trip the activity log
correctly.

Frontmatter description said "latest release" but NemoClaw uses tags,
not GitHub Releases — fix to "latest tag" to match Step 2.

Step 4's implementer note said "Two real failure modes" but listed
three; fix the off-by-one.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…-verification

Step 8.5 was producing comments that were correct in direction but
hand-wavy on substance. A side-by-side against an independent analysis
of #2168 surfaced the gap: that analysis cited specific file:line
locations, separated the literal bug-as-filed from a related failure
mode that still exists on latest, called out existing CI coverage of
the new workflow, and self-corrected an overstatement when prompted to
re-verify. Today the skill required none of that.

Restructure Step 8.5 into substeps so the rigor is mechanical, not
prompt-dependent:

- 8.5a: signal detection now requires verifiable evidence per signal —
  comment URL + quoted phrase (signal 1), commit SHA + diff range
  (signal 2), grep commands + outputs (signal 3). Add a sub-case for
  vestigial deprecation shims so a stub doesn't silently fail signal 3
  by appearing in latest.
- 8.5b: pre-check for related failure modes. Grep latest for the bug's
  symptom keywords (not the removed symbol) and require the comment to
  call out anything that surfaces. This is the section the side-by-side
  identified as missing — saying "the bug as filed can't reproduce" is
  not the same as "every bug shaped like this is fixed."
- 8.5c: check existing test coverage for the new workflow and cite up
  to three test paths if found. Strengthens the comment from "trust me,
  it was removed" to "the new workflow is exercised by these tests."
- 8.5d: explicit self-verification pass — re-run every cited command
  before composing the comment; bail to verify-inconclusive on any
  discrepancy. LLMs confidently overstate; mechanical re-verification
  catches it without needing a human prompt.

Replace the by-design comment template with one that has mandatory
sections matching the substeps: "What's structurally fixed", "Vestigial
references", "What's not literally the same bug", "Existing CI
coverage", "Recommendation". Each section either has concrete content
or is explicitly omitted — no hand-wavy claims slip through.

Add NVBugs cross-reference extraction to Step 4 (`grep -oE
'\[NVB#[0-9]+\]'`) so the by-design template can append the standard
"NVBugs#NNNNNNN will need a separate update; closing this GitHub issue
won't propagate" reminder when the issue body carries that footer (most
NV QA bugs do).

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
The previous signal-2 candidate query narrowed by commit subject
(`remove|delete|drop|deprecate`) before checking whether the diff
deleted the implicated symbol. That filter excludes the common case
where a removal lands inside a `refactor(...)` or `feat(...)` commit —
e.g., PR #2227 removed `--dangerously-skip-permissions` under a
"refactor(sandbox): default to mutable config" subject. A literal
follower of the previous rule would have concluded signal 2 doesn't
fire on issue #2168 even though the flag was deliberately removed.

Switch the primary candidate lookup to `git log -S<symbol>` pickaxe
search, which finds every commit whose diff changes the count of the
symbol regardless of subject wording. Keep the subject-keyword query
as a supplementary lookup for narrowing when the pickaxe returns many
commits. Note explicitly that the commit's actual subject doesn't need
to mention removal.

Surfaced while testing the new Step 8.5 rigor against #2168.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…horing

Two changes that surfaced from running Step 8.5 against issue #2168.

Use the existing repo `wontfix` label for the by-design path instead
of creating a new `wontfix-by-design` label. The repo already has
`wontfix` and it's already in Step 3's issue-type skip list, so applying
it gives idempotency for free without proliferating labels. The
by-design verdict is encoded in the marker comment, not the label
vocabulary.

Important consequence: the release sweep in
`nemoclaw-maintainer-cut-release-tag` does NOT clear `wontfix` — that
label is also applied for non-skill reasons (scope, priority, dup
decisions made by maintainers), and sweeping it would erase human
triage work. Sweep stays scoped to `fixed-on-latest` and
`verify-inconclusive`, which are skill-only.

Add a `**Verification mode:**` line to the by-design comment template
that explicitly says "static analysis at the verified-on tag — no
runtime reproduction." This was the honesty an independent side-by-side
analysis of #2168 caught: the previous template implied runtime
confirmation even though Step 8.5 short-circuits Brev provisioning by
design.

Add a tag-anchoring rule above the template too — every `file:line`
cite must refer to the verified-on tag, not the maintainer's HEAD,
since line numbers drift between tags and main and we want comments
to stay reproducible by anyone reading them later.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…ion check

Bare `file:line` paths in comments force the reader to navigate
manually — that's a usability bug, not a stylistic preference. Comments
posted by the skill should be self-serving artifacts: a maintainer or
QA reader should be able to click any citation and land on the exact
content at the verified-on tag.

Extend the tag-anchoring rule above the by-design template into a
"tag-anchoring + linking rule" with concrete formats for file blobs,
file:line, file:line-range, commit SHAs, and test paths. Note that
bare `#NNNN` PR/issue references already auto-link in same-repo
comments.

Strengthen Step 8.5d into two passes: evidence (re-run cited commands)
and link (resolve at least one rendered link per evidence section via
`gh api .../contents/<path>?ref=<tag>` or `curl -fsI`). A 404 on a
citation suggests verification that didn't actually happen — worse
than no link at all. Bail to verify-inconclusive on any failure.

Surfaced while reviewing the rendered #2168 comment: the citations
were correct and tag-anchored but not clickable. Better caught now
than after the first batch run posts a wall of bare paths.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Discovered while applying the by-design label to issue #2168: the skill
referenced `wontfix` and `needs-info` as bare strings, but the repo's
canonical labels are `status: wont-fix` and `status: needs-info` (with
prefix and hyphen). Same shape in Step 3's issue-type skip list — issues
labelled `status: wont-fix` or `status: needs-info` were NOT being
filtered out at candidate-time as the spec intended, because the skip
list keys didn't match the live labels.

Three coordinated fixes:

1. Step 3 issue-type skip list now uses `status: wont-fix` and
   `status: needs-info`. Annotate the rule so future readers know not to
   substitute the bare forms.

2. Step 8.5 by-design action applies `status: wont-fix` (with quoting
   on the CLI). Step 12 log entry, session summary, frontmatter
   description, and Companion Behavior section in both the verify-stale
   skill and `nemoclaw-maintainer-cut-release-tag` updated to match.

3. Step 6.5 preconditions now verifies all expected labels exist on the
   repo with `gh label list` before any later step tries to apply them.
   This is the gap that bit us on #2168 — the comment posted but the
   label couldn't be applied because the spec name didn't match. Failing
   fast in 6.5 means the run aborts before any Brev cost or comment
   posting can happen against a misconfigured label vocabulary.

Bare `wontfix` / `won't fix` / etc. are still in Signal 1's detection
regex — that's intentional, those are SEARCH PATTERNS for what
maintainers might write in free-text comments, not label names.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…s for faithfulness

Two changes that landed before the first real Brev E2E run (#2007).

The previous wallclock cap split — 25 min default with a 60 min extension
keyed off "memory leak" / "over time" / "eventually" keywords — optimised
for the wrong constraint. Most issues fit comfortably under 60 min once
you account for two full install passes plus comprehensive resets, and
the keyword-based extension forced re-runs whenever a real install or
bootstrap took longer than the optimistic 25 min budget. Bump the cap
to a single 60 min default and drop the keyword extension; bugs that
genuinely require more than an hour fall out of v1 scope.

Add a new Step 8a.5 for bootstrapping reproducer dependencies that the
stock Brev image doesn't ship — local model servers (Ollama, vLLM),
provider runtimes, third-party CLIs. Default policy is maximum
faithfulness: install the actual dependency the reporter used rather
than substituting a stub. Substituting trades faithfulness for speed,
and on a 60 min budget that trade is rarely worth it; it almost always
introduces a confound that makes the verdict less trustworthy.

Document the canonical Ollama and vLLM bootstraps with specific
attention to daemon survival between `brev exec` calls (Ollama's
installer registers a systemd service; vLLM uses nohup + log file).
Bootstrap runs once before Step 8b's baseline and is reused for Step
8d's latest run — model downloads are expensive and external state, so
the comprehensive reset must explicitly leave them alone.

Carve out one substitution case: providers that require API keys the
skill cannot safely supply (NIM, OpenAI, Anthropic). Stubbing a key
defeats faithfulness anyway, and a real key shouldn't sit in a
verify-stale run. Substitution applies the same -30 LLM-synth penalty
as Step 8b and must be documented in the rendered comment.

Bootstrap failure escalates to Step 11 infra failure — do NOT silently
substitute, since the maintainer opted into faithfulness for a reason.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…get on every comment

Two requirements surfaced from the #2007 e2e run that should be enforced
by the skill, not by per-session memory or by the prompter remembering.

Mandatory reporter @-mention with confirmation language. The skill cannot
independently confirm a closed-as-fixed verdict — only the reporter
knows whether their original symptom is gone in their environment. The
@-mention is what converts a "skill says it is fixed" claim into
actionable confirmation work for QA. Add the explicit closing block
(canonical wording: "please confirm the symptom is gone on a recent
build and reopen with a fresh reproducer if you observe otherwise") to
all three Step 10 templates: fixed-on-latest, still-reproduces, and the
Step 8.5 by-design template. The previous "If this verification is
wrong, please reopen..." line was passive; the new line names the
reporter and asks them to act.

Length target. Default rendered comments to 400-500 words. The evidence
table or by-design fixed/vestigial sections are the hero; everything
else has to either change the reader's mind about the verdict or be
deleted. The first #2007 draft ran ~750 words and the user pushed back
explicitly; encoding the cap into Step 10 means future agents do not
start from scratch on length each run.

Surfaced from: e2e Brev verification run on issue #2007 (the first
real Brev-track exercise of the skill end to end).

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Six concrete gaps that surfaced during the first end-to-end Brev run
(issue #2007). Each almost produced a wrong verdict or a wasted run; each
is now encoded so the next agent doesn't have to re-discover.

1. Step 9 baseline-validation cap-removal rule was too permissive. The
   `unless commits-touched OR PR-mention also fires` escape hatch let
   inferred fix evidence override the absence of a baseline run, which
   produced a misleading 100/100 on #2007 despite zero baseline
   confirmation. Tightened: cap at 84 holds regardless of corroboration;
   PR-search signals raise the score within the cap, never past it.
   Step 10 now requires an explicit one-line caveat naming the cap and
   the reason in the rendered Verdict section.

2. Step 6.5 install URL default switched from `nemoclaw.nvidia.com`
   (NVIDIA-internal, doesn't resolve from Brev) to `www.nvidia.com/
   nemoclaw.sh` (public Akamai 301 redirect). Brev runs were dying at
   bootstrap on the wrong host.

3. Step 7 CPU SKU picker now biases by reproducer-implied memory needs.
   On #2007 the cheapest 2 GB SKU couldn't load a 4.8 GiB Ollama probe;
   onboard failed at provider validation and we burned ~25 min before
   re-provisioning a 16 GB box. Adds a `CPU_RAM_FLOOR` env var and a
   memory-floor heuristic (16 GB when reproducer references a model
   server, 8 GB for sandbox onboarding without a model, 4 GB for
   pure-CLI bugs).

4. Step 11 failure taxonomy now has a "baseline-build rot" bucket
   distinct from "binary install rot." Same `BASELINE_INSTALL_FAILED=1`
   flag and same downstream cap-and-degrade behavior, but failures at
   the in-image Dockerfile build phase (what we hit on v0.0.18 — the
   `.openclaw-data/workspace/media` symlink layer, removed entirely by
   #2227) get a separate label so reviewers can see *why* the old
   image no longer builds without re-running.

5. New Step 8a.5b documents the two non-obvious `brev exec` quirks that
   reproducer scripts have to handle every time: PATH does not include
   `~/.local/bin` in non-login shells (so reproducers must `export PATH`
   at the top, or callers must wrap with `bash -lc`); and the docker
   group requires `sg docker -c '...'` because adding the user via
   `usermod -aG` doesn't take effect within the same Brev session.

6. New Step 8d.5 architectural-drift check. When the diff between
   `$REPORTED_VERSION` and `$LATEST` touches the *tool* the reproducer's
   expected output depends on (e.g. `openshell forward` between v0.0.18
   and v0.0.35), don't trust the reproducer's surface alone — multi-axis
   verification on OS-level surfaces (host listeners, NAT rules, docker
   ports, SSH tunnels, etc.) is required before claiming fixed-on-latest.
   This is the five-axis pattern we used to confirm #2007 wasn't a
   false positive; codifies it as a check the skill applies whenever
   pickaxe shows the reproducer's tool was reworked.

Surfaced from: end-to-end Brev verification of issue #2007. Build the
skill, exercise it on a real issue, fix what breaks, repeat — every
fix here came from a concrete failure mode in one real run.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…ll host

Two small UX fixes surfaced from a thorough re-read.

Step 6.5's brev-auth error printed three numbered options without a
clear recommendation. From a non-TTY agent harness (Claude Code or any
unattended runner) the right answer is option 2 — open a separate
terminal, run `brev login`, complete the browser flow, come back. The
previous message buried that path inside option 2's inline comment.
Replace with a directive recipe that names the recommended flow first
(open separate terminal, browser auth, re-run the skill) and lists the
headless alternatives below as "when option 1 isn't available." Also
explicitly notes that credentials persist to ~/.brev/credentials.json,
so re-running the skill picks them up automatically.

Step 6.5's install-URL error still suggested checking
`https://nemoclaw.nvidia.com` even after the default was switched to
`https://www.nvidia.com/nemoclaw.sh` in commit 296f7cd. Update the
suggestion to match the new default and add an explicit
"then re-run this skill" so the maintainer knows the flow is
fix-and-retry, not abort.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Five gaps verified against the source during a thorough re-read pass.
The audit also caught two false-positive claims I'd made earlier
(`NEMOCLAW_INSTALL_TAG` env var and `comments` field in batch fetch)
that didn't survive verification — those are not landed because there
was nothing to fix.

Step 1 batch discovery query bumped from `--limit 100` to `--limit 1000`
so the skill doesn't silently drop issues beyond the page. The recent
candidate triage on this repo found 129 open bugs; the previous limit
would have lost 29 of them. The 20-issue per-run processing cap is
downstream of Step 3/4 filters and unaffected.

Step 3's 7-day comment-marker TTL had the rule but no implementation.
Add a portable `gh issue view --json comments --jq` snippet that
extracts marker comments newer than the cutoff. macOS and Linux
date(1) syntax differ for relative-date math, so the cutoff line
tries both. Without this, the first cron run after any posting will
silently re-verify everything previously marked, posting duplicate
comments every Monday.

Step 6.5 preconditions now checks gh identity via `gh api user --jq
.login` and prints the resolved login before any later step runs.
Comments posted by Step 10 land under this account; surfacing it
explicitly catches the multi-token / wrong-tab / mid-session-reauth
case before a public comment lands under the wrong handle.

Step 10 now requires a `**Verification mode:**` header line in every
template (was previously by-design only). Reader should never have to
guess whether a verdict came from real install logs or from static
analysis. Filled in concrete defaults for the standard
fixed/inconclusive template ("runtime reproduction; baseline + latest
installed and run") and the still-reproduces template ("runtime
reproduction; bug confirmed live").

Step 10 also requires a link-pass self-verification on every
template, not just by-design's Step 8.5d. The "Tag-anchoring + linking
rule" already declared every citation must be a clickable markdown
link to the verified-on tag, but the verification step that resolves
those links (`gh api .../contents/<path>?ref=<tag>` or `curl -fsI`)
was scoped only to the by-design path. Same 404-cite risk applies to
the standard template — broken citation links advertise verification
work that didn't happen, and that's worse than no citation at all.

Surfaced from: end-to-end audit after the #2007 e2e run, with the
specific `gh issue view`/`grep` commands run against the SKILL.md to
confirm each claim before listing it as a gap.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Two changes that together turn the per-run batch cap from policy into code.

Lower the per-batch processing cap from 20 to 15. Sequential execution
with 1-2 reused Brev boxes works fine for runs in this size range; ~2-3
hours wallclock for a full 15-issue batch is the comfortable budget
before either the per-plan approval gate breaks down or the maintainer
session needs to span multiple sittings.

Add the slicing code that actually enforces the cap. Previously Step 1
declared "Cap at N issues per run" as policy and nothing applied it —
candidate sets larger than the cap would just process to completion
silently. Move the slice to the end of Step 4 (after Step 3 label
filters and the Step 4 version+candidate-rule filters narrow the pool)
and sort by `(-versions_behind, -age_days)` so the most stale come
first. Spillover beyond 15 stays eligible for the next run via the
marker-comment 7-day TTL added in 8976176.

Single-issue mode bypasses the cap entirely; the maintainer named the
issue explicitly.

Cadence section updated from "≤20 issues" to "≤15 issues" to match.

Surfaced from the post-#2007 audit: of the 13 gaps initially called
out, the cap-enforcement one was operational toil (skill could chew
through cost on a runaway batch). Lowering and enforcing closes that
toil path.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Looked at the 24 candidates surfaced by the recent triage run by *kind*
of bug rather than by version, and found six gaps where the standard
rubric would produce a wrong verdict or hang. All six landed in this
commit.

Step 3 platform skip: drop `Platform: Jetson AGX Thor/Orin`. Brev has
no equivalent silicon (embedded/edge ARM with integrated GPU is not
in the SKU catalog), so any Brev verification of a Jetson-only bug
produces a misleading "fixed-on-x86" verdict. `Platform: DGX Spark`
and `Platform: GB10` stay in scope but Step 10 now requires a
hardware-substitution caveat naming the Brev SKU we substituted.

Step 3 TUI / interactive-UI skip: drop issues whose title contains
TUI / dashboard UI / chat UI / keystroke / key press, or whose body
describes interactive UI behavior without a non-interactive reproducer.
`brev exec` does not allocate a real TTY by default; TUI reproducers
hang or silently fail at the first prompt. v1 documents this as
out-of-scope; v1.1 may add a script(1) / expect / tmux harness.

Step 5 now classifies bug-class in addition to CPU/GPU. Four classes:
`performance`, `rebuild-cycle`, `log-only`, `functional` (default).
Detection heuristics from the issue body (latency thresholds, "across
rebuilds" / "after restart", "see lots of error in <X> log"). Each
class routes to a different Step 8 rubric.

Step 8b: log-scraping. When `BUG_CLASS=log-only`, also pull
`~/.openclaw/logs/*.log` and `/var/log/nemoclaw/*.log` from inside the
sandbox after the reproducer runs and search them for the issue's
symptom phrase. Some bugs describe symptoms in internal log files,
not the reproducer's stdout; the previous match rubric only checked
the transcript.

Step 8b: flake-detection retry. For functional bugs, run baseline
three times if the first run shows the symptom inconsistently. Mixed
results (1 or 2 of 3 reproduce) trigger a "flake suspected" caveat,
−25 score adjustment, and downgrade `+50 latest clean` to `+25` so
a lucky-clean-latest-run on an intermittent bug doesn't silently
become a fixed-on-latest verdict.

New Step 8e: performance-bug verification. Multi-run latency
distribution rubric — N=10 runs each side, compute p50/p90, parse
SLA from issue body, match latest's distribution against the SLA
rather than against a symptom phrase. Cap at 60 unless the bug is
silicon-independent because Brev SKUs aren't faithful to DGX Spark
or GB10 hardware for performance-shape bugs.

New Step 8f: rebuild-cycle verification. Run-rebuild-rerun harness
for bugs that only manifest across destroy/recreate boundaries
(#2701-shape). Capture artifacts pre-rebuild, trigger
`nemoclaw destroy --all --force && nemoclaw onboard`, re-capture
post-rebuild, diff to determine whether the artifact persisted as the
issue expects.

Step 10 mandatory hardware-substitution caveat: when DGX Spark or
GB10 is in the issue's labels and Step 7 substituted with a different
silicon class, the comment metadata block must name the substitution
explicitly so silicon-shape bugs get the right reader expectation.

Surfaced from: stress-testing the skill against the actual shapes of
bugs in the 24-candidate set, not just against the bugs we already
verified.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Six gaps surfaced from the second e2e run (issue #2592) and the broader
audit of how the skill handles provider classification.

Step 5 now classifies provider in addition to CPU/GPU and bug-class.
Detection signals from labels and body keywords identify NIM, Gemini,
Anthropic, Bedrock, or Ollama. When the reproducer references a
non-Ollama provider AND actually exercises inference, the skill stops
at Step 5 and prompts the maintainer interactively before any Brev
cost — three options: provide the API key, accept Ollama substitution
with the -30 penalty, or skip to verify-inconclusive. Pure-CLI /
pure-sandbox bugs are exempt because the provider doesn't matter.

Step 6 reproducer-extraction regex extended to also match `openclaw`
and `openshell` invocations as anchor words. Issue #2592's reproducer
was `openclaw channels add telegram` run inside the sandbox; the
previous `nemoclaw`-only regex would have missed the verbatim block.

Step 8a now passes the provider env vars through to install.sh's
bundled onboard step, so it doesn't fall back to the default `build`
(NIM) provider and fail with the misleading NVIDIA_API_KEY missing
error. NEMOCLAW_PROVIDER=ollama (the default case) makes the bundled
onboard use the local Ollama set up in Step 8a.5; NVIDIA_API_KEY
only flows through if the maintainer provided one at Step 5's prompt.
The bundled onboard creates a throwaway sandbox that gets destroyed
before the reproducer runs. The reproducer's own onboard should pass
`--fresh` so a half-built install-sandbox doesn't trip the "previous
session failed" guard.

Step 8a.5 now has an Ollama-coverage table making explicit which bug
classes Ollama covers faithfully (CLI, sandbox, networking) vs which
ones it doesn't (provider-specific behavior, model-specific behavior,
silicon-dependent perf). Step 5's prompt keys off this table.

Step 8a.5b documents the openshell-sandbox-exec syntax footgun. The
correct form is `openshell sandbox exec -n NAME -- CMD`; the wrong
form silently auto-detects the sandbox by "last used" and stuffs the
leftover positional into bash's $0. Same section gets a brev-exec
re-execution guard via a sentinel file at ~/.verify-stale-running to
prevent the double-onboard scenario when SSH drops mid-run.

Step 11 reframes baseline-build-rot as the dominant failure mode for
any reported version >5-7 patches behind, not an edge case. Both
Brev e2e runs hit it. Cap-at-84 with reporter @-mention is the modal
verdict shape, not the exception — pre-flight carries more weight
than baseline runtime evidence for older bugs.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…sandbox name

Two fixes from a rot-debugging investigation that nearly produced a
wrong conclusion.

When testing whether old NemoClaw versions actually rot under the
literal end-user invocation (`https://www.nvidia.com/nemoclaw.sh`),
the first test run was botched by a shell-scoping mistake:
`NEMOCLAW_INSTALL_TAG=v0.0.26 curl ... | bash` scopes the env var to
curl, not to the downstream bash reading from the pipe. The bash side
defaulted to `latest`, resolved to v0.0.36, and the install proceeded
for several minutes producing convincing-looking output before the
mistake surfaced. The install script had no log line saying which
version it actually resolved, so the silent failure looked like a
working install of v0.0.26.

Skill-side fix: after each install (Step 8a baseline and Step 8d
latest), echo the resolved `nemoclaw --version` and case-match it
against the requested version. Mismatch in baseline sets
BASELINE_INSTALL_FAILED=1 to prevent verifying against the wrong
version. Mismatch in latest logs a WARN that gets surfaced in the
final comment. Cheapest possible guard against the class "I asked for
X, got Y" — and the principle generalizes: print the resolved state,
never trust the requested state.

Also fix the sandbox name in the install.sh-passthrough block: change
`NEMOCLAW_SANDBOX_NAME=__verify_stale_install__` (rejected by NemoClaw's
name validator: "Allowed format: lowercase, starts with a letter,
letters/numbers/internal hyphens only, ends with letter/number") to
`NEMOCLAW_SANDBOX_NAME=verify-stale-install`. Surfaced during the #2519
e2e run; net-zero impact on that run because install.sh fell through
on name validation rather than reaching the buggy code path, but the
spec was wrong.

The deeper context for the resolved-version-echo fix: when the rot
hypothesis was being tested, an independent agent's logical analysis
correctly identified that the public install URL DOES support
`NEMOCLAW_INSTALL_TAG=<ref>` and demanded an empirical re-test. The
re-test (with the env var on the bash side of the pipe) installed
v0.0.26 correctly and STILL hit `Patch 4 (replaceConfigFile EACCES)
not applied` — so the cap-at-84 framing held empirically, just not
for the original "git-clone is the only path" reason. The full
investigation is documented in findings.md Part 6's closing section
("Closing investigation: is baseline-build rot real, or our setup
artifact?").

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Add a Step 3 skip rule for issues where a MEMBER/OWNER/COLLABORATOR
comment was posted within the last 7 days AND the original reporter
hasn't replied since. The skill is for *stale* issues; actively-
discussed ones are in-flight, and posting a verify-stale verdict on
top of an open clarifying question from a maintainer would conflict
with their framing and confuse the reporter (now they have two
simultaneous asks).

Surfaced during pre-flight on issue #2757. The maintainer @cjagwani
had just commented questioning the bug's premise — specifically that
the reporter's "kill -9 took the parent down" framing didn't line up
with how the gateway is launched (`nohup ... &` detaches it) — and
asked the reporter to confirm what they actually observed on the
Station. Running verify-stale would have posted a "by-design, close
as wontfix" verdict on top of an active "wait, did this really
happen?" exchange. Wrong move; the skill should detect this state
and skip.

Implementation reuses the SEVEN_DAYS_AGO cutoff already computed for
the marker-TTL check, fetches the reporter via `gh issue view --json
author`, then jq-walks the comments array: find the most recent
maintainer comment within the cutoff, check whether any reporter
comment exists after that timestamp; if maintainer commented recently
AND no reporter reply since, skip.

Single-issue mode prints the skip reason and exits friendlily so the
maintainer knows why the skill bailed. Batch mode just moves on.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…argv

The post-#2592 commit (1d625dd) added a Step 5 prompt asking the
maintainer to "Export NVIDIA_API_KEY=<key>" and propagate it via
`brev exec`. During the #2604 e2e run that played out as
`NVIDIA_API_KEY=<value> brev exec ...` — which puts the literal key
in the brev exec process's argv, visible in `ps -ef` to anyone with
shell access on either the maintainer's laptop or the Brev box for
the duration of the run. The skill's stated promise ("never logged")
got violated by argv visibility.

Fix: switch propagation to file-based. Maintainer writes the key to
~/.nvidia-api-key with 600 perms on their laptop; Step 6.5 brev-copy's
the file to the Brev box (encrypted SSH); install / reproducer scripts
inside the box read with `NVIDIA_API_KEY=$(cat ~/.nvidia-api-key)`,
which sets the env var in the script's own process — never on a
command line and never visible in `ps -ef`.

Update Step 5's option-1 prompt to instruct the maintainer to use the
file form and to `rm ~/.nvidia-api-key` after the run. Add a new
"API-key propagation pattern" section after Step 5 that documents the
file-based mechanism explicitly. Update Step 8a (baseline install) and
Step 8d (latest install) to source the key from the file inside the
Brev box's exec context, not from the local shell's env.

Note in the docs: if the key was previously propagated via the cmdline
(pre-fix), treat it as exposed and rotate. The #2604 run did this; the
maintainer was reminded to rotate the NIM key after the run.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Replace Step 10's narrow "Length target" rule with a richer "Comment
authoring principle" block that captures lessons accumulated across
the e2e runs.

The core rule: every section in a rendered comment must either change
a reader's mind about the verdict or be cut. Word counts follow from
that — 500 is a ceiling, not a target. Most comments land in 200-400.
Simple cases land under 200.

Document why this matters: comments posted by the skill compete for a
maintainer's attention against every other in-flight thread on the
repo, and AI-slop prose actively reduces signal-to-noise.

Add a worked-examples block citing two real iterations:
- #2007 first draft was 750 words; cut to 371.
- #2604 took THREE drafts before settling on 190 words because each
  draft padded with prose that didn't ground the verdict — which
  caused the verdict itself to drift. Rule learned: name the verdict
  in one sentence first, then cut any section that doesn't support it.

Add per-verdict length targets:
- fixed-on-latest: 200-400 words
- wontfix (by-design): 250-500 words
- verify-inconclusive: 100-200 words
- Still-reproduces (no label): 30-80 words — and no transcripts
  (issue body has them), no @-mention (reporter knows), no
  architectural prose. One sentence + marker.

Add an explicit "cut, by default" list naming the patterns that
historically padded comments without changing verdicts.

Generalizes the existing memory note into the skill body so future
agents reading SKILL.md inherit the principle directly.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
… ceiling

Followup to fe363cf. The 200-400 / 250-500 ranges in the per-verdict
table were still too permissive. Tighten to 200-300 for both
fixed-on-latest and wontfix; verify-inconclusive stays at 100-200;
still-reproduces stays at 30-80.

Update the principle preamble: "300 is a hard ceiling for the main
verdicts." Simple cases land under 200. If a draft is past 300, it's
padding — cut before re-reading.

Memory note (feedback_verify_stale_comment_length.md) updated to match.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…d-question variant

Previously skipped any issue with a maintainer comment in the last 7 days the
reporter hadn't replied to. After 7 days the question is no longer "active
discussion" — it's a stuck thread, and the skill can be the unsticking voice
rather than a clueless interruption.

Step 3: classifies the most-recent-unanswered-maintainer-comment as recent
(within 7d → skip) or stale (>7d → proceed with variant), exporting
UNANSWERED_MAINT_LOGIN/URL/DATE for the templater.

Step 10: new mandatory block describing the variant — prepend an
"@<maint>'s comment from <date> is still unanswered" lead paragraph and
swap the closing reporter-only @-mention for a dual maintainer+reporter
@-mention that flags the open question.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- brev-provisioning: append `// empty` to the CPU_TYPE jq filter so an
  empty result yields an empty string instead of the literal "null",
  which `[ -n "$CPU_TYPE" ]` would otherwise accept and pass to
  `brev create --type "null"`.
- by-design: fix "greping" -> "grepping".
- candidate-selection: drop the hardcoded "8 prefixed variants" count
  (the inline list enumerates 11) and point readers at `gh label list
  --search enhancement` for the live set.
- environment-and-reproducer (API key entry): replace the inline
  `printf '%s' '<your-key>' > ~/.nvidia-api-key` example with a
  no-echo, no-history `IFS= read -rs` flow; the inline form left the
  key in shell history.
- environment-and-reproducer (local-first match): split the conjoined
  predicate into two explicit branches — matches reported symptom ->
  still-reproduces, matches expected-fixed behavior -> fixed-on-latest;
  third branch falls through to Brev as before.
- reproduction-rubrics (Step 8c rerun): prepend the same `export
  PATH="$HOME/.local/bin:$PATH"` guard 8b and 8d use, so a synth-repro
  rerun doesn't misread as `verify-inconclusive` because the nemoclaw
  binary fell off PATH.
- reproduction-rubrics (drift-check loop): switch from `for t in
  $TOOL` to `mapfile -t TOOLS` + `for t in "${TOOLS[@]}"`, so
  multi-word tool strings ("openshell forward") stay intact under
  pickaxe instead of word-splitting into separate searches.
- reproduction-rubrics (Step 8e perf rubric): compute p50 as the mean
  of the 5th and 6th sorted values for N=10 (standard median), and
  add the missing p90 line as the nearest-rank 9th value. Match
  rubric grows a p90 backstop that flips a within-p50 verdict to
  still-reproduces when an issue-declared p90 SLA is missed.
- scoring-comments-and-logging (idempotency marker): replace the two
  hardcoded `2026-05-12` template dates with the `YYYY-MM-DD`
  placeholder the surrounding prose already calls for.
- scoring-comments-and-logging (Markdown capitalization): capitalize
  "Markdown" in two prose mentions.
- scoring-comments-and-logging (still-reproduces contract): make the
  per-verdict table on L174 canonical — strip the closing reporter
  @-mention, baseline transcript, and latest transcript from the
  template body so it actually fits the 30–80-word target. Scope the
  unanswered-question dual @-mention rule to fixed-on-latest and
  by-design only (still-reproduces has no closing @-mention to
  replace); the lead-paragraph half of the rule still applies to all
  three templates.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md (1)

172-196: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Document version-prefix normalization between extraction and validation.

The body/comment regex (line 173) captures only the digit portion (\d+\.\d+\.\d+) without the v prefix, but:

  • Labels include the v (line 172: ^v\d+\.\d+\.\d+$)
  • Tag list entries include the v (line 49: ^v[0-9]+)
  • Validation uses grep -Fxq "$V" against tags with v prefix (line 189)
  • Output requires full tag format with v (line 196: REPORTED_VERSION="v0.0.32")

Without an explicit normalization step to prepend v to body/comment captures before validation, the grep -Fxq check will fail for versions extracted from body/comments, incorrectly dropping valid candidates.

📝 Recommended addition

Add a normalization note after line 176:

**Normalize captured versions.** Body and comment captures lack the `v` prefix (the capture group excludes it), but labels and tags include it. Before validation, prepend `v` to any version that doesn't already start with it:

\```bash
# Normalize each candidate version
for V in "${CANDIDATE_VERSIONS[@]}"; do
  [[ "$V" =~ ^v ]] || V="v$V"
  # Now validate against tag list...
done
\```
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md
around lines 172 - 196, The extraction regex captures only digits (e.g.,
(\d+\.\d+\.\d+)) but validation uses tag names with a leading v and sets
REPORTED_VERSION to the full tag; add a normalization step before validation
that ensures every candidate in your CANDIDATE_VERSIONS list is converted to the
full tag form (prepend "v" when the string does not already start with "v") so
that grep -Fxq "$V" against the tag list and the final REPORTED_VERSION
assignment operate on matching tag strings; update the validation flow that
invokes grep -Fxq and the logic that selects the smallest surviving version to
use the normalized/tag-prefixed values.
🧹 Nitpick comments (1)
.agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md (1)

114-119: ⚡ Quick win

Consider breaking the question-detection regex into multiple filters.

The single regex pattern on line 118 combines multiple question-detection heuristics in one long expression, making it harder to maintain or extend. Consider using multiple select() clauses or extracting the pattern to a shell variable with documentation.

♻️ Refactor option: Multiple select filters
 UNANSWERED_MAINT=$(gh issue view "$ISSUE_NUMBER" --repo NVIDIA/NemoClaw --json comments \
   --jq --arg reporter "$REPORTER" --arg cutoff "$SEVEN_DAYS_AGO" '
     (.comments
      | map(select((.authorAssociation == "MEMBER" or .authorAssociation == "OWNER" or .authorAssociation == "COLLABORATOR")
-         and (.body | test("\\?|(?i)\\bplease (confirm|share|provide|clarify|tell|verify|check|let me know|let us know)|(?i)\\b(could|can|would) you\\b|(?i)\\bdo you (have|know|see|use)\\b"))))
+         and (.body | test("\\?")
+                   or test("(?i)\\bplease (confirm|share|provide|clarify|tell|verify|check|let me know|let us know)")
+                   or test("(?i)\\b(could|can|would) you\\b")
+                   or test("(?i)\\bdo you (have|know|see|use)\\b"))))
      | sort_by(.createdAt) | last) as $maint
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md
around lines 114 - 119, The combined question-detection regex inside the gh
issue view pipeline (used to set UNANSWERED_MAINT) is too long and hard to
maintain; refactor by splitting the single select(...) that contains the big
test(...) into multiple select() clauses (e.g., one for authorAssociation
membership and separate selects for question patterns like a simple "?" test,
polite-request keywords, modal verbs, and "do you ..." patterns) or move each
regex into a clearly named shell variable and reference those variables in
separate select() calls so the logic in the pipeline (the gh issue view ... --jq
expression that produces $maint) is easier to read and extend.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/reproduction-rubrics.md:
- Around line 219-234: The measurements from /usr/bin/time are in seconds but
the SLA vars are milliseconds, so convert the computed p50/p90 to milliseconds
before comparing (or alternatively rename SLA vars to seconds); specifically
update the P50_MS and P90_MS calculations (and the corresponding baseline
variables) that read from latest-perf.log/baseline-perf.log so the awk/shell
math multiplies the second values by 1000 and formats as integers (e.g., compute
P50_MS = (mean*1000) and P90_MS = (value*1000)), update the echo suffix to "ms",
and ensure comparisons use SLA_P50_MS and SLA_P90_MS as millisecond integers.

---

Outside diff comments:
In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md:
- Around line 172-196: The extraction regex captures only digits (e.g.,
(\d+\.\d+\.\d+)) but validation uses tag names with a leading v and sets
REPORTED_VERSION to the full tag; add a normalization step before validation
that ensures every candidate in your CANDIDATE_VERSIONS list is converted to the
full tag form (prepend "v" when the string does not already start with "v") so
that grep -Fxq "$V" against the tag list and the final REPORTED_VERSION
assignment operate on matching tag strings; update the validation flow that
invokes grep -Fxq and the logic that selects the smallest surviving version to
use the normalized/tag-prefixed values.

---

Nitpick comments:
In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md:
- Around line 114-119: The combined question-detection regex inside the gh issue
view pipeline (used to set UNANSWERED_MAINT) is too long and hard to maintain;
refactor by splitting the single select(...) that contains the big test(...)
into multiple select() clauses (e.g., one for authorAssociation membership and
separate selects for question patterns like a simple "?" test, polite-request
keywords, modal verbs, and "do you ..." patterns) or move each regex into a
clearly named shell variable and reference those variables in separate select()
calls so the logic in the pipeline (the gh issue view ... --jq expression that
produces $maint) is easier to read and extend.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b4bcfcff-fe48-424e-9219-6822388d6445

📥 Commits

Reviewing files that changed from the base of the PR and between 4962da5 and 33a350c.

📒 Files selected for processing (6)
  • .agents/skills/nemoclaw-maintainer-verify-stale/reference/brev-provisioning.md
  • .agents/skills/nemoclaw-maintainer-verify-stale/reference/by-design.md
  • .agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md
  • .agents/skills/nemoclaw-maintainer-verify-stale/reference/environment-and-reproducer.md
  • .agents/skills/nemoclaw-maintainer-verify-stale/reference/reproduction-rubrics.md
  • .agents/skills/nemoclaw-maintainer-verify-stale/reference/scoring-comments-and-logging.md
✅ Files skipped from review due to trivial changes (4)
  • .agents/skills/nemoclaw-maintainer-verify-stale/reference/by-design.md
  • .agents/skills/nemoclaw-maintainer-verify-stale/reference/brev-provisioning.md
  • .agents/skills/nemoclaw-maintainer-verify-stale/reference/environment-and-reproducer.md
  • .agents/skills/nemoclaw-maintainer-verify-stale/reference/scoring-comments-and-logging.md

Comment thread .agents/skills/nemoclaw-maintainer-verify-stale/reference/reproduction-rubrics.md Outdated
@cv

cv commented May 26, 2026

Copy link
Copy Markdown
Collaborator

@CodeRabbit let's look at these skills considering Anthropic's best practices:

Skill authoring best practices Good Skills are concise, well-structured, and tested with real usage. This guide provides practical authoring decisions to help you write Skills that Claude can discover and use effectively.

For conceptual background on how Skills work, see the Skills overview.

Core principles

Concise is key

The context window is a public good. Your Skill shares the context window with everything else Claude needs to know, including:

  • The system prompt
  • Conversation history
  • Other Skills' metadata
  • Your actual request

Not every token in your Skill has an immediate cost. At startup, only the metadata (name and description) from all Skills is pre-loaded. Claude reads SKILL.md only when the Skill becomes relevant, and reads additional files only as needed. However, being concise in SKILL.md still matters: once Claude loads it, every token competes with conversation history and other context.

Default assumption: Claude is already very smart

Only add context Claude doesn't already have. Challenge each piece of information:

  • "Does Claude really need this explanation?"
  • "Can I assume Claude knows this?"
  • "Does this paragraph justify its token cost?"

Good example: Concise (approximately 50 tokens):

## Extract PDF text

Use pdfplumber for text extraction:

```python
import pdfplumber

with pdfplumber.open("file.pdf") as pdf:
    text = pdf.pages[0].extract_text()
```

Bad example: Too verbose (approximately 150 tokens):

## Extract PDF text

PDF (Portable Document Format) files are a common file format that contains
text, images, and other content. To extract text from a PDF, you'll need to
use a library. There are many libraries available for PDF processing, but
pdfplumber is recommended because it's easy to use and handles most cases well.
First, you'll need to install it using pip. Then you can use the code below...

The concise version assumes Claude knows what PDFs are and how libraries work.

Set appropriate degrees of freedom

Match the level of specificity to the task's fragility and variability.

High freedom (text-based instructions):

Use when:

  • Multiple approaches are valid
  • Decisions depend on context
  • Heuristics guide the approach

Example:

## Code review process

1. Analyze the code structure and organization
2. Check for potential bugs or edge cases
3. Suggest improvements for readability and maintainability
4. Verify adherence to project conventions

Medium freedom (pseudocode or scripts with parameters):

Use when:

  • A preferred pattern exists
  • Some variation is acceptable
  • Configuration affects behavior

Example:

## Generate report

Use this template and customize as needed:

```python
def generate_report(data, format="markdown", include_charts=True):
    # Process data
    # Generate output in specified format
    # Optionally include visualizations
```

Low freedom (specific scripts, few or no parameters):

Use when:

  • Operations are fragile and error-prone
  • Consistency is critical
  • A specific sequence must be followed

Example:

## Database migration

Run exactly this script:

```bash
python scripts/migrate.py --verify --backup
```

Do not modify the command or add additional flags.

Analogy: Think of Claude as a robot exploring a path:

  • Narrow bridge with cliffs on both sides: There's only one safe way forward. Provide specific guardrails and exact instructions (low freedom). Example: database migrations that must run in exact sequence.
  • Open field with no hazards: Many paths lead to success. Give general direction and trust Claude to find the best route (high freedom). Example: code reviews where context determines the best approach.

Test with all models you plan to use

Skills act as additions to models, so effectiveness depends on the underlying model. Test your Skill with all the models you plan to use it with.

Testing considerations by model:

  • Claude Haiku (fast, economical): Does the Skill provide enough guidance?
  • Claude Sonnet (balanced): Is the Skill clear and efficient?
  • Claude Opus (powerful reasoning): Does the Skill avoid over-explaining?

What works perfectly for Opus might need more detail for Haiku. If you plan to use your Skill across multiple models, aim for instructions that work well with all of them.

Skill structure

**YAML Frontmatter:** The SKILL.md frontmatter requires two fields:

name:

  • Maximum 64 characters
  • Must contain only lowercase letters, numbers, and hyphens
  • Cannot contain XML tags
  • Cannot contain reserved words: "anthropic", "claude"

description:

  • Must be non-empty
  • Maximum 1024 characters
  • Cannot contain XML tags
  • Should describe what the Skill does and when to use it

For complete Skill structure details, see the Skills overview.

Naming conventions

Use consistent naming patterns to make Skills easier to reference and discuss. Consider using gerund form (verb + -ing) for Skill names, as this clearly describes the activity or capability the Skill provides.

Remember that the name field must use lowercase letters, numbers, and hyphens only.

Good naming examples (gerund form):

  • processing-pdfs
  • analyzing-spreadsheets
  • managing-databases
  • testing-code
  • writing-documentation

Acceptable alternatives:

  • Noun phrases: pdf-processing, spreadsheet-analysis
  • Action-oriented: process-pdfs, analyze-spreadsheets

Avoid:

  • Vague names: helper, utils, tools
  • Overly generic: documents, data, files
  • Reserved words: anthropic-helper, claude-tools
  • Inconsistent patterns within your skill collection

Consistent naming makes it easier to:

  • Reference Skills in documentation and conversations
  • Understand what a Skill does at a glance
  • Organize and search through multiple Skills
  • Maintain a professional, cohesive skill library

Writing effective descriptions

The description field enables Skill discovery and should include both what the Skill does and when to use it.

**Always write in third person**. The description is injected into the system prompt, and inconsistent point-of-view can cause discovery problems.
  • Good: "Processes Excel files and generates reports"
  • Avoid: "I can help you process Excel files"
  • Avoid: "You can use this to process Excel files"

Be specific and include key terms. Include both what the Skill does and specific triggers/contexts for when to use it.

Each Skill has exactly one description field. The description is critical for skill selection: Claude uses it to choose the right Skill from potentially 100+ available Skills. Your description must provide enough detail for Claude to know when to select this Skill, while the rest of SKILL.md provides the implementation details.

Effective examples:

PDF Processing skill:

description: Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction.

Excel Analysis skill:

description: Analyze Excel spreadsheets, create pivot tables, generate charts. Use when analyzing Excel files, spreadsheets, tabular data, or .xlsx files.

Git Commit Helper skill:

description: Generate descriptive commit messages by analyzing git diffs. Use when the user asks for help writing commit messages or reviewing staged changes.

Avoid vague descriptions like these:

description: Helps with documents
description: Processes data
description: Does stuff with files

Progressive disclosure patterns

SKILL.md serves as an overview that points Claude to detailed materials as needed, like a table of contents in an onboarding guide. For an explanation of how progressive disclosure works, see How Skills work in the overview.

Practical guidance:

  • Keep SKILL.md body under 500 lines for optimal performance
  • Split content into separate files when approaching this limit
  • Use the patterns below to organize instructions, code, and resources effectively

Visual overview: From simple to complex

A basic Skill starts with just a SKILL.md file containing metadata and instructions:

Simple SKILL.md file showing YAML frontmatter and markdown body

As your Skill grows, you can bundle additional content that Claude loads only when needed:

Bundling additional reference files like reference.md and forms.md.

The complete Skill directory structure might look like this:

pdf/
├── SKILL.md              # Main instructions (loaded when triggered)
├── FORMS.md              # Form-filling guide (loaded as needed)
├── reference.md          # API reference (loaded as needed)
├── examples.md           # Usage examples (loaded as needed)
└── scripts/
    ├── analyze_form.py   # Utility script (executed, not loaded)
    ├── fill_form.py      # Form filling script
    └── validate.py       # Validation script

Pattern 1: High-level guide with references

---
name: pdf-processing
description: Extracts text and tables from PDF files, fills forms, and merges documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction.
---

# PDF Processing

## Quick start

Extract text with pdfplumber:
```python
import pdfplumber
with pdfplumber.open("file.pdf") as pdf:
    text = pdf.pages[0].extract_text()
```

## Advanced features

**Form filling**: See [FORMS.md](FORMS.md) for complete guide
**API reference**: See [REFERENCE.md](REFERENCE.md) for all methods
**Examples**: See [EXAMPLES.md](EXAMPLES.md) for common patterns

Claude loads FORMS.md, REFERENCE.md, or EXAMPLES.md only when needed.

Pattern 2: Domain-specific organization

For Skills with multiple domains, organize content by domain to avoid loading irrelevant context. When a user asks about sales metrics, Claude only needs to read sales-related schemas, not finance or marketing data. This keeps token usage low and context focused.

bigquery-skill/
├── SKILL.md (overview and navigation)
└── reference/
    ├── finance.md (revenue, billing metrics)
    ├── sales.md (opportunities, pipeline)
    ├── product.md (API usage, features)
    └── marketing.md (campaigns, attribution)
# BigQuery Data Analysis

## Available datasets

**Finance**: Revenue, ARR, billing → See [reference/finance.md](reference/finance.md)
**Sales**: Opportunities, pipeline, accounts → See [reference/sales.md](reference/sales.md)
**Product**: API usage, features, adoption → See [reference/product.md](reference/product.md)
**Marketing**: Campaigns, attribution, email → See [reference/marketing.md](reference/marketing.md)

## Quick search

Find specific metrics using grep:

```bash
grep -i "revenue" reference/finance.md
grep -i "pipeline" reference/sales.md
grep -i "api usage" reference/product.md
```

Pattern 3: Conditional details

Show basic content, link to advanced content:

# DOCX Processing

## Creating documents

Use docx-js for new documents. See [DOCX-JS.md](DOCX-JS.md).

## Editing documents

For simple edits, modify the XML directly.

**For tracked changes**: See [REDLINING.md](REDLINING.md)
**For OOXML details**: See [OOXML.md](OOXML.md)

Claude reads REDLINING.md or OOXML.md only when the user needs those features.

Avoid deeply nested references

Claude may partially read files when they're referenced from other referenced files. When encountering nested references, Claude might use commands like head -100 to preview content rather than reading entire files, resulting in incomplete information.

Keep references one level deep from SKILL.md. All reference files should link directly from SKILL.md to ensure Claude reads complete files when needed.

Bad example: Too deep:

# SKILL.md
See [advanced.md](advanced.md)...

# advanced.md
See [details.md](details.md)...

# details.md
Here's the actual information...

Good example: One level deep:

# SKILL.md

**Basic usage**: [instructions in SKILL.md]
**Advanced features**: See [advanced.md](advanced.md)
**API reference**: See [reference.md](reference.md)
**Examples**: See [examples.md](examples.md)

Structure longer reference files with table of contents

For reference files longer than 100 lines, include a table of contents at the top. This ensures Claude can see the full scope of available information even when previewing with partial reads.

Example:

# API Reference

## Contents
- Authentication and setup
- Core methods (create, read, update, delete)
- Advanced features (batch operations, webhooks)
- Error handling patterns
- Code examples

## Authentication and setup
...

## Core methods
...

Claude can then read the complete file or jump to specific sections as needed.

For details on how this filesystem-based architecture enables progressive disclosure, see the Runtime environment section in the Advanced section below.

Workflows and feedback loops

Use workflows for complex tasks

Break complex operations into clear, sequential steps. For particularly complex workflows, provide a checklist that Claude can copy into its response and check off as it progresses.

Example 1: Research synthesis workflow (for Skills without code):

## Research synthesis workflow

Copy this checklist and track your progress:

```
Research Progress:
- [ ] Step 1: Read all source documents
- [ ] Step 2: Identify key themes
- [ ] Step 3: Cross-reference claims
- [ ] Step 4: Create structured summary
- [ ] Step 5: Verify citations
```

**Step 1: Read all source documents**

Review each document in the `sources/` directory. Note the main arguments and supporting evidence.

**Step 2: Identify key themes**

Look for patterns across sources. What themes appear repeatedly? Where do sources agree or disagree?

**Step 3: Cross-reference claims**

For each major claim, verify it appears in the source material. Note which source supports each point.

**Step 4: Create structured summary**

Organize findings by theme. Include:
- Main claim
- Supporting evidence from sources
- Conflicting viewpoints (if any)

**Step 5: Verify citations**

Check that every claim references the correct source document. If citations are incomplete, return to Step 3.

This example shows how workflows apply to analysis tasks that don't require code. The checklist pattern works for any complex, multi-step process.

Example 2: PDF form filling workflow (for Skills with code):

## PDF form filling workflow

Copy this checklist and check off items as you complete them:

```
Task Progress:
- [ ] Step 1: Analyze the form (run analyze_form.py)
- [ ] Step 2: Create field mapping (edit fields.json)
- [ ] Step 3: Validate mapping (run validate_fields.py)
- [ ] Step 4: Fill the form (run fill_form.py)
- [ ] Step 5: Verify output (run verify_output.py)
```

**Step 1: Analyze the form**

Run: `python scripts/analyze_form.py input.pdf`

This extracts form fields and their locations, saving to `fields.json`.

**Step 2: Create field mapping**

Edit `fields.json` to add values for each field.

**Step 3: Validate mapping**

Run: `python scripts/validate_fields.py fields.json`

Fix any validation errors before continuing.

**Step 4: Fill the form**

Run: `python scripts/fill_form.py input.pdf fields.json output.pdf`

**Step 5: Verify output**

Run: `python scripts/verify_output.py output.pdf`

If verification fails, return to Step 2.

Clear steps prevent Claude from skipping critical validation. The checklist helps both Claude and you track progress through multi-step workflows.

Implement feedback loops

Common pattern: Run validator → fix errors → repeat

This pattern greatly improves output quality.

Example 1: Style guide compliance (for Skills without code):

## Content review process

1. Draft your content following the guidelines in STYLE_GUIDE.md
2. Review against the checklist:
   - Check terminology consistency
   - Verify examples follow the standard format
   - Confirm all required sections are present
3. If issues found:
   - Note each issue with specific section reference
   - Revise the content
   - Review the checklist again
4. Only proceed when all requirements are met
5. Finalize and save the document

This shows the validation loop pattern using reference documents instead of scripts. The "validator" is STYLE_GUIDE.md, and Claude performs the check by reading and comparing.

Example 2: Document editing process (for Skills with code):

## Document editing process

1. Make your edits to `word/document.xml`
2. **Validate immediately**: `python ooxml/scripts/validate.py unpacked_dir/`
3. If validation fails:
   - Review the error message carefully
   - Fix the issues in the XML
   - Run validation again
4. **Only proceed when validation passes**
5. Rebuild: `python ooxml/scripts/pack.py unpacked_dir/ output.docx`
6. Test the output document

The validation loop catches errors early.

Content guidelines

Avoid time-sensitive information

Don't include information that will become outdated:

Bad example: Time-sensitive (will become wrong):

If you're doing this before August 2025, use the old API.
After August 2025, use the new API.

Good example (use "old patterns" section):

## Current method

Use the v2 API endpoint: `api.example.com/v2/messages`

## Old patterns

<details>
<summary>Legacy v1 API (deprecated 2025-08)</summary>

The v1 API used: `api.example.com/v1/messages`

This endpoint is no longer supported.
</details>

The old patterns section provides historical context without cluttering the main content.

Use consistent terminology

Choose one term and use it throughout the Skill:

Good - Consistent:

  • Always "API endpoint"
  • Always "field"
  • Always "extract"

Bad - Inconsistent:

  • Mix "API endpoint", "URL", "API route", "path"
  • Mix "field", "box", "element", "control"
  • Mix "extract", "pull", "get", "retrieve"

Consistency helps Claude understand and follow instructions.

Common patterns

Template pattern

Provide templates for output format. Match the level of strictness to your needs.

For strict requirements (like API responses or data formats):

## Report structure

ALWAYS use this exact template structure:

```markdown
# [Analysis Title]

## Executive summary
[One-paragraph overview of key findings]

## Key findings
- Finding 1 with supporting data
- Finding 2 with supporting data
- Finding 3 with supporting data

## Recommendations
1. Specific actionable recommendation
2. Specific actionable recommendation
```

For flexible guidance (when adaptation is useful):

## Report structure

Here is a sensible default format, but use your best judgment based on the analysis:

```markdown
# [Analysis Title]

## Executive summary
[Overview]

## Key findings
[Adapt sections based on what you discover]

## Recommendations
[Tailor to the specific context]
```

Adjust sections as needed for the specific analysis type.

Examples pattern

For Skills where output quality depends on seeing examples, provide input/output pairs just like in regular prompting:

## Commit message format

Generate commit messages following these examples:

**Example 1:**
Input: Added user authentication with JWT tokens
Output:
```
feat(auth): implement JWT-based authentication

Add login endpoint and token validation middleware
```

**Example 2:**
Input: Fixed bug where dates displayed incorrectly in reports
Output:
```
fix(reports): correct date formatting in timezone conversion

Use UTC timestamps consistently across report generation
```

**Example 3:**
Input: Updated dependencies and refactored error handling
Output:
```
chore: update dependencies and refactor error handling

- Upgrade lodash to 4.17.21
- Standardize error response format across endpoints
```

Follow this style: type(scope): brief description, then detailed explanation.

Examples help Claude understand the desired style and level of detail more clearly than descriptions alone.

Conditional workflow pattern

Guide Claude through decision points:

## Document modification workflow

1. Determine the modification type:

   **Creating new content?** → Follow "Creation workflow" below
   **Editing existing content?** → Follow "Editing workflow" below

2. Creation workflow:
   - Use docx-js library
   - Build document from scratch
   - Export to .docx format

3. Editing workflow:
   - Unpack existing document
   - Modify XML directly
   - Validate after each change
   - Repack when complete
If workflows become large or complicated with many steps, consider pushing them into separate files and tell Claude to read the appropriate file based on the task at hand.

Evaluation and iteration

Build evaluations first

Create evaluations BEFORE writing extensive documentation. This ensures your Skill solves real problems rather than documenting imagined ones.

Evaluation-driven development:

  1. Identify gaps: Run Claude on representative tasks without a Skill. Document specific failures or missing context
  2. Create evaluations: Build three scenarios that test these gaps
  3. Establish baseline: Measure Claude's performance without the Skill
  4. Write minimal instructions: Create just enough content to address the gaps and pass evaluations
  5. Iterate: Execute evaluations, compare against baseline, and refine

This approach ensures you're solving actual problems rather than anticipating requirements that may never materialize.

Evaluation structure:

{
  "skills": ["pdf-processing"],
  "query": "Extract all text from this PDF file and save it to output.txt",
  "files": ["test-files/document.pdf"],
  "expected_behavior": [
    "Successfully reads the PDF file using an appropriate PDF processing library or command-line tool",
    "Extracts text content from all pages in the document without missing any pages",
    "Saves the extracted text to a file named output.txt in a clear, readable format"
  ]
}
This example demonstrates a data-driven evaluation with a simple testing rubric. There is not currently a built-in way to run these evaluations. Users can create their own evaluation system. Evaluations are your source of truth for measuring Skill effectiveness.

Develop Skills iteratively with Claude

The most effective Skill development process involves Claude itself. Work with one instance of Claude ("Claude A") to create a Skill that is used by other instances ("Claude B"). Claude A helps you design and refine instructions, while Claude B tests them in real tasks. This works because Claude models understand both how to write effective agent instructions and what information agents need.

Creating a new Skill:

  1. Complete a task without a Skill: Work through a problem with Claude A using normal prompting. As you work, you'll naturally provide context, explain preferences, and share procedural knowledge. Notice what information you repeatedly provide.

  2. Identify the reusable pattern: After completing the task, identify what context you provided that would be useful for similar future tasks.

    Example: If you worked through a BigQuery analysis, you might have provided table names, field definitions, filtering rules (like "always exclude test accounts"), and common query patterns.

  3. Ask Claude A to create a Skill: "Create a Skill that captures this BigQuery analysis pattern we just used. Include the table schemas, naming conventions, and the rule about filtering test accounts."

    Claude models understand the Skill format and structure natively. You don't need special system prompts or a "writing skills" skill to get Claude to help create Skills. Simply ask Claude to create a Skill and it generates properly structured SKILL.md content with appropriate frontmatter and body content.
  4. Review for conciseness: Check that Claude A hasn't added unnecessary explanations. Ask: "Remove the explanation about what win rate means - Claude already knows that."

  5. Improve information architecture: Ask Claude A to organize the content more effectively. For example: "Organize this so the table schema is in a separate reference file. We might add more tables later."

  6. Test on similar tasks: Use the Skill with Claude B (a fresh instance with the Skill loaded) on related use cases. Observe whether Claude B finds the right information, applies rules correctly, and handles the task successfully.

  7. Iterate based on observation: If Claude B struggles or misses something, return to Claude A with specifics: "When Claude used this Skill, it forgot to filter by date for Q4. Should we add a section about date filtering patterns?"

Iterating on existing Skills:

The same hierarchical pattern continues when improving Skills. You alternate between:

  • Working with Claude A (the expert who helps refine the Skill)
  • Testing with Claude B (the agent using the Skill to perform real work)
  • Observing Claude B's behavior and bringing insights back to Claude A
  1. Use the Skill in real workflows: Give Claude B (with the Skill loaded) actual tasks, not test scenarios

  2. Observe Claude B's behavior: Note where it struggles, succeeds, or makes unexpected choices

    Example observation: "When I asked Claude B for a regional sales report, it wrote the query but forgot to filter out test accounts, even though the Skill mentions this rule."

  3. Return to Claude A for improvements: Share the current SKILL.md and describe what you observed. Ask: "I noticed Claude B forgot to filter test accounts when I asked for a regional report. The Skill mentions filtering, but maybe it's not prominent enough?"

  4. Review Claude A's suggestions: Claude A might suggest reorganizing to make rules more prominent, using stronger language like "MUST filter" instead of "always filter", or restructuring the workflow section.

  5. Apply and test changes: Update the Skill with Claude A's refinements, then test again with Claude B on similar requests

  6. Repeat based on usage: Continue this observe-refine-test cycle as you encounter new scenarios. Each iteration improves the Skill based on real agent behavior, not assumptions.

Gathering team feedback:

  1. Share Skills with teammates and observe their usage
  2. Ask: Does the Skill activate when expected? Are instructions clear? What's missing?
  3. Incorporate feedback to address blind spots in your own usage patterns

Why this approach works: Claude A understands agent needs, you provide domain expertise, Claude B reveals gaps through real usage, and iterative refinement improves Skills based on observed behavior rather than assumptions.

Observe how Claude navigates Skills

As you iterate on Skills, pay attention to how Claude actually uses them in practice. Watch for:

  • Unexpected exploration paths: Does Claude read files in an order you didn't anticipate? This might indicate your structure isn't as intuitive as you thought
  • Missed connections: Does Claude fail to follow references to important files? Your links might need to be more explicit or prominent
  • Overreliance on certain sections: If Claude repeatedly reads the same file, consider whether that content should be in the main SKILL.md instead
  • Ignored content: If Claude never accesses a bundled file, it might be unnecessary or poorly signaled in the main instructions

Iterate based on these observations rather than assumptions. The 'name' and 'description' in your Skill's metadata are particularly critical. Claude uses these when deciding whether to trigger the Skill in response to the current task. Make sure they clearly describe what the Skill does and when it should be used.

Anti-patterns to avoid

Avoid Windows-style paths

Always use forward slashes in file paths, even on Windows:

  • Good: scripts/helper.py, reference/guide.md
  • Avoid: scripts\helper.py, reference\guide.md

Unix-style paths work across all platforms, while Windows-style paths cause errors on Unix systems.

Avoid offering too many options

Don't present multiple approaches unless necessary:

**Bad example: Too many choices** (confusing):
"You can use pypdf, or pdfplumber, or PyMuPDF, or pdf2image, or..."

**Good example: Provide a default** (with escape hatch):
"Use pdfplumber for text extraction:
```python
import pdfplumber
```

For scanned PDFs requiring OCR, use pdf2image with pytesseract instead."

Advanced: Skills with executable code

The sections below focus on Skills that include executable scripts. If your Skill uses only markdown instructions, skip to Checklist for effective Skills.

Solve, don't punt

When writing scripts for Skills, handle error conditions rather than punting to Claude.

Good example: Handle errors explicitly:

def process_file(path):
    """Process a file, creating it if it doesn't exist."""
    try:
        with open(path) as f:
            return f.read()
    except FileNotFoundError:
        # Create file with default content instead of failing
        print(f"File {path} not found, creating default")
        with open(path, "w") as f:
            f.write("")
        return ""
    except PermissionError:
        # Provide alternative instead of failing
        print(f"Cannot access {path}, using default")
        return ""

Bad example: Punt to Claude:

def process_file(path):
    # Just fail and let Claude figure it out
    return open(path).read()

Configuration parameters should also be justified and documented to avoid "voodoo constants" (Ousterhout's law). If you don't know the right value, how will Claude determine it?

Good example: Self-documenting:

# HTTP requests typically complete within 30 seconds
# Longer timeout accounts for slow connections
REQUEST_TIMEOUT = 30

# Three retries balances reliability vs speed
# Most intermittent failures resolve by the second retry
MAX_RETRIES = 3

Bad example: Magic numbers:

TIMEOUT = 47  # Why 47?
RETRIES = 5  # Why 5?

Provide utility scripts

Even if Claude could write a script, pre-made scripts offer advantages:

Benefits of utility scripts:

  • More reliable than generated code
  • Save tokens (no need to include code in context)
  • Save time (no code generation required)
  • Ensure consistency across uses

Bundling executable scripts alongside instruction files

The diagram above shows how executable scripts work alongside instruction files. The instruction file (forms.md) references the script, and Claude can execute it without loading its contents into context.

Important distinction: Make clear in your instructions whether Claude should:

  • Execute the script (most common): "Run analyze_form.py to extract fields"
  • Read it as reference (for complex logic): "See analyze_form.py for the field extraction algorithm"

For most utility scripts, execution is preferred because it's more reliable and efficient. See the Runtime environment section below for details on how script execution works.

Example:

## Utility scripts

**analyze_form.py**: Extract all form fields from PDF

```bash
python scripts/analyze_form.py input.pdf > fields.json
```

Output format:
```json
{
  "field_name": {"type": "text", "x": 100, "y": 200},
  "signature": {"type": "sig", "x": 150, "y": 500}
}
```

**validate_boxes.py**: Check for overlapping bounding boxes

```bash
python scripts/validate_boxes.py fields.json
# Returns: "OK" or lists conflicts
```

**fill_form.py**: Apply field values to PDF

```bash
python scripts/fill_form.py input.pdf fields.json output.pdf
```

Use visual analysis

When inputs can be rendered as images, have Claude analyze them:

## Form layout analysis

1. Convert PDF to images:
   ```bash
   python scripts/pdf_to_images.py form.pdf
   ```

2. Analyze each page image to identify form fields
3. Claude can see field locations and types visually
In this example, you'd need to write the `pdf_to_images.py` script.

Claude's vision capabilities help understand layouts and structures.

Create verifiable intermediate outputs

When Claude performs complex, open-ended tasks, it can make mistakes. The "plan-validate-execute" pattern catches errors early by having Claude first create a plan in a structured format, then validate that plan with a script before executing it.

Example: Imagine asking Claude to update 50 form fields in a PDF based on a spreadsheet. Without validation, Claude might reference non-existent fields, create conflicting values, miss required fields, or apply updates incorrectly.

Solution: Use the workflow pattern shown above (PDF form filling), but add an intermediate changes.json file that gets validated before applying changes. The workflow becomes: analyze → create plan filevalidate plan → execute → verify.

Why this pattern works:

  • Catches errors early: Validation finds problems before changes are applied
  • Machine-verifiable: Scripts provide objective verification
  • Reversible planning: Claude can iterate on the plan without touching originals
  • Clear debugging: Error messages point to specific problems

When to use: Batch operations, destructive changes, complex validation rules, high-stakes operations.

Implementation tip: Make validation scripts verbose with specific error messages like "Field 'signature_date' not found. Available fields: customer_name, order_total, signature_date_signed" to help Claude fix issues.

Package dependencies

Skills run in the code execution environment with platform-specific limitations:

  • claude.ai: Can install packages from npm and PyPI and pull from GitHub repositories
  • Claude API: Has no network access and no runtime package installation

List required packages in your SKILL.md and verify they're available in the code execution tool documentation.

Runtime environment

Skills run in a code execution environment with filesystem access, bash commands, and code execution capabilities. For the conceptual explanation of this architecture, see The Skills architecture in the overview.

How this affects your authoring:

How Claude accesses Skills:

  1. Metadata pre-loaded: At startup, the name and description from all Skills' YAML frontmatter are loaded into the system prompt
  2. Files read on-demand: Claude uses bash Read tools to access SKILL.md and other files from the filesystem when needed
  3. Scripts executed efficiently: Utility scripts can be executed via bash without loading their full contents into context. Only the script's output consumes tokens
  4. No context penalty for large files: Reference files, data, or documentation don't consume context tokens until actually read
  • File paths matter: Claude navigates your skill directory like a filesystem. Use forward slashes (reference/guide.md), not backslashes
  • Name files descriptively: Use names that indicate content: form_validation_rules.md, not doc2.md
  • Organize for discovery: Structure directories by domain or feature
    • Good: reference/finance.md, reference/sales.md
    • Bad: docs/file1.md, docs/file2.md
  • Bundle comprehensive resources: Include complete API docs, extensive examples, large datasets; no context penalty until accessed
  • Prefer scripts for deterministic operations: Write validate_form.py rather than asking Claude to generate validation code
  • Make execution intent clear:
    • "Run analyze_form.py to extract fields" (execute)
    • "See analyze_form.py for the extraction algorithm" (read as reference)
  • Test file access patterns: Verify Claude can navigate your directory structure by testing with real requests

Example:

bigquery-skill/
├── SKILL.md (overview, points to reference files)
└── reference/
    ├── finance.md (revenue metrics)
    ├── sales.md (pipeline data)
    └── product.md (usage analytics)

When the user asks about revenue, Claude reads SKILL.md, sees the reference to reference/finance.md, and invokes bash to read just that file. The sales.md and product.md files remain on the filesystem, consuming zero context tokens until needed. This filesystem-based model is what enables progressive disclosure. Claude can navigate and selectively load exactly what each task requires.

For complete details on the technical architecture, see How Skills work in the Skills overview.

MCP tool references

If your Skill uses MCP (Model Context Protocol) tools, always use fully qualified tool names to avoid "tool not found" errors.

Format: ServerName:tool_name

Example:

Use the BigQuery:bigquery_schema tool to retrieve table schemas.
Use the GitHub:create_issue tool to create issues.

Where:

  • BigQuery and GitHub are MCP server names
  • bigquery_schema and create_issue are the tool names within those servers

Without the server prefix, Claude may fail to locate the tool, especially when multiple MCP servers are available.

Avoid assuming tools are installed

Don't assume packages are available:

**Bad example: Assumes installation**:
"Use the pdf library to process the file."

**Good example: Explicit about dependencies**:
"Install required package: `pip install pypdf`

Then use it:
```python
from pypdf import PdfReader
reader = PdfReader("file.pdf")
```"

Technical notes

YAML frontmatter requirements

The SKILL.md frontmatter requires name and description fields with specific validation rules:

  • name: Maximum 64 characters, lowercase letters/numbers/hyphens only, no XML tags, no reserved words
  • description: Maximum 1024 characters, non-empty, no XML tags

See the Skills overview for complete structure details.

Token budgets

Keep SKILL.md body under 500 lines for optimal performance. If your content exceeds this, split it into separate files using the progressive disclosure patterns described earlier. For architectural details, see the Skills overview.

Checklist for effective Skills

Before sharing a Skill, verify:

Core quality

  • Description is specific and includes key terms
  • Description includes both what the Skill does and when to use it
  • SKILL.md body is under 500 lines
  • Additional details are in separate files (if needed)
  • No time-sensitive information (or in "old patterns" section)
  • Consistent terminology throughout
  • Examples are concrete, not abstract
  • File references are one level deep
  • Progressive disclosure used appropriately
  • Workflows have clear steps

Code and scripts

  • Scripts solve problems rather than punt to Claude
  • Error handling is explicit and helpful
  • No "voodoo constants" (all values justified)
  • Required packages listed in instructions and verified as available
  • Scripts have clear documentation
  • No Windows-style paths (all forward slashes)
  • Validation/verification steps for critical operations
  • Feedback loops included for quality-critical tasks

Testing

  • At least three evaluations created
  • Tested with Haiku, Sonnet, and Opus
  • Tested with real usage scenarios
  • Team feedback incorporated (if applicable)

@coderabbitai

coderabbitai Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "=== SKILL.md ==="
cat .agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md
echo ""
echo "=== Line counts for all reference files ==="
wc -l .agents/skills/nemoclaw-maintainer-verify-stale/reference/*.md
echo ""
echo "=== scoring-comments-and-logging.md (first 50 lines) ==="
head -50 .agents/skills/nemoclaw-maintainer-verify-stale/reference/scoring-comments-and-logging.md
echo ""
echo "=== candidate-selection.md (first 30 lines) ==="
head -30 .agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md
echo ""
echo "=== brev-provisioning.md (first 30 lines) ==="
head -30 .agents/skills/nemoclaw-maintainer-verify-stale/reference/brev-provisioning.md
echo ""
echo "=== by-design.md (first 30 lines) ==="
head -30 .agents/skills/nemoclaw-maintainer-verify-stale/reference/by-design.md
echo ""
echo "=== environment-and-reproducer.md (first 30 lines) ==="
head -30 .agents/skills/nemoclaw-maintainer-verify-stale/reference/environment-and-reproducer.md
echo ""
echo "=== reproduction-rubrics.md (first 30 lines) ==="
head -30 .agents/skills/nemoclaw-maintainer-verify-stale/reference/reproduction-rubrics.md
echo ""
echo "=== nemoclaw-skills-guide SKILL.md ==="
cat .agents/skills/nemoclaw-skills-guide/SKILL.md | head -40
echo ""
echo "=== nemoclaw-maintainer-cut-release-tag SKILL.md (tail, new Step 7) ==="
tail -100 .agents/skills/nemoclaw-maintainer-cut-release-tag/SKILL.md

Length of output: 26997



@cv — here's a structured gap analysis against the Anthropic best practices you shared.


✅ What's already aligned

Practice Status
SKILL.md body under 500 lines ✅ ~61 lines
Third-person, specific description with triggers ✅ Solid; covers invocation verbs and label names
Progressive disclosure — one level deep ✅ All 6 reference docs are linked directly from SKILL.md
TOC on every reference file > 100 lines ✅ Every reference file has a ## Contents block
Progress checklist pattern ✅ Copy-paste checklist in SKILL.md
Low-freedom instructions for fragile ops ✅ Specific bash scripts for Brev provisioning, gh commands; natural language for classification decisions
Feedback loop / self-verification pass by-design.md Step 8.5d forces re-running every evidence command before posting
No deeply nested references ✅ SKILL.md → reference/*.md, nothing deeper
Naming consistent within the collection ✅ Matches nemoclaw-maintainer-* pattern throughout

⚠️ Gaps and recommendations

1. scoring-comments-and-logging.md is 479 lines — split candidate

At 479 lines this is the only reference file approaching unwieldy territory. The doc covers four distinct concerns: scoring (Step 9), comment/label/project (Step 10), infra failures (Step 11), and logging/cadence (Step 12). Consider splitting into two files to keep each focused and reduce partial-read risk:

reference/
  scoring-and-comment.md   # Steps 9–11 (scoring, redaction, templates, race check, project move, infra)
  logging-and-cadence.md   # Step 12 + Cadence + Out of scope + Companion behavior

Update the SKILL.md reference map to point to both, which remains one level deep.

2. Time-sensitive operational fact hardcoded in candidate-selection.md

Line referencing the candidate triage run:

"the candidate triage run found 129 open bugs; an earlier --limit 100 would have missed 29"

This is the "bad example" from the best practices — a concrete count that will become stale. The rationale (don't use --limit 100 because the real pool exceeds that) is worth keeping; the specific count is not. Suggested rewrite:

Use `--limit 1000` so the skill doesn't silently drop issues beyond the page size.
The open-bug pool has historically exceeded 100 issues; a lower limit silently
truncates the candidate pool before the per-run processing cap is applied.

3. No structured evaluations

The best practices say to build evaluations before writing extensive documentation. The PR documents e2e runs against 9 candidates but there are no machine-readable eval fixtures. Even a lightweight evals/ directory with three scenario JSON files would satisfy the checklist:

// evals/01-single-issue-fixed-on-latest.json
{
  "skills": ["nemoclaw-maintainer-verify-stale"],
  "query": "Verify issue `#2791`",
  "expected_behavior": [
    "Reads candidate-selection.md to establish latest tag",
    "Detects by-design signal in comments before provisioning Brev",
    "Posts a wont-fix comment with commit SHA evidence",
    "Applies status: wont-fix label, does not auto-close"
  ]
}

Suggested minimum coverage: one fixed-on-latest, one still-reproduces, one by-design / wont-fix scenario.

4. Magic number: 14-day sweep threshold in cut-release-tag Step 7

The bash script uses [ "$AGE_DAYS" -ge 14 ] with no adjacent comment explaining why 14 days. Per the "no voodoo constants" principle:

# 14-day window: matches the 7-day idempotency TTL (candidate-selection.md Step 3)
# plus a 7-day reporter confirmation window.  Chosen so a freshly-applied label
# from the previous week's batch run is never swept on the next release cut.
AGE_DAYS_THRESHOLD=14

5. Naming: gerund form not used (low priority)

The best practices prefer gerund form (verifying-stale-issues) but explicitly list action-oriented names as acceptable, and note that consistency within the collection matters most. Since all eight nemoclaw-maintainer-* skills follow the <noun>-<verb>-<object> pattern, no change is needed — just flagging it as an intentional deviation that's justified by collection consistency.

6. Minor: reproducer vs repro terminology drift

SKILL.md and most reference files use reproducer consistently, but reproduction-rubrics.md uses repro as shorthand in a few places (e.g., "synth-repro"). Consider standardizing to reproducer throughout, or explicitly defining repro as the shorthand on first use in each file.


Summary checklist delta

Item Current Recommended action
SKILL.md < 500 lines
TOC on all large reference files
One-level-deep references
No time-sensitive data Remove "129 open bugs" count from candidate-selection.md
At least 3 evaluations Add evals/ directory with 3 scenario files
No voodoo constants ⚠️ Document 14-day rationale in cut-release-tag Step 7
Consistent terminology ⚠️ Standardize reproducer vs repro
File size manageable ⚠️ Consider splitting scoring-comments-and-logging.md

CodeRabbit follow-up on PR #3327: the prior fix computed P50_MS / P90_MS
straight from `/usr/bin/time -f '%e'` output, which is seconds, while
the match rubric compares against SLA_P50_MS / SLA_P90_MS parsed from
the issue body as milliseconds. A 1.5-second p50 would have falsely
beaten a 200 ms SLA. Multiply by 1000 in the awk, format as integer ms,
and update the echo suffix accordingly.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…on regex

Two CodeRabbit follow-ups on PR #3327's candidate-selection.md:

- Step 4 tag validation now normalizes each candidate to the full tag
  form (prepend `v` when absent) before `grep -Fxq` against the tag
  list. The body/comment capture group is digits-only by design, but
  tags carry the `v` — without the prepend, every body-sourced
  candidate was being dropped. Label-sourced candidates already carry
  the `v`, so the prepend is idempotent.
- Split the long alternation regex behind the unanswered-maintainer
  question detector into four named test() clauses (literal "?",
  polite imperative, modal interrogative, "do you ..."). Future
  heuristics drop in as a single test() append rather than a regex
  patch.

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
@cv cv added v0.0.51 Release target v0.0.52 Release target and removed v0.0.51 Release target labels May 26, 2026
@cv cv added v0.0.53 Release target and removed v0.0.52 Release target labels May 27, 2026
@cv cv disabled auto-merge May 27, 2026 03:18
@cv cv merged commit b10e785 into main May 27, 2026
16 checks passed
@prekshivyas prekshivyas deleted the feat/verify-stale-skill-v2 branch May 27, 2026 03:19
cv pushed a commit that referenced this pull request May 27, 2026
## Summary
- Add the v0.0.53 release notes with the user-facing onboarding,
inference, policy, runtime, Hermes, and maintainer-tooling changes from
the release range.
- Refresh generated `nemoclaw-user-*` skills from the current Fern docs,
including already-merged policy, inference, troubleshooting, and
command-reference updates.
- Remove skipped experimental shield wording from generated-doc source
so the release-prep skip-term gate stays clean.

## Source summary
- #4197 -> `docs/about/release-notes.mdx`,
`docs/reference/commands.mdx`: Document pre-recreate workspace backup,
abort-on-partial-backup behavior, and
`NEMOCLAW_RECREATE_WITHOUT_BACKUP`.
- #4273 -> `docs/about/release-notes.mdx`,
`docs/reference/troubleshooting.mdx`: Document the under-provisioned
runtime prompt defaulting to abort in interactive onboarding.
- #4220 -> `docs/about/release-notes.mdx`,
`docs/network-policy/customize-network-policy.mdx`,
`docs/network-policy/integration-policy-examples.mdx`: Include the
`openclaw-pricing` preset and generated skill refresh.
- #4253 -> `docs/about/release-notes.mdx`,
`docs/inference/use-local-inference.mdx`,
`docs/inference/switch-inference-providers.mdx`: Carry the Ollama
runtime context-window docs into generated skills.
- #4298 -> `docs/about/release-notes.mdx`,
`docs/reference/troubleshooting.mdx`: Carry WSL Docker Desktop GPU
guidance into generated skills and release notes.
- #4297, #4210, #4221, #4225, #4288, #4306, #4311, #4319, #4342, #4284,
#3327 -> `docs/about/release-notes.mdx`: Summarize release-range fixes
and maintainer tooling changes that did not need new standalone docs
pages.

## Verification
- `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix
nemoclaw-user --doc-platform fern-mdx`
- `rg "permissive mode|shields down|shields up|shields status|config
rotate-token|rotate-token" docs .agents/skills` returned no matches
outside `docs/.docs-skip`.
- `npm run docs` passes with full network access. Fern reports 0 errors
and one existing light-mode accent contrast warning.
- `FERN_VERSION=$(node -p "require('./fern/fern.config.json').version")
&& (cd fern && npx --yes "fern-api@${FERN_VERSION}" check --warnings)`
reports 0 errors and the same contrast warning.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Added v0.0.53 release notes with updates to onboarding, sandbox
recreation, and gateway handling
* Introduced `openclaw-pricing` preset for model pricing endpoint
management
* Clarified Ollama context window configuration and local model
validation behavior
* Updated sandbox recreation workflow documentation with backup/restore
details
* Enhanced interactive onboarding defaults for under-provisioned runtime
warnings
* Revised security guidance for configuration directory permissions and
immutability verification

<!-- review_stack_entry_start -->

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4360?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

<!-- review_stack_entry_end -->

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
prekshivyas added a commit that referenced this pull request Jun 1, 2026
Resolves conflicts in four files after #3327 (verify-stale skill) squash-
merged onto main while the dogfood branch carried the unsquashed history
plus three blueprint commits on top.

Resolution:
- .agents/skills/nemoclaw-skills-guide/SKILL.md: take main (skill counts
  evolved on main from 8 to 12 maintainer skills, 9 to 10 user skills, etc.).
- .agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md: take HEAD
  (preserves the Step 3 self-check wording, the self-check.md reference-map
  row, and the Automation environment variables table — all added by the
  dogfood feat commit).
- .agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md:
  take HEAD (preserves VERIFY_STALE_FORCE_OLLAMA_ONLY provider-credential skip,
  the two gh --jq --arg pipe rewrites that fix the never-fired idempotency
  and unanswered-question gates, and ${VERIFY_STALE_BATCH_CAP:-15} prose).
- .agents/skills/nemoclaw-maintainer-verify-stale/reference/scoring-comments-and-logging.md:
  take HEAD (preserves VERIFY_STALE_DRY_RUN closed-mid-run branch and the
  dry-run short-circuit before live comment-post + label-apply).

Outside the dogfood-touched regions the verify-stale skill files are
identical between branches (git diff main..pr-4279 -- on the other four
reference files returns empty).

Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
@wscurran wscurran added bug-fix PR fixes a bug or regression feature PR adds or expands user-visible functionality area: skills Skills, agent behaviors, prompts, or skill packaging and removed fix bug-fix PR fixes a bug or regression labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: skills Skills, agent behaviors, prompts, or skill packaging feature PR adds or expands user-visible functionality v0.0.53 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants