feat(skills): add nemoclaw-maintainer-verify-stale#3327
Conversation
Adds a maintainer skill that automates verifying whether old bug reports still reproduce against the latest NemoClaw release. The skill picks candidate issues opened against older versions, reuses or provisions a Brev Linux box (CPU or GPU based on the bug profile), runs the extracted reproducer, scores confidence, and posts an evidence-backed comment with either a `fixed-on-latest` or `verify-inconclusive` label. Tag-only — never auto-closes. v1 scope is intentionally narrow: Linux only, no third-party integration credentials, no service-account bot. Each maintainer runs it under their own gh credentials. Companion changes: - Adds the new skill to the maintainer table in `nemoclaw-skills-guide` (count 7 -> 8 maintainer skills, 18 -> 19 total). - Adds Step 7 to `nemoclaw-maintainer-cut-release-tag` that sweeps `fixed-on-latest` and `verify-inconclusive` labels off all open issues at release time, so the next verify-stale run re-evaluates against the new latest. Verification records remain in comment history; only the labels are reset. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Three flaws surfaced when dry-running the Step 3 candidate filter against the live open-issue list (141 open bugs). 1. Step 2 used `gh release view` to detect "latest". NemoClaw tags but does not publish GitHub releases, so this returned empty. Added a fallback that picks the highest semver from `git ls-remote --tags`, which is the load-bearing path today. 2. Step 3 said "2 or more minor versions behind". Wrong vocabulary for `0.0.x` repos — patch is what's iterating. Reworded to "at least 2 versions behind in the rightmost-incrementing component", with a concrete `0.0.x` example and a forward-compat note for when NemoClaw moves to `0.1.x`. 3. Step 4 specified a naive `v?0\.0\.\d+` regex that matched IP-like strings in bug bodies (Ollama bind `0.0.0.0:11434`, loopback `127.0.0.1`, etc.) and produced phantom v0.0.0 / v0.0.1 candidates. Tightened to require `v` prefix + word boundaries first, with a context-anchored fallback that only matches `0.0.X` on lines containing `nemoclaw` or `version`. Added a clamp that rejects any parsed version greater than `$LATEST` (catches roadmap labels like `v0.0.35` from being treated as "reported on"). After these fixes the dry-run produces 33 plausible candidates out of 141 open bugs, with credible reported-version assignments verified against issue bodies. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Two more issues surfaced from a deeper second-pass dry-run. 1. The Step 4 regex was hardcoded to v0.0.x. NemoClaw is on that line today, but the skill should keep working when it moves to v0.1.x or higher. Generalized to `\bv\d+\.\d+\.\d+\b` with a context-anchored `\b\d+\.\d+\.\d+\b` fallback on lines containing nemoclaw or version (so the loose form doesn't match `0.0.0.0:11434` and `127.0.0.1`). 2. Some bug reports cite NemoClaw versions that were never tagged (e.g. `v0.1.0` × 3 issues, calver `2026.3.11` × 1). Without a tag existence check the skill would point Brev verification at a version that does not exist. Added a `git ls-remote --tags` validation pass that drops the version if no matching tag is found. Also added an implementer note documenting a real failure mode caught during the dry-run: a naive `[scan(primary)] | first | .[0] | tonumber // [scan(fallback)] | ...` pipeline silently dropped 9 valid candidates because `null | first` errors when scan returns empty, and the `//` chain did not propagate cleanly to the fallback. The SKILL.md regex itself was correct — the failure was implementation-level fragility in how the two passes were composed. The note tells implementers to bind each pass to a named variable, coalesce at the end, and explicitly test the empty-match path. After these changes the live dry-run produces 47 plausible candidates (out of 141 open bugs) and the 4 residual drops are all reporter typos that cite a version that was never released — correctly unverifiable. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Third dry-run round surfaced a precision issue. The previous regex
matched any `\bv\d+\.\d+\.\d+\b` anywhere in the body, then validated
against the tag list. That picked up versions belonging to *other*
products that share the v0.0.x line — most often OpenShell, but also
Node.js (v22.x.x), and other dependencies. Tag validation then
correctly rejected the non-NemoClaw ones, but on bodies where multiple
products were listed (e.g. `Environment: openshell 0.0.4, nemoclaw
0.1.0, Node.js v22.16.0`) the parser would land on OpenShell's
tag-valid v0.0.4 and report it as the NemoClaw version, producing a
false-positive candidate. 12 such collisions exist in the current
backlog.
Fix: require the version to follow `nemoclaw` (case-insensitive) within
80 non-letter, non-newline characters. The anchored regex
`(?i)nemoclaw[^a-z\n]{0,80}v?(\d+\.\d+\.\d+)` matches `NemoClaw
v0.0.32`, `nemoclaw 0.0.28`, `- NemoClaw: v0.0.16`, but not `openshell
0.0.4` followed later by `nemoclaw 0.1.0`. Combined with the existing
tag-list validation, this collapses the four error classes — typos,
calver dates, roadmap labels, and product collisions — into one
"version must be a real NemoClaw release tag, mentioned next to
NemoClaw" check.
Also expanded the implementer note to cover three concrete failure
modes hit during the dry-run:
- Empty-match handling (`[]` flowing into `first | .[0]` errors).
- Capture-group inconsistency between branches in a parser pipeline.
- Variable-scoping bug in `select($tags | index(.))` where `.` rebinds
to `$tags` and silently passes invalid labels.
After this change the live dry-run produces 33 candidates out of 141
open bugs — same count as the very first round, but every candidate's
parsed version is now anchored to NemoClaw and validated against the
real tag list. The earlier wider counts (47, 48) included roughly 12-15
issues that would have been wasted Brev runs.
Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…test A clean run on latest is ambiguous on its own: - Scenario A: bug really got fixed (script worked, latest passes) - Scenario B: script was junk all along (script never triggered the bug; latest passes for the same reason baseline would) These look identical from the latest pass alone, which means the previous flow could land `fixed-on-latest` on a script that never had a chance — a false positive that the +25 commit and +25 PR signals only partially mitigate. This change adds a baseline pass before the latest pass: - Step 8a: install reported version on the box. - Step 8b: run reproducer, compare output to issue's actual-result description. Match = script validated. - Step 8c: if no match (script silent OR errored on the wrong thing), synth-repro using the issue body PLUS the baseline transcript so the LLM can react to the actual failure mode. Apply -30. Retry. Still no match -> verify-inconclusive, skip latest entirely. - Step 8d: install latest, run validated reproducer. Score from this. A single trigger drives synthesis: "the script didn't expose the bug on baseline." Whether it was silent or noisy doesn't matter; both mean the script needs work. This collapses the previous separate "narrative-only -> synth" and "verbatim-failed -> ???" paths into one rule. Step 6 simplified accordingly: just extract verbatim if available, otherwise carry the issue body forward to Step 8b for on-demand synthesis. No more "give up before provisioning" branch — that decision moves into Step 8c where it has more context to work with. Step 9 adds a baseline-validation gating rule. If the reported- version install fails (old releases rot — installer URLs, deps, OS images drift), the score is capped at 84 unless commit-area or PR- mention evidence also fires. That forces the @-mention-reporter band when we lack independent corroboration, which is the honest position when we couldn't validate the script ourselves. Step 11 split into two failure types: - Latest-install fail or harness error -> hard infra failure (no label, optional comment, move on). - Baseline-install fail -> degraded mode (skip baseline gate, run latest anyway, apply Step 9 cap). Wallclock budget bumped from 15 to 25 minutes per verification to accommodate two installs. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Tier-1 (blocking for E2E):
- Add the still-reproduces-on-latest path. If the latest run output
matches the issue symptom, the bug is still live — the skill posts
a "still reproducible" comment with both transcripts, applies no
label, and includes a dated marker `<!-- ... v1 YYYY-MM-DD -->`.
No new label per user direction; a 7-day TTL on the marker handles
re-verification on the next weekly cron.
- Update the comment template for the two-pass flow. The previous
template described a single transcript; the new one has Baseline
and Latest sections plus a dedicated still-reproduces template,
and the idempotency marker now carries today's date.
- Update the Step 12 log template to record baseline-install,
baseline-match, latest-install, and latest-result independently.
- Clarify in Step 4 that REPORTED_VERSION must be the full tag string
("v0.0.32"), not just the patch number (32). Step 8a's installer
expects the full tag.
Tier-2:
- Spell out an explicit match rubric in Step 8b: exit code agreement,
symptom-phrase match (LLM-judged semantic equivalence counts), and
distinguishing the bug from infra noise (DNS / auth / rate limits).
- Replace the minimal `rm -rf ~/.nemoclaw` reset with a comprehensive
one that stops nemoclaw/openshell processes, removes spawned Docker
containers, frees common ports (8080, 18789, 9119), and removes the
installed binaries and lib paths. Run before each install in 8a and
8d; idempotent so it's safe on a fresh box.
- Add interactive-subcommand handling in 8b: auto-detect `nemoclaw
onboard` / `configure` and try `--non-interactive`, then
`--dangerously-skip-prompts`, then stdin pre-feed. Fall through to
Step 8c (synth) if none work.
Tier-3 / opportunistic:
- Fix the actual installer command (gap 9). The installer accepts
the target ref via `NEMOCLAW_INSTALL_TAG` env var, NOT a `--version`
flag (verified against install.sh source). Updated 8a accordingly.
- Update Step 3 idempotency: drop on either label OR a marker
comment within the last 7 days. Previously a marker excluded the
issue forever; the release sweep only clears labels, so the marker
alone made re-verification impossible. The 7-day TTL also handles
the still-reproduces case naturally.
- Extend wallclock budget to 60 min when issue body contains
time-sensitive keywords (`after N minutes`, `eventually`, `memory
leak`, etc.). Hard ceiling at 60 — bugs that require hours fall
out of v1 scope.
- Keep provisioned boxes for 30 minutes on verify-inconclusive
outcomes so a maintainer can `brev shell` in and triage. Reused
boxes always stay.
Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Five concrete fixes from the pre-E2E gap analysis, plus a verified finding on gap 2 that lowered its risk profile. Gap 1 — sudo behavior. The reset script's `sudo rm -f` and `sudo rm -rf` could hang on a password prompt if the Brev image's default user lacks passwordless sudo. Switched all sudo calls to `sudo -n` (non- interactive) so they fail fast instead. Added a precondition note explaining the assumption: Brev stock images allow passwordless sudo, custom images may not. The user-local install path (~/.nemoclaw) is fully reset regardless of sudo state, so the worst case is a stale /usr/local/bin/nemoclaw symlink that the next install overwrites. Gap 2 — verified the install architecture supports old-version pinning natively. The bootstrap clones the requested ref and runs that tag's own install.sh (with a payload-marker fallback for legacy ones), so NEMOCLAW_INSTALL_TAG=v0.0.32 works architecturally. Risk reduces to "does v0.0.32's own installer still work on a 2026 OS image" — handled by Step 11's degraded-mode fall-through. No SKILL.md change needed. Gap 4 — path extraction was hand-waved. The +25 commits-touched-area weight referenced `git log v<reported>..$LATEST -- <path>` without specifying how to determine `<path>`. Added a three-tier extraction procedure: stack-trace path mentions parsed from the body and mapped to repo paths, then a component-label-to-directory map (NemoClaw CLI → bin/, Sandbox → src/lib/sandbox/, etc.), then title-keyword fallbacks. If none yields a path, skip the +25 signal entirely rather than guessing — floating the weight would inflate scores meaninglessly. Gap 5 — PR-search query was hand-waved. The +25 PR-mention weight had no concrete query. Added two-stage search: first `gh pr list --search "$ISSUE_NUMBER"` filtered for actual `#NNNN` references in body or title, then a symptom-phrase fallback if direct reference returns nothing. PRs that merged before the issue was filed are excluded via `mergedAt > tag-date(REPORTED_VERSION)`. Gap 6 — issues without an "Actual result" section had no match path. Behavioral / configuration bugs (e.g., #1242 "should default to a stable released version") describe a wrong default rather than a runtime error. Added a fallback to Step 8b's match rubric: use the title + full body as the symptom, match if reproducer's outcome contradicts the issue's expected behavior. If neither error string nor expected-behavior contradiction can be identified, route to Step 8c (synth-repro) so the LLM produces a more diagnostic script. Gap 7 — transcript redaction was thin. Step 10 only covered tokens, paths, and basic-auth URLs. Real transcripts leak more: Authorization headers, JWTs, GitHub PATs, AWS keys, NVIDIA API keys, internal hostnames, emails (PII), long base64 blobs. Expanded the redaction table with concrete patterns for each. Added an order note: longest/most-specific patterns first so the generic base64 catchall doesn't mask what was actually redacted. Made it explicit that redaction runs on every quoted chunk in the comment, not just issue- body excerpts. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…l SKU pick Preflight Brev auth and install-URL reachability before paying any cost (Step 6.5). For pure-CLI bugs with no sandbox, Docker, or GPU dependency, try the reproducer locally on the maintainer's `nemoclaw` install before provisioning a Brev box (Step 6.7) — same evidence, zero cost. Replace the `<your-team's-CPU-SKU>` placeholder in Step 7 with a runtime pick via `brev search cpu --sort price --json | jq` so the skill keeps working as SKUs change, with a `VERIFY_STALE_CPU_TYPE` env override for teams that need to pin. Drop the spurious `--yes` flag from `brev delete` — it does not exist on that subcommand and would error. `brev delete` is non-interactive by default. Print the instance name before the trap registers so manual cleanup is possible if the trap doesn't fire. Thread `INSTALL_URL` through Step 8a/8d so the URL override from Step 6.5 applies to both the baseline and latest installs. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Some bugs are filed against behavior that was deliberately removed or changed in a merged PR. Running the standard rubric on these produces a misleading verdict — the symptom "still reproduces" but the right answer is "won't fix, see PR #X." Issue #2791 is the prototype: `config set` was removed in #2227, the reporter tested a version that already had it gone, and the standard rubric would have buried that context under a low-confidence `verify-inconclusive` label. Add Step 8.5 with three independent detection signals: - Maintainer-attribution comment phrasing ("removed in #N", "by design", "wontfix", "intentional") from MEMBER/OWNER/COLLABORATOR authors. - A removal commit between reported version and `$LATEST` whose subject matches `\b(remove|delete|drop|deprecate)\b` and whose diff deletes the symbol implicated by the reproducer. - The implicated symbol absent from both reported version and `$LATEST`, meaning it was already gone when the issue was filed. When any signal fires, skip Step 9 scoring and Brev provisioning, apply a new `wontfix-by-design` label, and post a focused comment that links to the responsible PR. Detection on signals 2 and 3 can run as soon as the reported version is parsed, saving Brev cost on the entire class. Generalize the Step 3 idempotency marker regex from `v1` to `v\d+` so future skill versions can re-verify older-marked issues by tightening the regex to require a specific marker version. Add `wontfix-by-design` to the idempotency label list, the activity log entry options, the session summary, and the release sweep in `nemoclaw-maintainer-cut-release-tag`. Update the skill's frontmatter description to mention the new label and the local-first short-circuit. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…var log path The issue body that comes back from `gh issue view` for NV QA bugs is HTML, not markdown — `<pre>...</pre>` blocks and tables, with HTML-encoded entities. The previous Step 6 only mentioned `<pre>` in prose without an extractor that actually parsed it. Add a Python extractor that handles markdown fences (triple-backtick and tilde), HTML `<pre>`, and entity unescape in one pass, so verbatim reproducers from QA-filed issues land cleanly instead of falling through to LLM synthesis with a -30 penalty. Step 10 already had a comprehensive redaction regex table, but the patterns assumed plain text — tokens nested in HTML tags or attributes slipped through unredacted. Add an HTML to text pre-pass for issue-body excerpts before the regex table runs, so quoted excerpts from QA bodies get the same redaction coverage transcripts already do. Step 5's GPU keyword list flagged `inference` and `model serving`, both of which false-positive on CPU bugs (`models.providers.inference.baseUrl` is a config path, not a GPU need). Tighten to whole-word matches and swap in `vllm`, `tensorrt`, `L40S`, `L4`, `T4`. Step 12's activity log was hard-coded to one maintainer's personal organizer path. Read from `VERIFY_STALE_LOG_DIR` with the existing path as default, and require the skill to create the directory rather than assuming it exists, so the skill works in CI / shared-volume / other environments. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Path-extraction map in Step 9 was written from assumption rather than verification — `src/lib/sandbox/`, `src/lib/openshell/`, and `src/lib/policy/` don't exist in the current repo, and OpenShell lives in a separate repo entirely. The +25 commits-touched-component signal would have silently misfired on every Sandbox / OpenShell / policy issue, underscoring real fixes. Replace with paths verified against the current tree (`nemoclaw/src/blueprint/`, `nemoclaw-blueprint/`, `nemoclaw/src/commands/`, etc.) and explicitly note the cross-repo OpenShell case is out of scope for v1. Step 8 reset wiped `~/.nemoclaw` but not `~/.openclaw`. Sandbox state has been writable-by-default under `.openclaw` since #2227, so it persists across the baseline → latest reinstall and contaminates the latest run. Add it to the reset. Anchor the `pkill -9 -f nemoclaw|openshell` patterns to a leading slash (`/nemoclaw`, `/openshell`) so the kill matches actual installed paths but not unrelated processes that mention these strings — including the agent harness running this skill if its working directory contains the word. Reorder the Step 10 redaction table to match the stated "longest, most specific patterns first" rule. The previous order had the generic `(token|secret|password|...)` pattern executing before JWT/AWS/NVIDIA patterns, which would have masked specific-token redactions with generic ones — losing the signal of *what kind* of credential leaked. Replace `git ls-remote git@github.com:NVIDIA/...` (Steps 2 and 4) with `gh api repos/NVIDIA/NemoClaw/tags --paginate --jq '.[].name'`. The SSH path required keys to be configured; `gh api` reuses whatever auth the user already has. Add a CLI-deps precondition check (`command -v gh brev jq python3 curl`) to Step 6.5 so the skill fails fast if any required dependency is missing, instead of failing late and confusingly mid-run. Step 11's keep-box-on-inconclusive said "delay cleanup by 30 minutes" but had no implementation. A backgrounded `sleep && brev delete` doesn't survive session end. Replace with skip-the-trap-entirely plus an explicit manual-cleanup reminder printed to the run output. Out-of-Scope was contradicting itself on macOS — Step 6.7 explicitly allows macOS for local-first runs. Clarify: Brev path is Linux-only, the local-first short-circuit on a maintainer's laptop works on macOS for manual single-issue runs. Add a `local (no Brev — Step 6.7 short-circuit)` option to the Step 12 log entry's Box field so local-first runs round-trip the activity log correctly. Frontmatter description said "latest release" but NemoClaw uses tags, not GitHub Releases — fix to "latest tag" to match Step 2. Step 4's implementer note said "Two real failure modes" but listed three; fix the off-by-one. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…-verification Step 8.5 was producing comments that were correct in direction but hand-wavy on substance. A side-by-side against an independent analysis of #2168 surfaced the gap: that analysis cited specific file:line locations, separated the literal bug-as-filed from a related failure mode that still exists on latest, called out existing CI coverage of the new workflow, and self-corrected an overstatement when prompted to re-verify. Today the skill required none of that. Restructure Step 8.5 into substeps so the rigor is mechanical, not prompt-dependent: - 8.5a: signal detection now requires verifiable evidence per signal — comment URL + quoted phrase (signal 1), commit SHA + diff range (signal 2), grep commands + outputs (signal 3). Add a sub-case for vestigial deprecation shims so a stub doesn't silently fail signal 3 by appearing in latest. - 8.5b: pre-check for related failure modes. Grep latest for the bug's symptom keywords (not the removed symbol) and require the comment to call out anything that surfaces. This is the section the side-by-side identified as missing — saying "the bug as filed can't reproduce" is not the same as "every bug shaped like this is fixed." - 8.5c: check existing test coverage for the new workflow and cite up to three test paths if found. Strengthens the comment from "trust me, it was removed" to "the new workflow is exercised by these tests." - 8.5d: explicit self-verification pass — re-run every cited command before composing the comment; bail to verify-inconclusive on any discrepancy. LLMs confidently overstate; mechanical re-verification catches it without needing a human prompt. Replace the by-design comment template with one that has mandatory sections matching the substeps: "What's structurally fixed", "Vestigial references", "What's not literally the same bug", "Existing CI coverage", "Recommendation". Each section either has concrete content or is explicitly omitted — no hand-wavy claims slip through. Add NVBugs cross-reference extraction to Step 4 (`grep -oE '\[NVB#[0-9]+\]'`) so the by-design template can append the standard "NVBugs#NNNNNNN will need a separate update; closing this GitHub issue won't propagate" reminder when the issue body carries that footer (most NV QA bugs do). Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
The previous signal-2 candidate query narrowed by commit subject (`remove|delete|drop|deprecate`) before checking whether the diff deleted the implicated symbol. That filter excludes the common case where a removal lands inside a `refactor(...)` or `feat(...)` commit — e.g., PR #2227 removed `--dangerously-skip-permissions` under a "refactor(sandbox): default to mutable config" subject. A literal follower of the previous rule would have concluded signal 2 doesn't fire on issue #2168 even though the flag was deliberately removed. Switch the primary candidate lookup to `git log -S<symbol>` pickaxe search, which finds every commit whose diff changes the count of the symbol regardless of subject wording. Keep the subject-keyword query as a supplementary lookup for narrowing when the pickaxe returns many commits. Note explicitly that the commit's actual subject doesn't need to mention removal. Surfaced while testing the new Step 8.5 rigor against #2168. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…horing Two changes that surfaced from running Step 8.5 against issue #2168. Use the existing repo `wontfix` label for the by-design path instead of creating a new `wontfix-by-design` label. The repo already has `wontfix` and it's already in Step 3's issue-type skip list, so applying it gives idempotency for free without proliferating labels. The by-design verdict is encoded in the marker comment, not the label vocabulary. Important consequence: the release sweep in `nemoclaw-maintainer-cut-release-tag` does NOT clear `wontfix` — that label is also applied for non-skill reasons (scope, priority, dup decisions made by maintainers), and sweeping it would erase human triage work. Sweep stays scoped to `fixed-on-latest` and `verify-inconclusive`, which are skill-only. Add a `**Verification mode:**` line to the by-design comment template that explicitly says "static analysis at the verified-on tag — no runtime reproduction." This was the honesty an independent side-by-side analysis of #2168 caught: the previous template implied runtime confirmation even though Step 8.5 short-circuits Brev provisioning by design. Add a tag-anchoring rule above the template too — every `file:line` cite must refer to the verified-on tag, not the maintainer's HEAD, since line numbers drift between tags and main and we want comments to stay reproducible by anyone reading them later. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…ion check Bare `file:line` paths in comments force the reader to navigate manually — that's a usability bug, not a stylistic preference. Comments posted by the skill should be self-serving artifacts: a maintainer or QA reader should be able to click any citation and land on the exact content at the verified-on tag. Extend the tag-anchoring rule above the by-design template into a "tag-anchoring + linking rule" with concrete formats for file blobs, file:line, file:line-range, commit SHAs, and test paths. Note that bare `#NNNN` PR/issue references already auto-link in same-repo comments. Strengthen Step 8.5d into two passes: evidence (re-run cited commands) and link (resolve at least one rendered link per evidence section via `gh api .../contents/<path>?ref=<tag>` or `curl -fsI`). A 404 on a citation suggests verification that didn't actually happen — worse than no link at all. Bail to verify-inconclusive on any failure. Surfaced while reviewing the rendered #2168 comment: the citations were correct and tag-anchored but not clickable. Better caught now than after the first batch run posts a wall of bare paths. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Discovered while applying the by-design label to issue #2168: the skill referenced `wontfix` and `needs-info` as bare strings, but the repo's canonical labels are `status: wont-fix` and `status: needs-info` (with prefix and hyphen). Same shape in Step 3's issue-type skip list — issues labelled `status: wont-fix` or `status: needs-info` were NOT being filtered out at candidate-time as the spec intended, because the skip list keys didn't match the live labels. Three coordinated fixes: 1. Step 3 issue-type skip list now uses `status: wont-fix` and `status: needs-info`. Annotate the rule so future readers know not to substitute the bare forms. 2. Step 8.5 by-design action applies `status: wont-fix` (with quoting on the CLI). Step 12 log entry, session summary, frontmatter description, and Companion Behavior section in both the verify-stale skill and `nemoclaw-maintainer-cut-release-tag` updated to match. 3. Step 6.5 preconditions now verifies all expected labels exist on the repo with `gh label list` before any later step tries to apply them. This is the gap that bit us on #2168 — the comment posted but the label couldn't be applied because the spec name didn't match. Failing fast in 6.5 means the run aborts before any Brev cost or comment posting can happen against a misconfigured label vocabulary. Bare `wontfix` / `won't fix` / etc. are still in Signal 1's detection regex — that's intentional, those are SEARCH PATTERNS for what maintainers might write in free-text comments, not label names. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…s for faithfulness Two changes that landed before the first real Brev E2E run (#2007). The previous wallclock cap split — 25 min default with a 60 min extension keyed off "memory leak" / "over time" / "eventually" keywords — optimised for the wrong constraint. Most issues fit comfortably under 60 min once you account for two full install passes plus comprehensive resets, and the keyword-based extension forced re-runs whenever a real install or bootstrap took longer than the optimistic 25 min budget. Bump the cap to a single 60 min default and drop the keyword extension; bugs that genuinely require more than an hour fall out of v1 scope. Add a new Step 8a.5 for bootstrapping reproducer dependencies that the stock Brev image doesn't ship — local model servers (Ollama, vLLM), provider runtimes, third-party CLIs. Default policy is maximum faithfulness: install the actual dependency the reporter used rather than substituting a stub. Substituting trades faithfulness for speed, and on a 60 min budget that trade is rarely worth it; it almost always introduces a confound that makes the verdict less trustworthy. Document the canonical Ollama and vLLM bootstraps with specific attention to daemon survival between `brev exec` calls (Ollama's installer registers a systemd service; vLLM uses nohup + log file). Bootstrap runs once before Step 8b's baseline and is reused for Step 8d's latest run — model downloads are expensive and external state, so the comprehensive reset must explicitly leave them alone. Carve out one substitution case: providers that require API keys the skill cannot safely supply (NIM, OpenAI, Anthropic). Stubbing a key defeats faithfulness anyway, and a real key shouldn't sit in a verify-stale run. Substitution applies the same -30 LLM-synth penalty as Step 8b and must be documented in the rendered comment. Bootstrap failure escalates to Step 11 infra failure — do NOT silently substitute, since the maintainer opted into faithfulness for a reason. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…get on every comment Two requirements surfaced from the #2007 e2e run that should be enforced by the skill, not by per-session memory or by the prompter remembering. Mandatory reporter @-mention with confirmation language. The skill cannot independently confirm a closed-as-fixed verdict — only the reporter knows whether their original symptom is gone in their environment. The @-mention is what converts a "skill says it is fixed" claim into actionable confirmation work for QA. Add the explicit closing block (canonical wording: "please confirm the symptom is gone on a recent build and reopen with a fresh reproducer if you observe otherwise") to all three Step 10 templates: fixed-on-latest, still-reproduces, and the Step 8.5 by-design template. The previous "If this verification is wrong, please reopen..." line was passive; the new line names the reporter and asks them to act. Length target. Default rendered comments to 400-500 words. The evidence table or by-design fixed/vestigial sections are the hero; everything else has to either change the reader's mind about the verdict or be deleted. The first #2007 draft ran ~750 words and the user pushed back explicitly; encoding the cap into Step 10 means future agents do not start from scratch on length each run. Surfaced from: e2e Brev verification run on issue #2007 (the first real Brev-track exercise of the skill end to end). Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Six concrete gaps that surfaced during the first end-to-end Brev run (issue #2007). Each almost produced a wrong verdict or a wasted run; each is now encoded so the next agent doesn't have to re-discover. 1. Step 9 baseline-validation cap-removal rule was too permissive. The `unless commits-touched OR PR-mention also fires` escape hatch let inferred fix evidence override the absence of a baseline run, which produced a misleading 100/100 on #2007 despite zero baseline confirmation. Tightened: cap at 84 holds regardless of corroboration; PR-search signals raise the score within the cap, never past it. Step 10 now requires an explicit one-line caveat naming the cap and the reason in the rendered Verdict section. 2. Step 6.5 install URL default switched from `nemoclaw.nvidia.com` (NVIDIA-internal, doesn't resolve from Brev) to `www.nvidia.com/ nemoclaw.sh` (public Akamai 301 redirect). Brev runs were dying at bootstrap on the wrong host. 3. Step 7 CPU SKU picker now biases by reproducer-implied memory needs. On #2007 the cheapest 2 GB SKU couldn't load a 4.8 GiB Ollama probe; onboard failed at provider validation and we burned ~25 min before re-provisioning a 16 GB box. Adds a `CPU_RAM_FLOOR` env var and a memory-floor heuristic (16 GB when reproducer references a model server, 8 GB for sandbox onboarding without a model, 4 GB for pure-CLI bugs). 4. Step 11 failure taxonomy now has a "baseline-build rot" bucket distinct from "binary install rot." Same `BASELINE_INSTALL_FAILED=1` flag and same downstream cap-and-degrade behavior, but failures at the in-image Dockerfile build phase (what we hit on v0.0.18 — the `.openclaw-data/workspace/media` symlink layer, removed entirely by #2227) get a separate label so reviewers can see *why* the old image no longer builds without re-running. 5. New Step 8a.5b documents the two non-obvious `brev exec` quirks that reproducer scripts have to handle every time: PATH does not include `~/.local/bin` in non-login shells (so reproducers must `export PATH` at the top, or callers must wrap with `bash -lc`); and the docker group requires `sg docker -c '...'` because adding the user via `usermod -aG` doesn't take effect within the same Brev session. 6. New Step 8d.5 architectural-drift check. When the diff between `$REPORTED_VERSION` and `$LATEST` touches the *tool* the reproducer's expected output depends on (e.g. `openshell forward` between v0.0.18 and v0.0.35), don't trust the reproducer's surface alone — multi-axis verification on OS-level surfaces (host listeners, NAT rules, docker ports, SSH tunnels, etc.) is required before claiming fixed-on-latest. This is the five-axis pattern we used to confirm #2007 wasn't a false positive; codifies it as a check the skill applies whenever pickaxe shows the reproducer's tool was reworked. Surfaced from: end-to-end Brev verification of issue #2007. Build the skill, exercise it on a real issue, fix what breaks, repeat — every fix here came from a concrete failure mode in one real run. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…ll host Two small UX fixes surfaced from a thorough re-read. Step 6.5's brev-auth error printed three numbered options without a clear recommendation. From a non-TTY agent harness (Claude Code or any unattended runner) the right answer is option 2 — open a separate terminal, run `brev login`, complete the browser flow, come back. The previous message buried that path inside option 2's inline comment. Replace with a directive recipe that names the recommended flow first (open separate terminal, browser auth, re-run the skill) and lists the headless alternatives below as "when option 1 isn't available." Also explicitly notes that credentials persist to ~/.brev/credentials.json, so re-running the skill picks them up automatically. Step 6.5's install-URL error still suggested checking `https://nemoclaw.nvidia.com` even after the default was switched to `https://www.nvidia.com/nemoclaw.sh` in commit 296f7cd. Update the suggestion to match the new default and add an explicit "then re-run this skill" so the maintainer knows the flow is fix-and-retry, not abort. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Five gaps verified against the source during a thorough re-read pass.
The audit also caught two false-positive claims I'd made earlier
(`NEMOCLAW_INSTALL_TAG` env var and `comments` field in batch fetch)
that didn't survive verification — those are not landed because there
was nothing to fix.
Step 1 batch discovery query bumped from `--limit 100` to `--limit 1000`
so the skill doesn't silently drop issues beyond the page. The recent
candidate triage on this repo found 129 open bugs; the previous limit
would have lost 29 of them. The 20-issue per-run processing cap is
downstream of Step 3/4 filters and unaffected.
Step 3's 7-day comment-marker TTL had the rule but no implementation.
Add a portable `gh issue view --json comments --jq` snippet that
extracts marker comments newer than the cutoff. macOS and Linux
date(1) syntax differ for relative-date math, so the cutoff line
tries both. Without this, the first cron run after any posting will
silently re-verify everything previously marked, posting duplicate
comments every Monday.
Step 6.5 preconditions now checks gh identity via `gh api user --jq
.login` and prints the resolved login before any later step runs.
Comments posted by Step 10 land under this account; surfacing it
explicitly catches the multi-token / wrong-tab / mid-session-reauth
case before a public comment lands under the wrong handle.
Step 10 now requires a `**Verification mode:**` header line in every
template (was previously by-design only). Reader should never have to
guess whether a verdict came from real install logs or from static
analysis. Filled in concrete defaults for the standard
fixed/inconclusive template ("runtime reproduction; baseline + latest
installed and run") and the still-reproduces template ("runtime
reproduction; bug confirmed live").
Step 10 also requires a link-pass self-verification on every
template, not just by-design's Step 8.5d. The "Tag-anchoring + linking
rule" already declared every citation must be a clickable markdown
link to the verified-on tag, but the verification step that resolves
those links (`gh api .../contents/<path>?ref=<tag>` or `curl -fsI`)
was scoped only to the by-design path. Same 404-cite risk applies to
the standard template — broken citation links advertise verification
work that didn't happen, and that's worse than no citation at all.
Surfaced from: end-to-end audit after the #2007 e2e run, with the
specific `gh issue view`/`grep` commands run against the SKILL.md to
confirm each claim before listing it as a gap.
Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Two changes that together turn the per-run batch cap from policy into code. Lower the per-batch processing cap from 20 to 15. Sequential execution with 1-2 reused Brev boxes works fine for runs in this size range; ~2-3 hours wallclock for a full 15-issue batch is the comfortable budget before either the per-plan approval gate breaks down or the maintainer session needs to span multiple sittings. Add the slicing code that actually enforces the cap. Previously Step 1 declared "Cap at N issues per run" as policy and nothing applied it — candidate sets larger than the cap would just process to completion silently. Move the slice to the end of Step 4 (after Step 3 label filters and the Step 4 version+candidate-rule filters narrow the pool) and sort by `(-versions_behind, -age_days)` so the most stale come first. Spillover beyond 15 stays eligible for the next run via the marker-comment 7-day TTL added in 8976176. Single-issue mode bypasses the cap entirely; the maintainer named the issue explicitly. Cadence section updated from "≤20 issues" to "≤15 issues" to match. Surfaced from the post-#2007 audit: of the 13 gaps initially called out, the cap-enforcement one was operational toil (skill could chew through cost on a runaway batch). Lowering and enforcing closes that toil path. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Looked at the 24 candidates surfaced by the recent triage run by *kind* of bug rather than by version, and found six gaps where the standard rubric would produce a wrong verdict or hang. All six landed in this commit. Step 3 platform skip: drop `Platform: Jetson AGX Thor/Orin`. Brev has no equivalent silicon (embedded/edge ARM with integrated GPU is not in the SKU catalog), so any Brev verification of a Jetson-only bug produces a misleading "fixed-on-x86" verdict. `Platform: DGX Spark` and `Platform: GB10` stay in scope but Step 10 now requires a hardware-substitution caveat naming the Brev SKU we substituted. Step 3 TUI / interactive-UI skip: drop issues whose title contains TUI / dashboard UI / chat UI / keystroke / key press, or whose body describes interactive UI behavior without a non-interactive reproducer. `brev exec` does not allocate a real TTY by default; TUI reproducers hang or silently fail at the first prompt. v1 documents this as out-of-scope; v1.1 may add a script(1) / expect / tmux harness. Step 5 now classifies bug-class in addition to CPU/GPU. Four classes: `performance`, `rebuild-cycle`, `log-only`, `functional` (default). Detection heuristics from the issue body (latency thresholds, "across rebuilds" / "after restart", "see lots of error in <X> log"). Each class routes to a different Step 8 rubric. Step 8b: log-scraping. When `BUG_CLASS=log-only`, also pull `~/.openclaw/logs/*.log` and `/var/log/nemoclaw/*.log` from inside the sandbox after the reproducer runs and search them for the issue's symptom phrase. Some bugs describe symptoms in internal log files, not the reproducer's stdout; the previous match rubric only checked the transcript. Step 8b: flake-detection retry. For functional bugs, run baseline three times if the first run shows the symptom inconsistently. Mixed results (1 or 2 of 3 reproduce) trigger a "flake suspected" caveat, −25 score adjustment, and downgrade `+50 latest clean` to `+25` so a lucky-clean-latest-run on an intermittent bug doesn't silently become a fixed-on-latest verdict. New Step 8e: performance-bug verification. Multi-run latency distribution rubric — N=10 runs each side, compute p50/p90, parse SLA from issue body, match latest's distribution against the SLA rather than against a symptom phrase. Cap at 60 unless the bug is silicon-independent because Brev SKUs aren't faithful to DGX Spark or GB10 hardware for performance-shape bugs. New Step 8f: rebuild-cycle verification. Run-rebuild-rerun harness for bugs that only manifest across destroy/recreate boundaries (#2701-shape). Capture artifacts pre-rebuild, trigger `nemoclaw destroy --all --force && nemoclaw onboard`, re-capture post-rebuild, diff to determine whether the artifact persisted as the issue expects. Step 10 mandatory hardware-substitution caveat: when DGX Spark or GB10 is in the issue's labels and Step 7 substituted with a different silicon class, the comment metadata block must name the substitution explicitly so silicon-shape bugs get the right reader expectation. Surfaced from: stress-testing the skill against the actual shapes of bugs in the 24-candidate set, not just against the bugs we already verified. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Six gaps surfaced from the second e2e run (issue #2592) and the broader audit of how the skill handles provider classification. Step 5 now classifies provider in addition to CPU/GPU and bug-class. Detection signals from labels and body keywords identify NIM, Gemini, Anthropic, Bedrock, or Ollama. When the reproducer references a non-Ollama provider AND actually exercises inference, the skill stops at Step 5 and prompts the maintainer interactively before any Brev cost — three options: provide the API key, accept Ollama substitution with the -30 penalty, or skip to verify-inconclusive. Pure-CLI / pure-sandbox bugs are exempt because the provider doesn't matter. Step 6 reproducer-extraction regex extended to also match `openclaw` and `openshell` invocations as anchor words. Issue #2592's reproducer was `openclaw channels add telegram` run inside the sandbox; the previous `nemoclaw`-only regex would have missed the verbatim block. Step 8a now passes the provider env vars through to install.sh's bundled onboard step, so it doesn't fall back to the default `build` (NIM) provider and fail with the misleading NVIDIA_API_KEY missing error. NEMOCLAW_PROVIDER=ollama (the default case) makes the bundled onboard use the local Ollama set up in Step 8a.5; NVIDIA_API_KEY only flows through if the maintainer provided one at Step 5's prompt. The bundled onboard creates a throwaway sandbox that gets destroyed before the reproducer runs. The reproducer's own onboard should pass `--fresh` so a half-built install-sandbox doesn't trip the "previous session failed" guard. Step 8a.5 now has an Ollama-coverage table making explicit which bug classes Ollama covers faithfully (CLI, sandbox, networking) vs which ones it doesn't (provider-specific behavior, model-specific behavior, silicon-dependent perf). Step 5's prompt keys off this table. Step 8a.5b documents the openshell-sandbox-exec syntax footgun. The correct form is `openshell sandbox exec -n NAME -- CMD`; the wrong form silently auto-detects the sandbox by "last used" and stuffs the leftover positional into bash's $0. Same section gets a brev-exec re-execution guard via a sentinel file at ~/.verify-stale-running to prevent the double-onboard scenario when SSH drops mid-run. Step 11 reframes baseline-build-rot as the dominant failure mode for any reported version >5-7 patches behind, not an edge case. Both Brev e2e runs hit it. Cap-at-84 with reporter @-mention is the modal verdict shape, not the exception — pre-flight carries more weight than baseline runtime evidence for older bugs. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…sandbox name Two fixes from a rot-debugging investigation that nearly produced a wrong conclusion. When testing whether old NemoClaw versions actually rot under the literal end-user invocation (`https://www.nvidia.com/nemoclaw.sh`), the first test run was botched by a shell-scoping mistake: `NEMOCLAW_INSTALL_TAG=v0.0.26 curl ... | bash` scopes the env var to curl, not to the downstream bash reading from the pipe. The bash side defaulted to `latest`, resolved to v0.0.36, and the install proceeded for several minutes producing convincing-looking output before the mistake surfaced. The install script had no log line saying which version it actually resolved, so the silent failure looked like a working install of v0.0.26. Skill-side fix: after each install (Step 8a baseline and Step 8d latest), echo the resolved `nemoclaw --version` and case-match it against the requested version. Mismatch in baseline sets BASELINE_INSTALL_FAILED=1 to prevent verifying against the wrong version. Mismatch in latest logs a WARN that gets surfaced in the final comment. Cheapest possible guard against the class "I asked for X, got Y" — and the principle generalizes: print the resolved state, never trust the requested state. Also fix the sandbox name in the install.sh-passthrough block: change `NEMOCLAW_SANDBOX_NAME=__verify_stale_install__` (rejected by NemoClaw's name validator: "Allowed format: lowercase, starts with a letter, letters/numbers/internal hyphens only, ends with letter/number") to `NEMOCLAW_SANDBOX_NAME=verify-stale-install`. Surfaced during the #2519 e2e run; net-zero impact on that run because install.sh fell through on name validation rather than reaching the buggy code path, but the spec was wrong. The deeper context for the resolved-version-echo fix: when the rot hypothesis was being tested, an independent agent's logical analysis correctly identified that the public install URL DOES support `NEMOCLAW_INSTALL_TAG=<ref>` and demanded an empirical re-test. The re-test (with the env var on the bash side of the pipe) installed v0.0.26 correctly and STILL hit `Patch 4 (replaceConfigFile EACCES) not applied` — so the cap-at-84 framing held empirically, just not for the original "git-clone is the only path" reason. The full investigation is documented in findings.md Part 6's closing section ("Closing investigation: is baseline-build rot real, or our setup artifact?"). Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Add a Step 3 skip rule for issues where a MEMBER/OWNER/COLLABORATOR comment was posted within the last 7 days AND the original reporter hasn't replied since. The skill is for *stale* issues; actively- discussed ones are in-flight, and posting a verify-stale verdict on top of an open clarifying question from a maintainer would conflict with their framing and confuse the reporter (now they have two simultaneous asks). Surfaced during pre-flight on issue #2757. The maintainer @cjagwani had just commented questioning the bug's premise — specifically that the reporter's "kill -9 took the parent down" framing didn't line up with how the gateway is launched (`nohup ... &` detaches it) — and asked the reporter to confirm what they actually observed on the Station. Running verify-stale would have posted a "by-design, close as wontfix" verdict on top of an active "wait, did this really happen?" exchange. Wrong move; the skill should detect this state and skip. Implementation reuses the SEVEN_DAYS_AGO cutoff already computed for the marker-TTL check, fetches the reporter via `gh issue view --json author`, then jq-walks the comments array: find the most recent maintainer comment within the cutoff, check whether any reporter comment exists after that timestamp; if maintainer commented recently AND no reporter reply since, skip. Single-issue mode prints the skip reason and exits friendlily so the maintainer knows why the skill bailed. Batch mode just moves on. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…argv The post-#2592 commit (1d625dd) added a Step 5 prompt asking the maintainer to "Export NVIDIA_API_KEY=<key>" and propagate it via `brev exec`. During the #2604 e2e run that played out as `NVIDIA_API_KEY=<value> brev exec ...` — which puts the literal key in the brev exec process's argv, visible in `ps -ef` to anyone with shell access on either the maintainer's laptop or the Brev box for the duration of the run. The skill's stated promise ("never logged") got violated by argv visibility. Fix: switch propagation to file-based. Maintainer writes the key to ~/.nvidia-api-key with 600 perms on their laptop; Step 6.5 brev-copy's the file to the Brev box (encrypted SSH); install / reproducer scripts inside the box read with `NVIDIA_API_KEY=$(cat ~/.nvidia-api-key)`, which sets the env var in the script's own process — never on a command line and never visible in `ps -ef`. Update Step 5's option-1 prompt to instruct the maintainer to use the file form and to `rm ~/.nvidia-api-key` after the run. Add a new "API-key propagation pattern" section after Step 5 that documents the file-based mechanism explicitly. Update Step 8a (baseline install) and Step 8d (latest install) to source the key from the file inside the Brev box's exec context, not from the local shell's env. Note in the docs: if the key was previously propagated via the cmdline (pre-fix), treat it as exposed and rotate. The #2604 run did this; the maintainer was reminded to rotate the NIM key after the run. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
Replace Step 10's narrow "Length target" rule with a richer "Comment authoring principle" block that captures lessons accumulated across the e2e runs. The core rule: every section in a rendered comment must either change a reader's mind about the verdict or be cut. Word counts follow from that — 500 is a ceiling, not a target. Most comments land in 200-400. Simple cases land under 200. Document why this matters: comments posted by the skill compete for a maintainer's attention against every other in-flight thread on the repo, and AI-slop prose actively reduces signal-to-noise. Add a worked-examples block citing two real iterations: - #2007 first draft was 750 words; cut to 371. - #2604 took THREE drafts before settling on 190 words because each draft padded with prose that didn't ground the verdict — which caused the verdict itself to drift. Rule learned: name the verdict in one sentence first, then cut any section that doesn't support it. Add per-verdict length targets: - fixed-on-latest: 200-400 words - wontfix (by-design): 250-500 words - verify-inconclusive: 100-200 words - Still-reproduces (no label): 30-80 words — and no transcripts (issue body has them), no @-mention (reporter knows), no architectural prose. One sentence + marker. Add an explicit "cut, by default" list naming the patterns that historically padded comments without changing verdicts. Generalizes the existing memory note into the skill body so future agents reading SKILL.md inherit the principle directly. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
… ceiling Followup to fe363cf. The 200-400 / 250-500 ranges in the per-verdict table were still too permissive. Tighten to 200-300 for both fixed-on-latest and wontfix; verify-inconclusive stays at 100-200; still-reproduces stays at 30-80. Update the principle preamble: "300 is a hard ceiling for the main verdicts." Simple cases land under 200. If a draft is past 300, it's padding — cut before re-reading. Memory note (feedback_verify_stale_comment_length.md) updated to match. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…d-question variant Previously skipped any issue with a maintainer comment in the last 7 days the reporter hadn't replied to. After 7 days the question is no longer "active discussion" — it's a stuck thread, and the skill can be the unsticking voice rather than a clueless interruption. Step 3: classifies the most-recent-unanswered-maintainer-comment as recent (within 7d → skip) or stale (>7d → proceed with variant), exporting UNANSWERED_MAINT_LOGIN/URL/DATE for the templater. Step 10: new mandatory block describing the variant — prepend an "@<maint>'s comment from <date> is still unanswered" lead paragraph and swap the closing reporter-only @-mention for a dual maintainer+reporter @-mention that flags the open question. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- brev-provisioning: append `// empty` to the CPU_TYPE jq filter so an
empty result yields an empty string instead of the literal "null",
which `[ -n "$CPU_TYPE" ]` would otherwise accept and pass to
`brev create --type "null"`.
- by-design: fix "greping" -> "grepping".
- candidate-selection: drop the hardcoded "8 prefixed variants" count
(the inline list enumerates 11) and point readers at `gh label list
--search enhancement` for the live set.
- environment-and-reproducer (API key entry): replace the inline
`printf '%s' '<your-key>' > ~/.nvidia-api-key` example with a
no-echo, no-history `IFS= read -rs` flow; the inline form left the
key in shell history.
- environment-and-reproducer (local-first match): split the conjoined
predicate into two explicit branches — matches reported symptom ->
still-reproduces, matches expected-fixed behavior -> fixed-on-latest;
third branch falls through to Brev as before.
- reproduction-rubrics (Step 8c rerun): prepend the same `export
PATH="$HOME/.local/bin:$PATH"` guard 8b and 8d use, so a synth-repro
rerun doesn't misread as `verify-inconclusive` because the nemoclaw
binary fell off PATH.
- reproduction-rubrics (drift-check loop): switch from `for t in
$TOOL` to `mapfile -t TOOLS` + `for t in "${TOOLS[@]}"`, so
multi-word tool strings ("openshell forward") stay intact under
pickaxe instead of word-splitting into separate searches.
- reproduction-rubrics (Step 8e perf rubric): compute p50 as the mean
of the 5th and 6th sorted values for N=10 (standard median), and
add the missing p90 line as the nearest-rank 9th value. Match
rubric grows a p90 backstop that flips a within-p50 verdict to
still-reproduces when an issue-declared p90 SLA is missed.
- scoring-comments-and-logging (idempotency marker): replace the two
hardcoded `2026-05-12` template dates with the `YYYY-MM-DD`
placeholder the surrounding prose already calls for.
- scoring-comments-and-logging (Markdown capitalization): capitalize
"Markdown" in two prose mentions.
- scoring-comments-and-logging (still-reproduces contract): make the
per-verdict table on L174 canonical — strip the closing reporter
@-mention, baseline transcript, and latest transcript from the
template body so it actually fits the 30–80-word target. Scope the
unanswered-question dual @-mention rule to fixed-on-latest and
by-design only (still-reproduces has no closing @-mention to
replace); the lead-paragraph half of the rule still applies to all
three templates.
Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md (1)
172-196:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDocument version-prefix normalization between extraction and validation.
The body/comment regex (line 173) captures only the digit portion
(\d+\.\d+\.\d+)without thevprefix, but:
- Labels include the
v(line 172:^v\d+\.\d+\.\d+$)- Tag list entries include the
v(line 49:^v[0-9]+)- Validation uses
grep -Fxq "$V"against tags withvprefix (line 189)- Output requires full tag format with
v(line 196:REPORTED_VERSION="v0.0.32")Without an explicit normalization step to prepend
vto body/comment captures before validation, thegrep -Fxqcheck will fail for versions extracted from body/comments, incorrectly dropping valid candidates.📝 Recommended addition
Add a normalization note after line 176:
**Normalize captured versions.** Body and comment captures lack the `v` prefix (the capture group excludes it), but labels and tags include it. Before validation, prepend `v` to any version that doesn't already start with it: \```bash # Normalize each candidate version for V in "${CANDIDATE_VERSIONS[@]}"; do [[ "$V" =~ ^v ]] || V="v$V" # Now validate against tag list... done \```🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md around lines 172 - 196, The extraction regex captures only digits (e.g., (\d+\.\d+\.\d+)) but validation uses tag names with a leading v and sets REPORTED_VERSION to the full tag; add a normalization step before validation that ensures every candidate in your CANDIDATE_VERSIONS list is converted to the full tag form (prepend "v" when the string does not already start with "v") so that grep -Fxq "$V" against the tag list and the final REPORTED_VERSION assignment operate on matching tag strings; update the validation flow that invokes grep -Fxq and the logic that selects the smallest surviving version to use the normalized/tag-prefixed values.
🧹 Nitpick comments (1)
.agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md (1)
114-119: ⚡ Quick winConsider breaking the question-detection regex into multiple filters.
The single regex pattern on line 118 combines multiple question-detection heuristics in one long expression, making it harder to maintain or extend. Consider using multiple
select()clauses or extracting the pattern to a shell variable with documentation.♻️ Refactor option: Multiple select filters
UNANSWERED_MAINT=$(gh issue view "$ISSUE_NUMBER" --repo NVIDIA/NemoClaw --json comments \ --jq --arg reporter "$REPORTER" --arg cutoff "$SEVEN_DAYS_AGO" ' (.comments | map(select((.authorAssociation == "MEMBER" or .authorAssociation == "OWNER" or .authorAssociation == "COLLABORATOR") - and (.body | test("\\?|(?i)\\bplease (confirm|share|provide|clarify|tell|verify|check|let me know|let us know)|(?i)\\b(could|can|would) you\\b|(?i)\\bdo you (have|know|see|use)\\b")))) + and (.body | test("\\?") + or test("(?i)\\bplease (confirm|share|provide|clarify|tell|verify|check|let me know|let us know)") + or test("(?i)\\b(could|can|would) you\\b") + or test("(?i)\\bdo you (have|know|see|use)\\b")))) | sort_by(.createdAt) | last) as $maint🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md around lines 114 - 119, The combined question-detection regex inside the gh issue view pipeline (used to set UNANSWERED_MAINT) is too long and hard to maintain; refactor by splitting the single select(...) that contains the big test(...) into multiple select() clauses (e.g., one for authorAssociation membership and separate selects for question patterns like a simple "?" test, polite-request keywords, modal verbs, and "do you ..." patterns) or move each regex into a clearly named shell variable and reference those variables in separate select() calls so the logic in the pipeline (the gh issue view ... --jq expression that produces $maint) is easier to read and extend.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/reproduction-rubrics.md:
- Around line 219-234: The measurements from /usr/bin/time are in seconds but
the SLA vars are milliseconds, so convert the computed p50/p90 to milliseconds
before comparing (or alternatively rename SLA vars to seconds); specifically
update the P50_MS and P90_MS calculations (and the corresponding baseline
variables) that read from latest-perf.log/baseline-perf.log so the awk/shell
math multiplies the second values by 1000 and formats as integers (e.g., compute
P50_MS = (mean*1000) and P90_MS = (value*1000)), update the echo suffix to "ms",
and ensure comparisons use SLA_P50_MS and SLA_P90_MS as millisecond integers.
---
Outside diff comments:
In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md:
- Around line 172-196: The extraction regex captures only digits (e.g.,
(\d+\.\d+\.\d+)) but validation uses tag names with a leading v and sets
REPORTED_VERSION to the full tag; add a normalization step before validation
that ensures every candidate in your CANDIDATE_VERSIONS list is converted to the
full tag form (prepend "v" when the string does not already start with "v") so
that grep -Fxq "$V" against the tag list and the final REPORTED_VERSION
assignment operate on matching tag strings; update the validation flow that
invokes grep -Fxq and the logic that selects the smallest surviving version to
use the normalized/tag-prefixed values.
---
Nitpick comments:
In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md:
- Around line 114-119: The combined question-detection regex inside the gh issue
view pipeline (used to set UNANSWERED_MAINT) is too long and hard to maintain;
refactor by splitting the single select(...) that contains the big test(...)
into multiple select() clauses (e.g., one for authorAssociation membership and
separate selects for question patterns like a simple "?" test, polite-request
keywords, modal verbs, and "do you ..." patterns) or move each regex into a
clearly named shell variable and reference those variables in separate select()
calls so the logic in the pipeline (the gh issue view ... --jq expression that
produces $maint) is easier to read and extend.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b4bcfcff-fe48-424e-9219-6822388d6445
📒 Files selected for processing (6)
.agents/skills/nemoclaw-maintainer-verify-stale/reference/brev-provisioning.md.agents/skills/nemoclaw-maintainer-verify-stale/reference/by-design.md.agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md.agents/skills/nemoclaw-maintainer-verify-stale/reference/environment-and-reproducer.md.agents/skills/nemoclaw-maintainer-verify-stale/reference/reproduction-rubrics.md.agents/skills/nemoclaw-maintainer-verify-stale/reference/scoring-comments-and-logging.md
✅ Files skipped from review due to trivial changes (4)
- .agents/skills/nemoclaw-maintainer-verify-stale/reference/by-design.md
- .agents/skills/nemoclaw-maintainer-verify-stale/reference/brev-provisioning.md
- .agents/skills/nemoclaw-maintainer-verify-stale/reference/environment-and-reproducer.md
- .agents/skills/nemoclaw-maintainer-verify-stale/reference/scoring-comments-and-logging.md
|
@CodeRabbit let's look at these skills considering Anthropic's best practices: Skill authoring best practicesGood Skills are concise, well-structured, and tested with real usage. This guide provides practical authoring decisions to help you write Skills that Claude can discover and use effectively.For conceptual background on how Skills work, see the Skills overview. Core principlesConcise is keyThe context window is a public good. Your Skill shares the context window with everything else Claude needs to know, including:
Not every token in your Skill has an immediate cost. At startup, only the metadata (name and description) from all Skills is pre-loaded. Claude reads SKILL.md only when the Skill becomes relevant, and reads additional files only as needed. However, being concise in SKILL.md still matters: once Claude loads it, every token competes with conversation history and other context. Default assumption: Claude is already very smart Only add context Claude doesn't already have. Challenge each piece of information:
Good example: Concise (approximately 50 tokens): ## Extract PDF text
Use pdfplumber for text extraction:
```python
import pdfplumber
with pdfplumber.open("file.pdf") as pdf:
text = pdf.pages[0].extract_text()
```Bad example: Too verbose (approximately 150 tokens): ## Extract PDF text
PDF (Portable Document Format) files are a common file format that contains
text, images, and other content. To extract text from a PDF, you'll need to
use a library. There are many libraries available for PDF processing, but
pdfplumber is recommended because it's easy to use and handles most cases well.
First, you'll need to install it using pip. Then you can use the code below...The concise version assumes Claude knows what PDFs are and how libraries work. Set appropriate degrees of freedomMatch the level of specificity to the task's fragility and variability. High freedom (text-based instructions): Use when:
Example: ## Code review process
1. Analyze the code structure and organization
2. Check for potential bugs or edge cases
3. Suggest improvements for readability and maintainability
4. Verify adherence to project conventionsMedium freedom (pseudocode or scripts with parameters): Use when:
Example: ## Generate report
Use this template and customize as needed:
```python
def generate_report(data, format="markdown", include_charts=True):
# Process data
# Generate output in specified format
# Optionally include visualizations
```Low freedom (specific scripts, few or no parameters): Use when:
Example: ## Database migration
Run exactly this script:
```bash
python scripts/migrate.py --verify --backup
```
Do not modify the command or add additional flags.Analogy: Think of Claude as a robot exploring a path:
Test with all models you plan to useSkills act as additions to models, so effectiveness depends on the underlying model. Test your Skill with all the models you plan to use it with. Testing considerations by model:
What works perfectly for Opus might need more detail for Haiku. If you plan to use your Skill across multiple models, aim for instructions that work well with all of them. Skill structure**YAML Frontmatter:** The SKILL.md frontmatter requires two fields:
For complete Skill structure details, see the Skills overview. Naming conventionsUse consistent naming patterns to make Skills easier to reference and discuss. Consider using gerund form (verb + -ing) for Skill names, as this clearly describes the activity or capability the Skill provides. Remember that the Good naming examples (gerund form):
Acceptable alternatives:
Avoid:
Consistent naming makes it easier to:
Writing effective descriptionsThe
Be specific and include key terms. Include both what the Skill does and specific triggers/contexts for when to use it. Each Skill has exactly one description field. The description is critical for skill selection: Claude uses it to choose the right Skill from potentially 100+ available Skills. Your description must provide enough detail for Claude to know when to select this Skill, while the rest of SKILL.md provides the implementation details. Effective examples: PDF Processing skill: description: Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction.Excel Analysis skill: description: Analyze Excel spreadsheets, create pivot tables, generate charts. Use when analyzing Excel files, spreadsheets, tabular data, or .xlsx files.Git Commit Helper skill: description: Generate descriptive commit messages by analyzing git diffs. Use when the user asks for help writing commit messages or reviewing staged changes.Avoid vague descriptions like these: description: Helps with documentsdescription: Processes datadescription: Does stuff with filesProgressive disclosure patternsSKILL.md serves as an overview that points Claude to detailed materials as needed, like a table of contents in an onboarding guide. For an explanation of how progressive disclosure works, see How Skills work in the overview. Practical guidance:
Visual overview: From simple to complexA basic Skill starts with just a SKILL.md file containing metadata and instructions: As your Skill grows, you can bundle additional content that Claude loads only when needed: The complete Skill directory structure might look like this: Pattern 1: High-level guide with references---
name: pdf-processing
description: Extracts text and tables from PDF files, fills forms, and merges documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction.
---
# PDF Processing
## Quick start
Extract text with pdfplumber:
```python
import pdfplumber
with pdfplumber.open("file.pdf") as pdf:
text = pdf.pages[0].extract_text()
```
## Advanced features
**Form filling**: See [FORMS.md](FORMS.md) for complete guide
**API reference**: See [REFERENCE.md](REFERENCE.md) for all methods
**Examples**: See [EXAMPLES.md](EXAMPLES.md) for common patternsClaude loads FORMS.md, REFERENCE.md, or EXAMPLES.md only when needed. Pattern 2: Domain-specific organizationFor Skills with multiple domains, organize content by domain to avoid loading irrelevant context. When a user asks about sales metrics, Claude only needs to read sales-related schemas, not finance or marketing data. This keeps token usage low and context focused. # BigQuery Data Analysis
## Available datasets
**Finance**: Revenue, ARR, billing → See [reference/finance.md](reference/finance.md)
**Sales**: Opportunities, pipeline, accounts → See [reference/sales.md](reference/sales.md)
**Product**: API usage, features, adoption → See [reference/product.md](reference/product.md)
**Marketing**: Campaigns, attribution, email → See [reference/marketing.md](reference/marketing.md)
## Quick search
Find specific metrics using grep:
```bash
grep -i "revenue" reference/finance.md
grep -i "pipeline" reference/sales.md
grep -i "api usage" reference/product.md
```Pattern 3: Conditional detailsShow basic content, link to advanced content: # DOCX Processing
## Creating documents
Use docx-js for new documents. See [DOCX-JS.md](DOCX-JS.md).
## Editing documents
For simple edits, modify the XML directly.
**For tracked changes**: See [REDLINING.md](REDLINING.md)
**For OOXML details**: See [OOXML.md](OOXML.md)Claude reads REDLINING.md or OOXML.md only when the user needs those features. Avoid deeply nested referencesClaude may partially read files when they're referenced from other referenced files. When encountering nested references, Claude might use commands like Keep references one level deep from SKILL.md. All reference files should link directly from SKILL.md to ensure Claude reads complete files when needed. Bad example: Too deep: # SKILL.md
See [advanced.md](advanced.md)...
# advanced.md
See [details.md](details.md)...
# details.md
Here's the actual information...Good example: One level deep: # SKILL.md
**Basic usage**: [instructions in SKILL.md]
**Advanced features**: See [advanced.md](advanced.md)
**API reference**: See [reference.md](reference.md)
**Examples**: See [examples.md](examples.md)Structure longer reference files with table of contentsFor reference files longer than 100 lines, include a table of contents at the top. This ensures Claude can see the full scope of available information even when previewing with partial reads. Example: # API Reference
## Contents
- Authentication and setup
- Core methods (create, read, update, delete)
- Advanced features (batch operations, webhooks)
- Error handling patterns
- Code examples
## Authentication and setup
...
## Core methods
...Claude can then read the complete file or jump to specific sections as needed. For details on how this filesystem-based architecture enables progressive disclosure, see the Runtime environment section in the Advanced section below. Workflows and feedback loopsUse workflows for complex tasksBreak complex operations into clear, sequential steps. For particularly complex workflows, provide a checklist that Claude can copy into its response and check off as it progresses. Example 1: Research synthesis workflow (for Skills without code): ## Research synthesis workflow
Copy this checklist and track your progress:
```
Research Progress:
- [ ] Step 1: Read all source documents
- [ ] Step 2: Identify key themes
- [ ] Step 3: Cross-reference claims
- [ ] Step 4: Create structured summary
- [ ] Step 5: Verify citations
```
**Step 1: Read all source documents**
Review each document in the `sources/` directory. Note the main arguments and supporting evidence.
**Step 2: Identify key themes**
Look for patterns across sources. What themes appear repeatedly? Where do sources agree or disagree?
**Step 3: Cross-reference claims**
For each major claim, verify it appears in the source material. Note which source supports each point.
**Step 4: Create structured summary**
Organize findings by theme. Include:
- Main claim
- Supporting evidence from sources
- Conflicting viewpoints (if any)
**Step 5: Verify citations**
Check that every claim references the correct source document. If citations are incomplete, return to Step 3.This example shows how workflows apply to analysis tasks that don't require code. The checklist pattern works for any complex, multi-step process. Example 2: PDF form filling workflow (for Skills with code): ## PDF form filling workflow
Copy this checklist and check off items as you complete them:
```
Task Progress:
- [ ] Step 1: Analyze the form (run analyze_form.py)
- [ ] Step 2: Create field mapping (edit fields.json)
- [ ] Step 3: Validate mapping (run validate_fields.py)
- [ ] Step 4: Fill the form (run fill_form.py)
- [ ] Step 5: Verify output (run verify_output.py)
```
**Step 1: Analyze the form**
Run: `python scripts/analyze_form.py input.pdf`
This extracts form fields and their locations, saving to `fields.json`.
**Step 2: Create field mapping**
Edit `fields.json` to add values for each field.
**Step 3: Validate mapping**
Run: `python scripts/validate_fields.py fields.json`
Fix any validation errors before continuing.
**Step 4: Fill the form**
Run: `python scripts/fill_form.py input.pdf fields.json output.pdf`
**Step 5: Verify output**
Run: `python scripts/verify_output.py output.pdf`
If verification fails, return to Step 2.Clear steps prevent Claude from skipping critical validation. The checklist helps both Claude and you track progress through multi-step workflows. Implement feedback loopsCommon pattern: Run validator → fix errors → repeat This pattern greatly improves output quality. Example 1: Style guide compliance (for Skills without code): ## Content review process
1. Draft your content following the guidelines in STYLE_GUIDE.md
2. Review against the checklist:
- Check terminology consistency
- Verify examples follow the standard format
- Confirm all required sections are present
3. If issues found:
- Note each issue with specific section reference
- Revise the content
- Review the checklist again
4. Only proceed when all requirements are met
5. Finalize and save the documentThis shows the validation loop pattern using reference documents instead of scripts. The "validator" is STYLE_GUIDE.md, and Claude performs the check by reading and comparing. Example 2: Document editing process (for Skills with code): ## Document editing process
1. Make your edits to `word/document.xml`
2. **Validate immediately**: `python ooxml/scripts/validate.py unpacked_dir/`
3. If validation fails:
- Review the error message carefully
- Fix the issues in the XML
- Run validation again
4. **Only proceed when validation passes**
5. Rebuild: `python ooxml/scripts/pack.py unpacked_dir/ output.docx`
6. Test the output documentThe validation loop catches errors early. Content guidelinesAvoid time-sensitive informationDon't include information that will become outdated: Bad example: Time-sensitive (will become wrong): If you're doing this before August 2025, use the old API.
After August 2025, use the new API.Good example (use "old patterns" section): ## Current method
Use the v2 API endpoint: `api.example.com/v2/messages`
## Old patterns
<details>
<summary>Legacy v1 API (deprecated 2025-08)</summary>
The v1 API used: `api.example.com/v1/messages`
This endpoint is no longer supported.
</details>The old patterns section provides historical context without cluttering the main content. Use consistent terminologyChoose one term and use it throughout the Skill: Good - Consistent:
Bad - Inconsistent:
Consistency helps Claude understand and follow instructions. Common patternsTemplate patternProvide templates for output format. Match the level of strictness to your needs. For strict requirements (like API responses or data formats): ## Report structure
ALWAYS use this exact template structure:
```markdown
# [Analysis Title]
## Executive summary
[One-paragraph overview of key findings]
## Key findings
- Finding 1 with supporting data
- Finding 2 with supporting data
- Finding 3 with supporting data
## Recommendations
1. Specific actionable recommendation
2. Specific actionable recommendation
```For flexible guidance (when adaptation is useful): ## Report structure
Here is a sensible default format, but use your best judgment based on the analysis:
```markdown
# [Analysis Title]
## Executive summary
[Overview]
## Key findings
[Adapt sections based on what you discover]
## Recommendations
[Tailor to the specific context]
```
Adjust sections as needed for the specific analysis type.Examples patternFor Skills where output quality depends on seeing examples, provide input/output pairs just like in regular prompting: ## Commit message format
Generate commit messages following these examples:
**Example 1:**
Input: Added user authentication with JWT tokens
Output:
```
feat(auth): implement JWT-based authentication
Add login endpoint and token validation middleware
```
**Example 2:**
Input: Fixed bug where dates displayed incorrectly in reports
Output:
```
fix(reports): correct date formatting in timezone conversion
Use UTC timestamps consistently across report generation
```
**Example 3:**
Input: Updated dependencies and refactored error handling
Output:
```
chore: update dependencies and refactor error handling
- Upgrade lodash to 4.17.21
- Standardize error response format across endpoints
```
Follow this style: type(scope): brief description, then detailed explanation.Examples help Claude understand the desired style and level of detail more clearly than descriptions alone. Conditional workflow patternGuide Claude through decision points: ## Document modification workflow
1. Determine the modification type:
**Creating new content?** → Follow "Creation workflow" below
**Editing existing content?** → Follow "Editing workflow" below
2. Creation workflow:
- Use docx-js library
- Build document from scratch
- Export to .docx format
3. Editing workflow:
- Unpack existing document
- Modify XML directly
- Validate after each change
- Repack when completeEvaluation and iterationBuild evaluations firstCreate evaluations BEFORE writing extensive documentation. This ensures your Skill solves real problems rather than documenting imagined ones. Evaluation-driven development:
This approach ensures you're solving actual problems rather than anticipating requirements that may never materialize. Evaluation structure: {
"skills": ["pdf-processing"],
"query": "Extract all text from this PDF file and save it to output.txt",
"files": ["test-files/document.pdf"],
"expected_behavior": [
"Successfully reads the PDF file using an appropriate PDF processing library or command-line tool",
"Extracts text content from all pages in the document without missing any pages",
"Saves the extracted text to a file named output.txt in a clear, readable format"
]
}Develop Skills iteratively with ClaudeThe most effective Skill development process involves Claude itself. Work with one instance of Claude ("Claude A") to create a Skill that is used by other instances ("Claude B"). Claude A helps you design and refine instructions, while Claude B tests them in real tasks. This works because Claude models understand both how to write effective agent instructions and what information agents need. Creating a new Skill:
Iterating on existing Skills: The same hierarchical pattern continues when improving Skills. You alternate between:
Gathering team feedback:
Why this approach works: Claude A understands agent needs, you provide domain expertise, Claude B reveals gaps through real usage, and iterative refinement improves Skills based on observed behavior rather than assumptions. Observe how Claude navigates SkillsAs you iterate on Skills, pay attention to how Claude actually uses them in practice. Watch for:
Iterate based on these observations rather than assumptions. The 'name' and 'description' in your Skill's metadata are particularly critical. Claude uses these when deciding whether to trigger the Skill in response to the current task. Make sure they clearly describe what the Skill does and when it should be used. Anti-patterns to avoidAvoid Windows-style pathsAlways use forward slashes in file paths, even on Windows:
Unix-style paths work across all platforms, while Windows-style paths cause errors on Unix systems. Avoid offering too many optionsDon't present multiple approaches unless necessary: **Bad example: Too many choices** (confusing):
"You can use pypdf, or pdfplumber, or PyMuPDF, or pdf2image, or..."
**Good example: Provide a default** (with escape hatch):
"Use pdfplumber for text extraction:
```python
import pdfplumber
```
For scanned PDFs requiring OCR, use pdf2image with pytesseract instead."Advanced: Skills with executable codeThe sections below focus on Skills that include executable scripts. If your Skill uses only markdown instructions, skip to Checklist for effective Skills. Solve, don't puntWhen writing scripts for Skills, handle error conditions rather than punting to Claude. Good example: Handle errors explicitly: def process_file(path):
"""Process a file, creating it if it doesn't exist."""
try:
with open(path) as f:
return f.read()
except FileNotFoundError:
# Create file with default content instead of failing
print(f"File {path} not found, creating default")
with open(path, "w") as f:
f.write("")
return ""
except PermissionError:
# Provide alternative instead of failing
print(f"Cannot access {path}, using default")
return ""Bad example: Punt to Claude: def process_file(path):
# Just fail and let Claude figure it out
return open(path).read()Configuration parameters should also be justified and documented to avoid "voodoo constants" (Ousterhout's law). If you don't know the right value, how will Claude determine it? Good example: Self-documenting: # HTTP requests typically complete within 30 seconds
# Longer timeout accounts for slow connections
REQUEST_TIMEOUT = 30
# Three retries balances reliability vs speed
# Most intermittent failures resolve by the second retry
MAX_RETRIES = 3Bad example: Magic numbers: TIMEOUT = 47 # Why 47?
RETRIES = 5 # Why 5?Provide utility scriptsEven if Claude could write a script, pre-made scripts offer advantages: Benefits of utility scripts:
The diagram above shows how executable scripts work alongside instruction files. The instruction file (forms.md) references the script, and Claude can execute it without loading its contents into context. Important distinction: Make clear in your instructions whether Claude should:
For most utility scripts, execution is preferred because it's more reliable and efficient. See the Runtime environment section below for details on how script execution works. Example: ## Utility scripts
**analyze_form.py**: Extract all form fields from PDF
```bash
python scripts/analyze_form.py input.pdf > fields.json
```
Output format:
```json
{
"field_name": {"type": "text", "x": 100, "y": 200},
"signature": {"type": "sig", "x": 150, "y": 500}
}
```
**validate_boxes.py**: Check for overlapping bounding boxes
```bash
python scripts/validate_boxes.py fields.json
# Returns: "OK" or lists conflicts
```
**fill_form.py**: Apply field values to PDF
```bash
python scripts/fill_form.py input.pdf fields.json output.pdf
```Use visual analysisWhen inputs can be rendered as images, have Claude analyze them: ## Form layout analysis
1. Convert PDF to images:
```bash
python scripts/pdf_to_images.py form.pdf
```
2. Analyze each page image to identify form fields
3. Claude can see field locations and types visuallyClaude's vision capabilities help understand layouts and structures. Create verifiable intermediate outputsWhen Claude performs complex, open-ended tasks, it can make mistakes. The "plan-validate-execute" pattern catches errors early by having Claude first create a plan in a structured format, then validate that plan with a script before executing it. Example: Imagine asking Claude to update 50 form fields in a PDF based on a spreadsheet. Without validation, Claude might reference non-existent fields, create conflicting values, miss required fields, or apply updates incorrectly. Solution: Use the workflow pattern shown above (PDF form filling), but add an intermediate Why this pattern works:
When to use: Batch operations, destructive changes, complex validation rules, high-stakes operations. Implementation tip: Make validation scripts verbose with specific error messages like "Field 'signature_date' not found. Available fields: customer_name, order_total, signature_date_signed" to help Claude fix issues. Package dependenciesSkills run in the code execution environment with platform-specific limitations:
List required packages in your SKILL.md and verify they're available in the code execution tool documentation. Runtime environmentSkills run in a code execution environment with filesystem access, bash commands, and code execution capabilities. For the conceptual explanation of this architecture, see The Skills architecture in the overview. How this affects your authoring: How Claude accesses Skills:
Example: When the user asks about revenue, Claude reads SKILL.md, sees the reference to For complete details on the technical architecture, see How Skills work in the Skills overview. MCP tool referencesIf your Skill uses MCP (Model Context Protocol) tools, always use fully qualified tool names to avoid "tool not found" errors. Format: Example: Use the BigQuery:bigquery_schema tool to retrieve table schemas.
Use the GitHub:create_issue tool to create issues.Where:
Without the server prefix, Claude may fail to locate the tool, especially when multiple MCP servers are available. Avoid assuming tools are installedDon't assume packages are available: **Bad example: Assumes installation**:
"Use the pdf library to process the file."
**Good example: Explicit about dependencies**:
"Install required package: `pip install pypdf`
Then use it:
```python
from pypdf import PdfReader
reader = PdfReader("file.pdf")
```"Technical notesYAML frontmatter requirementsThe SKILL.md frontmatter requires
See the Skills overview for complete structure details. Token budgetsKeep SKILL.md body under 500 lines for optimal performance. If your content exceeds this, split it into separate files using the progressive disclosure patterns described earlier. For architectural details, see the Skills overview. Checklist for effective SkillsBefore sharing a Skill, verify: Core quality
Code and scripts
Testing
|
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
echo "=== SKILL.md ==="
cat .agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md
echo ""
echo "=== Line counts for all reference files ==="
wc -l .agents/skills/nemoclaw-maintainer-verify-stale/reference/*.md
echo ""
echo "=== scoring-comments-and-logging.md (first 50 lines) ==="
head -50 .agents/skills/nemoclaw-maintainer-verify-stale/reference/scoring-comments-and-logging.md
echo ""
echo "=== candidate-selection.md (first 30 lines) ==="
head -30 .agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md
echo ""
echo "=== brev-provisioning.md (first 30 lines) ==="
head -30 .agents/skills/nemoclaw-maintainer-verify-stale/reference/brev-provisioning.md
echo ""
echo "=== by-design.md (first 30 lines) ==="
head -30 .agents/skills/nemoclaw-maintainer-verify-stale/reference/by-design.md
echo ""
echo "=== environment-and-reproducer.md (first 30 lines) ==="
head -30 .agents/skills/nemoclaw-maintainer-verify-stale/reference/environment-and-reproducer.md
echo ""
echo "=== reproduction-rubrics.md (first 30 lines) ==="
head -30 .agents/skills/nemoclaw-maintainer-verify-stale/reference/reproduction-rubrics.md
echo ""
echo "=== nemoclaw-skills-guide SKILL.md ==="
cat .agents/skills/nemoclaw-skills-guide/SKILL.md | head -40
echo ""
echo "=== nemoclaw-maintainer-cut-release-tag SKILL.md (tail, new Step 7) ==="
tail -100 .agents/skills/nemoclaw-maintainer-cut-release-tag/SKILL.mdLength of output: 26997
✅ What's already aligned
|
| Item | Current | Recommended action |
|---|---|---|
| SKILL.md < 500 lines | ✅ | — |
| TOC on all large reference files | ✅ | — |
| One-level-deep references | ✅ | — |
| No time-sensitive data | ❌ | Remove "129 open bugs" count from candidate-selection.md |
| At least 3 evaluations | ❌ | Add evals/ directory with 3 scenario files |
| No voodoo constants | Document 14-day rationale in cut-release-tag Step 7 |
|
| Consistent terminology | Standardize reproducer vs repro |
|
| File size manageable | Consider splitting scoring-comments-and-logging.md |
CodeRabbit follow-up on PR #3327: the prior fix computed P50_MS / P90_MS straight from `/usr/bin/time -f '%e'` output, which is seconds, while the match rubric compares against SLA_P50_MS / SLA_P90_MS parsed from the issue body as milliseconds. A 1.5-second p50 would have falsely beaten a 200 ms SLA. Multiply by 1000 in the awk, format as integer ms, and update the echo suffix accordingly. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
…on regex Two CodeRabbit follow-ups on PR #3327's candidate-selection.md: - Step 4 tag validation now normalizes each candidate to the full tag form (prepend `v` when absent) before `grep -Fxq` against the tag list. The body/comment capture group is digits-only by design, but tags carry the `v` — without the prepend, every body-sourced candidate was being dropped. Label-sourced candidates already carry the `v`, so the prepend is idempotent. - Split the long alternation regex behind the unanswered-maintainer question detector into four named test() clauses (literal "?", polite imperative, modal interrogative, "do you ..."). Future heuristics drop in as a single test() append rather than a regex patch. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>
## Summary - Add the v0.0.53 release notes with the user-facing onboarding, inference, policy, runtime, Hermes, and maintainer-tooling changes from the release range. - Refresh generated `nemoclaw-user-*` skills from the current Fern docs, including already-merged policy, inference, troubleshooting, and command-reference updates. - Remove skipped experimental shield wording from generated-doc source so the release-prep skip-term gate stays clean. ## Source summary - #4197 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Document pre-recreate workspace backup, abort-on-partial-backup behavior, and `NEMOCLAW_RECREATE_WITHOUT_BACKUP`. - #4273 -> `docs/about/release-notes.mdx`, `docs/reference/troubleshooting.mdx`: Document the under-provisioned runtime prompt defaulting to abort in interactive onboarding. - #4220 -> `docs/about/release-notes.mdx`, `docs/network-policy/customize-network-policy.mdx`, `docs/network-policy/integration-policy-examples.mdx`: Include the `openclaw-pricing` preset and generated skill refresh. - #4253 -> `docs/about/release-notes.mdx`, `docs/inference/use-local-inference.mdx`, `docs/inference/switch-inference-providers.mdx`: Carry the Ollama runtime context-window docs into generated skills. - #4298 -> `docs/about/release-notes.mdx`, `docs/reference/troubleshooting.mdx`: Carry WSL Docker Desktop GPU guidance into generated skills and release notes. - #4297, #4210, #4221, #4225, #4288, #4306, #4311, #4319, #4342, #4284, #3327 -> `docs/about/release-notes.mdx`: Summarize release-range fixes and maintainer tooling changes that did not need new standalone docs pages. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" docs .agents/skills` returned no matches outside `docs/.docs-skip`. - `npm run docs` passes with full network access. Fern reports 0 errors and one existing light-mode accent contrast warning. - `FERN_VERSION=$(node -p "require('./fern/fern.config.json').version") && (cd fern && npx --yes "fern-api@${FERN_VERSION}" check --warnings)` reports 0 errors and the same contrast warning. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added v0.0.53 release notes with updates to onboarding, sandbox recreation, and gateway handling * Introduced `openclaw-pricing` preset for model pricing endpoint management * Clarified Ollama context window configuration and local model validation behavior * Updated sandbox recreation workflow documentation with backup/restore details * Enhanced interactive onboarding defaults for under-provisioned runtime warnings * Revised security guidance for configuration directory permissions and immutability verification <!-- review_stack_entry_start --> [](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4360?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) <!-- review_stack_entry_end --> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Resolves conflicts in four files after #3327 (verify-stale skill) squash- merged onto main while the dogfood branch carried the unsquashed history plus three blueprint commits on top. Resolution: - .agents/skills/nemoclaw-skills-guide/SKILL.md: take main (skill counts evolved on main from 8 to 12 maintainer skills, 9 to 10 user skills, etc.). - .agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md: take HEAD (preserves the Step 3 self-check wording, the self-check.md reference-map row, and the Automation environment variables table — all added by the dogfood feat commit). - .agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md: take HEAD (preserves VERIFY_STALE_FORCE_OLLAMA_ONLY provider-credential skip, the two gh --jq --arg pipe rewrites that fix the never-fired idempotency and unanswered-question gates, and ${VERIFY_STALE_BATCH_CAP:-15} prose). - .agents/skills/nemoclaw-maintainer-verify-stale/reference/scoring-comments-and-logging.md: take HEAD (preserves VERIFY_STALE_DRY_RUN closed-mid-run branch and the dry-run short-circuit before live comment-post + label-apply). Outside the dogfood-touched regions the verify-stale skill files are identical between branches (git diff main..pr-4279 -- on the other four reference files returns empty). Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>



Summary
Adds maintainer skill
nemoclaw-maintainer-verify-stale— automates the manual loop of: spin up a Brev box → install latest NemoClaw → try to reproduce an old bug → comment with findings. Tag-only (never auto-closes); a maintainer pulls the trigger after reviewing.The skill has been driven end-to-end against 9 real candidates during development, surfacing the gaps and skill changes summarised below.
Issues tackled
wontfix-by-designwontfix-by-design(tagged)status: wont-fix, notwontfix); markdown-linked citations in comments.fixed-on-latest(tagged)sg,brev execPATH gotcha, sandbox-build-rot reframe, multi-axis architectural-drift check.fixed-on-latest(tagged)openshell sandbox exec -n NAME --syntax footgun, reproducer-extraction regex extension (openclaw|openshell).fixed-on-latest(closed by maintainer mid-run)fixed-on-latest(tagged)fixed-on-latest(tagged)state == OPENbefore posting. Second mid-batch close-by-maintainer (alongside #2519) → promote from queue to a real Step 1/3 check.Verification mode: static analysis at <tag>).Effective touched count: 7 issues with skill-rendered comments + 2 side-discoveries. Two of the seven landed
wontfix-by-design, five landedfixed-on-latest. Sandbox-build rot capped four of the five at 84 (the modal verdict shape with reporter @-mention). #2611 was the first run to land below the cap, validating the −30 LLM-synth penalty path.v1 scope
bugissues against CLI / Sandbox / OpenShell / Docker / Getting Started / Ubuntu / DGX Spark / GB10Verification flow (one-line per step)
brev ls→ reuse averify-stale-*box if running, elsebrev createwith a runtime-picked CPU SKU~/development/daily-rhythm/activity/nemoclaw-verify-stale-log.mdIdempotency
Every posted comment carries
<!-- nemoclaw-verify-stale v1 YYYY-MM-DD -->. Candidate filter excludes any issue whose comments match<!-- nemoclaw-verify-stale v\d+ YYYY-MM-DD -->within the last 7 days OR carriesfixed-on-latest/verify-inconclusive/status: wont-fix. Release sweep clears the first two on each tag cut.Active-discussion handling (two-tier)
Companion changes
nemoclaw-skills-guide/SKILL.md— adds the new skill to the maintainer skills table.nemoclaw-maintainer-cut-release-tag/SKILL.md— release-time sweep offixed-on-latestandverify-inconclusivefrom every open issue. Without it, "latest" drifts and verifications go silently stale. Verification records stay in comments; only labels reset.CI status
3abffd72(placeholder[label](url)examples in SKILL.md were being parsed as real broken links — restructured as fenced-block templates) anddc5d7143(merge of main brought in the deletion ofdocs/get-started/brev-web-ui-quickstart.mdper #3050, resolving the merge-side missing-file failure)Test plan
wontfix-by-design, maintainer agreed and closedwontfix-by-design, tagged + open awaiting closeopenclaw channels add/removenot implemented — no host-side hint shown #2592, [Linux][Brev]Ollama-local provider returns HTTP 401 Unauthorized via inference.local — wizard caches OPENAI_API_KEY env var into credentials.json and not refreshable via env unset #2519, [NemoClaw][DGX Spark][Ubuntu 24.04][CLI] nemoclaw status omits Connected/Inference fields and shows cloudflared stopped with no context #2604, [Linux][Policy] sandbox logs repeatedly emit os.networkInterfaces guard errors after NVIDIA endpoint onboard #2611) — provision → baseline → reset → latest → repro → comment + labelPlatform: DGX Sparklabel, x86_64 actual)f0ec6b8e) but untested against a live candidate (no qualifying issue in the candidate set yet)cut-release-tagtested at design level onlySigned-off-by: Prekshi Vyas prekshiv@nvidia.com
Summary by CodeRabbit
New Features
Enhancements
Documentation