feat(skills): add nemoclaw-maintainer-verify-stale by prekshivyas · Pull Request #3327 · NVIDIA/NemoClaw

prekshivyas · 2026-05-11T03:20:47Z

Note: This PR supersedes #3063, which had unsigned commits in its history (a stale local git config in one of my clones — user.email=test@example.com, commit.gpgsign=false — produced 8 unverified commits I couldn't retroactively sign because the branch ruleset blocks force-push). Net diff vs main is unchanged. All commits on this branch are signed and authored by prekshiv@nvidia.com. The 8 post-review commits from #3063 are collapsed into a single signed commit; per-commit history for those is preserved in #3063 for reference.

Summary

Adds maintainer skill nemoclaw-maintainer-verify-stale — automates the manual loop of: spin up a Brev box → install latest NemoClaw → try to reproduce an old bug → comment with findings. Tag-only (never auto-closes); a maintainer pulls the trigger after reviewing.

The skill has been driven end-to-end against 9 real candidates during development, surfacing the gaps and skill changes summarised below.

Issues tackled

Issue	Reported	Verified	Verdict	Score	What this run surfaced (skill change)
#2791	v0.0.21	v0.0.35	`wontfix-by-design`	—	By-design prototype. Step 8.5 detection framework (3 signals: maintainer attribution / removal commit / symbol absent). Maintainer reviewed the analysis and closed the issue manually.
#2168	v0.0.21	v0.0.35	`wontfix-by-design` (tagged)	—	Repo-label vocabulary mismatch (`status: wont-fix`, not `wontfix`); markdown-linked citations in comments.
#2007	v0.0.18	v0.0.35	`fixed-on-latest` (tagged)	84/100	Docker-group `sg`, `brev exec` PATH gotcha, sandbox-build-rot reframe, multi-axis architectural-drift check.
#2592	v0.0.28	v0.0.35	`fixed-on-latest` (tagged)	84/100	Sentinel-file SSH-drop guard, `openshell sandbox exec -n NAME --` syntax footgun, reproducer-extraction regex extension (`openclaw\|openshell`).
#2519	v0.0.26	v0.0.35	`fixed-on-latest` (closed by maintainer mid-run)	84/100	Step 5 provider classification + API-key prompt, install.sh env-var passthrough, sandbox-name validator.
#2604	v0.0.28	v0.0.36	`fixed-on-latest` (tagged)	84/100	File-based API key propagation (kill argv exposure local→Brev). 300-word hard ceiling for comments + comment-authoring principle in Step 10.
#2611	v0.0.28	v0.0.36	`fixed-on-latest` (tagged)	70/100	On-box subshell argv leak (extend file-based pattern to Brev→inner shell). First non-cap-at-84 score (the −30 LLM-synth penalty pushed raw below the cap).
#2513	v0.0.26	—	maintainer self-verified mid-batch	—	Gap-queue: skill should re-check `state == OPEN` before posting. Second mid-batch close-by-maintainer (alongside #2519) → promote from queue to a real Step 1/3 check.
#2166	v0.0.21	—	still-reproduces (source-only)	—	Gap-queue: when pre-flight is conclusive, allow source-only verdicts to skip Brev entirely (`Verification mode: static analysis at <tag>`).

Effective touched count: 7 issues with skill-rendered comments + 2 side-discoveries. Two of the seven landed wontfix-by-design, five landed fixed-on-latest. Sandbox-build rot capped four of the five at 84 (the modal verdict shape with reporter @-mention). #2611 was the first run to land below the cap, validating the −30 LLM-synth penalty path.

v1 scope


In	Linux only (Brev), `bug` issues against CLI / Sandbox / OpenShell / Docker / Getting Started / Ubuntu / DGX Spark / GB10
Out	macOS, Windows/WSL, integration bugs needing third-party tokens, service-account bots, auto-close

Verification flow (one-line per step)

Step	Purpose
6.5 — Preconditions	Brev auth + install-URL reachability fail-fast (no Brev cost on dead deps)
6.7 — Local-first	Pure-CLI bugs (no sandbox/Docker/GPU) run on the maintainer's local install — same evidence, zero cost
7 — Reuse-or-provision	`brev ls` → reuse a `verify-stale-*` box if running, else `brev create` with a runtime-picked CPU SKU
8 — Validate-on-baseline, verify-on-latest	Install reported version first, confirm the reproducer actually exposes the bug, reset, then install latest
8.5 — By-design detection	Three signals (maintainer attribution / removal commit / symbol absent) short-circuit Brev cost on intentional removals
9 — Confidence scoring	+50 / +25 / +25 / −30 / −50, clamped to [0,100]. ≥85 silent · 60–84 + reporter @-mention · <60 inconclusive
10 — Comment + label	Comprehensive redaction (JWT / NVAPI / PAT / base64 / internal hostnames / emails) with HTML→text pre-pass for QA bodies. 300-word hard ceiling for fixed/wontfix; 30–80 for still-reproduces.
11 — Infra failure	Sandbox-build rot is the dominant failure mode for any version >5–7 patches behind; cap-at-84 with reporter @-mention is by design
12 — Activity log	Append per-issue + per-session entries to `~/development/daily-rhythm/activity/nemoclaw-verify-stale-log.md`

Idempotency

Every posted comment carries . Candidate filter excludes any issue whose comments match  within the last 7 days OR carries fixed-on-latest / verify-inconclusive / status: wont-fix. Release sweep clears the first two on each tag cut.

Active-discussion handling (two-tier)

Maintainer's unanswered comment age	Skill behaviour
≤ 7 days	Skip — running on top of fresh discussion would conflict with the maintainer's framing
> 7 days	Proceed with the unanswered-question variant: lead with a markdown link to the unanswered comment, dual @-mention of both maintainer and reporter

Companion changes

nemoclaw-skills-guide/SKILL.md — adds the new skill to the maintainer skills table.
nemoclaw-maintainer-cut-release-tag/SKILL.md — release-time sweep of fixed-on-latest and verify-inconclusive from every open issue. Without it, "latest" drifts and verifications go silently stale. Verification records stay in comments; only labels reset.

CI status

Check	Status
commit-lint, dco-check, installer-hash-check, legacy-path-guard, pr/changes	✅
markdown-links	✅ after `3abffd72` (placeholder `[label](url)` examples in SKILL.md were being parsed as real broken links — restructured as fenced-block templates) and `dc5d7143` (merge of main brought in the deletion of `docs/get-started/brev-web-ui-quickstart.md` per #3050, resolving the merge-side missing-file failure)

Test plan

Signed-off-by: Prekshi Vyas prekshiv@nvidia.com

Summary by CodeRabbit

New Features
- New maintainer skill to verify whether stale bug reports reproduce on the latest release, with interactive reproducer validation and evidence-backed comments/labels.
Enhancements
- Release-tag workflow now conditionally clears verification labels from open issues when verification is stale or regression risk is detected (preserves labels in specified safe cases).
Documentation
- Added comprehensive reference guides covering candidate selection, environment/reproducer setup, provisioning, reproduction rubrics, scoring, comments, and logging.

Adds a maintainer skill that automates verifying whether old bug reports still reproduce against the latest NemoClaw release. The skill picks candidate issues opened against older versions, reuses or provisions a Brev Linux box (CPU or GPU based on the bug profile), runs the extracted reproducer, scores confidence, and posts an evidence-backed comment with either a `fixed-on-latest` or `verify-inconclusive` label. Tag-only — never auto-closes. v1 scope is intentionally narrow: Linux only, no third-party integration credentials, no service-account bot. Each maintainer runs it under their own gh credentials. Companion changes: - Adds the new skill to the maintainer table in `nemoclaw-skills-guide` (count 7 -> 8 maintainer skills, 18 -> 19 total). - Adds Step 7 to `nemoclaw-maintainer-cut-release-tag` that sweeps `fixed-on-latest` and `verify-inconclusive` labels off all open issues at release time, so the next verify-stale run re-evaluates against the new latest. Verification records remain in comment history; only the labels are reset. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Three flaws surfaced when dry-running the Step 3 candidate filter against the live open-issue list (141 open bugs). 1. Step 2 used `gh release view` to detect "latest". NemoClaw tags but does not publish GitHub releases, so this returned empty. Added a fallback that picks the highest semver from `git ls-remote --tags`, which is the load-bearing path today. 2. Step 3 said "2 or more minor versions behind". Wrong vocabulary for `0.0.x` repos — patch is what's iterating. Reworded to "at least 2 versions behind in the rightmost-incrementing component", with a concrete `0.0.x` example and a forward-compat note for when NemoClaw moves to `0.1.x`. 3. Step 4 specified a naive `v?0\.0\.\d+` regex that matched IP-like strings in bug bodies (Ollama bind `0.0.0.0:11434`, loopback `127.0.0.1`, etc.) and produced phantom v0.0.0 / v0.0.1 candidates. Tightened to require `v` prefix + word boundaries first, with a context-anchored fallback that only matches `0.0.X` on lines containing `nemoclaw` or `version`. Added a clamp that rejects any parsed version greater than `$LATEST` (catches roadmap labels like `v0.0.35` from being treated as "reported on"). After these fixes the dry-run produces 33 plausible candidates out of 141 open bugs, with credible reported-version assignments verified against issue bodies. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Two more issues surfaced from a deeper second-pass dry-run. 1. The Step 4 regex was hardcoded to v0.0.x. NemoClaw is on that line today, but the skill should keep working when it moves to v0.1.x or higher. Generalized to `\bv\d+\.\d+\.\d+\b` with a context-anchored `\b\d+\.\d+\.\d+\b` fallback on lines containing nemoclaw or version (so the loose form doesn't match `0.0.0.0:11434` and `127.0.0.1`). 2. Some bug reports cite NemoClaw versions that were never tagged (e.g. `v0.1.0` × 3 issues, calver `2026.3.11` × 1). Without a tag existence check the skill would point Brev verification at a version that does not exist. Added a `git ls-remote --tags` validation pass that drops the version if no matching tag is found. Also added an implementer note documenting a real failure mode caught during the dry-run: a naive `[scan(primary)] | first | .[0] | tonumber // [scan(fallback)] | ...` pipeline silently dropped 9 valid candidates because `null | first` errors when scan returns empty, and the `//` chain did not propagate cleanly to the fallback. The SKILL.md regex itself was correct — the failure was implementation-level fragility in how the two passes were composed. The note tells implementers to bind each pass to a named variable, coalesce at the end, and explicitly test the empty-match path. After these changes the live dry-run produces 47 plausible candidates (out of 141 open bugs) and the 4 residual drops are all reporter typos that cite a version that was never released — correctly unverifiable. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Third dry-run round surfaced a precision issue. The previous regex matched any `\bv\d+\.\d+\.\d+\b` anywhere in the body, then validated against the tag list. That picked up versions belonging to *other* products that share the v0.0.x line — most often OpenShell, but also Node.js (v22.x.x), and other dependencies. Tag validation then correctly rejected the non-NemoClaw ones, but on bodies where multiple products were listed (e.g. `Environment: openshell 0.0.4, nemoclaw 0.1.0, Node.js v22.16.0`) the parser would land on OpenShell's tag-valid v0.0.4 and report it as the NemoClaw version, producing a false-positive candidate. 12 such collisions exist in the current backlog. Fix: require the version to follow `nemoclaw` (case-insensitive) within 80 non-letter, non-newline characters. The anchored regex `(?i)nemoclaw[^a-z\n]{0,80}v?(\d+\.\d+\.\d+)` matches `NemoClaw v0.0.32`, `nemoclaw 0.0.28`, `- NemoClaw: v0.0.16`, but not `openshell 0.0.4` followed later by `nemoclaw 0.1.0`. Combined with the existing tag-list validation, this collapses the four error classes — typos, calver dates, roadmap labels, and product collisions — into one "version must be a real NemoClaw release tag, mentioned next to NemoClaw" check. Also expanded the implementer note to cover three concrete failure modes hit during the dry-run: - Empty-match handling (`[]` flowing into `first | .[0]` errors). - Capture-group inconsistency between branches in a parser pipeline. - Variable-scoping bug in `select($tags | index(.))` where `.` rebinds to `$tags` and silently passes invalid labels. After this change the live dry-run produces 33 candidates out of 141 open bugs — same count as the very first round, but every candidate's parsed version is now anchored to NemoClaw and validated against the real tag list. The earlier wider counts (47, 48) included roughly 12-15 issues that would have been wasted Brev runs. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…test A clean run on latest is ambiguous on its own: - Scenario A: bug really got fixed (script worked, latest passes) - Scenario B: script was junk all along (script never triggered the bug; latest passes for the same reason baseline would) These look identical from the latest pass alone, which means the previous flow could land `fixed-on-latest` on a script that never had a chance — a false positive that the +25 commit and +25 PR signals only partially mitigate. This change adds a baseline pass before the latest pass: - Step 8a: install reported version on the box. - Step 8b: run reproducer, compare output to issue's actual-result description. Match = script validated. - Step 8c: if no match (script silent OR errored on the wrong thing), synth-repro using the issue body PLUS the baseline transcript so the LLM can react to the actual failure mode. Apply -30. Retry. Still no match -> verify-inconclusive, skip latest entirely. - Step 8d: install latest, run validated reproducer. Score from this. A single trigger drives synthesis: "the script didn't expose the bug on baseline." Whether it was silent or noisy doesn't matter; both mean the script needs work. This collapses the previous separate "narrative-only -> synth" and "verbatim-failed -> ???" paths into one rule. Step 6 simplified accordingly: just extract verbatim if available, otherwise carry the issue body forward to Step 8b for on-demand synthesis. No more "give up before provisioning" branch — that decision moves into Step 8c where it has more context to work with. Step 9 adds a baseline-validation gating rule. If the reported- version install fails (old releases rot — installer URLs, deps, OS images drift), the score is capped at 84 unless commit-area or PR- mention evidence also fires. That forces the @-mention-reporter band when we lack independent corroboration, which is the honest position when we couldn't validate the script ourselves. Step 11 split into two failure types: - Latest-install fail or harness error -> hard infra failure (no label, optional comment, move on). - Baseline-install fail -> degraded mode (skip baseline gate, run latest anyway, apply Step 9 cap). Wallclock budget bumped from 15 to 25 minutes per verification to accommodate two installs. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Tier-1 (blocking for E2E): - Add the still-reproduces-on-latest path. If the latest run output matches the issue symptom, the bug is still live — the skill posts a "still reproducible" comment with both transcripts, applies no label, and includes a dated marker ``. No new label per user direction; a 7-day TTL on the marker handles re-verification on the next weekly cron. - Update the comment template for the two-pass flow. The previous template described a single transcript; the new one has Baseline and Latest sections plus a dedicated still-reproduces template, and the idempotency marker now carries today's date. - Update the Step 12 log template to record baseline-install, baseline-match, latest-install, and latest-result independently. - Clarify in Step 4 that REPORTED_VERSION must be the full tag string ("v0.0.32"), not just the patch number (32). Step 8a's installer expects the full tag. Tier-2: - Spell out an explicit match rubric in Step 8b: exit code agreement, symptom-phrase match (LLM-judged semantic equivalence counts), and distinguishing the bug from infra noise (DNS / auth / rate limits). - Replace the minimal `rm -rf ~/.nemoclaw` reset with a comprehensive one that stops nemoclaw/openshell processes, removes spawned Docker containers, frees common ports (8080, 18789, 9119), and removes the installed binaries and lib paths. Run before each install in 8a and 8d; idempotent so it's safe on a fresh box. - Add interactive-subcommand handling in 8b: auto-detect `nemoclaw onboard` / `configure` and try `--non-interactive`, then `--dangerously-skip-prompts`, then stdin pre-feed. Fall through to Step 8c (synth) if none work. Tier-3 / opportunistic: - Fix the actual installer command (gap 9). The installer accepts the target ref via `NEMOCLAW_INSTALL_TAG` env var, NOT a `--version` flag (verified against install.sh source). Updated 8a accordingly. - Update Step 3 idempotency: drop on either label OR a marker comment within the last 7 days. Previously a marker excluded the issue forever; the release sweep only clears labels, so the marker alone made re-verification impossible. The 7-day TTL also handles the still-reproduces case naturally. - Extend wallclock budget to 60 min when issue body contains time-sensitive keywords (`after N minutes`, `eventually`, `memory leak`, etc.). Hard ceiling at 60 — bugs that require hours fall out of v1 scope. - Keep provisioned boxes for 30 minutes on verify-inconclusive outcomes so a maintainer can `brev shell` in and triage. Reused boxes always stay. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Five concrete fixes from the pre-E2E gap analysis, plus a verified finding on gap 2 that lowered its risk profile. Gap 1 — sudo behavior. The reset script's `sudo rm -f` and `sudo rm -rf` could hang on a password prompt if the Brev image's default user lacks passwordless sudo. Switched all sudo calls to `sudo -n` (non- interactive) so they fail fast instead. Added a precondition note explaining the assumption: Brev stock images allow passwordless sudo, custom images may not. The user-local install path (~/.nemoclaw) is fully reset regardless of sudo state, so the worst case is a stale /usr/local/bin/nemoclaw symlink that the next install overwrites. Gap 2 — verified the install architecture supports old-version pinning natively. The bootstrap clones the requested ref and runs that tag's own install.sh (with a payload-marker fallback for legacy ones), so NEMOCLAW_INSTALL_TAG=v0.0.32 works architecturally. Risk reduces to "does v0.0.32's own installer still work on a 2026 OS image" — handled by Step 11's degraded-mode fall-through. No SKILL.md change needed. Gap 4 — path extraction was hand-waved. The +25 commits-touched-area weight referenced `git log v<reported>..$LATEST -- <path>` without specifying how to determine `<path>`. Added a three-tier extraction procedure: stack-trace path mentions parsed from the body and mapped to repo paths, then a component-label-to-directory map (NemoClaw CLI → bin/, Sandbox → src/lib/sandbox/, etc.), then title-keyword fallbacks. If none yields a path, skip the +25 signal entirely rather than guessing — floating the weight would inflate scores meaninglessly. Gap 5 — PR-search query was hand-waved. The +25 PR-mention weight had no concrete query. Added two-stage search: first `gh pr list --search "$ISSUE_NUMBER"` filtered for actual `#NNNN` references in body or title, then a symptom-phrase fallback if direct reference returns nothing. PRs that merged before the issue was filed are excluded via `mergedAt > tag-date(REPORTED_VERSION)`. Gap 6 — issues without an "Actual result" section had no match path. Behavioral / configuration bugs (e.g., #1242 "should default to a stable released version") describe a wrong default rather than a runtime error. Added a fallback to Step 8b's match rubric: use the title + full body as the symptom, match if reproducer's outcome contradicts the issue's expected behavior. If neither error string nor expected-behavior contradiction can be identified, route to Step 8c (synth-repro) so the LLM produces a more diagnostic script. Gap 7 — transcript redaction was thin. Step 10 only covered tokens, paths, and basic-auth URLs. Real transcripts leak more: Authorization headers, JWTs, GitHub PATs, AWS keys, NVIDIA API keys, internal hostnames, emails (PII), long base64 blobs. Expanded the redaction table with concrete patterns for each. Added an order note: longest/most-specific patterns first so the generic base64 catchall doesn't mask what was actually redacted. Made it explicit that redaction runs on every quoted chunk in the comment, not just issue- body excerpts. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…l SKU pick Preflight Brev auth and install-URL reachability before paying any cost (Step 6.5). For pure-CLI bugs with no sandbox, Docker, or GPU dependency, try the reproducer locally on the maintainer's `nemoclaw` install before provisioning a Brev box (Step 6.7) — same evidence, zero cost. Replace the `<your-team's-CPU-SKU>` placeholder in Step 7 with a runtime pick via `brev search cpu --sort price --json | jq` so the skill keeps working as SKUs change, with a `VERIFY_STALE_CPU_TYPE` env override for teams that need to pin. Drop the spurious `--yes` flag from `brev delete` — it does not exist on that subcommand and would error. `brev delete` is non-interactive by default. Print the instance name before the trap registers so manual cleanup is possible if the trap doesn't fire. Thread `INSTALL_URL` through Step 8a/8d so the URL override from Step 6.5 applies to both the baseline and latest installs. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Some bugs are filed against behavior that was deliberately removed or changed in a merged PR. Running the standard rubric on these produces a misleading verdict — the symptom "still reproduces" but the right answer is "won't fix, see PR #X." Issue #2791 is the prototype: `config set` was removed in #2227, the reporter tested a version that already had it gone, and the standard rubric would have buried that context under a low-confidence `verify-inconclusive` label. Add Step 8.5 with three independent detection signals: - Maintainer-attribution comment phrasing ("removed in #N", "by design", "wontfix", "intentional") from MEMBER/OWNER/COLLABORATOR authors. - A removal commit between reported version and `$LATEST` whose subject matches `\b(remove|delete|drop|deprecate)\b` and whose diff deletes the symbol implicated by the reproducer. - The implicated symbol absent from both reported version and `$LATEST`, meaning it was already gone when the issue was filed. When any signal fires, skip Step 9 scoring and Brev provisioning, apply a new `wontfix-by-design` label, and post a focused comment that links to the responsible PR. Detection on signals 2 and 3 can run as soon as the reported version is parsed, saving Brev cost on the entire class. Generalize the Step 3 idempotency marker regex from `v1` to `v\d+` so future skill versions can re-verify older-marked issues by tightening the regex to require a specific marker version. Add `wontfix-by-design` to the idempotency label list, the activity log entry options, the session summary, and the release sweep in `nemoclaw-maintainer-cut-release-tag`. Update the skill's frontmatter description to mention the new label and the local-first short-circuit. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…var log path The issue body that comes back from `gh issue view` for NV QA bugs is HTML, not markdown — `<pre>...</pre>` blocks and tables, with HTML-encoded entities. The previous Step 6 only mentioned `<pre>` in prose without an extractor that actually parsed it. Add a Python extractor that handles markdown fences (triple-backtick and tilde), HTML `<pre>`, and entity unescape in one pass, so verbatim reproducers from QA-filed issues land cleanly instead of falling through to LLM synthesis with a -30 penalty. Step 10 already had a comprehensive redaction regex table, but the patterns assumed plain text — tokens nested in HTML tags or attributes slipped through unredacted. Add an HTML to text pre-pass for issue-body excerpts before the regex table runs, so quoted excerpts from QA bodies get the same redaction coverage transcripts already do. Step 5's GPU keyword list flagged `inference` and `model serving`, both of which false-positive on CPU bugs (`models.providers.inference.baseUrl` is a config path, not a GPU need). Tighten to whole-word matches and swap in `vllm`, `tensorrt`, `L40S`, `L4`, `T4`. Step 12's activity log was hard-coded to one maintainer's personal organizer path. Read from `VERIFY_STALE_LOG_DIR` with the existing path as default, and require the skill to create the directory rather than assuming it exists, so the skill works in CI / shared-volume / other environments. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Path-extraction map in Step 9 was written from assumption rather than verification — `src/lib/sandbox/`, `src/lib/openshell/`, and `src/lib/policy/` don't exist in the current repo, and OpenShell lives in a separate repo entirely. The +25 commits-touched-component signal would have silently misfired on every Sandbox / OpenShell / policy issue, underscoring real fixes. Replace with paths verified against the current tree (`nemoclaw/src/blueprint/`, `nemoclaw-blueprint/`, `nemoclaw/src/commands/`, etc.) and explicitly note the cross-repo OpenShell case is out of scope for v1. Step 8 reset wiped `~/.nemoclaw` but not `~/.openclaw`. Sandbox state has been writable-by-default under `.openclaw` since #2227, so it persists across the baseline → latest reinstall and contaminates the latest run. Add it to the reset. Anchor the `pkill -9 -f nemoclaw|openshell` patterns to a leading slash (`/nemoclaw`, `/openshell`) so the kill matches actual installed paths but not unrelated processes that mention these strings — including the agent harness running this skill if its working directory contains the word. Reorder the Step 10 redaction table to match the stated "longest, most specific patterns first" rule. The previous order had the generic `(token|secret|password|...)` pattern executing before JWT/AWS/NVIDIA patterns, which would have masked specific-token redactions with generic ones — losing the signal of *what kind* of credential leaked. Replace `git ls-remote git@github.com:NVIDIA/...` (Steps 2 and 4) with `gh api repos/NVIDIA/NemoClaw/tags --paginate --jq '.[].name'`. The SSH path required keys to be configured; `gh api` reuses whatever auth the user already has. Add a CLI-deps precondition check (`command -v gh brev jq python3 curl`) to Step 6.5 so the skill fails fast if any required dependency is missing, instead of failing late and confusingly mid-run. Step 11's keep-box-on-inconclusive said "delay cleanup by 30 minutes" but had no implementation. A backgrounded `sleep && brev delete` doesn't survive session end. Replace with skip-the-trap-entirely plus an explicit manual-cleanup reminder printed to the run output. Out-of-Scope was contradicting itself on macOS — Step 6.7 explicitly allows macOS for local-first runs. Clarify: Brev path is Linux-only, the local-first short-circuit on a maintainer's laptop works on macOS for manual single-issue runs. Add a `local (no Brev — Step 6.7 short-circuit)` option to the Step 12 log entry's Box field so local-first runs round-trip the activity log correctly. Frontmatter description said "latest release" but NemoClaw uses tags, not GitHub Releases — fix to "latest tag" to match Step 2. Step 4's implementer note said "Two real failure modes" but listed three; fix the off-by-one. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…-verification Step 8.5 was producing comments that were correct in direction but hand-wavy on substance. A side-by-side against an independent analysis of #2168 surfaced the gap: that analysis cited specific file:line locations, separated the literal bug-as-filed from a related failure mode that still exists on latest, called out existing CI coverage of the new workflow, and self-corrected an overstatement when prompted to re-verify. Today the skill required none of that. Restructure Step 8.5 into substeps so the rigor is mechanical, not prompt-dependent: - 8.5a: signal detection now requires verifiable evidence per signal — comment URL + quoted phrase (signal 1), commit SHA + diff range (signal 2), grep commands + outputs (signal 3). Add a sub-case for vestigial deprecation shims so a stub doesn't silently fail signal 3 by appearing in latest. - 8.5b: pre-check for related failure modes. Grep latest for the bug's symptom keywords (not the removed symbol) and require the comment to call out anything that surfaces. This is the section the side-by-side identified as missing — saying "the bug as filed can't reproduce" is not the same as "every bug shaped like this is fixed." - 8.5c: check existing test coverage for the new workflow and cite up to three test paths if found. Strengthens the comment from "trust me, it was removed" to "the new workflow is exercised by these tests." - 8.5d: explicit self-verification pass — re-run every cited command before composing the comment; bail to verify-inconclusive on any discrepancy. LLMs confidently overstate; mechanical re-verification catches it without needing a human prompt. Replace the by-design comment template with one that has mandatory sections matching the substeps: "What's structurally fixed", "Vestigial references", "What's not literally the same bug", "Existing CI coverage", "Recommendation". Each section either has concrete content or is explicitly omitted — no hand-wavy claims slip through. Add NVBugs cross-reference extraction to Step 4 (`grep -oE '\[NVB#[0-9]+\]'`) so the by-design template can append the standard "NVBugs#NNNNNNN will need a separate update; closing this GitHub issue won't propagate" reminder when the issue body carries that footer (most NV QA bugs do). Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

The previous signal-2 candidate query narrowed by commit subject (`remove|delete|drop|deprecate`) before checking whether the diff deleted the implicated symbol. That filter excludes the common case where a removal lands inside a `refactor(...)` or `feat(...)` commit — e.g., PR #2227 removed `--dangerously-skip-permissions` under a "refactor(sandbox): default to mutable config" subject. A literal follower of the previous rule would have concluded signal 2 doesn't fire on issue #2168 even though the flag was deliberately removed. Switch the primary candidate lookup to `git log -S<symbol>` pickaxe search, which finds every commit whose diff changes the count of the symbol regardless of subject wording. Keep the subject-keyword query as a supplementary lookup for narrowing when the pickaxe returns many commits. Note explicitly that the commit's actual subject doesn't need to mention removal. Surfaced while testing the new Step 8.5 rigor against #2168. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…horing Two changes that surfaced from running Step 8.5 against issue #2168. Use the existing repo `wontfix` label for the by-design path instead of creating a new `wontfix-by-design` label. The repo already has `wontfix` and it's already in Step 3's issue-type skip list, so applying it gives idempotency for free without proliferating labels. The by-design verdict is encoded in the marker comment, not the label vocabulary. Important consequence: the release sweep in `nemoclaw-maintainer-cut-release-tag` does NOT clear `wontfix` — that label is also applied for non-skill reasons (scope, priority, dup decisions made by maintainers), and sweeping it would erase human triage work. Sweep stays scoped to `fixed-on-latest` and `verify-inconclusive`, which are skill-only. Add a `**Verification mode:**` line to the by-design comment template that explicitly says "static analysis at the verified-on tag — no runtime reproduction." This was the honesty an independent side-by-side analysis of #2168 caught: the previous template implied runtime confirmation even though Step 8.5 short-circuits Brev provisioning by design. Add a tag-anchoring rule above the template too — every `file:line` cite must refer to the verified-on tag, not the maintainer's HEAD, since line numbers drift between tags and main and we want comments to stay reproducible by anyone reading them later. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…ion check Bare `file:line` paths in comments force the reader to navigate manually — that's a usability bug, not a stylistic preference. Comments posted by the skill should be self-serving artifacts: a maintainer or QA reader should be able to click any citation and land on the exact content at the verified-on tag. Extend the tag-anchoring rule above the by-design template into a "tag-anchoring + linking rule" with concrete formats for file blobs, file:line, file:line-range, commit SHAs, and test paths. Note that bare `#NNNN` PR/issue references already auto-link in same-repo comments. Strengthen Step 8.5d into two passes: evidence (re-run cited commands) and link (resolve at least one rendered link per evidence section via `gh api .../contents/<path>?ref=<tag>` or `curl -fsI`). A 404 on a citation suggests verification that didn't actually happen — worse than no link at all. Bail to verify-inconclusive on any failure. Surfaced while reviewing the rendered #2168 comment: the citations were correct and tag-anchored but not clickable. Better caught now than after the first batch run posts a wall of bare paths. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Discovered while applying the by-design label to issue #2168: the skill referenced `wontfix` and `needs-info` as bare strings, but the repo's canonical labels are `status: wont-fix` and `status: needs-info` (with prefix and hyphen). Same shape in Step 3's issue-type skip list — issues labelled `status: wont-fix` or `status: needs-info` were NOT being filtered out at candidate-time as the spec intended, because the skip list keys didn't match the live labels. Three coordinated fixes: 1. Step 3 issue-type skip list now uses `status: wont-fix` and `status: needs-info`. Annotate the rule so future readers know not to substitute the bare forms. 2. Step 8.5 by-design action applies `status: wont-fix` (with quoting on the CLI). Step 12 log entry, session summary, frontmatter description, and Companion Behavior section in both the verify-stale skill and `nemoclaw-maintainer-cut-release-tag` updated to match. 3. Step 6.5 preconditions now verifies all expected labels exist on the repo with `gh label list` before any later step tries to apply them. This is the gap that bit us on #2168 — the comment posted but the label couldn't be applied because the spec name didn't match. Failing fast in 6.5 means the run aborts before any Brev cost or comment posting can happen against a misconfigured label vocabulary. Bare `wontfix` / `won't fix` / etc. are still in Signal 1's detection regex — that's intentional, those are SEARCH PATTERNS for what maintainers might write in free-text comments, not label names. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…s for faithfulness Two changes that landed before the first real Brev E2E run (#2007). The previous wallclock cap split — 25 min default with a 60 min extension keyed off "memory leak" / "over time" / "eventually" keywords — optimised for the wrong constraint. Most issues fit comfortably under 60 min once you account for two full install passes plus comprehensive resets, and the keyword-based extension forced re-runs whenever a real install or bootstrap took longer than the optimistic 25 min budget. Bump the cap to a single 60 min default and drop the keyword extension; bugs that genuinely require more than an hour fall out of v1 scope. Add a new Step 8a.5 for bootstrapping reproducer dependencies that the stock Brev image doesn't ship — local model servers (Ollama, vLLM), provider runtimes, third-party CLIs. Default policy is maximum faithfulness: install the actual dependency the reporter used rather than substituting a stub. Substituting trades faithfulness for speed, and on a 60 min budget that trade is rarely worth it; it almost always introduces a confound that makes the verdict less trustworthy. Document the canonical Ollama and vLLM bootstraps with specific attention to daemon survival between `brev exec` calls (Ollama's installer registers a systemd service; vLLM uses nohup + log file). Bootstrap runs once before Step 8b's baseline and is reused for Step 8d's latest run — model downloads are expensive and external state, so the comprehensive reset must explicitly leave them alone. Carve out one substitution case: providers that require API keys the skill cannot safely supply (NIM, OpenAI, Anthropic). Stubbing a key defeats faithfulness anyway, and a real key shouldn't sit in a verify-stale run. Substitution applies the same -30 LLM-synth penalty as Step 8b and must be documented in the rendered comment. Bootstrap failure escalates to Step 11 infra failure — do NOT silently substitute, since the maintainer opted into faithfulness for a reason. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…get on every comment Two requirements surfaced from the #2007 e2e run that should be enforced by the skill, not by per-session memory or by the prompter remembering. Mandatory reporter @-mention with confirmation language. The skill cannot independently confirm a closed-as-fixed verdict — only the reporter knows whether their original symptom is gone in their environment. The @-mention is what converts a "skill says it is fixed" claim into actionable confirmation work for QA. Add the explicit closing block (canonical wording: "please confirm the symptom is gone on a recent build and reopen with a fresh reproducer if you observe otherwise") to all three Step 10 templates: fixed-on-latest, still-reproduces, and the Step 8.5 by-design template. The previous "If this verification is wrong, please reopen..." line was passive; the new line names the reporter and asks them to act. Length target. Default rendered comments to 400-500 words. The evidence table or by-design fixed/vestigial sections are the hero; everything else has to either change the reader's mind about the verdict or be deleted. The first #2007 draft ran ~750 words and the user pushed back explicitly; encoding the cap into Step 10 means future agents do not start from scratch on length each run. Surfaced from: e2e Brev verification run on issue #2007 (the first real Brev-track exercise of the skill end to end). Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Six concrete gaps that surfaced during the first end-to-end Brev run (issue #2007). Each almost produced a wrong verdict or a wasted run; each is now encoded so the next agent doesn't have to re-discover. 1. Step 9 baseline-validation cap-removal rule was too permissive. The `unless commits-touched OR PR-mention also fires` escape hatch let inferred fix evidence override the absence of a baseline run, which produced a misleading 100/100 on #2007 despite zero baseline confirmation. Tightened: cap at 84 holds regardless of corroboration; PR-search signals raise the score within the cap, never past it. Step 10 now requires an explicit one-line caveat naming the cap and the reason in the rendered Verdict section. 2. Step 6.5 install URL default switched from `nemoclaw.nvidia.com` (NVIDIA-internal, doesn't resolve from Brev) to `www.nvidia.com/ nemoclaw.sh` (public Akamai 301 redirect). Brev runs were dying at bootstrap on the wrong host. 3. Step 7 CPU SKU picker now biases by reproducer-implied memory needs. On #2007 the cheapest 2 GB SKU couldn't load a 4.8 GiB Ollama probe; onboard failed at provider validation and we burned ~25 min before re-provisioning a 16 GB box. Adds a `CPU_RAM_FLOOR` env var and a memory-floor heuristic (16 GB when reproducer references a model server, 8 GB for sandbox onboarding without a model, 4 GB for pure-CLI bugs). 4. Step 11 failure taxonomy now has a "baseline-build rot" bucket distinct from "binary install rot." Same `BASELINE_INSTALL_FAILED=1` flag and same downstream cap-and-degrade behavior, but failures at the in-image Dockerfile build phase (what we hit on v0.0.18 — the `.openclaw-data/workspace/media` symlink layer, removed entirely by #2227) get a separate label so reviewers can see *why* the old image no longer builds without re-running. 5. New Step 8a.5b documents the two non-obvious `brev exec` quirks that reproducer scripts have to handle every time: PATH does not include `~/.local/bin` in non-login shells (so reproducers must `export PATH` at the top, or callers must wrap with `bash -lc`); and the docker group requires `sg docker -c '...'` because adding the user via `usermod -aG` doesn't take effect within the same Brev session. 6. New Step 8d.5 architectural-drift check. When the diff between `$REPORTED_VERSION` and `$LATEST` touches the *tool* the reproducer's expected output depends on (e.g. `openshell forward` between v0.0.18 and v0.0.35), don't trust the reproducer's surface alone — multi-axis verification on OS-level surfaces (host listeners, NAT rules, docker ports, SSH tunnels, etc.) is required before claiming fixed-on-latest. This is the five-axis pattern we used to confirm #2007 wasn't a false positive; codifies it as a check the skill applies whenever pickaxe shows the reproducer's tool was reworked. Surfaced from: end-to-end Brev verification of issue #2007. Build the skill, exercise it on a real issue, fix what breaks, repeat — every fix here came from a concrete failure mode in one real run. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…ll host Two small UX fixes surfaced from a thorough re-read. Step 6.5's brev-auth error printed three numbered options without a clear recommendation. From a non-TTY agent harness (Claude Code or any unattended runner) the right answer is option 2 — open a separate terminal, run `brev login`, complete the browser flow, come back. The previous message buried that path inside option 2's inline comment. Replace with a directive recipe that names the recommended flow first (open separate terminal, browser auth, re-run the skill) and lists the headless alternatives below as "when option 1 isn't available." Also explicitly notes that credentials persist to ~/.brev/credentials.json, so re-running the skill picks them up automatically. Step 6.5's install-URL error still suggested checking `https://nemoclaw.nvidia.com` even after the default was switched to `https://www.nvidia.com/nemoclaw.sh` in commit 296f7cd. Update the suggestion to match the new default and add an explicit "then re-run this skill" so the maintainer knows the flow is fix-and-retry, not abort. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Five gaps verified against the source during a thorough re-read pass. The audit also caught two false-positive claims I'd made earlier (`NEMOCLAW_INSTALL_TAG` env var and `comments` field in batch fetch) that didn't survive verification — those are not landed because there was nothing to fix. Step 1 batch discovery query bumped from `--limit 100` to `--limit 1000` so the skill doesn't silently drop issues beyond the page. The recent candidate triage on this repo found 129 open bugs; the previous limit would have lost 29 of them. The 20-issue per-run processing cap is downstream of Step 3/4 filters and unaffected. Step 3's 7-day comment-marker TTL had the rule but no implementation. Add a portable `gh issue view --json comments --jq` snippet that extracts marker comments newer than the cutoff. macOS and Linux date(1) syntax differ for relative-date math, so the cutoff line tries both. Without this, the first cron run after any posting will silently re-verify everything previously marked, posting duplicate comments every Monday. Step 6.5 preconditions now checks gh identity via `gh api user --jq .login` and prints the resolved login before any later step runs. Comments posted by Step 10 land under this account; surfacing it explicitly catches the multi-token / wrong-tab / mid-session-reauth case before a public comment lands under the wrong handle. Step 10 now requires a `**Verification mode:**` header line in every template (was previously by-design only). Reader should never have to guess whether a verdict came from real install logs or from static analysis. Filled in concrete defaults for the standard fixed/inconclusive template ("runtime reproduction; baseline + latest installed and run") and the still-reproduces template ("runtime reproduction; bug confirmed live"). Step 10 also requires a link-pass self-verification on every template, not just by-design's Step 8.5d. The "Tag-anchoring + linking rule" already declared every citation must be a clickable markdown link to the verified-on tag, but the verification step that resolves those links (`gh api .../contents/<path>?ref=<tag>` or `curl -fsI`) was scoped only to the by-design path. Same 404-cite risk applies to the standard template — broken citation links advertise verification work that didn't happen, and that's worse than no citation at all. Surfaced from: end-to-end audit after the #2007 e2e run, with the specific `gh issue view`/`grep` commands run against the SKILL.md to confirm each claim before listing it as a gap. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Two changes that together turn the per-run batch cap from policy into code. Lower the per-batch processing cap from 20 to 15. Sequential execution with 1-2 reused Brev boxes works fine for runs in this size range; ~2-3 hours wallclock for a full 15-issue batch is the comfortable budget before either the per-plan approval gate breaks down or the maintainer session needs to span multiple sittings. Add the slicing code that actually enforces the cap. Previously Step 1 declared "Cap at N issues per run" as policy and nothing applied it — candidate sets larger than the cap would just process to completion silently. Move the slice to the end of Step 4 (after Step 3 label filters and the Step 4 version+candidate-rule filters narrow the pool) and sort by `(-versions_behind, -age_days)` so the most stale come first. Spillover beyond 15 stays eligible for the next run via the marker-comment 7-day TTL added in 8976176. Single-issue mode bypasses the cap entirely; the maintainer named the issue explicitly. Cadence section updated from "≤20 issues" to "≤15 issues" to match. Surfaced from the post-#2007 audit: of the 13 gaps initially called out, the cap-enforcement one was operational toil (skill could chew through cost on a runaway batch). Lowering and enforcing closes that toil path. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Looked at the 24 candidates surfaced by the recent triage run by *kind* of bug rather than by version, and found six gaps where the standard rubric would produce a wrong verdict or hang. All six landed in this commit. Step 3 platform skip: drop `Platform: Jetson AGX Thor/Orin`. Brev has no equivalent silicon (embedded/edge ARM with integrated GPU is not in the SKU catalog), so any Brev verification of a Jetson-only bug produces a misleading "fixed-on-x86" verdict. `Platform: DGX Spark` and `Platform: GB10` stay in scope but Step 10 now requires a hardware-substitution caveat naming the Brev SKU we substituted. Step 3 TUI / interactive-UI skip: drop issues whose title contains TUI / dashboard UI / chat UI / keystroke / key press, or whose body describes interactive UI behavior without a non-interactive reproducer. `brev exec` does not allocate a real TTY by default; TUI reproducers hang or silently fail at the first prompt. v1 documents this as out-of-scope; v1.1 may add a script(1) / expect / tmux harness. Step 5 now classifies bug-class in addition to CPU/GPU. Four classes: `performance`, `rebuild-cycle`, `log-only`, `functional` (default). Detection heuristics from the issue body (latency thresholds, "across rebuilds" / "after restart", "see lots of error in <X> log"). Each class routes to a different Step 8 rubric. Step 8b: log-scraping. When `BUG_CLASS=log-only`, also pull `~/.openclaw/logs/*.log` and `/var/log/nemoclaw/*.log` from inside the sandbox after the reproducer runs and search them for the issue's symptom phrase. Some bugs describe symptoms in internal log files, not the reproducer's stdout; the previous match rubric only checked the transcript. Step 8b: flake-detection retry. For functional bugs, run baseline three times if the first run shows the symptom inconsistently. Mixed results (1 or 2 of 3 reproduce) trigger a "flake suspected" caveat, −25 score adjustment, and downgrade `+50 latest clean` to `+25` so a lucky-clean-latest-run on an intermittent bug doesn't silently become a fixed-on-latest verdict. New Step 8e: performance-bug verification. Multi-run latency distribution rubric — N=10 runs each side, compute p50/p90, parse SLA from issue body, match latest's distribution against the SLA rather than against a symptom phrase. Cap at 60 unless the bug is silicon-independent because Brev SKUs aren't faithful to DGX Spark or GB10 hardware for performance-shape bugs. New Step 8f: rebuild-cycle verification. Run-rebuild-rerun harness for bugs that only manifest across destroy/recreate boundaries (#2701-shape). Capture artifacts pre-rebuild, trigger `nemoclaw destroy --all --force && nemoclaw onboard`, re-capture post-rebuild, diff to determine whether the artifact persisted as the issue expects. Step 10 mandatory hardware-substitution caveat: when DGX Spark or GB10 is in the issue's labels and Step 7 substituted with a different silicon class, the comment metadata block must name the substitution explicitly so silicon-shape bugs get the right reader expectation. Surfaced from: stress-testing the skill against the actual shapes of bugs in the 24-candidate set, not just against the bugs we already verified. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Six gaps surfaced from the second e2e run (issue #2592) and the broader audit of how the skill handles provider classification. Step 5 now classifies provider in addition to CPU/GPU and bug-class. Detection signals from labels and body keywords identify NIM, Gemini, Anthropic, Bedrock, or Ollama. When the reproducer references a non-Ollama provider AND actually exercises inference, the skill stops at Step 5 and prompts the maintainer interactively before any Brev cost — three options: provide the API key, accept Ollama substitution with the -30 penalty, or skip to verify-inconclusive. Pure-CLI / pure-sandbox bugs are exempt because the provider doesn't matter. Step 6 reproducer-extraction regex extended to also match `openclaw` and `openshell` invocations as anchor words. Issue #2592's reproducer was `openclaw channels add telegram` run inside the sandbox; the previous `nemoclaw`-only regex would have missed the verbatim block. Step 8a now passes the provider env vars through to install.sh's bundled onboard step, so it doesn't fall back to the default `build` (NIM) provider and fail with the misleading NVIDIA_API_KEY missing error. NEMOCLAW_PROVIDER=ollama (the default case) makes the bundled onboard use the local Ollama set up in Step 8a.5; NVIDIA_API_KEY only flows through if the maintainer provided one at Step 5's prompt. The bundled onboard creates a throwaway sandbox that gets destroyed before the reproducer runs. The reproducer's own onboard should pass `--fresh` so a half-built install-sandbox doesn't trip the "previous session failed" guard. Step 8a.5 now has an Ollama-coverage table making explicit which bug classes Ollama covers faithfully (CLI, sandbox, networking) vs which ones it doesn't (provider-specific behavior, model-specific behavior, silicon-dependent perf). Step 5's prompt keys off this table. Step 8a.5b documents the openshell-sandbox-exec syntax footgun. The correct form is `openshell sandbox exec -n NAME -- CMD`; the wrong form silently auto-detects the sandbox by "last used" and stuffs the leftover positional into bash's $0. Same section gets a brev-exec re-execution guard via a sentinel file at ~/.verify-stale-running to prevent the double-onboard scenario when SSH drops mid-run. Step 11 reframes baseline-build-rot as the dominant failure mode for any reported version >5-7 patches behind, not an edge case. Both Brev e2e runs hit it. Cap-at-84 with reporter @-mention is the modal verdict shape, not the exception — pre-flight carries more weight than baseline runtime evidence for older bugs. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…sandbox name Two fixes from a rot-debugging investigation that nearly produced a wrong conclusion. When testing whether old NemoClaw versions actually rot under the literal end-user invocation (`https://www.nvidia.com/nemoclaw.sh`), the first test run was botched by a shell-scoping mistake: `NEMOCLAW_INSTALL_TAG=v0.0.26 curl ... | bash` scopes the env var to curl, not to the downstream bash reading from the pipe. The bash side defaulted to `latest`, resolved to v0.0.36, and the install proceeded for several minutes producing convincing-looking output before the mistake surfaced. The install script had no log line saying which version it actually resolved, so the silent failure looked like a working install of v0.0.26. Skill-side fix: after each install (Step 8a baseline and Step 8d latest), echo the resolved `nemoclaw --version` and case-match it against the requested version. Mismatch in baseline sets BASELINE_INSTALL_FAILED=1 to prevent verifying against the wrong version. Mismatch in latest logs a WARN that gets surfaced in the final comment. Cheapest possible guard against the class "I asked for X, got Y" — and the principle generalizes: print the resolved state, never trust the requested state. Also fix the sandbox name in the install.sh-passthrough block: change `NEMOCLAW_SANDBOX_NAME=__verify_stale_install__` (rejected by NemoClaw's name validator: "Allowed format: lowercase, starts with a letter, letters/numbers/internal hyphens only, ends with letter/number") to `NEMOCLAW_SANDBOX_NAME=verify-stale-install`. Surfaced during the #2519 e2e run; net-zero impact on that run because install.sh fell through on name validation rather than reaching the buggy code path, but the spec was wrong. The deeper context for the resolved-version-echo fix: when the rot hypothesis was being tested, an independent agent's logical analysis correctly identified that the public install URL DOES support `NEMOCLAW_INSTALL_TAG=<ref>` and demanded an empirical re-test. The re-test (with the env var on the bash side of the pipe) installed v0.0.26 correctly and STILL hit `Patch 4 (replaceConfigFile EACCES) not applied` — so the cap-at-84 framing held empirically, just not for the original "git-clone is the only path" reason. The full investigation is documented in findings.md Part 6's closing section ("Closing investigation: is baseline-build rot real, or our setup artifact?"). Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

@cjagwani

Add a Step 3 skip rule for issues where a MEMBER/OWNER/COLLABORATOR comment was posted within the last 7 days AND the original reporter hasn't replied since. The skill is for *stale* issues; actively- discussed ones are in-flight, and posting a verify-stale verdict on top of an open clarifying question from a maintainer would conflict with their framing and confuse the reporter (now they have two simultaneous asks). Surfaced during pre-flight on issue #2757. The maintainer @cjagwani had just commented questioning the bug's premise — specifically that the reporter's "kill -9 took the parent down" framing didn't line up with how the gateway is launched (`nohup ... &` detaches it) — and asked the reporter to confirm what they actually observed on the Station. Running verify-stale would have posted a "by-design, close as wontfix" verdict on top of an active "wait, did this really happen?" exchange. Wrong move; the skill should detect this state and skip. Implementation reuses the SEVEN_DAYS_AGO cutoff already computed for the marker-TTL check, fetches the reporter via `gh issue view --json author`, then jq-walks the comments array: find the most recent maintainer comment within the cutoff, check whether any reporter comment exists after that timestamp; if maintainer commented recently AND no reporter reply since, skip. Single-issue mode prints the skip reason and exits friendlily so the maintainer knows why the skill bailed. Batch mode just moves on. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…argv The post-#2592 commit (1d625dd) added a Step 5 prompt asking the maintainer to "Export NVIDIA_API_KEY=<key>" and propagate it via `brev exec`. During the #2604 e2e run that played out as `NVIDIA_API_KEY=<value> brev exec ...` — which puts the literal key in the brev exec process's argv, visible in `ps -ef` to anyone with shell access on either the maintainer's laptop or the Brev box for the duration of the run. The skill's stated promise ("never logged") got violated by argv visibility. Fix: switch propagation to file-based. Maintainer writes the key to ~/.nvidia-api-key with 600 perms on their laptop; Step 6.5 brev-copy's the file to the Brev box (encrypted SSH); install / reproducer scripts inside the box read with `NVIDIA_API_KEY=$(cat ~/.nvidia-api-key)`, which sets the env var in the script's own process — never on a command line and never visible in `ps -ef`. Update Step 5's option-1 prompt to instruct the maintainer to use the file form and to `rm ~/.nvidia-api-key` after the run. Add a new "API-key propagation pattern" section after Step 5 that documents the file-based mechanism explicitly. Update Step 8a (baseline install) and Step 8d (latest install) to source the key from the file inside the Brev box's exec context, not from the local shell's env. Note in the docs: if the key was previously propagated via the cmdline (pre-fix), treat it as exposed and rotate. The #2604 run did this; the maintainer was reminded to rotate the NIM key after the run. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

Replace Step 10's narrow "Length target" rule with a richer "Comment authoring principle" block that captures lessons accumulated across the e2e runs. The core rule: every section in a rendered comment must either change a reader's mind about the verdict or be cut. Word counts follow from that — 500 is a ceiling, not a target. Most comments land in 200-400. Simple cases land under 200. Document why this matters: comments posted by the skill compete for a maintainer's attention against every other in-flight thread on the repo, and AI-slop prose actively reduces signal-to-noise. Add a worked-examples block citing two real iterations: - #2007 first draft was 750 words; cut to 371. - #2604 took THREE drafts before settling on 190 words because each draft padded with prose that didn't ground the verdict — which caused the verdict itself to drift. Rule learned: name the verdict in one sentence first, then cut any section that doesn't support it. Add per-verdict length targets: - fixed-on-latest: 200-400 words - wontfix (by-design): 250-500 words - verify-inconclusive: 100-200 words - Still-reproduces (no label): 30-80 words — and no transcripts (issue body has them), no @-mention (reporter knows), no architectural prose. One sentence + marker. Add an explicit "cut, by default" list naming the patterns that historically padded comments without changing verdicts. Generalizes the existing memory note into the skill body so future agents reading SKILL.md inherit the principle directly. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

… ceiling Followup to fe363cf. The 200-400 / 250-500 ranges in the per-verdict table were still too permissive. Tighten to 200-300 for both fixed-on-latest and wontfix; verify-inconclusive stays at 100-200; still-reproduces stays at 30-80. Update the principle preamble: "300 is a hard ceiling for the main verdicts." Simple cases land under 200. If a draft is past 300, it's padding — cut before re-reading. Memory note (feedback_verify_stale_comment_length.md) updated to match. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…d-question variant Previously skipped any issue with a maintainer comment in the last 7 days the reporter hadn't replied to. After 7 days the question is no longer "active discussion" — it's a stuck thread, and the skill can be the unsticking voice rather than a clueless interruption. Step 3: classifies the most-recent-unanswered-maintainer-comment as recent (within 7d → skip) or stale (>7d → proceed with variant), exporting UNANSWERED_MAINT_LOGIN/URL/DATE for the templater. Step 10: new mandatory block describing the variant — prepend an "@<maint>'s comment from <date> is still unanswered" lead paragraph and swap the closing reporter-only @-mention for a dual maintainer+reporter @-mention that flags the open question. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- brev-provisioning: append `// empty` to the CPU_TYPE jq filter so an empty result yields an empty string instead of the literal "null", which `[ -n "$CPU_TYPE" ]` would otherwise accept and pass to `brev create --type "null"`. - by-design: fix "greping" -> "grepping". - candidate-selection: drop the hardcoded "8 prefixed variants" count (the inline list enumerates 11) and point readers at `gh label list --search enhancement` for the live set. - environment-and-reproducer (API key entry): replace the inline `printf '%s' '<your-key>' > ~/.nvidia-api-key` example with a no-echo, no-history `IFS= read -rs` flow; the inline form left the key in shell history. - environment-and-reproducer (local-first match): split the conjoined predicate into two explicit branches — matches reported symptom -> still-reproduces, matches expected-fixed behavior -> fixed-on-latest; third branch falls through to Brev as before. - reproduction-rubrics (Step 8c rerun): prepend the same `export PATH="$HOME/.local/bin:$PATH"` guard 8b and 8d use, so a synth-repro rerun doesn't misread as `verify-inconclusive` because the nemoclaw binary fell off PATH. - reproduction-rubrics (drift-check loop): switch from `for t in $TOOL` to `mapfile -t TOOLS` + `for t in "${TOOLS[@]}"`, so multi-word tool strings ("openshell forward") stay intact under pickaxe instead of word-splitting into separate searches. - reproduction-rubrics (Step 8e perf rubric): compute p50 as the mean of the 5th and 6th sorted values for N=10 (standard median), and add the missing p90 line as the nearest-rank 9th value. Match rubric grows a p90 backstop that flips a within-p50 verdict to still-reproduces when an issue-declared p90 SLA is missed. - scoring-comments-and-logging (idempotency marker): replace the two hardcoded `2026-05-12` template dates with the `YYYY-MM-DD` placeholder the surrounding prose already calls for. - scoring-comments-and-logging (Markdown capitalization): capitalize "Markdown" in two prose mentions. - scoring-comments-and-logging (still-reproduces contract): make the per-verdict table on L174 canonical — strip the closing reporter @-mention, baseline transcript, and latest transcript from the template body so it actually fits the 30–80-word target. Scope the unanswered-question dual @-mention rule to fixed-on-latest and by-design only (still-reproduces has no closing @-mention to replace); the lead-paragraph half of the rule still applies to all three templates. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md (1)
172-196: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Document version-prefix normalization between extraction and validation.

The body/comment regex (line 173) captures only the digit portion (\d+\.\d+\.\d+) without the v prefix, but:

Labels include the v (line 172: ^v\d+\.\d+\.\d+$)

Tag list entries include the v (line 49: ^v[0-9]+)

Validation uses grep -Fxq "$V" against tags with v prefix (line 189)

Output requires full tag format with v (line 196: REPORTED_VERSION="v0.0.32")

Without an explicit normalization step to prepend v to body/comment captures before validation, the grep -Fxq check will fail for versions extracted from body/comments, incorrectly dropping valid candidates.
📝 Recommended addition

Add a normalization note after line 176:
**Normalize captured versions.** Body and comment captures lack the `v` prefix (the capture group excludes it), but labels and tags include it. Before validation, prepend `v` to any version that doesn't already start with it:

\```bash
# Normalize each candidate version
for V in "${CANDIDATE_VERSIONS[@]}"; do
  [[ "$V" =~ ^v ]] || V="v$V"
  # Now validate against tag list...
done
\```
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md
around lines 172 - 196, The extraction regex captures only digits (e.g.,
(\d+\.\d+\.\d+)) but validation uses tag names with a leading v and sets
REPORTED_VERSION to the full tag; add a normalization step before validation
that ensures every candidate in your CANDIDATE_VERSIONS list is converted to the
full tag form (prepend "v" when the string does not already start with "v") so
that grep -Fxq "$V" against the tag list and the final REPORTED_VERSION
assignment operate on matching tag strings; update the validation flow that
invokes grep -Fxq and the logic that selects the smallest surviving version to
use the normalized/tag-prefixed values.

🧹 Nitpick comments (1)

.agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md (1)

114-119: ⚡ Quick win

Consider breaking the question-detection regex into multiple filters.

The single regex pattern on line 118 combines multiple question-detection heuristics in one long expression, making it harder to maintain or extend. Consider using multiple select() clauses or extracting the pattern to a shell variable with documentation.

♻️ Refactor option: Multiple select filters

 UNANSWERED_MAINT=$(gh issue view "$ISSUE_NUMBER" --repo NVIDIA/NemoClaw --json comments \
   --jq --arg reporter "$REPORTER" --arg cutoff "$SEVEN_DAYS_AGO" '
     (.comments
      | map(select((.authorAssociation == "MEMBER" or .authorAssociation == "OWNER" or .authorAssociation == "COLLABORATOR")
-         and (.body | test("\\?|(?i)\\bplease (confirm|share|provide|clarify|tell|verify|check|let me know|let us know)|(?i)\\b(could|can|would) you\\b|(?i)\\bdo you (have|know|see|use)\\b"))))
+         and (.body | test("\\?")
+                   or test("(?i)\\bplease (confirm|share|provide|clarify|tell|verify|check|let me know|let us know)")
+                   or test("(?i)\\b(could|can|would) you\\b")
+                   or test("(?i)\\bdo you (have|know|see|use)\\b"))))
      | sort_by(.createdAt) | last) as $maint

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md
around lines 114 - 119, The combined question-detection regex inside the gh
issue view pipeline (used to set UNANSWERED_MAINT) is too long and hard to
maintain; refactor by splitting the single select(...) that contains the big
test(...) into multiple select() clauses (e.g., one for authorAssociation
membership and separate selects for question patterns like a simple "?" test,
polite-request keywords, modal verbs, and "do you ..." patterns) or move each
regex into a clearly named shell variable and reference those variables in
separate select() calls so the logic in the pipeline (the gh issue view ... --jq
expression that produces $maint) is easier to read and extend.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/reproduction-rubrics.md:
- Around line 219-234: The measurements from /usr/bin/time are in seconds but
the SLA vars are milliseconds, so convert the computed p50/p90 to milliseconds
before comparing (or alternatively rename SLA vars to seconds); specifically
update the P50_MS and P90_MS calculations (and the corresponding baseline
variables) that read from latest-perf.log/baseline-perf.log so the awk/shell
math multiplies the second values by 1000 and formats as integers (e.g., compute
P50_MS = (mean*1000) and P90_MS = (value*1000)), update the echo suffix to "ms",
and ensure comparisons use SLA_P50_MS and SLA_P90_MS as millisecond integers.

---

Outside diff comments:
In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md:
- Around line 172-196: The extraction regex captures only digits (e.g.,
(\d+\.\d+\.\d+)) but validation uses tag names with a leading v and sets
REPORTED_VERSION to the full tag; add a normalization step before validation
that ensures every candidate in your CANDIDATE_VERSIONS list is converted to the
full tag form (prepend "v" when the string does not already start with "v") so
that grep -Fxq "$V" against the tag list and the final REPORTED_VERSION
assignment operate on matching tag strings; update the validation flow that
invokes grep -Fxq and the logic that selects the smallest surviving version to
use the normalized/tag-prefixed values.

---

Nitpick comments:
In
@.agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md:
- Around line 114-119: The combined question-detection regex inside the gh issue
view pipeline (used to set UNANSWERED_MAINT) is too long and hard to maintain;
refactor by splitting the single select(...) that contains the big test(...)
into multiple select() clauses (e.g., one for authorAssociation membership and
separate selects for question patterns like a simple "?" test, polite-request
keywords, modal verbs, and "do you ..." patterns) or move each regex into a
clearly named shell variable and reference those variables in separate select()
calls so the logic in the pipeline (the gh issue view ... --jq expression that
produces $maint) is easier to read and extend.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b4bcfcff-fe48-424e-9219-6822388d6445

📥 Commits

Reviewing files that changed from the base of the PR and between 4962da5 and 33a350c.

📒 Files selected for processing (6)

.agents/skills/nemoclaw-maintainer-verify-stale/reference/brev-provisioning.md
.agents/skills/nemoclaw-maintainer-verify-stale/reference/by-design.md
.agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md
.agents/skills/nemoclaw-maintainer-verify-stale/reference/environment-and-reproducer.md
.agents/skills/nemoclaw-maintainer-verify-stale/reference/reproduction-rubrics.md
.agents/skills/nemoclaw-maintainer-verify-stale/reference/scoring-comments-and-logging.md

✅ Files skipped from review due to trivial changes (4)

.agents/skills/nemoclaw-maintainer-verify-stale/reference/by-design.md
.agents/skills/nemoclaw-maintainer-verify-stale/reference/brev-provisioning.md
.agents/skills/nemoclaw-maintainer-verify-stale/reference/environment-and-reproducer.md
.agents/skills/nemoclaw-maintainer-verify-stale/reference/scoring-comments-and-logging.md

cv · 2026-05-26T16:35:08Z

@CodeRabbit let's look at these skills considering Anthropic's best practices:

Skill authoring best practices

Good Skills are concise, well-structured, and tested with real usage. This guide provides practical authoring decisions to help you write Skills that Claude can discover and use effectively.

For conceptual background on how Skills work, see the Skills overview.

Core principles

Concise is key

The context window is a public good. Your Skill shares the context window with everything else Claude needs to know, including:

The system prompt
Conversation history
Other Skills' metadata
Your actual request

Not every token in your Skill has an immediate cost. At startup, only the metadata (name and description) from all Skills is pre-loaded. Claude reads SKILL.md only when the Skill becomes relevant, and reads additional files only as needed. However, being concise in SKILL.md still matters: once Claude loads it, every token competes with conversation history and other context.

Default assumption: Claude is already very smart

Only add context Claude doesn't already have. Challenge each piece of information:

"Does Claude really need this explanation?"
"Can I assume Claude knows this?"
"Does this paragraph justify its token cost?"

Good example: Concise (approximately 50 tokens):

## Extract PDF text

Use pdfplumber for text extraction:

```python
import pdfplumber

with pdfplumber.open("file.pdf") as pdf:
    text = pdf.pages[0].extract_text()
```

Bad example: Too verbose (approximately 150 tokens):

## Extract PDF text

PDF (Portable Document Format) files are a common file format that contains
text, images, and other content. To extract text from a PDF, you'll need to
use a library. There are many libraries available for PDF processing, but
pdfplumber is recommended because it's easy to use and handles most cases well.
First, you'll need to install it using pip. Then you can use the code below...

The concise version assumes Claude knows what PDFs are and how libraries work.

Set appropriate degrees of freedom

Match the level of specificity to the task's fragility and variability.

High freedom (text-based instructions):

Use when:

Multiple approaches are valid
Decisions depend on context
Heuristics guide the approach

Example:

## Code review process

1. Analyze the code structure and organization
2. Check for potential bugs or edge cases
3. Suggest improvements for readability and maintainability
4. Verify adherence to project conventions

Medium freedom (pseudocode or scripts with parameters):

Use when:

A preferred pattern exists
Some variation is acceptable
Configuration affects behavior

Example:

## Generate report

Use this template and customize as needed:

```python
def generate_report(data, format="markdown", include_charts=True):
    # Process data
    # Generate output in specified format
    # Optionally include visualizations
```

Low freedom (specific scripts, few or no parameters):

Use when:

Operations are fragile and error-prone
Consistency is critical
A specific sequence must be followed

Example:

## Database migration

Run exactly this script:

```bash
python scripts/migrate.py --verify --backup
```

Do not modify the command or add additional flags.

Analogy: Think of Claude as a robot exploring a path:

Narrow bridge with cliffs on both sides: There's only one safe way forward. Provide specific guardrails and exact instructions (low freedom). Example: database migrations that must run in exact sequence.
Open field with no hazards: Many paths lead to success. Give general direction and trust Claude to find the best route (high freedom). Example: code reviews where context determines the best approach.

Test with all models you plan to use

Skills act as additions to models, so effectiveness depends on the underlying model. Test your Skill with all the models you plan to use it with.

Testing considerations by model:

Claude Haiku (fast, economical): Does the Skill provide enough guidance?
Claude Sonnet (balanced): Is the Skill clear and efficient?
Claude Opus (powerful reasoning): Does the Skill avoid over-explaining?

What works perfectly for Opus might need more detail for Haiku. If you plan to use your Skill across multiple models, aim for instructions that work well with all of them.

Skill structure

**YAML Frontmatter:** The SKILL.md frontmatter requires two fields:

name:

Maximum 64 characters
Must contain only lowercase letters, numbers, and hyphens
Cannot contain XML tags
Cannot contain reserved words: "anthropic", "claude"

description:

Must be non-empty
Maximum 1024 characters
Cannot contain XML tags
Should describe what the Skill does and when to use it

For complete Skill structure details, see the Skills overview.

Naming conventions

Use consistent naming patterns to make Skills easier to reference and discuss. Consider using gerund form (verb + -ing) for Skill names, as this clearly describes the activity or capability the Skill provides.

Remember that the name field must use lowercase letters, numbers, and hyphens only.

Good naming examples (gerund form):

processing-pdfs
analyzing-spreadsheets
managing-databases
testing-code
writing-documentation

Acceptable alternatives:

Noun phrases: pdf-processing, spreadsheet-analysis
Action-oriented: process-pdfs, analyze-spreadsheets

Avoid:

Vague names: helper, utils, tools
Overly generic: documents, data, files
Reserved words: anthropic-helper, claude-tools
Inconsistent patterns within your skill collection

Consistent naming makes it easier to:

Reference Skills in documentation and conversations
Understand what a Skill does at a glance
Organize and search through multiple Skills
Maintain a professional, cohesive skill library

Writing effective descriptions

The description field enables Skill discovery and should include both what the Skill does and when to use it.

**Always write in third person**. The description is injected into the system prompt, and inconsistent point-of-view can cause discovery problems.

Good: "Processes Excel files and generates reports"
Avoid: "I can help you process Excel files"
Avoid: "You can use this to process Excel files"

Be specific and include key terms. Include both what the Skill does and specific triggers/contexts for when to use it.

Each Skill has exactly one description field. The description is critical for skill selection: Claude uses it to choose the right Skill from potentially 100+ available Skills. Your description must provide enough detail for Claude to know when to select this Skill, while the rest of SKILL.md provides the implementation details.

Effective examples:

PDF Processing skill:

description: Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction.

Excel Analysis skill:

description: Analyze Excel spreadsheets, create pivot tables, generate charts. Use when analyzing Excel files, spreadsheets, tabular data, or .xlsx files.

Git Commit Helper skill:

description: Generate descriptive commit messages by analyzing git diffs. Use when the user asks for help writing commit messages or reviewing staged changes.

Avoid vague descriptions like these:

description: Helps with documents

description: Processes data

description: Does stuff with files

Progressive disclosure patterns

SKILL.md serves as an overview that points Claude to detailed materials as needed, like a table of contents in an onboarding guide. For an explanation of how progressive disclosure works, see How Skills work in the overview.

Practical guidance:

Keep SKILL.md body under 500 lines for optimal performance
Split content into separate files when approaching this limit
Use the patterns below to organize instructions, code, and resources effectively

Visual overview: From simple to complex

A basic Skill starts with just a SKILL.md file containing metadata and instructions:

As your Skill grows, you can bundle additional content that Claude loads only when needed:

The complete Skill directory structure might look like this:

pdf/
├── SKILL.md              # Main instructions (loaded when triggered)
├── FORMS.md              # Form-filling guide (loaded as needed)
├── reference.md          # API reference (loaded as needed)
├── examples.md           # Usage examples (loaded as needed)
└── scripts/
    ├── analyze_form.py   # Utility script (executed, not loaded)
    ├── fill_form.py      # Form filling script
    └── validate.py       # Validation script

Pattern 1: High-level guide with references

---
name: pdf-processing
description: Extracts text and tables from PDF files, fills forms, and merges documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction.
---

# PDF Processing

## Quick start

Extract text with pdfplumber:
```python
import pdfplumber
with pdfplumber.open("file.pdf") as pdf:
    text = pdf.pages[0].extract_text()
```

## Advanced features

**Form filling**: See [FORMS.md](FORMS.md) for complete guide
**API reference**: See [REFERENCE.md](REFERENCE.md) for all methods
**Examples**: See [EXAMPLES.md](EXAMPLES.md) for common patterns

Claude loads FORMS.md, REFERENCE.md, or EXAMPLES.md only when needed.

Pattern 2: Domain-specific organization

For Skills with multiple domains, organize content by domain to avoid loading irrelevant context. When a user asks about sales metrics, Claude only needs to read sales-related schemas, not finance or marketing data. This keeps token usage low and context focused.

bigquery-skill/
├── SKILL.md (overview and navigation)
└── reference/
    ├── finance.md (revenue, billing metrics)
    ├── sales.md (opportunities, pipeline)
    ├── product.md (API usage, features)
    └── marketing.md (campaigns, attribution)

# BigQuery Data Analysis

## Available datasets

**Finance**: Revenue, ARR, billing → See [reference/finance.md](reference/finance.md)
**Sales**: Opportunities, pipeline, accounts → See [reference/sales.md](reference/sales.md)
**Product**: API usage, features, adoption → See [reference/product.md](reference/product.md)
**Marketing**: Campaigns, attribution, email → See [reference/marketing.md](reference/marketing.md)

## Quick search

Find specific metrics using grep:

```bash
grep -i "revenue" reference/finance.md
grep -i "pipeline" reference/sales.md
grep -i "api usage" reference/product.md
```

Pattern 3: Conditional details

Show basic content, link to advanced content:

# DOCX Processing

## Creating documents

Use docx-js for new documents. See [DOCX-JS.md](DOCX-JS.md).

## Editing documents

For simple edits, modify the XML directly.

**For tracked changes**: See [REDLINING.md](REDLINING.md)
**For OOXML details**: See [OOXML.md](OOXML.md)

Claude reads REDLINING.md or OOXML.md only when the user needs those features.

Avoid deeply nested references

Claude may partially read files when they're referenced from other referenced files. When encountering nested references, Claude might use commands like head -100 to preview content rather than reading entire files, resulting in incomplete information.

Keep references one level deep from SKILL.md. All reference files should link directly from SKILL.md to ensure Claude reads complete files when needed.

Bad example: Too deep:

# SKILL.md
See [advanced.md](advanced.md)...

# advanced.md
See [details.md](details.md)...

# details.md
Here's the actual information...

Good example: One level deep:

# SKILL.md

**Basic usage**: [instructions in SKILL.md]
**Advanced features**: See [advanced.md](advanced.md)
**API reference**: See [reference.md](reference.md)
**Examples**: See [examples.md](examples.md)

Structure longer reference files with table of contents

For reference files longer than 100 lines, include a table of contents at the top. This ensures Claude can see the full scope of available information even when previewing with partial reads.

Example:

# API Reference

## Contents
- Authentication and setup
- Core methods (create, read, update, delete)
- Advanced features (batch operations, webhooks)
- Error handling patterns
- Code examples

## Authentication and setup
...

## Core methods
...

Claude can then read the complete file or jump to specific sections as needed.

For details on how this filesystem-based architecture enables progressive disclosure, see the Runtime environment section in the Advanced section below.

Workflows and feedback loops

Use workflows for complex tasks

Break complex operations into clear, sequential steps. For particularly complex workflows, provide a checklist that Claude can copy into its response and check off as it progresses.

Example 1: Research synthesis workflow (for Skills without code):

## Research synthesis workflow

Copy this checklist and track your progress:

```
Research Progress:
- [ ] Step 1: Read all source documents
- [ ] Step 2: Identify key themes
- [ ] Step 3: Cross-reference claims
- [ ] Step 4: Create structured summary
- [ ] Step 5: Verify citations
```

**Step 1: Read all source documents**

Review each document in the `sources/` directory. Note the main arguments and supporting evidence.

**Step 2: Identify key themes**

Look for patterns across sources. What themes appear repeatedly? Where do sources agree or disagree?

**Step 3: Cross-reference claims**

For each major claim, verify it appears in the source material. Note which source supports each point.

**Step 4: Create structured summary**

Organize findings by theme. Include:
- Main claim
- Supporting evidence from sources
- Conflicting viewpoints (if any)

**Step 5: Verify citations**

Check that every claim references the correct source document. If citations are incomplete, return to Step 3.

This example shows how workflows apply to analysis tasks that don't require code. The checklist pattern works for any complex, multi-step process.

Example 2: PDF form filling workflow (for Skills with code):

## PDF form filling workflow

Copy this checklist and check off items as you complete them:

```
Task Progress:
- [ ] Step 1: Analyze the form (run analyze_form.py)
- [ ] Step 2: Create field mapping (edit fields.json)
- [ ] Step 3: Validate mapping (run validate_fields.py)
- [ ] Step 4: Fill the form (run fill_form.py)
- [ ] Step 5: Verify output (run verify_output.py)
```

**Step 1: Analyze the form**

Run: `python scripts/analyze_form.py input.pdf`

This extracts form fields and their locations, saving to `fields.json`.

**Step 2: Create field mapping**

Edit `fields.json` to add values for each field.

**Step 3: Validate mapping**

Run: `python scripts/validate_fields.py fields.json`

Fix any validation errors before continuing.

**Step 4: Fill the form**

Run: `python scripts/fill_form.py input.pdf fields.json output.pdf`

**Step 5: Verify output**

Run: `python scripts/verify_output.py output.pdf`

If verification fails, return to Step 2.

Clear steps prevent Claude from skipping critical validation. The checklist helps both Claude and you track progress through multi-step workflows.

Implement feedback loops

Common pattern: Run validator → fix errors → repeat

This pattern greatly improves output quality.

Example 1: Style guide compliance (for Skills without code):

## Content review process

1. Draft your content following the guidelines in STYLE_GUIDE.md
2. Review against the checklist:
   - Check terminology consistency
   - Verify examples follow the standard format
   - Confirm all required sections are present
3. If issues found:
   - Note each issue with specific section reference
   - Revise the content
   - Review the checklist again
4. Only proceed when all requirements are met
5. Finalize and save the document

This shows the validation loop pattern using reference documents instead of scripts. The "validator" is STYLE_GUIDE.md, and Claude performs the check by reading and comparing.

Example 2: Document editing process (for Skills with code):

## Document editing process

1. Make your edits to `word/document.xml`
2. **Validate immediately**: `python ooxml/scripts/validate.py unpacked_dir/`
3. If validation fails:
   - Review the error message carefully
   - Fix the issues in the XML
   - Run validation again
4. **Only proceed when validation passes**
5. Rebuild: `python ooxml/scripts/pack.py unpacked_dir/ output.docx`
6. Test the output document

The validation loop catches errors early.

Content guidelines

Avoid time-sensitive information

Don't include information that will become outdated:

Bad example: Time-sensitive (will become wrong):

If you're doing this before August 2025, use the old API.
After August 2025, use the new API.

Good example (use "old patterns" section):

## Current method

Use the v2 API endpoint: `api.example.com/v2/messages`

## Old patterns

<details>
<summary>Legacy v1 API (deprecated 2025-08)</summary>

The v1 API used: `api.example.com/v1/messages`

This endpoint is no longer supported.
</details>

The old patterns section provides historical context without cluttering the main content.

Use consistent terminology

Choose one term and use it throughout the Skill:

Good - Consistent:

Always "API endpoint"
Always "field"
Always "extract"

Bad - Inconsistent:

Mix "API endpoint", "URL", "API route", "path"
Mix "field", "box", "element", "control"
Mix "extract", "pull", "get", "retrieve"

Consistency helps Claude understand and follow instructions.

Common patterns

Template pattern

Provide templates for output format. Match the level of strictness to your needs.

For strict requirements (like API responses or data formats):

## Report structure

ALWAYS use this exact template structure:

```markdown
# [Analysis Title]

## Executive summary
[One-paragraph overview of key findings]

## Key findings
- Finding 1 with supporting data
- Finding 2 with supporting data
- Finding 3 with supporting data

## Recommendations
1. Specific actionable recommendation
2. Specific actionable recommendation
```

For flexible guidance (when adaptation is useful):

## Report structure

Here is a sensible default format, but use your best judgment based on the analysis:

```markdown
# [Analysis Title]

## Executive summary
[Overview]

## Key findings
[Adapt sections based on what you discover]

## Recommendations
[Tailor to the specific context]
```

Adjust sections as needed for the specific analysis type.

Examples pattern

For Skills where output quality depends on seeing examples, provide input/output pairs just like in regular prompting:

## Commit message format

Generate commit messages following these examples:

**Example 1:**
Input: Added user authentication with JWT tokens
Output:
```
feat(auth): implement JWT-based authentication

Add login endpoint and token validation middleware
```

**Example 2:**
Input: Fixed bug where dates displayed incorrectly in reports
Output:
```
fix(reports): correct date formatting in timezone conversion

Use UTC timestamps consistently across report generation
```

**Example 3:**
Input: Updated dependencies and refactored error handling
Output:
```
chore: update dependencies and refactor error handling

- Upgrade lodash to 4.17.21
- Standardize error response format across endpoints
```

Follow this style: type(scope): brief description, then detailed explanation.

Examples help Claude understand the desired style and level of detail more clearly than descriptions alone.

Conditional workflow pattern

Guide Claude through decision points:

## Document modification workflow

1. Determine the modification type:

   **Creating new content?** → Follow "Creation workflow" below
   **Editing existing content?** → Follow "Editing workflow" below

2. Creation workflow:
   - Use docx-js library
   - Build document from scratch
   - Export to .docx format

3. Editing workflow:
   - Unpack existing document
   - Modify XML directly
   - Validate after each change
   - Repack when complete

If workflows become large or complicated with many steps, consider pushing them into separate files and tell Claude to read the appropriate file based on the task at hand.

Evaluation and iteration

Build evaluations first

Create evaluations BEFORE writing extensive documentation. This ensures your Skill solves real problems rather than documenting imagined ones.

Evaluation-driven development:

Identify gaps: Run Claude on representative tasks without a Skill. Document specific failures or missing context
Create evaluations: Build three scenarios that test these gaps
Establish baseline: Measure Claude's performance without the Skill
Write minimal instructions: Create just enough content to address the gaps and pass evaluations
Iterate: Execute evaluations, compare against baseline, and refine

This approach ensures you're solving actual problems rather than anticipating requirements that may never materialize.

Evaluation structure:

{
  "skills": ["pdf-processing"],
  "query": "Extract all text from this PDF file and save it to output.txt",
  "files": ["test-files/document.pdf"],
  "expected_behavior": [
    "Successfully reads the PDF file using an appropriate PDF processing library or command-line tool",
    "Extracts text content from all pages in the document without missing any pages",
    "Saves the extracted text to a file named output.txt in a clear, readable format"
  ]
}

This example demonstrates a data-driven evaluation with a simple testing rubric. There is not currently a built-in way to run these evaluations. Users can create their own evaluation system. Evaluations are your source of truth for measuring Skill effectiveness.

Develop Skills iteratively with Claude

The most effective Skill development process involves Claude itself. Work with one instance of Claude ("Claude A") to create a Skill that is used by other instances ("Claude B"). Claude A helps you design and refine instructions, while Claude B tests them in real tasks. This works because Claude models understand both how to write effective agent instructions and what information agents need.

Creating a new Skill:

Complete a task without a Skill: Work through a problem with Claude A using normal prompting. As you work, you'll naturally provide context, explain preferences, and share procedural knowledge. Notice what information you repeatedly provide.
Identify the reusable pattern: After completing the task, identify what context you provided that would be useful for similar future tasks.

Example: If you worked through a BigQuery analysis, you might have provided table names, field definitions, filtering rules (like "always exclude test accounts"), and common query patterns.
Ask Claude A to create a Skill: "Create a Skill that captures this BigQuery analysis pattern we just used. Include the table schemas, naming conventions, and the rule about filtering test accounts."
Claude models understand the Skill format and structure natively. You don't need special system prompts or a "writing skills" skill to get Claude to help create Skills. Simply ask Claude to create a Skill and it generates properly structured SKILL.md content with appropriate frontmatter and body content.
Review for conciseness: Check that Claude A hasn't added unnecessary explanations. Ask: "Remove the explanation about what win rate means - Claude already knows that."
Improve information architecture: Ask Claude A to organize the content more effectively. For example: "Organize this so the table schema is in a separate reference file. We might add more tables later."
Test on similar tasks: Use the Skill with Claude B (a fresh instance with the Skill loaded) on related use cases. Observe whether Claude B finds the right information, applies rules correctly, and handles the task successfully.
Iterate based on observation: If Claude B struggles or misses something, return to Claude A with specifics: "When Claude used this Skill, it forgot to filter by date for Q4. Should we add a section about date filtering patterns?"

Iterating on existing Skills:

The same hierarchical pattern continues when improving Skills. You alternate between:

Working with Claude A (the expert who helps refine the Skill)
Testing with Claude B (the agent using the Skill to perform real work)
Observing Claude B's behavior and bringing insights back to Claude A

Use the Skill in real workflows: Give Claude B (with the Skill loaded) actual tasks, not test scenarios
Observe Claude B's behavior: Note where it struggles, succeeds, or makes unexpected choices

Example observation: "When I asked Claude B for a regional sales report, it wrote the query but forgot to filter out test accounts, even though the Skill mentions this rule."
Return to Claude A for improvements: Share the current SKILL.md and describe what you observed. Ask: "I noticed Claude B forgot to filter test accounts when I asked for a regional report. The Skill mentions filtering, but maybe it's not prominent enough?"
Review Claude A's suggestions: Claude A might suggest reorganizing to make rules more prominent, using stronger language like "MUST filter" instead of "always filter", or restructuring the workflow section.
Apply and test changes: Update the Skill with Claude A's refinements, then test again with Claude B on similar requests
Repeat based on usage: Continue this observe-refine-test cycle as you encounter new scenarios. Each iteration improves the Skill based on real agent behavior, not assumptions.

Gathering team feedback:

Share Skills with teammates and observe their usage
Ask: Does the Skill activate when expected? Are instructions clear? What's missing?
Incorporate feedback to address blind spots in your own usage patterns

Why this approach works: Claude A understands agent needs, you provide domain expertise, Claude B reveals gaps through real usage, and iterative refinement improves Skills based on observed behavior rather than assumptions.

Observe how Claude navigates Skills

As you iterate on Skills, pay attention to how Claude actually uses them in practice. Watch for:

Unexpected exploration paths: Does Claude read files in an order you didn't anticipate? This might indicate your structure isn't as intuitive as you thought
Missed connections: Does Claude fail to follow references to important files? Your links might need to be more explicit or prominent
Overreliance on certain sections: If Claude repeatedly reads the same file, consider whether that content should be in the main SKILL.md instead
Ignored content: If Claude never accesses a bundled file, it might be unnecessary or poorly signaled in the main instructions

Iterate based on these observations rather than assumptions. The 'name' and 'description' in your Skill's metadata are particularly critical. Claude uses these when deciding whether to trigger the Skill in response to the current task. Make sure they clearly describe what the Skill does and when it should be used.

Anti-patterns to avoid

Avoid Windows-style paths

Always use forward slashes in file paths, even on Windows:

✓ Good: scripts/helper.py, reference/guide.md
✗ Avoid: scripts\helper.py, reference\guide.md

Unix-style paths work across all platforms, while Windows-style paths cause errors on Unix systems.

Avoid offering too many options

Don't present multiple approaches unless necessary:

**Bad example: Too many choices** (confusing):
"You can use pypdf, or pdfplumber, or PyMuPDF, or pdf2image, or..."

**Good example: Provide a default** (with escape hatch):
"Use pdfplumber for text extraction:
```python
import pdfplumber
```

For scanned PDFs requiring OCR, use pdf2image with pytesseract instead."

Advanced: Skills with executable code

The sections below focus on Skills that include executable scripts. If your Skill uses only markdown instructions, skip to Checklist for effective Skills.

Solve, don't punt

When writing scripts for Skills, handle error conditions rather than punting to Claude.

Good example: Handle errors explicitly:

def process_file(path):
    """Process a file, creating it if it doesn't exist."""
    try:
        with open(path) as f:
            return f.read()
    except FileNotFoundError:
        # Create file with default content instead of failing
        print(f"File {path} not found, creating default")
        with open(path, "w") as f:
            f.write("")
        return ""
    except PermissionError:
        # Provide alternative instead of failing
        print(f"Cannot access {path}, using default")
        return ""

Bad example: Punt to Claude:

def process_file(path):
    # Just fail and let Claude figure it out
    return open(path).read()

Configuration parameters should also be justified and documented to avoid "voodoo constants" (Ousterhout's law). If you don't know the right value, how will Claude determine it?

Good example: Self-documenting:

# HTTP requests typically complete within 30 seconds
# Longer timeout accounts for slow connections
REQUEST_TIMEOUT = 30

# Three retries balances reliability vs speed
# Most intermittent failures resolve by the second retry
MAX_RETRIES = 3

Bad example: Magic numbers:

TIMEOUT = 47  # Why 47?
RETRIES = 5  # Why 5?

Provide utility scripts

Even if Claude could write a script, pre-made scripts offer advantages:

Benefits of utility scripts:

More reliable than generated code
Save tokens (no need to include code in context)
Save time (no code generation required)
Ensure consistency across uses

The diagram above shows how executable scripts work alongside instruction files. The instruction file (forms.md) references the script, and Claude can execute it without loading its contents into context.

Important distinction: Make clear in your instructions whether Claude should:

Execute the script (most common): "Run analyze_form.py to extract fields"
Read it as reference (for complex logic): "See analyze_form.py for the field extraction algorithm"

For most utility scripts, execution is preferred because it's more reliable and efficient. See the Runtime environment section below for details on how script execution works.

Example:

## Utility scripts

**analyze_form.py**: Extract all form fields from PDF

```bash
python scripts/analyze_form.py input.pdf > fields.json
```

Output format:
```json
{
  "field_name": {"type": "text", "x": 100, "y": 200},
  "signature": {"type": "sig", "x": 150, "y": 500}
}
```

**validate_boxes.py**: Check for overlapping bounding boxes

```bash
python scripts/validate_boxes.py fields.json
# Returns: "OK" or lists conflicts
```

**fill_form.py**: Apply field values to PDF

```bash
python scripts/fill_form.py input.pdf fields.json output.pdf
```

Use visual analysis

When inputs can be rendered as images, have Claude analyze them:

## Form layout analysis

1. Convert PDF to images:
   ```bash
   python scripts/pdf_to_images.py form.pdf
   ```

2. Analyze each page image to identify form fields
3. Claude can see field locations and types visually

In this example, you'd need to write the `pdf_to_images.py` script.

Claude's vision capabilities help understand layouts and structures.

Create verifiable intermediate outputs

When Claude performs complex, open-ended tasks, it can make mistakes. The "plan-validate-execute" pattern catches errors early by having Claude first create a plan in a structured format, then validate that plan with a script before executing it.

Example: Imagine asking Claude to update 50 form fields in a PDF based on a spreadsheet. Without validation, Claude might reference non-existent fields, create conflicting values, miss required fields, or apply updates incorrectly.

Solution: Use the workflow pattern shown above (PDF form filling), but add an intermediate changes.json file that gets validated before applying changes. The workflow becomes: analyze → create plan file → validate plan → execute → verify.

Why this pattern works:

Catches errors early: Validation finds problems before changes are applied
Machine-verifiable: Scripts provide objective verification
Reversible planning: Claude can iterate on the plan without touching originals
Clear debugging: Error messages point to specific problems

When to use: Batch operations, destructive changes, complex validation rules, high-stakes operations.

Implementation tip: Make validation scripts verbose with specific error messages like "Field 'signature_date' not found. Available fields: customer_name, order_total, signature_date_signed" to help Claude fix issues.

Package dependencies

Skills run in the code execution environment with platform-specific limitations:

claude.ai: Can install packages from npm and PyPI and pull from GitHub repositories
Claude API: Has no network access and no runtime package installation

List required packages in your SKILL.md and verify they're available in the code execution tool documentation.

Runtime environment

Skills run in a code execution environment with filesystem access, bash commands, and code execution capabilities. For the conceptual explanation of this architecture, see The Skills architecture in the overview.

How this affects your authoring:

How Claude accesses Skills:

Metadata pre-loaded: At startup, the name and description from all Skills' YAML frontmatter are loaded into the system prompt
Files read on-demand: Claude uses bash Read tools to access SKILL.md and other files from the filesystem when needed
Scripts executed efficiently: Utility scripts can be executed via bash without loading their full contents into context. Only the script's output consumes tokens
No context penalty for large files: Reference files, data, or documentation don't consume context tokens until actually read

File paths matter: Claude navigates your skill directory like a filesystem. Use forward slashes (reference/guide.md), not backslashes
Name files descriptively: Use names that indicate content: form_validation_rules.md, not doc2.md
Organize for discovery: Structure directories by domain or feature
- Good: reference/finance.md, reference/sales.md
- Bad: docs/file1.md, docs/file2.md
Bundle comprehensive resources: Include complete API docs, extensive examples, large datasets; no context penalty until accessed
Prefer scripts for deterministic operations: Write validate_form.py rather than asking Claude to generate validation code
Make execution intent clear:
- "Run analyze_form.py to extract fields" (execute)
- "See analyze_form.py for the extraction algorithm" (read as reference)
Test file access patterns: Verify Claude can navigate your directory structure by testing with real requests

Example:

bigquery-skill/
├── SKILL.md (overview, points to reference files)
└── reference/
    ├── finance.md (revenue metrics)
    ├── sales.md (pipeline data)
    └── product.md (usage analytics)

When the user asks about revenue, Claude reads SKILL.md, sees the reference to reference/finance.md, and invokes bash to read just that file. The sales.md and product.md files remain on the filesystem, consuming zero context tokens until needed. This filesystem-based model is what enables progressive disclosure. Claude can navigate and selectively load exactly what each task requires.

For complete details on the technical architecture, see How Skills work in the Skills overview.

MCP tool references

If your Skill uses MCP (Model Context Protocol) tools, always use fully qualified tool names to avoid "tool not found" errors.

Format: ServerName:tool_name

Example:

Use the BigQuery:bigquery_schema tool to retrieve table schemas.
Use the GitHub:create_issue tool to create issues.

Where:

BigQuery and GitHub are MCP server names
bigquery_schema and create_issue are the tool names within those servers

Without the server prefix, Claude may fail to locate the tool, especially when multiple MCP servers are available.

Avoid assuming tools are installed

Don't assume packages are available:

**Bad example: Assumes installation**:
"Use the pdf library to process the file."

**Good example: Explicit about dependencies**:
"Install required package: `pip install pypdf`

Then use it:
```python
from pypdf import PdfReader
reader = PdfReader("file.pdf")
```"

Technical notes

YAML frontmatter requirements

The SKILL.md frontmatter requires name and description fields with specific validation rules:

name: Maximum 64 characters, lowercase letters/numbers/hyphens only, no XML tags, no reserved words
description: Maximum 1024 characters, non-empty, no XML tags

See the Skills overview for complete structure details.

Token budgets

Keep SKILL.md body under 500 lines for optimal performance. If your content exceeds this, split it into separate files using the progressive disclosure patterns described earlier. For architectural details, see the Skills overview.

Checklist for effective Skills

Before sharing a Skill, verify:

Core quality

Code and scripts

Scripts solve problems rather than punt to Claude
Error handling is explicit and helpful
No "voodoo constants" (all values justified)
Required packages listed in instructions and verified as available
Scripts have clear documentation
No Windows-style paths (all forward slashes)
Validation/verification steps for critical operations
Feedback loops included for quality-critical tasks

Testing

At least three evaluations created
Tested with Haiku, Sonnet, and Opus
Tested with real usage scenarios
Team feedback incorporated (if applicable)

coderabbitai · 2026-05-26T16:37:01Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "=== SKILL.md ==="
cat .agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md
echo ""
echo "=== Line counts for all reference files ==="
wc -l .agents/skills/nemoclaw-maintainer-verify-stale/reference/*.md
echo ""
echo "=== scoring-comments-and-logging.md (first 50 lines) ==="
head -50 .agents/skills/nemoclaw-maintainer-verify-stale/reference/scoring-comments-and-logging.md
echo ""
echo "=== candidate-selection.md (first 30 lines) ==="
head -30 .agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md
echo ""
echo "=== brev-provisioning.md (first 30 lines) ==="
head -30 .agents/skills/nemoclaw-maintainer-verify-stale/reference/brev-provisioning.md
echo ""
echo "=== by-design.md (first 30 lines) ==="
head -30 .agents/skills/nemoclaw-maintainer-verify-stale/reference/by-design.md
echo ""
echo "=== environment-and-reproducer.md (first 30 lines) ==="
head -30 .agents/skills/nemoclaw-maintainer-verify-stale/reference/environment-and-reproducer.md
echo ""
echo "=== reproduction-rubrics.md (first 30 lines) ==="
head -30 .agents/skills/nemoclaw-maintainer-verify-stale/reference/reproduction-rubrics.md
echo ""
echo "=== nemoclaw-skills-guide SKILL.md ==="
cat .agents/skills/nemoclaw-skills-guide/SKILL.md | head -40
echo ""
echo "=== nemoclaw-maintainer-cut-release-tag SKILL.md (tail, new Step 7) ==="
tail -100 .agents/skills/nemoclaw-maintainer-cut-release-tag/SKILL.md

Length of output: 26997

@cv — here's a structured gap analysis against the Anthropic best practices you shared.

✅ What's already aligned

Practice	Status
SKILL.md body under 500 lines	✅ ~61 lines
Third-person, specific description with triggers	✅ Solid; covers invocation verbs and label names
Progressive disclosure — one level deep	✅ All 6 reference docs are linked directly from SKILL.md
TOC on every reference file > 100 lines	✅ Every reference file has a `## Contents` block
Progress checklist pattern	✅ Copy-paste checklist in SKILL.md
Low-freedom instructions for fragile ops	✅ Specific bash scripts for Brev provisioning, gh commands; natural language for classification decisions
Feedback loop / self-verification pass	✅ `by-design.md` Step 8.5d forces re-running every evidence command before posting
No deeply nested references	✅ SKILL.md → `reference/*.md`, nothing deeper
Naming consistent within the collection	✅ Matches `nemoclaw-maintainer-*` pattern throughout

⚠️ Gaps and recommendations

1. `scoring-comments-and-logging.md` is 479 lines — split candidate

At 479 lines this is the only reference file approaching unwieldy territory. The doc covers four distinct concerns: scoring (Step 9), comment/label/project (Step 10), infra failures (Step 11), and logging/cadence (Step 12). Consider splitting into two files to keep each focused and reduce partial-read risk:

reference/
  scoring-and-comment.md   # Steps 9–11 (scoring, redaction, templates, race check, project move, infra)
  logging-and-cadence.md   # Step 12 + Cadence + Out of scope + Companion behavior

Update the SKILL.md reference map to point to both, which remains one level deep.

2. Time-sensitive operational fact hardcoded in `candidate-selection.md`

Line referencing the candidate triage run:

"the candidate triage run found 129 open bugs; an earlier --limit 100 would have missed 29"

This is the "bad example" from the best practices — a concrete count that will become stale. The rationale (don't use --limit 100 because the real pool exceeds that) is worth keeping; the specific count is not. Suggested rewrite:

Use `--limit 1000` so the skill doesn't silently drop issues beyond the page size.
The open-bug pool has historically exceeded 100 issues; a lower limit silently
truncates the candidate pool before the per-run processing cap is applied.

3. No structured evaluations

The best practices say to build evaluations before writing extensive documentation. The PR documents e2e runs against 9 candidates but there are no machine-readable eval fixtures. Even a lightweight evals/ directory with three scenario JSON files would satisfy the checklist:

// evals/01-single-issue-fixed-on-latest.json
{
  "skills": ["nemoclaw-maintainer-verify-stale"],
  "query": "Verify issue `#2791`",
  "expected_behavior": [
    "Reads candidate-selection.md to establish latest tag",
    "Detects by-design signal in comments before provisioning Brev",
    "Posts a wont-fix comment with commit SHA evidence",
    "Applies status: wont-fix label, does not auto-close"
  ]
}

Suggested minimum coverage: one fixed-on-latest, one still-reproduces, one by-design / wont-fix scenario.

4. Magic number: 14-day sweep threshold in `cut-release-tag` Step 7

The bash script uses [ "$AGE_DAYS" -ge 14 ] with no adjacent comment explaining why 14 days. Per the "no voodoo constants" principle:

# 14-day window: matches the 7-day idempotency TTL (candidate-selection.md Step 3)
# plus a 7-day reporter confirmation window.  Chosen so a freshly-applied label
# from the previous week's batch run is never swept on the next release cut.
AGE_DAYS_THRESHOLD=14

5. Naming: gerund form not used (low priority)

The best practices prefer gerund form (verifying-stale-issues) but explicitly list action-oriented names as acceptable, and note that consistency within the collection matters most. Since all eight nemoclaw-maintainer-* skills follow the <noun>-<verb>-<object> pattern, no change is needed — just flagging it as an intentional deviation that's justified by collection consistency.

6. Minor: `reproducer` vs `repro` terminology drift

SKILL.md and most reference files use reproducer consistently, but reproduction-rubrics.md uses repro as shorthand in a few places (e.g., "synth-repro"). Consider standardizing to reproducer throughout, or explicitly defining repro as the shorthand on first use in each file.

Summary checklist delta

Item	Current	Recommended action
SKILL.md < 500 lines	✅	—
TOC on all large reference files	✅	—
One-level-deep references	✅	—
No time-sensitive data	❌	Remove "129 open bugs" count from `candidate-selection.md`
At least 3 evaluations	❌	Add `evals/` directory with 3 scenario files
No voodoo constants	⚠️	Document 14-day rationale in `cut-release-tag` Step 7
Consistent terminology	⚠️	Standardize `reproducer` vs `repro`
File size manageable	⚠️	Consider splitting `scoring-comments-and-logging.md`

CodeRabbit follow-up on PR #3327: the prior fix computed P50_MS / P90_MS straight from `/usr/bin/time -f '%e'` output, which is seconds, while the match rubric compares against SLA_P50_MS / SLA_P90_MS parsed from the issue body as milliseconds. A 1.5-second p50 would have falsely beaten a 200 ms SLA. Multiply by 1000 in the awk, format as integer ms, and update the echo suffix accordingly. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

…on regex Two CodeRabbit follow-ups on PR #3327's candidate-selection.md: - Step 4 tag validation now normalizes each candidate to the full tag form (prepend `v` when absent) before `grep -Fxq` against the tag list. The body/comment capture group is digits-only by design, but tags carry the `v` — without the prepend, every body-sourced candidate was being dropped. Label-sourced candidates already carry the `v`, so the prepend is idempotent. - Split the long alternation regex behind the unanswered-maintainer question detector into four named test() clauses (literal "?", polite imperative, modal interrogative, "do you ..."). Future heuristics drop in as a single test() append rather than a regex patch. Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

## Summary - Add the v0.0.53 release notes with the user-facing onboarding, inference, policy, runtime, Hermes, and maintainer-tooling changes from the release range. - Refresh generated `nemoclaw-user-*` skills from the current Fern docs, including already-merged policy, inference, troubleshooting, and command-reference updates. - Remove skipped experimental shield wording from generated-doc source so the release-prep skip-term gate stays clean. ## Source summary - #4197 -> `docs/about/release-notes.mdx`, `docs/reference/commands.mdx`: Document pre-recreate workspace backup, abort-on-partial-backup behavior, and `NEMOCLAW_RECREATE_WITHOUT_BACKUP`. - #4273 -> `docs/about/release-notes.mdx`, `docs/reference/troubleshooting.mdx`: Document the under-provisioned runtime prompt defaulting to abort in interactive onboarding. - #4220 -> `docs/about/release-notes.mdx`, `docs/network-policy/customize-network-policy.mdx`, `docs/network-policy/integration-policy-examples.mdx`: Include the `openclaw-pricing` preset and generated skill refresh. - #4253 -> `docs/about/release-notes.mdx`, `docs/inference/use-local-inference.mdx`, `docs/inference/switch-inference-providers.mdx`: Carry the Ollama runtime context-window docs into generated skills. - #4298 -> `docs/about/release-notes.mdx`, `docs/reference/troubleshooting.mdx`: Carry WSL Docker Desktop GPU guidance into generated skills and release notes. - #4297, #4210, #4221, #4225, #4288, #4306, #4311, #4319, #4342, #4284, #3327 -> `docs/about/release-notes.mdx`: Summarize release-range fixes and maintainer tooling changes that did not need new standalone docs pages. ## Verification - `python3 scripts/docs-to-skills.py docs/ .agents/skills/ --prefix nemoclaw-user --doc-platform fern-mdx` - `rg "permissive mode|shields down|shields up|shields status|config rotate-token|rotate-token" docs .agents/skills` returned no matches outside `docs/.docs-skip`. - `npm run docs` passes with full network access. Fern reports 0 errors and one existing light-mode accent contrast warning. - `FERN_VERSION=$(node -p "require('./fern/fern.config.json').version") && (cd fern && npx --yes "fern-api@${FERN_VERSION}" check --warnings)` reports 0 errors and the same contrast warning.  ## Summary by CodeRabbit * **Documentation** * Added v0.0.53 release notes with updates to onboarding, sandbox recreation, and gateway handling * Introduced `openclaw-pricing` preset for model pricing endpoint management * Clarified Ollama context window configuration and local model validation behavior * Updated sandbox recreation workflow documentation with backup/restore details * Enhanced interactive onboarding defaults for under-provisioned runtime warnings * Revised security guidance for configuration directory permissions and immutability verification  [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NemoClaw/pull/4360?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack)

Resolves conflicts in four files after #3327 (verify-stale skill) squash- merged onto main while the dogfood branch carried the unsquashed history plus three blueprint commits on top. Resolution: - .agents/skills/nemoclaw-skills-guide/SKILL.md: take main (skill counts evolved on main from 8 to 12 maintainer skills, 9 to 10 user skills, etc.). - .agents/skills/nemoclaw-maintainer-verify-stale/SKILL.md: take HEAD (preserves the Step 3 self-check wording, the self-check.md reference-map row, and the Automation environment variables table — all added by the dogfood feat commit). - .agents/skills/nemoclaw-maintainer-verify-stale/reference/candidate-selection.md: take HEAD (preserves VERIFY_STALE_FORCE_OLLAMA_ONLY provider-credential skip, the two gh --jq --arg pipe rewrites that fix the never-fired idempotency and unanswered-question gates, and ${VERIFY_STALE_BATCH_CAP:-15} prose). - .agents/skills/nemoclaw-maintainer-verify-stale/reference/scoring-comments-and-logging.md: take HEAD (preserves VERIFY_STALE_DRY_RUN closed-mid-run branch and the dry-run short-circuit before live comment-post + label-apply). Outside the dogfood-touched regions the verify-stale skill files are identical between branches (git diff main..pr-4279 -- on the other four reference files returns empty). Signed-off-by: Prekshi Vyas <prekshiv@nvidia.com>

prekshivyas and others added 30 commits May 5, 2026 14:29

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

Comment thread .agents/skills/nemoclaw-maintainer-verify-stale/reference/reproduction-rubrics.md Outdated

prekshivyas added 2 commits May 26, 2026 10:35

cv added v0.0.51 Release target v0.0.52 Release target and removed v0.0.51 Release target labels May 26, 2026

prekshivyas added 3 commits May 26, 2026 14:50

Merge branch 'main' into feat/verify-stale-skill-v2

64d7129

Merge branch 'main' into feat/verify-stale-skill-v2

222ca42

Merge branch 'main' into feat/verify-stale-skill-v2

bc2b1ab

prekshivyas mentioned this pull request May 27, 2026

feat(blueprint): dogfood-verify-stale blueprint + autonomous-mode env vars #4279

Closed

4 tasks

cv added v0.0.53 Release target and removed v0.0.52 Release target labels May 27, 2026

cv approved these changes May 27, 2026

View reviewed changes

cv disabled auto-merge May 27, 2026 03:18

cv merged commit b10e785 into main May 27, 2026
16 checks passed

prekshivyas deleted the feat/verify-stale-skill-v2 branch May 27, 2026 03:19

miyoungc mentioned this pull request May 27, 2026

docs: refresh v0.0.53 notes and skill links #4360

Merged

This was referenced May 28, 2026

chore: skills signing batch 1 #4438

Merged

feat(skills): add release notes maintainer workflow #4490

Merged

wscurran added bug-fix PR fixes a bug or regression feature PR adds or expands user-visible functionality area: skills Skills, agent behaviors, prompts, or skill packaging and removed fix bug-fix PR fixes a bug or regression labels Jun 3, 2026

Conversation

prekshivyas commented May 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Issues tackled

v1 scope

Verification flow (one-line per step)

Idempotency

Active-discussion handling (two-tier)

Companion changes

CI status

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cv commented May 26, 2026

Core principles

Concise is key

Set appropriate degrees of freedom

Test with all models you plan to use

Skill structure

Naming conventions

Writing effective descriptions

Progressive disclosure patterns

Visual overview: From simple to complex

Pattern 1: High-level guide with references

Pattern 2: Domain-specific organization

Pattern 3: Conditional details

Avoid deeply nested references

Structure longer reference files with table of contents

Workflows and feedback loops

Use workflows for complex tasks

Implement feedback loops

Content guidelines

Avoid time-sensitive information

Use consistent terminology

Common patterns

Template pattern

Examples pattern

Conditional workflow pattern

Evaluation and iteration

Build evaluations first

Develop Skills iteratively with Claude

Observe how Claude navigates Skills

Anti-patterns to avoid

Avoid Windows-style paths

Avoid offering too many options

Advanced: Skills with executable code

Solve, don't punt

Provide utility scripts

Use visual analysis

Create verifiable intermediate outputs

Package dependencies

Runtime environment

MCP tool references

Avoid assuming tools are installed

Technical notes

YAML frontmatter requirements

Token budgets

Checklist for effective Skills

Core quality

Code and scripts

Testing

Uh oh!

coderabbitai Bot commented May 26, 2026

✅ What's already aligned

⚠️ Gaps and recommendations

1. scoring-comments-and-logging.md is 479 lines — split candidate

2. Time-sensitive operational fact hardcoded in candidate-selection.md

3. No structured evaluations

4. Magic number: 14-day sweep threshold in cut-release-tag Step 7

5. Naming: gerund form not used (low priority)

6. Minor: reproducer vs repro terminology drift

Summary checklist delta

Uh oh!

Uh oh!

Reviewers

prekshivyas commented May 11, 2026 •

edited by coderabbitai Bot

Loading

1. `scoring-comments-and-logging.md` is 479 lines — split candidate

2. Time-sensitive operational fact hardcoded in `candidate-selection.md`

4. Magic number: 14-day sweep threshold in `cut-release-tag` Step 7

6. Minor: `reproducer` vs `repro` terminology drift