feat(skills): add autonomous PR comparator for issues with competing fixes#3052
Conversation
…cisions Adds nemoclaw-maintainer-pr-comparator: an autonomous skill that picks the best PR among multiple candidates targeting the same issue. Runs a 3-tier check pipeline (plumbing gates, correctness, code quality) plus deterministic comparative scoring with happy-path and degraded modes. Validated against 5 historical NemoClaw cases (#2681, #2947, #893, #2636, refactor chain #2087/#2489/#2495); spec patches added for PR-state-OPEN gate, supersession detection, and workaround-vs-root-cause flag based on validation findings.
…r-comparator Slim SKILL.md (orchestration only); tier checks moved to one-level-deep reference files; output template, repo policy, and validation guidance extracted; five utility scripts added for the deterministic work that previously required Claude to inspect raw gh JSON. Substance changes per Claude best-practices review: - Description rewritten in third person, scope no longer overpromises - CodeRabbit thread check uses GraphQL on reviewThreads.isResolved (REST /comments lacks resolution state) - Candidate discovery follows a single default order with explicit stop conditions instead of "plus optionally" expansions - Author-merge-ratio tiebreaker dropped (process-noisy) - "Earlier PR" demoted to final deterministic fallback - Repo-specific assumptions pulled into repo-policy.md
📝 WalkthroughWalkthroughA new ChangesPR Comparator Skill System
Sequence DiagramsequenceDiagram
actor User
participant Skill as PR Comparator<br/>Skill
participant GitHub as GitHub API
participant Scripts as Collection<br/>Scripts
participant LLM as LLM<br/>Evaluator
participant Renderer as Verdict<br/>Renderer
User->>Skill: invoke(issue_number)
Skill->>GitHub: fetch issue (acceptance criteria + comments)
GitHub-->>Skill: issue body + comments
Skill->>Scripts: find-candidates.sh (issue_number)
Scripts->>GitHub: search PRs by body/files/title
GitHub-->>Scripts: candidate PR numbers
Scripts-->>Skill: [PR#...]
Skill->>Scripts: parse-supersession.sh (PR#...)
Scripts->>GitHub: fetch PR bodies
GitHub-->>Scripts: PR body text
Scripts-->>Skill: supersession edges
Skill->>Scripts: collect-gates.sh (PR#n)
Scripts->>GitHub: gh pr view (state, CI, mergeable, reviewDecision)
GitHub-->>Scripts: gate data
Scripts-->>Skill: gate results per PR
Skill->>Scripts: check-coderabbit-threads.sh (PR#n)
Scripts->>GitHub: GraphQL query (reviewThreads)
GitHub-->>Scripts: thread resolution state
Scripts-->>Skill: thread gate status
Skill->>LLM: evaluate Tier 1 (diffs, tests, issue criteria)
LLM-->>Skill: correctness scores + evidence
Skill->>LLM: evaluate Tier 2 (description, docs, migrations)
LLM-->>Skill: quality scores + evidence
Skill->>Skill: Tier 3: rank via tiebreakers or degraded mode
Skill->>Renderer: render-verdict.py (comparison JSON)
Renderer-->>Skill: Markdown verdict
Skill-->>User: verdict + scorecard + reasoning + evidence
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (3)
.agents/skills/nemoclaw-maintainer-pr-comparator/scripts/parse-supersession.sh (1)
24-26: 💤 Low valueMissing validation for
--repoargument value.If
--repois passed as the last argument without a value (e.g.,parse-supersession.sh 123 --repo), the script will fail with an unbound variable error on$2due toset -u, orshift 2will fail if there's only one argument left. Adding validation improves error messaging.♻️ Proposed defensive check
--repo) + if [ -z "${2:-}" ]; then + echo "Error: --repo requires OWNER/REPO argument" >&2 + exit 64 + fi repo_args=(--repo "$2") shift 2 ;;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.agents/skills/nemoclaw-maintainer-pr-comparator/scripts/parse-supersession.sh around lines 24 - 26, The handler for the --repo option in parse-supersession.sh uses "$2" and shift 2 without validating that a value exists; update the --repo case to check that a non-empty argument is present (e.g., verify $# -ge 2 or that "${2:-}" is non-empty) before assigning repo_args=(--repo "$2") and performing shift 2, and if the check fails print a clear error/usage message and exit with non-zero status so the script doesn't hit an unbound-variable or broken shift..agents/skills/nemoclaw-maintainer-pr-comparator/scripts/check-coderabbit-threads.sh (1)
98-105: 💤 Low valueConsider JSON-escaping
$bot_loginin heredoc output.If
$bot_logincontains quotes or backslashes (unlikely but possible with custom bot names), the JSON output would be malformed. Usingjqfor the final output would ensure valid JSON.♻️ Safer JSON construction with jq
-cat <<JSON -{ - "pr": $pr, - "bot_login": "$bot_login", - "gate_coderabbit_threads_resolved": $gate_pass, - "details": $counts -} -JSON +jq -n \ + --argjson pr "$pr" \ + --arg bot "$bot_login" \ + --argjson gate "$gate_pass" \ + --argjson details "$counts" \ + '{pr: $pr, bot_login: $bot, gate_coderabbit_threads_resolved: $gate, details: $details}'🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.agents/skills/nemoclaw-maintainer-pr-comparator/scripts/check-coderabbit-threads.sh around lines 98 - 105, The heredoc that emits JSON uses the unescaped variable $bot_login which can break JSON if it contains quotes/backslashes; update the script (check-coderabbit-threads.sh) to build the JSON using a safe tool like jq or to explicitly JSON-escape $bot_login before emitting (e.g., construct the object with jq --arg pr "$pr" --arg bot_login "$bot_login" --argjson gate_coderabbit_threads_resolved "$gate_pass" --argjson details "$counts" '.pr = ($pr|tonumber?) | .bot_login = $bot_login | .gate_coderabbit_threads_resolved = $gate_coderabbit_threads_resolved | .details = $details' or equivalent) so the output is always valid JSON and references the same variables ($pr, $bot_login, $gate_pass, $counts)..agents/skills/nemoclaw-maintainer-pr-comparator/scripts/render-verdict.py (1)
181-186: ⚡ Quick winAdd defensive handling for missing required keys.
spec["issue"]will raiseKeyErrorif the key is missing, producing a cryptic traceback. Consider validating required keys upfront or using.get()with appropriate defaults/error messages.♻️ Proposed validation
+ required_keys = ["issue", "prs"] + for key in required_keys: + if key not in spec: + print(f"Missing required key in spec: {key}", file=sys.stderr) + return 64 + issue = spec["issue"] criteria = spec.get("criteria", []) prs = spec.get("prs", [])🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.agents/skills/nemoclaw-maintainer-pr-comparator/scripts/render-verdict.py around lines 181 - 186, spec["issue"] can raise a KeyError if the input dict lacks required keys; update the code that reads spec (the block assigning issue, criteria, prs, winner, mode, tiebreaker) to validate required keys up front (e.g., check for "issue" and any other required fields) and either use spec.get(...) with sensible defaults for optional fields (criteria, prs, winner, mode, tiebreaker) or raise a clear ValueError/TypeError with a descriptive message when a required key is missing so callers get an actionable error instead of a cryptic traceback.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.agents/skills/nemoclaw-maintainer-pr-comparator/scripts/collect-gates.sh:
- Around line 38-39: The jq expressions computing ci_failure_count and
ci_pending_count fail when .statusCheckRollup is null; update both queries to
null-coalesce .statusCheckRollup to an empty array (e.g. use .statusCheckRollup
// []) so the select/iteration always runs on an array; modify the expressions
that set ci_failure_count and ci_pending_count to use (.statusCheckRollup // [])
in place of .statusCheckRollup.
In @.agents/skills/nemoclaw-maintainer-pr-comparator/scripts/find-candidates.sh:
- Around line 36-38: The gh CLI search invocation assigned to the candidates
variable uses an invalid flag (--in:body); move the body qualifier into the
query string so gh search prs receives the full query (e.g., include "in:body
#${issue_number}" with repo_args), retry the search without suppressing stderr
while preserving piping to sort/head, and ensure the command still outputs JSON
numbers for the jq filter; update the command that builds candidates (using gh
search prs, repo_args, issue_number, MAX_CANDIDATES) accordingly so it actually
finds PRs containing the issue number in the body.
In @.agents/skills/nemoclaw-maintainer-pr-comparator/scripts/render-verdict.py:
- Around line 113-118: tier_2_keys currently contains "workaround_vs_root_cause"
but there is no corresponding check in checks/tier-2-quality.md; fix by either
removing "workaround_vs_root_cause" from the tier_2_keys list in
render-verdict.py (so the code and docs align) or add a formal definition and
scoring for the workaround_vs_root_cause check in checks/tier-2-quality.md (and
update any max-score calculations/templates accordingly) — locate the symbol
tier_2_keys in
.agents/skills/nemoclaw-maintainer-pr-comparator/scripts/render-verdict.py and
make the change consistently (removal from the list OR a documented check entry
and score in tier-2-quality.md).
- Around line 205-209: The current prints use the variable winner directly and
produce confusing output when winner is None; update the code that handles mode
and winner (variables mode and winner) to validate winner first and substitute a
clear fallback (e.g., "no PR selected" or "no winner") before formatting
messages, so in the "happy" branch you print "VERDICT: MERGE PR #<id>" only when
winner is not None and otherwise print "VERDICT: no PR selected", and in the
degraded branch print "Neither mergeable yet" and "No PR is closer" (or similar)
when winner is None; ensure all f-strings using winner are guarded or replaced
with the fallback string.
In @.agents/skills/nemoclaw-maintainer-pr-comparator/templates/verdict.md:
- Line 34: The example in the verdict template shows an incorrect max score
(21.0); update the "Weighted score" row in
.agents/skills/nemoclaw-maintainer-pr-comparator/templates/verdict.md to reflect
the true maximum based on actual check counts and weights (Tier 1 = 2.0× per
check, Tier 2 = 1.0× per check). Reconcile counts with render-verdict.py (which
currently includes the Tier‑2 check workaround_vs_root_cause) and compute the
correct max (e.g., 6 Tier‑1 checks ×2 + 4 Tier‑2 checks ×1 = 16.0, or adjust if
checks differ), then replace both the numerator and denominator values so the
example matches the code.
In @.agents/skills/nemoclaw-maintainer-pr-comparator/validation/backtest.md:
- Around line 22-24: The gh search prs command uses search qualifiers inside the
query string, not as flags; replace the incorrect snippet gh search prs --repo
OWNER/REPO --merged --in:body "supersedes" --limit 30 by moving the qualifier
into the quoted query, e.g. gh search prs --repo OWNER/REPO --merged "in:body
supersedes" --limit 30 (or include repo/is:merged qualifiers inside the query as
needed) so the in:body qualifier is part of the query string.
---
Nitpick comments:
In
@.agents/skills/nemoclaw-maintainer-pr-comparator/scripts/check-coderabbit-threads.sh:
- Around line 98-105: The heredoc that emits JSON uses the unescaped variable
$bot_login which can break JSON if it contains quotes/backslashes; update the
script (check-coderabbit-threads.sh) to build the JSON using a safe tool like jq
or to explicitly JSON-escape $bot_login before emitting (e.g., construct the
object with jq --arg pr "$pr" --arg bot_login "$bot_login" --argjson
gate_coderabbit_threads_resolved "$gate_pass" --argjson details "$counts" '.pr =
($pr|tonumber?) | .bot_login = $bot_login | .gate_coderabbit_threads_resolved =
$gate_coderabbit_threads_resolved | .details = $details' or equivalent) so the
output is always valid JSON and references the same variables ($pr, $bot_login,
$gate_pass, $counts).
In
@.agents/skills/nemoclaw-maintainer-pr-comparator/scripts/parse-supersession.sh:
- Around line 24-26: The handler for the --repo option in parse-supersession.sh
uses "$2" and shift 2 without validating that a value exists; update the --repo
case to check that a non-empty argument is present (e.g., verify $# -ge 2 or
that "${2:-}" is non-empty) before assigning repo_args=(--repo "$2") and
performing shift 2, and if the check fails print a clear error/usage message and
exit with non-zero status so the script doesn't hit an unbound-variable or
broken shift.
In @.agents/skills/nemoclaw-maintainer-pr-comparator/scripts/render-verdict.py:
- Around line 181-186: spec["issue"] can raise a KeyError if the input dict
lacks required keys; update the code that reads spec (the block assigning issue,
criteria, prs, winner, mode, tiebreaker) to validate required keys up front
(e.g., check for "issue" and any other required fields) and either use
spec.get(...) with sensible defaults for optional fields (criteria, prs, winner,
mode, tiebreaker) or raise a clear ValueError/TypeError with a descriptive
message when a required key is missing so callers get an actionable error
instead of a cryptic traceback.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7f7e4796-b974-4736-bfdb-2899699a803c
📒 Files selected for processing (13)
.agents/skills/nemoclaw-maintainer-pr-comparator/SKILL.md.agents/skills/nemoclaw-maintainer-pr-comparator/checks/tier-0-gates.md.agents/skills/nemoclaw-maintainer-pr-comparator/checks/tier-1-correctness.md.agents/skills/nemoclaw-maintainer-pr-comparator/checks/tier-2-quality.md.agents/skills/nemoclaw-maintainer-pr-comparator/repo-policy.md.agents/skills/nemoclaw-maintainer-pr-comparator/scripts/check-coderabbit-threads.sh.agents/skills/nemoclaw-maintainer-pr-comparator/scripts/collect-gates.sh.agents/skills/nemoclaw-maintainer-pr-comparator/scripts/find-candidates.sh.agents/skills/nemoclaw-maintainer-pr-comparator/scripts/parse-supersession.sh.agents/skills/nemoclaw-maintainer-pr-comparator/scripts/render-verdict.py.agents/skills/nemoclaw-maintainer-pr-comparator/templates/verdict.md.agents/skills/nemoclaw-maintainer-pr-comparator/tiebreakers.md.agents/skills/nemoclaw-maintainer-pr-comparator/validation/backtest.md
- collect-gates.sh: null-coalesce statusCheckRollup so jq doesn't fail when a PR has no status checks configured (CR major) - find-candidates.sh + backtest.md: `in:body` is a search qualifier, not a flag — move it inside the query string (CR critical) - tier-2-quality.md: add 2.4 workaround-vs-root-cause check that was in the locked v1 spec but dropped during the refactor; render-verdict.py was already counting it (CR major — code/docs alignment) - render-verdict.py: guard winner=None in both happy and degraded modes to avoid 'MERGE PR #None' output (CR minor) - templates/verdict.md: correct example weighted-score denominator to 16.0 (4 Tier-2 checks ×1 + 6 Tier-1 ×2) instead of stale 21.0 (CR minor)
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.agents/skills/nemoclaw-maintainer-pr-comparator/scripts/render-verdict.py:
- Around line 129-134: The current total uses all keys from
pr.get("tier_1"/"tier_2") which lets extra/hidden keys change rankings; change
the summation to iterate over the canonical key lists (tier_1_keys and
tier_2_keys) and call score_for(pr.get("tier_1", {}).get(key), TIER_1_WEIGHT)
and similarly for tier_2, so only the fixed keys affect total; keep max_total
as-is (len(tier_1_keys)*TIER_1_WEIGHT + len(tier_2_keys)*TIER_2_WEIGHT) and
ignore any extra keys in pr when computing total.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 48d51831-226d-4535-b7f4-9565c77dd699
📒 Files selected for processing (6)
.agents/skills/nemoclaw-maintainer-pr-comparator/checks/tier-2-quality.md.agents/skills/nemoclaw-maintainer-pr-comparator/scripts/collect-gates.sh.agents/skills/nemoclaw-maintainer-pr-comparator/scripts/find-candidates.sh.agents/skills/nemoclaw-maintainer-pr-comparator/scripts/render-verdict.py.agents/skills/nemoclaw-maintainer-pr-comparator/templates/verdict.md.agents/skills/nemoclaw-maintainer-pr-comparator/validation/backtest.md
✅ Files skipped from review due to trivial changes (2)
- .agents/skills/nemoclaw-maintainer-pr-comparator/validation/backtest.md
- .agents/skills/nemoclaw-maintainer-pr-comparator/templates/verdict.md
🚧 Files skipped from review as they are similar to previous changes (2)
- .agents/skills/nemoclaw-maintainer-pr-comparator/scripts/collect-gates.sh
- .agents/skills/nemoclaw-maintainer-pr-comparator/scripts/find-candidates.sh
| total = 0.0 | ||
| for status in pr.get("tier_1", {}).values(): | ||
| total += score_for(status, TIER_1_WEIGHT) | ||
| for status in pr.get("tier_2", {}).values(): | ||
| total += score_for(status, TIER_2_WEIGHT) | ||
| max_total = len(tier_1_keys) * TIER_1_WEIGHT + len(tier_2_keys) * TIER_2_WEIGHT |
There was a problem hiding this comment.
Score calculation can include hidden checks and skew ranking.
total sums all values present in tier_1/tier_2, while the table and max_total are based on fixed key lists. Extra keys in input can silently change ranking without appearing in the rendered scorecard.
🔧 Proposed fix
score_row = ["**Weighted score**"]
for pr in prs:
total = 0.0
- for status in pr.get("tier_1", {}).values():
- total += score_for(status, TIER_1_WEIGHT)
- for status in pr.get("tier_2", {}).values():
- total += score_for(status, TIER_2_WEIGHT)
+ tier_1 = pr.get("tier_1", {})
+ tier_2 = pr.get("tier_2", {})
+ for key in tier_1_keys:
+ total += score_for(tier_1.get(key, "fail"), TIER_1_WEIGHT)
+ for key in tier_2_keys:
+ total += score_for(tier_2.get(key, "fail"), TIER_2_WEIGHT)
max_total = len(tier_1_keys) * TIER_1_WEIGHT + len(tier_2_keys) * TIER_2_WEIGHT
score_row.append(f"{total:.1f} / {max_total:.1f}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| total = 0.0 | |
| for status in pr.get("tier_1", {}).values(): | |
| total += score_for(status, TIER_1_WEIGHT) | |
| for status in pr.get("tier_2", {}).values(): | |
| total += score_for(status, TIER_2_WEIGHT) | |
| max_total = len(tier_1_keys) * TIER_1_WEIGHT + len(tier_2_keys) * TIER_2_WEIGHT | |
| total = 0.0 | |
| tier_1 = pr.get("tier_1", {}) | |
| tier_2 = pr.get("tier_2", {}) | |
| for key in tier_1_keys: | |
| total += score_for(tier_1.get(key, "fail"), TIER_1_WEIGHT) | |
| for key in tier_2_keys: | |
| total += score_for(tier_2.get(key, "fail"), TIER_2_WEIGHT) | |
| max_total = len(tier_1_keys) * TIER_1_WEIGHT + len(tier_2_keys) * TIER_2_WEIGHT |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.agents/skills/nemoclaw-maintainer-pr-comparator/scripts/render-verdict.py
around lines 129 - 134, The current total uses all keys from
pr.get("tier_1"/"tier_2") which lets extra/hidden keys change rankings; change
the summation to iterate over the canonical key lists (tier_1_keys and
tier_2_keys) and call score_for(pr.get("tier_1", {}).get(key), TIER_1_WEIGHT)
and similarly for tier_2, so only the fixed keys affect total; keep max_total
as-is (len(tier_1_keys)*TIER_1_WEIGHT + len(tier_2_keys)*TIER_2_WEIGHT) and
ignore any extra keys in pr when computing total.
Summary
Adds
nemoclaw-maintainer-pr-comparator: an autonomous skill that picks the merge winner among competing PRs targeting the same issue. Runs Tier 0 plumbing gates → Tier 1 correctness checks → Tier 2 quality checks → Tier 3 deterministic ranking. Degraded mode handles the case where no PR passes gates.Why
When two contributors fix the same issue, deciding which to merge is judgment-heavy. This skill makes that judgment deterministic, explainable, and reusable across repos.
What's clever (catches what CI can't)
Validation
Backtested against 5 historical NemoClaw cases (#2681, #2947, #893, #2636, refactor chain #2087/#2489/#2495). Initial v1: 4/5. After patches (PR-state-OPEN gate, supersession detection, workaround-vs-root-cause flag): 5/5.
Skill structure (per Claude best-practices)
Slim SKILL.md (orchestration), tier checks in one-level-deep reference files, output template extracted, five utility scripts for the deterministic work, repo-specific assumptions in repo-policy.md for cross-repo reuse.
Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation