feat(skills): add verify-code-changes skill#4459
Conversation
|
Implements #406 as a bundled skill following the proposed phased design. Happy to adjust based on feedback |
…requesting-code-review Merge the passive code-review checklist and the automated verification pipeline (from PR #4459 by @MorAlekss) into a single requesting-code-review skill. This eliminates model confusion between three overlapping skills. Now includes: - Static security scan (grep on diff lines) - Baseline-aware quality gates (only flag NEW failures) - Multi-language tool detection (Python, Node, Rust, Go) - Independent reviewer subagent with fail-closed JSON verdict - Auto-fix loop with separate fixer agent (max 2 attempts) - Git checkpoint and [verified] commit convention Deletes: skills/software-development/code-review/ (absorbed) Closes: #406 (independent code verification)
|
Thanks for this thorough work @MorAlekss — the static security scan, baseline-aware quality gates, auto-fix loop, and fail-closed reviewer design are all excellent patterns. We've incorporated your key contributions into the existing
Your 19-scenario test matrix was particularly impressive — the SQL injection via f-string case (caught by reviewer but missed by static scan) directly informed keeping the dual-layer approach. Closing in favor of #4854. Your contribution is preserved with attribution. |
* chore: release v0.7.0 (2026.4.3) 168 merged PRs, 223 commits, 46 resolved issues, 40+ contributors. Highlights: pluggable memory providers, credential pools, Camofox browser, inline diff previews, API server session continuity, ACP MCP registration, gateway hardening, secret exfiltration blocking. * refactor(skills): consolidate code-review + verify-code-changes into requesting-code-review Merge the passive code-review checklist and the automated verification pipeline (from PR #4459 by @MorAlekss) into a single requesting-code-review skill. This eliminates model confusion between three overlapping skills. Now includes: - Static security scan (grep on diff lines) - Baseline-aware quality gates (only flag NEW failures) - Multi-language tool detection (Python, Node, Rust, Go) - Independent reviewer subagent with fail-closed JSON verdict - Auto-fix loop with separate fixer agent (max 2 attempts) - Git checkpoint and [verified] commit convention Deletes: skills/software-development/code-review/ (absorbed) Closes: #406 (independent code verification)
…Research#4854) * chore: release v0.7.0 (2026.4.3) 168 merged PRs, 223 commits, 46 resolved issues, 40+ contributors. Highlights: pluggable memory providers, credential pools, Camofox browser, inline diff previews, API server session continuity, ACP MCP registration, gateway hardening, secret exfiltration blocking. * refactor(skills): consolidate code-review + verify-code-changes into requesting-code-review Merge the passive code-review checklist and the automated verification pipeline (from PR NousResearch#4459 by @MorAlekss) into a single requesting-code-review skill. This eliminates model confusion between three overlapping skills. Now includes: - Static security scan (grep on diff lines) - Baseline-aware quality gates (only flag NEW failures) - Multi-language tool detection (Python, Node, Rust, Go) - Independent reviewer subagent with fail-closed JSON verdict - Auto-fix loop with separate fixer agent (max 2 attempts) - Git checkpoint and [verified] commit convention Deletes: skills/software-development/code-review/ (absorbed) Closes: NousResearch#406 (independent code verification)
…Research#4854) * chore: release v0.7.0 (2026.4.3) 168 merged PRs, 223 commits, 46 resolved issues, 40+ contributors. Highlights: pluggable memory providers, credential pools, Camofox browser, inline diff previews, API server session continuity, ACP MCP registration, gateway hardening, secret exfiltration blocking. * refactor(skills): consolidate code-review + verify-code-changes into requesting-code-review Merge the passive code-review checklist and the automated verification pipeline (from PR NousResearch#4459 by @MorAlekss) into a single requesting-code-review skill. This eliminates model confusion between three overlapping skills. Now includes: - Static security scan (grep on diff lines) - Baseline-aware quality gates (only flag NEW failures) - Multi-language tool detection (Python, Node, Rust, Go) - Independent reviewer subagent with fail-closed JSON verdict - Auto-fix loop with separate fixer agent (max 2 attempts) - Git checkpoint and [verified] commit convention Deletes: skills/software-development/code-review/ (absorbed) Closes: NousResearch#406 (independent code verification)
…Research#4854) * chore: release v0.7.0 (2026.4.3) 168 merged PRs, 223 commits, 46 resolved issues, 40+ contributors. Highlights: pluggable memory providers, credential pools, Camofox browser, inline diff previews, API server session continuity, ACP MCP registration, gateway hardening, secret exfiltration blocking. * refactor(skills): consolidate code-review + verify-code-changes into requesting-code-review Merge the passive code-review checklist and the automated verification pipeline (from PR NousResearch#4459 by @MorAlekss) into a single requesting-code-review skill. This eliminates model confusion between three overlapping skills. Now includes: - Static security scan (grep on diff lines) - Baseline-aware quality gates (only flag NEW failures) - Multi-language tool detection (Python, Node, Rust, Go) - Independent reviewer subagent with fail-closed JSON verdict - Auto-fix loop with separate fixer agent (max 2 attempts) - Git checkpoint and [verified] commit convention Deletes: skills/software-development/code-review/ (absorbed) Closes: NousResearch#406 (independent code verification)
…Research#4854) * chore: release v0.7.0 (2026.4.3) 168 merged PRs, 223 commits, 46 resolved issues, 40+ contributors. Highlights: pluggable memory providers, credential pools, Camofox browser, inline diff previews, API server session continuity, ACP MCP registration, gateway hardening, secret exfiltration blocking. * refactor(skills): consolidate code-review + verify-code-changes into requesting-code-review Merge the passive code-review checklist and the automated verification pipeline (from PR NousResearch#4459 by @MorAlekss) into a single requesting-code-review skill. This eliminates model confusion between three overlapping skills. Now includes: - Static security scan (grep on diff lines) - Baseline-aware quality gates (only flag NEW failures) - Multi-language tool detection (Python, Node, Rust, Go) - Independent reviewer subagent with fail-closed JSON verdict - Auto-fix loop with separate fixer agent (max 2 attempts) - Git checkpoint and [verified] commit convention Deletes: skills/software-development/code-review/ (absorbed) Closes: NousResearch#406 (independent code verification)
…Research#4854) * chore: release v0.7.0 (2026.4.3) 168 merged PRs, 223 commits, 46 resolved issues, 40+ contributors. Highlights: pluggable memory providers, credential pools, Camofox browser, inline diff previews, API server session continuity, ACP MCP registration, gateway hardening, secret exfiltration blocking. * refactor(skills): consolidate code-review + verify-code-changes into requesting-code-review Merge the passive code-review checklist and the automated verification pipeline (from PR NousResearch#4459 by @MorAlekss) into a single requesting-code-review skill. This eliminates model confusion between three overlapping skills. Now includes: - Static security scan (grep on diff lines) - Baseline-aware quality gates (only flag NEW failures) - Multi-language tool detection (Python, Node, Rust, Go) - Independent reviewer subagent with fail-closed JSON verdict - Auto-fix loop with separate fixer agent (max 2 attempts) - Git checkpoint and [verified] commit convention Deletes: skills/software-development/code-review/ (absorbed) Closes: NousResearch#406 (independent code verification)
…Research#4854) * chore: release v0.7.0 (2026.4.3) 168 merged PRs, 223 commits, 46 resolved issues, 40+ contributors. Highlights: pluggable memory providers, credential pools, Camofox browser, inline diff previews, API server session continuity, ACP MCP registration, gateway hardening, secret exfiltration blocking. * refactor(skills): consolidate code-review + verify-code-changes into requesting-code-review Merge the passive code-review checklist and the automated verification pipeline (from PR NousResearch#4459 by @MorAlekss) into a single requesting-code-review skill. This eliminates model confusion between three overlapping skills. Now includes: - Static security scan (grep on diff lines) - Baseline-aware quality gates (only flag NEW failures) - Multi-language tool detection (Python, Node, Rust, Go) - Independent reviewer subagent with fail-closed JSON verdict - Auto-fix loop with separate fixer agent (max 2 attempts) - Git checkpoint and [verified] commit convention Deletes: skills/software-development/code-review/ (absorbed) Closes: NousResearch#406 (independent code verification)
Closes #406
Summary
Implements the bundled
verify-code-changesskill from #406.Adds an independent, fail-closed verification pipeline using
delegate_task, with:Files added
skills/autonomous-ai-agents/verify-code-changes/SKILL.md- contains the full workflow, reviewer prompt template, baseline comparison logic, quality gate orchestration, auto-fix loop, git checkpointing convention, and session-scoped result cachingImplementation (aligned with #406)
Phase 1 — Independent reviewer:
delegate_taskis called directly by the agent - not via a script, as testing revealeddelegate_taskis unavailable insideexecute_codesandboxes. The reviewer receives only the git diff, wrapped in XML tags for injection protection, and returns a fail-closed JSON verdict:{ "passed": false, "security_concerns": ["Possible hardcoded secret (DB_PASSWORD)"], "logic_errors": [], "suggestions": [], "summary": "Hardcoded credential detected" }Fail-closed: non-empty
security_concernsorlogic_errors→passedmust befalse. Unparseable response →false.Verification results are cached per session using sha256(git diff), allowing identical diffs to skip re-verification and reducing cost/latency.
Phase 2 — Baseline-aware quality gates:
Baseline snapshot before changes, re-run after. Only NEW failures block the commit - pre-existing issues are ignored.
os.system(),subprocess shell=True,eval()/exec(), pickle, path traversal, raw IP HTTP calls, base64 decodePhase 3 — Auto-fix loop + git checkpointing:
When verification fails, a fresh
delegate_taskfix agent is spawned - not the implementer, not the reviewer. It fixes ONLY the reported issues. Maximum 2 attempts. If still failing - escalates to user with rollback instructions.Git checkpointing:
[auto-checkpoint]commit before changes — enables rollback[verified]commit after successful verificationTest coverage
19 scenarios across security, regression, auto-fix, and multi-language - all triggered
delegate_taskreviewer. Notable: SQL injection via f-string and path traversal were missed by static scan but caught by the reviewer - demonstrating the value of independent LLM review over grep-only approaches.Tested on macOS, Hermes v0.5.0, across Python, Node.js, Rust, and Go projects:
Security
DB_PASSWORD = "super-secret-123"stagedos.system()with user inputsubprocess.runsubprocess.run(cmd, shell=True)shell=False+shlex.split()eval(expression)with user inputCorrectness / regressions
multiplychanged toa + b, existing test broke-> intreturnsstrimport sysunused, baseline was cleanaddchanged toa - b, existing test brokeAuto-fix behavior
[verified]commit suggested[verified]commit createdMulti-language coverage
API_KEYprocess.env.API_KEY,[verified]commit-> numberreturns string)&Vecinstead of&[String])i32 + i32assigned toString)delegate_taskcalled in all 19 scenarios - confirmed via🔀 delegatein tool activity feed.Usage
Non-goals
Why this approach
Matches the design direction in #406:
delegate_task, terminal, git)