feat(skills): add verify-code-changes skill by MorAlekss · Pull Request #4459 · NousResearch/hermes-agent

MorAlekss · 2026-04-01T14:25:22Z

Closes #406

Summary

Implements the bundled verify-code-changes skill from #406.

Adds an independent, fail-closed verification pipeline using delegate_task, with:

isolated reviewer (no shared context)
baseline-aware quality gates
auto-fix loop + git checkpointing

Files added

skills/autonomous-ai-agents/verify-code-changes/SKILL.md - contains the full workflow, reviewer prompt template, baseline comparison logic, quality gate orchestration, auto-fix loop, git checkpointing convention, and session-scoped result caching

Implementation (aligned with #406)

Phase 1 — Independent reviewer:

delegate_task is called directly by the agent - not via a script, as testing revealed delegate_task is unavailable inside execute_code sandboxes. The reviewer receives only the git diff, wrapped in XML tags for injection protection, and returns a fail-closed JSON verdict:

{
  "passed": false,
  "security_concerns": ["Possible hardcoded secret (DB_PASSWORD)"],
  "logic_errors": [],
  "suggestions": [],
  "summary": "Hardcoded credential detected"
}

Fail-closed: non-empty security_concerns or logic_errors → passed must be false. Unparseable response → false.

Verification results are cached per session using sha256(git diff), allowing identical diffs to skip re-verification and reducing cost/latency.

Phase 2 — Baseline-aware quality gates:

Baseline snapshot before changes, re-run after. Only NEW failures block the commit - pre-existing issues are ignored.

Static scan via grep on added diff lines: hardcoded secrets, os.system(), subprocess shell=True, eval()/exec(), pickle, path traversal, raw IP HTTP calls, base64 decode
Linting: ruff (Python), eslint (Node), cargo clippy (Rust), golangci-lint (Go)
Type checking: mypy (Python), tsc (Node), cargo check (Rust), go vet (Go)
Tests: pytest / npm test / cargo test / go test - auto-detected by project files

Phase 3 — Auto-fix loop + git checkpointing:

When verification fails, a fresh delegate_task fix agent is spawned - not the implementer, not the reviewer. It fixes ONLY the reported issues. Maximum 2 attempts. If still failing - escalates to user with rollback instructions.

Git checkpointing:

[auto-checkpoint] commit before changes — enables rollback
[verified] commit after successful verification

Test coverage

19 scenarios across security, regression, auto-fix, and multi-language - all triggered delegate_task reviewer. Notable: SQL injection via f-string and path traversal were missed by static scan but caught by the reviewer - demonstrating the value of independent LLM review over grep-only approaches.

Tested on macOS, Hermes v0.5.0, across Python, Node.js, Rust, and Go projects:

Security

#	Scenario	Result
1	Hardcoded secret — `DB_PASSWORD = "super-secret-123"` staged	❌ blocked — static scan + reviewer
2	Shell injection — `os.system()` with user input	❌ blocked, auto-fixed with safe `subprocess.run`
3	Shell injection — `subprocess.run(cmd, shell=True)`	❌ blocked, auto-fixed with `shell=False` + `shlex.split()`
4	Dangerous `eval(expression)` with user input	❌ blocked, auto-fixed with AST-based evaluator
5	SQL injection via f-string (two functions)	❌ static scan missed it — reviewer caught both, auto-fixed with parameterized queries
6	Path traversal + 5 additional issues in multi-file project	❌ reviewer found all 6, 2 auto-fix cycles, production-quality fixes applied

Notable — scenario #6: A multi-layer Python service (DB + API + utils) where path traversal was hidden across 3 files. Static scan missed it entirely. The reviewer independently caught path traversal plus 5 additional issues: missing error handling, unused import, world-readable file permissions (/tmp/exports), error disclosure to clients, and a logic error in empty CSV handling. Two auto-fix cycles produced production-quality fixes including os.chmod(0o600) and structured error responses.

Correctness / regressions

#	Scenario	Result
7	Logic regression — `multiply` changed to `a + b`, existing test broke	❌ blocked — test regression + reviewer flagged logic error
8	Type mismatch — `-> int` returns `str`	❌ blocked — mypy + reviewer
9	New lint error — `import sys` unused, baseline was clean	❌ blocked — ruff
10	Go regression — `add` changed to `a - b`, existing test broke	❌ blocked — go test + reviewer

Auto-fix behavior

#	Scenario	Result
11	Clean diff — simple function added	✅ passed, `[verified]` commit suggested
12	Auto-fix success — hardcoded secret	✅ fixed in 1 attempt, `[verified]` commit created
13	Auto-fix exhausted — two unfixable security issues	✅ escalated after 2 attempts with rollback instructions

Multi-language coverage

#	Scenario	Result
14	Node.js — clean diff	✅ npm test + eslint detected automatically
15	Node.js — hardcoded `API_KEY`	✅ auto-fixed with `process.env.API_KEY`, `[verified]` commit
16	Node.js — type error (`-> number` returns string)	✅ blocked by tsc + reviewer, auto-fixed
17	Rust — shell injection + hardcoded secret	✅ escalated after 2 attempts with rollback instructions
18	Rust — cargo clippy errors (`&Vec` instead of `&[String]`)	✅ auto-fixed with idiomatic Rust
19	Rust — type error (`i32 + i32` assigned to `String`)	✅ blocked by cargo check + reviewer, auto-fixed

delegate_task called in all 19 scenarios - confirmed via 🔀 delegate in tool activity feed.

Usage

verify my recent code changes using the verify-code-changes skill

Non-goals

Not a full security scanner (heuristics only)
Not a replacement for CI/CD pipelines
Does not modify repository configuration or tooling setup

Why this approach

Matches the design direction in #406:

Implemented as a bundled skill, not a custom tool
Reuses existing primitives (delegate_task, terminal, git)
Enforces separation of concerns between agents
Introduces standardized verification pipeline without modifying core runtime
No custom Python integration needed - all gates use standard CLI tools

MorAlekss · 2026-04-01T14:31:57Z

Implements #406 as a bundled skill following the proposed phased design. Happy to adjust based on feedback

@MorAlekss

…requesting-code-review Merge the passive code-review checklist and the automated verification pipeline (from PR #4459 by @MorAlekss) into a single requesting-code-review skill. This eliminates model confusion between three overlapping skills. Now includes: - Static security scan (grep on diff lines) - Baseline-aware quality gates (only flag NEW failures) - Multi-language tool detection (Python, Node, Rust, Go) - Independent reviewer subagent with fail-closed JSON verdict - Auto-fix loop with separate fixer agent (max 2 attempts) - Git checkpoint and [verified] commit convention Deletes: skills/software-development/code-review/ (absorbed) Closes: #406 (independent code verification)

teknium1 · 2026-04-03T20:28:19Z

Thanks for this thorough work @MorAlekss — the static security scan, baseline-aware quality gates, auto-fix loop, and fail-closed reviewer design are all excellent patterns.

We've incorporated your key contributions into the existing requesting-code-review skill (PR #4854) rather than adding a separate skill, to avoid model confusion between overlapping skills (code-review, requesting-code-review, and verify-code-changes all triggered on similar prompts). The consolidated skill credits you in the author field and includes:

Your static grep-based security scan on diff lines
Baseline-aware quality gates (snapshot before/after, only flag NEW failures)
Multi-language tool detection (Python, Node, Rust, Go)
Fail-closed JSON verdict from independent reviewer
Auto-fix loop with separate fixer agent (max 2 attempts)
Git checkpointing convention

Your 19-scenario test matrix was particularly impressive — the SQL injection via f-string case (caught by reviewer but missed by static scan) directly informed keeping the dual-layer approach.

Closing in favor of #4854. Your contribution is preserved with attribution.

@MorAlekss

* chore: release v0.7.0 (2026.4.3) 168 merged PRs, 223 commits, 46 resolved issues, 40+ contributors. Highlights: pluggable memory providers, credential pools, Camofox browser, inline diff previews, API server session continuity, ACP MCP registration, gateway hardening, secret exfiltration blocking. * refactor(skills): consolidate code-review + verify-code-changes into requesting-code-review Merge the passive code-review checklist and the automated verification pipeline (from PR #4459 by @MorAlekss) into a single requesting-code-review skill. This eliminates model confusion between three overlapping skills. Now includes: - Static security scan (grep on diff lines) - Baseline-aware quality gates (only flag NEW failures) - Multi-language tool detection (Python, Node, Rust, Go) - Independent reviewer subagent with fail-closed JSON verdict - Auto-fix loop with separate fixer agent (max 2 attempts) - Git checkpoint and [verified] commit convention Deletes: skills/software-development/code-review/ (absorbed) Closes: #406 (independent code verification)

@MorAlekss

…Research#4854) * chore: release v0.7.0 (2026.4.3) 168 merged PRs, 223 commits, 46 resolved issues, 40+ contributors. Highlights: pluggable memory providers, credential pools, Camofox browser, inline diff previews, API server session continuity, ACP MCP registration, gateway hardening, secret exfiltration blocking. * refactor(skills): consolidate code-review + verify-code-changes into requesting-code-review Merge the passive code-review checklist and the automated verification pipeline (from PR NousResearch#4459 by @MorAlekss) into a single requesting-code-review skill. This eliminates model confusion between three overlapping skills. Now includes: - Static security scan (grep on diff lines) - Baseline-aware quality gates (only flag NEW failures) - Multi-language tool detection (Python, Node, Rust, Go) - Independent reviewer subagent with fail-closed JSON verdict - Auto-fix loop with separate fixer agent (max 2 attempts) - Git checkpoint and [verified] commit convention Deletes: skills/software-development/code-review/ (absorbed) Closes: NousResearch#406 (independent code verification)

@MorAlekss

…Research#4854) * chore: release v0.7.0 (2026.4.3) 168 merged PRs, 223 commits, 46 resolved issues, 40+ contributors. Highlights: pluggable memory providers, credential pools, Camofox browser, inline diff previews, API server session continuity, ACP MCP registration, gateway hardening, secret exfiltration blocking. * refactor(skills): consolidate code-review + verify-code-changes into requesting-code-review Merge the passive code-review checklist and the automated verification pipeline (from PR NousResearch#4459 by @MorAlekss) into a single requesting-code-review skill. This eliminates model confusion between three overlapping skills. Now includes: - Static security scan (grep on diff lines) - Baseline-aware quality gates (only flag NEW failures) - Multi-language tool detection (Python, Node, Rust, Go) - Independent reviewer subagent with fail-closed JSON verdict - Auto-fix loop with separate fixer agent (max 2 attempts) - Git checkpoint and [verified] commit convention Deletes: skills/software-development/code-review/ (absorbed) Closes: NousResearch#406 (independent code verification)

@MorAlekss

…Research#4854) * chore: release v0.7.0 (2026.4.3) 168 merged PRs, 223 commits, 46 resolved issues, 40+ contributors. Highlights: pluggable memory providers, credential pools, Camofox browser, inline diff previews, API server session continuity, ACP MCP registration, gateway hardening, secret exfiltration blocking. * refactor(skills): consolidate code-review + verify-code-changes into requesting-code-review Merge the passive code-review checklist and the automated verification pipeline (from PR NousResearch#4459 by @MorAlekss) into a single requesting-code-review skill. This eliminates model confusion between three overlapping skills. Now includes: - Static security scan (grep on diff lines) - Baseline-aware quality gates (only flag NEW failures) - Multi-language tool detection (Python, Node, Rust, Go) - Independent reviewer subagent with fail-closed JSON verdict - Auto-fix loop with separate fixer agent (max 2 attempts) - Git checkpoint and [verified] commit convention Deletes: skills/software-development/code-review/ (absorbed) Closes: NousResearch#406 (independent code verification)

@MorAlekss

…Research#4854) * chore: release v0.7.0 (2026.4.3) 168 merged PRs, 223 commits, 46 resolved issues, 40+ contributors. Highlights: pluggable memory providers, credential pools, Camofox browser, inline diff previews, API server session continuity, ACP MCP registration, gateway hardening, secret exfiltration blocking. * refactor(skills): consolidate code-review + verify-code-changes into requesting-code-review Merge the passive code-review checklist and the automated verification pipeline (from PR NousResearch#4459 by @MorAlekss) into a single requesting-code-review skill. This eliminates model confusion between three overlapping skills. Now includes: - Static security scan (grep on diff lines) - Baseline-aware quality gates (only flag NEW failures) - Multi-language tool detection (Python, Node, Rust, Go) - Independent reviewer subagent with fail-closed JSON verdict - Auto-fix loop with separate fixer agent (max 2 attempts) - Git checkpoint and [verified] commit convention Deletes: skills/software-development/code-review/ (absorbed) Closes: NousResearch#406 (independent code verification)

@MorAlekss

…Research#4854) * chore: release v0.7.0 (2026.4.3) 168 merged PRs, 223 commits, 46 resolved issues, 40+ contributors. Highlights: pluggable memory providers, credential pools, Camofox browser, inline diff previews, API server session continuity, ACP MCP registration, gateway hardening, secret exfiltration blocking. * refactor(skills): consolidate code-review + verify-code-changes into requesting-code-review Merge the passive code-review checklist and the automated verification pipeline (from PR NousResearch#4459 by @MorAlekss) into a single requesting-code-review skill. This eliminates model confusion between three overlapping skills. Now includes: - Static security scan (grep on diff lines) - Baseline-aware quality gates (only flag NEW failures) - Multi-language tool detection (Python, Node, Rust, Go) - Independent reviewer subagent with fail-closed JSON verdict - Auto-fix loop with separate fixer agent (max 2 attempts) - Git checkpoint and [verified] commit convention Deletes: skills/software-development/code-review/ (absorbed) Closes: NousResearch#406 (independent code verification)

@MorAlekss

…Research#4854) * chore: release v0.7.0 (2026.4.3) 168 merged PRs, 223 commits, 46 resolved issues, 40+ contributors. Highlights: pluggable memory providers, credential pools, Camofox browser, inline diff previews, API server session continuity, ACP MCP registration, gateway hardening, secret exfiltration blocking. * refactor(skills): consolidate code-review + verify-code-changes into requesting-code-review Merge the passive code-review checklist and the automated verification pipeline (from PR NousResearch#4459 by @MorAlekss) into a single requesting-code-review skill. This eliminates model confusion between three overlapping skills. Now includes: - Static security scan (grep on diff lines) - Baseline-aware quality gates (only flag NEW failures) - Multi-language tool detection (Python, Node, Rust, Go) - Independent reviewer subagent with fail-closed JSON verdict - Auto-fix loop with separate fixer agent (max 2 attempts) - Git checkpoint and [verified] commit convention Deletes: skills/software-development/code-review/ (absorbed) Closes: NousResearch#406 (independent code verification)

feat(skills): add verify-code-changes skill

7cd2503

teknium1 mentioned this pull request Apr 3, 2026

refactor(skills): consolidate code verification skills into one #4854

Merged

teknium1 closed this Apr 3, 2026

ai-ag2026 mentioned this pull request May 21, 2026

feat: add required artifact checker helper #23414

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(skills): add verify-code-changes skill#4459

feat(skills): add verify-code-changes skill#4459
MorAlekss wants to merge 1 commit into
NousResearch:mainfrom
MorAlekss:feat/verify-code-changes-406

MorAlekss commented Apr 1, 2026

Uh oh!

MorAlekss commented Apr 1, 2026

Uh oh!

teknium1 commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MorAlekss commented Apr 1, 2026

Summary

Files added

Implementation (aligned with #406)

Test coverage

Usage

Non-goals

Why this approach

Uh oh!

MorAlekss commented Apr 1, 2026

Uh oh!

teknium1 commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants