Skip to content

feat(skills): add verify-code-changes skill#4459

Closed
MorAlekss wants to merge 1 commit into
NousResearch:mainfrom
MorAlekss:feat/verify-code-changes-406
Closed

feat(skills): add verify-code-changes skill#4459
MorAlekss wants to merge 1 commit into
NousResearch:mainfrom
MorAlekss:feat/verify-code-changes-406

Conversation

@MorAlekss

Copy link
Copy Markdown
Contributor

Closes #406

Summary

Implements the bundled verify-code-changes skill from #406.

Adds an independent, fail-closed verification pipeline using delegate_task, with:

  • isolated reviewer (no shared context)
  • baseline-aware quality gates
  • auto-fix loop + git checkpointing

Files added

  • skills/autonomous-ai-agents/verify-code-changes/SKILL.md - contains the full workflow, reviewer prompt template, baseline comparison logic, quality gate orchestration, auto-fix loop, git checkpointing convention, and session-scoped result caching

Implementation (aligned with #406)

Phase 1 — Independent reviewer:

delegate_task is called directly by the agent - not via a script, as testing revealed delegate_task is unavailable inside execute_code sandboxes. The reviewer receives only the git diff, wrapped in XML tags for injection protection, and returns a fail-closed JSON verdict:

{
  "passed": false,
  "security_concerns": ["Possible hardcoded secret (DB_PASSWORD)"],
  "logic_errors": [],
  "suggestions": [],
  "summary": "Hardcoded credential detected"
}

Fail-closed: non-empty security_concerns or logic_errorspassed must be false. Unparseable response → false.

Verification results are cached per session using sha256(git diff), allowing identical diffs to skip re-verification and reducing cost/latency.

Phase 2 — Baseline-aware quality gates:

Baseline snapshot before changes, re-run after. Only NEW failures block the commit - pre-existing issues are ignored.

  • Static scan via grep on added diff lines: hardcoded secrets, os.system(), subprocess shell=True, eval()/exec(), pickle, path traversal, raw IP HTTP calls, base64 decode
  • Linting: ruff (Python), eslint (Node), cargo clippy (Rust), golangci-lint (Go)
  • Type checking: mypy (Python), tsc (Node), cargo check (Rust), go vet (Go)
  • Tests: pytest / npm test / cargo test / go test - auto-detected by project files

Phase 3 — Auto-fix loop + git checkpointing:

When verification fails, a fresh delegate_task fix agent is spawned - not the implementer, not the reviewer. It fixes ONLY the reported issues. Maximum 2 attempts. If still failing - escalates to user with rollback instructions.

Git checkpointing:

  • [auto-checkpoint] commit before changes — enables rollback
  • [verified] commit after successful verification

Test coverage

19 scenarios across security, regression, auto-fix, and multi-language - all triggered delegate_task reviewer. Notable: SQL injection via f-string and path traversal were missed by static scan but caught by the reviewer - demonstrating the value of independent LLM review over grep-only approaches.

Tested on macOS, Hermes v0.5.0, across Python, Node.js, Rust, and Go projects:

Security

# Scenario Result
1 Hardcoded secret — DB_PASSWORD = "super-secret-123" staged ❌ blocked — static scan + reviewer
2 Shell injection — os.system() with user input ❌ blocked, auto-fixed with safe subprocess.run
3 Shell injection — subprocess.run(cmd, shell=True) ❌ blocked, auto-fixed with shell=False + shlex.split()
4 Dangerous eval(expression) with user input ❌ blocked, auto-fixed with AST-based evaluator
5 SQL injection via f-string (two functions) ❌ static scan missed it — reviewer caught both, auto-fixed with parameterized queries
6 Path traversal + 5 additional issues in multi-file project ❌ reviewer found all 6, 2 auto-fix cycles, production-quality fixes applied

Notable — scenario #6: A multi-layer Python service (DB + API + utils) where path traversal was hidden across 3 files. Static scan missed it entirely. The reviewer independently caught path traversal plus 5 additional issues: missing error handling, unused import, world-readable file permissions (/tmp/exports), error disclosure to clients, and a logic error in empty CSV handling. Two auto-fix cycles produced production-quality fixes including os.chmod(0o600) and structured error responses.

Correctness / regressions

# Scenario Result
7 Logic regression — multiply changed to a + b, existing test broke ❌ blocked — test regression + reviewer flagged logic error
8 Type mismatch — -> int returns str ❌ blocked — mypy + reviewer
9 New lint error — import sys unused, baseline was clean ❌ blocked — ruff
10 Go regression — add changed to a - b, existing test broke ❌ blocked — go test + reviewer

Auto-fix behavior

# Scenario Result
11 Clean diff — simple function added ✅ passed, [verified] commit suggested
12 Auto-fix success — hardcoded secret ✅ fixed in 1 attempt, [verified] commit created
13 Auto-fix exhausted — two unfixable security issues ✅ escalated after 2 attempts with rollback instructions

Multi-language coverage

# Scenario Result
14 Node.js — clean diff ✅ npm test + eslint detected automatically
15 Node.js — hardcoded API_KEY ✅ auto-fixed with process.env.API_KEY, [verified] commit
16 Node.js — type error (-> number returns string) ✅ blocked by tsc + reviewer, auto-fixed
17 Rust — shell injection + hardcoded secret ✅ escalated after 2 attempts with rollback instructions
18 Rust — cargo clippy errors (&Vec instead of &[String]) ✅ auto-fixed with idiomatic Rust
19 Rust — type error (i32 + i32 assigned to String) ✅ blocked by cargo check + reviewer, auto-fixed

delegate_task called in all 19 scenarios - confirmed via 🔀 delegate in tool activity feed.

Usage

verify my recent code changes using the verify-code-changes skill

Non-goals

  • Not a full security scanner (heuristics only)
  • Not a replacement for CI/CD pipelines
  • Does not modify repository configuration or tooling setup

Why this approach

Matches the design direction in #406:

  • Implemented as a bundled skill, not a custom tool
  • Reuses existing primitives (delegate_task, terminal, git)
  • Enforces separation of concerns between agents
  • Introduces standardized verification pipeline without modifying core runtime
  • No custom Python integration needed - all gates use standard CLI tools

@MorAlekss

Copy link
Copy Markdown
Contributor Author

Implements #406 as a bundled skill following the proposed phased design. Happy to adjust based on feedback

teknium1 added a commit that referenced this pull request Apr 3, 2026
…requesting-code-review

Merge the passive code-review checklist and the automated verification
pipeline (from PR #4459 by @MorAlekss) into a single requesting-code-review
skill. This eliminates model confusion between three overlapping skills.

Now includes:
- Static security scan (grep on diff lines)
- Baseline-aware quality gates (only flag NEW failures)
- Multi-language tool detection (Python, Node, Rust, Go)
- Independent reviewer subagent with fail-closed JSON verdict
- Auto-fix loop with separate fixer agent (max 2 attempts)
- Git checkpoint and [verified] commit convention

Deletes: skills/software-development/code-review/ (absorbed)
Closes: #406 (independent code verification)
@teknium1

teknium1 commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

Thanks for this thorough work @MorAlekss — the static security scan, baseline-aware quality gates, auto-fix loop, and fail-closed reviewer design are all excellent patterns.

We've incorporated your key contributions into the existing requesting-code-review skill (PR #4854) rather than adding a separate skill, to avoid model confusion between overlapping skills (code-review, requesting-code-review, and verify-code-changes all triggered on similar prompts). The consolidated skill credits you in the author field and includes:

  • Your static grep-based security scan on diff lines
  • Baseline-aware quality gates (snapshot before/after, only flag NEW failures)
  • Multi-language tool detection (Python, Node, Rust, Go)
  • Fail-closed JSON verdict from independent reviewer
  • Auto-fix loop with separate fixer agent (max 2 attempts)
  • Git checkpointing convention

Your 19-scenario test matrix was particularly impressive — the SQL injection via f-string case (caught by reviewer but missed by static scan) directly informed keeping the dual-layer approach.

Closing in favor of #4854. Your contribution is preserved with attribution.

@teknium1 teknium1 closed this Apr 3, 2026
teknium1 added a commit that referenced this pull request Apr 3, 2026
* chore: release v0.7.0 (2026.4.3)

168 merged PRs, 223 commits, 46 resolved issues, 40+ contributors.

Highlights: pluggable memory providers, credential pools, Camofox browser,
inline diff previews, API server session continuity, ACP MCP registration,
gateway hardening, secret exfiltration blocking.

* refactor(skills): consolidate code-review + verify-code-changes into requesting-code-review

Merge the passive code-review checklist and the automated verification
pipeline (from PR #4459 by @MorAlekss) into a single requesting-code-review
skill. This eliminates model confusion between three overlapping skills.

Now includes:
- Static security scan (grep on diff lines)
- Baseline-aware quality gates (only flag NEW failures)
- Multi-language tool detection (Python, Node, Rust, Go)
- Independent reviewer subagent with fail-closed JSON verdict
- Auto-fix loop with separate fixer agent (max 2 attempts)
- Git checkpoint and [verified] commit convention

Deletes: skills/software-development/code-review/ (absorbed)
Closes: #406 (independent code verification)
Tommyeds pushed a commit to Tommyeds/hermes-agent that referenced this pull request Apr 12, 2026
…Research#4854)

* chore: release v0.7.0 (2026.4.3)

168 merged PRs, 223 commits, 46 resolved issues, 40+ contributors.

Highlights: pluggable memory providers, credential pools, Camofox browser,
inline diff previews, API server session continuity, ACP MCP registration,
gateway hardening, secret exfiltration blocking.

* refactor(skills): consolidate code-review + verify-code-changes into requesting-code-review

Merge the passive code-review checklist and the automated verification
pipeline (from PR NousResearch#4459 by @MorAlekss) into a single requesting-code-review
skill. This eliminates model confusion between three overlapping skills.

Now includes:
- Static security scan (grep on diff lines)
- Baseline-aware quality gates (only flag NEW failures)
- Multi-language tool detection (Python, Node, Rust, Go)
- Independent reviewer subagent with fail-closed JSON verdict
- Auto-fix loop with separate fixer agent (max 2 attempts)
- Git checkpoint and [verified] commit convention

Deletes: skills/software-development/code-review/ (absorbed)
Closes: NousResearch#406 (independent code verification)
angelburgosrosado pushed a commit to angelburgosrosado/hermes-agent that referenced this pull request Apr 27, 2026
…Research#4854)

* chore: release v0.7.0 (2026.4.3)

168 merged PRs, 223 commits, 46 resolved issues, 40+ contributors.

Highlights: pluggable memory providers, credential pools, Camofox browser,
inline diff previews, API server session continuity, ACP MCP registration,
gateway hardening, secret exfiltration blocking.

* refactor(skills): consolidate code-review + verify-code-changes into requesting-code-review

Merge the passive code-review checklist and the automated verification
pipeline (from PR NousResearch#4459 by @MorAlekss) into a single requesting-code-review
skill. This eliminates model confusion between three overlapping skills.

Now includes:
- Static security scan (grep on diff lines)
- Baseline-aware quality gates (only flag NEW failures)
- Multi-language tool detection (Python, Node, Rust, Go)
- Independent reviewer subagent with fail-closed JSON verdict
- Auto-fix loop with separate fixer agent (max 2 attempts)
- Git checkpoint and [verified] commit convention

Deletes: skills/software-development/code-review/ (absorbed)
Closes: NousResearch#406 (independent code verification)
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…Research#4854)

* chore: release v0.7.0 (2026.4.3)

168 merged PRs, 223 commits, 46 resolved issues, 40+ contributors.

Highlights: pluggable memory providers, credential pools, Camofox browser,
inline diff previews, API server session continuity, ACP MCP registration,
gateway hardening, secret exfiltration blocking.

* refactor(skills): consolidate code-review + verify-code-changes into requesting-code-review

Merge the passive code-review checklist and the automated verification
pipeline (from PR NousResearch#4459 by @MorAlekss) into a single requesting-code-review
skill. This eliminates model confusion between three overlapping skills.

Now includes:
- Static security scan (grep on diff lines)
- Baseline-aware quality gates (only flag NEW failures)
- Multi-language tool detection (Python, Node, Rust, Go)
- Independent reviewer subagent with fail-closed JSON verdict
- Auto-fix loop with separate fixer agent (max 2 attempts)
- Git checkpoint and [verified] commit convention

Deletes: skills/software-development/code-review/ (absorbed)
Closes: NousResearch#406 (independent code verification)
olympus-terminal pushed a commit to olympus-terminal/hermes-agent that referenced this pull request May 16, 2026
…Research#4854)

* chore: release v0.7.0 (2026.4.3)

168 merged PRs, 223 commits, 46 resolved issues, 40+ contributors.

Highlights: pluggable memory providers, credential pools, Camofox browser,
inline diff previews, API server session continuity, ACP MCP registration,
gateway hardening, secret exfiltration blocking.

* refactor(skills): consolidate code-review + verify-code-changes into requesting-code-review

Merge the passive code-review checklist and the automated verification
pipeline (from PR NousResearch#4459 by @MorAlekss) into a single requesting-code-review
skill. This eliminates model confusion between three overlapping skills.

Now includes:
- Static security scan (grep on diff lines)
- Baseline-aware quality gates (only flag NEW failures)
- Multi-language tool detection (Python, Node, Rust, Go)
- Independent reviewer subagent with fail-closed JSON verdict
- Auto-fix loop with separate fixer agent (max 2 attempts)
- Git checkpoint and [verified] commit convention

Deletes: skills/software-development/code-review/ (absorbed)
Closes: NousResearch#406 (independent code verification)
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…Research#4854)

* chore: release v0.7.0 (2026.4.3)

168 merged PRs, 223 commits, 46 resolved issues, 40+ contributors.

Highlights: pluggable memory providers, credential pools, Camofox browser,
inline diff previews, API server session continuity, ACP MCP registration,
gateway hardening, secret exfiltration blocking.

* refactor(skills): consolidate code-review + verify-code-changes into requesting-code-review

Merge the passive code-review checklist and the automated verification
pipeline (from PR NousResearch#4459 by @MorAlekss) into a single requesting-code-review
skill. This eliminates model confusion between three overlapping skills.

Now includes:
- Static security scan (grep on diff lines)
- Baseline-aware quality gates (only flag NEW failures)
- Multi-language tool detection (Python, Node, Rust, Go)
- Independent reviewer subagent with fail-closed JSON verdict
- Auto-fix loop with separate fixer agent (max 2 attempts)
- Git checkpoint and [verified] commit convention

Deletes: skills/software-development/code-review/ (absorbed)
Closes: NousResearch#406 (independent code verification)
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…Research#4854)

* chore: release v0.7.0 (2026.4.3)

168 merged PRs, 223 commits, 46 resolved issues, 40+ contributors.

Highlights: pluggable memory providers, credential pools, Camofox browser,
inline diff previews, API server session continuity, ACP MCP registration,
gateway hardening, secret exfiltration blocking.

* refactor(skills): consolidate code-review + verify-code-changes into requesting-code-review

Merge the passive code-review checklist and the automated verification
pipeline (from PR NousResearch#4459 by @MorAlekss) into a single requesting-code-review
skill. This eliminates model confusion between three overlapping skills.

Now includes:
- Static security scan (grep on diff lines)
- Baseline-aware quality gates (only flag NEW failures)
- Multi-language tool detection (Python, Node, Rust, Go)
- Independent reviewer subagent with fail-closed JSON verdict
- Auto-fix loop with separate fixer agent (max 2 attempts)
- Git checkpoint and [verified] commit convention

Deletes: skills/software-development/code-review/ (absorbed)
Closes: NousResearch#406 (independent code verification)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Independent Code Verification & Quality Gates — Fail-Closed Review, Baseline Regression Detection, and Auto-Fix Loop (inspired by Nightwire)

2 participants