MonkeyClaw

Autonomous red / purple / blue security agent for NVIDIA NemoClaw.


Inspiration

Coding agents are the most privileged software most people now run. They read your source, execute shell commands, call MCP tools, and reach the network — all on your behalf, often unattended. One injected instruction, or one mistaken action, is enough to read a secret and exfiltrate it with ordinary, unremarkable commands.

The unsettling part isn't that this is possible. It's that the agent, its tools, and its prompts change constantly — so a one-time security audit is stale the day after it ships. We didn't want a report. We wanted a system that keeps testing forever, and gets better at it over time.

And we kept circling one uncomfortable realization:

A blocked attack is not the same as a working control.

A sandbox that quietly stops an exploit but emits no telemetry "works today and regresses tomorrow, undetected." Most security tooling celebrates the block and moves on. We decided that wasn't good enough.


What it does

MonkeyClaw is an OpenClaw agent that continuously attacks NemoClaw's security controls — sandbox isolation, privacy routing, permission enforcement, skill-pipeline integrity — and then proves the fixes hold.

It runs a five-stage autonomous loop with no human in the inner loop:

red → judge → repro → blue → purple

Stage What it does
Red Generates attack ideas across 18 NemoClaw attack-surface zones using five ideation modes — including a research-grounded mode seeded by a 35-skill attack corpus and a systematic walk over MITRE ATLAS / OWASP-LLM techniques — then executes multi-turn attacks against live NemoClaw sandboxes.
Judge Scores results in two tiers: fast programmatic checks, then a five-role Nemotron judge ensemble with Elo weighting and a frontier-model appeal path.
Repro Replays each confirmed finding on fresh victims, delta-minimizes it, and computes a real executed-path root cause — anchor-to-sink BFS over a SQLite code graph — then hands a fresh, context-free agent the doc to cold-verify it.
Blue Triages, writes a candidate patch, and runs it through 8 verifier gates: diff applies, vuln blocked, still blocked under mutated variants, legit functionality preserved, full regression suite green, control plane not weakened, telemetry still fires.
Purple Scores whether each defense is real — not just whether it happened to block today.

The orchestrator always steers the next cycle toward the lowest-coverage zone, weighted by purple's detection-coverage gaps — so the system attacks its own blind spots. The whole thing runs as a zero-credential, one-command demo: one real pipeline cycle against a planted-vulnerability victim feeds an eleven-panel live dashboard.


A red team that trains, not just prompts

A general model asked to "play attacker" hedges, refuses, and drifts off-schema. So we built the red team its own model: a custom Nemotron-3-Nano-4B, LoRA-finetuned on NVIDIA Brev (A100) to be a specialist adversarial-idea generator for the red_ideation role.

What makes it work is the data pipeline, not the GPU hours. We generate the entire supervised-finetuning set from MonkeyClaw's own corpora — 1,032 examples spanning all 18 attack zones × 5 ideation modes, every one grounded in our 35-skill attack-skill library and the vendored MITRE ATLAS / OWASP-LLM technique catalog. Because each example is built directly against MonkeyClaw's IdeaObject contract, the training set is schema-valid by construction — no external API calls, no label noise — and ~12% of examples carry "loaded operator" notes that explicitly train against refusal.

The result, on a held-out eval: a 0% refusal rate and 0 request failures, reliably emitting parseable attack-idea JSON the live IdeationEngine accepts end-to-end (verified by an integration smoke test through the real production code path). It's a 4B hybrid Mamba-Transformer — small and cheap enough to drive the high-volume red loop — and it's already wired in: configs/monkeyclaw.yaml points the red_ideation role at it, served over an OpenAI-compatible endpoint. A red team that thinks like an attacker by training, not by prompt.


The purple team: proving the fix is real

Purple is the part most security tooling skips, and it's the heart of MonkeyClaw. After blue patches a vulnerability, purple answers a harder question than "is it blocked?" — "is the defense real, observable, and durable?"

Every cycle, purple runs a full pipeline over each execution:

  1. Detection oracle — scores every execution on two axes, was the attack blocked and did the runtime emit telemetry, placing it in a prevention × observability quadrant (PASS / PARTIAL / WEAK / FAIL). A control passes only when blocked and observed. We call this detection-as-pass — a blocked-but-silent control is a latent regression, not a win.
  2. Coverage model — tracks per-zone detection coverage, separate from attack coverage, so the orchestrator can aim red at observability blind spots.
  3. Control validator — re-runs the policy corpus against live builds to catch silent control-plane drift.
  4. Detection synthesizer — turns each confirmed finding into a reusable detection rule.
  5. Generalization loop — mutates the freshly patched attack and re-tests the patched victim; a surviving bypass bounces straight back to blue for re-patching (bounded rounds), so a patch isn't trusted because it blocked one phrasing.
  6. Report card — regenerates a seven-dimension security scorecard.
  7. Feedback router — boosts red's priority on blind spots and pushes regressions and PARTIALs back onto the blue queue.

Purple even runs a self-governance audit pointed at MonkeyClaw's own agents — the security tool holds itself to the standard it enforces.


How we built it

The engine is NVIDIA Nemotron, end to end, with per-role model routing — a tiered setup mapping cheap extraction, heavy ideation/judging, our custom red-team model, and a dedicated nemotron-content-safety-reasoning model to the roles that need each. NemoClaw is both the platform we run on and the target we probe; in-sandbox runs use NemoClaw's managed inference route, so no credentials ever touch the sandbox.

The codebase is built for trust:

  • 1000+ tests across ~146 test files
  • 20 forward-only, validated DB migrations
  • An MCP control plane wiring the agents together
  • A strict interfaces/ contract layer — so three of us could build red, blue, and purple in parallel without merge wars

Challenges we ran into

The hardest problem was epistemic: how do you trust an autonomous judgment? Our answer was layered — programmatic gates before semantic ones, an ensemble before a verdict, a frontier appeal for split votes, and a fresh context-free agent to cold-verify.

Finetuning the red-team model surfaced real systems work too: the 4B Nemotron is a hybrid Mamba-Transformer, so we worked around missing Mamba CUDA kernels with an exact torch tiling path and a deterministic manual LoRA merge when library versions disagreed — and still landed a model the production pipeline accepts.

The third hard problem was determinism — an autonomous loop with this many models is hard to demo reliably, so we gated every nondeterministic feature off by default and built a seeded-DB fallback.


Accomplishments we're proud of

The full red → judge → repro → blue → purple loop runs end to end, autonomously, across all 18 zones — detection-as-pass actually catches blocked-but-silent controls a normal pass/fail harness would have green-lit, and the red team runs on a model we trained ourselves, grounded entirely in the project's own attack knowledge.


What we learned

Security isn't a state — it's a rate.

The win condition isn't "zero vulnerabilities today." It's a loop where red finds it, blue proves and fixes it, purple confirms it's observable and stays fixed, and a growing regression suite makes sure it never comes back.

Observability isn't a nice-to-have bolted onto a control — it is the control.


What's next for MonkeyClaw

  • Expand the red-team model's training set to the full corpus and harden schema conformance further
  • Train the learned attack ranker on accumulated cycle data
  • Export purple's synthesized detection rules to real SIEMs

So MonkeyClaw can guard production agent fleets — not just a hackathon victim.


Built With

Share this project:

Updates