Skip to content

Feature: Acceptance Criteria & Independent Judge for Sub-agent Delegation (inspired by OpenPlanter) #356

@teknium1

Description

@teknium1

Overview

OpenPlanter implements a quality-gating pattern for recursive sub-agent delegation called IMPLEMENT-THEN-VERIFY: when a parent agent delegates work via subtask or execute, the parent specifies acceptance criteria (a clear statement of what "done" looks like), and upon completion, a separate cheap "judge" model (lowest tier, e.g. Haiku) evaluates the sub-agent's output against those criteria, returning PASS or FAIL with specific feedback. The key insight is that implementation and verification must be uncorrelated — the agent that does the work should not be its sole verifier.

Hermes Agent's delegate_task currently has no quality gating. Sub-agent results are returned to the parent as-is, with no verification of whether the output actually meets the original goal. This means the parent agent must manually evaluate every sub-agent result, consuming tokens and adding latency. Adding acceptance criteria with independent verification would significantly improve sub-agent output quality while reducing parent workload.

This is related to but distinct from #344 (Workflow Formulas), which mentions acceptance criteria in workflow step definitions but doesn't detail the independent verification mechanism. This feature is independently valuable — it applies to ALL delegate_task calls, not just structured workflows.


Research Findings

How OpenPlanter's Acceptance Criteria System Works

In engine.py (~1011 lines), the core pattern:

  1. When dispatching a subtask or execute tool call, the parent specifies:

    • objective: what to accomplish
    • acceptance_criteria: explicit success conditions (e.g., "Output file exists and contains at least 3 entity matches with confidence > 0.7")
  2. The sub-agent runs its full solve loop (potentially recursively), producing a result.

  3. After completion, if acceptance criteria were provided, a judge model evaluates:

    • The judge is always the LOWEST TIER model (cheapest/fastest — e.g., Haiku-class)
    • The judge receives: the original objective, the acceptance criteria, and the sub-agent's final output
    • The judge returns: PASS or FAIL with specific reasoning
  4. On FAIL:

    • The failure feedback is returned to the parent agent
    • The parent can decide to retry, modify the approach, or accept partial results
    • OpenPlanter does NOT auto-retry — it lets the parent decide

Tiered model routing reinforces this:

  • Sub-agents can only delegate to EQUAL OR LOWER tier models (never escalate)
  • Opus(tier 1) → Sonnet(tier 2) → Haiku(tier 3)
  • The judge always uses tier 3 (cheapest), ensuring verification cost is negligible
  • This prevents a cheap sub-agent from spawning an expensive verifier

Key code patterns from engine.py:

# Acceptance criteria in subtask definition
subtask_tool = {
    "objective": "Cross-reference vendor names against OFAC SDN list",
    "acceptance_criteria": "Output CSV contains columns: vendor_name, sdn_match, match_score, match_type. At least one row per input vendor.",
    "model": "haiku",  # optional: request specific model tier
    "reasoning_effort": "low"  # optional: control reasoning depth
}

# After sub-agent completes, judge evaluates
judge_prompt = f"""Evaluate whether this output meets the acceptance criteria.
Objective: {objective}
Acceptance Criteria: {acceptance_criteria}
Output: {sub_agent_result}
Respond with PASS or FAIL and explain why."""

The "Think" Tool with Acceptance Criteria

OpenPlanter also has a think tool that uses acceptance criteria for self-verification within a single agent:

{
    "name": "think",
    "description": "Think through a problem step by step. Optionally specify acceptance_criteria to self-check your reasoning.",
    "parameters": {
        "thought": "string - your reasoning",
        "acceptance_criteria": "string - optional criteria to verify your reasoning against"
    }
}

This is lighter-weight — the agent checks its own reasoning rather than delegating to a judge. Less robust than independent verification, but useful for single-agent reasoning quality.

Key Design Decisions

  1. Cheap judge model — Verification should be much cheaper than implementation. Using the lowest-tier model keeps overhead minimal.
  2. No auto-retry — The parent decides how to handle failures. This prevents infinite retry loops and preserves parent agency.
  3. Independent verification — The implementing agent doesn't evaluate its own work. This addresses self-evaluation bias (agents tend to rate their own output favorably).
  4. Acceptance criteria are explicit — Not just "do this well" but "output must contain X, Y, Z with these properties." This forces the parent to think about what success looks like before delegating.

Current State in Hermes Agent

delegate_task (tools/delegate_tool.py):

  • Spawns child AIAgent instances with isolated context
  • Two modes: single task (goal parameter) or batch (tasks array, up to 3 parallel)
  • Child returns a summary string to the parent
  • No acceptance criteria, no verification, no quality gating
  • Parent must manually evaluate output quality

Context compression (agent/context_compressor.py):

  • Has cheap-model summarization (Gemini Flash) — shows the pattern of using a cheap model for meta-tasks already exists in the codebase

What's missing:

  • No way to specify "what does success look like" when delegating
  • No independent verification of sub-agent output
  • No PASS/FAIL signal — parent gets raw output with no quality indicator
  • No structured feedback on failures (what specifically was missing/wrong)

Implementation Plan

Skill vs. Tool Classification

This is a codebase change to tools/delegate_tool.py (and potentially agent/ai_agent.py). Not a skill — it modifies the core delegation mechanism. Not a new tool — it extends an existing one.

What We'd Need

  1. New acceptance_criteria parameter on delegate_task — Optional string describing success conditions
  2. Judge evaluation function — Takes criteria + output, returns PASS/FAIL + reasoning
  3. Judge model selection — Use cheapest available model (similar to how context_compressor uses Gemini Flash)
  4. Result enrichment — Return both the sub-agent output AND the judge's verdict to the parent

Phased Rollout

Phase 1: Basic Acceptance Criteria + Judge

  • Add optional acceptance_criteria parameter to delegate_task (both single and batch modes)
  • After sub-agent completes, if criteria provided, run a cheap-model judge
  • Judge returns: {"verdict": "PASS"|"FAIL", "reasoning": "...", "output": "..."}
  • Parent receives enriched result with the verdict
  • If no criteria provided, behavior is unchanged (backward compatible)
  • Use the same cheap-model approach as context_compressor (Gemini Flash or configurable)
  • Deliverable: Quality-gated delegation with independent verification

Phase 2: Think Tool for Self-Verification

  • Add a lightweight think tool (or extend the existing reasoning mechanism)
  • Allows the agent to explicitly reason through a problem with optional self-check
  • Less overhead than full delegation — useful for complex reasoning steps within a single agent
  • Think tool output can optionally be hidden from the final response (scratchpad mode)
  • Deliverable: Structured reasoning with optional acceptance criteria

Phase 3: Integration with Workflow Formulas (#344)


Pros & Cons

Pros

  • Directly improves sub-agent output quality — Independent verification catches errors the implementing agent misses
  • Cheap overhead — Judge uses lowest-tier model; verification cost is <5% of implementation cost
  • Reduces parent token consumption — Parent gets a PASS/FAIL signal instead of having to evaluate raw output
  • Forces clearer delegation — Writing acceptance criteria makes the parent think about what success looks like
  • Backward compatible — acceptance_criteria is optional; existing delegate_task calls unchanged
  • Pattern already exists in codebase — context_compressor already uses cheap models for meta-tasks
  • Proven in production — OpenPlanter uses this successfully with recursive 4-depth delegation

Cons / Risks

  • Additional API call per delegation — One extra LLM call for the judge, even though it's cheap
  • False negatives — Judge model may incorrectly FAIL valid output, causing unnecessary retries
  • False positives — Judge model may incorrectly PASS invalid output, giving false confidence
  • Criteria quality matters — Vague criteria ("do a good job") produce meaningless verdicts. The feature is only as good as the criteria written
  • Judge model availability — Needs a cheap model configured; may not work if only one expensive model is available
  • Scope creep risk — Could grow into a complex verification framework; should stay simple

Open Questions

  1. Which model for the judge? Use context_compressor's Gemini Flash approach, or let the user configure a judge model?
  2. Should the judge have access to the sub-agent's tool calls? OpenPlanter passes only the final output. Including intermediate steps would improve judgment but increase cost.
  3. Should we auto-retry on FAIL? OpenPlanter doesn't. But a simple retry_on_fail=1 parameter could be useful for straightforward tasks.
  4. How to handle batch mode? Should each task in a batch have its own acceptance criteria, or one set for the whole batch?
  5. Should the verdict be part of the summary string or a structured field? Structured is cleaner but changes the return format.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions