Overview
OpenPlanter implements a quality-gating pattern for recursive sub-agent delegation called IMPLEMENT-THEN-VERIFY: when a parent agent delegates work via subtask or execute, the parent specifies acceptance criteria (a clear statement of what "done" looks like), and upon completion, a separate cheap "judge" model (lowest tier, e.g. Haiku) evaluates the sub-agent's output against those criteria, returning PASS or FAIL with specific feedback. The key insight is that implementation and verification must be uncorrelated — the agent that does the work should not be its sole verifier.
Hermes Agent's delegate_task currently has no quality gating. Sub-agent results are returned to the parent as-is, with no verification of whether the output actually meets the original goal. This means the parent agent must manually evaluate every sub-agent result, consuming tokens and adding latency. Adding acceptance criteria with independent verification would significantly improve sub-agent output quality while reducing parent workload.
This is related to but distinct from #344 (Workflow Formulas), which mentions acceptance criteria in workflow step definitions but doesn't detail the independent verification mechanism. This feature is independently valuable — it applies to ALL delegate_task calls, not just structured workflows.
Research Findings
How OpenPlanter's Acceptance Criteria System Works
In engine.py (~1011 lines), the core pattern:
-
When dispatching a subtask or execute tool call, the parent specifies:
objective: what to accomplish
acceptance_criteria: explicit success conditions (e.g., "Output file exists and contains at least 3 entity matches with confidence > 0.7")
-
The sub-agent runs its full solve loop (potentially recursively), producing a result.
-
After completion, if acceptance criteria were provided, a judge model evaluates:
- The judge is always the LOWEST TIER model (cheapest/fastest — e.g., Haiku-class)
- The judge receives: the original objective, the acceptance criteria, and the sub-agent's final output
- The judge returns: PASS or FAIL with specific reasoning
-
On FAIL:
- The failure feedback is returned to the parent agent
- The parent can decide to retry, modify the approach, or accept partial results
- OpenPlanter does NOT auto-retry — it lets the parent decide
Tiered model routing reinforces this:
- Sub-agents can only delegate to EQUAL OR LOWER tier models (never escalate)
- Opus(tier 1) → Sonnet(tier 2) → Haiku(tier 3)
- The judge always uses tier 3 (cheapest), ensuring verification cost is negligible
- This prevents a cheap sub-agent from spawning an expensive verifier
Key code patterns from engine.py:
# Acceptance criteria in subtask definition
subtask_tool = {
"objective": "Cross-reference vendor names against OFAC SDN list",
"acceptance_criteria": "Output CSV contains columns: vendor_name, sdn_match, match_score, match_type. At least one row per input vendor.",
"model": "haiku", # optional: request specific model tier
"reasoning_effort": "low" # optional: control reasoning depth
}
# After sub-agent completes, judge evaluates
judge_prompt = f"""Evaluate whether this output meets the acceptance criteria.
Objective: {objective}
Acceptance Criteria: {acceptance_criteria}
Output: {sub_agent_result}
Respond with PASS or FAIL and explain why."""
The "Think" Tool with Acceptance Criteria
OpenPlanter also has a think tool that uses acceptance criteria for self-verification within a single agent:
{
"name": "think",
"description": "Think through a problem step by step. Optionally specify acceptance_criteria to self-check your reasoning.",
"parameters": {
"thought": "string - your reasoning",
"acceptance_criteria": "string - optional criteria to verify your reasoning against"
}
}
This is lighter-weight — the agent checks its own reasoning rather than delegating to a judge. Less robust than independent verification, but useful for single-agent reasoning quality.
Key Design Decisions
- Cheap judge model — Verification should be much cheaper than implementation. Using the lowest-tier model keeps overhead minimal.
- No auto-retry — The parent decides how to handle failures. This prevents infinite retry loops and preserves parent agency.
- Independent verification — The implementing agent doesn't evaluate its own work. This addresses self-evaluation bias (agents tend to rate their own output favorably).
- Acceptance criteria are explicit — Not just "do this well" but "output must contain X, Y, Z with these properties." This forces the parent to think about what success looks like before delegating.
Current State in Hermes Agent
delegate_task (tools/delegate_tool.py):
- Spawns child AIAgent instances with isolated context
- Two modes: single task (
goal parameter) or batch (tasks array, up to 3 parallel)
- Child returns a summary string to the parent
- No acceptance criteria, no verification, no quality gating
- Parent must manually evaluate output quality
Context compression (agent/context_compressor.py):
- Has cheap-model summarization (Gemini Flash) — shows the pattern of using a cheap model for meta-tasks already exists in the codebase
What's missing:
- No way to specify "what does success look like" when delegating
- No independent verification of sub-agent output
- No PASS/FAIL signal — parent gets raw output with no quality indicator
- No structured feedback on failures (what specifically was missing/wrong)
Implementation Plan
Skill vs. Tool Classification
This is a codebase change to tools/delegate_tool.py (and potentially agent/ai_agent.py). Not a skill — it modifies the core delegation mechanism. Not a new tool — it extends an existing one.
What We'd Need
- New
acceptance_criteria parameter on delegate_task — Optional string describing success conditions
- Judge evaluation function — Takes criteria + output, returns PASS/FAIL + reasoning
- Judge model selection — Use cheapest available model (similar to how context_compressor uses Gemini Flash)
- Result enrichment — Return both the sub-agent output AND the judge's verdict to the parent
Phased Rollout
Phase 1: Basic Acceptance Criteria + Judge
- Add optional
acceptance_criteria parameter to delegate_task (both single and batch modes)
- After sub-agent completes, if criteria provided, run a cheap-model judge
- Judge returns:
{"verdict": "PASS"|"FAIL", "reasoning": "...", "output": "..."}
- Parent receives enriched result with the verdict
- If no criteria provided, behavior is unchanged (backward compatible)
- Use the same cheap-model approach as context_compressor (Gemini Flash or configurable)
- Deliverable: Quality-gated delegation with independent verification
Phase 2: Think Tool for Self-Verification
- Add a lightweight
think tool (or extend the existing reasoning mechanism)
- Allows the agent to explicitly reason through a problem with optional self-check
- Less overhead than full delegation — useful for complex reasoning steps within a single agent
- Think tool output can optionally be hidden from the final response (scratchpad mode)
- Deliverable: Structured reasoning with optional acceptance criteria
Phase 3: Integration with Workflow Formulas (#344)
Pros & Cons
Pros
- Directly improves sub-agent output quality — Independent verification catches errors the implementing agent misses
- Cheap overhead — Judge uses lowest-tier model; verification cost is <5% of implementation cost
- Reduces parent token consumption — Parent gets a PASS/FAIL signal instead of having to evaluate raw output
- Forces clearer delegation — Writing acceptance criteria makes the parent think about what success looks like
- Backward compatible — acceptance_criteria is optional; existing delegate_task calls unchanged
- Pattern already exists in codebase — context_compressor already uses cheap models for meta-tasks
- Proven in production — OpenPlanter uses this successfully with recursive 4-depth delegation
Cons / Risks
- Additional API call per delegation — One extra LLM call for the judge, even though it's cheap
- False negatives — Judge model may incorrectly FAIL valid output, causing unnecessary retries
- False positives — Judge model may incorrectly PASS invalid output, giving false confidence
- Criteria quality matters — Vague criteria ("do a good job") produce meaningless verdicts. The feature is only as good as the criteria written
- Judge model availability — Needs a cheap model configured; may not work if only one expensive model is available
- Scope creep risk — Could grow into a complex verification framework; should stay simple
Open Questions
- Which model for the judge? Use context_compressor's Gemini Flash approach, or let the user configure a judge model?
- Should the judge have access to the sub-agent's tool calls? OpenPlanter passes only the final output. Including intermediate steps would improve judgment but increase cost.
- Should we auto-retry on FAIL? OpenPlanter doesn't. But a simple
retry_on_fail=1 parameter could be useful for straightforward tasks.
- How to handle batch mode? Should each task in a batch have its own acceptance criteria, or one set for the whole batch?
- Should the verdict be part of the summary string or a structured field? Structured is cleaner but changes the return format.
References
Overview
OpenPlanter implements a quality-gating pattern for recursive sub-agent delegation called IMPLEMENT-THEN-VERIFY: when a parent agent delegates work via
subtaskorexecute, the parent specifies acceptance criteria (a clear statement of what "done" looks like), and upon completion, a separate cheap "judge" model (lowest tier, e.g. Haiku) evaluates the sub-agent's output against those criteria, returning PASS or FAIL with specific feedback. The key insight is that implementation and verification must be uncorrelated — the agent that does the work should not be its sole verifier.Hermes Agent's
delegate_taskcurrently has no quality gating. Sub-agent results are returned to the parent as-is, with no verification of whether the output actually meets the original goal. This means the parent agent must manually evaluate every sub-agent result, consuming tokens and adding latency. Adding acceptance criteria with independent verification would significantly improve sub-agent output quality while reducing parent workload.This is related to but distinct from #344 (Workflow Formulas), which mentions acceptance criteria in workflow step definitions but doesn't detail the independent verification mechanism. This feature is independently valuable — it applies to ALL
delegate_taskcalls, not just structured workflows.Research Findings
How OpenPlanter's Acceptance Criteria System Works
In
engine.py(~1011 lines), the core pattern:When dispatching a
subtaskorexecutetool call, the parent specifies:objective: what to accomplishacceptance_criteria: explicit success conditions (e.g., "Output file exists and contains at least 3 entity matches with confidence > 0.7")The sub-agent runs its full solve loop (potentially recursively), producing a result.
After completion, if acceptance criteria were provided, a judge model evaluates:
On FAIL:
Tiered model routing reinforces this:
Key code patterns from
engine.py:The "Think" Tool with Acceptance Criteria
OpenPlanter also has a
thinktool that uses acceptance criteria for self-verification within a single agent:{ "name": "think", "description": "Think through a problem step by step. Optionally specify acceptance_criteria to self-check your reasoning.", "parameters": { "thought": "string - your reasoning", "acceptance_criteria": "string - optional criteria to verify your reasoning against" } }This is lighter-weight — the agent checks its own reasoning rather than delegating to a judge. Less robust than independent verification, but useful for single-agent reasoning quality.
Key Design Decisions
Current State in Hermes Agent
delegate_task(tools/delegate_tool.py):goalparameter) or batch (tasksarray, up to 3 parallel)Context compression (agent/context_compressor.py):
What's missing:
Implementation Plan
Skill vs. Tool Classification
This is a codebase change to
tools/delegate_tool.py(and potentiallyagent/ai_agent.py). Not a skill — it modifies the core delegation mechanism. Not a new tool — it extends an existing one.What We'd Need
acceptance_criteriaparameter on delegate_task — Optional string describing success conditionsPhased Rollout
Phase 1: Basic Acceptance Criteria + Judge
acceptance_criteriaparameter todelegate_task(both single and batch modes){"verdict": "PASS"|"FAIL", "reasoning": "...", "output": "..."}Phase 2: Think Tool for Self-Verification
thinktool (or extend the existing reasoning mechanism)Phase 3: Integration with Workflow Formulas (#344)
Pros & Cons
Pros
Cons / Risks
Open Questions
retry_on_fail=1parameter could be useful for straightforward tasks.References
delegate_tool.py— Current delegation implementationcontext_compressor.py— Existing cheap-model meta-task pattern