feat(delegate): add acceptance criteria and independent judge to dele…#17980
Open
MorAlekss wants to merge 1 commit into
Open
feat(delegate): add acceptance criteria and independent judge to dele…#17980MorAlekss wants to merge 1 commit into
MorAlekss wants to merge 1 commit into
Conversation
Collaborator
ceccf53 to
b2f8f10
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements Phase 1 of #356: acceptance criteria and independent judge for delegate_task.
Root cause
Sub-agent results returned by
delegate_taskare self-reports with no independent verification. The parent agent must manually evaluate every result, consuming tokens and adding latency. There is no way to specify what "done" looks like when delegating, and no quality signal on whether the output actually meets the goal.Fix
Added an optional
acceptance_criteriaparameter todelegate_task. When provided, a new_judge_output()function evaluates the sub-agent's output against the criteria using a cheap auxiliary LLM and returns a PASS/FAIL verdict with reasoning. The result is enriched with ajudgefield. If no criteria are provided, behavior is unchanged.What changed
tools/delegate_tool.py: addedacceptance_criteria: Optional[str] = Nonetodelegate_tasksignaturetools/delegate_tool.py: addedacceptance_criteriatoDELEGATE_TASK_SCHEMA— both top-level and per-task in batch modetools/delegate_tool.py: added_judge_output()function — callscall_llm(task="judge"), parses PASS/FAIL JSON, graceful degrade on failuretools/delegate_tool.py: wired judge evaluation in_run_single_child— enrichesentry["judge"]when criteria providedtests/tools/test_delegate.py: addedTestJudgeOutputwith 5 testsWhat is not affected
delegate_taskbehavior whenacceptance_criteriais not provided: unchanged, zero breaking changesacceptance_criteria{"verdict": "FAIL", "reasoning": "Judge unavailable: ..."}— no exception raisedBehavioral change
When
acceptance_criteriais provided,delegate_taskresults now include ajudgefield:{ "task_index": 0, "status": "completed", "summary": "...", "judge": { "verdict": "PASS", "reasoning": "Output contains required columns" } }Tests
5 new tests in TestJudgeOutput, all pass. No regressions in existing tests.
test_judge_pass- judge returns PASS when criteria are mettest_judge_fail- judge returns FAIL when criteria are not mettest_no_criteria_no_judge- no criteria skips judge, returns PASStest_judge_unavailable_skips- LLM failure returns FAIL with reasoning, no exceptiontest_delegate_task_passes_acceptance_criteria- integration: acceptance_criteria flows through to _run_single_childPhase 2 (think tool for self-verification) is planned as a follow-up PR.
This is a prerequisite for the simplify skill (#379): independent judge evaluation directly addresses the need for quality-gating of parallel review agents.
Part of #356 (Phase 1/3)