Problem
Workflows currently let one agent perform the main implementation path, but there is no first-class way to run a separate model as a post-code, pre-PR judge. That makes it hard to enforce an independent review pass before a workflow pushes, opens a PR, or marks the run done.
We want workflows to support multi-agent stages where one stage can implement and another stage can review the result using a different model/provider.
Goal
Add workflow review stages that can run after code changes and tests, before PR creation. A review stage should be able to inspect constrained inputs, produce a structured verdict, and optionally block or send findings back to the implementing agent.
Example use cases:
- Post-code review with a different model family than the implementer.
- Security or regression judge before PR creation.
- Test adequacy review after unit/E2E output is available.
- Policy enforcement before workflows mark themselves complete.
Proposed workflow shape
stages:
- name: implement
agent:
model: gpt-5
instructions: |
Implement the issue. Commit changes.
- name: test
run:
command: go test ./...
- name: review
judge:
model: claude-sonnet-4
inputs:
- issue
- git_diff
- test_output
require:
verdict: pass
instructions: |
Review the diff for correctness, security, regressions, and missing tests.
Return pass/fail with specific required fixes.
- name: pr
run:
command: gh pr create ...
if: stages.review.verdict == "pass"
The exact YAML can change, but the concept should be explicit: a judge/review stage is different from the main implementing agent stage.
Behavioral requirements
- A judge stage can choose a different model/provider from the implementer.
- Judge inputs should be constrained and explicit, such as issue text, git diff, changed files, test output, CI output, or prior stage artifacts.
- Judge output should be structured, not just free-form prose. At minimum:
verdict: pass or fail
summary
findings: file/line/comment/severity when available
required_fixes
- Workflows should be able to decide whether a failed judge blocks PR creation, opens a PR with a warning, or sends findings back to the implementer.
- Judge output should be visible in the claw chat and persisted as a workflow/stage artifact.
- The workflow must have hard bounds: max attempts, timeout, and token/cost limits.
MVP proposal
Start with a bounded review loop:
- Implementer runs.
- Tests run.
- Judge reviews issue + diff + test output.
- If judge passes, continue to PR creation.
- If judge fails, send findings back to the same implementer once.
- Re-run tests.
- Run one final judge pass.
- If still failing, stop or follow the workflow's configured failure behavior.
Avoid unbounded agent debate in the first version.
Open questions
- Should judge stages be allowed to edit files, or should they be read-only by default?
- Should judge findings become normal chat messages, stage artifacts, PR comments, or all three?
- Should the implementer receive the judge's full reasoning or only structured findings?
- How should workflows reference prior stage outputs in conditions?
- Do we need built-in judge presets such as
code_review, security_review, and test_review?
Acceptance criteria
- A workflow can define a model-backed review/judge stage after implementation and tests.
- The judge can use a different model from the implementer.
- The judge receives explicit bounded inputs, including at least issue context, git diff, and test output.
- The judge returns a structured verdict that can block PR creation.
- The review result is visible in the UI/chat and stored as a stage artifact.
- The implementation supports a bounded one-retry feedback loop from judge to implementer.
- Tests cover pass, fail, retry-once, and block-PR behavior.
Problem
Workflows currently let one agent perform the main implementation path, but there is no first-class way to run a separate model as a post-code, pre-PR judge. That makes it hard to enforce an independent review pass before a workflow pushes, opens a PR, or marks the run done.
We want workflows to support multi-agent stages where one stage can implement and another stage can review the result using a different model/provider.
Goal
Add workflow review stages that can run after code changes and tests, before PR creation. A review stage should be able to inspect constrained inputs, produce a structured verdict, and optionally block or send findings back to the implementing agent.
Example use cases:
Proposed workflow shape
The exact YAML can change, but the concept should be explicit: a judge/review stage is different from the main implementing agent stage.
Behavioral requirements
verdict:passorfailsummaryfindings: file/line/comment/severity when availablerequired_fixesMVP proposal
Start with a bounded review loop:
Avoid unbounded agent debate in the first version.
Open questions
code_review,security_review, andtest_review?Acceptance criteria