Context
PR #368 adds LLM judge stages, which are useful for subjective/model review: "Does this diff look correct?" That is a different capability from deterministic tool review: "Did this required external validation pass?"
For project workflows, we need a first-class way to run specific validation commands before commit or after commit, capture their outputs, and use those results to drive workflow transitions without asking an LLM to reinterpret the output.
Problem
Workflows can run commands and capture outputs, and PR #368 adds LLM judge stages, but there is no first-class way to treat a deterministic tool result, such as CodeBuild, E2E tests, security scanners, deploy previews, or platform-specific validation scripts, as a review gate with declarative pass/fail transitions.
Today, this kind of review either has to be encoded indirectly in imperative scripts or reinterpreted by an LLM judge. That makes deterministic validation harder to reason about, harder to display, and harder to use as a strict blocker for later workflow steps like commit or PR creation.
Desired Capability
Add declarative tool review gates for workflow run outputs.
A workflow should be able to:
- Run a command or script before commit or after commit.
- Persist structured output from that command.
- Declaratively inspect the output.
- Transition to pass, fail, fix, commit, or PR stages based on output values.
- Block PR creation when required gates fail.
- Treat skipped gates as pass when configured.
- Show tool gate results in chat/UI.
- Persist tool gate results as stage artifacts.
- Work without requiring an LLM judge to reinterpret tool output.
Example Shape
- id: android_validation
label: Android Validation
on_enter:
run:
command: python3 scripts/run_android_codebuild.py --source-dir next_mobile
output: android_validation
gate:
output: android_validation
pass:
status:
- passed
- skipped
fail:
status:
- failed
- error
Then transitions could use explicit gate results:
- id: create_pr
triggers:
- gate_result:
stage: android_validation
verdict: pass
- id: fix_android
triggers:
- gate_result:
stage: android_validation
verdict: fail
A more general primitive could be:
triggers:
- output_matches:
output: android_validation
path: status
any_of: [passed, skipped]
The gate / gate_result vocabulary may be preferable because it clearly represents review-stage semantics rather than generic template plumbing.
Acceptance Criteria
- Workflows can run a command/script and persist structured output.
- Workflows can run these gates before commit and after commit.
- Workflows can declaratively inspect structured output values.
- Workflows can transition to pass/fail/fix/commit/PR stages based on output values.
- Failed required tool gates can block PR creation.
- Skipped gates can be treated as pass when configured.
- Tool gate results are visible in chat/UI.
- Tool gate results are persisted as stage artifacts.
- Tool gates work without requiring an LLM judge.
Notes
This should remain separate from LLM-as-a-judge behavior. LLM judge stages answer subjective review questions. Tool gates should answer deterministic validation questions and should preserve the exact external validation result that caused a pass or fail verdict.
Context
PR #368 adds LLM judge stages, which are useful for subjective/model review: "Does this diff look correct?" That is a different capability from deterministic tool review: "Did this required external validation pass?"
For project workflows, we need a first-class way to run specific validation commands before commit or after commit, capture their outputs, and use those results to drive workflow transitions without asking an LLM to reinterpret the output.
Problem
Workflows can run commands and capture outputs, and PR #368 adds LLM judge stages, but there is no first-class way to treat a deterministic tool result, such as CodeBuild, E2E tests, security scanners, deploy previews, or platform-specific validation scripts, as a review gate with declarative pass/fail transitions.
Today, this kind of review either has to be encoded indirectly in imperative scripts or reinterpreted by an LLM judge. That makes deterministic validation harder to reason about, harder to display, and harder to use as a strict blocker for later workflow steps like commit or PR creation.
Desired Capability
Add declarative tool review gates for workflow run outputs.
A workflow should be able to:
Example Shape
Then transitions could use explicit gate results:
A more general primitive could be:
The
gate/gate_resultvocabulary may be preferable because it clearly represents review-stage semantics rather than generic template plumbing.Acceptance Criteria
Notes
This should remain separate from LLM-as-a-judge behavior. LLM judge stages answer subjective review questions. Tool gates should answer deterministic validation questions and should preserve the exact external validation result that caused a pass or fail verdict.