Skip to content

Add declarative tool review gates for workflow run outputs #369

@marccampbell

Description

@marccampbell

Context

PR #368 adds LLM judge stages, which are useful for subjective/model review: "Does this diff look correct?" That is a different capability from deterministic tool review: "Did this required external validation pass?"

For project workflows, we need a first-class way to run specific validation commands before commit or after commit, capture their outputs, and use those results to drive workflow transitions without asking an LLM to reinterpret the output.

Problem

Workflows can run commands and capture outputs, and PR #368 adds LLM judge stages, but there is no first-class way to treat a deterministic tool result, such as CodeBuild, E2E tests, security scanners, deploy previews, or platform-specific validation scripts, as a review gate with declarative pass/fail transitions.

Today, this kind of review either has to be encoded indirectly in imperative scripts or reinterpreted by an LLM judge. That makes deterministic validation harder to reason about, harder to display, and harder to use as a strict blocker for later workflow steps like commit or PR creation.

Desired Capability

Add declarative tool review gates for workflow run outputs.

A workflow should be able to:

  • Run a command or script before commit or after commit.
  • Persist structured output from that command.
  • Declaratively inspect the output.
  • Transition to pass, fail, fix, commit, or PR stages based on output values.
  • Block PR creation when required gates fail.
  • Treat skipped gates as pass when configured.
  • Show tool gate results in chat/UI.
  • Persist tool gate results as stage artifacts.
  • Work without requiring an LLM judge to reinterpret tool output.

Example Shape

- id: android_validation
  label: Android Validation
  on_enter:
    run:
      command: python3 scripts/run_android_codebuild.py --source-dir next_mobile
      output: android_validation
  gate:
    output: android_validation
    pass:
      status:
        - passed
        - skipped
    fail:
      status:
        - failed
        - error

Then transitions could use explicit gate results:

- id: create_pr
  triggers:
    - gate_result:
        stage: android_validation
        verdict: pass

- id: fix_android
  triggers:
    - gate_result:
        stage: android_validation
        verdict: fail

A more general primitive could be:

triggers:
  - output_matches:
      output: android_validation
      path: status
      any_of: [passed, skipped]

The gate / gate_result vocabulary may be preferable because it clearly represents review-stage semantics rather than generic template plumbing.

Acceptance Criteria

  • Workflows can run a command/script and persist structured output.
  • Workflows can run these gates before commit and after commit.
  • Workflows can declaratively inspect structured output values.
  • Workflows can transition to pass/fail/fix/commit/PR stages based on output values.
  • Failed required tool gates can block PR creation.
  • Skipped gates can be treated as pass when configured.
  • Tool gate results are visible in chat/UI.
  • Tool gate results are persisted as stage artifacts.
  • Tool gates work without requiring an LLM judge.

Notes

This should remain separate from LLM-as-a-judge behavior. LLM judge stages answer subjective review questions. Tool gates should answer deterministic validation questions and should preserve the exact external validation result that caused a pass or fail verdict.

Metadata

Metadata

Assignees

No one assigned

    Labels

    doneThe issue is complete

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions