Skip to content

feat: DAG loop-back nodes — conditional retry and loop-within-DAG #972

@Wirasm

Description

@Wirasm

Migrated from dynamous-community/remote-coding-agent#792 — Archon active development has moved to coleam00/Archon. Original issue retained as historical reference.


Summary

Enable test→fix→retest and other "re-run until condition" patterns in workflows. Rescoped from a sub-DAG node to a composition-based design after a primitives review (see History below).

This issue now tracks two pieces of work:

  1. Prerequisite: a real expression evaluator for when: / until: conditions
  2. Feature: workflow invocation node + workflow-level loop_until

Problem

The DAG topology is computed once at load time (dag-executor.ts) and back-edges are rejected by cycle detection (loader.ts). There's no way to express "run these nodes until tests pass." The only existing iteration primitive is PR #785's loop: node, which iterates a single prompt — not a multi-step graph.

The naive workaround is duplicating nodes (run-tests, fix, run-tests-2, fix-2), which is fragile and caps at whatever count is hardcoded.

Why not a sub-DAG node (the original proposal)

The original proposal added a loop_node containing an inner nodes: list with its own edges. From a primitives standpoint this introduces a composite-with-nested-scope concept that brings real design debt:

  • Nested topology (a node that has internal edges)
  • Scoped variable resolution — what does $test.output mean across iterations? Not specified.
  • Recursive validation (cycle detection, ref checks, model compat all need to recurse)
  • Recursive execution (nested cancellation, timeouts, events, JSONL logging)
  • A second meaning for nodes: (outer DAG nodes vs. inner loop-body nodes)
  • Once nesting is allowed one level, two-level nesting becomes the next ask

The motivating examples don't require any of that.

Proposed Approach: composition over nesting

Reuse a primitive that already exists — a workflow is a unit of execution with its own scope, validation, cancellation, events, and logs. Add two small things:

1. Workflow-level loop_until

A workflow can declare itself as a loop body:

# .archon/workflows/test-fix.yaml
loop_until: \"\$test.exit_code == 0\"
max_iterations: 5
nodes:
  - id: test
    bash: |
      bun test
      echo \"exit_code=\$?\"
  - id: fix
    prompt: \"Tests failed. Fix: \$test.output\"
    depends_on: [test]
    when: \"\$test.exit_code != 0\"

The entire workflow re-runs until the condition is met or max_iterations is exhausted.

2. Workflow invocation node

A parent workflow can invoke a child workflow as a single node:

# .archon/workflows/ship.yaml
nodes:
  - id: build
    bash: \"bun run build\"

  - id: test-fix-loop
    workflow: test-fix       # invokes the workflow above
    depends_on: [build]

  - id: pr
    command: archon-create-pr
    depends_on: [test-fix-loop]

The parent sees test-fix-loop as a single node with a single output (the child's final state). Downstream nodes use \$test-fix-loop.output as usual.

Why this is better

Concern Sub-DAG node Workflow composition
New container primitive Yes (composite node + inner scope) No (workflow already exists)
Variable scoping across iterations Undefined / needs design Solved by function-call semantics — each invocation is its own scope
Recursive validation Required Each workflow validates itself (already does)
Cancellation / timeouts / events Need nested versions Reuses existing per-workflow infra
Reuses #785 iteration machinery At node layer At workflow layer (cleaner)
Naming collision with #785's `loop:` Yes No — different layers (node vs workflow)
"Loop a fragment of a workflow inline" Supported Requires extraction into named workflow

The last row is the only tradeoff, and it's a feature: forcing extraction mirrors how functions discipline iteration scope in code.

Prerequisite: expression evaluator

Both #785 and this issue hand-wave the condition evaluator. Today `when:` only does basic substring matching — `$test.exit_code == '0'` doesn't actually work. No loop design ships without this.

This should be done as its own self-contained piece of work:

  • Numeric and string comparisons (`==`, `!=`, `<`, `>`, `<=`, `>=`)
  • Boolean operators (`&&`, `||`, `!`)
  • Path access on captured outputs (`$node.field`)
  • Used by both `when:` (existing) and `loop_until:` / `until:` (new)

It benefits existing `when:` users immediately and unblocks both #785 follow-ups and this issue.

Implementation Sketch

Phase 1 — Expression evaluator (prerequisite)

File Change
`packages/workflows/src/condition-evaluator.ts` New: parser + evaluator for the expression grammar above
`packages/workflows/src/dag-executor.ts` Replace current substring-based `when:` check with evaluator
Tests Cover comparisons, booleans, missing-field handling, type coercion rules

Phase 2 — Workflow-level loop

File Change
`packages/workflows/src/schemas/workflow.ts` Add optional `loop_until` + `max_iterations` to workflow root schema
`packages/workflows/src/executor.ts` After a run completes, evaluate `loop_until`; if false and under `max_iterations`, re-execute the DAG with a fresh `nodeOutputs` scope. Reuse PR #785's iteration events at workflow level.
`packages/workflows/src/event-emitter.ts` `workflow.iteration_started` / `workflow.iteration_completed` events

Phase 3 — Workflow invocation node

File Change
`packages/workflows/src/schemas/dag-node.ts` New `workflow:` node variant (mutually exclusive with `command:`/`prompt:`/`bash:`/`loop:`)
`packages/workflows/src/loader.ts` Parse + validate `workflow:` node; reject self-reference cycles across workflows
`packages/workflows/src/dag-executor.ts` Dispatch `workflow:` nodes by spawning a child workflow run, awaiting completion, mapping its final output back to the parent's `nodeOutputs`
`packages/core/src/orchestrator/` Child runs share parent's conversation/isolation environment (no second worktree)

Coordination with #785

PR #785 ships a node-level `loop:` for iterating a single prompt (Ralph-style). Under this rescope there is no naming collision — `loop:` stays at the node layer, `loop_until:` lives at the workflow layer. Different concepts, different scopes, both first-class.

Out of Scope

Related

History

Originally proposed a `loop_node` containing an inner sub-DAG. Rescoped after a first-principles review concluded that workflow composition gives the same capability without introducing a nested-scope composite primitive. Original proposal preserved in earlier comments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium priority - Backlog, when time permitsarchitectureArchitectural changes and designarea: workflowsWorkflow engineeffort/highCross-cutting changes, multiple domains, requires design decisionsfeatureNew functionality (planned)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions