Skip to content

Orchestration: DagScheduler busy-spins at 250ms when SubAgentManager is saturated #1618

@bug-ops

Description

@bug-ops

Description

When the SubAgentManager concurrency limit is reached (e.g. [agents] max_concurrent = 1 default), the DagScheduler enters a tight retry loop at 250ms intervals. The log is flooded with:

ERROR zeph_core::agent: spawn_for_task failed error=concurrency limit reached (active: 1, max: 1) task_id=1
ERROR zeph_core::agent: spawn_for_task failed error=concurrency limit reached (active: 1, max: 1) task_id=1

...repeated every ~250ms indefinitely.

Root Cause

In DagScheduler::wait_event():

if self.running.is_empty() {
    tokio::time::sleep(self.deferral_backoff).await;
    return;
}

self.running is empty because no DAG tasks were successfully registered (all spawns failed). The fallback is deferral_backoff = 250ms. Since DagScheduler.running and SubAgentManager's active-agent count are separate, the DagScheduler cannot tell that the subagent pool is occupied and will keep retrying at 250ms until the external sub-agent completes.

Reproduction

  1. [agents] max_concurrent = 1 (default) or 1 sub-agent already running
  2. Create a plan with 2+ tasks that require sub-agent spawning
  3. Confirm the plan
  4. Watch logs for flood of ERROR messages every 250ms

Config used: /tmp/testing-orch-cancel.toml (no [agents] section → max_concurrent defaults to 1)

Expected

The DagScheduler should back off significantly longer (e.g., 1–5 seconds) when all spawn attempts fail due to concurrency limits, or ideally wait for a signal that a slot has freed. 250ms is too aggressive and floods logs.

Actual

Tight 250ms spin loop with ERROR log on every iteration until the session is killed.

Severity: Medium

Not data-loss, but causes log flood and CPU waste. Makes diagnosing actual errors harder.

Suggested Fix

Increase deferral_backoff from 250ms to at least 1–2s, or implement exponential backoff up to a cap. Alternatively, subscribe to SubAgentManager slot-freed events to wake the DagScheduler exactly when a spawn is possible.

Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions