Description
When the SubAgentManager concurrency limit is reached (e.g. [agents] max_concurrent = 1 default), the DagScheduler enters a tight retry loop at 250ms intervals. The log is flooded with:
ERROR zeph_core::agent: spawn_for_task failed error=concurrency limit reached (active: 1, max: 1) task_id=1
ERROR zeph_core::agent: spawn_for_task failed error=concurrency limit reached (active: 1, max: 1) task_id=1
...repeated every ~250ms indefinitely.
Root Cause
In DagScheduler::wait_event():
if self.running.is_empty() {
tokio::time::sleep(self.deferral_backoff).await;
return;
}
self.running is empty because no DAG tasks were successfully registered (all spawns failed). The fallback is deferral_backoff = 250ms. Since DagScheduler.running and SubAgentManager's active-agent count are separate, the DagScheduler cannot tell that the subagent pool is occupied and will keep retrying at 250ms until the external sub-agent completes.
Reproduction
[agents] max_concurrent = 1 (default) or 1 sub-agent already running
- Create a plan with 2+ tasks that require sub-agent spawning
- Confirm the plan
- Watch logs for flood of ERROR messages every 250ms
Config used: /tmp/testing-orch-cancel.toml (no [agents] section → max_concurrent defaults to 1)
Expected
The DagScheduler should back off significantly longer (e.g., 1–5 seconds) when all spawn attempts fail due to concurrency limits, or ideally wait for a signal that a slot has freed. 250ms is too aggressive and floods logs.
Actual
Tight 250ms spin loop with ERROR log on every iteration until the session is killed.
Severity: Medium
Not data-loss, but causes log flood and CPU waste. Makes diagnosing actual errors harder.
Suggested Fix
Increase deferral_backoff from 250ms to at least 1–2s, or implement exponential backoff up to a cap. Alternatively, subscribe to SubAgentManager slot-freed events to wake the DagScheduler exactly when a spawn is possible.
Context
Description
When the
SubAgentManagerconcurrency limit is reached (e.g.[agents] max_concurrent = 1default), theDagSchedulerenters a tight retry loop at 250ms intervals. The log is flooded with:...repeated every ~250ms indefinitely.
Root Cause
In
DagScheduler::wait_event():self.runningis empty because no DAG tasks were successfully registered (all spawns failed). The fallback isdeferral_backoff = 250ms. SinceDagScheduler.runningandSubAgentManager's active-agent count are separate, the DagScheduler cannot tell that the subagent pool is occupied and will keep retrying at 250ms until the external sub-agent completes.Reproduction
[agents] max_concurrent = 1(default) or 1 sub-agent already runningConfig used:
/tmp/testing-orch-cancel.toml(no[agents]section → max_concurrent defaults to 1)Expected
The DagScheduler should back off significantly longer (e.g., 1–5 seconds) when all spawn attempts fail due to concurrency limits, or ideally wait for a signal that a slot has freed. 250ms is too aggressive and floods logs.
Actual
Tight 250ms spin loop with ERROR log on every iteration until the session is killed.
Severity: Medium
Not data-loss, but causes log flood and CPU waste. Makes diagnosing actual errors harder.
Suggested Fix
Increase
deferral_backofffrom 250ms to at least 1–2s, or implement exponential backoff up to a cap. Alternatively, subscribe to SubAgentManager slot-freed events to wake the DagScheduler exactly when a spawn is possible.Context