Bug
When DagScheduler spawns tasks on a tick and hits the sub-agent concurrency limit (max_concurrent=1 default), it calls record_spawn_failure() which unconditionally sets the task to Failed and propagates failure to dependents.
For transient errors like "concurrency limit reached", the task should remain Pending and be retried on the next scheduler tick — not permanently failed.
Reproduction
- Create a plan with 3+ tasks (e.g. LlmPlanner creates 3 independent root tasks)
- Config has default
max_concurrent = 1
/plan confirm → DagScheduler tick → task 0 spawned → task 1 "concurrency limit 1 reached" → Failed → task 2 skipped
Expected
Task 1 should stay Pending and be picked up on the next tick after task 0 completes.
Actual
Task 1 permanently fails with "spawn failed: concurrency limit 1 reached", propagating failure to dependents.
Root cause
record_spawn_failure() at scheduler.rs:484 unconditionally marks task Failed. It needs to distinguish transient errors (concurrency limit, temporary resource unavailability) from permanent errors (invalid agent definition, configuration error).
Suggested fix
In spawn_for_task() caller (or in record_spawn_failure() itself), check if the error is "concurrency limit reached" and revert the task to Pending instead of Failed. Alternatively, check concurrency availability before attempting spawn.
Related
Live test evidence
spawn_for_task failed error=spawn failed: concurrency limit 1 reached task_id=1
scheduler: spawn failed, marking task failed task_id=1
Plan failed. 1/3 tasks failed:
Bug
When
DagSchedulerspawns tasks on a tick and hits the sub-agent concurrency limit (max_concurrent=1default), it callsrecord_spawn_failure()which unconditionally sets the task toFailedand propagates failure to dependents.For transient errors like "concurrency limit reached", the task should remain
Pendingand be retried on the next scheduler tick — not permanently failed.Reproduction
max_concurrent = 1/plan confirm→ DagScheduler tick → task 0 spawned → task 1 "concurrency limit 1 reached" → Failed → task 2 skippedExpected
Task 1 should stay
Pendingand be picked up on the next tick after task 0 completes.Actual
Task 1 permanently fails with "spawn failed: concurrency limit 1 reached", propagating failure to dependents.
Root cause
record_spawn_failure()atscheduler.rs:484unconditionally marks taskFailed. It needs to distinguish transient errors (concurrency limit, temporary resource unavailability) from permanent errors (invalid agent definition, configuration error).Suggested fix
In
spawn_for_task()caller (or inrecord_spawn_failure()itself), check if the error is "concurrency limit reached" and revert the task toPendinginstead ofFailed. Alternatively, check concurrency availability before attempting spawn.Related
Live test evidence