Skip to content

fix(orchestration): exponential backoff in DagScheduler when concurrency limit hit#1620

Merged
bug-ops merged 1 commit intomainfrom
orchestration-dagscheduler-bus
Mar 13, 2026
Merged

fix(orchestration): exponential backoff in DagScheduler when concurrency limit hit#1620
bug-ops merged 1 commit intomainfrom
orchestration-dagscheduler-bus

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented Mar 13, 2026

Summary

  • DagScheduler::wait_event() was sleeping a fixed 250ms and retrying when SubAgentManager was saturated, causing a tight spin loop that flooded logs with ERROR spawn_for_task failed: concurrency limit reached every 250ms
  • Replaced with exponential backoff: 250ms → 500ms → 1s → 2s → 4s, capped at 5 seconds per retry cycle
  • Counter resets to zero on the first successful spawn so normal operation is not affected
  • Added consecutive_spawn_failures: u32 field to DagScheduler; helper current_deferral_backoff() computes the scaled duration

Test plan

  • test_current_deferral_backoff_exponential_growth — verifies doubling per failure, cap at 5s
  • test_record_spawn_resets_consecutive_failures — verifies counter reset on success
  • test_record_spawn_failure_increments_consecutive_failures — verifies counter increment per ConcurrencyLimit failure
  • test_wait_event_sleeps_deferral_backoff_when_running_empty — existing regression test still passes

Closes #1618.

@github-actions github-actions Bot added bug Something isn't working size/M Medium PR (51-200 lines) documentation Improvements or additions to documentation rust Rust code changes core zeph-core crate and removed size/M Medium PR (51-200 lines) labels Mar 13, 2026
…ncy limit hit

When all SubAgentManager slots are occupied, DagScheduler::wait_event()
was sleeping the fixed deferral_backoff (250ms) and retrying immediately,
producing a tight spin that flooded logs with ERROR messages every 250ms.

Replace the fixed sleep with exponential backoff: each consecutive
ConcurrencyLimit failure doubles the delay (250ms → 500ms → 1s → 2s → 4s),
capped at 5 seconds. The counter resets to zero on the first successful spawn.

Adds three regression tests covering backoff growth, counter reset, and
counter increment. Closes #1618.
@bug-ops bug-ops force-pushed the orchestration-dagscheduler-bus branch from c2a0672 to f10f874 Compare March 13, 2026 14:58
@github-actions github-actions Bot added the size/M Medium PR (51-200 lines) label Mar 13, 2026
@bug-ops bug-ops enabled auto-merge (squash) March 13, 2026 15:05
@bug-ops bug-ops merged commit c46ccfc into main Mar 13, 2026
15 checks passed
@bug-ops bug-ops deleted the orchestration-dagscheduler-bus branch March 13, 2026 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working core zeph-core crate documentation Improvements or additions to documentation rust Rust code changes size/M Medium PR (51-200 lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Orchestration: DagScheduler busy-spins at 250ms when SubAgentManager is saturated

1 participant