Skip to content

fix(orchestration): prevent DagScheduler deadlock on concurrency exhaustion (#1619)#1624

Merged
bug-ops merged 3 commits intomainfrom
orchestration-dagscheduler-dea
Mar 13, 2026
Merged

fix(orchestration): prevent DagScheduler deadlock on concurrency exhaustion (#1619)#1624
bug-ops merged 3 commits intomainfrom
orchestration-dagscheduler-dea

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented Mar 13, 2026

Summary

  • Raises max_concurrent default from 1 → 5 (must be ≥ max_parallel + 1 when orchestration is active)
  • Adds slot reservation in SubAgentManager so planning-phase sub-agents cannot starve orchestration task spawning
  • Replaces fixed 250 ms spin with exponential backoff (100 ms base, ×2/retry, 5 s cap) on ConcurrencyLimit; downgrades log level from ERROR to WARN with structured fields
  • Emits startup warn! when max_concurrent < max_parallel + 1
  • 6 regression tests covering reservation blocking/release and backoff behaviour

Closes #1619

Test plan

  • cargo +nightly fmt --check passes
  • cargo clippy --workspace --features full -- -D warnings passes
  • cargo nextest run --config-file .github/nextest.toml --workspace --features full --lib --bins — 5174 tests pass
  • New tests: test_reserve_slots_blocks_spawn, test_release_reservation_allows_spawn, test_reservation_with_zero_active_blocks_spawn, test_consecutive_deferrals_increments_on_concurrency_limit, test_consecutive_deferrals_resets_on_success, test_exponential_backoff_duration

…ustion (#1619)

When a sub-agent spawned during planning occupies the only concurrency slot,
DagScheduler tasks were silently failing with ConcurrencyLimitReached and the
plan never progressed.

- Raise default max_concurrent from 1 to 5 (must be >= max_parallel + 1 when
  orchestration is active)
- Add reserved_slots to SubAgentManager so orchestration reserves capacity
  before entering the scheduler loop; reservation is released unconditionally
  on scheduler exit
- Replace fixed 250 ms spin with exponential backoff (100 ms base, x2 per
  retry, 5 s cap) when ConcurrencyLimit is returned; log at WARN level with
  active/max/consecutive_deferrals fields instead of DEBUG
- Emit startup warning when max_concurrent < max_parallel + 1
- Add 6 regression tests covering reservation blocking, release, and
  backoff behaviour
@github-actions github-actions Bot added documentation Improvements or additions to documentation rust Rust code changes core zeph-core crate bug Something isn't working size/L Large PR (201-500 lines) labels Mar 13, 2026
Use consecutive_spawn_failures field name and current_deferral_backoff()
helper from #1618; keep warn! log level with active/max fields from #1619.
@bug-ops bug-ops enabled auto-merge (squash) March 13, 2026 15:56
@bug-ops bug-ops merged commit edc9447 into main Mar 13, 2026
15 checks passed
@bug-ops bug-ops deleted the orchestration-dagscheduler-dea branch March 13, 2026 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working core zeph-core crate documentation Improvements or additions to documentation rust Rust code changes size/L Large PR (201-500 lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Orchestration: DagScheduler deadlock when SubAgentManager concurrency exhausted before plan execution

1 participant