Skip to content

fix: resolve orchestrator deadlock on parallel task completion#178

Merged
rileyhilliard merged 2 commits intomainfrom
fix/177-parallel-orchestrator-deadlock
Feb 14, 2026
Merged

fix: resolve orchestrator deadlock on parallel task completion#178
rileyhilliard merged 2 commits intomainfrom
fix/177-parallel-orchestrator-deadlock

Conversation

@rileyhilliard
Copy link
Copy Markdown
Owner

@rileyhilliard rileyhilliard commented Feb 13, 2026

Closes #177

Summary

Parallel tasks hang indefinitely after all tasks complete. The reported symptom is correct, but the root cause is different from what the issue describes.

Root cause analysis

#177 attributes the hang to a signal handler goroutine leak (signal.Notify without signal.Stop). That's not it. A goroutine blocked on <-sigChan can't prevent Go from exiting, because Go terminates all goroutines when main() returns. The workflow.go cleanup pattern is good hygiene but isn't what keeps the process alive.

The actual problem is a channel deadlock in the orchestrator's shutdown sequence (internal/parallel/orchestrator.go). After all tasks finish:

  1. Workers complete their tasks, loop back, block reading from taskQueue
  2. Dispatcher blocks on for task := range requeueChan, waiting for requeued tasks that will never come
  3. Cleanup goroutine blocks on wg.Wait(), waiting for workers to exit

Circular dependency: workers wait on taskQueue (closed by dispatcher) while the dispatcher waits on requeueChan (closed only after workers exit). Nobody can make progress.

Reproduction

Built the binary, ran rr quick-check (parallel: [vet, lint]). Process hung after both tasks completed. Used kill -SIGABRT to get goroutine dumps confirming workers stuck in [select] at the taskQueue read and the dispatcher stuck on requeueChan.

Fix

Added an allDone signal channel that gets closed (via sync.Once) when all expected results are collected. The dispatcher now selects on allDone alongside requeueChan and ctx.Done(), so it can exit once everything's accounted for. This closes taskQueue, which unblocks workers, and the whole shutdown chain completes.

Changes

  • Add allDone channel to break the circular shutdown dependency in orchestrator.Run()
  • Guard close(allDone) with sync.Once to prevent double-close if a future code path produces unexpected extra results
  • Dispatcher selects on allDone in both the initial dispatch loop and the post-dispatch requeue loop
  • Track expectedResults and collected count in the result-gathering loop

Testing

  • Added TestOrchestrator_RemotePath_CompletesWithoutDeadlock regression test with 4 scenarios (single/multi task, single/multi host, more tasks than hosts)
  • Test uses a 5-second context timeout as a deadlock detector
  • All existing tests pass, race detector clean

Notes

The signal handler cleanup the reporter flagged (signal.Stop/close in parallel.go, run.go, task.go) is a real gap worth fixing for hygiene, but it's a separate concern. It doesn't cause the hang and should be addressed in a follow-up.

Summary by CodeRabbit

  • Bug Fixes

    • Improved shutdown coordination for parallel task orchestration to avoid deadlocks when remote hosts are unreachable; unconsumed tasks are now handled and failures are reported reliably.
  • Tests

    • Added a regression test to verify parallel execution completes without deadlock when remote hosts cannot be reached.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 13, 2026

Caution

Review failed

The pull request is closed.

Walkthrough

Adds an allDone shutdown signal and result-tracking to the parallel orchestrator to break a circular wait between dispatcher, workers, and result collection; dispatcher now drains unconsumed requeue entries and emits failure results when all hosts are unavailable, and workers requeue before marking hosts unavailable.

Changes

Cohort / File(s) Summary
Orchestrator Shutdown Coordination
internal/parallel/orchestrator.go
Add allDone channel and expected/collected result tracking; introduce drainUnconsumedTasks(taskQueue <-chan TaskInfo, resultChan chan<- TaskResult); dispatcher now selects on allDone, drains requeueChan on exit, emits failure results when all hosts unavailable, and reorders worker requeue → mark-unavailable path.
Regression Tests (duplicate)
internal/parallel/orchestrator_test.go
Add TestOrchestrator_RemotePath_CompletesWithoutDeadlock (appears twice) to assert Run completes without deadlock and returns failure results when hosts are unreachable within timeout.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Dispatcher
    participant Workers
    participant ResultCollector
    participant allDone as allDone (chan)

    Client->>Dispatcher: Run(tasks, hosts)
    Dispatcher->>Workers: populate taskQueue / send tasks
    loop per-task processing
        Workers->>Workers: execute task
        alt success or terminal failure
            Workers->>ResultCollector: send TaskResult
            ResultCollector->>ResultCollector: increment collected
        else retryable / host unreachable
            Workers->>Dispatcher: send TaskInfo to requeueChan
            Dispatcher->>Dispatcher: requeue or mark host unavailable
        end
        Dispatcher->>Dispatcher: monitor allDone in select
    end

    Note over ResultCollector, allDone: when collected == expected
    ResultCollector->>allDone: close allDone
    allDone-->>Dispatcher: unblock dispatcher select
    Dispatcher->>Dispatcher: drain requeueChan, emit failures if all hosts unavailable
    Dispatcher->>Client: return final aggregated results
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • #179 — Modifies the same orchestrator path to add allDone/shutdown signaling and related drain logic; closely related.
  • #168 — Changes requeue and host-unavailable dispatcher behavior that this PR extends and hardens.
  • #89 — Earlier orchestrator coordination changes that this PR builds upon.
🚥 Pre-merge checks | ✅ 3 | ❌ 3
❌ Failed checks (1 warning, 2 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Linked Issues check ❓ Inconclusive PR fixes the orchestrator deadlock (primary cause) but does not address signal handler cleanup mentioned in #177, leaving that for follow-up work. Clarify whether signal handler cleanup (#177's suggested fix) is intentionally deferred or should be included in this PR.
Out of Scope Changes check ❓ Inconclusive Changes are scoped to orchestrator synchronization logic and related tests; test duplication in orchestrator_test.go warrants review. Verify whether the duplicated test in orchestrator_test.go is intentional or a merge artifact that should be removed.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title accurately describes the main fix: resolving a deadlock in the orchestrator by adding an allDone signaling mechanism.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/177-parallel-orchestrator-deadlock

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…el tasks complete (#177)

Workers blocked on taskQueue while the dispatcher blocked on requeueChan,
creating a circular dependency that prevented shutdown. Added an allDone
signal channel (guarded by sync.Once) that fires when all expected results
are collected, letting the dispatcher exit and unblock the workers.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/parallel/orchestrator.go (1)

168-199: ⚠️ Potential issue | 🟠 Major

Requeued tasks can be lost when allHostsUnavailable() triggers mid-dispatch.

If some tasks have already been sent to taskQueue (removed from pending) before allHostsUnavailable() returns true, those tasks may be picked up by workers, fail, get sent to requeueChan, and then be silently discarded — the dispatcher has already returned, and the cleanup goroutine just closes requeueChan without draining it.

This means those tasks produce no TaskResult, so collected never reaches expectedResults, and allDone never fires. The function still returns (because resultChan gets closed by the cleanup goroutine), but with incomplete results.

Consider draining requeueChan after wg.Wait() and emitting failure results for any orphaned tasks before closing resultChan:

Proposed fix in the cleanup goroutine (lines 258-263)
 go func() {
     wg.Wait()
     close(requeueChan)
+    // Drain any orphaned requeued tasks and emit failure results
+    for task := range requeueChan {
+        resultChan <- TaskResult{
+            TaskName:  task.Name,
+            TaskIndex: task.Index,
+            Command:   task.Command,
+            Host:      "none",
+            ExitCode:  1,
+            Error:     fmt.Errorf("all hosts unavailable"),
+            StartTime: time.Now(),
+            EndTime:   time.Now(),
+        }
+    }
     <-dispatcherDone
     close(resultChan)
 }()

Wait — close(requeueChan) makes range requeueChan drain the remaining buffered items. That should work. But note: this drains items sent before the close. Since workers have already exited (wg.Wait() returned), no new sends can happen.

🤖 Fix all issues with AI agents
In `@internal/parallel/orchestrator_test.go`:
- Around line 870-875: Replace the weak "at least one" assertion on
result.TaskResults with a strict equality check against the expected count:
assert that len(result.TaskResults) equals the expectedResults (which is
len(o.tasks) in this test) so the test fails if any task result is missing; keep
the existing loop that asserts each tr.Success() is false to ensure all returned
results are failures.
🧹 Nitpick comments (1)
internal/parallel/orchestrator.go (1)

206-243: Dispatcher dropping requeued tasks on allDone (lines 237-238) is safe only if the invariant holds.

When allDone fires in the inner select (line 237), the requeued task read at line 212 is silently dropped. This is correct only if collected >= expectedResults guarantees every task already has a result. Since requeued tasks explicitly skip sending a result (line 385-402 in hostWorkerWithRequeue), a requeued task in-flight here should mean collected hasn't counted it yet — which contradicts allDone having fired.

In practice this path is likely unreachable under normal conditions, but it's worth a brief comment explaining why the drop is safe (or adding a defensive failure result).

…iple tasks, only 1 result was being produced instead of N
@rileyhilliard rileyhilliard merged commit 23962cc into main Feb 14, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parallel tasks don't release terminal after completion — signal handler goroutine leak

1 participant