fix: resolve orchestrator deadlock on parallel task completion by rileyhilliard · Pull Request #178 · rileyhilliard/rr

rileyhilliard · 2026-02-13T21:44:04Z

Closes #177

Summary

Parallel tasks hang indefinitely after all tasks complete. The reported symptom is correct, but the root cause is different from what the issue describes.

Root cause analysis

#177 attributes the hang to a signal handler goroutine leak (signal.Notify without signal.Stop). That's not it. A goroutine blocked on <-sigChan can't prevent Go from exiting, because Go terminates all goroutines when main() returns. The workflow.go cleanup pattern is good hygiene but isn't what keeps the process alive.

The actual problem is a channel deadlock in the orchestrator's shutdown sequence (internal/parallel/orchestrator.go). After all tasks finish:

Workers complete their tasks, loop back, block reading from taskQueue
Dispatcher blocks on for task := range requeueChan, waiting for requeued tasks that will never come
Cleanup goroutine blocks on wg.Wait(), waiting for workers to exit

Circular dependency: workers wait on taskQueue (closed by dispatcher) while the dispatcher waits on requeueChan (closed only after workers exit). Nobody can make progress.

Reproduction

Built the binary, ran rr quick-check (parallel: [vet, lint]). Process hung after both tasks completed. Used kill -SIGABRT to get goroutine dumps confirming workers stuck in [select] at the taskQueue read and the dispatcher stuck on requeueChan.

Fix

Added an allDone signal channel that gets closed (via sync.Once) when all expected results are collected. The dispatcher now selects on allDone alongside requeueChan and ctx.Done(), so it can exit once everything's accounted for. This closes taskQueue, which unblocks workers, and the whole shutdown chain completes.

Changes

Add allDone channel to break the circular shutdown dependency in orchestrator.Run()
Guard close(allDone) with sync.Once to prevent double-close if a future code path produces unexpected extra results
Dispatcher selects on allDone in both the initial dispatch loop and the post-dispatch requeue loop
Track expectedResults and collected count in the result-gathering loop

Testing

Added TestOrchestrator_RemotePath_CompletesWithoutDeadlock regression test with 4 scenarios (single/multi task, single/multi host, more tasks than hosts)
Test uses a 5-second context timeout as a deadlock detector
All existing tests pass, race detector clean

Notes

The signal handler cleanup the reporter flagged (signal.Stop/close in parallel.go, run.go, task.go) is a real gap worth fixing for hygiene, but it's a separate concern. It doesn't cause the hang and should be addressed in a follow-up.

Summary by CodeRabbit

Bug Fixes
- Improved shutdown coordination for parallel task orchestration to avoid deadlocks when remote hosts are unreachable; unconsumed tasks are now handled and failures are reported reliably.
Tests
- Added a regression test to verify parallel execution completes without deadlock when remote hosts cannot be reached.

coderabbitai · 2026-02-13T21:44:33Z

Caution

Review failed

The pull request is closed.

Walkthrough

Adds an allDone shutdown signal and result-tracking to the parallel orchestrator to break a circular wait between dispatcher, workers, and result collection; dispatcher now drains unconsumed requeue entries and emits failure results when all hosts are unavailable, and workers requeue before marking hosts unavailable.

Changes

Cohort / File(s)	Summary
Orchestrator Shutdown Coordination `internal/parallel/orchestrator.go`	Add `allDone` channel and expected/collected result tracking; introduce `drainUnconsumedTasks(taskQueue <-chan TaskInfo, resultChan chan<- TaskResult)`; dispatcher now selects on `allDone`, drains `requeueChan` on exit, emits failure results when all hosts unavailable, and reorders worker requeue → mark-unavailable path.
Regression Tests (duplicate) `internal/parallel/orchestrator_test.go`	Add `TestOrchestrator_RemotePath_CompletesWithoutDeadlock` (appears twice) to assert Run completes without deadlock and returns failure results when hosts are unreachable within timeout.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Dispatcher
    participant Workers
    participant ResultCollector
    participant allDone as allDone (chan)

    Client->>Dispatcher: Run(tasks, hosts)
    Dispatcher->>Workers: populate taskQueue / send tasks
    loop per-task processing
        Workers->>Workers: execute task
        alt success or terminal failure
            Workers->>ResultCollector: send TaskResult
            ResultCollector->>ResultCollector: increment collected
        else retryable / host unreachable
            Workers->>Dispatcher: send TaskInfo to requeueChan
            Dispatcher->>Dispatcher: requeue or mark host unavailable
        end
        Dispatcher->>Dispatcher: monitor allDone in select
    end

    Note over ResultCollector, allDone: when collected == expected
    ResultCollector->>allDone: close allDone
    allDone-->>Dispatcher: unblock dispatcher select
    Dispatcher->>Dispatcher: drain requeueChan, emit failures if all hosts unavailable
    Dispatcher->>Client: return final aggregated results

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

#179 — Modifies the same orchestrator path to add allDone/shutdown signaling and related drain logic; closely related.
#168 — Changes requeue and host-unavailable dispatcher behavior that this PR extends and hardens.
#89 — Earlier orchestrator coordination changes that this PR builds upon.

🚥 Pre-merge checks | ✅ 3 | ❌ 3

❌ Failed checks (1 warning, 2 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Linked Issues check	❓ Inconclusive	PR fixes the orchestrator deadlock (primary cause) but does not address signal handler cleanup mentioned in `#177`, leaving that for follow-up work.	Clarify whether signal handler cleanup (`#177`'s suggested fix) is intentionally deferred or should be included in this PR.
Out of Scope Changes check	❓ Inconclusive	Changes are scoped to orchestrator synchronization logic and related tests; test duplication in orchestrator_test.go warrants review.	Verify whether the duplicated test in orchestrator_test.go is intentional or a merge artifact that should be removed.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	Title accurately describes the main fix: resolving a deadlock in the orchestrator by adding an allDone signaling mechanism.
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `main`

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/177-parallel-orchestrator-deadlock

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…el tasks complete (#177) Workers blocked on taskQueue while the dispatcher blocked on requeueChan, creating a circular dependency that prevented shutdown. Added an allDone signal channel (guarded by sync.Once) that fires when all expected results are collected, letting the dispatcher exit and unblock the workers.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/parallel/orchestrator.go (1)
168-199: ⚠️ Potential issue | 🟠 Major

Requeued tasks can be lost when allHostsUnavailable() triggers mid-dispatch.

If some tasks have already been sent to taskQueue (removed from pending) before allHostsUnavailable() returns true, those tasks may be picked up by workers, fail, get sent to requeueChan, and then be silently discarded — the dispatcher has already returned, and the cleanup goroutine just closes requeueChan without draining it.

This means those tasks produce no TaskResult, so collected never reaches expectedResults, and allDone never fires. The function still returns (because resultChan gets closed by the cleanup goroutine), but with incomplete results.

Consider draining requeueChan after wg.Wait() and emitting failure results for any orphaned tasks before closing resultChan:
Proposed fix in the cleanup goroutine (lines 258-263)
 go func() {
     wg.Wait()
     close(requeueChan)
+    // Drain any orphaned requeued tasks and emit failure results
+    for task := range requeueChan {
+        resultChan <- TaskResult{
+            TaskName:  task.Name,
+            TaskIndex: task.Index,
+            Command:   task.Command,
+            Host:      "none",
+            ExitCode:  1,
+            Error:     fmt.Errorf("all hosts unavailable"),
+            StartTime: time.Now(),
+            EndTime:   time.Now(),
+        }
+    }
     <-dispatcherDone
     close(resultChan)
 }()
Wait — close(requeueChan) makes range requeueChan drain the remaining buffered items. That should work. But note: this drains items sent before the close. Since workers have already exited (wg.Wait() returned), no new sends can happen.

🤖 Fix all issues with AI agents

In `@internal/parallel/orchestrator_test.go`:
- Around line 870-875: Replace the weak "at least one" assertion on
result.TaskResults with a strict equality check against the expected count:
assert that len(result.TaskResults) equals the expectedResults (which is
len(o.tasks) in this test) so the test fails if any task result is missing; keep
the existing loop that asserts each tr.Success() is false to ensure all returned
results are failures.

🧹 Nitpick comments (1)

internal/parallel/orchestrator.go (1)

206-243: Dispatcher dropping requeued tasks on allDone (lines 237-238) is safe only if the invariant holds.

When allDone fires in the inner select (line 237), the requeued task read at line 212 is silently dropped. This is correct only if collected >= expectedResults guarantees every task already has a result. Since requeued tasks explicitly skip sending a result (line 385-402 in hostWorkerWithRequeue), a requeued task in-flight here should mean collected hasn't counted it yet — which contradicts allDone having fired.

In practice this path is likely unreachable under normal conditions, but it's worth a brief comment explaining why the drop is safe (or adding a defensive failure result).

internal/parallel/orchestrator_test.go

…iple tasks, only 1 result was being produced instead of N

rileyhilliard force-pushed the fix/177-parallel-orchestrator-deadlock branch from c78dbb2 to 223ee9f Compare February 13, 2026 22:16

rileyhilliard mentioned this pull request Feb 13, 2026

Parallel tasks don't release terminal after completion — signal handler goroutine leak #177

Closed

coderabbitai bot reviewed Feb 13, 2026

View reviewed changes

internal/parallel/orchestrator_test.go Outdated Show resolved Hide resolved

chore: improve test and fix: When all hosts are unavailable with mult…

84b6616

…iple tasks, only 1 result was being produced instead of N

rileyhilliard merged commit 23962cc into main Feb 14, 2026
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: resolve orchestrator deadlock on parallel task completion#178

fix: resolve orchestrator deadlock on parallel task completion#178
rileyhilliard merged 2 commits intomainfrom
fix/177-parallel-orchestrator-deadlock

rileyhilliard commented Feb 13, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 13, 2026 •

edited

Loading

Review failed

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rileyhilliard commented Feb 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause analysis

Reproduction

Fix

Changes

Testing

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rileyhilliard commented Feb 13, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 13, 2026 •

edited

Loading