Skip to content

Fix deadlock in multi-level job dependency failures#141

Merged
linyows merged 3 commits intomainfrom
fix-multi-level-dependency-deadlock
Jan 11, 2026
Merged

Fix deadlock in multi-level job dependency failures#141
linyows merged 3 commits intomainfrom
fix-multi-level-dependency-deadlock

Conversation

@linyows
Copy link
Owner

@linyows linyows commented Jan 11, 2026

Summary

This PR fixes a deadlock issue that occurred when jobs with multi-level dependencies (3+ levels) failed in sequence.

Problem

When running workflows with multi-level job dependencies where early jobs failed:

  • Jobs at level 2 would be correctly marked as JobFailed
  • Jobs at level 3+ would remain in JobPending status indefinitely
  • This caused the workflow to deadlock and never complete

Example scenario (testdata/verify-examples.yml):

setup (fails) → http_server/grpc_server/... (JobFailed) → http/grpc/... (stuck in JobPending) 

Root Cause

The MarkJobsWithFailedDependencies() method in scheduler.go only checked for dependencies with:

status[dep] == JobCompleted && !results[dep]

This missed dependencies that were already marked as JobFailed, preventing failure propagation beyond 2 levels.

Solution

Modified scheduler.go:257 to also check for JobFailed status:

if js.status[dep] == JobFailed || (js.status[dep] == JobCompleted && !js.results[dep]) {
    hasFailedDependency = true
}

Now failure status correctly propagates through all dependency levels.

Test Plan

  • ✅ Added unit tests in scheduler_test.go for multi-level dependency failures
  • ✅ Enhanced examples/needs-literal.yml with 5-level dependency chain
  • ✅ All existing tests pass
  • ✅ Verified fix with testdata/verify-examples.yml in docker context failure scenario

Example output (5-level chain):

⏺ 3rd job after 2nd job-B (Should Fail) (Failed in 0.01s)
⏺ 4th job after 3rd job (Should not run) (Failed in 0.00s)
⏺ 5th job after 4th job (Should not run) (Failed in 0.00s)
Total workflow time: 3.05s ✗ 3 job(s) failed

All jobs complete immediately without deadlock.

🤖 Generated with Claude Code

This commit fixes a deadlock issue that occurred when jobs with
multi-level dependencies (3+ levels) failed.

Problem:
When a job at level 1 failed, jobs at level 2 would be marked as
JobFailed. However, jobs at level 3+ that depended on level 2 jobs
would remain in JobPending status indefinitely, causing a deadlock.

Root Cause:
The MarkJobsWithFailedDependencies() method only checked for
dependencies with status JobCompleted && !results, but did not
check for dependencies with status JobFailed.

Solution:
Modified scheduler.go:257 to also check for JobFailed status:
- Before: if status[dep] == JobCompleted && !results[dep]
- After: if status[dep] == JobFailed || (status[dep] == JobCompleted && !results[dep])

This allows failure status to propagate through all dependency levels.

Changes:
- scheduler.go: Fix dependency failure detection
- scheduler_test.go: Add unit tests for multi-level dependency failures
- examples/needs-literal.yml: Add 5-level dependency chain to verify fix

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

- Add proper error handling for scheduler.AddJob() calls in tests
- Use t.Fatalf() to immediately fail tests if job addition fails

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link

Code Metrics Report

main (1b59337) #141 (4923d7a) +/-
Coverage 48.2% 48.5% +0.3%
Code to Test Ratio 1:1.1 1:1.1 +0.0
Test Execution Time 24s 13s -11s
Details
  |                     | main (1b59337) | #141 (4923d7a) |  +/-  |
  |---------------------|----------------|----------------|-------|
+ | Coverage            |          48.2% |          48.5% | +0.3% |
  |   Files             |             52 |             52 |     0 |
  |   Lines             |           5244 |           5244 |     0 |
+ |   Covered           |           2530 |           2546 |   +16 |
+ | Code to Test Ratio  |          1:1.1 |          1:1.1 |  +0.0 |
  |   Code              |          10564 |          10564 |     0 |
+ |   Test              |          11856 |          11955 |   +99 |
+ | Test Execution Time |            24s |            13s |  -11s |

Code coverage of files in pull request scope (76.5% → 90.9%)

Files Coverage +/- Status
scheduler.go 90.9% +14.4% modified

Reported by octocov

@linyows linyows merged commit 9938fa9 into main Jan 11, 2026
7 checks passed
@linyows linyows deleted the fix-multi-level-dependency-deadlock branch January 11, 2026 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant