Skip to content

Treat PID 0 as dead in supervisor liveness check#4431

Merged
amirejaz merged 2 commits intostacklok:mainfrom
gkatz2:fix/supervisor-alive-pid-zero-4429
Mar 30, 2026
Merged

Treat PID 0 as dead in supervisor liveness check#4431
amirejaz merged 2 commits intostacklok:mainfrom
gkatz2:fix/supervisor-alive-pid-zero-4429

Conversation

@gkatz2
Copy link
Copy Markdown
Contributor

@gkatz2 gkatz2 commented Mar 30, 2026

Summary

  • isSupervisorProcessAlive only checked err from GetWorkloadPID, not the PID value. During transport restart, ResetWorkloadPID sets process_id to 0, so GetWorkloadPID returns (0, nil). The function returned true, causing thv restart to silently no-op during the 5-60s window where PID is 0.
  • Add pid <= 0 guard consistent with the pattern established in Guard against invalid PIDs in process operations #4401 for KillProcess and FindProcess

Fixes #4429

Type of change

  • Bug fix

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)

Special notes for reviewers

stopProcess has the same err-only check at line 844. When PID is 0 it proceeds to call KillProcess(0), which is harmless (that function has its own pid <= 0 guard from #4401), but adding a guard there too would be a logical consistency improvement. Left out of this PR to keep the scope minimal.

Generated with Claude Code

During transport restart, ResetWorkloadPID sets process_id to 0 in
the status file. isSupervisorProcessAlive only checked for an error
from GetWorkloadPID, so (0, nil) was treated as a live supervisor.
This caused thv restart to silently no-op during the transient
window where PID is 0.

Fixes stacklok#4429

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Greg Katz <gkatz@indeed.com>
@github-actions github-actions bot added the size/XS Extra small PR: < 100 lines changed label Mar 30, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.53%. Comparing base (eb0d405) to head (b6fd5a1).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4431      +/-   ##
==========================================
- Coverage   69.54%   69.53%   -0.01%     
==========================================
  Files         487      487              
  Lines       50138    50138              
==========================================
- Hits        34870    34865       -5     
- Misses      12585    12589       +4     
- Partials     2683     2684       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added size/XS Extra small PR: < 100 lines changed and removed size/XS Extra small PR: < 100 lines changed labels Mar 30, 2026
@amirejaz amirejaz merged commit 2c27346 into stacklok:main Mar 30, 2026
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XS Extra small PR: < 100 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

thv restart skips supervisor restart when PID is transiently zero

2 participants