-
Notifications
You must be signed in to change notification settings - Fork 198
Dying process clobbers replacement process PID in status file #4324
Description
Bug description
When replacing a running MCP server (thv rm followed by thv run), the old process's cleanup routine overwrites the new process's PID in the status file with 0. This causes monitoring tools to report the server as "desynced" even though it is running correctly.
Steps to reproduce
thv rm glean && thv run glean
# Immediately check status file: process_id = <new PID>
# Wait ~35 seconds
# Check again: process_id = 0 (clobbered by old process)
# Server is still running fineReproduction is 100% reliable with the Glean MCP server because its backend holds SSE connections open indefinitely, causing http.Server.Shutdown() to block for the full 30-second timeout before cleanup runs.
Expected behavior
After thv rm + thv run, the status file should retain the new process's PID indefinitely.
Actual behavior
The old process's SIGTERM handler eventually runs ResetWorkloadPID, which unconditionally writes process_id: 0 to the status file. By this time, the new process has already written its own PID to that same file, so the old process clobbers it.
The sequence:
thv rmsends SIGTERM but does not wait for the process to exitthv rmdeletes the status filethv runstarts the new process and creates a fresh status file with the new PID- The old process's shutdown blocks in
http.Server.Shutdown()waiting for active connections (up to 30s) - After the timeout, the old process calls
ResetWorkloadPIDwhich unconditionally writesprocess_id: 0, clobbering the new PID
The new process binds successfully because SO_REUSEADDR is set on the listener socket. The bug is only in step 5.
Additional context
Both ResetWorkloadPID call sites in pkg/runner/runner.go (the stopMCPServer closure and the doneCh path) have this problem. The fix is to add a PID comparison guard so the dying process only resets PID if the status file still contains its own PID, not a replacement's.
The bug affects all servers in theory, but only manifests when the old process's shutdown takes long enough for the new process to start first. In practice, only servers whose backends hold SSE streams open indefinitely (like Glean) reliably trigger the 30-second shutdown delay.