Dying process clobbers replacement process PID in status file

## Bug description

When replacing a running MCP server (`thv rm` followed by `thv run`), the old process's cleanup routine overwrites the new process's PID in the status file with 0. This causes monitoring tools to report the server as "desynced" even though it is running correctly.

## Steps to reproduce

```bash
thv rm glean && thv run glean
# Immediately check status file: process_id = <new PID>
# Wait ~35 seconds
# Check again: process_id = 0 (clobbered by old process)
# Server is still running fine
```

Reproduction is 100% reliable with the Glean MCP server because its backend holds SSE connections open indefinitely, causing `http.Server.Shutdown()` to block for the full 30-second timeout before cleanup runs.

## Expected behavior

After `thv rm` + `thv run`, the status file should retain the new process's PID indefinitely.

## Actual behavior

The old process's SIGTERM handler eventually runs `ResetWorkloadPID`, which unconditionally writes `process_id: 0` to the status file. By this time, the new process has already written its own PID to that same file, so the old process clobbers it.

The sequence:

1. `thv rm` sends SIGTERM but does not wait for the process to exit
2. `thv rm` deletes the status file
3. `thv run` starts the new process and creates a fresh status file with the new PID
4. The old process's shutdown blocks in `http.Server.Shutdown()` waiting for active connections (up to 30s)
5. After the timeout, the old process calls `ResetWorkloadPID` which unconditionally writes `process_id: 0`, clobbering the new PID

The new process binds successfully because `SO_REUSEADDR` is set on the listener socket. The bug is only in step 5.

## Additional context

Both `ResetWorkloadPID` call sites in `pkg/runner/runner.go` (the `stopMCPServer` closure and the `doneCh` path) have this problem. The fix is to add a PID comparison guard so the dying process only resets PID if the status file still contains its own PID, not a replacement's.

The bug affects all servers in theory, but only manifests when the old process's shutdown takes long enough for the new process to start first. In practice, only servers whose backends hold SSE streams open indefinitely (like Glean) reliably trigger the 30-second shutdown delay.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dying process clobbers replacement process PID in status file #4324

Bug description

Steps to reproduce

Expected behavior

Actual behavior

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dying process clobbers replacement process PID in status file #4324

Description

Bug description

Steps to reproduce

Expected behavior

Actual behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions