Skip to content

Dying process clobbers replacement process PID in status file #4324

@gkatz2

Description

@gkatz2

Bug description

When replacing a running MCP server (thv rm followed by thv run), the old process's cleanup routine overwrites the new process's PID in the status file with 0. This causes monitoring tools to report the server as "desynced" even though it is running correctly.

Steps to reproduce

thv rm glean && thv run glean
# Immediately check status file: process_id = <new PID>
# Wait ~35 seconds
# Check again: process_id = 0 (clobbered by old process)
# Server is still running fine

Reproduction is 100% reliable with the Glean MCP server because its backend holds SSE connections open indefinitely, causing http.Server.Shutdown() to block for the full 30-second timeout before cleanup runs.

Expected behavior

After thv rm + thv run, the status file should retain the new process's PID indefinitely.

Actual behavior

The old process's SIGTERM handler eventually runs ResetWorkloadPID, which unconditionally writes process_id: 0 to the status file. By this time, the new process has already written its own PID to that same file, so the old process clobbers it.

The sequence:

  1. thv rm sends SIGTERM but does not wait for the process to exit
  2. thv rm deletes the status file
  3. thv run starts the new process and creates a fresh status file with the new PID
  4. The old process's shutdown blocks in http.Server.Shutdown() waiting for active connections (up to 30s)
  5. After the timeout, the old process calls ResetWorkloadPID which unconditionally writes process_id: 0, clobbering the new PID

The new process binds successfully because SO_REUSEADDR is set on the listener socket. The bug is only in step 5.

Additional context

Both ResetWorkloadPID call sites in pkg/runner/runner.go (the stopMCPServer closure and the doneCh path) have this problem. The fix is to add a PID comparison guard so the dying process only resets PID if the status file still contains its own PID, not a replacement's.

The bug affects all servers in theory, but only manifests when the old process's shutdown takes long enough for the new process to start first. In practice, only servers whose backends hold SSE streams open indefinitely (like Glean) reliably trigger the 30-second shutdown delay.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcliChanges that impact CLI functionalitygoPull requests that update go code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions