feat: periodic health monitoring and auto-restart for backend servers (spec §8)#3022
feat: periodic health monitoring and auto-restart for backend servers (spec §8)#3022
Conversation
…vers Implement MCP Gateway Specification §8 requirements: - Add HealthMonitor that runs a 30-second periodic ticker checking all backend server states via GetServerState() - Automatically restart servers in error state by clearing cached state and calling GetOrLaunch to re-establish connections - Cap consecutive restart failures at 3 per server to avoid infinite retry loops; log each attempt and final give-up - Wire health monitor lifecycle into UnifiedServer: started after tool registration, stopped before connection teardown during shutdown - Add clearServerForRestart to Launcher to safely remove error records and stale connections before retry Includes unit tests for start/stop, failure counting, max-retry cap, context cancellation, and state clearing. Closes #2988 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds a launcher-level health watchdog to periodically evaluate backend states and automatically attempt relaunches for backends in an "error" state, wiring the monitor into unified server startup/shutdown to meet MCP Gateway Spec §8 “SHOULD” guidance.
Changes:
- Introduces
HealthMonitor(30s ticker) to poll backend state and trigger capped restart attempts. - Adds
Launcher.clearServerForRestart()to reset cached state/connection before a relaunch. - Wires monitor lifecycle into
NewUnified()startup andInitiateShutdown()teardown, with unit tests for monitor behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| internal/server/unified.go | Starts/stops the new health monitor alongside unified server lifecycle. |
| internal/launcher/launcher.go | Adds state/connection clearing helper to support restarts. |
| internal/launcher/health_monitor.go | Implements periodic polling + capped restart attempts. |
| internal/launcher/health_monitor_test.go | Adds unit tests for monitor lifecycle and retry behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| close(hm.stopCh) | ||
| <-hm.doneCh | ||
| logHealth.Print("Health monitor stopped") | ||
| logger.LogInfo("shutdown", "Health monitor stopped") |
There was a problem hiding this comment.
Stop() can block shutdown for up to the backend startupTimeout (default 60s) per restart attempt, because the goroutine may be inside checkAll()/GetOrLaunch() and won’t observe stopCh until it returns. Consider making Stop() time-bounded (e.g., select on doneCh with a timeout) and/or making restart attempts cancellable so shutdown isn’t held hostage by a hung backend launch.
| close(hm.stopCh) | |
| <-hm.doneCh | |
| logHealth.Print("Health monitor stopped") | |
| logger.LogInfo("shutdown", "Health monitor stopped") | |
| // If already stopped, return immediately to avoid closing an already-closed channel. | |
| select { | |
| case <-hm.doneCh: | |
| logHealth.Print("Health monitor already stopped") | |
| logger.LogInfo("shutdown", "Health monitor already stopped") | |
| return | |
| default: | |
| } | |
| // Signal the run loop to stop. | |
| close(hm.stopCh) | |
| // Bound the time we wait for the run loop to exit, so shutdown isn't blocked indefinitely | |
| // by a stuck health check or backend restart attempt. | |
| timeout := hm.interval | |
| if timeout <= 0 { | |
| timeout = DefaultHealthCheckInterval | |
| } | |
| select { | |
| case <-hm.doneCh: | |
| logHealth.Print("Health monitor stopped") | |
| logger.LogInfo("shutdown", "Health monitor stopped") | |
| case <-time.After(timeout): | |
| logHealth.Printf("Health monitor stop timed out after %s", timeout) | |
| logger.LogInfo("shutdown", "Health monitor stop timed out after %s", timeout) | |
| } |
| // Start begins periodic health checks in a background goroutine. | ||
| func (hm *HealthMonitor) Start() { | ||
| log.Printf("[HEALTH] Starting health monitor (interval=%s)", hm.interval) | ||
| logger.LogInfo("startup", "Health monitor started (interval=%s)", hm.interval) | ||
| go hm.run() | ||
| } | ||
|
|
||
| // Stop signals the health monitor to stop and waits for it to finish. | ||
| func (hm *HealthMonitor) Stop() { | ||
| close(hm.stopCh) | ||
| <-hm.doneCh |
There was a problem hiding this comment.
Start()/Stop() aren’t safe to call more than once: Start() can spawn multiple goroutines that will both close(hm.doneCh) (panic), and Stop() will close(hm.stopCh) (panic) if invoked twice. Consider guarding lifecycle with a sync.Once (or an atomic state) so start/stop are idempotent and safe under repeated shutdown paths.
| conn.Close() | ||
| delete(l.connections, serverID) |
There was a problem hiding this comment.
clearServerForRestart calls conn.Close() while holding l.mu. Since Close() can block (e.g., waiting on SDK session shutdown), this can stall unrelated launcher operations that need the same mutex. Prefer removing the connection from the map under the lock, then closing it after unlocking.
| conn.Close() | |
| delete(l.connections, serverID) | |
| delete(l.connections, serverID) | |
| go conn.Close() |
| func (hm *HealthMonitor) checkAll() { | ||
| for _, serverID := range hm.launcher.ServerIDs() { | ||
| state := hm.launcher.GetServerState(serverID) | ||
|
|
||
| switch state.Status { | ||
| case "error": | ||
| hm.handleErrorState(serverID, state) | ||
| case "running": | ||
| // Reset consecutive failure counter on healthy server. | ||
| if hm.consecutiveFailures[serverID] > 0 { | ||
| hm.consecutiveFailures[serverID] = 0 | ||
| } | ||
| } |
There was a problem hiding this comment.
checkAll() only consults GetServerState(), which is currently derived from cached serverStartTimes/serverErrors (i.e., it won’t detect a backend that crashes after a successful launch). If the intent is true periodic health monitoring, consider adding an active probe (e.g., a lightweight RPC like tools/list/initialize on the cached connection) and recording an error when that probe fails so auto-restart can trigger on real liveness failures.
| us.healthMonitor = launcher.NewHealthMonitor(l, launcher.DefaultHealthCheckInterval) | ||
| us.healthMonitor.Start() | ||
|
|
There was a problem hiding this comment.
NewUnified() now starts the health monitor, but (*UnifiedServer).Close() does not stop it (many tests use defer us.Close() without calling InitiateShutdown()). This can leak a ticker goroutine and may cause the monitor to keep calling into a closed launcher. Consider stopping the monitor in Close() (or making Close() delegate to InitiateShutdown()), and/or cancelling the shared context used by the monitor/launcher.
| us.healthMonitor = launcher.NewHealthMonitor(l, launcher.DefaultHealthCheckInterval) | |
| us.healthMonitor.Start() | |
| hm := launcher.NewHealthMonitor(l, launcher.DefaultHealthCheckInterval) | |
| us.healthMonitor = hm | |
| hm.Start() | |
| // Ensure the health monitor stops when the context is cancelled to avoid goroutine leaks. | |
| go func() { | |
| <-ctx.Done() | |
| hm.Stop() | |
| }() |
Summary
Implements the two remaining SHOULD requirements from MCP Gateway Specification §8 (Health Monitoring) identified in #2988:
"error"state are automatically relaunchedChanges
New:
internal/launcher/health_monitor.goHealthMonitorstruct with a 30-second ticker goroutinecheckAll()iterates all configured backends viaGetServerState()handleErrorState()clears stale state and callsGetOrLaunchto retrylogger.LogWarn/LogError/LogInfoStop()signal and context cancellationModified:
internal/launcher/launcher.goclearServerForRestart(serverID)— atomically removes error records, start times, and stale connections soGetOrLaunchcan retry from scratchModified:
internal/server/unified.goNewUnified(): starts health monitor after tool registrationInitiateShutdown(): stops health monitor before closing connectionsNew:
internal/launcher/health_monitor_test.goDesign Decisions
GetOrLaunchon each call, so a successfully relaunched backend is immediately usableSpec Reference
MCP Gateway Specification §8 — Health Monitoring
Closes #2988