Skip to content

feat: periodic health monitoring and auto-restart for backend servers (spec §8)#3022

Merged
lpcox merged 1 commit intomainfrom
feat/health-monitor-2988
Apr 1, 2026
Merged

feat: periodic health monitoring and auto-restart for backend servers (spec §8)#3022
lpcox merged 1 commit intomainfrom
feat/health-monitor-2988

Conversation

@lpcox
Copy link
Copy Markdown
Collaborator

@lpcox lpcox commented Apr 1, 2026

Summary

Implements the two remaining SHOULD requirements from MCP Gateway Specification §8 (Health Monitoring) identified in #2988:

  1. Periodic health checks — background goroutine polls all backend server states every 30 seconds
  2. Automatic restart — servers in "error" state are automatically relaunched

Changes

New: internal/launcher/health_monitor.go

  • HealthMonitor struct with a 30-second ticker goroutine
  • checkAll() iterates all configured backends via GetServerState()
  • handleErrorState() clears stale state and calls GetOrLaunch to retry
  • Caps consecutive restart failures at 3 per server (avoids infinite retry)
  • Logs all restart attempts and outcomes via logger.LogWarn/LogError/LogInfo
  • Respects both Stop() signal and context cancellation

Modified: internal/launcher/launcher.go

  • Added clearServerForRestart(serverID) — atomically removes error records, start times, and stale connections so GetOrLaunch can retry from scratch

Modified: internal/server/unified.go

  • NewUnified(): starts health monitor after tool registration
  • InitiateShutdown(): stops health monitor before closing connections

New: internal/launcher/health_monitor_test.go

  • Tests for start/stop lifecycle, default interval, failure counter reset, max-retry cap, state clearing, and context cancellation

Design Decisions

  • Watchdog in launcher package: Health monitoring is a launcher concern (server lifecycle), not a server concern (request routing)
  • Max 3 retries: Prevents log spam and wasted resources for persistently broken backends (e.g., missing Docker image)
  • No HTTP restart: HTTP backends already reconnect lazily; the monitor only actively restarts stdio backends that failed launch
  • Tool re-registration not needed: Tool handlers use GetOrLaunch on each call, so a successfully relaunched backend is immediately usable

Spec Reference

MCP Gateway Specification §8 — Health Monitoring

Closes #2988

…vers

Implement MCP Gateway Specification §8 requirements:

- Add HealthMonitor that runs a 30-second periodic ticker checking all
  backend server states via GetServerState()
- Automatically restart servers in error state by clearing cached state
  and calling GetOrLaunch to re-establish connections
- Cap consecutive restart failures at 3 per server to avoid infinite
  retry loops; log each attempt and final give-up
- Wire health monitor lifecycle into UnifiedServer: started after tool
  registration, stopped before connection teardown during shutdown
- Add clearServerForRestart to Launcher to safely remove error records
  and stale connections before retry

Includes unit tests for start/stop, failure counting, max-retry cap,
context cancellation, and state clearing.

Closes #2988

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 1, 2026 22:06
@lpcox lpcox merged commit fde9c63 into main Apr 1, 2026
22 checks passed
@lpcox lpcox deleted the feat/health-monitor-2988 branch April 1, 2026 22:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a launcher-level health watchdog to periodically evaluate backend states and automatically attempt relaunches for backends in an "error" state, wiring the monitor into unified server startup/shutdown to meet MCP Gateway Spec §8 “SHOULD” guidance.

Changes:

  • Introduces HealthMonitor (30s ticker) to poll backend state and trigger capped restart attempts.
  • Adds Launcher.clearServerForRestart() to reset cached state/connection before a relaunch.
  • Wires monitor lifecycle into NewUnified() startup and InitiateShutdown() teardown, with unit tests for monitor behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
internal/server/unified.go Starts/stops the new health monitor alongside unified server lifecycle.
internal/launcher/launcher.go Adds state/connection clearing helper to support restarts.
internal/launcher/health_monitor.go Implements periodic polling + capped restart attempts.
internal/launcher/health_monitor_test.go Adds unit tests for monitor lifecycle and retry behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +56 to +59
close(hm.stopCh)
<-hm.doneCh
logHealth.Print("Health monitor stopped")
logger.LogInfo("shutdown", "Health monitor stopped")
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop() can block shutdown for up to the backend startupTimeout (default 60s) per restart attempt, because the goroutine may be inside checkAll()/GetOrLaunch() and won’t observe stopCh until it returns. Consider making Stop() time-bounded (e.g., select on doneCh with a timeout) and/or making restart attempts cancellable so shutdown isn’t held hostage by a hung backend launch.

Suggested change
close(hm.stopCh)
<-hm.doneCh
logHealth.Print("Health monitor stopped")
logger.LogInfo("shutdown", "Health monitor stopped")
// If already stopped, return immediately to avoid closing an already-closed channel.
select {
case <-hm.doneCh:
logHealth.Print("Health monitor already stopped")
logger.LogInfo("shutdown", "Health monitor already stopped")
return
default:
}
// Signal the run loop to stop.
close(hm.stopCh)
// Bound the time we wait for the run loop to exit, so shutdown isn't blocked indefinitely
// by a stuck health check or backend restart attempt.
timeout := hm.interval
if timeout <= 0 {
timeout = DefaultHealthCheckInterval
}
select {
case <-hm.doneCh:
logHealth.Print("Health monitor stopped")
logger.LogInfo("shutdown", "Health monitor stopped")
case <-time.After(timeout):
logHealth.Printf("Health monitor stop timed out after %s", timeout)
logger.LogInfo("shutdown", "Health monitor stop timed out after %s", timeout)
}

Copilot uses AI. Check for mistakes.
Comment on lines +47 to +57
// Start begins periodic health checks in a background goroutine.
func (hm *HealthMonitor) Start() {
log.Printf("[HEALTH] Starting health monitor (interval=%s)", hm.interval)
logger.LogInfo("startup", "Health monitor started (interval=%s)", hm.interval)
go hm.run()
}

// Stop signals the health monitor to stop and waits for it to finish.
func (hm *HealthMonitor) Stop() {
close(hm.stopCh)
<-hm.doneCh
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start()/Stop() aren’t safe to call more than once: Start() can spawn multiple goroutines that will both close(hm.doneCh) (panic), and Stop() will close(hm.stopCh) (panic) if invoked twice. Consider guarding lifecycle with a sync.Once (or an atomic state) so start/stop are idempotent and safe under repeated shutdown paths.

Copilot uses AI. Check for mistakes.
Comment on lines +336 to +337
conn.Close()
delete(l.connections, serverID)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clearServerForRestart calls conn.Close() while holding l.mu. Since Close() can block (e.g., waiting on SDK session shutdown), this can stall unrelated launcher operations that need the same mutex. Prefer removing the connection from the map under the lock, then closing it after unlocking.

Suggested change
conn.Close()
delete(l.connections, serverID)
delete(l.connections, serverID)
go conn.Close()

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +94
func (hm *HealthMonitor) checkAll() {
for _, serverID := range hm.launcher.ServerIDs() {
state := hm.launcher.GetServerState(serverID)

switch state.Status {
case "error":
hm.handleErrorState(serverID, state)
case "running":
// Reset consecutive failure counter on healthy server.
if hm.consecutiveFailures[serverID] > 0 {
hm.consecutiveFailures[serverID] = 0
}
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkAll() only consults GetServerState(), which is currently derived from cached serverStartTimes/serverErrors (i.e., it won’t detect a backend that crashes after a successful launch). If the intent is true periodic health monitoring, consider adding an active probe (e.g., a lightweight RPC like tools/list/initialize on the cached connection) and recording an error when that probe fails so auto-restart can trigger on real liveness failures.

Copilot uses AI. Check for mistakes.
Comment on lines +199 to +201
us.healthMonitor = launcher.NewHealthMonitor(l, launcher.DefaultHealthCheckInterval)
us.healthMonitor.Start()

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NewUnified() now starts the health monitor, but (*UnifiedServer).Close() does not stop it (many tests use defer us.Close() without calling InitiateShutdown()). This can leak a ticker goroutine and may cause the monitor to keep calling into a closed launcher. Consider stopping the monitor in Close() (or making Close() delegate to InitiateShutdown()), and/or cancelling the shared context used by the monitor/launcher.

Suggested change
us.healthMonitor = launcher.NewHealthMonitor(l, launcher.DefaultHealthCheckInterval)
us.healthMonitor.Start()
hm := launcher.NewHealthMonitor(l, launcher.DefaultHealthCheckInterval)
us.healthMonitor = hm
hm.Start()
// Ensure the health monitor stops when the context is cancelled to avoid goroutine leaks.
go func() {
<-ctx.Done()
hm.Stop()
}()

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[compliance] Compliance Gap: Health Monitoring Missing Periodic Checks and Auto-Restart (Section 8)

2 participants