feat: periodic health monitoring and auto-restart for backend servers (spec §8) by lpcox · Pull Request #3022 · github/gh-aw-mcpg

lpcox · 2026-04-01T22:06:27Z

Summary

Implements the two remaining SHOULD requirements from MCP Gateway Specification §8 (Health Monitoring) identified in #2988:

Periodic health checks — background goroutine polls all backend server states every 30 seconds
Automatic restart — servers in "error" state are automatically relaunched

Changes

New: `internal/launcher/health_monitor.go`

HealthMonitor struct with a 30-second ticker goroutine
checkAll() iterates all configured backends via GetServerState()
handleErrorState() clears stale state and calls GetOrLaunch to retry
Caps consecutive restart failures at 3 per server (avoids infinite retry)
Logs all restart attempts and outcomes via logger.LogWarn/LogError/LogInfo
Respects both Stop() signal and context cancellation

Modified: `internal/launcher/launcher.go`

Added clearServerForRestart(serverID) — atomically removes error records, start times, and stale connections so GetOrLaunch can retry from scratch

Modified: `internal/server/unified.go`

NewUnified(): starts health monitor after tool registration
InitiateShutdown(): stops health monitor before closing connections

New: `internal/launcher/health_monitor_test.go`

Tests for start/stop lifecycle, default interval, failure counter reset, max-retry cap, state clearing, and context cancellation

Design Decisions

Watchdog in launcher package: Health monitoring is a launcher concern (server lifecycle), not a server concern (request routing)
Max 3 retries: Prevents log spam and wasted resources for persistently broken backends (e.g., missing Docker image)
No HTTP restart: HTTP backends already reconnect lazily; the monitor only actively restarts stdio backends that failed launch
Tool re-registration not needed: Tool handlers use GetOrLaunch on each call, so a successfully relaunched backend is immediately usable

Spec Reference

MCP Gateway Specification §8 — Health Monitoring

Closes #2988

…vers Implement MCP Gateway Specification §8 requirements: - Add HealthMonitor that runs a 30-second periodic ticker checking all backend server states via GetServerState() - Automatically restart servers in error state by clearing cached state and calling GetOrLaunch to re-establish connections - Cap consecutive restart failures at 3 per server to avoid infinite retry loops; log each attempt and final give-up - Wire health monitor lifecycle into UnifiedServer: started after tool registration, stopped before connection teardown during shutdown - Add clearServerForRestart to Launcher to safely remove error records and stale connections before retry Includes unit tests for start/stop, failure counting, max-retry cap, context cancellation, and state clearing. Closes #2988 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds a launcher-level health watchdog to periodically evaluate backend states and automatically attempt relaunches for backends in an "error" state, wiring the monitor into unified server startup/shutdown to meet MCP Gateway Spec §8 “SHOULD” guidance.

Changes:

Introduces HealthMonitor (30s ticker) to poll backend state and trigger capped restart attempts.
Adds Launcher.clearServerForRestart() to reset cached state/connection before a relaunch.
Wires monitor lifecycle into NewUnified() startup and InitiateShutdown() teardown, with unit tests for monitor behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
internal/server/unified.go	Starts/stops the new health monitor alongside unified server lifecycle.
internal/launcher/launcher.go	Adds state/connection clearing helper to support restarts.
internal/launcher/health_monitor.go	Implements periodic polling + capped restart attempts.
internal/launcher/health_monitor_test.go	Adds unit tests for monitor lifecycle and retry behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-01T22:11:23Z

internal/launcher/health_monitor.go

+	close(hm.stopCh)
+	<-hm.doneCh
+	logHealth.Print("Health monitor stopped")
+	logger.LogInfo("shutdown", "Health monitor stopped")


Stop() can block shutdown for up to the backend startupTimeout (default 60s) per restart attempt, because the goroutine may be inside checkAll()/GetOrLaunch() and won’t observe stopCh until it returns. Consider making Stop() time-bounded (e.g., select on doneCh with a timeout) and/or making restart attempts cancellable so shutdown isn’t held hostage by a hung backend launch.

Suggested change

close(hm.stopCh)

<-hm.doneCh

logHealth.Print("Health monitor stopped")

logger.LogInfo("shutdown", "Health monitor stopped")

// If already stopped, return immediately to avoid closing an already-closed channel.

select {

case <-hm.doneCh:

logHealth.Print("Health monitor already stopped")

logger.LogInfo("shutdown", "Health monitor already stopped")

return

default:

}

// Signal the run loop to stop.

close(hm.stopCh)

// Bound the time we wait for the run loop to exit, so shutdown isn't blocked indefinitely

// by a stuck health check or backend restart attempt.

timeout := hm.interval

if timeout <= 0 {

timeout = DefaultHealthCheckInterval

}

select {

case <-hm.doneCh:

logHealth.Print("Health monitor stopped")

logger.LogInfo("shutdown", "Health monitor stopped")

case <-time.After(timeout):

logHealth.Printf("Health monitor stop timed out after %s", timeout)

logger.LogInfo("shutdown", "Health monitor stop timed out after %s", timeout)

}

Copilot · 2026-04-01T22:11:23Z

internal/launcher/health_monitor.go

+// Start begins periodic health checks in a background goroutine.
+func (hm *HealthMonitor) Start() {
+	log.Printf("[HEALTH] Starting health monitor (interval=%s)", hm.interval)
+	logger.LogInfo("startup", "Health monitor started (interval=%s)", hm.interval)
+	go hm.run()
+}
+
+// Stop signals the health monitor to stop and waits for it to finish.
+func (hm *HealthMonitor) Stop() {
+	close(hm.stopCh)
+	<-hm.doneCh


Start()/Stop() aren’t safe to call more than once: Start() can spawn multiple goroutines that will both close(hm.doneCh) (panic), and Stop() will close(hm.stopCh) (panic) if invoked twice. Consider guarding lifecycle with a sync.Once (or an atomic state) so start/stop are idempotent and safe under repeated shutdown paths.

Copilot · 2026-04-01T22:11:23Z

internal/launcher/launcher.go

+		conn.Close()
+		delete(l.connections, serverID)


clearServerForRestart calls conn.Close() while holding l.mu. Since Close() can block (e.g., waiting on SDK session shutdown), this can stall unrelated launcher operations that need the same mutex. Prefer removing the connection from the map under the lock, then closing it after unlocking.

Suggested change

conn.Close()

delete(l.connections, serverID)

delete(l.connections, serverID)

go conn.Close()

Copilot · 2026-04-01T22:11:24Z

internal/launcher/health_monitor.go

+func (hm *HealthMonitor) checkAll() {
+	for _, serverID := range hm.launcher.ServerIDs() {
+		state := hm.launcher.GetServerState(serverID)
+
+		switch state.Status {
+		case "error":
+			hm.handleErrorState(serverID, state)
+		case "running":
+			// Reset consecutive failure counter on healthy server.
+			if hm.consecutiveFailures[serverID] > 0 {
+				hm.consecutiveFailures[serverID] = 0
+			}
+		}


checkAll() only consults GetServerState(), which is currently derived from cached serverStartTimes/serverErrors (i.e., it won’t detect a backend that crashes after a successful launch). If the intent is true periodic health monitoring, consider adding an active probe (e.g., a lightweight RPC like tools/list/initialize on the cached connection) and recording an error when that probe fails so auto-restart can trigger on real liveness failures.

Copilot · 2026-04-01T22:11:24Z

internal/server/unified.go

+	us.healthMonitor = launcher.NewHealthMonitor(l, launcher.DefaultHealthCheckInterval)
+	us.healthMonitor.Start()
+


NewUnified() now starts the health monitor, but (*UnifiedServer).Close() does not stop it (many tests use defer us.Close() without calling InitiateShutdown()). This can leak a ticker goroutine and may cause the monitor to keep calling into a closed launcher. Consider stopping the monitor in Close() (or making Close() delegate to InitiateShutdown()), and/or cancelling the shared context used by the monitor/launcher.

Suggested change

us.healthMonitor = launcher.NewHealthMonitor(l, launcher.DefaultHealthCheckInterval)

us.healthMonitor.Start()

hm := launcher.NewHealthMonitor(l, launcher.DefaultHealthCheckInterval)

us.healthMonitor = hm

hm.Start()

// Ensure the health monitor stops when the context is cancelled to avoid goroutine leaks.

go func() {

<-ctx.Done()

hm.Stop()

}()

Copilot AI review requested due to automatic review settings April 1, 2026 22:06

Copilot started reviewing on behalf of lpcox April 1, 2026 22:07 View session

lpcox merged commit fde9c63 into main Apr 1, 2026
22 checks passed

lpcox deleted the feat/health-monitor-2988 branch April 1, 2026 22:09

Copilot AI reviewed Apr 1, 2026

View reviewed changes

Copilot AI mentioned this pull request Apr 3, 2026

chore: update Playwright Browser v1.59.1, MCP Gateway v0.2.12 github/gh-aw#24226

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: periodic health monitoring and auto-restart for backend servers (spec §8)#3022

feat: periodic health monitoring and auto-restart for backend servers (spec §8)#3022
lpcox merged 1 commit intomainfrom
feat/health-monitor-2988

lpcox commented Apr 1, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-	close(hm.stopCh)
-	<-hm.doneCh
-	logHealth.Print("Health monitor stopped")
-	logger.LogInfo("shutdown", "Health monitor stopped")
+	// If already stopped, return immediately to avoid closing an already-closed channel.
+	select {
+	case <-hm.doneCh:
+		logHealth.Print("Health monitor already stopped")
+		logger.LogInfo("shutdown", "Health monitor already stopped")
+		return
+	default:
+	}
+	// Signal the run loop to stop.
+	close(hm.stopCh)
+	// Bound the time we wait for the run loop to exit, so shutdown isn't blocked indefinitely
+	// by a stuck health check or backend restart attempt.
+	timeout := hm.interval
+	if timeout <= 0 {
+		timeout = DefaultHealthCheckInterval
+	}
+	select {
+	case <-hm.doneCh:
+		logHealth.Print("Health monitor stopped")
+		logger.LogInfo("shutdown", "Health monitor stopped")
+	case <-time.After(timeout):
+		logHealth.Printf("Health monitor stop timed out after %s", timeout)
+		logger.LogInfo("shutdown", "Health monitor stop timed out after %s", timeout)
+	}

		us.healthMonitor = launcher.NewHealthMonitor(l, launcher.DefaultHealthCheckInterval)
		us.healthMonitor.Start()

-	us.healthMonitor = launcher.NewHealthMonitor(l, launcher.DefaultHealthCheckInterval)
-	us.healthMonitor.Start()
+	hm := launcher.NewHealthMonitor(l, launcher.DefaultHealthCheckInterval)
+	us.healthMonitor = hm
+	hm.Start()
+	// Ensure the health monitor stops when the context is cancelled to avoid goroutine leaks.
+	go func() {
+		<-ctx.Done()
+		hm.Stop()
+	}()

Conversation

lpcox commented Apr 1, 2026

Summary

Changes

New: internal/launcher/health_monitor.go

Modified: internal/launcher/launcher.go

Modified: internal/server/unified.go

New: internal/launcher/health_monitor_test.go

Design Decisions

Spec Reference

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New: `internal/launcher/health_monitor.go`

Modified: `internal/launcher/launcher.go`

Modified: `internal/server/unified.go`

New: `internal/launcher/health_monitor_test.go`