fix: track real server status and uptime in health endpoint#2938
Conversation
Replace hardcoded 'running'/0 in GetServerStatus() with actual runtime state tracking: - Add ServerState struct and serverStartTimes/serverErrors maps to Launcher for tracking connection outcomes - Record start time on successful launch, error on failure - Add GetServerState() method to query real server state - Update GetServerStatus() to return actual status and uptime - Servers report 'stopped' before launch, 'running' with real uptime after successful connection, 'error' on failure - BuildHealthResponse now correctly marks gateway 'unhealthy' when servers have errors Adds 7 launcher state tests and 2 health response tests. Fixes Section 8 (Health Monitoring) compliance gap. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses the MCP Gateway health-monitoring compliance gap by replacing the GetServerStatus() stub with launcher-tracked server state and computed uptime, so /health can reflect non-running backends.
Changes:
- Track backend server “started at” timestamps and most-recent launch errors in
internal/launcher. - Update
UnifiedServer.GetServerStatus()to return launcher-derived status and uptime (seconds since recorded start). - Add unit tests for launcher state reporting and health becoming unhealthy when a backend launch fails.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| internal/server/unified.go | Uses launcher state to report per-server status and uptime instead of hardcoded values. |
| internal/server/health_test.go | Adds tests asserting error status/uptime and unhealthy gateway status on backend launch failure. |
| internal/launcher/launcher.go | Introduces basic server state tracking (start time + last error) and exposes GetServerState. |
| internal/launcher/launcher_test.go | Adds tests for new server state tracking behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // recordError records a server launch failure. | ||
| func (l *Launcher) recordError(serverID string, errMsg string) { | ||
| l.serverErrors[serverID] = errMsg | ||
| logLauncher.Printf("Recorded server error: serverID=%s, error=%s", serverID, errMsg) |
There was a problem hiding this comment.
recordError logs the raw errMsg into logs. Since errMsg can include untrusted/secret-bearing content (e.g., backend stderr or URLs with tokens), this can leak sensitive data into launcher logs. Sanitize/truncate the error string before logging (or log a stable error code/type and keep the full error only in memory).
| // recordError records a server launch failure. | |
| func (l *Launcher) recordError(serverID string, errMsg string) { | |
| l.serverErrors[serverID] = errMsg | |
| logLauncher.Printf("Recorded server error: serverID=%s, error=%s", serverID, errMsg) | |
| const maxLoggedErrorLength = 512 | |
| // truncateErrorForLog returns a version of the error message that is safe to log by | |
| // limiting its length. The full error is still kept in memory (e.g., in serverErrors). | |
| func truncateErrorForLog(errMsg string) string { | |
| if len(errMsg) <= maxLoggedErrorLength { | |
| return errMsg | |
| } | |
| return errMsg[:maxLoggedErrorLength] + "...(truncated)" | |
| } | |
| // recordError records a server launch failure. | |
| func (l *Launcher) recordError(serverID string, errMsg string) { | |
| // Store the full error message in memory for internal use. | |
| l.serverErrors[serverID] = errMsg | |
| // Log only a truncated version to reduce the risk of leaking sensitive data. | |
| safeErrMsg := truncateErrorForLog(errMsg) | |
| logLauncher.Printf("Recorded server error: serverID=%s, error=%s", serverID, safeErrMsg) |
| // GetServerState returns the observed runtime state for a single server. | ||
| func (l *Launcher) GetServerState(serverID string) ServerState { | ||
| l.mu.RLock() | ||
| defer l.mu.RUnlock() | ||
|
|
||
| if errMsg, hasErr := l.serverErrors[serverID]; hasErr { | ||
| return ServerState{ | ||
| Status: "error", | ||
| LastError: errMsg, | ||
| } | ||
| } | ||
|
|
||
| if startedAt, ok := l.serverStartTimes[serverID]; ok { | ||
| return ServerState{ | ||
| Status: "running", | ||
| StartedAt: startedAt, | ||
| } | ||
| } | ||
|
|
||
| return ServerState{Status: "stopped"} | ||
| } |
There was a problem hiding this comment.
GetServerState derives "running" solely from serverStartTimes, but nothing clears/updates that state when connections are closed (e.g., Launcher.Close() or session pool cleanup). This can cause /health to report servers as running with increasing uptime even after shutdown/close. Consider clearing serverStartTimes/serverErrors in Close(), and/or recording a "stopped" state when a connection is closed so the reported status reflects reality.
| firstStart := l.serverStartTimes["test-server"] | ||
|
|
||
| // Second call should be a no-op | ||
| l.recordStart("test-server") | ||
| secondStart := l.serverStartTimes["test-server"] |
There was a problem hiding this comment.
This test reaches into l.serverStartTimes directly to assert behavior. That couples the test to the Launcher's internal representation and bypasses the public surface (GetServerState). Prefer asserting via GetServerState().StartedAt across two recordStart calls so internal refactors (e.g., switching to a different storage/locking strategy) don't break the test unnecessarily.
| firstStart := l.serverStartTimes["test-server"] | |
| // Second call should be a no-op | |
| l.recordStart("test-server") | |
| secondStart := l.serverStartTimes["test-server"] | |
| firstState := l.GetServerState("test-server") | |
| firstStart := firstState.StartedAt | |
| // Second call should be a no-op | |
| l.recordStart("test-server") | |
| secondState := l.GetServerState("test-server") | |
| secondStart := secondState.StartedAt |
| cfg := &config.Config{ | ||
| Servers: map[string]*config.ServerConfig{ | ||
| "server-a": {Type: "stdio", Command: "docker", Args: []string{"run", "test"}}, | ||
| }, | ||
| } | ||
|
|
||
| ctx := context.Background() | ||
| us, err := NewUnified(ctx, cfg) | ||
| require.NoError(err) | ||
| t.Cleanup(func() { us.Close() }) | ||
|
|
||
| status := us.GetServerStatus() | ||
| require.Contains(status, "server-a") | ||
| assert.Equal("error", status["server-a"].Status, "Server should report 'error' after failed launch") | ||
| assert.Equal(0, status["server-a"].Uptime, "Uptime should be 0 for errored server") | ||
| } | ||
|
|
||
| // TestBuildHealthResponse_UnhealthyOnError verifies that failed servers make the gateway unhealthy | ||
| func TestBuildHealthResponse_UnhealthyOnError(t *testing.T) { | ||
| assert := assert.New(t) | ||
| require := require.New(t) | ||
|
|
||
| cfg := &config.Config{ | ||
| Servers: map[string]*config.ServerConfig{ | ||
| "server-a": {Type: "stdio", Command: "docker", Args: []string{"run", "test"}}, | ||
| }, | ||
| } | ||
|
|
||
| ctx := context.Background() | ||
| us, err := NewUnified(ctx, cfg) | ||
| require.NoError(err) | ||
| t.Cleanup(func() { us.Close() }) | ||
|
|
||
| response := BuildHealthResponse(us) | ||
| assert.Equal("unhealthy", response.Status, "Gateway should be unhealthy when a server has errors") | ||
| } |
There was a problem hiding this comment.
These tests call NewUnified(), which triggers tool registration and will attempt to launch the configured stdio backend. Using docker run test can hang/pull images and then wait up to the default 60s startup timeout, making the unit test slow/flaky depending on the environment. Use a deterministic fast-fail command (e.g., a clearly nonexistent binary like other launcher tests do) and/or set cfg.Gateway.StartupTimeout to a very small value in the test.
Problem
GetServerStatus()hardcoded every configured server as"running"withUptime: 0, regardless of actual connection state. The/healthendpoint always reported all servers as healthy, even if they had crashed or never been contacted.Spec reference: Section 8 — Health Monitoring requires accurate server status and uptime tracking.
Fix
ServerStatestruct withserverStartTimesandserverErrorsmapsrecordStart()on successful launch,recordError()on failure, across bothGetOrLaunchandGetOrLaunchForSessionpathsGetServerState()returns"stopped"(never launched),"running"(successfully connected), or"error"(launch failed)time.Since(startedAt).Seconds()instead of hardcoded 0BuildHealthResponse()now correctly marks gateway as"unhealthy"when servers have errorsFiles Changed
internal/launcher/launcher.goServerState, state tracking maps,recordStart/recordError/GetServerStateinternal/server/unified.goGetServerStatus()queries launcher for real state + uptimeinternal/launcher/launcher_test.gointernal/server/health_test.goTesting
make agent-finishedpasses (format, build, lint, all tests).