Skip to content

fix: track real server status and uptime in health endpoint#2938

Merged
lpcox merged 2 commits intomainfrom
copilot/fix-get-server-status-uptime
Mar 31, 2026
Merged

fix: track real server status and uptime in health endpoint#2938
lpcox merged 2 commits intomainfrom
copilot/fix-get-server-status-uptime

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 31, 2026

Problem

GetServerStatus() hardcoded every configured server as "running" with Uptime: 0, regardless of actual connection state. The /health endpoint always reported all servers as healthy, even if they had crashed or never been contacted.

Spec reference: Section 8 — Health Monitoring requires accurate server status and uptime tracking.

Fix

  • Track server state in the Launcher — Added ServerState struct with serverStartTimes and serverErrors maps
  • Record outcomesrecordStart() on successful launch, recordError() on failure, across both GetOrLaunch and GetOrLaunchForSession paths
  • Query real stateGetServerState() returns "stopped" (never launched), "running" (successfully connected), or "error" (launch failed)
  • Compute real uptimetime.Since(startedAt).Seconds() instead of hardcoded 0
  • Accurate health responseBuildHealthResponse() now correctly marks gateway as "unhealthy" when servers have errors

Files Changed

File Change
internal/launcher/launcher.go Added ServerState, state tracking maps, recordStart/recordError/GetServerState
internal/server/unified.go GetServerStatus() queries launcher for real state + uptime
internal/launcher/launcher_test.go 7 new tests: stopped default, unknown server, running state, first-launch-only, error state, error clearing, error precedence
internal/server/health_test.go 2 new tests: error on failed launch, unhealthy on error

Testing

make agent-finished passes (format, build, lint, all tests).

Replace hardcoded 'running'/0 in GetServerStatus() with actual
runtime state tracking:

- Add ServerState struct and serverStartTimes/serverErrors maps
  to Launcher for tracking connection outcomes
- Record start time on successful launch, error on failure
- Add GetServerState() method to query real server state
- Update GetServerStatus() to return actual status and uptime
- Servers report 'stopped' before launch, 'running' with real
  uptime after successful connection, 'error' on failure
- BuildHealthResponse now correctly marks gateway 'unhealthy'
  when servers have errors

Adds 7 launcher state tests and 2 health response tests.

Fixes Section 8 (Health Monitoring) compliance gap.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@lpcox lpcox marked this pull request as ready for review March 31, 2026 18:55
Copilot AI review requested due to automatic review settings March 31, 2026 18:55
@lpcox lpcox changed the title [WIP] Fix server status reporting to include uptime fix: track real server status and uptime in health endpoint Mar 31, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses the MCP Gateway health-monitoring compliance gap by replacing the GetServerStatus() stub with launcher-tracked server state and computed uptime, so /health can reflect non-running backends.

Changes:

  • Track backend server “started at” timestamps and most-recent launch errors in internal/launcher.
  • Update UnifiedServer.GetServerStatus() to return launcher-derived status and uptime (seconds since recorded start).
  • Add unit tests for launcher state reporting and health becoming unhealthy when a backend launch fails.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
internal/server/unified.go Uses launcher state to report per-server status and uptime instead of hardcoded values.
internal/server/health_test.go Adds tests asserting error status/uptime and unhealthy gateway status on backend launch failure.
internal/launcher/launcher.go Introduces basic server state tracking (start time + last error) and exposes GetServerState.
internal/launcher/launcher_test.go Adds tests for new server state tracking behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +319 to +322
// recordError records a server launch failure.
func (l *Launcher) recordError(serverID string, errMsg string) {
l.serverErrors[serverID] = errMsg
logLauncher.Printf("Recorded server error: serverID=%s, error=%s", serverID, errMsg)
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recordError logs the raw errMsg into logs. Since errMsg can include untrusted/secret-bearing content (e.g., backend stderr or URLs with tokens), this can leak sensitive data into launcher logs. Sanitize/truncate the error string before logging (or log a stable error code/type and keep the full error only in memory).

Suggested change
// recordError records a server launch failure.
func (l *Launcher) recordError(serverID string, errMsg string) {
l.serverErrors[serverID] = errMsg
logLauncher.Printf("Recorded server error: serverID=%s, error=%s", serverID, errMsg)
const maxLoggedErrorLength = 512
// truncateErrorForLog returns a version of the error message that is safe to log by
// limiting its length. The full error is still kept in memory (e.g., in serverErrors).
func truncateErrorForLog(errMsg string) string {
if len(errMsg) <= maxLoggedErrorLength {
return errMsg
}
return errMsg[:maxLoggedErrorLength] + "...(truncated)"
}
// recordError records a server launch failure.
func (l *Launcher) recordError(serverID string, errMsg string) {
// Store the full error message in memory for internal use.
l.serverErrors[serverID] = errMsg
// Log only a truncated version to reduce the risk of leaking sensitive data.
safeErrMsg := truncateErrorForLog(errMsg)
logLauncher.Printf("Recorded server error: serverID=%s, error=%s", serverID, safeErrMsg)

Copilot uses AI. Check for mistakes.
Comment on lines +325 to +345
// GetServerState returns the observed runtime state for a single server.
func (l *Launcher) GetServerState(serverID string) ServerState {
l.mu.RLock()
defer l.mu.RUnlock()

if errMsg, hasErr := l.serverErrors[serverID]; hasErr {
return ServerState{
Status: "error",
LastError: errMsg,
}
}

if startedAt, ok := l.serverStartTimes[serverID]; ok {
return ServerState{
Status: "running",
StartedAt: startedAt,
}
}

return ServerState{Status: "stopped"}
}
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetServerState derives "running" solely from serverStartTimes, but nothing clears/updates that state when connections are closed (e.g., Launcher.Close() or session pool cleanup). This can cause /health to report servers as running with increasing uptime even after shutdown/close. Consider clearing serverStartTimes/serverErrors in Close(), and/or recording a "stopped" state when a connection is closed so the reported status reflects reality.

Copilot uses AI. Check for mistakes.
Comment on lines +822 to +826
firstStart := l.serverStartTimes["test-server"]

// Second call should be a no-op
l.recordStart("test-server")
secondStart := l.serverStartTimes["test-server"]
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test reaches into l.serverStartTimes directly to assert behavior. That couples the test to the Launcher's internal representation and bypasses the public surface (GetServerState). Prefer asserting via GetServerState().StartedAt across two recordStart calls so internal refactors (e.g., switching to a different storage/locking strategy) don't break the test unnecessarily.

Suggested change
firstStart := l.serverStartTimes["test-server"]
// Second call should be a no-op
l.recordStart("test-server")
secondStart := l.serverStartTimes["test-server"]
firstState := l.GetServerState("test-server")
firstStart := firstState.StartedAt
// Second call should be a no-op
l.recordStart("test-server")
secondState := l.GetServerState("test-server")
secondStart := secondState.StartedAt

Copilot uses AI. Check for mistakes.
Comment on lines +474 to +509
cfg := &config.Config{
Servers: map[string]*config.ServerConfig{
"server-a": {Type: "stdio", Command: "docker", Args: []string{"run", "test"}},
},
}

ctx := context.Background()
us, err := NewUnified(ctx, cfg)
require.NoError(err)
t.Cleanup(func() { us.Close() })

status := us.GetServerStatus()
require.Contains(status, "server-a")
assert.Equal("error", status["server-a"].Status, "Server should report 'error' after failed launch")
assert.Equal(0, status["server-a"].Uptime, "Uptime should be 0 for errored server")
}

// TestBuildHealthResponse_UnhealthyOnError verifies that failed servers make the gateway unhealthy
func TestBuildHealthResponse_UnhealthyOnError(t *testing.T) {
assert := assert.New(t)
require := require.New(t)

cfg := &config.Config{
Servers: map[string]*config.ServerConfig{
"server-a": {Type: "stdio", Command: "docker", Args: []string{"run", "test"}},
},
}

ctx := context.Background()
us, err := NewUnified(ctx, cfg)
require.NoError(err)
t.Cleanup(func() { us.Close() })

response := BuildHealthResponse(us)
assert.Equal("unhealthy", response.Status, "Gateway should be unhealthy when a server has errors")
}
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests call NewUnified(), which triggers tool registration and will attempt to launch the configured stdio backend. Using docker run test can hang/pull images and then wait up to the default 60s startup timeout, making the unit test slow/flaky depending on the environment. Use a deterministic fast-fail command (e.g., a clearly nonexistent binary like other launcher tests do) and/or set cfg.Gateway.StartupTimeout to a very small value in the test.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[compliance] Compliance Gap: Health Monitoring - Server Status Always Reports "running" with Zero Uptime

3 participants