Add health checks, metrics, and continuous monitoring to Metal agent#205
Merged
Add health checks, metrics, and continuous monitoring to Metal agent#205
Conversation
…171) Add an HTTP health/metrics server on 127.0.0.1:9090, continuous process health polling with automatic restart, and 6 Prometheus metrics for agent observability. New endpoints: - GET /healthz (liveness, always 200) - GET /readyz (readiness, 200 if any process healthy or none exist) - GET /metrics (Prometheus text format) New metrics: - llmkube_metal_agent_managed_processes - llmkube_metal_agent_process_healthy - llmkube_metal_agent_process_restarts_total - llmkube_metal_agent_health_check_duration_seconds - llmkube_metal_agent_memory_budget_bytes - llmkube_metal_agent_memory_estimated_bytes The health monitor polls each managed llama-server every 30s via its /health endpoint. On failure, the process is marked unhealthy and restarted via ensureProcess(). On recovery, it is marked healthy again. Security hardening applied per audit review: - Health server binds to 127.0.0.1 only (not 0.0.0.0) - Full HTTP server timeouts (Read, Write, Idle, MaxHeaderBytes) - All per-process metric labels cleaned up on delete (no cardinality leak) - Health checker drains response body for TCP connection reuse - Endpoints restricted to GET method only Signed-off-by: Christopher Maher <chris@mahercode.io>
This was referenced Mar 4, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
127.0.0.1:9090with/healthz,/readyz, and/metricsendpointsllmkube_metal_agent_*)Closes #171 (health checks, metrics, and monitoring portion)
Details
New files:
pkg/agent/agentmetrics.go— standalone Prometheus registry with 6 metrics + Go/process collectorspkg/agent/health.go—ProcessHealthCheckerinterface,HealthMonitor,HealthServerpkg/agent/agentmetrics_test.go— 7 metrics testspkg/agent/health_test.go— 11 health server/monitor testsModified files:
pkg/agent/agent.go— wires health server + monitor intoStart(), addsscheduleRestart(), instrumentsensureProcess()/deleteProcess()with metricsdeployment/macos/README.md— new "Health Checks & Monitoring" sectionexamples/metal-quickstart/README.md— health endpoint verification stepsconfig/grafana/SETUP.md— Metal agent scrape config and metrics tableNot in scope (separate issues): backpressure/queueing, restart backoff (#171 remainder), runtime memory pressure (#186), CRD fields (#187), ServiceMonitor/Grafana dashboard.
Test plan
make test— all existing + 18 new tests passmake lint— 0 issuesmake build— builds successfullyGOOS=darwin GOARCH=arm64 go build ./cmd/metal-agent— cross-compile checkcurl localhost:9090/healthz→ "ok"curl localhost:9090/readyz→ "ready"curl localhost:9090/metrics→ containsllmkube_metal_agent_managed_processesprocess_restarts_totalincrements