Skip to content

Add health checks, metrics, and continuous monitoring to Metal agent#205

Merged
Defilan merged 1 commit intomainfrom
feat/metal-agent-health-metrics
Mar 4, 2026
Merged

Add health checks, metrics, and continuous monitoring to Metal agent#205
Defilan merged 1 commit intomainfrom
feat/metal-agent-health-metrics

Conversation

@Defilan
Copy link
Member

@Defilan Defilan commented Mar 4, 2026

Summary

  • Adds HTTP health/metrics server on 127.0.0.1:9090 with /healthz, /readyz, and /metrics endpoints
  • Adds continuous health monitoring (30s polling) with automatic process restart on failure
  • Adds 6 Prometheus metrics for agent observability (llmkube_metal_agent_*)
  • Security hardened: localhost-only binding, full HTTP timeouts, GET-only endpoints, metric label cleanup, response body draining

Closes #171 (health checks, metrics, and monitoring portion)

Details

New files:

  • pkg/agent/agentmetrics.go — standalone Prometheus registry with 6 metrics + Go/process collectors
  • pkg/agent/health.goProcessHealthChecker interface, HealthMonitor, HealthServer
  • pkg/agent/agentmetrics_test.go — 7 metrics tests
  • pkg/agent/health_test.go — 11 health server/monitor tests

Modified files:

  • pkg/agent/agent.go — wires health server + monitor into Start(), adds scheduleRestart(), instruments ensureProcess()/deleteProcess() with metrics
  • deployment/macos/README.md — new "Health Checks & Monitoring" section
  • examples/metal-quickstart/README.md — health endpoint verification steps
  • config/grafana/SETUP.md — Metal agent scrape config and metrics table

Not in scope (separate issues): backpressure/queueing, restart backoff (#171 remainder), runtime memory pressure (#186), CRD fields (#187), ServiceMonitor/Grafana dashboard.

Test plan

  • make test — all existing + 18 new tests pass
  • make lint — 0 issues
  • make build — builds successfully
  • GOOS=darwin GOARCH=arm64 go build ./cmd/metal-agent — cross-compile check
  • Manual: start agent, curl localhost:9090/healthz → "ok"
  • Manual: curl localhost:9090/readyz → "ready"
  • Manual: curl localhost:9090/metrics → contains llmkube_metal_agent_managed_processes
  • Manual: kill a llama-server, verify restart within 30s and process_restarts_total increments

…171)

Add an HTTP health/metrics server on 127.0.0.1:9090, continuous process
health polling with automatic restart, and 6 Prometheus metrics for
agent observability.

New endpoints:
- GET /healthz (liveness, always 200)
- GET /readyz (readiness, 200 if any process healthy or none exist)
- GET /metrics (Prometheus text format)

New metrics:
- llmkube_metal_agent_managed_processes
- llmkube_metal_agent_process_healthy
- llmkube_metal_agent_process_restarts_total
- llmkube_metal_agent_health_check_duration_seconds
- llmkube_metal_agent_memory_budget_bytes
- llmkube_metal_agent_memory_estimated_bytes

The health monitor polls each managed llama-server every 30s via its
/health endpoint. On failure, the process is marked unhealthy and
restarted via ensureProcess(). On recovery, it is marked healthy again.

Security hardening applied per audit review:
- Health server binds to 127.0.0.1 only (not 0.0.0.0)
- Full HTTP server timeouts (Read, Write, Idle, MaxHeaderBytes)
- All per-process metric labels cleaned up on delete (no cardinality leak)
- Health checker drains response body for TCP connection reuse
- Endpoints restricted to GET method only

Signed-off-by: Christopher Maher <chris@mahercode.io>
@Defilan Defilan merged commit a113fd1 into main Mar 4, 2026
15 checks passed
@Defilan Defilan deleted the feat/metal-agent-health-metrics branch March 4, 2026 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metal agent: health checks, backpressure, and observability

1 participant