Add health checks, metrics, and continuous monitoring to Metal agent by Defilan · Pull Request #205 · defilantech/LLMKube

Defilan · 2026-03-04T06:51:00Z

Summary

Adds HTTP health/metrics server on 127.0.0.1:9090 with /healthz, /readyz, and /metrics endpoints
Adds continuous health monitoring (30s polling) with automatic process restart on failure
Adds 6 Prometheus metrics for agent observability (llmkube_metal_agent_*)
Security hardened: localhost-only binding, full HTTP timeouts, GET-only endpoints, metric label cleanup, response body draining

Closes #171 (health checks, metrics, and monitoring portion)

Details

New files:

pkg/agent/agentmetrics.go — standalone Prometheus registry with 6 metrics + Go/process collectors
pkg/agent/health.go — ProcessHealthChecker interface, HealthMonitor, HealthServer
pkg/agent/agentmetrics_test.go — 7 metrics tests
pkg/agent/health_test.go — 11 health server/monitor tests

Modified files:

pkg/agent/agent.go — wires health server + monitor into Start(), adds scheduleRestart(), instruments ensureProcess()/deleteProcess() with metrics
deployment/macos/README.md — new "Health Checks & Monitoring" section
examples/metal-quickstart/README.md — health endpoint verification steps
config/grafana/SETUP.md — Metal agent scrape config and metrics table

Not in scope (separate issues): backpressure/queueing, restart backoff (#171 remainder), runtime memory pressure (#186), CRD fields (#187), ServiceMonitor/Grafana dashboard.

Test plan

make test — all existing + 18 new tests pass
make lint — 0 issues
make build — builds successfully
GOOS=darwin GOARCH=arm64 go build ./cmd/metal-agent — cross-compile check
Manual: start agent, curl localhost:9090/healthz → "ok"
Manual: curl localhost:9090/readyz → "ready"
Manual: curl localhost:9090/metrics → contains llmkube_metal_agent_managed_processes
Manual: kill a llama-server, verify restart within 30s and process_restarts_total increments

…171) Add an HTTP health/metrics server on 127.0.0.1:9090, continuous process health polling with automatic restart, and 6 Prometheus metrics for agent observability. New endpoints: - GET /healthz (liveness, always 200) - GET /readyz (readiness, 200 if any process healthy or none exist) - GET /metrics (Prometheus text format) New metrics: - llmkube_metal_agent_managed_processes - llmkube_metal_agent_process_healthy - llmkube_metal_agent_process_restarts_total - llmkube_metal_agent_health_check_duration_seconds - llmkube_metal_agent_memory_budget_bytes - llmkube_metal_agent_memory_estimated_bytes The health monitor polls each managed llama-server every 30s via its /health endpoint. On failure, the process is marked unhealthy and restarted via ensureProcess(). On recovery, it is marked healthy again. Security hardening applied per audit review: - Health server binds to 127.0.0.1 only (not 0.0.0.0) - Full HTTP server timeouts (Read, Write, Idle, MaxHeaderBytes) - All per-process metric labels cleaned up on delete (no cardinality leak) - Health checker drains response body for TCP connection reuse - Endpoints restricted to GET method only Signed-off-by: Christopher Maher <chris@mahercode.io>

Defilan merged commit a113fd1 into main Mar 4, 2026
15 checks passed

Defilan deleted the feat/metal-agent-health-metrics branch March 4, 2026 07:55

This was referenced Mar 4, 2026

chore: release 0.5.0 #191

Merged

fix: correct CHANGELOG entry from 0.4.21 to 0.5.0 #212

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add health checks, metrics, and continuous monitoring to Metal agent#205

Add health checks, metrics, and continuous monitoring to Metal agent#205
Defilan merged 1 commit intomainfrom
feat/metal-agent-health-metrics

Defilan commented Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Defilan commented Mar 4, 2026

Summary

Details

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant