Skip to content

Metal agent: health checks, backpressure, and observability #171

@Defilan

Description

@Defilan

Background

Got a great question from a LinkedIn commenter on the Metal agent post:

How are you handling health checks + backpressure (queueing, timeouts, retries) when the Mac agent is saturated?

Honest answer: we're not, really. This issue tracks closing that gap.

Current State

  • Agent polls llama.cpp /health on startup (30s timeout, 500ms intervals)
  • Healthy flag is set once and never re-evaluated after startup
  • No continuous health monitoring of managed processes
  • No backpressure, queueing, or circuit breakers
  • No metrics endpoint from the agent itself
  • Process health is local only, not synced back to Kubernetes
  • Port 9090 is configured but never actually used

Proposed Work

Health Checks

  • Expose HTTP health endpoint on agent port (9090)
  • Continuous liveness checks on managed llama.cpp processes (periodic /health polling)
  • Update Healthy flag dynamically, restart unhealthy processes
  • Report process health back to Kubernetes via Service/Endpoints status

Backpressure

  • Concurrency limits on inference requests (configurable max concurrent)
  • Request queueing with configurable depth and timeout
  • Circuit breaker for failed operations (K8s API calls, process starts)
  • Retry logic with exponential backoff for transient failures

Observability

  • /metrics endpoint (Prometheus format) exposing:
    • Inference request count, latency, errors
    • Process health status
    • Memory/GPU utilization
    • Queue depth (if implemented)
  • ServiceMonitor for Prometheus scraping
  • Grafana dashboard for Metal agent metrics

Saturation Handling

  • Detect when llama.cpp is overloaded (response time degradation, error rate)
  • Adaptive polling interval on K8s watcher (back off under load)
  • Port recycling and conflict detection for managed processes

Related

  • Sprint 4-5: Production Hardening
  • Existing observability stack (Prometheus + Grafana + DCGM) from Sprint 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions