Metal agent: health checks, backpressure, and observability

## Background

Got a great question from a LinkedIn commenter on the Metal agent post:

> How are you handling health checks + backpressure (queueing, timeouts, retries) when the Mac agent is saturated?

Honest answer: we're not, really. This issue tracks closing that gap.

## Current State

- Agent polls llama.cpp `/health` on startup (30s timeout, 500ms intervals)
- `Healthy` flag is set once and never re-evaluated after startup
- No continuous health monitoring of managed processes
- No backpressure, queueing, or circuit breakers
- No metrics endpoint from the agent itself
- Process health is local only, not synced back to Kubernetes
- Port 9090 is configured but never actually used

## Proposed Work

### Health Checks
- [ ] Expose HTTP health endpoint on agent port (9090)
- [ ] Continuous liveness checks on managed llama.cpp processes (periodic `/health` polling)
- [ ] Update `Healthy` flag dynamically, restart unhealthy processes
- [ ] Report process health back to Kubernetes via Service/Endpoints status

### Backpressure
- [ ] Concurrency limits on inference requests (configurable max concurrent)
- [ ] Request queueing with configurable depth and timeout
- [ ] Circuit breaker for failed operations (K8s API calls, process starts)
- [ ] Retry logic with exponential backoff for transient failures

### Observability
- [ ] `/metrics` endpoint (Prometheus format) exposing:
  - Inference request count, latency, errors
  - Process health status
  - Memory/GPU utilization
  - Queue depth (if implemented)
- [ ] ServiceMonitor for Prometheus scraping
- [ ] Grafana dashboard for Metal agent metrics

### Saturation Handling
- [ ] Detect when llama.cpp is overloaded (response time degradation, error rate)
- [ ] Adaptive polling interval on K8s watcher (back off under load)
- [ ] Port recycling and conflict detection for managed processes

## Related

- Sprint 4-5: Production Hardening
- Existing observability stack (Prometheus + Grafana + DCGM) from Sprint 1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal agent: health checks, backpressure, and observability #171

Background

Current State

Proposed Work

Health Checks

Backpressure

Observability

Saturation Handling

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metal agent: health checks, backpressure, and observability #171

Description

Background

Current State

Proposed Work

Health Checks

Backpressure

Observability

Saturation Handling

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions