-
Notifications
You must be signed in to change notification settings - Fork 4
Closed
Labels
area/observabilityMonitoring, metrics, logging, tracingMonitoring, metrics, logging, tracingcomponent/metal-agentRelated to the Metal agent for macOSRelated to the Metal agent for macOSenhancementNew feature or requestNew feature or requestkind/featureNew feature or requestNew feature or requestpriority/highHigh priorityHigh prioritysize/largeLarge effort (> 3 days)Large effort (> 3 days)
Description
Background
Got a great question from a LinkedIn commenter on the Metal agent post:
How are you handling health checks + backpressure (queueing, timeouts, retries) when the Mac agent is saturated?
Honest answer: we're not, really. This issue tracks closing that gap.
Current State
- Agent polls llama.cpp
/healthon startup (30s timeout, 500ms intervals) Healthyflag is set once and never re-evaluated after startup- No continuous health monitoring of managed processes
- No backpressure, queueing, or circuit breakers
- No metrics endpoint from the agent itself
- Process health is local only, not synced back to Kubernetes
- Port 9090 is configured but never actually used
Proposed Work
Health Checks
- Expose HTTP health endpoint on agent port (9090)
- Continuous liveness checks on managed llama.cpp processes (periodic
/healthpolling) - Update
Healthyflag dynamically, restart unhealthy processes - Report process health back to Kubernetes via Service/Endpoints status
Backpressure
- Concurrency limits on inference requests (configurable max concurrent)
- Request queueing with configurable depth and timeout
- Circuit breaker for failed operations (K8s API calls, process starts)
- Retry logic with exponential backoff for transient failures
Observability
-
/metricsendpoint (Prometheus format) exposing:- Inference request count, latency, errors
- Process health status
- Memory/GPU utilization
- Queue depth (if implemented)
- ServiceMonitor for Prometheus scraping
- Grafana dashboard for Metal agent metrics
Saturation Handling
- Detect when llama.cpp is overloaded (response time degradation, error rate)
- Adaptive polling interval on K8s watcher (back off under load)
- Port recycling and conflict detection for managed processes
Related
- Sprint 4-5: Production Hardening
- Existing observability stack (Prometheus + Grafana + DCGM) from Sprint 1
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area/observabilityMonitoring, metrics, logging, tracingMonitoring, metrics, logging, tracingcomponent/metal-agentRelated to the Metal agent for macOSRelated to the Metal agent for macOSenhancementNew feature or requestNew feature or requestkind/featureNew feature or requestNew feature or requestpriority/highHigh priorityHigh prioritysize/largeLarge effort (> 3 days)Large effort (> 3 days)