Skip to content

feat: add Prometheus metrics, OpenTelemetry tracing, and inference observability#189

Merged
Defilan merged 3 commits intomainfrom
feat/ai-ops-observability
Feb 28, 2026
Merged

feat: add Prometheus metrics, OpenTelemetry tracing, and inference observability#189
Defilan merged 3 commits intomainfrom
feat/ai-ops-observability

Conversation

@Defilan
Copy link
Member

@Defilan Defilan commented Feb 28, 2026

Summary

  • Add 10 custom Prometheus metrics for model lifecycle and inference service management (download duration, model status, time-to-ready, GPU queue depth, reconciliation tracking)
  • Enable llama.cpp --metrics flag and health probes (startup + liveness + readiness) on inference pods
  • Add PodMonitor template for scraping inference pod metrics via Prometheus
  • Initialize OpenTelemetry TracerProvider with OTLP gRPC exporter for distributed tracing (opt-in via OTEL_EXPORTER_OTLP_ENDPOINT env var)
  • Add lifecycle alert rules: ModelDownloadSlow, InferenceServiceNotReady, GPUQueueBacklog

Related Issues

Changes

New files

  • internal/metrics/metrics.go — Custom metrics registered via controller-runtime registry
  • internal/metrics/metrics_test.go — Comprehensive tests for all 10 metrics
  • charts/llmkube/templates/inference-podmonitor.yaml — PodMonitor for inference pods

Modified files

  • internal/controller/model_controller.go — Download duration, status, and reconcile metrics
  • internal/controller/inferenceservice_controller.go — Phase tracking, ready duration, --metrics flag, three-probe pattern (startup/liveness/readiness)
  • cmd/main.go — OTEL TracerProvider initialization
  • config/prometheus/llmkube-alerts.yaml — 3 new lifecycle alerts
  • charts/llmkube/values.yaml — Inference PodMonitor configuration
  • go.mod — Promote OTel dependencies from indirect to direct

Design Decisions

Startup probe for model loading tolerance

llama.cpp /health returns 503 during model loading and 200 when ready. Large models (30B+) take 5-30 minutes to load. Instead of relying on initialDelaySeconds (which gave only 75s tolerance), we use Kubernetes' three-probe pattern:

  • StartupProbe: failureThreshold=180 * periodSeconds=10 = 30 min budget for model loading. Gates liveness/readiness until model is loaded.
  • LivenessProbe: Deadlock detection after startup (tight 15s interval)
  • ReadinessProbe: Traffic routing after startup (tight 10s interval)

--metrics always enabled

llama.cpp --metrics has negligible overhead (read-only endpoint). The PodMonitor is enabled: false by default, so metrics aren't scraped unless the user opts in.

Test plan

  • All existing tests pass (go test ./...)
  • New metrics package tests pass (10 tests covering registration, observations, phase transitions, histogram buckets)
  • go vet ./... passes with no issues
  • Deploy and verify metrics endpoint: curl -k https://llmkube-controller:8443/metrics | grep llmkube_
  • Deploy a model and verify llmkube_model_download_duration_seconds is observed
  • Verify PodMonitor scrapes inference pod metrics when enabled
  • Verify OTEL traces appear in Tempo when OTEL_EXPORTER_OTLP_ENDPOINT is configured
  • Verify startup probe tolerates large model loading (>75s)

…servability

Add custom Prometheus metrics for model lifecycle and inference service
management: download duration, model status, time-to-ready, GPU queue
depth, and reconciliation tracking. Enable llama.cpp metrics endpoint
and health probes on inference pods. Add PodMonitor for scraping
inference pod metrics. Initialize OpenTelemetry TracerProvider with
OTLP gRPC exporter for distributed tracing. Add lifecycle alert rules
for slow downloads, stuck services, and GPU queue backlog.

- internal/metrics: 10 custom metrics registered via controller-runtime
- model_controller: download duration, status, and reconcile metrics
- inferenceservice_controller: phase tracking, ready duration, --metrics flag, probes
- cmd/main.go: OTEL TracerProvider init when OTEL_EXPORTER_OTLP_ENDPOINT is set
- charts: PodMonitor template, values for inference pod monitoring
- config: lifecycle alert rules (ModelDownloadSlow, InferenceServiceNotReady, GPUQueueBacklog)
- tests: comprehensive metrics package tests

Signed-off-by: Christopher Maher <chris@mahercode.io>
@Defilan Defilan force-pushed the feat/ai-ops-observability branch from 7709327 to 685ada4 Compare February 28, 2026 22:53
Replace initialDelaySeconds-based liveness/readiness probes with a
three-probe pattern. The previous configuration gave a maximum startup
tolerance of 75 seconds (30s delay + 3 failures * 15s), which would
cause restart loops for any model taking longer to load.

llama.cpp /health returns 503 during model loading and 200 when ready.
Large models (30B+) routinely take 5-30 minutes to load onto GPU. The
new startupProbe allows up to 30 minutes (failureThreshold=180 *
periodSeconds=10) for model initialization before failing. Once startup
succeeds, tighter liveness (deadlock detection) and readiness (traffic
routing) probes take over for steady-state monitoring.

Signed-off-by: Christopher Maher <chris@mahercode.io>
The ctrl.Result return is always zero-valued, but callers take &result
to signal the status-update path was taken (pointer non-nil vs nil).
Changing the return type would cascade through multiple helper functions
for no behavioral change.

Signed-off-by: Christopher Maher <chris@mahercode.io>
@Defilan Defilan merged commit c653ff1 into main Feb 28, 2026
15 checks passed
@Defilan Defilan deleted the feat/ai-ops-observability branch February 28, 2026 23:49
This was referenced Feb 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant