feat: add Prometheus metrics, OpenTelemetry tracing, and inference observability by Defilan · Pull Request #189 · defilantech/LLMKube

Defilan · 2026-02-28T22:51:54Z

Summary

Add 10 custom Prometheus metrics for model lifecycle and inference service management (download duration, model status, time-to-ready, GPU queue depth, reconciliation tracking)
Enable llama.cpp --metrics flag and health probes (startup + liveness + readiness) on inference pods
Add PodMonitor template for scraping inference pod metrics via Prometheus
Initialize OpenTelemetry TracerProvider with OTLP gRPC exporter for distributed tracing (opt-in via OTEL_EXPORTER_OTLP_ENDPOINT env var)
Add lifecycle alert rules: ModelDownloadSlow, InferenceServiceNotReady, GPUQueueBacklog

Related Issues

Contributes to Additional Grafana dashboards: inference metrics and cost tracking #5 — Provides the custom Prometheus metrics (tokens/sec, latency, queue depth, model status) needed for inference metrics and cost tracking dashboards
Contributes to test: Improve unit test coverage across packages #89 — Adds comprehensive unit tests for the new internal/metrics package, improving controller test coverage
Related to Metal agent: health checks, backpressure, and observability #171 — Establishes operator-level observability patterns (metrics, PodMonitor, alert rules) that the metal agent observability work can build on

Changes

New files

internal/metrics/metrics.go — Custom metrics registered via controller-runtime registry
internal/metrics/metrics_test.go — Comprehensive tests for all 10 metrics
charts/llmkube/templates/inference-podmonitor.yaml — PodMonitor for inference pods

Modified files

internal/controller/model_controller.go — Download duration, status, and reconcile metrics
internal/controller/inferenceservice_controller.go — Phase tracking, ready duration, --metrics flag, three-probe pattern (startup/liveness/readiness)
cmd/main.go — OTEL TracerProvider initialization
config/prometheus/llmkube-alerts.yaml — 3 new lifecycle alerts
charts/llmkube/values.yaml — Inference PodMonitor configuration
go.mod — Promote OTel dependencies from indirect to direct

Design Decisions

Startup probe for model loading tolerance

llama.cpp /health returns 503 during model loading and 200 when ready. Large models (30B+) take 5-30 minutes to load. Instead of relying on initialDelaySeconds (which gave only 75s tolerance), we use Kubernetes' three-probe pattern:

StartupProbe: failureThreshold=180 * periodSeconds=10 = 30 min budget for model loading. Gates liveness/readiness until model is loaded.
LivenessProbe: Deadlock detection after startup (tight 15s interval)
ReadinessProbe: Traffic routing after startup (tight 10s interval)

`--metrics` always enabled

llama.cpp --metrics has negligible overhead (read-only endpoint). The PodMonitor is enabled: false by default, so metrics aren't scraped unless the user opts in.

Test plan

All existing tests pass (go test ./...)
New metrics package tests pass (10 tests covering registration, observations, phase transitions, histogram buckets)
go vet ./... passes with no issues
Deploy and verify metrics endpoint: curl -k https://llmkube-controller:8443/metrics | grep llmkube_
Deploy a model and verify llmkube_model_download_duration_seconds is observed
Verify PodMonitor scrapes inference pod metrics when enabled
Verify OTEL traces appear in Tempo when OTEL_EXPORTER_OTLP_ENDPOINT is configured
Verify startup probe tolerates large model loading (>75s)

…servability Add custom Prometheus metrics for model lifecycle and inference service management: download duration, model status, time-to-ready, GPU queue depth, and reconciliation tracking. Enable llama.cpp metrics endpoint and health probes on inference pods. Add PodMonitor for scraping inference pod metrics. Initialize OpenTelemetry TracerProvider with OTLP gRPC exporter for distributed tracing. Add lifecycle alert rules for slow downloads, stuck services, and GPU queue backlog. - internal/metrics: 10 custom metrics registered via controller-runtime - model_controller: download duration, status, and reconcile metrics - inferenceservice_controller: phase tracking, ready duration, --metrics flag, probes - cmd/main.go: OTEL TracerProvider init when OTEL_EXPORTER_OTLP_ENDPOINT is set - charts: PodMonitor template, values for inference pod monitoring - config: lifecycle alert rules (ModelDownloadSlow, InferenceServiceNotReady, GPUQueueBacklog) - tests: comprehensive metrics package tests Signed-off-by: Christopher Maher <chris@mahercode.io>

Replace initialDelaySeconds-based liveness/readiness probes with a three-probe pattern. The previous configuration gave a maximum startup tolerance of 75 seconds (30s delay + 3 failures * 15s), which would cause restart loops for any model taking longer to load. llama.cpp /health returns 503 during model loading and 200 when ready. Large models (30B+) routinely take 5-30 minutes to load onto GPU. The new startupProbe allows up to 30 minutes (failureThreshold=180 * periodSeconds=10) for model initialization before failing. Once startup succeeds, tighter liveness (deadlock detection) and readiness (traffic routing) probes take over for steady-state monitoring. Signed-off-by: Christopher Maher <chris@mahercode.io>

The ctrl.Result return is always zero-valued, but callers take &result to signal the status-update path was taken (pointer non-nil vs nil). Changing the return type would cascade through multiple helper functions for no behavioral change. Signed-off-by: Christopher Maher <chris@mahercode.io>

Defilan force-pushed the feat/ai-ops-observability branch from 7709327 to 685ada4 Compare February 28, 2026 22:53

Defilan added 2 commits February 28, 2026 15:04

Defilan merged commit c653ff1 into main Feb 28, 2026
15 checks passed

Defilan deleted the feat/ai-ops-observability branch February 28, 2026 23:49

This was referenced Feb 28, 2026

chore: release 0.4.20 #170

Merged

chore: release 0.4.22 #207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Prometheus metrics, OpenTelemetry tracing, and inference observability#189

feat: add Prometheus metrics, OpenTelemetry tracing, and inference observability#189
Defilan merged 3 commits intomainfrom
feat/ai-ops-observability

Defilan commented Feb 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Defilan commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issues

Changes

New files

Modified files

Design Decisions

Startup probe for model loading tolerance

--metrics always enabled

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Defilan commented Feb 28, 2026 •

edited

Loading

`--metrics` always enabled