feat: add Prometheus metrics, OpenTelemetry tracing, and inference observability#189
Merged
feat: add Prometheus metrics, OpenTelemetry tracing, and inference observability#189
Conversation
…servability Add custom Prometheus metrics for model lifecycle and inference service management: download duration, model status, time-to-ready, GPU queue depth, and reconciliation tracking. Enable llama.cpp metrics endpoint and health probes on inference pods. Add PodMonitor for scraping inference pod metrics. Initialize OpenTelemetry TracerProvider with OTLP gRPC exporter for distributed tracing. Add lifecycle alert rules for slow downloads, stuck services, and GPU queue backlog. - internal/metrics: 10 custom metrics registered via controller-runtime - model_controller: download duration, status, and reconcile metrics - inferenceservice_controller: phase tracking, ready duration, --metrics flag, probes - cmd/main.go: OTEL TracerProvider init when OTEL_EXPORTER_OTLP_ENDPOINT is set - charts: PodMonitor template, values for inference pod monitoring - config: lifecycle alert rules (ModelDownloadSlow, InferenceServiceNotReady, GPUQueueBacklog) - tests: comprehensive metrics package tests Signed-off-by: Christopher Maher <chris@mahercode.io>
7709327 to
685ada4
Compare
Replace initialDelaySeconds-based liveness/readiness probes with a three-probe pattern. The previous configuration gave a maximum startup tolerance of 75 seconds (30s delay + 3 failures * 15s), which would cause restart loops for any model taking longer to load. llama.cpp /health returns 503 during model loading and 200 when ready. Large models (30B+) routinely take 5-30 minutes to load onto GPU. The new startupProbe allows up to 30 minutes (failureThreshold=180 * periodSeconds=10) for model initialization before failing. Once startup succeeds, tighter liveness (deadlock detection) and readiness (traffic routing) probes take over for steady-state monitoring. Signed-off-by: Christopher Maher <chris@mahercode.io>
The ctrl.Result return is always zero-valued, but callers take &result to signal the status-update path was taken (pointer non-nil vs nil). Changing the return type would cascade through multiple helper functions for no behavioral change. Signed-off-by: Christopher Maher <chris@mahercode.io>
This was referenced Feb 28, 2026
Merged
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--metricsflag and health probes (startup + liveness + readiness) on inference podsOTEL_EXPORTER_OTLP_ENDPOINTenv var)ModelDownloadSlow,InferenceServiceNotReady,GPUQueueBacklogRelated Issues
internal/metricspackage, improving controller test coverageChanges
New files
internal/metrics/metrics.go— Custom metrics registered via controller-runtime registryinternal/metrics/metrics_test.go— Comprehensive tests for all 10 metricscharts/llmkube/templates/inference-podmonitor.yaml— PodMonitor for inference podsModified files
internal/controller/model_controller.go— Download duration, status, and reconcile metricsinternal/controller/inferenceservice_controller.go— Phase tracking, ready duration,--metricsflag, three-probe pattern (startup/liveness/readiness)cmd/main.go— OTEL TracerProvider initializationconfig/prometheus/llmkube-alerts.yaml— 3 new lifecycle alertscharts/llmkube/values.yaml— Inference PodMonitor configurationgo.mod— Promote OTel dependencies from indirect to directDesign Decisions
Startup probe for model loading tolerance
llama.cpp
/healthreturns 503 during model loading and 200 when ready. Large models (30B+) take 5-30 minutes to load. Instead of relying oninitialDelaySeconds(which gave only 75s tolerance), we use Kubernetes' three-probe pattern:failureThreshold=180 * periodSeconds=10= 30 min budget for model loading. Gates liveness/readiness until model is loaded.--metricsalways enabledllama.cpp
--metricshas negligible overhead (read-only endpoint). The PodMonitor isenabled: falseby default, so metrics aren't scraped unless the user opts in.Test plan
go test ./...)go vet ./...passes with no issuescurl -k https://llmkube-controller:8443/metrics | grep llmkube_llmkube_model_download_duration_secondsis observedOTEL_EXPORTER_OTLP_ENDPOINTis configured