-
Notifications
You must be signed in to change notification settings - Fork 22
Closed
Labels
Description
Problem
The ai-service-metrics conformance validator does a single instant Prometheus query for DCGM_FI_DEV_GPU_UTIL and fails immediately if the result is empty. This causes flaky failures in CI when the metrics pipeline hasn't fully warmed up yet.
Evidence from GPU Training Test run: https://github.com/NVIDIA/aicr/actions/runs/23025287577/job/66871557300?pr=383
ai-service-metricsfailed with:[NOT_FOUND] no DCGM_FI_DEV_GPU_UTIL time series in Prometheus- In the same run,
accelerator-metricspassed andpod-autoscalingobserved custom metrics shortly after
Root Cause
Single-shot query with no retry in ai_service_metrics_check.go:124. The metrics scrape interval means the time series may not exist for the first few seconds after DCGM exporter starts.
Proposed Fix
- Add retry/backoff before failing on empty Prometheus result
- Add constants in
pkg/defaults/timeouts.go:AIServiceMetricsWaitTimeout = 2 * time.MinuteAIServiceMetricsPollInterval = 10 * time.Second
- Keep current instant query, poll until timeout if result is empty
- Optionally use
last_over_time(DCGM_FI_DEV_GPU_UTIL[5m])on retries to reduce scrape-timing flake
This keeps the check strict (still fails if truly missing) but removes startup-race flakes.
Files
validators/conformance/ai_service_metrics_check.go(lines 99, 124)pkg/defaults/timeouts.go
Reactions are currently unavailable