Skip to content

fix(validator): ai-service-metrics flakes on metrics scrape startup race #386

@yuanchen8911

Description

@yuanchen8911

Problem

The ai-service-metrics conformance validator does a single instant Prometheus query for DCGM_FI_DEV_GPU_UTIL and fails immediately if the result is empty. This causes flaky failures in CI when the metrics pipeline hasn't fully warmed up yet.

Evidence from GPU Training Test run: https://github.com/NVIDIA/aicr/actions/runs/23025287577/job/66871557300?pr=383

  • ai-service-metrics failed with: [NOT_FOUND] no DCGM_FI_DEV_GPU_UTIL time series in Prometheus
  • In the same run, accelerator-metrics passed and pod-autoscaling observed custom metrics shortly after

Root Cause

Single-shot query with no retry in ai_service_metrics_check.go:124. The metrics scrape interval means the time series may not exist for the first few seconds after DCGM exporter starts.

Proposed Fix

  1. Add retry/backoff before failing on empty Prometheus result
  2. Add constants in pkg/defaults/timeouts.go:
    • AIServiceMetricsWaitTimeout = 2 * time.Minute
    • AIServiceMetricsPollInterval = 10 * time.Second
  3. Keep current instant query, poll until timeout if result is empty
  4. Optionally use last_over_time(DCGM_FI_DEV_GPU_UTIL[5m]) on retries to reduce scrape-timing flake

This keeps the check strict (still fails if truly missing) but removes startup-race flakes.

Files

  • validators/conformance/ai_service_metrics_check.go (lines 99, 124)
  • pkg/defaults/timeouts.go

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions