fix(validator): add retry for ai-service-metrics Prometheus query by yuanchen8911 · Pull Request #393 · NVIDIA/aicr

yuanchen8911 · 2026-03-13T00:36:14Z

Summary

Add a retry loop to the ai-service-metrics conformance check so it waits up to 2 minutes for DCGM_FI_DEV_GPU_UTIL time series to appear in Prometheus, instead of failing immediately on the first empty result.

Motivation / Context

DCGM exporter may not have been scraped yet when the validator runs on fresh deployments, causing flaky CI failures.

Fixes: #386
Related: N/A

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/aicr, pkg/cli)
API server (cmd/aicrd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: pkg/defaults (new timeout constants), validators/conformance

Implementation Notes

Added AIServiceMetricsWaitTimeout (2min) and AIServiceMetricsPollInterval (10s) to pkg/defaults/timeouts.go
Replaced single-shot Prometheus query with a retry loop that polls until results appear or the deadline expires
Checks ctx.Done() between retries to respect context cancellation

Testing

go build ./validators/conformance/... && go build ./pkg/defaults/...

Build verified locally.

Risk Assessment

Low — Isolated change, well-tested, easy to revert

Rollout notes: N/A — backward compatible, only adds retry behavior

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality
I updated docs if user-facing behavior changed
Changes follow existing patterns in the codebase
Commits are cryptographically signed (git commit -S) — GPG signing info

mchmarny

Review

The approach is sound — retrying for DCGM exporter metrics on fresh deployments is the right fix for flaky CI.

Issues

1. recordRawTextArtifact called on every retry iteration

The artifact recording runs inside the loop, so you'll get duplicate/overwritten artifacts on each poll. This is noisy and may confuse evidence output. Move the recordRawTextArtifact calls after the break (outside the loop), recording only the final successful response.

2. time.Now() deadline pattern vs context.WithTimeout

Using time.Now().Add() + time.Now().After() works but creates a subtle issue: the elapsed time includes the HTTP request duration. If httpGet is slow (e.g., 30s timeout), fewer retries happen than expected. A context.WithTimeout on the retry scope would be more precise and idiomatic:

retryCtx, retryCancel := context.WithTimeout(ctx.Ctx, defaults.AIServiceMetricsWaitTimeout)
defer retryCancel()

Then check retryCtx.Done() instead of time.Now().After(deadline).

3. Missing test coverage

The PR checklist shows "I added/updated tests for new functionality" unchecked. The retry logic is the core change — a unit test with a mock Prometheus that returns empty results N times then succeeds would validate the behavior and prevent regressions.

Minor

The promResp struct is declared outside the loop but only used inside — could move it inside for tighter scope, though this is cosmetic.

Summary

Issue #1 (artifact duplication) should be fixed before merge. Issue #2 is a suggested improvement. Issue #3 (tests) would strengthen confidence but isn't blocking.

mchmarny

recordRawTextArtifact called on every retry iteration
time.Now() deadline pattern vs context.WithTimeout

yuanchen8911 · 2026-03-16T16:46:29Z

Addressed the comments as follows.

#	Comment	Fix
1	`recordRawTextArtifact` called on every retry (duplicate artifacts)	Moved both `recordRawTextArtifact` calls after the retry loop, recording only the final successful response
2	`time.Now()` deadline pattern vs `context.WithTimeout`	Replaced `time.Now().Add()` / `time.Now().After()` with `context.WithTimeout(ctx.Ctx, ...)` — more precise and idiomatic
3	Missing test coverage for retry logic	Added 3 unit tests: retry→success, timeout expiry, parent-context cancellation

yuanchen8911 · 2026-03-16T16:47:39Z

Fixed additional issues.

#	Issue	Fix
1	`retryCtx` timeout during `httpGet` misclassified as "Prometheus unreachable"	After `httpGet` error, check `retryCtx.Err()` first — if set, route to timeout error instead of connectivity error
2	`retryCtx.Done()` conflates parent cancellation with retry timeout	New `metricsWaitTimeoutError()` helper checks `parentCtx.Err()` — returns `ErrCodeTimeout` for upstream cancellation, `ErrCodeNotFound` for retry exhaustion

DCGM exporter may not have scraped yet when the validator runs on fresh deployments, causing flaky CI failures. Retry the Prometheus query with a configurable timeout (2m) and poll interval (10s). - Use context.WithTimeout instead of time.Now() deadline for precision - Move recordRawTextArtifact outside retry loop (record only final result) - Distinguish retry-timeout from connectivity failure in httpGet errors - Distinguish parent-context cancellation from retry exhaustion - Add unit tests: retry→success, timeout expiry, parent cancellation Signed-off-by: Yuan Chen <yuanchen97@gmail.com>

…IDIA#393) Signed-off-by: Yuan Chen <yuanchen97@gmail.com>

yuanchen8911 added bug Something isn't working area/validator labels Mar 13, 2026

github-actions bot added size/M and removed area/validator labels Mar 13, 2026

yuanchen8911 mentioned this pull request Mar 13, 2026

fix(validator): ai-service-metrics flakes on metrics scrape startup race #386

Closed

yuanchen8911 marked this pull request as ready for review March 13, 2026 00:53

yuanchen8911 requested a review from a team as a code owner March 13, 2026 00:53

mchmarny reviewed Mar 13, 2026

View reviewed changes

mchmarny requested changes Mar 13, 2026

View reviewed changes

mchmarny assigned yuanchen8911 Mar 13, 2026

mchmarny added this to the M2 - KubeCon EU milestone Mar 13, 2026

yuanchen8911 force-pushed the fix/ai-service-metrics-retry branch from 86b6548 to 3991124 Compare March 16, 2026 16:49

github-actions bot added size/L and removed size/M labels Mar 16, 2026

yuanchen8911 requested a review from mchmarny March 16, 2026 16:49

mchmarny approved these changes Mar 16, 2026

View reviewed changes

yuanchen8911 force-pushed the fix/ai-service-metrics-retry branch from 3991124 to 9495255 Compare March 16, 2026 18:05

yuanchen8911 merged commit 8a65335 into NVIDIA:main Mar 16, 2026
13 checks passed

xdu31 pushed a commit to xdu31/aicr that referenced this pull request Mar 24, 2026

fix(validator): add retry for ai-service-metrics Prometheus query (NV…

97f8be9

…IDIA#393) Signed-off-by: Yuan Chen <yuanchen97@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(validator): add retry for ai-service-metrics Prometheus query#393

fix(validator): add retry for ai-service-metrics Prometheus query#393
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:fix/ai-service-metrics-retry

yuanchen8911 commented Mar 13, 2026

Uh oh!

mchmarny left a comment

Uh oh!

mchmarny left a comment

Uh oh!

yuanchen8911 commented Mar 16, 2026 •

edited

Loading

Uh oh!

yuanchen8911 commented Mar 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yuanchen8911 commented Mar 13, 2026

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Review

Issues

Minor

Summary

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

yuanchen8911 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuanchen8911 commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yuanchen8911 commented Mar 16, 2026 •

edited

Loading

yuanchen8911 commented Mar 16, 2026 •

edited

Loading