Skip to content

fix(validator): add retry for ai-service-metrics Prometheus query#393

Merged
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:fix/ai-service-metrics-retry
Mar 16, 2026
Merged

fix(validator): add retry for ai-service-metrics Prometheus query#393
yuanchen8911 merged 1 commit intoNVIDIA:mainfrom
yuanchen8911:fix/ai-service-metrics-retry

Conversation

@yuanchen8911
Copy link
Copy Markdown
Contributor

Summary

Add a retry loop to the ai-service-metrics conformance check so it waits up to 2 minutes for DCGM_FI_DEV_GPU_UTIL time series to appear in Prometheus, instead of failing immediately on the first empty result.

Motivation / Context

DCGM exporter may not have been scraped yet when the validator runs on fresh deployments, causing flaky CI failures.

Fixes: #386
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: pkg/defaults (new timeout constants), validators/conformance

Implementation Notes

  • Added AIServiceMetricsWaitTimeout (2min) and AIServiceMetricsPollInterval (10s) to pkg/defaults/timeouts.go
  • Replaced single-shot Prometheus query with a retry loop that polls until results appear or the deadline expires
  • Checks ctx.Done() between retries to respect context cancellation

Testing

go build ./validators/conformance/... && go build ./pkg/defaults/...

Build verified locally.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert

Rollout notes: N/A — backward compatible, only adds retry behavior

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@yuanchen8911 yuanchen8911 added bug Something isn't working area/validator labels Mar 13, 2026
@yuanchen8911 yuanchen8911 marked this pull request as ready for review March 13, 2026 00:53
@yuanchen8911 yuanchen8911 requested a review from a team as a code owner March 13, 2026 00:53
Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The approach is sound — retrying for DCGM exporter metrics on fresh deployments is the right fix for flaky CI.

Issues

1. recordRawTextArtifact called on every retry iteration

The artifact recording runs inside the loop, so you'll get duplicate/overwritten artifacts on each poll. This is noisy and may confuse evidence output. Move the recordRawTextArtifact calls after the break (outside the loop), recording only the final successful response.

2. time.Now() deadline pattern vs context.WithTimeout

Using time.Now().Add() + time.Now().After() works but creates a subtle issue: the elapsed time includes the HTTP request duration. If httpGet is slow (e.g., 30s timeout), fewer retries happen than expected. A context.WithTimeout on the retry scope would be more precise and idiomatic:

retryCtx, retryCancel := context.WithTimeout(ctx.Ctx, defaults.AIServiceMetricsWaitTimeout)
defer retryCancel()

Then check retryCtx.Done() instead of time.Now().After(deadline).

3. Missing test coverage

The PR checklist shows "I added/updated tests for new functionality" unchecked. The retry logic is the core change — a unit test with a mock Prometheus that returns empty results N times then succeeds would validate the behavior and prevent regressions.

Minor

  • The promResp struct is declared outside the loop but only used inside — could move it inside for tighter scope, though this is cosmetic.

Summary

Issue #1 (artifact duplication) should be fixed before merge. Issue #2 is a suggested improvement. Issue #3 (tests) would strengthen confidence but isn't blocking.

Copy link
Copy Markdown
Member

@mchmarny mchmarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • recordRawTextArtifact called on every retry iteration
  • time.Now() deadline pattern vs context.WithTimeout

@yuanchen8911
Copy link
Copy Markdown
Contributor Author

yuanchen8911 commented Mar 16, 2026

Addressed the comments as follows.

# Comment Fix
1 recordRawTextArtifact called on every retry (duplicate artifacts) Moved both recordRawTextArtifact calls after the retry loop, recording only the final successful response
2 time.Now() deadline pattern vs context.WithTimeout Replaced time.Now().Add() / time.Now().After() with context.WithTimeout(ctx.Ctx, ...) — more precise and idiomatic
3 Missing test coverage for retry logic Added 3 unit tests: retry→success, timeout expiry, parent-context cancellation

@yuanchen8911
Copy link
Copy Markdown
Contributor Author

yuanchen8911 commented Mar 16, 2026

Fixed additional issues.

# Issue Fix
1 retryCtx timeout during httpGet misclassified as "Prometheus unreachable" After httpGet error, check retryCtx.Err() first — if set, route to timeout error instead of connectivity error
2 retryCtx.Done() conflates parent cancellation with retry timeout New metricsWaitTimeoutError() helper checks parentCtx.Err() — returns ErrCodeTimeout for upstream cancellation, ErrCodeNotFound for retry exhaustion

@yuanchen8911 yuanchen8911 force-pushed the fix/ai-service-metrics-retry branch from 86b6548 to 3991124 Compare March 16, 2026 16:49
@github-actions github-actions bot added size/L and removed size/M labels Mar 16, 2026
@yuanchen8911 yuanchen8911 requested a review from mchmarny March 16, 2026 16:49
DCGM exporter may not have scraped yet when the validator runs on fresh
deployments, causing flaky CI failures. Retry the Prometheus query with
a configurable timeout (2m) and poll interval (10s).

- Use context.WithTimeout instead of time.Now() deadline for precision
- Move recordRawTextArtifact outside retry loop (record only final result)
- Distinguish retry-timeout from connectivity failure in httpGet errors
- Distinguish parent-context cancellation from retry exhaustion
- Add unit tests: retry→success, timeout expiry, parent cancellation

Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
@yuanchen8911 yuanchen8911 force-pushed the fix/ai-service-metrics-retry branch from 3991124 to 9495255 Compare March 16, 2026 18:05
@yuanchen8911 yuanchen8911 merged commit 8a65335 into NVIDIA:main Mar 16, 2026
13 checks passed
xdu31 pushed a commit to xdu31/aicr that referenced this pull request Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(validator): ai-service-metrics flakes on metrics scrape startup race

2 participants