fix(validator): add retry for ai-service-metrics Prometheus query#393
Conversation
mchmarny
left a comment
There was a problem hiding this comment.
Review
The approach is sound — retrying for DCGM exporter metrics on fresh deployments is the right fix for flaky CI.
Issues
1. recordRawTextArtifact called on every retry iteration
The artifact recording runs inside the loop, so you'll get duplicate/overwritten artifacts on each poll. This is noisy and may confuse evidence output. Move the recordRawTextArtifact calls after the break (outside the loop), recording only the final successful response.
2. time.Now() deadline pattern vs context.WithTimeout
Using time.Now().Add() + time.Now().After() works but creates a subtle issue: the elapsed time includes the HTTP request duration. If httpGet is slow (e.g., 30s timeout), fewer retries happen than expected. A context.WithTimeout on the retry scope would be more precise and idiomatic:
retryCtx, retryCancel := context.WithTimeout(ctx.Ctx, defaults.AIServiceMetricsWaitTimeout)
defer retryCancel()Then check retryCtx.Done() instead of time.Now().After(deadline).
3. Missing test coverage
The PR checklist shows "I added/updated tests for new functionality" unchecked. The retry logic is the core change — a unit test with a mock Prometheus that returns empty results N times then succeeds would validate the behavior and prevent regressions.
Minor
- The
promRespstruct is declared outside the loop but only used inside — could move it inside for tighter scope, though this is cosmetic.
Summary
Issue #1 (artifact duplication) should be fixed before merge. Issue #2 is a suggested improvement. Issue #3 (tests) would strengthen confidence but isn't blocking.
mchmarny
left a comment
There was a problem hiding this comment.
- recordRawTextArtifact called on every retry iteration
- time.Now() deadline pattern vs context.WithTimeout
|
Addressed the comments as follows.
|
|
Fixed additional issues.
|
86b6548 to
3991124
Compare
DCGM exporter may not have scraped yet when the validator runs on fresh deployments, causing flaky CI failures. Retry the Prometheus query with a configurable timeout (2m) and poll interval (10s). - Use context.WithTimeout instead of time.Now() deadline for precision - Move recordRawTextArtifact outside retry loop (record only final result) - Distinguish retry-timeout from connectivity failure in httpGet errors - Distinguish parent-context cancellation from retry exhaustion - Add unit tests: retry→success, timeout expiry, parent cancellation Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
3991124 to
9495255
Compare
…IDIA#393) Signed-off-by: Yuan Chen <yuanchen97@gmail.com>
Summary
Add a retry loop to the ai-service-metrics conformance check so it waits up to 2 minutes for DCGM_FI_DEV_GPU_UTIL time series to appear in Prometheus, instead of failing immediately on the first empty result.
Motivation / Context
DCGM exporter may not have been scraped yet when the validator runs on fresh deployments, causing flaky CI failures.
Fixes: #386
Related: N/A
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)pkg/defaults(new timeout constants),validators/conformanceImplementation Notes
AIServiceMetricsWaitTimeout(2min) andAIServiceMetricsPollInterval(10s) topkg/defaults/timeouts.goctx.Done()between retries to respect context cancellationTesting
go build ./validators/conformance/... && go build ./pkg/defaults/...Build verified locally.
Risk Assessment
Rollout notes: N/A — backward compatible, only adds retry behavior
Checklist
make testwith-race)make lint)git commit -S) — GPG signing info