Skip to content

Refactor/e2e waiter package#526

Merged
oleg-kushniriov merged 6 commits into
ai-dynamo:mainfrom
oleg-kushniriov:refactor/e2e-waiter-package
Apr 19, 2026
Merged

Refactor/e2e waiter package#526
oleg-kushniriov merged 6 commits into
ai-dynamo:mainfrom
oleg-kushniriov:refactor/e2e-waiter-package

Conversation

@oleg-kushniriov

@oleg-kushniriov oleg-kushniriov commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature
/kind cleanup

What this PR does / why we need it:

Creates a dedicated e2e/waiter package that provides a single, generic Waiter[T] struct driven by user-supplied fetch functions and predicates. This replaces the scattered wait/poll
logic that was previously spread across k8s.PollForCondition, manager methods, and inline closures with inconsistent signatures.

Key changes:

  • e2e/waiter/waiter.go — Generic Waiter[T] struct with builder-pattern configuration (New[T]().WithTimeout().WithInterval().WithLogger().WithRetryOnError()), WaitFor (returns value) and
    WaitUntil (discards value) methods, typed FetchFunc[T]/Predicate[T] types, built-in predicates (IsNotZero, IsZero, AlwaysTrue), helper converters (ToFetchFunc1, ToFetchFunc2,
    FetchByName), and resource lifecycle helpers (WaitForResource, WaitForResourceDeletion)
  • e2e/k8s/pods/predicates.go — Reusable pod predicates: AllReady, AllPending, MatchPhases, ReadyCount, CountEquals, ExcludesPod, HasUnschedulableEvents
  • Migrated all ~30 call sites from k8s.PollForCondition and wait.PollUntilContextTimeout to the new waiter package with properly typed fetch functions and predicates
  • Deleted e2e/k8s/polling.go — the old PollForCondition is fully replaced
  • Removed Option functional-options pattern in favor of builder-style method chaining
  • Removed PodWaiter()/FetchPods()/WaitForCondition() from TestContext — all pod wait logic now delegates to PodManager methods

Which issue(s) this PR fixes:

Fixes #518

Special notes for your reviewer:

  • Every waiter uses the natural K8s type of what's being fetched (e.g., *v1.PodList, *PodCliqueSet, *v1.Node, *v1.Secret) rather than bool. This eliminates the unsafe closure-capture
    pattern where callers declared var result T outside PollForCondition and assigned it inside.
  • WaitForResource and WaitForResourceDeletion are free functions (not methods) because they require T comparable for IsNotZero/IsZero, which can't be added as a constraint on Waiter[T
    any] methods.
  • NotFound handling is now explicit in each fetch function rather than blanket WithRetryOnError() — this preserves the original behavior where NotFound keeps polling but other errors
    fail immediately.
  • The waiter package now imports k8s.io/apimachinery for GetFunc[T] and errors.IsNotFound.

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

See e2e/waiter/waiter.go for the full API. Usage example:

  // Wait for a resource to exist
  w := waiter.New[*v1.Node]().
             WithTimeout(timeout).
             WithInterval(interval).
             WithLogger(logger)                                                                                               
  return waiter.WaitForResource(ctx, w, nodeName, clientset.CoreV1().Nodes().Get)                                                                                                          
                                                                                                                                                                                           
  // Wait with custom predicate                                                                                                                                                            
  w := waiter.New[*v1.PodList]().
             WithTimeout(timeout).
             WithInterval(interval)                                                                                                               
  pods, err := w.WaitFor(ctx, pm.FetchFunc(ctx, ns, selector), pods.AllReady(expectedCount))       
  
    // Wait with custom predicate                                                                                                                                                            
  w := waiter.New[*v1.PodList]().
             WithTimeout(timeout).
             WithInterval(interval)                                                                                                               
  err := w.WaitUntil(ctx, pm.FetchFunc(ctx, ns, selector), pods.AllReady(expectedCount))                                                                                                  
                                                                                                                                                                                           
  // Wait for deletion                                                                                                                                                                     
  w := waiter.New[*PodCliqueSet]().
             WithTimeout(timeout).
             WithInterval(interval)                                                                                                             
  return waiter.WaitForResourceDeletion(ctx, w, name, client.PodCliqueSets(ns).Get)

@copy-pr-bot

copy-pr-bot Bot commented Apr 14, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@oleg-kushniriov oleg-kushniriov self-assigned this Apr 14, 2026
@oleg-kushniriov oleg-kushniriov marked this pull request as ready for review April 15, 2026 10:05
@oleg-kushniriov oleg-kushniriov force-pushed the refactor/e2e-waiter-package branch 3 times, most recently from 7f74579 to ecf5b0c Compare April 16, 2026 10:01
Comment thread operator/e2e/waiter/waiter.go Outdated
Comment thread operator/e2e/waiter/waiter.go
danbar2
danbar2 previously approved these changes Apr 16, 2026

@danbar2 danbar2 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work !

@oleg-kushniriov oleg-kushniriov force-pushed the refactor/e2e-waiter-package branch from e119051 to 81b6ede Compare April 19, 2026 07:06
Comment thread operator/e2e/grove/workload/workload.go Outdated
Comment thread operator/e2e/grove/workload/workload.go Outdated
Comment thread operator/e2e/grove/workload/workload.go Outdated
Comment thread operator/e2e/k8s/nodes/nodes.go Outdated
Comment thread operator/e2e/tests/auto-mnnvl/unsupported_but_enabled_test.go Outdated
Comment thread operator/e2e/tests/auto-mnnvl/unsupported_but_enabled_test.go Outdated
Comment thread operator/e2e/k8s/nodes/nodes.go
@oleg-kushniriov oleg-kushniriov merged commit db656cc into ai-dynamo:main Apr 19, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create waiter package with predicate-based generic API for e2e polling

3 participants