Skip to content

feat(e2e): add scale test measurement infrastructure#484

Merged
Ronkahn21 merged 19 commits into
ai-dynamo:mainfrom
Ronkahn21:feature/scale-test-infra
Mar 18, 2026
Merged

feat(e2e): add scale test measurement infrastructure#484
Ronkahn21 merged 19 commits into
ai-dynamo:mainfrom
Ronkahn21:feature/scale-test-infra

Conversation

@Ronkahn21

@Ronkahn21 Ronkahn21 commented Mar 12, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds a reusable measurement framework for Grove e2e scale tests and a first
1000-pod scale test (Test_ScaleTest_1000_MoE).

Timeline measurement framework (operator/e2e/utils/measurement/):

The core abstraction is TimelineTracker — it records an ordered sequence of
named phases, each with an action and a set of milestone conditions. After all
phases complete, it returns a TrackerResult with wall-clock timestamps for
every milestone, suitable for archiving and diffing across runs.

  • Phases represent logical test stages (e.g. deploy, delete). Each phase has an ActionFn that fires once (e.g. apply YAML), followed by polling until all its milestones are reached.
  • Milestones are named checkpoints within a phase (e.g. pods-created, pods-ready, pcs-available). Each milestone has a MilestoneCondition interface (Met(ctx) (bool, error)) polled on a configurable interval. Conditions that implement ProgressReporter emit periodic progress logs.
  • Exporters write TrackerResult to one or more destinations via a ResultExporter interface: SummaryExporter (human-readable stdout), JSONFileExporter (archived benchmark artifact), MultiExporter (fan-out).
  • OperatorMetadata (grove image, K8s client config, controller concurrency) is passed to the tracker at construction via WithOperatorMetadata and embedded into the result by buildResult() — keeping result assembly inside the tracker, not scattered across the test.

Operator metadata enrichment (operator/e2e/utils/grove_config_k8s.go):

  • ReadGroveMetadata fetches the operator Deployment once and returns both the manager container image and the live operator config — single GET, two data points.
  • Grove image is included in the JSON artifact and human-readable summary as the primary version identifier for correlating benchmarks across releases.

Operator config (operator/api/config/v1alpha1/decode.go):

  • DecodeOperatorConfig([]byte) extracted to the API package — reused by both the operator CLI and e2e tests; no cluster needed.

Supporting infrastructure:

  • SharedClusterManager singleton for cluster connection and node management
  • ApplyYAMLFile, DeletePodCliqueSet utilities

Sample test output

=== Test: ScaleTest_1000_MoE (run: run-20260315-163145) ===
Grove image:      registry:5001/grove-operator:E2E_TESTS@sha256:6022bde69e95c8296fd25a99638df79b579e0e763a0c28aabc7ee704f7f2d788
Namespace:        default
PCS count:        1
K8s client:       QPS=500 Burst=1000
Max reconcile:    pcs=3 pcsg=40 pclq=40
Total test time:  191.689s
Timeline:
  Phase: deploy (started +0.000s)
    pods-created  +30.567s
    pods-ready    +83.148s
    pcs-available +107.930s
  Phase: delete (started +107.930s)
    pcs-deleted   +83.758s
--- PASS: Test_ScaleTest_1000_MoE (211.95s)

Which issue(s) this PR fixes:

Part of #483

Special notes for your reviewer:

  • operator/api/config/v1alpha1/decode.go is a new reusable function extracted from operator/cmd/cli/cli.go — no operator behavior changes
  • Scale test requires KWOK cluster and //go:build e2e; unit tests in measurement/ and api/config/v1alpha1/ run without a cluster

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

@Ronkahn21 Ronkahn21 marked this pull request as ready for review March 15, 2026 08:26
Comment thread operator/e2e/tests/scale_test.go Outdated
Comment thread operator/e2e/tests/scale_test.go Outdated
Comment thread operator/e2e/utils/measurement/condition/pod.go Outdated
Comment thread operator/e2e/utils/measurement/measurement.go
Comment thread operator/e2e/utils/measurement/measurement.go
Comment thread operator/e2e/utils/grove_config_k8s.go Outdated
shayasoolin
shayasoolin previously approved these changes Mar 16, 2026
danbar2
danbar2 previously approved these changes Mar 17, 2026
@Ronkahn21 Ronkahn21 dismissed stale reviews from danbar2 and shayasoolin via 014aceb March 17, 2026 19:40
@Ronkahn21 Ronkahn21 force-pushed the feature/scale-test-infra branch from e851168 to 014aceb Compare March 17, 2026 19:40
Add PhaseDefinition struct with AddPhase/Run methods so phases are
defined upfront then executed together, replacing direct RunPhase calls.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add PCSDeletedCondition and PCSAvailableCondition milestone conditions.
Consolidate pod conditions to use controller-runtime client instead of
client-go kubernetes.Interface for consistency across all conditions.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…tracker

- Rename ScaleTestResult → TrackerResult, drop scale-specific fields
- NewTimelineTracker now takes testName, runID, namespace, pcsCount
- Run returns (*TrackerResult, error) with complete result
- Remove Phases() method; phases accessible via TrackerResult.Phases
- Update exporter to use TrackerResult

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Remove obsolete ScaleTestResult, TimelineTracker, and related files
- Update documentation to reflect removal of scale test utilities

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add 5000-pod MoE scale test that measures deploy and delete
phases with milestones, exports results via summary and JSON.
Includes CR client utility for controller-runtime typed access.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
… handling

- Change initial pod count in scale test to 1000 for larger cluster initialization.
- Add `client-go` scheme to CR client for improved compatibility.
…date scale test

- Add configurable logger to TimelineTracker for detailed phase and milestone progress output.
- Introduce ProgressReporter interface for milestone conditions to provide human-readable progress updates.
- Update PCSAvailableCondition, PodsCreatedCondition, and PodsReadyCondition to implement progress tracking.
- Refactor scale test to utilize
- Move `scaleTestPollInterval` and `scaleTestTimeout` definitions to setup.go for reuse and clarity.
- Remove unused `controller-runtime` import from setup.go.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Create and own crClient alongside other clients in connectToCluster(),
expose via GetCRClient() accessor, and remove duplicate creation from
prepareTestCluster().

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Reads operator ConfigMap at test start and enriches TrackerResult
with K8sClientConfig (QPS/Burst) and ControllerMaxReconcile
(ConcurrentSyncs per controller). SummaryExporter prints both
sections when populated. ParseGroveConfig is unit-testable without
a cluster via scheme-codec matching operator startup.

Closes ai-dynamo#424

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…etter consistency

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Moves the scheme-codec parsing logic from a thin e2e wrapper into
a reusable DecodeOperatorConfig function in operator/api/config/v1alpha1.
cli.go and grove_config_k8s.go now delegate to it, eliminating
duplication. Unit tests live next to the function in decode_test.go.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Capture f.Close() error to avoid masking write failures
- Fix misleading log message: cluster uses 100 nodes not 1000

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…nvention

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Reduce worker_nodes to 0 and introduce pcs_syncs, pcsg_syncs, and pclq_syncs values for grove profiling in scale test configuration.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@Ronkahn21 Ronkahn21 force-pushed the feature/scale-test-infra branch from 014aceb to 58e84a2 Compare March 18, 2026 09:08
@Ronkahn21 Ronkahn21 merged commit ddeb96d into ai-dynamo:main Mar 18, 2026
11 checks passed
danbar2 pushed a commit to danbar2/grove that referenced this pull request Mar 18, 2026
enoodle pushed a commit to enoodle/grove that referenced this pull request Mar 24, 2026
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants