feat(e2e): add scale test measurement infrastructure#484
Merged
Conversation
shayasoolin
reviewed
Mar 15, 2026
shayasoolin
previously approved these changes
Mar 16, 2026
danbar2
previously approved these changes
Mar 17, 2026
e851168 to
014aceb
Compare
Add PhaseDefinition struct with AddPhase/Run methods so phases are defined upfront then executed together, replacing direct RunPhase calls. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add PCSDeletedCondition and PCSAvailableCondition milestone conditions. Consolidate pod conditions to use controller-runtime client instead of client-go kubernetes.Interface for consistency across all conditions. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…tracker - Rename ScaleTestResult → TrackerResult, drop scale-specific fields - NewTimelineTracker now takes testName, runID, namespace, pcsCount - Run returns (*TrackerResult, error) with complete result - Remove Phases() method; phases accessible via TrackerResult.Phases - Update exporter to use TrackerResult Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Remove obsolete ScaleTestResult, TimelineTracker, and related files - Update documentation to reflect removal of scale test utilities Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add 5000-pod MoE scale test that measures deploy and delete phases with milestones, exports results via summary and JSON. Includes CR client utility for controller-runtime typed access. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
… handling - Change initial pod count in scale test to 1000 for larger cluster initialization. - Add `client-go` scheme to CR client for improved compatibility.
…date scale test - Add configurable logger to TimelineTracker for detailed phase and milestone progress output. - Introduce ProgressReporter interface for milestone conditions to provide human-readable progress updates. - Update PCSAvailableCondition, PodsCreatedCondition, and PodsReadyCondition to implement progress tracking. - Refactor scale test to utilize
…e diagnostics file handling
- Move `scaleTestPollInterval` and `scaleTestTimeout` definitions to setup.go for reuse and clarity. - Remove unused `controller-runtime` import from setup.go. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Create and own crClient alongside other clients in connectToCluster(), expose via GetCRClient() accessor, and remove duplicate creation from prepareTestCluster(). Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Reads operator ConfigMap at test start and enriches TrackerResult with K8sClientConfig (QPS/Burst) and ControllerMaxReconcile (ConcurrentSyncs per controller). SummaryExporter prints both sections when populated. ParseGroveConfig is unit-testable without a cluster via scheme-codec matching operator startup. Closes ai-dynamo#424 Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…etter consistency Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Moves the scheme-codec parsing logic from a thin e2e wrapper into a reusable DecodeOperatorConfig function in operator/api/config/v1alpha1. cli.go and grove_config_k8s.go now delegate to it, eliminating duplication. Unit tests live next to the function in decode_test.go. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Capture f.Close() error to avoid masking write failures - Fix misleading log message: cluster uses 100 nodes not 1000 Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…nvention Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Reduce worker_nodes to 0 and introduce pcs_syncs, pcsg_syncs, and pclq_syncs values for grove profiling in scale test configuration. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
014aceb to
58e84a2
Compare
danbar2
approved these changes
Mar 18, 2026
shayasoolin
approved these changes
Mar 18, 2026
danbar2
pushed a commit
to danbar2/grove
that referenced
this pull request
Mar 18, 2026
enoodle
pushed a commit
to enoodle/grove
that referenced
this pull request
Mar 24, 2026
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
/kind feature
What this PR does / why we need it:
Adds a reusable measurement framework for Grove e2e scale tests and a first
1000-pod scale test (
Test_ScaleTest_1000_MoE).Timeline measurement framework (
operator/e2e/utils/measurement/):The core abstraction is
TimelineTracker— it records an ordered sequence ofnamed phases, each with an action and a set of milestone conditions. After all
phases complete, it returns a
TrackerResultwith wall-clock timestamps forevery milestone, suitable for archiving and diffing across runs.
deploy,delete). Each phase has anActionFnthat fires once (e.g. apply YAML), followed by polling until all its milestones are reached.pods-created,pods-ready,pcs-available). Each milestone has aMilestoneConditioninterface (Met(ctx) (bool, error)) polled on a configurable interval. Conditions that implementProgressReporteremit periodic progress logs.TrackerResultto one or more destinations via aResultExporterinterface:SummaryExporter(human-readable stdout),JSONFileExporter(archived benchmark artifact),MultiExporter(fan-out).OperatorMetadata(grove image, K8s client config, controller concurrency) is passed to the tracker at construction viaWithOperatorMetadataand embedded into the result bybuildResult()— keeping result assembly inside the tracker, not scattered across the test.Operator metadata enrichment (
operator/e2e/utils/grove_config_k8s.go):ReadGroveMetadatafetches the operator Deployment once and returns both the manager container image and the live operator config — single GET, two data points.Operator config (
operator/api/config/v1alpha1/decode.go):DecodeOperatorConfig([]byte)extracted to the API package — reused by both the operator CLI and e2e tests; no cluster needed.Supporting infrastructure:
SharedClusterManagersingleton for cluster connection and node managementApplyYAMLFile,DeletePodCliqueSetutilitiesSample test output
Which issue(s) this PR fixes:
Part of #483
Special notes for your reviewer:
operator/api/config/v1alpha1/decode.gois a new reusable function extracted fromoperator/cmd/cli/cli.go— no operator behavior changes//go:build e2e; unit tests inmeasurement/andapi/config/v1alpha1/run without a clusterDoes this PR introduce a API change?
Additional documentation e.g., enhancement proposals, usage docs, etc.: