feat(e2e): scale test ergonomics with pprof and timeline tracking#528
Merged
danbar2 merged 29 commits intoApr 16, 2026
Conversation
Add PhaseDefinition struct with AddPhase/Run methods so phases are defined upfront then executed together, replacing direct RunPhase calls. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add PCSDeletedCondition and PCSAvailableCondition milestone conditions. Consolidate pod conditions to use controller-runtime client instead of client-go kubernetes.Interface for consistency across all conditions. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…tracker - Rename ScaleTestResult → TrackerResult, drop scale-specific fields - NewTimelineTracker now takes testName, runID, namespace, pcsCount - Run returns (*TrackerResult, error) with complete result - Remove Phases() method; phases accessible via TrackerResult.Phases - Update exporter to use TrackerResult Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Remove obsolete ScaleTestResult, TimelineTracker, and related files - Update documentation to reflect removal of scale test utilities Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add 5000-pod MoE scale test that measures deploy and delete phases with milestones, exports results via summary and JSON. Includes CR client utility for controller-runtime typed access. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
… handling - Change initial pod count in scale test to 1000 for larger cluster initialization. - Add `client-go` scheme to CR client for improved compatibility.
…date scale test - Add configurable logger to TimelineTracker for detailed phase and milestone progress output. - Introduce ProgressReporter interface for milestone conditions to provide human-readable progress updates. - Update PCSAvailableCondition, PodsCreatedCondition, and PodsReadyCondition to implement progress tracking. - Refactor scale test to utilize
…e diagnostics file handling
- Move `scaleTestPollInterval` and `scaleTestTimeout` definitions to setup.go for reuse and clarity. - Remove unused `controller-runtime` import from setup.go. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Create and own crClient alongside other clients in connectToCluster(), expose via GetCRClient() accessor, and remove duplicate creation from prepareTestCluster(). Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Reads operator ConfigMap at test start and enriches TrackerResult with K8sClientConfig (QPS/Burst) and ControllerMaxReconcile (ConcurrentSyncs per controller). SummaryExporter prints both sections when populated. ParseGroveConfig is unit-testable without a cluster via scheme-codec matching operator startup. Closes ai-dynamo#424 Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Moves the scheme-codec parsing logic from a thin e2e wrapper into a reusable DecodeOperatorConfig function in operator/api/config/v1alpha1. cli.go and grove_config_k8s.go now delegate to it, eliminating duplication. Unit tests live next to the function in decode_test.go. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Capture f.Close() error to avoid masking write failures - Fix misleading log message: cluster uses 100 nodes not 1000 Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Adopt upstream API changes lost during rebase: ReadGroveMetadata with GroveImage support, OperatorMetadata type, per-phase timeouts, and simplified pod condition progress. Preserves branch pprof hooks. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Introduce goroutine and mutex profiles to pprof downloader. - Optimize polling interval for scale test (100ms). - Replace REST-style pprof query with gRPC-based Connect RPC. - Convert Pyroscope JSON profiles to gzip-compressed binary pprof format. - Adjust default Pyroscope namespace to "pyroscope" from "monitoring".
…and optimize pod condition checks - Refine scaleTestPollInterval definition comment for clarity. - Replace scheme-based decoding with yaml.Unmarshal in DecodeOperatorConfig. - Introduce parsedSelector for caching label selector parsing, reducing redundant parsing in pod condition checks.
…ecoding - Replace `yaml.Unmarshal` with `runtime.NewScheme` and `serializer.UniversalDecoder` for consistency with Kubernetes API conventions. - Add scheme initialization via `AddToScheme` to support defaulting and validation. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Pass parent context to async after-phase hooks so they aren't cancelled when the per-phase timeout fires - Add tracker.Wait() after Run() to drain async hooks before exit - Fix KWOK toleration key in scale-test-1000-moe.yaml - Add nil guard on err.Error() in exporter test Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Remove `pprof/convert.go` as Pyroscope now serves binary protobuf directly. - Eliminate JSON-to-gzip pprof conversion logic from the downloader. - Refactor downloader to gzip raw protobuf data from Pyroscope. - Remove unused mutex profile type, query prefix, and related test cases. - Simplify select-merge request by switching to binary protobuf encoding. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…es.yaml Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…scale test suite - Remove `scale_test.go` in favor of the new `tests/scale` package. - Add `TestMain` for shared cluster setup in scale tests. - Replace `prepareTestCluster` calls with `PrepareTestCluster` to align with refactored clients. - Delete unused constants and helper functions like `toOperatorMetadata`. - Expand profiling with pprof improvements, including data size logging. - Add configurable profiling ports to deployment templates (pprofBindPort). Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Scale test moved to its own package but lost access to symbols from the tests package and pprof hook wiring during rebase conflict resolution. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Merge pprof_hook.go into main_test.go (single test infra file) - Delete unused cr_client.go and grove_config_k8s.go (duplicate utils) - Extract magic strings/ints to named constants - Fix parsedSelector hot-path mutation in condition/pod.go - Add MkdirAll for output directory, use test-named subdirectory Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Output directory is now <testName>/<runID>/ so each run is isolated. JSON results and pprof profiles land in the same directory. Also fixes lint issues: exported const comment, unused request params, and parsedSelector hot-path optimization. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
danbar2
approved these changes
Apr 15, 2026
shayasoolin
reviewed
Apr 15, 2026
shayasoolin
approved these changes
Apr 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
/kind feature
What this PR does / why we need it:
Overhauls the e2e scale test infrastructure to improve observability and developer ergonomics:
Wait()for graceful completiontests/scale/package with self-contained setup, constants, and test-named output directoriesscale-cluster-up,scale-cluster-down,run-scale-testfor streamlined scale test workflowWhich issue(s) this PR fixes:
Fixes #483
Special notes for your reviewer:
GROVE_E2E_PYROSCOPE_DISABLED=trueto explicitly disable profilingDoes this PR introduce a API change?
Additional documentation e.g., enhancement proposals, usage docs, etc.: