Skip to content

feat(e2e): scale test ergonomics with pprof and timeline tracking#528

Merged
danbar2 merged 29 commits into
ai-dynamo:mainfrom
Ronkahn21:feature/scale-test-ergonomics
Apr 16, 2026
Merged

feat(e2e): scale test ergonomics with pprof and timeline tracking#528
danbar2 merged 29 commits into
ai-dynamo:mainfrom
Ronkahn21:feature/scale-test-ergonomics

Conversation

@Ronkahn21

@Ronkahn21 Ronkahn21 commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Overhauls the e2e scale test infrastructure to improve observability and developer ergonomics:

  • TimelineTracker enhancements: Phase-level logging, progress reporting, async after-phase hooks, and Wait() for graceful completion
  • Pprof profiling: Automatic CPU/memory/goroutine profile collection from Pyroscope after each test phase via port-forwarding
  • Scale test package: Dedicated tests/scale/ package with self-contained setup, constants, and test-named output directories
  • Makefile targets: scale-cluster-up, scale-cluster-down, run-scale-test for streamlined scale test workflow
  • Pod condition optimization: Cached label selector parsing to avoid re-parsing on every 100ms poll
  • Helm chart: Expose pprof port when profiling is enabled

Which issue(s) this PR fixes:

Fixes #483

Special notes for your reviewer:

  • The scale test was validated against a 100-node KWOK cluster with 1000 pods (completes in ~110s)
  • Pprof collection is best-effort — gracefully degrades when Pyroscope is not installed
  • Set GROVE_E2E_PYROSCOPE_DISABLED=true to explicitly disable profiling

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

New Makefile targets:
- `make scale-cluster-up` — create k3d cluster with KWOK nodes + Pyroscope
- `make scale-cluster-down` — tear down scale test cluster
- `make run-scale-test` — run scale tests against existing cluster

Add PhaseDefinition struct with AddPhase/Run methods so phases are
defined upfront then executed together, replacing direct RunPhase calls.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add PCSDeletedCondition and PCSAvailableCondition milestone conditions.
Consolidate pod conditions to use controller-runtime client instead of
client-go kubernetes.Interface for consistency across all conditions.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…tracker

- Rename ScaleTestResult → TrackerResult, drop scale-specific fields
- NewTimelineTracker now takes testName, runID, namespace, pcsCount
- Run returns (*TrackerResult, error) with complete result
- Remove Phases() method; phases accessible via TrackerResult.Phases
- Update exporter to use TrackerResult

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Remove obsolete ScaleTestResult, TimelineTracker, and related files
- Update documentation to reflect removal of scale test utilities

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Add 5000-pod MoE scale test that measures deploy and delete
phases with milestones, exports results via summary and JSON.
Includes CR client utility for controller-runtime typed access.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
… handling

- Change initial pod count in scale test to 1000 for larger cluster initialization.
- Add `client-go` scheme to CR client for improved compatibility.
…date scale test

- Add configurable logger to TimelineTracker for detailed phase and milestone progress output.
- Introduce ProgressReporter interface for milestone conditions to provide human-readable progress updates.
- Update PCSAvailableCondition, PodsCreatedCondition, and PodsReadyCondition to implement progress tracking.
- Refactor scale test to utilize
- Move `scaleTestPollInterval` and `scaleTestTimeout` definitions to setup.go for reuse and clarity.
- Remove unused `controller-runtime` import from setup.go.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Create and own crClient alongside other clients in connectToCluster(),
expose via GetCRClient() accessor, and remove duplicate creation from
prepareTestCluster().

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Reads operator ConfigMap at test start and enriches TrackerResult
with K8sClientConfig (QPS/Burst) and ControllerMaxReconcile
(ConcurrentSyncs per controller). SummaryExporter prints both
sections when populated. ParseGroveConfig is unit-testable without
a cluster via scheme-codec matching operator startup.

Closes ai-dynamo#424

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Moves the scheme-codec parsing logic from a thin e2e wrapper into
a reusable DecodeOperatorConfig function in operator/api/config/v1alpha1.
cli.go and grove_config_k8s.go now delegate to it, eliminating
duplication. Unit tests live next to the function in decode_test.go.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Capture f.Close() error to avoid masking write failures
- Fix misleading log message: cluster uses 100 nodes not 1000

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Adopt upstream API changes lost during rebase: ReadGroveMetadata with
GroveImage support, OperatorMetadata type, per-phase timeouts, and
simplified pod condition progress. Preserves branch pprof hooks.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Introduce goroutine and mutex profiles to pprof downloader.
- Optimize polling interval for scale test (100ms).
- Replace REST-style pprof query with gRPC-based Connect RPC.
- Convert Pyroscope JSON profiles to gzip-compressed binary pprof format.
- Adjust default Pyroscope namespace to "pyroscope" from "monitoring".
…and optimize pod condition checks

- Refine scaleTestPollInterval definition comment for clarity.
- Replace scheme-based decoding with yaml.Unmarshal in DecodeOperatorConfig.
- Introduce parsedSelector for caching label selector parsing, reducing redundant parsing in pod condition checks.
…ecoding

- Replace `yaml.Unmarshal` with `runtime.NewScheme` and `serializer.UniversalDecoder` for consistency with Kubernetes API conventions.
- Add scheme initialization via `AddToScheme` to support defaulting and validation.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Pass parent context to async after-phase hooks so they aren't
  cancelled when the per-phase timeout fires
- Add tracker.Wait() after Run() to drain async hooks before exit
- Fix KWOK toleration key in scale-test-1000-moe.yaml
- Add nil guard on err.Error() in exporter test

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Remove `pprof/convert.go` as Pyroscope now serves binary protobuf directly.
- Eliminate JSON-to-gzip pprof conversion logic from the downloader.
- Refactor downloader to gzip raw protobuf data from Pyroscope.
- Remove unused mutex profile type, query prefix, and related test cases.
- Simplify select-merge request by switching to binary protobuf encoding.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…es.yaml

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…scale test suite

- Remove `scale_test.go` in favor of the new `tests/scale` package.
- Add `TestMain` for shared cluster setup in scale tests.
- Replace `prepareTestCluster` calls with `PrepareTestCluster` to align with refactored clients.
- Delete unused constants and helper functions like `toOperatorMetadata`.
- Expand profiling with pprof improvements, including data size logging.
- Add configurable profiling ports to deployment templates (pprofBindPort).

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Scale test moved to its own package but lost access to symbols from
the tests package and pprof hook wiring during rebase conflict resolution.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Merge pprof_hook.go into main_test.go (single test infra file)
- Delete unused cr_client.go and grove_config_k8s.go (duplicate utils)
- Extract magic strings/ints to named constants
- Fix parsedSelector hot-path mutation in condition/pod.go
- Add MkdirAll for output directory, use test-named subdirectory

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@Ronkahn21 Ronkahn21 marked this pull request as ready for review April 14, 2026 10:14
Output directory is now <testName>/<runID>/ so each run is isolated.
JSON results and pprof profiles land in the same directory. Also fixes
lint issues: exported const comment, unused request params, and
parsedSelector hot-path optimization.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Comment thread operator/Makefile
@danbar2 danbar2 merged commit ded9a63 into ai-dynamo:main Apr 16, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(e2e): add scale test measurement infrastructure

3 participants