What you would like to be added?
Add comprehensive scale testing infrastructure for Grove operator including:
- KWOK/Kubemark cluster setup: Automated deployment of simulated node clusters (1000+ nodes)
- Observability integration: Prometheus metrics collection, pprof profiling, optional Pyroscope continuous profiling
- Performance baseline tracking: Store and compare metrics across versions with regression detection
- CI integration: Automated nightly scale tests with benchstat comparison
Why is this needed?
Grove orchestrates complex multi-node workloads with multi-layer gang scheduling (PodGangSet → ScalingGroup → PodClique → Pods). To ensure reliable operation at scale:
- Validate performance characteristics: Understand operator behavior with 100-10k+ pods
- Detect regressions: Automatically track performance metrics (reconciliation latency, memory usage, throughput) across git commits
- Identify bottlenecks: Profile operator under load to optimize critical code paths (cascade sync, gang coordination, topology evaluation)
- Build confidence: Demonstrate Grove handles production workloads at scale before customers deploy
What you would like to be added?
Add comprehensive scale testing infrastructure for Grove operator including:
Why is this needed?
Grove orchestrates complex multi-node workloads with multi-layer gang scheduling (PodGangSet → ScalingGroup → PodClique → Pods). To ensure reliable operation at scale: