Skip to content

GREP: E2E Scale Test Infrastructure #402

@Ronkahn21

Description

@Ronkahn21

What you would like to be added?

Add comprehensive scale testing infrastructure for Grove operator including:

  • KWOK/Kubemark cluster setup: Automated deployment of simulated node clusters (1000+ nodes)
  • Observability integration: Prometheus metrics collection, pprof profiling, optional Pyroscope continuous profiling
  • Performance baseline tracking: Store and compare metrics across versions with regression detection
  • CI integration: Automated nightly scale tests with benchstat comparison

Why is this needed?

Grove orchestrates complex multi-node workloads with multi-layer gang scheduling (PodGangSet → ScalingGroup → PodClique → Pods). To ensure reliable operation at scale:

  1. Validate performance characteristics: Understand operator behavior with 100-10k+ pods
  2. Detect regressions: Automatically track performance metrics (reconciliation latency, memory usage, throughput) across git commits
  3. Identify bottlenecks: Profile operator under load to optimize critical code paths (cascade sync, gang coordination, topology evaluation)
  4. Build confidence: Demonstrate Grove handles production workloads at scale before customers deploy

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions